Tag Archives: Amazon QuickSight

Extracting key insights from Amazon S3 access logs with AWS Glue for Ray

2023-09-07 Cristiane de Melo

Post Syndicated from Cristiane de Melo original https://aws.amazon.com/blogs/big-data/extracting-key-insights-from-amazon-s3-access-logs-with-aws-glue-for-ray/

Customers of all sizes and industries use Amazon Simple Storage Service (Amazon S3) to store data globally for a variety of use cases. Customers want to know how their data is being accessed, when it is being accessed, and who is accessing it. With exponential growth in data volume, centralized monitoring becomes challenging. It is also crucial to audit granular data access for security and compliance needs.

This blog post presents an architecture solution that allows customers to extract key insights from Amazon S3 access logs at scale. We will partition and format the server access logs with Amazon Web Services (AWS) Glue, a serverless data integration service, to generate a catalog for access logs and create dashboards for insights.

Amazon S3 access logs

Amazon S3 access logs monitor and log Amazon S3 API requests made to your buckets. These logs can track activity, such as data access patterns, lifecycle and management activity, and security events. For example, server access logs could answer a financial organization’s question about how many requests are made and who is making what type of requests. Amazon S3 access logs provide object-level visibility and incur no additional cost besides storage of logs. They store attributes such as object size, total time, turn-around time, and HTTP referer for log records. For more details on the server access log file format, delivery, and schema, see Logging requests using server access logging and Amazon S3 server access log format.

Key considerations when using Amazon S3 access logs:

Amazon S3 delivers server access log records on a best-effort basis. Amazon S3 does not guarantee the completeness and timeliness of them, although delivery of most log records is within a few hours of the recorded time.
A log file delivered at a specific time can contain records written at any point before that time. A log file may not capture all log records for requests made up to that point.
Amazon S3 access logs provide small unpartitioned files stored as space-separated, newline-delimited records. They can be queried using Amazon Athena, but this solution poses high latency and increased query cost for customers generating logs in petabyte scale. Top 10 Performance Tuning Tips for Amazon Athena include converting the data to a columnar format like Apache Parquet and partitioning the data in Amazon S3.
Amazon S3 listing can become a bottleneck even if you use a prefix, particularly with billions of objects. Amazon S3 uses the following object key format for log files:
TargetPrefixYYYY-mm-DD-HH-MM-SS-UniqueString/

TargetPrefix is optional and makes it simpler for you to locate the log objects. We use the YYYY-mm-DD-HH format to generate a manifest of logs matching a specific prefix.

Architecture overview

The following diagram illustrates the solution architecture. The solution uses AWS Serverless Analytics services such as AWS Glue to optimize data layout by partitioning and formatting the server access logs to be consumed by other services. We catalog the partitioned server access logs from multiple Regions. Using Amazon Athena and Amazon QuickSight, we query and create dashboards for insights.

Architecture Diagram

As a first step, enable server access logging on S3 buckets. Amazon S3 recommends delivering logs to a separate bucket to avoid an infinite loop of logs. Both the user data and logs buckets must be in the same AWS Region and owned by the same account.

AWS Glue for Ray, a data integration engine option on AWS Glue, is now generally available. It combines AWS Glue’s serverless data integration with Ray (ray.io), a popular new open-source compute framework that helps you scale Python workloads. The Glue for Ray job will partition and store the logs in parquet format. The Ray script also contains checkpointing logic to avoid re-listing, duplicate processing, and missing logs. The job stores the partitioned logs in a separate bucket for simplicity and scalability.

The AWS Glue Data Catalog is a metastore of the location, schema, and runtime metrics of your data. AWS Glue Data Catalog stores information as metadata tables, where each table specifies a single data store. The AWS Glue crawler writes metadata to the Data Catalog by classifying the data to determine the format, schema, and associated properties of the data. Running the crawler on a schedule updates AWS Glue Data Catalog with new partitions and metadata.

Amazon Athena provides a simplified, flexible way to analyze petabytes of data where it lives. We can query partitioned logs directly in Amazon S3 using standard SQL. Athena uses AWS Glue Data Catalog metadata like databases, tables, partitions, and columns under the hood. AWS Glue Data Catalog is a cross-Region metadata store that helps Athena query logs across multiple Regions and provide consolidated results.

Amazon QuickSight enables organizations to build visualizations, perform case-by-case analysis, and quickly get business insights from their data anytime, on any device. You can use other business intelligence (BI) tools that integrate with Athena to build dashboards and share or publish them to provide timely insights.

Technical architecture implementation

This section explains how to process Amazon S3 access logs and visualize Amazon S3 metrics with QuickSight.

Before you begin

There are a few prerequisites before you get started:

Create an IAM role to use with AWS Glue. For more information, see Create an IAM Role for AWS Glue in the AWS Glue documentation.
Ensure that you have access to Athena from your account.
Enable access logging on an S3 bucket. For more information, see How to Enable Server Access Logging in the Amazon S3 documentation.

Run AWS Glue for Ray job

The following screenshots guide you through creating a Ray job on Glue console. Create an ETL job with Ray engine with the sample Ray script provided. In the Job details tab, select an IAM role.

Create AWS Glue job

AWS Glue job details

Pass required arguments and any optional arguments with `--{arg}` in the job parameters.

AWS Glue job parameters

Save and run the job. In the Runs tab, you can select the current execution and view the logs using the Log group name and Id (Job Run Id). You can also graph job run metrics from the CloudWatch metrics console.

CloudWatch metrics console

Alternatively, you can select a frequency to schedule the job run.

AWS Glue job run schedule

Note: Schedule frequency depends on your data latency requirement.

On a successful run, the Ray job writes partitioned log files to the output Amazon S3 location. Now we run an AWS Glue crawler to catalog the partitioned files.

Create an AWS Glue crawler with the partitioned logs bucket as the data source and schedule it to capture the new partitions. Alternatively, you can configure the crawler to run based on Amazon S3 events. Using Amazon S3 events improves the re-crawl time to identify the changes between two crawls by listing all the files from a partition instead of listing the full S3 bucket.

AWS Glue Crawler

You can view the AWS Glue Data Catalog table via the Athena console and run queries using standard SQL. The Athena console displays the Run time and Data scanned metrics. In the following screenshots below, you will see how partitioning improves performance by reducing the amount of data scanned.

There are significant wins when we partition and format server access logs as parquet. Compared to the unpartitioned raw logs, the Athena queries 1/scanned 99.9 percent less data, and 2/ran 92 percent faster. This is evident from the following Athena SQL queries, which are similar but on unpartitioned and partitioned server access logs respectively.

SELECT “operation”, “requestdatetime”
FROM “s3_access_logs_db”.”unpartitioned_sal”
GROUP BY “requestdatetime”, “operation”

Amazon Athena query

Note: You can create a table schema on raw server access logs by following the directions at How do I analyze my Amazon S3 server access logs using Athena?

SELECT “operation”, “requestdate”, “requesthour” 
FROM “s3_access_logs_db”.”partitioned_sal” 
GROUP BY “requestdate”, “requesthour”, “operation”

Amazon Athena query

You can run queries on Athena or build dashboards with a BI tool that integrates with Athena. We built the following sample dashboard in Amazon QuickSight to provide insights from the Amazon S3 access logs. For additional information, see Visualize with QuickSight using Athena.

Amazon QuickSight dashboard

Clean up

Delete all the resources to avoid any unintended costs.

Disable the access log on the source bucket.
Disable the scheduled AWS Glue job run.
Delete the AWS Glue Data Catalog tables and QuickSight dashboards.

Why we considered AWS Glue for Ray

AWS Glue for Ray offers scalable Python-native distributed compute framework combined with AWS Glue’s serverless data integration. The primary reason for using the Ray engine in this solution is its flexibility with task distribution. With the Amazon S3 access logs, the largest challenge in processing them at scale is the object count rather than the data volume. This is because they are stored in a single, flat prefix that can contain hundreds of millions of objects for larger customers. In this unusual edge case, the Amazon S3 listing in Spark takes most of the job’s runtime. The object count is also large enough that most Spark drivers will run out of memory during listing.

In our test bed with 470 GB (1,544,692 objects) of access logs, large Spark drivers using AWS Glue’s G.8X worker type (32 vCPU, 128 GB memory, and 512 GB disk) ran out of memory. Using Ray tasks to distribute Amazon S3 listing dramatically reduced the time to list the objects. It also kept the list in Ray’s distributed object store preventing out-of-memory failures when scaling. The distributed lister combined with Ray data and map_batches to apply a pandas function against each block of data resulted in a highly parallel and performant execution across all stages of the process. With Ray engine, we successfully processed the logs in ~9 minutes. Using Ray reduces the server access logs processing cost, adding to the reduced Athena query cost.

Ray job run details:

Ray job logs

Ray job run details

Please feel free to download the script and test this solution in your development environment. You can add additional transformations in Ray to better prepare your data for analysis.

Conclusion

In this blog post, we detailed a solution to visualize and monitor Amazon S3 access logs at scale using Athena and QuickSight. It highlights a way to scale the solution by partitioning and formatting the logs using AWS Glue for Ray. To learn how to work with Ray jobs in AWS Glue, see Working with Ray jobs in AWS Glue. To learn how to accelerate your Athena queries, see Reusing query results.

About the Authors

Cristiane de Melo is a Solutions Architect Manager at AWS based in Bay Area, CA. She brings 25+ years of experience driving technical pre-sales engagements and is responsible for delivering results to customers. Cris is passionate about working with customers, solving technical and business challenges, thriving on building and establishing long-term, strategic relationships with customers and partners.

Archana Inapudi is a Senior Solutions Architect at AWS supporting Strategic Customers. She has over a decade of experience helping customers design and build data analytics, and database solutions. She is passionate about using technology to provide value to customers and achieve business outcomes.

Nikita Sur is a Solutions Architect at AWS supporting a Strategic Customer. She is curious to learn new technologies to solve customer problems. She has a Master’s degree in Information Systems – Big Data Analytics and her passion is databases and analytics.

Zach Mitchell is a Sr. Big Data Architect. He works within the product team to enhance understanding between product engineers and their customers while guiding customers through their journey to develop their enterprise data architecture on AWS.

AWS Weekly Roundup – AWS Dedicated Zones, Events and More – August 28, 2023

2023-08-29 Sébastien Stormacq

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-aws-dedicated-zones-events-and-more-august-28-2023/

This week, I will meet our customers and partners at the AWS Summit Mexico. If you are around, please come say hi at the community lounge and at the F1 Game Day where I will spend most of my time. I would love to discuss your developer experience on AWS and listen to your stories about building on AWS.

Last Week’s Launches
I am amazed at how quickly service teams are deploying services to the new il-central-1 Region, aka AWS Israel (Tel-Aviv) Region. I counted no fewer than 25 new service announcements since we opened the Region on August 1, including ten just for last week!

In addition to these developments in the new Region, here are some launches that got my attention during the previous week.

AWS Dedicated Local Zones – Just like Local Zones, Dedicated Local Zones are a type of AWS infrastructure that is fully managed by AWS. Unlike Local Zones, they are built for exclusive use by you or your community and placed in a location or data center specified by you to help comply with regulatory requirements. I think about them as a portion of AWS infrastructure dedicated to my exclusive usage.

Enhanced search on AWS re:Post – AWS re:Post is a cloud knowledge service. The enhanced search experience helps you locate answers and discover articles more quickly. Search results are now presenting a consolidated view of all AWS knowledge on re:Post. The view shows AWS Knowledge Center articles, question and answers, and community articles that are relevant to the user’s search query.

Amazon QuickSight supports scheduled programmatic export to Microsoft Excel – Amazon QuickSight now supports scheduled generation of Excel workbooks by selecting multiple tables and pivot table visuals from any sheet of a dashboard. Snapshot Export APIs will now also support programmatic export to Excel format, in addition to Paginated PDF and CSV.

Amazon WorkSpaces announced a new client to support Ubuntu 20.04 and 22.04 – The new client, powered by WorkSpaces Streaming Protocol (WSP), improves the remote desktop experience by offering enhanced web conferencing functionality, better multi-monitor support, and a more user-friendly interface. To get started, simply download the new Linux client versions from Amazon WorkSpaces client download website.

Amazon Sagemaker CPU/GPU profiler – We launched the preview of Amazon SageMaker Profiler, an advanced observability tool for large deep learning workloads. With this new capability, you are able to access granular compute hardware-related profiling insights for optimizing model training performance.

Amazon Sagemaker rolling deployments strategy – You can now update your Amazon SageMaker Endpoints using a rolling deployment strategy. Rolling deployment makes it easier for you to update fully-scaled endpoints that are deployed on hundreds of popular accelerated compute instances.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Some other updates and news that you might have missed:

On-demand Container Loading in AWS Lambda – This one is not new from this week, but I spotted it while I was taking a few days of holidays. Marc Brooker and team were awarded Best Paper by USENIX Association for On-demand Container Loading in AWS Lambda (pdf). They explained in detail the challenges of loading (huge) container images in AWS Lambda. A must-read if you’re curious how Lambda functions work behind the scenes (pdf).

The Official AWS Podcast – Listen each week for updates on the latest AWS news and deep dives into exciting use cases. There are also official AWS podcasts in several languages. Check out the ones in French, German, Italian, and Spanish.

AWS Open Source News and Updates – This is a newsletter curated by my colleague Ricardo to bring you the latest open source projects, posts, events, and more.

Upcoming AWS Events
Check your calendars and sign up for these AWS events:

AWS Hybrid Cloud & Edge Day (August 30) – Join a free-to-attend one-day virtual event to hear the latest hybrid cloud and edge computing trends, emerging technologies, and learn best practices from AWS leaders, customers, and industry analysts. To learn more, see the detail agenda and register now.

AWS Global Summits – The 2023 AWS Summits season is almost ending with the last two in-person events in Mexico City (August 30) and Johannesburg (September 26).

AWS re:Invent – But don’t worry because re:Invent season (November 27–December 1) is coming closer. Join us to hear the latest from AWS, learn from experts, and connect with the global cloud community. Registration is now open.

AWS Community Days – Join a community-led conference run by AWS user group leaders in your region: Aotearoa (September 6), Lebanon (September 9), Munich (September 14), Argentina (September 16), Spain (September 23), and Chile (September 30). Visit the landing page to check out all the upcoming AWS Community Days.

CDK Day (September 29) – A community-led fully virtual event with tracks in English and Spanish about CDK and related projects. Learn more at the website.

That’s all for this week. Check back next Monday for another Week in Review!

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

— seb

Create an Apache Hudi-based near-real-time transactional data lake using AWS DMS, Amazon Kinesis, AWS Glue streaming ETL, and data visualization using Amazon QuickSight

2023-08-03 Raj Ramasubbu

Post Syndicated from Raj Ramasubbu original https://aws.amazon.com/blogs/big-data/create-an-apache-hudi-based-near-real-time-transactional-data-lake-using-aws-dms-amazon-kinesis-aws-glue-streaming-etl-and-data-visualization-using-amazon-quicksight/

With the rapid growth of technology, more and more data volume is coming in many different formats—structured, semi-structured, and unstructured. Data analytics on operational data at near-real time is becoming a common need. Due to the exponential growth of data volume, it has become common practice to replace read replicas with data lakes to have better scalability and performance. In most real-world use cases, it’s important to replicate the data from the relational database source to the target in real time. Change data capture (CDC) is one of the most common design patterns to capture the changes made in the source database and reflect them to other data stores.

We recently announced support for streaming extract, transform, and load (ETL) jobs in AWS Glue version 4.0, a new version of AWS Glue that accelerates data integration workloads in AWS. AWS Glue streaming ETL jobs continuously consume data from streaming sources, clean and transform the data in-flight, and make it available for analysis in seconds. AWS also offers a broad selection of services to support your needs. A database replication service such as AWS Database Migration Service (AWS DMS) can replicate the data from your source systems to Amazon Simple Storage Service (Amazon S3), which commonly hosts the storage layer of the data lake. Although it’s straightforward to apply updates on a relational database management system (RDBMS) that backs an online source application, it’s difficult to apply this CDC process on your data lakes. Apache Hudi, an open-source data management framework used to simplify incremental data processing and data pipeline development, is a good option to solve this problem.

This post demonstrates how to apply CDC changes from Amazon Relational Database Service (Amazon RDS) or other relational databases to an S3 data lake, with flexibility to denormalize, transform, and enrich the data in near-real time.

Solution overview

We use an AWS DMS task to capture near-real-time changes in the source RDS instance, and use Amazon Kinesis Data Streams as a destination of the AWS DMS task CDC replication. An AWS Glue streaming job reads and enriches changed records from Kinesis Data Streams and performs an upsert into the S3 data lake in Apache Hudi format. Then we can query the data with Amazon Athena visualize it in Amazon QuickSight. AWS Glue natively supports continuous write operations for streaming data to Apache Hudi-based tables.

The following diagram illustrates the architecture used for this post, which is deployed through an AWS CloudFormation template.

Prerequisites

Before you get started, make sure you have the following prerequisites:

An AWS account
A basic understanding of Amazon S3
A basic understanding of QuickSight to create dashboards
An AWS Identity and Access Management (IAM) role with permissions to create the Amazon RDS database, AWS DMS instance and tasks, Kinesis data stream, S3 buckets, AWS Glue job, AWS Glue Data Catalog, and QuickSight dashboards, and run SQL queries using Athena (see Adding and removing IAM identity permissions for reference)

Source data overview

To illustrate our use case, we assume a data analyst persona who is interested in analyzing near-real-time data for sport events using the table ticket_activity. An example of this table is shown in the following screenshot.

Apache Hudi connector for AWS Glue

For this post, we use AWS Glue 4.0, which already has native support for the Hudi framework. Hudi, an open-source data lake framework, simplifies incremental data processing in data lakes built on Amazon S3. It enables capabilities including time travel queries, ACID (Atomicity, Consistency, Isolation, Durability) transactions, streaming ingestion, CDC, upserts, and deletes.

Set up resources with AWS CloudFormation

This post includes a CloudFormation template for a quick setup. You can review and customize it to suit your needs.

The CloudFormation template generates the following resources:

An RDS database instance (source).
An AWS DMS replication instance, used to replicate the data from the source table to Kinesis Data Streams.
A Kinesis data stream.
Four AWS Glue Python shell jobs:
- rds-ingest-rds-setup-<CloudFormation Stack name> – creates one source table called ticket_activity on Amazon RDS.
- rds-ingest-data-initial-<CloudFormation Stack name> – Sample data is automatically generated at random by the Faker library and loaded to the ticket_activity table.
- rds-ingest-data-incremental-<CloudFormation Stack name> – Ingests new ticket activity data into the source table ticket_activity continuously. This job simulates customer activity.
- rds-upsert-data-<CloudFormation Stack name> – Upserts specific records in the source table ticket_activity. This job simulates administrator activity.
AWS Identity and Access Management (IAM) users and policies.
An Amazon VPC, a public subnet, two private subnets, internet gateway, NAT gateway, and route tables.
- We use private subnets for the RDS database instance and AWS DMS replication instance.
- We use the NAT gateway to have reachability to pypi.org to use the MySQL connector for Python from the AWS Glue Python shell jobs. It also provides reachability to Kinesis Data Streams and an Amazon S3 API endpoint

To set up these resources, you must have the following prerequisites:

IAM roles dms-vpc-role, dms-cloudwatch-logs-role, and dms-access-for-endpoint. If you haven’t used AWS DMS before, you need to create these special IAM roles via the IAM console or the AWS Command Line Interface (AWS CLI). For instructions, see Creating the IAM roles to use with the AWS CLI and AWS DMS API.
If you already deselected Use only IAM access control for new databases and Use only IAM access control for new table in new databases on the AWS Lake Formation console Settings page, you need to select these two check boxes again and save your settings. For more information, see Changing the default settings for your data lake.

The following diagram illustrates the architecture of our provisioned resources.

To launch the CloudFormation stack, complete the following steps:

Sign in to the AWS CloudFormation console.
Choose Launch Stack
Choose Next.
For S3BucketName, enter the name of your new S3 bucket.
For VPCCIDR, enter a CIDR IP address range that doesn’t conflict with your existing networks.
For PublicSubnetCIDR, enter the CIDR IP address range within the CIDR you gave for VPCCIDR.
For PrivateSubnetACIDR and PrivateSubnetBCIDR, enter the CIDR IP address range within the CIDR you gave for VPCCIDR.
For SubnetAzA and SubnetAzB, choose the subnets you want to use.
For DatabaseUserName, enter your database user name.
For DatabaseUserPassword, enter your database user password.
Choose Next.
On the next page, choose Next.
Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
Choose Create stack.

Stack creation can take about 20 minutes.

Set up an initial source table

The AWS Glue job rds-ingest-rds-setup-<CloudFormation stack name> creates a source table called event on the RDS database instance. To set up the initial source table in Amazon RDS, complete the following steps:

On the AWS Glue console, choose Jobs in the navigation pane.
Choose rds-ingest-rds-setup-<CloudFormation stack name> to open the job.
Choose Run.
Navigate to the Runs tab and wait for Run status to show as SUCCEEDED.

This job will only create the one table, ticket_activity, in the MySQL instance (DDL). See the following code:

CREATE TABLE ticket_activity (
ticketactivity_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
sport_type VARCHAR(256) NOT NULL,
start_date DATETIME NOT NULL,
location VARCHAR(256) NOT NULL,
seat_level VARCHAR(256) NOT NULL,
seat_location VARCHAR(256) NOT NULL,
ticket_price INT NOT NULL,
customer_name VARCHAR(256) NOT NULL,
email_address VARCHAR(256) NOT NULL,
created_at DATETIME NOT NULL,
updated_at DATETIME NOT NULL )

Ingest new records

In this section, we detail the steps to ingest new records. Implement following steps to star the execution of the jobs.

Start data ingestion to Kinesis Data Streams using AWS DMS

To start data ingestion from Amazon RDS to Kinesis Data Streams, complete the following steps:

On the AWS DMS console, choose Database migration tasks in the navigation pane.
Select the task rds-to-kinesis-<CloudFormation stack name>.
On the Actions menu, choose Restart/Resume.
Wait for the status to show as Load complete and Replication ongoing.

The AWS DMS replication task ingests data from Amazon RDS to Kinesis Data Streams continuously.

Start data ingestion to Amazon S3

Next, to start data ingestion from Kinesis Data Streams to Amazon S3, complete the following steps:

On the AWS Glue console, choose Jobs in the navigation pane.
Choose streaming-cdc-kinesis2hudi-<CloudFormation stack name> to open the job.
Choose Run.

Do not stop this job; you can check the run status on the Runs tab and wait for it to show as Running.

Start the data load to the source table on Amazon RDS

To start data ingestion to the source table on Amazon RDS, complete the following steps:

On the AWS Glue console, choose Jobs in the navigation pane.
Choose rds-ingest-data-initial-<CloudFormation stack name> to open the job.
Choose Run.
Navigate to the Runs tab and wait for Run status to show as SUCCEEDED.

Validate the ingested data

After about 2 minutes from starting the job, the data should be ingested into the Amazon S3. To validate the ingested data in the Athena, complete the following steps:

On the Athena console, complete the following steps if you’re running an Athena query for the first time:
- On the Settings tab, choose Manage.
- Specify the stage directory and the S3 path where Athena saves the query results.
- Choose Save.

On the Editor tab, run the following query against the table to check the data:

SELECT * FROM "database_<account_number>_hudi_cdc_demo"."ticket_activity" limit 10;

Note that AWS Cloud Formation will create the database with the account number as database_<your-account-number>_hudi_cdc_demo.

Update existing records

Before you update the existing records, note down the ticketactivity_id value of a record from the ticket_activity table. Run the following SQL using Athena. For this post, we use ticketactivity_id = 46 as an example:

SELECT * FROM "database_<account_number>_hudi_cdc_demo"."ticket_activity" limit 10;

To simulate a real-time use case, update the data in the source table ticket_activity on the RDS database instance to see that the updated records are replicated to Amazon S3. Complete the following steps:

On the AWS Glue console, choose Jobs in the navigation pane.
Choose rds-ingest-data-incremental-<CloudFormation stack name> to open the job.
Choose Run.
Choose the Runs tab and wait for Run status to show as SUCCEEDED.

To upsert the records in the source table, complete the following steps:

On the AWS Glue console, choose Jobs in the navigation pane.
Choose the job rds-upsert-data-<CloudFormation stack name>.
On the Job details tab, under Advanced properties, for Job parameters, update the following parameters:
- For Key, enter --ticketactivity_id.
- For Value, replace 1 with one of the ticket IDs you noted above (for this post, 46).

Choose Save.
Choose Run and wait for the Run status to show as SUCCEEDED.

This AWS Glue Python shell job simulates a customer activity to buy a ticket. It updates a record in the source table ticket_activity on the RDS database instance using the ticket ID passed in the job argument --ticketactivity_id. It will update ticket_price=500 and updated_at with the current timestamp.

To validate the ingested data in Amazon s3, run the same query from Athena and check the ticket_activity value you noted earlier to observe the ticket_price and updated_at fields:

SELECT * FROM "database_<account_number>_hudi_cdc_demo"."ticket_activity" where ticketactivity_id = 46 ;

Visualize the data in QuickSight

After you have the output file generated by the AWS Glue streaming job in the S3 bucket, you can use QuickSight to visualize the Hudi data files. QuickSight is a scalable, serverless, embeddable, ML-powered business intelligence (BI) service built for the cloud. QuickSight lets you easily create and publish interactive BI dashboards that include ML-powered insights. QuickSight dashboards can be accessed from any device and seamlessly embedded into your applications, portals, and websites.

Build a QuickSight dashboard

To build a QuickSight dashboard, complete the following steps:

Open the QuickSight console.

You’re presented with the QuickSight welcome page. If you haven’t signed up for QuickSight, you may have to complete the signup wizard. For more information, refer to Signing up for an Amazon QuickSight subscription.

After you have signed up, QuickSight presents a “Welcome wizard.” You can view the short tutorial, or you can close it.

On the QuickSight console, choose your user name and choose Manage QuickSight.
Choose Security & permissions, then choose Manage.
Select Amazon S3 and select the buckets that you created earlier with AWS CloudFormation.
Select Amazon Athena.
Choose Save.
If you changed your Region during the first step of this process, change it back to the Region that you used earlier during the AWS Glue jobs.

Create a dataset

Now that you have QuickSight up and running, you can create your dataset. Complete the following steps:

On the QuickSight console, choose Datasets in the navigation pane.
Choose New dataset.
Choose Athena.
For Data source name, enter a name (for example, hudi-blog).
Choose Validate.
After the validation is successful, choose Create data source.
For Database, choose database_<your-account-number>_hudi_cdc_demo.
For Tables, select ticket_activity.
Choose Select.
Choose Visualize.
Choose hour and then ticket_activity_id to get the count of ticket_activity_id by hour.

Clean up

To clean up your resources, complete the following steps:

Stop the AWS DMS replication task rds-to-kinesis-<CloudFormation stack name>.
Navigate to the RDS database and choose Modify.
Deselect Enable deletion protection, then choose Continue.
Stop the AWS Glue streaming job streaming-cdc-kinesis2redshift-<CloudFormation stack name>.
Delete the CloudFormation stack.
On the QuickSight dashboard, choose your user name, then choose Manage QuickSight.
Choose Account settings, then choose Delete account.
Choose Delete account to confirm.
Enter confirm and choose Delete account.

Conclusion

In this post, we demonstrated how you can stream data—not only new records, but also updated records from relational databases—to Amazon S3 using an AWS Glue streaming job to create an Apache Hudi-based near-real-time transactional data lake. With this approach, you can easily achieve upsert use cases on Amazon S3. We also showcased how to visualize the Apache Hudi table using QuickSight and Athena. As a next step, refer to the Apache Hudi performance tuning guide for a high-volume dataset. To learn more about authoring dashboards in QuickSight, check out the QuickSight Author Workshop.

About the Authors

Raj Ramasubbu is a Sr. Analytics Specialist Solutions Architect focused on big data and analytics and AI/ML with Amazon Web Services. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS. Raj provided technical expertise and leadership in building data engineering, big data analytics, business intelligence, and data science solutions for over 18 years prior to joining AWS. He helped customers in various industry verticals like healthcare, medical devices, life science, retail, asset management, car insurance, residential REIT, agriculture, title insurance, supply chain, document management, and real estate.

Rahul Sonawane is a Principal Analytics Solutions Architect at AWS with AI/ML and Analytics as his area of specialty.

Sundeep Kumar is a Sr. Data Architect, Data Lake at AWS, helping customers build data lake and analytics platform and solutions. When not building and designing data lakes, Sundeep enjoys listening music and playing guitar.

Estimating Scope 1 Carbon Footprint with Amazon Athena

2023-08-02 Thomas Burns

Post Syndicated from Thomas Burns original https://aws.amazon.com/blogs/big-data/estimating-scope-1-carbon-footprint-with-amazon-athena/

Today, more than 400 organizations have signed The Climate Pledge, a commitment to reach net-zero carbon by 2040. Some of the drivers that lead to setting explicit climate goals include customer demand, current and anticipated government relations, employee demand, investor demand, and sustainability as a competitive advantage. AWS customers are increasingly interested in ways to drive sustainability actions. In this blog, we will walk through how we can apply existing enterprise data to better understand and estimate Scope 1 carbon footprint using Amazon Simple Storage Service (S3) and Amazon Athena, a serverless interactive analytics service that makes it easy to analyze data using standard SQL.

The Greenhouse Gas Protocol

The Greenhouse Gas Protocol (GHGP) provides standards for measuring and managing global warming impacts from an organization’s operations and value chain.

The greenhouse gases covered by the GHGP are the seven gases required by the UNFCCC/Kyoto Protocol (which is often called the “Kyoto Basket”). These gases are carbon dioxide (CO₂), methane (CH₄), nitrous oxide (N₂O), the so-called F-gases (hydrofluorocarbons and perfluorocarbons), sulfur hexafluoride (SF₆) nitrogen trifluoride (NF3). Each greenhouse gas is characterized by its global warming potential (GWP), which is determined by the gas’s greenhouse effect and its lifetime in the atmosphere. Since carbon dioxide (CO₂) accounts for about 76 percent of total man-made greenhouse gas emissions, the global warming potential of greenhouse gases are measured relative to CO₂, and are thus expressed as CO₂-equivalent (CO₂e).

The GHGP divides an organization’s emissions into three primary scopes:

Scope 1 – Direct greenhouse gas emissions (for example from burning fossil fuels)
Scope 2 – Indirect emissions from purchased energy (typically electricity)
Scope 3 – Indirect emissions from the value chain, including suppliers and customers

How do we estimate greenhouse gas emissions?

There are different methods to estimating GHG emissions that includes the Continuous Emissions Monitoring System (CEMS) Method, the Spend-Based Method, and the Consumption-Based Method.

Direct Measurement – CEMS Method

An organization can estimate its carbon footprint from stationary combustion sources by performing a direct measurement of carbon emissions using the CEMS method. This method requires continuously measuring the pollutants emitted in exhaust gases from each emissions source using equipment such as gas analyzers, gas samplers, gas conditioning equipment (to remove particulate matter, water vapor and other contaminants), plumbing, actuated valves, Programmable Logic Controllers (PLCs) and other controlling software and hardware. Although this approach may yield useful results, CEMS requires specific sensing equipment for each greenhouse gas to be measured, requires supporting hardware and software, and is typically more suitable for Environment Health and Safety applications of centralized emission sources. More information on CEMS is available here.

Spend-Based Method

Because the financial accounting function is mature and often already audited, many organizations choose to use financial controls as a foundation for their carbon footprint accounting. The Economic Input-Output Life Cycle Assessment (EIO LCA) method is a spend-based method that combines expenditure data with monetary-based emission factors to estimate the emissions produced. The emission factors are published by the U.S. Environment Protection Agency (EPA) and other peer-reviewed academic and government sources. With this method, you can multiply the amount of money spent on a business activity by the emission factor to produce the estimated carbon footprint of the activity.

For example, you can convert the amount your company spends on truck transport to estimated kilograms (KG) of carbon dioxide equivalent (CO₂e) emitted as shown below.

Estimated Carbon Footprint = Amount of money spent on truck transport * Emission Factor ^[1]

Although these computations are very easy to make from general ledgers or other financial records, they are most valuable for initial estimates or for reporting minor sources of greenhouse gases. As the only user-provided input is the amount spent on an activity, EIO LCA methods aren’t useful for modeling improved efficiency. This is because the only way to reduce EIO-calculated emissions is to reduce spending. Therefore, as a company continues to improve its carbon footprint efficiency, other methods of estimating carbon footprint are often more desirable.

Consumption-Based Method

From either Enterprise Resource Planning (ERP) systems or electronic copies of fuel bills, it’s straightforward to determine the amount of fuel an organization procures during a reporting period. Fuel-based emission factors are available from a variety of sources such as the US Environmental Protection Agency and commercially-licensed databases. Multiplying the amount of fuel procured by the emission factor yields an estimate of the CO₂e emitted through combustion. This method is often used for estimating the carbon footprint of stationary emissions (for instance backup generators for data centers or fossil fuel ovens for industrial processes).

If for a particular month an enterprise consumed a known amount of motor gasoline for stationary combustion, the Scope 1 CO₂e footprint of the stationary gasoline combustion can be estimated in the following manner:

Estimated Carbon Footprint = Amount of Fuel Consumed * Stationary Combustion Emission Factor^[2]

Organizations may estimate their carbon emissions by using existing data found in fuel and electricity bills, ERP data, and relevant emissions factors, which are then consolidated in to a data lake. Using existing analytics tools such as Amazon Athena and Amazon QuickSight an organization can gain insight into its estimated carbon footprint.

The data architecture diagram below shows an example of how you could use AWS services to calculate and visualize an organization’s estimated carbon footprint.

Customers have the flexibility to choose the services in each stage of the data pipeline based on their use case. For example, in the data ingestion phase, depending on the existing data requirements, there are many options to ingest data into the data lake such as using the AWS Command Line Interface (CLI), AWS DataSync, or AWS Database Migration Service.

Example of calculating a Scope 1 stationary emissions footprint with AWS services

Let’s assume you burned 100 standard cubic feet (scf) of natural gas in an oven. Using the US EPA emission factors for stationary emissions we can estimate the carbon footprint associated with the burning. In this case the emission factor is 0.05449555 Kg CO₂e /scf.^[3]

Amazon S3 is ideal for building a data lake on AWS to store disparate data sources in a single repository, due to its virtually unlimited scalability and high durability. Athena, a serverless interactive query service, allows the analysis of data directly from Amazon S3 using standard SQL without having to load the data into Athena or run complex extract, transform, and load (ETL) processes. Amazon QuickSight supports creating visualizations of different data sources, including Amazon S3 and Athena, and the flexibility to use custom SQL to extract a subset of the data. QuickSight dashboards can provide you with insights (such as your company’s estimated carbon footprint) quickly, and also provide the ability to generate standardized reports for your business and sustainability users.

In this example, the sample data is stored in a file system and uploaded to Amazon S3 using the AWS Command Line Interface (CLI) as shown in the following architecture diagram. AWS recommends creating AWS resources and managing CLI access in accordance with the Best Practices for Security, Identity, & Compliance guidance.

The AWS CLI command below demonstrates how to upload the sample data folders into the S3 target location.

aws s3 cp /path/to/local/file s3://bucket-name/path/to/destination

The snapshot of the S3 console shows two newly added folders that contains the files.

To create new table schemas, we start by running the following script for the gas utilization table in the Athena query editor using Hive DDL. The script defines the data format, column details, table properties, and the location of the data in S3.

CREATE EXTERNAL TABLE `gasutilization`(
`fuel_id` int,
`month` string,
`year` int,
`usage_therms` float,
`usage_scf` float,
`g-nr1_schedule_charge` float,
`accountfee` float,
`gas_ppps` float,
`netcharge` float,
`taxpercentage` float,
`totalcharge` float)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://<bucketname>/Scope 1 Sample Data/gasutilization'
TBLPROPERTIES (
'classification'='csv',
'skip.header.line.count'='1')

The script below shows another example of using Hive DDL to generate the table schema for the gas emission factor data.

CREATE EXTERNAL TABLE `gas_emission_factor`(
`fuel_id` int,
`gas_name` string,
`emission_factor` float)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://<bucketname>/Scope 1 Sample Data/gas_emission_factor'
TBLPROPERTIES (
'classification'='csv',
'skip.header.line.count'='1')

After creating the table schema in Athena, we run the below query against the gas utilization table that includes details of gas bills to show the gas utilization and the associated charges, such as gas public purpose program surcharge (PPPS) and total charges after taxes for the year of 2020:

SELECT * FROM "gasutilization" where year = 2020;

We are also able to analyze the emission factor data showing the different fuel types and their corresponding CO2e emission as shown in the screenshot.

With the emission factor and the gas utilization data, we can run the following query below to get an estimated Scope 1 carbon footprint alongside other details. In this query, we joined the gas utilization table and the gas emission factor table on fuel id and multiplied the gas usage in standard cubic foot (scf) by the emission factor to get the estimated CO₂e impact. We also selected the month, year, total charge, and gas usage measured in therms and scf, as these are often attributes that are of interest for customers.

SELECT "gasutilization"."usage_scf" * "gas_emission_factor"."emission_factor" 
AS "estimated_CO2e_impact", 
"gasutilization"."month", 
"gasutilization"."year", 
"gasutilization"."totalcharge", 
"gasutilization"."usage_therms", 
"gasutilization"."usage_scf" 
FROM "gasutilization" 
JOIN "gas_emission_factor" 
on "gasutilization"."fuel_id"="gas_emission_factor"."fuel_id";

Lastly, Amazon QuickSight allows visualization of different data sources, including Amazon S3 and Athena, and the flexibility to use custom SQL to get a subset of the data. The following is an example of a QuickSight dashboard showing the gas utilization, gas charges, and estimated carbon footprint across different years.

We have just estimated the Scope 1 carbon footprint for one source of stationary combustion. If we were to do the same process for all sources of stationary and mobile emissions (with different emissions factors) and add the results together, we could roll up an accurate estimate of our Scope 1 carbon emissions for the entire business by only utilizing native AWS services and our own data. A similar process will yield an estimate of Scope 2 emissions, with grid carbon intensity in the place of Scope 1 emission factors.

Summary

This blog discusses how organizations can use existing data in disparate sources to build a data architecture to gain better visibility into Scope 1 greenhouse gas emissions. With Athena, S3, and QuickSight, organizations can now estimate their stationary emissions carbon footprint in a repeatable way by applying the consumption-based method to convert fuel utilization into an estimated carbon footprint.

Other approaches available on AWS include Carbon Accounting on AWS, Sustainability Insights Framework, Carbon Data Lake on AWS, and general guidance detailed at the AWS Carbon Accounting Page.

If you are interested in information on estimating your organization’s carbon footprint with AWS, please reach out to your AWS account team and check out AWS Sustainability Solutions.

References

An example from page four of Amazon’s Carbon Methodology document illustrates this concept.
Amount spent on truck transport: $100,000
EPA Emission Factor: 1.556 KG CO2e /dollar of truck transport
Estimated CO₂e emission: $100,000 * 1.556 KG CO₂e/dollar of truck transport = 155,600 KG of CO2e ↑
For example,
Gasoline consumed: 1,000 US Gallons
EPA Emission Factor: 8.81 Kg of CO2e /gallon of gasoline combusted
Estimated CO2e emission = 1,000 US Gallons * 8.81 Kg of CO2e per gallon of gasoline consumed= 8,810 Kg of CO2e.
EPA Emissions Factor for stationary emissions of motor gasoline is 8.78 kg CO2 plus .38 grams of CH4, plus .08 g of N2O.
Combining these emission factors using 100-year global warming potential for each gas (CH4:25 and N2O:298) gives us Combined Emission Factor = 8.78 kg + 25*.00038 kg + 298 *.00008 kg = 8.81 kg of CO2e per gallon.↑
The Emission factor per scf is 0.05444 kg of CO₂ plus 0.00103 g of CH₄ plus 0.0001 g of N₂O. To get this in terms of CO2e we need to multiply the emission factor of the other two gases by their global warming potentials (GWP). The 100-year GWP for CH₄and N₂O are 25 and 298 respectively. Emission factors and GWPs come from the US EPA website. ↑

About the Authors

Thomas Burns, SCR, CISSP is a Principal Sustainability Strategist and Principal Solutions Architect at Amazon Web Services. Thomas supports manufacturing and industrial customers world-wide. Thomas’s focus is using the cloud to help companies reduce their environmental impact both inside and outside of IT.

Aileen Zheng is a Solutions Architect supporting US Federal Civilian Sciences customers at Amazon Web Services (AWS). She partners with customers to provide technical guidance on enterprise cloud adoption and strategy and helps with building well-architected solutions. She is also very passionate about data analytics and machine learning. In her free time, you’ll find Aileen doing pilates, taking her dog Mumu out for a hike, or hunting down another good spot for food! You’ll also see her contributing to projects to support diversity and women in technology.

AWS Week in Review – Agents for Amazon Bedrock, Amazon SageMaker Canvas New Capabilities, and More – July 31, 2023

2023-08-01 Donnie Prakoso

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/aws-week-in-review-agents-for-amazon-bedrock-amazon-sagemaker-canvas-new-capabilities-and-more-july-31-2023/

This July, AWS communities in ASEAN wrote a new history. First, the AWS User Group Malaysia recently held the first AWS Community Day in Malaysia.

Another significant milestone has been achieved by the AWS User Group Philippines. They just celebrated their tenth anniversary by running 2 days of AWS Community Day Philippines. Here are a few photos from the event, including Jeff Barr sharing his experiences attending AWS User Group meetup, in Manila, Philippines 10 years ago.

Big congratulations to AWS Community Heroes, AWS Community Builders, AWS User Group leaders and all volunteers who organized and delivered AWS Community Days! Also, thank you to everyone who attended and help support our AWS communities.

Last Week’s Launches
We had interesting launches last week, including from AWS Summit, New York. Here are some of my personal highlights:

(Preview) Agents for Amazon Bedrock – You can now create managed agents for Amazon Bedrock to handle tasks using API calls to company systems, understand user requests, break down complex tasks into steps, hold conversations to gather more information, and take actions to fulfill requests.

(Coming Soon) New LLM Capabilities in Amazon QuickSight Q – We are expanding the innovation in QuickSight Q by introducing new LLM capabilities through Amazon Bedrock. These Generative BI capabilities will allow organizations to easily explore data, uncover insights, and facilitate sharing of insights.

AWS Glue Studio support for Amazon CodeWhisperer – You can now write specific tasks in natural language (English) as comments in the Glue Studio notebook, and Amazon CodeWhisperer provides code recommendations for you.

(Preview) Vector Engine for Amazon OpenSearch Serverless – This capability empowers you to create modern ML-augmented search experiences and generative AI applications without the need to handle the complexities of managing the underlying vector database infrastructure.

Last week, Amazon SageMaker Canvas also released a set of new capabilities:

AWS Open-Source Updates
As always, my colleague Ricardo has curated the latest updates for open-source news at AWS. Here are some of the highlights.

cdk-aws-observability-accelerator is a set of opinionated modules to help you set up observability for your AWS environments with AWS native services and AWS-managed observability services such as Amazon Managed Service for Prometheus, Amazon Managed Grafana, AWS Distro for OpenTelemetry (ADOT) and Amazon CloudWatch.

iac-devtools-cli-for-cdk is a command line interface tool that automates many of the tedious tasks of building, adding to, documenting, and extending AWS CDK applications.

Upcoming AWS Events
There are upcoming events that you can join to learn. Let’s start with AWS events:

AWS Storage Day (August 9)
AWS Summit Taiwan (August 2-3)
AWS Summit São Paulo (August 3)
AWS Summit Mexico City (August 30)

And let’s learn from our fellow builders and join AWS Community Days:

AWS Community Day Colombia (August 12)
AWS Community Day West Africa (August 19)

Open for Registration for AWS re:Invent
We want to be sure you know that AWS re:Invent registration is now open!

This learning conference hosted by AWS for the global cloud computing community will be held from November 27 to December 1, 2023, in Las Vegas.

Pro-tip: You can use information on the Justify Your Trip page to prove the value of your trip to AWS re:Invent trip.

Give Us Your Feedback
We’re focused on improving our content to provide a better customer experience, and we need your feedback to do so. Please take this quick survey to share insights on your experience with the AWS Blog. Note that this survey is hosted by an external company, so the link does not lead to our website. AWS handles your information as described in the AWS Privacy Notice.

That’s all for this week. Check back next Monday for another Week in Review.

Happy building!

— Donnie

This post is part of our Week in Review series. Check back each week for a quick round-up of interesting news and announcements from AWS!

P.S. We’re focused on improving our content to provide a better customer experience, and we need your feedback to do so. Please take this quick survey to share insights on your experience with the AWS Blog. Note that this survey is hosted by an external company, so the link does not lead to our website. AWS handles your information as described in the AWS Privacy Notice.

Dimensional modeling in Amazon Redshift

2023-07-19 Bernard Verster

Post Syndicated from Bernard Verster original https://aws.amazon.com/blogs/big-data/dimensional-modeling-in-amazon-redshift/

Amazon Redshift is a fully managed and petabyte-scale cloud data warehouse that is used by tens of thousands of customers to process exabytes of data every day to power their analytics workload. You can structure your data, measure business processes, and get valuable insights quickly can be done by using a dimensional model. Amazon Redshift provides built-in features to accelerate the process of modeling, orchestrating, and reporting from a dimensional model.

In this post, we discuss how to implement a dimensional model, specifically the Kimball methodology. We discuss implementing dimensions and facts within Amazon Redshift. We show how to perform extract, transform, and load (ELT), an integration process focused on getting the raw data from a data lake into a staging layer to perform the modeling. Overall, the post will give you a clear understanding of how to use dimensional modeling in Amazon Redshift.

Solution overview

The following diagram illustrates the solution architecture.

In the following sections, we first discuss and demonstrate the key aspects of the dimensional model. After that, we create a data mart using Amazon Redshift with a dimensional data model including dimension and fact tables. Data is loaded and staged using the COPY command, the data in the dimensions is loaded using the MERGE statement, and facts will be joined to the dimensions where insights are derived from. We schedule the loading of the dimensions and facts using the Amazon Redshift Query Editor V2. Lastly, we use Amazon QuickSight to gain insights on the modeled data in the form of a QuickSight dashboard.

For this solution, we use a sample dataset (normalized) provided by Amazon Redshift for event ticket sales. For this post, we have narrowed down the dataset for simplicity and demonstration purposes. The following tables show examples of the data for ticket sales and venues.

According to the Kimball dimensional modeling methodology, there are four key steps in designing a dimensional model:

Identify the business process.
Declare the grain of your data.
Identify and implement the dimensions.
Identify and implement the facts.

Additionally, we add a fifth step for demonstration purposes, which is to report and analyze business events.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account
An Amazon Redshift cluster or an Amazon Redshift Serverless endpoint
An Amazon Simple Storage Service (Amazon S3) bucket where files that will be loaded into Amazon Redshift are stored
A QuickSight account (optional)
Some degree of data warehousing knowledge will be helpful to understand the topics

Identify the business process

In simple terms, identifying the business process is identifying a measurable event that generates data within an organization. Usually, companies have some sort of operational source system that generates their data in its raw format. This is a good starting point to identify various sources for a business process.

The business process is then persisted as a data mart in the form of dimensions and facts. Looking at our sample dataset mentioned earlier, we can clearly see the business process is the sales made for a given event.

A common mistake made is using departments of a company as the business process. The data (business process) needs to be integrated across various departments, in this case, marketing can access the sales data. Identifying the correct business process is critical—getting this step wrong can impact the entire data mart (it can cause the grain to be duplicated and incorrect metrics on the final reports).

Declare the grain of your data

Declaring the grain is the act of uniquely identifying a record in your data source. The grain is used in the fact table to accurately measure the data and enable you to roll up further. In our example, this could be a line item in the sales business process.

In our use case, a sale can be uniquely identified by looking at the transaction time when the sale took place; this will be the most atomic level.

Identify and implement the dimensions

Your dimension table describes your fact table and its attributes. When identifying the descriptive context of your business process, you store the text in a separate table, keeping the fact table grain in mind. When joining the dimensions table to the fact table, there should only be a single row associated to the fact table. In our example, we use the following table to be separated into a dimensions table; these fields describe the facts that we will measure.

When designing the structure of the dimensional model (the schema), you can either create a star or snowflake schema. The structure should closely align with the business process; therefore, a star schema is best fit for our example. The following figure shows our Entity Relationship Diagram (ERD).

In the following sections, we detail the steps to implement the dimensions.

Stage the source data

Before we can create and load the dimensions table, we need source data. Therefore, we stage the source data into a staging or temporary table. This is often referred to as the staging layer, which is the raw copy of the source data. To do this in Amazon Redshift, we use the COPY command to load the data from the dimensional-modeling-in-amazon-redshift public S3 bucket located on the us-east-1 Region. Note that the COPY command uses an AWS Identity and Access Management (IAM) role with access to Amazon S3. The role needs to be associated with the cluster. Complete the following steps to stage the source data:

Create the venue source table:

CREATE TABLE public.venue (
    venueid bigint,
    venuename character varying(100),
    venuecity character varying(30),
    venuestate character(2),
    venueseats bigint
) DISTSTYLE AUTO
        SORTKEY
    (venueid);

Load the venue data:

COPY public.venue
FROM 's3://redshift-blogs/dimensional-modeling-in-amazon-redshift/venue.csv'
IAM_ROLE '<Your IAM role arn>'
DELIMITER ','
REGION 'us-east-1'
IGNOREHEADER 1

Create the sales source table:

CREATE TABLE public.sales (
    salesid integer,
    venueid character varying(256),
    saletime timestamp without time zone,
    qtysold BIGINT,
    commission numeric(18,2),
    pricepaid numeric(18,2)
) DISTSTYLE AUTO;

Load the sales source data:

COPY public.sales
FROM 's3://redshift-blogs/dimensional-modeling-in-amazon-redshift/sales.csv'
IAM_ROLE '<Your IAM role arn>'
DELIMITER ','
REGION 'us-east-1'
IGNOREHEADER 1

Create the calendar table:

CREATE TABLE public.DimCalendar(
    dateid smallint,
        caldate date,
        day varchar(20),
        week smallint,
        month varchar(20),
        qtr varchar(20),
        year smallint,
        holiday boolean
) DISTSTYLE AUTO
SORTKEY
    (dateid);

Load the calendar data:

COPY public.DimCalendar
FROM 's3://redshift-blogs/dimensional-modeling-in-amazon-redshift/date.csv'
IAM_ROLE '<Your IAM role arn>'
DELIMITER ',' 
REGION 'us-east-1'
IGNOREHEADER 1

Create the dimensions table

Designing the dimensions table can depend on your business requirement—for example, do you need to track changes to the data over time? There are seven different dimension types. For our example, we use type 1 because we don’t need to track historical changes. For more about type 2, refer to Simplify data loading into Type 2 slowly changing dimensions in Amazon Redshift. The dimensions table will be denormalized with a primary key, surrogate key, and a few added fields to indicate changes to the table. See the following code:

create schema SalesMart;

CREATE TABLE SalesMart.DimVenue( 
    "VenueSkey" int IDENTITY(1,1) primary key
    ,"VenueId" VARCHAR NOT NULL
    ,"VenueName" VARCHAR NULL
    ,"VenueCity" VARCHAR NULL
    ,"VenueState" VARCHAR NULL
    ,"VenueSeats" INT NULL
    ,"InsertedDate" DATETIME NOT NULL
    ,"UpdatedDate" DATETIME NOT NULL
) 
diststyle AUTO;

A few notes on creating the dimensions table creation:

The field names are transformed into business-friendly names
Our primary key is VenueID, which we use to uniquely identify a venue at which the sale took place
Two additional rows will be added, indicating when a record was inserted and updated (to track changes)
We are using an AUTO distribution style to give Amazon Redshift the responsibility to choose and adjust the distribution style

Another important factor to consider in dimensional modelling is the usage of surrogate keys. Surrogate keys are artificial keys that are used in dimensional modelling to uniquely identify each record in a dimension table. They are typically generated as a sequential integer, and they don’t have any meaning in the business domain. They offer several benefits, such as ensuring uniqueness and improving performance in joins, because they’re typically smaller than natural keys and as surrogate keys they don’t change over time. This allows us to be consistent and join facts and dimensions more easily.

In Amazon Redshift, surrogate keys are typically created using the IDENTITY keyword. For example, the preceding CREATE statement creates a dimension table with a VenueSkey surrogate key. The VenueSkey column is automatically populated with unique values as new rows are added to the table. This column can then be used to join the venue table to the FactSaleTransactions table.

A few tips for designing surrogate keys:

Use a small, fixed-width data type for the surrogate key. This will improve performance and reduce storage space.
Use the IDENTITY keyword, or generate the surrogate key using a sequential or GUID value. This will ensure that the surrogate key is unique and can’t be changed.

Load the dim table using MERGE

There are numerous ways to load your dim table. Certain factors need to be considered—for example, performance, data volume, and perhaps SLA loading times. With the MERGE statement, we perform an upsert without needing to specify multiple insert and update commands. You can set up the MERGE statement in a stored procedure to populate the data. You then schedule the stored procedure to run programmatically via the query editor, which we demonstrate later in the post. The following code creates a stored procedure called SalesMart.DimVenueLoad:

CREATE OR REPLACE PROCEDURE SalesMart.DimVenueLoad()
AS $$
BEGIN
MERGE INTO SalesMart.DimVenue USING public.venue as MergeSource
ON SalesMart.DimVenue.VenueId = MergeSource.VenueId
WHEN MATCHED
THEN
UPDATE
SET VenueName = ISNULL(MergeSource.VenueName, 'Unknown')
, VenueCity = ISNULL(MergeSource.VenueCity, 'Unknown')
, VenueState = ISNULL(MergeSource.VenueState, 'Unknown')
, VenueSeats = ISNULL(MergeSource.VenueSeats, -1)
, UpdatedDate = GETDATE()
WHEN NOT MATCHED
THEN
INSERT (
VenueId
, VenueName
, VenueCity
, VenueState
, VenueSeats
, UpdatedDate
, InsertedDate
)
VALUES (
ISNULL(MergeSource.VenueId, -1)
, ISNULL(MergeSource.VenueName, 'Unknown')
, ISNULL(MergeSource.VenueCity, 'Unknown')
, ISNULL(MergeSource.VenueState, 'Unknown')
, ISNULL(MergeSource.VenueSeats, -1)
, ISNULL(GETDATE() , '1900-01-01')
, ISNULL(GETDATE() , '1900-01-01')
);
END;
$$
LANGUAGE plpgsql;

A few notes on the dimension loading:

When a record in inserted for the first time, the inserted date and updated date will be populated. When any values change, the data is updated and the updated date reflects the date when it was changed. The inserted date remains.
Because the data will be used by business users, we need to replace NULL values, if any, with more business-appropriate values.

Identify and implement the facts

Now that we have declared our grain to be the event of a sale that took place at a specific time, our fact table will store the numeric facts for our business process.

We have identified the following numerical facts to measure:

Quantity of tickets sold per sale
Commission for the sale

Implementing the Fact

There are three types of fact tables (transaction fact table, periodic snapshot fact table, and accumulating snapshot fact table). Each serves a different view of the business process. For our example, we use a transaction fact table. Complete the following steps:

Create the fact table

CREATE TABLE SalesMart.FactSaleTransactions( 
    CalendarDate date NOT NULL
    ,SaleTransactionTime DATETIME NOT NULL
    ,VenueSkey INT NOT NULL
    ,QuantitySold BIGINT NOT NULL
    ,SaleComission NUMERIC NOT NULL
    ,InsertedDate DATETIME DEFAULT GETDATE()
) diststyle AUTO;

An inserted date with a default value is added, indicating if and when a record was loaded. You can use this when reloading the fact table to remove the already loaded data to avoid duplicates.

Loading the fact table consists of a simple insert statement joining your associated dimensions. We join from the DimVenue table that was created, which describes our facts. It’s best practice but optional to have calendar date dimensions, which allow the end-user to navigate the fact table. Data can either be loaded when there is a new sale, or daily; this is where the inserted date or load date comes in handy.

We load the fact table using a stored procedure and use a date parameter.

Create the stored procedure with the following code. To keep the same data integrity that we applied in the dimension load, we replace NULL values, if any, with more business appropriate values:

create or replace procedure SalesMart.FactSaleTransactionsLoad(loadate datetime)
language plpgsql
as
    $$
begin
--------------------------------------------------------------------
/*** Delete records loaded for the day, should there be any ***/
--------------------------------------------------------------------
Delete from SalesMart.FactSaleTransactions
where cast(InsertedDate as date) = CAST(loadate as date);
RAISE INFO 'Deleted rows for load date: %', loadate;
--------------------------------------------------------------------
/*** Insert records ***/
--------------------------------------------------------------------
INSERT INTO SalesMart.FactSaleTransactions (
CalendarDate    
,SaleTransactionTime    
,VenueSkey  
,QuantitySold  
,Salecomission
)
SELECT DISTINCT
    ISNULL(c.caldate, '1900-01-01') as CalendarDate
    ,ISNULL(a.saletime, '1900-01-01') as SaleTransactionTime
    ,ISNULL(b.VenueSkey, -1) as VenueSkey
    ,ISNULL(a.qtysold, 0) as QuantitySold
    ,ISNULL(a.commission, 0) as SaleComission
FROM
    public.sales as a
 
LEFT JOIN SalesMart.DimVenue as b
on a.venueid = b.venueid
 
LEFT JOIN public.DimCalendar as c
on to_char(a.saletime,'YYYYMMDD') = to_char(c.caldate,'YYYYMMDD');
--Optional filter, should you want to load only the latest data from source
--where cast(a.saletime as date) = cast(loadate as date);
  
end;
$$;

Load the data by calling the procedure with the following command:

call SalesMart.FactSaleTransactionsLoad(getdate())

Schedule the data load

We can now automate the modeling process by scheduling the stored procedures in Amazon Redshift Query Editor V2. Complete the following steps:

We first call the dimension load and after the dimension load runs successfully, the fact load begins:

BEGIN;
----Insert Dim Loads
call SalesMart.DimVenueLoad();

----Insert Fact Loads. They will only run if the DimLoad is successful
call SalesMart.FactSaleTransactionsLoad(getdate());
END;

If the dimension load fails, the fact load will not run. This ensures consistency in the data because we don’t want to load the fact table with outdated dimensions.

To schedule the load, choose Schedule in Query Editor V2.

We schedule the query to run every day at 5:00 AM.
Optionally, you can add failure notifications by enabling Amazon Simple Notification Service (Amazon SNS) notifications.

Report and analysis the data in Amazon Quicksight

QuickSight is a business intelligence service that makes it easy to deliver insights. As a fully managed service, QuickSight lets you easily create and publish interactive dashboards that can then be accessed from any device and embedded into your applications, portals, and websites.

We use our data mart to visually present the facts in the form of a dashboard. To get started and set up QuickSight, refer to Creating a dataset using a database that’s not autodiscovered.

After you create your data source in QuickSight, we join the modeled data (data mart) together based on our surrogate key skey. We use this dataset to visualize the data mart.

Our end dashboard will contain the insights of the data mart and answer critical business questions, such as total commission per venue and dates with the highest sales. The following screenshot shows the final product of the data mart.

Clean up

To avoid incurring future charges, delete any resources you created as part of this post.

Conclusion

We have now successfully implemented a data mart using our DimVenue, DimCalendar, and FactSaleTransactions tables. Our warehouse is not complete; as we can expand the data mart with more facts and implement more marts, and as the business process and requirements grow over time, so will the data warehouse. In this post, we gave an end-to-end view on understanding and implementing dimensional modeling in Amazon Redshift.

Get started with your Amazon Redshift dimensional model today.

About the Authors

Bernard Verster is an experienced cloud engineer with years of exposure in creating scalable and efficient data models, defining data integration strategies, and ensuring data governance and security. He is passionate about using data to drive insights, while aligning with business requirements and objectives.

Abhishek Pan is a WWSO Specialist SA-Analytics working with AWS India Public sector customers. He engages with customers to define data-driven strategy, provide deep dive sessions on analytics use cases, and design scalable and performant analytical applications. He has 12 years of experience and is passionate about databases, analytics, and AI/ML. He is an avid traveler and tries to capture the world through his camera lens.

Introducing field-based coloring experience for Amazon QuickSight

2023-07-18 Bhupinder Chadha

Post Syndicated from Bhupinder Chadha original https://aws.amazon.com/blogs/big-data/introducing-field-based-coloring-experience-for-amazon-quicksight/

Color plays a crucial role in visualizations. It conveys meaning, captures attention, and enhances aesthetics. You can quickly grasp important information when key insights and data points pop with color. However, it’s important to use color judiciously to enhance readability and ensure correct interpretation. Color should also be accessible and consistent to enable users to establish visual patterns and comprehend data effectively.

In line with data visualization best practices, Amazon QuickSight is announcing the launch of field-based coloring options, which provides a fresh approach to configuring colors across visuals in addition to the visual-level color settings. With field-based colors, you can now enjoy the following benefits:

Consistent coloring across visuals using the same Color field
The ability to assign custom colors to dimension values at the field level
The ability to persist default color consistency during visual interactions, such as filtering and sorting

Consistent coloring experience across visuals

At present, users in QuickSight can either assign colors to their charts using themes or the on-visual menu. In addition to these options, the launch of field-based coloring allows authors to specify colors on a per-field basis, simplifying the process of setting colors and ensuring consistency across all visuals that use the same field. The following example shows that, before this feature was available, both charts using the color field Ship region displayed different colors across the field values.

With the implementation of field colors, authors now have the capability to maintain consistent color schemes across visuals that utilize the same field. This is achieved by defining distinct colors for each field value, which ensures uniformity throughout. In contrast to the previous example, both charts now showcase consistent colors for the Ship region field.

Consistent coloring experience with visual interaction

In the past, the default coloring logic used to be based on the sorting order, which means that colors would stay the same for a given sort order. However, this caused inconsistency because the same values could display different colors when the sorting order changed or when they were filtered. The following example shows that the colors for each segment field (Online, In-Store, and Catalog) on the donut chart differ from the colors on the bar chart after sorting.

The assigned colors persist and remain unchanged during any visual interaction, such as sorting or filtering, by defining field-based colors. Notice that, after sorting the donut chart another way, the legend order changes, but the colors remain the same.

How to customize field colors

In this section, we demonstrate the various ways you can customize field colors.

Edit field color

There are two ways to add or edit field-based color:

Fields list pane – Select the field in your analysis and choose Edit field colors from the context menu. This allows you to choose your own colors for each value.
On-visual menu – To define or modify colors another way, you can simply select the legend or the desired data point. Access the context menu and choose Edit field colors. This opens the Edit field colors pane, which is filtered to display the selected value and allows for easy and convenient color customization.

Note the following considerations:

Colors defined at a visual level override field-based colors.
You can assign colors to a maximum of 50 values per field. If you want more than 50, you’ll need to reset a previously assigned color to continue.

Reset visual color

If your visuals have colors assigned through the on-visual menu, the field-based colors aren’t visible. This is because on-visual colors take precedence over the field-based color settings. However, you can easily reset the visual-based colors to reveal the underlying field-based colors in such cases.

Reset field colors

If you want to change the color of a specific value, simply choose the reset icon next to the edited color. Alternatively, if you want to reset all colors, choose Reset colors at the bottom. This restores all edited values to their default color assignment.

Unused color (stale color assignment)

When values that you’ve assigned colors to no longer appear in data, QuickSight labels the values as unused. You can view the unused color assignments and choose to delete them if you’d like.

Conclusion

Field-based coloring options in QuickSight simplify the process of achieving consistent and visually appealing visuals. The persistence of default colors during interactions, such as filtering and sorting, enhances the user experience. Start using field-based coloring today for consistent coloring experience and to enable better comparisons and pattern recognition for effective data interpretation and decision-making.

About the author

Bhupinder Chadha is a senior product manager for Amazon QuickSight focused on visualization and front end experiences. He is passionate about BI, data visualization and low-code/no-code experiences. Prior to QuickSight he was the lead product manager for Inforiver, responsible for building a enterprise BI product from ground up. Bhupinder started his career in presales, followed by a small gig in consulting and then PM for xViz, an add on visualization product.

Build and share a business capability model with Amazon QuickSight

2023-07-14 Abdul Qadir

Post Syndicated from Abdul Qadir original https://aws.amazon.com/blogs/big-data/build-and-share-a-business-capability-model-with-amazon-quicksight/

The technology landscape has been evolving rapidly, with waves of change impacting IT from every angle. It is causing a ripple effect across IT organizations and shifting the way IT delivers applications and services.

The change factors impacting IT organizations include:

The shift from a traditional application model to a services-based application model (SaaS, PaaS)
The shift from a traditional infrastructure and hardware costing model to cloud-based containers (private and public clouds) with metered usage for resources (IaaS)
The shift from the lengthy traditional development and delivery cycles to continuous development and integration (DevOps)
The shift in application architecture from N-Tier to loosely coupled services

The portfolio of services delivering business capabilities are the new assets of IT organizations that need to be cataloged in a repository. The system must follow a well-defined business taxonomy that enhances discovery, analysis, and reuse by potential consumers, and avoids building redundant services. The traditional portfolio management tools within the organization need to be augmented with additional components that can manage the complexity of the services ecosystem.

This post provides a simple and quick way of building an extendable analytical system using Amazon QuickSight to better manage lines of business (LOBs) with a detailed list of business capabilities and APIs, deep analytical insights, and desired graphical visualizations from different dimensions. In addition, this tool enhances the discovery and reuse of existing business capabilities, avoids duplication of services, and shortens time-to-market.

Use case overview

Bob is a Senior Enterprise Architect. He recently joined a Tier 1 bank. His first assignment is to assess the bank’s capabilities to offer new financial products to its high-value retail clients. The only document given to Bob was PowerPoint slides and the names of the head of each department to get more information. The PowerPoint presentation provided high-level information, but it didn’t give an insight into how capable each department is to provide the required data through APIs for the new products. To collect that information, Bob gets in touch with the head of each department, who in turn refer him to their development leads, who in turn give him a bunch of technical documents that explain how APIs are being used.

Relevance

Business analysts are familiar with business terminology and taxonomy, and often depend on the technology team to explain the technical assets associated with business capabilities. The business capabilities are the assets of the IT organization that need to be cataloged in a repository. The catalog must follow a well-defined business taxonomy that enhances discovery and reuse by consumers, and avoids building redundant services.

The better organized the catalog is, the higher the potential for reuse and the return on investment for the services transformation strategy. The catalog needs to be organized using some business functions taxonomy with a detailed list of capabilities and sub-capabilities. The following diagram illustrates an example of services information and interdependencies.

Example of services information and interdependencies

Defining and capturing a business capability model

If an enterprise doesn’t have a system to capture the business capability model, consider defining and finding a way to capture the model for better insight and visibility, and then map it with digital assets like APIs. The model should be able to showcase to LOBs their categories and capabilities. The following table includes some sample LOBs and their associations for a business that sells the services.

LOB	Category	Capability
Recruitment	Manage Applicant Experience	Manage Application Activities
	Process Application	Follow-Ups
		Pursue Automated Leads
Sale Service	Engage Customer	Provide Needs Assessment Tools
		Provide Service Information

After the map is defined and captured, each business capability can be mapped to APIs that are implemented for it. Each business capability then has visibility into all the associated digital assets and mapped metadata of the services, such as consumers of the API.

To capture the model, you can define a simple table to capture the information, and then you can perform further analysis on it with an analytical tool such as QuickSight.

In the following sample data model, each business LOB has several business categories and capabilities, and each capability can be mapped to multiple APIs. Also note that there’s not always a 1:1 mapping between a business capability, an API, and a service.

Business LOB – Recruitment, Sale Service
Business category – Process Application, Engage Customer
Business capabilities – Complete an Application, Follow-Ups
Digital assets – Recruitment API, Sale Service API

There are sets of other standard information that you can include in a data model, such as API consumers.

The following example shows a table structure to capture this information.

LOB table structure

The following figure visualizes the business capabilities and associated APIs.

Visualization of business capabilities and associated APIs

The remainder of the post highlights the key components to build the full solution end to end. The UI captures the business capabilities and associated APIs, and publishes the service information through a DevOps process. The solution also includes storage and a reporting tool that complement the applications portfolio management capability in place and expand its capabilities with the services portfolio.

Aligning APIs to a business capability model

To align APIs to a business capability model, you can follow these steps:

Understand the business capabilities – Identify the key business capabilities of your organization and understand how they support the overall business strategy.
Map the APIs to the capabilities – Review the existing APIs and map them to the corresponding business capabilities. This will help identify any gaps in the capabilities that can be addressed through new or updated APIs.
Prioritize the APIs – Prioritize the development of new or updated APIs based on their importance to the business capabilities. This will ensure that the most critical capabilities are supported by the APIs.
Implement governance – Implement a governance process to ensure that the APIs are aligned with the business capabilities and are used correctly. This can include setting standards for how the APIs are designed, developed, and deployed.
Monitor and measure – Monitor the usage and performance of the APIs to measure their impact on the business capabilities. Use this information to make decisions about changes to the APIs over time.
Regularly review and update – Review and update the mapping of the APIs to the business capabilities on a regular basis to ensure they remain aligned with the organization’s goals and objectives.

Maintenance and evolution of a business capability model

Building a business capability model is not a one-time exercise. It keeps evolving with business requirements and usage. Data management best practices should be followed as per your company’s guidelines to have consistent data end to end.

Solution overview

In this section, we introduce the ability to capture the business capabilities and associated APIs and make them available using the QuickSight business intelligence (BI) tool, and highlight its features.

The following approach provides the ability to manage business capability models and enable them to link business capabilities with enterprise digital assets, including services, APIs, and IT systems. This solution enables IT and business teams to further drill down into the model to see what has been implemented. These details provide value to architects and analysts to assess which services can be combined to provide new offerings and shorten time-to-market, enable reusability by consumers, and avoid building redundant services.

The following key components are required:

Reporting tool – Amazon QuickSight
Query service – Amazon Athena
Serverless web application – This includes the following:
- Front end – AWS Amplify
- API – AWS AppSync
- Database – Amazon DynamoDB
- Authentication – Amazon Cognito

Organizations can use their existing UI framework (if available) to capture the information, or they can use one of the open-source services available in the market. Depending on the selection and capability of the open-source product, a user interface can be generated and customized.

Let’s look at each service in our solution in more detail:

Amplify – Amplify is a set of tools and services that can be used together or on their own, to help front-end web and mobile developers build scalable full stack applications, powered by AWS. With Amplify, you can configure app backends and connect your app in minutes, deploy static web apps in a few clicks, and easily manage app content outside the AWS Management Console. Amplify supports popular web frameworks including JavaScript, React, Angular, Vue, and Next.js, and mobile platforms including Android, iOS, React Native, Ionic, and Flutter. Get to market faster with AWS Amplify.
AppSync – AWS AppSync simplifies application development by creating a universal API for securely accessing, modifying, and combining data from multiple sources. AWS AppSync is a managed service that uses GraphQL so that applications can easily get only the data they need.
Athena – Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL. In this solution, we use Athena as a data source for QuickSight.
Amazon Cognito – Amazon Cognito delivers frictionless customer identity and access management (CIAM) with a cost-effective and customizable platform. It easily connects the web application to the backend resources and web services.
DynamoDB – DynamoDB is a fully managed, serverless, key-value NoSQL database designed to run high-performance applications at any scale. DynamoDB offers built-in security, continuous backups, automated multi-Region replication, in-memory caching, and data import and export tools.
QuickSight – QuickSight is a serverless, cloud-based BI and reporting service that brings data insights to your teams and end-users through machine learning (ML)-powered dashboards and data visualizations, which can be accessed via QuickSight or embedded in apps and portals that your users access.

The following diagram illustrates the solution architecture.

Business capabilities insights solution architecture

In the following sections, we walk through the implementation and end-to-end integration steps.

Build a serverless web application with Amplify

The open-source Amplify provides a CLI, libraries, UI components and Amplify hosting to build full stack iOS, Android, Flutter, Web, and React Native apps. For instructions on building a serverless web application, refer to the following tutorial. For this post, we created the following GraphQL schema with amplify add api:

type BusinessCapability @model {
  company_id: ID!
  company_name: String!
  company_desc: String!
  lob_name: String!
  categoray: String!
  capability: String!
  digital_asset_type: String!
  digital_asset_name: String!
  digital_asset_info: String!
}

After we use Amplify to deploy the API in the cloud, a corresponding AppSync API and a DynamoDB table are created automatically.

You can use the Amplify UI library to generate a business capability intake form and bind the fields to your front-end code.

Amplify studio generated form

You can add authentication to your application using Amazon Cognito by running amplify add auth.

With that, you are now hosting a serverless web application for your business capabilities securely and at scale.

Set up Athena and the Athena DynamoDB data connector

The DynamoDB table generated by Amplify stores all the business capabilities. You can set up Athena and the Athena DynamoDB data connector so that you can query your tables with SQL. For more information, refer to Amazon Athena DynamoDB connector.

Enable QuickSight

Enable QuickSight in your AWS account and create the datasets. The source dataset is the Athena database and table that you created earlier. To connect, you need to allow access to query Athena and Amazon S3 via the admin user interface in QuickSight. Refer to accessing AWS resources for access requirements.

Sample reports

When all the components are up and running, you can design analyses and generate reports. For more information about gathering insights from the captured data, refer to Tutorial: Create an Amazon QuickSight analysis. You can export reports in PDF, and share analyses and reports with other users. The following screenshots are reports that reflects the relationship among LOBs, business capabilities, and APIs.

The first screenshot visualizes the capabilities and associated APIs. This enables the user to identify a set of APIs, and use the same API in new similar business functions.

Business Capability Visualization 1

The following screenshot visualizes LOBs, category, and capabilities. This enables the user to easily gain insights on these relationships.

Business Capabilities Visualization 2

Best practices

The following are some best practices for business capability modeling:

Define clear and measurable capabilities – Each capability should be defined in a way that is clear and measurable, so that it can be tracked and improved over time.
Involve key stakeholders – Involve key stakeholders in the modeling process to ensure that the capabilities accurately reflect the needs of the organization.
Use a consistent framework – Use a consistent framework to ensure that capabilities are defined and organized in a way that makes sense for the organization.
Regularly review and update – Review and update the capabilities regularly to ensure they remain relevant and aligned with the organization’s goals and objectives.
Use visual representations – Use visual representations, like diagrams or models, to help stakeholders understand and communicate the capabilities.
Implement a governance process – Implement a governance process to ensure that the capabilities are being used correctly and to make decisions about changes to the capabilities over time.

Conclusion

In this post, you learned how to build a system to manage a business capability model, and discover and visualize the results in QuickSight.

We hope that companies can use this solution to manage their enterprise capability model and enable users to explore business functions available for them to use within the organization. Business users and technical architects can now easily discover business capabilities and APIs, helping accelerate the creation and orchestration of new features. With the QuickSight web interface, you can filter through thousands of business capabilities, analyze the data for your business needs, and understand the technical requirements and how to combine existing technical capabilities into a new business capability.

Furthermore, you can use your data source to gain further insights from your data by setting up ML Insights in QuickSight and create graphical representations of your data using QuickSight visuals.

To learn more about how you can create, schedule, and share reports and data exports, see Amazon QuickSight Paginated Reports.

About the authors

Abdul Qadir is an AWS Solutions Architect based in New Jersey. He works with independent software vendors in the Northeast and provides customer guidance to build well-architected solutions on the AWS cloud platform.

Sharon Li is a solutions architect at AWS, based in the Boston, MA area. She works with enterprise customers, helping them solve difficult problems and build on AWS. Outside of work, she likes to spend time with her family and explore local restaurants.

Create a comprehensive view of AWS support cases with Amazon QuickSight

2023-07-13 Yash Bindlish

Post Syndicated from Yash Bindlish original https://aws.amazon.com/blogs/big-data/create-a-comprehensive-view-of-aws-support-cases-with-amazon-quicksight/

AWS customers are looking for an efficient tracking method of support cases raised with AWS Support across their multiple interconnected accounts. Having a unified view lets the cloud operations team derive actionable insights across the support cases raised by different business units and accounts. This helps ensure that the team has a comprehensive understanding of the state of existing support cases and can quickly identify and work with teams to resolve them. The team can also prioritize their responses based on the severity of impact of the issues and take action on cases that need acknowledgement or additional information. AWS Systems Manager is the operations hub for your AWS applications and resources and a secure end-to-end management solution for hybrid cloud environments that enables secure operations at scale. AWS Systems Manager Explorer provides a summary of support cases across your AWS accounts to help you get better visibility into the operational health of your AWS environment.

This post describes how Amazon QuickSight dashboards can help you visualize your support cases in a single pane of glass using data extracts from Systems Manager. QuickSight meets varying analytic needs from the same source of truth through modern interactive dashboards, paginated reports, embedded analytics, and natural language queries.

Solution overview

The following architecture diagram illustrates the use of Systems Manager to provide a summary of support cases across your AWS accounts. The solution automates the collection process using a Systems Manager Automation document, scheduling automations within a maintenance window. When the Systems Manager configuration is done, the automation extracts the all support cases across the organization and creates a CSV file in an Amazon Simple Storage Service (Amazon S3) bucket. From the S3 bucket, we integrate with Amazon Athena to create a table, and lastly we visualize all support cases in QuickSight. Note that for aggregating data across multiple accounts, they must reside within a single AWS Organization. Implementing the solution requires the following steps:

Set up a Systems Manager maintenance window.
Register an automation task in the maintenance window.
Create a database in the AWS Glue Data Catalog.
Create a custom classifier for an AWS Glue crawler.
Create and run an AWS Glue crawler.
Create views in Athena.
Visualize AWS support cases in QuickSight.

Prerequisites

Before you get started, complete the following prerequisites:

Have a Business or Enterprise support plan for your AWS accounts.
Enable and set up Athena.
Enable QuickSight in your data collection account. For instructions, refer to Setting up for Amazon QuickSight.
Create an S3 bucket where Systems Manager Automation will export support cases.
Follow the steps in Centralized view of support cases opened from multiple AWS accounts using AWS Systems Manager to establish Systems Manager Explorer and create a resource data sync for data aggregation.
Create an Amazon Simple Notification Service (SNS) topic. Use the following command to create an SNS topic named SSM-supportcases-notification and subscribe an email address:

aws sns create-topic --name SSM-supportcases-notification

You should see the following output:

{
	"SubscriptionArn": "arn:aws:sns:us-east-1:12345678901A:SSM-supportcases-notification:5d906xxxx-7c8x-45dx-a9dx-0484e31c98xx"
}

For more information, refer to Creating an Amazon SNS topic.

Have an AWS Identity and Access Manager (IAM) Systems Manager Explorer Exporting OpsData role. The role AmazonSSMExplorerExport allows Explorer to export OpsData to a CSV file. For more information, refer to Exporting OpsData from Systems Manager Explorer.
Have Systems Manager permissions for maintenance windows. For more information, refer to Use the console to configure permissions for maintenance windows.

After you have all the prerequisites in place, follow the step-by-step instructions in the rest of this post.

Set up a Systems Manager maintenance window

Maintenance windows, a capability of Systems Manager, help you define a schedule for AWS support cases to extract at a predefined schedule. For instructions on creating a maintenance window, see Create a maintenance window (console).

Register an automation task with a maintenance window

In this step, you add a task to a maintenance window. Tasks are the actions performed when a maintenance window runs. For instructions on registering an automation task to a maintenance window, see Schedule automations with maintenance windows.

Provide a name for the maintenance task and choose the automation document AWS-ExportOpsDataToS3

2. Enter the following details in the Input parameters section.

Variable	Description	Value
`assumeRole`	(Required) The role ARN to assume during the automation run	The role you created as a prerequisite
`filters`	(Optional) Filters for the `getOpsSummary` request	Leave blank
`syncName`	(Optional) The name of the resource data sync	The sync name that you created as a prerequisite
`resultAttribute`	(Optional) The result attribute for the `getOpsSummary` request	AWS:SupportCenterCase
`columnFields`	(Optional) The column fields to write to the output file	“DisplayId”,”SourceAccountId”,”Subject”,”Status”,”ServiceCode”,”CategoryCode”,”SeverityCode”,”TimeCreated”
`s3BucketName`	(Required) The S3 bucket where you want to download the output file	The S3 bucket that you created as a prerequisite
`snsTopicArn`	(Required) The SNS topic ARN to notify when the download is complete	The ARN for the SNS topic that you created as a prerequisite
`snsSuccessMessage`	(Optional) The message to send when a document is complete	Leave blank
`columnFieldsWithType`	(Optional) The fully qualified column fields to write to the output file	Leave blank
`resultAttributeList`	(Optional) The multiple result attributes for the `getOpsSummary` request	Leave blank

Choose the IAM service role you created as a prerequisite.
Choose Register Automation task.

After you successfully register the task, the automation will run, and you will see CSV files getting created in your S3 bucket. In our use case, we set the rate expression as 1 day. However, you can use a lesser frequency such as 1 hour or even 5 minutes to test the functionality.

Create a database in the AWS Glue Data Catalog

Before you can create an AWS Glue crawler, you need to create a database in the Data Catalog, which is a container that holds tables. You use databases to organize your tables into separate categories. In our use case, support cases data resides in an S3 bucket.

On the AWS Glue console, create a new database.
For Name, enter a name (for example, aws_support_cases).
Add an optional location and description.
Choose Create database.

For more information about AWS Glue databases, refer to Working with databases on the AWS Glue console.

Create a custom classifier

Crawlers invoke classifiers to infer the schema of your data. We need to create a custom classifier because when we extract the support cases, every column in a potential header parses as a string data type. When creating your classifier, choose Has headings and add the following:

number,DisplayId,SourceAccountId,Subject,Status,ServiceCode,CategoryCode,SeverityCode,TimeCreated

For more information on classifiers, refer to Adding classifiers to a crawler in AWS Glue.

Create an AWS Glue crawler

To create a crawler that reads files stored on Amazon S3, complete the following steps:

On the AWS Glue console, in the navigation pane, choose Crawlers.
On the Crawlers page, choose Add crawler.
For Crawler name, enter support cases extract, then choose Next.
For the crawler source type, choose Data stores, then choose Next.

Now let’s point the crawler to your data.

On the Add a data store page, choose the Amazon S3 data store.
For Crawl data in, choose Specified path in this account.
For Include path, enter the path where the crawler can find the support cases data, which is s3://S3_BUCKET_PATH. After you enter the path, the title of this field changes to Include path.
Choose Next.

The crawler also needs permissions to access the data store and create objects in the Data Catalog.

To configure these permissions, choose Create an IAM role. The IAM role name starts with AWSGlueServiceRole-; you enter the last part of the role name (for this post, we enter Crawlercases).
Choose Next.

Crawlers create tables in your Data Catalog. Tables are contained in a database in the Data Catalog.

Choose Target database and select the database you created.

Now we create a schedule for the crawler.

For Frequency, choose Daily
Choose Next.
Verify the choices you made. If you see any mistakes, you can choose Edit to return to previous pages and make changes.
After you have reviewed the information, choose Finish to create the crawler.

For more information on creating an AWS Glue crawler, refer to Adding an AWS Glue crawler.

Create views in Athena

After the AWS Glue crawler is configured successfully, we query the data from the database and table created by the crawler and create views in Athena. The data source for the dashboard will be an Athena view of your existing support_cases database. We create a view in Athena with a group by condition.

Create the view case_summary_view by modifying the table name support_cases from the following code and run the query in the Athena query editor:

CREATE OR REPLACE VIEW "case_summary_view" AS SELECT DISTINCT DisplayId caseid , SourceAccountId accountid , Subject case_subject , Status case_status , ServiceCode case_service , CategoryCode case_category , CAST("substring"(TimeCreated, 1, 10) AS date) case_created_on FROM "AwsDataCatalog"."aws_support_cases"."aws_support_cases_report" GROUP BY DisplayId, SourceAccountId, Subject, Status, ServiceCode, CategoryCode, SeverityCode, TimeCreated

Visualize AWS support cases in QuickSight

After we create the Athena view, we can create a dashboard in QuickSight. Before connecting QuickSight to Athena, make sure to grant QuickSight access to Athena and the associated S3 buckets in your account. For details, refer to Authorizing connections to Amazon Athena.

On the QuickSight console, choose Datasets in the navigation pane.
Choose New dataset.
Choose Athena as your data source.
For Data source name¸ enter AWS_Support_Cases.
Choose Create data source.
For Database, choose the aws_support_cases database, which contains the views you created (refer to the Athena console if you are unsure which ones to select)
For Tables, select the case_summary_view table that we created as part of the steps in Athena.
Choose Edit/Preview data.
Select SPICE to change your query mode.

Now you can create the sheet aws_support_cases in the analysis.

Choose Publish & Visualize.
Select the sheet type that you want (Interactive sheet or Paginated report). For this post, we select Interactive sheet.
Choose Add.

Refer to Starting an analysis in Amazon QuickSight for more information about creating an analysis.

In Sheet 1 of the newly created analysis, under Fields list, choose case_category and case_status.
For Visual types, choose a clustered bar combo chart.

This type of visual returns the count of records by case category.

To add more visuals to the workspace, choose Add, then Add visual.

In the second visual, we create a donut chart with the field case_status to count the number of overall cases.

Next, we create a word cloud to display how often AWS support cases have been raised by which AWS account.

The word cloud shows the top 100 accounts by default (if you have data for more than one account) and displays the account with the maximum number of entries in a higher font size. If you wanted to show just the top account, you would have to configure a top 1 filter.

Next, we create a stacked bar combo chart to display cases with service type, using the fields case_created_on, caseid, and case_service.
Next, we create a table visual to display all case details in table format (select all available fields).

The following screenshot shows a visualization of all fields of support cases in tabular format.

19. Adjust the size and position of the visuals to fit the layout of your analysis.

The following screenshot shows our final dashboard for support cases.

You’ve now set up a fully functional AWS support cases dashboard at an organizational view. You can share the dashboard with your cloud platform and operations teams. For more information, refer to Sharing Amazon QuickSight dashboards.

Clean up

When you don’t need this dashboard anymore, complete the following steps to delete the AWS resources you created to avoid ongoing charges to your account:

Delete the Amazon S3 Bucket
Delete the SNS topic.
Delete the IAM roles.
Cancel your QuickSight subscription. You should only delete your QuickSight account if you explicitly set it up to follow this post and are absolutely sure that it’s not being used by any other users.

Conclusion

This post outlined the steps and resources required to construct a customized analytics dashboard in QuickSight, empowering you to attain comprehensive visibility and valuable insights into support cases generated across multiple accounts within your organization. To learn more about how QuickSight can help your business with dashboards, reports, and more, visit Amazon QuickSight.

About the authors

Yash Bindlish is a Enterprise Support Manager at Amazon Web Services. He has more than 17 years of industry experience including roles in cloud architecture, systems engineering, and infrastructure. He works with Global Enterprise customers and help them build, scalable, modern and cost effective solutions on their growth journey with AWS. He loves solving complex problems with his solution-oriented approach.

Shivani Reddy is a Technical Account Manager (TAM) at AWS with over 12 years of IT experience. She has worked in a variety of roles, including application support engineer, Linux systems engineer, and administrator. In her current role, she works with global customers to help them build sustainable software solutions. She loves the customer management aspect of her job and enjoys working with customers to solve problems and find solutions that meet their specific needs.

How AWS helped Altron Group accelerate their vision for optimized customer engagement

2023-07-13 Jason Yung

Post Syndicated from Jason Yung original https://aws.amazon.com/blogs/big-data/how-aws-helped-altron-group-accelerate-their-vision-for-optimized-customer-engagement/

This is a guest post co-authored by Jacques Steyn, Senior Manager Professional Services at Altron Group.

Altron is a pioneer of providing data-driven solutions for their customers by combining technical expertise with in-depth customer understanding to provide highly differentiated technology solutions. Alongside their partner AWS, they participated in AWS Data-Driven Everything (D2E) workshops and a bespoke AWS Immersion Day workshop that catered to their needs to improve their engagement with their customers.

This post discusses the journey that took Altron from their initial goals, to technical implementation, to the business value created from understanding their customers and their unique opportunities better. They were able to think big but start small with a working solution involving rich business intelligence (BI) and insights provided to their key business areas.

Data-Driven Everything engagement

Altron has provided information technology services since 1965 across South Africa, the Middle East, and Australia. Although the group saw strong results at 2022-year end, the region continues to experience challenging operating conditions with global supply chains disrupted, electronic component shortages, and scarcity of IT talent.

To reflect the needs of their customers spread across different geographies and industries, Altron has organized their operating model across individual Operating Companies (OpCos). These are run autonomously with different sales teams, creating siloed operations and engagement with customers and making it difficult to have a holistic and unified sales motion.

To succeed further, their vision of data requires it to be accessible and actionable to all, with key roles and responsibilities defined by those who produce and consume data, as shown in the following figure. This allows for transparency, speed to action, and collaboration across the group while enabling the platform team to evangelize the use of data:

Altron engaged with AWS to seek advice on their data strategy and cloud modernization to bring their vision to fruition. The D2E program was selected to help identify the best way to think big but start small by working collaboratively to ideate on the opportunities to build data as a product, particularly focused on federating customer profile data in an agile and scalable approach.

Amazon mechanisms such as Working Backwards were employed to devise the most delightful and meaningful solution and put customers at the heart of the experience. The workshop helped devise the think big solution that starting with the Systems Integration (SI) OpCo as the first flywheel turn would be the best way to start small and prototype the initial data foundation collaboratively with AWS Solutions Architects.

Preparing for an AWS Immersion Day workshop

The typical preparation of an AWS Immersion Day involves identifying examples of common use case patterns and utilizing demonstration data. To maximize its success, the Immersion Day was stretched across multiple days as a hands-on workshop to enable Altron to bring their own data, build a robust data pipeline, and scale their long-term architecture. In addition, AWS and Altron identified and resolved any external dependencies, such as network connectivity to data sources and targets, where Altron was able to put the connectivity to the sources in place.

Identifying key use cases

After a number of preparation meetings to discuss business and technical aspects of the use case, AWS and Altron identified two uses cases to resolve their two business challenges:

Business intelligence for business-to-business accounts – Altron wanted to focus on their business-to-business (B2B) accounts and customer data. In particular, they wanted to enable their account managers, sales executives, and analysts to use actual data and facts to get a 360 view of their accounts.
- Goals – Grow revenue, increase the conversion ratio of opportunities, reduce the average sales cycle, improve the customer renewal rate.
- Target – Dashboards to be refreshed on a daily basis that would provide insights on sales, gross profit, sales pipelines, and customers.
Data quality for account and customer data – Altron wanted to enable data quality and data governance best practices.
- Goals – Lay the foundation for a data platform that can be used in the future by internal and external stakeholders.

Conducting the Immersion Day workshop

Altron set aside 4 days for their Immersion Day, during which time AWS had assigned a dedicated Solutions Architect to work alongside them to build the following prototype architecture:

This solution includes the following components:

AWS Glue is a serverless data integration service that makes it simple to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning, and application development. The Altron team created an AWS Glue crawler and configured it to run against Azure SQL to discover its tables. The AWS Glue crawler populates the table definition with its schema in AWS Glue Data Catalog.
This step consists of two components:
1. A set of AWS Glue PySpark jobs reads the source tables from Azure SQL and outputs the resulting data in the Amazon Simple Storage Service “Raw Zone”. Basic formatting and readability of the data is standardized here. The jobs run on a scheduled basis, according to the upstream data availability (which currently is daily).
2. Users are able to manually upload reference files (CSV and Excel format) via the Amazon Web Services console directly to the Amazon S3 buckets. Depending on the frequency of upload, the Altron team will consider automated mechanisms and remove manual upload.
The reporting zone is based on a set of Amazon Athena views, which are consumed for BI purposes. The Altron team used Athena to explore the source tables and create the views in SQL language. Depending on the needs, the Altron team will materialize these views or create corresponding AWS Glue jobs.
Athena exposes the content of the reporting zone for consumption.
The content of the reporting zone is ingested via SPICE in Amazon QuickSight. BI users create dashboards and reports in QuickSight. Business users can access QuickSight dashboards from their mobile, thanks to the QuickSight native application, configured to use single sign-on (SSO).
An AWS Step Functions state machine orchestrates the run of the AWS Glue jobs. The Altron team will expand the state machine to include automated refresh of QuickSight SPICE datasets.
To verify the data quality of the sources through statistically-relevant metrics, AWS Glue Data Quality runs data quality tasks on relevant AWS Glue tables. This can be run manually or scheduled via Amazon Eventbridge (Optional).

Generating business outcomes

In 4 days, the Altron SI team left the Immersion Day workshop with the following:

A data pipeline ingesting data from 21 sources (SQL tables and files) and combining them into three mastered and harmonized views that are cataloged for Altron’s B2B accounts.
A set of QuickSight dashboards to be consumed via browser and mobile.
Foundations for a data lake with data governance controls and data quality checks. The datasets used for the workshop originate from different systems; by integrating the datasets during the workshop implementation, the Altron team can have a comprehensive overview of their customers.

Altron’s sales teams are now able to quickly refresh dashboards encompassing previously disparate datasets that are now centralized to get insights about sales pipelines and forecasts on their desktop or mobile. The technical teams are equally adept at adjusting to business needs by autonomously onboarding new data sources and further enriching the user experience and trust in the data.

Conclusion

In this post, we walked you through the journey the Altron team took with AWS. The outcomes to identify the opportunities that were most pressing to Altron, applying a working backward approach and coming up with a best-fit architecture, led to the subsequent AWS Immersion Day to implement a working prototype that helped them become more data-driven.

With their new focus on AWS skills and mechanisms, increasing trust in their internal data, and understanding the importance of driving change in data literacy and mindset, Altron is better set up for success to best serve their customers in the region.

To find out more about how Altron and AWS can help work backward on your data strategy and employ the agile methodologies discussed in this post, check out Data Management. To learn more about how can help you turn your ideas into solutions, visit the D2E website and the series of AWS Immersion Days that you can choose from. For more hands-on bespoke options, contact your AWS Account Manager, who can provide more details.

Special thanks to everyone at Altron Group who helped contribute to the success of the D2E and Build Lab workshops:

The Analysts (Liesl Kok, Carmen Kotze)
Data Engineers (Banele Ngemntu, James Owen, Andrew Corry, Thembelani Mdlankomo)
QuickSight BI Developers (Ricardo De Gavino Dias, Simei Antoniades)
Cloud Administrator (Shamiel Galant)

About the authors

Jacques Steyn runs the Altron Data Analytics Professional Services. He has been leading the building of data warehouses and analytic solutions for the past 20 years. In his free time, he spends time with his family, whether it be on the golf , walking in the mountains, or camping in South Africa, Botswana, and Namibia.

Jason Yung is a Principal Analytics Specialist with Amazon Web Services. Working with Senior Executives across the Europe and Asia-Pacific Regions, he helps customers become data-driven by understanding their use cases and articulating business value through Amazon mechanisms. In his free time, he spends time looking after a very active 1-year-old daughter, alongside juggling geeky activities with respectable hobbies such as cooking sub-par food.

Michele Lamarca is a Senior Solutions Architect with Amazon Web Services. He helps architect and run Solutions Accelerators in Europe to enable customers to become hands-on with AWS services and build prototypes quickly to release the value of data in the organization. In his free time, he reads books and tries (hopelessly) to improve his jazz piano skills.

Hamza is a Specialist Solutions Architect with Amazon Web Services. He runs Solutions Accelerators in EMEA regions to help customers accelerate their journey to move from an idea into a solution in production. In his free time, he spends time with his family, meets with friends, swims in the municipal swimming pool, and learns new skills.

Vega Cloud brings FinOps solutions to their customers faster by embedding Amazon QuickSight

2023-07-13 Kris Bliesner

Post Syndicated from Kris Bliesner original https://aws.amazon.com/blogs/big-data/vega-cloud-brings-finops-solutions-to-their-customers-faster-by-embedding-amazon-quicksight/

This is a guest post authored by Kris Bliesner and Mike Brown from Vega Cloud.

Vega Cloud is a premier member of the FinOps Foundation, a program by Linux Foundation supporting FinOps practitioners on cloud financial management best practices. Vega Cloud provides a place where finance, engineers, and innovators come together to accelerate the business value of the cloud with concrete curated data, context-relevant recommendations, and automation to achieve cost savings. Vega Cloud’s platform is based on the FinOps Foundation’s best practices and removes the confusion behind the business value of cloud services and accelerates strategic decisions while maintaining cost optimization. Vega’s curated reports provide actionable insights to accelerate time-to-value from years into days. On average, Vega’s customers immediately identify 15–25% of underutilized cloud spend with a clear direction on how to reallocate the funds to maximize business impact.

Vega Cloud has been growing rapidly and saw an opportunity to accelerate cloud intelligence at hyperscale. Engineering leadership chose Amazon QuickSight, which allowed Vega to add insightful analytics into its platform with customized interactive visuals and dashboards, while scaling at a lower cost without the need to manage infrastructure.

In this post, we discuss how Vega uses QuickSight to bring cloud intelligence solutions to our customers.

Bringing solutions to market at a fast pace

The Vega Cloud Platform was designed from the start by cloud pioneers to enable businesses to get the most value out of their cloud spend. This is done by a multi-step process that starts with data analytics and ends with automation to remediate inefficiencies. The Vega Cloud Platform consumes customer billing and usage information from cloud providers and third parties, and uses that data to show customers what they are spending and which services they are consuming, with business context to help the customer with chargebacks and cost inquiries. The Vega Cloud Platform then analyzes the data collected and produces context relevant recommendations across five major categories: financial, waste elimination, utilization, process, and architecture. Finally, the Vega Cloud Platform enables customers to choose which recommendations to implement through automated processes and immediately receive cost benefits without massive amounts of work by end developers or app teams.

Vega is constantly updating and improving the platform by adding more recommendation types, deeper analytics, and easier automation to save time. When looking for an embedded analytics solution to bring these insights to customers, Vega looked for a tool that would allow us to keep up with our rapid growth and iterate quickly. With QuickSight, Vega has been able to scale from the proof of concept stage to enterprise-level analytics and visualizations as the company grows. QuickSight enables our product team to ship product quickly and rapidly test customer feedback and assumptions. Vega has tremendously reduced the time from idea to implementation for the analytics solutions by using QuickSight.

Using embedded QuickSight saved Vega 6–12 months of development time, allowing us to go to market sooner. Vega Cloud’s team of certified FinOps practitioners—a unique combination of finance professionals, architects, FinOps practitioners, educators, engineers, financial analysts, and data analysts with deep expertise in multi-cloud environments—can focus on driving business growth and meeting customer needs. QuickSight gives the Vega team one place to build reports and dashboards, allowing the Vega Cloud Platform to deliver data analytics to customers quickly and consistently. The Vega Cloud Platform uses QuickSight APIs to seamlessly onboard new users. In addition to the cost savings Vega Cloud has achieved by saving development time, QuickSight doesn’t require licensing or maintenance costs. The AWS pay-as-you-go pricing model allowed Vega Cloud to hit the ground running and scale with real-time demand.

Creating a powerful array of solutions for customers

By embedding QuickSight, Vega Cloud has been able to bring a wealth of information to cloud consumers, helping them gain value and efficiencies. For example, an enterprise customer in the energy industry engaged Vega for cloud cost optimization and saw monthly savings exceeding 25% with cumulative savings of over $1.36 million over 11 months. They moved from 24% Reserved Instance/Savings Plans (RI/SP) coverage to 53% coverage. Their optimization efforts led them to increase their cloud commit by 10 times over 5 years.

The Vega Cloud Platform has two SKUs that use embedded QuickSight: Vega Inform and Vega Optimize. Vega Inform is about cost allocation, chargebacks and showbacks, anomaly detection, spend analysis, and deep usage analytics. Vega Optimize is an easy-to-use set of dashboards to help customers better understand the optimization opportunities they have across their entire enterprise. In the Vega Inform SKU, the Vega Cloud Platform provides true multi-cloud cost reporting with cash and fiscal views and the ability to switch between them seamlessly. The Vega Cloud Platform is a curated data platform to ensure customers avoid garbage in/garbage out scenarios. Vega curates customer usage and billing data to verify billing rates, usage, and credit allocation, and then enables retroactive cleanups to historical spend.

Vega Optimize is a core piece of the Vega Cloud Platform and allows end-users to see cost-optimization recommendations with business context added using the embedded QuickSight dashboards. The Vega Cloud Platform enables end-users to self-manage and approve optimization recommendations for implementation—ensuring that businesses are taking the actions they need to better manage their cloud investments.

Vega customers can identify and act upon near-term optimization opportunities prioritized by business impact and level of effort, as well as identify, purchase, and track committed use resources. QuickSight enables end-users to easily filter down data to exactly what the user wants to see. Doing so enables questions to be answered more quickly, which ensures the right optimization takes place in a timely manner. The depth and breadth of data the Vega Cloud Platform consumes and surfaces to end-users via QuickSight provides customers with a platform approach to enabling FinOps within their organizations.

Powerful and dynamic QuickSight features

The Vega Cloud Platform ingests billions of lines of data on behalf of customers, which must be converted into actionable insight and personalized to a decision-maker’s role inside the organization. Processing terabytes of data can lead to delayed and infrequent reports, slowing down an organization’s ability to respond and compete in their respective markets. It is paramount that customers have consistent, dependable access to timely reports with the correct business context. QuickSight is powered by SPICE (Super-fast, Parallel, In-memory Calculation Engine), a robust in-memory engine that now supports up to 1 billion rows of data. Thanks to the Vega Cloud Platform’s implementation of QuickSight, Vega has offloaded the responsibility including engineering from customers to ingest and curate billions of rows of data every day. The Vega Cloud Platform uses role-based permissions, with row-level security from QuickSight, to centralize and tailor data to prioritize actionable insights with the ability to quickly investigate details to provide evidence-based decisions to customers.

QuickSight allows Vega to highlight data that is considered high-cost or high-value to its customers so that they can take quick action. This is accomplished by purpose-built dashboards based on over a decade of experience to ensure customers see the items that will be most impactful in their optimization efforts. The advanced visualizations, sorting, and filtering capabilities of QuickSight allow the Vega Cloud Platform to scale usage by multiple groups within a business, including finance, DevOps, IT and many others. Along with QuickSight, the Vega Cloud Platform uses many other AWS services, including but not limited to Amazon Relational Database Service (Amazon RDS), Amazon Redshift, Amazon Athena, AWS Glue, and more.

Scaling into the future with QuickSight

Vega is focused on continuing to provide customers with cloud intelligence at hyperscale using QuickSight. The Vega Cloud Platform roadmap includes a proof of concept for Amazon QuickSight Q, which would give customers the ability to ask questions in natural language and receive accurate answers with relevant visualizations that help users gain insights from the data. This also includes paginated reports, which makes it easier for customers to create, schedule, and share reports.

QuickSight has enabled Vega Cloud to grow rapidly, while saving time and money and delivering FinOps solutions to businesses in any and every vertical industry consuming cloud at scale.

To learn more about how you can embed customized data visuals and interactive dashboards into any application, visit Amazon QuickSight Embedded.

About the Authors

Kris Bliesner, CEO, Vega Cloud is a seasoned technology leader with over 25 years of experience in IT management, cloud computing, and consumer-based technology. As the co-founder and CEO of Vega Cloud, Kris continues to be at the forefront of revolutionizing cloud infrastructure optimization.

Mike Brown, CTO, Vega Cloud is a highly-skilled technology leader and co-founder of Vega Cloud, where he currently serves as the Chief Technology Officer (CTO). With a proven track record in driving technological innovation, Mike has been instrumental in shaping the application architecture and solutions for the company.

Level up your React app with Amazon QuickSight: How to embed your dashboard for anonymous access

2023-07-11 Adrianna Kurzela

Post Syndicated from Adrianna Kurzela original https://aws.amazon.com/blogs/big-data/level-up-your-react-app-with-amazon-quicksight-how-to-embed-your-dashboard-for-anonymous-access/

Using embedded analytics from Amazon QuickSight can simplify the process of equipping your application with functional visualizations without any complex development. There are multiple ways to embed QuickSight dashboards into application. In this post, we look at how it can be done using React and the Amazon QuickSight Embedding SDK.

Dashboard consumers often don’t have a user assigned to their AWS account and therefore lack access to the dashboard. To enable them to consume data, the dashboard needs to be accessible for anonymous users. Let’s look at the steps required to enable an unauthenticated user to view your QuickSight dashboard in your React application.

Solution overview

Our solution uses the following key services:

Amazon API Gateway
AWS Identity and Access Management (IAM)
AWS Lambda
Amazon QuickSight (with session capacity pricing enabled)

After loading the web page on the browser, the browser makes a call to API Gateway, which invokes a Lambda function that calls the QuickSight API to generate a dashboard URL for an anonymous user. The Lambda function needs to assume an IAM role with the required permissions. The following diagram shows an overview of the architecture.

Prerequisites

You must have the following prerequisites:

An AWS account
A QuickSight account with session capacity pricing enabled
QuickSight dashboards (for setup instructions, refer to Tutorial: Create an Amazon QuickSight dashboard using sample data)
A sample React application (to get started, refer to Start a New React Project)

Set up permissions for unauthenticated viewers

In your account, create an IAM policy that your application will assume on behalf of the viewer:

On the IAM console, choose Policies in the navigation pane.
Choose Create policy.
On the JSON tab, enter the following policy code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "quicksight:GenerateEmbedUrlForAnonymousUser"
            ],
            "Resource": [
                "arn:aws:quicksight:*:*:namespace/default",
				"arn:aws:quicksight:*:*:dashboard/<YOUR_DASHBOARD_ID1>",
				"arn:aws:quicksight:*:*:dashboard/<YOUR_DASHBOARD_ID2>"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

Make sure to change the value of <YOUR_DASHBOARD_ID> to the value of the dashboard ID. Note this ID to use in a later step as well.

For the second statement object with logs, permissions are optional. It allows you to create a log group with the specified name, create a log stream for the specified log group, and upload a batch of log events to the specified log stream.

In this policy, we allow the user to perform the GenerateEmbedUrlForAnonymousUser action on the dashboard ID within the list of dashboard IDs inserted in the placeholder.

Enter a name for your policy (for example, AnonymousEmbedPolicy) and choose Create policy.

Next, we create a role and attach this policy to the role.

Choose Roles in the navigation pane, then choose Create role.

Identity and access console

Choose Lambda for the trusted entity.
Search for and select AnonymousEmbedPolicy, then choose Next.
Enter a name for your role, such as AnonymousEmbedRole.
Make sure the policy name is included in the Add permissions section.
Finish creating your role.

You have just created the AnonymousEmbedRole execution role. You can now move to the next step.

Generate an anonymous embed URL Lambda function

In this step, we create a Lambda function that interacts with QuickSight to generate an embed URL for an anonymous user. Our domain needs to be allowed. There are two ways to achieve the integration of Amazon QuickSight:

By adding the URL to the list of allowed domains in the Amazon QuickSight admin console (explained later in [Optional] Add your domain in QuickSight section).
[Recommended] By adding the embed URL request during runtime in the API call. Option 1 is recommended when you need to persist the allowed domains. Otherwise, the domains will be removed after 30 minutes, which is equivalent to the session duration. For other use cases, it is recommended to use the second option (described and implemented below).

On the Lambda console, create a new function.

Select Author from scratch.
For Function name, enter a name, such as AnonymousEmbedFunction.
For Runtime¸ choose Python 3.9.
For Execution role¸ choose Use an existing role.
Choose the role AnonymousEmbedRole.
Choose Create function.
On the function details page, navigate to the Code tab and enter the following code:


import json, boto3, os, re, base64

def lambda_handler(event, context):
print(event)
try:
def getQuickSightDashboardUrl(awsAccountId, dashboardIdList, dashboardRegion, event):
#Create QuickSight client
quickSight = boto3.client('quicksight', region_name=dashboardRegion);

#Construct dashboardArnList from dashboardIdList
dashboardArnList=[ 'arn:aws:quicksight:'+dashboardRegion+':'+awsAccountId+':dashboard/'+dashboardId for dashboardId in dashboardIdList]
#Generate Anonymous Embed url
response = quickSight.generate_embed_url_for_anonymous_user(
AwsAccountId = awsAccountId,
Namespace = 'default',
ExperienceConfiguration = {'Dashboard':{'InitialDashboardId': dashboardIdList[0]}},
AuthorizedResourceArns = dashboardArnList,
SessionLifetimeInMinutes = 60,
AllowedDomains = ['http://localhost:3000']
)
return response

#Get AWS Account Id
awsAccountId = context.invoked_function_arn.split(':')[4]

#Read in the environment variables
dashboardIdList = re.sub(' ','',os.environ['DashboardIdList']).split(',')
dashboardNameList = os.environ['DashboardNameList'].split(',')
dashboardRegion = os.environ['DashboardRegion']

response={}

response = getQuickSightDashboardUrl(awsAccountId, dashboardIdList, dashboardRegion, event)

return {'statusCode':200,
'headers': {"Access-Control-Allow-Origin": "http://localhost:3000",
"Content-Type":"text/plain"},
'body':json.dumps(response)
}

except Exception as e: #catch all
return {'statusCode':400,
'headers': {"Access-Control-Allow-Origin": "http://localhost:3000",
"Content-Type":"text/plain"},
'body':json.dumps('Error: ' + str(e))
}

If you don’t use localhost, replace http://localhost:3000 in the returns with the hostname of your application. To move to production, don’t forget to replace http://localhost:3000 with your domain.

On the Configuration tab, under General configuration, choose Edit.
Increase the timeout from 3 seconds to 30 seconds, then choose Save.
Under Environment variables, choose Edit.
Add the following variables:
1. Add DashboardIdList and list your dashboard IDs.
2. Add DashboardRegion and enter the Region of your dashboard.
Choose Save.

Your configuration should look similar to the following screenshot.

On the Code tab, choose Deploy to deploy the function.

Environment variables console

Set up API Gateway to invoke the Lambda function

To set up API Gateway to invoke the function you created, complete the following steps:

On the API Gateway console, navigate to the REST API section and choose Build.
Under Create new API, select New API.
For API name, enter a name (for example, QuicksightAnonymousEmbed).
Choose Create API.

API gateway console

On the Actions menu, choose Create resource.
For Resource name, enter a name (for example, anonymous-embed).

Now, let’s create a method.

Choose the anonymous-embed resource and on the Actions menu, choose Create method.
Choose GET under the resource name.
For Integration type, select Lambda.
Select Use Lambda Proxy Integration.
For Lambda function, enter the name of the function you created.
Choose Save, then choose OK.

API gateway console

Now we’re ready to deploy the API.

On the Actions menu, choose Deploy API.
For Deployment stage, select New stage.
Enter a name for your stage, such as embed.
Choose Deploy.

[Optional] Add your domain in QuickSight

If you added Allowed domains in Generate an anonymous embed URL Lambda function part, feel free to move to Turn on capacity pricing section.

To add your domain to the allowed domains in QuickSight, complete the following steps:

On the QuickSight console, choose the user menu, then choose Manage QuickSight.

Quicksight dropdown menu

Choose Domains and Embedding in the navigation pane.
For Domain, enter your domain (http://localhost:<PortNumber>).

Make sure to replace <PortNumber> to match your local setup.

Choose Add.

Quicksight admin console

Make sure to replace the localhost domain with the one you will use after testing.

Turn on capacity pricing

If you don’t have session capacity pricing enabled, follow the steps in this section. It’s mandatory to have this function enabled to proceed further.

Capacity pricing allows QuickSight customers to purchase reader sessions in bulk without having to provision individual readers in QuickSight. Capacity pricing is ideal for embedded applications or large-scale business intelligence (BI) deployments. For more information, visit Amazon QuickSight Pricing.

To turn on capacity pricing, complete the following steps:

On the Manage QuickSight page, choose Your Subscriptions in the navigation pane.
In the Capacity pricing section, select Get monthly subscription.
Choose Confirm subscription.

To learn more about capacity pricing, see New in Amazon QuickSight – session capacity pricing for large scale deployments, embedding in public websites, and developer portal for embedded analytics.

Set up your React application

To set up your React application, complete the following steps:

In your React project folder, go to your root directory and run npm i amazon-quicksight-embedding-sdk to install the amazon-quicksight-embedding-sdk package.
In your App.js file, replace the following:
1. Replace YOUR_API_GATEWAY_INVOKE_URL/RESOURCE_NAME with your API Gateway invoke URL and your resource name (for example, https://xxxxxxxx.execute-api.xx-xxx-x.amazonaws.com/embed/anonymous-embed).
2. Replace YOUR_DASHBOARD1_ID with the first dashboardId from your DashboardIdList. This is the dashboard that will be shown on the initial render.
3. Replace YOUR_DASHBOARD2_ID with the second dashboardId from your DashboardIdList.

The following code snippet shows an example of the App.js file in your React project. The code is a React component that embeds a QuickSight dashboard based on the selected dashboard ID. The code contains the following key components:

State hooks – Two state hooks are defined using the useState() hook from React:
- dashboard – Holds the currently selected dashboard ID.
- quickSightEmbedding – Holds the QuickSight embedding object returned by the embedDashboard() function.
Ref hook – A ref hook is defined using the useRef() hook from React. It’s used to hold a reference to the DOM element where the QuickSight dashboard will be embedded.
useEffect() hook – The useEffect() hook is used to trigger the embedding of the QuickSight dashboard whenever the selected dashboard ID changes. It first fetches the dashboard URL for the selected ID from the QuickSight API using the fetch() method. After it retrieves the URL, it calls the embed() function with the URL as the argument.
Change handler – The changeDashboard() function is a simple event handler that updates the dashboard state whenever the user selects a different dashboard from the drop-down menu. As soon as new dashboard ID is set, the useEffect hook is triggered.
10-millisecond timeout – The purpose of using the timeout is to introduce a small delay of 10 milliseconds before making the API call. This delay can be useful in scenarios where you want to avoid immediate API calls or prevent excessive requests when the component renders frequently. The timeout gives the component some time to settle before initiating the API request. Because we’re building the application in development mode, the timeout helps avoid errors caused by the double run of useEffect within StrictMode. For more information, refer to Updates to Strict Mode.

See the following code:

import './App.css';
import * as React from 'react';
import { useEffect, useRef, useState } from 'react';
import { createEmbeddingContext } from 'amazon-quicksight-embedding-sdk';

 function App() {
  const dashboardRef = useRef([]);
  const [dashboardId, setDashboardId] = useState('YOUR_DASHBOARD1_ID');
  const [embeddedDashboard, setEmbeddedDashboard] = useState(null);
  const [dashboardUrl, setDashboardUrl] = useState(null);
  const [embeddingContext, setEmbeddingContext] = useState(null);

  useEffect(() => {
    const timeout = setTimeout(() => {
      fetch("YOUR_API_GATEWAY_INVOKE_URL/RESOURCE_NAME"
      ).then((response) => response.json()
      ).then((response) => {
        setDashboardUrl(response.EmbedUrl)
      })
    }, 10);
    return () => clearTimeout(timeout);
  }, []);

  const createContext = async () => {
    const context = await createEmbeddingContext();
    setEmbeddingContext(context);
  }

  useEffect(() => {
    if (dashboardUrl) { createContext() }
  }, [dashboardUrl])

  useEffect(() => {
    if (embeddingContext) { embed(); }
  }, [embeddingContext])

  const embed = async () => {

    const options = {
      url: dashboardUrl,
      container: dashboardRef.current,
      height: "500px",
      width: "600px",
    };

    const newEmbeddedDashboard = await embeddingContext.embedDashboard(options);
    setEmbeddedDashboard(newEmbeddedDashboard);
  };

  useEffect(() => {
    if (embeddedDashboard) {
      embeddedDashboard.navigateToDashboard(dashboardId, {})
    }
  }, [dashboardId])

  const changeDashboard = async (e) => {
    const dashboardId = e.target.value
    setDashboardId(dashboardId)
  }

  return (
    <>
      <header>
        <h1>Embedded <i>QuickSight</i>: Build Powerful Dashboards in React</h1>
      </header>
      <main>
        <p>Welcome to the QuickSight dashboard embedding sample page</p>
        <p>Please pick a dashboard you want to render</p>
        <select id='dashboard' value={dashboardId} onChange={changeDashboard}>
          <option value="YOUR_DASHBOARD1_ID">YOUR_DASHBOARD1_NAME</option>
          <option value="YOUR_DASHBOARD2_ID">YOUR_DASHBOARD2_NAME</option>
        </select>
        <div ref={dashboardRef} />
      </main>
    </>
  );
};

export default App

Next, replace the contents of your App.css file, which is used to style and layout your web page, with the content from the following code snippet:

body {
  background-color: #ffffff;
  font-family: Arial, sans-serif;
  margin: 0;
  padding: 0;
}

header {
  background-color: #f1f1f1;
  padding: 20px;
  text-align: center;
}

h1 {
  margin: 0;
}

main {
  margin: 20px;
  text-align: center;
}

p {
  margin-bottom: 20px;
}

a {
  color: #000000;
  text-decoration: none;
}

a:hover {
  text-decoration: underline;
}

i {
  color: orange;
  font-style: normal;
}

Now it’s time to test your app. Start your application by running npm start in your terminal. The following screenshots show examples of your app as well as the dashboards it can display.

Example application page with the visualisation

Conclusion

In this post, we showed you how to embed a QuickSight dashboard into a React application using the AWS SDK. Sharing your dashboard with anonymous users allows them to access your dashboard without granting them access to your AWS account. There are also other ways to share your dashboard anonymously, such as using 1-click public embedding.

Join the Quicksight Community to ask, answer and learn with others and explore additional resources.

About the Author

author headshot

Adrianna is a Solutions Architect at AWS Global Financial Services. Having been a part of Amazon since August 2018, she has had the chance to be involved both in the operations as well as the cloud business of the company. Currently, she builds software assets which demonstrate innovative use of AWS services, tailored to a specific customer use cases. On a daily basis, she actively engages with various aspects of technology, but her true passion lies in combination of web development and analytics.

Amazon Mexico FP&A dives deep into financial data with Amazon QuickSight

2023-07-11 Gonzalo Lezma

Post Syndicated from Gonzalo Lezma original https://aws.amazon.com/blogs/big-data/amazon-mexico-fpa-dives-deep-into-financial-data-with-amazon-quicksight/

This is a guest post by Gonzalo Lezma from Amazon Mexico FP&A.

The Financial Planning and Analysis (FP&A) team in Mexico provides strategic support to Amazon’s CFO and executive team on planning, analysis, and reporting related to Amazon Mexico. We produce and manage key finance deliverables, such as internal profit and loss (P&L) reports for all business groups. We are also involved in planning processes, such as monthly forecast estimates, annual operating plans, and 3-year forecasts.

Our team needed to address five key challenges: manual recurrent reporting, managing data from different sources, moving away from large and slow spreadsheets, enabling ad hoc data insights extraction by business users, and variance analysis. To tackle these issues, we chose Amazon QuickSight for our business intelligence (BI) needs.

In this post, I discuss how QuickSight has enabled us to focus on financial and business analysis that helps drive business strategy.

Fully automating recurrent reporting

Creating and maintaining reports manually is time-consuming due to dense data granularity and multiple business groups and sub-products. This involves a lot of data to process and numerous stakeholders to please. Therefore, recurrent reporting requires allocating human hours to elaborate those reports, check the spreadsheet formulas, and rely on attention to detail to validate the numbers to report.

QuickSight dashboards show the Profit and Loss (P&L), which update as soon as databases refresh, simplifying recurrent reporting dramatically. There is no need for human intervention, which eliminates any risk of error during the elaboration of recurrent reporting. The maintenance and elaboration time has decreased from a week to zero for processes that are currently in QuickSight. We employ the QuickSight alerts feature (as shown in the following screenshot) to remain informed when specific metrics exceed a predefined threshold. This enables us to stay cognizant of significant changes in our P&L at a granular level.

Data from different sources

With new marketplaces and channels constantly emerging in Mexico, not all of them are integrated into the financial planning system; therefore, shadow P&Ls and reports are common and unavoidable. Therefore, the team has to find ways to track them without compromising accuracy or consistency, which poses significant additional challenges. Moreover, with multiple channels and teams reporting those numbers, it’s time-consuming to manually update the data source from every team we work with.

QuickSight can onboard data on channels and products that are relatively new and haven’t been onboarded to official planning systems. The team has numerous options to load data, including Amazon Redshift, CSV files, and Excel spreadsheets. There is virtually no limit on the granularity and scope of our reports.

Large and slow spreadsheets

Although spreadsheets are a popular tool for financial analysis, they have limitations for large and complex datasets. This affects performance, reliability, and validation. Spreadsheets become slow, bulky, and prone to errors, making it challenging to manage large datasets efficiently.

The SPICE (Super-fast, Parallel, In-memory Calculation Engine) in-memory engine that QuickSight uses is unparalleled, compared with other solutions the team tried in the past such as Tableau and Excel, eliminating the need for large spreadsheets dramatically. In addition to the time spent in elaborating the reports, the team was having a hard time reading and visualizing them.

The MX Financial Planning and Analysis Dashboard shown below shows the main contributors to the Gross Merchandise Sales for our business. If sales growth is at 20.92% as it says in the graph, we know that 9.09% is due to our NAFN channel. The graph at the bottom shows which products drove the sales increase.

Ad hoc data insights extraction

The finance space frequently requires ad hoc financial data on recent historic trends for a particular product and timespan. Given the number of channels, products, and scenarios the team works on, this creates a big problem to tackle. Extracting these data insights requires significant bandwidth, which can take away from other essential tasks the team needs to focus on.

Amazon QuickSight Q can answer any simple question about the data in a straightforward and nimble manner, allowing the team to handle ad hoc data insights using natural language requests. The following screenshot shows a graph we frequently get using Q to report shipping costs.

Variance analysis

Providing accurate and insightful variance analysis is a significant challenge for anyone working in financial analysis (for example, explaining price or profit per unit by separating mix effect and rate effect). Huge and difficult-to-understand spreadsheets might sound familiar to anyone who has tried to tackle this problem in the finance space.

With QuickSight URL actions (as shown in the following gif and screenshot), the team can right-click on the variance they want to dissect and link to another sheet with granular detail that has a decomposition of the main drivers explaining that particular variance, replacing the huge and cumbersome Excel variance analysis tool the team used to have.

Summary

All in all, the dynamic and interactive nature of the dashboards allows our internal users to go deeper into the data with just a click of the mouse. Now, building visualizations is intuitive, insightful, and fast. In fact, the whole solution and tools were built without the need of a dedicated BI team. In addition to this, we developed internal QuickSight dashboards to view our own customers’ QuickSight usage, so we have perfect visibility on which areas and users are more active and which features are more used by our partners.

With our QuickSight solution, we have automation, self-service, speedy reaction to requests, and flexibility.

To learn more about how QuickSight can help your business with dashboards, reports, and more, visit Amazon QuickSight.

About the Author

Gonzalo Lezma is the Mexico Finance Manager for the Amazon LATAM Finance Team. He is a lifelong learner, tech and data lover.

New Solution – Clickstream Analytics on AWS for Mobile and Web Applications

2023-07-07 Sébastien Stormacq

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/new-solution-clickstream-analytics-on-aws-for-mobile-and-web-applications/

Starting today, you can deploy on your AWS account an end-to-end solution to capture, ingest, store, analyze, and visualize your customers’ clickstreams inside your web and mobile applications (both for Android and iOS). The solution is built on top of standard AWS services.

This new solution Clickstream Analytics on AWS allows you to keep your data in the security and compliance perimeter of your AWS account and customize the processing and analytics as you require, giving you the full flexibility to extract value for your business. For example, many business line owners want to combine clickstream analytics data with business system data to gain more comprehensive insights. Storing clickstream analysis data in your AWS account allows you to cross reference the data with your existing business system, which is complex to implement when you use a third-party analytics solution that creates an artificial data silo.

Clickstream Analytics on AWS is available from the AWS Solutions Library at no cost, except for the services it deploys on your account.

Why Analyze Your Applications Clickstreams?
Organizations today are in search of vetted solutions and architectural guidance to rapidly solve business challenges. Whether you prefer off-the-shelf deployments or customizable architectures, the AWS Solutions Library carries solutions built by AWS and AWS Partners for a broad range of industry and technology use cases.

When I talk with mobile and web application developers or product owners, you often tell me that you want to use a clickstream analysis solution to understand your customers’ behavior inside your application. Click stream analysis solutions help you to identify popular and frequently visited screens, analyze navigation patterns, identify bottlenecks and drop-off points, or perform A/B testing of functionalities such as the pay wall, but you face two challenges to adopt or build a click stream analysis solution.

Either you use a third-party library and analytics solution that sends all your application and customer data to an external provider, which causes security and compliance risks and makes it more difficult to reference your existing business data to enrich the analysis, or you dedicate time and resources to build your own solution based on AWS services, such as Amazon Kinesis (for data ingestion), Amazon EMR (for processing), Amazon Redshift (for storage), and Amazon QuickSight (for visualization). Doing so ensures your application and customer data stay in the security perimeter of your AWS account, which is already approved and vetted by your information and security team. Often, building such a solution is an undifferentiated task that drives resources and budget away from developing the core business of your application.

Introducing Clickstream Analytics on AWS
The new solution Clickstream Analytics on AWS provides you with a backend for data ingestion, processing, and visualization of click stream data. It’s shipped as an AWS CloudFormation template that you can easily deploy into the AWS account of your choice.

In addition to the backend component, the solution provides you with purpose-built Java and Swift SDKs to integrate into your mobile applications (for both Android and iOS). The SDKs automatically collects data and provide developers with an easy-to-use API to collect application-specific data. They manage the low-level tasks of buffering the data locally, sending them to the backend, managing the retries in case of communication errors, and more.

The following diagram shows you the high-level architecture of the solution.

The solution comes with an easy-to-use console to configure your solution. For example, it allows you to choose between three AWS services to ingest the application clickstream data: Amazon Managed Streaming for Apache Kafka, Amazon Kinesis Data Streams, or Amazon Simple Storage Service (Amazon S3). You can create multiple data pipelines for multiple applications or teams, each using a different configuration. This allows you to adjust the backend to the application user base and requirements.

You can use plugins to transform the data during the processing phase. The solution comes with two plugins preinstalled: User-Agent enrichment and IP address enrichment to add additional data that’s related to the User-Agent and the geolocation of the IP address used by the client applications.

By default, it provides a Amazon Redshift Serverless cluster to minimize the costs, but you can select a provisioned Amazon Redshift configuration to meet your performance and budget requirements.

Finally, the solution provides you with a set of pre-assembled visualization dashboards to report on user acquisition, user activity, and user engagement. The dashboard consumes the data available in Amazon Redshift. You’re free to develop other analytics and other dashboards using the tools and services of your choice.

Let’s See It in Action
The best way to learn how to deploy and to configure Clickstream Analytics on AWS is to follow the tutorial steps provided by the Clickstream Analytics on AWS workshop.

The workshop goes into great detail about each step. Here are the main steps I did to deploy the solution:

1. I create the control plane (the management console) of the solution using this CloudFormation template. The output of the template contains the URL to the management console. I later receive an email with a temporary password for the initial connection.

2. On the Clickstream Analytics console, I create my first project and define various network parameters such as the VPC, subnets, and security groups. I also select the service to use for data ingestion and my choice of configuration for Amazon Redshift.

3. When I enter all configuration data, the console creates the data plane for my application.

AWS services and solutions are usually built around a control plane and one or multiple data planes. In the context of Clickstream Analytics, the control plane is the console that I use to define my data acquisition and analysis project. The data plane is the infrastructure to receive, analyze, and visualize my application data. Now that I define my project, the console generates and launches another CloudFormation template to create and manage the data plane.

4. The Clickstream Analytics console generates a JSON configuration file to include into my application and it shares the Java or Swift code to include into my Android or iOS application. The console provides instructions to add the clickstream analysis as a dependency to my application. I also update my application code to insert the code suggested and start to deploy.

5. After my customers start to use the mobile app, I access the Clickstream Analytics dashboard to visualize the data collected.

The Dashboards
Clickstream Analytics dashboards are designed to provide a holistic view of the user lifecycle: the acquisition, the engagement, the activity, and the retention. In addition, it adds visibility into user devices and geographies. The solution automatically generates visualizations in these six categories: Acquisition, Engagement, Activity, Retention, Devices, and Navigation path. Here are a couple of examples.

The Acquisition dashboard reports the total number of users, the registered number of users (the ones that signed in), and the number of users by traffic source. It also computes the new users and registered users’ trends.

The Engagement dashboard reports the user engagement level (the number of user sessions versus the time users spent on my application). Specifically, I have access to the number of engaged sessions (sessions that last more than 10 seconds or have at least two screen views), the engagement rate (the percentage of engaged sessions from the total number of sessions), and the average engagement time.

The Activity dashboard shows the event and actions taken by my customers in my application. It reports data, such as the number of events and number of views (or screens) shown, with the top events and views shown for a given amount of time.

The Retention tab shows user retention over time: the user stickiness for your daily, weekly, and monthly active users. It also shows the rate of returning users versus new users.

The Device tab shows data about your customer’s devices: operating systems, versions, screen sizes, and language.

And finally, the Path explorer dashboard shows your customers’ navigation path into the screens of your applications.

As I mentioned earlier, all the data are available in Amazon Redshift, so you’re free to build other analytics and dashboards.

Pricing and Availability
The Clickstream Analytics solution is available free of charge. You pay for the AWS services provisioned for you, including Kinesis or Amazon Redshift. Cost estimates depend on the configuration that you select. For example, the size of the Kinesis and Amazon Redshift cluster you select for your data ingestion and analytics needs, or the volume of data your applications send to the pipeline both affect the monthly cost of the solution.

To learn how to get started with this solution, take the Clickstream Analytics workshop today and stop sharing your customer and application clickstream data with third-party solutions.

— seb

Position2’s Arena Calibrate helps customers drive marketing efficiency with Amazon QuickSight Embedded

2023-06-30 Vinod Nambiar

Post Syndicated from Vinod Nambiar original https://aws.amazon.com/blogs/big-data/position2s-arena-calibrate-helps-customers-drive-marketing-efficiency-with-amazon-quicksight-embedded/

This is a guest post by Vinod Nambiar from Position².

Position² is a leading US-based growth marketing services provider focused on data-driven strategy and technology to deliver growth with improved return on investment (ROI).

Position² was established in 2006 in Silicon Valley and has a clientele spanning American Express, Lenovo, Fujitsu, and Thales. We work with clients ranging from VC-funded startups to Fortune 500 firms. Our 200-member team is based in the US and Bangalore, comprising marketing and software experts, engineers, data scientists, client management, creative writers, and designers. The team brings deep domain expertise in digital, B2B, B2C, analytics, technology, mobile, marketing automation, and UX/UI domain. Our integrated campaigns are powered by cutting-edge content creation, digital advertising, web design/development, marketing automation, and analytics.

We have built two software products to help our customers drive digital marketing success and growth:

Arena is an easy-to-use, intuitive platform that provides clients with a single interface to engage with Position². It includes project management, process workflows, collaboration, and performance tracking, helping deliver superior customer experience, transparency, and real-time data.
Arena Calibrate is a customizable digital marketing dashboard that helps marketers track their cross-platform performance at a glance, saving them hours of manual work. Its machine learning (ML) engine provides automated insights to improve campaign performance and ROI.

In this post, we share how Arena Calibrate helps our customers drive marketing efficiency using Amazon QuickSight Embedded.

Beyond services to a data automation and reporting platform

Very early on, we realized that in order to move the needle on marketing efficiency, we needed to go beyond marketing services and help clients master their data analytics and reporting.

Marketing data challenges are real, and we previously faced them every day across our agency. We’ve used a wide variety of tools in the market, but most of them ultimately expect the user to spend significant energy performing time-consuming analytics activities, like configuring data/platform connectors and building datasets. After all that, we were still missing out on proactive analysis that identifies trends and uncovers optimization opportunities.

We know there are many businesses out there facing similar challenges. This led us to build Arena Calibrate by using the best ML algorithms, data connectors, and business intelligence (BI) platforms.

Arena Calibrate helps marketers leap ahead with the best tech stack available without breaking their budget.

In-depth insights in time: Enter Amazon QuickSight

The challenge was to get data from multiple platforms into one place accurately, automatically, and easily. Each platform has data in different formats. We wanted to help customers answer questions such as, “How do you track your customers’ journey through the funnel? Can you leverage advanced BI techniques to get meaningful insights when you have the data?”

We evaluated a range of products before zeroing in on Amazon QuickSight, a unified, serverless BI service enabling organizations to deliver data-driven insights to all users.

The three primary reasons we selected QuickSight were:

Better together with AWS – We were already using AWS for our Arena platform, including services such as Amazon Simple Storage Service (Amazon S3), Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Block Store (Amazon EBS), and Application Load Balancer. As a result, we wanted to continue using AWS services to ensure seamless integration within our technology stack.
Customization and automation flexibility – The QuickSight API allowed us to have customization using AWS Identity Access Management (IAM). APIs helped us with authorization and authentication. With IAM APIs, we were able to create users and map them to the required permissions and roles, thereby giving the right permissions to the right dashboards.
Pay as you go model – Consumption-based pricing ensured flexibility for our customers and us by allowing us to innovate with no monetary risks. We could start small and grow with time without being locked into expensive user-based and capacity-based contracts.

Enabling customers to visualize marketing performance and success

The focus of marketing professionals today is twofold: brand building and revenue growth marketing, which ensures that all your marketing operations can be tracked back to revenue and sales. While the former is a longer-term focus, the latter needs operations to be tracked minutely and any changes need to be incorporated immediately to ensure Return on Ad Spend (ROAS).

Arena Calibrate’s advanced software and BI service package allows you to quickly understand the overall health of your marketing funnel and drive impact on lead generation, marketing-qualified leads (MQLs), sales-qualified leads (SQLs), Customer Acquisition Cost (CAC), and ROAS. We provide insights across the funnel from lead to revenue. Our clients use our dashboards to make marketing decisions like campaign performance, website traffic changes, and audience segmentation. The built-in ML engine helps optimize marketing campaigns by adjusting the targeting, messaging, and timing of campaigns based on their performance.

We use QuickSight as the base for our visualization platform, and clients can visualize predictive forecasting models on ad spend and revenue; make decisions based on optimizing spend and arresting churn; and identify high-value customer segments, marketing channel effectiveness, conversion rates, customer acquisition costs, or ROI. We also use QuickSight ML insights to augment trend analysis and automate anomaly detection for our users.

Arena Calibrate enables users to connect to their data sources in minutes and launch their dashboards quickly with ready-made templates. Our BI team is then available to customize the dashboard for deeper insights and gather accurate ROI.

QuickSight is embedded into Arena Calibrate. The QuickSight API allowed us to generate the signed dashboard URL, which was then embedded into Arena Calibrate.

The following screenshot of Arena Calibrate shows the Campaign Performance dashboard.

The Pay-per-click (PPC) & Search Engine Marketing (SEM) dashboard lets you deep dive into key metrics by campaigns, assets, ad formats, landing pages, and device types to better gauge performance.

The E-Commerce dashboard lets you monitor the performance of your online store through key metrics such as abandoned cart rate, average order value, cart conversion rate, and more.

The Cross Channel Campaign Performance dashboard lets you consolidate performance of all relevant metrics across your ad campaigns, enabling you to evaluate campaign ROAS and revenue from one place.

Today, there are dozens of customer dashboards powered by QuickSight across diverse sectors ranging from software as a service (SaaS) to FinTech, IT to healthcare.

Position²’s customer focus

We broadly serve two segments of users. Our Position² services clients automatically have access to the Arena Calibrate dashboard for their business. The pricing for the dashboard is included in the service fee, in most cases.

Arena Calibrate is also available as a standalone BI platform. Here we combine product-led growth (PLG) and sales to attract prospects largely in the SMB and enterprise markets. Prospects can sign up online, connect their platforms, and test-drive Arena Calibrate by integrating up to five standard data sources for free.

We follow a tiered pricing approach based on the number of platforms and the types of platforms (for example, advertising, marketing automation, or CRM) the user needs to integrate, as well as customization needs.

Cost reduction by 50% and accelerated time-to-value with QuickSight

QuickSight allows us to centralize our organization’s BI reporting—both for internal and external purposes. We pull data from the various sources, then use Snowflake as a database and QuickSight as our visualization tool. The QuickSight in-memory engine SPICE (Super-fast, Parallel, In-memory Calculation Engine) on top of its native integration with Snowflake has improved the dashboard load times by up to 20%. Also, the QuickSight console makes it simple to set up datasets for SPICE. There were direct cost savings when we moved to QuickSight because billing is usage-based. We dropped our costs by approximately 50%.

The comprehensive access and security capabilities of QuickSight allow us to support our enterprise customers by providing the right access to relevant dashboards to the right users with ease. QuickSight allows us to make customizations tailored to each customer’s unique requirements. For example, for newer users, we create dashboards by reusing templates and changing the datasets relevant to them—a process we are currently in the process of automating using QuickSight APIs.

QuickSight, along with Snowflake’s transformation and automation capabilities, has allowed us to reduce our dashboard publishing time from 2 days to 2 hours. Templates have helped us reuse the dashboard layout. Customers appreciate seeing various dashboards in one place—across product lines, business lines, and more. With no or very less engineering effort, the BI team is able to build dashboards. We could also enable author embedding with ease, because it doesn’t require any extra coding or licensing with QuickSight.

The BI team also prefers the features such as iFrame-based embedding and reliability of QuickSight over our previous tool. Our prior tool was slow to render, had stability issues, and had lot of downtime, unlike QuickSight, a serverless, auto-scaling BI service.

We have been able to drive effectiveness through a better understanding of spend and the ability to channel the spend to the best possible outcome drivers. In addition, we have seen efficiency gains by faster time to insights and reduced the need to move across multiple tools to access marketing operations data. Overall, we have seen an increase in ROAS across our client base by 3–5% and productivity gains across clients and services teams by 20–25%.

Calibrating the data-driven future

The long-term vision for Arena Calibrate is to empower the data-driven marketer. With advancements in AI and data analytics, Position² is well placed to analyze customers’ data and provide actionable insights and recommendations to improve the performance of campaigns and drive growth. We look forward to the continued collaboration with QuickSight to enable this journey.

To learn more about how QuickSight can help your business with dashboards, reports, and more, visit Amazon QuickSight.

About the Author

Vinod Nambiar is the co-founder at Arena and Managing Director of Position². An engineer with a passion for advertising, Vinod has been instrumental in designing all processes for delivery operations. His passion is to explore how the latest developments in technology can transform digital marketing. He is associated with various global forums in digital marketing and has been part of the faculty at leading marketing institutes in India like Northpoint and Mudra Institute of Communications. When not thinking digital, he can be found doing yoga and reading books ranging from spiritual to fiction. He lives with his wife and two children in Bangalore.

Centralize near-real-time governance through alerts on Amazon Redshift data warehouses for sensitive queries

2023-06-29 Rajdip Chaudhuri

Post Syndicated from Rajdip Chaudhuri original https://aws.amazon.com/blogs/big-data/centralize-near-real-time-governance-through-alerts-on-amazon-redshift-data-warehouses-for-sensitive-queries/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud that delivers powerful and secure insights on all your data with the best price-performance. With Amazon Redshift, you can analyze your data to derive holistic insights about your business and your customers. In many organizations, one or multiple Amazon Redshift data warehouses run daily for data and analytics purposes. Therefore, over time, multiple Data Definition Language (DDL) or Data Control Language (DCL) queries, such as CREATE, ALTER, DROP, GRANT, or REVOKE SQL queries, are run on the Amazon Redshift data warehouse, which are sensitive in nature because they could lead to dropping tables or deleting data, causing disruptions or outages. Tracking such user queries as part of the centralized governance of the data warehouse helps stakeholders understand potential risks and take prompt action to mitigate them following the operational excellence pillar of the AWS Data Analytics Lens. Therefore, for a robust governance mechanism, it’s crucial to alert or notify the database and security administrators on the kind of sensitive queries that are run on the data warehouse, so that prompt remediation actions can be taken if needed.

To address this, in this post we show you how you can automate near-real-time notifications over a Slack channel when certain queries are run on the data warehouse. We also create a simple governance dashboard using a combination of Amazon DynamoDB, Amazon Athena, and Amazon QuickSight.

Solution overview

An Amazon Redshift data warehouse logs information about connections and user activities taking place in databases, which helps monitor the database for security and troubleshooting purposes. These logs can be stored in Amazon Simple Storage Service (Amazon S3) buckets or Amazon CloudWatch. Amazon Redshift logs information in the following log files, and this solution is based on using an Amazon Redshift audit log to CloudWatch as a destination:

Connection log – Logs authentication attempts, connections, and disconnections
User log – Logs information about changes to database user definitions
User activity log – Logs each query before it’s run on the database

The following diagram illustrates the solution architecture.

Solution Architecture

The solution workflow consists of the following steps:

Audit logging is enabled in each Amazon Redshift data warehouse to capture the user activity log in CloudWatch.
Subscription filters on CloudWatch capture the required DDL and DCL commands by providing filter criteria.
The subscription filter triggers an AWS Lambda function for pattern matching.
The Lambda function processes the event data and sends the notification over a Slack channel using a webhook.
The Lambda function stores the data in a DynamoDB table over which a simple dashboard is built using Athena and QuickSight.

Prerequisites

Before starting the implementation, make sure the following requirements are met:

You have an AWS account.
The AWS Region used for this post is us-east-1. However, this solution is relevant in any other Region where the necessary AWS services are available.
Permissions to create Slack a workspace.

Create and configure an Amazon Redshift cluster

To set up your cluster, complete the following steps:

Create a provisioned Amazon Redshift data warehouse.

For this post, we use three Amazon Redshift data warehouses: demo-cluster-ou1, demo-cluster-ou2, and demo-cluster-ou3. In this post, all the Amazon Redshift data warehouses are provisioned clusters. However, the same solution applies for Amazon Redshift Serverless.

To enable audit logging with CloudWatch as the log delivery destination, open an Amazon Redshift cluster and go to the Properties tab.
On the Edit menu, choose Edit audit logging.

Redshift edit audit logging

Select Turn on under Configure audit logging.
Select CloudWatch for Log export type.
Select all three options for User log, Connection log, and User activity log.
Choose Save changes.

Create a parameter group for the clusters with enable_user_activity_logging set as true for each of the clusters.
Modify the cluster to attach the new parameter group to the Amazon Redshift cluster.

For this post, we create three custom parameter groups: custom-param-grp-1, custom-param-grp-2, and custom-param-grp-3 for three clusters.

Note, if you enable only the audit logging feature, but not the associated parameter, the database audit logs log information for only the connection log and user log, but not for the user activity log.

On the CloudWatch console, choose Log groups under Logs in the navigation pane.
Search for /aws/redshift/cluster/demo.

This will show all the log groups created for the Amazon Redshift clusters.

Create a DynamoDB audit table

To create your audit table, complete the following steps:

On the DynamoDB console, choose Tables in the navigation pane.
Choose Create table.
For Table name, enter demo_redshift_audit_logs.
For Partition key, enter partKey with the data type as String.

Keep the table settings as default.
Choose Create table.

Create Slack resources

Slack Incoming Webhooks expect a JSON request with a message string corresponding to a "text" key. They also support message customization, such as adding a user name and icon, or overriding the webhook’s default channel. For more information, see Sending messages using Incoming Webhooks on the Slack website.

The following resources are created for this post:

A Slack workspace named demo_rc
A channel named #blog-demo in the newly created Slack workspace
A new Slack app in the Slack workspace named demo_redshift_ntfn (using the From Scratch option)
Note down the Incoming Webhook URL, which will be used in this post for sending the notifications

Create an IAM role and policy

In this section, we create an AWS Identity and Access Management (IAM) policy that will be attached to an IAM role. The role is then used to grant a Lambda function access to a DynamoDB table. The policy also includes permissions to allow the Lambda function to write log files to Amazon CloudWatch Logs.

On the IAM console, choose Policies in navigation pane.
Choose Create policy.
In the Create policy section, choose the JSON tab and enter the following IAM policy. Make sure you replace your AWS account ID in the policy (replace XXXXXXXX with your AWS account ID).

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ReadWriteTable",
            "Effect": "Allow",
            "Action": [
                "dynamodb:BatchGetItem",
                "dynamodb:BatchWriteItem",
                "dynamodb:PutItem",
                "dynamodb:GetItem",
                "dynamodb:Scan",
                "dynamodb:Query",
                "dynamodb:UpdateItem",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:dynamodb:us-east-1:XXXXXXXX:table/demo_redshift_audit_logs",
                "arn:aws:logs:*:XXXXXXXX:log-group:*:log-stream:*"
            ]
        },
        {
            "Sid": "WriteLogStreamsAndGroups",
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:CreateLogGroup"
            ],
            "Resource": "arn:aws:logs:*:XXXXXXXX:log-group:*"
        }
    ]
}

Choose Next: Tags, then choose Next: Review.
Provide the policy name demo_post_policy and choose Create policy.

To apply demo_post_policy to a Lambda function, you first have to attach the policy to an IAM role.

On the IAM console, choose Roles in the navigation pane.
Choose Create role.
Select AWS service and then select Lambda.
Choose Next.

On the Add permissions page, search for demo_post_policy.
Select demo_post_policy from the list of returned search results, then choose Next.

On the Review page, enter demo_post_role for the role and an appropriate description, then choose Create role.

Create a Lambda function

We create a Lambda function with Python 3.9. In the following code, replace the slack_hook parameter with the Slack webhook you copied earlier:

import gzip
import base64
import json
import boto3
import uuid
import re
import urllib3

http = urllib3.PoolManager()
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table("demo_redshift_audit_logs")
slack_hook = "https://hooks.slack.com/services/xxxxxxx"

def exe_wrapper(data):
    cluster_name = (data['logStream'])
    for event in data['logEvents']:
        message = event['message']
        reg = re.match(r"'(?P<ts>\d{4}-\d\d-\d\dT\d\d:\d\d:\d\dZ).*?\bdb=(?P<db>\S*).*?\buser=(?P<user>\S*).*?LOG:\s+(?P<query>.*?);?$", message)
        if reg is not None:
            filter = reg.groupdict()
            ts = filter['ts']
            db = filter['db']
            user = filter['user']
            query = filter['query']
            query_type = ' '.join((query.split(" "))[0 : 2]).upper()
            object = query.split(" ")[2]
            put_dynamodb(ts,cluster_name,db,user,query,query_type,object,message)
            slack_api(cluster_name,db,user,query,query_type,object)
            
def put_dynamodb(timestamp,cluster,database,user,sql,query_type,object,event):
    table.put_item(Item = {
        'partKey': str(uuid.uuid4()),
        'redshiftCluster': cluster,
        'sqlTimestamp' : timestamp,
        'databaseName' : database,
        'userName': user,
        'sqlQuery': sql,
        'queryType' : query_type,
        'objectName': object,
        'rawData': event
        })
        
def slack_api(cluster,database,user,sql,query_type,object):
    payload = {
	'channel': '#blog-demo',
	'username': 'demo_redshift_ntfn',
	'blocks': [{
			'type': 'section',
			'text': {
				'type': 'mrkdwn',
				'text': 'Detected *{}* command\n *Affected Object*: `{}`'.format(query_type, object)
			}
		},
		{
			'type': 'divider'
		},
		{
			'type': 'section',
			'fields': [{
					'type': 'mrkdwn',
					'text': ':desktop_computer: *Cluster Name:*\n{}'.format(cluster)
				},
				{
					'type': 'mrkdwn',
					'text': ':label: *Query Type:*\n{}'.format(query_type)
				},
				{
					'type': 'mrkdwn',
					'text': ':card_index_dividers: *Database Name:*\n{}'.format(database)
				},
				{
					'type': 'mrkdwn',
					'text': ':technologist: *User Name:*\n{}'.format(user)
				}
			]
		},
		{
			'type': 'section',
			'text': {
				'type': 'mrkdwn',
				'text': ':page_facing_up: *SQL Query*\n ```{}```'.format(sql)
			}
		}
	]
	}
    encoded_msg = json.dumps(payload).encode('utf-8')
    resp = http.request('POST',slack_hook, body=encoded_msg)
    print(encoded_msg) 

def lambda_handler(event, context):
    print(f'Logging Event: {event}')
    print(f"Awslog: {event['awslogs']}")
    encoded_zipped_data = event['awslogs']['data']
    print(f'data: {encoded_zipped_data}')
    print(f'type: {type(encoded_zipped_data)}')
    zipped_data = base64.b64decode(encoded_zipped_data)
    data = json.loads(gzip.decompress(zipped_data))
    exe_wrapper(data)

Create your function with the following steps:

On the Lambda console, choose Create function.
Select Author from scratch and for Function name, enter demo_function.
For Runtime, choose Python 3.9.
For Execution role, select Use an existing role and choose demo_post_role as the IAM role.
Choose Create function.

On the Code tab, enter the preceding Lambda function and replace the Slack webhook URL.
Choose Deploy.

Create a CloudWatch subscription filter

We need to create the CloudWatch subscription filter on the useractivitylog log group created by the Amazon Redshift clusters.

On the CloudWatch console, navigate to the log group /aws/redshift/cluster/demo-cluster-ou1/useractivitylog.
On the Subscription filters tab, on the Create menu, choose Create Lambda subscription filter.

Choose demo_function as the Lambda function.
For Log format, choose Other.
Provide the subscription filter pattern as ?create ?alter ?drop ?grant ?revoke.
Provide the filter name as Sensitive Queries demo-cluster-ou1.
Test the filter by selecting the actual log stream. If it has any queries with a match pattern, then you can see some results. For testing, use the following pattern and choose Test pattern.

'2023-04-02T04:18:43Z UTC [ db=dev user=awsuser pid=100 userid=100 xid=100 ]' LOG: alter table my_table alter column string type varchar(16);
'2023-04-02T04:06:08Z UTC [ db=dev user=awsuser pid=100 userid=100 xid=200 ]' LOG: create user rs_user with password '***';

Choose Start streaming.

Repeat the same steps for /aws/redshift/cluster/demo-cluster-ou2/useractivitylog and /aws/redshift/cluster/demo-cluster-ou3/useractivitylog by giving unique subscription filter names.
Complete the preceding steps to create a second subscription filter for each of the Amazon Redshift data warehouses with the filter pattern ?CREATE ?ALTER ?DROP ?GRANT ?REVOKE, ensuring uppercase SQL commands are also captured through this solution.

Test the solution

In this section, we test the solution in the three Amazon Redshift clusters that we created in the previous steps and check for the notifications of the commands on the Slack channel as per the CloudWatch subscription filters as well as data getting ingested in the DynamoDB table. We use the following commands to test the solution; however, this is not restricted to these commands only. You can check with other DDL commands as per the filter criteria in your Amazon Redshift cluster.

create schema sales;
create schema marketing;
create table dev.public.demo_test_table_1  (id int, string varchar(10));
create table dev.public.demo_test_table_2  (empid int, empname varchar(100));
alter table dev.public.category alter column catdesc type varchar(65);
drop table dev.public.demo_test_table_1;
drop table dev.public.demo_test_table_2;

In the Slack channel, details of the notifications look like the following screenshot.

To get the results in DynamoDB, complete the following steps:

On the DynamoDB console, choose Explore items under Tables in the navigation pane.
In the Tables pane, select demo_redshift_audit_logs.
Select Scan and Run to get the results in the table.

Athena federation over the DynamoDB table

The Athena DynamoDB connector enables Athena to communicate with DynamoDB so that you can query your tables with SQL. As part of the prerequisites for this, deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more details, refer to Deploying a data source connector or Using the AWS Serverless Application Repository to deploy a data source connector. For this post, we use the Athena console.

On the Athena console, under Administration in the navigation pane, choose Data sources.
Choose Create data source.

Select the data source as Amazon DynamoDB, then choose Next.

For Data source name, enter dynamo_db.
For Lambda function, choose Create Lambda function to open a new window with the Lambda console.

Under Application settings, enter the following information:
- For Application name, enter AthenaDynamoDBConnector.
- For SpillBucket, enter the name of an S3 bucket.
- For AthenaCatalogName, enter dynamo.
- For DisableSpillEncryption, enter false.
- For LambdaMemory, enter 3008.
- For LambdaTimeout, enter 900.
- For SpillPrefix, enter athena-spill-dynamo.

Select I acknowledge that this app creates custom IAM roles and choose Deploy.
Wait for the function to deploy, then return to the Athena window and choose the refresh icon next to Lambda function.
Select the newly deployed Lambda function and choose Next.

Review the information and choose Create data source.
Navigate back to the query editor, then choose dynamo_db for Data source and default for Database.
Run the following query in the editor to check the sample data:

SELECT partkey,
       redshiftcluster,
       databasename,
       objectname,
       username,
       querytype,
       sqltimestamp,
       sqlquery,
       rawdata
FROM dynamo_db.default.demo_redshift_audit_logs limit 10;

Visualize the data in QuickSight

In this section, we create a simple governance dashboard in QuickSight using Athena in direct query mode to query the record set, which is persistently stored in a DynamoDB table.

Sign up for QuickSight on the QuickSight console.
Select Amazon Athena as a resource.
Choose Lambda and select the Lambda function created for DynamoDB federation.

Create a new dataset in QuickSight with Athena as the source.
Provide the name of the data source name as demo_blog.
Choose dynamo_db for Catalog, default for Database, and demo_redshift_audit_logs for Table.
Choose Edit/Preview data.

Choose String in the sqlTimestamp column and choose Date.

In the dialog box that appears, enter the data format yyyy-MM-dd'T'HH:mm:ssZZ.
Choose Validate and Update.

Choose PUBLISH & VISUALIZE.
Choose Interactive sheet and choose CREATE.

This will take you to the visualization page to create the analysis on QuickSight.

Create a governance dashboard with the appropriate visualization type.

Refer to the Amazon QuickSight learning videos in QuickSight community for basic to advanced level of authoring. The following screenshot is a sample visualization created on this data.

Clean up

Clean up your resources with the following steps:

Conclusion

Amazon Redshift is a powerful, fully managed data warehouse that can offer significantly increased performance and lower cost in the cloud. In this post, we discussed a pattern to implement a governance mechanism to identify and notify sensitive DDL/DCL queries on an Amazon Redshift data warehouse, and created a quick dashboard to enable the DBA and security team to take timely and prompt action as required. Additionally, you can extend this solution to include DDL commands used for Amazon Redshift data sharing across clusters.

Operational excellence is a critical part of the overall data governance on creating a modern data architecture, as it’s a great enabler to drive our customers’ business. Ideally, any data governance implementation is a combination of people, process, and technology that organizations use to ensure the quality and security of their data throughout its lifecycle. Use these instructions to set up your automated notification mechanism as sensitive queries are detected as well as create a quick dashboard on QuickSight to track the activities over time.

About the Authors

Rajdip Chaudhuri is a Senior Solutions Architect with Amazon Web Services specializing in data and analytics. He enjoys working with AWS customers and partners on data and analytics requirements. In his spare time, he enjoys soccer and movies.

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

iostudio delivers key metrics to public sector recruiters with Amazon QuickSight

2023-06-27 Jon Walker

Post Syndicated from Jon Walker original https://aws.amazon.com/blogs/big-data/iostudio-delivers-key-metrics-to-public-sector-recruiters-with-amazon-quicksight/

This is a guest post by Jon Walker and Ari Orlinsky from iostudio written in collaboration with Sumitha AP from AWS.

iostudio is an award-winning marketing agency based in Nashville, TN. We build solutions that bring brands to life, making content and platforms work together. We serve our customers, who range from small technology startups to government agencies, as a social media strategy partner with in-house video production capabilities, a creative resource that provides data-driven insights about a campaign’s performance, a content marketing machine with connections across the United States, and a sophisticated customer engagement partner.

We wanted to include interactive, real-time visualizations to support recruiters from one of our government clients. Our previous solution offered visualization of key metrics, but point-in-time snapshots produced only in PDF format. We chose Amazon QuickSight because it gave us dynamic and interactive dashboards embedded in our application, while saving us money and development time.

In this post, we discuss how we built a solution using QuickSight that delivers real-time visibility of key metrics to public sector recruiters.

Modernized analytics and reporting

At iostudio, we faced the challenge of modernizing our government client’s static recruitment marketing analytics solution. Given the limitations of the static PDF charts used for its recruitment marketing data, we recognized the opportunity to introduce real-time interactive dashboards to improve insights needed to drive recruitment marketing initiatives. With the solution we built using QuickSight, recruiters are given access to rich visualizations on interactive dashboards in real time, eliminating uncertainty about whether the information they are looking at is accurate.

We created a QuickSight dashboard as a proof of concept, and it surpassed our expectations because of its advanced visualizations. We embedded QuickSight dashboards into our web application, making it seamless for recruiters to log in and get the insights they need. Because we used QuickSight anonymous embedding APIs, we were able to do this without registering and managing all our users in QuickSight. After this initial proof of concept, we gained confidence in our solution quickly, and we were able to build and launch our solution to production within 4–6 weeks. As a result, we reduced our development time by 60%, allowing us to bring this embedded analytics solution to our government client faster. We also saved 75% on our annual external software costs. Switching to QuickSight has enabled us to better serve our customers.

Taking care of sensitive data

iostudio operates in the AWS GovCloud environment because many of our customers are government agencies. This makes protecting customer data even more important. When building our solution for recruiters, we needed to ensure that recruiters can only see data related to the marketing campaigns that are assigned to them. We used row-level security with tag-based rules in QuickSight to restrict data on a per-user basis.

Integrating with AWS

With the AWS technology stack, we were able to create a custom-fit solution using a regionally diverse, service-oriented model to improve our time to deliver interactive reporting to our customers while also trimming costs. With AWS, we aren’t forced to pay for a bundle with services that we don’t use. We can pick what we need, and use what we need with pay-as-you-go pricing.

Our client had previously been using a data integration tool called Pentaho to get data from different sources into one place, which wasn’t an optimal solution. The following diagram illustrates our updated solution architecture using AWS services.

The custom-fit solution that we built uses AWS Lambda to extract data from three key data sources: a referral marketing tool, Google analytics data, and call center operations data. The data lands in an Amazon Simple Storage Service (Amazon S3) bucket, and from there AWS Glue jobs are used to transform the data and load it into another S3 bucket. QuickSight is connected to this data using Amazon Athena, helping us create real-time and interactive dashboards. This end-to-end extract, transform, and load (ETL) process is run with the help of AWS Step Functions, giving us the ability to orchestrate and monitor all the steps of the ETL process seamlessly.

Conclusion

By switching to QuickSight, we were able to provide our client’s recruiters with key metrics in real time, while reducing our development time and cutting costs significantly. Because the components in the architecture are reusable and interoperable, we were able to extend this solution to even more of our customers.

To learn more about how you can embed customized data visuals and interactive dashboards into any application, visit Amazon QuickSight Embedded.

About the Authors

Jon Walker, Senior Director of Engineering, is a native Nashvillian who has been in the technology field for over 19 years. He oversees enterprise-wide system engineering, development, and technology programs for large federal and DoD clients, as well as iostudio’s commercial clients.

Ari Orlinsky, Director of Information Services, leads iostudio’s Information Systems Department, responsible for AWS Cloud, SaaS applications, on-premises technology, risk assessment, compliance, budgeting, and human resource management. With nearly 20 years’ experience in strategic IS and technology operations, Ari has developed a keen enthusiasm for emerging technologies, DOD security and compliance, large format interactive experiences, and customer service communication technologies. As iostudio’s Technical Product Owner across internal and client-facing applications including a cloud-based omni-channel contact center platform, he advocates for secure deployment of applicable technologies to the cloud while ensuring resilient on-premises data center solutions.

Sumitha AP is a Sr. Solutions Architect at AWS. Sumitha works with SMB customers to help them design secure, scalable, reliable, and cost-effective solutions in the AWS Cloud. She has a focus on data and analytics and provides guidance on building analytics solutions on AWS.

Enable business users to analyze large datasets in your data lake with Amazon QuickSight

2023-06-23 Eliad Maimon

Post Syndicated from Eliad Maimon original https://aws.amazon.com/blogs/big-data/enable-business-users-to-analyze-large-datasets-in-your-data-lake-with-amazon-quicksight/

This blog post is co-written with Ori Nakar from Imperva.

Imperva Cloud WAF protects hundreds of thousands of websites and blocks billions of security events every day. Events and many other security data types are stored in Imperva’s Threat Research Multi-Region data lake.

Imperva harnesses data to improve their business outcomes. To enable this transformation to a data-driven organization, Imperva brings together data from structured, semi-structured, and unstructured sources into a data lake. As part of their solution, they are using Amazon QuickSight to unlock insights from their data.

Imperva’s data lake is based on Amazon Simple Storage Service (Amazon S3), where data is continually loaded. Imperva’s data lake has a few dozen different datasets, in the scale of petabytes. Each day, TBs of new data is added to the data lake, which is then transformed, aggregated, partitioned, and compressed.

In this post, we explain how Imperva’s solution enables users across the organization to explore, visualize, and analyze data using Amazon Redshift Serverless, Amazon Athena, and QuickSight.

Challenges and needs

A modern data strategy gives you a comprehensive plan to manage, access, analyze, and act on data. AWS provides the most complete set of services for the entire end-to-end data journey for all workloads, all types of data, and all desired business outcomes. In turn, this makes AWS the best place to unlock value from your data and turn it into insight.

Redshift Serverless is a serverless option of Amazon Redshift that allows you to run and scale analytics without having to provision and manage data warehouse clusters. Redshift Serverless automatically provisions and intelligently scales data warehouse capacity to deliver high performance for all your analytics. You just need to load and query your data, and you only pay for the compute used for the duration of the workloads on a per-second basis. Redshift Serverless is ideal when it’s difficult to predict compute needs such as variable workloads, periodic workloads with idle time, and steady-state workloads with spikes.

Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, straightforward to use, and makes it simple for anyone with SQL skills to quickly analyze large-scale datasets in multiple Regions.

QuickSight is a cloud-native business intelligence (BI) service that you can use to visually analyze data and share interactive dashboards with all users in the organization. QuickSight is fully managed and serverless, requires no client downloads for dashboard creation, and has a pay-per-session pricing model that allows you to pay for dashboard consumption. Imperva uses QuickSight to enable users with no technical expertise, from different teams such as marketing, product, sales, and others, to extract insight from the data without the help of data or research teams.

QuickSight offers SPICE, an in-memory, cloud-native data store that allows end-users to interactively explore data. SPICE provides consistently fast query performance and automatically scales for high concurrency. With SPICE, you save time and cost because you don’t need to retrieve data from the data source (whether a database or data warehouse) every time you change an analysis or update a visual, and you remove the load of concurrent access or analytical complexity off the underlying data source with the data.

In order for QuickSight to consume data from the data lake, some of the data undergoes additional transformations, filters, joins, and aggregations. Imperva cleans their data by filtering incomplete records, reducing the number of records by aggregations, and applying internal logic to curate millions of security incidents out of hundreds of millions of records.

Imperva had the following requirements for their solution:

High performance with low query latency to enable interactive dashboards
Continuously update and append data to queryable sources from the data lake
Data freshness of up to 1 day
Low cost
Engineering efficiency

The challenge faced by Imperva and many other companies is how to create a big data extract, transform, and load (ETL) pipeline solution that fits these requirements.

In this post, we review two approaches Imperva implemented to address their challenges and meet their requirements. The solutions can be easily implemented while maintaining engineering efficiency, especially with the introduction of Redshift Serverless.

Imperva’s solutions

Imperva needed to have the data lake’s data available through QuickSight continuously. The following solutions were chosen to connect the data lake to QuickSight:

QuickSight caching layer, SPICE – Use Athena to query the data into a QuickSight SPICE dataset
Redshift Serverless – Copy the data to Redshift Serverless and use it as a data source

Our recommendation is to use a solution based on the use case. Each solution has its own advantages and challenges, which we discuss as part of this post.

The high-level flow is the following:

Data is continuously updated from the data lake into either Redshift Serverless or the QuickSight caching layer, SPICE
An internal user can create an analysis and publish it as a dashboard for other internal or external users

The following architecture diagram shows the high-level flow.

High-level flow

In the following sections, we discuss the details about the flow and the different solutions, including a comparison between them, which can help you choose the right solution for you.

Solution 1: Query with Athena and import to SPICE

QuickSight provides inherent capabilities to upload data using Athena into SPICE, which is a straightforward approach that meets Imperva’s requirements regarding simple data management. For example, it suits stable data flows without frequent exceptions, which may result in SPICE full refresh.

You can use Athena to load data into a QuickSight SPICE dataset, and then use the SPICE incremental upload option to load new data to the dataset. A QuickSight dataset will be connected to a table or a view accessible by Athena. A time column (like day or hour) is used for incremental updates. The following table summarizes the options and details.

Option	Description	Pros/Cons
Existing table	Use the built-in option by QuickSight.	Not flexible—the table is imported as is in the data lake.
Dedicated view	A view will let you better control the data in your dataset. It allows joining data, aggregation, or choosing a filter like the date you want to start importing data from. Note that QuickSight allows building a dataset based on custom SQL, but this option doesn’t allow incremental updates.	Large Athena resource consumption on a full refresh.
Dedicated ETL	Create a dedicated ETL process, which is similar to a view, but unlike the view, it allows reuse of the results in case of a full refresh. In case your ETL or view contains grouping or other complex operations, you know that these operations will be done only by the ETL process, according to the schedule you define.	Most flexible, but requires ETL development and implementation and additional Amazon S3 storage.

The following architecture diagram details the options for loading data by Athena into SPICE.

Architecture diagram details the options for loading data by Athena into SPICE

The following code provides a SQL example for a view creation. We assume the existence of two tables, customers and events, with one join column called customer_id. The view is used to do the following:

Aggregate the data from daily to weekly, and reduce the number of rows
Control the start date of the dataset (in this case, 30 weeks back)
Join the data to add more columns (customer_type) and filter it

CREATE VIEW my_dataset AS
SELECT DATE_ADD('day', -DAY_OF_WEEK(day) + 1, day) AS first_day_of_week,
       customer_type, event_type, COUNT(events) AS total_events
FROM my_events INNER JOIN my_customers USING (customer_id)
WHERE customer_type NOT IN ('Reseller')
      AND day BETWEEN DATE_ADD('DAY',-7 * 30 -DAY_OF_WEEK(CURRENT_DATE) + 1, CURRENT_DATE)
      AND DATE_ADD('DAY', -DAY_OF_WEEK(CURRENT_DATE), CURRENT_DATE)
GROUP BY 1, 2, 3

Solution 2: Load data into Redshift Serverless

Redshift Serverless provides full visibility to the data, which can be viewed or edited at any time. For example, if there is a delay in adding data to the data lake or the data isn’t properly added, with Redshift Serverless, you can edit data using SQL statements or retry data loading. Redshift Serverless is a scalable solution that doesn’t have a dataset size limitation.

Redshift Serverless is used as a serving layer for the datasets that are to be used in QuickSight. The pricing model for Redshift Serverless is based on storage utilization and the run of queries; idle compute resources have no associated cost. Setting up a cluster is simple and doesn’t require you to choose node types or amount of storage. You simply load the data to tables you create and start working.

To create a new dataset, you need to create an Amazon Redshift table and run the following process every time data is added:

Transform the data using an ETL process (optional):
- Read data from the tables.
- Transform to the QuickSight dataset schema.
- Write the data to an S3 bucket and load it to Amazon Redshift.
Delete old data if it exists to avoid duplicate data.
Load the data using the COPY command.

The following architecture diagram details the options to load data into Redshift Serverless with or without an ETL process.

Architecture diagram details the options to load data into Redshift Serverless with or without an ETL process

The Amazon Redshift COPY command is simple and fast. For example, to copy daily partition Parquet data, use the following code:

COPY my_table
FROM 's3://my_bucket/my_table/day=2022-01-01'
IAM_ROLE 'my_role' 
FORMAT AS PARQUET

Use the following COPY command to load the output file of the ETL process. Values will be truncated according to Amazon Redshift column size. The column truncation is important because, unlike in the data lake, in Amazon Redshift, the column size must be set. This option prevents COPY failures:

COPY my_table
FROM 's3://my_bucket/my_table/day=2022-01-01'
IAM_ROLE 'my_role' 
FORMAT AS JSON GZIP TRUNCATECOLUMNS

The Amazon Redshift COPY operation provides many benefits and options. It supports multiple formats as well as column mapping, escaping, and more. It also allows more control over data format, object size, and options to tune the COPY operation for improved performance. Unlike data in the data lake, Amazon Redshift has column length specifications. We use TRUNCATECOLUMNS to truncates the data in columns to the appropriate number of characters so that it fits the column specification.

Using this method provides full control over the data. In case of a problem, we can repair parts of the table by deleting old data and loading the data again. It’s also possible to use the QuickSight dataset JOIN option, which is not available in SPICE when using incremental update.

Additional benefit of this approach is that the data is available for other clients and services looking to use the same data, such as SQL clients or notebooks servers such as Apache Zeppelin.

Conclusion

QuickSight allows Imperva to expose business data to various departments within an organization. In the post, we explored approaches for importing data from a data lake to QuickSight, whether continuously or incrementally.

However, it’s important to note that there is no one-size-fits-all solution; the optimal approach will depend on the specific use case. Both options—continuous and incremental updates—are scalable and flexible, with no significant cost differences observed for our dataset and access patterns.

Imperva found incremental refresh to be very useful and uses it for simple data management. For more complex datasets, Imperva has benefitted from the greater scalability and flexibility provided by Redshift Serverless.

In cases where a higher degree of control over the datasets was required, Imperva chose Redshift Serverless so that data issues could be addressed promptly by deleting, updating, or inserting new records as necessary.

With the integration of dashboards, individuals can now access data that was previously inaccessible to them. Moreover, QuickSight has played a crucial role in streamlining our data distribution processes, enabling data accessibility across all departments within the organization.

To learn more, visit Amazon QuickSight.

About the Authors

Eliad Maimon is a Senior Startups Solutions Architect at AWS in Tel-Aviv with over 20 years of experience in architecting, building, and maintaining software products. He creates architectural best practices and collaborates with customers to leverage cloud and innovation, transforming businesses and disrupting markets. Eliad is specializing in machine learning on AWS, with a focus in areas such as generative AI, MLOps, and Amazon SageMaker.

Ori Nakar is a principal cyber-security researcher, a data engineer, and a data scientist at Imperva Threat Research group. Ori has many years of experience as a software engineer and engineering manager, focused on cloud technologies and big data infrastructure.

Optimize queries using dataset parameters in Amazon QuickSight

2023-06-19 Anwar Ali

Post Syndicated from Anwar Ali original https://aws.amazon.com/blogs/big-data/optimize-queries-using-dataset-parameters-in-amazon-quicksight/

Amazon QuickSight powers data-driven organizations with unified business intelligence (BI) at hyperscale. With QuickSight, all users can meet varying analytic needs from the same source of truth through modern interactive dashboards, paginated reports, embedded analytics and natural language queries.

We have introduced dataset parameters, a new kind of parameter in QuickSight that can help you create interactive experiences in your dashboards. In this post, we dive deeper into what dataset parameters are, explain the key differences between dataset and analysis parameters, and discuss different use cases for dataset parameters along their benefits.

Introduction to dataset parameters

Before going deep into dataset parameters, let’s first discuss QuickSight analysis parameters. QuickSight analysis parameters are named variables that can transfer a value for use by an action or an object. Parameters help users create interactive experiences in their dashboards. You can tie parameters with other features in the QuickSight analysis. For example, a dashboard user can reference a parameter value in multiple places, using controls, filters, and actions, and also within calculated fields, narratives, and dynamic titles. Then the visuals in the dashboard react to the user’s selection of parameter value. Parameters can also help connect one dashboard to another, allowing a dashboard user to drill down into data that’s in a different analysis.

Dataset parameters, on the other hand, are defined at the dataset level. With dataset parameters, authors can optimize the experience and load time of dashboards that are connected live to external SQL-based sources. When readers interact with their data, the selection and actions they make in controls, filters, and visuals can be propagated to the data sources via live, custom, parameterized SQL queries. By mapping multiple dataset parameters to analysis parameters, users can create a wide variety of experiences using controls, user actions, parameterized URLs, and calculated fields, as well as dynamic visuals’ titles and insights.

In the following example, dataset owners connected via direct query to a table containing data about taxi rides in New York. They can add a WHERE clause in their custom SQL to filter the dataset based on the end-user’s input of a specific pickup date that will be later provided by the dashboard readers. In the SQL query, the rows are filtered by the date in the dataset parameter <<$pPickupDate>> if it matches the date in the pickupdate column. This way, the dataset size can be significantly smaller for users that are only interested in data for a specific taxi ride date. See the following code:

SELECT *
FROM nytaxidata
WHERE pickupdate = <<$pPickupDate>>

To allow users to provide multiple values in the parameter, you can create a multi-value parameter (for example, pPickupDates), and insert the parameter into an IN phrase as follows:

SELECT *
FROM nytaxidata
WHERE pickupdate in (<<$pPickupDates>>)

Use cases for dataset parameters

In this section, we discuss common use cases using dataset parameters and their benefits.

Optimized custom SQL in direct queries

With dataset parameters, you no longer have to trade-off between the flexibility of using custom SQL logic and the performance of an optimized SQL query. Parameterized datasets can be filtered to a relatively smaller result set when loaded. Authors and readers can benefit from the faster load of analyses and dashboards for the first time using default values, as well as for later queries when data is sliced and diced using filter controls on the dashboard. Also, data owners benefit from their datasets putting less load on backend database resources, making it more scalable and performant to serve higher user concurrency.

The performance gains will be evident when you work with direct query datasets that have complex custom SQL, such as nested queries that have to filter the data in the inner sections of the query.

Generic datasets reusable across analyses

Dataset parameters can enable datasets to be largely reused across various analyses, thereby reducing the effort for the data owners to prepare and maintain the datasets. Whether you have a SPICE dataset or direct query dataset, with dataset parameters, you can port calculated field referencing parameters from the analysis to the dataset. Authors can now reuse calculated fields referencing parameters created by dataset owners in a dataset, rather than recreate these fields across multiple analysis.

With the option to port parameter-dependent calculated fields from the analysis to the underlying datasets, dataset parameters can help you create the same calculated fields in the dataset and reuse them across multiple analyses. This is important for governance use cases as well: dataset owners can move the parameter-dependent calculated fields from the analysis to protect the business logic, ensuring that their calculated fields can’t be modified by analyses’ authors.

Simpler dataset maintenance with repeatable variables

When you have a dataset that refers to a static value (placeholder) in multiple places in custom SQL and calculated fields, you can now create a dataset parameter and reuse it in multiple places. This will help in better code maintainability. (Note that inserting parameters in custom SQL is only available in direct query.)

Solution overview

In this scenario, we create a custom SQL direct query dataset to observe unoptimized SQL queries that are generated without dataset parameters, and demonstrate how your current custom SQL queries run if you don’t use dataset parameters. Then we modify the custom SQL, add the dataset parameter, and show the optimized query generated for the same dataset if we use dataset parameters.

In this example, we use an Amazon RDS for PostgreSQL database. However, this feature will work with any SQL-based data source in QuickSight.

Query your data with analysis parameters

To set up your data source, dataset, and analysis, complete the following steps. If you’re using real data, you can skip to the next section.

Create a QuickSight data source.

The following screenshot shows sample connection details.

create a datasource

Create a new direct query custom SQL dataset.

We are using sample data from NYC OpenData for New York taxi rides with a subset of approximately 1 million records. The data is loaded in an RDS for PostgreSQL database table called nytaxidata.

create a sample dataset nytaxidata

Create a sample analysis using the dataset you just created. Choose the table visual and add a few columns from the Fields list.

create a sample analysis using nytaxidata dataset

Reload the analysis and observe the query generated on the PostgreSQL database.

You will notice it loads the full dataset (select * from nytaxidata) as referenced in the screenshot below from RDS Performance Insight.

SQL from performance insight, unoptimized SQL inner query without where clause

Add an analysis parameter-based filter control to the QuickSight analysis. Change the value of this filter control (analysis parameter in this case).

creating analysis parameter with a control

The inner query over the dataset still uses custom SQL without using the filter in the WHERE clause. This filter control parameter is still part of the WHERE clause of the outer query, so the custom SQL fetches the complete result set as part of the inner query. This may not be the case if you use database tables as a dataset rather than a custom SQL query as a dataset. With a dataset based directly on tables, parameter values are passed to the database in the WHERE clause.

SQL from performance insight, unoptimized SQL inner query without where clause with analysis parameter

So how do we overcome the challenge of being able to include the parameter in the WHERE clause in custom SQL datasets? With dataset parameters!

Optimize your query with dataset parameters

Let’s look at a few scenarios where we can use dataset parameters to send more optimized queries to the database.

Create a dataset parameter (for example, pDSfareamount) and add it to the WHERE clause with an equality predicate in the custom SQL.Observe if there is any change in the SQL query that was passed to the database.

creating dataset parameter

This time, you will see optimized SQL generated using the default parameter value in the WHERE clause of the inner query (select * from nytaxidata where fare_amount=0). This results in better query performance for direct query datasets.

optimized sql generated with dataset parameter

Map dataset parameters with analysis parameters

Dataset parameters can be mapped to analysis parameters and user-selected values can pass to the dataset parameters from the interactions on the dashboard at run time.

You can use a single analysis parameter and map it to multiple dataset parameters. The parent analysis parameter can now be linked with a filter control or an action, and can help you filter multiple datasets based on custom SQL.

In this section, we map a dataset parameter with an analysis parameter and bind it with a filter control at runtime.

First, we create an analysis parameter and map it to a dataset parameter (we use the dataset parameter we created earlier).

mapping analysis parameter with a dataset parameter

Now the analysis parameter (for this example, pAfareamount) is created. You can create the control object Fare Amount to dynamically change the dataset parameter value from the analysis or dashboard using a parameter control. You can bind pAfareamount with a QuickSight filter to pass values to the dataset parameter dynamically. When you’re changing values in a parameter control, you will find optimized SQL on the backend database with the WHERE predicate in inner query generated.

chaing value of analysis parameter mapped to dataste parameter via filter control

Additional examples using dataset parameters

So far, we have used dataset parameters with an equality predicate.Let’s look at a few more scenarios using dataset parameters.

The following screenshot demonstrates using a dataset parameter with a range predicate of custom SQL.

dataset parameter with non equality predicate

The following example illustrates using two dataset parameters with a between operator.

two dataset parameters with between operator

The following example shows using a dataset parameter within a calculation.

dataset parameter used in calculated field based on ifelse condition

We can also use a dataset parameter with a scalar user-defined function (UDF). In the following example, we have a scalar function is_holiday(pickupdate), which takes a pickupdate as a parameter and returns a flag of 0 or 1 based on whether pickupdate is a public holiday.

dataset parameter used with scalar user defined function

Additionally, we can use a dataset parameter to derive a calculated field. In the following example, we need to calculate the surcharge_amount dynamically based on a value specified at runtime and the number of passengers. We use a dataset parameter along with a case statement to calculate the desired surcharge_amount.

dataset paramter with calculated field case statement

The final example illustrates how to move calculations using parameters in the analysis to the dataset for reusability.

porting dataset parameter from analysis to dataset

Dataset parameter limitations

The following are the known limitations (as of this writing) that you may encounter when working with dataset parameters in QuickSight:

Dataset parameters can’t be inserted into custom SQL of datasets stored in SPICE.
Dynamic defaults can only be configured on the analysis page of the analysis that is using the dataset. You can’t configure a dynamic default at the dataset level.
The Select all option is not supported on multi-value controls of analysis parameters that are mapped to dataset parameters (but there is a workaround that you can follow).
Cascading controls are not supported for dataset parameters.
Dataset parameters can only be used by dataset filters when the dataset is using a direct query.
When dashboard readers schedule emailed reports, selected controls don’t propagate to the dataset parameters that are included in the report that is attached to the email. Instead, the default values of the parameters are used.

Refer to Using dataset parameters in Amazon QuickSight for more information.

Conclusion

In this post, we showed you how to create QuickSight dataset parameters and map them to analysis parameters. Dataset parameters help improve your QuickSight dashboard performance for direct query custom SQL datasets by generating optimized SQL queries. We also showed a few examples of how to use dataset parameters in SQL range predicates, calculated fields, scalar UDFs, and case statements.

Dataset parameters enable dataset owners to centrally create and govern parameter-dependent calculated fields at the dataset level. Such calculated fields can be reused across multiple analyses, and cannot be tampered with by analysis authors.

We hope you will find dataset parameters in QuickSight useful. We have already seen how the feature is creatively used in a wide range of use cases. We recommend that you review your existing direct query custom SQL datasets in your QuickSight deployment to look for candidates for optimization, or take advantage of the other benefits of dataset parameters. For example, BI teams can benefit from dataset parameters by reusing the same dataset with different values in the parameter to analyze different slices of the same data, such as different regions, products, or customers by industry segments.

Are you considering migrating legacy reports to QuickSight? Dataset parameters can help enterprise BI developers reduce the migration effort of legacy reports that already have parameterized SQL queries in the legacy queries. These SQL queries can be passed along their parameters to QuickSight datasets via automations with the help of QuickSight APIs (and a few adjustments to the queries if the parameters are marked differently).

For more information on dataset parameters, refer to Using dataset parameters in Amazon QuickSight.

About the authors

Anwar Ali is a Specialist Solutions Architect for Amazon QuickSight. Anwar has over 18 years of experience implementing enterprise business intelligence (BI), data analytics and database solutions . He specializes in integration of BI solutions with business applications, helping customers in BI architecture design patterns and best practices.

Salim Khan is a Specialist Solutions Architect for Amazon QuickSight. Salim has over 16 years of experience implementing enterprise business intelligence (BI) solutions. Prior to AWS, Salim worked as a BI consultant catering to industry verticals like Automotive, Healthcare, Entertainment, Consumer, Publishing and Financial Services. He has delivered business intelligence, data warehousing, data integration and master data management solutions across enterprises.

Gil Raviv is a Principal Product Manager for Amazon QuickSight, AWS’ cloud-native, fully managed SaaS BI service. As a thought-leader in BI, Gil accelerated the growth of global BI practices at AWS and Avanade, and has guided Fortune 1000 enterprises in their Data & AI journey. As a passionate evangelist, author and blogger of low-code/no-code data prep and analytic tools, Gil was awarded 5 times as a Microsoft MVP (Most Valuable Professional).

Best practices for enabling business users to answer questions about data using natural language in Amazon QuickSight

2023-06-16 Amy Laresch

Post Syndicated from Amy Laresch original https://aws.amazon.com/blogs/big-data/best-practices-for-enabling-business-users-to-answer-questions-about-data-using-natural-language-in-amazon-quicksight/

In this post, we explain how you can enable business users to ask and answer questions about data using their everyday business language by using the Amazon QuickSight natural language query function, Amazon QuickSight Q.

QuickSight is a unified BI service providing modern interactive dashboards, natural language querying, paginated reports, machine learning (ML) insights, and embedded analytics at scale. Powered by ML, Q uses natural language processing (NLP) to answer your business questions quickly. Q empowers any user in an organization to start asking questions using their own language. Q uses the same QuickSight datasets you use for your dashboards and reports so your data is governed and secured. Just as data is prepared visually using dashboards and reports, it can be readied for language-based interactions using a topic. Topics are collections of one or more datasets that represent a subject area that your business users can ask questions about. To learn how to create a topic, refer to Creating Amazon QuickSight Q topics.

With automated data preparation in QuickSight Q, the model will do a lot of the topic setup for you, but there is some context that is specific to your business that you need to provide. To learn more about the initial setup work that Q does behind the scenes, check out New – Announcing Automated Data Preparation for Amazon QuickSight Q.

Business users can access Q from the QuickSight console or embedded in your website or application. To learn how to embed the Q bar, refer to Embedding the Amazon QuickSight Q search bar for registered users or anonymous (unregistered) users. To see examples of embedded dashboards with Q, refer to the QuickSight DemoCentral.

Once you have a topic shared with your business users, they can ask their own questions and save questions to their pinboard as seen in GIF 1.

QuickSight authors can also add their Q visuals straight to an analysis to speed up dashboard creation, as seen in GIF 2.

This post assumes you’re familiar with building visual analytics in dashboards or reports, and shares new and different strategies needed to build natural language interfaces that are simple to use.

In this post, we discuss the following:

The importance of starting with a narrow and focused use case
Why and how to teach the system your unique business language
How to get success by providing support and having a feedback loop

If you don’t have Q enabled yet, refer to Getting started with Amazon QuickSight Q or watch the following video.

Follow along

In the following examples, we often refer to two out-of-the-box sample topics, Product Sales and Student Enrollment Statistics, so you can follow along as you go. We recommend creating the topics now before continuing with this post, because they take a few minutes to be ready.

Understand your users

Before we jump into solutions, let’s talk about when natural language query (NLQ) capabilities are right for your use case. NLQ is a fast way for a business user who is an expert in their business area to flexibly answer a large variety of questions from a scoped data domain. NLQ doesn’t replace the need for dashboards. Instead, when designed to augment a dashboard or reporting use case, NLQ helps business users get customized answers about specific details without asking a business analyst for help.

It’s critical to have a well-understood use case because language is inherently complex. There are many ways to refer to the same concept. For example, a university might refer to “classes” several ways, such as “courses,” “programs,” or “enrollments.” Language also has inherent ambiguity—“top students” might mean by highest GPA to one person and highest number of extracurriculars to another. By understanding the use case up front, you can uncover areas of potential ambiguity and build that knowledge directly into the topic.

For example, the AWS Analytics sales leadership team uses QuickSight and Q to track key metrics for their region as part of their monthly business review. When I worked with the sales leaders, I learned their preferred terminology and business language through our usability sessions. One observation I made was that they referred to the data field Sales Amortized Revenue as “adrr”. With these learnings, I could easily add this context to the topic using synonyms, which I cover in detail below. One of the sales leaders shared, “This will be awesome for next month when I write my MBR. What previously took a couple of hours, I can now do in a few minutes. Now I can spend more time working to deliver my customer’s outcomes.” If the sales leader asked a question about “adrr” but that connection was not included in their Q topic, then the leader would feel misunderstood and revert back to their original, but slower, ways of finding the answer. Check out more QuickSight use cases and success stories on the AWS Big Data Blog.

Start small

In this section, we share a few common challenges and considerations when getting started with Q.

Data can contain overlapping words

One pitfall to look out for is any fields with long strings, like survey write-in responses, product descriptions, and so on. This type of data introduces additional lexical complexity for readers to navigate. In other words, when an end-user asks a question, there is a higher chance that a word in one of the strings will overlap with other relevant fields, such as a survey write-in that mentions a product name in your Product field. Other non-descriptor fields can also contain overlaps. You can have two or more field names with lexical overlap, and the same across values, and even between fields and values. For example, let’s say you have a topic with a Product Order Status field with the values Open and Closed and a Customer Complaint Status field also with the values Open and Closed. To help avoid this overlap, consider alternate names that would be natural to your end-users to avoid the potential ambiguity. In our example, I’d keep the Product Order Status values and change the Customer Complaint Status to Resolved and Unresolved.

Avoid including aggregation names in your fields and values

Another common pitfall that introduces unnecessary ambiguity is including calculated fields for basic aggregations that Q can do on the fly. For example, business users might track average clickthrough rates for a website or month-to-date free to paid conversions. Although these types of calculations are necessary in a dashboard, with Q, these calculated fields are not needed. Q can aggregate metrics using natural language, like simply asking “year over year sales” or “top customers by sales” or “average product discount,” as you can see in Figure 1. Defining a field with the name YoY Sales adds an additional potential answer choice to your topic, leaving end-users to select between the pre-defined YoY Sales field, or using Q’s built-in YoY aggregation capability, whereas you may already know which of these choices is likely to bring them the best outcome. If you have complex business logic built into calculated fields, those are still relevant to include (and if you create the topic from your existing analysis, then Q will bring them over.)

Q answer showing MoM sales

Figure 1: Q visual showing MoM sales for EMEA

Start with a single use case

For this post, we recommend defining a use case as a well-defined set of questions that actual business users will ask. Q gives the ability to answer questions not already answered in dashboards and reports, so simply having a dashboard or a dataset doesn’t mean you necessarily have a Q-ready use case. These questions are the real words and phrases used by business users, like “how are my customers performing?” where the word “performing” might map in the data to “sales amortized revenue,” but a business user might not ask questions using the precise data names.

Start with a single use case and the minimum number of fields to meet it. Then incrementally layer in more as needed. It’s better to introduce a topic with, for example, 10 fields and a 100% success rate of answering questions as expected vs. starting with 30 fields and a 70% success rate to help users feel confident.

To help you start small, Q enables you to create your topic in one click from your existing analysis (Figure 2).

Enable a Q topic from a QuickSight analysis

Figure 2: Enable a Q topic from a QuickSight analysis

Q will scan the underlying metadata in your analysis and automatically select high-value columns based on how they are used in the analysis. You’ll also get all your existing calculated fields ported over to the new topic so you don’t have to re-create them.

Add lexical context

Q knows English well. It understands a variety of phrases and different forms of the same word. What it doesn’t know is the unique terms from your business, and only you can teach it.

There are some key ways to provide Q this context, including adding synonyms, semantic types, default aggregations, primary date, named filters, and named entities. If you created your Q topic as described in the previous section, you will be a few steps ahead, but it’s always good to check the model’s work.

Add synonyms

In a dashboard, authors use visual titles, text boxes, and filter names to help business users navigate and find their answers. With NLQ, language is the interface. NLQ empowers business users to ask their questions in their own words. The author needs to make those business lexicon connections for Q using synonyms. Your business users might refer to revenue as “gross sales,” “amortized revenue,” or any number of terms specific to your business. From the topic authoring page, you can add relevant terms (Figure 3).

Add Q synonyms

Figure 3: Adding relevant synonyms

If your business users refer to the data values in multiple ways, you can use value synonyms to create those connections for Q (Figure 4). For example, in the Student Enrollment topic, let’s say your business users sometimes use First Years to map to Freshmen and so on for each classification type. If you don’t have that data directly in your dataset, you can create those mappings using value synonyms (Figure 5).

Configure Q value synonyms

Figure 4: Configure field value synonyms

Add Q value synonyms

Figure 5: Example value synonyms for Student Enrollment topic

Check semantic types

When you create a topic using automatic data prep, Q will automatically select relevant semantic types that it can detect. Q uses semantic types to understand which specific fields to use to answer vague question like who, where, when, and how many. For example, in the student enrollment statistics example, Q already set Home of Origin as Location so if someone asks “where,” Q knows to use this field (Figure 6). Another example is adding Person for the Student Name and Professor fields so Q knows what fields to use when your business users ask for “who.”

Home of origin semantic type

Figure 6: Semantic Type set to “Location”

Another important semantic type is the Identifier. This tells Q what to count when your business users ask questions like “How many were enrolled in biology in 2021?” (Figure 7). In this example, Student ID is set as the Identifier.

Q answer showing a KPI of 3

Figure 7: Q visual showing a “how many” question

Here is a list of semantic types that map to implicit question phrases:

Location: Where?
Person or Organization: Who?
- If there are no person or organization fields, then Q will use the identifier
Identifier: How many? What is the number of?
Duration: How long?
Date Part: When?
Age: How old?
Distance: How far?

Semantic types also help the model in several other ways, including mapping terms like “most expensive” or “cheapest” to Currency. There is not always a relevant semantic type, so it’s okay to leave those empty.

Set default aggregations

Q will always aggregate measure values a business user asks for, so it’s important to use measures that retain their meaning when brought together with other values. As of this writing, Q works best with underlying data that is summative, for example, a currency value or a count. Examples of metrics that are not summative are percentages, percentiles, and medians. Measures of this type can produce misleading or statistically inaccurate results when added with one another. Q can be used to produce averages, percentiles, and medians by end-users without first performing those calculations in underlying data.

Help Q understand the business logic behind your data by setting default aggregations. For example, in the Student Enrollment topic, we have student test scores for every course, which should be averaged and not summed, because it’s a percentage. Therefore, we set Average as the default and set Sum as a not allowed aggregation type (Figure 8).

Percentage semantic type

Figure 8: Setting “Sum” as a “Not allowed aggregation” for a percentage data field

To ensure end-users get a correct count, consider whether the default aggregation type for each dimensional field should be Distinct Count or Count and set accordingly. For example, if we wanted to ask “how many courses do we offer,” we would want to set Courses to Distinct Count because the underlying data contains multiple records for the same course to track each student enrolled.

If we have a count, we get over 6,000 courses, which is a count of all rows that have data in the Courses field, covering every student in the dataset (Figure 9).

Q KPI showing 6,277

Figure 9: Q visual showing a count of courses

If we set the default aggregation to Distinct Count, we get the count of unique course names, which is more likely to be what the end-user expects (Figure 10).

Q KPI showing 15

Figure 10: Q visual showing the unique count of courses

Review the primary date field

Q will automatically select a primary date field for answering time related questions like “when” or “yoy”. If your data includes more than one date field, you may want to choose a different date than Q’s default choice. End-users can also ask about additional date fields by explicitly naming them (Figure 12). You can always specify a different date if you’d like. To review or change the primary date, go to the topic page, navigate to the Data section, and choose the Datasets tab. Expand the dataset and review the value for Default date (Figure 11).

Reviewing the Q default date

Figure 11: Reviewing the default date

You can change the date as needed.

Changing the date field

Figure 12: Asking about non-default dates

Add named filters

In a dashboard, filters are critical to allow users to focus in on their area of interest. With Q, traditional filters aren’t required because users can automatically ask to filter any field values included in the Q topic. For example, you could ask “What were sales last week for Acme Inc. for returning shoppers?” Instead of building the filters in a dashboard (date, customer name, and returning vs. new customer), Q does the filtering on the fly to instantly provide the answer.

With Q, a filter is a specific word or phrase your business users will use to instruct Q to filter returned results. For example, you have student test scores but you want a way for your users to ask about failing test scores. You can set up a filter for “Failing” defined as test scores less than 70% (Figure 13).

Q filter configuration

Figure 13: Filter configuration example using a measure

Additionally, maybe you have a field for Student Classification, which includes Freshmen, Sophomore, Junior, Senior, and Graduate, and you want to let users ask about “undergrads” vs. “graduates” (Figure 14). You can make a filter that includes the relevant values.

Q undergrad named filter example

Figure 14: Filter configuration example using a dimension

Add named entities

Named entities are a way to get Q to return a set of fields as a table visual when a user asks for a specific word or phrase. If someone wanted to know “sales for retail december” and they get a KPI saying $6,169 without any extra context, it is hard to understand all data this number includes (Figure 15).

Q KPI showing $6,169

Figure 15: A Q visual showing “sales for retail december”

By presenting the KPI in a table view with other relevant dimensions, the data includes additional context making it easier to understand meaning (Figure 16).

Q named entity as a table visual

Figure 16: A Q visual showing “sales details for retail december”

By building these table views, you can happily surprise your business users by anticipating the information they want to see without having to explicitly ask for each piece of data. The best part is your business users can easily filter the table using language to answer their own data questions. For example, in the Student Enrollment topic, we created a Student information named entity with some important student details like their name, major, email, and test scores per course.

Q named entity for student information

Figure 17: Named entity example

If a university administrator wanted to reach out to students who are failing biology, they can simply ask for “student information for failing biology majors.” In one step, they get a filtered list that already includes their emails and test scores so they can reach out (Figure 18).

Q named entity filtered down for failing biology students

Figure 18: Filtering a named entity

If the university administrator wanted to also see the phone numbers of the students to send texts offering free tutoring, they could simply ask Q “Student information for failing biology majors with phone numbers.” Now, Mobile is added as the first column (Figure 19).

Q named entity adding phone number

Figure 19: Adding a column to a named entity

Entities can also be referenced using synonyms in order to capture all the ways your business users might refer to this group of data. In our example, we could also add “student contact info” and “academic details” based on the common terminology the university admins use.

Besides looking for patterns in the data fields, ask yourself about what your business users care about. For example, let’s assume we have data for our HR specialists, and we know they care about job postings, candidates, and recruiters. Each author might think of the groups slightly differently, but as long as it’s rooted in your business jobs to be done, then your groupings are providing value. With those three groups in mind, we can sort all the data into one of those buckets. For this use case, our Candidate bucket is pretty large, with about 20 fields. We can scan the list and notice that we track information for rejected and accepted candidates, so we start splitting the metrics into two groups: Successful Candidates and Rejected Candidates. Now information like Offer Letter Date, Accept Date, and Final Salary are all in the Successful Candidate group, and related fields about Rejected Candidates are clearly grouped together.

If you’re curious about strategies for how to create entities, check out card sorting techniques.

In the Product Sales sample topic, after scanning the data, we would start with Sales, Product, and Customer as three key groupings of information to analyze. Try out the exercise on your own data and feel free to ask any questions on the QuickSight Community. To learn how to create named entities, refer to Adding named entities to a topic dataset.

Drive NLQ adoption

After you have refined your topic, tested it out with some readers, and made it available for a larger audience, it’s important to follow two strategies to drive adoption.

First, provide your business users with support. Support might look like a short tutorial video or newsletter announcement. Consider keeping an open channel like a Slack or Teams chat where active users can post questions or enhancements.

Here at Amazon, the Prime team has a dedicated Product Manager (PM) for their embedded Q application that they call PrimeQ. The PM hosts regular demo and training sessions where the Prime team can ask them any questions and get ideas about what types of answers they can get. The PM also sends out a monthly newsletter to announce the availability of new data and topics along with sample questions, FAQs, and quotes from Prime team members who get value out of Q. The PM also has an active Slack channel where every single question gets answered within 24 hours, either by the PM or a data engineer on the Prime team.

Pro tip: Make sure your business users know who they can reach out to if they get stuck. Avoid the black box of “reach out to your author” so readers feel confident their questions will be answered by a known person. For embedded applications, be sure to build an easy way to get support.

Second, maintain a healthy feedback loop. Look at the usage data directly in the product and schedule 1-on-1 sessions with your readers. Use the usage data to track adoption and identify readers who are asking unanswerable questions (Figure 20). Engage with both your successful and struggling readers to learn how to continue to iterate and improve the experience. Talking to business users is especially important to uncover the implicit ambiguity of language.

Another example here at Amazon, after first launching the Revenue Insights topic for the AWS Analytics sales team, a QuickSight Solution Architect (SA) and myself checked the usage tab on a daily basis to track unanswerable questions and directly reach out to the sales team member to let them know how to adjust their question or that we made a change so their question would now work. For example, we initially had a field turned off for Market Segment and noticed a question from a sales leader asking about sales by segment. We turned the field on and let him know those questions would now work. The SA and I have a Slack channel with other stakeholders so we can troubleshoot asynchronously with ease. Now that the topic has been available for several months, we check the usage tab on a weekly basis.

Q user activity tab

Figure 20: User Activity tab in Q

Conclusion

In this post, we discussed how language is inherently complex and what context you need to provide Q to teach the system about your unique business language. Q’s automated data prep will get you started, but you need to add the context that is specific to your business user’s language. As we mentioned at the start of the post, consider the following:

Start with a narrow and focused use case
Teach the system your unique business language
Get success by providing support and having a feedback loop

Follow this post to enable your business users to answer questions of data using natural language in QuickSight.

Ready to get started with Q? Watch our quick tutorial on enabling QuickSight Q.

Want some tutorial videos to share with your team? Check out the following:

To see how Q can answer the “Why” behind data changes and forecast future business performance, refer to New analytical questions available in Amazon QuickSight Q: “Why” and “Forecast”.

About the Author

Amy Laresch is a product manager for Amazon QuickSight Q. She is passionate about analytics and is focused on delivering the best experience for every QuickSight Q reader. Check out her videos on the @AmazonQuickSight YouTube channel for best practices and to see what’s new for QuickSight Q.