Tag Archives: Analytics

How Cloudflare instruments services using Workers Analytics Engine

2022-11-18 Jen Sells

Post Syndicated from Jen Sells original https://blog.cloudflare.com/using-analytics-engine-to-improve-analytics-engine/

How Cloudflare instruments services using Workers Analytics Engine

Workers Analytics Engine is a new tool, announced earlier this year, that enables developers and product teams to build time series analytics about anything, with high dimensionality, high cardinality, and effortless scaling. We built Analytics Engine for teams to gain insights into their code running in Workers, provide analytics to end customers, or even build usage based billing.

In this blog post we’re going to tell you about how we use Analytics Engine to build Analytics Engine. We’ve instrumented our own Analytics Engine SQL API using Analytics Engine itself and use this data to find bugs and prioritize new product features. We hope this serves as inspiration for other teams who are looking for ways to instrument their own products and gather feedback.

Why do we need Analytics Engine?

Analytics Engine enables you to generate events (or “data points”) from Workers with just a few lines of code. Using the GraphQL or SQL API, you can query these events and create useful insights about the business or technology stack. For more about how to get started using Analytics Engine, check out our developer docs.

Since we released the Analytics Engine open beta in September, we’ve been adding new features at a rapid clip based on feedback from developers. However, we’ve had two big gaps in our visibility into the product.

First, our engineering team needs to answer classic observability questions, such as: how many requests are we getting, how many of those requests result in errors, what are the nature of these errors, etc. They need to be able to view both aggregated data (like average error rate, or p99 response time) and drill into individual events.

Second, because this is a newly launched product, we are looking for product insights. By instrumenting the SQL API, we can understand the queries our customers write, and the errors they see, which helps us prioritize missing features.

We realized that Analytics Engine would be an amazing tool for both answering our technical observability questions, and also gathering product insight. That’s because we can log an event for every query to our SQL API, and then query for both aggregated performance issues as well as individual errors and queries that our customers run.

In the next section, we’re going to walk you through how we use Analytics Engine to monitor that API.

Adding instrumentation to our SQL API

The Analytics Engine SQL API lets you query events data in the same way you would an ordinary database. For decades, SQL has been the most common language for querying data. We wanted to provide an interface that allows you to immediately start asking questions about your data without having to learn a new query language.

Our SQL API parses user SQL queries, transforms and validates them, and then executes them against backend database servers. We then write information about the query back into Analytics Engine so that we can run our own analytics.
Writing data into Analytics Engine from a Cloudflare Worker is very simple and explained in our documentation. We instrument our SQL API in the same way our users do, and this code excerpt shows the data we write into Analytics Engine:

With that data now being stored in Analytics Engine, we can then pull out insights about every field we’re reporting.

Querying for insights

Having our analytics in an SQL database gives you the freedom to write any query you might want. Compared to using something like metrics which are often predefined and purpose specific, you can define any custom dataset desired, and interrogate your data to ask new questions with ease.

We need to support datasets comprising trillions of data points. In order to accomplish this, we have implemented a sampling method called Adaptive Bit Rate (ABR). With ABR, if you have large amounts of data, your queries may be returned sampled events in order to respond in reasonable time. If you have more typical amounts of data, Analytics Engine will query all your data. This allows you to run any query you like and still get responses in a short length of time. Right now, you have to account for sampling in how you make your queries, but we are exploring making it automatic.

Any data visualization tool can be used to visualize your analytics. At Cloudflare, we heavily use Grafana (and you can too!). This is particularly useful for observability use cases.

Observing query response times

One query we pay attention to gives us information about the performance of our backend database clusters:

As you can see, the 99% percentile (corresponding to the 1% most complex queries to execute) sometimes spikes up to about 300ms. But on average our backend responds to queries within 100ms.

This visualization is itself generated from an SQL query:

Customer insights from high-cardinality data

Another use of Analytics Engine is to draw insights out of customer behavior. Our SQL API is particularly well-suited for this, as you can take full advantage of the power of SQL. Thanks to our ABR technology, even expensive queries can be carried out against huge datasets.

We use this ability to help prioritize improvements to Analytics Engine. Our SQL API supports a fairly standard dialect of SQL but isn’t feature-complete yet. If a user tries to do something unsupported in an SQL query, they get back a structured error message. Those error messages are reported into Analytics Engine. We’re able to aggregate the kinds of errors that our customers encounter, which helps inform which features to prioritize next.

The SQL API returns errors in the format of type of error: more details, and so we can take the first portion before the colon to give us the type of error. We group by that, and get a count of how many times that error happened and how many users it affected:

To perform the above query using an ordinary metrics system, we would need to represent each error type with a different metric. Reporting that many metrics from each microservice creates scalability challenges. That problem doesn’t happen with Analytics Engine, because it’s designed to handle high-cardinality data.

Another big advantage of a high-cardinality store like Analytics Engine is that you can dig into specifics. If there’s a large spike in SQL errors, we may want to find which customers are having a problem in order to help them or identify what function they want to use. That’s easy to do with another SQL query:

Inside Cloudflare, we have historically relied on querying our backend database servers for this type of information. Analytics Engine’s SQL API now enables us to open up our technology to our customers, so they can easily gather insights about their services at any scale!

Conclusion and what’s next

The insights we gathered about usage of the SQL API are a super helpful input to our product prioritization decisions. We already added support for substring and position functions which were used in the visualizations above.

Looking at the top SQL errors, we see numerous errors related to selecting columns. These errors are mostly coming from some usability issues related to the Grafana plugin. Adding support for the DESCRIBE function should alleviate this because without this, the Grafana plugin doesn’t understand the table structure. This, as well as other improvements to our Grafana plugin, is on our roadmap.

We also can see that users are trying to query time ranges for older data that no longer exists. This suggests that our customers would appreciate having extended data retention. We’ve recently extended our retention from 31 to 92 days, and we will keep an eye on this to see if we should offer further extension.

We saw lots of errors related to common mistakes or misunderstandings of proper SQL syntax. This indicates that we could provide better examples or error explanations in our documentation to assist users with troubleshooting their queries.

Stay tuned into our developer docs to be informed as we continue to iterate and add more features!

You can start using Workers Analytics Engine Now! Analytics Engine is now in open beta with free 90-day retention. Start using it today or join our Discord community to talk with the team.

Announcing the AWS Well-Architected Data Analytics Lens

2022-11-17 Dhiraj Thakur

Post Syndicated from Dhiraj Thakur original https://aws.amazon.com/blogs/big-data/announcing-the-aws-well-architected-data-analytics-lens/

We are delighted to announce the release of the Data Analytics Lens. The lens consists of a lens whitepaper and an AWS-created lens available in the Lens Catalog of the AWS Well-Architected Tool. The AWS Well-Architected Framework provides a consistent approach to evaluate architectures and implement scalable designs. With the AWS Well-Architected Framework, cloud architects, system architects, engineers, and developers can build secure, high-performance, resilient, and efficient infrastructure for their applications and workloads.

Using the Lens in the Tool’s Lens Catalog, you can directly assess your Analytics workload in the console, and produce a set of actionable results for customized improvement plans recommended by the Tool.

The updated Data Analytics Lens outlines the most up-to-date steps for performing an AWS Well-Architected review that empowers you to assess and identify technical risks of your data analytics platforms. The new whitepaper and Lens cover multiple analytics use cases and scenarios, and provide comprehensive guidance to help you design your analytics applications in accordance with AWS best practices.

The new Data Analytics Lens offers implementation guidance you can use to deliver secure, high performance and reliable workloads, all with an eye toward maintaining cost-effectiveness and sustainability.

For more information on AWS Well-Architected Lenses, refer to AWS Well-Architected.

What’s new in the Data Analytics Lens?

The Data Analytics Lens is a collection of customer-proven design principles, best practices, and prescriptive guidance to help you adopt a cloud-centered approach to running analytics on AWS. These recommendations are based on insights that AWS has gathered from customers, AWS Partners, the field, and our own analytics technical specialist communities.

This version covers the following topics:

New Lens for the Well-Architected Tool in the Lens Catalog
New Data Mesh analytics user scenario
Included guidance on building ACID compliant data lakes using Iceberg
Included guidance on adding business context to your data catalog to improve searchability and access
How best to leverage Serverless to build sustainable data pipelines
Expanded advanced performance tuning techniques
Additional content for analytics scenario use cases
Links to updated blogs and product documentation, partner solutions, training content, and how-to videos

The lens highlights some of the most common areas for assessment and improvement. It’s designed to align with and provide insights across the six pillars of the AWS Well-Architected Framework:

Operational excellence – Includes the ability to support development and run workloads effectively, gain insight into your operations, and continually improve supporting processes and procedures to deliver business value.
Security – Includes the ability to protect data, systems, and assets to take advantage of cloud technologies to improve your security.
Reliability – Includes the ability of a system to automatically recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions such as misconfiguration or transient network issues.
Performance efficiency – Includes the efficient use of computing resources to meet requirements and the maintenance of that efficiency as demand changes and technologies evolve.
Cost optimization – Includes the continual process of system refinement and improvement over the entire lifecycle to optimize cost, from the initial design of your first proof of concept to the ongoing operation of production workloads.
Sustainability – Includes minimizing the environmental impacts of running cloud workloads. Topics including benchmarking, trading data accuracy for carbon, encouraging a data minimization culture, implementing data retention processes, optimizing data modeling, preventing unnecessary data movement, and efficiently managing analytics infrastructure.

The new Data Analytics Lens provides guidance that can help you make appropriate design decisions in line with your business requirements. By applying the techniques detailed in this lens to your architecture, you can validate the resiliency and efficiency of your design. This lens also provides recommendations to address any gaps you may identify.

Who should use the Data Analytics Lens?

The Data Analytics Lens is intended for all AWS customers who use analytics processes to run their workloads.

We believe that the lens will be valuable regardless of your stage of cloud adoption: whether you’re launching your first analytics workloads on AWS, migrating existing services to the cloud, or working to extend and improve existing AWS analytics workloads.

The material is intended to support customers in roles such as architects, developers, and operations team members.

Conclusion

Applying the Data Analytics Lens to your existing architectures can validate the stability and efficiency of your design and provide recommendations to address identified gaps.

For more information about building your own Well-Architected systems using the Data Analytics Lens, see the Data Analytics Lens whitepaper. For information about on the new Lens, please see the Well Architected Tool and Lens Catalog briefs. If you require additional expert guidance, contact your AWS account team to engage a Specialist Solutions Architect.

To learn more about supported analytics solutions, customer case studies, and additional resources, refer to Architecture Best Practices for Analytics & Big Data.

About the authors

Russell Jackson is a Senior Solutions Architect at AWS based in the UK. Russell has over 15 years of analytics experience and is passionate about Big Data, event driven-architectures and building environmentally sustainable data pipelines. Outside of work, Russell enjoys road cycling, wild swimming and traveling.

Theo Tolv is a Senior Analytics Architect based in Stockholm, Sweden. He’s worked with small and big data for most of his career, and has built applications running on AWS since 2008. In his spare time he likes to tinker with electronics and read space opera.

Bruce Ross is a Senior Solutions Architect at AWS in the New York Area. Bruce is the Lens Leader for the Well-Architected Framework. He has been involved in IT and Content Development for over 20 years. He is an avid sailor and angler, and enjoys R&B, jazz, and classical music.

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

Pragnesh Shah is a Solutions Architect in the Partner Organisation. He is specialist in migration, modernisation, Cloud strategy, designing and delivering data and analytics capabilities. Outside of work, he spends time with family and nature. He likes to record nature sound and practice Zen meditation.

Share and publish your Snowflake data to AWS Data Exchange using Amazon Redshift data sharing

2022-11-17 Raks Khare

Post Syndicated from Raks Khare original https://aws.amazon.com/blogs/big-data/share-and-publish-your-snowflake-data-to-aws-data-exchange-using-amazon-redshift-data-sharing/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. Today, tens of thousands of AWS customers—from Fortune 500 companies, startups, and everything in between—use Amazon Redshift to run mission-critical business intelligence (BI) dashboards, analyze real-time streaming data, and run predictive analytics. With the constant increase in generated data, Amazon Redshift customers continue to achieve successes in delivering better service to their end-users, improving their products, and running an efficient and effective business.

In this post, we discuss a customer who is currently using Snowflake to store analytics data. The customer needs to offer this data to clients who are using Amazon Redshift via AWS Data Exchange, the world’s most comprehensive service for third-party datasets. We explain in detail how to implement a fully integrated process that will automatically ingest data from Snowflake into Amazon Redshift and offer it to clients via AWS Data Exchange.

Overview of the solution

The solution consists of four high-level steps:

Configure Snowflake to push the changed data for identified tables into an Amazon Simple Storage Service (Amazon S3) bucket.
Use a custom-built Redshift Auto Loader to load this Amazon S3 landed data to Amazon Redshift.
Merge the data from the change data capture (CDC) S3 staging tables to Amazon Redshift tables.
Use Amazon Redshift data sharing to license the data to customers via AWS Data Exchange as a public or private offering.

The following diagram illustrates this workflow.

Prerequisites

To get started, you need the following prerequisites:

A Snowflake account in the same Region as your Amazon Redshift cluster.
An S3 bucket. Refer to Create your first S3 bucket for more details.
An Amazon Redshift cluster with encryption enabled and an AWS Identity and Access Management (IAM) role with permission to the S3 bucket. See Create a sample Amazon Redshift cluster and Create an IAM role for Amazon Redshift for more details.
A database schema from Snowflake to Amazon Redshift that is migrated using the AWS Schema Conversion Tool (AWS SCT). For more information, refer to Accelerate Snowflake to Amazon Redshift migration using AWS Schema Conversion Tool.
An IAM role and external Amazon S3 stage for Snowflake access to the S3 bucket you created earlier. For instructions, refer to Configuring Secure Access to Amazon S3. Name this external stage unload_to_s3, pointing to the s3-redshift-loader-source folder of the target S3 bucket. It will be referenced in COPY commands later in this post for offloading the data to Amazon S3. Once created, you should see an external stage created as shown in the following screenshot.
You must be a registered provider on AWS Data Exchange. For more information, see Providing data products on AWS Data Exchange.

Configure Snowflake to track the changed data and unload it to Amazon S3

In Snowflake, identify the tables that you need to replicate to Amazon Redshift. For the purpose of this demo, we use the data in the TPCH_SF1 schema’s Customer, LineItem, and Orders tables of the SNOWFLAKE_SAMPLE_DATA database, which comes out of the box with your Snowflake account.

Make sure that the Snowflake external stage name unload_to_s3 created in the prerequisites is pointing to the S3 prefix s3-redshift-loader-sourcecreated in the previous step.
Create a new schema BLOG_DEMO in the DEMO_DB database:CREATE SCHEMA demo_db.blog_demo;

Duplicate the Customer, LineItem, and Orders tables in the TPCH_SF1 schema to the BLOG_DEMO schema:

CREATE TABLE CUSTOMER AS 
SELECT * FROM snowflake_sample_data.tpch_sf1.CUSTOMER;
CREATE TABLE ORDERS AS
SELECT * FROM snowflake_sample_data.tpch_sf1.ORDERS;
CREATE TABLE LINEITEM AS 
SELECT * FROM snowflake_sample_data.tpch_sf1.LINEITEM;

Verify that the tables have been duplicated successfully:

SELECT table_catalog, table_schema, table_name, row_count, bytes
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'BLOG_DEMO'
ORDER BY ROW_COUNT;

unload-step-4

Create table streams to track data manipulation language (DML) changes made to the tables, including inserts, updates, and deletes:

CREATE OR REPLACE STREAM CUSTOMER_CHECK ON TABLE CUSTOMER;
CREATE OR REPLACE STREAM ORDERS_CHECK ON TABLE ORDERS;
CREATE OR REPLACE STREAM LINEITEM_CHECK ON TABLE LINEITEM;

Perform DML changes to the tables (for this post, we run UPDATE on all tables and MERGE on the customer table):

UPDATE customer 
SET c_comment = 'Sample comment for blog demo' 
WHERE c_custkey between 0 and 10; 
UPDATE orders 
SET o_comment = 'Sample comment for blog demo' 
WHERE o_orderkey between 1800001 and 1800010; 
UPDATE lineitem 
SET l_comment = 'Sample comment for blog demo' 
WHERE l_orderkey between 3600001 and 3600010;

MERGE INTO customer c 
USING 
( 
SELECT n_nationkey 
FROM snowflake_sample_data.tpch_sf1.nation s 
WHERE n_name = 'UNITED STATES') n 
ON n.n_nationkey = c.c_nationkey 
WHEN MATCHED THEN UPDATE SET c.c_comment = 'This is US based customer1';

Validate that the stream tables have recorded all changes:
```
SELECT * FROM CUSTOMER_CHECK; 
SELECT * FROM ORDERS_CHECK; 
SELECT * FROM LINEITEM_CHECK;
```
For example, we can query the following customer key value to verify how the events were recorded for the MERGE statement on the customer table:

SELECT * FROM CUSTOMER_CHECK where c_custkey = 60027;

We can see the METADATA$ISUPDATE column as TRUE, and we see DELETE followed by INSERT in the METADATA$ACTION column.

Run the COPY command to offload the CDC from the stream tables to the S3 bucket using the external stage name unload_to_s3.In the following code, we’re also copying the data to S3 folders ending with _stg to ensure that when Redshift Auto Loader automatically creates these tables in Amazon Redshift, they get created and marked as staging tables:

COPY INTO @unload_to_s3/customer_stg/
FROM (select *, sysdate() as LAST_UPDATED_TS from demo_db.blog_demo.customer_check)
FILE_FORMAT = (TYPE = PARQUET)
OVERWRITE = TRUE HEADER = TRUE;

COPY INTO @unload_to_s3/customer_stg/
FROM (select *, sysdate() as LAST_UPDATED_TS from demo_db.blog_demo.customer_check)
FILE_FORMAT = (TYPE = PARQUET)
OVERWRITE = TRUE HEADER = TRUE;

COPY INTO @unload_to_s3/lineitem_stg/ 
FROM (select *, sysdate() as LAST_UPDATED_TS from demo_db.blog_demo.lineitem_check) 
FILE_FORMAT = (TYPE = PARQUET) 
OVERWRITE = TRUE HEADER = TRUE;

Verify the data in the S3 bucket. There will be three sub-folders created in the s3-redshift-loader-source folder of the S3 bucket, and each will have .parquet data files.You can also automate the preceding COPY commands using tasks, which can be scheduled to run at a set frequency for automatic copy of CDC data from Snowflake to Amazon S3.
Use the ACCOUNTADMIN role to assign the EXECUTE TASK privilege. In this scenario, we’re assigning the privileges to the SYSADMIN role:
```
USE ROLE accountadmin;
GRANT EXECUTE TASK, EXECUTE MANAGED TASK ON ACCOUNT TO ROLE sysadmin;
```

Use the SYSADMIN role to create three separate tasks to run three COPY commands every 5 minutes: USE ROLE sysadmin;

/* Task to offload Customer CDC table */ 
CREATE TASK sf_rs_customer_cdc 
WAREHOUSE = SMALL 
SCHEDULE = 'USING CRON 5 * * * * UTC' 
AS 
COPY INTO @unload_to_s3/customer_stg/ 
FROM (select *, sysdate() as LAST_UPDATED_TS from demo_db.blog_demo.customer_check) 
FILE_FORMAT = (TYPE = PARQUET) 
OVERWRITE = TRUE 
HEADER = TRUE;

/*Task to offload Orders CDC table */ 
CREATE TASK sf_rs_orders_cdc 
WAREHOUSE = SMALL 
SCHEDULE = 'USING CRON 5 * * * * UTC' 
AS 
COPY INTO @unload_to_s3/orders_stg/ 
FROM (select *, sysdate() as LAST_UPDATED_TS from demo_db.blog_demo.orders_check)
FILE_FORMAT = (TYPE = PARQUET)
OVERWRITE = TRUE HEADER = TRUE;

/* Task to offload Lineitem CDC table */ 
CREATE TASK sf_rs_lineitem_cdc 
WAREHOUSE = SMALL 
SCHEDULE = 'USING CRON 5 * * * * UTC' 
AS 
COPY INTO @unload_to_s3/lineitem_stg/ 
FROM (select *, sysdate() as LAST_UPDATED_TS from demo_db.blog_demo.lineitem_check)
FILE_FORMAT = (TYPE = PARQUET)
OVERWRITE = TRUE HEADER = TRUE;

When the tasks are first created, they’re in a SUSPENDED state.

Alter the three tasks and set them to RESUME state:

ALTER TASK sf_rs_customer_cdc RESUME;
ALTER TASK sf_rs_orders_cdc RESUME;
ALTER TASK sf_rs_lineitem_cdc RESUME;

Validate that all three tasks have been resumed successfully: SHOW TASKS;Now the tasks will run every 5 minutes and look for new data in the stream tables to offload to Amazon S3.As soon as data is migrated from Snowflake to Amazon S3, Redshift Auto Loader automatically infers the schema and instantly creates corresponding tables in Amazon Redshift. Then, by default, it starts loading data from Amazon S3 to Amazon Redshift every 5 minutes. You can also change the default setting of 5 minutes.
On the Amazon Redshift console, launch the query editor v2 and connect to your Amazon Redshift cluster.
Browse to the dev database, public schema, and expand Tables.
You can see three staging tables created with the same name as the corresponding folders in Amazon S3.
Validate the data in one of the tables by running the following query:SELECT * FROM "dev"."public"."customer_stg";

Configure the Redshift Auto Loader utility

The Redshift Auto Loader makes data ingestion to Amazon Redshift significantly easier because it automatically loads data files from Amazon S3 to Amazon Redshift. The files are mapped to the respective tables by simply dropping files into preconfigured locations on Amazon S3. For more details about the architecture and internal workflow, refer to the GitHub repo.

We use an AWS CloudFormation template to set up Redshift Auto Loader. Complete the following steps:

Launch the CloudFormation template.
Choose Next.
For Stack name, enter a name.

Provide the parameters listed in the following table.

CloudFormation Template Parameter	Allowed Values	Description
`RedshiftClusterIdentifier`	Amazon Redshift cluster identifier	Enter the Amazon Redshift cluster identifier.
`DatabaseUserName`	Database user name in the Amazon Redshift cluster	The Amazon Redshift database user name that has access to run the SQL script.
`DatabaseName`	S3 bucket name	The name of the Amazon Redshift primary database where the SQL script is run.
`DatabaseSchemaName`	Database name in Amazon Redshift	The Amazon Redshift schema name where the tables are created.
`RedshiftIAMRoleARN`	Default or the valid IAM role ARN attached to the Amazon Redshift cluster	The IAM role ARN associated with the Amazon Redshift cluster. Your default IAM role is set for the cluster and has access to your S3 bucket, leave it at the default.
`CopyCommandOptions`	Copy option; default is delimiter ‘\|’ gzip	Provide the additional COPY command data format parameters. If InitiateSchemaDetection = Yes, then the process attempts to detect the schema and automatically set the suitable copy command options. In the event of failure on schema detection or when InitiateSchemaDetection = No, then this value is used as the default COPY command options to load data.
`SourceS3Bucket`	S3 bucket name	The S3 bucket where the data is stored. Make sure the IAM role that is associated to the Amazon Redshift cluster has access to this bucket.
`InitiateSchemaDetection`	Yes/No	Set to Yes to dynamically detect the schema prior to file load and create a table in Amazon Redshift if it doesn’t exist already. If a table already exists, then it won’t drop or recreate the table in Amazon Redshift. If schema detection fails, the process uses the default COPY options as specified in `CopyCommandOptions`.

The Redshift Auto Loader uses the COPY command to load data into Amazon Redshift. For this post, set CopyCommandOptions as follows, and configure any supported COPY command options:

delimiter '|' dateformat 'auto' TIMEFORMAT 'auto'

autoloader-input-parameters

Choose Next.
Accept the default values on the next page and choose Next.
Select the acknowledgement check box and choose Create stack.
Monitor the progress of the Stack creation and wait until it is complete.
To verify the Redshift Auto Loader configuration, sign in to the Amazon S3 console and navigate to the S3 bucket you provided.
You should see a new directory s3-redshift-loader-source is created.

Copy all the data files exported from Snowflake under s3-redshift-loader-source.

Merge the data from the CDC S3 staging tables to Amazon Redshift tables

To merge your data from Amazon S3 to Amazon Redshift, complete the following steps:

Create a temporary staging table merge_stg and insert all the rows from the S3 staging table that have metadata_action as INSERT, using the following code. This includes all the new inserts as well as the update.
```
CREATE TEMP TABLE merge_stg 
AS
SELECT * FROM
(
SELECT *, DENSE_RANK() OVER (PARTITION BY c_custkey ORDER BY last_updated_ts DESC
) AS rnk
FROM customer_stg WHERE rnk = 1 AND metadata$action = 'INSERT'
```
The preceding code uses a window function DENSE_RANK() to select the latest entries for a given c_custkey by assigning a rank to each row for a given c_custkey and arrange the data in descending order using last_updated_ts. We then select the rows with rnk=1 and metadata$action = ‘INSERT’ to capture all the inserts.
Use the S3 staging table customer_stg to delete the records from the base table customer, which are marked as deletes or updates:
```
DELETE FROM customer 
USING customer_stg 
WHERE customer.c_custkey = customer_stg.c_custkey;
```
This deletes all the rows that are present in the CDC S3 staging table, which takes care of rows marked for deletion and updates.

Use the temporary staging table merge_stg to insert the records marked for updates or inserts:

INSERT INTO customer 
SELECT c_custkey, c_name, c_address, c_nationkey, c_phone, c_acctbal, c_mktsegment, c_comment 
FROM merge_stg;

Truncate the staging table, because we have already updated the target table:truncate customer_stg;

You can also run the preceding steps as a stored procedure:

CREATE OR REPLACE PROCEDURE merge_customer()
AS $$
BEGIN
/*CREATING TEMP TABLE TO GET THE MOST LATEST RECORDS FOR UPDATES/NEW INSERTS*/
CREATE TEMP TABLE merge_stg AS
SELECT * FROM
(
SELECT *, DENSE_RANK() OVER (PARTITION BY c_custkey ORDER BY last_updated_ts DESC ) AS rnk
FROM customer_stg
)
WHERE rnk = 1 AND metadata$action = 'INSERT';
/* DELETING FROM THE BASE TABLE USING THE CDC STAGING TABLE ALL THE RECORDS MARKED AS DELETES OR UPDATES*/
DELETE FROM customer
USING customer_stg
WHERE customer.c_custkey = customer_stg.c_custkey;
/*INSERTING NEW/UPDATED RECORDS IN THE BASE TABLE*/ 
INSERT INTO customer
SELECT c_custkey, c_name, c_address, c_nationkey, c_phone, c_acctbal, c_mktsegment, c_comment
FROM merge_stg;
truncate customer_stg;
END;
$$ LANGUAGE plpgsql;

For example, let’s look at the before and after states of the customer table when there’s been a change in data for a particular customer.

The following screenshot shows the new changes recorded in the customer_stg table for c_custkey = 74360.
merge-process-new-changes
We can see two records for a customer with c_custkey=74360 one with metadata$action as DELETE and one with metadata$action as INSERT. That means the record with c_custkey was updated at the source and these changes need to be applied to the target customer table in Amazon Redshift.

The following screenshot shows the current state of the customer table before these changes have been merged using the preceding stored procedure:
merge-process-current-state

Now, to update the target table, we can run the stored procedure as follows: CALL merge_customer()The following screenshot shows the final state of the target table after the stored procedure is complete.

Run the stored procedure on a schedule

You can also run the stored procedure on a schedule via Amazon EventBridge. The scheduling steps are as follows:

On the EventBridge console, choose Create rule.
For Name, enter a meaningful name, for example, Trigger-Snowflake-Redshift-CDC-Merge.
For Event bus, choose default.
For Rule Type, select Schedule.
Choose Next.
For Schedule pattern, select A schedule that runs at a regular rate, such as every 10 minutes.
For Rate expression, enter Value as 5 and choose Unit as Minutes.
Choose Next.
For Target types, choose AWS service.
For Select a Target, choose Redshift cluster.
For Cluster, choose the Amazon Redshift cluster identifier.
For Database name, choose dev.
For Database user, enter a user name with access to run the stored procedure. It uses temporary credentials to authenticate.
Optionally, you can also use AWS Secrets Manager for authentication.
For SQL statement, enter CALL merge_customer().
For Execution role, select Create a new role for this specific resource.
Choose Next.
Review the rule parameters and choose Create rule.

After the rule has been created, it automatically triggers the stored procedure in Amazon Redshift every 5 minutes to merge the CDC data into the target table.

Configure Amazon Redshift to share the identified data with AWS Data Exchange

Now that you have the data stored inside Amazon Redshift, you can publish it to customers using AWS Data Exchange.

In Amazon Redshift, using any query editor, create the data share and add the tables to be shared:

CREATE DATASHARE salesshare MANAGEDBY ADX;
ALTER DATASHARE salesshare ADD SCHEMA tpch_sf1;
ALTER DATASHARE salesshare ADD TABLE tpch_sf1.customer;

ADX-step1

On the AWS Data Exchange console, create your dataset.
Select Amazon Redshift datashare.
Create a revision in the dataset.
Add assets to the revision (in this case, the Amazon Redshift data share).
Finalize the revision.

After you create the dataset, you can publish it to the public catalog or directly to customers as a private product. For instructions on how to create and publish products, refer to NEW – AWS Data Exchange for Amazon Redshift

Clean up

To avoid incurring future charges, complete the following steps:

Delete the CloudFormation stack used to create the Redshift Auto Loader.
Delete the Amazon Redshift cluster created for this demonstration.
If you were using an existing cluster, drop the created external table and external schema.
Delete the S3 bucket you created.
Delete the Snowflake objects you created.

Conclusion

In this post, we demonstrated how you can set up a fully integrated process that continuously replicates data from Snowflake to Amazon Redshift and then uses Amazon Redshift to offer data to downstream clients over AWS Data Exchange. You can use the same architecture for other purposes, such as sharing data with other Amazon Redshift clusters within the same account, cross-accounts, or even cross-Regions if needed.

About the Authors

Raks Khare is an Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers architect data analytics solutions at scale on the AWS platform.

Ekta Ahuja is a Senior Analytics Specialist Solutions Architect at AWS. She is passionate about helping customers build scalable and robust data and analytics solutions. Before AWS, she worked in several different data engineering and analytics roles. Outside of work, she enjoys baking, traveling, and board games.

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 13 years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling
and cooking.

Ahmed Shehata is a Senior Analytics Specialist Solutions Architect at AWS based on Toronto. He has more than two decades of experience helping customers modernize their data platforms, Ahmed is passionate about helping customers build efficient, performant and scalable Analytic solutions.

Use Karpenter to speed up Amazon EMR on EKS autoscaling

2022-11-17 Changbin Gong

Post Syndicated from Changbin Gong original https://aws.amazon.com/blogs/big-data/use-karpenter-to-speed-up-amazon-emr-on-eks-autoscaling/

Amazon EMR on Amazon EKS is a deployment option for Amazon EMR that allows organizations to run Apache Spark on Amazon Elastic Kubernetes Service (Amazon EKS). With EMR on EKS, the Spark jobs run on the Amazon EMR runtime for Apache Spark. This increases the performance of your Spark jobs so that they run faster and cost less than open source Apache Spark. Also, you can run Amazon EMR-based Apache Spark applications with other types of applications on the same EKS cluster to improve resource utilization and simplify infrastructure management.

Karpenter was introduced at AWS re:Invent 2021 to provide a dynamic, high performance, open-source cluster auto scaling solution for Kubernetes. It automatically provisions new nodes in response to unschedulable pods. It observes the aggregate resource requests of unscheduled pods and makes decisions to launch new nodes and terminate stop them to reduce scheduling latencies as well as infrastructure costs.

To configure Karpenter, you create provisioners that define how Karpenter manages the pods that are pending and expires nodes. Although most use cases are addressed with a single provisioner, multiple provisioners are useful in multi-tenant use cases such as isolating nodes for billing, using different node constraints (such as no GPUs for a team), or using different deprovisioning settings. Karpenter launches nodes with minimal compute resources to fit un-schedulable pods for efficient binpacking. It works in tandem with the Kubernetes scheduler to bind un-schedulable pods to the new nodes that are provisioned. The following diagram illustrates how it works.

This post shows how to integrate Karpenter into your EMR on EKS architecture to achieve faster and capacity-aware auto scaling capabilities to speed up your big data and machine learning (ML) workloads while reducing costs. We run the same workload using both Cluster Autoscaler and Karpenter, to see some of the improvements we discuss in the next section.

Improvements compared to Cluster Autoscaler

Like Karpenter, Kubernetes Cluster Autoscaler (CAS) is designed to add nodes when requests come in to run pods that can’t be met by current capacity. Cluster Autoscaler is part of the Kubernetes project, with implementations by major Kubernetes cloud providers. By taking a fresh look at provisioning, Karpenter offers the following improvements:

No node group management overhead – Because you have different resource requirements for different Spark workloads along with other workloads in your EKS cluster, you need to create separate node groups that can meet your requirements, like instance sizes, Availability Zones, and purchase options. This can quickly grow to tens and hundreds of node groups, which adds additional management overhead. Karpenter manages each instance directly, without the use of additional orchestration mechanisms like node groups, taking a group-less approach by calling the EC2 Fleet API directly to provision nodes. This allows Karpenter to use diverse instance types, Availability Zones, and purchase options by simply creating a single provisioner, as shown in the following figure.
Quick retries – If the Amazon Elastic Compute Cloud (Amazon EC2) capacity isn’t available, Karpenter can retry in milliseconds instead of minutes. This is can be a really useful if you’re using EC2 Spot Instances and you’re unable to get capacity to specific instance types.
Designed to handle full flexibility of the cloud – Karpenter has the ability to efficiently address the full range of instance types available through AWS. Cluster Autoscaler wasn’t originally built with the flexibility to handle hundreds of instance types, Availability Zones, and purchase options. We recommend being as flexible as you can be to enable Karpenter get the just-in-time capacity you need.
Improves the overall node utilization by binpacking – Karpenter batches pending pods and then binpacks them based on CPU, memory, and GPUs required, taking into account node overhead (for example, daemon set resources required). After the pods are binpacked on the most efficient instance type, Karpenter takes other instance types that are similar or larger than the most efficient packing, and passes the instance type options to an API called EC2 Fleet, following some of the best practices of instance diversification to improve the chances of getting the request capacity.

Best practices using Karpenter with EMR on EKS

For general best practices with Karpenter, refer to Karpenter Best Practices. The following are additional things to consider with EMR on EKS:

Avoid inter-AZ data transfer cost by either configuring the Karpenter provisioner to launch in a single Availability Zone or use node selector or affinity and anti-affinity to schedule the driver and the executors of the same job to a single Availability Zone. See the following code:
```
nodeSelector:
  topology.kubernetes.io/zone: us-east-1a
```
Cost optimize Spark workloads using EC2 Spot Instances for executors and On-Demand Instances for the driver by using the node selector with the label karpenter.sh/capacity-type in the pod templates. We recommend using pod templates to specify driver pods to run on On-Demand Instances and executor pods to run on Spot Instances. This allows you to consolidate provisioner specs because you don’t need two specs per job type. It also follows the best practice of using customization defined on workload types and to keep provisioner specs to support a broader number of use cases.
When using EC2 Spot Instances, maximize the instance diversification in the provisioner configuration to adhere to the best practices. To select suitable instance types, you can use the ec2-instance-selector, a CLI tool and go library that recommends instance types based on resource criteria like vCPUs and memory.

Solution overview

This post provides an example of how to set up both Cluster Autoscaler and Karpenter in an EKS cluster and compare the auto scaling improvements by running a sample EMR on EKS workload.

The following diagram illustrates the architecture of this solution.

We use the Transaction Processing Performance Council-Decision Support (TPC-DS), a decision support benchmark to sequentially run three Spark SQL queries (q70-v2.4, q82-v2.4, q64-v2.4) with a fixed number of 50 executors, against 17.7 billion records, approximately 924 GB compressed data in Parquet file format. For more details on TPC-DS, refer to the eks-spark-benchmark GitHub repo.

We submit the same job with different Spark driver and executor specs to mimic different jobs solely to observe the auto scaling behavior and binpacking. We recommend you right-size your Spark executors based on the workload characteristics for production workloads.

The following code is an example Spark configuration that results in pod spec requests of 4 vCPU and 15 GB:

--conf spark.executor.instances=50 --conf spark.driver.cores=4 --conf spark.driver.memory=10g --conf spark.driver.memoryOverhead=5g --conf spark.executor.cores=4 --conf spark.executor.memory=10g  --conf spark.executor.memoryOverhead=5g

We use pod templates to schedule Spark drivers on On-Demand Instances and executors on EC2 Spot Instances (which can save up to 90% over On-Demand Instance prices). Spark’s inherent resiliency has the driver launch new executors to replace the ones that fail due to Spot interruptions. See the following code:

apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    karpenter.sh/capacity-type: spot
  containers:
  - name: spark-kubernetes-executor


apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    karpenter.sh/capacity-type: on-demand
  containers:
  - name: spark-kubernetes-driver

Prerequisites

We use an AWS Cloud9 IDE to run all the instructions throughout this post.

To create your IDE, run the following commands in AWS CloudShell. The default Region is us-east-1, but you can change it if needed.

# clone the repo
git clone https://github.com/black-mirror-1/karpenter-for-emr-on-eks.git
cd karpenter-for-emr-on-eks
./setup/create-cloud9-ide.sh

Navigate to the AWS Cloud9 IDE using the URL from the output of the script.

Install tools on the AWS Cloud9 IDE

Install the following tools required on the AWS Cloud9 environment by the running a script:

Run the following instructions in your AWS Cloud9 environment and not CloudShell.

Clone the GitHub repository:

cd ~/environment
git clone https://github.com/black-mirror-1/karpenter-for-emr-on-eks.git
cd ~/environment/karpenter-for-emr-on-eks

Set up the required environment variables. Feel free to adjust the following code according to your needs:

# Install envsubst (from GNU gettext utilities) and bash-completion
sudo yum -y install jq gettext bash-completion moreutils

# Setup env variables required
export EKSCLUSTER_NAME=aws-blog
export EKS_VERSION="1.23"
# get the link to the same version as EKS from here https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html
export KUBECTL_URL="https://s3.us-west-2.amazonaws.com/amazon-eks/1.23.7/2022-06-29/bin/linux/amd64/kubectl"
export HELM_VERSION="v3.9.4"
export KARPENTER_VERSION="v0.18.1"
# get the most recent matching version of the Cluster Autoscaler from here https://github.com/kubernetes/autoscaler/releases
export CAS_VERSION="v1.23.1"

Install the AWS Cloud9 CLI tools:

cd ~/environment/karpenter-for-emr-on-eks
./setup/c9-install-tools.sh

Provision the infrastructure

We set up the following resources using the provision infrastructure script:

A new EKS cluster
EMR on EKS enabler
The required AWS Identity and Access Management (IAM) roles
An Amazon Simple Storage Service (Amazon S3) bucket to store results and pod templates
Karpenter set up to scale in Availability Zone {AWS_REGION}a
Optionally, Cluster Autoscaler set up to scale multiple node groups in Availability Zone {AWS_REGION}b
Prometheus to ship metrics to Amazon Managed Service for Prometheus

Create the EMR on EKS and Karpenter infrastructure:

cd ~/environment/karpenter-for-emr-on-eks
./setup/create-eks-emr-infra.sh

Validate the setup:

# Should have results that are running
kubectl get nodes
kubectl get pods -n karpenter
kubectl get po -n kube-system -l app.kubernetes.io/instance=cluster-autoscaler
kubectl get po -n prometheus

Understanding Karpenter configurations

Because the sample workload has driver and executor specs that are of different sizes, we have identified the instances from c5, c5a, c5d, c5ad, c6a, m4, m5, m5a, m5d, m5ad, and m6a families of sizes 2xlarge, 4xlarge, 8xlarge, and 9xlarge for our workload using the amazon-ec2-instance-selector CLI. With CAS, we need to create a total of 12 node groups, as shown in eksctl-config.yaml, but can define the same constraints in Karpenter with a single provisioner, as shown in the following code:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  provider:
    launchTemplate: {EKSCLUSTER_NAME}-karpenter-launchtemplate
    subnetSelector:
      karpenter.sh/discovery: {EKSCLUSTER_NAME}
  labels:
    app: kspark
  requirements:
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["on-demand","spot"]
    - key: "kubernetes.io/arch" 
      operator: In
      values: ["amd64"]
    - key: karpenter.k8s.aws/instance-family
      operator: In
      values: [c5, c5a, c5d, c5ad, m5, c6a]
    - key: karpenter.k8s.aws/instance-size
      operator: In
      values: [2xlarge, 4xlarge, 8xlarge, 9xlarge]
    - key: "topology.kubernetes.io/zone"
      operator: In
      values: ["{AWS_REGION}a"]

  limits:
    resources:
      cpu: "2000"

  ttlSecondsAfterEmpty: 30

We have set up both auto scalers to scale down nodes that are empty for 30 seconds using ttlSecondsAfterEmpty in Karpenter and --scale-down-unneeded-time in CAS.

Karpenter by design will try to achieve the most efficient packing of the pods on a node based on CPU, memory, and GPUs required.

Run a sample workload

To run a sample workload, complete the following steps:

Lets review the AWS Command Line Interface (AWS CLI) command to submit a sample job:

aws emr-containers start-job-run \
  --virtual-cluster-id $VIRTUAL_CLUSTER_ID \
  --name karpenter-benchmark-${CORES}vcpu-${MEMORY}gb  \
  --execution-role-arn $EMR_ROLE_ARN \
  --release-label emr-6.5.0-latest \
  --job-driver '{
  "sparkSubmitJobDriver": {
      "entryPoint": "local:///usr/lib/spark/examples/jars/eks-spark-benchmark-assembly-1.0.jar",
      "entryPointArguments":["s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned","s3://'$S3BUCKET'/EMRONEKS_TPCDS-TEST-3T-RESULT-KA","/opt/tpcds-kit/tools","parquet","3000","1","false","q70-v2.4,q82-v2.4,q64-v2.4","true"],
      "sparkSubmitParameters": "--class com.amazonaws.eks.tpcds.BenchmarkSQL --conf spark.executor.instances=50 --conf spark.driver.cores='$CORES' --conf spark.driver.memory='$EXEC_MEMORY'g --conf spark.executor.cores='$CORES' --conf spark.executor.memory='$EXEC_MEMORY'g"}}' \
  --configuration-overrides '{
    "applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.kubernetes.node.selector.app": "kspark",
          "spark.kubernetes.node.selector.topology.kubernetes.io/zone": "'${AWS_REGION}'a",

          "spark.kubernetes.container.image":  "'$ECR_URL'/eks-spark-benchmark:emr6.5",
          "spark.kubernetes.driver.podTemplateFile": "s3://'$S3BUCKET'/pod-template/karpenter-driver-pod-template.yaml",
          "spark.kubernetes.executor.podTemplateFile": "s3://'$S3BUCKET'/pod-template/karpenter-executor-pod-template.yaml",
          "spark.network.timeout": "2000s",
          "spark.executor.heartbeatInterval": "300s",
          "spark.kubernetes.executor.limit.cores": "'$CORES'",
          "spark.executor.memoryOverhead": "'$MEMORY_OVERHEAD'G",
          "spark.driver.memoryOverhead": "'$MEMORY_OVERHEAD'G",
          "spark.kubernetes.executor.podNamePrefix": "karpenter-'$CORES'vcpu-'$MEMORY'gb",
          "spark.executor.defaultJavaOptions": "-verbose:gc -XX:+UseG1GC",
          "spark.driver.defaultJavaOptions": "-verbose:gc -XX:+UseG1GC",

          "spark.ui.prometheus.enabled":"true",
          "spark.executor.processTreeMetrics.enabled":"true",
          "spark.kubernetes.driver.annotation.prometheus.io/scrape":"true",
          "spark.kubernetes.driver.annotation.prometheus.io/path":"/metrics/executors/prometheus/",
          "spark.kubernetes.driver.annotation.prometheus.io/port":"4040",
          "spark.kubernetes.driver.service.annotation.prometheus.io/scrape":"true",
          "spark.kubernetes.driver.service.annotation.prometheus.io/path":"/metrics/driver/prometheus/",
          "spark.kubernetes.driver.service.annotation.prometheus.io/port":"4040",
          "spark.metrics.conf.*.sink.prometheusServlet.class":"org.apache.spark.metrics.sink.PrometheusServlet",
          "spark.metrics.conf.*.sink.prometheusServlet.path":"/metrics/driver/prometheus/",
          "spark.metrics.conf.master.sink.prometheusServlet.path":"/metrics/master/prometheus/",
          "spark.metrics.conf.applications.sink.prometheusServlet.path":"/metrics/applications/prometheus/"
         }}
    ]}'

Submit four jobs with different driver and executor vCPUs and memory sizes on Karpenter:

# the arguments are vcpus and memory
export EMRCLUSTER_NAME=${EKSCLUSTER_NAME}-emr
./sample-workloads/emr6.5-tpcds-karpenter.sh 4 7
./sample-workloads/emr6.5-tpcds-karpenter.sh 8 15
./sample-workloads/emr6.5-tpcds-karpenter.sh 4 15
./sample-workloads/emr6.5-tpcds-karpenter.sh 8 31

To monitor the pods’s autoscaling status in real time, open a new terminal in Cloud9 IDE and run the following command (nothing is returned at the start):
```
watch -n1 "kubectl get pod -n emr-karpenter"
```
Observe the EC2 instance and node auto scaling status in a second terminal tab by running the following command (by design, Karpenter schedules in Availability Zone a):
```
watch -n1 "kubectl get node --label-columns=node.kubernetes.io/instance-type,karpenter.sh/capacity-type,topology.kubernetes.io/zone,app -l app=kspark"
```

Compare with Cluster Autoscaler (Optional)

We have set up Cluster Autoscaler during the infrastructure setup step with the following configuration:

Launch EC2 nodes in Availability Zone b
Contain 12 node groups (6 each for On-Demand and Spot)
Scale down unneeded nodes after 30 seconds with --scale-down-unneeded-time
Use the least-waste expander on CAS, which can select the node group that will have the least idle CPU for binpacking efficiency

Submit four jobs with different driver and executor vCPUs and memory sizes on CAS:

# the arguments are vcpus and memory
./sample-workloads/emr6.5-tpcds-ca.sh 4 7
./sample-workloads/emr6.5-tpcds-ca.sh 8 15
./sample-workloads/emr6.5-tpcds-ca.sh 4 15
./sample-workloads/emr6.5-tpcds-ca.sh 8 31

To monitor the pods’s autoscaling status in real time, open a new terminal in Cloud9 IDE and run the following command (nothing is returned at the start):
```
watch -n1 "kubectl get pod -n emr-ca"
```
Observe the EC2 instance and node auto scaling status in a second terminal tab by running the following command (by design, CAS schedules in Availability Zone b):
```
watch -n1 "kubectl get node --label-columns=node.kubernetes.io/instance-type,eks.amazonaws.com/capacityType,topology.kubernetes.io/zone,app -l app=caspark"
```

Observations

The time from pod creation to being scheduled on average is less with Karpenter than CAS, as shown in the following figure; you can see a noticeable difference when you run large scale workloads.

As shown in the following figures, as the jobs were completed, Karpenter was able to scale down the nodes that aren’t needed within seconds. In contrast, CAS takes minutes, because it sends a signal to the node groups, adding additional latency. This in turn helps reduce overall costs by reducing the number of seconds unneeded EC2 instances are running.

Clean up

To clean up your environment, delete all the resources created in reverse order by running the cleanup script:

export EKSCLUSTER_NAME=aws-blog
cd ~/environment/karpenter-for-emr-on-eks
./setup/cleanup.sh

Conclusion

In this post, we showed you how to use Karpenter to simplify EKS node provisioning, and speed up auto scaling of EMR on EKS workloads. We encourage you to try Karpenter and provide any feedback by creating a GitHub issue.

Query expansion based on user behaviour

2022-11-16 Grab Tech

Post Syndicated from Grab Tech original https://engineering.grab.com/query-expansion-based-on-user-behaviour

Introduction

Our consumers used to face a few common pain points while searching for food with the Grab app. Sometimes, the results would include merchants that were not yet operational or locations that were out of the delivery radius. Other times, no alternatives were provided. The search system would also have difficulties handling typos, keywords in different languages, synonyms, and even word spacing issues, resulting in a suboptimal user experience.

Over the past few months, our search team has been building a query expansion framework that can solve these issues. When a user query comes in, it expands the query to a few related keywords based on semantic relevance and user intention. These expanded words are then searched with the original query to recall more results that are high-quality and diversified. Now let’s take a deeper look at how it works.

Query expansion framework

Building the query expansion corpus

We used two different approaches to produce query expansion candidates: manual annotation for top keywords and data mining based on user rewrites.

Manual annotation for top keywords

Search has a pronounced fat head phenomenon. The most frequent thousand of keywords account for more than 70% of the total search traffic. Therefore, handling these keywords well can improve the overall search quality a lot. We manually annotated the possible expansion candidates for these common keywords to cover the most popular merchants, items and alternatives. For instance, “McDonald’s” is annotated with {“burger”, “western”}.

Data mining based on user rewrites

We observed that sometimes users tend to rewrite their queries if they are not satisfied with the search result. As a pilot study, we checked the user rewrite records within some user search sessions and found several interesting samples:

{Ya Kun Kaya Toast,Starbucks}
{healthy,Subway}
{Muni,Muji}
{奶茶,koi}
{Roti,Indian}

We can see that besides spelling corrections, users’ rewrite behaviour also reveals deep semantic relations between these pairs that cannot be easily captured by lexical similarity, such as similar merchants, merchant attributes, language differences, cuisine types, and so on. We can leverage the user’s knowledge to build a query expansion corpus to improve the diversity of the search result and user experience. Furthermore, we can use the wisdom of the crowd to find some common patterns with higher confidence.

Based on this intuition, we leveraged the high volume of search click data available in Grab to generate high-quality expansion pairs at the user session level. To augment the original queries, we collected rewrite pairs that happened to multiple users and multiple times in a time period. Specifically, we used the heuristic rules below to collect the rewrite pairs:

Select the sessions where there are at least two distinct queries (rewrite session)
Collect adjacent query pairs in the search session where the second query leads to a click but the first does not (effective rewrite)
Filter out the sample pairs with time interval longer than 30 seconds in between, as users are more likely to change their mind on what to look for in these pairs (single intention)
Count the occurrences and filter out the low-frequency pairs (confidence management)

After we have the mining pairs, we categorised and annotated the rewrite types to gain a deeper understanding of the user’s rewrite behaviour. A few samples mined from the Singapore area data are shown in the table below.

Original query	Rewrite query	Frequency in a month	Distinct user count	Type
playmade by 丸作	playmade	697	666	Drop keywords
mcdonald’s	burger	573	535	Merchant -> Food
Bubble tea	koi	293	287	Food -> Merchant
Kfc	McDonald’s	238	234	Merchant -> Merchant
cake	birthday cake	206	205	Add words
麦当劳	mcdonald’s	205	199	Locale change
4 fingers	4fingers	165	162	Space correction
krc	kfc	126	124	Spelling correction
5 guys	five guys	120	120	Number synonym
koi the	koi thé	45	44	Tone change

We further computed the percentages of some categories, as shown in the figure below.

Figure 1. The donut chart illustrates the percentages of the distinct user counts for different types of rewrites.

Apart from adding words, dropping words and spelling corrections, a significant portion of the rewrites are in the category of Other. It is more semantic driven, such as merchant to merchant, or merchant to cuisine. Those rewrites are useful for capturing deeper connections between queries and can be a powerful diversifier to query expansion.

Grouping

After all the rewrite pairs were discovered offline through data mining, we grouped the query pairs by the original query to get the expansion candidates of each query. For serving efficiency, we limited the max number of expansion candidates to three.

Query expansion serving

Expansion matching architecture

The expansion matching architecture benefits from the recent search architecture upgrade, where the system flow is changed to a query understanding, multi-recall and result fusion flow. In particular, a query goes through the query understanding module and gets augmented with additional information. In this case, the query understanding module takes in the keyword and expands it to multiple synonyms, for example, KFC will be expanded to fried chicken. The original query together with its expansions are sent together to the search engine under the multi-recall framework. After that, results from multiple recallers with different keywords are fused together.

Continuous monitoring and feedback loop

It’s important to make sure the expansion pairs are relevant and up-to-date. We run the data mining pipeline periodically to capture the new user rewrite behaviours. Meanwhile, we also monitor the expansion pairs’ contribution to the search result by measuring the net contribution of recall or user interaction that the particular query brings, and eliminate the obsolete pairs in an automatic way. This reflects our effort to build an adaptive system.

Results

We conducted online A/B experiments across 6 countries in Southeast Asia to evaluate the expanded queries generated by our system. We set up 3 groups:

Control group, where no query is expanded.
Treatment group 1, where we expanded the queries based on manual annotations only.
Treatment group 2, where we expanded the queries using the data mining approach.

We observed decent uplift in click-through rate and conversion rate from both treatment groups. Furthermore, in treatment group 2, the data mining approach produced even better results.

Future work

Data mining enhancement

Currently, the data mining approach can only identify the pairs from the same search session by one user. This limits the number of linked pairs. Some potential enhancements include:

Augment expansion pairs by associating queries from different users who click on the same merchant/item, for example, using a click graph. This can capture relevant queries across user sessions.
Build a probabilistic model on top of the current transition pairs. Currently, all the transition pairs are equally weighted but apparently, the transitions that happen more often should carry higher probability/weights.

Ads application

Query expansion can be applied to advertising and would increase ads fill rate. With “KFC” expanded to “fried chicken”, the sponsored merchants who buy the keyword “fried chicken” would be eligible to show up when the user searches “KFC”. This would enable Grab to provide more relevant sponsored content to our users, which helps not only the consumers but also the merchants.

Special thanks to Zhengmin Xu and Daniel Ng for proofreading this article.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Your guide to streaming data & real-time analytics at re:Invent 2022

2022-11-16 Anna Montalat

Post Syndicated from Anna Montalat original https://aws.amazon.com/blogs/big-data/your-guide-to-streaming-data-real-time-analytics-at-reinvent-2022/

Mark your calendars for November 28 through December 2, 2022 to attend AWS re:Invent in Las Vegas – a learning conference hosted by AWS for the global cloud computing community.

To maximize the value of your data, you need to act upon it in real time, instead of waiting for hours, days, or week. AWS streaming data services offer unmatched, end to end capabilities to build real-time streaming data pipelines and applications to maximize the value of your data and act upon it in real time. You can leverage Kinesis Data Streams, Kinesis Video Streams and Amazon Managed Streaming for Apache Kafka (MSK) to collect and store data streams at scale; Kinesis Data Firehose to load real-time streams into data lakes, warehouses, and analytics services; and Kinesis Data Analytics to analyze streaming data in real time using Apache Flink. With streaming data architectures, customers can analyze data as soon as it is produced, get timely insights and make real-time decisions to capitalize on opportunities, enhance customer experiences, prevent networking failures, or update critical business metrics in real-time, just to name a few. This post walks you through the key sessions on streaming data and analytics that you cannot miss this year at reInvent to help you plan your conference week accordingly.

To access the session catalog and reserve your seat for one of our streaming data and analytics sessions, you must be registered for re:Invent. Register now!

Keynotes and leadership sessions you cannot miss!

Speakers have always been a key piece of the re:Invent puzzle. This year is no different, and you’ll have the chance to hear from some of the leading voices at AWS.

Adam Selipsky, Chief Executive Officer of Amazon Web Services – Keynote

Tuesday November 29 | 8:30 AM – 10:30 AM PST | The Venetian

Join Adam Selipsky, CEO of Amazon Web Services, as he looks at the ways that forward-thinking builders are transforming industries and even our future, powered by AWS. He highlights innovations in data, infrastructure, security, and more that are helping customers achieve their goals faster, take advantage of untapped potential, and create a better future with AWS.

Reserve your seat now!

Swami Sivasubramanian, Vice President of AWS Data and Machine Learning – Keynote

Wednesday November 30 | 8:30 AM – 10:30 AM PST | The Venetian

Join Swami Sivasubramanian, Vice President of AWS Data and Machine Learning, as he unveils some of the latest AWS innovations, designed to help you transform data into meaningful insights. Hear from leading AWS customers who are using data to bring new experiences to life for their customers.

Reserve your seat now!

AWS storage innovations at exabyte scale – Leadership session

Tuesday November 29 | 11:00 – 12:00 PM PST | The Venetian

Data is the change agent driving digital transformation. In this session, Mai-Lan Tomsen Bukovec, AWS Tech VP, and Andy Warfield, AWS Distinguished Engineer, share the latest AWS storage innovations and an inside look at how customers drive modern business on data lakes and with high-performance data.

Reserve your seat now!

Unlock the value of your data with AWS analytics – Leadership session

Wednesday November 30 | 2:30 – 3:30 PM PST | The Venetian

Data fuels digital transformation and drives effective business decisions. In this session, G2 Krishnamoorthy, VP of AWS Analytics, addresses the current state of analytics on AWS, covers the latest service innovations around data, and highlights customer successes with AWS analytics.

Reserve your seat now!

Customer sessions

Join our customer sessions to learn first-hand how other organizations are maximizing the value of their data with real-time streaming data architectures, enabling them to untap new business opportunities, enhance processes, and deliver delightful customer experiences.

How Riot Games processes 20 TB of analytics data daily on AWS – Riot Games ingests about 20 TB of data every day on AWS. This data powers a wide range of services, including game matchmaking, in-game personalization, analytics, security, and player behavior management. Join this session to learn how Riot Games transformed their data ingestion pipeline to query data from 6 hours after it was produced down to just 5 minutes. Reserve your seat now!
How Samsung modernized architecture for real-time analytics – In this session, Samsung SmartThings shares how they modernized their streaming data analytics architecture for real-time analytics. Originally, Samsung developers spent most of their time managing infrastructure. After migrating to Amazon Kinesis Data Analytics, developers were able to focus on delivering business value without needing to worry about infrastructure management. Reserve your seat now!
Leveling up computer vision and artificial intelligence development – Seeing is believing, and Kami Vision is proof! In this session, Kami Vision speaks to how they utilized Amazon Kinesis Video Streams to do the undifferentiated video lifting so that they could develop KamiCare fall detection—an accurate way to monitor if a person has fallen to the floor and cannot get up. Reserve your seat now!
How Sony Orchard accelerated innovation with Amazon MSK – The Orchard, a subsidiary of Sony Music Entertainment, built a high-performing data synchronization solution using Amazon Managed Streaming for Apache Kafka (Amazon MSK). Learn how their data synchronization and search capabilities improved using this solution. Reserve your seat now!
How Poshmark accelerates growth via real-time analytics & personalization – Find out how Poshmark designed real-time personalization using real-time event capture to deliver tailored customer experiences, reduce security risks, and enable end-users to more confidently interact with the Poshmark app. Reserve your seat now!
Building and operating at scale with feature management (sponsored by LaunchDarkly) – LaunchDarkly customers deliver software applications that support millions of end-users at any given time. They rely on LaunchDarkly to launch, control, and measure those applications in real time without negative customer impact. In this session, we’ll discuss key architecture decisions and LaunchDarkly best practices. Reserve your seat now!

Breakout sessions

AWS re:Invent breakout sessions are lecture-style and one hour long. These sessions take place across the re:Invent campus and cover all topics at all levels.

What’s new in AWS streaming – Streaming data and analytics help your business make real-time contextual decisions, deliver personalized customer experiences, and untap new opportunities in real time. Join us to find out about the latest innovations in the AWS streaming portfolio. Reserve your seat now!
Build a managed analytics platform for your ecommerce business – With the increase in popularity of online shopping, building an analytics platform for ecommerce is important for any organization because it provides insights about the business, trends, and customer behavior. Join us to learn how to build a complete analytics platform in batch and real-time mode. Reserve your seat now!
Publishing real-time financial data feeds using Kafka – This session describes how to offer a real-time financial data feed as a service on AWS. With Amazon MSK, you can use Kafka to allow your customers to subscribe to message topics containing the financial data of interest. We will cover connectivity best practices for scalability, security options for a secure architecture, and lessons learned from customers that are using AWS to distribute financial data on AWS. Reserve your seat now!
Interact with streaming data using Amazon Kinesis Data Analytics Studio – Join us in this theater session to learn how analyzing streaming data provides the timely, actionable insights a business needs to grow. Reserve your seat now!

Chalk talks

Chalk talks are a highly interactive content format with a small audience. Each begins with a short lecture delivered by an AWS expert followed by a Q&A session with the audience.

Modern data exchange using AWS data streaming – We’ll explore how different systems sync low-latency data changes using Apache Hudi backed by Amazon Simple Storage Service (Amazon S3) in a data mesh architecture. This modern architecture allows developers to build streaming jobs that read, join, and aggregate data from multiple datasets and sync data changes to downstream data stores. Reserve your seat now!
Build a serverless streaming workload with Amazon Kinesis – Collecting, processing, and analyzing streaming data is easy with Amazon Kinesis services. Make plans for this chalk talk that will take your streaming capabilities to the next level. Reserve your seat now!

Workshops

Workshops are two-hour hands-on sessions where you work in teams to solve problems using AWS services. Workshops organize attendees into small groups and provide scenarios to encourage interaction, giving you the opportunity to learn from and teach each other. Don’t forget to bring your laptop!

Building a serverless Apache Kafka data pipeline – Serverless means “focus on what matters”! In this workshop, we’ll show how you can build a serverless data pipeline using Amazon MSK Serverless, deploy a Kafka client container-based AWS Lambda function, and much more! Reserve your seat now!
Event detection with Amazon MSK and Amazon Kinesis Data Analytics – When in Las Vegas, you do as Las Vegans do! In this workshop, you’ll see how casinos use Amazon MSK, Amazon Kinesis Data Analytics Studio, and AWS Lambda to enhance customer experiences. Reserve your seat now!
Build smart camera applications using Amazon Kinesis Video Streams WebRTC – Amazon Kinesis Video Streams WebRTC helps users to easily build low-latency video solutions such as smart doorbells, connected vehicles, surveillance cameras, and more. Join this workshop for hands-on experience building a complete real-world video solution, including setting up a device with a camera to transmit video. Reserve your seat now!

Fun, fun, and more fun!

All work and no play … not at re:Invent! Sure, we’ll work hard and learn a lot, but we also plan to have a great time while we’re together. Our gamified learning sessions will give you real-life learning opportunities through interactive events that promise to be fun and entertaining!

The fun continues with AWS Builder Labs, where you’ll have the opportunity to test your skills in sandbox settings while working alongside some of the leading minds from AWS!

And don’t forget to visit the Analytics kiosk within the AWS Village to meet with experts to dive deeper into AWS streaming data services such as Kinesis Data Streams, Kinesis Data Firehose, Kinesis Data Analytics and Amazon MSK. You will be able to ask our experts questions and experience live demos for our newly launched capabilities. Make sure to stop by the swag distribution table to grab free Analytics swag if you have attended either the Analytics kiosk or one of our breakout sessions, chalk talks, or workshops.

Register today

Keep your eyes on this post for more updates and exciting news. It’s going to be a simply amazing event and we can’t wait to see you at re:Invent 2022, the world’s premier tech event! Register now to secure your spot!

About the author

Anna Montalat is a Senior Product Marketing Manager for AWS streaming data services which includes Amazon Managed Streaming for Apache Kafka (MSK), Kinesis Data Streams, Kinesis Video Streams, Kinesis Data Firehose, and Kinesis Data Analytics. She is passionate about bringing new and emerging technologies to market, working closely with service teams and enterprise customers. Outside of work, Anna skis through winter time and sails through summer.

Use an event-driven architecture to build a data mesh on AWS

2022-11-15 Jan Michael Go Tan

Post Syndicated from Jan Michael Go Tan original https://aws.amazon.com/blogs/big-data/use-an-event-driven-architecture-to-build-a-data-mesh-on-aws/

In this post, we take the data mesh design discussed in Design a data mesh architecture using AWS Lake Formation and AWS Glue, and demonstrate how to initialize data domain accounts to enable managed sharing; we also go through how we can use an event-driven approach to automate processes between the central governance account and data domain accounts (producers and consumers). We build a data mesh pattern from scratch as Infrastructure as Code (IaC) using AWS CDK and use an open-source self-service data platform UI to share and discover data between business units.

The key advantage of this approach is being able to add actions in response to data mesh events such as permission management, tag propagation, search index management, and to automate different processes.

Before we dive into it, let’s look at AWS Analytics Reference Architecture, an open-source library that we use to build our solution.

AWS Analytics Reference Architecture

AWS Analytics Reference Architecture (ARA) is a set of analytics solutions put together as end-to-end examples. It regroups AWS best practices for designing, implementing, and operating analytics platforms through different purpose-built patterns, handling common requirements, and solving customers’ challenges.

ARA exposes reusable core components in an AWS CDK library, currently available in Typescript and Python. This library contains AWS CDK constructs (L3) that can be used to quickly provision analytics solutions in demos, prototypes, proofs of concept, and end-to-end reference architectures.

The following table lists data mesh specific constructs in the AWS Analytics Reference Architecture library.

Construct Name	Purpose
CentralGovernance	Creates an Amazon EventBridge event bus for central governance account that is used to communicate with data domain accounts (producer/consumer). Creates workflows to automate data product registration and sharing.
DataDomain	Creates an Amazon EventBridge event bus for data domain account (producer/consumer) to communicate with central governance account. It creates data lake storage (Amazon S3), and workflow to automate data product registration. It also creates a workflow to populate AWS Glue Catalog metadata for newly registered data product.

You can find AWS CDK constructs for the AWS Analytics Reference Architecture on Construct Hub.

In addition to ARA constructs, we also use an open-source Self-service data platform (User Interface). It is built using AWS Amplify, Amazon DynamoDB, AWS Step Functions, AWS Lambda, Amazon API Gateway, Amazon EventBridge, Amazon Cognito, and Amazon OpenSearch. The frontend is built with React. Through the self-service data platform you can: 1) manage data domains and data products, and 2) discover and request access to data products.

Central Governance and data sharing

For the governance of our data mesh, we will use AWS Lake Formation. AWS Lake Formation is a fully managed service that simplifies data lake setup, supports centralized security management, and provides transactional access on top of your data lake. Moreover, it enables data sharing across accounts and organizations. This centralized approach has a number of key benefits, such as: centralized audit; centralized permission management; and centralized data discovery. More importantly, this allows organizations to gain the benefits of centralized governance while taking advantage of the inherent scaling characteristics of decentralized data product management.

There are two ways to share data resources in Lake Formation: 1) Named Based Access Control (NRAC), and 2) Tag-Based Access Control (LF-TBAC). NRAC uses AWS Resource Access Manager (AWS RAM) to share data resources across accounts. Those are consumed via resource links that are based on created resource shares. Tag-Based Access Control (LF-TBAC) is another approach to share data resources in AWS Lake Formation, that defines permissions based on attributes. These attributes are called LF-tags. You can read this blog to learn about LF-TBAC in the context of data mesh.

The following diagram shows how NRAC and LF-TBAC data sharing works. In this example, data domain is registered as a node on mesh and therefore we create two databases in the central governance account. NRAC database is shared with data domain via AWS RAM. Access to data products that we register in this database will be handled through NRAC. LF-TBAC database is tagged with data domain N line of business (LOB) LF-tag: <LOB:N>. LOB tag is automatically shared with data domain N account and therefore database is available in that account. Access to Data Products in this database will be handled through LF-TBAC.

In our solution we will demonstrate both NRAC and LF-TBAC approaches. With the NRAC approach, we will build up an event-based workflow that would automatically accept RAM share in the data domain accounts and automate the creation of the necessary metadata objects (eg. local database, resource links, etc). While with the LF-TBAC approach, we rely on permissions associated with the shared LF-Tags to allow producer data domains to manage their data products, and consumer data domains read access to the relevant data products associated with the LF-Tags that they requested access to.

We use CentralGovernance construct from ARA library to build a central governance account. It creates an EventBridge event bus to enable communication with data domain accounts that register as nodes on mesh. For each registered data domain, specific event bus rules are created that route events towards that account. Central governance account has a central metadata catalog that allows for data to be stored in different data domains, as opposed to a single central lake. For each registered data domain, we create two separate databases in central governance catalog to demonstrate both NRAC and LF-TBAC data sharing. CentralGovernance construct creates workflows for data product registration and data product sharing. We also deploy a self-service data platform UI to enable good user experience to manage data domains, data products, and to simplify data discovery and sharing.

A data domain: producer and consumer

We use DataDomain construct from ARA library to build a data domain account that can be either producer, consumer, or both. Producers manage the lifecycle of their respective data products in their own AWS accounts. Typically, this data is stored in Amazon Simple Storage Service (Amazon S3). DataDomain construct creates a data lake storage with cross-account bucket policy that enables central governance account to access the data. Data is encrypted using AWS KMS, and central governance account has a permission to use the key. Config secret in AWS Secrets Manager contains all the necessary information to register data domain as a node on mesh in central governance. It includes: 1) data domain name, 2) S3 location that holds data products, and 3) encryption key ARN. DataDomain construct also creates data domain and crawler workflows to automate data product registration.

Creating an event-driven data mesh

Data mesh architectures typically require some level of communication and trust policy management to maintain least privileges of the relevant principals between the different accounts (for example, central governance to producer, central governance to consumer). We use event-driven approach via EventBridge to securely forward events from one event bus to event bus in another account while maintaining the least privilege access. When we register data domain to central governance account through the self-service data platform UI, we establish bi-directional communication between the accounts via EventBridge. Domain registration process also creates database in the central governance catalog to hold data products for that particular domain. Registered data domain is now a node on mesh and we can register new data products.

The following diagram shows data product registration process:

Starts Register Data Product workflow that creates an empty table (the schema is managed by the producers in their respective producer account). This workflow also grants a cross-account permission to the producer account that allows producer to manage the schema of the table.
When complete, this emits an event into the central event bus.
The central event bus contains a rule that forwards the event to the producer’s event bus. This rule was created during the data domain registration process.
When the producer’s event bus receives the event, it triggers the Data Domain workflow, which creates resource-links and grants permissions.
Still in the producer account, Crawler workflow gets triggered when the Data Domain workflow state changes to Successful. This creates the crawler, runs it, waits and checks if the crawler is done, and deletes the crawler when it’s complete. This workflow is responsible for populating tables’ schemas.

Now other data domains can find newly registered data products using the self-service data platform UI and request access. The sharing process works in the same way as product registration by sending events from the central governance account to consumer data domain, and triggering specific workflows.

Solution Overview

The following high-level solution diagram shows how everything fits together and how event-driven architecture enables multiple accounts to form a data mesh. You can follow the workshop that we released to deploy the solution that we covered in this blog post. You can deploy multiple data domains and test both data registration and data sharing. You can also use self-service data platform UI to search through data products and request access using both LF-TBAC and NRAC approaches.

Conclusion

Implementing a data mesh on top of an event-driven architecture provides both flexibility and extensibility. A data mesh by itself has several moving parts to support various functionalities, such as onboarding, search, access management and sharing, and more. With an event-driven architecture, we can implement these functionalities in smaller components to make them easier to test, operate, and maintain. Future requirements and applications can use the event stream to provide their own functionality, making the entire mesh much more valuable to your organization.

To learn more how to design and build applications based on event-driven architecture, see the AWS Event-Driven Architecture page. To dive deeper into data mesh concepts, see the Design a Data Mesh Architecture using AWS Lake Formation and AWS Glue blog.

If you’d like our team to run data mesh workshop with you, please reach out to your AWS team.

About the authors

Jan Michael Go Tan is a Principal Solutions Architect for Amazon Web Services. He helps customers design scalable and innovative solutions with the AWS Cloud.

Dzenan Softic is a Senior Solutions Architect at AWS. He works with startups to help them define and execute their ideas. His main focus is in data engineering and infrastructure.

David Greenshtein is a Specialist Solutions Architect for Analytics at AWS with a passion for ETL and automation. He works with AWS customers to design and build analytics solutions enabling business to make data-driven decisions. In his free time, he likes jogging and riding bikes with his son.

Vincent Gromakowski is an Analytics Specialist Solutions Architect at AWS where he enjoys solving customers’ analytics, NoSQL, and streaming challenges. He has a strong expertise on distributed data processing engines and resource orchestration platform.

Build an optimized self-service interactive analytics platform with Amazon EMR Studio

2022-11-14 Pablo Redondo Sanchez

Post Syndicated from Pablo Redondo Sanchez original https://aws.amazon.com/blogs/big-data/build-an-optimized-self-service-interactive-analytics-platform-with-amazon-emr-studio/

Data engineers and data scientists are dependent on distributed data processing infrastructure like Amazon EMR to perform data processing and advanced analytics jobs on large volumes of data. In most mid-size and enterprise organizations, cloud operations teams own procuring, provisioning, and maintaining the IT infrastructures, and their objectives and best practices differ from the data engineering and data science teams. Enforcing infrastructure best practices and governance controls present interesting challenges for analytics teams:

Limited agility – Designing and deploying a cluster with the required networking, security, and monitoring configuration requires significant expertise in cloud infrastructure. This results in high dependency on operations teams to perform simple experimentation and development tasks. This typically results in weeks or months to deploy an environment.
Security and performance risks – Experimentation and development activities typically require sharing existing environments with other teams, which presents security and performance risks due to lack of workload isolation.
Limited collaboration – The security complexity of running shared environments and the lack of a shared web UI limits the analytics team’s ability to share and collaborate during development tasks.

To promote experimentation and solve the agility challenge, organizations need to reduce deployment complexity and remove dependencies to cloud operations teams while maintaining guardrails to optimize cost, security, and resource utilization. In this post, we walk you through how to implement a self-service analytics platform with Amazon EMR and Amazon EMR Studio to improve the agility of your data science and data engineering teams without compromising on the security, scalability, resiliency, and cost efficiency of your big data workloads.

Solution overview

A self-service data analytics platform with Amazon EMR and Amazon EMR Studio provides the following advantages:

It’s simple to launch and access for data engineers and data scientists.
The robust integrated development environment (IDE) is interactive, makes data easy to explore, and provides all the tooling necessary to debug, build, and schedule data pipelines.
It enables collaboration for analytics teams with the right level of workload isolation for added security.
It removes dependency from cloud operations teams by allowing administrators within each analytics organization to self-provision, scale, and de-provision resources from within the same UI, without exposing the complexities of the EMR cluster infrastructure and without compromising on security, governance, and costs.
It simplifies moving from prototyping into a production environment.
Cloud operations teams can independently manage EMR cluster configurations as products and continuously optimize for cost and improve the security, reliability, and performance of their EMR clusters.

Amazon EMR Studio is a web-based IDE that provides fully managed Jupyter notebooks where teams can develop, visualize, and debug applications written in R, Python, Scala, and PySpark, and tools such as Spark UI to provide an interactive development experience and simplify debugging of jobs. Data scientists and data engineers can directly access Amazon EMR Studio through a single sign-on enabled URL and collaborate with peers using these notebooks within the concept of an Amazon EMR Studio Workspace, version code with repositories such as GitHub and Bitbucket, or run parameterized notebooks as part of scheduled workflows using orchestration services. Amazon EMR Studio notebook applications run on EMR clusters, so you get the benefit of a highly scalable data processing engine using the performance optimized Amazon EMR runtime for Apache Spark.

The following diagram illustrates the architecture of the self-service analytics platform with Amazon EMR and Amazon EMR Studio.

Self Service Analytics Architecture

Cloud operations teams can assign one Amazon EMR Studio environment to each team for isolation and provision Amazon EMR Studio developer and administrator users within each team. Cloud operations teams have full control on the permissions each Amazon EMR Studio user has via Amazon EMR Studio permissions policies and control the EMR cluster configurations that Amazon EMR Studio administrators can deploy via cluster templates. Amazon EMR Studio administrators within each team can assign workspaces to each developer and attached to existing EMR clusters or, if allowed, self-provision EMR clusters from predefined templates. Each workspace is a serverless Jupyter instance with notebook files backed up continuously into an Amazon Simple Storage Service (Amazon S3) bucket. Users can attach or detach to provisioned EMR clusters and you only pay for the EMR cluster compute capacity used.

Cloud operations teams organize EMR cluster configurations as products within the AWS Service Catalog. In AWS Service Catalog, EMR cluster templates are organized as products in a portfolio that you share with Amazon EMR Studio users. Templates hide the complexities of the infrastructure configuration and can have custom parameters to allow for further optimization based on the workload requirement. After you publish a cluster template, Amazon EMR Studio administrators can launch new clusters and attach to new or existing workspaces within an Amazon EMR Studio without dependency to cloud operations teams. This makes it easier for teams to test upgrades, share predefined templates across teams, and allow analytics users to focus on achieving business outcomes.

The following diagram illustrates the decoupling architecture.

Decoupling Architecture

You can decouple the definition of the EMR clusters configurations as products and enable independent teams to deploy serverless workspaces and attach self-provisioned EMR clusters within Amazon EMR Studio in minutes. This enables organizations to create an agile and self-service environment for data processing and data science at scale while maintaining the proper level of security and governance.

As a cloud operations engineer, the main task is making sure your templates follow proper cluster configurations that are secure, run at optimal cost, and are easy to use. The following sections discuss key recommendations for security, cost optimization, and ease of use when defining EMR cluster templates for use within Amazon EMR Studio. For additional Amazon EMR best practices, refer to the EMR Best Practices Guide.

Security

Security is mission critical for any data science and data prep workload. Ensure you follow these recommendations:

Team-based isolation – Maintain workload isolation by provisioning an Amazon EMR Studio environment per team and a workspace per user.
Authentication – Use AWS IAM Identity Center (successor for AWS Single Sign-On) or federated access with AWS Identity and Access Management (IAM) to centralize user management.
Authorization – Set fine-grained permissions within your Amazon EMR Studio environment. Set limited (1–2) users with the Amazon EMR Studio admin role to allow workspace and cluster provisioning. Most data engineers and data scientists will have a developer role. For more information on how to define permissions, refer to Configure EMR Studio user permissions.
Encryption – When defining your cluster configuration templates, ensure encryption is enforced both in transit and at rest. For example, traffic between data lakes should use the latest version of TLS, data is encrypted with AWS Key Management Service (AWS KMS) at rest for Amazon S3, Amazon Elastic Block Store (Amazon EBS), and Amazon Relational Database Service (Amazon RDS).

Cost

To optimize cost of your running EMR cluster, consider the following cost-optimization options in your cluster templates:

Use EC2 Spot Instances – Spot Instances let you take advantage of unused Amazon Elastic Compute Cloud (Amazon EC2) capacity in the AWS Cloud and offer up to a 90% discount compared to On-Demand prices. Spot is best suited for workloads that can be interrupted or have flexible SLAs, like testing and development workloads.
Use instance fleets – Use instance fleets when using EC2 Spot to increase the likelihood of Spot availability. An instance fleet is a group of EC2 instances that host a particular node type (primary, core, or task) in an EMR cluster. Because instance fleets can consist of a mix of instance types, both On-Demand and Spot, this will increase the likelihood of Spot Instance availability when reaching your target capacity. Consider at least 10 instance types across all Availability Zones.
Use Spark cluster mode and ensure that application masters run on On-Demand nodes – The application master (AM) is the main container launching and monitoring the application executors. Therefore, it’s important to ensure the AM is as resilient as possible. In an Amazon EMR Studio environment, you can expect users running multiple applications concurrently. In cluster mode, your Spark applications can run as independent sets of processes spread across your worker nodes within the AMs. By default, an AM can run on any of the worker nodes. Modify the behavior to ensure AMs run only in On-Demand nodes. For details on this setup, see Spot Usage.
Use Amazon EMR managed scaling – This avoids overprovisioning clusters and automatically scales your clusters up or down based on resource utilization. With Amazon EMR managed scaling, AWS manages the automatic scaling activity by continuously evaluating cluster metrics and making optimized scaling decisions.
Implement an auto-termination policy – This avoids idle clusters or the need to manually monitor and stop unused EMR clusters. When you set an auto-termination policy, you specify the amount of idle time after which the cluster should automatically shut down.
Provide visibility and monitor usage costs – You can provide visibility of EMR clusters to Amazon EMR Studio administrators and cloud operations teams by configuring user-defined cost allocation tags. These tags help create detailed cost and usage reports in AWS Cost Explorer for EMR clusters across multiple dimensions.

Ease of use

With Amazon EMR Studio, administrators within data science and data engineering teams can self-provision EMR clusters from templates pre-built with AWS CloudFormation. Templates can be parameterized to optimize cluster configuration according to each team’s workload requirements. For ease of use and to avoid dependencies to cloud operations teams, the parameters should avoid requesting unnecessary details or expose infrastructure complexities. Here are some tips to abstract the input values:

Maintain the number of questions to a minimum (less than 5).
Hide network and security configurations. Be opinionated when defining your cluster according to your security and network requirements following Amazon EMR best practices.
Avoid input values that require knowledge of AWS Cloud-specific terminology, such as EC2 instance types, Spot vs. On-Demand Instances, and so on.
Abstract input parameters considering information available to data engineering and data science teams. Focus on parameters that will help further optimize the size and costs of your EMR clusters.

The following screenshot is an example of input values you can request from a data science team and how to resolve them via CloudFormation template features.

EMR Studio IDE

The input parameters are as follows:

User concurrency – Knowing how many users are expected to run jobs simultaneously will help determine the number of executors to provision
Optimized for cost or reliability – Use Spot Instances to optimize for cost; for SLA sensitive workloads, use only On-Demand nodes
Workload memory requirements (small, medium, large) – Determine the ratio of memory per Spark executor in your EMR cluster

The following sections describe how to resolve the EMR cluster configurations from these input parameters and what features to use in your CloudFormation templates.

User concurrency: How many concurrent users do you need?

Knowing the expected user concurrency will help determine the target node capacity of your cluster or the min/max capacity when using the Amazon EMR auto scaling feature. Consider how much capacity (CPU cores and memory) each data scientist needs to run their average workload.

For example, let’s say you want to provision 10 executors to each data scientist in the team. If the expected concurrency is set to 7, then you need to provision 70 executors. An r5.2xlarge instance type has 8 cores and 64 Gib of RAM. With the default configuration, the core count (spark.executor.cores) is set to 1 and memory (spark.executor.memory) is set to 6 Gib. One core will be reserved for running the Spark application, therefore leaving seven executors per node. You will need a total of 10 r5.2xlarge nodes to meet the demand. The target capacity can dynamically resolve to 10 from the user concurrency input, and the capacity weights in your fleet make sure the same capacity is met if different instance sizes are provisioned to meet the expected capacity.

Using an CloudFormation transform allows you to resolve the target capacity based on a numeric input value. A transform passes your template script to a custom AWS Lambda function so you can replace any placeholder in your CloudFormation template with values resolved from your input parameters.

The following CloudFormation script calls the emr-size-macro transform that replaces the custom::Target placeholder in the TargetSpotCapacity object based on the UserConcurrency input value:

Parameters:
...
 UserConcurrency: 
  Description: "How many users you expect to run jobs simultaneously" 
  Type: "Number" 
  Default: "5"
...
Resources
   EMRClusterTaskSpot: 
    'Fn::Transform': 
      Name: emr-size-macro Parameters: 
      FleetType: task 
      InputSize: !Ref TeamSize
    Type: AWS::EMR::InstanceFleetConfig
    Condition: UseSpot
    Properties:
      ClusterId: !Ref EMRCluster
      Name: cfnTask
      InstanceFleetType: TASK
      TargetOnDemandCapacity: 0
      TargetSpotCapacity: "custom::Target"
      LaunchSpecifications:
        OnDemandSpecification:
          AllocationStrategy: lowest-price
        SpotSpecification:
          AllocationStrategy: capacity-optimized
          TimeoutAction: SWITCH_TO_ON_DEMAND
          TimeoutDurationMinutes: 5
     InstanceTypeConfigs: !FindInMap [ InstanceTypes, !Ref MemoryProfile, taskfleet]

Optimized for cost or reliability: How do you optimize your EMR cluster?

This parameter determines if the cluster should use Spot Instances for task nodes to optimize cost or provision only On-Demand nodes for SLA sensitive workloads that need to be optimized for reliability.

You can use the CloudFormation Conditions feature in your template to resolve your desired instance fleet configurations. The following code shows how the Conditions feature looks in a sample EMR template:

Parameters:
  ...
  Optimization: 
    Description: "Choose reliability if your jobs need to meet specific SLAs" 
    Type: "String" 
    Default: "cost" 
    AllowedValues: [ 'cost', 'reliability']
...
Conditions: 
  UseSpot: !Equals 
    - !Ref Optimization 
    - cost 
  UseOnDemand: !Equals 
    - !Ref Optimization 
    - reliability
Resources:
...
EMRClusterTaskSpot:
    Type: AWS::EMR::InstanceFleetConfig
    Condition: UseSpot
    Properties:
      ClusterId: !Ref EMRCluster
      Name: cfnTask
      InstanceFleetType: TASK
      TargetOnDemandCapacity: 0
      TargetSpotCapacity: 6
      LaunchSpecifications:
        OnDemandSpecification:
          AllocationStrategy: lowest-price
        SpotSpecification:
          AllocationStrategy: capacity-optimized
          TimeoutAction: SWITCH_TO_ON_DEMAND
          TimeoutDurationMinutes: 5
      InstanceTypeConfigs:
        - InstanceType: !FindInMap [ InstanceTypes, !Ref ClusterSize, taskfleet]
          WeightedCapacity: 1
 EMRClusterTaskOnDemand:
    Type: AWS::EMR::InstanceFleetConfig
    Condition: UseOnDemand
    Properties:
      ClusterId: !Ref EMRCluster
      Name: cfnTask
      InstanceFleetType: TASK
      TargetOnDemandCapacity: 6
      TargetSpotCapacity: 0
 ...

Workload memory requirements: How big a cluster do you need?

This parameter helps determine the amount of memory and CPUs to allocate to each Spark executor. The specific memory to CPU ratio allocated to each executor should be set appropriately to avoid out of memory errors. You can map the input parameter (small, medium, large) to specific instance types to select the CPU/memory ratio. Amazon EMR has default configurations (spark.executor.cores, spark.executor.memory) based on each instance type. For example, a small size cluster request could resolve to general purpose instances like m5 (default: 2 cores and 4 gb per executor), whereas a medium workflow can resolve to an R type (default: 1 core and 6 gb per executor). You can further tune the default Amazon EMR memory and CPU core allocation to each instance type by following the best practices outlined in the Spark section of the EMR Best Practices Guides.

Use the CloudFormation Mappings section to resolve the cluster configuration in your template:

Parameters:
…
   MemoryProfile: 
    Description: "What is the memory profile you expect in your workload." 
    Type: "String" 
    Default: "small" 
    AllowedValues: ['small', 'medium', 'large']
…
Mappings:
  InstanceTypes: small:
      master: "m5.xlarge"
      core: "m5.xlarge"
      taskfleet:
        - InstanceType: m5.2xlarge
          WeightedCapacity: 1
        - InstanceType: m5.4xlarge
          WeightedCapacity: 2
        - InstanceType: m5.8xlarge
          WeightedCapacity: 3
          ...
    medium:
      master: "m5.xlarge"
      core: "r5.2xlarge"
      taskfleet:
        - InstanceType: r5.2xlarge
          WeightedCapacity: 1
        - InstanceType: r5.4xlarge
          WeightedCapacity: 2
        - InstanceType: r5.8xlarge
          WeightedCapacity: 3
...
Resources:
...
  EMRClusterTaskSpot:
    Type: AWS::EMR::InstanceFleetConfig
    Properties:
      ClusterId: !Ref EMRCluster
      InstanceFleetType: TASK    
      InstanceTypeConfigs: !FindInMap [InstanceTypes, !Ref MemoryProfile, taskfleet]
      ...

Conclusion

In this post, we showed how to create a self-service analytics platform with Amazon EMR and Amazon EMR Studio to take full advantage of the agility the AWS Cloud provides by considerably reducing deployment times without compromising governance. We also walked you through best practices in security, cost, and ease of use when defining your Amazon EMR Studio environment so data engineering and data science teams can speed up their development cycles by removing dependencies from Cloud Operations teams when provisioning their data processing platforms.

If this is your first time exploring Amazon EMR Studio, we recommend checking out the Amazon EMR workshops and referring to Create an EMR Studio. Continue referencing the Amazon EMR Best Practices Guide when defining your templates and check out the Amazon EMR Studio sample repo for EMR cluster template references.

About the Authors

Pablo Redondo is a Principal Solutions Architect at Amazon Web Services. He is a data enthusiast with over 16 years of fintech and healthcare industry experience and is a member of the AWS Analytics Technical Field Community (TFC). Pablo has been leading the AWS Gain Insights Program to help AWS customers achieve better insights and tangible business value from their data analytics initiatives.

Malini Chatterjee is a Senior Solutions Architect at AWS. She provides guidance to AWS customers on their workloads across a variety of AWS technologies with a breadth of expertise in data and analytics. She is very passionate about semi-classical dancing and performs in community events. She loves traveling and spending time with her family.

Avijit Goswami is a Principal Solutions Architect at AWS, specialized in data and analytics. He supports AWS strategic customers in building high-performing, secure, and scalable data lake solutions on AWS using AWS-managed services and open-source solutions. Outside of his work, Avijit likes to travel, hike San Francisco Bay Area trails, watch sports, and listen to music.

How Hudl built a cost-optimized AWS Glue pipeline with Apache Hudi datasets

2022-11-10 Indira Balakrishnan

Post Syndicated from Indira Balakrishnan original https://aws.amazon.com/blogs/big-data/how-hudl-built-a-cost-optimized-aws-glue-pipeline-with-apache-hudi-datasets/

This is a guest blog post co-written with Addison Higley and Ramzi Yassine from Hudl.

Hudl Agile Sports Technologies, Inc. is a Lincoln, Nebraska based company that provides tools for coaches and athletes to review game footage and improve individual and team play. Its initial product line served college and professional American football teams. Today, the company provides video services to youth, amateur, and professional teams in American football as well as other sports, including soccer, basketball, volleyball, and lacrosse. It now serves 170,000 teams in 50 different sports around the world. Hudl’s overall goal is to capture and bring value to every moment in sports.

Hudl’s mission is to make every moment in sports count. Hudl does this by expanding access to more moments through video and data and putting those moments in context. Our goal is to increase access by different people and increase context with more data points for every customer we serve. Using data to generate analytics, Hudl is able to turn data into actionable insights, telling powerful stories with video and data.

To best serve our customers and provide the most powerful insights possible, we need to be able to compare large sets of data between different sources. For example, enriching our MongoDB and Amazon DocumentDB (with MongoDB compatibility) data with our application logging data leads to new insights. This requires resilient data pipelines.

In this post, we discuss how Hudl has iterated on one such data pipeline using AWS Glue to improve performance and scalability. We talk about the initial architecture of this pipeline, and some of the limitations associated with this approach. We also discuss how we iterated on that design using Apache Hudi to dramatically improve performance.

Problem statement

A data pipeline that ensures high-quality MongoDB and Amazon DocumentDB statistics data is available in our central data lake, and is a requirement for Hudl to be able to deliver sports analytics. It’s important to maintain the integrity of the data between MongoDB and Amazon DocumentDB transactional data with the data lake capturing changes in near-real time along with upserts to records in the data lake. Because Hudl statistics are backed by MongoDB and Amazon DocumentDB databases, in addition to a broad range of other data sources, it’s important that relevant MongoDB and Amazon DocumentDB data is available in a central data lake where we can run analytics queries to compare statistics data between sources.

Initial design

The following diagram demonstrates the architecture of our initial design.

Intial Ingestion Pipeline Design

Let’s discuss the key AWS services of this architecture:

AWS Data Migration Service (AWS DMS) allowed our team to move quickly in delivering this pipeline. AWS DMS gives our team a full snapshot of the data, and also offers ongoing change data capture (CDC). By combining these two datasets, we can ensure our pipeline delivers the latest data.
Amazon Simple Storage Service (Amazon S3) is the backbone of Hudl’s data lake because of its durability, scalability, and industry-leading performance.
AWS Glue allows us to run our Spark workloads in a serverless fashion, with minimal setup. We chose AWS Glue for its ease of use and speed of development. Additionally, features such as AWS Glue bookmarking simplified our file management logic.
Amazon Redshift offers petabyte-scale data warehousing. Amazon Redshift provides consistently fast performance, and easy integrations with our S3 data lake.

The data processing flow includes the following steps:

Amazon DocumentDB holds the Hudl statistics data.
AWS DMS gives us a full export of statistics data from Amazon DocumentDB, and ongoing changes in the same data.
In the S3 Raw Zone, the data is stored in JSON format.
An AWS Glue job merges the initial load of statistics data with the changed statistics data to give a snapshot of statistics data in JSON format for reference, eliminating duplicates.
In the S3 Cleansed Zone, the JSON data is normalized and converted to Parquet format.
AWS Glue uses a COPY command to insert Parquet data into Amazon Redshift consumption base tables.
Amazon Redshift stores the final table for consumption.

The following is a sample code snippet from the AWS Glue job in the initial data pipeline:

from awsglue.context import GlueContext 
from pyspark.sql.session import SparkSession

spark = SparkSession.builder.getOrCreate() 
spark_context = spark.sparkContext 
gc = GlueContext(spark_context)
   full_df = read_full_data()#Load entire dataset from S3 Cleansed Zone


cdc_df = read_cdc_data() # Read new CDC data which represents delta in the source MongoDB/DocumentDB


joined_df = full_df.join(cdc_df, '_id', 'full_outer') #Calculate final snapshot by joining the existing data with delta


result = joined_df.filter((joined_df.Op != 'D') | (joined_df.Op.isNull())) .select(coalesce(cdc_df._doc, full_df._doc).alias('_doc'))

gc.write_dynamic_frame.from_options(frame=DynamicFrame.fromDF(result, gc) , connection_type = "s3", connection_options = {"path": output_path}, format = "parquet", transformation_ctx = "ctx4")

Challenges

Although this initial solution met our need for data quality, we felt there was room for improvement:

The pipeline was slow – The pipeline ran slowly (over 2 hours) because for each batch, the whole dataset was compared. Every record had to be compared, flattened, and converted to Parquet, even when only a few records were changed from the previous daily run.
The pipeline was expensive – As the data size grew daily, the job duration also grew significantly (especially in step 4). To mitigate the impact, we needed to allocate more AWS Glue DPUs (Data Processing Units) to scale the job, which led to higher cost.
The pipeline limited our ability to scale – Hudl’s data has a long history of rapid growth with increasing customers and sporting events. Given this trend, our pipeline needed to run as efficiently as possible to handle only changing datasets to have predictable performance.

New design

The following diagram illustrates our updated pipeline architecture.

Although the overall architecture looks roughly the same, the internal logic in AWS Glue was significantly changed, along with addition of Apache Hudi datasets.

In step 4, AWS Glue now interacts with Apache HUDI datasets in the S3 Cleansed Zone to upsert or delete changed records as identified by AWS DMS CDC. The AWS Glue to Apache Hudi connector helps convert JSON data to Parquet format and upserts into the Apache HUDI dataset. Retaining the full documents in our Apache HUDI dataset allows us to easily make schema changes to our final Amazon Redshift tables without needing to re-export data from our source systems.

The following is a sample code snippet from the new AWS Glue pipeline:

from awsglue.context import GlueContext 
from pyspark.sql.session import SparkSession

spark = SparkSession.builder.getOrCreate() 
spark_context = spark.sparkContext 
gc = GlueContext(spark_context)

upsert_conf = {'className': 'org.apache.hudi', '
hoodie.datasource.hive_sync.use_jdbc': 'false', 
'hoodie.datasource.write.precombine.field': 'write_ts', 
'hoodie.datasource.write.recordkey.field': '_id', 
'hoodie.table.name': 'glue_table', 
'hoodie.consistency.check.enabled': 'true', 
'hoodie.datasource.hive_sync.database': 'glue_database', 'hoodie.datasource.hive_sync.table': 'glue_table', 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.support_timestamp': 'true', 'hoodie.datasource.hive_sync.sync_as_datasource': 'false', 
'path': 's3://bucket/prefix/', 'hoodie.compact.inline': 'false', 'hoodie.datasource.hive_sync.partition_extractor_class':'org.apache.hudi.hive.NonPartitionedExtractor, 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.NonpartitionedKeyGenerator', 'hoodie.upsert.shuffle.parallelism': 200, 
'hoodie.datasource.write.operation': 'upsert', 
'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS', 
'hoodie.cleaner.commits.retained': 10 }

gc.write_dynamic_frame.from_options(frame=DynamicFrame.fromDF(cdc_upserts_df, gc, "cdc_upserts_df"), connection_type="marketplace.spark", connection_options=upsert_conf)

Results

With this new approach using Apache Hudi datasets with AWS Glue deployed after May 2022, the pipeline runtime was predictable and less expensive than the initial approach. Because we only handled new or modified records by eliminating the full outer join over the entire dataset, we saw an 80–90% reduction in runtime for this pipeline, thereby reducing costs by 80–90% compared to the initial approach. The following diagram illustrates our processing time before and after implementing the new pipeline.

Conclusion

With Apache Hudi’s open-source data management framework, we simplified incremental data processing in our AWS Glue data pipeline to manage data changes at the record level in our S3 data lake with CDC from Amazon DocumentDB.

We hope that this post will inspire your organization to build AWS Glue pipelines with Apache Hudi datasets that reduce cost and bring performance improvements using serverless technologies to achieve your business goals.

About the authors

Addison Higley is a Senior Data Engineer at Hudl. He manages over 20 data pipelines to help ensure data is available for analytics so Hudl can deliver insights to customers.

Ramzi Yassine is a Lead Data Engineer at Hudl. He leads the architecture, implementation of Hudl’s data pipelines and data applications, and ensures that our data empowers internal and external analytics.

Swagat Kulkarni is a Senior Solutions Architect at AWS and an AI/ML enthusiast. He is passionate about solving real-world problems for customers with cloud-native services and machine learning. Swagat has over 15 years of experience delivering several digital transformation initiatives for customers across multiple domains, including retail, travel and hospitality, and healthcare. Outside of work, Swagat enjoys travel, reading, and meditating.

Indira Balakrishnan is a Principal Solutions Architect in the AWS Analytics Specialist SA Team. She is passionate about helping customers build cloud-based analytics solutions to solve their business problems using data-driven decisions. Outside of work, she volunteers at her kids’ activities and spends time with her family.

How SOCAR built a streaming data pipeline to process IoT data for real-time analytics and control

2022-11-10 DoYeun Kim

Post Syndicated from DoYeun Kim original https://aws.amazon.com/blogs/big-data/how-socar-built-a-streaming-data-pipeline-to-process-iot-data-for-real-time-analytics-and-control/

SOCAR is the leading Korean mobility company with strong competitiveness in car-sharing. SOCAR has become a comprehensive mobility platform in collaboration with Nine2One, an e-bike sharing service, and Modu Company, an online parking platform. Backed by advanced technology and data, SOCAR solves mobility-related social problems, such as parking difficulties and traffic congestion, and changes the car ownership-oriented mobility habits in Korea.

SOCAR is building a new fleet management system to manage the many actions and processes that must occur in order for fleet vehicles to run on time, within budget, and at maximum efficiency. To achieve this, SOCAR is looking to build a highly scalable data platform using AWS services to collect, process, store, and analyze internet of things (IoT) streaming data from various vehicle devices and historical operational data.

This in-car device data, combined with operational data such as car details and reservation details, will provide a foundation for analytics use cases. For example, SOCAR will be able to notify customers if they have forgotten to turn their headlights off or to schedule a service if a battery is running low. Unfortunately, the previous architecture didn’t enable the enrichment of IoT data with operational data and couldn’t support streaming analytics use cases.

AWS Data Lab offers accelerated, joint-engineering engagements between customers and AWS technical resources to create tangible deliverables that accelerate data and analytics modernization initiatives. The Build Lab is a 2–5-day intensive build with a technical customer team.

In this post, we share how SOCAR engaged the Data Lab program to assist them in building a prototype solution to overcome these challenges, and to build the basis for accelerating their data project.

Use case 1: Streaming data analytics and real-time control

SOCAR wanted to utilize IoT data for a new business initiative. A fleet management system, where data comes from IoT devices in the vehicles, is a key input to drive business decisions and derive insights. This data is captured by AWS IoT and sent to Amazon Managed Streaming for Apache Kafka (Amazon MSK). By joining the IoT data to other operational datasets, including reservations, car information, device information, and others, the solution can support a number of functions across SOCAR’s business.

An example of real-time monitoring is when a customer turns off the car engine and closes the car door, but the headlights are still on. By using IoT data related to the car light, door, and engine, a notification is sent to the customer to inform them that the car headlights should be turned off.

Although this real-time control is important, they also want to collect historical data—both raw and curated data—in Amazon Simple Storage Service (Amazon S3) to support historical analytics and visualizations by using Amazon QuickSight.

Use case 2: Detect table schema change

The first challenge SOCAR faced was existing batch ingestion pipelines that were prone to breaking when schema changes occurred in the source systems. Additionally, these pipelines didn’t deliver data in a way that was easy for business analysts to consume. In order to meet the future data volumes and business requirements, they needed a pattern for the automated monitoring of batch pipelines with notification of schema changes and the ability to continue processing.

The second challenge was related to the complexity of the JSON files being ingested. The existing batch pipelines weren’t flattening the five-level nested structure, which made it difficult for business users and analysts to gain business insights without any effort on their end.

Overview of solution

In this solution, we followed the serverless data architecture to establish a data platform for SOCAR. This serverless architecture allowed SOCAR to run data pipelines continuously and scale automatically with no setup cost and without managing servers.

AWS Glue is used for both the streaming and batch data pipelines. Amazon Kinesis Data Analytics is used to deliver streaming data with subsecond latencies. In terms of storage, data is stored in Amazon S3 for historical data analysis, auditing, and backup. However, when frequent reading of the latest snapshot data is required by multiple users and applications concurrently, the data is stored and read from Amazon DynamoDB tables. DynamoDB is a key-value and document database that can support tables of virtually any size with horizontal scaling.

Let’s discuss the components of the solution in detail before walking through the steps of the entire data flow.

Component 1: Processing IoT streaming data with business data

The first data pipeline (see the following diagram) processes IoT streaming data with business data from an Amazon Aurora MySQL-Compatible Edition database.

Whenever a transaction occurs in two tables in the Aurora MySQL database, this transaction is captured as data and then loaded into two MSK topics via AWS Database Management (AWS DMS) tasks. One topic conveys the car information table, and the other topic is for the device information table. This data is loaded into a single DynamoDB table that contains all the attributes (or columns) that exist in the two tables in the Aurora MySQL database, along with a primary key. This single DynamoDB table contains the latest snapshot data from the two DB tables, and is important because it contains the latest information of all the cars and devices for the lookup against the streaming IoT data. If the lookup were done on the database directly with the streaming data, it would impact the production database performance.

When the snapshot is available in DynamoDB, an AWS Glue streaming job runs continuously to collect the IoT data and join it with the latest snapshot data in the DynamoDB table to produce the up-to-date output, which is written into another DynamoDB table.

The up-to-date data in DynamoDB is used for real-time monitoring and control that SOCAR’s Data Analytics team performs for safety maintenance and fleet management. This data is ultimately consumed by a number of apps to perform various business activities, including route optimization, real-time monitoring for oil consumption and temperature, and to identify a driver’s driving pattern, tire wear and defect detection, and real-time car crash notifications.

Component 2: Processing IoT data and visualizing the data in dashboards

The second data pipeline (see the following diagram) batch processes the IoT data and visualizes it in QuickSight dashboards.

There are two data sources. The first is the Aurora MySQL database. The two database tables are exported into Amazon S3 from the Aurora MySQL cluster and registered in the AWS Glue Data Catalog as tables. The second data source is Amazon MSK, which receives streaming data from AWS IoT Core. This requires you to create a secure AWS Glue connection for an Apache Kafka data stream. SOCAR’s MSK cluster requires SASL_SSL as a security protocol (for more information, refer to Authentication and authorization for Apache Kafka APIs). To create an MSK connection in AWS Glue and set up connectivity, we use the following CLI command:

aws glue create-connection —connection-input
'{"Name":"kafka-connection","Description":"kafka connection example",
"ConnectionType":"KAFKA",
"ConnectionProperties":{
"KAFKA_BOOTSTRAP_SERVERS":"<server-ip-addresses>",
"KAFKA_SSL_ENABLED":"true",
// "KAFKA_CUSTOM_CERT": "s3://bucket/prefix/cert.pem",
"KAFKA_SECURITY_PROTOCOL" : "SASL_SSL",
"KAFKA_SKIP_CUSTOM_CERT_VALIDATION":"false",
"KAFKA_SASL_MECHANISM": "SCRAM-SHA-512",
"KAFKA_SASL_SCRAM_USERNAME": "<username>",
"KAFKA_SASL_SCRAM_PASSWORD: "<password>"
},
"PhysicalConnectionRequirements":
{"SubnetId":"subnet-xxx","SecurityGroupIdList":["sg-xxx"],"AvailabilityZone":"us-east-1a"}}'

Component 3: Real-time control

The third data pipeline processes the streaming IoT data in millisecond latency from Amazon MSK to produce the output in DynamoDB, and sends a notification in real time if any records are identified as an outlier based on business rules.

AWS IoT Core provides integrations with Amazon MSK to set up real-time streaming data pipelines. To do so, complete the following steps:

On the AWS IoT Core console, choose Act in the navigation pane.
Choose Rules, and create a new rule.
For Actions, choose Add action and choose Kafka.
Choose the VPC destination if required.
Specify the Kafka topic.
Specify the TLS bootstrap servers of your Amazon MSK cluster.

You can view the bootstrap server URLs in the client information of your MSK cluster details. The AWS IoT rule was created with the Kafka topic as an action to provide data from AWS IoT Core to Kafka topics.

SOCAR used Amazon Kinesis Data Analytics Studio to analyze streaming data in real time and build stream-processing applications using standard SQL and Python. We created one table from the Kafka topic using the following code:

CREATE TABLE table_name (
column_name1 VARCHAR,
column_name2 VARCHAR(100),
column_name3 VARCHAR,
column_name4 as TO_TIMESTAMP (`time_column`, 'EEE MMM dd HH:mm:ss z yyyy'),
 WATERMARK FOR column AS column -INTERVAL '5' SECOND
)
PARTITIONED BY (column_name5)
WITH (
'connector'= 'kafka',
'topic' = 'topic_name',
'properties.bootstrap.servers' = '<bootstrap servers shown in the MSK client info dialog>',
'format' = 'json',
'properties.group.id' = 'testGroup1',
'scan.startup.mode'= 'earliest-offset'
);

Then we applied a query with business logic to identify a particular set of records that need to be alerted. When this data is loaded back into another Kafka topic, AWS Lambda functions trigger the downstream action: either load the data into a DynamoDB table or send an email notification.

Component 4: Flattening the nested structure JSON and monitoring schema changes

The final data pipeline (see the following diagram) processes complex, semi-structured, and nested JSON files.

This step uses an AWS Glue DynamicFrame to flatten the nested structure and then land the output in Amazon S3. After the data is loaded, it’s scanned by an AWS Glue crawler to update the Data Catalog table and detect any changes in the schema.

Data flow: Putting it all together

The following diagram illustrates our complete data flow with each component.

Let’s walk through the steps of each pipeline.

The first data pipeline (in red) processes the IoT streaming data with the Aurora MySQL business data:

AWS DMS is used for ongoing replication to continuously apply source changes to the target with minimal latency. The source includes two tables in the Aurora MySQL database tables (carinfo and deviceinfo), and each is linked to two MSK topics via AWS DMS tasks.
Amazon MSK triggers a Lambda function, so whenever a topic receives data, a Lambda function runs to load data into DynamoDB table.
There is a single DynamoDB table with columns that exist from the carinfo table and the deviceinfo table of the Aurora MySQL database. This table consists of all the data from two tables and stores the latest data by performing an upsert operation.
An AWS Glue job continuously receives the IoT data and joins it with data in the DynamoDB table to produce the output into another DynamoDB target table.
This target table contains the final data, which includes all the device and car status information from the IoT devices as well as metadata from the Aurora MySQL table.

The second data pipeline (in green) batch processes IoT data to use in dashboards and for visualization:

The car and reservation data (in two DB tables) is exported via a SQL command from the Aurora MySQL database with the output data available in an S3 bucket. The folders that contain data are registered as an S3 location for the AWS Glue crawler and become available via the AWS Glue Data Catalog.
The MSK input topic continuously receives data from AWS IoT. Each car has a number of IoT devices, and each device captures data and sends it to an MSK input topic. The Amazon MSK S3 sink connector is configured to export data from Kafka topics to Amazon S3 in JSON formats. In addition, the S3 connector exports data by guaranteeing exactly-once delivery semantics to consumers of the S3 objects it produces.
The AWS Glue job runs in a daily batch to load the historical IoT data into Amazon S3 and into two tables (refer to step 1) to produce the output data in an Enriched folder in Amazon S3.
Amazon Athena is used to query data from Amazon S3 and make it available as a dataset in QuickSight for visualizing historical data.

The third data pipeline (in blue) processes streaming IoT data from Amazon MSK with millisecond latency to produce the output in DynamoDB and send a notification:

An Amazon Kinesis Data Analytics Studio notebook powered by Apache Zeppelin and Apache Flink is used to build and deploy its output as a Kinesis Data Analytics application. This application loads data from Amazon MSK in real time, and users can apply business logic to select particular events coming from the IoT real-time data, for example, the car engine is off and the doors are closed, but the headlights are still on. The particular event that users want to capture can be sent to another MSK topic (Outlier) via the Kinesis Data Analytics application.
Amazon MSK triggers a Lambda function, so whenever a topic receives data, a Lambda function runs to send an email notification to users that are subscribed to an Amazon Simple Notification Service (Amazon SNS) topic. An email is published using an SNS notification.
The Kinesis Data Analytics application loads data from AWS IoT, applies business logic, and then loads it into another MSK topic (output). Amazon MSK triggers a Lambda function when data is received, which loads data into a DynamoDB Append table.
Amazon Kinesis Data Analytics Studio is used to run SQL commands for ad hoc interactive analysis on streaming data.

The final data pipeline (in yellow) processes complex, semi-structured, and nested JSON files, and sends a notification when a schema evolves.

An AWS Glue job runs and reads the JSON data from Amazon S3 (as a source), applies logic to flatten the nested schema using a DynamicFrame, and pivots out array columns from the flattened frame.
The output is stored in Amazon S3 and is automatically registered to the AWS Glue Data Catalog table.
Whenever there is a new attribute or change in the JSON input data at any level in the nested structure, the new attribute and change are captured in Amazon EventBridge as an event from the AWS Glue Data Catalog. An email notification is published using Amazon SNS.

Conclusion

As a result of the four-day Build Lab, the SOCAR team left with a working prototype that is custom fit to their needs, gaining a clear path to production. The Data Lab allowed the SOCAR team to build a new streaming data pipeline, enrich IoT data with operational data, and enhance the existing data pipeline to process complex nested JSON data. This establishes a baseline architecture to support the new fleet management system beyond the car-sharing business.

About the Authors

DoYeun Kim is the Head of Data Engineering at SOCAR. He is a passionate software engineering professional with 19+ years experience. He leads a team of 10+ engineers who are responsible for the data platform, data warehouse and MLOps engineering, as well as building in-house data products.

SangSu Park is a Lead Data Architect in SOCAR’s cloud DB team. His passion is to keep learning, embrace challenges, and strive for mutual growth through communication. He loves to travel in search of new cities and places.

YoungMin Park is a Lead Architect in SOCAR’s cloud infrastructure team. His philosophy in life is-whatever it may be-to challenge, fail, learn, and share such experiences to build a better tomorrow for the world. He enjoys building expertise in various fields and basketball.

Younggu Yun is a Senior Data Lab Architect at AWS. He works with customers around the APAC region to help them achieve business goals and solve technical problems by providing prescriptive architectural guidance, sharing best practices, and building innovative solutions together. In his free time, his son and he are obsessed with Lego blocks to build creative models.

Vicky Falconer leads the AWS Data Lab program across APAC, offering accelerated joint engineering engagements between teams of customer builders and AWS technical resources to create tangible deliverables that accelerate data analytics modernization and machine learning initiatives.

Learn more about Apache Flink and Amazon Kinesis Data Analytics with three new videos

2022-11-09 Deepthi Mohan

Post Syndicated from Deepthi Mohan original https://aws.amazon.com/blogs/big-data/learn-more-about-apache-flink-and-amazon-kinesis-data-analytics-with-three-new-videos/

Amazon Kinesis Data Analytics is a fully managed service for Apache Flink that reduces the complexity of building, managing, and integrating Apache Flink applications with other AWS services. Apache Flink is an open-source framework and engine for stateful processing of data streams. It’s highly available and scalable, delivering high throughput and low latency for the most demanding stream-processing applications.

In this post, we highlight three new videos for you to learn more about Apache Flink and Kinesis Data Analytics, including open-source contributions to Apache Flink, our learnings from running thousands of Flink jobs on a managed service, and how we use Kinesis Data Analytics and Apache Flink to enable machine learning (ML) in Alexa.

In Introducing the new Async Sink, we present the new Async Sink framework, an open-source contribution to make it easier than ever to build sink connectors for Apache Flink. You can learn about the need for the Async Sink framework and how we built it, followed by a demo of building a new sink to Amazon CloudWatch to deliver CloudWatch metrics, in under 20 minutes! The Async Sink framework bootstraps development of Flink sinks, is compatible with Apache Flink 1.15 and above, and has already seen usage by the community beyond building new sinks to AWS services.

The video Practical learnings from running thousands of Flink jobs shares insight from running Kinesis Data Analytics, a managed service for Apache Flink that runs tens of thousands of Flink jobs. You can learn lessons based on our experience of operating Apache Flink at very large scale, touching on issues such as out-of-memory errors, timeouts, and stability challenges. The video also covers improving application performance with memory tuning and configuration changes and the approaches to automating job health monitoring and management of Flink jobs at scale.

“Alexa, be quiet!” End-to-end near-real time model building and evaluation in Amazon Alexa discusses how Alexa has built an automated end-to-end solution for incremental model building or fine-tuning ML models through continuous learning, continual learning, or semi-supervised active learning. Alexa uses Apache Flink to transform and discover metrics in real time. In this video, you learn about how Alexa scales infrastructure to meet the needs of ML teams across Alexa, and explore specific use cases that use Apache Flink and Kinesis Data Analytics to improve Alexa experiences to delight customers.

To learn more about Kinesis Data Analytics for Apache Flink, visit our product page.

About the author

Deepthi Mohan is a Principal Product Manager on the Kinesis Data Analytics team.

How Kyligence Cloud uses Amazon EMR Serverless to simplify OLAP

2022-11-09 Daniel Gu

Post Syndicated from Daniel Gu original https://aws.amazon.com/blogs/big-data/how-kyligence-cloud-uses-amazon-emr-serverless-to-simplify-olap/

This post was co-written with Daniel Gu and Yolanda Wang, from Kyligence.

Today, more than ever, organizations realize that modern business runs on data—almost all our interactions with business are based on data, and organizations must use analytics to understand, plan, and improve their operations. That is where Online Analytical Processing (OLAP) comes in. OLAP is designed to manage and analyze big data, enabling organizations to use their data to extract business insights in multiple dimensions.

Kyligence Cloud OLAP solution offers an Intelligent OLAP Platform to simplify multi-dimensional analytics for cloud data lakes. In the past, Kyligence used to deploy and maintain its own Spark clusters based on Amazon Elastic Compute Cloud (Amazon EC2) to handle the multi-dimensional model pre-computing process that required users to build their monitoring and alerting systems to improve the observability and reliability of the Spark cluster. In this post, we present how Kyligence built and end-to-end Kyligence Cloud OLAP solution with Amazon EMR Serverless to simplify deployment and operations, reduce costs, and accelerate time-to-value over the data lake.

What is Amazon EMR Serverless?

Amazon EMR Serverless is a big data cloud platform for running large-scale distributed data processing jobs, and machine learning (ML) applications using open-source analytics frameworks like Apache Spark and Apache Hive. Amazon EMR Serverless makes it easy and cost-effective for data engineers and analysts to run applications without having to tune, operate, optimize, secure, or manage clusters.

What is OLAP?

OLAP is an approach to quickly answer analytics queries at high speeds on large volumes of data, providing capabilities for precomputation, sophisticated data modeling, and multi-dimensional analytics by rolling up large, sometimes separate datasets into a multi-dimensional database known as an OLAP Cube that enables “slicing and dicing” of data from different viewpoints for a streamlined query experience. Apache Kylin, Apache Druid, and ClickHouse are some of the popular OLAP tools.

Although OLAP tools have been successfully used in various industries, they still face many challenges：

Dependence on IT organizations – Traditional OLAP tools require complex infrastructure to run large-scale data computing. It requires a large team of IT professionals to operate and maintain this infrastructure, resulting in high costs.
Need for large compute resources – Traditional OLAP tools need huge amount of computing resources for processing, and transforming data through a series of specific steps toward a concrete goal. Lack of computational capabilities leads to longer response times, limits the amount of data that can be processed, and impedes the flexibility of the OLAP tool greatly . As a result, data analysts are often confined to narrow datasets, incapable of analyzing all the data freely.
Inefficient usage of resources in the cloud – When a large-scale data modeling calculation is performed in the cloud, the cost estimation tools estimate and deploy the corresponding computing resources. However, the utilization rate of resources is often not very high, resulting in inefficient usage of resources.

With OLAP integrated with Amazon EMR Serverless, OLAP tools can use Amazon EMR Serverless as a serverless computing resource pool to complete data processing jobs, which simplifies and enhances user experience.

Kyligence approach to OLAP using Amazon EMR Serverless

Kyligence is an AWS ISV partner that offers an Intelligent OLAP Platform to simplify multi-dimensional analytics for cloud data lakes. As a cloud-native OLAP platform, Kyligence Cloud now integrates with Amazon EMR Serverless to automatically provision Spark to run indexing and building jobs. This empowers you to use all the features and benefits of Kyligence’s OLAP with Amazon EMR Serverless.

Kyligence seamlessly connects to major AWS-native data sources including Amazon Simple Storage Service (Amazon S3), Amazon Redshift, and Amazon Relational Database Service (Amazon RDS) to get the most out of your data on AWS, building a comprehensive AWS big data solution. During data modeling, Kyligence uses Amazon S3 to store the pre-computed data, and serves it for high concurrency queries. Kyligence also seamlessly interfaces with popular business intelligence (BI) tools such as Tableau, Microsoft Power BI, and Microsoft Excel to provide rich, built-in data visualization and self-service tools.

The following diagram illustrates the Kyligence Cloud architecture on AWS.

What you can expect from Kyligence Cloud on AWS

This solution offers the following benefits:

High performance – With AWS’s global infrastructure and the distributed computing capabilities of Amazon EMR, Kyligence offers a scalable, cost-effective, high-performance OLAP engine for multi-dimensional analytics. It enables critical data applications and large-scale interactive analytics, and helps you achieve sub-second query response times and high concurrency on PB-scale data.
Auto-scaling – Kyligence Cloud’s computing resources can be expanded with one click, and as load decreases, cluster size can be automatically reduced. This auto-scaling capability provides optimized costs with service stability.
High compatibility – Kyligence Cloud provides a rich set of APIs (ODBC, JDBC, Rest API, Python Client) and standard ANSI-SQL and XMLA/MDX interface, which can be easily integrated with popular analytics tools like Tableau, Microsoft Excel, Microsoft Power BI, and data science tools like Python.
Security and reliability – With Amazon S3, Amazon RDS, Kyligence enterprise-level security features, and AWS Identity and Access Management (IAM) support, Kyligence Cloud safely manages access to the services and resources deployed on AWS while supporting multi-level access control of data models, tables, and cells to ensure data security and privacy protection.
One-click deployment on AWS – Kyligence Cloud is available in AWS Marketplace. The deployment is completed automatically based on an AWS CloudFormation template and parameter settings. Kyligence performs automated cluster operation and maintenance, and elastic rule-based cluster scaling, which lightens the workload for IT administrators and cloud infrastructure teams. Kyligence also offers a quick deployment method in the Kyligence Cloud Portal.

How Amazon EMR Serverless integrates with OLAP

With Amazon EMR Serverless, Kyligence Cloud provides out-of-the-box managed Apache Spark services. The Kyligence engine can distribute the compute job to Apache Spark in Amazon EMR Serverless. With the automatic on-demand provisioning and scaling capabilities of Amazon EMR Serverless, Kyligence can quickly meet changing processing requirements at any data volume.

The following diagram illustrates Kyligence Cloud integrated with Amazon EMR Serverless.

Benefits of using Kyligence Cloud with Amazon EMR Serverless

In the past, Kyligence used to deploy and maintain its own Spark clusters based on Amazon Elastic Compute Cloud (Amazon EC2) to handle the multi-dimensional model pre-computing process that required Kyligence users to build their monitoring and alerting systems to improve the observability and reliability of the Spark clusters.

Now, running Kyligence on Amazon EMR Serverless offers a more cost-effective, and high-performance way to run cloud analytics on AWS:

Simplified deployment on the cloud – With managed services, you don’t need to consider the lifecycle of the underlying infrastructure and resources. This greatly reduces application complexity and simplifies the deployment of Kyligence Cloud.
Improve performance on the cloud – With the help of Amazon EMR Serverless, it provides a refined scaling strategy, which can help Kyligence Cloud spin up and recycle resources faster. In Kyligence performance benchmark testing, we observed 15–20% faster performance compared to open-source Spark cluster for index building.
Reduce the difficulty of operation and maintenance – With the help of Amazon EMR Serverless capabilities, operation and maintenance personnel can easily maintain the capacity and running status of computing resources without having to understand the underlying analysis framework.
Cost-optimization on the cloud – Amazon EMR Serverless provides a refined scaling strategy that can automatically determine the resources that the application needs, acquires these resources to process your jobs, and releases the resources when the jobs complete. You only pay for the resources used by the application, which helps reduce the Total Cost of Operations (TCO) on the cloud.

Get started with Kyligence Cloud on Amazon EMR Serverless

You can get started with the full potential of Kyligence Cloud on the AWS Marketplace or quickly test drive Kyligence.

To use Amazon EMR Serverless, you just need to select Serverless Spark on the Build Cluster tab during deployment.

Conclusion

Using managed and scalable services like Amazon EMR Serverless allows Kyligence users to speed up self-service analytics on large volumes of data, and maintain a relatively simplified architecture. With this solution, you can now concentrate on business demands instead of technical issues.

About Kyligence

Kyligence was founded in 2016 by the original creators of Apache Kylin, the leading open-source OLAP for big data. Kyligence offers an Intelligent OLAP Platform to simplify multi-dimensional analytics for cloud data lakes.

For more information, visit Kyligence.

About the authors

Daniel Gu is a senior product manager on the Kyligence Cloud Team, who manages products and services and conducts research to determine the viability of products in the cloud.

Yolanda Wang is a senior product marketing manager at Kyligence, who owns the positioning, messaging, and branding of Kyligence products and works with various teams to drive go-to-market strategies.

Kiran Guduguntla is a WW Go-to-Market Specialist for Amazon EMR at AWS. He works with AWS customers across the globe to strategize, build, develop, and deploy modern data analytics solutions.

Field-level security in Amazon OpenSearch Service

2022-11-08 Satyanarayana Adimula

Post Syndicated from Satyanarayana Adimula original https://aws.amazon.com/blogs/big-data/field-level-security-in-amazon-opensearch-service/

Amazon OpenSearch Service is fully open-source search and analytics engine that securely unlocks real-time search, monitoring, and analysis of business and operational data for use cases like application monitoring, log analytics, observability, and website search.

But what if you have personal identifiable information (PII) data in your log data? How do you control and audit access to that data? For example, what if you need to exclude fields from log search results or anonymize them? Fine-grained access control can manage access to your data depending on the use case—to return results from only one index, hide certain fields in your documents, or exclude certain documents altogether.

Let’s say you have users that work on the logistics of online orders placed on Sunday. The users must not have the access to a customer’s PII data and must be restricted from seeing the customer’s email. Additionally, the customer’s full name and first name must be anonymized. The post demonstrates implementing this field-level security with OpenSearch Service security controls.

Solution overview

The solution has the following steps to provision OpenSearch Service with Amazon Cognito federation within Amazon Virtual Private Cloud (Amazon VPC), use a proxy server to sign in to OpenSearch Dashboards, and demonstrate the field-level security:

Create an OpenSearch Service domain with VPC access and fine-grained access enabled.
Access OpenSearch Service from outside the VPC and load the sample data.
Create an OpenSearch Service role for field-level security and map it to a backend role.

OpenSearch Service security has three main layers:

Network – Determines whether a request can reach an OpenSearch Service domain. Placing an OpenSearch Service domain within a VPC enables secure communication between OpenSearch Service and other services within the VPC without the need for an internet gateway, NAT device, or VPN connection. The associated security groups must permit clients to reach the OpenSearch Service endpoint.
Domain access policy – After a request reaches a domain endpoint, the domain access policy allows or denies the request access to a given URI at the edge of the domain. The domain access policy specifies which actions a principal can perform on the domain’s sub-resources, which include OpenSearch Service indexes and APIs. If a domain access policy contains AWS Identity and Access Management (IAM) users or roles, clients must send signed requests using AWS Signature Version 4.
Fine-grained access control – After the domain access policy allows a request to reach a domain endpoint, fine-grained access control evaluates the user credentials and either authenticates the user or denies the request. If fine-grained access control authenticates the user, the request is handled based on the OpenSearch Service roles mapped to the user. Additional security levels include:
- Cluster-level security – To make broad requests such as _mget, _msearch, and _bulk, monitor health, take snapshots, and more. For details, see Cluster permissions.
- Index-level security – To create new indexes, search indexes, read and write documents, delete documents, manage aliases, and more. For details, see Index permissions.
- Document-level security – To restrict the documents in an index that a user can see. For details, see Document-level security.
- Field-level security – To control the document fields a user can see. When creating a role, add a list of fields to either include or exclude. If you include fields, any users you map to that role can see only those fields. If you exclude fields, they can see all fields except the excluded ones. Field-level security affects the number of fields included in hits when you search. For details, see Field-level security.
- Field masking – To anonymize the data in a field. If you apply the standard masking to a field, OpenSearch Service uses a secure, random hash that can cause inaccurate aggregation results. To perform aggregations on masked fields, use pattern-based masking instead. For details, see Field masking.

The following figure illustrates these layers.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account
An Amazon Cognito user pool and identity pool

Create an OpenSearch Service domain with VPC access

You first create an OpenSearch Service domain with VPC access, enabling fine-grained access control and choosing the IAM ARN as the master user.

When you use IAM for the master user, all requests to the cluster must be signed using AWS Signature Version 4. For sample code, see Signing HTTP requests to Amazon OpenSearch Service. IAM is recommended if you want to use the same users on multiple clusters, to use Amazon Cognito to access OpenSearch Dashboards, or if you have OpenSearch Service clients that support Signature Version 4 signing.

Fine-grained access control requires HTTPS, node-to-node encryption, and encryption at rest. Node-to-node encryption enables TLS 1.2 encryption for all communications within the VPC. If you send data to OpenSearch Service over HTTPS, node-to-node encryption helps ensure that your data remains encrypted as OpenSearch Service distributes (and redistributes) it throughout the cluster.

Add a domain access policy to allow the specified IAM ARNs to the URI at the edge of the domain.

Set up Amazon Cognito to federate into OpenSearch Service

You can authenticate and protect your OpenSearch Service default installation of OpenSearch Dashboards using Amazon Cognito. If you don’t configure Amazon Cognito authentication, you can still protect Dashboards using an IP-based access policy and a proxy server, HTTP basic authentication, or SAML. For more details, see Amazon Cognito authentication for OpenSearch Dashboards.

Create a user called masteruser in the Amazon Cognito user pool that was configured for the OpenSearch Service domain and associate the user with the IAM role Cognito_<Cognito User Pool>Auth_Role, which is a master user in OpenSearch Service. Create another user called ecomuser1 and associate it with a different IAM role, for example OpenSearchFineGrainedAccessRole. Note that ecomuser1 doesn’t have any access by default.

If you want to configure SAML authentication, see SAML authentication for OpenSearch Dashboards.

Access OpenSearch Service from outside the VPC

When you place your OpenSearch Service domain within a VPC, your computer must be able to connect to the VPC. This connection can be VPN, transit gateway, managed network, or proxy server.

Fine-grained access control has an OpenSearch Dashboards plugin that simplifies management tasks. You can use Dashboards to manage users, roles, mappings, action groups, and tenants. The Dashboards sign-in page and underlying authentication method differs depending on how you manage users and configured your domain.

Load sample data into OpenSearch

Sign in as masteruser to access OpenSearch Dashboards and load the sample data for ecommerce orders, flight data, and web logs.

Create an OpenSearch Service role and user mapping

OpenSearch Service roles are the core ways of controlling access to your cluster. Roles contain any combination of cluster-wide permissions, index-specific permissions, document-level and field-level security, and tenants.

You can create new roles for fine-grained access control and map roles to users using OpenSearch Dashboards or the _plugins/_security operation in the REST API. For more information, see Create roles and Map users to roles. Fine-grained access control also includes a number of predefined roles.

Backend roles offer another way of mapping OpenSearch Service roles to users. Rather than mapping the same role to dozens of different users, you can map the role to a single backend role, and then make sure that all users have that backend role. Note that the master user ARN is mapped to the all_access and security_manager roles by default to give the user full access to the data.

Create an OpenSearch Service role for field-level security

For our use case, an ecommerce company has requirements for certain users to see the online orders placed on Sunday. The users need to look at the order fulfilment logistics for only those orders. They don’t need to see customer’s email. They also don’t have to know the actual first name and last name of the customer; the customer’s first name and last name must be anonymized when displayed to the user.

Create a role in OpenSearch Service with the following steps:

Log in to OpenSearch Dashboards as masteruser.
Choose Security, Roles, and Create role.
Name the role Orders-placed-on-Sunday.
For Index permissions, specify opensearch_dashboards_sample_data_ecommerce.
For the action group, choose read.
For Document-level security, specify the following query:
```
{
  "match": {
    "day_of_week" : "Sunday"
  }
}
```
For Field-level security, choose Exclude and specify email.
For Anonymization, specify customer_first_name and customer_full_name.
Choose Create.

You can see the following permissions to the role Orders-placed-on-Sunday.

Choose View expression to see the document-level security.

Map the OpenSearch Service role to the backend role of the Amazon Cognito group

To perform user mapping, complete the following steps:

Go to the OpenSearch Service role Orders-placed-on-Sunday.
Choose Mapped users, Manage mapping.
For Backend roles, enter arn:aws:iam::<account-id>:role/OpenSearchFineGrainedAccessRole.
Choose Map.
Return to the list of roles and choose the predefined role opensearch_dashboards_user, which includes the permissions a user needs to work with index patterns, visualizations, dashboards, and tenants.
Map the opensearch_dashboards_user role to arn:aws:iam::<account-id>:role/OpenSearchFineGrainedAccessRole.

Test the solution

To test your fine-grained access control, complete the following steps:

Log in to the OpenSearch Dashboards URL as ecomuser1.
Go to OpenSearch Plugins and choose Query Workbench.
Run the following SQL queries in OpenSearch Workbench to verify the fine-grained access applied to ecomuser1 as compared to the same queries run by masteruser.

SQL

Results when signed-in as masteruser

SHOW tables LIKE %sample%;

opensearch_dashboards_sample_data_ecommerce
opensearch_dashboards_sample_data_flights
opensearch_dashboards_sample_data_logs

SELECT COUNT(*) FROM opensearch_dashboards_sample_data_flights ;

13059

SELECT day_of_week, count(*) AS total_records FROM opensearch_dashboards_sample_data_ecommerce GROUP BY day_of_week_i,day_of_week ORDER BY day_of_week_i;

day_of_week	total_records
Monday	579
Tuesday	609
Wednesday	592
Thursday	775
Friday	770
Saturday	736
Sunday	614

SELECT customer_last_name AS last_name, customer_full_name AS full_name, email FROM opensearch_dashboards_sample_data_ecommerce WHERE day_of_week = ‘Sunday’ AND order_id = ‘582936’;

last_name	full_name	email
Miller	Gwen Miller	[email protected]

SQL Results when signed-in as ecomuser1 Observations

SHOW tables LIKE %sample%; no permissions for [indices:admin/get] and User [name=Cognito/<cognito pool-id>/ecomuser1, backend_roles=[arn:aws:iam::<account-id>:role/OpenSearchFineGrainedAccessRole] ecomuser1 can’t list tables.

SELECT COUNT(*) FROM opensearch_dashboards_sample_data_flights ; no permissions for [indices:data/read/search] and User [name=Cognito/<cognito pool-id>/ecomuser1, backend_roles=[arn:aws:iam::<account-id>:role/OpenSearchFineGrainedAccessRole] ecomuser1 can’t see flights data.

SELECT day_of_week, count(*) AS total_records FROM opensearch_dashboards_sample_data_ecommerce GROUP BY day_of_week_i,day_of_week ORDER BY day_of_week_i;

day_of_week	total_records
Sunday	614

ecomuser1 can see ecommerce orders placed on Sunday only.

SELECT customer_last_name AS last_name, customer_full_name AS full_name, email FROM opensearch_dashboards_sample_data_ecommerce WHERE day_of_week = ‘Sunday’ AND order_id = ‘582936’;

last_name	full_name	email
Miller	f1493b0f9039531ed02c9b1b7855707116beca01c6c0d42cf7398b8d880d555f	.

For ecomuser1, customer’s email is excluded and customer_full_name is anonymized.

From these results, you can see OpenSearch Service field-level access controls were applied to ecomuser1, restricting the user from seeing the customer’s email. Additionally, the customer’s full name and first name were anonymized when displayed to the user.

Conclusion

When OpenSearch Service fine-grained access control authenticates a user, the request is handled based on the OpenSearch Service roles mapped to the user. This post demonstrated fine-grained access control restricting a user from seeing a customer’s PII data, as per the business requirements.

Role-based fine-grained access control enables you to control access to your data on OpenSearch Service at the index level, document level, and field level. When your logs or applications data has sensitive data, the field-level security permissions can help you provision the right level of access for your users.

About the author

Satya Adimula is a Senior Data Architect at AWS based in Boston. With extensive experience in data and analytics, Satya helps organizations derive their business insights from the data at scale.

Reduce cost and improve query performance with Amazon Athena Query Result Reuse

2022-11-08 Theo Tolv

Post Syndicated from Theo Tolv original https://aws.amazon.com/blogs/big-data/reduce-cost-and-improve-query-performance-with-amazon-athena-query-result-reuse/

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon Simple Storage Service (Amazon S3) using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run on datasets at petabyte scale. You can use Athena to query your S3 data lake for use cases such as data exploration for machine learning (ML) and AI, business intelligence (BI) reporting, and ad hoc querying.

It’s not uncommon for datasets in data lakes to update only daily, or at most a few times per day, yet queries running on these datasets may be repeated more frequently. Previously, all queries resulted in a data scan, even if the same query was repeated again. When the source data hasn’t changed, repeat queries run needlessly, leading to the same results with higher data scan costs and query latency. Wouldn’t it be better if the results of a recent query could be reused instead?

Query Result Reuse is a new feature available in Athena engine version 3 that makes it possible to reuse the results of a previous query. This can improve performance and reduce cost for frequently run queries, by skipping scanning the source data and instead returning a previously calculated result directly. With Query Result Reuse, you can tell Athena that you want to reuse results of a previous query run, with a maximum age setting that controls how recent a previous result has to be.

Athena automatically reuses any previous results that match your query and maximum age setting, or transparently runs the query again if no match is found. If you know that a dataset changes a few times per day, you can, for example, tell Athena to reuse results that are up to an hour old to avoid rerunning most queries, but still get new results when you run a query soon after new data has become available.

In this post, we demonstrate how to reduce cost and improve query performance with the new Query Result Reuse feature.

When should you use Query Result Reuse?

We recommend using Query Result Reuse for every query where the source data doesn’t change frequently. You can configure the maximum age of results to reuse per query, or use the default, which is 60 minutes. In certain cases where queries include non-deterministic functions such as RAND(), the query fetches fresh data from the input source even if the Query Result Reuse feature is enabled.

Query Result Reuse allows results to be shared among users in a workgroup, as long as they have access to the tables and data. This means Query Result Reuse can benefit not only a single user, but also other users in the workgroup who might be running the same queries. One example where this may be especially beneficial is when you have dashboards that are viewed by many users. The dashboard widgets run the same queries for all users, and are therefore accelerated by Query Result Reuse, when enabled.

Another example is if you have a dataset that is updated daily, and many users who all query the most recent data to create reports. Different people might run the same queries as part of their work; with Query Result Reuse, they can collectively avoid running the same query more than once, making everyone more productive and lowering overall cost by avoiding repeated scans of the same data.

Finally, if you have a historical dataset that is frequently queried, but never or very rarely updated, you can configure queries to reuse results that are up to 7 days old to maximize the chances of reusing results and avoid unnecessary costs.

How does Query Result Reuse work?

Query Result Reuse takes advantage of the fact that Athena writes query results to Amazon S3 as a CSV file. Before the introduction of Query Result Reuse, it was possible to reuse query results by reading these files directly. You could also use the ClientRequestToken parameter of the StartQueryExecution API to ensure queries are run only once, and subsequent runs return the same results. With Query Result Reuse, the process of reusing query results is easier and more versatile.

When Athena receives a query with Query Result Reuse enabled, it looks for a result for a query with the same query string that was run in the same workgroup. The query string has to be identical in order to match.

Query Result Reuse is enabled on a per query basis. When you run a query, you specify how old a result can be for it to be reused, from 1 minute up to 7 days. If the query has been run before, and a result exists that matches the request, it’s returned, otherwise the query is run and a new result is calculated. This new result is then available to be reused by subsequent queries.

You can run the query multiple times with different settings for how old a result you can accept. Results can be reused within the same workgroup, even if a different user ran the query previously.

Before a query result is reused, Athena does a few checks to make sure that the user is still allowed to see the results. It checks that the user has access to the tables involved in the query and permission to read the result file on Amazon S3.

There are some situations where query results can’t be reused, for example if the query uses non-deterministic functions, or has AWS Lake Form ation fine-grained access controls enabled. These limitations are described in more detail later in this post.

Run queries with Query Result Reuse

In this section, we demonstrate how to run queries with the Query Result Reuse feature via the Athena API, the Athena console, and the JDBC and ODBC drivers.

Run queries using the Athena API

For applications that use the Athena API through the AWS Command Line Interface (AWS CLI) or the AWS SDKs, the StartQueryExecution API call now has the additional parameter ResultReuseConfiguration, where you can enable Query Result Reuse and specify the maximum age of results. For example, when using the AWS CLI, you can run a query with Query Result Reuse enabled as follows:

aws athena start-query-execution \
  --work-group "my_work_group" \
  --query-string "SELECT * FROM my_table LIMIT 10" \
  --result-reuse-configuration \
    "ResultReuseByAgeConfiguration={Enabled=true,MaxAgeInMinutes=60}"

The following code shows how to do this with the AWS SDK for Python:

import boto3

client = boto3.client('athena')
response = client.start_query_execution(
    WorkGroup='my_work_group',
    QueryString='SELECT * FROM my_table LIMIT 10',
    ResultReuseConfiguration={
        'ResultReuseByAgeConfiguration': {
   	    	'Enabled': True,
     		'MaxAgeInSeconds': 60
        }
    }
)

These examples assume that my_work_group uses Athena engine v3, that the workgroup has an output location configured, and that the AWS Region has been set in the AWS CLI configuration.

When a query result is reused, you can see in the statistics section of the response from the GetQueryExecution API call that no data was scanned and that results were reused:

{
    "QueryExecution": {
        …
        "Statistics": {
            "EngineExecutionTimeInMillis": 272,
            "DataScannedInBytes": 0,
            "TotalExecutionTimeInMillis": 445,
            "QueryQueueTimeInMillis": 143,
            "ServiceProcessingTimeInMillis": 30,
            "ResultReuseInformation": {
               	"ReusedPreviousResult": true
           	}
        }
    }
}

Run queries using the Athena console

When you run queries on the Athena console, Query Result Reuse is now enabled by default. You can enable and disable Query Result Reuse in the query editor. You can also choose the pen icon to change the maximum age of results. This setting applies to all queries run on the Athena console.

The following screenshot shows an example query run against AWS CloudTrail logs with Query Result Reuse enabled.

When we ran the query again, the results showed up immediately, and we could see the message “using reused query results” in the Query results pane as a confirmation that the results of our first query had been reused. The Data scanned statistic also showed “-” to indicate that no data was scanned.

Run queries using the JDBC and ODBC drivers

If you use the JDBC or ODBC driver to query Athena, you can now add enableResultReuse=1 to your connection parameters to enable Query Result Reuse, and use ageforResultReuse=60 to set the maximum age to 60 minutes. The drivers automatically apply the setting to all queries running in the context of the connection.

For more information on how to connect to Athena via JDBC and ODBC, refer to Connecting to Amazon Athena with ODBC and JDBC drivers.

Limitations and considerations

Query Result Reuse is supported for most Athena queries, but there are some limitations. We want to ensure that reusing results doesn’t create surprising situations, or expose results that a user shouldn’t have access to. For that reason, Athena always runs a fresh query in the following situations:

Non-deterministic functions – Some functions and expressions produce different results from query to query, such as CURRENT_TIME and RAND(). Results for queries that use temporal and non-deterministic expressions and functions aren’t reusable because that could create surprising and inconsistent results.
Fine-grained access controls – Row-level and column-level permissions are configured in Lake Formation, and Athena can’t know if these have changed since a previous query result was created. Users using the same workgroup can also have different permissions, and checking all permissions would undo many of the cost and performance savings you get from Query Result Reuse.
Federated queries, user-defined functions (UDFs), and external Hive metastores – Users using the same workgroup can have different permissions to invoke the AWS Lambda functions that these features rely on. Athena isn’t able to check that a user that wants to reuse a result has permission to invoke these Lambda functions without running the query, which would negate the cost and performance savings.

Athena detects these conditions automatically and runs the query as if Query Result Reuse wasn’t enabled. You won’t get errors, but you can determine that Query Result Reuse wasn’t in effect by inspecting the query status (see our earlier examples).

Query Result Reuse is available in Athena engine version 3 only.

Conclusion

Query Result Reuse is a new feature in Athena that aims to reduce cost and query response times for datasets that change less frequently than they are queried. For teams that often run the same query, or have dashboards that are used more often than the data changes, Query Result Reuse can result in lower costs and faster results. It’s easy to get started with Query Result Reuse via the Athena console, API, and JDBC/ODBC; all you have to do is set the maximum age of results, and run your queries as usual.

We hope that you will like this new feature, and that it will save cost and improve performance for you and your team!

About the authors

Theo Tolv is a Senior Big Data Architect in the Athena team. He’s worked with small and big data for most of his career and often hangs out on Stack Overflow answering questions about Athena.

Vijay Jain is a Senior Product Manager in Amazon Web Services (AWS) Athena team. He is passionate about building scalable analytics technologies and products working closely with enterprise customers. Outside of work, Vijay likes running and spending time with his family.

What’s new with Amazon QuickSight at AWS re:Invent 2022

2022-11-03 Mia Heard

Post Syndicated from Mia Heard original https://aws.amazon.com/blogs/big-data/whats-new-with-amazon-quicksight-at-aws-reinvent-2022/

AWS re:Invent is a learning conference hosted by AWS for the global cloud computing community. This year’s re:Invent will be held in Las Vegas, Nevada, from November 28 to December 2.

Amazon QuickSight is the most popular cloud-native serverless BI service. This post walks you through the details of all QuickSight-related sessions and activities to help you plan your conference week accordingly. These sessions should appeal to data and analytics teams, product and engineering teams, and line of business and technology leaders interested in modernizing their BI capabilities to transform data into actionable insights for all.

To access the session catalog and reserve your seat for one of our BI sessions, you must be registered for re:Invent. Register now!

Keynotes

Adam Selipsky, Chief Executive Officer of Amazon Web Services – Keynote

Tuesday November 29 | 8:30 AM – 10:30 AM PST | The Venetian

Join Adam Selipsky, Chief Executive Officer of Amazon Web Services, as he looks at the ways that forward-thinking builders are transforming industries and even our future, powered by AWS. He highlights innovations in data, infrastructure, and more that are helping customers achieve their goals faster, take advantage of untapped potential, and create a better future with AWS.

Swami Sivasubramanian, Vice President of AWS Data and Machine Learning – Keynote

Wednesday November 30 | 8:30 AM – 10:30 AM PST | The Venetian

Join Swami Sivasubramanian, Vice President of AWS Data and Machine Learning, as he reveals the latest AWS innovations that can help you transform your company’s data into meaningful insights and actions for your business. In this keynote, several speakers discuss the key components of a future-proof data strategy and how to empower your organization to drive the next wave of modern invention with data. Hear from leading AWS customers who are using data to bring new experiences to life for their customers.

Leadership sessions

ANT203-L (LVL 200) Unlock the value of your data with AWS analytics

Wednesday November 30 | 2:30 – 3:30 PM PST | The Venetian

Data fuels digital transformation and drives effective business decisions. To survive in an ever-changing world, organizations are turning to data to derive insights, create new experiences, and reinvent themselves so they can remain relevant today and in the future. AWS offers analytics services that allow organizations to gain faster and deeper insights from all their data. In this session, G2 Krishnamoorthy, VP of AWS Analytics, addresses the current state of analytics on AWS, covers the latest service innovations around data, and highlights customer successes with AWS analytics. Also, learn from organizations like FINRA and more who have turned to AWS for their digital transformation journey.
Reserve your seat now!

BSI201 (LVL 200) Reinvent how you derive value from your data with Amazon QuickSight

Tuesday November 29 | 2:00 PM – 3:00 PM PST | Mandalay Bay

In this session, learn how you can use AWS-native business analytics to provide your users with machine learning-powered interactive dashboards, natural language query (NLQ), and embedded analytics to provide insights to users at scale, when and where they need it. Join this session to also learn more about how Amazon uses QuickSight internally.
Reserve your seat now!

Breakout sessions

BSI202 (LVL 200) Migrate to cloud-native business analytics with Amazon QuickSight

Wednesday November 30 | 2:30 PM – 3:30 PM PST | Encore

Legacy BI systems can hurt agile decision-making in the modern organization, with expensive licensing, outdated capabilities, and expensive infrastructure management. In this session, discover how migrating your BI to the cloud with cloud-native, fully managed business analytics capabilities from QuickSight can help you overcome these challenges. Learn how you can use QuickSight’s interactive dashboards and reporting capabilities to provide insights to every user in the organization, lowering your costs and enabling better decision-making. Join this session to also learn more about Siemens QuickSight use case.
Reserve your seat now!

BSI207 (LVL 200) Get clarity on your data in seconds with Amazon QuickSight Q

Wednesday November 30 | 4:45 PM – 5:45 PM PST | MGM Grand

Amazon QuickSight Q is a machine learning–powered natural language capability that empowers business users to ask questions about all of their data using everyday business language and get answers in seconds. Q interprets questions to understand their intent and generates an answer instantly in the form of a visual without requiring authors to create graphics, dashboards, or analyses. In this session, the QuickSight Q team provides an overview and demonstration of Q in action. Join this session to also learn more about NASDAQ’s QuickSight use case.
Reserve your seat now!

BSI203 (LVL 200) Differentiate your apps with Amazon QuickSight embedded analytics

Thursday December 1 | 12:30 PM – 1:30 PM PST | Caesars Forum

In this session, learn how to enable new monetization opportunities and grow your business with QuickSight embedded analytics. Discover how you can differentiate your end-user experience by embedding data visualizations, dashboards, and ML-powered natural language query into your applications at scale with no infrastructure to manage. Join this session to also learn more about Guardian Life and Showpad’s QuickSight use case.
Reserve your seat now!

BSI304 (LVL 300) Optimize your AWS cost and usage with Cloud Intelligence Dashboards

Thursday December 1 | 3:30 PM – 4:30 PM PST | Encore

Do your engineers know how much they’re spending? Do you have insight into the details of your cost and usage on AWS? Are you taking advantage of all your cost optimization opportunities? Attend this session to learn how organizations are using the Cloud Intelligence Dashboards to start their FinOps journeys and create cost-aware cultures within their organizations. Dive deep into specific use cases and learn how you can use these insights to drive and measure your cost optimization efforts. Discover how unit economics, resource-level visibility, and periodic spend updates make it possible for FinOps practitioners, developers, and business executives to come together to make smarter decisions. Join this session to also learn more about Dolby laboratories’ QuickSight use case.
Reserve your seat now!

Chalk talks

BSI302 (LVL 300) Deploy your BI assets at scale to thousands with Amazon QuickSight

Tuesday November 29 | 11:45 AM – 12:45 AM PST | Wynn
As your user bases grow to hundreds or thousands of users, managing assets and user permissions at scale becomes increasingly important. In this chalk talk, learn about best practices for content development, promotion, authorization, organization, and cleanup to help ensure that your users are developing and sharing content in a safe and scalable manner.
Reserve your seat now!

BSI301 (LVL 300) Architecting multi-tenancy for your apps with Amazon QuickSight

Tuesday November 29 | 2:45 PM – 3:45 PM PST | Caesars Forum

Whether you are deploying QuickSight internally in a centrally managed single account or developing a SaaS application with multiple external tenants, it is paramount to focus on security and governance and to isolate tenants from each other. In this chalk talk, learn about different architectures and security frameworks that you can use to deploy QuickSight to thousands of departments or clients in a scalable and controlled manner.
Reserve your seat now!

*This session will also be repeated Wednesday November 30 | 7:45 PM – 8:45 PM PST | Wynn

BSI401 (LVL 400) Insightful dashboards through advanced calculations with QuickSight

Monday November 28 | 12:15 PM – 1:15 PM PST | MGM Grand
Loading data into various charting types is very rarely the end goal for your users. When they find interesting patterns or trends, they tend to dig deeper into their data and use calculations to surface more underlying insights. In this chalk talk, learn about various ways to build insightful and creative dashboards using QuickSight’s new advanced calculation capabilities, such as level-aware calculation and period functions.
Reserve your seat now!

Workshops

BSI205 (LVL 200) Build stunning customized dashboards with Amazon QuickSight

Monday November 28 | 10:45 AM – 12:45 PM PST | Wynn

Want to grow your dashboard building skills? In this workshop, the QuickSight team demonstrates the latest authoring functionality designed to empower you to build beautiful layouts and robust interactive experiences with other applications, right from within your dashboard. You must bring your laptop to participate.
Reserve your seat now!

*This session will be also be repeated Thursday December 1 | 11:45 AM – 1:45 PM PST | Caesars Forum

BSI303 (LVL 300) Seamlessly embed analytics into your apps with Amazon QuickSight
Wednesday November 30 | 5:30 PM – 7:30 PM PST | Wynn

In this workshop, learn how you can bring data insights to your internal teams and end customers by simply and seamlessly embedding rich, interactive data visualizations and dashboards into your web applications and portals. You must bring your laptop to participate.
Reserve your seat now!

Partner session

PEX307 (LVL 300) Migrating business intelligence systems to Amazon QuickSight

Wednesday November 30 | 9:15 AM – 10:15 AM PST | Encore

QuickSight is a scalable, serverless, embeddable, machine learning–powered BI tool built for the cloud. If you’re building a cloud-native BI solution and are unsure how to migrate on AWS, this session is for you. Dive deep into BI best practices, tools, and methodologies for migrating BI dashboards to QuickSight, and learn how to use APIs and the AWS CLI to automate common migration tasks required to perform BI dashboard migration. This session is intended for AWS Partners.
Reserve your seat now!

Additional activities

Business Intelligence kiosk in the AWS Village

Visit the Business Intelligence kiosk within the AWS Village to meet with experts to dive deeper into QuickSight capabilities such as Q and embedded analytics. You will be able to ask our experts questions and experience live demos for our newly launched capabilities.

Free QuickSight swag

Make sure to stop by the swag distribution table to grab free QuickSight swag if you have attended either the Business Intelligence kiosk or one of our breakout sessions, chalk talks, or workshops.

Useful resources

Whether you plan on attending re:Invent in person or view available content virtually, you can always learn more about QuickSight through these helpful resources:

QuickSight Community Hub – Ask, answer, and learn with others in the QuickSight Community.

QuickSight YouTube channel – Subscribe to stay up to date on the latest QuickSight workshops, how tos, and demo videos.

QuickSight DemoCentral – Experience QuickSight first-hand through interactive dashboards and demos

About the authors

Mia Heard is a Product Marketing Manager for Amazon QuickSight, AWS’ cloud-native, fully managed BI service.

California State University Chancellor’s Office reduces cost and improves efficiency using Amazon QuickSight for streamlined HR reporting in higher education

2022-11-02 Madi Hsieh

Post Syndicated from Madi Hsieh original https://aws.amazon.com/blogs/big-data/california-state-university-chancellors-office-reduces-cost-and-improves-efficiency-using-amazon-quicksight-for-streamlined-hr-reporting-in-higher-education/

The California State University Chancellor’s Office (CSUCO) sits at the center of America’s most significant and diverse 4-year universities. The California State University (CSU) serves approximately 477,000 students and employs more than 55,000 staff and faculty members across 23 universities and 7 off-campus centers. The CSU provides students with opportunities to develop intellectually and personally, and to contribute back to the communities throughout California. For this large organization, managing a wide system of campuses while maintaining the decentralized autonomy of each is crucial. In 2019, they needed a highly secure tool to streamline the process of pulling HR data. The CSU had been using a legacy central data warehouse based on data from their financial system, but it lacked the robustness to keep up with modern technology. This wasn’t going to work for their HR reporting needs.

Looking for a tool to match the cloud-based infrastructure of their other operations, the Business Intelligence and Data Operations (BI/DO) team within the Chancellor’s Office chose Amazon QuickSight, a fast, easy-to-use, cloud-powered business analytics service that makes it easy for all employees within an organization to build visualizations, perform ad hoc analysis, and quickly get business insights from their data, any time, on any device. The team uses QuickSight to organize HR information across the CSU, implementing a centralized security system.

“It’s easy to use, very straightforward, and relatively intuitive. When you couple the experience of using QuickSight, with a huge cost difference to [the BI platform we had been using], to me, it’s a simple choice,”

– Andy Sydnor, Director Business Intelligence and Data Operations at the CSUCO.

With QuickSight, the team has the capability to harness security measures and deliver data insights efficiently across their campuses.

In this post, we share how the CSUCO uses QuickSight to reduce cost and improve efficiency in their HR reporting.

Delivering BI insights across the CSU’s 23 universities

The CSUCO serves the university system’s faculty, students, and staff by overseeing operations in several areas, including finance, HR, student information, and space and facilities. Since migrating to QuickSight in 2019, the team has built dashboards to support these operations. Dashboards include COVID-related leaves of absence, historical financial reports, and employee training data, along with a large selection of dashboards to track employee data at an individual campus level or from a system-wide perspective.

The team created a process for reading security roles from the ERP system and then translating them using QuickSight groups for internal HR reporting. QuickSight allowed them to match security measures with the benefits of low maintenance and familiarity to their end-users.

With QuickSight, the CSUCO is able to run a decentralized security process where campus security teams can provision access directly and users can get to their data faster. Before transitioning to QuickSight, the BI/DO team spent hours trying to get to specific individual-level data, but with QuickSight, the retrieval time was shortened to just minutes. For the first time, Sydnor and his team were able to pinpoint a specific employee’s work history without having to take additional actions to find the exact data they needed.

Cost savings compared to other BI tools

Sydnor shares that, for a public organization, one of the most attractive qualities of QuickSight is the immense cost savings. The BI/DO team at the Chancellor’s Office estimates that they’re saving roughly 40% on costs since switching from their previous BI platform, which is a huge benefit for a public organization of this scale. Their previous BI tool was costing them extensive amounts of money on licensing for features they didn’t require; the CSUCO felt they weren’t getting the best use of their investment.

The functionality of QuickSight to meet their reporting needs at an affordable price point is what makes QuickSight the CSUCO’s preferred BI reporting tool. Sydnor likes that with QuickSight, “we don’t have to go out and buy a subscription or a license for somebody, we can just provision access. It’s much easier to distribute the product.” QuickSight allows the CSUCO to focus their budget in other areas rather than having to pay for charges by infrequent users.

Simple and intuitive interface

Getting started in QuickSight was a no-brainer for Sydnor and his team. As a public organization, the procurement process can be cumbersome, thereby slowing down valuable time for putting their data to action. As an existing AWS customer, the CSUCO could seamlessly integrate QuickSight into their package of AWS services. An issue they were running into with other BI tools was encountering roadblocks to setting up the system, which wasn’t an issue with QuickSight, because it’s a fully managed service that doesn’t require deploying any servers.

The following screenshot shows an example of the CSUCO security audit dashboard.

Sydnor tells us, “Our previous BI tool had a huge library of visualization, but we don’t need 95% of those. Our presentations look great with the breadth of visuals QuickSight provides. Most people just want the data and ultimately, need a robust vehicle to get data out of a database and onto a table or visualization.”

Converting from their original BI tool to QuickSight was painless for his team. Sydnor tells us that he has “yet to see something we can’t do with QuickSight.” One of Sydnor’s employees who was a user of the previous tool learned QuickSight in just 30 minutes. Now, they conduct QuickSight demos all the time.

Looking to the future: Expanding BI integration and adopting Amazon QuickSight Q

With QuickSight, the Chancellor’s Office aims to roll out more HR dashboards across its campuses and extend the tool for faculty use in the classroom. In the upcoming year, two campuses are joining CSUCO in building their own HR reporting dashboards through QuickSight. The organization is also making plans to use QuickSight to report on student data and implement external-facing dashboards. Some of the data points they’re excited to explore are insights into at-risk students and classroom scheduling on campus.

Thinking ahead, CSUCO is considering Amazon QuickSight Q, a machine learning-powered natural language capability that gives anyone in an organization the ability to ask business questions in natural language and receive accurate answers with relevant visualizations. Sydnor says, “How cool would that be if professors could go in and ask simple, straightforward questions like, ‘How many of my department’s students are taking full course loads this semester?’ It has a lot of potential.”

Summary

The CSUCO is excited to be a champion of QuickSight in the CSU, and are looking for ways to increase its implementation across their organization in the future.

To learn more, visit the website for the California State University Chancellor’s Office. For more on QuickSight, visit the Amazon QuickSight product page, or browse other Big Data Blog posts featuring QuickSight.

About the authors

Madi Hsieh, AWS 2022 Summer Intern, UCLA.

Tina Kelleher, Program Manager at AWS.

Build the next generation, cross-account, event-driven data pipeline orchestration product

2022-11-02 Maria Guerra

Post Syndicated from Maria Guerra original https://aws.amazon.com/blogs/big-data/build-the-next-generation-cross-account-event-driven-data-pipeline-orchestration-product/

This is a guest post by Mehdi Bendriss, Mohamad Shaker, and Arvid Reiche from Scout24.

At Scout24 SE, we love data pipelines, with over 700 pipelines running daily in production, spread across over 100 AWS accounts. As we democratize data and our data platform tooling, each team can create, maintain, and run their own data pipelines in their own AWS account. This freedom and flexibility is required to build scalable organizations. However, it’s full of pitfalls. With no rules in place, chaos is inevitable.

We took a long road to get here. We’ve been developing our own custom data platform since 2015, developing most tools ourselves. Since 2016, we have our self-developed legacy data pipeline orchestration tool.

The motivation to invest a year of work into a new solution was driven by two factors:

Lack of transparency on data lineage, especially dependency and availability of data
Little room to implement governance

As a technical platform, our target user base for our tooling includes data engineers, data analysts, data scientists, and software engineers. We share the vision that anyone with relevant business context and minimal technical skills can create, deploy, and maintain a data pipeline.

In this context, in 2015 we created the predecessor of our new tool, which allows users to describe their pipeline in a YAML file as a list of steps. It worked well for a while, but we faced many problems along the way, notably:

Our product didn’t support pipelines to be triggered by the status of other pipelines, but based on the presence of _SUCCESS files in Amazon Simple Storage Service (Amazon S3). Here we relied on periodic pulls. In complex organizations, data jobs often have strong dependencies to other work streams.
Given the previous point, most pipelines could only be scheduled based on a rough estimate of when their parent pipelines might finish. This led to cascaded failures when the parents failed or didn’t finish on time.
When a pipeline fails and gets fixed, then manually redeployed, all its dependent pipelines must be rerun manually. This means that the data producer bears the responsibility of notifying every single team downstream.

Having data and tooling democratized without the ability to provide insights into which jobs, data, and dependencies exist diminishes synergies within the company, leading to silos and problems in resource allocation. It became clear that we needed a successor for this product that would give more flexibility to the end-user, less computing costs, and no infrastructure management overhead.

In this post, we describe, through a hypothetical case study, the constraints under which the new solution should perform, the end-user experience, and the detailed architecture of the solution.

Case study

Our case study looks at the following teams:

The core-data-availability team has a data pipeline named listings that runs every day at 3:00 AM on the AWS account Account A, and produces on Amazon S3 an aggregate of the listings events published on the platform on the previous day.
The search team has a data pipeline named searches that runs every day at 5:00 AM on the AWS account Account B, and exports to Amazon S3 the list of search events that happened on the previous day.
The rent-journey team wants to measure a metric referred to as X; they create a pipeline named pipeline-X that runs daily on the AWS account Account C, and relies on the data of both previous pipelines. pipeline-X should only run daily, and only after both the listings and searches pipelines succeed.

User experience

We provide users with a CLI tool that we call DataMario (relating to its predecessor DataWario), and which allows users to do the following:

Set up their AWS account with the necessary infrastructure needed to run our solution
Bootstrap and manage their data pipeline projects (creating, deploying, deleting, and so on)

When creating a new project with the CLI, we generate (and require) every project to have a pipeline.yaml file. This file describes the pipeline steps and the way they should be triggered, alerting, type of instances and clusters in which the pipeline will be running, and more.

In addition to the pipeline.yaml file, we allow advanced users with very niche and custom needs to create their pipeline definition entirely using a TypeScript API we provide them, which allows them to use the whole collection of constructs in the AWS Cloud Development Kit (AWS CDK) library.

For the sake of simplicity, we focus on the triggering of pipelines and the alerting in this post, along with the definition of pipelines through pipeline.yaml.

The listings and searches pipelines are triggered as per a scheduling rule, which the team defines in the pipeline.yaml file as follows:

trigger: 
    schedule: 
        hour: 3

pipeline-x is triggered depending on the success of both the listings and searches pipelines. The team defines this dependency relationship in the project’s pipeline.yaml file as follows:

trigger: 
    executions: 
        allOf: 
            - name: listings 
              account: Account_A_ID 
              status: 
                  - SUCCESS 
            - name: searches 
              account: Account_B_ID 
              status: 
                  - SUCCESS

The executions block can define a complex set of relationships by combining the allOf and anyOf blocks, along with a logical operator operator: OR / AND, which allows mixing the allOf and anyOf blocks. We focus on the most basic use case in this post.

Accounts setup

To support alerting, logging, and dependencies management, our solution has components that must be pre-deployed in two types of accounts:

A central AWS account – This is managed by the Data Platform team and contains the following:
- A central data pipeline Amazon EventBridge bus receiving all the run status changes of AWS Step Functions workflows running in user accounts
- An AWS Lambda function logging the Step Functions workflow run changes in an Amazon DynamoDB table to verify if any downstream pipelines should be triggered based on the current event and previous run status changes log
- A Slack alerting service to send alerts to the Slack channels specified by users
- A trigger management service that broadcasts triggering events to the downstream buses in the user accounts
All AWS user accounts using the service – These accounts contain the following:
- A data pipeline EventBridge bus that receives Step Functions workflow run status changes forwarded from the central EventBridge bus
- An S3 bucket to store data pipelines artifacts, along their logs
- Resources needed to run Amazon EMR clusters, like security groups, AWS Identity and Access Management (IAM) roles, and more

With the provided CLI, users can set up their account by running the following code:

$ dpc setup-user-account

Solution overview

The following diagram illustrates the architecture of the cross-account, event-driven pipeline orchestration product.

In this post, we refer to the different colored and numbered squares to reference a component in the architecture diagram. For example, the green square with label 3 refers to the EventBridge bus default component.

Deployment flow

This section is illustrated with the orange squares in the architecture diagram.

A user can create a project consisting of a data pipeline or more using our CLI tool as follows:

$ dpc create-project -n 'project-name'

The created project contains several components that allow the user to create and deploy data pipelines, which are defined in .yaml files (as explained earlier in the User experience section).

The workflow of deploying a data pipeline such as listings in Account A is as follows:

Deploy listings by running the command dpc deploy in the root folder of the project. An AWS CDK stack with all required resources is automatically generated.
The previous stack is deployed as an AWS CloudFormation template.
The stack uses custom resources to perform some actions, such as storing information needed for alerting and pipeline dependency management.
Two Lambda functions are triggered, one to store the mapping pipeline-X/slack-channels used for alerting in a DynamoDB table, and another one to store the mapping between the deployed pipeline and its triggers (other pipelines that should result in triggering the current one).
To decouple alerting and dependency management services from the other components of the solution, we use Amazon API Gateway for two components:
- The Slack API.
- The dependency management API.
All calls for both APIs are traced in Amazon CloudWatch log groups and two Lambda functions:
- The Slack channel publisher Lambda function, used to store the mapping pipeline_name/slack_channels in a DynamoDB table.
- The dependencies publisher Lambda function, used to store the pipelines dependencies (the mapping pipeline_name/parents) in a DynamoDB table.

Pipeline trigger flow

This is an event-driven mechanism that ensures that data pipelines are triggered as requested by the user, either following a schedule or a list of fulfilled upstream conditions, such as a group of pipelines succeeding or failing.

This flow relies heavily on EventBridge buses and rules, specifically two types of rules:

Scheduling rules.
Step Functions event-based rules, with a payload matching the set of statuses of all the parents of a given pipeline. The rules indicate for which set of statuses all the parents of pipeline-X should be triggered.

Scheduling

This section is illustrated with the black squares in the architecture diagram.

The listings pipeline running on Account A is set to run every day at 3:00 AM. The deployment of this pipeline creates an EventBridge rule and a Step Functions workflow for running the pipeline:

The EventBridge rule is of type schedule and is created on the default bus (this is the EventBridge bus responsible for listening to native AWS events—this distinction is important to avoid confusion when introducing the other buses). This rule has two main components:
- A cron-like notation to describe the frequency at which it runs: 0 3 * * ? *.
- The target, which is the Step Functions workflow describing the workflow of the listings pipeline.
The listings Step Function workflow describes and runs immediately when the rule gets triggered. (The same happens to the searches pipeline.)

Each user account has a default EventBridge bus, which listens to the default AWS events (such as the run of any Lambda function) and scheduled rules.

Dependency management

This section is illustrated with the green squares in the architecture diagram. The current flow starts after the Step Functions workflow (black square 2) starts, as explained in the previous section.

As a reminder, pipeline-X is triggered when both the listings and searches pipelines are successful. We focus on the listings pipeline for this post, but the same applies to the searches pipeline.

The overall idea is to notify all downstream pipelines that depend on it, in every AWS account, passing by and going through the central orchestration account, of the change of status of the listings pipeline.

It’s then logical that the following flow gets triggered multiple times per pipeline (Step Functions workflow) run as its status changes from RUNNING to either SUCCEEDED, FAILED, TIMED_OUT, or ABORTED. The reason being that there could be pipelines downstream potentially listening on any of those status change events. The steps are as follows:

The event of the Step Functions workflow starting is listened to by the default bus of Account A.
The rule export-events-to-central-bus, which specifically listens to the Step Function workflow run status change events, is then triggered.
The rule forwards the event to the central bus on the central account.
The event is then caught by the rule trigger-events-manager.
This rule triggers a Lambda function.
The function gets the list of all children pipelines that depend on the current run status of listings.
The current run is inserted in the run log Amazon Relational Database Service (Amazon RDS) table, following the schema sfn-listings, time (timestamp), status (SUCCEEDED, FAILED, and so on). You can query the run log RDS table to evaluate the running preconditions of all children pipelines and get all those that qualify for triggering.
A triggering event is broadcast in the central bus for each of those eligible children.
Those events get broadcast to all accounts through the export rules—including Account C, which is of interest in our case.
The default EventBridge bus on Account C receives the broadcasted event.
The EventBridge rule gets triggered if the event content matches the expected payload of the rule (notably that both pipelines have a SUCCEEDED status).
If the payload is valid, the rule triggers the Step Functions workflow pipeline-X and triggers the workflow to provision resources (which we discuss later in this post).

Alerting

This section is illustrated with the gray squares in the architecture diagram.

Many teams handle alerting differently across the organization, such as Slack alerting messages, email alerts, and OpsGenie alerts.

We decided to allow users to choose their preferred methods of alerting, giving them the flexibility to choose what kind of alerts to receive:

At the step level – Tracking the entire run of the pipeline
At the pipeline level – When it fails, or when it finishes with a SUCCESS or FAILED status

During the deployment of the pipeline, a new Amazon Simple Notification Service (Amazon SNS) topic gets created with the subscriptions matching the targets specified by the user (URL for OpsGenie, Lambda for Slack or email).

The following code is an example of what it looks like in the user’s pipeline.yaml:

notifications:
    type: FULL_EXECUTION
    targets:
        - channel: SLACK
          addresses:
               - data-pipeline-alerts
        - channel: EMAIL
          addresses:
               - [email protected]

The alerting flow includes the following steps:

As the pipeline (Step Functions workflow) starts (black square 2 in the diagram), the run gets logged into CloudWatch Logs in a log group corresponding to the name of the pipeline (for example, listings).
Depending on the user preference, all the run steps or events may get logged or not thanks to a subscription filter whose target is the execution-tracker-lambda Lambda function. The function gets called anytime a new event gets published in CloudWatch.
This Lambda function parses and formats the message, then publishes it to the SNS topic.
For the email and OpsGenie flows, the flow stops here. For posting the alert message on Slack, the Slack API caller Lambda function gets called with the formatted event payload.
The function then publishes the message to the /messages endpoint of the Slack API Gateway.
The Lambda function behind this endpoint runs, and posts the message in the corresponding Slack channel and under the right Slack thread (if applicable).
The function retrieves the secret Slack REST API key from AWS Secrets Manager.
It retrieves the Slack channels in which the alert should be posted.
It retrieves the root message of the run, if any, so that subsequent messages get posted under the current run thread on Slack.
It posts the message on Slack.
If this is the first message for this run, it stores the mapping with the DB schema execution/slack_message_id to initiate a thread for future messages related to the same run.

Resource provisioning

This section is illustrated with the light blue squares in the architecture diagram.

To run a data pipeline, we need to provision an EMR cluster, which in turn requires some information like Hive metastore credentials, as shown in the workflow. The workflow steps are as follows:

Trigger the Step Functions workflow listings on schedule.
Run the listings workflow.
Provision an EMR cluster.
Use a custom resource to decrypt the Hive metastore password to be used in Spark jobs relying on central Hive tables or views.

End-user experience

After all preconditions are fulfilled (both the listings and searches pipelines succeeded), the pipeline-X workflow runs as shown in the following diagram.

As shown in the diagram, the pipeline description (as a sequence of steps) defined by the user in the pipeline.yaml is represented by the orange block.

The steps before and after this orange section are automatically generated by our product, so users don’t have to take care of provisioning and freeing compute resources. In short, the CLI tool we provide our users synthesizes the user’s pipeline definition in the pipeline.yaml and generates the corresponding DAG.

Additional considerations and next steps

We tried to stay consistent and stick to one programming language for the creation of this product. We chose TypeScript, which played well with AWS CDK, the infrastructure as code (IaC) framework that we used to build the infrastructure of the product.

Similarly, we chose TypeScript for building the business logic of our Lambda functions, and of the CLI tool (using Oclif) we provide for our users.

As demonstrated in this post, EventBridge is a powerful service for event-driven architectures, and it plays a central and important role in our products. As for its limitations, we found that pairing Lambda and EventBridge could fulfill all our current needs and granted a high level of customization that allowed us to be creative in the features we wanted to serve our users.

Needless to say, we plan to keep developing the product, and have a multitude of ideas, notably:

Extend the list of core resources on which workloads run (currently only Amazon EMR) by adding other compute services, such Amazon Elastic Compute Cloud (Amazon EC2)
Use the Constructs Hub to allow users in the organization to develop custom steps to be used in all data pipelines (we currently only offer Spark and shell steps, which suffice in most cases)
Use the stored metadata regarding pipeline dependencies for data lineage, to have an overview of the overall health of the data pipelines in the organization, and more

Conclusion

This architecture and product brought many benefits. It allows us to:

Have a more robust and clear dependency management of data pipelines at Scout24.
Save on compute costs by avoiding scheduling pipelines based approximately on when its predecessors are usually triggered. By shifting to an event-driven paradigm, no pipeline gets started unless all its prerequisites are fulfilled.
Track our pipelines granularly and in real time on a step level.
Provide more flexible and alternative business logic by exposing multiple event types that downstream pipelines can listen to. For example, a fallback downstream pipeline might be run in case of a parent pipeline failure.
Reduce the cross-team communication overhead in case of failures or stopped runs by increasing the transparency of the whole pipelines’ dependency landscape.
Avoid manually restarting pipelines after an upstream pipeline is fixed.
Have an overview of all jobs that run.
Support the creation of a performance culture characterized by accountability.

We have big plans for this product. We will use DataMario to implement granular data lineage, observability, and governance. It’s a key piece of infrastructure in our strategy to scale data engineering and analytics at Scout24.

We will make DataMario open source towards the end of 2022. This is in line with our strategy to promote our approach to a solution on a self-built, scalable data platform. And with our next steps, we hope to extend this list of benefits and ease the pain in other companies solving similar challenges.

Thank you for reading.

About the authors

Mehdi Bendriss is a Senior Data / Data Platform Engineer, MSc in Computer Science and over 9 years of experience in software, ML, and data and data platform engineering, designing and building large-scale data and data platform products.

Mohamad Shaker is a Senior Data / Data Platform Engineer, with over 9 years of experience in software and data engineering, designing and building large-scale data and data platform products that enable users to access, explore, and utilize their data to build great data products.

Arvid Reiche is a Data Platform Leader, with over 9 years of experience in data, building a data platform that scales and serves the needs of the users.

Marco Salazar is a Solutions Architect working with Digital Native customers in the DACH region with over 5 years of experience building and delivering end-to-end, high-impact, cloud native solutions on AWS for Enterprise and Sports customers across EMEA. He currently focuses on enabling customers to define technology strategies on AWS for the short- and long-term that allow them achieve their desired business objectives, specializing on Data and Analytics engagements. In his free time, Marco enjoys building side-projects involving mobile/web apps, microcontrollers & IoT, and most recently wearable technologies.

Your guide to AWS Analytics at re:Invent 2022

2022-11-01 Imtiaz Sayed

Post Syndicated from Imtiaz Sayed original https://aws.amazon.com/blogs/big-data/your-guide-to-aws-analytics-at-reinvent-2022/

Join the global cloud community at AWS re:Invent this year to meet, get inspired, and rethink what’s possible!

Reserved seating is available for registered attendees to secure seats in the sessions of their choice. You can reserve a seat in your favorite sessions by signing in to the attendee portal and navigating to Event > Sessions. For those who can’t make it in person, you can get your free online pass to watch live keynotes and leadership sessions by registering for a virtual-only access. This curated attendee guide helps data and analytics enthusiasts manage their schedule*, as well as navigate the AWS analytics and business intelligence tracks to get the best out of re:Invent.

For additional session details, visit the AWS Analytics splash page.

#AWSanalytics, #awsfordata, #reinvent22

Keynotes

KEY002 | Adam Selipsky (CEO, Amazon Web Services) | Tuesday, November 29 | 8:30 AM – 10:30 AM

Join Adam Selipsky, CEO of Amazon Web Services, as he looks at the ways that forward-thinking builders are transforming industries and even our future, powered by AWS.

KEY003 | Swami Sivasubramanian (Vice President, AWS Data and Machine Learning) | Wednesday, November 30 | 8:30 AM – 10:30 AM

Leadership sessions

ANT203-L | Unlock the value of your data with AWS analytics | G2 Krishnamoorthy, VP of AWS Analytics | Wednesday, November 30 | 2:30 PM – 3:30 PM

G2 addresses the current state of analytics on AWS, covers the latest service innovations around data, and highlights customer successes with AWS analytics. Also, learn from organizations like FINRA and more who have turned to AWS for their digital transformation journey.

Breakout sessions

AWS re:Invent breakout sessions are lecture-style and one hour long sessions delivered by AWS experts, customers, and partners.

Monday, Nov 28

Tuesday, Nov 29

Wednesday, Nov 30

Thursday, Dec 1

Friday, Dec 2

10:00 AM – 11:00 AM

ANT326 | How BMW, Intuit, and Morningstar are transforming with AWS and Amazon Athena

11:00 AM – 12:00 PM

ANT301 | Democratizing your organization’s data analytics experience

10:00 AM – 11:00 AM

ANT212 | How JPMC and LexisNexis modernize analytics with Amazon Redshift

12:30 PM – 1:30 PM

ANT207 | What’s new in AWS streaming

8:30 AM – 9:30 AM

ANT311 | Building security operations with Amazon OpenSearch Service

11:30 AM – 12:30 PM

ANT206 | What’s new in Amazon OpenSearch Service

12:15 PM – 1:15 PM

ANT334 | Simplify and accelerate data integration and ETL modernization with AWS Glue

10:00 AM – 11:00 AM

ANT209 | Build interactive analytics applications

12:30 PM – 1:30 PM

BSI203 | Differentiate your apps with Amazon QuickSight embedded analytics

12:15 PM – 1:15 PM

ANT337 | Migrating to Amazon EMR to reduce costs and simplify operations

1:15 PM – 2:15 PM

ANT205 | Achieving your modern data architecture

10:45 AM – 11:45 AM

ANT218 | Leveling up computer vision and artificial intelligence development

1:15 PM – 2:15 PM

ANT336 | Building data mesh architectures on AWS

1:00 PM – 2:00 PM

ANT341 | How Riot Games processes 20 TB of analytics data daily on AWS

2:00 PM – 3:00 PM

BSI201 | Reinvent how you derive value from your data with Amazon QuickSight

11:30 AM – 12:30 PM

ANT340 | How Sony Orchard accelerated innovation with Amazon MSK

2:00 PM – 3:00 PM

ANT342 | How Poshmark accelerates growth via real-time analytics and personalization

1:45 PM – 2:45 PM

BSI207 | Get clarity on your data in seconds with Amazon QuickSight Q

2:45 PM – 3:45 PM

ANT339 | How Samsung modernized architecture for real-time analytics

1:00 PM – 2:00 PM

ANT201 | What’s new with Amazon Redshift

3:30 PM – 4:30 PM

ANT219 | Dow Jones and 3M: Observability with Amazon OpenSearch Service

3:15 PM – 4:15 PM

ANT302 | What’s new with Amazon EMR

3:30 PM – 4:30 PM

ANT204 | Enabling agility with data governance on AWS

2:30 PM – 3:30 PM

BSI202 | Migrate to cloud-native business analytics with Amazon QuickSight

4:45 PM – 5:45 PM

ANT335 | How Disney Parks uses AWS Glue to replace thousands of Hadoop jobs

5:00 PM – 6:00 PM

ANT338 | Scaling data processing with Amazon EMR at the speed of market volatility

4:45 PM – 5:45 PM

ANT324 | Modernize your data warehouse

5:30 PM – 6:30 PM

ANT220 | Using Amazon AppFlow to break down data silos for analytics and ML

5:45 PM – 6:45 PM

ANT325 | Simplify running Apache Spark and Hive apps with Amazon EMR Serverless

5:30 PM – 6:30 PM

ANT317 | Self-service analytics with Amazon Redshift Serverless

Chalk talks

Chalk talks are an hour long, highly interactive content format with a small audience. Each begins with a short lecture delivered by an AWS expert, followed by a Q&A session with the audience.

Monday, Nov 28

Tuesday, Nov 29

Wednesday, Nov 30

Thursday, Dec 1

Friday, Dec 2

12:15 PM – 1:15 PM

ANT303 | Security and data access controls in Amazon EMR

11:00 AM – 12:00 PM

ANT318 [Repeat] | Build event-based microservices with AWS streaming services

9:15 AM – 10:15 AM

ANT320 [Repeat] | Get better price performance in cloud data warehousing with Amazon Redshift

11:45 AM – 12:45 PM

ANT329 | Turn data to insights in seconds with secure and reliable Amazon Redshift

9:15 AM – 10:15 AM

ANT314 [Repeat] | Why and how to migrate to Amazon OpenSearch Service

12:15 PM – 1:15 PM

BSI401 | Insightful dashboards through advanced calculations with QuickSight

11:45 AM – 12:45 PM

BSI302 | Deploy your BI assets at scale to thousands with Amazon QuickSight

10:45 AM – 11:45 AM

ANT330 [Repeat] | Run Apache Spark on Kubernetes with Amazon EMR on Amazon EKS

1:15 PM – 2:15 PM

ANT401 | Ingest machine-generated data at scale with Amazon OpenSearch Service

10:00 AM – 11:00 AM

ANT322 [Repeat] | Simplifying ETL migration and data integration with AWS Glue

1:00 PM – 2:00 PM

ANT323 [Repeat] | Break through data silos with Amazon Redshift

1:15 PM – 2:15 PM

ANT327 | Modernize your analytics architecture with Amazon Athena

12:15 PM – 1:15 PM

ANT323 [Repeat] | Break through data silos with Amazon Redshift

2:00 PM – 3:00 PM

ANT333 [Repeat] | Build a serverless data streaming workload with Amazon Kinesis

1:45 PM – 2:45 PM

ANT319 | Democratizing ML for data analysts

2:45 PM – 3:45 PM

ANT320 [Repeat] | Get better price performance in cloud data warehousing with Amazon Redshift

4:00 PM – 5:00 PM

ANT314 [Repeat] | Why and how to migrate to Amazon OpenSearch Service

.2:00 AM – 3:00 PM

ANT330 [Repeat] | Run Apache Spark on Kubernetes with Amazon EMR on Amazon EKS

1:45 PM – 2:45 PM

ANT322 [Repeat] | Simplifying ETL migration and data integration with AWS Glue

2:45 PM – 3:45 PM

BSI301 | Architecting multi-tenancy for your apps with Amazon QuickSight

4:45 PM – 5:45 PM

ANT333 [Repeat] | Build a serverless data streaming workload with Amazon Kinesis

5:30 PM – 6:30 PM

ANT315 | Optimizing Amazon OpenSearch Service domains for scale and cost

4:15 PM – 5:15 PM

ANT304 | Run serverless Spark workloads with AWS analytics

4:45 PM – 5:45 PM

ANT331 | Understanding TCO for different Amazon EMR deployment models

5:00 PM – 6:00 PM

ANT328 | Build transactional data lakes using open-table formats in Amazon Athena

4:45 PM – 5:45 PM

ANT321 | What’s new in AWS Lake Formation

7:00 PM – 8:00 PM

ANT318 [Repeat] | Build event-based microservices with AWS streaming services

Builders’ sessions

These are one-hour small-group sessions with up to nine attendees per table and one AWS expert. Each builders’ session begins with a short explanation or demonstration of what you’re going to build. Once the demonstration is complete, bring your laptop to experiment and build with the AWS expert.

Monday, Nov 28

Tuesday, Nov 29

Wednesday, Nov 30

Thursday, Dec 1

Friday, Dec 2

………………………….

11:00 AM – 12:00 PM

ANT402 | Human vs. machine: Amazon Redshift ML inferences

1:00 PM – 2:00 PM

ANT332 | Build a data pipeline using Apache Airflow and Amazon EMR Serverless

11:00 AM – 12:00 PM

ANT316 [Repeat] | How to build dashboards for machine-generated data

………………………

7:00 PM – 8:00 PM

ANT316 [Repeat] | How to build dashboards for machine-generated data

Workshops

Workshops are two-hour interactive sessions where you work in teams or individually to solve problems using AWS services. Each workshop starts with a short lecture, and the rest of the time is spent working the problem. Bring your laptop to build along with AWS experts.

Monday, Nov 28	Tuesday, Nov 29	Wednesday, Nov 30	Thursday, Dec 1	Friday, Dec 2
10:00 AM – 12:00 PM ANT306 [Repeat] \| Beyond monitoring: Observability with operational analytics	11:45 AM – 1:45 PM ANT313 \| Using Apache Spark for data science and ML workflows with Amazon EMR	8:30 AM – 10:30 AM ANT307 \| Improve search relevance with ML in Amazon OpenSearch Service	11:00 AM – 1:00 PM ANT403 \| Event detection with Amazon MSK and Amazon Kinesis Data Analytics	8:30 AM – 10:30 AM ANT309 [Repeat]\| Build analytics applications using Apache Spark with Amazon EMR Serverless
4:00 PM – 6:00 PM ANT309 [Repeat]\| Build analytics applications using Apache Spark with Amazon EMR Serverless	2:45 PM – 4:45 PM ANT310 [Repeat] \| Build a data mesh with AWS Lake Formation and AWS Glue	12:15 PM – 2:15 PM ANT306 [Repeat] \| Beyond monitoring: Observability with operational analytics	11:45 AM – 1:45 PM BSI205 \| Build stunning customized dashboards with Amazon QuickSight	.
.	.	12:15 PM – 2:15 PM ANT312 \| Near real-time ML inferences with Amazon Redshift	2:45 PM – 4:45 PM ANT308 \| Seamless data sharing using Amazon	.
.	.	5:30 PM – 7:30 PM ANT310 [Repeat] \| Build a data mesh with AWS Lake Formation and AWS Glue	.	.
.	.	5:30 PM – 7:30 PM BSI303 \| Seamlessly embed analytics into your apps with Amazon QuickSight	.	.

* All schedules are in PDT time zone.

AWS Analytics & Business Intelligence kiosks

Join us at the AWS Analytics Kiosk in the AWS Village at the Expo. Dive deep into AWS Analytics with AWS subject matter experts, see the latest demos, ask questions, or just drop by to socially connect with your peers.

About the author

Imtiaz (Taz) Sayed is the WW Tech Leader for Analytics at AWS. He enjoys engaging with the community on all things data and analytics. He can be reached via
LinkedIn.

Retain more for less with tiered storage for Amazon MSK

2022-10-28 Masudur Rahaman Sayem

Post Syndicated from Masudur Rahaman Sayem original https://aws.amazon.com/blogs/big-data/retain-more-for-less-with-tiered-storage-for-amazon-msk/

Organizations are adopting Apache Kafka and Amazon Managed Streaming for Apache Kafka (Amazon MSK) to capture and analyze data in real-time. Amazon MSK allows you to build and run production applications on Apache Kafka without needing Kafka infrastructure management expertise or having to deal with the complex overheads associated with running Apache Kafka on your own. With increasing maturity, customers seek to build sophisticated use cases that combine aspects of real time and batch processing. For instance, you may want to train machine learning (ML) models based on historic data and then use these models to do real time inferencing. Or you may want to be able to recompute previous results when the application logic changed, e.g., when a new KPI is added to a streaming analytics application or when a bug was fixed that caused incorrect output. These use cases often require storing data for several weeks, months, or even years.

Apache Kafka is well positioned to support these kind of use cases. Data is retained in the Kafka cluster as long as required by configuring the retention policy. In this way, the most recent data can be processed in real time for low-latency use cases while historic data remains accessible in the cluster and can be processed in a batch fashion.

However, retaining data in a Kafka cluster can become expensive because storage and compute are tightly coupled in a cluster. To scale storage, you need to add more brokers. But adding more brokers with the sole purpose of increasing the storage squanders the rest of the compute resources like CPU and memory. Also, a large cluster with more nodes adds operational complexity with a longer time to recover and rebalance when a broker fails. To avoid that operational complexity and higher cost, you can move your data to Amazon Simple Storage Service (Amazon S3) for long-term access and with cost-effective storage classes in Amazon S3 you can optimize your overall storage cost. This solves cost challenges, but now you have to build and maintain that part of the architecture for data movement to a different data store. You also need to build different data processing logic using different APIs for consuming data (Kafka API for streaming, Amazon S3 API for historic reads).

Today, we’re announcing Amazon MSK tiered storage, which brings a virtually unlimited and low-cost storage tier for Amazon MSK, making it simpler and cost-effective for developers to build streaming data applications. Since the launch of Amazon MSK in 2019, we have enabled capabilities such as vertical scaling and automatic scaling of broker storage so you can operate your Kafka workloads in a cost-effective way. Earlier this year, we launched provisioned throughput which enables seamlessly scaling I/O without having to provision additional brokers. Tiered storage makes it even more cost-effective for you to run Kafka workloads. You can now store data in Apache Kafka without worrying about limits. You can effectively balance your performance and costs by using the performance-optimized primary storage for real-time data and the new low-cost tier for the historical data. With a few clicks, you can move streaming data into a lower-cost tier to store data and only pay for what you use.

Tiered storage frees you from making hard trade-offs between supporting the data retention needs of your application teams and the operational complexity that comes with it. This enables you to use the same code to process both real-time and historical data to minimize redundant workflows and simplify architectures. With Amazon MSK tiered storage, you can implement a Kappa architecture – a streaming-first software architecture deployment pattern – to use the same data processing pipeline for correctness and completeness of data over a much longer time horizon for business analysis.

How Amazon MSK tiered storage works

Let’s look at how tiered storage works for Amazon MSK. Apache Kafka stores data in files called log segments. As each segment completes, based on the segment size configured at cluster or topic level, it’s copied to the low-cost storage tier. Data is held in performance-optimized storage for a specified retention time, or up to a specified size, and then deleted. There is a separate time and size limit setting for the low-cost storage, which must be longer than the performance-optimized storage tier. If clients request data from segments stored in the low-cost tier, the broker reads the data from it and serves the data in the same way as if it were being served from the performance-optimized storage. The APIs and existing clients work with minimal changes. When your application starts reading data from the low-cost tier, you can expect an increase in read latency for the first few bytes. As you start reading the remaining data sequentially from the low-cost tier, you can expect latencies that are similar to the primary storage tier. With tiered storage, you pay for the amount of data you store and the amount of data you retrieve.

For a pricing example, let’s consider a workload where your ingestion rate is 15 MB/s, with a replication factor of 3, and you want to retain data in your Kafka cluster for 7 days. For such a workload, it requires 6x m5.large brokers, with 32.4 TB EBS storage, which costs $4,755. But if you use tiered storage for the same workload with local retention of 4 hours and overall data retention of 7 days, it requires 3x m5.large brokers, with 0.8 TB EBS storage and 9 TB of tiered storage, which costs $1,584. If you want to read all the historic data at once, it costs $13 ($0.0015 per GB retrieval cost). In this example with tiered storage, you save around 66% of your overall cost.

Get started using Amazon MSK tiered storage

To enable tiered storage on your existing cluster, upgrade your MSK cluster to Kafka version 2.8.2.tiered and then choose Tiered storage and EBS storage as your cluster storage mode on the Amazon MSK console.

After tiered storage is enabled on the cluster level, run the following command to enable tiered storage on an existing topic. In this example, you’re enabling tiered storage on a topic called msk-ts-topic with 7 days’ retention (local.retention.ms=604800000) for a local high-performance storage tier, setting 180 days’ retention (retention.ms=15550000000) to retain the data in the low-cost storage tier, and updating the log segment size to 48 MB:

bin/kafka-configs.sh --bootstrap-server $bsrv --alter --entity-type topics --entity-name msk-ts-topic --add-config 'remote.storage.enable=true, local.retention.ms=604800000, retention.ms=15550000000, segment.bytes=50331648'

Availability and pricing

Amazon MSK tiered storage is available in all AWS regions where Amazon MSK is available excluding the AWS China, AWS GovCloud regions. This low-cost storage tier scales to virtually unlimited storage and requires no upfront provisioning. You pay only for the volume of data retained and retrieved in the low-cost tier.

For more information about this feature and its pricing, see the Amazon MSK developer guide and Amazon MSK pricing page. For finding the right sizing for your cluster, see the best practices page.

Summary

With Amazon MSK tiered storage you don’t need to provision storage for the low-cost tier or manage the infrastructure. Tiered storage enables you to scale to virtually unlimited storage. You can access data in the low-cost tier using the same clients you currently use to read data from the high-performance primary storage tier. Apache Kafka’s consumer API, streams API, and connectors consume data from both tiers without changes. You can modify the retention limits on the low-cost storage tier similarly as to how you can modify the retention limits on the high-performance storage.

Enable tiered storage on your MSK clusters today to retain data longer at a lower cost.

About the Author

Masudur Rahaman Sayem is a Streaming Architect at AWS. He works with AWS customers globally to design and build data streaming architecture to solve real-world business problems. He is passionate about distributed systems. He also likes to read, especially classic comic books.

Measure the adoption of your Amazon QuickSight dashboards and view your BI portfolio in a single pane of glass

2022-10-28 Maitri Brahmbhatt

Post Syndicated from Maitri Brahmbhatt original https://aws.amazon.com/blogs/big-data/measure-the-adoption-of-your-amazon-quicksight-dashboards-and-view-your-bi-portfolio-in-a-single-pane-of-glass/

Amazon QuickSight is a fully managed, cloud-native business intelligence (BI) service. If you plan to deploy enterprise-grade QuickSight dashboards, measuring user adoption and usage patterns is an important ingredient for the success of your BI investment. For example, knowing the usage patterns like geo location, department, and job role can help you fine-tune your dashboards to the right audience. Furthermore, to return the investment of your BI portfolio, with dashboard usage, you can reduce license costs by identifying inactive QuickSight authors.

In this post, we introduce the latest Admin Console, an AWS packaged solution that you can easily deploy and use to create a usage and inventory dashboard for your QuickSight assets. The Admin Console helps identify usage patterns of an individual user and dashboards. It can also help you track which dashboards and groups you have or need access to, and what you can do with that access, by providing more details on QuickSight group and user permissions and activities and QuickSight asset (dashboards, analyses, and datasets) permissions. With timely access to interactive usage metrics, the Admin Console can help BI leaders and administrators make a cost-efficient plan for dashboard improvements. Another common use case of this dashboard is to provide a centralized repository of the QuickSight assets. QuickSight artifacts consists of multiple types of assets (dashboards, analyses, datasets, and more) with dependencies between them. Having a single repository to view all assets and their dependencies can be an important element in your enterprise data dictionary.

This post demonstrates how to build the Admin Console using a serverless data pipeline. With basic AWS knowledge, you can create this solution in your own environment within an hour. Alternatively, you can dive deep into the source code to meet your specific needs.

Admin Console dashboard

The following animation displays the contents of our demo dashboard.

The Admin Console dashboard includes six sheets:

Landing Page – Provides drill-down into each detailed tabs.
User Analysis – Provides detailed analysis of the user behavior and identifies active and inactive users and authors.
Dashboard Analysis – Shows the most commonly viewed dashboards.
Assets Access Permissions – Provides information on permissions applied to each asset, such as dashboard, analysis, datasets, data source, and themes.
Data Dictionary – Provides information on the relationships between each of your assets, such as which analysis was used to build each dashboard, and which datasets and data sources are being used in each analysis. It also provides details on each dataset, including schema name, table name, columns, and more.
Overview – Provides instructions on how to use the dashboard.

You can interactively play with the sample dashboard in the following Interactive Dashboard Demo.

Let’s look at Forwood Safety, an innovative, values-driven company with a laser focus on fatality prevention. An early adopter of QuickSight, they collaborated with AWS to deploy this solution to collect BI application usage insights.

“Our engineers love this admin console solution,” says Faye Crompton, Leader of Analytics and Benchmarking at Forwood. “It helps us to understand how users analyze critical control learnings by helping us to quickly identify the most frequently visited dashboards in Forwood’s self-service analytics and reporting tool, FAST.”

Solution overview

The following diagram illustrates the workflow of the solution.

The workflow involves the following steps:

The AWS Lambda function Data_Prepare is scheduled to run hourly. This function calls QuickSight APIs to get the QuickSight namespace, group, user, and asset access permissions information.
The Lambda function Dataset_Info is scheduled to run hourly. This function calls QuickSight APIs to get dashboard, analysis, dataset, and data source information.
Both the functions save the results to an Amazon Simple Storage Service (Amazon S3) bucket.
AWS CloudTrail logs are stored in an S3 bucket.
Based on the file in Amazon S3 that contains user-group information, dataset information, QuickSight assets access permissions information, as well as dashboard views and user login events from the CloudTrail logs, five Amazon Athena tables are created. Optionally, the BI engineer can combine these tables with employee information tables to display human resource information of the users.
Four QuickSight datasets fetch the data from the Athena tables created in Step 5 and import them into SPICE. Then, based on these datasets, a QuickSight dashboard is created.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account
Access to the following AWS services:
- AWS Athena
- AWS CloudFormation
- AWS Lambda
- AWS QuickSight
- Amazon Simple Storage Service (Amazon S3)
Basic knowledge of Python
Basic knowledge of SQL

Create solution resources

We can create all the resources needed for this dashboard using three CloudFormation templates: one for Lambda functions, one for Athena tables, and one for QuickSight objects.

CloudFormation template for Lambda functions

This template creates the Lambda functions data_prepare and dataset_info.

Choose Launch Stack and follow the steps to create these resources.

After the stack creation is successful, you have two Lambda functions, data_prepare and dataset_info, and one S3 bucket named admin-console[AWS-account-ID]. You can verify if the Lambda function can run successfully and if the group_membership, object_access, datasets_info, and data_dictionary folders are created in the S3 bucket under admin-console[AWS-account-ID]/monitoring/quicksight/, as shown in the following screenshots.

The Data_Prepare Lambda function is scheduled to run hourly with the CloudWatch Events rule admin-console-every-hour. This function calls the QuickSight Assets APIs to get QuickSight users, assets, and the access permissions information. Finally, this function creates two files, group_membership.csv and object_access.csv, and saves these files to an S3 bucket.

The Dataset_Info Lambda function is scheduled to run hourly and calls the QuickSight Assets APIs to get datasets, schemas, tables, and fields (columns) information. Then this function creates two files, datasets_info.csv and data_dictionary.csv, and saves these files to an S3 bucket.

Create a CloudTrail log if you don’t already have one and note down the S3 bucket name of the log files for future use.
Note down all the resources created from the previous steps. If the S3 bucket name for the CloudTrail log from step 2 is different from the one in step 1’s output, use the S3 bucket from step 2.

The following table summarizes the keys and values you use when creating the Athena tables with the next CloudFormation stack.

Key	Value	Description
cloudtraillog	`s3://cloudtrail-awslogs-[aws-account-id]-do-not-delete/AWSLogs/[aws-account-id]/CloudTrail`	The Amazon S3 location of the CloudTrail log
cloudtraillogtablename	`cloudtrail_logs`	The table name of CloudTrail log
groupmembership	`s3://admin-console[aws-account-id]/monitoring/quicksight/group_membership`	The Amazon S3 location of `group_membership.csv`
objectaccess	`s3://admin-console[aws-account-id]/monitoring/quicksight/object_access`	The Amazon S3 location of `object_access.csv`
dataset info	`s3://admin-console[aws-account-id]/monitoring/quicksight/datsets_info`	The Amazon S3 location of `datsets_info.csv`
datadict	`s3://admin-console[aws-account-id]/monitoring/quicksight/data_dictionary`	The Amazon S3 location of `data_dictionary.csv`

CloudFormation template for Athena tables

To create your Athena tables, complete the following steps:

Download the following JSON file.
Edit the file and replace the corresponding fields with the keys and values you noted in the previous section.

For example, search for the groupmembership keyword.

Then replace the location value with the Amazon S3 location for the groupmembership folder.

Create Athena tables by deploying this edited file as a CloudFormation template. For instructions, refer to Get started.

After a successful deployment, you have a database called admin-console created in AwsDataCatalog in Athena and three tables in the database: cloudtrail_logs, group_membership, object_access, datasets_info and data_dict

Confirm the tables via the Athena console.

The following screenshot shows sample data of the group_membership table.

The following screenshot shows sample data of the object_access table.

For instructions on building an Athena table with CloudTrail events, see Amazon QuickSight Now Supports Audit Logging with AWS CloudTrail. For this post, we create the table cloudtrail_logs in the default database.

After all five tables are created in Athena, go to the security permissions on the QuickSight console to enable bucket access for s3://admin-console[AWS-account-ID] and s3://cloudtrail-awslogs-[aws-account-id]-do-not-delete.
Enable Athena access under Security & Permissions.

Now QuickSight can access all five tables through Athena.

CloudFormation template for QuickSight objects

To create the QuickSight objects, complete the following steps:

Get the QuickSight admin user’s ARN by running following command in the AWS Command Line Interface (AWS CLI):
```
aws quicksight describe-user --aws-account-id [aws-account-id] --namespace default --user-name [admin-user-name]
```
For example: arn:aws:quicksight:us-east-1:12345678910:user/default/admin/xyz.

Choose Launch Stack to create the QuickSight datasets and dashboard:

Provide the ARN you noted earlier.

After a successful deployment, four datasets named Admin-Console-Group-Membership, Admin-Console-dataset-info, Admin-Console-Object-Access, and Admin-Console-CFN-Main are created and you have the dashboard named admin-console-dashboard. If modifying the dashboard is preferred, use the dashboard save-as option, then recreate the analysis, make modifications, and publish a new dashboard.

Set your preferred SPICE refresh schedule for the four SPICE datasets, and share the dashboard in your organization as needed.

Dashboard demo

The following screenshot shows the Admin Console Landing page.

The following screenshot shows the User Analysis sheet.

The following screenshot shows the Dashboards Analysis sheet.

The following screenshot shows the Access Permissions sheet.

The following screenshot shows the Data Dictionary sheet.

The following screenshot shows the Overview sheet.

You can interactively play with the sample dashboard in the following Interactive Dashboard Demo.

You can reference the public template of the preceding dashboard in create-template, create-analysis, and create-dashboard API calls to create this dashboard and analysis in your account. The public template of this dashboard with the template ARN is 'TemplateArn': 'arn:aws:quicksight:us-east-1:889399602426:template/admin-console'.

Tips and tricks

Here are some advanced tips and tricks to build the dashboard as the Admin Console to analyze usage metrics. The following steps are based on the dataset admin_console. You can apply the same logic to create the calculated fields to analyze user login activities.

Create parameters – For example, we can create a parameter called InActivityMonths, as in the following screenshot. Similarly, we can create other parameters such as InActivityDays, Start Date, and End Date.

Create controls based on the parameters – In the following screenshot, we create controls based on the start and end date.

Create calculated fields – For instance, we can create a calculated field to detect the active or inactive status of QuickSight authors. If the time span between the latest view dashboard activity and now is larger or equal to the number defined in the Inactivity Months control, the author status is Inactive. The following screenshot shows the relevant code. According to the end-user’s requirements, we can define several calculated fields to perform the analysis.

Create visuals – For example, we create an insight to display the top three dashboard views by reader and a visual to display the authors of these dashboards.

Add URL actions – You can add an URL action to define some extra features to email inactive authors or check details of users.

The following sample code defines the action to email inactive authors:

mailto:<<email>>?subject=Alert to inactive author! &body=Hi, <<username>>, any author without activity for more than a month will be deleted. Please log in to your QuickSight account to continue accessing and building analyses and dashboards!

Clean up

To avoid incurring future charges, delete all the resources you created with the CloudFormation templates.

Conclusion

This post discussed how BI administrators can use QuickSight, CloudTrail, and other AWS services to create a centralized view to analyze QuickSight usage metrics. We also presented a serverless data pipeline to support the Admin Console dashboard.

If you would like to have a demo, please email us.

Appendix

We can perform some additional sophisticated analysis to collect advanced usage metrics. For example, Forwood Safety raised a unique request to analyze the readers who log in but don’t view any dashboard actions (see the following code). This helps their clients identify and prevent any wasting of reader sessions fees. Leadership teams value the ability to minimize uneconomical user activity.

CREATE OR REPLACE VIEW "loginwithoutviewdashboard" AS
with login as
(SELECT COALESCE("useridentity"."username", "split_part"("useridentity"."arn", '/', 3)) AS "user_name", awsregion,
date_parse(eventtime, '%Y-%m-%dT%H:%i:%sZ') AS event_time
FROM cloudtrail_logs
WHERE
eventname = 'AssumeRoleWithSAML'
GROUP BY  1,2,3),
dashboard as
(SELECT COALESCE("useridentity"."username", "split_part"("useridentity"."arn", '/', 3)) AS "user_name", awsregion,
date_parse(eventtime, '%Y-%m-%dT%H:%i:%sZ') AS event_time
FROM cloudtrail_logs
WHERE
eventsource = 'quicksight.amazonaws.com'
AND
eventname = 'GetDashboard'
GROUP BY  1,2,3),
users as 
(select Namespace,
Group,
User,
(case
when Group in (‘quicksight-fed-bi-developer’, ‘quicksight-fed-bi-admin’)
then ‘Author’
else ‘Reader’
end)
as author_status
from "group_membership" )
select l.* 
from login as l 
join dashboard as d 
join users as u 
on l.user_name=d.user_name 
and 
l.awsregion=d.awsregion 
and 
l.user_name=u.user_name
where d.event_time>(l.event_time + interval '30' minute ) 
and 
d.event_time<l.event_time 
and 
u.author_status='Reader'

About the Authors

Ying Wang is a Manager of Software Development Engineer. She has 12 years of expertise in data analytics and science. She assisted customers with enterprise data architecture solutions to scale their data analytics in the cloud during her time as a data architect. Currently, she helps customer to unlock the power of Data with QuickSight from engineering by delivering new features.

Ian Liao is a Senior Data Visualization Architect at AWS Professional Services. Before AWS, Ian spent years building startups in data and analytics. Now he enjoys helping customer to scale their data application on the cloud.

Maitri Brahmbhatt is a Business Intelligence Engineer at AWS. She helps customers and partners leverage their data to gain insights into their business and make data driven decisions by developing QuickSight dashboards.

Monday, Nov 28	Tuesday, Nov 29	Wednesday, Nov 30	Thursday, Dec 1	Friday, Dec 2
10:00 AM – 12:00 PM ANT306 [Repeat] \| Beyond monitoring: Observability with operational analytics	11:45 AM – 1:45 PM ANT313 \| Using Apache Spark for data science and ML workflows with Amazon EMR	8:30 AM – 10:30 AM ANT307 \| Improve search relevance with ML in Amazon OpenSearch Service	11:00 AM – 1:00 PM ANT403 \| Event detection with Amazon MSK and Amazon Kinesis Data Analytics	8:30 AM – 10:30 AM ANT309 [Repeat]\| Build analytics applications using Apache Spark with Amazon EMR Serverless
4:00 PM – 6:00 PM ANT309 [Repeat]\| Build analytics applications using Apache Spark with Amazon EMR Serverless	2:45 PM – 4:45 PM ANT310 [Repeat] \| Build a data mesh with AWS Lake Formation and AWS Glue	12:15 PM – 2:15 PM ANT306 [Repeat] \| Beyond monitoring: Observability with operational analytics	11:45 AM – 1:45 PM BSI205 \| Build stunning customized dashboards with Amazon QuickSight	.
.	.	12:15 PM – 2:15 PM ANT312 \| Near real-time ML inferences with Amazon Redshift	2:45 PM – 4:45 PM ANT308 \| Seamless data sharing using Amazon	.
.	.	5:30 PM – 7:30 PM ANT310 [Repeat] \| Build a data mesh with AWS Lake Formation and AWS Glue	.	.
.	.	5:30 PM – 7:30 PM BSI303 \| Seamlessly embed analytics into your apps with Amazon QuickSight	.	.

Why do we need Analytics Engine?

Adding instrumentation to our SQL API

Querying for insights

Observing query response times

Customer insights from high-cardinality data

Conclusion and what’s next

What’s new in the Data Analytics Lens?

Who should use the Data Analytics Lens?

Conclusion

About the authors

Overview of the solution

Prerequisites

Configure Snowflake to track the changed data and unload it to Amazon S3

Configure the Redshift Auto Loader utility

Merge the data from the CDC S3 staging tables to Amazon Redshift tables

Run the stored procedure on a schedule

Configure Amazon Redshift to share the identified data with AWS Data Exchange

Clean up

Conclusion

About the Authors

Improvements compared to Cluster Autoscaler

Best practices using Karpenter with EMR on EKS

Solution overview

Prerequisites

Install tools on the AWS Cloud9 IDE

Provision the infrastructure

Understanding Karpenter configurations

Run a sample workload

Compare with Cluster Autoscaler (Optional)

Observations

Clean up

Conclusion

Further reading

About the Authors

Introduction

Query expansion framework

Building the query expansion corpus

Manual annotation for top keywords

Data mining based on user rewrites

Grouping

Query expansion serving

Expansion matching architecture

Continuous monitoring and feedback loop

Results

Future work

Data mining enhancement

Ads application

Join us

Keynotes and leadership sessions you cannot miss!

Customer sessions

Breakout sessions

Chalk talks

Workshops

Fun, fun, and more fun!

Register today

About the author

AWS Analytics Reference Architecture

Central Governance and data sharing

A data domain: producer and consumer

Creating an event-driven data mesh

Solution Overview

Conclusion

About the authors

Solution overview

Security

Cost

Ease of use

User concurrency: How many concurrent users do you need?

Optimized for cost or reliability: How do you optimize your EMR cluster?

Workload memory requirements: How big a cluster do you need?

Conclusion

About the Authors

Problem statement

Initial design

Challenges

New design

Results

Conclusion

About the authors

Use case 1: Streaming data analytics and real-time control