Tag Archives: Analytics

Send custom branded email reports from Amazon QuickSight

2021-11-24 Kareem Syed-Mohammed

Post Syndicated from Kareem Syed-Mohammed original https://aws.amazon.com/blogs/big-data/send-custom-branded-email-reports-from-amazon-quicksight/

Amazon QuickSight is a fully-managed, cloud-native business intelligence (BI) service that makes it easy to connect to your data, create interactive dashboards, and share these with tens of thousands of users, either directly within QuickSight application, or embedded in web apps and portals.

QuickSight Enterprise Edition now supports the ability to send custom branded email reports. You can customize the email sender domain for email reports sent from QuickSight, along with the logo and header color of the email, as well as footer text of the email. If you have your dashboard embedded in your own application, you can also customize the URL to open the dashboard from the email to the URL of your application. This lets you customize emails to reflect your corporate branding, whether you want to send these reports to 1000s of your internal users or external customers.

In this post, we will go through the following:

Steps to implement the solution
1. Create a customized email template
2. Create an email schedule and subscribe email recipients
End user experience
Sample use case

Solution overview

Step 1: Create a customized email template

This new feature lets you customize your email with the following customization options:

Custom sender email address
Custom logo in the email header and custom header color
Custom link to open the dashboard (if your dashboard is embedded in your own application)
Custom footer

You can customize all or any of these options. To customize, create an email template in your QuickSight account, which will be used when sending email reports for any dashboard to any user. This email template is specific to the AWS region and account it is created in.

Log in to QuickSight as an admin, and select your name in the top right, then in the menu select “Manage QuickSight” as shown in the following screenshot:

In the next screen, select “Account Customization”, and you will see the available account customization options. Under the “Email report template” section, select “Update” as shown in the following screenshot. You must have the right IAM Identity-Based Policies assigned to you to create or edit the template.

In the next screen, you can set customizations that we will see one by one.

Customize sender email address

This option lets you set a custom email address or use QuickSight’s email address <[email protected]> to send email reports. To select sending via QuickSight email address, select the radio button for QuickSight.

To send a custom email, select the radio button for custom email setting. At this time, only verified email addresses can be used for a custom email address. SES and QuickSight must be in the same AWS account and region. If you do not have an SES account, then you can get started <HERE> with SES’ free tier of XX. Steps to add a custom email address.

Add a verified SES email address and click “Verify email”. If you get an error, then refer here for creating a verified SES email address.
Once the email address is verified, you must authorize QuickSight to send emails on your behalf. To do this, copy the given “Authorization Policy”, and add it as a “Sending authorization policy” for your verified email address in SES. Refer here to learn about SES sending authorization policy.

As we can see in the screenshot above, once the authorization policy is verified, QuickSight is authorized to send email using the SES email address.
You can set a friendly name for the email address as shown in the following screenshot.

Customize logo

Email reports from QuickSight have a QuickSight logo in the header of the email body. You can choose to select a custom logo, use QuickSight logo, or have no logo by selecting the corresponding radio button.

When you select the “Custom logo” option, you can select your own logo (for format jpg, jpeg, or png) and a maximum file size of 1MB. Your logo will be scaled to a height of 32px, maintaining the aspect ratio. When you upload the logo image, you get an option to set the background color (as a HEX code) of the header in the email report.

Select where the dashboard opens

Email reports have an image of the first sheet of the QuickSight dashboards. In order for the recipient to interact with the dashboard, email reports also provide a link to open the dashboard. By default, this link opens the dashboard in the QuickSight application. Now you can select where the dashboard opens. If you have embedded the dashboard in your application, then you can provide the URL of your application. Moreover, you can choose to hide the option to disable opening the dashboard from the email entirely. Please see the following screenshot for reference.

If you want to add your custom link, then you will have to add the following query parameters – account-id, dashboard-id, and region – to your link. QuickSight will populate these parameters at runtime, and when your customers select the open dashboard link from the email, they will be taken to the link you have provided. With the account-id, dashboard-id, and region now available with the link, you can provide logic to take your customers to where you have embedded the dashboard in your application.

Custom footer

Email reports default QuickSight footers have content and a link related to QuickSight and QuickSight application. You have an option to customize the footer or hide the entire footer. Please refer to the following screenshot for reference.

If you select the option to set a custom footer, then you can provide custom text and hyperlink content in the textbox. At this time, we only allow plain text.

Step 2: Create an email schedule and subscribe recipients

Once your QuickSight account has an email template saved, any email report sent in the same AWS region will use this template. To send an email report, the author of the dashboard should create an email schedule for the dashboard and assign recipients to that schedule.

To set a schedule, the dashboard author should open the dashboard in QuickSight application, select “Share” in the top right, and select “Email report” in the menu. Please refer to the following screenshot for reference.

You will be taken to the “Edit email report” screen, where you can create a schedule for the email to be sent and add email recipients. Please refer to this documentation on sending reports by email and this post for sending personalized email reports.

If you are embedding dashboards in your application, then your readers cannot subscribe to the schedule from the embedded dashboard. Authors must add those readers to the recipient list through the steps stated above. Therefore, your readers must be provisioned in QuickSight.

End user experience

The end user gets the email as per the schedule set. If the email template has been set, then recipients get the look and feel of the email based on the customization done on the template. The following screenshot shows the email with a custom look and feel.

As you can see, this email has the following:

From address customized to [email protected] with a friendly name, “data-insights-team”
Logo customized to a brand logo, and header customized to the brand green shade
Dashboard open link customized to take customers to your app if the dashboard is embedded in that app
Footer customized with a custom message

Use case

ShipPronto is a logistics service provider for heavy machinery. It has many customers that store their heavy machinery at ShipPronto’s warehouse. When customers get purchase orders on these machineries, they have ShipPronto fulfill those orders on their behalf from its warehouse. ShipPronto has an application where each customer can login and see rich data on their order shipment and machinery quantity at the warehouse. ShipPronto uses QuickSight dashboard embedded in its application to provide the insights. Furthermore, it sends daily emails to its customers on this dashboard. It’s using the email customization feature of QuickSight to customize the look and feel of the email so that customers receiving the email get a seamless experience.

Below is the customized email that their customers receive daily with the sender email address, logo, header color, and footer customized.

When customers click on the “Open Dashboard” link in the email, they are taken to ShipPronto’s app, on which they must log in, as shown in the following screenshot.

Once the customers log in, based on the query string parameters that were passed along with the custom URL (which was set as part of the URL, to open the dashboard in the email template), ShipPronto can take its customers to the page where they have this dashboard embedded.

This experience means that ShipPronto’s end users see the ShipPronto branded email and get a seamless experience where they access the embedded dashboard, in the application, from the email.

Conclusion

Email customizations let you send branded email reports to your customers, thereby enabling a seamless experience when customers are accessing the email or the application where the dashboard is embedded. And all of this is done without any infrastructure setup or management, while scaling to millions of users. For more updates from QuickSight embedded analytics, see What’s New in the Amazon QuickSight User Guide.

About the Author

Kareem Syed-Mohammed is a Product Manager at Amazon QuickSight. He focuses on embedded analytics, APIs, and developer experience. Prior to QuickSight he has been with AWS Marketplace and Amazon retail as a PM. Kareem started his career as a developer and then PM for call center technologies, Local Expert and Ads for Expedia. He worked as a consultant with McKinsey and Company for a short while.

Kenz Shane is a UI Designer for Amazon QuickSight. As part of the product’s Business Intelligence User Experience (BIUX) team, she specializes in creating customer-focused visual interfaces. Previously, she worked with the Experience Innovation Group at Dell, serving as a subject matter expert in enterprise-grade user interface (UI) design, accessible data visualization, and design systems. Kenz has provided art direction and design for clients across multiple industries, including Nordstrom, Columbia Hospitality, AIGA, and Warner Bros.

Raji Sivasubramaniam is a Specialist Solutions Architect at AWS, focusing on Analytics. Raji has 20 years of experience in architecting end-to-end Enterprise Data Management, Business Intelligence and Analytics solutions for Fortune 500 and Fortune 100 companies across the globe. She has in-depth experience in integrated healthcare data and analytics with wide variety of healthcare datasets including managed market, physician targeting and patient analytics. In her spare time, Raji enjoys hiking, yoga and gardening.

Srikanth Baheti is a Specialized World Wide Sr. Solution Architect for Amazon QuickSight. He started his career as a consultant and worked for multiple private and government organizations. Later he worked for PerkinElmer Health and Sciences & eResearch Technology Inc, where he was responsible for designing and developing high traffic web applications, highly scalable and maintainable data pipelines for reporting platforms using AWS services and Serverless computing.-

Introducing PII data identification and handling using AWS Glue DataBrew

2021-11-20 Harsh Vardhan Singh Gaur

Post Syndicated from Harsh Vardhan Singh Gaur original https://aws.amazon.com/blogs/big-data/introducing-pii-data-identification-and-handling-using-aws-glue-databrew/

AWS Glue DataBrew, a visual data preparation tool, can now identify and handle sensitive data by applying advance transformations like redaction, replacement, encryption, and decryption on your personally identifiable information (PII) data. With exponential growth of data, companies are handling huge volumes and a wide variety of data coming into their platform, including PII data. Identifying and protecting sensitive data at scale has become increasingly complex, expensive, and time-consuming. Organizations have to adhere to data privacy, compliance, and regulatory needs such as GDPR and CCPA. They need to identify sensitive data, including PII such as name, SSN, address, email, driver’s license, and more. Even after identification, it’s cumbersome to implement redaction, masking, or encryption of sensitive personal information at scale.

To enable data privacy and protection, DataBrew has launched PII statistics, which identifies PII columns and provide their data statistics when you run a profile job on your dataset. Furthermore, DataBrew has introduced PII data handling transformations, which enable you to apply data masking, encryption, decryption, and other operations on your sensitive data.

In this post, we walk through a solution in which we run a data profile job to identify and suggest potential PII columns present in a dataset. Next, we target PII columns in a DataBrew project and apply various transformations to handle the sensitive columns existing in the dataset. Finally, we run a DataBrew job to apply the transformations on the entire dataset and store the processed, masked, and encrypted data securely in Amazon Simple Storage Service (Amazon S3).

Solution overview

We use a public dataset that is available for download at Synthetic Patient Records with COVID-19. The data hosted within SyntheticMass has been generated by SyntheaTM, an open-source patient population simulation made available by The MITRE Corporation.

Download the zipped file 10k_synthea_covid19_csv.zip for this solution and unzip it locally. The solution uses the dummy data in the file patient.csv to demonstrate data redaction and encryption capability. The file contains 10,000 synthetic patient records in CSV format, including PII columns like driver’s license, birth date, address, SSN, and more.

The following diagram illustrates the architecture for our solution.

The steps in this solution are as follows:

The sensitive data is stored in an S3 bucket. You create a DataBrew dataset by connecting to the data in Amazon S3.
Run a DataBrew profile job to identify the PII columns present in the dataset by enabling PII statistics.
After identification of PII columns, apply transformations to redact or encrypt column values as a part of your recipe.
A DataBrew job runs the recipe steps on the entire data and generates output files with sensitive data redacted or encrypted.
After the output data is written to Amazon S3, we create an external table on top in Amazon Athena. Data consumers can use Athena to query the processed and cleaned data.

Prerequisites

For this walkthrough, you need an AWS account. Use us-east-1 as your AWS Region to implement this solution.

Set up your source data in Amazon S3

Create an S3 bucket called databrew-clean-pii-data-<Your-Account-ID> in us-east-1 with the following prefixes:

sensitive_data_input
cleaned_data_output
profile_job_output

Upload the patient.csv file to the sensitive_data_input prefix.

Create a DataBrew dataset

To create a DataBrew dataset, complete the following steps:

On the DataBrew console, in the navigation pane, choose Datasets.
Choose Connect new dataset.
For Dataset name, enter a name (for this post, Patients).
Under Connect to new dataset, select Amazon S3 as your source.
For Enter your source from S3, enter the S3 path to the patient.csv file. In our case, this is s3://databrew-clean-pii-data-<Account-ID>/ sensitive_data_input/patients.csv.
Scroll to the bottom of the page and choose Create dataset.

Run a data profile job

You’re now ready to create your profile job.

In the navigation pane, choose Datasets.
Select the Patients dataset.
Choose Run data profile and choose Create profile job.
Name the job Patients - Data Profile Job.
We run the data profile on the entire dataset, so for Data sample, select Full dataset.
In the Job output settings section, point to the profile_job_output S3 prefix where the data profile output is stored when the job is complete.
Expand Data profile configurations, and select Enable PII statistics to identify PII columns when running the data profile job.

This option is disabled by default; you must enable it manually before running the data profile job.

For PII categories, select All categories.
Keep the remaining settings at their default.
In the Permissions section, create a new AWS Identity and Access Management (IAM) role that is used by the DataBrew job to run the profile job, and use PII-DataBrew-Role as the role suffix.
Choose Create and run job.

The job runs on the sample data and takes a few minutes to complete.

Now that we’ve run our profile job, we can review data profile insights about our dataset by choosing View data profile. We can also review the results of the profile through the visualizations on the DataBrew console and view the PII widget. This section provides a list of identified PII columns mapped to PII categories with column statistics. Furthermore, it suggests potential PII data that you can review.

Create a DataBrew project

After we identify the PII columns in our dataset, we can focus on handling the sensitive data in our dataset. In this solution, we perform redaction and encryption in our DataBrew project using the Sensitive category of transformations.

To create a DataBrew project for handling our sensitive data, complete the following steps:

On the DataBrew console, choose Projects.
Choose Create project.
For Project name, enter a name (for this post, patients-pii-handling).
For Select a dataset, select My datasets.
Select the Patients dataset.
Under Permissions, for Role name, choose the IAM role that we created previously for our DataBrew profile job AWSGlueDataBrewServiceRole-PII-DataBrew-Role.
Choose Create project.

The dataset takes few minutes to load. When the dataset is loaded, we can start performing redactions. Let us start with the column SSN.

For the SSN column, on the Sensitive menu, choose Redact data.
Under Apply redaction, select Full string value.
We redact all the non-alphanumeric characters and replace them with #.
Choose Preview changes to compare the redacted values.
Choose Apply.

On the Sensitive menu, all the data masking transformations—redact, replace, and hash data—are irreversible. After we finalize our recipe and run the DataBrew job, the job output to Amazon S3 is permanently redacted and we can’t recover it.

Now, let’s apply redaction to multiple columns, assuming the following columns must not be consumed by any downstream users like data analyst, BI engineer, and data scientist:
1. DRIVERS
2. PASSPORT
3. BIRTHPLACE
4. ADDRESS
5. LAT
6. LON

In special cases, when we need to recover our sensitive data, instead of masking, we can encrypt our column values and when needed, decrypt the data to bring it back to its original format. Let’s assume we require a column value to be decrypted by a downstream application; in that case, we can encrypt our sensitive data.

We have two encryption options: deterministic and probabilistic. For use cases when we want to join two datasets on the same encrypted column, we should apply deterministic encryption. It makes sure that the encrypted value of all the distinct values is the same across DataBrew projects as long as we use the same AWS secret key. Additionally, keep in mind that when you apply deterministic encryption on your PII columns, you can only use DataBrew to decrypt those columns.

For our use case, let’s assume we want to perform deterministic encryption on a few of our columns.

On the Sensitive menu, choose Deterministic encryption.
For Source columns, select BIRTHDATE, DEATHDATE, FIRST, and LAST.
For Encryption option, select Deterministic encryption.
For Select secret, choose the databrew!default AWS secret.
Choose Apply.
After you finish applying all your transformations, choose Publish.
Enter a description for the recipe version and choose Publish.

Create a DataBrew job

Now that our recipe is ready, we can create a job to apply the recipe steps to the Patients dataset.

On the DataBrew console, choose Jobs.
Choose Create a job.
For Job name, enter a name (for example, Patient PII Making and Encryption).
Select the Patients dataset and choose patients-pii-handling-recipe as your recipe.
Under Job output settings¸ for File type, choose your final storage format to be Parquet.
For S3 location, enter your S3 output as s3://databrew-clean-pii-data-<Account-ID>/cleaned_data_output/.
For Compression, choose None.
For File output storage, select Replace output files for each job run.
Under Permissions, for Role name¸ choose the same IAM role we used previously.
Choose Create and run job.

Create an Athena table

You can create tables by writing the DDL statement in the Athena query editor. If you’re not familiar with Apache Hive, you should review Creating Tables in Athena to learn how to create an Athena table that references the data residing in Amazon S3.

To create an Athena table, use the query editor and enter the following DDL statement:

CREATE EXTERNAL TABLE patient_masked_encrypted_data (
   `id` string, 
  `birthdate` string, 
  `deathdate` string, 
  `ssn` string, 
  `drivers` string, 
  `passport` string, 
  `prefix` string, 
  `first` string, 
  `last` string, 
  `suffix` string, 
  `maiden` string, 
  `marital` string, 
  `race` string, 
  `ethnicity` string, 
  `gender` string, 
  `birthplace` string, 
  `address` string, 
  `city` string, 
  `state` string, 
  `county` string, 
  `zip` int, 
  `lat` string, 
  `lon` string, 
  `healthcare_expenses` double, 
  `healthcare_coverage` double 
)
STORED AS PARQUET
LOCATION 's3://databrew-clean-pii-data-<Account-ID>/cleaned_data_output/'

Let’s validate the table output in Athena by running a simple SELECT query. The following screenshot shows the output.

We can clearly see the encrypted and redacted column values in our query output.

Cleaning up

To avoid incurring future charges, delete the resources created during this walkthrough.

Conclusion

As demonstrated in this post, you can use DataBrew to help identify, redact, and encrypt PII data. With these new PII transformations, you can streamline and simplify customer data management across industries such as financial services, government, retail, and much more.

Now that you can protect your sensitive data workloads to meet regulatory and compliance best practices, you can use this solution to build de-identified data lakes in AWS. Sensitive data fields remain protected throughout their lifecycle, whereas non-sensitive data fields remain in the clear. This approach can allow analytics or other business functions to operate on data without exposing sensitive data.

About the Authors

Harsh Vardhan Singh Gaur is an AWS Solutions Architect, specializing in Analytics. He has over 5 years of experience working in the field of big data and data science. He is passionate about helping customers adopt best practices and discover insights from their data.

Navnit Shukla is an AWS Specialist Solution Architect, Analytics, and is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions.

Attendee guide for the AWS Analytics track at AWS re:Invent 2021

2021-11-19 Imtiaz Sayed

Post Syndicated from Imtiaz Sayed original https://aws.amazon.com/blogs/big-data/attendee-guide-for-the-aws-analytics-track-at-aws-reinvent-2021/

AWS re:Invent is a learning conference hosted by Amazon Web Services (AWS) for the global cloud computing community. We’re super excited to join you at the 10th annual re:Invent to share the latest from AWS leaders and discover more ways to learn and build. Let’s celebrate this milestone, which will be offered in person in Las Vegas (November 29–December 3) and virtually (November 29–December 10). The health and safety of our customers and partners remains our top priority. You can find additional information on the health measures page. For details about the virtual format, check out the virtual section.

The AWS Analytics track at re:Invent offers sessions in various analytics disciplines delivered by AWS Analytics experts and AWS customers. The sessions vary from intermediate (200) through expert (400) levels, share new AWS innovations, discuss exciting customer experiences, and provide you opportunities to learn how to easily extract more out of your data in the most cost-effective and performant manner.

Keynotes

Adam Selipsky – CEO, Amazon Web Services – Keynote
Adam Selipsky, AWS CEO, takes the stage to share his insights and the latest news about AWS customers, products, and services including Analytics services announcements

Swami Sivasubramanian – Vice President, Amazon Machine Learning – Keynote
Join Swami Sivasubramanian, Vice President, Amazon Machine Learning, on an exploration of what it takes to put data in action with an end to end data strategy including the latest news on databases, analytics, and machine learning.

Leadership session

ANT214-L – Reinvent your business for the future with AWS Analytics
The next wave of digital transformation will be data-driven, and organizations will have to reinvent themselves using data to make decisions quickly and gain faster and deeper insights to serve their customers. In this session, Rahul Pathak, VP of AWS Analytics, addresses the current state of analytics on AWS, focusing on the latest service innovations. Learn how you can put your data to work with the best of both data lakes and purpose-built data stores. Also, discover how AWS can help you build new experiences and reimagine old processes with a modern data architecture on AWS.

Breakout sessions

re:Invent breakout sessions are lecture-style and 1 hour long. These sessions are delivered by AWS experts, customers, and partners, and typically include 10–15 minutes of Q&A at the end. For our virtual attendees, breakout sessions will be made available on-demand in the week after re:Invent.

ANT215 – Introduction to AWS Data Exchange for Amazon Redshift
AWS Data Exchange for Amazon Redshift allows you to combine third-party data found on AWS Data Exchange with your own data from your Amazon Redshift cloud data warehouse, requiring no ETL and accelerating time to value. AWS Data Exchange allows an organization’s line of business to immediately access and analyze a provider’s data once access has been granted, eliminating the need to depend on IT teams to provision the necessary data. Data providers can license access to their Amazon Redshift cloud data warehouses or allow subscribers to download files from Amazon S3 with no heavy lifting.

ANT203 – What’s new in Amazon OpenSearch Service
Amazon OpenSearch Service (successor to Amazon Elasticsearch Service), is a fully managed service that makes it easy for you to deploy, secure, and run OpenSearch and Apache 2.0-licensed Elasticsearch clusters cost-effectively at scale. The OpenSearch project is a community-driven, open-source fork of Elasticsearch and Kibana. This session discusses customer use cases, best practices, and newly launched features. In addition, it discusses how AWS has made the move to OpenSearch seamless and what to expect going forward.

ANT201 – What’s new with Amazon Redshift
Join this session to hear about important new features of Amazon Redshift. Learn about the architectural evolution of Amazon Redshift and how it uses machine learning to create a self-optimizing data warehouse. Additionally, explore how Amazon Redshift integrates with other popular AWS services.

ANT202 – What’s new with Amazon EMR
Amazon EMR simplifies running open-source data processing applications such as Apache Spark, Apache Hive, and Presto on AWS, enabling users to run ETL, ML, real-time processing, data science, and low-latency SQL at petabyte scale. This session covers the latest on Amazon EMR and how Amazon EMR runtimes provide excellent performance to open-source versions of such engines without breaking API compatibility. Discover how Amazon EMR Studio and Amazon SageMaker Studio simplify building applications and pipelines for data scientists and engineers. Learn how to add support for transactions and real-time streams in data lakes with Apache Hudi and Apache Iceberg. See how to enforce fine-grained access control over data in Amazon S3.

ANT318 – Data lakes: Easily build, secure, and share data with AWS Lake Formation
Organizations are breaking down data silos and building petabyte-scale data lakes on AWS to democratize access to thousands of end-users. In this session, learn about recent innovations in AWS Lake Formation that make it easy to build, secure, and manage your data lakes. Hear how an AWS customer built their data mesh architecture using Lake Formation to share data across their lines of business and inform data-driven decisions.

ANT303 – Democratizing data for self-service analytics and ML
Access to all your data for fast analytics at scale is foundational for 360-degree projects involving data engineers, database developers, data analysts, data scientists, BI professionals, and the line of business. In this session, learn how easy-to-use ML can help your organization imagine new products or services, transform your customer experiences, streamline your business operations, and improve your decision-making. A secure, integrated platform that’s easy to use and supports nonproprietary data formats can improve collaboration through data sharing and can also improve customer responsiveness. Learn how AWS developer tools, including the Data API, and native support for semi-structured data using standard SQL commands can improve software time to market.

ANT316 – How Coinbase uses Amazon MSK as an event store for applications
In this session, learn how focusing on security, availability, and customer obsession has translated into operational excellence and product innovations with Amazon MSK, a managed service for Apache Kafka. This session features cryptocurrency exchange company Coinbase’s experience managing streaming events and analyzing billions of daily cryptocurrency transactions with Amazon MSK. Dive into Coinbase’s event streaming architecture to learn how it leverages Amazon MSK as an enterprise event bus to ingest and analyze a huge scale of events from users, applications, databases, and cryptocurrency sources across products.

ANT310 – How VMware uses Amazon Kinesis to keep customers safe from cyberattacks
Streaming data with Amazon Kinesis Data Streams is an easy and cost-effective way to capture data from hundreds of thousands of sources and make it available for analysis in milliseconds. VMware Carbon Black’s cloud-native intelligent threat detection system uses Kinesis Data Streams and other AWS services. Join this session to dive deep into how VMware Carbon Black, a leader in cybersecurity, processes trillions of events per day to uncover concerning behavioral patterns and detect and prevent cybersecurity risks. VMware Carbon Black shares lessons learned while scaling its multi-tenant streaming data infrastructure and best practices for cost-effective data processing in real time.

ANT317 – Serverless data integration with AWS Glue
The first step in an analytics or machine learning project is to prepare your data to obtain quality results. AWS Glue is a serverless data integration service that makes data preparation simpler, faster, and cheaper. In this session, learn about the latest innovations in AWS Glue and hear how an AWS customer uses AWS Glue to enable self-service data preparation across their organization.

ANT307 – What’s new with Amazon Athena
Amazon Athena is a highly scalable analytics service that makes it easy to analyze data in Amazon S3 and other data stores. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. This session offers a deep dive into the service, customer use cases, best practices, newly launched features, and what is next for Athena.

ANT401 – Deep dive: Accelerating Apache Spark with Amazon EMR
Running Apache Spark workloads on Amazon EMR is becoming faster and more cost-effective. In this session, explore the features that Amazon EMR offers to improve performance and reduce the cost of operating big data analytics workloads. In this session, dive deep into the architectures and design patterns that organizations have employed when migrating their open-source analytics applications to Amazon EMR, and explore features such as the performance-optimized Amazon EMR runtime for Apache Spark, Graviton2 instance support, and more.

Chalk talks

Chalk talks are highly interactive sessions with a small audience. Experts lead you through problems and solutions on a digital whiteboard as the discussion unfolds. Each begins with a short lecture (10–15 minutes) delivered by an AWS expert, followed by a Q&A session of 45–50 minutes with the audience.

ANT322 – Amazon EMR on EKS
Is your organization considering a move to Kubernetes and Amazon EKS and wondering how to run Apache Spark applications on Amazon EKS? In this chalk talk, learn how Amazon EMR on EKS simplifies running Spark applications on Amazon EKS. Learn about the benefits of moving to containerization and moving to Amazon EKS. Also, dive into architectures and best practices and learn from customers who are using Spark on Amazon EKS at 3,000 or more nodes.

ANT308 – Building analytics at scale with Amazon Athena
Organizations want analytics solutions that are easy to set up and maintain while delivering the powerful analytics required to succeed with a modern data strategy. This chalk talk covers how you can use Amazon Athena to build powerful capabilities, like real-time fraud detection, and enable data scientists to build and train ML models across all of your data. Learn how Athena offers this capability with no infrastructure for you to manage and offers simple centralized governance and security.

ANT320 – Building data lakes and sharing data with AWS Lake Formation
Building data lakes and sharing data across your organization can be challenging. In this chalk talk, learn how to use AWS Lake Formation to simplify building, securing, and managing your data lakes. Discover best practices for reliably building your data lakes and sharing this data across your lines of business and thousands of users.

ANT301 – Concurrency and scalability strategies with Amazon Redshift
Amazon Redshift provides multiple features to help you deliver consistent performance, even as workloads grow and vary. Learn how to use concurrency scaling, data sharing, and more on their own and together to manage your workloads. In this chalk talk, you have the opportunity to ask Amazon Redshift service team experts about your unique situation.

ANT319 – Data preparation: Building scalable ETL pipelines with AWS Glue
Do you have questions about how AWS Glue works? Join this chalk talk to learn more about the best practices for building data integration pipelines at scale. Learn how to use the different components of AWS Glue to discover, catalog, and prepare your data for machine learning and analytics. Also learn best practices for optimizing your Apache Spark scripts.

ANT306 – Modernize your log analytics solution with Amazon OpenSearch Service
Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) is a fully managed service that makes it easy for you to deploy, secure, and run OpenSearch and Apache 2.0-licensed Elasticsearch clusters cost-effectively at scale. In this chalk talk, learn how to ingest data into Amazon OpenSearch Service from Amazon ECS using FireLens for logging and AWS Distro for OpenTelemetry for distributed tracing. Discover how to leverage OpenSearch Dashboards to analyze your application health and performance.

ANT302 – New use cases for Amazon Redshift
Amazon Redshift continuous innovations provide cloud data warehousing capabilities that deliver price performance leadership and ease of use with scale. Learn how Amazon Redshift features, built on the reliability and performance this service is known for today, can help you empower developers with automated capabilities, reduce time to business insights, or integrate across data types, AWS, and third-party services. Join this chalk talk to explore new features and learn from the experts about ways that you can use them.

ANT314 – Process streaming data using Amazon MSK & Amazon Kinesis Data Analytics
As data streaming architectures evolve, it’s vital to continuously improve your streaming data pipelines and take advantage of new features and updates to streaming services. With fully managed Apache Kafka and Apache Flink services, AWS makes it easy for developers to run streaming applications without managing infrastructure. In this chalk talk, learn how to use Amazon MSK, Amazon Kinesis Data Analytics for Apache Flink, and AWS Lambda to build serverless streaming data pipelines. Discover best practices for application operations and reliability, and see how AWS managed services can help you avoid potential challenges.

ANT321 – Set up capital markets analytics, integrated with your data, using FinSpace
Are you a financial services firm such as a hedge fund, sell side bank, or asset manager with quantitative financial analysts using Jupyter notebooks to perform financial analysis such as time series, portfolio, or risk analytics? Do your analysts require secure access to data across your enterprise? Do your analysts need scalable Apache Spark to process petabytes of data such as trade and quote data? In this chalk talk, learn how Amazon FinSpace provides a managed research notebook environment with the security controls you need and the ability to integrate with data from internal systems and third-party data feeds.

ANT309 – Simplifying Amazon S3 analytics with Amazon Kinesis Data Firehose
Join this chalk talk to learn how Amazon Kinesis Data Firehose enables you to reliably load your streaming data into data lakes, data warehouses, and analytics services built on AWS, with AWS Partners, and using open-source tools. This talk includes a demonstration showcasing how Kinesis Data Firehose easily captures, transforms, and delivers streaming data to a data lake built on Amazon S3. Dive deep into reducing the cost of Amazon S3 analytics queries and simplifying Amazon S3 analytics workflows using Kinesis Data Firehose, Apache Parquet, and dynamic partitioning.

ANT315 – Using Amazon Redshift to directly query third-party data on AWS
In this chalk talk, learn how companies spanning multiple industries are using AWS Data Exchange and Amazon Redshift to find, subscribe to, and immediately access and analyze third-party datasets without having to set up data ingestion pipelines.

ANT405 – Enforcing data access control on Amazon EMR
Organizations often want to enforce fine-grained data access controls across data lakes throughout a company. In this chalk talk, learn about what these controls are and how you can you enforce them when using Apache Spark, Presto, and Hive on Amazon EMR. Discover various ways of authenticating users and how each of these authentication mechanisms impact authorization policies. Lastly, review the use of IAM roles, AWS Lake Formation, and Apache Ranger as tools to enforce fine-grained data access controls, and learn when you should use which. This chalk talk covers the basic tools required to enforce fine-grained authorization and how to use them.

ANT402 – Sizing Amazon OpenSearch Service domains
Whether you’re searching your product catalog or storing your logs for infrastructure monitoring, application performance monitoring, or observability, Amazon OpenSearch Service is the ideal tool. Its distributed search engine scales to support high-volume ingest and query rates. How you scale affects the performance of your workload and your cost running that workload, so it’s important to get it right. How do you find your way through all of the configuration options to create an optimal cluster? Come to this chalk talk with your workload description—source data, velocity, query types, and quantity—and we’ll help you get sized right.

Builders’ sessions

Builders’ sessions are small group sessions led by an AWS expert who demonstrates and builds a solution on AWS. Each builders’ session is an interactive, hour-long engagement. It begins with a short explanation followed by a practical walkthrough of the demonstration. When the demonstration is complete, feel free to use the shared artifacts to build on your own.

ANT311 – Build a data mesh with AWS Lake Formation and AWS Glue
In this builders’ session, learn how to build a data mesh design pattern using AWS Glue and AWS Lake Formation that supports a proliferation of data producers and data consumers with consistent, centralized governance. The design approach facilitates best practices for building scalable data platforms, ubiquitous data sharing, and centralized governance, and enables self-service analytics on AWS.

ANT312 – Building a secure, modern data architecture with AWS analytics
In this builders’ session, learn how to build a secure modern data architecture to combine various disparate data sources using AWS Lake Formation, Amazon AppFlow, AWS Database Migration Service (AWS DMS), and AWS Glue. Gain an understanding of key architecture tenets for ingestion patterns, design factors for securely storing data, how to apply granular security policies, data cataloging, and transformation for consumption.

ANT313 – Security essentials with Amazon MSK
Organizations have unique security and compliance mandates. A well-informed understanding of authentication features is critical to making the right choice for an organization’s security posture. Amazon MSK provides several authentication options to control access to Apache Kafka clusters. In this builders’ session, explore the available Amazon MSK authentication mechanisms, industry best practices, and recommendations for running secure Amazon MSK clusters.

Workshops

Workshops are 2-hour interactive learning sessions where you work in small group teams to solve problems using AWS services. Each workshop starts with a short lecture (10–15 minutes) by the main speaker, and the rest of the time is spent working as a group. Come prepared with your laptop and a willingness to learn!

ANT205- Create and train ML models with ease using Amazon Redshift ML
Amazon Redshift is the most widely deployed data warehouse and is the cornerstone of AWS data lake strategy. Experience how quickly you can build your data warehouse with Amazon Redshift and gain insights using the integrated SQL query editor. In this workshop, data analysts and data scientists can easily train machine learning (ML) models using SQL with Amazon Redshift ML, with zero data movement required. Data engineers can learn how the data API simplifies access and allows you to easily integrate applications with Amazon Redshift and build event-driven applications systems.

ANT204 – Dive into Amazon OpenSearch Service
OpenSearch is an Apache 2.0-licensed tool that provides you with rich, relevant search results for your data. Paired with OpenSearch Dashboards, you can analyze and visualize your log data. In this workshop, discover how Amazon OpenSearch Service enables you to focus on your search or monitoring problem and not worry about managing your infrastructure. Explore the console and deploy an OpenSearch Service domain in Amazon VPC, use OpenSearch search APIs, and work with OpenSearch Dashboards to build out visualizations. Come see how Amazon OpenSearch Service can help you solve your search and analytics needs.

ANT305 – Data science and DataOps workflows with Amazon EMR Studio
Have you ever felt that building data science applications, data engineering pipelines, or machine learning models was hard with Apache Spark on Amazon EMR? Join this workshop to learn how Amazon EMR Studio makes it simple to do these things. The workshop includes a walkthrough of a couple of examples with sample data so you can see how collaboration works with Amazon EMR Studio.

ANT404 – Event detection using Amazon MSK and Amazon Kinesis Data Analytics
In this workshop, you take on the role of an acting technology manager for a Las Vegas casino. Your assignment is to create a stream processing application that identifies customers entering your casino who have gambled heavily in the past and then sends you a text message when big spenders sit down at a gambling table. To do this, use Amazon MSK to capture events, Amazon Kinesis Data Analytics Studio to detect events of interest, and AWS Lambda with Amazon SNS to send you an email for any events.

ANT403 – Powering observability with Amazon OpenSearch Service
Amazon OpenSearch Service’s Trace Analytics functionality allows you to go beyond simple monitoring to understand not just what events are happening, but why they are happening. In this workshop, learn how to instrument, collect, and analyze metrics, traces, and log data all the way from user front ends to service backends and everything in between. Put this together with Amazon OpenSearch Service, AWS Distro for OpenTelemetry, and Data Prepper.

AWS Analytics Kiosk

Join us at the AWS Analytics Kiosk in the AWS Village at the Expo. Dive deep into AWS Analytics with AWS subject matter experts, see the latest demos, ask questions, or just drop by to chat with your peers.

AWS Analytics Meet-and-Greet Cocktail Hour

Date: Tuesday, November 30, 8:00 PM – 9:00 PM PST

Location: Canaletto Ristorante Veneto (The Venetian), Las Vegas, NV

Socialize with the AWS Analytics technical community. Join us and network over hors d’oeuvres and drinks with AWS leaders and specialists.

Looking forward to seeing you there!

About the Authors

Taz Sayed is the world-wide Analytics Tech Leader at AWS. He enjoys engaging with the wider data analytics community, and designing well-architected solutions for AWS customers.

Navnit Shukla is an Analytics Specialist Solution Architect with AWS. He is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions.

Query data in Amazon OpenSearch Service using SQL from Amazon Athena

2021-11-10 Behram Irani

Post Syndicated from Behram Irani original https://aws.amazon.com/blogs/big-data/query-data-in-amazon-opensearch-service-using-sql-from-amazon-athena/

Amazon Athena is an interactive serverless query service to query data from Amazon Simple Storage Service (Amazon S3) in standard SQL. Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) is a fully managed, open-source, distributed search and analytics suite derived from Elasticsearch, allowing you to run OpenSearch Service or Elasticsearch clusters at scale without having to manage hardware provisioning, software installation, patching, backups, and so on.

Although both services have their own use cases, there might be situations when you want to run queries that combine data in Amazon S3 with data in an Amazon OpenSearch Service Cluster. In such cases, federated queries in Athena are a good option—they provide the capability to combine data from multiple data sources and analyze them in a single query. The federated query feature works by using data source connectors that are built for different data sources. To allow Amazon OpenSearch Service to be one of the sources that Athena can query against, AWS has made available a data source connector for OpenSearch Service clusters to be queried from Athena.

This post demonstrates how to query data in Amazon OpenSearch Service and Amazon S3 in a single query. We use the data made available in the public COVID-19 data lake, documented as part of the post A public data lake for analysis of COVID-19 data.

In particular, we use the following two datasets:

alleninstitute_metadata : Metadata on papers pulled from the COVID-19 Open Research Dataset (CORD-19). The sha column indicates the paper ID, which is the file name of the paper in the data lake. This dataset is stored in Amazon OpenSearch Service because it contains the column abstract, which you can search on.
alleninstitute_comprehend_medical : Results containing annotations obtained by running the papers in the preceding dataset through Amazon Comprehend Medical. This is accessed from its public storage at s3://covid19-lake/alleninstitute/CORD19/comprehendmedical/comprehend_medical.json.

Data flow when combining data from Amazon OpenSearch Service and Amazon S3

The data source connectors are implemented as AWS Lambda functions. When a user issues a query that combines data from Amazon OpenSearch Service and Amazon S3, Athena refers to the AWS Glue Data Catalog metadata to look up the table definitions. For the table whose data is in Amazon S3, Athena fetches the data from Amazon S3. For the tables that are in Amazon OpenSearch Service, Athena invokes the Lambda function (part of the data source connector application) to read the data from Amazon OpenSearch Service. Depending on the amount of data, you can invoke this function multiple times in parallel for the same query to enable faster reads.

The following diagram illustrates this data flow.

Set up the two data sources using AWS CloudFormation

To prepare for querying both data sources, launch the AWS CloudFormation template using the “Launch Stack” button below. All you need to do is choose Create stack.

To run the CloudFormation stack, you need to be logged in to an AWS account with permissions to do the following:

Create a CloudFormation stack
Create an Identity and Access Management (IAM) role
Create a Lambda function, assign an IAM role to it, and invoke it
Launch an OpenSearch Service cluster
Create AWS Glue databases and table
Create an S3 bucket

For instructions on creating a CloudFormation stack, see Get started.

For more information about controlling permissions and access for these services, see the following resources:

CloudFormation – Controlling access with AWS Identity and Access Management
Lambda – AWS Lambda permissions
Amazon OpenSearch Service – Identity and Access Management in Amazon OpenSearch Service
AWS Glue – Managing Access Permissions for AWS Glue Resources
Amazon S3 – Identity and access management in Amazon S3

The CloudFormation template creates the following:

A table in the AWS Glue Data Catalog named alleninstitute_comprehend_medical that points to the S3 location s3://covid19-lake/alleninstitute/CORD19/comprehendmedical/comprehend_medical.json. This contains the results extracted from the CORD-19 data using the natural language processing service Amazon Comprehend Medical.
An S3 bucket with the name athena-es-connector-spill-bucket- followed by the first few characters from the stack ID to keep the bucket name unique.
An OpenSearch Service cluster with the name es-alleninstitute-data, which has two instances configured to allow a role to access the cluster.
An IAM role to access the OpenSearch Service cluster.
A Lambda function that contains a piece of Python code that reads all the metadata of the papers along with the abstract. This data is available as JSON at s3://covid19-lake/alleninstitute/CORD19/json/metadata/. For this post, we load just one of the four JSON files available.
A custom resource that invokes the Lambda function to load the data into the OpenSearch Service cluster.

The stack can take 15–30 minutes to complete.

When the stack is fully deployed, navigate to the Outputs tab of the stack and note the name of the S3 bucket created (the value for SpillBucket).

For the rest of the steps, you need permissions to do the following:

Deploy an application in the AWS Serverless Application Repository (for information about IAM access, see Identity and Access Management for the AWS Serverless Application Repository)
Create a new data source and run queries in Athena (for information about IAM access, see Identity and Access Management in Athena)

Deploy the Amazon Athena OpenSearch connector

When the OpenSearch Service domain with an index containing the metadata related to the COVID-19 research papers and the AWS Glue table pointing to the Amazon Comprehend Medical output data is ready, you can deploy the Amazon Athena OpenSearch connector using the AWS Serverless Application Repository.

On the AWS Serverless Application Repository console, choose Available applications.
Search for Athena Elasticsearch and select Show apps that create custom IAM roles or resource policies.
Choose AthenaElasticsearchConnector.

You’re redirected to the application screen.

Scroll down to the Application settings section.
For AthenaCatalogName, enter a name (for this post, we use es-connector).

This name is the name of the application and the Lambda function that connects to Amazon OpenSearch Service every time you run a query from Athena. For more details about all the parameters, refer to the connector’s GitHub page.

For SpillBucket, enter the name you noted in the previous section when we deployed the CloudFormation stack (it begins with athena-es-connector-spill-bucket).
Leave all other settings as default.
Select I acknowledge that this app creates custom IAM roles.
Choose Deploy.

In a few seconds, you’re redirected to the Applications page. You can see the status of your deployment on the Deployments tab. The deployment takes 1–2 minutes to complete.

Create a new data source in Athena

Now that the connector application has been deployed, it’s time to set up the OpenSearch Service domain to show as a catalog on Athena.

On the Athena console, navigate to the cord19 database.

The database contains the table alleninstitute_comprehend_medical, which was created as part of the CloudFormation template. This refers to the data sitting in Amazon S3 at s3://covid19-lake/alleninstitute/CORD19/comprehendmedical/.

Choose Data sources in the navigation pane.
Choose Connect data source.
Select Custom data source.
For Data source name, enter a name (for example, es-cord19-catalog).
Select Use an existing Lambda function and choose es-connector on the drop-down menu.
Choose Connect data source.
Choose Next.
For Lambda function, choose es-connector.
Choose Connect.

A new catalog es-cord19-catalog should now be available, as in the following screenshot.

On the Query editor tab, for Data source, choose es-cord19-catalog.

You can now query this data source from Athena.

Query OpenSearch Service domains from Athena

When you choose the es-cord19-catalog data source, the Lambda function (which was part of the connector application that we deployed) gets invoked and fetches the details about the domain and the index. The OpenSearch Service domain shows up as a database, and the index is shown as a table. You can also query the table with the following query:

select count(*) from "es-cord19-catalog"."es-alleninstitute-data".alleninstitute_metadata

Now you can join data from both Amazon OpenSearch Service and Amazon S3 with queries, such as the following:

select es.title, es.url from 
"es-cord19-catalog"."es-alleninstitute-data".alleninstitute_metadata es
    inner join
AwsDataCatalog.cord19.alleninstitute_comprehend_medical s3
    on es.sha = s3.paper_id
WHERE 
    array_join(s3.dx_name, ',') like '%infectious disease%'

The preceding query gets the title and the URL of all the research papers where the diagnosis was related to infectious diseases.

The following screenshot shows the query results.

Clean up

To clean up the resources created as part of this post, complete the following steps:

On the Amazon S3 console, locate and select your S3 bucket (the same bucket you noted from the CloudFormation stack).
Choose Empty.

You can also achieve this by running the following command from a command line:

aws s3 rm s3://athena-es-connector-spill-bucket-f7eb2cb0 –recursive

On the AWS CloudFormation console, delete the stack you created.
Delete the stack created for the Amazon Athena OpenSearch connector application. The default name is serverlessrepo-AthenaElasticsearchConnector.
On the Athena console, delete the es-cord19-catalog data source.

You can also delete the data source with the following command:

aws athena delete-data-catalog --name "es-cord19-catalog"

Conclusion

In this post, we saw how to combine data from OpenSearch Service clusters with other data sources like Amazon S3 to run federated queries. You can apply this solution to other use cases, such as combining AWS CloudTrail logs loaded into OpenSearch Service clusters with VPC flow logs data in Amazon S3 to analyze unusual network traffic, or combining product reviews data in Amazon OpenSearch Service with product data in Amazon S3 or other data sources. You can also pull data from Amazon OpenSearch Service and create an AWS Glue table out of it using a CTAS query in Athena.

To learn more about the Amazon Athena OpenSearch connector and its other configuration options, see the GitHub repo.

To learn more about query federation in Athena, refer to Using Amazon Athena Federated Query or Query any data source with Amazon Athena’s new federated query.

About the Authors

Behram Irani, Sr Analytics Solutions Architect

Madhav Vishnubhatta, Sr Technical Account Manager

Creating a costs analytics view to email campaign generated by Amazon Pinpoint

2021-11-08 rafaaws

Post Syndicated from rafaaws original https://aws.amazon.com/blogs/messaging-and-targeting/creating-a-costs-analytics-view-to-email-campaign-generated-by-amazon-pinpoint/

Introduction

Many companies have multiple departments using different campaigns in the same AWS account on Amazon Pinpoint and need to split costs at the end of each month between the owners of each campaign. To do this, companies need an easy way to find how much each campaign has generated of cost, since the Amazon Pinpoint console doesn’t have this information. To solution this, it is possible combine some AWS services to find out these costs.

In this post, I will demonstrate how AWS analytics and storage services such as Amazon Kinesis, AWS Glue, Amazon Athena, Amazon Quicksight and Amazon Simple Storage Service (Amazon S3) can help you create an analytical view of the costs generated by emails sent through each campaign on Amazon Pinpoint. Not include transactional and Amazon Journey e-mails on the example.

Amazon Pinpoint is an AWS service that helps companies to engage their customers across multiple channels. You can use the Amazon Pinpoint to send email, SMS, push notifications and voice messages to deliver one-off demands or across campaigns.

In this blog, you will learn how to create a dashboard with total cost of emails sent and MTA (Monthly Targeted Audience) by each campaign. With this information, you will be able to distribute costs internally to each department responsible for each campaign on Amazon Pinpoint.

Solution Overview

To create this dashboard, we will take advantage of the Digital User Engagement Events Database solution. We can use an AWS CloudFormation template that set up Amazon Pinpoint event flow. This solution uses the Amazon Kinesis to stream all events about campaigns to a bucket on Amazon S3. After that, a data processing task is performed by AWS Lambda and cataloged on AWS Glue. Some views will be created on Amazon Athena to organize all data and we will use them to calculating and analyzing the Pinpoint costs. For more information about the solution, architecture and how AWS CloudFormation template works to automate the deployment, please visit the Implementation Guide page.

During the deployment process, you will have the option to create a new Amazon Pinpoint project to manage your campaigns or use an existing one.

Prerequisites

1. Complete the Digital User Engagement Events Database implementation.

2. Have an e-mail campaign on Amazon Pinpoint created with some emails already sent.

Analytics View

All events created by Amazon Pinpoint through the campaigns after the Digital Use Engagement Events Database implementation is complete should be appearing in parquet file format in the Amazon S3. If your campaign did not generate any events after the AWS CloudFormation was completed, I recommend creating and executing a new email campaign just to test this solution. Any test email sent during the implementation of this solution will incur charges. The email costs will be explained during this blog.

After the entire Amazon Pinpoint event flow has worked correctly, some modifications on the views created in Amazon Athena must be made. These changes will help to access the information about the quantity of endpoints registered by each campaign of your project. For this, the following steps are required:

To create a new view

1. Open the Amazon Athena console.

2. Under Database, choose the database name created by AWS CloudFormation template.

You will notice a table called “all events” and some views, eg. campaign_send, email_open, email_send and others. These views are responsible to improve the organization of data sent by Amazon Pinpoint, eg. in the campaign_send view it is possible to see all informations about all events that Amazon Pinpoint sends across the campaign by multiple channels.

3. Choose Create view.

A tab will be added on the center page so you can add your command which will be responsible to created a new view.

4. Replace the existing text to the command below and choose Run query.

CREATE OR REPLACE VIEW endpoint_unit AS
SELECT DISTINCT
client.client_id endpoint_id
, "min"("from_unixtime"((event_timestamp / 1000))) event_timestamp
, "month"("from_unixtime"((event_timestamp / 1000))) month_data
, "year"("from_unixtime"((event_timestamp / 1000))) year_data
FROM
all_events
WHERE (event_type = '_campaign.send')
GROUP BY client.client_id, "month"("from_unixtime"((event_timestamp / 1000))), "year"("from_unixtime"((event_timestamp / 1000)))

In this command we are creating a new view, grouping all the endpoints already used by campaigns, along with the earliest date and time that the endpoint was registered. This command will help you identify the first campaign that used the endpoint in each month and year.

Once the new view is created, you will notice that it is listed in the views pane with the name endpoint_unit. You can run this view to check which values are returned.

5. Choose Preview in the endpoint_unit view to return the results.

Example:

The data displayed refer only to the information of the endpoints used in campaigns after the implementation of the Digital User Engagement Events Database.

Now is the time to create the analytical view in Amazon Quicksight.

To check Amazon Quicksight Region

1. Open the Amazon Quicksight console.

If this is your first time using Amazon Quicksight, a page will appear to subscribe to the services, feel free to choose the best option for your business.

Warning: Some costs might occur by using Amazon Quicksight. Check the Amazon Quicksight Pricing page for more information.

2. On top of the screen, choose <Role>/<Account-Name> and select the region where you have the Amazon Pinpoint project.

To check Amazon Quicksight Permissions

Check the permission of the Amazon Quicksight to enable the access to Amazon S3 bucket.

1. On top of the screen, choose <Role>/<Account-Name>, Manage Quicksight.

2. In navigation pane, choose Security & permissions.

3. Under QuickSight access to AWS services, choose Add or remove.

4. Inside the QuickSight access to AWS services table, under Amazon S3, choose Details.

5. Choose Select S3 buckets.

6. Check if the Amazon S3 bucket name checkbox containing all stream files is selected. If not, select the checkbox and choose Finish and Update.

Create a Dataset for Campaign cost

1. Back to the Amazon Quicksight main page.

2. In the navigation page, choose Datasets, New dataset.

3. Choose Athena.

Now, let’s add the table containing information regarding the number of emails sent per Amazon Pinpoint campaign.

4. In the New Athena data source dialog box, do the following:

a. For Data source name, type a name.

b. Choose Create data source.

5. In the Choose your table dialog box, do the following:

a. Choose Use custom SQL.

b. In first field, enter a name for the custom SQL.

c. In second field, paste the command below.

SELECT * FROM "due_eventdb"."campaign_send" where (message_tags['delivery_type'] = 'EMAIL')

This command is filtering only events of campaign based on emails.

d. Choose Confirm query.

6. In the Finish dataset creation dialog box, you will be asked to select between storing a copy of the data of this table in SPICE or performing a query directly from data source. Feel free to choose the best option for your business.

7. Choose Visualize.

After creating the dataset, you will be redirected to the Amazon Quicksight analytics creation page.

As the information sent by Amazon Pinpoint does not have the unit cost of each email sent, we will use three features called Parameters, Control and Calculated Field to calculate these amounts.

Parameters are named variables that can transfer a value for use by an action or an object. To make the parameters accessible to the dashboard viewer, you add a parameter control.

The calculated fields help you to transform your data by using one or more of the actions: Operators, Functions, Aggregate functions, Fields that contain data or other calculated fields.

Create an Amazon QuickSight parameter

1. In the navigation bar, choose + Add at the top of the screen.

2. Choose Add parameter.

3. In the Create new parameter dialog box, do the following:

a. For Name, type the name for the parameter, eg. Costemail.

b. For Data type, choose number.

c. For Values, select Single value.

d. For Static default value enter the unit cost of email. The cost of each Amazon Pinpoint email can be found on Amazon Pinpoint Pricing page.

e. Choose Create and Close.

Important: All costs in this blog will be calculated in USD.

After creating the first parameters, we need to create a manual control of these costs.

Create an Amazon QuickSight control

1. In the navigation pane, choose Parameters.

2. Under the name of parameters that you previously created, choose Add control.

3. In the Add control dialog box, do the following:

a. For Display name, enter a name.

b. For Style, choose Text field.

c. Choose Add.

This control will help you in the future if you need to change the unit price without changing the Parameters configuration.

We now need to create the Calculate Field. It multiplying the total sent emails value by the unit costs.

Create an Amazon QuickSight calculated field

1. In the navigation bar, choose + Add at the top of the screen again.

2. Choose add calculated field.

3. In add name field, type a name.

4. Paste the expression below.

count({pinpoint_campaign_id}) * ${name_parameters}

5. Replace the “name_parameters” in the expression with the name of the Parameter you created earlier.

6. Choose Save.

We now have all fields available to create the chart on Quicksight.

Create an Amazon QuickSight dashboard

1. In the navigation page, choose Visualize.

2. Under Visual type, choose Vertical bar chart.

Note: If you prefer, you can change it later to other visual type.

3. Choose the fields pinpoint_campaign_id, calculated field that you just created and event_timestamp. Drag each field to X axis, value and group/color to create the chart.

Example:

In this example, you can see the cost in USD x Amazon Pinpoint Campaign ID between April and May 2021.

If you prefer, you can customize your chart in Format Visual. You can also build another view to show the amount of emails sent per campaign.

4. In the navigation bar, choose Share.

5. Choose Publish dashboard.

6. In the Publish a dashboard dialog box, under Publish new dashboard as, type a name.

7. Choose Publish dashboard.

If you want, you can share your dashboard with other username, group, or email address.

Now that we have the total costs of emails per each campaign, let’s create the chart for the total endpoint cost for each campaign.

Create a Dataset for MTA cost

1. Back to the Amazon Quicksight main page.

2. In the navigation page, choose Datasets, New dataset.

3. Choose Athena.

4. In the New Athena data source dialog box, do the following:

a. For Data source name, type a name.

b. Choose Create data source.

5. In the Choose your table dialog box, do the following:

a. Choose Use custom SQL.

b. In first field, enter a name for the custom SQL.

c. In second field, paste the command below.

SELECT distinct c.endpoint_id, e.pinpoint_campaign_id, c.event_timestamp
FROM "due_eventdb"."campaign_send" e
INNER JOIN "due_eventdb"."endpoint_unit" c ON c.event_timestamp = e.event_timestamp

d. Choose Confirm query.

7. Choose Visualize.

This command is responsible to join the pinpoint_campaign view with the endpoint_unit view. This command will return the campaign ID responsible for contacting the endpoint for the first time on each month.

8. After the dataset is created, repeat the same steps to create a new Parameter, Control and Calculated Field on section Create an Amazon QuickSight parameter, Create an Amazon QuickSight control and Create an Amazon QuickSight calculated field, but when creating a new Parameter, you will use the unit cost of each endpoint, currently it is described on the Amazon Pinpoint Pricing page as Monthly Targeted Audience (MTA).

If you send messages from an Amazon Pinpoint campaign or journey, the unique endpoints you contact are known as a monthly targeted audience (MTA). You are charged on the number of MTAs targeted in a calendar month.

Important: In this calculation we are not considering the subtraction of the free tier values.

In the process of creating Calculated Field, use the following expression:

count({endpoint_id}) * ${name_parameters}

Also remember to replace the name_parameters field in the expression with the name of the parameter that you previously created in point 8 on Create a Dataset for MTA cost section to calculate the costs of endpoints.
In this expression you are calculating the MTA for the distinct endpoint contacted per month.

After this, you will also have all the required fields to create your total cost chart of endpoint per campaign. In this case, use the fields pinpoint_campaign_id, calculated field that you created earlier and event_timestamp.

Example:

In this example, you can see the MTA cost in USD x Amazon Pinpoint Campaign ID between April and May 2021.

Some customers usually have tens or hundreds of campaigns, on this case you can use the Filter option in the navigation pane to a specify a range of date.

Optional: If you prefer you can combine the email and MTA cost in the same Analyses and Dashboard.

Add more datasets in the same Analyses and Dashboard

1. Back to the Amazon Quicksight main page.

2. In the navigation page, choose Analyses.

3. Choose the email cost Analyses that you created on section Create an Amazon QuickSight dashboard.

4. Choose the pencil icon on Dataset option near to the Navigation page.

5. Choose add dataset.

6. In the Choose dataset to add dialog box, choose the dataset you created on section Create a Dataset for MTA cost.

7. Choose Select.

Now you can create each parameter, control and calculated field per dataset on the same Analyses and publish all charts on same dashboard.

Cleanup

To avoid incurring charges, navigate to AWS Cloudformation console and delete the stack that you used on Digital User Engagement Events Database solutions procedure.

After the stack is deleted, you also need to delete your dashboards, analyses and datasets on Amazon Quicksight. You can also delete the stream events data in your Amazon S3 bucket.

Conclusion

In this blog, we used the total cost of email sent and endpoints to create the charts, but it is possible to obtain several analyses in Quicksight using the views that became available in the Digital User Engagement Events Database solution, such as costs for push notifications and other types of channels.

Also try creating dashboards with other Amazon Pinpoint channels

To do this, use this same procedure, explore the view campaign_send to find all data about other channels and modify the SQL queries on Amazon Quicksight to create your dashboards.

About the Author

Rafael Rodrigues is an Enterprise Solution Architect for AWS based in Sao Paulo, Brazil. He helps customers innovate with modern IT architecture on cloud computing.

Accelerate self-service analytics with Amazon Redshift Query Editor V2

2021-11-08 Bhanu Pittampally

Post Syndicated from Bhanu Pittampally original https://aws.amazon.com/blogs/big-data/accelerate-self-service-analytics-with-amazon-redshift-query-editor-v2/

Amazon Redshift is a fast, fully managed cloud data warehouse. Tens of thousands of customers use Amazon Redshift as their analytics platform. Users such as data analysts, database developers, and data scientists use SQL to analyze their data in Amazon Redshift data warehouses. Amazon Redshift provides a web-based query editor in addition to supporting connectivity via ODBC/JDBC or the Redshift Data API. Query Editor V2 lets users explore, analyze, and collaborate on data. You can use Query Editor V2 to create databases, schemas, tables, and load data from Amazon Simple Storage Service (S3) either using COPY command or using a wizard . You can browse multiple databases and run queries on your Amazon Redshift data warehouse, data lake, or federated query to operational databases such as Amazon Aurora.

From the smallest start-ups to worldwide conglomerates, customers across the spectrum tell us they want to promote self-service analytics by empowering their end-users, such as data analysts and business analysts, to load data into their analytics platform. Analysts at these organizations create tables and load data in their own workspace, and they join that with the curated data available from the data warehouse to gain insight. This post will discuss how Query Editor V2 accelerates self-service analytics by enabling users to create tables and load data with simple wizards.

The Goal to Accelerate and Empower Data Analysts

A common practice that we see across enterprises today is that more and more enterprises are letting data analysts or business analysts load data into their user or group workspaces that co-exist on data warehouse platforms. Enterprise calls these personal workspaces, departmental schemas, project-based schemas or labs, and so on. The idea of this approach is to empower data analysts to load data sets by themselves and join curated data sets on a data warehouse platform to accelerate the data analysis process.

Amazon Redshift Query Editor V2 makes it easy for administrators to create the workspaces, and it enables data analysts to create and load data into the tables. Query Editor V2 lets you easily create external schemas in Redshift Cluster to extend the data warehouse to a data lake, thereby accelerating analytics.

An example Use case

Let’s assume that an organization has a marketing department with some power users and regular users. In this example, let’s also consider that the organization already has an Enterprise Data Warehouse (EDW) powered by Amazon Redshift. The marketing department would like to have a workspace created for their team members.

A visual depiction of a Data Warehouse Environment may look like the following figure. Enterprises let user/group schemas be created along with an EDW, which contains curated data sets. Analysts can create and load exploratory data sets into user schemas, and then join curated data sets available in the EDW.

ScopeofSolution

Amazon Redshift provides several options to isolate your users’ data from the enterprise data warehouse data,. Amazon Redshift data sharing lets you share data from your EDW cluster with a separate consumer cluster. Your users can consume the EDW data and create their own workspace in the consumer cluster. Alternatively, you can create a separate database for your users’ group workspace in the same cluster, and then isolate each user group to have their own schema. Amazon Redshift supports queries of data joining across databases, and then users can join their tables with the curated data in the EDW. We recommend you use the data sharing option that lets you isolate both compute and data. Query Editor v2 supports both scenarios.

Once you have enabled your data analysts to have their own workspace and provided the relevant privileges, then they can easily create Schema, table, and load data.

Prerequisites

You have an Amazon Redshift cluster, and you have configured the Query Editor V2. You can view the Simplify Data Analysis with Amazon Redshift Query Editor V2 post for instructions on setting up Query Editor V2.
For loading your data from Amazon S3 into Amazon Redshift, you will start by creating an IAM role to provide permissions to access Amazon S3 and grant that role to the Redshift cluster. By default, Redshift users assume that the IAM role is attached to the Redshift cluster. You can find the instructions in the Redshift getting started guide.
For users who want to load data from Amazon S3, Query Editor V2 provides an option to browse S3 buckets. To use this feature, users should have List permission on the S3 bucket.

Create Schemas

The Query Editor V2 supports the schema creation actions. Likewise, admins can create both native and external schemas by creating Schema wizard.

CreateSchemas

As a user, you can easily create a “schema” by accessing Create Schema wizard available from the Create button, and then selecting “Schema” from the drop-down list, as shown in the following screenshot.

If you select the Schema from the drop-down list, then the Create Schema wizard similar to the following screenshot is displayed. You can choose a local schema and provide a schema name.

Optionally, you can authorize a user to authorize users to create objects in the Schema. When the Authorize user check box is selected, then Create and Usage access are granted to the user. Now, Janedoe can create objects in this Schema.

Let’s assume that the analyst user Janedoe logs in to Query Editor V2 and logs in to the database and wants to create table and load data into their personal workspace.

Creating Tables

The Query Editor V2 provides a Create table wizard for users to create a table quickly. It allows power users to auto-create the table as based on a data file. Users can upload the file from their local machine and let Query Editor V2 figure out the data types and column widths. Optionally, you can change the column definition, such as encoding and table properties.

Below is a sample CSV file with a row header and sample rows from the MarketingCampaign.csv file. We will demonstrate how to create a table based on this file in the following steps.

SampleData

The following screenshot shows the uploading of the MarketingCampaing.csv file into Query Editor V2.

Create Table Wizard has two sections:

Columns

The Columns tab lets users select a file from their local desktop and upload it to Query Editor V2. Users can choose Schema from the drop-down option and provide a table name.

Query Editor V2 automatically infers columns and some data types for each column. It has several options to choose from to set as column properties. For example, you have the option to change column names, data types, and encoding. If you do not choose any encoding option, then the encoding choice will be selected automatically. You can also add new fields, for example, an auto-increment ID column, and define properties for that particular identity column.

Table Details

You can use the Create Table wizard to create a temporary table or regular table with the option of including it in automatic backups. The temporary table is available until the end of the session and is used in queries. A temporary table is automatically dropped once a user session is terminated.

The “Table Details” is optional, as Amazon Redshift’s Automatic Table Optimization feature takes care of Distribution Key and Sort Key on behalf of users.

Viewing Create Table Statement

Once the column and table level detail is set, Query Editor V2 gives an option to view the Create table statement in Query Editor tab. This lets users save the definition for later use or share it with other users. Once the user reviews the create table definition, then the user can hit the “Run” button to run the query. Users can also directly create a table from the Create table wizard.

The following screenshot shows the Create table definition for the marketing campaign data set.

CreateTable3

Query Editor V2 lets users view table definitions in a table format. The following screenshot displays the table that we created earlier. Note that Redshift automatically inferred encoding type for each column. As the best practice, it skipped for “Dt_Customer“, as it was set as the sort key. When creating the table, we did not set the encodings for columns, as Redshift will automatically set the best compression methods for each column.

Query Editor V2 distinguishes columns by data types in a table by using distinct icons for them.

You can also view the table definition by right-clicking on the table and selecting the show definition option. You can also generate a template select command, and drop or truncate the table by right-clicking on a table.

Loading Data

Now that we have created a schema and a table, let’s learn how to upload the data to the table that we created earlier.

Query Editor V2 provides you with the ability to load data for S3 buckets to Redshift tables. The COPY command is recommended to load data in Amazon Redshift. The COPY command leverages the massively parallel processing capabilities of Redshift.

The Load Data wizard in the Query Editor V2 loads data into Redshift by generating the COPY command. As a data analyst, you don’t have to remember the intricacies of the COPY command.

You can quickly load data from CSV, JSON, ORC, or Parquet files to an existing table using the Load Data Wizard. It supports all of the options in the COPY command. The Load Data Wizard lets Data analysts build a COPY command with an easy-to-use GUI.

The following screenshot shows an S3 bucket that has our MarketingCampaign.csv file. This is a much larger file that we used to create the table using Create table wizard. We will use this file to walk you through the Load Data wizard.

LoadData1

The Load Data wizard lets you browse your available S3 bucket and select a file or folder from the S3 bucket. You can also use a manifest file. A manifest file lets you make sure that all of the files are loaded using the COPY command. You can find more information about manifest files here.

The Load Data Wizard lets you enter several properties, such as the Redshift Cluster IAM role and whether data is encrypted. You can also set file options. For example, in the case of CSV, you can set delimiter and quote parameters. If the file is compressed, then you can provide compression settings.

With the Data Conversion Parameters, you can select options like Escape Characters, time format, and if you want to ignore the header in your data file. The Load Operations option lets you set compression encodings and error handling options.

Query Editor V2 lets you browse S3 objects, thereby making it easier to navigate buckets, folders, and files. Below screens displays the flow

Query Editor V2 supports loading data of many open formats, such as JSON, Delimiter, FixedWidth, AVRO, Parquet, ORC, and Shapefile.

In our example, we are loading CSV files. As you can see, we have selected our MarketingCampaing.csv file and set the Region, and then selected the Resfhift cluster IAM Role.

For the CSV file, under additional File Options, Delimiter Character and Quote Character are set with “;” and an empty quote in the below screen.

Once the required parameters are set, continue to next step to load data. Load Data operation builds a copy command and automatically loads it into Query Editor Tab, and then invokes the query.

LoadData5

Data is loaded into the target table successfully, and now you can run a query to view that data. The following screen shows the result of the select query executed on our target table:

LoadData6

Viewing load errors

If your COPY command fails, then these are logged into STL_LOAD_ERRORS system table. Query Editor v2 simplifies the viewing of the common errors by showing the errors in-place as shown in the following screenshot:

LoadData7

Saving and reusing the queries

You can save the load queries for future usage by clicking on the saved query and providing a name in the saved query.

You would probably like to reuse the load query in the future to load data in from another S3 location. In that case, you can use the parameterized query by replacing the S3 URL of the as shown in the following screenshot:

SavingQ2

You can save the query, and then share the query with another user.

When you or other users run the query, a prompt for the parameter will appear as in the following screenshot:

SavingQ3

We discussed how data analysts could load data into their own or the group’s workspace.

We will now discuss using Query Editor V2 to create an external schema to extend your data warehouse to the data lake.

Extending the Data Warehouse to the Data Lake

Extending Data warehouses to Data lakes is part of modern data architecture practices. Amazon Redshift enables this with seamless integration through Data lake running on AWS. Redshift uses Spectrum to allow this extension. You can access data lakes from the Redshift Data warehouse by creating Redshift external schemas.

Query Editor V2 lets you create an external schema referencing an external database in AWS Glue Data Catalogue.

To extend your Data Warehouse to Data Lake, you should have an S3 data lake and AWS Glue Data Catalog database defined for the data lake. Grant permission on AWS Glue to Redshift Cluster Role. You can find more information about external Schema here.

You can navigate to the Create External Schema by using Create Schema wizard, and then selecting the External Schema as shown in the following screenshot:

The Query Editor V2 makes the schema creation experience very easy by hiding the intricacies of the create external schema syntax. You can use the simple interface and provide the required parameters, such as Glue data regions, external database name, and the IAM role. You can browse the Glue Catalog and view the database name.

After you use the create schema option, you can see the schemas in the tree-view. The Query Editor V2 uses distinct icons to distinguish between native Schema and external Schema.

Viewing External Table Definitions

The Query Editor V2 lets data analysts quickly view objects available in external databases and understand their metadata.

You can view tables and columns for a given table by clicking on external Schema and then on a table. When a particular table is selected, its metadata information is displayed in the bottom portion below the tree-view panel. This is a powerful feature, as an analyst can easily understand the data residing externally in the data lake.

You can now run queries against external tables in the external Schema.

In our fictitious enterprise, Marketing Department team members can load data in their own workspace and join the data from their own user/group workspace with the curated data in the enterprise data warehouse or data lake.

Conclusion

This post demonstrated how the Query Editor V2 enabled data analysts to create tables and load data from Amazon S3 easily with a simple wizard.

We also discussed how Query Editor V2 lets you extend the data warehouse to a data lake. The data analysts can easily browse tables in your local data warehouse, data shared from another cluster, or tables in the data lake. You can run queries that can join tables in your data warehouse and data lake. The Query Editor V2 also provides several features for the collaboration of query authoring. You can view the earlier blog to learn more about how the Query Editor V2 simplifies data analysis.

These features let organizations accelerate self-service analytics and end-users deliver the insights faster.

Happy querying!

About the Authors

Bhanu Pittampally is Analytics Specialist Solutions Architect based out of Dallas. He specializes in building analytical solutions. His background is in data warehouse – architecture, development and administration. He is in data and analytical field for over 13 years. His Linkedin profile is here.

Debu Panda is a Principal Product Manager at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world.

Cansu Aksu is a Front End Engineer at AWS, has a several years of experience in developing user interfaces. She is detail oriented, eager to learn and passionate about delivering products and features that solve customer needs and problems

Chengyang Wang is a Frontend Engineer in Redshift Console Team. He worked on a number of new features delivered by redshift in the past 2 years. He thrives to deliver high quality products and aim to improve customer experience from UI

Now Available: Updated guidance on the Data Analytics Lens for AWS Well-Architected Framework

2021-10-29 Wallace Printz

Post Syndicated from Wallace Printz original https://aws.amazon.com/blogs/big-data/now-available-updated-guidance-on-the-data-analytics-lens-for-aws-well-architected-framework/

Nearly all businesses today require some form of data analytics processing, from auditing user access to generating sales reports. For all your analytics needs, the Data Analytics Lens for AWS Well-Architected Framework provides prescriptive guidance to help you assess your workloads and identify best practices aligned to the AWS Well-Architected Pillars: Operational Excellence, Security, Reliability, Performance Efficiency, and Cost Optimization. Today, we’re pleased to announce a completely revised and updated version of the Data Analytics Lens whitepaper.

Self-assess with Well-Architected design principles

The updated version of the Data Analytics Lens whitepaper has been revised to provide guidance to CxOs as well as all data personas. Within each of the five Well-Architected Pillars, we provide top-level design principles for CxOs to quickly identify areas for teams and fundamental rules that analytics workloads designers should follow. Each design principle is followed by a series of questions and best practices that architects and system designers can use to perform self-assessments. Additionally, the Data Analytics Lens includes suggestions that prescriptively explain steps to implement best practices useful for implementation teams.

For example, the Security Pillar design principle “Control data access” works with the best practice to build user identity solutions that uniquely identify people and systems. The associated suggestion for this best practice is to centralize workforce identities, which details how to use this principle and includes links to more documentation on the suggestion.

“Building Data Analytics platform or workloads is one of the complex architecture patterns. It involves multi-layered approach such as Data Ingestion, Data Landing, Transformation Layer, Analytical/Insight and Reporting. Choices of technology and service for each of these layers are wide. The AWS Well-Architected Analytics Lens helps us to design and validate with great confidence against each of the pillars. Now Cognizant Architects can perform assessments using the Data Analytics Lens to validate and help build secure, scalable and innovative data solutions for customers.”

– Supriyo Chakraborty, Principal Architect & Head of Data Engineering Guild, Cognizant Germany
– Somasundaram Janavikulam, Cloud Enterprise Architect & Well Architected Partner Program Lead, Cognizant

In addition to performing your own assessment, AWS can provide a guided experience through reviewing your workload with a Well-Architected Framework Review engagement. For customers building data analytics workloads with AWS Professional Services, our teams of Data Architects can perform assessments using the Data Analytics Lens during the project engagements. This provides you with an objective assessment of your workloads and guidance on future improvements. The integration is available now for customers of the AWS Data Lake launch offering, with additional Data Analytics offerings coming in 2022. Reach out to your AWS Account Team if you’d like to know more about these guided Reviews.

Updated architectural patterns and scenarios

In this version of the Data Analytics Lens, we have also revised the discussion of data analytics patterns and scenarios to keep up with the industry and modern data analytics practices. Each scenario includes sections on characteristics that help you plan when developing systems for that scenario, a reference architecture to visualize and explain how the components work together, and configuration notes to help you properly configure your solution.

This version covers the following topics:

Building a modern data architecture (formerly Lake House Architecture)
Organize around data domains by delivering data as a product using a data mesh
Efficiently and securely provide batch data processing
Use streaming ingest and stream processing for real-time workloads
Build operational analytics systems to improve business processes and performance
Provide data visualization securely and cost-effectively at scale

Changed from the first release, the machine learning and tenant analytics scenarios have been migrated to a separate Machine Learning Lens whitepaper and SaaS Lens whitepaper.

Conclusion

We expect this updated version will provide better guidance to validate your existing architectures, as well as provide recommendations for any gaps that identified.

For more information about building your own Well-Architected systems using the Data Analytics Lens, see the Data Analytics Lens whitepaper.

Special thanks to everyone across the AWS Solution Architecture and Data Analytics communities who contributed. These contributions encompassed diverse perspectives, expertise, and experiences in developing the new AWS Well-Architected Data Analytics Lens.

About the Authors

Wallace Printz is a Senior Solutions Architect based in Austin, Texas. He helps customers across Texas transform their businesses in the cloud. He has a background in semiconductors, R&D, and machine learning.

Indira Balakrishnan is a Senior Solutions Architect in the AWS Analytics Specialist SA Team. She is passionate about helping customers build cloud-based analytics solutions to solve their business problems using data-driven decisions. Outside of work, she volunteers at her kids’ activities and spends time with her family.

Designing a High-volume Streaming Data Ingestion Platform Natively on AWS

2021-10-19 Soonam Jose

Post Syndicated from Soonam Jose original https://aws.amazon.com/blogs/architecture/designing-a-high-volume-streaming-data-ingestion-platform-natively-on-aws/

The total global data storage is projected to exceed 200 zettabytes by 2025. This exponential growth of data demands increased vigilance against cybercrimes. Emerging cybersecurity trends include increasing service attacks, ransomware, and critical infrastructure threats. Businesses are changing how they approach cybersecurity and are looking for new ways to tackle these threats. In the past, they have relied on internal IT or engaged a managed security services provider (MSSP) to monitor and prevent unauthorized access and attacks.

An end-to-end analytics solution should ingest and process log data streaming from various computing and IoT devices. It can then make processed data available to analytics systems users in near-real-time. However, the sheer volume of data in the future makes this difficult to address in a reliable and cost-effective manner.

In this blog post, we present three approaches for a high-volume log data ingestion and processing platform natively on Amazon Web Services (AWS). We also compare the pros and cons of each. We’ll discuss factors to consider when evaluating the different options and their associated flexibility, to take full advantage of AWS. We will showcase a fictional use case for a top MSSP who ingests high volumes of logs from security devices to cloud. This MSSP also performs downstream analytics and threat detection modeling.

The options we present here have a log collection platform (LCP) on-premises. It collects logs from security devices and sensors, performs necessary translations and tokenization, and pushes compressed log files to the processing tier on cloud. The collection platform can also be modernized to have the IoT-enabled devices send logs to AWS IoT services. This will push the data to Amazon Kinesis, a managed service for collecting and analyzing streaming data.

Approach 1: Amazon Kinesis for log ingestion and format conversion

Figure 1 illustrates a comprehensive solution that uses managed and serverless services on AWS.

Figure 1. Amazon Kinesis for log ingestion and format conversion

1. LCP will invoke a scalable producer application for Amazon Kinesis Data Streams running on AWS Fargate behind an Application Load Balancer. The producer application will use the Amazon Kinesis Producer Library (KPL). KPL aggregates and batches data records to make ingestion into the data stream more efficient. The application may provide compressed records to the KPL to have it manage object compression.

The application can be set up as an HTTP endpoint that receives log files and processes them using KPL. Customer ID sent as part of an HTTP request header can be used to maintain affinity. The application can run in a Docker container, which is orchestrated by Amazon ECS on AWS Fargate. A target tracking scaling policy can manage the number of parallel running data ingestion containers to manage scalability of the ingestion process.

2. Amazon Kinesis Scaling Utility can be used to scale data streams up or down by a count, or as a percentage of the total fleet. The scaling utility archive file can be imported as a library to AWS Lambda. It will automatically manage the number of shards in the stream based on the observed PUT or GET rate of the stream. The combination of customer ID and security device ID may be used to define the partition key.

3. Records uploaded to the stream by the producer application will be consumed by Lambda. It will perform gateway transformations (required by all downstream consumers) and the normalization of record format. Any additional consumer level transformations may be handled separately, associated with respective consumers.

A combination of batch window and batch size configurations can improve efficiency of function invocations. Batch windows are the maximum amount of time in seconds to gather records before invoking the function. Batch size is the number of records to send to the function in each batch. The Lambda function will throttle sending records to Amazon Kinesis Data Firehose. Error handling will be accomplished via retries with a smaller batch size, with number of retries limited as appropriate. It will discard records that are too old.

An Amazon Simple Queue Service (SQS) queue can be configured as a failed-event destination for further offline analysis. A Lambda function can read from the error SQS queue to do basic checks and determine appropriate follow-up actions. This can be an initiated email for additional investigation or a command to discard the message.

4. Output of transformations by Lambda will be saved to the short term (hot) storage Amazon S3 bucket via Kinesis Data Firehose. This can efficiently handle Parquet format conversion required by downstream analytics applications. Kinesis Data Firehose delivery streams will be created per customer and configured with associated AWS Glue Data Catalog table, to perform parquet format conversion.

5. AWS Glue jobs will be used to consolidate and write larger files to the long term (cold) storage bucket.

6. The data in the cold storage bucket will be accessed by internal SOC analysts for threat detection and mitigation.

7. The data in cold storage buckets will also be accessed by end customers via dashboards in Amazon QuickSight.

8. This architecture also provides additional options to modernize streaming analytics using Amazon Kinesis Data Analytics or AWS Glue streaming jobs as appropriate.

While this architecture proposes a fully managed, end-to-end solution, the sheer volume of log messages may drive up the total cost of the solution. This is especially true for Kinesis Data Streams and Kinesis Data Firehose costs.

Approach 2: Containerized application on AWS Fargate for ingestion and Amazon Kinesis for format conversion

An alternative approach shown in Figure 2 replaces the gateway Kinesis Data Streams and transformations, with a containerized application on Fargate. Conversion to Parquet format and writing to the S3 bucket is still handled by Kinesis Data Firehose.

Figure 2. Containerized application for ingestion and Amazon Kinesis for format conversion

1. LCP will upload log files to a raw storage bucket in Amazon S3.

2. A Lambda function will process Event Notifications from the raw data storage bucket. It can insert Amazon S3 object pointers to a Kinesis Data Stream partitioned by Customer ID and Device ID.

3. The producer application will retrieve the Event Notifications from the Data Stream and retrieve corresponding log files from S3. It will perform initial aggregations and transformations, and output to Kinesis Data Firehose. The application can run in a Docker container that is orchestrated by Amazon ECS on Fargate. A target tracking scaling policy can manage the number of parallel running data ingestion containers, to manage scalability of the ingestion process. ECS cluster capacity can be scaled up or down based on Amazon CloudWatch alarms.

4. Kinesis Data Firehose converts to Parquet format, zips the data, and persists to a short-term storage bucket in S3. This is backed by Glue Data Catalog.

Steps 5, 6 and 7 perform consolidation and availability of the processed data to downstream consumers, as in the previous approach.

This option uses the built-in capabilities of Kinesis Data Firehose to transform to Parquet format and deliver to S3. Note that higher costs associated with the service may still be cost prohibitive for larger data volumes.

Approach 3: Containerized application on AWS Fargate for ingestion and format conversion

Figure 3 uses a containerized application running on Fargate for both gateway transformations. This app also provides conversion to Parquet format before writing the files to a short term (hot) storage bucket. All the other steps are the same as in option 2.

Figure 3. Containerized application for ingestion and format conversion

This option offers the least expensive way to transform, aggregate, and enrich the incoming log records, as well as convert them to Parquet format. But it comes with additional overhead for custom development of format conversion, checkpointing, error handling, and application management. Evaluate based on your business needs and workflow.

Conclusion

In this post, we discussed multiple approaches to design a platform on AWS to ingest and process high-volume security log records. We compared the pros and cons for each option. Amazon Kinesis is a fully managed and scalable service that helps easily collect, process, and analyze video and data streams in real time. A solution primarily based on Kinesis may become cost prohibitive due to large data volumes. Consider alternate approaches that use containerized applications on AWS Fargate. The trade-off would be the ability for custom development versus application management overhead.

To improve your security log analysis solution, explore one of the approaches we illustrate and customize as appropriate to fit your unique needs.

How Amazon Transportation Service enabled near-real-time event analytics at petabyte scale using AWS Glue with Apache Hudi

2021-10-14 Madhavan Sriram

Post Syndicated from Madhavan Sriram original https://aws.amazon.com/blogs/big-data/how-amazon-transportation-service-enabled-near-real-time-event-analytics-at-petabyte-scale-using-aws-glue-with-apache-hudi/

This post is co-written with Madhavan Sriram and Diego Menin from Amazon Transportation Services (ATS).

The transportation and logistics industry covers a wide range of services, such as multi-modal transportation, warehousing, fulfillment, freight forwarding, and delivery. At Amazon Transportation Service (ATS), the lifecycle of the shipment is digitally tracked and appended to tens of tracking updates on average. Those tracking updates are vital to kick off events through the shipment operational and billing lifecycle, including delay identification and route optimization. They are also the base for the customer and consumer tracking experience through the different touchpoints.

In this post, we discuss how ATS enabled near-real-time event analytics at petabyte scale using Apache Hudi tables created by AWS Glue Spark jobs.

ATS was looking for ways to securely and cost-efficiently manage and derive analytical insights over petabyte-sized datasets, with data coming in from different sources at different paces, and stored over different storage solutions. You can gain deeper and richer insights when you bring together all your relevant data of all structures and types, from all sources, to analyze.

One of the main challenges that our data engineering team at ATS faced was bringing together all the data arriving in real time, and building a holistic view for our customers and partners. The majority of the orders placed through Amazon, one of the world’s largest online retailers, are operationalized by ATS for the transportation and logistics. ATS provides the business accurate and timely package delivery. ATS operations generate data at petabyte scale, so having the data available at their fingertips provides innumerable opportunities to improve operations through data-driven decision-making.

Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline development. This framework more efficiently manages business requirements like data lifecycles and improves data quality. Hudi enables you to manage data at the record level in Amazon Simple Storage Service (Amazon S3) data lakes to simplify change data capture (CDC) and streaming data ingestion at petabyte scale, and helps handle data privacy use cases requiring record-level updates and deletes.

Solution overview

One of the biggest challenges ATS faced was handling data at petabyte scale with the need for constant inserts, updates, and deletes with minimal time delay, which reflects real business scenarios and package movement to downstream data consumers.

Their traditional data warehouses couldn’t scale to the size of the data nor the frequency of data ingestion. They needed to scale to hundreds of GBs of data across multiple data ingestion sources in order to derive near-real-time data for downstream consumers to use for data analytics that powered business-critical reports, dashboards, and visualizations. The data is also used for training machine learning models with overall service level agreements (SLAs) of 15 minutes for data ingestion and processing.

In this post, we show how we ingest data in real time in the order of hundreds of GBs per hour and run inserts, updates, and deletes on a petabyte-scale data lake using Apache Hudi tables loaded using AWS Glue Spark jobs and other AWS server-less services including AWS Lambda, Amazon Kinesis Data Firehose, and Amazon DynamoDB. AWS ProServe, working closely with ATS, built a data lake comprising of Apache Hudi tables on Amazon S3 created and populated using AWS Glue. A data pipeline was created that supports inserts, updates, and deletes at petabyte scale on the Apache Hudi tables using AWS Glue. To support real-time time ingestion, ATS also implemented a real-time data ingestion pipeline based on Kinesis Data Firehose, DynamoDB, and Amazon DynamoDB Streams.

To tackle the challenges we discussed, we decided to follow the “Divide et Impera” approach, and define two separate workstreams:

Stream-based – We ingested data from four different data sources and 11 datasets, and performed some initial data transformation and joins steps, honoring a time window that may vary from 3 hours to 2 weeks across all workloads. The event rate might go up to thousands of events per second, and events might have duplicates, arrive late, or not be in the correct order. Our objective was to understand in real time the transit status of a given package or truck, capture the current status of ATS operations in real time, and extend the current stream-based solution to offload and supplement the current extract, transform, and load (ETL) solution, based on Amazon Redshift.
Data lake – We wanted the ability to store petabytes of data and allow for merges between historical data (petabytes) with newly ingested data. The data retention policy extends to up to 5 years, which brings increased costs and reduces performance significantly. Our team requires access to near-real-time data (less than 15 minutes) from stream-based ingestion, with full GDPR compliance. Our objective was to merge stream-based ingested data files to derive a holistic view of the dataset at a certain point in time, with an SLA of under 15 minutes. Data lineage capabilities would also be nice to have.

Stream-based solution

The following diagram illustrates the architecture of our stream-based solution.

The flow of the solution is as follows:

Data is ingested from various sources in separate Firehose data streams, collected for up to 15 minutes and stored in S3 buckets.
Upon the arrival of every new file in Amazon S3, a Lambda function is triggered to insert data into a DynamoDB table associated with a specific data source or datasets.
With DynamoDB Streams, we trigger a second Lambda function that aggregates data in real time across the different DynamoDB tables by performing real-time DynamoDB table lookups. The ETL window is enforced using DynamoDB item TTL, so data is automatically deleted from the table after the TTL period expires.
After it’s transformed, data is collected in Amazon S3 passing through a Firehose delivery stream and is ready to be ingested into our data lake.

The solution allows us to do the following:

Ingest data in parallel, in real time, and at the desired scale from all the data sources
Scale on demand, and with minimal human operational overhead; this is achieved using an AWS Serverless technology stack
Implement our desired time window on a per-item base, reducing costs and the total amount of data stored
Implement ETL using Lambda functions in Python, thereby providing a tighter grasp over expressing the business logic
Access data on Amazon S3 before it’s ingested into our data lake, and allow customers and partners to consume data in raw format if needed

The data present in Amazon S3 represents the starting point for a seamless data lake integration.

Data lake ingestion

Moving into our data lake, the following diagram illustrates our architecture for data lake ingestion.

The core implementation in this architecture is the AWS Glue Spark ingestion job for the Hudi table; it represents the entry point for the incremental data processing pipeline.

AWS Glue Spark job runs with a concurrency of 1 and contains the logic for upsert and delete sequentially applied on the Hudi table. The sequencing of delete after upsert in the AWS Glue Spark job ensures, deletes are applied after upsert and the data consistency is maintained even in case of job reruns.

To use Apache Hudi v0.7 on AWS Glue jobs using PySpark, we imported the following libraries in the AWS Glue jobs, extracted locally from the master node of Amazon EMR:

hudi-spark-bundle_2.11-0.7.0-amzn-1.jar
spark-avro_2.11-2.4.7-amzn-1.jar

We recommend using Glue 3.0 with Hudi 0.9.0 connector rather than importing Hudi v0.7 jar files from EMR, for seamless integration and have more capabilities and features.

Before we insert data the Hudi table, we prepare it for push. To optimize for incremental merge, we take a fixed lookup window based on business use case considerations. We start by reading historical data in a given time window. See the following code:

# HUDI DATA READ
read_options = {
  'hoodie.datasource.query.type': 'snapshot'
}


# HUDI DATAFRAME  created  from target Hudi Table on S3
hudi_df = spark. \
  read. \
  format("hudi"). \
  options(*read_options). \
  load(config['target'] + "////*")

# Read Historical data set, load(basePath) use "/partitionKey=partitionValue" folder structure for Spark auto partition discovery

# input_df is the INCREMENT DATAFRAME created from incrementally ingested data on S3
input_df = spark.read.format("csv"). options(header='true').load(config['incremental_path'])



window_year, window_month, window_day = year_month_day_window()
window_filter_clause = "year >= {} AND month >= {} and day >= {}".format(window_year, window_month, window_day)

# We merge it with the incoming newly available data: 

# Data from Hudi Table on S3, because our use case is global, id is unique else id + partitionPath = unique.
hudi_s3_df = hudi_df.select(col("node_id"),col("container_label"),col(config['sort_key'])).filter(window_filter_clause)

# Perform a left outer join between new data (input_df) and data present in S3 Hudi. (hudi_s3_df)
hudi_join_df = input_df.alias("incomingData").join(hudi_s3_df.alias("S3HudiData"), (input_df.node_id == hudi_s3_df.node_id) & (input_df.container_label == hudi_s3_df.container_label), "leftouter")

# As it's a Left Outer join, there might bew new records which aren't present on S3 Hudi. 

hudi_new_df = hudi_join_df.filter(col("S3HudiData.last_update_row_epoch_seconds").isNull()).drop(col("S3HudiData.node_id")).drop(col("S3HudiData.container_label")).drop(col("S3HudiData.last_update_row_epoch_seconds"))

# As it's a Left Outer join, Select the records where input_df.last_update_time > hudi_s3_df.last_update_time. 

hudi_updated_df = hudi_join_df.filter(col("S3HudiData.last_update_row_epoch_seconds").isNotNull() & (col("incomingData.last_update_row_epoch_seconds") > col("S3HudiData.last_update_row_epoch_seconds"))).drop(col("S3HudiData.node_id")).drop(col("S3HudiData.container_label")).drop(col("S3HudiData.last_update_row_epoch_seconds"))
hudi_final_df = hudi_new_df.union(hudi_updated_df)

#  After we prepare the data to be pushed in the Hudi table, we implement the Hudi table update using the following code:

(hudi_final_df.write.format(HUDI_FORMAT)
.option(TABLE_NAME, config['hudi_table_name'])
.option(RECORDKEY_FIELD_OPT_KEY, config["primary_key"])
.option(PARTITIONPATH_FIELD_OPT_KEY,config["partition_keys"])
.option(KEYGENERATOR_CLASS_OPT_KEY, COMPLEX_KEYGENERATOR_CLASS_OPT_VAL)
.option(PRECOMBINE_FIELD_OPT_KEY, config["sort_key"])
.option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL)
.option(UPSERT_PARALLELISM, 1500)
.option('hoodie.payload.ordering.field',config["sort_key"])
.option(PAYLOAD_CLASS_OPT_KEY,'org.apache.hudi.common.model.DefaultHoodieRecordPayload')
.option(HIVE_PARTITION_FIELDS_OPT_KEY, config["partition_keys"])
.option(HIVE_DATABASE_OPT_KEY,config['hudi_database'])
.option(HIVE_TABLE_OPT_KEY,config['hudi_table_name'])
.option(HIVE_SYNC_ENABLED_OPT_KEY,"true")
.option(HIVE_JDBC_SYNC,"false")
.option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, MULTIPART_KEYS_EXTRACTOR_CLASS_OPT_VAL)
.option('hoodie.datasource.write.hive_style_partitioning', 'true')
# To switch to Global Bloom index, set the following configuration as GLOBAL_BLOOM.
.option('hoodie.index.type', 'GLOBAL_SIMPLE') 
.option('hoodie.simple.index.update.partition.path', 'true')
.option('hoodie.global.simple.index.parallelism', '500')
.mode("append")
.save(config['target']))

In the preceding code, config is a dictionary that includes all the Apache Hudi configurations. The AWS Glue Data Catalog is automatically synched after Hudi table creation, as part of the Glue job, reflecting the Amazon S3 partition structure. We can now query the data using Amazon Athena or Amazon Redshift Spectrum.

To comply with our strict internal ingestion SLA, we had to dedicate special attention to employing the right Hudi indexes and defining the right table type. For the latter, we analyzed the type of workload. Due to the analytic nature of the datasets and use case, we identified that the right configuration would be to use a COPY_ON_WRITE table, even if that was a compromise on the write performances but enhanced read performance.

For the former, we went through an experimentation phase. We started with a GLOBAL_BLOOM index, identifying an initial non-linear pattern for data writing performances.

Given the randomness and the time window specified for the input data, we have encountered a significant number of false positives, leading to reading the entire dataset back for comparison. Moreover, GLOBAL_BLOOM keeps increasing linearly corresponding to the data size, whereas GLOBAL_SIMPLE doesn’t bring this overhead (with a fixed lookup window) as can be observed in the diagram.

The graph represents the Total Time taken by Glue Hudi job(X-axis) over days (Y axis) as incoming data is merged with the historical data. leveraging GLOBAL_BLOOM. The graph in upper half, shows when same data was merged consecutively over days, non-linear time increase was observed. Lower half graph indicates Linear increase with a steep slope when new incoming data was merged with historical data.

GLOBAL_BLOOM wasn’t appropriate for our use case as the historical data spanned back to 5 years, and Glue job will not be able to meet the SLA demands. At this point, we investigated GLOBAL_SIMPLE indexes, reaching the expected performance patterns.

Our data lake solution allows us to do the following:

Ingest data files in a petabyte-scale data lake, with a 15-minute ingestion SLA from the moment we receive the data
Read the data at Peta Byte scale by leveraging Amazon S3 partitions (created by Glue jobs and mapped to Hudi partition logic) and faster lookups by using Hudi indexes
Use Hudi data lineage capabilities
Reduce costs for data storage, infrastructure maintenance, and development
Manage data governance using AWS Lake Formation, which allows partners and customers to query the data using their own tools, while allowing ATS to retain control over our data

Conclusion

In this post, we highlighted how ATS built a real-time fully serverless data ingestion platform, scaling up to thousands of events per second and merging with petabyte- sized historical data stored in a data lake in near-real time.

We built a petabyte-scale data lake solution based on Apache Hudi and AWS Glue that allows us to share our data within 15 minutes from ingestion with our partners and consumers, while retaining complete control over our data and automatically offloading costs for data consumption. This provides linear performance as data grows over time.

About Amazon Transportation Service

Amazon Transportation Service (ATS) is the middle mile of the transportation network of Amazon, connecting the fulfillment centers at one end and the delivery stations and post offices at the other end. We enable packages that are ordered and packaged from fulfillment centers that traverse across the European continent to be delivered in the final delivery station that does the house-to-house delivery.

About the Authors

Madhavan Sriram is a Manager, Data Science who comes with a wide experience across multiple enterprise organisations in the space of Big Data and Machine Learning technologies. He currently leads the Data Technology and Products team within Amazon Transportation Services (ATS) and builds data-intensive products for the transportation network within Amazon. In his free time, Madhavan enjoys photography and poetry.

Diego Menin is a Senior Data Engineer within the Data Technology and Products team. He comes with a wide experience across startups and enterprises with deep AWS expertise to develop scalable cloud-based data and analytics products. Within Amazon, he is the architect of Amazon’s transportation data lake and working heavily on streaming data and integration mechanisms with downstream applications through the data lake.

Gabriele Cacciola is a Senior Data Architect working for the Professional Service team with Amazon Web Services. Coming from a solid Startup experience, he currently helps enterprise customers across EMEA implement their ideas, innovate using the latest tech and build scalable data and analytics solutions to make critical business decisions. In his free time, Gabriele enjoys football and cooking.

Kunal Gautam is a Senior Big Data Architect at Amazon Web Services. Having experience in building his own Startup and working along with enterprises, he brings a unique perspective to get people, business and technology work in tandem for customers. He is passionate about helping customers in their digital transformation journey and enables them to build scalable data and advance analytics solutions to gain timely insights and make critical business decisions. In his spare time, Kunal enjoys Marathons, Tech Meetups and Meditation retreats.

Optimize performance and reduce costs for network analytics with VPC Flow Logs in Apache Parquet format

2021-10-13 Radhika Ravirala

Post Syndicated from Radhika Ravirala original https://aws.amazon.com/blogs/big-data/optimize-performance-and-reduce-costs-for-network-analytics-with-vpc-flow-logs-in-apache-parquet-format/

VPC Flow Logs help you understand network traffic patterns, identify security issues, audit usage, and diagnose network connectivity on AWS. Customers often route their VPC flow logs directly to Amazon Simple Storage Service (Amazon S3) for long-term retention. You can then use a custom format conversion application to convert these text files into an Apache Parquet format to optimize the analytical processing of the log data and reduce the cost of log storage. This custom format conversion step added complexity, time to insight, and costs to the VPC flow log traffic analytics. Until today, VPC flow logs were delivered to Amazon S3 as raw text files in GZIP format.

Today, we’re excited to announce a new feature that delivers VPC flow logs in the Apache Parquet format, making it easier, faster, and more cost-efficient to analyze your VPC flow logs stored in Amazon S3. You can also deliver VPC flow logs to Amazon S3 with Hive-compatible S3 prefixes partitioned by the hour.

Apache Parquet is an open-source file format that stores data efficiently in columnar format, provides different encoding types, and supports predicate filtering. With good compression ratios and efficient encoding, VPC flow logs stored in Parquet reduce your Amazon S3 storage costs. When querying flow logs persisted in Parquet format with analytic frameworks, non-relevant data is skipped, requiring fewer reads on Amazon S3 and thereby improving query performance. To reduce query running time and cost with Amazon Athena and Amazon Redshift Spectrum, Apache Parquet is often the recommended file format.

In this post, we explore this new feature and how it can help you run performant queries on your flow logs with Athena.

Create flow logs with Parquet file format

To take advantage of this feature, simply create a new VPC flow log subscription with Amazon S3 as the destination using the AWS Management Console, AWS Command Line Interface (AWS CLI), or API. On the console, when creating new a VPC flow log subscription with Amazon S3, you can select one or more of the following options:

Log file format
Hive-compatible S3 prefixes
Partition logs by time

We now explore how each of these options can make processing and storage of flow logs more efficient

Apache Parquet formatted files

By default, your logs are delivered in text format. To change to Parquet, for Log file format, select Parquet. This delivers your VPC flow logs to Amazon S3 in the Apache Parquet format.

Note the following considerations:

You can’t change existing flow logs to deliver logs in Parquet format. You need to create a new VPC flow log subscription with Parquet as the log file format.
Consider using a higher maximum aggregation interval (10 minutes) when aggregating flow packets to ensure larger Parquet files on Amazon S3.
Refer to Amazon CloudWatch pricing for pricing of log delivery in Apache Parquet format for VPC flow logs

Hive-compatible partitions

Partitioning is a technique to organize your data to improve the efficiency of your query engine. Partitions aligned with the columns that are frequently used in the query filters can significantly lower your query response time. You can now specify that your flow logs be organized in Hive-compatible format. This allows you to run the MSCK REPAIR command in Athena to quickly and easily add new partitions as they get delivered into Amazon S3. Simply select Enable for Hive-compatible S3 prefix to set this up. This delivers the flow logs to Amazon S3 in the following path:

s3://my-flow-log-bucket/my-custom-flow-logs/AWSLogs/aws-account-id=123456789012/aws-service=vpcflowlogs/aws-region=us-east-1/year=2021/month=10/day=07/123456789012_vpcflowlogs_us-east-1_fl-06a0eeb1087d806aa_20211007T1930Z_d5ab7c14.log.parquet

Per-hour partitions

You can also organize your flow logs at a much more granular level by adding per-hour partitions. You should enable this feature if you constantly need to query large volumes of logs with a specific time frame as the predicate. Querying logs only during certain hours results in less data scanned, which translates to lower cost per query with engines such as Athena and Redshift Spectrum.

You can also set per-hour partitions via an API or the AWS CLI using the --destination-options parameter in create-flow-logs:

aws ec2 create-flow-logs \
--resource-type VPC \
--resource-ids vpc-001122333 \
--traffic-type ALL \
--log-destination-type s3 \
--log-destination arn:aws:s3:::my-flow-log-bucket/my-custom-flow-logs/ \
--destination-options FileFormat=parquet,HiveCompatiblePartitions=True, PerHourPartition=True

The following is a sample flow log file deposited into an hourly bucket. By default, the flow logs in Parquet are compressed using Gzip format, which has the highest compression ratio compared to other compression formats.

s3://my-flow-log-bucket/my-custom-flow-logs/AWSLogs/aws-account-id=123456789012/aws-service=vpcflowlogs/aws-region=us-east-1/year=2021/month=10/day=07/hour=19/123456789012_vpcflowlogs_us-east-1_fl-06a0eeb1087d806aa_20211007T1930Z_d5ab7c14.log.parquet

Query with Athena

You can use the Athena integration for VPC Flow Logs from the Amazon VPC console to automate the Athena setup and query VPC flow logs in Amazon S3. This integration has now been extended to support these new flow log delivery options to Amazon S3.

To demonstrate querying flow logs in Parquet and in plain text in this blog, let’s start from the Amazon Athena console. We begin by creating an external table pointing to flow logs in Parquet.

Note that this feature supports specifying flow logs fields in Parquet’s native data types. This eliminates the need for you to cast your fields when querying the traffic logs.

Then run MSCK REPAIR TABLE.

Let’s run a sample query on these Parquet-based flow logs.

Now, let’s create a table for flow logs delivered in plain text.

We add the partitions using the ALTER TABLE statement in Athena.

Run a simple flow logs query and note the time it took to run the query.

The Athena query run time with flow logs in Parquet (1.16 seconds) is much faster than the run time with flow logs in plain text (2.51 seconds).

For benchmarks that further describe the cost savings and performance improvements from persisting data in Parquet in granular partitions, see Top 10 Performance Tuning Tips for Amazon Athena.

Summary

You can now deliver your VPC flow logs to Amazon S3 with three new options:

In Apache Parquet formatted files
With Hive-compatible S3 prefixes
In hourly partitioned files

These delivery options make it faster, easier, and more cost-efficient to store and run analytics on your VPC flow logs. To learn more, visit VPC Flow Logs documentation. We hope you will give this feature a try and share your experience with us. Please send feedback to the AWS forum for Amazon VPC or through your usual AWS support contacts.

About the Authors

Radhika Ravirala is a Principal Streaming Architect at Amazon Web Services, where she helps customers craft distributed streaming applications using Amazon Kinesis and Amazon MSK. In her free time, she enjoys long walks with her dog, playing board games, and reading widely.

Vaibhav Katkade is a Senior Product Manager in the Amazon VPC team. He is interested in areas of network security and cloud networking operations. Outside of work, he enjoys cooking and the outdoors.

Offloading SQL for Amazon RDS using the Heimdall Proxy

2021-10-13 Antony Prasad Thevaraj

Post Syndicated from Antony Prasad Thevaraj original https://aws.amazon.com/blogs/architecture/offloading-sql-for-amazon-rds-using-the-heimdall-proxy/

Getting the maximum scale from your database often requires fine-tuning the application. This can increase time and incur cost – effort that could be used towards other strategic initiatives. The Heimdall Proxy was designed to intelligently manage SQL connections to help you get the most out of your database.

In this blog post, we demonstrate two SQL offload features offered by this proxy:

Automated query caching
Read/Write split for improved database scale

By leveraging the solution shown in Figure 1, you can save on development costs and accelerate the onboarding of applications into production.

Figure 1. Heimdall Proxy distributed, auto-scaling architecture

Why query caching?

For ecommerce websites with high read calls and infrequent data changes, query caching can drastically improve your Amazon Relational Database Sevice (RDS) scale. You can use Amazon ElastiCache to serve results. Retrieving data from cache has a shorter access time, which reduces latency and improves I/O operations.

It can take developers considerable effort to create, maintain, and adjust TTLs for cache subsystems. The proxy technology covered in this article has features that allow for automated results caching in grid-caching chosen by the user, without code changes. What makes this solution unique is the distributed, scalable architecture. As your traffic grows, scaling is supported by simply adding proxies. Multiple proxies work together as a cohesive unit for caching and invalidation.

View video: Heimdall Data: Query Caching Without Code Changes

Why Read/Write splitting?

It can be fairly straightforward to configure a primary and read replica instance on the AWS Management Console. But it may be challenging for the developer to implement such a scale-out architecture.

Some of the issues they might encounter include:

Replication lag. A query read-after-write may result in data inconsistency due to replication lag. Many applications require strong consistency.
DNS dependencies. Due to the DNS cache, many connections can be routed to a single replica, creating uneven load distribution across replicas.
Network latency. When deploying Amazon RDS globally using the Amazon Aurora Global Database, it’s difficult to determine how the application intelligently chooses the optimal reader.

The Heimdall Proxy streamlines the ability to elastically scale out read-heavy database workloads. The Read/Write splitting supports:

ACID compliance. Determines the replication lag and know when it is safe to access a database table, ensuring data consistency.
Database load balancing. Tracks the status of each DB instance for its health and evenly distribute connections without relying on DNS.
Intelligent routing. Chooses the optimal reader to access based on the lowest latency to create local-like response times. Check out our Aurora Global Database blog.

View video: Heimdall Data: Scale-Out Amazon RDS with Strong Consistency

Customer use case: Tornado

Hayden Cacace, Director of Engineering at Tornado

Tornado is a modern web and mobile brokerage that empowers anyone who aspires to become a better investor.

Our engineering team was tasked to upgrade our backend such that it could handle a massive surge in traffic. With a 3-month timeline, we decided to use read replicas to reduce the load on the main database instance.

First, we migrated from Amazon RDS for PostgreSQL to Aurora for Postgres since it provided better data replication speed. But we still faced a problem – the amount of time it would take to update server code to use the read replicas would be significant. We wanted the team to stay focused on user-facing enhancements rather than server refactoring.

Enter the Heimdall Proxy: We evaluated a handful of options for a database proxy that could automatically do Read/Write splits for us with no code changes, and it became clear that Heimdall was our best option. It had the Read/Write splitting “out of the box” with zero application changes required. And it also came with database query caching built-in (integrated with Amazon ElastiCache), which promised to take additional load off the database.

Before the Tornado launch date, our load testing showed the new system handling several times more load than we were able to previously. We were using a primary Aurora Postgres instance and read replicas behind the Heimdall proxy. When the Tornado launch date arrived, the system performed well, with some background jobs averaging around a 50% hit rate on the Heimdall cache. This has really helped reduce the database load and improve the runtime of those jobs.

Using this solution, we now have a data architecture with additional room to scale. This allows us to continue to focus on enhancing the product for all our customers.

Download a free trial from the AWS Marketplace.

Resources

Blog: Read/Write Split with ACID Compliance
Blog: Automated Query Caching
Blog: Advanced Connection Pooling
Heimdall Data technical documentation
Contact: [email protected]

Heimdall Data, based in the San Francisco Bay Area, is an AWS Advanced Tier ISV partner. They have Amazon Service Ready designations for Amazon RDS and Amazon Redshift. Heimdall Data offers a database proxy that offloads SQL improving database scale. Deployment does not require code changes. For other proxy options, consider the Amazon RDS Proxy, PgBouncer, PgPool-II, or ProxySQL.

Use Amazon Athena and Amazon QuickSight in a cross-account environment

2021-09-30 Lotfi Mouhib

Post Syndicated from Lotfi Mouhib original https://aws.amazon.com/blogs/big-data/use-amazon-athena-and-amazon-quicksight-in-a-cross-account-environment/

Many AWS customers use a multi-account strategy to host applications for different departments within the same company. However, you might deploy services like Amazon QuickSight using a single-account approach, which raises challenges when you need to use QuickSight in combination with Amazon Athena to build reports and dashboards. With the recently announced built-in support for cross-account Data Catalogs in Athena, you can now use AWS Glue Data Catalogs in different accounts to create datasets, and build reports and dashboards from a single AWS account using QuickSight and Athena – creating a serverless data visualization solution that lets you share insights from all your data to all your users.

In this post, I show you how to use this new feature to set up cross-account access to Athena for QuickSight.

Solution overview

To set up cross-account access, you complete the following steps:

Grant QuickSight cross-account access to an AWS Glue Data Catalog.
Register the Data Catalog in Athena.
Grant QuickSight cross-account access to an Amazon Simple Storage Service (Amazon S3) bucket.
Add the shared bucket to QuickSight.
Connect QuickSight to Athena.

The following architecture shows the deployment steps.

athena-quicksight-cross-account-architecture

Grant cross-account access for the Data Catalog

QuickSight uses a service role to interact with other AWS services. QuickSight creates this role for you under the name aws-quicksight-s3-consumers-role-v0. You need this role to allow access to the Data Catalog cross-account share. To allow the QuickSight service role (Account A, the borrower account) to access the Data Catalog (Account B, the owner account), you need to grant cross-account access by updating the AWS Glue resource policy.

In the AWS account of the Data Catalog, complete the following steps:

On the AWS Glue console, choose Settings in the navigation pane.

Under Permissions, enter the following resource policy:

{
  “Version” : “2012-10-17”,
  “Statement” : [
{
    “Effect” : “Allow”,
    “Principal” : {
      “AWS” : “arn:aws:iam::<Account A>:role/service-role/aws-quicksight-s3-consumers-role-v0”
    },
    “Action”: [
                “glue:SearchTables”,
                “glue:GetDatabase”,
                “glue:GetPartition”,
                “glue:GetTables”,
                “glue:GetDatabases”,
                “glue:GetTable”
            ],
    “Resource” : [ “arn:aws:glue:<region>:<Account B>:catalog”, “arn:aws:glue:<region>:<Account B>:database/*”, “arn:aws:glue:<region>:<Account B>:table/*” ]
  } ]
}

Choose Save.

The resource policy gives QuickSight access to all the databases and tables in the Data Catalog. You can further scope it down by adding the name of the tables and databases to the resource element.

The following screenshot shows the Settings page on the AWS Glue console and the catalog UI for updating the resource permission.

glue-catalog-screenshot-ui

Register the Data Catalog in Athena

Now you need to register the shared Data Catalog with Athena in the AWS account (borrower) that hosts QuickSight.

On the Athena console, choose Data sources in the navigation pane.
Choose Connect data source.
For Choose where your data is located, select Query data in Amazon S3.
For Choose a metadata catalog, select AWS Glue Data Catalog.
Choose AWS Glue Data Catalog in another account.
For Connection details, enter a Data Catalog name, optional description, and the Data Catalog owner’s AWS account ID.
Choose Register.

When you complete these steps, you can see the borrowed catalog on the Data sources page on the Athena console.

borrowed-catalog-athena-ui

Grant QuickSight cross-account access to an S3 bucket

Creating a resource policy on the Data Catalog to allow cross-account access for QuickSight is not sufficient. You also need to grant QuickSight access to the S3 bucket where the data is stored. You use the same QuickSight service role that we used for the Data Catalog to update the S3 bucket policy.

In the account of the Data Catalog, complete the following steps:

On the Amazon S3 console, choose Buckets.
Choose the bucket that you want to create a policy for, or whose policy you want to edit.
Choose Permissions.

Enter the following policy:

{
    "Version": "2012-10-17",
    "Id": "Policy1621366959711",
    "Statement": [
{
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<Account A>:role/service-role/aws-quicksight-s3-consumers-role-v0"
            },
            "Action": ["s3:ListBucket", "s3:GetObject", "s3:GetObjectVersion" ],
            "Resource": [
                "arn:aws:s3:::<name-of-bucket>",
                "arn:aws:s3:::<name-of-bucket>/*"
            ]
        }
    ]
}

Choose Save changes.

Add the shared S3 bucket to QuickSight

The last step before you can connect QuickSight to Athena is to add the S3 bucket (Account B) as a resource that the QuickSight service role (Account A) can access. To allow your QuickSight service role access to the S3 bucket in another account, perform the following steps:

On the QuickSight console, on the account drop-down menu, choose Manage QuickSight.
Choose Security & permissions.
Choose Add or remove.
Choose Details.
Choose Select S3 buckets.
Under Use a different bucket, add your bucket.
Choose Finish.

Connect QuickSight to Athena

After you set up the necessary permissions, you can follow the instruction in this section to add a dataset in Athena by using the remote (borrowed) Data Catalog.

On the Athena console, choose Datasets in the navigation pane.
Choose New dataset.
Create a new connection profile by providing a data source name and Athena workgroup.
Choose Validate connection.
Choose Create data source.
In the Choose your table section, for Catalog, choose the catalog you created in Athena.
Choose a database and table and click on Select.
Choose Edit/Preview data.
To create a dataset and analyse the data using the table, choose Visualize.

Conclusion

This post showed how to use the built-in support for a cross-account Data Catalog in Athena with Quicksight when the Data Catalog and the S3 bucket containing the data are in a different account. This feature greatly reduces operational overhead by having a single account managing the Data Catalog and its data.

After you set up your data sources, you can join your data across these various sources. You can also use these new data sources to gain further insights from your data by setting up ML Insights in QuickSight and set graphical representations of your data using QuickSight visuals.

author Lotfi is a Senior Solutions Architect working for the Public Sector team with Amazon Web Services. He helps public sector customers across EMEA realize their ideas, build new services, and innovate for citizens. In his spare time, Lotfi enjoys cycling and running.

Simplify your data analysis with Amazon Redshift Query Editor v2

2021-09-30 Srikanth Sopirala

Post Syndicated from Srikanth Sopirala original https://aws.amazon.com/blogs/big-data/simplify-your-data-analysis-with-amazon-redshift-query-editor-v2/

Amazon Redshift is a fast, fully managed cloud data warehouse that provides a web-based query editor in addition to supporting connectivity via ODBC/JDBC or the Redshift Data API. Tens of thousands of customers use Amazon Redshift as their analytics platform. Data analysts, database developers, and data scientists use SQL to analyze their data in Amazon Redshift data warehouses. Amazon Redshift Query Editor v2 is a web-based SQL client application that you can use to author and run queries on your Amazon Redshift data warehouse. You can visualize query results with charts and collaborate by sharing queries with members of your team.

Query Editor v2 provides several capabilities, such as the ability to browse and explore multiple databases, external tables, views, stored procedures, and user-defined functions. It provides wizards to create schemas, tables, and user-defined functions. It simplifies the management and collaboration of saved queries. You can also gain faster insights by visualizing the results with a single click.

Query Editor v2 enhances and builds upon the functionality of the prior version of the query editor, such as increased size of queries, the ability to author and run multi-statement queries, support for session variables, and query parameters, to name a few.

You can provide Query Editor v2 to end-users such as data analysts, database developers, and data scientists without providing the privileges required to access the Amazon Redshift console.

In this post, we walk through how to create an AWS Identity and Access Management (IAM) role to provide access to Query Editor v2 for end-users, easily connect to your clusters, run SQL queries, load data in your clusters, create charts, and share queries directly from the console.

Configure Query Editor v2 for your AWS account

As an admin, you must first configure Query Editor v2 before providing access to your end-users.

You can access Query Editor v2 from the Amazon Redshift console.

When you choose Query Editor v2 from the Editor options, a new tab in your browser opens with the Query Editor v2 interface.

By default, an AWS-owned key is used to encrypt resources. Optionally, you can create a symmetric customer managed key to encrypt Query Editor v2 resources such as saved queries and query results using the AWS Key Management Service (AWS KMS) console or AWS KMS API operations.

Provide access to Query Editor v2 for your end-users

Enterprises want to democratize access to data in the data warehouse securely by providing a web-based query editor to their end-users. You can either use IAM users or integrate the AWS console with your single sign-on (SSO) provider to provide access to end-users. In a future post, we will document all necessary steps to integrate your SSO provider with the query editor.

To enable your users to access Query Editor v2 using IAM, as an administrator, you can attach one of the AWS-managed policies depicted in the following table to the IAM user or role to grant permission. These managed policies also give access to other required services. You can create your custom-managed policy if you want to customize permissions for your end-users.

Policy	Description
AmazonRedshiftQueryEditorV2FullAccess	Grants full access to Query Editor v2 operations and resources. This is primarily intended for administrators.
AmazonRedshiftQueryEditorV2NoSharing	Grants the ability to work with Query Editor v2 without sharing resources. Users can’t share their queries with their team members.
AmazonRedshiftQueryEditorV2ReadSharing	Grants the ability to work with Query Editor v2 with limited sharing of resources. The granted principal can read the saved queries shared with its team but can’t update them.
AmazonRedshiftQueryEditorV2ReadWriteSharing	Grants the ability to work with Query Editor v2 with sharing of resources. The granted principal can read and update the shared resources with its team.

For example, if you have a group of users as a part of marketing_group, and you want them to collaborate between themselves by sharing their queries, you can create an IAM role for them and assign the AmazonRedshiftQueryEditorV2ReadSharing policy. You can also tag the role with sqlworkbench-team as marketing_group.

You can use the IAM console to attach IAM policies to an IAM user or an IAM role. After you attach a policy to a role, you can attach the role to an IAM user.

To attach the IAM policies to an IAM role, complete the following steps:

On the IAM console, choose Roles.
Choose the role that needs access to Query Editor v2. Assume the name of the role as marketing_role.
Choose Attach policies.
For Policy names, choose the policies that we described previously based on your requirement.
Choose Attach policy.

Now you can add the marketing_group tag for an IAM role.

In the navigation pane, choose Roles and select the name of the role that you want to edit.
Choose the Tags tab and choose Add tags.
Add the tag key sqlworkbench-team and the value marketing_group.
Choose Save changes.

Now the end-users with marketing_role can access Query Editor v2 with limited sharing of resources.

Work with Query Editor v2

You can use Query Editor v2 to author and run queries, visualize results, and share your work with your team. With Query Editor v2, you can create databases, schemas, tables, and user-defined functions (UDFs) with visual wizards. In a tree-view panel, for each of your clusters, you can view its schemas. For each schema, you can view its tables, views, functions (UDFs), and stored procedures.

Open Query Editor v2

After you log in to the console and navigate to Query Editor v2, you see a page like the following screenshot.

Query Editor v2 now provides a more IDE-like experience to its users and offers both dark and light themes. You can switch between themes by choosing the moon icon at the lower left of the page.

The left navigation pane shows the list of clusters that you have access to. If you don’t have an Amazon Redshift cluster, use the Getting Started with Amazon Redshift cluster with sample data option. In this post, we use the sample data (Tickets database) as examples.

Connect to an Amazon Redshift database

You can connect to a cluster by choosing a cluster and entering your credentials.

You can connect using either a database user name and password or temporary credentials. Query Editor v2 creates a secret on your behalf stored in AWS Secrets Manager. This secret contains credentials to connect to your database. With temporary credentials, Query Editor v2 generates a temporary password to connect to the database.

Browse a database

You can browse one or more databases in the cluster that you’re connected to. Within a database, you can manage schemas, tables, views, functions, and stored procedures in the tree-view panel. If you have integrated your cluster with the AWS Glue Data Catalog, you see the Data Catalog schema and external tables. Similarly, you can browse the external tables if you create external schemas using Amazon Redshift data sharing, Amazon Redshift Spectrum, or federated queries.

You can perform an operation on an object choosing it (right-click) and choosing from the menu options.

Author and run queries

Query Editor v2 allows you to run your queries by selecting a specific database. If you have multiple databases, make sure that you choose the correct database.

You can enter a query in the editor or select a saved query from the Queries list and choose Run. The query editor provides several shortcuts for using with your query editor, and you can access that by choosing the content assist option.

By default, Limit 100 is set to limit the results to 100 rows. You can turn off this option to return a more extensive result set. If you turn off this option, you can include the LIMIT option in your SQL statement to avoid very large result sets.

Use multiple SQL statements in a query

The query editor supports multiple queries, session variables, and temporary tables. If you have multiple SQL statements and you run the query, the results are displayed on various tabs.

Run long queries

You don’t have to wait for long queries to complete to view results. The queries run even if the browser window is closed. You can view the results the next time you log in to Query Editor v2.

Run parameterized queries

You can use parameters with your query instead of hardcoding certain values, as in the following code:

SELECT sum(qtysold) 
FROM   sales, date 
WHERE  sales.dateid = date.dateid 
AND    sellerId >= ${sellerid};

When you run a query with a parameter, you’re prompted with a form.

Run the explain plan

You can optimize your queries by turning on the Explain option to display a query plan in the results area. You can choose Save to save the query to the Queries folder.

Export results

You can export the query results on the current page to a file in JSON or CSV format. To save the file in the format you want, open the context menu (right-click) in the results area, then choose Export current page and either JSON or CSV. You can also select rows and export the results for specific rows.

Visual analysis of your results

You can perform a visual analysis of your results for a query by turning on Chart to display a graphic visualization of the results. Choose Traces to display the results as a chart. For Type, choose the style of chart as Bar, Line, and so on. For Orientation, you can choose Vertical or Horizontal. For X, select the table column that you want to use for the horizontal axis. For Y, choose the table column that you want to use for the vertical axis.

Choose Refresh to update the chart display. Choose Fullscreen to expand the chart display.

To create a chart, complete the following steps:

Run a query and get results.
Turn on Chart.
Choose a chart style from the available options.

Choose Trace and start to visualize your data.
Choose Style to customize the appearance, including colors, axes, legend, and annotations.
Choose Annotations to add text, shapes, and images.

For certain chart types, you can add transforms to filter, spilt, aggregate, and sort the underlying data for the chart.

You can also save, export, and browse the charts you created.

Collaborate and share with your team members

You can share queries with others on your team. As we discussed earlier, an administrator sets up a team based on the IAM policy associated with an IAM user or IAM role. For example, if you’re a member of marketing_group, you can share your queries with your team members.

Save, organize and browse queries

Before you can share your query with your team, save your query. You can also view and delete saved queries.

To save your query, choose Save, enter a title, and choose Save again.

To browse for saved queries, choose Queries from the navigation pane. You can view queries that are My queries, Shared by me, or Shared to my team. These queries can appear as individual queries or within folders you created.

Organize your queries with folders

You can organize your queries by creating folders and dragging and dropping a saved query to a folder.

Share a query

You can share your queries with your team.

Choose Queries in the navigation pane.
Open the context menu (right-click) of the query that you want to share.
Choose Share with my team.

Manage query versions

You can also view the history of saved queries and manage query versions. Every time you save an SQL query, Query Editor v2 saves it as a new version. You can view or store 20 different versions of your query and browse earlier query versions, save a copy of a query, or restore a query.

Choose Queries in the navigation pane.
Open the context menu (right-click) for the query that you want to work with.
Choose Version history to open a list of versions of the query.
On the Version history page, choose one of the following options:
- Revert to selected – Revert to the selected version and continue your work with this version.
- Save selected as – Create a new query in the editor.

Conclusion

In this post, we introduced you to Amazon Redshift Query Editor v2, which has a rich set of features to manage and run your SQL statements securely that provide you with several capabilities, such as ability to browse and explore multiple databases, external tables, views, stored procedures, and user-defined functions. It provides wizards to create schemas, tables, and user-defined functions. Query Editor v2 simplifies management and collaboration of saved queries and improves the ability to analyze and visualize results with a single click.

If you have any questions or suggestions, please leave a comment.

Happy querying!

About the Author

Srikanth Sopirala is a Principal Analytics Specialist Solutions Architect at AWS. He is a seasoned leader with over 20 years of experience, who is passionate about helping customers build scalable data and analytics solutions to gain timely insights and make critical business decisions. In his spare time, he enjoys reading, spending time with his family, and road cycling.

Debu Panda, a Principal Product Manager at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world. Debu has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences such as re:Invent, Oracle Open World, and Java One. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).

Eren Baydemir, a Technical Product Manager at AWS, has 15 years of experience in building customer facing products and is currently creating data analytics solutions in the Amazon Redshift team. He was the CEO and co-founder of DataRow which was acquired by Amazon in 2020.

Erol Murtezaoglu, a Technical Product Manager at AWS, is an inquisitive and enthusiastic thinker with a drive for self improvement and learning. He has a strong and proven technical background in software development and architecture, balanced with a drive to deliver commercially successful products. Erol highly values the process of understanding customer needs and problems, in order to deliver solutions that exceed expectations.

Introducing Amazon Redshift Query Editor V2, a Free Web-based Query Authoring Tool for Data Analysts

2021-09-29 Alex Casalboni

Post Syndicated from Alex Casalboni original https://aws.amazon.com/blogs/aws/amazon-redshift-query-editor-v2-web-query-authoring/

When it comes to manipulating and analyzing relational data, Structured Query Language (SQL) has been an international standard since 1986, a couple of years before I was born. And yet, it sometimes takes hours to get access to a new database or data warehouse, configure credentials or single sign-on, download and install multiple desktop libraries or drivers, and get familiar with the new schema—all this before you even run a query. Not to mention the challenge of sharing queries, results, and analyses securely between members of the same team or across teams.

Today, I’m glad to announce the general availability of Amazon Redshift Query Editor V2, a web-based tool that you can use to explore, analyze, and share data using SQL. It allows you to explore, analyze, share, and collaborate on data stored on Amazon Redshift. It supports data warehouses on Amazon Redshift and data lakes through Amazon Redshift Spectrum.

Amazon Redshift Query Editor V2 provides a free serverless web interface that reduces the operational costs of managing query tools and infrastructure. Because it’s a managed SQL editor in your browser and it’s integrated with your single sign-on provider, the Query Editor V2 reduces the number of steps to the first query so you gain insights faster. You also get in-place visual analysis of query results (no data download required), all in one place. As an additional team productivity boost, it improves collaboration with saved queries and the ability to share results and analyses between users.

From a security standpoint, analysts can access Query Editor V2 without requiring any admin privileges on the Amazon Redshift cluster, using an IAM role for READ, WRITE, or ADMIN access. Check out the documentation for more details.

Connection Setup for Amazon Redshift Query Editor V2
First, you’ll need to configure the connection to your Amazon Redshift cluster.

After you have configured the connection, you can reuse it for future sessions. And, of course, you can edit or delete a connection at any time.

Simply click on a cluster to connect with Query Editor V2.

Amazon Redshift Query Editor V2 in Action
The web interface allows you to browse schemas, tables, views, functions, and stored procedures. You can also preview a table’s columns with one click and create or delete schemas, tables, or functions.

The interface is intuitive for newcomers and expert users alike. You can resize panels, create tabs, and configure your editor preferences.

Running or explaining a query is quite straightforward: You simply write (or paste) the query and choose Run. You can visualize and interact with the result set in the bottom pane. For example, you might want to change the row ordering or search for a specific word. Even though Amazon Redshift Query Editor V2 is a browser-based tool, the data movement between your browser and the Amazon Redshift cluster is optimized, so your browser doesn’t need to download any raw data. A lot of the filtering and reordering happens directly in the browser, without any wait time.

To export a result set as a JSON or CSV file on your local machine, simply right-click it.

So far so good! Running queries is the minimum you’d expect from a Query Editor. Let’s have a look at some of the more interesting features.

Team Collaboration with Amazon Redshift Query Editor V2
Amazon Redshift Query Editor V2 allows you to manage the permissions of your team members based on their IAM roles, so that you can easily share queries and cluster access in a secure way.

For example, you can use IAM managed policies such as AmazonRedshiftQueryEditorV2FullAccess, AmazonRedshiftQueryEditorV2ReadSharing, or AmazonRedshiftQueryEditorV2ReadWriteSharing. Also, don’t forget to include the redshift:GetClusterCredentials permission.

After you’ve set up the IAM roles for your team, choose Save to save a query.

The Untitled tab will show the query name. From now on, you edit this saved query to make updates and then choose Save again.

Individual users with WRITE access can run, edit, and delete shared queries, while users with READ access can only run shared queries.

If you work on multiple projects and collaborate with many different teams, it might be difficult to remember query names or even find them in a long list. In Amazon Redshift Query Editor V2, saved and shared queries are available from the left navigation in Queries. You can keep your queries organized into folders. Even nested folders are supported.

Last but not least, each saved query is versioned and the version history is always available. That’s pretty useful when you need to restore an older version.

Plot Your Queries with Amazon Redshift Query Editor V2
Sharing queries with teammates is great, but wouldn’t it even better if you could visualize a result set, export it as PNG or JPEG, and save the chart for later? Amazon Redshift Query Editor V2 allows you to perform in-place visualizations of your results. When you’re happy with the look and feel of your chart, you can save it for later and organize all your saved charts into folders. This allows you to simply choose a saved chart, rerun the corresponding query, and export the new image. No need to configure the plot from scratch or remember the configuration of hundreds of charts and queries across different projects.

Available Today
Amazon Redshift Query Editor V2 is available today in all commercial AWS Regions, except AP-Northeast-3 regions. It requires no license and it’s free, except for the cost for your Amazon Redshift cluster.

You can interact with the service using the Amazon Redshift console. It doesn’t require any driver or software on your local machine.

For more information, see the Amazon Redshift Query Editor V2 technical documentation or take a look at this video:

We look forward to your feedback.

— Alex

Integral Ad Science secures self-service data lake using AWS Lake Formation

2021-09-23 Mat Sharpe

Post Syndicated from Mat Sharpe original https://aws.amazon.com/blogs/big-data/integral-ad-science-secures-self-service-data-lake-using-aws-lake-formation/

This post is co-written with Mat Sharpe, Technical Lead, AWS & Systems Engineering from Integral Ad Science.

Integral Ad Science (IAS) is a global leader in digital media quality. The company’s mission is to be the global benchmark for trust and transparency in digital media quality for the world’s leading brands, publishers, and platforms. IAS does this through data-driven technologies with actionable real-time signals and insight.

In this post, we discuss how IAS uses AWS Lake Formation and Amazon Athena to efficiently manage governance and security of data.

The challenge

IAS processes over 100 billion web transactions per day. With strong growth and changing seasonality, IAS needed a solution to reduce cost, eliminate idle capacity during low utilization periods, and maximize data processing speeds during peaks to ensure timely insights for customers.

In 2020, IAS deployed a data lake in AWS, storing data in Amazon Simple Storage Service (Amazon S3), cataloging its metadata in the AWS Glue Data Catalog, ingesting and processing using Amazon EMR, and using Athena to query and analyze the data. IAS wanted to create a unified data platform to meet its business requirements. Additionally, IAS wanted to enable self-service analytics for customers and users across multiple business units, while maintaining critical controls over data privacy and compliance with regulations such as GDPR and CCPA. To accomplish this, IAS needed to securely ingest and organize real-time and batch datasets, as well as secure and govern sensitive customer data.

To meet the dynamic nature of IAS’s data and use cases, the team needed a solution that could define access controls by attribute, such as classification of data and job function. IAS processes significant volumes of data and this continues to grow. To support the volume of data, IAS needed the governance solution to scale in order to create and secure many new daily datasets. This meant IAS could enable self-service access to data from different tools, such as development notebooks, the AWS Management Console, and business intelligence and query tools.

To address these needs, IAS evaluated several approaches, including a manual ticket-based onboarding process to define permissions on new datasets, many different AWS Identity and Access Management (IAM) policies, and an AWS Lambda based approach to automate defining Lake Formation table and column permissions triggered by changes in security requirements and the arrival of new datasets.

Although these approaches worked, they were complex and didn’t support the self-service experience that IAS data analysts required.

Solution overview

IAS selected Lake Formation, Athena, and Okta to solve this challenge. The following architectural diagram shows how the company chose to secure its data lake.

The solution needed to support data producers and consumers in multiple AWS accounts. For brevity, this diagram shows a central data lake producer that includes a set of S3 buckets for raw and processed data. Amazon EMR is used to ingest and process the data, and all metadata is cataloged in the data catalog. The data lake consumer account uses Lake Formation to define fine-grained permissions on datasets shared by the producer account; users logging in through Okta can run queries using Athena and be authorized by Lake Formation.

Lake Formation enables column-level control, and all Amazon S3 access is provisioned via a Lake Formation data access role in the query account, ensuring only that service can access the data. Each business unit with access to the data lake is provisioned with an IAM role that only allows limited access to:

That business unit’s Athena workgroup
That workgroup’s query output bucket
The lakeformation:GetDataAccess API

Because Lake Formation manages all the data access and permissions, the configuration of the user’s role policy in IAM becomes very straightforward. By defining an Athena workgroup per business unit, IAS also takes advantage of assigning per-department billing tags and query limits to help with cost management.

Define a tag strategy

IAS commonly deals with two types of data: data generated by the company and data from third parties. The latter usually includes contractual stipulations on privacy and use.

Some data sets require even tighter controls, and defining a tag strategy is one key way that IAS ensures compliance with data privacy standards. With the tag-based access controls in Lake Formation IAS can define a set of tags within an ontology that is assigned to tables and columns. This ensures users understand available data and whether or not they have access. It also helps IAS manage privacy permissions across numerous tables with new ones added every day.

At a simplistic level, we can define policy tags for class with private and non-private, and for owner with internal and partner.

As we progressed, our tagging ontology evolved to include individual data owners and data sources within our product portfolio.

Apply tags to data assets

After IAS defined the tag ontology, the team applied tags at the database, table, and column level to manage permissions. Tags are inherited, so they only need to be applied at the highest level. For example, IAS applied the owner and class tags at the database level and relied on inheritance to propagate the tags to all the underlying tables and columns. The following diagram shows how IAS activated a tagging strategy to distinguish between internal and partner datasets , while classifying sensitive information within these datasets.

Only a small number of columns contain sensitive information; IAS relied on inheritance to apply a non-private tag to the majority of the database objects and then overrode it with a private tag on a per-column basis.

The following screenshot shows the tags applied to a database on the Lake Formation console.

With its global scale, IAS needed a way to automate how tags are applied to datasets. The team experimented with various options including string matching on column names, but the results were unpredictable in situations where unexpected column names are used (ipaddress vs. ip_address, for example). Ultimately, IAS incorporated metadata tagging into its existing infrastructure as code (IaC) process, which gets applied as part of infrastructure updates.

Define fine-grained permissions

The final piece of the puzzle was to define permission rules to associate with tagged resources. The initial data lake deployment involved creating permission rules for every database and table, with column exclusions as necessary. Although these were generated programmatically, it added significant complexity when the team needed to troubleshoot access issues. With Lake Formation tag-based access controls, IAS reduced hundreds of permission rules down to precisely two rules, as shown in the following screenshot.

When using multiple tags, the expressions are logically ANDed together. The preceding statements permit access only to data tagged non-private and owned by internal.

Tags allowed IAS to simplify permission rules, making it easy to understand, troubleshoot, and audit access. The ability to easily audit which datasets include sensitive information and who within the organization has access to them made it easy to comply with data privacy regulations.

Benefits

This solution provides self-service analytics to IAS data engineers, analysts, and data scientists. Internal users can query the data lake with their choice of tools, such as Athena, while maintaining strong governance and auditing. The new approach using Lake Formation tag-based access controls reduces the integration code and manual controls required. The solution provides the following additional benefits:

Meets security requirements by providing column-level controls for data
Significantly reduces permission complexity
Reduces time to audit data security and troubleshoot permissions
Deploys data classification using existing IaC processes
Reduces the time it takes to onboard data users including engineers, analysts, and scientists

Conclusion

When IAS started this journey, the company was looking for a fully managed solution that would enable self-service analytics while meeting stringent data access policies. Lake Formation provided IAS with the capabilities needed to deliver on this promise for its employees. With tag-based access controls, IAS optimized the solution by reducing the number of permission rules from hundreds down to a few, making it even easier to manage and audit. IAS continues to analyze data using more tools governed by Lake Formation.

About the Authors

Mat Sharpe is the Technical Lead, AWS & Systems Engineering at IAS where he is responsible for the company’s AWS infrastructure and guiding the technical teams in their cloud journey. He is based in New York.

Brian Maguire is a Solution Architect at Amazon Web Services, where he is focused on helping customers build their ideas in the cloud. He is a technologist, writer, teacher, and student who loves learning. Brian is the co-author of the book Scalable Data Streaming with Amazon Kinesis.

Danny Gagne is a Solutions Architect at Amazon Web Services. He has extensive experience in the design and implementation of large-scale high-performance analysis systems, and is the co-author of the book Scalable Data Streaming with Amazon Kinesis. He lives in New York City.

Accelerate your data warehouse migration to Amazon Redshift – Part 4

2021-09-20 Michael Soo

Post Syndicated from Michael Soo original https://aws.amazon.com/blogs/big-data/part-4-accelerate-your-data-warehouse-migration-to-amazon-redshift/

This is the fourth in a series of posts. We’re excited to share dozens of new features to automate your schema conversion; preserve your investment in existing scripts, reports, and applications; accelerate query performance; and potentially reduce your overall cost to migrate to Amazon Redshift.

Check out the previous posts in the series:

Accelerate your data warehouse migration to Amazon Redshift – Part 1 to learn more about macro conversion, case-insensitive string comparison, and other new features
Accelerate your data warehouse migration to Amazon Redshift – Part 2 to learn about automation for proprietary data types
Accelerate your data warehouse migration to Amazon Redshift – Part 3 to learn about automation for proprietary SQL statements

Amazon Redshift is the leading cloud data warehouse. No other data warehouse makes it as easy to gain new insights from your data. With Amazon Redshift, you can query exabytes of data across your data warehouse, operational data stores, and data lake using standard SQL. You can also integrate other AWS services like Amazon EMR, Amazon Athena, Amazon SageMaker, AWS Glue, AWS Lake Formation, and Amazon Kinesis to use all the analytic capabilities in the AWS Cloud.

Many customers have asked for help migrating from self-managed data warehouse engines, like Teradata, to Amazon Redshift. In these cases, you typically have terabytes or petabytes of data, a heavy reliance on proprietary features, and thousands of extract, transform, and load (ETL) processes and reports built over a few years (or decades) of use.

Until now, migrating a data warehouse to AWS was complex and involved a significant amount of manual effort. You needed to manually remediate syntax differences, inject code to replace proprietary features, and manually tune the performance of queries and reports on the new platform.

For example, you may have a significant investment in BTEQ (Basic Teradata Query) scripting for database automation, ETL, or other tasks. Previously, you needed to manually recode these scripts as part of the conversion process to Amazon Redshift. Together with supporting infrastructure (job scheduling, job logging, error handling), this was a significant impediment to migration.

Today, we’re happy to share with you a new, purpose-built command line tool called Amazon Redshift RSQL. Some of the key features added in Amazon Redshift RSQL are enhanced flow control syntax and single sign-on support. You can also describe properties or attributes of external tables in an AWS Glue catalog or Apache Hive Metastore, external databases in Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL-Compatible Edition, and tables shared using Amazon Redshift data sharing.

We have also enhanced the AWS Schema Conversion Tool (AWS SCT) to automatically convert BTEQ scripts to Amazon Redshift RSQL scripts. The converted scripts run on Amazon Redshift with little to no changes.

In this post, we describe some of the features of Amazon Redshift RSQL, show example scripts, and demonstrate how to convert BTEQ scripts into Amazon Redshift RSQL scripts.

Amazon Redshift RSQL features

If you currently use Amazon Redshift, you may already be running scripts on Amazon Redshift using the PSQL command line client. These scripts operate on Amazon Redshift RSQL with no modification. You can think of Amazon Redshift RSQL as an Amazon Redshift-native version of PSQL.

In addition, we have designed Amazon Redshift RSQL to make it easy to transition BTEQ scripts to the tool. The following are some examples of Amazon Redshift RSQL commands that make this possible. (For full details, see Amazon Redshift RSQL.)

\EXIT – This command is an extension of the PSQL \quit command. Like \quit, \EXIT terminates the execution of Amazon Redshift RSQL. In addition, you can specify an optional exit code with \EXIT.
\LOGON – This command creates a new connection to a database. \LOGON is an alias for the PSQL \connect command. You can specify connection parameters using positional syntax or as a connection string.
\REMARK – This command prints the specified string to the output. \REMARK extends the PSQL \echo command by adding the ability to break the output over multiple lines, using // as a line break.
\RUN – This command runs the Amazon Redshift RSQL script contained in the specified file. \RUN extends the PSQL \i command by adding an option to skip any number of lines in the specified file.
\OS – This is an alias for the PSQL \! command. \OS runs the operating system command that is passed as a parameter. Control returns to Amazon Redshift RSQL after running the OS command.
\LABEL – This is a new command for Amazon Redshift RSQL. \LABEL establishes an entry point for execution, as the target for a \GOTO command.
\GOTO – This command is a new command for Amazon Redshift RSQL. It’s used in conjunction with the \LABEL command. \GOTO skips all intervening commands and resumes processing at the specified \LABEL. The \LABEL must be a forward reference. You can’t jump to a \LABEL that lexically precedes the \GOTO.
\IF (\ELSEIF, \ELSE, \ENDIF) – This command is an extension of the PSQL \if (\elif, \else, \endif) command. \IF and \ELSEIF support arbitrary Boolean expressions including AND, OR, and NOT conditions. You can use the \GOTO command within a \IF block to control conditional execution.
\EXPORT – This command specifies the name of an export file that Amazon Redshift RSQL uses to store database information returned by a subsequent SQL SELECT statement.

We’ve also added some variables to Amazon Redshift RSQL to support converting your BTEQ scripts.

:ACTIVITYCOUNT – This variable returns the number of rows affected by the last submitted request. For a data-returning request, this is the number of rows returned to Amazon Redshift RSQL from the database. ACTIVITYCOUNT is similar to the PSQL variable ROW_COUNT, however, ROW_COUNT does not report affected-row count for SELECT, COPY or UNLOAD.
:ERRORCODE – This variable contains the return code for the last submitted request to the database. A zero signifies the request completed without error. The ERRORCODE variable is an alias for the variable SQLSTATE.
:ERRORLEVEL – This variable assigns severity levels to errors. Use the severity levels to determine a course of action based on the severity of the errors that Amazon Redshift RSQL encounters.
:MAXERROR – This variable designates a maximum error severity level beyond which Amazon Redshift RSQL terminates job processing.

An example Amazon Redshift RSQL script

Let’s look at an example. First, we log in to an Amazon Redshift database using Amazon Redshift RSQL. You specify the connection information on the command line as shown in the following code. The port and database are optional and default to 5439 and dev respectively if not provided.

$ rsql -h testcluster1.<example>.redshift.amazonaws.com -U testuser1 -d myredshift -p 5439
Password for user testuser1: 
DSN-less Connected
DBMS Name: Amazon Redshift
Driver Name: Amazon Redshift ODBC Driver
Driver Version: 1.4.34.1000
Rsql Version: 1.0.1
Redshift Version: 1.0.29551
Type "help" for help.

(testcluster1) testuser1@myredshift=#

If you choose to change the connection from within the client, you can use the \LOGON command:

(testcluster1) testuser1@myredshift=# \logon testschema testuser2 testcluster2.<example>.redshift.amazonaws.com
Password for user testuser2: 
DBMS Name: Amazon Redshift
Driver Name: Amazon Redshift ODBC Driver
Driver Version: 1.4.34.1000
Rsql Version: 1.0.1

Now, let’s run a simple script that runs a SELECT statement, checks for output, then branches depending on whether data was returned or not.

First, we inspect the script by using the \OS command to print the file to the screen:

(testcluster1) testuser1@myredshift=# \os cat activitycount.sql
select * from testschema.employees;

\if :ACTIVITYCOUNT = 0
  \remark '****No data found****'
  \goto LETSQUIT
\else
  \remark '****Data found****'
  \goto LETSDOSOMETHING
\endif

\label LETSQUIT
\remark '****We are quitting****'
\exit 0

\label LETSDOSOMETHING
\remark '****We are doing it****'
\exit 0

The script prints one of two messages depending on whether data is returned by the SELECT statement or not.

Now, let’s run the script using the \RUN command. The SELECT statement returns 11 rows of data. The script prints a “data found” message, and jumps to the LETSDOSOMETHING label.

(testcluster1) testuser1@myredshift=# \run file=activitycount.sql
  id  | name    | manager_id | last_promo_date
 -----+---------+------------+-----------------
 112  | Britney | 201        | 2041-03-30
 101  | Bob     | 100        |
 110  | Mark    | 201        |
 106  | Jeff    | 102        |
 201  | Ana     | 104        |
 104  | Chris   | 103        |
 111  | Phyllis | 103        |
 102  | Renee   | 101        | 2021-01-01
 100  | Caitlin |            | 2021-01-01
 105  | David   | 103        | 2021-01-01
 103  | John    | 101        |
 (11 rows)

****Data found****
\label LETSQUIT ignored
\label LETSDOSOMETHING processed
****We are doing it****

That’s Amazon Redshift RSQL in a nutshell. If you’re developing new scripts for Amazon Redshift, we encourage you to use Amazon Redshift RSQL and take advantage of its additional capabilities. If you have existing PSQL scripts, you can run those scripts using Amazon Redshift RSQL with no changes.

Use AWS SCT to automate your BTEQ conversions

If you’re a Teradata developer or DBA, you’ve probably built a library of BTEQ scripts that you use to perform administrative work, load or transform data, or to generate datasets and reports. If you’re contemplating a migration to Amazon Redshift, you’ll want to preserve the investment you made in creating those scripts.

AWS SCT has long had the ability to convert BTEQ to AWS Glue. Now, you can also use AWS SCT to automatically convert BTEQ scripts to Amazon Redshift RSQL. AWS SCT supports all the new Amazon Redshift RSQL features like conditional execution, escape to the shell, and branching.

Let’s see how it works. We create two Teradata tables, product_stg and product. Then we create a simple ETL script that uses a MERGE statement to update the product table using data from the product_stg table:

CREATE TABLE testschema.product_stg (
  prod_id INTEGER
, description VARCHAR(100) CHARACTER SET LATIN
,category_id INTEGER)
UNIQUE PRIMARY INDEX ( prod_id );

CREATE TABLE testschema.product (
  prod_id INTEGER
, description VARCHAR(100) CHARACTER SET LATIN
, category_id INTEGER)
UNIQUE PRIMARY INDEX ( prod_id );

We embed the MERGE statement inside a BTEQ script. The script tests error conditions and branches accordingly:

.SET WIDTH 100

SELECT COUNT(*) 
FROM testschema.product_stg 
HAVING COUNT(*) > 0;

.IF ACTIVITYCOUNT = 0 then .GOTO NODATA;

MERGE INTO testschema.product tgt 
USING testschema.product_stg stg 
   ON tgt.prod_id = stg.prod_id
WHEN MATCHED THEN UPDATE SET
      description = stg.description
    , category_id = stg.category_id
WHEN NOT MATCHED THEN INSERT VALUES (
  stg.prod_id
, stg.description
, stg.category_id
);

.GOTO ALLDONE;

.LABEL NODATA

.REMARK 'Staging table is empty. Stopping'

.LABEL ALLDONE 

.QUIT;

Now, let’s use AWS SCT to convert the script to Amazon Redshift RSQL. AWS SCT converts the BTEQ commands to their Amazon Redshift RSQL and Amazon Redshift equivalents. The converted script is as follows:

\rset width 100
SELECT
    COUNT(*)
    FROM testschema.product_stg
    HAVING COUNT(*) > 0;
\if :ACTIVITYCOUNT = 0
    \goto NODATA
\endif
UPDATE testschema.product
SET description = stg.description, category_id = stg.category_id
FROM testschema.product_stg AS stg
JOIN testschema.product AS tgt
    ON tgt.prod_id = stg.prod_id;
INSERT INTO testschema.product
SELECT
    stg.prod_id, stg.description, stg.category_id
    FROM testschema.product_stg AS stg
    WHERE NOT EXISTS (
        SELECT 1
        FROM testschema.product AS tgt
        WHERE tgt.prod_id = stg.prod_id);
\goto ALLDONE
\label NODATA
\remark 'Staging table is empty. Stopping'
\label ALLDONE
\quit :ERRORLEVEL

The following are the main points of interest in the conversion:

The BTEQ .SET WIDTH command is converted to the Amazon Redshift RSQL \RSET WIDTH command.
The BTEQ ACTIVITYCOUNT variable is converted to the Amazon Redshift RSQL ACTIVITYCOUNT variable.
The BTEQ MERGE statement is converted into an UPDATE followed by an INSERT statement. Currently, Amazon Redshift doesn’t support a native MERGE statement.
The BTEQ .LABEL and .GOTO statements are translated to their Amazon Redshift RSQL equivalents \LABEL and \GOTO.

Let’s look at the actual process of using AWS SCT to convert a BTEQ script.

After starting AWS SCT, you create a Teradata migration project and navigate to the BTEQ scripts node in the source tree window pane. Right-click and choose Load scripts.

Then select the folder that contains your BTEQ scripts. The folder appears in the source tree. Open it and navigate to the script you want to convert. In our case, the script is contained in the file merge.sql. Right-click on the file, choose Convert script, then choose Convert to RSQL.You can inspect the converted script in the bottom middle pane. When you’re ready to save the script to a file, do that from the target tree on the right side.

If you have many BTEQ scripts, you can convert an entire folder at once by selecting the folder instead of an individual file.

Convert shell scripts

Many applications run BTEQ commands from within shell scripts. For example, you may have a shell script that redirects log output and controls login credentials, as in the following:

bteq <<EOF >> ${LOG} 2>&1

.run file $LOGON;

SELECT COUNT(*) 
FROM testschema.product_stg 
HAVING COUNT(*) > 0;
…

EOF

If you use shell scripts to run BTEQ, we’re happy to share that AWS SCT can help you convert those scripts. AWS SCT supports bash scripts now, and we’ll add additional shell dialects in the future.

The process to convert shell scripts is very similar to BTEQ conversion. You select a folder that contains your scripts by navigating to the Shell node in the source tree and then choosing Load scripts.

After the folder is loaded, you can convert one (or more) scripts by selecting them and choosing Convert script.

As before, the converted script appears in the UI, and you can save it from the target tree on the right side of the page.

Conclusion

We’re happy to share Amazon Redshift RSQL and expect it to be a big hit with customers. If you’re contemplating a migration from Teradata to Amazon Redshift, Amazon Redshift RSQL and AWS SCT can simplify the conversion of your existing Teradata scripts and help preserve your investment in existing reports, applications, and ETL.

All of the features described in this post are available for you to use today. You can download Amazon Redshift RSQL and AWS SCT and give it a try.

We’ll be back soon with the next installment in this series. Check back for more information on automating your migrations from Teradata to Amazon Redshift. In the meantime, you can learn more about Amazon Redshift, Amazon Redshift RSQL, and AWS SCT. Happy migrating!

About the Authors

Michael Soo is a Senior Database Engineer with the AWS Database Migration Service team. He builds products and services that help customers migrate their database workloads to the AWS cloud.

Po Hong, PhD, is a Principal Data Architect of Lake House Global Specialty Practice,
AWS Professional Services. He is passionate about supporting customers to adopt innovative solutions to reduce time to insight. Po is specialized in migrating large scale MPP on-premises data warehouses to the AWS Lake House architecture.

Entong Shen is a Software Development Manager of Amazon Redshift. He has been working on MPP databases for over 9 years and has focused on query optimization, statistics and migration related SQL language features such as stored procedures and data types.

Adekunle Adedotun is a Sr. Database Engineer with Amazon Redshift service. He has been working on MPP databases for 6 years with a focus on performance tuning. He also provides guidance to the development team for new and existing service features.

Asia Khytun is a Software Development Manager for the AWS Schema Conversion Tool. She has 10+ years of software development experience in C, C++, and Java.

Illia Kratsov is a Database Developer with the AWS Project Delta Migration team. He has 10+ years experience in data warehouse development with Teradata and other MPP databases.

Accelerate your data warehouse migration to Amazon Redshift – Part 3

2021-09-20 Michael Soo

Post Syndicated from Michael Soo original https://aws.amazon.com/blogs/big-data/part-3-accelerate-your-data-warehouse-migration-to-amazon-redshift/

This is the third post in a multi-part series. We’re excited to share dozens of new features to automate your schema conversion; preserve your investment in existing scripts, reports, and applications; accelerate query performance; and reduce your overall cost to migrate to Amazon Redshift.

Check out the previous posts in the series:

Accelerate your data warehouse migration to Amazon Redshift – Part 1 to learn more about automated conversion of database macros, case-insensitive string comparison, and case-sensitive identifiers,
Accelerate your data warehouse migration to Amazon Redshift – Part 2 to learn about automatic conversion for proprietary data types.

Amazon Redshift is the leading cloud data warehouse. No other data warehouse makes it as easy to gain new insights from your data. With Amazon Redshift, you can query exabytes of data across your data warehouse, operational data stores, and data lake using standard SQL. You can also integrate other services such as Amazon EMR, Amazon Athena, and Amazon SageMaker to use all the analytic capabilities in the AWS Cloud.

Many customers have asked for help migrating from self-managed data warehouse engines, like Teradata, to Amazon Redshift. In these cases, you may have terabytes (or petabytes) of historical data, a heavy reliance on proprietary features, and thousands of extract, transform, and load (ETL) processes and reports built over years (or decades) of use.

Until now, migrating a Teradata data warehouse to AWS was complex and involved a significant amount of manual effort.

Today, we’re happy to share recent enhancements to Amazon Redshift and the AWS Schema Conversion Tool (AWS SCT) that make it easier to automate your Teradata to Amazon Redshift migrations.

In this post, we introduce new automation for merge statements, a native function to support ASCII character conversion, enhanced error checking for string to date conversion, enhanced support for Teradata cursors and identity columns, automation for ANY and SOME predicates, automation for RESET WHEN clauses, automation for two proprietary Teradata functions (TD_NORMALIZE_OVERLAP and TD_UNPIVOT), and automation to support analytic functions (QUANTILE and QUALIFY).

Merge statement

Like its name implies, the merge statement takes an input set and merges it into a target table. If an input row already exists in the target table (a row in the target table has the same primary key value), then the target row is updated. If there is no matching target row, the input row is inserted into the table.

Until now, if you used merge statements in your workload, you were forced to manually rewrite the merge statement to run on Amazon Redshift. Now, we’re happy to share that AWS SCT automates this conversion for you. AWS SCT decomposes a merge statement into an update on existing records followed by an insert for new records.

Let’s look at an example. We create two tables in Teradata: a target table, employee, and a delta table, employee_delta, where we stage the input rows:

CREATE TABLE testschema.employee(
  id INTEGER
, name VARCHAR(20)
, manager INTEGER)
UNIQUE PRIMARY INDEX (id)
;

CREATE TABLE testschema.employee_delta (
  id INTEGER
, name VARCHAR(20)
, manager INTEGER)
UNIQUE PRIMARY INDEX(id)
;

Now we create a Teradata merge statement that updates a row if it exists in the target, otherwise it inserts the new row. We embed this merge statement into a macro so we can show you the conversion process later.

REPLACE MACRO testschema.merge_employees AS (
  MERGE INTO testschema.employee tgt
  USING testschema.employee_delta delta
    ON delta.id = tgt.id
  WHEN MATCHED THEN
    UPDATE SET name = delta.name, manager = delta.manager
  WHEN NOT MATCHED THEN
    INSERT (delta.id, delta.name, delta.manager);
);

Now we use AWS SCT to convert the macro. (See Accelerate your data warehouse migration to Amazon Redshift – Part 1 for details on macro conversion.) AWS SCT creates a stored procedure that contains an update (to implement the WHEN MATCHED condition) and an insert (to implement the WHEN NOT MATCHED condition).

CREATE OR REPLACE PROCEDURE testschema.merge_employees()
AS $BODY$
BEGIN
    UPDATE testschema.employee
    SET name = "delta".name, manager = "delta".manager
    FROM testschema.employee_delta AS delta JOIN testschema.employee AS tgt
        ON "delta".id = tgt.id;
      
    INSERT INTO testschema.employee
    SELECT
      "delta".id
    , "delta".name
    , "delta".manager
    FROM testschema.employee_delta AS delta
    WHERE NOT EXISTS (
      SELECT 1
      FROM testschema.employee AS tgt
      WHERE "delta".id = tgt.id
    );
END;
$BODY$
LANGUAGE plpgsql;

This example showed how to use merge automation for macros, but you can convert merge statements in any application context: stored procedures, BTEQ scripts, Java code, and more. Download the latest version of AWS SCT and try it out.

ASCII() function

The ASCII function takes as input a string and returns the ASCII code, or more precisely, the UNICODE code point, of the first character in the string. Previously, Amazon Redshift supported ASCII as a leader-node only function, which prevented its use with user-defined tables.

We’re happy to share that the ASCII function is now available on Amazon Redshift compute nodes and can be used with user-defined tables. In the following code, we create a table with some string data:

CREATE TABLE testschema.char_table (
  id INTEGER
, char_col  CHAR(10)
, varchar_col VARCHAR(10)
);

INSERT INTO testschema.char_table VALUES (1, 'Hello', 'world');

Now you can use the ASCII function on the string columns:

# SELECT id, char_col, ascii(char_col), varchar_col, ascii(varchar_col) FROM testschema.char_table;

 id |  char_col  | ascii | varchar_col | ascii 
  1 | Hello      |    72 | world       |   119

Lastly, if your application code uses the ASCII function, AWS SCT automatically converts any such function calls to Amazon Redshift.

The ASCII feature is available now—try it out in your own cluster.

TO_DATE() function

The TO_DATE function converts a character string into a DATE value. A quirk of this function is that it can accept a string value that isn’t a valid date and translate it into a valid date.

For example, consider the string 2021-06-31. This isn’t a valid date because the month of June has only 30 days. However, the TO_DATE function accepts this string and returns the “31st” day of June (July 1):

# SELECT to_date('2021-06-31', 'YYYY-MM-DD');
 to_date 
 2021-07-01
(1 row)

Customers have asked for strict input checking for TO_DATE, and we’re happy to share this new capability. Now, you can include a Boolean value in the function call that turns on strict checking:

# SELECT to_date('2021-06-31', 'YYYY-MM-DD', TRUE);
ERROR: date/time field date value out of range: 2021-6-31

You can turn off strict checking explicitly as well:

# SELECT to_date('2021-06-31', 'YYYY-MM-DD', FALSE);
 to_date 
 2021-07-01
(1 row)

Also, the Boolean value is optional. If you don’t include it, strict checking is turned off, and you see the same behavior as before the feature was launched.

You can learn more about the TO_DATE function and try out strict date checking in Amazon Redshift now.

CURSOR result sets

A cursor is a programming language construct that applications use to manipulate a result set one row at a time. Cursors are more relevant for OLTP applications, but some legacy applications built on data warehouses also use them.

Teradata provides a diverse set of cursor configurations. Amazon Redshift supports a more streamlined set of cursor features.

Based on customer feedback, we’ve added automation to support Teradata WITH RETURN cursors. These types of cursors are opened within stored procedures and returned to the caller for processing of the result set. AWS SCT will convert a WITH RETURN cursor to an Amazon Redshift REFCURSOR.

For example, consider the following procedure, which contains a WITH RETURN cursor. The procedure opens the cursor and returns the result to the caller as a DYNAMIC RESULT SET:

REPLACE PROCEDURE testschema.employee_cursor (IN p_mgrid INTEGER) DYNAMIC RESULT SETS 1
BEGIN
   DECLARE result_set CURSOR WITH RETURN ONLY FOR 
     SELECT id, name, manager 
     FROM testschema.employee
     WHERE manager = to_char(p_mgrid); 
   OPEN result_set;
END;

AWS SCT converts the procedure as follows. An additional parameter is added to the procedure signature to pass the REFCURSOR:

CREATE OR REPLACE PROCEDURE testschema.employee_cursor(par_p_mgrid IN INTEGER, dynamic_return_cursor INOUT refcursor)
AS $BODY$
DECLARE
BEGIN
    OPEN dynamic_return_cursor FOR
    SELECT
        id, name, manager
        FROM testschema.employee
        WHERE manager = to_char(par_p_mgrid, '99999');
END;
$BODY$
LANGUAGE plpgsql;

IDENTITY columns

Teradata supports several non-ANSI compliant features for IDENTITY columns. We have enhanced AWS SCT to automatically convert these features to Amazon Redshift, whenever possible.

Specifically, AWS SCT now converts the Teradata START WITH and INCREMENT BY clauses to the Amazon Redshift SEED and STEP clauses, respectively. For example, consider the following Teradata table:

CREATE TABLE testschema.identity_table (
  a2 BIGINT GENERATED ALWAYS AS IDENTITY (
    START WITH 1 
    INCREMENT BY 20
  )
);

The GENERATED ALWAYS clause indicates that the column is always populated automatically—a value can’t be explicitly inserted or updated into the column. The START WITH clause defines the first value to be inserted into the column, and the INCREMENT BY clause defines the next value to insert into the column.

When you convert this table using AWS SCT, the following Amazon Redshift DDL is produced. Notice that the START WITH and INCREMENT BY values are preserved in the target syntax:

CREATE TABLE IF NOT EXISTS testschema.identity_table (
  a2 BIGINT IDENTITY(1, 20)
)
DISTSTYLE KEY
DISTKEY
(a2)
SORTKEY
(a2);

Also, by default, an IDENTITY column in Amazon Redshift only contains auto-generated values, so that the GENERATED ALWAYS property in Teradata is preserved:

# INSERT INTO testschema.identity_table VALUES (100);
ERROR:  cannot set an identity column to a value

IDENTITY columns in Teradata can also be specified as GENERATED BY DEFAULT. In this case, a value can be explicitly defined in an INSERT statement. If no value is specified, the column is filled with an auto-generated value like normal. Before, AWS SCT didn’t support conversion for GENERATED BY DEFAULT columns. Now, we’re happy to share that AWS SCT automatically converts such columns for you.

For example, the following table contains an IDENTITY column that is GENERATED BY DEFAULT:

CREATE TABLE testschema.identity_by_default (
  a1 BIGINT GENERATED BY DEFAULT AS IDENTITY (
     START WITH 1 
     INCREMENT BY 20 
  )
PRIMARY INDEX (a1);

The IDENTITY column is converted by AWS SCT as follows. The converted column uses the Amazon Redshift GENERATED BY DEFAULT clause:

CREATE TABLE testschema.identity_by_default (
  a1 BIGINT GENERATED BY DEFAULT AS IDENTITY(1,20) DISTKEY
)                                                          
 DISTSTYLE KEY                                               
 SORTKEY (a1);

There is one additional syntax issue that requires attention. In Teradata, an auto-generated value is inserted when NULL is specified for the column value:

INSERT INTO identity_by_default VALUES (null);

Amazon Redshift uses a different syntax for the same purpose. Here, you include the keyword DEFAULT in the values list to indicate that the column should be auto-generated:

INSERT INTO testschema.identity_by_default VALUES (default);

We’re happy to share that AWS SCT automatically converts the Teradata syntax for INSERT statements like the preceding example. For example, consider the following Teradata macro:

REPLACE MACRO testschema.insert_identity_by_default AS (
  INSERT INTO testschema.identity_by_default VALUES (NULL);
);

AWS SCT removes the NULL and replaces it with DEFAULT:

CREATE OR REPLACE PROCEDURE testschema.insert_identity_by_default() LANGUAGE plpgsql
AS $$                                                              
BEGIN                                                              
  INSERT INTO testschema.identity_by_default VALUES (DEFAULT);
END;                                                               
$$

IDENTITY column automation is available now in AWS SCT. You can download the latest version and try it out.

ANY and SOME filters with inequality predicates

The ANY and SOME filters determine if a predicate applies to one or more values in a list. For example, in Teradata, you can use <> ANY to find all employees who don’t work for a certain manager:

REPLACE MACRO testschema.not_in_103 AS (
  SELECT *
  FROM testschema.employee 
  WHERE manager <> ANY (103)
;
);

Of course, you can rewrite this query using a simple not equal filter, but you often see queries from third-party SQL generators that follow this pattern.

Amazon Redshift doesn’t support this syntax natively. Before, any queries using this syntax had to be manually converted. Now, we’re happy to share that AWS SCT automatically converts ANY and SOME clauses with inequality predicates. The macro above is converted to a stored procedure as follows.

CREATE OR REPLACE PROCEDURE testschema.not_in_103(macro_out INOUT refcursor)
AS $BODY$
BEGIN
    OPEN macro_out FOR
    SELECT *
    FROM testschema.employee
    WHERE ((manager <> 103));
END;
$BODY$
LANGUAGE plpgsql;

If the values list following the ANY contains two more values, AWS SCT will convert this to a series of OR conditions, one for each element in the list.

ANY/SOME filter conversion is available now in AWS SCT. You can try it out in the latest version of the application.

Analytic functions with RESET WHEN

RESET WHEN is a Teradata feature used in SQL analytical window functions. It’s an extension to the ANSI SQL standard. RESET WHEN determines the partition over which a SQL window function operates based on a specified condition. If the condition evaluates to true, a new dynamic sub-partition is created inside the existing window partition.

For example, the following view uses RESET WHEN to compute a running total by store. The running total accumulates as long as sales increase month over month. If sales drop from one month to the next, the running total resets.

CREATE TABLE testschema.sales (
  store_id INTEGER
, month_no INTEGER
, sales_amount DECIMAL(9,2)
)
;

REPLACE VIEW testschema.running_total (
  store_id
, month_no
, sales_amount
, cume_sales_amount
)
AS
SELECT 
  store_id
, month_no
, sales_amount
, SUM(sales_amount) OVER (
     PARTITION BY store_id 
     ORDER BY month_no
     RESET WHEN sales_amount < SUM(sales_amount) OVER (
       PARTITION BY store_id
       ORDER BY month_no
       ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
     )
     ROWS UNBOUNDED PRECEDING 
  )
FROM testschema.sales;

To demonstrate, we insert some test data into the table:

INSERT INTO testschema.sales VALUES (1001, 1, 35000.00);
INSERT INTO testschema.sales VALUES (1001, 2, 40000.00);
INSERT INTO testschema.sales VALUES (1001, 3, 45000.00);
INSERT INTO testschema.sales VALUES (1001, 4, 25000.00);
INSERT INTO testschema.sales VALUES (1001, 5, 30000.00);
INSERT INTO testschema.sales VALUES (1001, 6, 30000.00);
INSERT INTO testschema.sales VALUES (1001, 7, 50000.00);
INSERT INTO testschema.sales VALUES (1001, 8, 35000.00);
INSERT INTO testschema.sales VALUES (1001, 9, 60000.00);
INSERT INTO testschema.sales VALUES (1001, 10, 80000.00);
INSERT INTO testschema.sales VALUES (1001, 11, 90000.00);
INSERT INTO testschema.sales VALUES (1001, 12, 100000.00);

The sales amounts drop after months 3 and 7. The running total is reset accordingly at months 4 and 8.

SELECT * FROM testschema.running_total;

   store_id     month_no  sales_amount  cume_sales_amount
-----------  -----------  ------------  -----------------
       1001            1      35000.00           35000.00
       1001            2      40000.00           75000.00
       1001            3      45000.00          120000.00
       1001            4      25000.00           25000.00
       1001            5      30000.00           55000.00
       1001            6      30000.00           85000.00
       1001            7      50000.00          135000.00
       1001            8      35000.00           35000.00
       1001            9      60000.00           95000.00
       1001           10      80000.00          175000.00
       1001           11      90000.00          265000.00
       1001           12     100000.00          365000.00

AWS SCT converts the view as follows. The converted code uses a subquery to emulate the RESET WHEN. Essentially, a marker attribute is added to the result that flags a month over month sales drop. The flag is then used to determine the longest preceding run of increasing sales to aggregate.

CREATE OR REPLACE VIEW testschema.running_total (
  store_id
, month_no
, sales_amount
, cume_sales_amount) AS
SELECT
  store_id
, month_no
, sales_amount
, sum(sales_amount) OVER 
    (PARTITION BY k1, store_id ORDER BY month_no NULLS 
     FIRST ROWS UNBOUNDED      PRECEDING)
FROM (
  SELECT
   store_id
 , month_no
 , sales_amount
 , SUM(CASE WHEN k = 1 THEN 0 ELSE 1 END) OVER 
     (PARTITION BY store_id ORDER BY month_no NULLS 
       FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS k1
 FROM (
   SELECT
     store_id
   , month_no
   , sales_amount
   , CASE WHEN sales_amount < SUM(sales_amount) OVER 
      (PARTITION BY store_id ORDER BY month_no 
        ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) 
      OR sales_amount IS NULL THEN 0 ELSE 1 END AS k
   FROM testschema.sales
  )
);

We expect that RESET WHEN conversion will be a big hit with customers. You can try it now in AWS SCT.

TD_NORMALIZE_OVERLAP() function

The TD_NORMALIZE_OVERLAP function combines rows that have overlapping PERIOD values. The resulting normalized row contains the earliest starting bound and the latest ending bound from the PERIOD values of all the rows involved.

For example, we create a Teradata table that records employee salaries with the following code. Each row in the table is timestamped with the period that the employee was paid the given salary.

CREATE TABLE testschema.salaries (
  emp_id INTEGER
, salary DECIMAL(8,2)
, from_to PERIOD(DATE)
);

Now we add data for two employees. For emp_id = 1 and salary = 2000, there are two overlapping rows. Similarly, the two rows with emp_id = 2 and salary = 3000 are overlapping.

SELECT * FROM testschema.salaries ORDER BY emp_id, from_to;

     emp_id      salary  from_to
-----------  ----------  ------------------------
          1     1000.00  ('20/01/01', '20/05/31')
          1     2000.00  ('20/06/01', '21/02/28')
          1     2000.00  ('21/01/01', '21/06/30')
          2     3000.00  ('20/01/01', '20/03/31')
          2     3000.00  ('20/02/01', '20/04/30')

Now we create a view that uses the TD_NORMALIZE_OVERLAP function to normalize the overlapping data:

REPLACE VIEW testschema.normalize_salaries AS 
WITH sub_table(emp_id, salary, from_to) AS (
  SELECT 
    emp_id
  , salary
  , from_to
  FROM testschema.salaries
)
SELECT *
FROM 
  TABLE(TD_SYSFNLIB.TD_NORMALIZE_OVERLAP (NEW VARIANT_TYPE(sub_table.emp_id, sub_table.salary), sub_table.from_to)
    RETURNS (emp_id INTEGER, salary DECIMAL(8,2), from_to PERIOD(DATE))
    HASH BY emp_id
    LOCAL ORDER BY emp_id, salary, from_to
  ) AS DT(emp_id, salary, duration)
;

We can check that the view data is actually normalized:

select * from testschema.normalize_salaries order by emp_id, duration;

     emp_id      salary  duration
-----------  ----------  ------------------------
          1     1000.00  ('20/01/01', '20/05/31')
          1     2000.00  ('20/06/01', '21/06/30')
          2     3000.00  ('20/01/01', '20/04/30')

You can now use AWS SCT to convert any TD_NORMALIZE_OVERLAP statements. We first convert the salaries table to Amazon Redshift (see Accelerate your data warehouse migration to Amazon Redshift – Part 2 for details about period data type automation):

CREATE TABLE testschema.salaries (
  emp_id integer distkey
, salary numeric(8,2) ENCODE az64
, from_to_begin date ENCODE az64
, from_to_end date ENCODE az64    
)                                   
DISTSTYLE KEY                       
SORTKEY (emp_id);

# SELECT * FROM testschema.salaries ORDER BY emp_id, from_to_begin;
 emp_id | salary  | from_to_begin | from_to_end 
      1 | 1000.00 | 2020-01-01    | 2020-05-31
      1 | 2000.00 | 2020-06-01    | 2021-02-28
      1 | 2000.00 | 2021-01-01    | 2021-06-30
      2 | 3000.00 | 2020-01-01    | 2020-03-31
      2 | 3000.00 | 2020-02-01    | 2020-04-30

Now we use AWS SCT to convert the normalize_salaries view. AWS SCT adds a column that marks the start of a new group of rows. It then produces a single row for each group with a normalized timestamp.

CREATE VIEW testschema.normalize_salaries (emp_id, salary, from_to_begin, from_to_end) AS
WITH sub_table AS (
  SELECT
    emp_id
  , salary
  , from_to_begin AS start_date
  , from_to_end AS end_date
  , CASE
      WHEN start_date <= lag(end_date) OVER (PARTITION BY emp_id, salary ORDER BY start_date, end_date) THEN 0 
      ELSE 1
    END AS GroupStartFlag
    FROM testschema.salaries
  )
SELECT
  t2.emp_id
, t2.salary
, min(t2.start_date) AS from_to_begin
, max(t2.end_date) AS from_to_end
FROM (
  SELECT
    emp_id
  , salary
  , start_date
  , end_date
  , sum(GroupStartFlag) OVER (PARTITION BY emp_id, salary ORDER BY start_date ROWS UNBOUNDED PRECEDING) AS GroupID
  FROM 
    sub_table
) AS t2
GROUP BY 
  t2.emp_id
, t2.salary
, t2.GroupID;

We can check that the converted view returns the correctly normalized data:

# SELECT * FROM testschema.normalize_salaries ORDER BY emp_id;
 emp_id | salary  | from_to_begin | from_to_end 
      1 | 1000.00 | 2020-01-01    | 2020-05-31
      1 | 2000.00 | 2020-06-01    | 2021-06-30
      2 | 3000.00 | 2020-01-01    | 2020-04-30

You can try out TD_NORMALIZE_OVERLAP conversion in the latest release of AWS SCT. Download it now.

TD_UNPIVOT() function

The TD_UNPIVOT function transforms columns into rows. Essentially, we use it to take a row of similar metrics over different time periods and create a separate row for each metric.

For example, consider the following Teradata table. The table records customer visits by year and month for small kiosk stores:

CREATE TABLE TESTSCHEMA.kiosk_monthly_visits (
  kiosk_id INTEGER
, year_no INTEGER
, jan_visits INTEGER
, feb_visits INTEGER
, mar_visits INTEGER
, apr_visits INTEGER
, may_visits INTEGER
, jun_visits INTEGER
, jul_visits INTEGER
, aug_visits INTEGER
, sep_visits INTEGER
, oct_visits INTEGER
, nov_visits INTEGER
, dec_visits INTEGER)
PRIMARY INDEX (kiosk_id);

We insert some sample data into the table:

INSERT INTO testschema.kiosk_monthly_visits VALUES (100, 2020, 1100, 1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000, 2100, 2200);

Next, we create a view that unpivots the table so that the monthly visits appear on separate rows. The single row in the pivoted table creates 12 rows in the unpivoted table, one row per month.

REPLACE VIEW testschema.unpivot_kiosk_monthly_visits (
  kiosk_id
, year_no
, month_name
, month_visits
)
AS
SELECT 
  kiosk_id
, year_no
, month_name (FORMAT 'X(10)')
, month_visits
FROM TD_UNPIVOT (
 ON (SELECT * FROM testschema.kiosk_monthly_visits)
 USING
 VALUE_COLUMNS ('month_visits')
 UNPIVOT_COLUMN('month_name')
 COLUMN_LIST(
   'jan_visits'
 , 'feb_visits'
 , 'mar_visits'
 , 'apr_visits'
 , 'may_visits'
 , 'jun_visits'
 , 'jul_visits'
 , 'aug_visits'
 , 'sep_visits'
 , 'oct_visits'
 , 'nov_visits'
 , 'dec_visits'
 )
 COLUMN_ALIAS_LIST (
   'jan'
 , 'feb'
 , 'mar'
 , 'apr'
 , 'may'
 , 'jun'
 , 'jul'
 , 'aug'
 , 'sep'
 , 'oct'
 , 'nov'
 , 'dec'
 )
) a;

When you select from the view, the monthly sales are unpivoted into 12 separate rows:

SELECT * FROM testschema.unpivot_monthly_sales;

 id           yr           mon     mon_sales
----------- ----------- ---------- ----------
100         2021        jan           1100.00
100         2021        feb           1200.00
100         2021        mar           1300.00
100         2021        apr           1400.00
100         2021        may           1500.00
100         2021        jun           1600.00
100         2021        jul           1700.00
100         2021        aug           1800.00
100         2021        sep           1900.00
100         2021        oct           2000.00
100         2021        nov           2100.00
100         2021        dec           2200.00

Now we use AWS SCT to convert the view into ANSI SQL that can be run on Amazon Redshift. The conversion creates a common table expression (CTE) to place each month in a separate row. It then joins the CTE and the remaining attributes from the original pivoted table.

REPLACE VIEW testschema.unpivot_kiosk_monthly_visits (kiosk_id, year_no, month_name, month_visits) AS
WITH cols
AS (SELECT
    'jan' AS col
UNION ALL
SELECT
    'feb' AS col
UNION ALL
SELECT
    'mar' AS col
UNION ALL
SELECT
    'apr' AS col
UNION ALL
SELECT
    'may' AS col
UNION ALL
SELECT
    'jun' AS col
UNION ALL
SELECT
    'jul' AS col
UNION ALL
SELECT
    'aug' AS col
UNION ALL
SELECT
    'sep' AS col
UNION ALL
SELECT
    'oct' AS col
UNION ALL
SELECT
    'nov' AS col
UNION ALL
SELECT
    'dec' AS col)
SELECT
    t1.kiosk_id, t1.year_no, col AS "month_name",
    CASE col
        WHEN 'jan' THEN "jan_visits"
        WHEN 'feb' THEN "feb_visits"
        WHEN 'mar' THEN "mar_visits"
        WHEN 'apr' THEN "apr_visits"
        WHEN 'may' THEN "may_visits"
        WHEN 'jun' THEN "jun_visits"
        WHEN 'jul' THEN "jul_visits"
        WHEN 'aug' THEN "aug_visits"
        WHEN 'sep' THEN "sep_visits"
        WHEN 'oct' THEN "oct_visits"
        WHEN 'nov' THEN "nov_visits"
        WHEN 'dec' THEN "dec_visits"
        ELSE NULL
    END AS "month_visits"
    FROM testschema.kiosk_monthly_visits AS t1
    CROSS JOIN cols
    WHERE month_visits IS NOT NULL;

You can check that the converted view produces the same result as the Teradata version:

# SELECT * FROM testschema.unpivot_kiosk_monthly_visits;
 kiosk_id | year_no | month_name | month_visits 
      100 |    2020 | oct        |        2000
      100 |    2020 | nov        |        2100
      100 |    2020 | jul        |        1700
      100 |    2020 | feb        |        1200
      100 |    2020 | apr        |        1400
      100 |    2020 | aug        |        1800
      100 |    2020 | sep        |        1900
      100 |    2020 | jan        |        1100
      100 |    2020 | mar        |        1300
      100 |    2020 | may        |        1500
      100 |    2020 | jun        |        1600
      100 |    2020 | dec        |        2200

You can try out the conversion support for TD_UNPIVOT in the latest version of AWS SCT.

QUANTILE function

QUANTILE is a ranking function. It partitions the input set into a specified number of groups, each containing an equal portion of the total population. QUANTILE is a proprietary Teradata extension of the NTILE function found in ANSI SQL.

For example, we can compute the quartiles of the monthly visit data using the following Teradata view:

REPLACE VIEW testschema.monthly_visit_rank AS
SELECT
  kiosk_id
, year_no
, month_name
, month_visits
, QUANTILE(4, month_visits) qtile
FROM
 testschema.unpivot_kiosk_monthly_visits
;

When you select from the view, the QUANTILE function computes the quartile and applies it as an attribute on the output:

SELECT * FROM monthly_visit_rank;

   kiosk_id      year_no  month_name  month_visits        qtile
-----------  -----------  ----------  ------------  -----------
        100         2020  jan                 1100            0
        100         2020  feb                 1200            0
        100         2020  mar                 1300            0
        100         2020  apr                 1400            1
        100         2020  may                 1500            1
        100         2020  jun                 1600            1
        100         2020  jul                 1700            2
        100         2020  aug                 1800            2
        100         2020  sep                 1900            2
        100         2020  oct                 2000            3
        100         2020  nov                 2100            3
        100         2020  dec                 2200            3

Amazon Redshift supports a generalized NTILE function, which can implement QUANTILE, and is ANSI-compliant. We’ve enhanced AWS SCT to automatically convert QUANTILE function calls to equivalent NTILE function calls.

For example, when you convert the preceding Teradata view, AWS SCT produces the following Amazon Redshift code:

SELECT 
  unpivot_kiosk_monthly_visits.kiosk_id
, unpivot_kiosk_monthly_visits.year_no
, unpivot_kiosk_monthly_visits.month_name
, unpivot_kiosk_monthly_visits.month_visits
, ntile(4) OVER (ORDER BY unpivot_kiosk_monthly_visits.month_visits ASC  NULLS FIRST) - 1) AS qtile 
FROM 
  testschema.unpivot_kiosk_monthly_visits
;

QUANTILE conversion support is available now in AWS SCT.

QUALIFY filter

The QUALIFY clause in Teradata filters rows produced by an analytic function. Let’s look at an example. We use the following table, which contains store revenue by month. Our goal is to find the top five months by revenue:

CREATE TABLE testschema.sales (
  store_id INTEGER
, month_no INTEGER
, sales_amount DECIMAL(9,2))
PRIMARY INDEX (store_id);


SELECT * FROM sales;

   store_id     month_no  sales_amount
-----------  -----------  ------------
       1001            1      35000.00
       1001            2      40000.00
       1001            3      45000.00
       1001            4      25000.00
       1001            5      30000.00
       1001            6      30000.00
       1001            7      50000.00
       1001            8      35000.00
       1001            9      60000.00
       1001           10      80000.00
       1001           11      90000.00
       1001           12     100000.00

The data shows that July, September, October, November, and December were the top five sales months.

We create a view that uses the RANK function to rank each month by sales, then use the QUALIFY function to select the top five months:

REPLACE VIEW testschema.top_five_months(
  store_id
, month_no
, sales_amount
, month_rank
) as
SELECT
  store_id
, month_no
, sales_amount
, RANK() OVER (PARTITION BY store_id ORDER BY sales_amount DESC) month_rank
FROM
  testschema.sales
QUALIFY RANK() OVER (PARTITION by store_id ORDER BY sales_amount DESC) <= 5
;

Before, if you used the QUALIFY clause, you had to manually recode your SQL statements. Now, AWS SCT automatically converts QUALIFY into Amazon Redshift-compatible, ANSI-compliant SQL. For example, AWS SCT rewrites the preceding view as follows:

CREATE OR REPLACE VIEW testschema.top_five_months (
  store_id
, month_no
, sales_amount
, month_rank) AS
SELECT
  qualify_subquery.store_id
, qualify_subquery.month_no
, qualify_subquery.sales_amount
, month_rank
FROM (
  SELECT
    store_id
  , month_no
  , sales_amount
  , rank() OVER (PARTITION BY store_id ORDER BY sales_amount DESC NULLS FIRST) AS month_rank
  , rank() OVER (PARTITION BY store_id ORDER BY sales_amount DESC NULLS FIRST) AS qualify_expression_1
  FROM testschema.sales) AS qualify_subquery
  WHERE 
    qualify_expression_1 <= 5;

AWS SCT converts the original query into a subquery, and applies the QUALIFY expression as a filter on the subquery. AWS SCT adds an additional column to the subquery for the purpose of filtering. This is not strictly needed, but simplifies the code when column aliases aren’t used.

You can try QUALIFY conversion in the latest version of AWS SCT.

Summary

We’re happy to share these new features with you. If you’re contemplating a migration to Amazon Redshift, these capabilities can help automate your schema conversion and preserve your investment in existing reports and applications. If you’re looking to get started on a data warehouse migration, you can learn more about Amazon Redshift and AWS SCT from our public documentation.

This post described a few of the dozens of new features we’re introducing to automate your Teradata migrations to Amazon Redshift. We’ll share more in upcoming posts about automation for proprietary Teradata features and other exciting new capabilities.

Check back soon for more information. Until then, you can learn more about Amazon Redshift and the AWS Schema Conversion Tool. Happy migrating!

About the Authors

Michael Soo is a Senior Database Engineer with the AWS Database Migration Service team. He builds products and services that help customers migrate their database workloads to the AWS cloud.

Raza Hafeez is a Data Architect within the Lake House Global Specialty Practice of AWS Professional Services. He has over 10 years of professional experience building and optimizing enterprise data warehouses and is passionate about enabling customers to realize the power of their data. He specializes in migrating enterprise data warehouses to AWS Lake House Architecture.

Po Hong, PhD, is a Principal Data Architect of Lake House Global Specialty Practice, AWS Professional Services. He is passionate about supporting customers to adopt innovative solutions to reduce time to insight. Po is specialized in migrating large scale MPP on-premises data warehouses to the AWS Lake House architecture.

Sumit Singh is a database engineer with Database Migration Service team at Amazon Web Services. He works closely with customers and provide technical assistance to migrate their on-premises workload to AWS cloud. He also assists in continuously improving the quality and functionality of AWS Data migration products.

Nelly Susanto is a Senior Database Migration Specialist of AWS Database Migration Accelerator. She has over 10 years of technical background focusing on migrating and replicating databases along with data-warehouse workloads. She is passionate about helping customers in their cloud journey.

Introducing Amazon MSK Connect – Stream Data to and from Your Apache Kafka Clusters Using Managed Connectors

2021-09-17 Danilo Poccia

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/introducing-amazon-msk-connect-stream-data-to-and-from-your-apache-kafka-clusters-using-managed-connectors/

Apache Kafka is an open-source platform for building real-time streaming data pipelines and applications. At re:Invent 2018, we announced Amazon Managed Streaming for Apache Kafka, a fully managed service that makes it easy to build and run applications that use Apache Kafka to process streaming data.

When you use Apache Kafka, you capture real-time data from sources such as IoT devices, database change events, and website clickstreams, and deliver it to destinations such as databases and persistent storage.

Kafka Connect is an open-source component of Apache Kafka that provides a framework for connecting with external systems such as databases, key-value stores, search indexes, and file systems. However, manually running Kafka Connect clusters requires you to plan and provision the required infrastructure, deal with cluster operations, and scale it in response to load changes.

Today, we’re announcing a new capability that makes it easier to manage Kafka Connect clusters. MSK Connect allows you to configure and deploy a connector using Kafka Connect with a just few clicks. MSK Connect provisions the required resources and sets up the cluster. It continuously monitors the health and delivery state of connectors, patches and manages the underlying hardware, and auto-scales connectors to match changes in throughput. As a result, you can focus your resources on building applications rather than managing infrastructure.

MSK Connect is fully compatible with Kafka Connect, which means you can migrate your existing connectors without code changes. You don’t need an MSK cluster to use MSK Connect. It supports Amazon MSK, Apache Kafka, and Apache Kafka compatible clusters as sources and sinks. These clusters can be self-managed or managed by AWS partners and 3rd parties as long as MSK Connect can privately connect to the clusters.

Using MSK Connect with Amazon Aurora and Debezium
To test MSK Connect, I want to use it to stream data change events from one of my databases. To do so, I use Debezium, an open-source distributed platform for change data capture built on top of Apache Kafka.

I use a MySQL-compatible Amazon Aurora database as the source and the Debezium MySQL connector with the setup described in this architectural diagram:

To use my Aurora database with Debezium, I need to turn on binary logging in the DB cluster parameter group. I follow the steps in the How do I turn on binary logging for my Amazon Aurora MySQL cluster article.

Next, I have to create a custom plugin for MSK Connect. A custom plugin is a set of JAR files that contain the implementation of one or more connectors, transforms, or converters. Amazon MSK will install the plugin on the workers of the connect cluster where the connector is running.

From the Debezium website, I download the MySQL connector plugin for the latest stable release. Because MSK Connect accepts custom plugins in ZIP or JAR format, I convert the downloaded archive to ZIP format and keep the JARs files in the main directory:

$ tar xzf debezium-connector-mysql-1.6.1.Final-plugin.tar.gz
$ cd debezium-connector-mysql
$ zip -9 ../debezium-connector-mysql-1.6.1.zip *
$ cd ..

Then, I use the AWS Command Line Interface (CLI) to upload the custom plugin to an Amazon Simple Storage Service (Amazon S3) bucket in the same AWS Region I am using for MSK Connect:

$ aws s3 cp debezium-connector-mysql-1.6.1.zip s3://my-bucket/path/

On the Amazon MSK console there is a new MSK Connect section. I look at the connectors and choose Create connector. Then, I create a custom plugin and browse my S3 buckets to select the custom plugin ZIP file I uploaded before.

I enter a name and a description for the plugin and then choose Next.

Now that the configuration of the custom plugin is complete, I start the creation of the connector. I enter a name and a description for the connector.

I have the option to use a self-managed Apache Kafka cluster or one that is managed by MSK. I select one of my MSK cluster that is configured to use IAM authentication. The MSK cluster I select is in the same virtual private cloud (VPC) as my Aurora database. To connect, the MSK cluster and Aurora database use the default security group for the VPC. For simplicity, I use a cluster configuration with auto.create.topics.enable set to true.

In Connector configuration, I use the following settings:

connector.class=io.debezium.connector.mysql.MySqlConnector
tasks.max=1
database.hostname=<aurora-database-writer-instance-endpoint>
database.port=3306
database.user=my-database-user
database.password=my-secret-password
database.server.id=123456
database.server.name=ecommerce-server
database.include.list=ecommerce
database.history.kafka.topic=dbhistory.ecommerce
database.history.kafka.bootstrap.servers=<bootstrap servers>
database.history.consumer.security.protocol=SASL_SSL
database.history.consumer.sasl.mechanism=AWS_MSK_IAM
database.history.consumer.sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required;
database.history.consumer.sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler
database.history.producer.security.protocol=SASL_SSL
database.history.producer.sasl.mechanism=AWS_MSK_IAM
database.history.producer.sasl.jaas.config=software.amazon.msk.auth.iam.IAMLoginModule required;
database.history.producer.sasl.client.callback.handler.class=software.amazon.msk.auth.iam.IAMClientCallbackHandler
include.schema.changes=true

Some of these settings are generic and should be specified for any connector. For example:

connector.class is the Java class of the connector.
tasks.max is the maximum number of tasks that should be created for this connector.

Other settings are specific to the Debezium MySQL connector:

The database.hostname contains the writer instance endpoint of my Aurora database.
The database.server.name is a logical name of the database server. It is used for the names of the Kafka topics created by Debezium.
The database.include.list contains the list of databases hosted by the specified server.
The database.history.kafka.topic is a Kafka topic used internally by Debezium to track database schema changes.
The database.history.kafka.bootstrap.servers contains the bootstrap servers of the MSK cluster.
The final eight lines (database.history.consumer.* and database.history.producer.*) enable IAM authentication to access the database history topic.

In Connector capacity, I can choose between autoscaled or provisioned capacity. For this setup, I choose Autoscaled and leave all other settings at their defaults.

With autoscaled capacity, I can configure these parameters:

MSK Connect Unit (MCU) count per worker – Each MCU provides 1 vCPU of compute and 4 GB of memory.
The minimum and maximum number of workers.
Autoscaling utilization thresholds – The upper and lower target utilization thresholds on MCU consumption in percentage to trigger auto scaling.

There is a summary of the minimum and maximum MCUs, memory, and network bandwidth for the connector.

For Worker configuration, you can use the default one provided by Amazon MSK or provide your own configuration. In my setup, I use the default one.

In Access permissions, I create a IAM role. In the trusted entities, I add kafkaconnect.amazonaws.com to allow MSK Connect to assume the role.

The role is used by MSK Connect to interact with the MSK cluster and other AWS services. For my setup, I add:

Permissions to write logs to a Amazon CloudWatch log group I created earlier.
Permissions to authenticate to my MSK cluster through IAM.

The Debezium connector needs access to the cluster configuration to find the replication factor to use to create the history topic. For this reason, I add to the permissions policy the kafka-cluster:DescribeClusterDynamicConfiguration action (equivalent Apache Kafka’s DESCRIBE_CONFIGS cluster ACL).

Depending on your configuration, you might need to add more permissions to the role (for example, in case the connector needs access to other AWS resources such as an S3 bucket). If that is the case, you should add permissions before creating the connector.

In Security, the settings for authentication and encryption in transit are taken from the MSK cluster.

In Logs, I choose to deliver logs to CloudWatch Logs to have more information on the execution of the connector. By using CloudWatch Logs, I can easily manage retention and interactively search and analyze my log data with CloudWatch Logs Insights. I enter the log group ARN (it’s the same log group I used before in the IAM role) and then choose Next.

I review the settings and then choose Create connector. After a few minutes, the connector is running.

Testing MSK Connect with Amazon Aurora and Debezium
Now let’s test the architecture I just set up. I start an Amazon Elastic Compute Cloud (Amazon EC2) instance to update the database and start a couple of Kafka consumers to see Debezium in action. To be able to connect to both the MSK cluster and the Aurora database, I use the same VPC and assign the default security group. I also add another security group that gives me SSH access to the instance.

I download a binary distribution of Apache Kafka and extract the archive in the home directory:

$ tar xvf kafka_2.13-2.7.1.tgz

To use IAM to authenticate with the MSK cluster, I follow the instructions in the Amazon MSK Developer Guide to configure clients for IAM access control. I download the latest stable release of the Amazon MSK Library for IAM:

$ wget https://github.com/aws/aws-msk-iam-auth/releases/download/1.1.0/aws-msk-iam-auth-1.1.0-all.jar

In the ~/kafka_2.13-2.7.1/config/ directory I create a client-config.properties file to configure a Kafka client to use IAM authentication:

# Sets up TLS for encryption and SASL for authN.
security.protocol = SASL_SSL

# Identifies the SASL mechanism to use.
sasl.mechanism = AWS_MSK_IAM

# Binds SASL client implementation.
sasl.jaas.config = software.amazon.msk.auth.iam.IAMLoginModule required;

# Encapsulates constructing a SigV4 signature based on extracted credentials.
# The SASL client bound by "sasl.jaas.config" invokes this class.
sasl.client.callback.handler.class = software.amazon.msk.auth.iam.IAMClientCallbackHandler

I add a few lines to my Bash profile to:

Add Kafka binaries to the PATH.
Add the MSK Library for IAM to the CLASSPATH.
Create the BOOTSTRAP_SERVERS environment variable to store the bootstrap servers of my MSK cluster.

$ cat >> ~./bash_profile
export PATH=~/kafka_2.13-2.7.1/bin:$PATH
export CLASSPATH=/home/ec2-user/aws-msk-iam-auth-1.1.0-all.jar
export BOOTSTRAP_SERVERS=<bootstrap servers>

Then, I open three terminal connections to the instance.

In the first terminal connection, I start a Kafka consumer for a topic with the same name as the database server (ecommerce-server). This topic is used by Debezium to stream schema changes (for example, when a new table is created).

$ cd ~/kafka_2.13-2.7.1/
$ kafka-console-consumer.sh --bootstrap-server $BOOTSTRAP_SERVERS \
                            --consumer.config config/client-config.properties \
                            --topic ecommerce-server --from-beginning

In the second terminal connection, I start another Kafka consumer for a topic with a name built by concatenating the database server (ecommerce-server), the database (ecommerce), and the table (orders). This topic is used by Debezium to stream data changes for the table (for example, when a new record is inserted).

$ cd ~/kafka_2.13-2.7.1/
$ kafka-console-consumer.sh --bootstrap-server $BOOTSTRAP_SERVERS \
                            --consumer.config config/client-config.properties \
                            --topic ecommerce-server.ecommerce.orders --from-beginning

In the third terminal connection, I install a MySQL client using the MariaDB package and connect to the Aurora database:

$ sudo yum install mariadb
$ mysql -h <aurora-database-writer-instance-endpoint> -u <database-user> -p

From this connection, I create the ecommerce database and a table for my orders:

CREATE DATABASE ecommerce;

USE ecommerce

CREATE TABLE orders (
       order_id VARCHAR(255),
       customer_id VARCHAR(255),
       item_description VARCHAR(255),
       price DECIMAL(6,2),
       order_date DATETIME DEFAULT CURRENT_TIMESTAMP
);

These database changes are captured by the Debezium connector managed by MSK Connect and are streamed to the MSK cluster. In the first terminal, consuming the topic with schema changes, I see the information on the creation of database and table:

Struct{source=Struct{version=1.6.1.Final,connector=mysql,name=ecommerce-server,ts_ms=1629202831473,db=ecommerce,server_id=1980402433,file=mysql-bin-changelog.000003,pos=9828,row=0},databaseName=ecommerce,ddl=CREATE DATABASE ecommerce,tableChanges=[]}
Struct{source=Struct{version=1.6.1.Final,connector=mysql,name=ecommerce-server,ts_ms=1629202878811,db=ecommerce,table=orders,server_id=1980402433,file=mysql-bin-changelog.000003,pos=10002,row=0},databaseName=ecommerce,ddl=CREATE TABLE orders ( order_id VARCHAR(255), customer_id VARCHAR(255), item_description VARCHAR(255), price DECIMAL(6,2), order_date DATETIME DEFAULT CURRENT_TIMESTAMP ),tableChanges=[Struct{type=CREATE,id="ecommerce"."orders",table=Struct{defaultCharsetName=latin1,primaryKeyColumnNames=[],columns=[Struct{name=order_id,jdbcType=12,typeName=VARCHAR,typeExpression=VARCHAR,charsetName=latin1,length=255,position=1,optional=true,autoIncremented=false,generated=false}, Struct{name=customer_id,jdbcType=12,typeName=VARCHAR,typeExpression=VARCHAR,charsetName=latin1,length=255,position=2,optional=true,autoIncremented=false,generated=false}, Struct{name=item_description,jdbcType=12,typeName=VARCHAR,typeExpression=VARCHAR,charsetName=latin1,length=255,position=3,optional=true,autoIncremented=false,generated=false}, Struct{name=price,jdbcType=3,typeName=DECIMAL,typeExpression=DECIMAL,length=6,scale=2,position=4,optional=true,autoIncremented=false,generated=false}, Struct{name=order_date,jdbcType=93,typeName=DATETIME,typeExpression=DATETIME,position=5,optional=true,autoIncremented=false,generated=false}]}}]}

Then, I go back to the database connection in the third terminal to insert a few records in the orders table:

INSERT INTO orders VALUES ("123456", "123", "A super noisy mechanical keyboard", "50.00", "2021-08-16 10:11:12");
INSERT INTO orders VALUES ("123457", "123", "An extremely wide monitor", "500.00", "2021-08-16 11:12:13");
INSERT INTO orders VALUES ("123458", "123", "A too sensible microphone", "150.00", "2021-08-16 12:13:14");

In the second terminal, I see the information on the records inserted into the orders table:

Struct{after=Struct{order_id=123456,customer_id=123,item_description=A super noisy mechanical keyboard,price=50.00,order_date=1629108672000},source=Struct{version=1.6.1.Final,connector=mysql,name=ecommerce-server,ts_ms=1629202993000,db=ecommerce,table=orders,server_id=1980402433,file=mysql-bin-changelog.000003,pos=10464,row=0},op=c,ts_ms=1629202993614}
Struct{after=Struct{order_id=123457,customer_id=123,item_description=An extremely wide monitor,price=500.00,order_date=1629112333000},source=Struct{version=1.6.1.Final,connector=mysql,name=ecommerce-server,ts_ms=1629202993000,db=ecommerce,table=orders,server_id=1980402433,file=mysql-bin-changelog.000003,pos=10793,row=0},op=c,ts_ms=1629202993621}
Struct{after=Struct{order_id=123458,customer_id=123,item_description=A too sensible microphone,price=150.00,order_date=1629115994000},source=Struct{version=1.6.1.Final,connector=mysql,name=ecommerce-server,ts_ms=1629202993000,db=ecommerce,table=orders,server_id=1980402433,file=mysql-bin-changelog.000003,pos=11114,row=0},op=c,ts_ms=1629202993630}

My change data capture architecture is up and running and the connector is fully managed by MSK Connect.

Availability and Pricing
MSK Connect is available in the following AWS Regions: Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), EU (Frankfurt), EU (Ireland), EU (London), EU (Paris), EU (Stockholm), South America (Sao Paulo), US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon). For more information, see the AWS Regional Services List.

With MSK Connect you pay for what you use. The resources used by your connectors can be scaled automatically based on your workload. For more information, see the Amazon MSK pricing page.

Simplify the management of your Apache Kafka connectors today with MSK Connect.

— Danilo

Discovering what’s slowing down your website with Web Analytics

2021-09-15 Joao Sousa Botto

Post Syndicated from Joao Sousa Botto original https://blog.cloudflare.com/web-analytics-vitals-explorer/

Discovering what’s slowing down your website with Web Analytics

Web Analytics is Cloudflare’s privacy-focused real user measurement solution. It leverages a lightweight JavaScript beacon and does not use any client-side state, such as cookies or localStorage, to collect usage metrics. Nor does it “fingerprint” individuals via their IP address, User Agent string, or any other data.

Cloudflare Web Analytics makes essential web analytics, such as the top-performing pages on your website and top referrers, available to everyone for free, and it’s becoming more powerful than ever.

Focusing on Performance

Earlier this year we merged Web Analytics with our Browser Insights product, which enabled customers proxying their websites through Cloudflare to evaluate visitors’ experience on their web properties through Core Web Vitals such as Largest Contentful Paint (LCP) and First Input Delay (FID).

It was important to bring the Core Web Vitals performance measurements into Web Analytics given the outsized impact that page load times have on bounce rates. A page load time increase from 1s to 3s increases bounce rates by 32% and from 1s to 6s increases it by 106% (source).

Now that you know the impact a slow-loading web page can have on your visitors, it’s time for us to make it a no-brainer to take action. Read on.

Becoming Action-Oriented

We believe that, to deliver the most value to our users, the product should facilitate the following process:

Measure the real user experience
Grade this experience — is it satisfactory or in need of improvement?
Provide actionable insights — what part of the web page should be tweaked to improve the user experience?
Repeat

And it all starts with Web Analytics Vitals Explorer, which started rolling out today.

Introducing Web Analytics Vitals Explorer

Vitals Explorer enables you to easily pinpoint which elements on your pages are affecting users the most, with accurate measurements from the visitors perspective and an easy-to-read impact grading.

To do that, we have automatically updated the Web Analytics JavaScript beacon so that it collects the relevant vital measurements from the browser. As always, we are not collecting any information that would invade your visitors’ privacy.

Usage

Once this new beacon is updated on your sites — and again the update will happen transparently to you — you can then navigate to the Core Web Vitals page on Web Analytics. When entering that page, you will see three graphs grading the user experience for Largest Contentful Paint (LCP), First Input Delay (FID), and Cumulative Layout Shift (CLS). Below each graph you can see the debug section with the top five elements with a negative impact on the metric. Lastly, when clicking on either of these elements shown in the data table, you will be presented with its impact and exact paths so that you can easily decide whether this is worth keeping on your website in its current format.

In addition to this new Core Web Vitals content, we have also added First Paint and First Contentful Paint to the Page Load Time page. When you navigate to this page you will now see the page load summary and a graph representing page load timing. These will allow you to quickly identify any regressions to these important performance metrics.

Measurement details

This additional debugging information for Core Web Vitals is measured during the lifespan of the page (until the user leaves the tab or closes the browser window, which updates visibilityState to a hidden state).

Here’s what we collect:

Common for all Core Web Vitals

Element is a CSS selector representing the DOM node. With this string, the developer can use `document.querySelector(<element_name>)` in their browser’s dev console to find out which DOM node has a negative impact on your scores/values.
Path is the URL path at the time the Core Web Vitals are captured.
Value is the metric value for each Core Web Vitals. This value is in milliseconds for LCP or FID and a score for CLS (Cumulative Layout Shift).

Largest Contentful Paint

URL is the source URL (such as image, text, web fonts).
Size is the source object’s size in bytes.

First Input Delay

Name is the type of event (such as mousedown, keydown, pointerdown).

Cumulative Layout Shift

Layout information is a JSON value that includes width, height, x axis position, y axis position, left, right, top, and bottom. You are able to observe layout shifts that happen on the page by observing these values.

CurrentRect is the largest source element’s layout information after the shift. This JSON value is shown as Current under Layout Shifts section in the Web Analytics UI.
PreviousRect is the largest source element’s layout information before the shift. This JSON value is shown as Previous under Layout Shifts section in the Web Analytics UI.

Paint Timings

Additionally, we have added two important paint timings

First Paint is the time between navigation and when the browser renders the first pixels to the screen.
First Contentful Paint is the time when the browser renders the first bit of content from the DOM.

A lot of this is based on standard browser measurements, which you can read about in detail on this blog post from Google.

Moving forward

And we are by no means done. Moving forward, we will bring this structured approach with grading and actionable insights into as Web Analytics measurements as possible, and keep guiding you through how to improve your visitors’ experience. So stay tuned.
And in the meantime, do let us know what you think about this feature and ask questions on the community forums.

How we built Instant Logs

2021-09-14 Ben Yule

Post Syndicated from Ben Yule original https://blog.cloudflare.com/how-we-built-instant-logs/

How we built Instant Logs

As a developer, you may be all too familiar with the stress of responding to a major service outage, becoming aware of an ongoing security breach, or simply dealing with the frustration of setting up a new service for the first time. When confronted with these situations, you want a real-time view into the events flowing through your network, so you can receive, process, and act on information as quickly as possible.

If you have a UNIX mindset, you’ll be familiar with tailing web service logs and searching for patterns using grep. With distributed systems like Cloudflare’s edge network, this task becomes much more complex because you’ll either need to log in to thousands of servers, or ship all the logs to a single place.

This is why we built Instant Logs. Instant Logs removes all barriers to accessing your Cloudflare logs, giving you a complete platform to view your HTTP logs in real time, with just a single click, right from within Cloudflare’s dashboard. Powerful filters then let you drill into specific events or search for patterns, and act on them instantly.

The Challenge

Today, Cloudflare’s Logpush product already gives customers the ability to ship their logs to a third-party analytics or storage provider of their choosing. While this system is already exceptionally fast, delivering logs in about 15s on average, it is optimized for completeness and the utmost certainty that your data is reliably making it to its destination. It is the ideal solution for after things have settled down, and you want to perform a forensic deep dive or retrospective.

We originally aimed to extend this system to provide our real-time logging capabilities, but we soon realized the objectives were inherently at odds with each other. In order to get all of your data, to a single place, all the time, the laws of the universe require that latencies be introduced into the system. We needed a complementary solution, with its own unique set of objectives.

This ultimately boiled down to the following

It has to be extremely fast, in human terms. This means average latencies between an event occurring at the edge and being received by the client should be under three seconds.
We wanted the system design to be simple, and communication to be as direct to the client as possible. This meant operating the dataplane entirely at the edge, eliminating unnecessary round trips to a core data center.
The pipeline needs to provide sensible results on properties of all sizes, ranging from a few requests per day to hundreds of thousands of requests per second.
The pipeline must support a broad set of user-definable filters that are applied before any sampling occurs, such that a user can target and receive exactly what they want.

Workers and Durable Objects

Our existing Logpush pipeline relies heavily on Kafka to provide sharding, buffering, and aggregation at a single, central location. While we’ve had excellent results using Kafka for these pipelines, the clusters are optimized to run only within our core data centers. Using Kafka would require extra hops to far away data centers, adding a latency penalty we were not willing to incur.

In order to keep the data plane running on the edge, we needed primitives that would allow us to perform some of the same key functions we needed out of Kafka. This is where Workers and the recently released Durable Objects come in. Workers provide an incredibly simple to use, highly elastic, edge-native, compute platform we can use to receive events, and perform transformations. Durable Objects, through their global uniqueness, allow us to coordinate messages streaming from thousands of servers and route them to a singular object. This is where aggregation and buffering are performed, before finally pushing to a client over a thin WebSocket. We get all of this, without ever having to leave the edge!

Let’s walk through what this looks like in practice.

A Simple Start

Imagine a simple scenario in which we have a single web server which produces log messages, and a single client which wants to consume them. This can be implemented by creating a Durable Object, which we will refer to as a Durable Session, that serves as the point of coordination between the server and client. In this case, the client initiates a WebSocket connection with the Durable Object, and the server sends messages to the Durable Object over HTTP, which are then forwarded directly to the client.

This model is quite quick and introduces very little additional latency other than what would be required to send a payload directly from the web server to the client. This is thanks to the fact that Durable Objects are generally located at or near the data center where they are first requested. At least in human terms, it’s instant. Adding more servers to our model is also trivial. As the additional servers produce events, they will all be routed to the same Durable Object, which merges them into a single stream, and sends them to the client over the same WebSocket.

Durable Objects are inherently single threaded. As the number of servers in our simple example increases, the Durable Object will eventually saturate its CPU time and will eventually start to reject incoming requests. And even if it didn’t, as data volumes increase, we risk overwhelming a client’s ability to download and render log lines. We’ll handle this in a few different ways.

Honing in on specific events

Filtering is the most simple and obvious way to reduce data volume before it reaches the client. If we can filter out the noise, and stream only the events of interest, we can substantially reduce volume. Performing this transformation in the Durable Object itself will provide no relief from CPU saturation concerns. Instead, we can push this filtering out to an invoking Worker, which will run many filter operations in parallel, as it elastically scales to process all the incoming requests to the Durable Object. At this point, our architecture starts to look a lot like the MapReduce pattern!

Scaling up with shards

Ok, so filtering may be great in some situations, but it’s not going to save us under all scenarios. We still need a solution to help us coordinate between potentially thousands of servers that are sending events every single second. Durable Objects will come to the rescue, yet again. We can implement a sharding layer consisting of Durable Objects, we will call them Durable Shards, that effectively allow us to reduce the number of requests being sent to our primary object.

But how do we implement this layer if Durable Objects are globally unique? We first need to decide on a shard key, which is used to determine which Durable Object a given message should first be routed to. When the Worker processes a message, the key will be added to the name of the downstream Durable Object. Assuming our keys are well-balanced, this should effectively reduce the load on the primary Durable Object by approximately 1/N.

Reaching the moon by sampling

But wait, there’s more to do. Going back to our original product requirements, “The pipeline needs to provide sensible results on properties of all sizes, ranging from a few requests per day to hundreds of thousands of requests per second.” With the system as designed so far, we have the technical headroom to process an almost arbitrary number of logs. However, we’ve done nothing to reduce the absolute volume of messages that need to be processed and sent to the client, and at high log volumes, clients would quickly be overwhelmed. To deliver the interactive, instant user experience customers expect, we need to roll up our sleeves one more time.

This is where our final trick, sampling, comes into play.

Up to this point, when our pipeline saturates, it still makes forward progress by dropping excess data as the Durable Object starts to refuse connections. However, this form of ‘uncontrolled shedding’ is dangerous because it causes us to lose information. When we drop data in this way, we can’t keep a record of the data we dropped, and we cannot infer things about the original shape of the traffic from the messages that we do receive. Instead, we implement a form of ‘controlled’ sampling, which still preserves the statistics, and information about the original traffic.

For Instant Logs, we implement a sampling technique called Reservoir Sampling. Reservoir sampling is a form of dynamic sampling that has this amazing property of letting us pick a specific k number of items from a stream of unknown length n, with a single pass through the data. By buffering data in the reservoir, and flushing it on a short (sub second) time interval, we can output random samples to the client at the maximum data rate of our choosing. Sampling is implemented in both layers of Durable Objects.

Information about the original traffic shape is preserved by assigning a sample interval to each line, which is equivalent to the number of samples that were dropped for this given sample to make it through, or 1/probability. The actual number of requests can then be calculated by taking the sum of all sample intervals within a time window. This technique adds a slight amount of latency to the pipeline to account for buffering, but enables us to point an event source of nearly any size at the pipeline, and we can expect it will be handled in a sensible, controlled way.

Putting it all together

What we are left with is a pipeline that sensibly handles wildly different volumes of traffic, from single digits to hundreds of thousands of requests a second. It allows the user to pinpoint an exact event in a sea of millions, or calculate summaries over every single one. It delivers insight within seconds, all without ever having to do more than click a button.

Best of all? Workers and Durable Objects handle this workload with aplomb and no tuning, and the available developer tooling allowed me to be productive from my first day writing code targeting the Workers ecosystem.

How to get involved?

We’ll be starting our Beta for Instant Logs in a couple of weeks. Join the waitlist to get notified about when you can get access!

If you want to be part of building the future of data at Cloudflare, we’re hiring engineers for our data team in Lisbon, London, Austin, and San Francisco!