Tag Archives: Analytics

Handle fast-changing reference data in an AWS Glue streaming ETL job

Post Syndicated from Jerome Rajan original https://aws.amazon.com/blogs/big-data/handle-fast-changing-reference-data-in-an-aws-glue-streaming-etl-job/

Streaming ETL jobs in AWS Glue can consume data from streaming sources such as Amazon Kinesis and Apache Kafka, clean and transform those data streams in-flight, as well as continuously load the results into Amazon Simple Storage Service (Amazon S3) data lakes, data warehouses, or other data stores.

The always-on nature of streaming jobs poses a unique challenge when handling fast-changing reference data that is used to enrich data streams within the AWS Glue streaming ETL job. AWS Glue processes real-time data from Amazon Kinesis Data Streams using micro-batches. The foreachbatch method used to process micro-batches handles one data stream.

This post proposes a solution to enrich streaming data with frequently changing reference data in an AWS Glue streaming ETL job.

You can enrich data streams with changing reference data in the following ways:

  • Read the reference dataset with every micro-batch, which can cause redundant reads and an increase in read requests. This approach is expensive, inefficient, and isn’t covered in this post.
  • Design a method to tell the AWS Glue streaming job that the reference data has changed and refresh it only when needed. This approach is cost-effective and highly available. We recommend using this approach.

Solution overview

This post uses DynamoDB Streams to capture changes to reference data, as illustrated in the following architecture diagram. For more information about DynamoDB Streams, see DynamoDB Streams Use Cases and Design Patterns.

The workflow contains the following steps:

  1. A user or application updates or creates a new item in the DynamoDB table.
  2. DynamoDB Streams is used to identify changes in the reference data.
  3. A Lambda function is invoked every time a change occurs in the reference data.
  4. The Lambda function captures the event containing the changed record, creates a “change file” and places it in an Amazon S3 bucket.
  5. The AWS Glue job is designed to monitor the stream for this value in every micro-batch. The moment that it sees the change flag, AWS Glue initiates a refresh of the DynamoDB data before processing any further records in the stream.

This post is accompanied by an AWS CloudFormation template that creates resources as described in the solution architecture:

  • A DynamoDB table named ProductPriority with a few items loaded
  • An S3 bucket named demo-bucket-<AWS AccountID>
  • Two Lambda functions:
    • demo-glue-script-creator-lambda
    • demo-reference-data-change-handler
  • A Kinesis data stream named SourceKinesisStream
  • An AWS Glue Data Catalog database called my-database
  • Two Data Catalog tables
  • An AWS Glue job called demo-glue-job-<AWS AccountID>. The code for the AWS Glue job can be found at this link.
  • Two AWS Identity and Access Management (IAM) roles:
    • A role for the Lambda functions to access Kinesis, Amazon S3, and DynamoDB Streams
    • A role for the AWS Glue job to access Kinesis, Amazon S3, and DynamoDB
  • An Amazon Kinesis Data Generator (KDG) account with a user created through Amazon Cognito to generate a sample data stream

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account
  • The IAM user should have permissions to create the required roles
  • Permission to create a CloudFormation stack and the services we detailed

Create resources with AWS CloudFormation

To deploy the solution, complete the following steps:

  1. Choose Launch Stack:
  2. Set up an Amazon Cognito user pool and test if you can access the KDG URL specified in the stack’s output tab. Furthermore, validate if you can log in to KDG using the credentials provided while creating the stack.

You should now have the required resources available in your AWS account.

  1. Verify this list with the resources in the output section of the CloudFormation stack.

Sample data

Sample reference data has already been loaded into the reference data store. The following screenshot shows an example.

The priority value may change frequently based on the time of the day, the day of the week, or other factors that drive demand and supply.

The objective is to accommodate these changes to the reference data seamlessly into the pipeline.

Generate a randomized stream of events into Kinesis

Next, we simulate a sample stream of data into Kinesis. For detailed instructions, see Test Your Streaming Data Solution with the New Amazon Kinesis Data Generator. For this post, we define the structure of the simulated orders data using a parameterized template.

  1. On the KDG console, choose the Region where the source Kinesis stream is located.
  2. Choose your delivery stream.
  3. Enter the following template into the Record template field:
    {
    "dish": "{{random.arrayElement(["pizza","burger","salad","donut","ice-cream"])}}"
    ,"cost": {{random.number({"min":10,"max":150})}}
    ,"customer_id":{{random.number({"min":1,"max":10000})}}
    }

  4. Choose Test template, then choose Send data.

KDG should start sending a stream of randomly generated orders to the Kinesis data stream.

Run the AWS Glue streaming job

The CloudFormation stack created an AWS Glue job that reads from the Kinesis data stream through a Data Catalog table, joins with the reference data in DynamoDB, and writes the result to an S3 bucket. To run the job, complete the following steps:

  1. On the AWS Glue console, under ETL in the navigation pane, choose Jobs.
  2. Select the job demo-glue-job-<AWS AccountID>.
  3. On the Actions menu, choose Run job.

In addition to the enrichment, the job includes an additional check that monitors an Amazon S3 prefix for a “Change Flag” file. This file is created by the Lambda function, which is invoked by the DynamoDB stream whenever there is an update or a new reference item.

Investigate the target data in Amazon S3

The following is a screenshot of the data being loaded in real time into the item=burger partition. The priority was set to medium in the reference data, and the orders go into the corresponding partition.

Update the reference data

Now we update the priority for burgers to high in the DynamoDB table through the console while the orders are streaming into the pipeline.

Use the following command to perform the update through Amazon CloudShell. Change the Region to the appropriate value.

aws dynamodb update-item --table-name "ProductPriority" --key '{"item":{"S":"burger"}, "price":{"N":"100"}}' --update-expression "SET priority = :s" --expression-attribute-values '{":s": {"S": "high"}}' --return-values ALL_NEW --region us-east-1

Verify that the data got updated.

Navigate to the target S3 folder to confirm the contents. The AWS Glue job should have started sending the orders for burgers into the high partition.

The Lambda function is invoked by the DynamoDB stream and places a “Change Flag” file in an Amazon S3 bucket. The AWS Glue job refreshes the reference data and deletes the file to avoid redundant refreshes.

Using this pattern for reference data in Amazon S3

If the reference data is stored in an S3 bucket, create an Amazon S3 event notification that identifies changes to the prefix where the reference data is stored. The event notification invokes a Lambda function that inserts the change flag into the data stream.

Cleaning up

To avoid incurring future charges, delete the resources. You can do this by deleting the CloudFormation stack.

Conclusion

In this post, we discussed  approaches to handle fast-changing reference data stored in DynamoDB or Amazon S3. We demonstrated a simple use case that implements this pattern.

Note that DynamoDB Streams writes stream records in near-real time. When designing your solution, account for a minor delay between the actual update in DynamoDB and the write into the DynamoDB stream.


About the Authors

Jerome Rajan is a Lead Data Analytics Consultant at AWS. He helps customers design & build scalable analytics solutions and migrate data pipelines and data warehouses into the cloud. In an alternate universe, he is a World Chess Champion!

Dipankar Ghosal is a Principal Architect at Amazon Web Services and is based out of Minneapolis, MN. He has a focus in analytics and enjoys helping customers solve their unique use cases. When he’s not working, he loves going hiking with his wife and daughter.

Gain insights into your Amazon Kinesis Data Firehose delivery stream using Amazon CloudWatch

Post Syndicated from Alon Gendler original https://aws.amazon.com/blogs/big-data/gain-insights-into-your-amazon-kinesis-data-firehose-delivery-stream-using-amazon-cloudwatch/

The volume of data being generated globally is growing at an ever-increasing pace. Data is generated to support an increasing number of use cases, such as IoT, advertisement, gaming, security monitoring, machine learning (ML), and more. The growth of these use cases drives both volume and velocity of streaming data and requires companies to capture, processes, transform, analyze, and load the data into various data stores in near-real time.

Amazon Kinesis Data Firehose is the easiest way to reliably load streaming data into data lakes, data stores, and analytics services. As the volume of the data you stream into Kinesis Data Firehose grows, you should gain insights and monitor the health of your data ingestion, transformation, and delivery.

In this post, we review the capabilities of using Firehose delivery stream metrics and the Amazon CloudWatch dashboard located on your Kinesis Data Firehose console. These capabilities allow you to create alerts when, for example, if the destination you configured in Kinesis Data Firehose has missing privileges, misconfigurations, or other issues, then Firehose will be able to detect it for you and report it as a failure. Other errors that might also occur are if you configured data transformation using Lambda and your Lambda function invocation failed, or if you have reached the Kinesis Firehose quota limits associated with your AWS account. In these cases, the data delivery from Kinesis Data Firehose to its destination may delay or fail. The CloudWatch alerts described in this post should help identify such cases in a timely manner.

This post also covers the different proactive actions that you can take when alarms are being triggered, such as submitting a request to increase quota or adding exponential backoff to your data producers.

Monitoring the delivery streams and taking these actions makes sure that data is delivered to your destinations without interruptions, enabling your business to gain insights in near-real time.

Monitor data ingestion to Kinesis Data Firehose

You can deliver data from your data producers to Kinesis Data Firehose through Amazon Kinesis Data Streams (as described later in this post), using Kinesis Agent, or directly using the Kinesis Data Firehose API operations PutRecord and PutRecordBatch. When you use Kinesis Data Streams as a data source, Kinesis Data Firehose scales automatically as your Kinesis Data Stream scales. When using the API operations for direct ingestion, you need to check the quota limits associated with your AWS account to avoid API requests throttling. Depending on your data producer behavior, this throttling can cause your data producers to retry the operation, which results in a delay of the data delivery to your destination. This throttling can also result in data loss if your data producers don’t implement a retry mechanism.

To gain deeper insights into Firehose delivery stream usage, we provide additional CloudWatch metrics that help you monitor and proactively scale quota limits: ThrottledRecords, RecordsPerSecondLimit, BytesPerSecondLimit, and PutRequestsPerSecondLimit. You can use the CloudWatch metrics dashboard (on the Monitoring tab on your Kinesis Data Firehose console) to easily visualize current usage and the quota limits.

When ingesting data directly to your delivery stream using PutRecord or PutRecordBatch, you should monitor the ThrottledRecords metric. This metric represents the number of records that were actually throttled because data ingestion exceeded one of the delivery stream limits. Kinesis Data Firehose calculates the throttling rates during the ingestion at a 1-second granularity, but the data ingestion metrics we mentioned are aggregated and emitted to CloudWatch every 5 minutes. Because of that, you can get throttled within that 5-minute window even if the data ingestion metrics don’t show that you reached the limit.

To receive alerts before your data producers are actually throttled, you can use additional CloudWatch metrics to alert you when you’re about to reach one of the delivery stream limits. You can achieve this by using the CloudWatch metrics IncomingRecords, IncomingBytes, and IncomingPutRequests. To check the limits of these metrics, refer to Amazon Kinesis Data Firehose Quota.

You can use the following ingestion metrics and their corresponding limit metrics to create a CloudWatch alarm:

  • RecordsPerSecondLimit – The maximum number of records that can be ingested in a second (IncomingRecords)
  • BytesPerSecondLimit – The maximum volume of data that can be ingested in a second (IncomingBytes)
  • PutRequestsPerSecondLimit – The maximum number of successful PutRecord and PutRecordBatch API requests that can be performed in a second (IncomingPutRequests)

To set up an alarm that alerts you when your ingestion rates are close to a quota, you should look for a percentage relationship between the ingestion rate and its corresponding limit. Because Kinesis Data Firehose emits metrics to CloudWatch every 5 minutes, you need to divide your metric with the 5-minute aggregation period, expressed as seconds (300). For example, to generate an alert when the incoming records per second rate is breaching 80% of your API operations quota, your CloudWatch alarm should be defined as follows:

This gives you a way to proactively understand how close your ingestion rates are to your delivery stream limits, and the flexibility to modify the percentage levels based on your use case. To prevent a throttling bottleneck, you should separately monitor the three delivery stream ingestion rate metrics we discussed.

Define alerts using CloudWatch alarms

You can define CloudWatch alarms manually through the AWS Management Console or by using AWS CloudFormation. In this post we cover both methods, starting with the CloudFormation template.

The following template creates your CloudWatch alarms, which you can review and customize to suit your needs.

During the stack creation process, you provide the Firehose delivery stream name that you want to monitor, and the quota percentage where you want to be notified when it’s being breached, such as 80%. After the stack creation is successful, you have four CloudWatch alarms ready.

To create your CloudWatch alarms manually through the console, complete the following steps:

  1. On the Kinesis Data Firehose console, find your delivery stream.
  2. On the Monitoring tab, choose the more options icon of the metric you want to monitor (for this example, we monitor incoming records per second).
  3. On the options menu, choose View in metrics.

On the CloudWatch console, you can see a graph that represents your current API operations (blue line) and the quota limit (red line).

  1. To create an alarm, choose Math expression.
  2. Select Common and choose Percentage.
  3. For the metric name, enter Percentage of records per second quota.
  4. We use the metric expression 100*(e1/m2), which represents the formula 100*(BytesPerSecond/BytesPerSecondLimit) that was described earlier and reflects how close you are to your maximum in percentage.
  5. Change the expression of the metric e1 from METRICS("m1")/300 to m1/300.

You can also change the Y axis label.

  1. On the Graph options tab, under Left Y Axis, for Label, enter Percentage.
  2. Now that you have the expression to use for the alarm, deselect every other expression and metric on the page.

The only expression selected should be the one you just created. You should now see the desired percentage, as in the following screenshot.

Create a CloudWatch alarm

You have now created an expression on your IncomingRecords and RecordsPerSecond quota, which you can use as a base for the alarm. With this, you can configure the tolerance level that your business use case requires.

  1. Choose the alarm icon next to your expression.
  2. In the Specify metric and conditions section, choose to receive an alert when the alarms breach the 75% limit.
  3. In the Configure actions section, specify how to forward this alarm.

You can forward this alarm to your monitoring systems or to an email address through an Amazon Simple Notification Service (Amazon SNS) topic. For this post, we create a new SNS topic and subscribe [email protected] to it.

Actions you can take when approaching the limits

When you’re getting close to your limits, you can take several different actions, which we describe in this section.

Request a service quota increase

One action you can take when seeing an alert is to request an increase in quota using the Amazon Kinesis Data Firehose Limits form. The three quotas scale proportionally, for example, if you increase the throughput quota in US East (N. Virginia), US West (Oregon), or Europe (Ireland) from 5 MiB/second to 10 MiB/second, the other two quotas increase from 2,000 requests/second to 4,000 requests/second and from 500,000 records/second to 1 million records/second. For more information about the service quota limits by AWS Region, see Amazon Kinesis Data Firehose Quota.

Use the PutRecordBatch API

If you use the API call PutRecord to deliver events to a Firehose data stream and you’re reaching the request/second quota limit, consider using the PutRecordBatch API operation. PutRecordBatch writes multiple data records into a delivery stream in a single call to achieve higher throughput per producer than writing single records, and reduces the amount of requests per second to your delivery stream.

Implement exponential backoff

As we mentioned before, even when you’re monitoring your delivery stream, you can still have bursts in your data stream. This could be caused by sudden spikes in usage of your system or external events like high trading activity in financial markets. To protect the producers from multiple throttled records, you should implement an exponential backoff. Exponential backoff is a commonly used algorithm that you can use to decrease the rate of submitting records to Kinesis Data Firehose when being throttled, so that the producer can slowly retry in order to successfully send the records.

The following are the Kinesis Data Firehose API responses when records are throttled:

  • If you’re using the API operation PutRecord, the returned error from the service is ServiceUnavailableException with HTTP status code 500.
  • If you’re using PutRecordBatch, you should iterate through the RequestResponses array and look for individual PutRecordBatchResponseEntry with ErrorCode 500 and ErrorMessage ServiceUnavailableException. Also make sure to check the value of FailedPutCount in the response even when the API call succeeds.

In both cases, you should use exponential backoff and retry the operation. For more information about implementing exponential backoff, see Error retries and exponential backoff in AWS.

Use Kinesis Data Streams with Kinesis Data Firehose

Kinesis Data Streams is a massively scalable and durable real-time data streaming service. Your data producers can produce data directly to Kinesis Data Streams, and you can configure Kinesis Data Firehose to consume the data from Kinesis Data Streams and deliver it to your destination. When you use Kinesis Data Streams as the source for the Firehose delivery stream, the throughput limits mentioned before don’t apply. You don’t need to worry about throughput limits because Kinesis Data Firehose scales automatically to match the number of shards your Kinesis data stream has.

If you’re attaching a Firehose delivery stream as a consumer to your Kinesis data stream, and you have multiple consumer applications that read data from your Kinesis data stream such as AWS Lambda (see Using AWS Lambda with Amazon Kinesis), make sure that the total consumer applications aren’t breaching the shard’s 2 MB total read rate. This can cause the Kinesis data stream to throttle your consumer applications’ reading throughput, including Kinesis Data Firehose.

If more read capacity is required, some application consumers such as Lambda (see AWS Lambda supports Kinesis Data Streams Enhanced Fan-Out and HTTP/2 for faster streaming) or custom consumers that were developed with the Kinesis Consumer Library can support dedicated throughput from Kinesis Data Streams using enhanced fan-out, which currently isn’t supported by Kinesis Data Firehose. This feature provides these consumer applications isolated connection to the stream with 2 MB/second outbound throughput, so they don’t impact other consumer applications that are reading from the shards.

If you need more ingest capacity, you can easily scale up the number of shards in the stream using the console or the UpdateShardCount API.

Monitor data delivery of Kinesis Data Firehose

In case of network timeouts, missing privileges, or misconfigurations of your delivery stream such as incorrect destination configuration or AWS Key Management Service (AWS KMS) key ARN, the data delivery of your data from Kinesis Data Firehose to its destination may delay or fail. Errors might also occur if you configured data transformation using Lambda and your Lambda function invocation failed.

When Kinesis Data Firehose encounters delivery or processing errors, it retries until the configured retry duration expires. If the retry duration ends and the data hasn’t delivered successfully, Kinesis Data Firehose retains the data internally up to a maximum period of 24 hours. If the issue continues beyond the 24-hour maximum retention period, then Kinesis Data Firehose discards the data, resulting in a data loss.

When such data delivery issues persist, the data freshness metric, which is the age of the oldest record in Kinesis Data Firehose that hasn’t been delivered yet, constantly increases. To be alerted in such cases, you should create a CloudWatch alarm for when the data freshness metric exceeds the threshold of 4 hours. We also recommend setting an alarm to observe the historical p90 of the data freshness metric value. For example, set a certain tolerance level (such as 50% above the observed value) as an alarm threshold to detect data freshness variations.

You should monitor the data freshness metric that is relevant to your Kinesis Data Firehose destination, such as DeliveryToS3.DataFreshness, DeliveryToAmazonOpenSearchService.DataFreshness, DeliveryToSplunk.DataFreshness, or DeliveryToHttpEndpoint.DataFreshness. For more information, see Monitoring Kinesis Data Firehose Using CloudWatch Metrics.

If this alarm is triggered, you should take action to understand the root cause of the data freshness variation. A reason for such a variation could be a change in your Lambda transformation logic or configuration change of Lambda concurrency when using Kinesis Data Firehose data transformation. It could also be a result of change in the configuration parameters, format conversion schema, or ingested record type. For more information, see Data Freshness Metric Increasing or Not Emitted or you can submit a technical support request if needed.

When data delivery fails because of data transformation or an issue at the destination, in some cases you can find detailed failure logs in CloudWatch Logs, which can help you troubleshot the problem.

We also recommend monitoring the data delivery byte rate to your destination (for example, DeliveryToS3.Byte), which must match or exceed your data ingestion byte rate (IncomingBytes) on a sustained average basis to avoid increase of the data freshness metric and possible eventual data loss. If the observed delivery data rates are lower than the ingestion rates, consider tuning bottlenecks such as Lambda concurrency levels or your Lambda transformation logic if used with Kinesis Data Firehose data transformation.

To gain additional insights on the delivery of your data to its destination, we provide CloudWatch metrics you can monitor. For example, you can monitor the number of records delivered to keep track of data ingested into your destinations from Kinesis Data Firehose. For more information and additional metrics per destination, see Monitoring Kinesis Data Firehose Using CloudWatch Metrics.

Conclusion

In this post, we discussed the capabilities of using the Firehose delivery stream metrics and the CloudWatch dashboard located on your Kinesis Data Firehose console. This allows you to gain operational insights into the data ingestion and data delivery of your Firehose deliv­­ery stream, and also create CloudWatch alerts to be notified when one of your thresholds is breached. We also covered the different actions that you can take when these alarms are triggered, such as submitting a request to increase your quota or adding exponential backoff to your data producers.

Monitor your delivery streams and take these actions to make sure that your business data is delivered to your destinations without interruptions, enabling your business to gain insights in near-real time.


About the Author

Alon Gendler is a Startup Solutions Architect Manager at Amazon Web Services. He works with AWS customers to help them solve complex problems and architect secure, resilient, scalable and high performance applications in the cloud. Alon is passionate about Data and helping customers get the most out of it.

Build event-driven data quality pipelines with AWS Glue DataBrew

Post Syndicated from Laith Al-Saadoon original https://aws.amazon.com/blogs/big-data/build-event-driven-data-quality-pipelines-with-aws-glue-databrew/

Businesses collect more and more data every day to drive processes like decision-making, reporting, and machine learning (ML). Before cleaning and transforming your data, you need to determine whether it’s fit for use. Incorrect, missing, or malformed data can have large impacts on downstream analytics and ML processes. Performing data quality checks helps identify issues earlier in your workflow so you can resolve them faster. Additionally, doing these checks using an event-based architecture helps you reduce manual touchpoints and scale with growing amounts of data.

AWS Glue DataBrew is a visual data preparation tool that makes it easy to find data quality statistics such as duplicate values, missing values, and outliers in your data. You can also set up data quality rules in DataBrew to perform conditional checks based on your unique business needs. For example, a manufacturer might need to ensure that there are no duplicate values specifically in a Part ID column, or a healthcare provider might check that values in an SSN column are a certain length. After you create and validate these rules with DataBrew, you can use Amazon EventBridge, AWS Step Functions, AWS Lambda, and Amazon Simple Notification Service (Amazon SNS) to create an automated workflow and send a notification when a rule fails a validation check.

In this post, we walk you through the end-to-end workflow and how to implement this solution. This post includes a step-by-step tutorial, an AWS Serverless Application Model (AWS SAM) template, and example code that you can use to deploy the application in your own AWS environment.

Solution overview

The solution in this post combines serverless AWS services to build a completely automated, end-to-end event-driven pipeline for data quality validation. The following diagram illustrates our solution architecture.

The solution workflow contains the following steps:

  1. When you upload new data to your Amazon Simple Storage Service (Amazon S3) bucket, events are sent to EventBridge.
  2. An EventBridge rule triggers a Step Functions state machine to run.
  3. The state machine starts a DataBrew profile job, configured with a data quality ruleset and rules. If you’re considering building a similar solution, the DataBrew profile job output location and the source data S3 buckets should be unique. This prevents recursive job runs. We deploy our resources with an AWS CloudFormation template, which creates unique S3 buckets.
  4. A Lambda function reads the data quality results from Amazon S3, and returns a Boolean response into the state machine. The function returns false if one or more rules in the ruleset fail, and returns true if all rules succeed.
  5. If the Boolean response is false, the state machine sends an email notification with Amazon SNS and the state machine ends in a failed status. If the Boolean response is true, the state machine ends in a succeed status. You can also extend the solution in this step to run other tasks on success or failure. For example, if all the rules succeed, you can send an EventBridge message to trigger another transformation job in DataBrew.

In this post, you use AWS CloudFormation to deploy a fully functioning demo of the event-driven data quality validation solution. You test the solution by uploading a valid comma-separated values (CSV) file to Amazon S3, followed by an invalid CSV file.

The steps are as follows:

  1. Launch a CloudFormation stack to deploy the solution resources.
  2. Test the solution:
    1. Upload a valid CSV file to Amazon S3 and observe the data quality validation and Step Functions state machine succeed.
    2. Upload an invalid CSV file to Amazon S3 and observe the data quality validation and Step Functions state machine fail, and receive an email notification from Amazon SNS.

All the sample code can be found in the GitHub repository.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Deploy the solution resources using AWS CloudFormation

You use a CloudFormation stack to deploy the resources needed for the event-driven data quality validation solution. The stack includes an example dataset and ruleset in DataBrew.

  1. Sign in to your AWS account and then choose Launch Stack:
  2. On the Quick create stack page, for EmailAddress, enter a valid email address for Amazon SNS email notifications.
  3. Leave the remaining options set to the defaults.
  4. Select the acknowledgement check boxes.
  5. Choose Create stack.

The CloudFormation stack takes about 5 minutes to reach CREATE_COMPLETE status.

  1. Check the inbox of the email address you provided and accept the SNS subscription.

You need to review and accept the subscription confirmation in order to demonstrate the email notification feature at the end of the walkthrough.

On the Outputs tab of the stack, you can find the URLs to browse the DataBrew and Step Functions resources that the template created. Also note the completed AWS CLI commands you use in later steps.

If you choose the AWSGlueDataBrewRuleset value link, you should see the ruleset details page, as in the following screenshot. In this walkthrough, we create a data quality ruleset with three rules that check for missing values, outliers, and string length.

Test the solution

In the following steps, you use the AWS CLI to upload correct and incorrect versions of the CSV file to test the event-driven data quality validation solution.

  1. Open a terminal or command line prompt and use the AWS CLI to download sample data. Use the command from the CloudFormation stack output with the key name CommandToDownloadTestData:
    aws s3 cp s3://<your_bucket>/artifacts/BDB-1942/votes.csv

  2. Use the AWS CLI again to upload the unchanged CSV file to your S3 bucket. Replace the string <your_bucket> with your bucket name, or copy and paste the command provided to you from the CloudFormation template output:
    aws s3 cp votes.csv s3://<your_bucket>/artifacts/BDB-1942/votes.csv

  3. On the Step Functions console, locate the state machine created by the CloudFormation template.

You can find a URL in the CloudFormation outputs noted earlier.

  1. On the Executions tab, you should see a new run of the state machine.
  2. Choose the run’s URL to view the state machine graph and monitor its progress.

The following image shows the workflow of our state machine.

To demonstrate a data quality rule’s failure, you make at least one edit to the votes.csv file.

  1. Open the file in your preferred text editor or spreadsheet tool, and delete just one cell.

In the following screenshots, I use the GNU nano editor on Linux. You can also use a spreadsheet editor to delete a cell. This causes the “Check All Columns For Missing Values” rule to fail.

The following screenshot shows the CSV file before modification.

The following screenshot shows the changed CSV file.

  1. Save the edited votes.csv file and return to your command prompt or terminal.
  2. Use the AWS CLI to upload the file to your S3 bucket one more time. You use the same command as before:
    aws s3 cp votes.csv s3://<your_bucket>/artifacts/BDB-1942/votes.csv

  3. On the Step Functions console, navigate to the latest state machine run to monitor it.

The data quality validation fails, triggering an SNS email notification and the failure of the overall state machine’s run.

The following image shows the workflow of the failed state machine.

The following screenshot shows an example of the SNS email.

  1. You can investigate the rule failure on the DataBrew console by choosing the AWSGlueDataBrewProfileResults value in the CloudFormation stack outputs.

Clean up

To avoid incurring future charges, delete the resources. On the AWS CloudFormation console, delete the stack named AWSBigDataBlogDataBrewDQSample.

Conclusion

In this post, you learned how to build automated, event-driven data quality validation pipelines. With DataBrew, you can define data quality rules, thresholds, and rulesets for your business and technical requirements. Step Functions, EventBridge, and Amazon SNS allow you to build complex pipelines with customizable error handling and alerting tailored to your needs.

You can learn more about this solution and the source code by visiting the GitHub repository. To learn more about DataBrew data quality rules, visit AWS Glue DataBrew now allows customers to create data quality rules to define and validate their business requirements or refer to Validating data quality in AWS Glue DataBrew.


About the Authors

Laith Al-Saadoon is a Principal Prototyping Architect on the Envision Engineering team. He builds prototypes and solutions using AI, machine learning, IoT & edge computing, streaming analytics, robotics, and spatial computing to solve real-world customer problems. In his free time, Laith enjoys outdoor activities such as photography, drone flights, hiking, and paintballing.

Gordon Burgess is a Senior Product Manager with AWS Glue DataBrew. He is passionate about helping customers discover insights from their data, and focuses on building user experiences and rich functionality for analytics products. Outside of work, Gordon enjoys reading, coffee, and building computers.

Using Amazon Aurora Global Database for Low Latency without Application Changes

Post Syndicated from Roneel Kumar original https://aws.amazon.com/blogs/architecture/using-amazon-aurora-global-database-for-low-latency-without-application-changes/

Deploying global applications has many challenges, especially when accessing a database to build custom pages for end users. One example is an application using AWS Lambda@Edge. Two main challenges include performance and availability.

This blog explains how you can optimally deploy a global application with fast response times and without application changes.

The Amazon Aurora Global Database enables a single database cluster to span multiple AWS Regions by asynchronously replicating your data within subsecond timing. This provides fast, low-latency local reads in each Region. It also enables disaster recovery from Region-wide outages using multi-Region writer failover. These capabilities minimize the recovery time objective (RTO) of cluster failure, thus reducing data loss during failure. You will then be able to achieve your recovery point objective (RPO).

However, there are some implementation challenges. Most applications are designed to connect to a single hostname with atomic, consistent, isolated, and durable (ACID) consistency. But Global Aurora clusters provide reader hostname endpoints in each Region. In the primary Region, there are two endpoints, one for writes, and one for reads. To achieve strong  data consistency, a global application requires the ability to:

  • Choose the optimal reader endpoints
  • Change writer endpoints on a database failover
  • Intelligently select the reader with the most up-to-date, freshest data

These capabilities typically require additional development.

The Heimdall Proxy coupled with Amazon Route 53 allows edge-based applications to access the Aurora Global Database seamlessly, without  application changes. Features include automated Read/Write split with ACID compliance and edge results caching.

Figure 1. Heimdall Proxy architecture

Figure 1. Heimdall Proxy architecture

The architecture in Figure 1 shows Aurora Global Databases primary Region in AP-SOUTHEAST-2, and secondary Regions in AP-SOUTH-1 and US-WEST-2. The Heimdall Proxy uses latency-based routing to determine the closest Reader Instance for read traffic, and redirects all write traffic to the Writer Instance. The Heimdall Configuration stores the Amazon Resource Name (ARN) of the global cluster. It automatically detects failover and cross-Region on the cluster, and directs traffic accordingly.

With an Aurora Global Database, there are two approaches to failover:

  • Managed planned failover. To relocate your primary database cluster to one of the secondary Regions in your Aurora global database, see Managed planned failovers with Amazon Aurora Global Database. With this feature, RPO is 0 (no data loss) and it synchronizes secondary DB clusters with the primary before making any other changes. RTO for this automated process is typically less than that of the manual failover.
  • Manual unplanned failover. To recover from an unplanned outage, you can manually perform a cross-Region failover to one of the secondaries in your Aurora Global Database. The RTO for this manual process depends on how quickly you can manually recover an Aurora global database from an unplanned outage. The RPO is typically measured in seconds, but this is dependent on the Aurora storage replication lag across the network at the time of the failure.

The Heimdall Proxy automatically detects Amazon Relational Database Service (RDS) / Amazon Aurora configuration changes based on the ARN of the Aurora Global cluster. Therefore, both managed planned and manual unplanned failovers are supported.

Solution benefits for global applications

Implementing the Heimdall Proxy has many benefits for global applications:

  1. An Aurora Global Database has a primary DB cluster in one Region and up to five secondary DB clusters in different Regions. But the Heimdall Proxy deployment does not have this limitation. This allows for a larger number of endpoints to be globally deployed. Combined with Amazon Route 53 latency-based routing, new connections have a shorter establishment time. They can use connection pooling to connect to the database, which reduces overall connection latency.
  2. SQL results are cached to the application for faster response times.
  3. The proxy intelligently routes non-cached queries. When safe to do so, the closest (lowest latency) reader will be used. When not safe to access the reader, the query will be routed to the global writer. Proxy nodes globally synchronize their state to ensure that volatile tables are locked to provide ACID compliance.

For more information on configuring the Heimdall Proxy and Amazon Route 53 for a global database, read the Heimdall Proxy for Aurora Global Database Solution Guide.

Download a free trial from the AWS Marketplace.

Resources:

Heimdall Data, based in the San Francisco Bay Area, is an AWS Advanced ISV partner. They have AWS Service Ready designations for Amazon RDS and Amazon Redshift. Heimdall Data offers a database proxy that offloads SQL improving database scale. Deployment does not require code changes.

Transform data and create dashboards using AWS Glue DataBrew and Tableau

Post Syndicated from Nipun Chagari original https://aws.amazon.com/blogs/big-data/transform-data-and-create-dashboards-using-aws-glue-databrew-and-tableau/

Before you can create visuals and dashboards that convey useful information, you need to transform and prepare the underlying data. With AWS Glue DataBrew, you can now easily transform and prepare datasets from Amazon Simple Storage Service (Amazon S3), an Amazon Redshift data warehouse, Amazon Aurora, and other Amazon Relational Database Service (Amazon RDS) databases and upload them into Amazon S3 to visualize the transformed data in a dashboard using Amazon QuickSight or other business intelligence (BI) tools like Tableau.

DataBrew now also supports writing prepared data into Tableau Hyper format, allowing you to easily take prepared datasets from Amazon S3 and upload them into Tableau for further visualization and analysis. Hyper is Tableau’s in-memory data engine technology optimized for fast data ingest and analytical query processing on large or complex datasets.

In this post, we use DataBrew to extract data from Amazon Redshift, cleanse and transform data using DataBrew to Tableau Hyper format without any coding, and store it in Amazon S3.

Overview of solution

The following diagram illustrates the architecture of the solution.

The solution workflow includes the following steps:

  1. You create a JDBC connection for Amazon Redshift and a DataBrew project on the DataBrew console.
  2. DataBrew queries data from Amazon Redshift by creating a recipe and performing transformations.
  3. The DataBrew job writes the final output to an S3 bucket in Tableau Hyper format.
  4. You can now upload the file into Tableau for further visualization and analysis.

Prerequisites

For this walkthrough, you should have the following prerequisites:

The following screenshots show the configuration for creating an Amazon Redshift cluster using the Amazon Redshift console with demo sales data. For more information about network security for the cluster, see Setting Up a VPC to Connect to JDBC Data Stores.

For this post, we use the sample data that comes with the Amazon Redshift cluster.

In this post, we only demonstrate how to transform your Amazon Redshift data to Hyper format; uploading the file for further analysis is out of scope.

Create an Amazon Redshift connection

In this step, you use the DataBrew console to create an Amazon Redshift connection.

  1. On the DataBrew console, choose Datasets.
  2. On the Connections tab, choose Create connection.
  3. For Connection name, enter a name (for example, ticket-db-connection).
  4. For Connection type, select Amazon Redshift.
  5. In the Connection access section, provide details like cluster name, database name, user name, and password.
  6. Choose Create connection.

Create your dataset

To create a new dataset, complete the following steps:

  1. On the DataBrew console, choose Datasets.
  2. On the Datasets tab, choose Connect new dataset.
  3. For Dataset name, enter sales.
  4. For Connect to new dataset, select Amazon Redshift.
  5. Choose the connection you created (AwsGlueDataBrew-tickit-sales-db-connection).
  6. Select the public schema and sales table
  7. In the Additional configurations section, for Enter S3 destination, enter the S3 bucket you created as a prerequisite.

DataBrew uses this bucket to store the intermediate results.

  1. Choose Create dataset.
    If your query is taking too much time, then add LIMIT clause in your Select statement.

Create a project using the dataset

To create a new project, complete the following steps:

  1. On the DataBrew console, choose Projects and choose Create project.
  2. For Project name, enter sales-project.
  3. For Attached recipe, choose Create new recipe.
  4. For Recipe name, enter sales-project-recipe.
  5. For Select a dataset, select My datasets.
  6. Select the sales dataset.
  7. Under Permissions, for Role name, choose an existing IAM role created during the prerequisites or create a new role.
  8. Choose Create project.

When the project is opened, a DataBrew interactive session is created. DataBrew retrieves sample data based on your sampling configuration selection.

When we connect a dataset to an Amazon Redshift cluster in your VPC, DataBrew provisions an elastic network interface in your VPC without a public IPV4 address. Because of this, you need to provision a NAT gateway in your VPC as well as an appropriate subnet route table configured for the subnets associated with the AWS Glue network interfaces. To use DataBrew with a VPC subnet without a NAT, you must have a gateway VPC endpoint to Amazon S3 and a VPC endpoint for the AWS Glue interface in your VPC. For more information, see Create a gateway endpoint and Interface VPC endpoints (AWS PrivateLink).

Build a transformation recipe

In this step, we perform some feature engineering techniques (transformations) to prepare our dataset and drop the unnecessary columns from our dataset that aren’t required for this exercise.

  1. On the DataBrew console, choose Column.
  2. Choose Delete.
  3. For Source columns, select the columns pricepaid and commissions.
  4. Choose Apply.

Add a logical condition

With DataBrew, you can now use IF, AND, OR, and CASE logical conditions to create transformations based on functions. With this feature, you have the flexibility to use custom values or reference other columns within the expressions, and can create adaptable transformations for their specific use cases.

To add a logical condition to your transformation recipe, complete the following steps:

  1. On the DataBrew console, choose Conditions.
  2. Choose IF.
  3. For Matching conditions, select Match all conditions.
  4. For Source, choose the value qtysold.
  5. For Enter a value, select Enter a custom value and enter 2.
  6. For Destination column, enter opportunity.
  7. Choose Apply.

The following screenshot shows the full recipe that we applied to our dataset.

Create the DataBrew job

Now that we have built the recipe, we can create and run the DataBrew recipe job.

  1. On the project details page, choose Create job.
  2. For Job name, enter sales-opportunities.
  3. We choose TABLEAU HYPER as the output format.
  4. For S3 location, enter the previously created S3 bucket.
  5. For Role name, choose an existing role created during the prerequisites or create a new role.
  6. Choose Create and run job.

  7. Navigate to the Jobs page and wait for the sales-opportunity job to complete.
  8. Choose the output link to navigate to the Amazon S3 console to access the job output.

Clean up

To avoid incurring future charges, delete the resources you created:

  • Amazon Redshift cluster
  • Recipe job
  • Job output stored in the S3 bucket
  • IAM roles created as part of this exercise
  • DataBrew project sales-project and its associated recipe sales-project-recipe
  • DataBrew datasets

Conclusion

In this post, we showed you how to connect to an Amazon Redshift cluster and create a DataBrew dataset.

We saw how easy it is to get data from Amazon Redshift into DataBrew and apply transformations without any coding. We then ran a recipe job to convert this dataset to Tableau Hyper format file and store it in Amazon S3 for visualization using Tableau. Learn more about all the products and service integrations that AWS Glue DataBrew supports.


About the Authors

Nipun Chagari is a Senior Solutions Architect at AWS, where he helps customers build highly available, scalable, and resilient applications on the AWS Cloud. He is currently focused on helping customers leverage serverless technology to meet their business objectives.

Mohit Malik is a Senior Solutions Architect at Amazon Web Services who specializes in compute, networking, and serverless technologies. He enjoys helping customers learn how to operate efficiently and effectively in the cloud. In his spare time, Mohit enjoys spending time with his family, reading books, and watching movies.

Detect Real-Time Anomalies and Failures in Industrial Processes Using Apache Flink

Post Syndicated from Hubert Asamer original https://aws.amazon.com/blogs/architecture/detect-real-time-anomalies-and-failures-in-industrial-processes-using-apache-flink/

For a long time, industrial control systems were the heart of the manufacturing process which allows collecting, processing, and acting on data from the shop floor. Process manufacturers used a distributed control system (DCS) to do the automated control and operation of an industrial process or plant.

With the convergence of operational technology and information technology (IT), customers such as Yara are integrating their DCS with additional intelligence from the IT side. This provides customers with a holistic view of the different data sources to make more complex decisions with advanced analytics.

In this blog post, we show how to start with advanced analytics on streaming data coming from the shop floor. The sensor data, such as pressure and temperature, is typically published by a DCS. It is then ingested with a local edge gateway and streamed to the cloud with streaming and industrial internet of things (IoT) technology. Analytics on the streaming data is typically done before all data points are stored in the data layer. Figure 1 shows how the data flow can be modeled and visualized with AWS services.

Figure 1: High-level ingestion and analytics architecture

Figure 1: High-level ingestion and analytics architecture

In this example, we are concentrating on the streaming analytics part in the Cloud. We will generate data from a simulated DCS to Amazon Kinesis Data Streams where you have a gateway such as AWS IoT Greengrass and maybe other IoT services in-between.

For the simulated process that the DCS is controlling, we use a well-documented industrial process for creating a chemical compound (acetic anhydride) called the Tennesee Eastman process (TEP). There are several simulations available as open source. We demonstrate how to use this data as a constant stream with more than 30 real-time measurement parameters, ingest to Kinesis Data Streams, and run in-stream analytics using Apache Flink. Within Apache Flink, data is grouped and mapped to the respective stages and parts of the industrial process, and constantly analyzed by calculating anomalies of all process stages. All raw data, plus the derived anomalies and failure patterns, are then ingested from Apache Flink to Amazon Timestream for further use in near real-time dashboards.

Overview of solution

Note: Refer to steps 1 to 6 in Figure 2.

As a starting point for a realistic and data intensive measurement source, we use an already existing (TEP) simulation framework written in C++ originally created from National Institute of Standards and Technology, and published as open source. The GitHub Blog repository contains a small patch which adds AWS connectivity with the software development kits (SDKs) and modifications to the command line arguments. The programs provided by this framework are (step 1) a simulation process starter with configurable starting conditions and timestep configurations and a real-time client (step 2) which connects to the simulation and sends the simulation output data to the AWS Cloud.

Tennesee Eastman process (TEP) background

A paper by Downs & Vogel, A plant-wide industrial process control problem, from 1991 states:

“This chemical standard process consists of a reactor/separator/recycle arrangement involving two simultaneous gas-liquid exothermic reactions.”

“The process produces two liquid products from four reactants. Also present are an inert and a byproduct making a total of eight components. Two additional byproduct reactions also occur. The process has 12 valves available for manipulation and 41 measurements available for monitoring or control.“

The simulation framework used can control all of the 12 valve settings and produces 41 measurement variables with varying sampling frequency.

Data ingestion

The 41 measurement variables, named xmeas_1 to xmeas_41, are emitted by the real-time client (step 2) as key-value JSON messages. The client code is configured to produce 100 messages per second. A built-in C++ Kinesis SDK allows the real-time client to directly stream JSON messages to a Kinesis data stream (step 3).

Figure 2: Detailed system architecture

Figure 2 – Detailed system architecture

Stream processing with Apache Flink

Messages sent to Amazon Kinesis Data Stream are processed in configurable batch sizes by an Apache Flink application, deployed in Amazon Kinesis Data Analytics. Apache Flink is an open-source stream processing framework, written and usable in Java or Scala. As described in Figure 3, it allows the definition of various data sources (for example, a Kinesis data stream) and data sinks for storing processing results. In-between data can be processed by a range of operators—typically mapping and reducing functions (step 4).

In our case, we use a mapping operator where each batch of incoming messages is processed. In Code snippet 1, we apply a custom mapping function to the raw data stream. For rapid and iterative development purposes it’s possible to have the complete stream processing pipeline running in a local Java or Scala IDE such as Maven, Eclipse, or IntelliJ.

Figure 3: Flink execution plan (green: streaming data sources; yellow: data sinks)

Figure 3: Flink execution plan (green: streaming data sources; yellow: data sinks)

public class StreamingJob extends AnomalyDetector {
---
  public static DataStream<String> createKinesisSource
    (StreamExecutionEnvironment env, 
     ParameterTool parameter)
    {
    // create Stream
    return kinesisStream;
  }
---
  public static void main(String[] args) {
    // set up the execution environment
    final StreamExecutionEnvironment env = 
      StreamExecutionEnvironment.getExecutionEnvironment();
---
    DataStream<List<TimestreamPoint>> mainStream =
      createKinesisSource(env, parameter)
      .map(new AnomalyJsonToTimestreamPayloadFn(parameter))
      .name("MaptoTimestreamPayload");
---
    env.execute("Amazon Timestream Flink Anomaly Detection Sink");
  }
}

Code snippet 1: Flink application main class

In-stream anomaly detection

Within the Flink mapping operator, a statistical outlier detection (anomaly detection) is implemented. Flink allows the inclusion of custom libraries within its operators. The library used here is published by AWS—a Random Cut Forest implementation available from GitHub. Random Cut Forest is a well understood statistical method which can operate on batches of measurements. It then calculates an anomaly score for each new measurement by comparing a new value with a cached pool (=forest) of older values.

The algorithm allows the creation of grouped anomaly scores, where a set of variables is combined to calculate a single anomaly score. In the simulated chemical process (TEP), we can group the measurement variables into three process stages:

  1. reactor feed analysis
  2. purge gas analysis
  3. product analysis.

Each group consists of 5–10 measurement variables. We’re getting anomaly scores for a, b, and c. In Code snippet 2 we can learn how an anomaly detector is created. The class AnomalyDetector is instantiated and extended then three times (for our three distinct process stages) within the mapping function as described in Code snippet 3.

Flink distributes this calculation across its worker nodes and handles data deduplication processes within its system.

---
public class AnomalyDetector {
    protected final ParameterTool parameter;
    protected final Function<RandomCutForest, LineTransformer> algorithmInitializer;
    protected LineTransformer algorithm;
    protected ShingleBuilder shingleBuilder;
    protected double[] pointBuffer;
    protected double[] shingleBuffer;
    public AnomalyDetector(
      ParameterTool parameter,
      Function<RandomCutForest,LineTransformer> algorithmInitializer)
    {
      this.parameter = parameter;
      this.algorithmInitializer = algorithmInitializer;
    }
    public List<String> run(Double[] values) {
            if (pointBuffer == null) {
                prepareAlgorithm(values.length);
            }
      return processLine(values);
    }
    protected void prepareAlgorithm(int dimensions) {
---
      RandomCutForest forest = RandomCutForest.builder()
        .numberOfTrees(Integer.parseInt(
          parameter.get("RcfNumberOfTrees", "50")))
        .sampleSize(Integer.parseInt(
          parameter.get("RcfSampleSize", "8192")))
        .dimensions(shingleBuilder.getShingledPointSize())
        .lambda(Double.parseDouble(
          parameter.get("RcfLambda", "0.00001220703125")))
        .randomSeed(Integer.parseInt(
          parameter.get("RcfRandomSeed", "42")))
      .build();
---
    algorithm = algorithmInitializer.apply(forest);
  }

Code snippet 2: AnomalyDetector base class, which gets extended by the streaming applications main class

public class AnomalyJsonToTimestreamPayloadFn extends 
    RichMapFunction<String, List<TimestreamPoint>> {
  protected final ParameterTool parameter;
  private final Logger logger = 

  public AnomalyJsonToTimestreamPayloadFn(ParameterTool parameter) {
    this.parameter = parameter;
  }

  // create new instance of StreamingJob for running our Forest
  StreamingJob overallAnomalyRunner1;
  StreamingJob overallAnomalyRunner2;
  StreamingJob overallAnomalyRunner3;
---

  // use `open`method as RCF initialization
  @Override
  public void open(Configuration parameters) throws Exception {
    overallAnomalyRunner1 = new StreamingJob(parameter);
    overallAnomalyRunner2 = new StreamingJob(parameter);
    overallAnomalyRunner3 = new StreamingJob(parameter);
  super.open(parameters);
}
---

Code snippet 3: Mapping Function uses the Flink RichMapFunction open routine to initialize three distinct Random Cut Forests

Data persistence – Flink data sinks

After all anomalies are calculated, we can decide where to send this data. Flink provides various ready-to-use data sinks. In these examples, we fan out all (raw and processed) data to Amazon Kinesis Data Firehose for storing in Amazon Simple Storage Service (Amazon S3) (long term) (step 5) and to Amazon Timestream (short term) (step 5). Kinesis Data Firehose is configured with a small AWS Lambda function to reformat data from JSON to CSV, and data is stored with automated partitioning to Amazon S3. A Timestream data sink does not come pre-bundled with Flink. A custom Timestream ingestion code is used in these examples. Flink provides extensible operator interfaces for the creation of custom map and sink functions.

Timeseries handling

Timestream, in combination with Grafana, is used for near real-time monitoring. Grafana comes bundled with a Timestream data source plugin and can constantly query and visualize Timestream data (step 6).

Walkthrough

Our architecture is available as a deployable AWS CloudFormation template. The simulation framework comes packed as a docker image, with an option to install it locally on a linux host.

Prerequisites

To implement this architecture, you will need:

  • An AWS account
  • Docker (CE) Engine v18++
  • Java JDK v11++
  • maven v3.6++

We recommend running a local and recent Linux environment. It is assumed that you are using AWS Cloud9, deployed with CloudFormation, within your AWS account.

Steps

Follow these steps to deploy the solution and play with the simulation framework. At the end, detected anomalies derived from Flink are stored next to all raw data in Timestream and presented in Grafana. We’re using AWS Cloud9 and its Linux terminal capabilities here to fire up a Grafana instance, then manually run the simulation to ingest data to Kinesis and optionally manually start the Flink app from the console using Maven.

Deploy stack

After you’re logged in to the AWS Management console you can deploy the CloudFormation stack. This stack creates a fully configured AWS Cloud9 environment with the related GitHub Repo already in place, a Kinesis data stream, Kinesis Data Firehose delivery stream, Kinesis Data Analytics with Flink app deployed, Timestream database, and an S3 bucket.

launch stack button

After successful deployment, record two important facts from the CloudFormation console: the chosen stack name and the attribute 03Cloud9EnvUrl displayed in the Output Section of the stack. The attribute’s URL will take you directly to our deployed AWS Cloud9 environment.

Run post install step within AWS Cloud9

The deployed stack created an AWS Cloud9 environment and an AWS Identity and Access Management (IAM) instance profile. We apply this instance profile to AWS Cloud9 to interact with Kinesis, Timestream, and Amazon S3 throughout the next steps. The used script also configures and installs other required tools.

1.       Open a terminal window.

$ cd flinkAnomalySteps/deployment
$ source c9-postInstall.sh
---SETTING UP IAM INSTANCE PROFILE
Please enter cloudformation stack name (default: flink-rcf-app):
# enter your stack name

Start a Grafana development server

In this section we are starting a Grafana server using docker. Cloud 9 allows us to expose web applications (for demo & development purposes) on container port 8080.

1.       Open a terminal window.

$ cd ../src/grafana-dashboard
$ docker volume create grafana-storage
# this creates a docker volume for persisting your Grafana settings
$./start-grafana.sh
# this starts a recent Grafana using docker with Timestream plugin and a pre-configured dashboard in place

2.       Open the preview panel by selecting Preview, and then select Preview Running Application.

Cloud9 screenshot

3.       Next, in the preview pane, select Pop out into new Window.

Cloud9 screenshot2

4.       A new browser tab with Grafana opens.

5.       Choose any username and password combination.

6.       In Grafana use the “Search dashboards” icon on the left and choose “TEP-SIM-DEV”. This pre-configured dashboard displays data from Amazon Timestream (see step “Open Grafana”).

TEP simulation procedure

Within your local or AWS Cloud9 Linux environment, fetch the simulation docker image from the public AWS container registry, or build the simulation binaries locally, for building manually check the GitHub repo.

Start simulation (in separate terminal)

# starts container and switch into the container-shell
$ docker run -it --rm \
  --network host \
  --name tesim-runner \
  tesim-runner:01 \
 /bin/bash
# then inside container
$ ./tesim --simtime 100  --external-ctrl
# simulation started…

Manipulate simulation (in separate terminal)

Follow the steps here for a basic process disturbance task. Review the aspects of influencing the simulation in the GitHub-Repo. The rtclient program has a range of commands to use for introducing disturbances.

# first switch into running simulation container
$ docker exec -it tesim-runner /bin/bash
# now we can access the shared storage of the simulation process…
$ ./rtclient –setidv 6
# this enables one of the built in process disturbances (1-20)
$ ./rtclient –setidv 7
$ ./rtclient –setidv 8
$ …

Stream Simulation data to Amazon Kinesis DataStream (in separate terminal)

The client has a built-in record frequency of 50 messages per second. One message contains more than 50 measurements, so we have approximately 2,500 measurements per second.

$ ./rtclient -k

AWS libcrypto resolve: found static libcrypto 1.1.1 HMAC symbols
AWS libcrypto resolve: found static libcrypto 1.1.1 EVP_MD symbols
{"xmeas_1": 3649.739476,"xmeas_2": 4451.32071,"xmeas_3": 9.223142558,"xmeas_4": 32.39290913,"xmeas_5": 47.55975621,"xmeas_6": 2798.975688,"xmeas_7": 64.99582601,"xmeas_8": 122.8987929,"xmeas_9": 0.1978264656,…}
# Messages in JSON sent to Kinesis DataStream visible via stdout

Compile and start Flink Application (optional step)

If you want deeper insights into the Flink Application, we can start this as well from the AWS Cloud9 instance. Note: this is only appropriate in development.

$ cd flinkAnomalySteps/src
$ cd flink-rcf-app
$ mvn clean compile
# the Flink app gets compiled
$ mvn exec:java -Dexec.mainClass= \
    "com.amazonaws.services.kinesisanalytics.StreamingJob"
# Flink App is started with default settings in place…
…

Open Grafana dashboard (from the step Start a Grafana development server)

Process anomalies are visible instantly after you start the simulation. Use Grafana to drill down into the data as needed.

/**example - simplest possible Timestream query used for Viz:**/

SELECT CREATE_TIME_SERIES(time, measure_value::double) as anomaly_stream6 FROM "kdaflink"."kinesisdata1"

    WHERE measure_name='anomaly_score_stream6' AND

    time between ago(15m) and now()

Code snippet 4: Timestream SQL example; Timestream database is `kdaflink` – table is `kinesisdata1`

Figure 4 - Grafana dashboard showing near real-time simulation data

Figure 4 – Grafana dashboard showing near real-time simulation data, three anomalies, mapped to the TEP process, are constantly calculated by Flink

S3 raw metrics bucket

For the sake of completeness and potential usefulness, the Flink Application emits all raw data in an intermediate step to Kinesis Data Firehose. The service converts all JSON data to CSV format by using a small AWS Lambda function.

$ aws s3 ls flink-rcf-app-rawmetricsbucket-<CFN-UUID>/tep-raw-csv/2021/11/26/19/

Cleaning up

Delete the deployed CloudFormation stack. All resources (excluding S3 buckets) are permanently deleted.

Conclusion

In this blog post, we learned that in-stream anomaly detection and constant measurement data insights can work together. The Apache Flink framework offers a ready-to-use platform that is mission critical for future adoption across manufacturing and other industries. Other applications of the presented Flink pattern can run on capable edge compute devices. Integration with AWS IoT Greengrass and AWS Greengrass Stream Manager are part of the GitHub Blog repository.

Another extension includes measurement data pattern detection routines, which can coexist with in-stream anomaly detection and can detect specific failure patterns over time using time-windowing features of the Flink framework. You can refer to the GitHub repo which accompanies this blog post. Give it a try and let us know your feedback in the comments!

Set up cross-account audit logging for your Amazon Redshift cluster

Post Syndicated from Milind Oke original https://aws.amazon.com/blogs/big-data/set-up-cross-account-audit-logging-for-your-amazon-redshift-cluster/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. With Amazon Redshift, you can analyze all your data to derive holistic insights about your business and your customers. One of the best practices of modern application design is to have centralized logging. Troubleshooting application problems is easy when you can correlate all your data together.

When you enable audit logging, Amazon Redshift logs information about connections and user activities in the database. These logs help you monitor the database for security and troubleshooting purposes, a process called database auditing. The logs are stored in Amazon Simple Storage Service (Amazon S3) buckets. These provide convenient access with data security features for users who are responsible for monitoring activities in the database.

If you want to establish a central audit logging account to capture audit logs generated by Amazon Redshift clusters located in separated AWS accounts, you can use the solution in this post to achieve cross-account audit logging for Amazon Redshift. As of this writing, the Amazon Redshift console only lists S3 buckets from the same account (in which the Amazon Redshift cluster is located) while enabling audit logging, so you can’t set up cross-account audit logging using the Amazon Redshift console. In this post, we demonstrate how to configure cross-account audit logging using the AWS Command Line Interface (AWS CLI).

Prerequisites

For this walkthrough, you must have the following prerequisites:

  • Two AWS accounts: one for analytics and one for centralized logging
  • A provisioned Amazon Redshift cluster in the analytics AWS account
  • An S3 bucket in the centralized logging AWS account
  • Access to the AWS CLI

Overview of solution

As a general security best practice, we recommend making sure that Amazon Redshift audit logs are sent to the correct S3 buckets. The Amazon Redshift service team has introduced additional security controls in the event that the destination S3 bucket resides in a different account from the Amazon Redshift cluster owner account. For more information, see Bucket permissions for Amazon Redshift audit logging.

This post uses the AWS CLI to establish cross-account audit logging for Amazon Redshift, as illustrated in the following architecture diagram.

For this post, we established an Amazon Redshift cluster named redshift-analytics-cluster-01 in the analytics account in Region us-east-2.

We also set up an S3 bucket named redshift-cluster-audit-logging-xxxxxxxxxxxx in the centralized logging account for capturing audit logs in Region us-east-1.

Now you’re ready to complete the following steps to set up the cross-account audit logging:

  1. Create AWS Identity and Access Management (IAM) policies in the analytics AWS account.
  2. Create an IAM user and attach the policies you created.
  3. Create an S3 bucket policy in the centralized logging account to allow Amazon Redshift to write audit logs to the S3 bucket, and allow the IAM user to enable audit logging for the S3 bucket.
  4. Configure the AWS CLI.
  5. Enable audit logging in the centralized logging account.

Create IAM policies in the analytics account

Create two IAM policies in the analytics account that has the Amazon Redshift cluster.

The first policy is the Amazon Redshift access policy (we named the policy redshift-audit-logging-redshift-policy). This policy allows the principal to whom it’s attached to enable, disable, or describe Amazon Redshift logs. It also allows the principal to describe the Amazon Redshift cluster. See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "redshift:EnableLogging",
                "redshift:DisableLogging",
                "redshift:DescribeLoggingStatus"
            ],            
"Resource": "arn:aws:redshift:us-east-2:xxxxxxxxxxxx:cluster: redshift-analytics-cluster-01"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "redshift:DescribeClusters",
            "Resource": "*"
        }
    ]
}

The second policy is the Amazon S3 access policy (we named the policy redshift-audit-logging-s3-policy). This policy allows the principal to whom it’s attached to write to the S3 bucket in the centralized logging account. See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:PutObjectAcl"
            ],
            "Resource": [
                "arn:aws:s3:::redshift-cluster-audit-logging-xxxxxxxxxxxx",
                "arn:aws:s3:::redshift-cluster-audit-logging-xxxxxxxxxxxx/*"
            ]
        }
    ]
}

Create an IAM user and attach the policies

Create an IAM user (we named it redshift-audit-logging-user) with programmatic access in the analytics account and attach the policies you created to it.

Save the generated AWS secret key and secret access key credentials for this user securely. We use these credentials in the next step.

Create an S3 bucket policy for the S3 bucket in the centralized logging AWS account

Add the following bucket policy to the audit logging S3 bucket redshift-cluster-audit-logging-xxxxxxxxxxxx in the centralized logging account. This policy serves two purposes: it allows Amazon Redshift to write audit logs to the S3 bucket, and it allows the IAM user to enable audit logging for the S3 bucket. See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Put bucket policy needed for audit logging",
            "Effect": "Allow",
            "Principal": {
                "Service": "redshift.amazonaws.com"
            },
            "Action": [
                "s3:PutObject",
                "s3:GetBucketAcl"
            ],
            "Resource": [
                "arn:aws:s3:::redshift-cluster-audit-logging-xxxxxxxxxxxx",
                "arn:aws:s3:::redshift-cluster-audit-logging-xxxxxxxxxxxx/*"
            ]
        },
        {
            "Sid": "Put IAM User bucket policy needed for audit logging",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::xxxxxxxxxxxx:user/redshift-audit-logging-user"
            },
            "Action": "s3:PutObject",
            "Resource": [
                "arn:aws:s3:::redshift-cluster-audit-logging-xxxxxxxxxxxx",
                "arn:aws:s3:::redshift-cluster-audit-logging-xxxxxxxxxxxx/*"
            ]
        }
    ]
}

Note that you have to modify the service name redshift.amazonaws.com to look like redshift.region.amazonaws.com if the cluster is in one of the opt-in Regions.

Configure the AWS CLI

As part of this step, you need to install and configure the AWS CLI. After you install the AWS CLI, configure it to use the IAM user credentials that we generated earlier. We perform the next steps based on the permissions attached to the IAM user we created.

Enable audit logging in the centralized logging account

Run the AWS CLI command to enable audit logging for the Amazon Redshift cluster in an S3 bucket in the centralized logging AWS account. In the following code, provide the Amazon Redshift cluster ID, S3 bucket name, and the prefix applied to the log file names:

aws redshift enable-logging --cluster-identifier <ClusterName> --bucket-name <BucketName> --s3-key-prefix <value>

The following screenshot shows that the cross-account Amazon Redshift audit logging is successfully set up.

A test file is also created by AWS to ensure that the log files can be successfully written into the S3 bucket. The following screenshot shows the test file was created successfully in the S3 bucket under the rsauditlog1 prefix.

After some time, we started seeing the audit logs created in the S3 bucket. By default, Amazon Redshift organizes the log files in the S3 bucket using the following bucket and object structure:

AWSLogs/AccountID/ServiceName/Region/Year/Month/Day/AccountID_ServiceName_Region_ClusterName_LogType_Timestamp.gz

Amazon Redshift logs information in the following log files:

  • Connection log – Logs authentication attempts, connections, and disconnections
  • User log – Logs information about changes to database user definitions
  • User activity log – Logs each query before it’s run on the database

The following screenshot shows that log files, such as connection logs and user activity logs, are now being created in the centralized logging account in us-east-1 from the Amazon Redshift cluster in the analytics account in us-east-2.

For more details on analyzing Amazon Redshift audit logs, refer to below mentioned blogs

  1. Visualize Amazon Redshift audit logs using Amazon Athena and Amazon QuickSight
  2. How do I analyze my audit logs using Amazon Redshift Spectrum?

Clean up

To avoid incurring future charges, you can delete all the resources you created while following the steps in this post.

Conclusion

In this post, we demonstrated how to accomplish cross-account audit logging for an Amazon Redshift cluster in one account to an Amazon S3 bucket in another account. Using this solution, you can establish a central audit logging account to capture audit logs generated by Amazon Redshift clusters located in separated AWS accounts.

Try this solution to achieve cross-account audit logging for Amazon Redshift and leave a comment.


About the Authors

Milind Oke is a Data Warehouse Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over 15 years and specializes in Amazon Redshift.

Dipankar Kushari is a Sr. Analytics Solutions Architect with AWS.

Pankaj Pattewar is a Cloud Application Architect at Amazon Web Services. He specializes in architecting and building cloud-native applications and enables customers with best practices in their cloud journey.

Sudharshan Veerabatheran is a Cloud Support Engineer based out of Portland.

Enrich datasets for descriptive analytics with AWS Glue DataBrew

Post Syndicated from Daniel Rozo original https://aws.amazon.com/blogs/big-data/enrich-datasets-for-descriptive-analytics-with-aws-glue-databrew/

Data analytics remains a constantly hot topic. More and more businesses are beginning to understand the potential their data has to allow them to serve customers more effectively and give them a competitive advantage. However, for many small to medium businesses, gaining insight from their data can be challenging because they often lack in-house data engineering skills and knowledge.

Data enrichment is another challenge. Businesses that focus on analytics using only their internal datasets miss the opportunity to gain better insights by using reliable and credible public datasets. Small to medium businesses are no exception to this shortcoming, where obstacles such as not having sufficient data diminish their ability to make well-informed decisions based on accurate analytical insights.

In this post, we demonstrate how AWS Glue DataBrew enables businesses of all sizes to get started with data analytics with no prior coding knowledge. DataBrew is a visual data preparation tool that makes it easy for data analysts and scientists to clean and normalize data in preparation for analytics or machine learning. It includes more than 350 pre-built transformations for common data preparation use cases, enabling you to get started with cleaning, preparing, and combining your datasets without writing code.

For this post, we assume the role of a fictitious small Dutch solar panel distribution and installation company named OurCompany. We demonstrate how this company can prepare, combine, and enrich an internal dataset with publicly available data from the Dutch public entity, the Centraal Bureau voor de Statistiek (CBS), or in English, Statistics Netherlands. Ultimately, OurCompany desires to know how well they’re performing compared to the official reported values by the CBS across two important key performance indicators (KPIs): the amount of solar panel installations, and total energy capacity in kilowatt (kW) per region.

Solution overview

The architecture uses DataBrew for data preparation and transformation, Amazon Simple Storage Service (Amazon S3) as the storage layer of the entire data pipeline, and the AWS Glue Data Catalog for storing the dataset’s business and technical metadata. Following the modern data architecture best practices, this solution adheres to foundational logical layers of the Lake House Architecture.

The solution includes the following steps:

  1. We set up the storage layer using Amazon S3 by creating the following folders: raw-data, transformed-data, and curated-data. We use these folders to track the different stages of our data pipeline consumption readiness.
  2. Three CSV raw data files containing unprocessed data of solar panels as well as the external datasets from the CBS are ingested into the raw-data S3 folder.
  3. This part of the architecture incorporates both processing and cataloging capabilities:
    1. We use AWS Glue crawlers to populate the initial schema definition tables for the raw dataset automatically. For the remaining two stages of the data pipeline (transformed-data and curated-data), we utilize the functionality in DataBrew to directly create schema definition tables into the Data Catalog. Each table provides an up-to-date schema definition of the datasets we store on Amazon S3.
    2. We work with DataBrew projects as the centerpiece of our data analysis and transformation efforts. In here, we set up no-code data preparation and transformation steps, and visualize them through a highly interactive, intuitive user interface. Finally, we define DataBrew jobs to apply these steps and store transformation outputs on Amazon S3.
  4. To gain the benefits of granular access control and easily visualize data from Amazon S3, we take advantage of the seamless integration between Amazon Athena and Amazon QuickSight. This provides a SQL interface to query all the information we need from the curated dataset stored on Amazon S3 without the need to create and maintain manifest files.
  5. Finally, we construct an interactive dashboard with QuickSight to depict the final curated dataset alongside our two critical KPIs.

Prerequisites

Before beginning this tutorial, make sure you have the required Identity and Access Management (IAM) permissions to create the resources required as part of the solution. Your AWS account should also have an active subscription to QuickSight to create the visualization on processed data. If you don’t have a QuickSight account, you can sign up for an account.

The following sections provide a step-by-step guide to create and deploy the entire data pipeline for OurCompany without the use of code.

Data preparation steps

We work with the following files:

  • CBS Dutch municipalities and provinces (Gemeentelijke indeling op 1 januari 2021) – Holds all the municipalities and provinces names and codes of the Netherlands. Download the file gemeenten alfabetisch 2021. Open the file and save it as cbs_regions_nl.csv. Remember to change the format to CSV (comma-delimited).
  • CBS Solar power dataset (Zonnestroom; vermogen bedrijven en woningen, regio, 2012-2018) – This file contains the installed capacity in kilowatts and total number of installations for businesses and private homes across the Netherlands from 2012–2018. To download the file, go to the dataset page, choose the Onbewerkte dataset, and download the CSV file. Rename the file to cbs_sp_cap_nl.csv.
  • OurCompany’s solar panel historical data – Contains the reported energy capacity from all solar panel installations of OurCompany across the Netherlands from 2012 until 2018. Download the file.

As a result, the following are the expected input files we use to work with the data analytics pipeline:

  • cbs_regions_nl.csv
  • cbs_sp_cap_nl.csv
  • sp_data.csv

Set up the storage Layer

We first need to create the storage layer for our solution to store all raw, transformed, and curated datasets. We use Amazon S3 as the storage layer of our entire data pipeline.

  1. Create an S3 bucket in the AWS Region where you want to build this solution. In our case, the bucket is named cbs-solar-panel-data. You can use the same name followed by a unique identifier.
  2. Create the following three prefixes (folders) in your S3 bucket by choosing Create folder:
    1. curated-data/
    2. raw-data/
    3. transformed-data/

  3. Upload the three raw files to the raw-data/ prefix.
  4. Create two prefixes within the transformed-data/ prefix named cbs_data/ and sp_data/.

Create a Data Catalog database

After we set up the storage layer of our data pipeline, we need to create the Data Catalog to store all the metadata of the datasets hosted in Amazon S3. To do so, follow these steps:

  1. Open the AWS Glue console in the same Region of your newly created S3 bucket.
  2. In the navigation pane, choose Databases.
  3. Choose Add database.
  4. Enter the name for the Data Catalog to store all the dataset’s metadata.
  5. Name the database sp_catalog_db.

Create AWS Glue data crawlers

Now that we created the catalog database, it’s time to crawl the raw data prefix to automatically retrieve the metadata associated to each input file.

  1. On the AWS Glue console, choose Crawlers in the navigation pane.
  2. Add a crawler with the name crawler_raw and choose Next.
  3. For S3 path, select the raw-data folder of the cbs-solar-panel-data prefix.
  4. Create an IAM role and name it AWSGlueServiceRole-cbsdata.
  5. Leave the frequency as Run on demand.
  6. Choose the sp_catalog_db database created in the previous section, and enter the prefix raw_ to identify the tables that belong to the raw data folder.
  7. Review the parameters of the crawler and then choose Finish.
  8. After the crawler is created, select it and choose Run crawler.

After successful deployment of the crawler, your three tables are created in the sp_catalog_db database: raw_sp_data_csv, raw_cbs_regions_nl_csv, and raw_cbs_sp_cap_nl_csv.

Create DataBrew raw datasets

To utilize the power of DataBrew, we need to connect datasets that point to the Data Catalog S3 tables we just created. Follow these steps to connect the datasets:

  1. On the DataBrew console, choose Datasets in the navigation pane.
  2. Choose Connect new dataset.
  3. Name the dataset cbs-sp-cap-nl-dataset.
  4. For Connect to new dataset, choose Data Catalog S3 tables.
  5. Select the sp_catalog_db database and the raw_cbs_sp_cap_nl_csv table.
  6. Choose Create dataset.

We need to create to two more datasets following the same process. The following table summarizes the names and tables of the catalog required for the new datasets.

Dataset name Data catalog table
sp-dataset raw_sp_data_csv
cbs-regions-nl-dataset raw_cbs_regions_nl_csv

Import DataBrew recipes

A recipe is a set of data transformation steps. These transformations are applied to one or multiple datasets of your DataBrew project. For more information about recipes, see Creating and using AWS Glue DataBrew recipes.

We have prepared three DataBrew recipes, which contain the set of data transformation steps we need for this data pipeline. Some of these transformation steps include: renaming columns (from Dutch to English), removing null or missing values, aggregating rows based on specific attributes, and combining datasets in the transformation stage.

To import the recipes, follow these instructions:

  1. On the DataBrew console, choose Recipes in the navigation pane.
  2. Choose Upload recipe.
  3. Enter the name of the recipe: recipe-1-transform-cbs-data.
  4. Upload the following JSON recipe.
  5. Choose Create recipe.

Now we need to upload two more recipes that we use for transformation and aggregation projects in DataBrew.

  1. Follow the same procedure to import the following recipes:
Recipe name Recipe source file
recipe-2-transform-sp-data Download
recipe-3-curate-sp-cbs-data Download
  1. Make sure the recipes are listed in the Recipes section filtered by All recipes.

Set up DataBrew projects and jobs

After we successfully create the Data Catalog database, crawlers, DataBrew datasets, and import the DataBrew recipes, we need to create the first transformation project.

CBS external data transformation project

The first project takes care of transforming, cleaning, and preparing cbs-sp-cap-nl-dataset. To create the project, follow these steps:

  1. On the DataBrew console, choose Projects in the navigation pane.
  2. Create a new project with the name 1-transform-cbs-data.
  3. In the Recipe details section, choose Edit existing recipe and choose the recipe recipe-1-transform-cbs-data.
  4. Select the newly created cbs-sp-cap-nl-dataset under Select a dataset.
  5. In the Permissions section, choose Create a new IAM role.
  6. As suffix, enter sp-project.
  7. Choose Create project.

After you create the project, a preview dataset is displayed as a result of applying the selected recipe. When you choose 10 more recipe steps, the service shows the entire set of transformation steps.

After you create the project, you need to grant put and delete S3 object permissions to the created role AWSGlueDataBrewServiceRole-sp-project on IAM. Add an inline policy using the following JSON and replace the resource with your S3 bucket name:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::<your-S3-bucket-name>/*"
        }
    ]
}

This role also needs permissions to access the Data Catalog. To grant these permissions, add the managed policy AWSGlueServiceRole to the role.

CBS external data transformation job

After we define the project, we need to configure and run a job to apply the transformation across the entire raw dataset stored in the Raw-data folder of your S3 bucket. To do so, you need to do the following:

  1. On the DataBrew project page, choose Create job.
  2. For Job name, enter 1-transform-cbs-data-job.
  3. For Output to, choose Data Catalog S3 tables.
  4. For File type¸ choose Parquet.
  5. For Database name, choose sp_catalog_db.
  6. For Table name, choose Create new table.
  7. For Catalog table name, enter transformed_cbs_data.
  8. For S3 location, enter s3://<your-S3-bucket-name>/transformed-data/cbs_data/.
  9. In the job output settings section, choose Settings.
  10. Select Replace output files for each job run and then choose Save.
  11. In the permissions section, choose the automatically created role with the sp-project suffix; for example, AWSGlueDataBrewServiceRole-sp-project.
  12. Review the job details once more and then choose Create and run job.
  13. Back in the main project view, choose Job details.

After a few minutes, the job status changes from Running to Successful. Choose the output to go to the S3 location where all the generated Parquet files are stored.

Solar panels data transformation stage

We now create the second phase of the data pipeline. We create a project and a job using the same procedure described in the previous section.

  1. Create a DataBrew project with the following parameters:
    1. Project name2-transform-sp-data
    2. Imported reciperecipe-2-transform-sp-data
    3. Datasetsp_dataset
    4. Permissions roleAWSGlueDataBrewServiceRole-sp-project
  2. Create and run another DataBrew job with the following parameters:
    1. Job name2-transform-sp-data-job
    2. Output to – Data Catalog S3 tables
    3. File type – Parquet
    4. Database namesp_catalog_db
    5. Create new table with table nametransformed_sp_data
    6. S3 locations3://<your-S3-bucket-name>/transformed-data/sp_data/
    7. Settings – Replace output files for each job run.
    8. Permissions roleAWSGlueDataBrewServiceRole-sp-project
  3. After the job is complete, create the DataBrew datasets with the following parameters:
Dataset name Data catalog table
transformed-cbs-dataset awsgluedatabrew_transformed_cbs_data
transformed-sp-dataset awsgluedatabrew_transformed_sp_data

You should now see five items as part of your DataBrew dataset.

Data curation and aggregation stage

We now create the final DataBrew project and job.

  1. Create a DataBrew project with the following parameters:
    1. Project name3-curate-sp-cbs-data
    2. Imported reciperecipe-3-curate-sp-cbs-data
    3. Datasettransformed_sp_dataset
    4. Permissions roleAWSGlueDataBrewServiceRole-sp-project
  2. Create a DataBrew job with the following parameters:
    1. Job name3-curate-sp-cbs-data-job
    2. Output to – Data Catalog S3 tables
    3. File type – Parquet
    4. Database namesp_catalog_db
    5. Create new table with table namecurated_data
    6. S3 locations3://<your-S3-bucket-name>/curated-data/
    7. Settings – Replace output files for each job run
    8. Permissions roleAWSGlueDataBrewServiceRole-sp-project

The last project defines a single transformation step; the join between the transformed-cbs-dataset and the transformed-sp-dataset based on the municipality code and the year.

The DataBrew job should take a few minutes to complete.

Next, check your sp_catalog_db database. You should now have raw, transformed, and curated tables in your database. DataBrew automatically adds the prefix awsgluedatabrew_ to both the transformed and curated tables in the catalog.

Consume curated datasets for descriptive analytics

We’re now ready to build the consumption layer for descriptive analytics with QuickSight. In this section, we build a business intelligence dashboard that reflects OurCompany’s solar panel energy capacity and installations participation in contrast to the reported values by the CBS from 2012–2018.

To complete this section, you need to have the default primary workgroup already set up on Athena in the same Region where you implemented the data pipeline. If it’s your first time setting up workgroups on Athena, follow the instructions in Setting up Workgroups.

Also make sure that QuickSight has the right permissions to access Athena and your S3 bucket. Then complete the following steps:

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose Create a new dataset.
  3. Select Athena as the data source.
  4. For Data source name, enter sp_data_source.
  5. Choose Create data source.
  6. Choose AWSDataCatalog as the catalog and sp_catalog_db as the database.
  7. Select the table curated_data.
  8. Choose Select.
  9. In the Finish dataset creation section, choose Directly query your data and choose Visualize.
  10. Choose the clustered bar combo chart from the Visual types list.
  11. Expand the field wells section and then drag and drop the following fields into each section as shown in the following screenshot.
  12. Rename the visualization as you like, and optionally filter the report by sp_year using the Filter option.

From this graph, we can already benchmark OurCompany against the regional values reported by the CBS across two dimensions: the total amount of installations and the total kW capacity generated by solar panels.

We went one step further and created two KPI visualizations to empower our descriptive analytics capabilities. The following is our final dashboard that we can use to enhance our decision-making process.

Clean up resources

To clean all the resources we created for the data pipeline, complete the following steps:

  1. Remove the QuickSight analyses you created.
  2. Delete the dataset curated_data.
  3. Delete all the DataBrew projects with their associated recipes.
  4. Delete all the DataBrew datasets.
  5. Delete all the AWS Glue crawlers you created.
  6. Delete the sp_catalog_db catalog database; this removes all the tables.
  7. Empty the contents of your S3 bucket and delete it.

Conclusion

In this post, we demonstrated how you can begin your data analytics journey. With DataBrew, you can prepare and combine the data you already have with publicly available datasets such as those from the Dutch CBS (Centraal Bureau voor de Statistiek) without needing to write a single line of code. Start using DataBrew today and enrich key datasets in AWS for enhanced descriptive analytics capabilities.


About the Authors

Daniel Rozo is a Solutions Architect with Amazon Web Services based out of Amsterdam, The Netherlands. He is devoted to working with customers and engineering simple data and analytics solutions on AWS. In his free time, he enjoys playing tennis and taking tours around the beautiful Dutch canals.

Maurits de Groot is an intern Solutions Architect at Amazon Web Services. He does research on startups with a focus on FinTech. Besides working, Maurits enjoys skiing and playing squash.


Terms of use: Gemeentelijke indeling op 1 januari 2021, Zonnestroom; vermogen bedrijven en woningen, regio (indeling 2018), 2012-2018, and copies of these datasets redistributed by AWS, are licensed under the Creative Commons 4.0 license (CC BY 4.0), sourced from Centraal Bureau voor de Statistiek (CBS). The datasets used in this solution are modified to rename columns from Dutch to English, remove null or missing values, aggregate rows based on specific attributes, and combine the datasets in the final transformation. Refer to the CC BY 4.0 use, adaptation, and attribution requirements for additional information.

Query cross-account AWS Glue Data Catalogs using Amazon Athena

Post Syndicated from Louis Hourcade original https://aws.amazon.com/blogs/big-data/query-cross-account-aws-glue-data-catalogs-using-amazon-athena/

Many AWS customers rely on a multi-account strategy to scale their organization and better manage their data lake across different projects or lines of business. The AWS Glue Data Catalog contains references to data used as sources and targets of your extract, transform, and load (ETL) jobs in AWS Glue. Using a centralized Data Catalog offers organizations a unified metadata repository and minimizes the administrative overhead related to sharing data across different accounts, thereby expanding access to the data lake.

Amazon Athena is one of the popular choices to run analytical queries in data lakes. This interactive query service makes it easy to analyze data in Amazon Simple Storage Service (Amazon S3) using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you’re charged based on the amount of data scanned by your queries.

In May 2021, Athena introduced the ability to query Data Catalogs across multiple AWS accounts, enabling you to access your data lake without the complexity of replicating catalog metadata in individual AWS accounts. This blog post details the procedure for using the feature.

Solution overview

The following diagram shows the necessary components used in two different accounts (consumer account and producer account, hosting a central Data Catalog) and the flow between the two for cross-account Data Catalog access using Athena.

Our use case showcases Data Catalog sharing between two accounts:

  • Producer account – The account that administrates the central Data Catalog
  • Consumer account – The account querying data from the producer’s Data Catalog (the central Data Catalog)

In this walkthrough, we use the following two tables, extracted from an ecommerce dataset:

  • The orders table logs the website’s orders and contains the following key attributes:
    • Row ID­ – Unique entry identifier in the orders table
    • Order ID – Unique order identifier
    • Order date – Date the order was placed
    • Profit – Profit value of the order
  • The returns table logs the returned items and contains the following attributes:
    • Returned – If the order has been returned (Yes/No)
    • Order ID – Unique order identifier
    • Market – Region market

We walk you through the following high-level steps to use this solution:

  1. Set up the producer account.
  2. Set up the consumer account.
  3. Set up permissions.
  4. Register the producer account in the Data Catalog.
  5. Query your data.

You use Athena in the consumer account to perform different operations using the producer account’s Data Catalog.

First, you use the consumer account to query the orders table in the producer account’s Data Catalog.

Next, you use the consumer account to join the two tables and retrieve information about lost profit from returned items. The returns table is in the consumer’s Data Catalog, and the orders table is in the producer’s.

Prerequisites

The following are the prerequisites for this walkthrough:

This lists all your Athena workgroups. Make sure that the one you use runs on Athena engine version 2.

If all your workgroups are using Athena engine version 1, you need to update the engine version of an existing workgroup or create a new workgroup with the appropriate version.

Set up the producer account

In the producer account, complete the following steps:

  1. Create an S3 bucket for your producer’s data. For information about how to secure your S3 bucket, see Security Best Practices for Amazon S3.
  2. In this bucket, create a prefix named orders.
  3. Download the orders table in CSV format and upload it to the orders prefix.
  4. Run the following Athena query to create the producer’s database:
CREATE DATABASE producer_database
  COMMENT 'Producer data'
  1. Run the following Athena query to create the orders table in the producer’s database. Make sure to replace <your-producer-s3-bucket-name> with the name of the bucket you created.
CREATE EXTERNAL TABLE producer_database.orders(
  `row id` bigint, 
  `order id` string, 
  `order date` string, 
  `ship date` string, 
  `ship mode` string, 
  `customer id` string, 
  `customer name` string, 
  `segment` string, 
  `city` string, 
  `state` string, 
  `country` string, 
  `postal code` bigint, 
  `market` string, 
  `region` string, 
  `product id` string, 
  `category` string, 
  `sub-category` string, 
  `product name` string, 
  `sales` string, 
  `quantity` bigint, 
  `discount` string, 
  `profit` string, 
  `shipping cost` string, 
  `order priority` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\;'
LOCATION
  's3://<your-producer-s3-bucket-name>/orders/'
TBLPROPERTIES (
  'skip.header.line.count'='1'
)

Set up the consumer account

In the consumer account, complete the following steps:

  1. Create an S3 bucket for your consumer’s data.
  2. In this bucket, create a prefix named returns.
  3. Download the returns table in CSV format and upload it to the returns prefix.
  4. Run the following Athena query to create the consumer’s database:
CREATE DATABASE consumer_database
COMMENT 'Consumer data'
  1. Run the following Athena query to create the returns table in the consumer’s database. Make sure to replace <your-consumer-s3-bucket-name> with the name of the bucket you created.
CREATE EXTERNAL TABLE consumer_database.returns(
  `returned` string, 
  `order id` string, 
  `market` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\;' 
LOCATION
  's3://<your-consumer-s3-bucket-name>/returns/'
TBLPROPERTIES (
  'skip.header.line.count'='1'
)

Set up permissions

For the consumer account to query data in the producer account, we need to set up permissions.

First, we give the consumer account permission to access the producer account’s AWS Glue resources.

  1. In the producer account’s Data Catalog settings, add the following AWS Glue resource policy, which grants the consumer account access to the Data Catalog:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<Consumer-account-id>:role/<role-in-consumer-account>"
            },
            "Action": [
        "glue:GetDatabases",
        "glue:GetTables"
      ],
            "Resource": [
                "arn:aws:glue:<Region>:<Producer-account-id>:catalog",
                "arn:aws:glue:<Region>:<Producer-account-id>:database/producer-database",
                "arn:aws:glue:<Region>:<Producer-account-id>:table/producer-database/orders"
            ]
        }
    ]
}

Next, we give the consumer account permission to list and get data from the S3 bucket in the producer account.

  1. In the producer account, add the following S3 bucket policy to the bucket <Producer-bucket>, which stores the data:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<Consumer-account-id>:role/<role-in-consumer-account>"
            },
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<Producer-bucket>",
                "arn:aws:s3:::<Producer-bucket>/orders/*"
            ]
        }
    ]
}

Register the producer account’s Data Catalog

At this stage, you have set up the required permissions to access the central Data Catalog in the producer account from the consumer account. You now need to register the central Data Catalog as a data source in Athena.

  1. In the consumer account, go the Athena console and choose Connect data source.
  2. Select S3 – AWS Glue Data Catalog as the data source selection.
  3. Select AWS Glue Data Catalog in another account.

You then need to provide some information regarding the central Data Catalog you want to register.

  1. For Data source name, enter a name for the catalog (for example, Central_Data_Catalog). This serves as an alias in the consumer account, pointing to the central Data Catalog in the producer account.
  1. For Catalog ID, enter the producer account ID.
  2. Choose Register to complete the process.

Query your data

You have now registered the central Data Catalog as a data source in the consumer account. In the Athena query editor, you can then choose Central_Data_Catalog as a data source. Under Database, you can see all the databases for which you were granted access in the producer account’s AWS Glue resource policy. The same applies for the tables. After completing the steps in the earlier sections, you should see the orders table from producer_database located in the producer account.

You can start querying the Data Catalog of the producer account directly from Athena in the consumer account. You can test this by running the following SQL query in Athena:

SELECT * FROM "Central_Data_Catalog"."producer_database"."orders" limit 10;

This SQL query extracts the first 10 rows of the orders table located in the producer account.

You just queried a Data Catalog located in another AWS account, which enables you to easily access your central Data Catalog and scale your data lake strategy.

Now, let’s see how we can join two tables that are in different AWS accounts. In our scenario, the returns table is in the consumer account and the orders table is in the producer account. Suppose you want to join the two tables and see the total amount of items returned in each market. The Athena built-in support for cross-account Data Catalogs makes this operation easy. In the Athena query editor, run the following SQL query:

SELECT
returns_tb.Market as Market,
sum(orders_tb.quantity) as Total_Quantity
FROM "Central_Data_Catalog"."producer_database"."orders" as orders_tb
JOIN "AwsDataCatalog"."consumer_database"."returns" as returns_tb
ON orders_tb."order id" = returns_tb."order id"
GROUP BY returns_tb.Market;

In this SQL query, you use both the consumer’s Data Catalog AwsDataCatalog and the producer’s Data Catalog Central_Data_Catalog to join tables and get insights from your data.

Limitations and considerations

The following are some limitations that you should take into consideration before using Athena built-in support for cross-account Data Catalogs:

  • This Athena feature is available only in Regions where Athena engine version 2 is supported. For a list of Regions that support Athena engine version 2, see Athena engine version 2. To upgrade a workgroup to engine version 2, see Changing Athena Engine Versions.
  • As of this writing, CREATE VIEW statements that include a cross-account Data Catalog are not supported.
  • Cross-Region Data Catalog queries are not supported.

Clean up

After you query and analyze the data, you should clean up the resources used in this tutorial to prevent any recurring AWS costs.

To clean up the resources, navigate to the Amazon S3 console in both the provider and consumer accounts, and empty the S3 buckets. Also, navigate to the AWS Glue console and delete the databases.

Conclusion

In this post, you learned how to query data from multiple accounts using Athena, which allows your organization to access to a centralized Data Catalog. We hope that this post helps you build and explore your data lake across multiple accounts.

To learn more about AWS tools to manage access to your data, check out AWS Lake Formation. This service facilitates setting up a centralized data lake and allows you to grant users and ETL jobs cross-account access to Data Catalog metadata and underlying data.


About the Authors

Louis Hourcade is a Data Scientist in the AWS Professional Services team. He works with AWS customer across various industries to accelerate their business outcomes with innovative technologies. In his spare time he enjoys running, climbing big rocks, and surfing (not so big) waves.

Sara Kazdagli is a Professional Services consultant specialized in data analytics and machine learning. She helps customers across different industries build innovative solutions and make data-driven decisions. Sara holds a MSc in Software engineering and a MSc in data science. In her spare time, she likes to go on hikes and walks with her australian shepherd dog Kiba.

Jahed Zaïdi is an AI/ML & Big Data specialist at AWS Professional Services. He is a builder and a trusted advisor to companies across industries, helping them innovate faster and on a larger scale. As a lifelong explorer, Jahed enjoys discovering new places, cultures, and outdoor activities.

Stream Apache HBase edits for real-time analytics

Post Syndicated from Amir Shenavandeh original https://aws.amazon.com/blogs/big-data/stream-apache-hbase-edits-for-real-time-analytics/

Apache HBase is a non-relational database. To use the data, applications need to query the database to pull the data and changes from tables. In this post, we introduce a mechanism to stream Apache HBase edits into streaming services such as Apache Kafka or Amazon Kinesis Data Streams. In this approach, changes to data are pushed and queued into a streaming platform such as Kafka or Kinesis Data Streams for real-time processing, using a custom Apache HBase replication endpoint.

We start with a brief technical background on HBase replication and review a use case in which we store IoT sensor data into an HBase table and enrich rows using periodic batch jobs. We demonstrate how this solution enables you to enrich the records in real time and in a serverless way using AWS Lambda functions.

Common scenarios and use cases of this solution are as follows:

  • Auditing data, triggers, and anomaly detection using Lambda
  • Shipping WALEdits via Kinesis Data Streams to Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) to index records asynchronously
  • Triggers on Apache HBase bulk loads
  • Processing streamed data using Apache Spark Streaming, Apache Flink, Amazon Kinesis Data Analytics, or Kinesis analytics
  • HBase edits or change data capture (CDC) replication into other storage platforms, such as Amazon Simple Storage Service (Amazon S3), Amazon Relational Database Service (Amazon RDS), and Amazon DynamoDB
  • Incremental HBase migration by replaying the edits from any point in time, based on configured retention in Kafka or Kinesis Data Streams

This post progresses into some common use cases you might encounter, along with their design options and solutions. We review and expand on these scenarios in separate sections, in addition to the considerations and limits in our application designs.

Introduction to HBase replication

At a very high level, the principle of HBase replication is based on replaying transactions from a source cluster to the destination cluster. This is done by replaying WALEdits or Write Ahead Log entries on the RegionServers of the source cluster into the destination cluster. To explain WALEdits, in HBase, all the mutations in data like PUT or DELETE are written to MemStore of their specific region and are appended to a WAL file as WALEdits or entries. Each WALEdit represents a transaction and can carry multiple write operations on a row. Because the MemStore is an in-memory entity, in case a region server fails, the lost data can be replayed and restored from the WAL files. Having a WAL is optional, and some operations may not require WALs or can request to bypass WALs for quicker writes. For example, records in a bulk load aren’t recorded in WAL.

HBase replication is based on transferring WALEdits to the destination cluster and replaying them so any operation that bypasses WAL isn’t replicated.

When setting up a replication in HBase, a ReplicationEndpoint implementation needs to be selected in the replication configuration when creating a peer, and on every RegionServer, an instance of ReplicationEndpoint runs as a thread. In HBase, a replication endpoint is pluggable for more flexibility in replication and shipping WALEdits to different versions of HBase. You can also use this to build replication endpoints for sending edits to different platforms and environments. For more information about setting up replication, see Cluster Replication.

HBase bulk load replication HBASE-13153

In HBase, bulk loading is a method to directly import HFiles or Store files into RegionServers. This avoids the normal write path and WALEdits. As a result, far less CPU and network resources are used when importing big portions of data into HBase tables.

You can also use HBase bulk loads to recover data when an error or outage causes the cluster to lose track of regions and Store files.

Because bulk loads skip WAL creation, all new records aren’t replicated to the secondary cluster. In HBASE-13153, which is an enhancement, a bulk load is represented as a bulk load event, carrying the location of the imported files. You can activate this by setting hbase.replication.bulkload.enabled to true and setting hbase.replication.cluster.id to a unique value as a prerequisite.

Custom streaming replication endpoint

We can use HBase’s pluggable endpoints to stream records into platforms such as Kinesis Data Streams or Kafka. Transferred records can be consumed by Lambda functions, processed by a Spark Streaming application or Apache Flink on Amazon EMR, Kinesis Data Analytics, or any other big data platform.

In this post, we demonstrate an implementation of a custom replication endpoint that allows replicating WALEdits in Kinesis Data Streams or Kafka topics.

In our example, we built upon the BaseReplicationEndpoint abstract class, inheriting the ReplicationEndpoint interface.

The main method to implement and override is the replicate method. This method replicates a set of items it receives every time it’s called and blocks until all those records are replicated to the destination.

For our implementation and configuration options, see our GitHub repository.

Use case: Enrich records in real time

We now use the custom streaming replication endpoint implementation to stream HBase edits to Kinesis Data Streams.

The provided AWS CloudFormation template demonstrates how we can set up an EMR cluster with replication to either Kinesis Data Streams or Apache Kafka and consume the replicated records, using a Lambda function to enrich data asynchronously, in real time. In our sample project, we launch an EMR cluster with an HBase database. A sample IoT traffic generator application runs as a step in the cluster and puts records, containing a registration number and an instantaneous speed, into a local HBase table. Records are replicated in real time into a Kinesis stream or Kafka topic based on the selected option at launch, using our custom HBase replication endpoint. When the step starts putting the records into the stream, a Lambda function is provisioned and starts digesting records from the beginning of the stream and catches up with the stream. The function calculates a score per record, based on a formula on variation from minimum and maximum speed limits in the use case, and persists the result as a score qualifier into a different column family, out of replication scope in the source table, by running HBase puts on RowKey.

The following diagram illustrates this architecture.

To launch our sample environment, you can use our template on GitHub.

The template creates a VPC, public and private subnets, Lambda functions to consume records and prepare the environment, an EMR cluster with HBase, and a Kinesis data stream or an Amazon Managed Streaming for Apache Kafka (Amazon MSK) cluster depending on the selected parameters when launching the stack.

Architectural design patterns

Traditionally, Apache HBase tables are considered as data stores, where consumers get or scan the records from tables. It’s very common in modern databases to react to database logs or CDC for real-time use cases and triggers. With our streaming HBase replication endpoint, we can project table changes into message delivery systems like Kinesis Data Streams or Apache Kafka.

We can trigger Lambda functions to consume messages and records from Apache Kafka or Kinesis Data Streams to consume the records in a serverless design or the wider Amazon Kinesis ecosystems such as Kinesis Data Analytics or Amazon Kinesis Data Firehose for delivery into Amazon S3. You could also pipe in Amazon OpenSearch Service.

A wide range of consumer ecosystems, such as Apache Spark, AWS Glue, and Apache Flink, is available to consume from Kafka and Kinesis Data Streams.

Let’s review few other common use cases.

Index HBase rows

Apache HBase rows are retrievable by RowKey. Writing a row into HBase with the same RowKey overwrites or creates a new version of the row. To retrieve a row, it needs to be fetched by the RowKey or a range of rows needs to be scanned if the RowKey is unknown.

In some use cases, scanning the table for a specific qualifier or value is expensive if we index our rows in another parallel system like Elasticsearch asynchronously. Applications can use the index to find the RowKey. Without this solution, a periodic job has to scan the table and write them into an indexing service like Elasticsearch to hydrate the index, or the producing application has to write in both HBase and Elasticsearch directly, which adds overhead to the producer.

Enrich and audit data

A very common use case for HBase streaming endpoints is enriching data and storing the enriched records in a data store, such as Amazon S3 or RDS databases. In this scenario, a custom HBase replication endpoint streams the records into a message distribution system such as Apache Kafka or Kinesis Data Streams. Records can be serialized, using the AWS Glue Schema Registry for schema validation. A consumer on the other end of the stream reads the records, enriches them, and validates against a machine learning model in Amazon SageMaker for anomaly detection. The consumer persists the records in Amazon S3 and potentially triggers an alert using Amazon Simple Notification Service (Amazon SNS). Stored data on Amazon S3 can be further digested on Amazon EMR, or we can create a dashboard on Amazon QuickSight, interfacing Amazon Athena for queries.

The following diagram illustrates our architecture.

Store and archive data lineage

Apache HBase comes with the snapshot feature. You can freeze the state of tables into snapshots and export them to any distributed file system like HDFS or Amazon S3. Recovering snapshots restores the entire table to the snapshot point.

Apache HBase also supports versioning at the row level. You can configure column families to keep row versions, and the default versioning is based on timestamps.

However, when using this approach to stream records into Kafka or Kinesis Data Streams, records are retained inside the stream, and you can partially replay a period. Recovering snapshots only recovers up to the snapshot point and the future records aren’t present.

In Kinesis Data Streams, by default records of a stream are accessible for up to 24 hours from the time they are added to the stream. This limit can be increased to up to 7 days by enabling extended data retention, or up to 365 days by enabling long-term data retention. See Quotas and Limits for more information.

In Apache Kafka, record retention has virtually no limits based on available resources and disk space configured on the Kafka cluster, and can be configured by setting log.retention.

Trigger on HBase bulk load

The HBase bulk load feature uses a MapReduce job to output table data in HBase’s internal data format, and then directly loads the generated Store files into the running cluster. Using bulk load uses less CPU and network resources than loading via the HBase API, as HBase bulk load bypasses WALs in the write path and the records aren’t seen by replication. However, since HBASE-13153, you can configure HBase to replicate a meta record as an indication of a bulk load event.

A Lambda function processing replicated WALEdits can listen to this event to trigger actions, such as automatically refreshing a read replica HBase cluster on Amazon S3 whenever a bulk load happens. The following diagram illustrates this workflow.

Considerations for replication into Kinesis Data Streams

Kinesis Data Streams is a massively scalable and durable real-time data streaming service. Kinesis Data Streams can continuously capture gigabytes of data per second from hundreds of thousands of sources with very low latency. Kinesis is fully managed and runs your streaming applications without requiring you to manage any infrastructure. It’s durable, because records are synchronously replicated across three Availability Zones, and you can increase data retention to 365 days.

When considering Kinesis Data Streams for any solution, it’s important to consider service limits. For instance, as of this writing, the maximum size of the data payload of a record before base64-encoding is up to 1 MB, so we must make sure the records or serialized WALEdits remain within the Kinesis record size limit. To be more efficient, you can enable the hbase.replication.compression-enabled attribute to GZIP compress the records before sending them to the configured stream sink.

Kinesis Data Streams retains the order of the records within the shards as they arrive, and records can be read or processed in the same order. However, in this sample custom replication endpoint, a random partition key is used so that the records are evenly distributed between the shards. We can also use a hash function to generate a partition key when putting records into the stream, for example based on the Region ID so that all the WALEdits from the same Region land in the same shard and consumers can assume Region locality per shards.

For delivering records in KinesisSinkImplemetation, we use the Amazon Kinesis Producer Library (KPL) to put records into Kinesis data streams. The KPL simplifies producer application development; we can achieve high write throughput to a Kinesis data stream. We can use the KPL in either synchronous or asynchronous use cases. We suggest using the higher performance of the asynchronous interface unless there is a specific reason to use synchronous behavior. KPL is very configurable and has retry logic built in. You can also perform record aggregation for maximum throughput. In KinesisSinkImplemetation, by default records are asynchronously replicated to the stream. We can change to synchronous mode by setting hbase.replication.kinesis.syncputs to true. We can enable record aggregation by setting hbase.replication.kinesis.aggregation-enabled to true.

The KPL can incur an additional processing delay because it buffers records before sending them to the stream based on a user-configurable attribute of RecordMaxBufferedTime. Larger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance. However, applications that can’t tolerate this additional delay may need to use the AWS SDK directly.

Kinesis Data Streams and the Kinesis family are fully managed and easily integrate with the rest of the AWS ecosystem with minimum development effort with services such as the AWS Glue Schema Registry and Lambda. We recommend considering Kinesis Data Streams for low-latency, real-time use cases on AWS.

Considerations for replication into Apache Kafka

Apache Kafka is a high-throughput, scalable, and highly available open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

AWS offers Amazon MSK as a fully managed Kafka service. Amazon MSK provides the control plane operations and runs open-source versions of Apache Kafka. Existing applications, tooling, and plugins from partners and the Apache Kafka community are supported without requiring changes to application code.

You can configure this sample project for Apache Kafka brokers directly or just point towards an Amazon MSK ARN for replication.

Although there is virtually no limit on the size of the messages in Kafka, the default maximum message size is set to 1 MB by default, so we must make sure the records, or serialized WALEdits, remain within the maximum message size for topics.

The Kafka producer tries to batch records together whenever possible to limit the number of requests for more efficiency. This is configurable by setting batch.size, linger.ms, and delivery.timeout.ms.

In Kafka, topics are partitioned, and partitions are distributed between different Kafka brokers. This distributed placement allows for load balancing of consumers and producers. When a new event is published to a topic, it’s appended to one of the topic’s partitions. Events with the same event key are written to the same partition, and Kafka guarantees that any consumer of a given topic partition can always read that partition’s events in exactly the same order as they were written. KafkaSinkImplementation uses a random partition key to distribute the messages evenly between the partitions. This could be based on a heuristic function, for example based on Region ID, if the order of the WALEdits or record locality is important by the consumers.

Semantic guarantees

Like any streaming application, it’s important to consider semantic guarantee from the producer of messages, to acknowledge or fail the status of delivery of messages in the message queue and checkpointing on the consumer’s side. Based on our use cases, we need to consider the following:

  • At most once delivery – Messages are never delivered more than once, and there is a chance of losing messages
  • At least once delivery – Messages can be delivered more than once, with no loss of messages
  • Exactly once delivery – Every message is delivered only once, and there is no loss of messages

After changes are persisted as WALs and in MemStore, the replicate method in ReplicationEnpoint is called to replicate a collection of WAL entries and returns a Boolean (true/false) value. If the returned value is true, the entries are considered successfully replicated by HBase and the replicate method is called for the next batch of WAL entries. Depending on configuration, both KPL and Kafka producers might buffer the records for longer if configured for asynchronous writes. Failures can cause loss of entries, retries, and duplicate delivery of records to the stream, which could be determinantal for synchronous or asynchronous message delivery.

If our operations aren’t idempotent, you can checkpoint or check for unique sequence numbers on the consumer side. For a simple HBase record replication, RowKey operations are idempotent and they carry a timestamp and sequence ID.

Summary

Replication of HBase WALEdits into streams is a powerful tool that you can use in multiple use cases and in combination with other AWS services. You can create practical solutions to further process records in real time, audit the data, detect anomalies, set triggers on ingested data, or archive data in streams to be replayed on other HBase databases or storage services from a point in time. This post outlined some common use cases and solutions, along with some best practices when implementing your custom HBase streaming replication endpoints.

Review, clone, and try our HBase replication endpoint implementation from our GitHub repository and launch our sample CloudFormation template.

We like to learn about your use cases. If you have questions or suggestions, please leave a comment.


About the Authors

Amir Shenavandeh is a Senior Hadoop systems engineer and Amazon EMR subject matter expert at Amazon Web Services. He helps customers with architectural guidance and optimization. He leverages his experience to help people bring their ideas to life, focusing on distributed processing and big data architectures.

Maryam Tavakoli is a Cloud Engineer and Amazon OpenSearch subject matter expert at Amazon Web Services. She helps customers with their Analytics and Streaming workload optimization and is passionate about solving complex problems with simplistic user experience that can empower customers to be more productive.

Use unsupervised training with K-means clustering in Amazon Redshift ML

Post Syndicated from Phil Bates original https://aws.amazon.com/blogs/big-data/use-unsupervised-training-with-k-means-clustering-in-amazon-redshift-ml/

Amazon Redshift is the fastest, most widely used, fully managed, and petabyte-scale cloud data warehouse. Tens of thousands of customers use Amazon Redshift to process exabytes of data every day to power their analytics workloads. Data analysts and database developers want to use this data to train machine learning (ML) models, which can then be used to generate insights for use cases such as forecasting revenue, predicting customer churn, and detecting anomalies.

Amazon Redshift ML makes it easy for SQL users to create, train, and deploy ML models using familiar SQL commands. In previous posts, we covered how Amazon Redshift supports supervised learning that includes regression, binary classification, and multiclass classification, as well as training models using XGBoost and providing advanced options such as preprocessors, problem type, and hyperparameters.

In this post, we use Redshift ML to perform unsupervised learning on unlabeled training data using the K-means algorithm. This algorithm solves clustering problems where you want to discover groupings in the data. Unlabeled data is grouped and partitioned based on their similarities and differences. By grouping, the K-means algorithm iteratively determines the best centroids and assigns each member to the closest centroid. Data points nearest the same centroid belong to the same group. Members of a group are as similar as possible to other members in the same group, and as different as possible from members of other groups. To learn more about K-means clustering, see K-means clustering with Amazon SageMaker.

Solution overview

The following are some use cases for K-means:

  • Ecommerce and retail – Segment your customers by purchase history, stores they visited, or clickstream activity.
  • Healthcare – Group similar images for image detection. For example, you can detect patterns for diseases or successful treatment scenarios.
  • Finance – Detect fraud by detecting anomalies in the dataset. For example, you can detect credit card fraud by abnormal purchase patterns.
  • Technology – Build a network intrusion detection system that aims to identify attacks or malicious activity.
  • Meteorology – Detect anomalies in sensor data collection such as storm forecasting.

In our example, we use K-means on the Global Database of Events, Language, and Tone (GDELT) dataset, which monitors world news across the world, and the data is stored for every second of every day. This information is freely available as part of the Registry of Open Data on AWS.

The data is stored as multiple files on Amazon Simple Storage Service (Amazon S3), with two different formats: historical, which covers the years 1979–2013, and daily updates, which cover the years 2013 and later. For this example, we use the historical format and bring in 1979 data.

For our use case, we use a subset of the data’s attributes:

  • EventCode – The raw CA­­­­­­MEO action code describing the action that Actor1 performed upon Actor2.
  • NumArticles – The total number of source documents containing one or more mentions of this event. You can use this to assess the importance of an event. The more discussion of that event, the more likely it is to be significant.
  • AvgTone – The average tone of all documents containing one or more mentions of this event. The score ranges from -100 (extremely negative) to +100 (extremely positive). Common values range between -10 and +10, with 0 indicating neutral.
  • Actor1Geo_Lat – The centroid latitude of the Actor1 landmark for mapping.
  • Actor1Geo_Long – The centroid longitude of the Actor1 landmark for mapping.
  • Actor2Geo_Lat – The centroid latitude of the Actor2 landmark for mapping.
  • Actor2Geo_Long – The centroid longitude of the Actor2 landmark for mapping.

Each row corresponds to an event at a specific location. For example, rows 53-57 in the file 1979.csv which we will use below, seem to all refer to interactions between FRA and AFR, dealing with consultation and diplomatic relations with a mostly positive tone. It is hard, if not impossible for us to make sense of such data at scale. Clusters of events, either with a similar tone, occurring in similar locations or between similar actors, are useful in visualizing and interpreting the data. Clustering can also reveal non-obvious structures such as potential common causes for different events, or the propagation of a root event across the globe, or the change in tone toward a common event over time. However, we do not know what makes two events similar – is it the location, the two actors, the tone, the time or some combination of these? Clustering algorithms can learn from data and determine 1) what makes different datapoints similar, 2) which datapoints are related to which other datapoints and 3) what are the common characteristics of these related datapoints.

Prerequisites

To get started, we need an Amazon Redshift cluster with version 1.0.33433 or higher and an AWS Identity and Access Management (IAM) role attached that provides access to Amazon SageMaker and permissions to an S3 bucket.

For an introduction to Redshift ML and instructions on setting it up, see Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML.

To create a simple cluster, complete the following steps:

  1. On the Amazon Redshift console, choose Clusters in the navigation pane.
  2. Choose Create cluster.
  3. Provide the configuration parameters such as cluster name, user name, and password.
  4. For Associated IAM roles, on the menu Manage IAM roles, choose Create IAM role.

If you have an existing role with the required parameters, you can choose Associate IAM roles.

  1. Select Specific S3 buckets and choose a bucket for storing the artifacts generated by Redshift ML.
  2. Choose Create IAM role as default.

A default IAM role is created for you and automatically associated with the cluster.

  1. Choose Create cluster.

Prepare the data

Load the GDELT data into Amazon Redshift using the following SQL. You can use the Amazon Redshift Query Editor v2 or your favorite SQL tool to run these commands.

To create the table, use the following commands:

DROP TABLE IF EXISTS gdelt_data CASCADE;

CREATE TABLE gdelt_data (
GlobalEventId   bigint,
SqlDate  bigint,
MonthYear bigint,
Year   bigint,
FractionDate double precision,
Actor1Code varchar(256),
Actor1Name varchar(256),
Actor1CountryCode varchar(256),
Actor1KnownGroupCode varchar(256),
Actor1EthnicCode varchar(256),
Actor1Religion1Code varchar(256),
Actor1Religion2Code varchar(256),
Actor1Type1Code varchar(256),
Actor1Type2Code varchar(256),
Actor1Type3Code varchar(256),
Actor2Code varchar(256),
Actor2Name varchar(256),
Actor2CountryCode varchar(256),
Actor2KnownGroupCode varchar(256),
Actor2EthnicCode varchar(256),
Actor2Religion1Code  varchar(256),
Actor2Religion2Code varchar(256),
Actor2Type1Code varchar(256),
Actor2Type2Code varchar(256),
Actor2Type3Code varchar(256),
IsRootEvent bigint,
EventCode bigint,
EventBaseCode bigint,
EventRootCode bigint,
QuadClass bigint,
GoldsteinScale double precision,
NumMentions bigint,
NumSources bigint,
NumArticles bigint,
AvgTone double precision,
Actor1Geo_Type bigint,
Actor1Geo_FullName varchar(256),
Actor1Geo_CountryCode varchar(256),
Actor1Geo_ADM1Code varchar(256),
Actor1Geo_Lat double precision,
Actor1Geo_Long double precision,
Actor1Geo_FeatureID bigint,
Actor2Geo_Type bigint,
Actor2Geo_FullName varchar(256),
Actor2Geo_CountryCode varchar(256),
Actor2Geo_ADM1Code varchar(256),
Actor2Geo_Lat double precision,
Actor2Geo_Long double precision,
Actor2Geo_FeatureID bigint,
ActionGeo_Type bigint,
ActionGeo_FullName varchar(256),
ActionGeo_CountryCode varchar(256),
ActionGeo_ADM1Code varchar(256),
ActionGeo_Lat double precision,
ActionGeo_Long double precision,
ActionGeo_FeatureID bigint,
DATEADDED bigint 
) ;

To load data into the table, use the following command:

COPY gdelt_data FROM 's3://gdelt-open-data/events/1979.csv'
region 'us-east-1' iam_role default csv delimiter '\t';  

Create a model in Redshift ML

When using the K-means algorithm, you must specify an input K that specifies the number of clusters to find in the data. The output of this algorithm is a set of K centroids, one for each cluster. Each data point belongs to one of the K clusters that is closest to it. Each cluster is described by its centroid, which can be thought of as a multi-dimensional representation of the cluster. The K-means algorithm compares the distances between centroids and data points to learn how different the clusters are from each other. A larger distance generally indicates a greater difference between the clusters.

Before we create the model, let’s examine the training data by running the following SQL code in Amazon Redshift Query Editor v2:

select AvgTone, EventCode, NumArticles, Actor1Geo_Lat, Actor1Geo_Long, Actor2Geo_Lat, Actor2Geo_Long
from gdelt_data

The following screenshot shows our results.

We create a model with seven clusters from this data (see the following code). You can experiment by changing the K value and creating different models. The SageMaker K-means algorithm can obtain a good clustering with only a single pass over the data with very fast runtimes.

CREATE MODEL news_data_clusters
FROM (select AvgTone, EventCode, NumArticles, Actor1Geo_Lat, Actor1Geo_Long, Actor2Geo_Lat, Actor2Geo_Long
   from gdelt_data)
FUNCTION  news_monitoring_cluster
IAM_ROLE default
AUTO OFF
MODEL_TYPE KMEANS
PREPROCESSORS 'none'
HYPERPARAMETERS DEFAULT EXCEPT (K '7')
SETTINGS (S3_BUCKET '<<your-amazon-s3-bucket-name>>');

For more information about model training, see Machine learning overview. For a list of other hyper-parameters K-means supports, see K-means Hyperparameters, for the full syntax of CREATE MODEL see our documentation.

You can use the SHOW MODEL command to view the status of the model:

SHOW MODEL NEWS_DATA_CLUSTERS;

The results show that our model is in the READY state.

We can now run the query to identify the clusters. The following query shows the cluster associated with each GlobelEventId:

select globaleventid, news_monitoring_cluster ( AvgTone, EventCode, NumArticles, Actor1Geo_Lat, Actor1Geo_Long, Actor2Geo_Lat, Actor2Geo_Long ) as cluster 
from gdelt_data;

We get the following results.

Now let’s run a query to check the distribution of data across our clusters to see if seven is the appropriate cluster size for this dataset:

select events_cluster , count(*) as nbr_events  from   
(select globaleventid, news_monitoring_cluster( AvgTone, EventCode, NumArticles, Actor1Geo_Lat, Actor1Geo_Long, Actor2Geo_Lat, Actor2Geo_Long ) as events_cluster
from gdelt_data)
group by 1;

The results show that very few events are assigned to clusters 1 and 3.

Let’s try running the above query again after re-creating the model with nine clusters by changing the K value to 9.

Using nine clusters helps smooth out the cluster sizes. The smallest is now approximately 11,000 and the largest is approximately 117,000, compared to 188,000 when using seven clusters.

Now, let’s run the following query to determine the centers of the clusters based on number of articles by event code:

select news_monitoring_cluster ( AvgTone, EventCode, NumArticles, Actor1Geo_Lat, Actor1Geo_Long, Actor2Geo_Lat, Actor2Geo_Long ) as events_cluster, eventcode ,sum(numArticles) as numArticles from 
gdelt_data
group by 1,2 ;


Let’s run the following query to get more insights into the datapoints assigned to one of the clusters:

select news_monitoring_cluster ( AvgTone, EventCode, NumArticles, Actor1Geo_Lat, Actor1Geo_Long, Actor2Geo_Lat, Actor2Geo_Long ) as events_cluster, eventcode, actor1name, actor2name, sum(numarticles) as totalarticles
from gdelt_data
where events_cluster = 5
and actor1name <> ' 'and actor2name <> ' '
group by 1,2,3,4
order by 5 desc

Observing the datapoints assigned to the clusters, we see clusters of events corresponding to interactions between US and China – probably due to the establishment of diplomatic relations, between US and RUS – probably corresponding to the SALT II Treaty and those involving Iran– probably corresponding to the Iranian Revolution. Thus, clustering can help us make sense of the data, and show us the way as we continue to explore and use it.

Conclusion

Redshift ML makes it easy for users of all skill levels to use ML technology. With no prior ML knowledge, you can use Redshift ML to gain business insights for your data. You can take advantage of ML approaches such as supervised and unsupervised learning to classify your labeled and unlabeled data, respectively. In this post, we walked you through how to perform unsupervised learning with Redshift ML by creating an ML model that uses the K-means algorithm to discover grouping in your data.

For more information about building different models, see Amazon Redshift ML.


About the Authors

Phil Bates is a Senior Analytics Specialist Solutions Architect at AWS with over 25 years of data warehouse experience.

Debu Panda, a Principal Product Manager at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world. Debu has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences such as re:Invent, Oracle Open World, and Java One. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).

Akash Gheewala is a Solutions Architect at AWS. He helps global enterprises across the high tech industry in their journey to the cloud. He does this through his passion for accelerating digital transformation for customers and building highly scalable and cost-effective solutions in the cloud. Akash also enjoys mental models, creating content and vagabonding about the world.

Murali Narayanaswamy is a principal machine learning scientist in AWS. He received his PhD from Carnegie Mellon University and works at the intersection of ML, AI, optimization, learning and inference to combat uncertainty in real-world applications including personalization, forecasting, supply chains and large scale systems.

Optimize your IoT Services for Scale with IoT Device Simulator

Post Syndicated from Ajay Swamy original https://aws.amazon.com/blogs/architecture/optimize-your-iot-services-for-scale-with-iot-device-simulator/

The IoT (Internet of Things) has accelerated digital transformation for many industries. Companies can now offer smarter home devices, remote patient monitoring, connected and autonomous vehicles, smart consumer devices, and many more products. The enormous volume of data emitted from IoT devices can be used to improve performance, efficiency, and develop new service and business models. This can help you build better relationships with your end consumers. But you’ll need an efficient and affordable way to test your IoT backend services without incurring significant capex by deploying test devices to generate this data.

IoT Device Simulator (IDS) is an AWS Solution that manufacturing companies can use to simulate data, test device integration, and improve the performance of their IoT backend services. The solution enables you to create hundreds of IoT devices with unique attributes and properties. You can simulate data without configuring and managing physical devices.

An intuitive UI to create and manage devices and simulations

IoT Device Simulator comes with an intuitive user interface that enables you to create and manage device types for data simulation. The solution also provides you with a pre-built autonomous car device type to simulate a fleet of connected vehicles. Once you create devices, you can create simulations and generate data (see Figure 1.)

Figure 1. The landing page UI enables you to create devices and simulation

Figure 1. The landing page UI enables you to create devices and simulation

Create devices and simulate data

With IDS, you can create multiple device types with varying properties and data attributes (see Figure 2.) Each device type has a topic where simulation data is sent. The supported data types are object, array, sinusoidal, location, Boolean, integer, float, and more. Refer to this full list of data types. Additionally, you can import device types via a specific JSON format or use the existing automotive demo to pre-populate connected vehicles.

Figure 2. Create multiple device types and their data attributes

Figure 2. Create multiple device types and their data attributes

Create and manage simulations

With IDS, you can create simulations with one device or multiple device types (see Figure 3.) In addition, you can specify the number of devices to simulate for each device type and how often data is generated and sent.

Figure 3. Create simulations for multiple devices

Figure 3. Create simulations for multiple devices

You can then run multiple simulations (see Figure 4) and use the data generated to test your IoT backend services and infrastructure. In addition, you have the flexibility to stop and restart the simulation as needed.

Figure 4. Run and stop multiple simulations

Figure 4. Run and stop multiple simulations

You can view the simulation in real time and observe the data messages flowing through. This way you can ensure that the simulation is working as expected (see Figure 5.) You can stop the simulation or add a new simulation to the mix at any time.

Figure 5. Observe your simulation in real time

Figure 5. Observe your simulation in real time

IoT Device Simulator architecture

Figure 6. IoT Device Simulator architecture

Figure 6. IoT Device Simulator architecture

The AWS CloudFormation template for this solution deploys the following architecture, shown in Figure 6:

  1. Amazon CloudFront serves the web interface content from an Amazon Simple Storage Service (Amazon S3) bucket.
  2. The Amazon S3 bucket hosts the web interface.
  3. Amazon Cognito user pool authenticates the API requests.
  4. An Amazon API Gateway API provides the solution’s API layer.
  5. AWS Lambda serves as the solution’s microservices and routes API requests.
  6. Amazon DynamoDB stores simulation and device type information.
  7. AWS Step Functions include an AWS Lambda simulator function to simulate devices and send messages.
  8. An Amazon S3 bucket stores pre-defined routes that are used for the automotive demo (which is a pre-built example in the solution).
  9. AWS IoT Core serves as the endpoint to which messages are sent.
  10. Amazon Location Service provides the map display showing the location of automotive devices for the automotive demo.

The IoT Device Simulator console is hosted on an Amazon S3 bucket, which is accessed via Amazon CloudFront. It uses Amazon Cognito to manage access. API calls, such as retrieving or manipulating information from the databases or running simulations, are routed through API Gateway. API Gateway calls the microservices, which will call the relevant service.

For example, when creating a new device type, the request is sent to API Gateway, which then routes the request to the microservices Lambda function. Based on the request, the microservices Lambda function recognizes that it is a request to create a device type and saves the device type to DynamoDB.

Running a simulation

When running a simulation, the microservices Lambda starts a Step Functions workflow. First, the request contains information about the simulation to be run, including the unique device type ID. Then, using the unique device type ID, Step Functions retrieves all the necessary information about each device type to run the simulation. Once all the information has been retrieved, the simulator Lambda function is run. The simulator Lambda function uses the device type information, including the message payload template. The Lambda function uses this template to build the message sent to the IoT topic specified for the device type.

When running a custom device type, the simulator generates random information based on the values provided for each attribute. For example, when the automotive simulation is run, the simulation runs a series of calculations to simulate an automobile moving along a series of pre-defined routes. Pre-defined routes are created and stored in an S3 bucket, when the solution is launched. The simulation retrieves the routes at random each time the Lambda function runs. Automotive demo simulations also show a map generated from Amazon Location Service and display the device locations as they move.

The simulator exits once the Lambda function has completed or has reached the fifteen-minute execution limit. It then passes all the necessary information back to the Step Function. Step Functions then enters a choice state and restarts the Lambda function if it has not yet surpassed the duration specified for the simulation. It then passes all the pertinent information back to the Lambda function so that it can resume where it left off. The simulator Lambda function also checks DynamoDB every thirty seconds to see if the user has manually stopped the simulation. If it has, it will end the simulation early. Once the simulation is complete, the Step Function updates the DynamoDB table.

The solution enables you to launch hundreds of devices to test backend infrastructure in an IoT workflow. The solution contains an Import/Export feature to share device types. Exporting a device type generates a JSON file that represents the device type. The JSON file can then be imported to create the same device type automatically. The solution allows the viewing of up to 100 messages while the solution is running. You can also filter the messages by topic and device and see what data each device emits.

Conclusion

IoT Device Simulator is designed to help customers test device integration and IoT backend services more efficiently without incurring capex for physical devices. This solution provides an intuitive web-based graphic user interface (GUI) that enables customers to create and simulate hundreds of connected devices. It is not necessary to configure and manage physical devices or develop time-consuming scripts. Although we’ve illustrated an automotive application in this post, this simulator can be used for many different industries, such as consumer electronics, healthcare equipment, utilities, manufacturing, and more.

Get started with IoT Device Simulator today.

How to set up Amazon Quicksight dashboard for Amazon Pinpoint and Amazon SES engagement events

Post Syndicated from satyaso original https://aws.amazon.com/blogs/messaging-and-targeting/how-to-set-up-amazon-quicksight-dashboard-for-amazon-pinpoint-and-amazon-ses-events/

In this post, we will walk through using Amazon Pinpoint and Amazon Quicksight to create customizable messaging campaign reports. Amazon Pinpoint is a flexible and scalable outbound and inbound marketing communications service that allows customers to connect with users over channels like email, SMS, push, or voice. Amazon QuickSight is a scalable, serverless, embeddable, machine learning-powered business intelligence (BI) service built for the cloud. This solution allows event and user data from Amazon Pinpoint to flow into Amazon Quicksight. Once in Quicksight, customers can build their own reports that shows campaign performance on a more granular level.

Engagement Event Dashboard

Customers want to view the results of their messaging campaigns in ever increasing levels of granularity and ensure their users see value from the email, SMS or push notifications they receive. Customers also want to analyze how different user segments respond to different messages, and how to optimize subsequent user communication. Previously, customers could only view this data in Amazon Pinpoint analytics, which offers robust reporting on: events, funnels, and campaigns. However, does not allow analysis across these different parameters and the building of custom reports. For example, show campaign revenue across different user segments, or show what events were generated after a user viewed a campaign in a funnel analysis. Customers would need to extract this data themselves and do the analysis in excel.

Prerequisites

  • Digital user engagement event database solution must be setup at 1st.
  • Customers should be prepared to purchase Amazon Quicksight because it has its own set of costs which is not covered within Amazon Pinpoint cost.

Solution Overview

This Solution uses the Athena tables created by Digital user engagement events database solution. The AWS CloudFormation template given in this post automatically sets up the different architecture components, to capture detailed notifications about Amazon Pinpoint engagement events and log those in Amazon Athena in the form of Athena views. You still need to manually configure Amazon Quicksight dashboards to link to these newly generated Athena views. Please follow the steps below in order for further information.

Use case(s)

Event dashboard solutions have following use cases: –

  • Deep dive into engagement insights. (eg: SMS events, Email events, Campaign events, Journey events)
  • The ability to view engagement events at the individual user level.
  • Data/process mining turn raw event data into useful marking insights.
  • User engagement benchmarking and end user event funneling.
  • Compute campaign conversions (post campaign user analysis to show campaign effectiveness)
  • Build funnels that shows user progression.

Getting started with solution deployment

Prerequisite tasks to be completed before deploying the logging solution

Step 1 – Create AWS account, Pinpoint Project, Implement Event-Database-Solution.
As part of this step customers need to implement DUE Event database solution as the current solution (DUE event dashboard) is an extension of DUE event database solution. The basic assumption here is that the customer has already configured Amazon Pinpoint project or Amazon SES within the required AWS region before implementing this step.

The steps required to implement an event dashboard solution are as follows.

a/Follow the steps mentioned in Event database solution to implement the complete stack. Prior installing the complete stack copy and save the name Athena events database name as shown in the diagram. For my case it is due_eventdb. Database name is required as an input parameter for the current Event Dashboard solution.

b/Once the solution is deployed, navigate to the output page of the cloud formation stack, and copy, and save the following information, which will be required as input parameters in step 2 of the current Event Dashboard solution.

Step 2 – Deploy Cloud formation template for Event dashboard solution
This step generates a number of new Amazon Athena views that will serve as a data source for Amazon Quicksight. Continue with the following actions.

  • Download the cloud formation template(“Event-dashboard.yaml”) from AWS samples.
  • Navigate to Cloud formation page in AWS console, click up right on “Create stack” and select the option “With new resources (standard)”
  • Leave the “Prerequisite – Prepare template” to “Template is ready” and for the “Specify template” option, select “Upload a template file”. On the same page, click on “Choose file”, browse to find the file “Event-dashboard.yaml” file and select it. Once the file is uploaded, click “Next” and deploy the stack.

  • Enter following information under the section “Specify stack details”:
    • EventAthenaDatabaseName – As mentioned in Step 1-a.
    • S3DataLogBucket- As mentioned in Step 1-b
    • This solution will create additional 5 Athena views which are
      • All_email_events
      • All_SMS_events
      • All_custom_events (Custom events can be Mobile app/WebApp/Push Events)
      • All_campaign_events
      • All_journey_events

Step 3 – Create Amazon Quicksight engagement Dashboard
This step walks you through the process of creating an Amazon Quicksight dashboard for Amazon Pinpoint engagement events using the Athena views you created in step-2

  1. To Setup Amazon Quicksight for the 1st time please follow this link (this process is not needed if you have already setup Amazon Quicksight). Please make sure you are an Amazon Quicksight Administrator.
  2. Go/search Amazon Quicksight on AWS console.
  3. Create New Analysis and then select “New dataset”
  4. Select Athena as data source
  5. As a next step, you need to select what all analysis you need for respective events. This solution provides option to create 5 different set of analysis as mentioned in Step 2. They are a/All email events, b/All SMS Events, c/All Custom Events (Mobile/Web App, web push etc), d/ All Campaign events, e/All Journey events. Dashboard can be created from Quicksight analysis and same can be shared among the organization stake holders. Following are the steps to create analysis and dashboards for different type of events.
  6. Email Events –
    • For all email events, name the analysis “All-emails-events” (this can be any kind of customer preferred nomenclature), select Athena workgroup as primary, and then create a data source.
    • Once you create the data source Quicksight lists all the views and tables available under the specified database (in our case it is:-  due_eventdb). Select the email_all_events view as data source.
    • Select the event data location for analysis. There are mainly two options available which are a/ Import to Spice quicker analysis b/ Directly query your data. Please select the preferred options and then click on “visualize the data”.
    • Import to Spice quicker analysis – SPICE is the Amazon QuickSight Super-fast, Parallel, In-memory Calculation Engine. It’s engineered to rapidly perform advanced calculations and serve data. In Enterprise edition, data stored in SPICE is encrypted at rest. (1 GB of storage is available for free for extra storage customer need to pay extra, please refer cost section in this document )
    • Directly query your data – This process enables Quicksight to query directly to the Athena or source database (In the current case it is Athena) and Quicksight will not store any data.
    • Now that you have selected a data source, you will be taken to a blank quick sight canvas (Blank analysis page) as shown in the following Image, please drag and drop what visualization type you need to visualize onto the auto-graph pane. Please note that Amazon QuickSight is a Busines intelligence platform, so customers are free to choose the desired visualization types to observe the individual engagement events.
    • As part of this blog, we have displayed how to create some simple analysis graphs to visualize the engagement events.
    • As an initial step please Select tabular Visualization as shown in the Image.
    • Select all the event dimensions that you want to put it as part of the Table in X axis. Amazon Quicksight table can be extended to show as many as tables columns, this completely depends upon the business requirement how much data marketers want to visualize.
    • Further filtering on the table can be done using Quicksight filters, you can apply the filter on specific granular values to enable further filtering. For Eg – If you want to apply filtering on the destination email Id then 1/Select the filter from left hand menu 2/Add destination field as the filtering criterion 3/ Tick on the destination field you are trying to filter or search for the Destination email ID that 4/ All the result in the table gets further filtered as per the filter criterion
    • As a next step please add another visual from top left corner “Add -> Add Visual”, then select the Donut Chart from Visual types pane. Donut charts are always used for displaying aggregation.
    • Then select the “event_type” as the Group to visualize the aggregated events, this helps marketers/business users to figure out how many email events occurred and what are the aggregated success ratio, click ratio, complain ratio or bounce ratio etc for the emails/Campaign that’s sent to end users.
    • To create a Quicksight dashboards from the Quicksight analysis click Share menu option at the top right corner then select publish dashboard”. Provide required dashboard name while publishing the dashboard”. Same dashboard can be shared with multiple audiences in the Organization.
    • Following is the final version of the dashboard. As mentioned above Quicksight dashboards can be shared with other stakeholders and also complete dashboard can be exported as excel sheet.
  7. SMS Events-
    • As shown above SMS events can be analyzed using Quicksight and dash boards can be created out of the analysis. Please repeat all of the sub-steps listed in step 6. Following is a sample SMS dashboard.
  8. Custom Events-
    • After you integrate your application (app) with Amazon Pinpoint, Amazon Pinpoint can stream event data about user activity, different type custom events, and message deliveries for the app. Eg :- Session.start, Product_page_view, _session.stop etc. Do repeat all of the sub-steps listed in step 6 create a custom event dashboards.
  9. Campaign events
    • As shown before campaign also can be included in the same dashboard or you can create new dashboard only for campaign events.

Cost for Event dashboard solution
You are responsible for the cost of the AWS services used while running this solution. As of the date of publication, the cost for running this solution with default settings in the US West (Oregon) Region is approximately $65 a month. The cost estimate includes the cost of AWS Lambda, Amazon Athena, Amazon Quicksight. The estimate assumes querying 1TB of data in a month, and two authors managing Amazon Quicksight every month, four Amazon Quicksight readers witnessing the events dashboard unlimited times in a month, and a Quicksight spice capacity is 50 GB per month. Prices are subject to change. For full details, see the pricing webpage for each AWS service you will be using in this solution.

Clean up

When you’re done with this exercise, complete the following steps to delete your resources and stop incurring costs:

  1. On the CloudFormation console, select your stack and choose Delete. This cleans up all the resources created by the stack,
  2. Delete the Amazon Quicksight Dashboards and data sets that you have created.

Conclusion

In this blog post, I have demonstrated how marketers, business users, and business analysts can utilize Amazon Quicksight dashboards to evaluate and exploit user engagement data from Amazon SES and Pinpoint event streams. Customers can also utilize this solution to understand how Amazon Pinpoint campaigns lead to business conversions, in addition to analyzing multi-channel communication metrics at the individual user level.

Next steps

The personas for this blog are both the tech team and the marketing analyst team, as it involves a code deployment to create very simple Athena views, as well as the steps to create an Amazon Quicksight dashboard to analyse Amazon SES and Amazon Pinpoint engagement events at the individual user level. Customers may then create their own Amazon Quicksight dashboards to illustrate the conversion ratio and propensity trends in real time by integrating campaign events with app-level events such as purchase conversions, order placement, and so on.

Extending the solution

You can download the AWS Cloudformation templates, code for this solution from our public GitHub repository and modify it to fit your needs.


About the Author


Satyasovan Tripathy works at Amazon Web Services as a Senior Specialist Solution Architect. He is based in Bengaluru, India, and specialises on the AWS Digital User Engagement product portfolio. He likes reading and travelling outside of work.

New – Create and Manage EMR Clusters and Spark Jobs with Amazon SageMaker Studio

Post Syndicated from Sean M. Tracey original https://aws.amazon.com/blogs/aws/new-create-and-manage-emr-clusters-and-spark-jobs-with-amazon-sagemaker-studio/

Today, we’re very excited to offer three new enhancements to our Amazon SageMaker Studio service.

As of now, users of SageMaker Studio can create, terminate, manage, discover, and connect to Amazon EMR clusters running within a single AWS account and in shared accounts across an organization—all directly from SageMaker Studio. Furthermore, SageMaker Studio Notebook users can able to utilize SparkUI to monitor and debug Spark jobs running on an Amazon EMR cluster—directly from the SageMaker Studio Notebooks!

The story so far…
Before today, SageMaker Studio users had some ability to find and connect with EMR clusters, provided that they were running in the same account as SageMaker Studio. While useful in many circumstances, if a cluster did not exist that would suit the requirements of the model or analysis being run, then data scientists would have to leave their development environment and manually configure a cluster that suited their needs. As well as being disruptive to workflow of data scientists, there are no guarantees that the data scientists would have either the permissions or depth of knowledge required to provision a cluster that would enable them to continue with their work. Additionally, being restricted to creating and managing clusters in a single account could become prohibitive in organizations working across many AWS accounts.

What’s new?
Data scientists can:

  • Discover, manage, create, terminate, and connect to Amazon EMR clusters from within SageMaker Studio
  • Utilize “templates” – a new way to configure and provision clusters for your workload needs with support from seasoned DevOps practitioners
  • Connect to, debug, and monitor Spark jobs running on an Amazon EMR cluster from within a SageMaker Studio Notebook

Creating, Connecting to, and Managing EMR Clusters

Connecting to an EMR Cluster from a SageMaker Studio Notebook

With the ability to connect to and manage EMR clusters from within SageMaker Studio, data scientists no longer have to leave their familiar environment to create, configure and provision the EMR clusters where they run their workloads.

Introducing Templates
A template is a collection of off-the-shelf cluster configurations optimized for numerous workloads. Templates can be created and managed by DevOps administrators and made available through the AWS Service Catalog to data scientists within SageMaker Studio. This lets them quickly spin up a cluster to meet their needs, all while safe in the knowledge that a trusted DevOps admin has correctly configured a cluster per the project’s requirements. Furthermore, this lets data scientists get on with the work they do best, and it gives DevOps administrators within these teams greater ability to manage the types of provisioned infrastructure.

Managing EMR Clusters from within SageMaker Studio Notebooks

Directly Connect to and monitor Spark Jobs
Finally, to make the job of data scientists even simpler, we’ve built the ability to connect to, debug, and monitor Spark jobs running on an Amazon EMR cluster from within a SageMaker Studio Notebook. Before now, to access the monitoring UI of a Spark Job, one needed to configure secure tunnels and web proxies to gain direct access to currently executing jobs, adding friction to the workflow of a data scientist trying to observe and debug their workloads. Now, with these new features, users will have one-click access directly from the interface that they already know. This enables them to build and put their workloads to work, rather than spending time on configuring infrastructure and workloads.

Connecting to a Spark Job from within a SageMaker Studio Notebook

These new features let data scientists can use a simple, consistent UI to provision and manage infrastructure as needed without ever having to leave SageMaker Studio or dive into the minutiae of the provisioning of such hardware – Moreover, they won’t have to spend time configuring proxies and SSH tunnels to debug and monitor ongoing Spark jobs.

Find out more
These features are generally available in all AWS Regions where SageMaker Studio is available, and there are no additional charges to use this capability. For complete information on pricing and regional availability, please refer to the SageMaker Studio pricing page .

To learn more, see our documentation.

Introducing Amazon Redshift Serverless – Run Analytics At Any Scale Without Having to Manage Data Warehouse Infrastructure

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/introducing-amazon-redshift-serverless-run-analytics-at-any-scale-without-having-to-manage-infrastructure/

We’re seeing the use of data analytics expanding among new audiences within organizations, for example with users like developers and line of business analysts who don’t have the expertise or the time to manage a traditional data warehouse. Also, some customers have variable workloads with unpredictable spikes, and it can be very difficult for them to constantly manage capacity.

With Amazon Redshift, you use SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes. Today, I am happy to introduce the public preview of Amazon Redshift Serverless, a new capability that makes it super easy to run analytics in the cloud with high performance at any scale. Just load your data and start querying. There is no need to set up and manage clusters. You pay for the duration in seconds when your data warehouse is in use, for example, while you are querying or loading data. There is no charge when your data warehouse is idle.

Amazon Redshift Serverless automatically provisions the right compute resources for you to get started. As your demand evolves with more concurrent users and new workloads, your data warehouse scales seamlessly and automatically to adapt to the changes. You can optionally specify the base data warehouse size to have additional control on cost and application-specific SLAs.

With the new serverless option, you can continue to query data in other AWS data stores, such as Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Aurora and Amazon Relational Database Service (RDS) databases.

Amazon Redshift Serverless is ideal when it is difficult to predict compute needs such as variable workloads, periodic workloads with idle time, and steady-state workloads with spikes. This approach is also a good fit for ad-hoc analytics needs that need to get started quickly and for test and development environments.

Let’s see how this works in practice.

Using Amazon Redshift Serverless
I go to the Amazon Redshift console and choose the new serverless option. The first time, I set up the serverless endpoint and configure networking and security.

I confirm the default settings that use all subnets in my default Amazon Virtual Private Cloud (VPC) and its default security group. Data is always encrypted, and I use the default AWS-owned key. Optionally, I can customize all settings. I can associate now or later the AWS Identity and Access Management (IAM) roles to give permissions to access other AWS resources, for example, to be able to load data from an S3 bucket. The configuration of the serverless endpoint will be shared by all my serverless data warehouses in the same AWS account and Region.

Console screenshot.

To query data, I use Amazon Redshift Query Editor V2, a new free web-based tool that we made available a few months back. The query editor provides quick access to a few sample datasets to make it easy to learn Amazon Redshift’s SQL capabilities: TPC-H, TPC-DS, and tickit, a dataset containing information on ticket sales for events.

For a quick test, I use the tickit sample dataset so I don’t need to load any data. I prepare a query to get the list of tickets sold per date, sorted to see the dates with more sales first:

SELECT caldate, sum(qtysold) as sumsold
FROM   tickit.sales, tickit.date
WHERE  sales.dateid = date.dateid 
GROUP BY caldate
ORDER BY sumsold DESC;

By using the web-based query editor, I don’t need to configure a SQL client or set up the network permissions to reach the serverless endpoint. Instead, I just write my SQL query and run it.

Console screenshot.

I am a visual person. I enable the Chart option on the right of the result table and select a bar chart.

Console screenshot.

Satisfied with the clarity of the chart, I export it as an image file. In this way, I can quickly share it or include it in a report.

Bar chart

Amazon Redshift Serverless supports all rich SQL functionality of Amazon Redshift such as semi-structured data support. I can use any JDBC/ODBC-compliant tool or the Amazon Redshift Data API to query my data. To migrate data, I can take a snapshot of an Amazon Redshift provisioned cluster and restore it as serverless. Then, I just need to update my SQL applications to use the new serverless endpoint.

Availability and Pricing
Amazon Redshift Serverless is available in public preview in the following AWS Regions: US East (N. Virginia), US West (N. California, Oregon), Europe (Frankfurt, Ireland), Asia Pacific (Tokyo).

With Amazon Redshift Serverless, you pay separately for the compute and storage you use. Compute capacity is measured in Redshift Processing Units (RPUs), and you pay for the workloads in RPU-hours with per-second billing. For storage, you pay for data stored in Amazon Redshift-managed storage and storage used for snapshots, similar to what you’d pay with a provisioned cluster using RA3 instances.

To control your costs, you can specify usage limits and define actions that Amazon Redshift automatically takes if those limits are reached. You can specify usage limits in RPU-hours and associated with a daily, weekly, or monthly duration. Setting higher usage limits can improve the overall throughput of the system, especially for workloads that need to handle high concurrency while maintaining consistently high performance.

Compute resources automatically shutdown behind the scenes when there is no activity and resume when you are loading data, or there are queries coming in. When accessing your S3 data lake via the new serverless endpoint, you do not pay for Amazon Redshift Spectrum separately. You have a unified serverless experience and pay for data lake queries also in RPU-seconds. For more information, see the Amazon Redshift pricing page.

The serverless end point is configured at the AWS account level. If you have multiple teams or projects and want to manage costs separately, you can use separate AWS accounts. You can share data between your provisioned clusters and serverless endpoint and between serverless endpoints across accounts.

To help you get practice, we provide you upfront with $500 in AWS credits to try the Amazon Redshift Serverless public preview. You get the credits when you first create a database with Amazon Redshift Serverless. These credits are used to cover your costs for compute, storage, and snapshot usage of Amazon Redshift Serverless only.

Start using Amazon Redshift Serverless today to run and scale analytics without having to provision and manage data warehouse clusters.

Danilo

Announcing Amazon EMR Serverless (Preview): Run big data applications without managing servers

Post Syndicated from Damon Cortesi original https://aws.amazon.com/blogs/big-data/announcing-amazon-emr-serverless-preview-run-big-data-applications-without-managing-servers/

Today we’re happy to announce Amazon EMR Serverless, a new option in Amazon EMR that makes it easy and cost-effective for data engineers and analysts to run petabyte-scale data analytics in the cloud. With EMR Serverless, you can run applications built using open-source frameworks such as Apache Spark, Hive, and Presto, without having to configure, manage, optimize, or secure clusters. EMR Serverless automatically provisions and scales the compute and memory resources required by your applications, and you only pay for the resources that your applications use.

In this post, we discuss the benefits of EMR Serverless, walk you through the core concepts of EMR Serverless and how you can use it, and show you a quick demo.

Overview of EMR Serverless

Tens of thousands of customers use Amazon EMR, a managed service for running open-source analytics frameworks such as Apache Spark, Hive, and Presto, for large-scale data analytics applications. With Amazon EMR, you can provision clusters of any size in minutes. Amazon EMR automatically installs and configures the frameworks you choose, and provides a performance-optimized runtime that is compatible with and over twice as fast as standard open-source.

Amazon EMR customers have full control over cluster configuration. The ability to customize clusters allows you to optimize for cost and performance based on workload requirements. For example, you can use Amazon Elastic Compute Cloud (Amazon EC2) memory optimized instances to run SQL workloads with low latency, or use the EC2 Graviton2-based instances to improve performance. You can also use EC2 Spot Instances, which are integrated in Amazon EMR so that you can take advantage of unused EC2 capacity in the AWS Cloud to obtain instances at up to a 90% discount compared to On-Demand prices. If you run your applications on Kubernetes, you can use Amazon EMR on Amazon EKS to run your Amazon EMR analytics applications on Amazon Elastic Kubernetes Service (Amazon EKS) clusters.

However, tuning clusters for optimal cost and performance requires engineers to have deep knowledge of the underlying analytics frameworks. Furthermore, the specific compute and memory resources needed to optimally run applications depend on various factors, such as the schedule and complexity of data processing jobs and the volume of data being processed. When these characteristics change over time, you need to reevaluate and reconfigure clusters. In addition, administrators have to secure and monitor the clusters to ensure that they’re compliant with corporate security policies, and adjust security settings each time the cluster is reconfigured. Many customers don’t need this level of customization and control, and want a simpler way to process data using open-source frameworks on the Amazon EMR performance-optimized runtime.

With this in mind, we built EMR Serverless. With EMR Serverless, you can get all the benefits of running Amazon EMR, but with a serverless environment. We had the following goals in mind when we built EMR Serverless:

  • Provide a simpler experience – EMR Serverless is simple to use because you don’t have to configure, optimize, operate, or secure clusters. You don’t have to worry about instance types or cluster sizes, or about applying OS patches. You simply specify the framework and version that you want to use for your application, and submit your data processing jobs. You still get all the benefits that you expect out of Amazon EMR—open-source compatibility, open-source version currency, and performance-optimized runtime—but without the need to manage clusters.
  • No need to guess cluster sizes – EMR Serverless eliminates the need to right-size clusters for varying jobs and data sizes. With EMR Serverless, you create an application using an open-source framework version, and submit jobs to the application. EMR Serverless automatically adds and removes workers at different stages of processing your job. As a result, you don’t have to reconfigure when data volumes change, and you only pay for what your jobs require. You can control costs by specifying the minimum and maximum number of concurrent workers, and the VCPU and memory per worker.
  • Retain Amazon EMR’s performance-optimized runtime and open-source currency – EMR Serverless includes the Amazon EMR performance-optimized runtime for Apache Spark, Hive, and Presto. The Amazon EMR runtime is API-compatible and over twice as fast as standard open-source, so your jobs run faster and incur less compute costs.
  • Seamless integration with EMR Studio – EMR Serverless includes EMR Studio, which provides fully managed serverless Jupyter Notebooks and familiar open-source tools such as Spark UI and Tez UI to help you develop, visualize, and debug your applications.
  • Automatic and fine-grained scaling – EMR Serverless automatically scales up workers at each stage of processing your job and scales them down when they’re not required. You’re charged for aggregate vCPU, memory, and storage resources used from the time a worker starts running until it stops, rounded up to the nearest second with a 1-minute minimum. For example, your job may require 10 workers for the first 10 minutes of processing the job, and 50 workers for the next 5 minutes. With fine-grained automatic scaling, you only incur cost for 10 workers for 10 minutes and 50 workers for 5 minutes. As a result, you don’t have to pay for underutilized resources.
  • Resilience to Availability Zone failures – EMR Serverless is a Regional service. When you submit jobs to an EMR Serverless application, it can run in any Availability Zone in the Region. A job is run in a single Availability Zone to avoid performance implications of network traffic across Availability Zones. In case an Availability Zone is impaired, a job submitted to your EMR Serverless application is automatically run in a different (healthy) Availability Zone. When using resources in a private VPC, EMR Serverless recommends you specify the private VPC configuration for multiple Availability Zones so that EMR Serverless can automatically select a healthy Availability Zone.
  • Enable shared applications – When you submit jobs to an EMR Serverless application, you can specify the AWS Identity and Access Management (IAM) role that must be used by the job to access AWS resources such as Amazon Simple Storage Service (Amazon S3) objects. As a result, different IAM principals can run jobs on a single EMR Serverless application, and each job can only access the AWS resources that the IAM principal is allowed to access. This enables you to set up scenarios where a single application with a pre-initialized pool of workers is made available to multiple tenants wherein each tenant can submit jobs using a different IAM role but use the common pool of pre-initialized workers to immediately process requests.
  • Enable interactive applications – Interactive applications that allow data scientists and analysts to run interactive SQL queries for data exploration require a fast response time to user requests. For such interactive applications, EMR Serverless allows you to pre-initialize a pool of workers. You can start your EMR Serverless application and pre-initialize the pool of workers as soon as a user starts the application, and stop the application to stop workers when no interactive users are active. If processing user requests requires more workers than what have been pre-initialized, EMR Serverless automatically adds more workers up to the maximum concurrent limits that you specify. Therefore, by controlling the number of workers to pre-initialize and the maximum concurrent workers, you can optimize user experience and cost for your interactive applications.
  • Make it easy to switch from one deployment model to another – The same Amazon EMR releases are provided for applications using EMR clusters, Amazon EMR on EKS, and EMR Serverless. When you build an application using an Amazon EMR release (for example a Spark job using Amazon EMR release 6.4), you can choose to run it on an EMR cluster, Amazon EMR on EKS, or EMR Serverless without having to rewrite the application. This allows you to build applications for a given framework version, and retain the flexibility to change the deployment model based on future operational needs.

Core concepts

In this section, we discuss the core concepts in EMR Serverless: applications, jobs, workers, and pre-initialized workers.

Application

With EMR Serverless, you can create one or more applications that use open-source analytics frameworks. To create an application, you specify the open-source framework that you want to use (for example, Apache Spark or Apache Hive), the Amazon EMR release for the open-source framework version (for example, Amazon EMR release 6.4, which corresponds to Apache Spark 3.1.2), and a name for your application. After you create an application, you can submit data processing jobs or interactive requests to your application.

The following are a few examples where you may want to create multiple applications:

  • To use different open-source frameworks (for example, Hive or Spark)
  • To use different versions of open-source frameworks for different use cases (for example, use a newer version of Spark for a new application without having to upgrade older applications)
  • To perform A/B testing when upgrading from one version to another (for example, migrating from Spark 2.4 to Spark 3.1)
  • To maintain separate logical environments for test and production scenarios
  • To provide separate logical environments for different teams with independent cost controls and usage tracking
  • To logically separate different line-of-business applications (for example, finance vs. marketing)

Job

A job is a request submitted to an EMR Serverless application that is asynchronously run and tracked through completion. You can run multiple jobs concurrently in an application.

Workers

An EMR Serverless application internally uses workers to run your jobs. By default, each application uses workers with 4 VCPU, 30 memory, and 20 GB of local storage per worker. You have the ability to customize this configuration.

Pre-initialized workers

EMR Serverless provides an optional feature to pre-initialize workers when your application starts up, so that the workers are ready to process requests immediately when a job is submitted to the application. Pre-initialized workers allow you to maintain a warm pool of workers for the application so that it can provide a sub-second response to start processing requests.

Common usage patterns applied to EMR Serverless

Now let’s examine some common usage scenarios and how EMR Serverless provides you a simple solution.

Pattern #1: Data pipelines

Data pipelines are the backbone of your analytics workloads. A common pattern with data pipelines is to start a cluster, run a job, and stop the cluster when the job is complete. Because data is separated from compute, the inputs and outputs for each job are persisted separately from the cluster (for example, in Amazon S3). These steps are frequently automated using workflow orchestration applications such as Apache Airflow. You can also use AWS services such as AWS Step Functions and AWS Managed Workflows for Apache Airflow (Amazon MWAA) to create such workflows.

Although automating these steps isn’t complex, data engineers have to spend time determining the appropriate EC2 instance and cluster size. They have to determine the Availability Zone where the cluster is run, and handle failover. They have to test their applications when adopting OS updates. When data sizes change over time, they have to resize clusters, or use features like Amazon EMR managed scaling that automatically resize clusters. EMR Serverless provides a simpler solution by eliminating the need for you to handle these scenarios. You simply choose the open-source framework and version for your application, and submit jobs. You don’t have to worry about instance selection, cluster sizes, cluster startup, cluster resize, stopping nodes, Availability Zone failover, or OS updates.

Pattern #2: Shared clusters

Another common pattern is for teams to use a shared long-running cluster to run multiple jobs. In this case, engineers implement queues in Apache YARN for different workloads on a common cluster, and set up rules to automatically scale the cluster up or down based on overall workload. With Amazon EMR on EC2 clusters, you can use Amazon EMR managed scaling, a feature that automatically scales clusters up or down depending on the workload. With EMR Serverless, workers are assigned to each job when required, so your jobs get the resources they need. Moreover, because you only pay for the workers that your jobs require, you don’t incur cost for over-provisioned resources. Finally, because each job can specify the IAM role that should be used to access AWS resources when running the job, you don’t have to set up complex configurations to manage queues and permissions.

Pattern #3: Interactive workloads

A third pattern of use is when teams keep a cluster of instances available to support interactive analysis. In this case, the cluster is set up and initialized with applications that wait for interactive user requests. Applications are pre-initialized so that they can immediately start processing user requests and provide an interactive user experience. EMR Serverless enables this scenario without requiring you to manage clusters. You can specify the number of workers that you want to pre-initialize when you start an EMR Serverless application. Subsequently, when users submit requests, the pre-initialized workers can be used to immediately process user requests. If processing the user requests requires more workers than what you have chosen to pre-initialize, EMR Serverless automatically adds more workers (up to the maximum concurrent limit that you specify). When the requests are processed, EMR Serverless automatically reverts back to maintaining the pre-initialized workers that you specified. You can control when the pre-initialized workers start by controlling when to start and stop your EMR Serverless application. For example, you can start your application when users begin interactive analysis and turn it off when there are no user requests and the application remains idle.

Demo

Conclusion

In this post, we discussed the core concepts and common usage patterns of EMR Serverless, and showed you a quick demo. EMR Serverless is in Preview, in which you can run workloads using Spark 3.1.2 and Hive 2.0 using the API, AWS Command Line Interface (AWS CLI), and SDK. Sign up for now, and for more information, see EMR Serverless documentation.


About the Authors

Damon Cortesi is a Principal Developer Advocate with Amazon Web Services.

Mehul Y. Shah is the GM for Amazon EMR.

Abhishek Sinha is a Principal Product Manager at Amazon Web Services.

AWS Lake Formation – General Availability of Cell-Level Security and Governed Tables with Automatic Compaction

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-lake-formation-general-availability-of-cell-level-security-and-governed-tables-with-automatic-compaction/

A data lake can help you break down data silos and combine different types of analytics into a centralized repository. You can store all of your structured and unstructured data in this repository. However, setting up and managing data lakes involve a lot of manual, complicated, and time-consuming tasks. AWS Lake Formation makes it easy to set up a secure data lake in days instead of weeks or months.

Today, I am excited to share the general availability of some new features that simplify even further loading data, optimizing storage, and managing access to a data lake:

  • Governed Tables – A new type of Amazon Simple Storage Service (Amazon S3) tables that makes it simple and reliable to ingest and manage data at any scale. Governed tables support ACID transactions that let multiple users concurrently and reliably insert and delete data across multiple governed tables. ACID transactions also let you run queries that return consistent and up-to-date data. In case of errors in your extract, transform, and load (ETL) processes, or during an update, changes are not committed and will not be visible.
  • Storage Optimization with Automatic Compaction for governed tables – When this option is enabled, Lake Formation automatically compacts small S3 objects in your governed tables into larger objects to optimize access via analytics engines, such as Amazon Athena and Amazon Redshift Spectrum. By using automatic compaction, you don’t have to implement custom ETL jobs that read, merge, and compress data into new files, and then replace the original files.
  • Granular Access Control with Row and Cell-Level Security – You can control access to specific rows and columns in query results and within AWS Glue ETL jobs based on the identity of who is performing the action. In this way, you don’t have to create (and keep updated) subsets of your data for different roles and legislations. This works for both governed and traditional S3 tables.

Using Governed Tables, ACID Transactions, and Automatic Compaction
In the Lake Formation console, I can enable governed data access and management at table creation. Automatic compaction is enabled by default, and it can be disabled using the AWS Command Line Interface (CLI) or AWS SDKs.

Console screenshot.

Governed tables have a manifest that tracks the S3 objects that are part of the table’s data. I can use the UpdateTableObjects API to keep the manifest updated when adding new objects to the table, and I can call it using the AWS CLI and SDKs. This API is implicitly used by the AWS Glue ETL library.

Moreover, I have access to new Lake Formation APIs to start, commit, or cancel a transaction. I can use these APIs to wrap data loading, data transformation, and output consistent and up-to-date data.

Using Row and Cell-Level Security
There are many use cases where, for a table, you want to restrict access to specific columns, rows, or a combination that depends on the role of the user accessing the data. For example, a company with offices in the US, Germany, and France can create a filter for analysts based in the European Union (EU) to limit access to EU-based customers.

Console screenshot.

The filter can enforce that some columns, such as date of birth (dob) and phone, are not accessible to those analysts. Moreover, access to individual rows can be filtered by using filter expressions. You can configure row filter expressions with a SQL-compatible syntax based on the open-source PartiQL language. In this case, only rows with country equal to Germany or France (country='DE' OR country='FR') are visible.

Console screenshot.

Availability and Pricing
These new features are available today in the following AWS Regions: US East (N. Virginia), US West (Oregon), Europe (Ireland), US East (Ohio), and Asia Pacific (Tokyo).

When querying governed tables, or tables secured with row and cell-level security, you pay by the amount of data scanned (with a 10MB minimum). When using governed tables, transaction metadata is charged by the number of S3 objects tracked, and you pay for the number of transaction requests. Automatic compaction is charged based on the data processed. For more information, see the AWS Lake Formation pricing page.

While implementing these features, we introduced a new Lake Formation Storage API that is integrated with tools such as AWS Glue, Amazon Athena, Amazon Redshift Spectrum, and Amazon QuickSight. You can use this storage API directly in your applications to query tables with a SQL-like syntax (joins are not supported) and get the benefits of governed tables and cell-level security.

See the detailed blog series published during the preview to learn more:

Effective data lakes using AWS Lake Formation

Take advantage of these new features to simplify the creation and management of your data lake.

Danilo

Announcing AWS Data Exchange for APIs: Find, Subscribe to, and Use Third-party APIs with Consistent Authentication

Post Syndicated from Alex Casalboni original https://aws.amazon.com/blogs/aws/data-exchange-for-apis-find-subscribe-use-third-party-apis-consistent-authentication/

Data is at the center of many processes and products, whether it’s a large-scale dataset used to train machine learning models, a relational database, or an API-based integration. AWS Data Exchange lets you discover, subscribe to, and use hundreds of file-based datasets via Amazon Simple Storage Service (Amazon S3) offered by third parties such as Reuters, Foursquare, Change Healthcare, Vortexa, IMDb, and many more. Additionally, AWS Data Exchange for Amazon Redshift makes it even easier to ingest third-party data in your Amazon Redshift data warehouse, without any manual processing or transformation.

However, in many cases your data projects require more than static datasets because you need frequent and synchronous retrieval of small amounts of information – for example, you might need to fetch a stock price every hour. Data APIs let you answer specific questions quickly and without having to build ad-hoc data pipelines to ingest, process, and analyze bulk datasets. But each API provider has its own ease of use, SDK, documentation, and authentication mechanisms, which makes this harder than it needs to be.

Today, I’m happy to announce the general availability of AWS Data Exchange for APIs, a new capability that lets you find, subscribe to, and use third-party APIs with a consistent access using AWS SDKs, as well as consistent AWS-native authentication and governance. This simplifies the lives of developers and IT administrators who have to integrate and secure the access to multiple third-party APIs.

Now you can make RESTful or GraphQL API calls directly to AWS Data Exchange and receive synchronous responses that contain the information you need, using the AWS SDK in the programming language of your choice. We take care of integrating with the API provider, implementing proper authentication, managing the API subscription, and ensuring charges appear on your AWS bill. You can manage API access centrally with AWS Identity and Access Management (IAM).

As a data provider, you make your API discoverable by millions of AWS customers by listing it in the AWS Data Exchange catalog using an OpenAPI specification and fronting it with an Amazon API Gateway endpoint.

AWS Data Exchange for APIs in Action
First, I look for an API product in the AWS Data Exchange catalog, review its subscription terms, support information, and auto-renewal. Each API product might include multiple public or private subscription offers and periods.

I select Subscribe and a couple of minutes later I’m successfully subscribed.

Within the API product, I select an entitled data set and its latest revision.

Each API revision contains one or more API assets that correspond to a specific API endpoint and a unique Asset ARN.

AWS Data Exchange takes care of invoking API endpoints with the correct authentication.

All I need to do is check the Integration notes, which include instructions and code snippets based on the AWS Command Line Interface (CLI).

Of course, I could implement the very same API call with my favorite programming language using one of the AWS SDKs.

For example, here’s how I’d implement a simple wrapper function in Python:

import json
import urllib
import boto3

adx = boto3.client('dataexchange')

def get_api_response(path, method="GET", querystring={}, headers={}, body={}):
    return adx.send_api_asset(
        DataSetId="4b3fbabc31171662851531b8576a3411",
        RevisionId="e8e78e921af12c76499edc40f92e3082",
        AssetId="557d858c317efdfb5b6c9a2860ec4a03",
        Method=method,
        Path=path,
        QueryStringParameters=urllib.urlencode(querystring),
        RequestHeaders=urllib.urlencode(headers),
        Body=json.dumps(body),
    )

Please note that there are no hard-coded credentials in the code above because all the authorization happens via AWS Identity and Access Management (IAM).

And that’s how you make your first API call via AWS Data Exchange for APIs.

Available Today
AWS Data Exchange for APIs is generally available in all AWS Regions where AWS Data Exchange is available. We’re looking forward to helping you simplify and centralize the management and governance of third-party APIs while we take care of the undifferentiated heavy lifting for you.

Today you can start integrating third-party APIs such as Infutor, Variety Business Intelligence, IMDb, PeopleDataLabs, Neustar, Experian, Foursquare, PredictHQ, WeatherTrends International, and many more.

If you’re a developer, check out the new AWS Data Exchange for APIs documentation to learn more about subscribing and using APIs. If you’re an API provider, check out the new publishing documentation to learn more about publishing new APIs on the AWS Data Exchange catalog.

Alex

Enforce customized data quality rules in AWS Glue DataBrew

Post Syndicated from Navnit Shukla original https://aws.amazon.com/blogs/big-data/enforce-customized-data-quality-rules-in-aws-glue-databrew/

GIGO (garbage in, garbage out) is a concept common to computer science and mathematics: the quality of the output is determined by the quality of the input. In modern data architecture, you bring data from different data sources, which creates challenges around volume, velocity, and veracity. You might write unit tests for applications, but it’s equally important to ensure the data veracity of these applications, because incoming data quality can make or break your application. Incorrect, missing, or malformed data can have a large impact on production systems. Examples of data quality issues include but are not limited to the following:

  • Missing or incorrect values can lead to failures in the production system that require non-null values
  • Changes in the distribution of data can lead to unexpected outputs of machine learning (ML) models
  • Aggregations of incorrect data can lead to wrong business decisions
  • Incorrect data types have a big impact on financial or scientific institutes

In this post, we introduce data quality rules in AWS Glue DataBrew. DataBrew is a visual data preparation tool that makes it easy to profile and prepare data for analytics and ML. We demonstrate how to use DataBrew to define a list of rules in a new entity called a ruleset. A ruleset is a set of rules that compare different data metrics against expected values.

The post describes the implementation process and provides a step-by-step guide to build data quality checks in DataBrew.

Solution overview

To illustrate our data quality use case, we use a human resources dataset. This dataset contains the following attributes:

Emp ID, Name Prefix, First Name, Middle Initial,Last Name,Gender,E Mail,Father's Name,Mother's Name,Mother's Maiden Name,Date of Birth,Time of Birth,Age in Yrs.,Weight in Kgs.,Date of Joining,Quarter of Joining,Half of Joining,Year of Joining,Month of Joining,Month Name of Joining,Short Month,Day of Joining,DOW of Joining,Short DOW,Age in Company (Years),Salary,Last % Hike,SSN,Phone No. ,Place Name,County,City,State,Zip,Region,User Name,Password

For this post, we downloaded data with 5 million records, but feel free to use a smaller dataset to follow along with this post.

The following diagram illustrates the architecture for our solution.

The steps in this solution are as follows:

  1. Create a sample dataset.
  2. Create a ruleset.
  3. Create data quality rules.
  4. Create a profile job.
  5. Inspect the data quality rules validation results.
  6. Clean the dataset.
  7. Create a DataBrew job.
  8. Validate the data quality check with the updated dataset.

Prerequisites

Before you get started, complete the following prerequisites:

  1. Have an AWS account.
  2. Download the sample dataset.
  3. Extract the CSV file.
  4. Create an Amazon Simple Storage Service (Amazon S3) bucket with three folders: input, output, and profile.
  5. Upload the sample data in input folder to your S3 bucket (for example, s3://<s3 bucket name>/input/).

Create a sample dataset

To create your dataset, complete the following steps:

  1. On the DataBrew console, in the navigation pane, choose Datasets.
  2. Choose Connect new dataset.
  3. For Dataset name, enter a name (for example, human-resource-dataset).
  4. Under Data lake/data store, choose Amazon S3 as your source.
  5. For Enter your source from Amazon S3, enter the S3 bucket location where you uploaded your sample files (for example, s3://<s3 bucket name>/input/).
  6. Under Additional configurations, keep the selected file type CSV and CSV delimiter comma (,).
  7. Scroll to the bottom of the page and choose Create dataset.

The dataset is now available on the Datasets page.

Create a ruleset

We now define data quality rulesets against the dataset created in the previous step.

  1. On the DataBrew console, in the navigation pane, choose DQ Rules.
  2. Choose Create data quality ruleset.
  3. For Ruleset name, enter a name (­for example, human-resource-dataquality-ruleset).
  4. Under Associated dataset, choose the dataset you created earlier.

Create data quality rules

To add data quality rules, you can use rules and add multiple rules, and within each rule, you can define multiple checks.

For this post, we create the following data quality rules and data quality checks within the rules:

  • Row count is correct
  • No duplicate rows
  • Employee ID, email address, and SSN are unique
  • Employee ID and phone number are not be null
  • Employee ID and employee age in years has no negative values
  • SSN data format is correct (123-45-6789)
  • Phone number for string length is correct
  • Region column only has the specified region
  • Employee ID is an integer

Row count is correct

To check the total row count, complete the following steps:

  1. Add a new rule.
  2. For Rule name, enter a name (for example, Check total record count).
  3. For Data quality check scope, choose Individual check for each column.
  4. For Rule success criteria, choose All data quality checks are met (AND).
  5. For Data quality checks¸ choose Number of rows.
  6. For Condition, choose Is equals.
  7. For Value, enter 5000000.

No duplicate rows

To check the dataset for duplicate rows, complete the following steps:

  1. Choose Add another rule.
  2. For Rule name, enter a name (for example, Check dataset for duplicate rows).
  3. For Data quality check scope, choose Individual check for each column.
  4. For Rule success criteria, choose All data quality checks are met (AND).
  5. Under Check 1, for Data quality check¸ choose Duplicate rows.
  6. For Condition, choose Is equals.
  7. For Value, enter 0 and choose rows on the drop-down menu.

Employee ID, email address, and SSN are unique

To check that the employee ID, email, and SSN are unique, complete the following steps:

  1. Choose Add another rule.
  2. For Rule name, enter a name (for example, Check dataset for Unique Values).
  3. For Data quality check scope, choose Common checks for selected columns.
  4. For Rule success criteria, choose All data quality checks are met (AND).
  5. For Selected columns, select Selected columns.
  6. Choose the columns Emp ID, e mail, and SSN.
  7. Under Check 1, for Data quality check, choose Unique values.
  8. For Condition, choose Is equals.
  9. For Value, enter 100 and choose %(percent) rows on the drop-down menu.

Employee ID and phone number are not be null

To check that employee IDs and phone numbers aren’t null, complete the following steps:

  1. Choose Add another rule.
  2. For Rule name, enter a name (for example, Check Dataset for NOT NULL).
  3. For Data quality check scope, choose Common checks for selected columns.
  4. For Rule success criteria, choose All data quality checks are met (AND).
  5. For Selected columns, select Selected columns.
  6. Choose the columns Emp ID and Phone No.
  7. Under Check 1, for Data quality check, choose Value is not missing.
  8. For Condition, choose Greater than equals.
  9. For Threshold, enter 100 and choose %(percent) rows on the drop-down menu.

Employee ID and age in years has no negative values

To check the employee ID and age for positive values, complete the following steps:

  1. Choose Add another rule.
  2. For Rule name, enter a name (for example, Check emp ID and age for positive values).
  3. For Data quality check scope, choose Individual check for each column.
  4. For Rule success criteria, choose All data quality checks are met (AND).
  5. Under Check 1, for Data quality check, choose Numeric values.
  6. Choose Emp ID on the drop-down menu.
  7. For Condition, choose Greater than equals.
  8. For Value, select Custom value and enter 0.
  9. Choose Add another quality check and repeat the same steps for age in years.

SSN data format is correct

To check the SSN data format, complete the following steps:

  1. Choose Add another rule.
  2. For Rule name, enter a name (for example, Check dataset format).
  3. For Data quality check scope, choose Individual check for each column.
  4. For Rule success criteria, choose All data quality checks are met (AND).
  5. Under Check 1, for Data quality check, choose String values.
  6. Choose SSN on the drop-down menu.
  7. For Condition, choose Matches (RegEx pattern).
  8. For RegEx value, enter ^[0-9]{3}-[0-9]{2}-[0-9]{4}$.

Phone number string length is correct

To check the length of the phone number, complete the following steps:

  1. Choose Add another rule.
  2. For Rule name, enter a name (for example, Check Dataset Phone no. for string length).
  3. For Data quality check scope, choose Individual check for each column.
  4. For Rule success criteria, choose All data quality checks are met (AND).
  5. Under Check 1, for Data quality check, choose Value string length.
  6. Choose Phone No on the drop-down menu.
  7. For Condition, choose Greater than equals.
  8. For Value, select Custom value and enter 9.
  9. Under Check 2, for Data quality check, choose Value string length.
  10. Choose Phone No on the drop-down menu.
  11. For Condition, choose Less than equals.
  12. For Value¸ select Custom value and enter 12.

Region column only has the specified region

To check the Region column, complete the following steps:

  1. Choose Add another rule.
  2. For Rule name, enter a name (for example, Check Region column only for specific region).
  3. For Data quality check scope, choose Individual check for each column.
  4. For Rule success criteria, choose All data quality checks are met (AND).
  5. Under Check 1, for Data quality check, choose Value is exactly.
  6. Choose Region on the drop-down menu.
  7. For Value, select Custom value.
  8. Choose the values Midwest, South, West, and Northeast.

Employee ID is an integer

To check that the employee ID is an integer, complete the following steps:

  1. Choose Add another rule.
  2. For Rule name, enter a name (for example, Validate Emp ID is an Integer).
  3. For Data quality check scope, choose Individual check for each column.
  4. For Rule success criteria, choose All data quality checks are met (AND).
  5. Under Check 1, for Data quality check, choose String values.
  6. Choose Emp ID on the drop-down menu.
  7. For Condition, choose Matches (RegEx pattern).
  8. For RegEx value, enter ^[0-9]+$.
  9. After you create all the rules, choose Create ruleset.

Your ruleset is now listed on the Data quality rulesets page.

Create a profile job

To create a profile job with your new ruleset, complete the following steps:

  1. On the Data quality rulesets page, select the ruleset you just created.
  2. Choose Create profile job with ruleset.
  3. For Job name, keep the prepopulated name or enter a new one.
  4. For Data sample, select Full dataset.

The default sample size is important for data quality rules validation, because it matters if you validate all the roles or a limited sample.

  1. Under Job output settings, for S3 location, enter the path to the profile bucket.

If you enter a new bucket name, the folder is created automatically.

  1. Keep the default settings for the remaining optional sections: Data profile configurations, Data quality rules, Advanced job settings, Associated schedules, and Tags.

The next step is to choose or create the AWS Identity and Access Management (IAM) role that grants DataBrew access to read from the input S3 bucket and write to the job output bucket.

  1. For Role name, choose an existing role or choose Create a new IAM role and enter an IAM role suffix.
  2. Choose Create and run job.

For more information about configuring and running DataBrew jobs, see Creating, running, and scheduling AWS Glue DataBrew jobs.

Inspect data quality rules validation results

To inspect the data quality rules, we need to let the profile job complete.

  1. On the Jobs page of the DataBrew console, choose the Profile jobs tab.
  2. Wait until the profile job status changes to Succeeded.
  3. When the job is complete, choose View data profile.

You’re redirected to the Data profile overview tab on the Datasets page.

  1. Choose the Data quality rules tab.

Here you can review the status to your data quality rules. As shown in the following screenshot, eight of the nine data quality rules defined were successful, and one rule failed.

Our failed data quality rule indicates that we found duplicate values for employee ID, SSN, and email.

  1. To confirm that the data has duplicate values, on the Column statistics tab, choose the Emp ID column.
  2. Scroll down to the section Top distinct values.

Similarly, you can check the E Mail and SSN columns to find that those columns also have duplicate values.

Now we have confirmed that our data has duplicate values. The next step is to clean up the dataset and rerun the quality rules validation.

Clean the dataset

To clean the dataset, we first need to create a project.

  1. On the DataBrew console, choose Projects.
  2. Choose Create project.
  3. For Project name, enter a name (for this post, human-resource-project-demo).
  4. For Select a dataset, select My datasets.
  5. Select the human-resource-dataset dataset.
  6. Keep the sampling size at its default.
  7. Under Permissions, for Role name, choose the IAM role that we created previously for our DataBrew profile job.
  8. Choose Create project.

The project takes a few minutes to open. When it’s complete, you can see your data.

Next, we delete the duplicate value from the Emp ID column.

  1. Choose the Emp ID column.
  2. Choose the more options icon (three dots) to view all the transforms available for this column.
  3. Choose Remove duplicate values.
  4. Repeat these steps for the SSN and E Mail columns.

You can now see the three applied steps in the Recipe pane.

Create a DataBrew job

The next step is to create a DataBrew job to run these transforms against the full dataset.

  1. On the project details page, choose Create job.
  2. For Job name, enter a name (for example, human-resource-after-dq-check).
  3. Under Job output settings¸ for File type, choose your final storage format to be CSV.
  4. For S3 location, enter your output S3 bucket location (for example, s3://<s3 bucket name>/output/).
  5. For Compression, choose None.
  6. Under Permissions, for Role name¸ choose the same IAM role we used previously.
  7. Choose Create and run job.
  8. Wait for job to complete; you can monitor the job on the Jobs page.

Validate the data quality check with the corrected dataset

To perform the data quality checks with the corrected dataset, complete the following steps:

  1. Follow the steps outlined earlier to create a new dataset, using the corrected data from the previous section.
  2. Choose the Amazon S3 location of the job output.
  3. Choose Create dataset.
  4. Choose DQ Rules and select the ruleset you created earlier.
  5. On the Actions menu, choose Duplicate.
  6. For Ruleset name, enter a name (for example, human-resource-dataquality-ruleset-on-corrected-dataset).
  7. Select the newly created dataset.
  8. Choose Create data quality ruleset.
  9. After the ruleset is created, select it and choose Create profile job with ruleset.
  10. Create a new profile job.
  11. Choose Create and run job.
  12. When the job is complete, repeat the steps from earlier to inspect the data quality rules validation results.

This time, under Data quality rules, all the rules are passed except Check total record count because you removed duplicate values.

On the Column statistics page, under Top distinct values for the Emp ID column, you can see the distinct values.

You can find similar results for the SSN and E Mail columns.

Clean up

To avoid incurring future charges, we recommend you delete the resources you created during this walkthrough.

Conclusion

As demonstrated in this post, you can use DataBrew to help create data quality rules, which can help you identify any discrepancies in your data. You can also use DataBrew to clean the data and validate it going forwards. You can learn more about AWS Glue DataBrew from here and learn around AWS Glue DataBrew pricing here.


About the Authors

Navnit Shukla is an AWS Specialist Solution Architect, Analytics, and is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions.

Harsh Vardhan Singh Gaur is an AWS Solutions Architect, specializing in Analytics. He has over 5 years of experience working in the field of big data and data science. He is passionate about helping customers adopt best practices and discover insights from their data.

Send custom branded email reports from Amazon QuickSight

Post Syndicated from Kareem Syed-Mohammed original https://aws.amazon.com/blogs/big-data/send-custom-branded-email-reports-from-amazon-quicksight/

Amazon QuickSight is a fully-managed, cloud-native business intelligence (BI) service that makes it easy to connect to your data, create interactive dashboards, and share these with tens of thousands of users, either directly within QuickSight application, or embedded in web apps and portals.

QuickSight Enterprise Edition now supports the ability to send custom branded email reports. You can customize the email sender domain for email reports sent from QuickSight, along with the logo and header color of the email, as well as footer text of the email. If you have your dashboard embedded in your own application, you can also customize the URL to open the dashboard from the email to the URL of your application. This lets you customize emails to reflect your corporate branding, whether you want to send these reports to 1000s of your internal users or external customers.

In this post, we will go through the following:

  1. Steps to implement the solution
    1. Create a customized email template
    2. Create an email schedule and subscribe email recipients
  2. End user experience
  3. Sample use case

Solution overview

Step 1: Create a customized email template

This new feature lets you customize your email with the following customization options:

  1. Custom sender email address
  2. Custom logo in the email header and custom header color
  3. Custom link to open the dashboard (if your dashboard is embedded in your own application)
  4. Custom footer

You can customize all or any of these options. To customize, create an email template in your QuickSight account, which will be used when sending email reports for any dashboard to any user. This email template is specific to the AWS region and account it is created in.

Log in to QuickSight as an admin, and select your name in the top right, then in the menu select “Manage QuickSight” as shown in the following screenshot:

In the next screen, select “Account Customization”, and you will see the available account customization options. Under the “Email report template” section, select “Update” as shown in the following screenshot. You must have the right IAM Identity-Based Policies assigned to you to create or edit the template.

In the next screen, you can set customizations that we will see one by one.

Customize sender email address

This option lets you set a custom email address or use QuickSight’s email address <[email protected]> to send email reports. To select sending via QuickSight email address, select the radio button for QuickSight.

To send a custom email, select the radio button for custom email setting. At this time, only verified email addresses can be used for a custom email address. SES and QuickSight must be in the same AWS account and region. If you do not have an SES account, then you can get started <HERE> with SES’ free tier of XX. Steps to add a custom email address.

  1. Add a verified SES email address and click “Verify email”. If you get an error, then refer here for creating a verified SES email address.
  2. Once the email address is verified, you must authorize QuickSight to send emails on your behalf. To do this, copy the given “Authorization Policy”, and add it as a “Sending authorization policy” for your verified email address in SES. Refer here to learn about SES sending authorization policy.

    As we can see in the screenshot above, once the authorization policy is verified, QuickSight is authorized to send email using the SES email address.
  3. You can set a friendly name for the email address as shown in the following screenshot.

Customize logo

Email reports from QuickSight have a QuickSight logo in the header of the email body. You can choose to select a custom logo, use QuickSight logo, or have no logo by selecting the corresponding radio button.

When you select the “Custom logo” option, you can select your own logo (for format jpg, jpeg, or png) and a maximum file size of 1MB. Your logo will be scaled to a height of 32px, maintaining the aspect ratio. When you upload the logo image, you get an option to set the background color (as a HEX code) of the header in the email report.

Select where the dashboard opens

Email reports have an image of the first sheet of the QuickSight dashboards. In order for the recipient to interact with the dashboard, email reports also provide a link to open the dashboard. By default, this link opens the dashboard in the QuickSight application. Now you can select where the dashboard opens. If you have embedded the dashboard in your application, then you can provide the URL of your application. Moreover, you can choose to hide the option to disable opening the dashboard from the email entirely. Please see the following screenshot for reference.

If you want to add your custom link, then you will have to add the following query parameters – account-id, dashboard-id, and region – to your link. QuickSight will populate these parameters at runtime, and when your customers select the open dashboard link from the email, they will be taken to the link you have provided. With the account-id, dashboard-id, and region now available with the link, you can provide logic to take your customers to where you have embedded the dashboard in your application.

Custom footer

Email reports default QuickSight footers have content and a link related to QuickSight and QuickSight application. You have an option to customize the footer or hide the entire footer. Please refer to the following screenshot for reference.

If you select the option to set a custom footer, then you can provide custom text and hyperlink content in the textbox. At this time, we only allow plain text.

Step 2: Create an email schedule and subscribe recipients

Once your QuickSight account has an email template saved, any email report sent in the same AWS region will use this template. To send an email report, the author of the dashboard should create an email schedule for the dashboard and assign recipients to that schedule.

To set a schedule, the  dashboard author should open the dashboard in QuickSight application, select “Share” in the top right, and select “Email report” in the menu. Please refer to the following screenshot for reference.

You will be taken to the “Edit email report” screen, where you can create a schedule for the email to be sent and add email recipients. Please refer to this documentation on sending reports by email and this post for sending personalized email reports.

If you are embedding dashboards in your application, then your readers cannot subscribe to the schedule from the embedded dashboard. Authors must add those readers to the recipient list through the steps stated above. Therefore, your readers must be provisioned in QuickSight.

End user experience

The end user gets the email as per the schedule set. If the email template has been set, then recipients get the look and feel of the email based on the customization done on the template. The following screenshot shows the email with a custom look and feel.

As you can see, this email has the following:

  1. From address customized to [email protected] with a friendly name, “data-insights-team”
  2. Logo customized to a brand logo, and header customized to the brand green shade
  3. Dashboard open link customized to take customers to your app if the dashboard is embedded in that app
  4. Footer customized with a custom message

Use case

ShipPronto is a logistics service provider for heavy machinery. It has many customers that store their heavy machinery at ShipPronto’s warehouse. When customers get purchase orders on these machineries, they have ShipPronto fulfill those orders on their behalf from its warehouse. ShipPronto has an application where each customer can login and see rich data on their order shipment and machinery quantity at the warehouse. ShipPronto uses QuickSight dashboard embedded in its application to provide the insights. Furthermore, it sends daily emails to its customers on this dashboard. It’s using the email customization feature of QuickSight to customize the look and feel of the email so that customers receiving the email get a seamless experience.

Below is the customized email that their customers receive daily with the sender email address, logo, header color, and footer customized.

When customers click on the “Open Dashboard” link in the email, they are taken to ShipPronto’s app, on which they must log in, as shown in the following screenshot.

Once the customers log in, based on the query string parameters that were passed along with the custom URL (which was set as part of the URL, to open the dashboard in the email template), ShipPronto can take its customers to the page where they have this dashboard embedded.

This experience means that ShipPronto’s end users see the ShipPronto branded email and get a seamless experience where they access the embedded dashboard, in the application, from the email.

Conclusion

Email customizations let you send branded email reports to your customers, thereby enabling a seamless experience when customers are accessing the email or the application where the dashboard is embedded. And all of this is done without any infrastructure setup or management, while scaling to millions of users. For more updates from QuickSight embedded analytics, see What’s New in the Amazon QuickSight User Guide.


About the Author

Kareem Syed-Mohammed is a Product Manager at Amazon QuickSight. He focuses on embedded analytics, APIs, and developer experience. Prior to QuickSight he has been with AWS Marketplace and Amazon retail as a PM. Kareem started his career as a developer and then PM for call center technologies, Local Expert and Ads for Expedia. He worked as a consultant with McKinsey and Company for a short while.

Kenz Shane is a UI Designer for Amazon QuickSight. As part of the product’s Business Intelligence User Experience (BIUX) team, she specializes in creating customer-focused visual interfaces. Previously, she worked with the Experience Innovation Group at Dell, serving as a subject matter expert in enterprise-grade user interface (UI) design, accessible data visualization, and design systems. Kenz has provided art direction and design for clients across multiple industries, including Nordstrom, Columbia Hospitality, AIGA, and Warner Bros.

Raji Sivasubramaniam is a Specialist Solutions Architect at AWS, focusing on Analytics. Raji has 20 years of experience in architecting end-to-end Enterprise Data Management, Business Intelligence and Analytics solutions for Fortune 500 and Fortune 100 companies across the globe. She has in-depth experience in integrated healthcare data and analytics with wide variety of healthcare datasets including managed market, physician targeting and patient analytics. In her spare time, Raji enjoys hiking, yoga and gardening.

Srikanth Baheti is a Specialized World Wide Sr. Solution Architect for Amazon QuickSight. He started his career as a consultant and worked for multiple private and government organizations. Later he worked for PerkinElmer Health and Sciences & eResearch Technology Inc, where he was responsible for designing and developing high traffic web applications, highly scalable and maintainable data pipelines for reporting platforms using AWS services and Serverless computing.-