Tag Archives: AWS Big Data

How to delete user data in an AWS data lake

Post Syndicated from George Komninos original https://aws.amazon.com/blogs/big-data/how-to-delete-user-data-in-an-aws-data-lake/

General Data Protection Regulation (GDPR) is an important aspect of today’s technology world, and processing data in compliance with GDPR is a necessity for those who implement solutions within the AWS public cloud. One article of GDPR is the “right to erasure” or “right to be forgotten” which may require you to implement a solution to delete specific users’ personal data.

In the context of the AWS big data and analytics ecosystem, every architecture, regardless of the problem it targets, uses Amazon Simple Storage Service (Amazon S3) as the core storage service. Despite its versatility and feature completeness, Amazon S3 doesn’t come with an out-of-the-box way to map a user identifier to S3 keys of objects that contain user’s data.

This post walks you through a framework that helps you purge individual user data within your organization’s AWS hosted data lake, and an analytics solution that uses different AWS storage layers, along with sample code targeting Amazon S3.

Reference architecture

To address the challenge of implementing a data purge framework, we reduced the problem to the straightforward use case of deleting a user’s data from a platform that uses AWS for its data pipeline. The following diagram illustrates this use case.

We’re introducing the idea of building and maintaining an index metastore that keeps track of the location of each user’s records and allows us locate to them efficiently, reducing the search space.

You can use the following architecture diagram to delete a specific user’s data within your organization’s AWS data lake.

For this initial version, we created three user flows that map each task to a fitting AWS service:

Flow 1: Real-time metastore update

The S3 ObjectCreated or ObjectDelete events trigger an AWS Lambda function that parses the object and performs an add/update/delete operation to keep the metadata index up to date. You can implement a simple workflow for any other storage layer, such as Amazon Relational Database Service (RDS), Amazon Aurora, or Amazon Elasticsearch Service (ES). We use Amazon DynamoDB and Amazon RDS for PostgreSQL as the index metadata storage options, but our approach is flexible to any other technology.

Flow 2: Purge data

When a user asks for their data to be deleted, we trigger an AWS Step Functions state machine through Amazon CloudWatch to orchestrate the workflow. Its first step triggers a Lambda function that queries the metadata index to identify the storage layers that contain user records and generates a report that’s saved to an S3 report bucket. A Step Functions activity is created and picked up by a Lambda Node JS based worker that sends an email to the approver through Amazon Simple Email Service (SES) with approve and reject links.

The following diagram shows a graphical representation of the Step Function state machine as seen on the AWS Management Console.

The approver selects one of the two links, which then calls an Amazon API Gateway endpoint that invokes Step Functions to resume the workflow. If you choose the approve link, Step Functions triggers a Lambda function that takes the report stored in the bucket as input, deletes the objects or records from the storage layer, and updates the index metastore. When the purging job is complete, Amazon Simple Notification Service (SNS) sends a success or fail email to the user.

The following diagram represents the Step Functions flow on the console if the purge flow completed successfully.

For the complete code base, see step-function-definition.json in the GitHub repo.

Flow 3: Batch metastore update

This flow refers to the use case of an existing data lake for which index metastore needs to be created. You can orchestrate the flow through AWS Step Functions, which takes historical data as input and updates metastore through a batch job. Our current implementation doesn’t include a sample script for this user flow.

Our framework

We now walk you through the two use cases we followed for our implementation:

  • You have multiple user records stored in each Amazon S3 file
  • A user has records stored in homogenous AWS storage layers

Within these two approaches, we demonstrate alternatives that you can use to store your index metastore.

Indexing by S3 URI and row number

For this use case, we use a free tier RDS Postgres instance to store our index. We created a simple table with the following code:

CREATE UNLOGGED TABLE IF NOT EXISTS user_objects (
				userid TEXT,
				s3path TEXT,
				recordline INTEGER
			);

You can index on user_id to optimize query performance. On object upload, for each row, you need to insert into the user_objects table a row that indicates the user ID, the URI of the target Amazon S3 object, and the row that corresponds to the record. For instance, when uploading the following JSON input, enter the following code:

{"user_id":"V34qejxNsCbcgD8C0HVk-Q","body":"…"}
{"user_id":"ofKDkJKXSKZXu5xJNGiiBQ","body":"…"}
{"user_id":"UgMW8bLE0QMJDCkQ1Ax5Mg","body ":"…"}

We insert the tuples into user_objects in the Amazon S3 location s3://gdpr-demo/year=2018/month=2/day=26/input.json. See the following code:

(“V34qejxNsCbcgD8C0HVk-Q”, “s3://gdpr-demo/year=2018/month=2/day=26/input.json”, 0)
(“ofKDkJKXSKZXu5xJNGiiBQ”, “s3://gdpr-demo/year=2018/month=2/day=26/input.json”, 1)
(“UgMW8bLE0QMJDCkQ1Ax5Mg”, “s3://gdpr-demo/year=2018/month=2/day=26/input.json”, 2)

You can implement the index update operation by using a Lambda function triggered on any Amazon S3 ObjectCreated event.

When we get a delete request from a user, we need to query our index to get some information about where we have stored the data to delete. See the following code:

SELECT s3path,
                ARRAY_AGG(recordline)
                FROM user_objects
                WHERE userid = ‘V34qejxNsCbcgD8C0HVk-Q’
                GROUP BY;

The preceding example SQL query returns rows like the following:

(“s3://gdpr-review/year=2015/month=12/day=21/review-part-0.json“, {2102,529})

The output indicates that lines 529 and 2102 of S3 object s3://gdpr-review/year=2015/month=12/day=21/review-part-0.json contain the requested user’s data and need to be purged. We then need to download the object, remove those rows, and overwrite the object. For a Python implementation of the Lambda function that implements this functionality, see deleteUserRecords.py in the GitHub repo.

Having the record line available allows you to perform the deletion efficiently in byte format. For implementation simplicity, we purge the rows by replacing the deleted rows with an empty JSON object. You pay a slight storage overhead, but you don’t need to update subsequent row metadata in your index, which would be costly. To eliminate empty JSON objects, we can implement an offline vacuum and index update process.

Indexing by file name and grouping by index key

For this use case, we created a DynamoDB table to store our index. We chose DynamoDB because of its ease of use and scalability; you can use its on-demand pricing model so you don’t need to guess how many capacity units you might need. When files are uploaded to the data lake, a Lambda function parses the file name (for example, 1001-.csv) to identify the user identifier and populates the DynamoDB metadata table. Userid is the partition key, and each different storage layer has its own attribute. For example, if user 1001 had data in Amazon S3 and Amazon RDS, their records look like the following code:

{"userid:": 1001, "s3":{"s3://path1", "s3://path2"}, "RDS":{"db1.table1.column1"}}

For a sample Python implementation of this functionality, see update-dynamo-metadata.py in the GitHub repo.

On delete request, we query the metastore table, which is DynamoDB, and generate a purge report that contains details on what storage layers contain user records, and storage layer specifics that can speed up locating the records. We store the purge report to Amazon S3. For a sample Lambda function that implements this logic, see generate-purge-report.py in the GitHub repo.

After the purging is approved, we use the report as input to delete the required resources. For a sample Lambda function implementation, see gdpr-purge-data.py in the GitHub repo.

Implementation and technology alternatives

We explored and evaluated multiple implementation options, all of which present tradeoffs, such as implementation simplicity, efficiency, critical data compliance, and feature completeness:

  • Scan every record of the data file to create an index – Whenever a file is uploaded, we iterate through its records and generate tuples (userid, s3Uri, row_number) that are then inserted to our metadata storing layer. On delete request, we fetch the metadata records for requested user IDs, download the corresponding S3 objects, perform the delete in place, and re-upload the updated objects, overwriting the existing object. This is the most flexible approach because it supports a single object to store multiple users’ data, which is a very common practice. The flexibility comes at a cost because it requires downloading and re-uploading the object, which introduces a network bottleneck in delete operations. User activity datasets such as customer product reviews are a good fit for this approach, because it’s unexpected to have multiple records for the same user within each partition (such as a date partition), and it’s preferable to combine multiple users’ activity in a single file. It’s similar to what was described in the section “Indexing by S3 URI and row number” and sample code is available in the GitHub repo.
  • Store metadata as file name prefix – Adding the user ID as the prefix of the uploaded object under the different partitions that are defined based on query pattern enables you to reduce the required search operations on delete request. The metadata handling utility finds the user ID from the file name and maintains the index accordingly. This approach is efficient in locating the resources to purge but assumes a single user per object, and requires you to store user IDs within the filename, which might require InfoSec considerations. Clickstream data, where you would expect to have multiple click events for a single customer on a single date partition during a session, is a good fit. We covered this approach in the section “Indexing by file name and grouping by index key” and you can download the codebase from the GitHub repo.
  • Use a metadata file – Along with uploading a new object, we also upload a metadata file that’s picked up by an indexing utility to create and maintain the index up to date. On delete request, we query the index, which points us to the records to purge. A good fit for this approach is a use case that already involves uploading a metadata file whenever a new object is uploaded, such as uploading multimedia data, along with their metadata. Otherwise, uploading a metadata file on every object upload might introduce too much of an overhead.
  • Use the tagging feature of AWS services – Whenever a new file is uploaded to Amazon S3, we use the Put Object Tagging Amazon S3 operation to add a key-value pair for the user identifier. Whenever there is a user data delete request, it fetches objects with that tag and deletes them. This option is straightforward to implement using the existing Amazon S3 API and can therefore be a very initial version of your implementation. However, it involves significant limitations. It assumes a 1:1 cardinality between Amazon S3 objects and users (each object only contains data for a single user), searching objects based on a tag is limited and inefficient, and storing user identifiers as tags might not be compliant with your organization’s InfoSec policy.
  • Use Apache Hudi – Apache Hudi is becoming a very popular option to perform record-level data deletion on Amazon S3. Its current version is restricted to Amazon EMR, and you can use it if you start to build your data lake from scratch, because you need to store your as Hudi datasets. Hudi is a very active project and additional features and integrations with more AWS services are expected.

The key implementation decision of our approach is separating the storage layer we use for our data and the one we use for our metadata. As a result, our design is versatile and can be plugged in any existing data pipeline. Similar to deciding what storage layer to use for your data, there are many factors to consider when deciding how to store your index:

  • Concurrency of requests – If you don’t expect too many simultaneous inserts, even something as simple as Amazon S3 could be a starting point for your index. However, if you get multiple concurrent writes for multiple users, you need to look into a service that copes better with transactions.
  • Existing team knowledge and infrastructure – In this post, we demonstrated using DynamoDB and RDS Postgres for storing and querying the metadata index. If your team has no experience with either of those but are comfortable with Amazon ES, Amazon DocumentDB (with MongoDB compatibility), or any other storage layer, use those. Furthermore, if you’re already running (and paying for) a MySQL database that’s not used to capacity, you could use that for your index for no additional cost.
  • Size of index – The volume of your metadata is orders of magnitude lower than your actual data. However, if your dataset grows significantly, you might need to consider going for a scalable, distributed storage solution rather than, for instance, a relational database management system.

Conclusion

GDPR has transformed best practices and introduced several extra technical challenges in designing and implementing a data lake. The reference architecture and scripts in this post may help you delete data in a manner that’s compliant with GDPR.

Let us know your feedback in the comments and how you implemented this solution in your organization, so that others can learn from it.

 


About the Authors

George Komninos is a Data Lab Solutions Architect at AWS. He helps customers convert their ideas to a production-ready data product. Before AWS, he spent 3 years at Alexa Information domain as a data engineer. Outside of work, George is a football fan and supports the greatest team in the world, Olympiacos Piraeus.

 

 

 

 

Sakti Mishra is a Data Lab Solutions Architect at AWS. He helps customers architect data analytics solutions, which gives them an accelerated path towards modernization initiatives. Outside of work, Sakti enjoys learning new technologies, watching movies, and travel.

Stream CDC into an Amazon S3 data lake in Parquet format with AWS DMS

Post Syndicated from Viral Shah original https://aws.amazon.com/blogs/big-data/stream-cdc-into-an-amazon-s3-data-lake-in-parquet-format-with-aws-dms/

Most organizations generate data in real time and ever-increasing volumes. Data is captured from a variety of sources, such as transactional and reporting databases, application logs, customer-facing websites, and external feeds. Companies want to capture, transform, and analyze this time-sensitive data to improve customer experiences, increase efficiency, and drive innovations. With increased data volume and velocity, it’s imperative to capture the data from source systems as soon as they are generated and store them on a secure, scalable, and cost-efficient platform.

AWS Database Migration Service (AWS DMS) performs continuous data replication using change data capture (CDC). Using CDC, you can determine and track data that has changed and provide it as a stream of changes that a downstream application can consume and act on. Most database management systems manage a transaction log that records changes made to the database contents and metadata. AWS DMS reads the transaction log by using engine-specific API operations and functions and captures the changes made to the database in a nonintrusive manner.

Amazon Simple Storage Service (Amazon S3) is the largest and most performant object storage service for structured and unstructured data and the storage service of choice to build a data lake. With Amazon S3, you can cost-effectively build and scale a data lake of any size in a secure environment where data is protected by 99.999999999% of durability.

AWS DMS offers many options to capture data changes from relational databases and store the data in columnar format (Apache Parquet) into Amazon S3:

The second option helps you build a flexible data pipeline to ingest data into an Amazon S3 data lake from several relational and non-relational data sources, compared to just relational data sources support in the former option. Kinesis Data Firehose provides pre-built AWS Lambda blueprints for converting common data sources such as Apache logs and system logs to JSON and CSV formats or writing your own custom functions. It can also convert the format of incoming data from JSON to Parquet or Apache ORC before storing the data in Amazon S3. Data stored in columnar format gives you faster and lower-cost queries with downstream analytics services like Amazon Athena.

In this post, we focus on the technical challenges outlined in the second option and how to address them.

As shown in the following reference architecture, data is ingested from a database into Parquet format in Amazon S3 via AWS DMS integrating with Kinesis Data Streams and Kinesis Data Firehose.

Our solution provides flexibility to ingest data from several sources using Kinesis Data Streams and Kinesis Data Firehose with built-in data format conversion and integrated data transformation capabilities before storing data in a data lake. For more information about data ingestion into Kinesis Data Streams, see Writing Data into Amazon Kinesis Data Streams. You can then query Parquet data in Amazon S3 efficiently with Athena.

Implementing the architecture

AWS DMS can migrate data to and from most widely used commercial and open-source databases. You can migrate and replicate data directly to Amazon S3 in CSV and Parquet formats, and store data in Amazon S3 in Parquet because it offers efficient compression and encoding schemes. Parquet format allows compression schemes on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented.

AWS DMS supports Kinesis Data Streams as a target. Kinesis Data Streams is a massively scalable and durable real-time data streaming service that can collect and process large streams of data records in real time. AWS DMS service publishes records to a data stream using JSON. For more information about configuration details, see Use the AWS Database Migration Service to Stream Change Data to Amazon Kinesis Data Streams.

Kinesis Data Firehose can pull data from Kinesis Data Streams. It’s a fully managed service that delivers real-time streaming data to destinations such as Amazon S3, Amazon Redshift, Amazon Elasticsearch Service (Amazon ES), and Splunk. Kinesis Data Firehose can convert the format of input data from JSON to Parquet or ORC before sending it to Amazon S3. It needs reference schema to interpret the AWS DMS streaming data in JSON and convert into Parquet. In this post, we use AWS Glue, a fully managed ETL service, to create a schema in the AWS Glue Data Catalog for Kinesis Data Firehose to reference.

When AWS DMS migrates records, it creates additional fields (metadata) for each migrated record. The metadata provides additional information about the record being migrated, such as source table name, schema name, and type of operation. Most metadata fields add – in their field names (for example, record-type, schema-name, table-name, transaction-id). See the following code:

{
        "data": {
            "MEET_CODE": 5189459,
            "MEET_DATE": "2020-02-21T19:20:04Z",
            "RACE_CODE": 5189459,
            "LAST_MODIFIED_DATE": "2020-02-24T19:20:04Z",
            "RACE_ENTRY_CODE": 11671651,
            "HORSE_CODE": 5042811
        },
        "metadata": {
            "transaction-id": 917505,
            "schema-name": "SH",
            "operation": "insert",
            "table-name": "RACE_ENTRY",
            "record-type": "data",
            "timestamp": "2020-02-26T00:20:07.482592Z",
            "partition-key-type": "schema-table"
        }
    }

Additional metadata added by AWS DMS leads to an error during the data format conversion phase in Kinesis Data Firehose. Kinesis Data Firehose follows Hive style formatting and therefore doesn’t recognize the – character in the metadata field names during data conversion from JSON into Parquet and returns an error message: expected at the position 30 of ‘struct’ but ‘-’ is found. For example, see the following code:

{
	"deliveryStreamARN": "arn:aws:firehose:us-east-1:1234567890:deliverystream/abc-def-KDF",
	"destination": "arn:aws:s3:::abc-streaming-bucket",
	"deliveryStreamVersionId": 13,
	"message": "The schema is invalid. Error parsing the schema:
	 Error: : expected at the position 30 of 'struct<timestamp:string,record-type:string,operation:string,partition-key-type:string,schema-name:string,table-name:string,transaction-id:int>' but '-' is found.",
	"errorCode": "DataFormatConversion.InvalidSchema"
}

You can resolve the issue by making the following changes: specifying JSON key mappings and creating a reference table in AWS Glue before configuring Kinesis Data Firehose.

Specifying JSON key mappings

In your Kinesis Data Firehose configuration, specify JSON key mappings for fields with – in their names. Mapping transforms these specific metadata fields names to _ (for example, record-type changes to record_type).

Use AWS Command Line Interface (AWS CLI) to create Kinesis Data Firehose with the JSON key mappings. Modify the parameters to meet your specific requirements.

Kinesis Data Firehose configuration mapping is only possible through the AWS CLI or API and not through the AWS Management Console.

The following code configures Kinesis Data Firehose with five columns with – in their field names mapped to new field names with _”:

"S3BackupMode": "Disabled",
                    "DataFormatConversionConfiguration": {
                        "SchemaConfiguration": {
                            "RoleARN": "arn:aws:iam::123456789012:role/sample-firehose-delivery-role",
                            "DatabaseName": "sample-db",
                            "TableName": "sample-table",
                            "Region": "us-east-1",
                            "VersionId": "LATEST"
                        },
                        "InputFormatConfiguration": {
                            "Deserializer": {
                                "OpenXJsonSerDe": {
                                "ColumnToJsonKeyMappings":
                                {
                                 "record_type": "record-type","partition_key_type": "partition-key-type","schema_name":"schema-name","table_name":"table-name","transaction_id":"transaction-id"
                                }
                                }

Creating a reference table in AWS Glue

Because Kinesis Data Firehose uses the Data Catalog to reference schema for Parquet format conversion, you must first create a reference table in AWS Glue before configuring Kinesis Data Firehose. Use Athena to create a Data Catalog table. For instructions, see CREATE TABLE. In the table, make sure that the column name uses _ in their names, and manually modify it in advance through the Edit schema option for the referenced table in AWS Glue, if needed.

Use Athena to query the results of data ingested by Kinesis Data Firehose into Amazon S3.

This solution is only applicable in the following use cases:

  • Capturing data changes from your source with AWS DMS
  • Converting data into Parquet with Kinesis Data Firehose

If you want to store data in non-Parquet format (such CSV or JSON) or ingest into Kinesis through other routes, then you don’t need to modify your Kinesis Data Firehose configuration.

Conclusion

This post demonstrated how to convert AWS DMS data into Parquet format and specific configurations to make sure metadata follows the expected format of Kinesis Data Streams and Kinesis Data Firehose. We encourage you to try this solution and take advantage of all the benefits of using AWS DMS with Kinesis Data Streams and Kinesis Data Firehose. For more information, see Getting started with AWS Database Migration Service and Setting up Amazon Kinesis Firehose.

If you have questions or suggestions, please leave a comment.

 


About the Author

Viral Shah is a Data Lab Architect with Amazon Web Services. Viral helps our customers architect and build data and analytics prototypes in just four days in the AWS Data Lab. He has over 20 years of experience working with enterprise customers and startups primarily in the Data and Database space.

 

 

Fast and predictable performance with serverless compilation using Amazon Redshift

Post Syndicated from Kiran Chinta original https://aws.amazon.com/blogs/big-data/fast-and-predictable-performance-with-serverless-compilation-using-amazon-redshift/

Amazon Redshift is a fast, fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. Customers tell us that they want extremely fast query response times so they can make equally fast decisions.

This post presents the recently launched, massively scalable serverless compilation capability for Amazon Redshift, which can now concurrently compile query segments with additional compute resources at no extra cost. We also share how our customers have enjoyed faster performance (in several cases, twice as fast) because of this new capability.

Amazon Redshift query compilation

When a query is sent to Amazon Redshift, the query processing engine parses it into multiple segments and compiles these segments to produce optimized object files that are processed during query execution. When similar or same queries are sent to Amazon Redshift, the corresponding segments are present in the cluster code compilation cache. Query segments that use already compiled code in the cache run faster because there’s no overhead of query compilation.

You can also accelerate your workloads of one-time and first-time queries, which don’t have query segments compiled in the cache. Depending on the query’s complexity, Amazon Redshift usually compiles those queries within seconds. However, some mission-critical workloads require even faster response time. This is where the massively scalable serverless compilation capability in Amazon Redshift makes a big difference.

Amazon Redshift serverless query compilation

Amazon Redshift breaks down a query into a set of segments, and each segment is a set of operations, such as SCAN or BUILD HASH TABLE. With the launch of the massively scalable serverless compilation capability, Amazon Redshift can now compile the query segments faster and in parallel because the compilation isn’t limited by the specific cluster being used and its available CPU and memory resources.

The Amazon Redshift compilation capability is managed with an external resource that your Amazon Redshift cluster uses based on your workload. During query processing, Amazon Redshift generates query segments and sends the segments that aren’t present in the cluster’s local cache to the external compilation farm to be compiled with massive parallelism. At the time of running the query, the segments are quickly fetched from the compilation service and saved in the cluster’s local cache for future processing. This makes sure that one-time and first-time queries are processed with high performance in a transparent way, without any additional cost.

Design and usage

The massively scalable serverless compilation capabilities benefit you whenever you need query compilation, especially with complex and highly concurrent workloads. The following are some specific use cases where this capability helps:

  • Dashboard applications that require fast query performance experience lower query compilation time, leading to improved user experience.
  • Dynamic one-time queries with new query segments that aren’t present in the code cache can be processed faster.
  • Scheduled ETL or reporting jobs with a strict SLA benefit from lower query compilation times.
  • Highly complex and concurrent workloads run with high performance without impacting the overall cluster performance.
  • Clusters that are resized, upgraded, or paused and resumed use the external code cache. No warmup is needed.

The following diagram illustrates the architecture of the Amazon Redshift serverless compilation.

Compilation improvements

Although the serverless compilation has already been improving query performance significantly since its launch, the Amazon Redshift team is working to further improve its effectiveness and performance. More recently, we announced an unlimited cache size to store compiled objects and increase cache hits across the Amazon Redshift fleet from 99.60% to 99.95%.

The following graph shows the percent cache hit that’s improved beyond the local cache over the releases.

Faster performance

During a standard maintenance window, an Amazon Redshift patch flushes the compilation cache. Before we launched the new compilation capabilities, your cluster’s performance was impacted after being patched during maintenance periods. Now, that performance impact is almost unnoticeable with this feature.

Many Amazon Redshift customers are benefiting from these performance improvements and saving time and cost for their Amazon Redshift environments. In this section, we share the stories of two organizations.

Aptos

Aptos is the largest provider of enterprise software focused exclusively on retail. They use Amazon Redshift to power the analytics solution for retail clients. Jonathan Strohl, a cloud engineer on the Aptos team, shared this anecdote with us:

“Prior to last week’s Redshift maintenance, we sent our clients the typical notification letting them know to expect performance delays the following morning due to the object cache being flushed during the maintenance. However, the morning after the maintenance, a couple of our clients emailed back asking whether the maintenance had actually occurred, because there had been no noticeable delay. The performance delays they had previously noticed were now eliminated due to the serverless compilation recently released by Amazon Redshift. This is the best result we could have hoped for—our clients were unable to tell that a cache-flushing maintenance had even occurred!”

Manthan

Manthan delivers BI, analytics, and artificial intelligence solutions to more than 200 leading retailers across 22 countries. Vijay Chidambaram, Head of Cloud Engineering at Manthan, shared the following with us:

“The normal ETL runtimes are around 90–100 minutes. The ETL runtime would go to around 290 minutes post an upgrade without the serverless compilation feature. That value has come down to about 150 minutes, which is a 2X improvement. Across the clusters, there is no increase in the ETL wall clock runtime compared to normal runtimes on day two and beyond.”

Intentwise

Intentwise is an Amazon Advertising optimization platform that empowers brands, sellers, and agencies with insights, automation, and expertise. They use Amazon Redshift to power the analytics for their SaaS offering. Raghavendra, a Software Architect at Intentwise, shared the following with us:

“The new serverless compilation feature improves the query compilation time by 3x. This makes Amazon Redshift an even more powerful data warehouse for our analytical platform because it continues to innovate to offer better performance and lower costs, all with no efforts on our end.”

Summary

This post explained how the massively scalable serverless compilation capability for Amazon Redshift works and gave examples of the benefits you can expect from the performance improvements. The capability is free and automatically enabled on all new and existing Amazon Redshift clusters.

For more information about Amazon Redshift query planning and workflow, see Query planning and execution workflow. For more information about improving query performance, see Factors affecting query performance.


About the Authors

Kiran Chinta is a Senior Software Development Engineer at Amazon Redshift. He has been working on distributed databases for over 13 years and has focused on high availability, disaster recovery, SQL language features and performance features for on-prem and cloud databases. In his spare time, he enjoys reading and playing various sports.

 

 

 

 

Naresh Chainani is a Senior Software Development Manager at Amazon Redshift. He leads Query Processing, Query Performance, Distributed Systems and Workload Management with a strong team. Naresh is passionate about building high-performance databases to enable customers to gain timely insights and make critical business decisions. In his spare time, Naresh enjoys reading and playing tennis.

 

 

 

 

Maor Kleider is a product and database engineering leader for Amazon Redshift. Maor is passionate about collaborating with customers and partners, learning about their unique big data use cases and making their experience even better. In his spare time, Maor enjoys traveling and exploring new restaurants with his family.

 

 

 

 

Quan Li is a Senior Database Engineer at Amazon Redshift. His focus is enabling customers to deliver maximum business value. Quan is passionate about optimizing high performance analytical databases. During his spare time, he enjoys traveling and experiencing different types of cuisines with his family.

How Aruba Networks built a cost analysis solution using AWS Glue, Amazon Redshift, and Amazon QuickSight

Post Syndicated from Siddharth Thacker original https://aws.amazon.com/blogs/big-data/how-aruba-networks-built-a-cost-analysis-solution-using-aws-glue-amazon-redshift-and-amazon-quicksight/

This is a guest post co-written by Siddharth Thacker and Swatishree Sahu from Aruba Networks.

Aruba Networks is a Silicon Valley company based in Santa Clara that was founded in 2002 by Keerti Melkote and Pankaj Manglik. Aruba is the industry leader in wired, wireless, and network security solutions. Hewlett-Packard acquired Aruba in 2015, making it a wireless networking subsidiary with a wide range of next-generation network access solutions.

Aruba Networks provides cloud-based platform called Aruba Central for network management and AI Ops. Aruba cloud platform supports thousands of workloads to support customer facing production environment and also a separate development platform for Aruba engineering.

The motivation to build the solution presented in this post was to understand the unit economics of the AWS resources used by multiple product lines across different organization pillars. Aruba wanted a faster, effective, and reliable way to analyze cost and usage data and visualize that into a dashboard. This solution has helped Aruba in multiple ways, including:

  • Visibility into costs – Multiple Aruba teams can now analyze the cost of their application via data surfaced with this solution
  • Cost optimization – The solution helps teams identify new cost-optimization opportunities by making them aware of the higher-cost resources with low utilization so they can optimize accordingly
  • Cost management – The Cloud DevOps organization, the group who built this solution, can effectively plan at the application level and have a direct positive impact on gross margins
  • Cost savings – With daily cost data available, engineers can see the monetary impact of right-sizing compute and other AWS resources almost immediately
  • Big picture as well as granular – Users can visualize cost data from the top down and track cost at a business level and a specific resource level

Overview of the solution

This post describes how Aruba Networks automated the solution, from generating the AWS Cost & Usage Report (AWS CUR) to its final visualization on Amazon QuickSight. In this solution, they start by configuring the CUR on their primary payer account, which publishes the billing reports to an Amazon Simple Storage Service (Amazon S3) bucket. Then they use an AWS Glue crawler to define and catalog the CUR data. As the new CUR data is delivered daily, the data catalog is updated, and the data is loaded into an Amazon Redshift database using Amazon Redshift Spectrum and SQL. The reporting and visualization layer is built using QuickSight. Finally, the entire pipeline is automated by using AWS Data Pipeline.

The following diagram illustrates this architecture.

Aruba prefers the AWS CUR Report to AWS Cost Explorer because AWS Cost Explorer provides usage information at a high level, and not enough granularity for detailed operations, such as data transfer cost. AWS CUR provides the most detailed information available about your AWS costs and usage at an hourly granularity. This allows the Aruba team to drill down the costs by the hour or day, product or product resource, or custom tags, enabling them to achieve their goals.

Aruba implemented the solution with the following steps:

  1. Set up the CUR delivery to a primary S3 bucket from the billing dashboard.
  2. Use Amazon S3 replication to copy the primary payer S3 bucket to the analytics bucket. Having a separate analytics account helps prevent direct access to the primary account.
  3. Create and schedule the crawler to crawl the CUR data. This is required to make the metadata available in the Data Catalog and update it quickly when new data arrives.
  4. Create respective Amazon Redshift schema and tables.
  5. Orchestrate an ETL flow to load data to Amazon Redshift using Data Pipeline.
  6. Create and publish dashboards using QuickSight for executives and stakeholders.

Insights generated

The Aruba DevOps team built various reports that provide the cost classifications on AWS services, weekly cost by applications, cost by product, infrastructure, resource type, and much more using the detailed CUR data as shown by the following screenshot.

For example, using the following screenshot, Aruba can conveniently figure out that compute cost is the biggest contributor compared to other costs. To reduce the cost, they can consider using various cost-optimization methods like buying reserved instances, savings plans, or Spot Instances wherever applicable.

Similarly, the following screenshot highlights the cost doubled compared to the first week of April. This helps Aruba to identify anomalies quickly and make informed decisions.

Setting up the CUR delivery

For instructions on setting up a CUR, see Creating Cost and Usage Reports.

To reduce complexity in the workflow, Aruba chose to create resources in the same region with hourly granularity, mainly to see metrics more frequently.

To lower the storage costs for data files and maximize the effectiveness of querying data with serverless technologies like Amazon Athena, Amazon Redshift Spectrum, and Amazon S3 data lake, save the CUR in Parquet format. The following screenshot shows the configuration for delivery options.

The following table shows some example CUR data.

bill_payer_account_idline_item_usage_account_idline_item_usage_start_dateline_item_usage_end_dateline_item_product_codeline_item_usage_typeline_item_operation
12345678911122233344400:00.000:00.0AmazonEC2USW2-EBS:VolumeP-IOPS.piopsCreateVolume-P-IOPS
12345678911122233344400:00.000:00.0AmazonEC2USW2-APN1-AWS-In-BytesLoadBalancing-PublicIP-In
12345678911122233344400:00.000:00.0AmazonEC2USW2-DataProcessing-BytesLoadBalancing
12345678911122233344400:00.000:00.0AmazonEC2USW2-EBS:SnapshotUsageCreateSnapshot
12345678955566677788800:00.000:00.0AmazonEC2USW2-EBS:SnapshotUsageCreateSnapshot
12345678955566677788800:00.000:00.0AmazonEC2USW2-EBS:SnapshotUsageCreateSnapshot
12345678955566677788800:00.000:00.0AmazonEC2USW2-DataTransfer-Regional-BytesInterZone-In
12345678955566677788800:00.000:00.0AmazonS3USW2-Requests-Tier2ReadLocation
12345678955566677788800:00.000:00.0AmazonEC2USW2-DataTransfer-Regional-BytesInterZone-In

Replicating the CUR data to your analytics account

For security purposes, other teams aren’t allowed to access the primary (payer) account, and therefore can’t access CUR data generated from that account. Aruba replicated the data to their analytics account and build the cost analysis solution there. Other teams can access the cost data without getting access permission for the primary account. The data is replicated across accounts by adding an Amazon S3 replication rule in the bucket. For more information, see Adding a replication rule when the destination bucket is in a different AWS account.

Cataloging the data with a crawler and scheduling it to run daily

Because AWS delivers all daily reports in a report date range report-prefix/report-name/yyyymmdd-yyyymmdd folder, Aruba uses AWS Glue crawlers to crawl through the data and update the catalog.

AWS Glue is a fully managed ETL service that makes it easy to prepare and load the data for analytics. Once the AWS Glue is pointed to the data stored on AWS, it discovers the data and stores the associated metadata (such as table definition and schema) in the Data Catalog. After the data is cataloged, the data is immediately searchable, queryable, and available for ETL. For more information, see Populating the AWS Glue Data Catalog.

The following screenshot shows the crawler created on Amazon S3 location of the CUR data.

The following code is an example table definition populated by the crawler.:

CREATE EXTERNAL TABLE `cur_parquet`(
  `identity_line_item_id` string, 
  `identity_time_interval` string, 
  `bill_invoice_id` string, 
………
………
  `resource_tags_user_infra_role` string)

PARTITIONED BY ( 
  `year` string, 
  `month` string )

ROW FORMAT SERDE  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://curS3bucket/Parquet/'

Transforming and loading using Amazon Redshift

Next in the analytics service, Aruba chose Amazon Redshift over Athena. Aruba has a use case to integrate cost data together with other tables already present in Amazon Redshift and hence using the same service makes it easy to integrate with their existing data. To further filter and transform data at the same time, and simplify the multi-step ETL, Aruba chose Amazon Redshift Spectrum. It helps to efficiently query and load CUR data from Amazon S3. For more information, see Getting started with Amazon Redshift Spectrum.

Use the following query to create an external schema and map it to the AWS Glue database created earlier in the Data Catalog:

--Choose a schema name of your choice, cur_redshift_external_schema name is just an example--
 create external schema cur_redshift_spectrum_external_schema from data catalog database 
 'aruba_curr_db' iam_role 'arn:aws:iam::xxxxxxxxxxxxx:role/redshiftclusterrole' 
 create external database if not exists;

The table created in the Data Catalog appears under the Amazon Redshift Spectrum schema. The schema, table, and records created can be verified with the following SQL code:

SELECT Count(*) 
FROM   cur_redshift_spectrum_external_schema.<TABLE>; 

--Query the right partition, year=2020 and month=2 is used an example
SELECT Count(*) 
FROM   cur_redshift_spectrum_external_schema.<TABLE> 
WHERE  year=2020 
AND    month=2;

Next, transform and load the data into the Amazon Redshift table. Aruba started by creating an Amazon Redshift table to contain the data. The following SQL code can be used to create the production table with the desired columns:

CREATE TABLE redshift_schema.redshift_table 
  ( 
     usage_start_date TIMESTAMP, 
     usage_end_date   TIMESTAMP, 
     service_region   VARCHAR (256), 
     service_az       VARCHAR (256), 
     aws_resource_id  VARCHAR (256), 
     usage_amount     FLOAT (17), 
     charge_currency  VARCHAR (256), 
     aws_product_name VARCHAR (256), 
     instance_family  VARCHAR (256), 
     instance_type    VARCHAR (256), 
     unblended_cost   FLOAT (17), 
     usage_cost       FLOAT (17)
  ); 

CUR is dynamic in nature, which means that some columns may appear or disappear with each update. When creating the table, we take static columns only. For more information, see Line item details.

Next, insert and update to ingest the data from Amazon S3 to the Amazon Redshift table. Each CUR update is cumulative, which means that each version of the CUR includes all the line items and information from the previous version.

The reports generated throughout the month are estimated and subject to change during the rest of the month. AWS finalizes the report at the end of each month. Finalized reports have the calculations for the blended and unblended costs, and cover all the usage for the month. For this use case, Aruba updates the last 45 days of data to make sure the finalized cost is captured. The below sample query can be used to verify the updated data:

-- Create Table Statement
 INSERT INTO redshift_schema.redshift_table
            (usage_start_date, 
             usage_end_date, 
             service_region, 
             service_az, 
             aws_resource_id, 
             usage_amount, 
             charge_currency, 
             aws_product_name, 
             instance_family, 
             instance_type, 
             unblended_cost,
             Usage_Cost ) 
 SELECT line_item_usage_start_date, 
       line_item_usage_end_date, 
       line_item_operation, 
       line_item_availability_zone, 
       line_item_resource_id, 
       line_item_usage_amount, 
       line_item_currency_code, 
       product_product_name, 
       product_instance_family, 
       product_instance_type, 
       line_item_unblended_cost,
       case when line_item_type='Usage' then line_item_unblended_cost
            else 0
            end as usage_cost 
 FROM   cur_redshift_external_schema.cur_parquet_parquet
 WHERE  line_item_usage_start_date >= date_add('day', -45, getdate()) 
       AND line_item_usage_start_date < date_add('day', 1, getdate()); 

Using Data Pipeline to orchestrate the ETL workflow

To automate this ETL workflow, Aruba chose Data Pipeline. Data Pipeline helps to reliably process and move data between different AWS compute and storage services, as well as on-premises data sources. With Data Pipeline, Aruba can regularly access their data where it’s stored, transform and process it at scale, and efficiently transfer the results to AWS services such as Amazon S3, Amazon Relational Database Service (Amazon RDS), Amazon DynamoDB, and Amazon EMR. Although the detailed steps of setting up this pipeline are out of scope for this blog, there is a sample workflow definition JSON file, which can be imported after making the necessary changes.

Data Pipeline workflow

The following screenshot shows the multi-step ETL workflow using Data Pipeline. Data Pipeline is used to run the INSERT query daily, which inserts and updates the latest CUR data into our Amazon Redshift table from the external table.

In order to copy data to Amazon Redshift,  RedshiftDataNode and RedshiftCopyActivity can be used, and then scheduled to run periodically.

Sharing metrics and creating visuals with QuickSight

To share the cost and usage with other teams, Aruba choose QuickSight using Amazon Redshift as the data source. QuickSight is a native AWS service that seamlessly integrates with other AWS services such as Amazon Redshift, Athena, Amazon S3, and many other data sources.

As a fully managed service, QuickSight lets Aruba to easily create and publish interactive dashboards that include ML Insights. In addition to building powerful visualizations, QuickSight provides data preparation tools that makes it easy to filter and transform the data into the exact needed dataset. As a cloud-native service, dashboards can be accessed from any device and embedded into applications and portals, allowing other teams to monitor their resource usage easily. For more information about creating a dataset, see Creating a Dataset from a Database. Quicksight Visuals can then be created from this dataset.

The following screenshot shows a visual comparison of device cost and count to help find the cost per device. This visual helped Aruba quickly identify the cost per device increase in April and take necessary actions.

Similarly, the following visualization helped Aruba identify an increase in data transfer cost and helped them decide to invest in rearchitecting their application.

The following visualization classifies the cost spend per resource.

Conclusion

In this post, we discussed how Aruba Networks was able to successfully achieve the following:

  • Generate CUR and use AWS Glue to define data, catalog the data, and update the metadata
  • Use Amazon Redshift Spectrum to transform and load the data to Amazon Redshift tables
  • Query, visualize, and share the data stored using QuickSight
  • Automate and orchestrate the entire solution using Data Pipeline

Aruba use this solution to automatically generate a daily cost report and share it with their stakeholders, including executives and cloud operations team.

 


About the Authors

Siddharth Thacker works in Business & Finance Strategy in Cloud Software division at Aruba Networks. Siddharth has Master’s in Finance with experience in industries like banking, investment management, cloud software and focuses on business analytics, margin improvement and strategic partnerships at Aruba. In his spare time, he likes exploring outdoors and participate in team sports.

Swatishree Sahu is a Technical Data Analyst at Aruba Networks. She has lived and worked in India for 7 years as an SME for SOA-based integration tools before coming to US to pursue her master’s in Business Analytics from UT Dallas. Breaking down and analyzing data is her passion. She is a Star Wars geek, and in her free time, she loves gardening, painting, and traveling.

Ritesh Chaman is a Technical Account Manager at Amazon Web Services. With 10 years of experience in the IT industry, Ritesh has a strong background in Data Analytics, Data Management, and Big Data systems. In his spare time, he loves cooking (spicy Indian food), watching sci-fi movies, and playing sports.

 

 

 

Kunal Ghosh is a Solutions Architect at AWS. His passion is to build efficient and effective solutions on the cloud, especially involving Analytics, AI, Data Science, and Machine Learning. Besides family time, he likes reading and watching movies, and is a foodie.

Build a self-service environment for each line of business using Amazon EMR and AWS Service Catalog

Post Syndicated from Tanzir Musabbir original https://aws.amazon.com/blogs/big-data/build-a-self-service-environment-for-each-line-of-business-using-amazon-emr-and-aws-service-catalog/

Enterprises often want to centralize governance and compliance requirements, and provide a common set of policies on how Amazon EMR instances should be set up. You can use AWS Service Catalog to centrally manage commonly deployed Amazon EMR cluster configurations, and this helps you achieve consistent governance and meet your compliance requirements, while at the same time enabling your end users to quickly deploy only the approved EMR cluster configurations on a self-service basis.

In this post, we will demonstrate how enterprise administrators can use AWS Service Catalog to create and manage catalogs, that data engineers and data scientists use to quickly discover and deploy clusters using a self-service environment. With AWS Service Catalog you can control which EMR release versions are available, cluster configuration, and permission access by individual, group, department, or cost center.

The following are a few key AWS Service Catalog concepts:

  • An AWS Service Catalog product is a blueprint for building the AWS resources that you want available for deployment. You create your products by importing AWS CloudFormation templates.
  • A portfolio is a collection of products. With AWS Service Catalog, you can create a customized portfolio for each type of user in your organization and selectively grant access to the appropriate portfolio.
  • A provisioned product is a collection of resources that result from instantiating an AWS CloudFormation

Use cases

You can use AWS Service Catalog to provide Amazon EMR as a self-serve Extract, Transform, Load (ETL) platform at scale while hiding all the security and network configurations from end users.

As an administrator in AWS Service Catalog, you can create one or more Service Catalog products that define different configurations to be used for EMR clusters. In those Service Catalog products, you can define the security and network configurations to be used for the EMR cluster, you can define auto-scaling rules, instance configurations, different purchase options, or you can preconfigure EMR to run different EMR Step jobs. On the other hand, as a user in Service Catalog, you can browse through different EMR templates through Service Catalog products and provision the product based on your requirement. By following this approach, you can make your EMR usage self-serviceable, reduce the EMR learning curve for your users, and ensure adherence to security standards and best practices.

The following image illustrates how the interactions look between Amazon EMR administrators and end-users when using AWS Service Catalog to provision EMR clusters.

The use cases in this post have three AWS Identity and Access Management (IAM) users with different access permissions:

  • emr-admin: This user is the administrator and has access to all the resources. This user creates EMR clusters for their end-users based on their requirements.
  • emr-data-engineer: The data engineer uses Spark and Hive most of the time. They run different ETL scripts on Hive and Spark to process, transform, and enrich their datasets.
  • emr-data-analyst: This user is very familiar with SQL and mostly uses Hue to submit queries to Hive.

You can solve several Amazon EMR operational use cases using AWS Service Catalog. The following sections discuss three different use cases. Later in this post, you walk through each of the use cases with a solution.

Use case 1: Ensuring least privilege and appropriate access

The administrator wants to enforce a few organizational standards. The first one is no default EMR_EC2_ROLE for any EMR cluster. Instead, the administrator wants to have a role that has limited access to Amazon Simple Storage Service (Amazon S3) and assigns that role automatically every time an EMR cluster is launched. Second, end-users sometimes forget to add appropriate tags to their resources. Because of that, often times it is hard for the administrator to identify their resources and allocate cost appropriately. So, the administrator wants to have a mechanism that assigns tags to EMR clusters automatically when they launch.

Use case 2: Providing Amazon EMR as a self-serve ETL platform with Spark and Hive

Data engineers use Spark and Hive applications, and they prefer to have a platform where they just submit their jobs without spending time creating the cluster. They also want to try out different Amazon EMR versions to see how their jobs run on different Spark or Hive versions. They don’t want to spend time learning AWS or Amazon EMR. Additionally, the administrator doesn’t want to give full Amazon EMR access to all users.

Use case 3: Automatically scaling the Hive cluster for analysts

Data analysts have strong SQL backgrounds, so they typically use Hue to submit their Hive queries. They run queries against a large dataset, so they want to have an EMR cluster that can scale when needed. They also don’t have access to the Amazon EMR console and don’t know how to configure automatic scaling for Amazon EMR.

Solution overview

Service Catalog, self-serve your Amazon EMR users, enforce best practices and compliance, and speed up the adoption process.

At a high level, the solution includes the following steps:

  1. Configuring the AWS environment to run this solution.
  2. Creating a CloudFormation template.
  3. Setting up AWS Service Catalog products and portfolios.
  4. Managing access to AWS Service Catalog and provisioning products.
  5. Demonstrating the self-service Amazon EMR platform for users.
  6. Enforcing best practices and compliance through AWS Service Catalog.
  7. Executing ETL workloads on Amazon EMR using AWS Service Catalog.
  8. Optionally, setting up AWS Service Catalog and launching Amazon EMR products through the AWS Command Line Interface (AWS CLI).

The following section looks at the CloudFormation template, which you use to set up the AWS environment to run this solution.

Setting up the AWS environment

To set up this solution, you need to create a few AWS resources. The CloudFormation template provided in this post creates all the required AWS resources. This template requires you to pass the following parameters during the launch:

  • A password for your test users.
  • An Amazon Compute Cloud (Amazon EC2) key pair.
  • The latest AMI ID for the EC2 helper instance. This instance configures the environment and sets up the required files and templates for this solution.

This template is designed only to show how you can use Amazon EMR with AWS Service Catalog. This setup isn’t intended for production use without modification.

To launch the CloudFormation stack, choose Launch Stack:

Launching this stack creates several AWS resources. The following resources shown in the AWS CloudFormation output are the ones you need in the next step:

KeyDescription
ConsoleLoginURLURL you use to switch between multiple users
EMRSCBlogBucketName of the S3 bucket to store blog-related files
UserPasswordPassword to use for all the test users
DataAdminUsernameIAM user name for the administrator user
DataEngineerUsernameIAM user name for the data engineer user
DataAnalystUsernameIAM user name for the data analyst user
HiveScriptURLAmazon S3 path for the Hive script
HiveETLInputParameterPath for the Hive input parameter
HiveETLOutputParameterPath for the Hive output parameter
SparkScriptURLAmazon S3 path for the Spark script
SparkETLInputParameterPath for the Spark input parameter
SparkETLOutputParameterPath for the Spark output parameter

When the CloudFormation template is complete, record the outputs listed on the Outputs tab on the AWS CloudFormation console. See the following screenshot.

(Optional) Configuring the AWS CLI

The AWS CLI is a unified tool to manage your AWS services. In the optional step, you use the AWS CLI to create AWS Service Catalog products and portfolios. Installation of AWS CLI isn’t required for this solution. For instructions on configuring the AWS CLI in your environment, see Configuring the AWS CLI.

Provisioning EMR clusters through AWS Service Catalog

You can create AWS Service Catalog products from the existing CloudFormation template and use those products to provision a variety of EMR clusters. You can create an EMR cluster and consume the cluster’s services without having access to the cluster, which improves the Amazon EMR adoption process.

The following CloudFormation template creates an EMR cluster. This template takes two parameters:

  • Cluster size – You select how many core nodes you want to have in the EMR cluster
  • Compute type – Based on the compute type you choose; the template selects the respective EC2 instance type

As an account administrator, you can define the internal configuration for the EMR cluster. End users are not required to know all the security groups, subnet ID, key pair, and other information. They also don’t need to access the EMR cluster or spend time setting up your cluster. As an administrator, you define a template for the cluster; enforce all the compliance, versions, applications, automatic scaling rules through the CloudFormation template, and expose this template as a product through AWS Service Catalog.

The following section walks you through the solution for each use case.

Use cases walkthrough

The CloudFormation template already configured AWS Service Catalog portfolios and products. You can review these on the AWS Service Catalog console.

  1. Use the ConsoleLoginURL from the AWS CloudFormation console Outputs tab and sign in as an emr-admin user.
  2. On the AWS Service Catalog console, you can see two portfolios for engineers and analysts. In each of those portfolios, you can see two products.

The Data Analysts Stack contains products for the analyst and is assigned to the user emr-data-analyst. The Data Engineering Stack contains products for engineers and is assigned to the emr-data-engineer user. Upon logging in, they can see their respective products and portfolios.

Use case 1: Ensuring least privilege and appropriate access

The cluster administrator creates the least privilege IAM role for their users and associated that role through the Service Catalog product. Similarly, the administrator also assigns appropriate tags for each product. When data engineers or analysts launch an EMR cluster using any of their assigned products, the cluster has the least privilege access and resources are tagged automatically. To confirm this access is in place, complete the following steps:

  1. Sign in to the AWS Management Console as either emr-data-engineer user or emr-data-analyst user.

Your console looks slightly different because the end-user does not manage the products, they just use the product to launch the clusters or execute jobs on the cluster.

  1. Choose Default EMR and provision this product by choosing Launch Product.
  2. For the name of the provisioned product, enter SampleEMR.

The next screen shows a list of allowed parameters your administrator thinks you may need.

  1. Leave all parameters as default.
  2. For the cluster name, enter Sample EMR.
  3. Review all the information and launch the product.

It takes few minutes to spin up the cluster. When the cluster is ready, the status changes to Succeeded. The provision product page also shows you a list of outputs your product owner wants you to see. For example, using output values, your product owner can share Master DNS Address, Resource Manager URL, and Hue URL as shown in the following figure.

To verify if this launched EMR cluster has the expected IAM role and tags, sign in as emr-admin user and go to the AWS EMR Console to review the service role for EC2 instances and tags.

Use case 2: Providing Amazon EMR as a self-serve ETL platform with Spark and Hive

For this use case, data engineers have two different ETL scripts:

  • A Spark script that reads Amazon reviews stored in Amazon S3 and converts them into Parquet before writing back to Amazon S3
  • A Hive script that reads Amazon reviews data from Amazon S3 and finds out the top toys based on customer ratings.

The administrator creates a product to self-serve these users; the product defines the job type and the job parameters. End users selects the job type and passes script, input and output locations.

  1. Sign in as emr-data-engineer.
  2. Select the EMR ETL Engine product.
  3. Choose Launch.

The next page shows if the product has multiple versions. Because the engineer wants to try out two different Amazon EMR versions, the administrator provided both options through the product version. You can launch the EMR cluster with the required version by selecting your preferred product version.

  1. Enter the name of the product.
  2. For this post, select EMR 5.29.0.
  3. Choose Next.

  1. For JobType, choose Spark.
  2. For JobArtifacts, enter the following value (you can get these values from the AWS CloudFormation output):
s3://blog-emr-sc-<account-id>/scripts/spark_converter.py s3://amazon-reviews-pds/tsv/amazon_reviews_us_Toys_v1_00.tsv.gz s3://blog-emr-sc-<account-id>/spark/
  1. Choose Next.

Based on your configuration, an EMR cluster launches. When the cluster is ready, the Spark job runs.

  1. In a different browser, sign in as emr-admin using the ConsoleLoginURL (from the AWS CloudFormation output).

You can see the cluster status, job status, and output path from the Amazon EMR console.

Now, go to Amazon S3 console to check the output path:

The Parquet files are written inside the Spark folder.

  1. To test the Hive job, go back to the first browser where you already signed in as emr-data-engineer.
  2. Choose Provisioned products list.
  3. Choose the product options menu (right-click) and choose Update provisioned product.

  1. On the next page, you can select a different version or the same version.
  2. In the Parameters section, choose Hive.
  3. In the JobArtifacts field, enter the following Hive parameters:
s3://blog-emr-sc-<account-id>/scripts/hive_converter.sql -d INPUT=s3://amazon-reviews-pds/tsv/ -d OUTPUT=s3://blog-emr-sc-<account-id>/hive/
  1. Choose Update.

If you select the same version, AWS Service Catalog compares the old provisioned product with the updated product and only runs the portion that you changed. For this post, I chose the same Amazon EMR version and only updated the job type and parameters. You can see that the same EMR cluster is still there, but on the Steps tab, a new step is executed for Hive.

  1. On the Amazon S3 console using the second browser, verify that a new folder hive is created with data that represents top toys based on Amazon reviews.

To recap, you saw how to use AWS Service Catalog to provide a product to run your ETL jobs. Your data engineers can focus on their ETL scripts and your platform can self-serve them to run their ETL jobs on the EMR cluster.

Use case 3: Automatically scaling the Hive cluster for data analysts

To automatically scale the Hive cluster for data analysts, complete the following steps:

  1. Using the console login URL from the AWS CloudFormation output, and sign in as emr-data-analyst and go to AWS Service Catalog console.

You can see a different set of products for this user.

For this use case, your data analysts want to have an automatically scaling EMR cluster with Hive application. The administrator set up the Auto-scaling EMR product with preconfigured rules.

  1. Choose Auto-scaling EMR.
  2. Enter a provisioned product name.
  3. Select Hive Auto-scaling.
  4. Choose Next.
  5. In the Parameters section, leave the options at their default and enter a cluster name.
  6. Launch the product.

The product owner also provided a client URL (for example, Hue URL) through the product output so business analysts can connect to it.

  1. Sign in as emr-admin and validate if this new cluster is configured with the expected automatic scaling rules.
  2. On the Amazon EMR console, choose the cluster.

You can see the configuration on the Hardware tab.

In this use case, you learned how to use AWS Service Catalog to provide business analyst users a preconfigured, automatically scaled EMR cluster.

(Optional) Setting up AWS Service Catalog for Amazon EMR using AWS CLI

In the previous section, I demonstrated the solution using the AWS Service Catalog console. In the following section, I will show you how you use AWS Service Catalog using the AWS CLI. You can create AWS Service Catalog products and portfolios, assign IAM principals, and launch products.

  1. Create a portfolio named CLI – Stack for the user emr-admin. See the following command:
aws --region us-east-1 servicecatalog create-portfolio --display-name "CLI - Stack" --provider-name "@emr-admin" --description "Sample stack for pre-defined EMR clusters"

You receive a JSON output.

  1. Record the portfolio id port-xxxxxxxx from the output to use later.

The emr-admin user is the provider for this portfolio. The user is created with power user access, so the user can see the full-service catalog console and can manage products and portfolios.

You can associate this portfolio with multiple users. By assigning them to a portfolio, they can use the portfolio, browse through its products, and provision new products. For this use case, you associate a portfolio to emr-admin and the AWS CLI user name (the name of the user that you used to configure your AWS CLI). Make sure to update the portfolio and AWS account ID.

  1. Enter the following code:
aws --region us-east-1 servicecatalog associate-principal-with-portfolio --portfolio-id port-xxxxxxxxxx --principal-type IAM --principal-arn arn:aws:iam::xxxxx:user/emr-admin

aws --region us-east-1 servicecatalog associate-principal-with-portfolio --portfolio-id port-xxxxxxxxxx --principal-type IAM --principal-arn arn:aws:iam::xxxxx:user/<aws-cli-user-name>
  1. To verify the portfolio to the user’s association, enter the following command with the portfolio ID:
aws --region us-east-1 servicecatalog list-principals-for-portfolio --portfolio-id port-xxxxxxxxx

It will list out the associated principals for the above portfolio as shown in this following figure:

The CloudFormation template already copied the Amazon EMR template into your Amazon S3 account at the path s3://blog-emr-sc-<account-id>/products.

  1. To create the product CLI - Sample EMR using that template from Amazon S3, enter the following command:
aws --region us-east-1 servicecatalog create-product --name "CLI - Sample EMR" --owner "@emr-admin" --description "Sample EMR cluster with default" --product-type CLOUD_FORMATION_TEMPLATE --provisioning-artifact-parameters '{"Name": "Initial revision", "Description": "", "Info":{"LoadTemplateFromURL":"https://s3.amazonaws.com/blog-emr-sc-<account-id>/products/sample-cluster.template"},"Type":"CLOUD_FORMATION_TEMPLATE"}'

  1. Record the product ID and provision ID from the JSON output.

You now have a product and a portfolio. A portfolio can have one to many products, and each product can have multiple versions.

  1. To assign the CLI -Sample EMR product to the portfolio you created in Step 1, enter the following command:
aws --region us-east-1 servicecatalog associate-product-with-portfolio --product-id prod-xxxxxx --portfolio-id port-xxxxxx

A launch constraint specifies the IAM role that AWS Service Catalog assumes when an end-user launches a product. With a launch constraint, you can control end-user access to your AWS resources and limit usage.

The CloudFormation template already created the role Blog-SCLaunchRole; create a launch constraint using that IAM role. Use the portfolio and product IDs that you collected from the previous step and your AWS account ID.

  1. To create the launch constraint, enter the following command:
aws --region us-east-1 servicecatalog create-constraint --type LAUNCH --portfolio-id port-xxxxxx --product-id prod-xxxxxx --parameters '{"RoleArn" : "arn:aws:iam::<account-id>:role/Blog-SCLaunchRole"}'

  1. Record the launch constraint ID to use later.

You now have an AWS Service Catalog product that you can use to provision an EMR cluster. The CloudFormation template that you used to create the CLI - Sample EMR product takes three parameters (ClusterName, ComputeRequirements, ClusterSize).

  1. To pass those three parameters as a key value pair, enter the following command (use the product ID and provision ID that you recorded earlier):
aws --region us-east-1 servicecatalog provision-product --product-id prod-xxxxxx --provisioning-artifact-id pa-xxxxx --provisioned-product-name cli-emr --provisioning-parameters Key=ClusterName,Value=cli-emr-cluster Key=ComputeRequirements,Value=CPU Key=ClusterSize,Value=2

  1. Check the provisioned product’s status by using the provisioned product ID:
aws --region us-east-1 servicecatalog describe-provisioned-product --id pp-xxxxx

To recap, in this section you learned how to use AWS Service Catalog CLI to configure AWS Service Catalog products and portfolios, and how to provision an EMR cluster through AWS Service Catalog product.

Cleaning up

To clean up the resources you created, complete the following steps:

  1. Terminate the product that you provisioned in the previous step:
aws --region us-east-1 servicecatalog terminate-provisioned-product --provisioned-product-id pp-xxxxx
  1. Disassociate the product CLI – Sample EMR from the portfolio CLI – Stack:
aws --region us-east-1 servicecatalog disassociate-product-from-portfolio --product-id prod-xxxxx --portfolio-id port-xxxxx
  1. Disassociate IAM principals from the portfolio CLI – Stack:
aws --region us-east-1 servicecatalog disassociate-principal-from-portfolio --portfolio-id port-xxxxx --principal-arn arn:aws:iam::xxxxxx:user/emr-admin

aws --region us-east-1 servicecatalog disassociate-principal-from-portfolio --portfolio-id port-xxxxx --principal-arn arn:aws:iam::xxxxxx:user/<aws-cli-user-name> 
  1. Delete the launch constraint created in the previous step:
aws --region us-east-1 servicecatalog delete-constraint --id cons-xxxxx
  1. Delete the product CLI – Sample EMR:
aws --region us-east-1 servicecatalog delete-product --id prod-xxxxx
  1. Delete the portfolio CLI – Stack:
aws --region us-east-1 servicecatalog delete-portfolio --id port-xxxxx

Cleaning up additional resources

You must also clean up the resources you created with the CloudFormation template.

  1. On the AWS Service Catalog console, choose Provisioned products list.
  2. Terminate each product that you provisioned for these use cases.
  3. Check each of the users and their provisioned products to make sure they’re terminated.
  4. On the Amazon S3 console, empty the bucket blog-emr-sc-<account-id>.
  5. If you are using the AWS CLI, delete the objects in the blog-emr-sc-<account-id> bucket with the following command (make sure you’re running this command on the correct bucket):
aws S3 s3://blog-emr-sc-<account-id> --recursive
  1. If you ran the optional AWS CLI section, make sure you follow the cleanup process mentioned in that section.
  2. On the AWS CloudFormation console or AWS CLI, delete the stack named Blog-EMR-Service-Catalog.

Next steps

To enhance this solution, you can explore the following options:

  • In this post, I enforced resource tagging through AWS CloudFormation. You can also use the AWS Service Catalog TagOptions library to provide a consistent taxonomy and tagging of AWS Service Catalog resources. During a product launch (provisioning), AWS Service Catalog aggregates the associated portfolio and product TagOptions and applies them to the provisioned product.
  • This solution demonstrates the usage of launch constraints and how you can provide limited access to your AWS resources to your users. You can also use template constraints to manage parameters. Template constraints make sure that end-users only have options that you allow them when launching products. This can help you maintain your organization’s compliance requirements.
  • You can integrate AWS Budgets with AWS Service Catalog. By associating AWS Budgets with your products and portfolios, you can track your usage and service costs. You can set a custom budget for each of the portfolios and trigger alerts when your costs exceed your threshold.

Summary

In this post, I showed you how you can simplify your Amazon EMR provisional process using the AWS Service Catalog, how to make Amazon EMR a self-service platform for your end-users, and how you can enforce best practices and compliance to your EMR clusters. You also walked through three different use cases and implemented solutions with AWS Service Catalog. Give this solution a try and share your experience with us!

 


About the Author

Tanzir Musabbir is a Data & Analytics Architect with AWS. At AWS, he works with our customers to provide them architectural guidance for running analytics solutions on Amazon EMR, Amazon Athena & AWS Glue. Tanzir is a big Real Madrid fan and he loves to travel in his free time.

Top 10 performance tuning techniques for Amazon Redshift

Post Syndicated from Matt Scaer original https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-techniques-for-amazon-redshift/

Customers use Amazon Redshift for everything from accelerating existing database environments, to ingesting weblogs for big data analytics. Amazon Redshift is a fully managed, petabyte-scale, massively parallel data warehouse that offers simple operations and high performance. Amazon Redshift provides an open standard JDBC/ODBC driver interface, which allows you to connect your existing business intelligence (BI) tools and reuse existing analytics queries.

Amazon Redshift can run any type of data model, from a production transaction system third-normal-form model to star and snowflake schemas, data vault, or simple flat tables.

This post takes you through the most common performance-related opportunities when adopting Amazon Redshift and gives you concrete guidance on how to optimize each one.

What’s new

This post refreshes the Top 10 post from early 2019. We’re pleased to share the advances we’ve made since then, and want to highlight a few key points.

Query throughput is more important than query concurrency.

Configuring concurrency, like memory management, can be relegated to Amazon Redshift’s internal ML models through Automatic WLM with Query Priorities. On production clusters across the fleet, we see the automated process assigning a much higher number of active statements for certain workloads, while a lower number for other types of use-cases. This is done to maximize throughput, a measure of how much work the Amazon Redshift cluster can do over a period of time. Examples are 300 queries a minute, or 1,500 SQL statements an hour. It’s recommended to focus on increasing throughput over concurrency, because throughput is the metric with much more direct impact on the cluster’s users.

In addition to the optimized Automatic WLM settings to maximize throughput, the concurrency scaling functionality in Amazon Redshift extends the throughput capability of the cluster to up to 10 times greater than what’s delivered with the original cluster. The tenfold increase is a current soft limit, you can reach out to your account team to increase it.

Investing in the Amazon Redshift driver.

AWS now recommends the Amazon Redshift JDBC or ODBC driver for improved performance. Each driver has optional configurations to further tune it for higher or lower number of statements, with either fewer or greater row counts in the result set.

Ease of use by automating all the common DBA tasks.

In 2018, the SET DW “backronym” summarized the key considerations to drive performance (sort key, encoding, table maintenance, distribution, and workload management). Since then, Amazon Redshift has added automation to inform 100% of SET DW, absorbed table maintenance into the service’s (and no longer the user’s) responsibility, and enhanced out-of-the-box performance with smarter default settings. Amazon Redshift Advisor continuously monitors the cluster for additional optimization opportunities, even if the mission of a table changes over time. AWS publishes the benchmark used to quantify Amazon Redshift performance, so anyone can reproduce the results.

Scaling compute separately from storage with RA3 nodes and Amazon Redshift Spectrum.

Although the convenient cluster building blocks of the Dense Compute and Dense Storage nodes continue to be available, you now have a variety of tools to further scale compute and storage separately. Amazon Redshift Managed Storage (the RA3 node family) allows for focusing on using the right amount of compute, without worrying about sizing for storage. Concurrency scaling lets you specify entire additional clusters of compute to be applied dynamically as-needed. Amazon Redshift Spectrum uses the functionally-infinite capacity of Amazon Simple Storage Service (Amazon S3) to support an on-demand compute layer up to 10 times the power of the main cluster, and is now bolstered with materialized view support.

Pause and resume feature to optimize cost of environments

All Amazon Redshift clusters can use the pause and resume feature. For clusters created using On Demand, the per-second grain billing is stopped when the cluster is paused. Reserved Instance clusters can use the pause and resume feature to define access times or freeze a dataset at a point in time.

Tip #1: Precomputing results with Amazon Redshift materializes views

Materialized views can significantly boost query performance for repeated and predictable analytical workloads such as dash-boarding, queries from BI tools, and extract, load, transform (ELT) data processing. Data engineers can easily create and maintain efficient data-processing pipelines with materialized views while seamlessly extending the performance benefits to data analysts and BI tools.

Materialized views are especially useful for queries that are predictable and repeated over and over. Instead of performing resource-intensive queries on large tables, applications can query the pre-computed data stored in the materialized view.

When the data in the base tables changes, you refresh the materialized view by issuing the Amazon Redshift SQL statement “refresh materialized view“. After issuing a refresh statement, your materialized view contains the same data as a regular view. Refreshes can be incremental or full refreshes (recompute). When possible, Amazon Redshift incrementally refreshes data that changed in the base tables since the materialized view was last refreshed.

To demonstrate how it works, we can create an example schema to store sales information, each sale transaction and details about the store where the sales took place.

To view the total amount of sales per city, we create a materialized view with the create materialized view SQL statement (city_sales) joining records from two tables and aggregating sales amount (sum(sales.amount)) per city (group by city):

CREATE MATERIALIZED VIEW city_sales AS 
  (
  SELECT st.city, SUM(sa.amount) as total_sales
  FROM sales sa, store st
  WHERE sa.store_id = st.id
  GROUP BY st.city
  );

Now we can query the materialized view just like a regular view or table and issue statements like “SELECT city, total_sales FROM city_sales” to get the following results. The join between the two tables and the aggregate (sum and group by) are already computed, resulting in significantly less data to scan.

When the data in the underlying base tables changes, the materialized view doesn’t automatically reflect those changes. You can refresh the data stored in the materialized view on demand with the latest changes from the base tables using the SQL refresh materialized view command. For example, see the following code:

!-- let's add a row in the sales base table

INSERT INTO sales (id, item, store_id, customer_id, amount) 
VALUES(8, 'Gaming PC Super ProXXL', 1, 1, 3000);

SELECT city, total_sales FROM city_sales WHERE city = 'Paris'

|city |total_sales|
|-----|-----------|
|Paris|        690|

!-- the new sale is not taken into account !!
-- let's refresh the materialized view
REFRESH MATERIALIZED VIEW city_sales;

SELECT city, total_sales FROM city_sales WHERE city = 'Paris'

|city |total_sales|
|-----|-----------|
|Paris|       3690|

!-- now the view has the latest sales data

The full code for this use case is available as a very simple demo is available as a gist in GitHub.

You can also extend the benefits of materialized views to external data in your Amazon S3 data lake and federated data sources. With materialized views, you can easily store and manage the pre-computed results of a SELECT statement referencing both external tables and Amazon Redshift tables. Subsequent queries referencing the materialized views run much faster because they use the pre-computed results stored in Amazon Redshift, instead of accessing the external tables. This also helps you reduce the associated costs of repeatedly accessing the external data sources, because you can only access them when you explicitly refresh the materialized views.

Tip #2: Handling bursts of workload with concurrency scaling and elastic resize

The legacy, on-premises model requires you to estimate what the system will need 3-4 years in the future to make sure you’re leasing enough horsepower at the time of purchase. But the ability to resize a cluster allows for right-sizing your resources as you go. Amazon Redshift extends this ability with elastic resize and concurrency scaling.

Elastic resize lets you quickly increase or decrease the number of compute nodes, doubling or halving the original cluster’s node count, or even change the node type. You can expand the cluster to provide additional processing power to accommodate an expected increase in workload, such as Black Friday for internet shopping, or a championship game for a team’s web business. Choose classic resize when you’re resizing to a configuration that isn’t available through elastic resize. Classic resize is slower but allows you to change the node type or expand beyond the doubling or halving size limitations of an elastic resize. 

Elastic resize completes in minutes and doesn’t require a cluster restart. For anticipated workload spikes that occur on a predictable schedule, you can automate the resize operation using the elastic resize scheduler feature on the Amazon Redshift console, the AWS Command Line Interface (AWS CLI), or API.

Concurrency scaling allows your Amazon Redshift cluster to add capacity dynamically in response to the workload arriving at the cluster.

By default, concurrency scaling is disabled, and you can enable it for any workload management (WLM) queue to scale to a virtually unlimited number of concurrent queries, with consistently fast query performance. You can control the maximum number of concurrency scaling clusters allowed by setting the “max_concurrency_scaling_clusters” parameter value from 1 (default) to 10 (contact support to raise this soft limit). The free billing credits provided for concurrency scaling is often enough and the majority of customers using this feature don’t end up paying extra for it. For more information about the concurrency scaling billing model see Concurrency Scaling pricing.

You can monitor and control the concurrency scaling usage and cost by creating daily, weekly, or monthly usage limits and instruct Amazon Redshift to automatically take action (such as logging, alerting or disabling further usage) if those limits are reached. For more information, see Managing usage limits in Amazon Redshift.

Together, these options open up new ways to right-size the platform to meet demand. Before these options, you needed to size your WLM queue, or even an entire Amazon Redshift cluster, beforehand in anticipation of upcoming peaks.

Tip #3: Using the Amazon Redshift Advisor to minimize administrative work

Amazon Redshift Advisor offers recommendations specific to your Amazon Redshift cluster to help you improve its performance and decrease operating costs.

Advisor bases its recommendations on observations regarding performance statistics or operations data. Advisor develops observations by running tests on your clusters to determine if a test value is within a specified range. If the test result is outside of that range, Advisor generates an observation for your cluster. At the same time, Advisor creates a recommendation about how to bring the observed value back into the best-practice range. Advisor only displays recommendations that can have a significant impact on performance and operations. When Advisor determines that a recommendation has been addressed, it removes it from your recommendation list. In this section, we share some examples of Advisor recommendations:

Distribution key recommendation

Advisor analyzes your cluster’s workload to identify the most appropriate distribution key for the tables that can significantly benefit from a KEY distribution style. Advisor provides ALTER TABLE statements that alter the DISTSTYLE and DISTKEY of a table based on its analysis. To realize a significant performance benefit, make sure to implement all SQL statements within a recommendation group.

The following screenshot shows recommendations regarding distribution keys.

If you don’t see a recommendation, that doesn’t necessarily mean that the current distribution styles are the most appropriate. Advisor doesn’t provide recommendations when there isn’t enough data or the expected benefit of redistribution is small.

Sort key recommendation

Sorting a table on an appropriate sort key can accelerate query performance, especially queries with range-restricted predicates, by requiring fewer table blocks to be read from disk.

Advisor analyzes your cluster’s workload over several days to identify a beneficial sort key for your tables. See the following screenshot.

If you don’t see a recommendation for a table, that doesn’t necessarily mean that the current configuration is the best. Advisor doesn’t provide recommendations when there isn’t enough data or the expected benefit of sorting is small.

Table compression recommendation

Amazon Redshift is optimized to reduce your storage footprint and improve query performance by using compression encodings. When you don’t use compression, data consumes additional space and requires additional disk I/O. Applying compression to large uncompressed columns can have a big impact on your cluster.

The compression analysis in Advisor tracks uncompressed storage allocated to permanent user tables. It reviews storage metadata associated with large uncompressed columns that aren’t sort key columns.

The following screenshot shows an example of table compression recommendation.

Table statistics recommendation

Maintaining current statistics helps complex queries run in the shortest possible time. The Advisor analysis tracks tables whose statistics are out-of-date or missing. It reviews table access metadata associated with complex queries. If tables that are frequently accessed with complex patterns are missing statistics, Amazon Redshift Advisor creates a critical recommendation to run ANALYZE. If tables that are frequently accessed with complex patterns have out-of-date statistics, Advisor creates a suggested recommendation to run ANALYZE.

The following screenshot shows a table statistics recommendation.

Tip #4: Using Auto WLM with priorities to increase throughput

Auto WLM simplifies workload management and maximizes query throughput by using ML to dynamically manage memory and concurrency, which ensures optimal utilization of the cluster resources

Amazon Redshift runs queries using the queuing system (WLM). You can define up to eight queues to separate workloads from each other.

Amazon Redshift Advisor automatically analyzes the current WLM usage and can make recommendations to get more throughput from your cluster. Periodically reviewing the suggestions from Advisor helps you get the best performance.

Query priorities is a feature of Auto WLM that lets you assign priority ranks to different user groups or query groups, to ensure that higher priority workloads get more resources for consistent query performance, even during busy times. It is a good practice to set up query monitoring rules (QMR) to monitor and manage resource intensive or runaway queries. QMR also enables you to dynamically change a query’s priority based on its runtime performance and metrics-based rules you define.

For more information on migrating from manual to automatic WLM with query priorities, see Modifying the WLM configuration.

It’s recommended to take advantage of Amazon Redshift’s short query acceleration (SQA). SQA uses ML to run short-running jobs in their own queue. This keeps small jobs processing, rather than waiting behind longer-running SQL statements. SQA is enabled by default in the default parameter group and for all new parameter groups. You can enable and disable SQA via a check box on the Amazon Redshift console, or by using the Amazon Redshift CLI.

If you enable concurrency scaling, Amazon Redshift can automatically and quickly provision additional clusters should your workload begin to back up. This is an important consideration when deciding the cluster’s WLM configuration.

A common pattern is to optimize the WLM configuration to run most SQL statements without the assistance of supplemental memory, reserving additional processing power for short jobs. Some queueing is acceptable because additional clusters spin up if your needs suddenly expand. To enable concurrency scaling on a WLM queue, set the concurrency scaling mode value to AUTO. You can best inform your decisions by reviewing the concurrency scaling billing model. You can also monitor and control the concurrency scaling usage and cost by using the Amazon Redshift usage limit feature.

In some cases, unless you enable concurrency scaling for the queue, the user or query’s assigned queue may be busy, and you must wait for a queue slot to open. During this time, the system isn’t running the query at all. If this becomes a frequent problem, you may have to increase concurrency.

First, determine if any queries are queuing, using the queuing_queries.sql admin script. Review the maximum concurrency that your cluster needed in the past with wlm_apex.sql, or get an hour-by-hour historical analysis with wlm_apex_hourly.sql. Keep in mind that increasing concurrency allows more queries to run, but each query gets a smaller share of the memory. You may find that by increasing concurrency, some queries must use temporary disk storage to complete, which is also sub-optimal.

Tip #5: Taking advantage of Amazon Redshift data lake integration

Amazon Redshift is tightly integrated with other AWS-native services such as Amazon S3 which let’s the Amazon Redshift cluster interact with the data lake in several useful ways.

Amazon Redshift Spectrum lets you query data directly from files on Amazon S3 through an independent, elastically sized compute layer. Use these patterns independently or apply them together to offload work to the Amazon Redshift Spectrum compute layer, quickly create a transformed or aggregated dataset, or eliminate entire steps in a traditional ETL process.

  • Use the Amazon Redshift Spectrum compute layer to offload workloads from the main cluster, and apply more processing power to the specific SQL statement. Amazon Redshift Spectrum automatically assigns compute power up to approximately 10 times the processing power of the main cluster. This may be an effective way to quickly process large transform or aggregate jobs.
  • Skip the load in an ELT process and run the transform directly against data on Amazon S3. You can run transform logic against partitioned, columnar data on Amazon S3 with an INSERT … SELECT statement. It’s easier than going through the extra work of loading a staging dataset, joining it to other tables, and running a transform against it.
  • Use Amazon Redshift Spectrum to run queries as the data lands in Amazon S3, rather than adding a step to load the data onto the main cluster. This allows for real-time analytics.
  • Land the output of a staging or transformation cluster on Amazon S3 in a partitioned, columnar format. The main or reporting cluster can either query from that Amazon S3 dataset directly or load it via an INSERT … SELECT statement.

Within Amazon Redshift itself, you can export the data into the data lake with the UNLOAD command, or by writing to external tables. Both options export SQL statement output to Amazon S3 in a massively parallel fashion. You can do the following:

  • Using familiar CREATE EXTERNAL TABLE AS SELECT and INSERT INTO SQL commands, create and populate external tables on Amazon S3 for subsequent use by Amazon Redshift or other services participating in the data lake without the need to manually maintain partitions. Materialized views can also cover external tables, further enhancing the accessibility and utility of the data lake.
  • Using the UNLOAD command, Amazon Redshift can export SQL statement output to Amazon S3 in a massively parallel fashion. This technique greatly improves the export performance and lessens the impact of running the data through the leader node. You can compress the exported data on its way off the Amazon Redshift cluster. As the size of the output grows, so does the benefit of using this feature. For writing columnar data to the data lake, UNLOAD can write partition-aware Parquet data.

Tip #6: Improving the efficiency of temporary tables

Amazon Redshift provides temporary tables, which act like normal tables but have a lifetime of a single SQL session. The proper use of temporary tables can significantly improve performance of some ETL operations. Unlike regular permanent tables, data changes made to temporary tables don’t trigger automatic incremental backups to Amazon S3, and they don’t require synchronous block mirroring to store a redundant copy of data on a different compute node. Due to these reasons, data ingestion on temporary tables involves reduced overhead and performs much faster. For transient storage needs like staging tables, temporary tables are ideal.

You can create temporary tables using the CREATE TEMPORARY TABLE syntax, or by issuing a SELECT … INTO #TEMP_TABLE query. The CREATE TABLE statement gives you complete control over the definition of the temporary table. The SELECT … INTO and C(T)TAS commands use the input data to determine column names, sizes and data types, and use default storage properties. Consider default storage properties carefully, because they may cause problems. By default, for temporary tables, Amazon Redshift applies EVEN table distribution with no column encoding (such as RAW compression) for all columns. This data structure is sub-optimal for many types of queries.

If you employ the SELECT…INTO syntax, you can’t set the column encoding, column distribution, or sort keys. The CREATE TABLE AS (CTAS) syntax instead lets you specify a distribution style and sort keys, and Amazon Redshift automatically applies LZO encoding for everything other than sort keys, Booleans, reals, and doubles. You can exert additional control by using the CREATE TABLE syntax rather than CTAS.

If you create temporary tables, remember to convert all SELECT…INTO syntax into the CREATE statement. This ensures that your temporary tables have column encodings and don’t cause distribution errors within your workflow. For example, you may want to convert a statement using this syntax:

SELECT column_a, column_b INTO #my_temp_table FROM my_table;

You need to analyze the temporary table for optimal column encoding:

Master=# analyze compression #my_temp_table;
Table | Column | Encoding
----------------+----------+---------
#my_temp_table | columb_a | lzo
#my_temp_table | columb_b | bytedict
(2 rows)

You can then convert the SELECT INTO a statement to the following:

BEGIN;

CREATE TEMPORARY TABLE my_temp_table(
column_a varchar(128) encode lzo,
column_b char(4) encode bytedict)
distkey (column_a) -- Assuming you intend to join this table on column_a
sortkey (column_b) -- Assuming you are sorting or grouping by column_b
;

INSERT INTO my_temp_table SELECT column_a, column_b FROM my_table;

COMMIT;

If you create a temporary staging table by using a CREATE TABLE LIKE statement, the staging table inherits the distribution key, sort keys, and column encodings from the parent target table. In this case, merge operations that join the staging and target tables on the same distribution key performs faster because the joining rows are collocated. To verify that the query uses a collocated join, run the query with EXPLAIN and check for DS_DIST_NONE on all the joins.

You may also want to analyze statistics on the temporary table, especially when you use it as a join table for subsequent queries. See the following code:

ANALYZE my_temp_table;

With this trick, you retain the functionality of temporary tables but control data placement on the cluster through distribution key assignment. You also take advantage of the columnar nature of Amazon Redshift by using column encoding.

Tip #7: Using QMR and Amazon CloudWatch metrics to drive additional performance improvements

In addition to the Amazon Redshift Advisor recommendations, you can get performance insights through other channels.

The Amazon Redshift cluster continuously and automatically collects query monitoring rules metrics, whether you institute any rules on the cluster or not. This convenient mechanism lets you view attributes like the following:

  • The CPU time for a SQL statement (query_cpu_time)
  • The amount of temporary space a job might ‘spill to disk’ (query_temp_blocks_to_disk)
  • The ratio of the highest number of blocks read over the average (io_skew)

It also makes Amazon Redshift Spectrum metrics available, such as the number of Amazon Redshift Spectrum rows and MBs scanned by a query (spectrum_scan_row_count and spectrum_scan_size_mb, respectively). The Amazon Redshift system view SVL_QUERY_METRICS_SUMMARY shows the maximum values of metrics for completed queries, and STL_QUERY_METRICS and STV_QUERY_METRICS carry the information at 1-second intervals for the completed and running queries respectively.

The Amazon Redshift CloudWatch metrics are data points for use with Amazon CloudWatch monitoring. These can be cluster-wide metrics, such as health status or read/write, IOPS, latency, or throughput. It also offers compute node–level data, such as network transmit/receive throughput and read/write latency. At the WLM queue grain, there are the number of queries completed per second, queue length, and others. CloudWatch facilitates monitoring concurrency scaling usage with the metrics ConcurrencyScalingSeconds and ConcurrencyScalingActiveClusters.

It’s recommended to consider the CloudWatch metrics (and the existing notification infrastructure built around them) before investing time in creating something new. Similarly, the QMR metrics cover most metric use cases and likely eliminate the need to write custom metrics.

Tip #8: Federated queries connect the OLAP, OLTP and data lake worlds

The new Federated Query feature in Amazon Redshift allows you to run analytics directly against live data residing on your OLTP source system databases and Amazon S3 data lake, without the overhead of performing ETL and ingesting source data into Amazon Redshift tables. This feature gives you a convenient and efficient option for providing realtime data visibility on operational reports, as an alternative to micro-ETL batch ingestion of realtime data into the data warehouse. By combining historical trend data from the data warehouse with live developing trends from the source systems, you can gather valuable insights to drive real-time business decision making.

For example, consider sales data residing in three different data stores:

  • Live sales order data stored on an Amazon RDS for PostgreSQL database (represented as “ext_postgres” in the following external schema)
  • Historical sales data warehoused in a local Amazon Redshift database (represented as “local_dwh”)
  • Archived, “cold” sales data older than 5 years stored on Amazon S3 (represented as “ext_spectrum”)

We can create a late binding view in Amazon Redshift that allows you to merge and query data from all three sources. See the following code:

CREATE VIEW store_sales_integrated AS 
SELECT * FROM ext_postgres.store_sales_live 
UNION ALL 
SELECT * FROM local_dwh.store_sales_current 
UNION ALL 
SELECT ss_sold_date_sk, ss_sold_time_sk, ss_item_sk, ss_customer_sk, ss_cdemo_sk, 
ss_hdemo_sk, ss_addr_sk, ss_store_sk, ss_promo_sk, ss_ticket_number, ss_quantity, 
ss_wholesale_cost, ss_list_price, ss_sales_price, ss_ext_discount_amt, 
ss_ext_sales_price, ss_ext_wholesale_cost, ss_ext_list_price, ss_ext_tax, 
ss_coupon_amt, ss_net_paid, ss_net_paid_inc_tax, ss_net_profit 
FROM ext_spectrum.store_sales_historical 
WITH NO SCHEMA BINDING
;

Currently, direct federated querying is supported for data stored in Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL databases, with support for other major RDS engines coming soon. You can also use the federated query feature to simplify the ETL and data-ingestion process. Instead of staging data on Amazon S3, and performing a COPY operation, federated queries allow you to ingest data directly into an Amazon Redshift table in one step, as part of a federated CTAS/INSERT SQL query.

For example, the following code shows an upsert/merge operation in which the COPY operation from Amazon S3 to Amazon Redshift is replaced with a federated query sourced directly from PostgreSQL:

BEGIN;

CREATE TEMP TABLE staging (LIKE ods.store_sales);

-- replace the following COPY from S3: 
   /*COPY staging FROM 's3://yourETLbucket/daily_store_sales/' 
   IAM_ROLE 'arn:aws:iam::<account_id>:role/<s3_reader_role>' 
   DELIMITER '|' COMPUPDATE OFF; */
      
-- with this federated query to load staging data directly from PostgreSQL source
INSERT INTO staging SELECT * FROM pg.store_sales p
    WHERE p.last_updated_date > (SELECT MAX(last_updated_date) FROM ods.store_sales);

DELETE FROM ods.store_sales USING staging s WHERE ods.store_sales.id = s.id;

INSERT INTO ods.store_sales SELECT * FROM staging;

DROP TABLE staging;

COMMIT;

For more information about setting up the preceding federated queries, see Build a Simplified ETL and Live Data Query Solution using Redshift Federated Query. For additional tips and best practices on federated queries, see Best practices for Amazon Redshift Federated Query.

Tip #9: Maintaining efficient data loads

Amazon Redshift best practices suggest using the COPY command to perform data loads of file-based data. Single-row INSERTs are an anti-pattern. The COPY operation uses all the compute nodes in your cluster to load data in parallel, from sources such as Amazon S3, Amazon DynamoDB, Amazon EMR HDFS file systems, or any SSH connection.

When performing data loads, compress the data files whenever possible. For row-oriented (CSV) data, Amazon Redshift supports both GZIP and LZO compression. It’s more efficient to load a large number of small files than one large one, and the ideal file count is a multiple of the cluster’s total slice count. Columnar data, such as Parquet and ORC, is also supported. You can achieve best performance when the compressed files are between 1MB-1GB each.

The number of slices per node depends on the cluster’s node size (and potentially elastic resize history). By ensuring an equal number of files per slice, you know that the COPY command evenly uses cluster resources and complete as quickly as possible. Query for the cluster’s current slice count with SELECT COUNT(*) AS number_of_slices FROM stv_slices;.

Another script in the amazon-redshift-utils GitHub repo, CopyPerformance, calculates statistics for each load. Amazon Redshift Advisor also warns of missing compression or too few files based on the number of slices (see the following screenshot):

Conducting COPY operations efficiently reduces the time to results for downstream users, and minimizes the cluster resources utilized to perform the load.

Tip #10: Using the latest Amazon Redshift drivers from AWS

Because Amazon Redshift is based on PostgreSQL, we previously recommended using JDBC4 PostgreSQL driver version 8.4.703 and psql ODBC version 9.x drivers. If you’re currently using those drivers, we recommend moving to the new Amazon Redshift–specific drivers. For more information about drivers and configuring connections, see JDBC and ODBC drivers for Amazon Redshift in the Amazon Redshift Cluster Management Guide.

While rarely necessary, the Amazon Redshift drivers do permit some parameter tuning that may be useful in some circumstances. Downstream third-party applications often have their own best practices for driver tuning that may lead to additional performance gains.

For JDBC, consider the following:

  • To avoid client-side out-of-memory errors when retrieving large data sets using JDBC, you can enable your client to fetch data in batches by setting the JDBC fetch size parameter or BlockingRowsMode.
  • Amazon Redshift doesn’t recognize the JDBC maxRows parameter. Instead, specify a LIMIT clause to restrict the result set. You can also use an OFFSET clause to skip to a specific starting point in the result set.

For ODBC, consider the following:

  • A cursor is enabled on the cluster’s leader node when useDelareFecth is enabled. The cursor fetches up to fetchsize/cursorsize and then waits to fetch more rows when the application request more rows.
  • The CURSOR command is an explicit directive that the application uses to manipulate cursor behavior on the leader node. Unlike the JDBC driver, the ODBC driver doesn’t have a BlockingRowsMode mechanism.

It’s recommended that you do not undertake driver tuning unless you have a clear need. AWS Support is available to help on this topic as well.

Conclusion

Amazon Redshift is a powerful, fully managed data warehouse that can offer increased performance and lower cost in the cloud. As Amazon Redshift grows based on the feedback from its tens of thousands of active customers world-wide, it continues to become easier to use and extend its price-for-performance value proposition. Staying abreast of these improvements can help you get more value (with less effort) from this core AWS service.

We hope you learned a great deal about making the most of your Amazon Redshift account with the resources in this post.

If you have questions or suggestions, please leave a comment.

 


About the Authors

Matt Scaer is a Principal Data Warehousing Specialist Solution Architect, with over 20 years of data warehousing experience, with 11+ years at both AWS and Amazon.com.

 

 

 

 

 

Manish Vazirani is an Analytics Specialist Solutions Architect at Amazon Web Services.

 

 

 

 

 

 

Tarun Chaudhary is an Analytics Specialist Solutions Architect at AWS.

Configure and optimize performance of Amazon Athena federation with Amazon Redshift

Post Syndicated from Harsha Tadiparthi original https://aws.amazon.com/blogs/big-data/configure-and-optimize-performance-of-amazon-athena-federation-with-amazon-redshift/

This post provides guidance on how to configure Amazon Athena federation with AWS Lambda and Amazon Redshift, while addressing performance considerations to ensure proper use.

If you use data lakes in Amazon Simple Storage Service (Amazon S3) and use Amazon Redshift as your data warehouse, you may want to integrate the two for a lake house approach. Lake House is the ability to integrate Data Lake and Data warehouse seamlessly. When you need to query your data lake from your Amazon Redshift Data warehouse, you can use Amazon Redshift Spectrum, which works great in unifying your data lake and data warehouse. However, when you use Athena in the data lake and need to access data in Amazon Redshift for the following two scenarios which are commonly seen, there is no easy approach:

  • Team A has a data lake in Amazon S3 and uses Athena. They need access to the data in an Amazon Redshift cluster owned by Team B.
  • Analysts using Athena to query their data lake for analytics need agility and flexibility to access data in an Amazon Redshift data warehouse without moving the data to Amazon S3 Data Lake.

In these scenarios, Athena federation with Amazon Redshift allows you to seamlessly access the data in your Amazon Redshift data warehouse without having to wait to unload the data to the Amazon S3 data lake, which removes the overhead in managing such jobs.

In this post, you walk through a step-by-step configuration to set up Athena federation using Lambda to access data in Amazon Redshift. You also see a performance benchmark analysis of interactive and ad hoc TPC-DS queries, and learn some key performance considerations and best practices when using federation.

Solution overview

Data federation is the capability to integrate data in another data store using a single interface. The following diagram depicts how Athena federation works by using Lambda to integrate with a federated data source.

Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL.

Lambda lets you run code without provisioning or managing servers. You can run code for virtually any type of application with zero administration and only pay for when the code is running.

Amazon Redshift is a petabyte-scale data warehouse designed from the ground up, natively for the cloud. Amazon Redshift is the most popular and fastest cloud data warehouse. It’s integrated with your data lake, offers performance up to three times faster than any other data warehouse, and costs up to 75% less than any other cloud data warehouse.

The following diagram depicts all the data source connectors available as of this writing in the AWS Serverless Application Repository.

The AWS Serverless Application Repository is a managed repository for serverless applications. It enables you to store and share reusable applications, and easily assemble and deploy serverless architectures in powerful new ways.

You can also create a custom connector for sources that aren’t in the AWS Serverless Application Repository.

Prerequisites

Before you get started, create a secret for the Amazon Redshift login ID and password using AWS Secrets Manager.

  1. On the Secrets Manager console, choose Secrets.
  2. Choose Store a new secret.
  3. Choose credentials for your Amazon Redshift cluster, and set your user name and password.
  4. Choose the cluster you want to use.
  5. For Secret name, enter a name for your secret. Use the prefix AthenaJDBCFederation so it’s easy to find.
  6. Leave the remaining fields at their defaults and choose Next.
  7. Complete your secret creation.

Setting up your S3 bucket

On the Amazon S3 console, create a new S3 bucket and subfolder for Lambda to use. For this post, use the name myworkspace0009/athenafederation.

Configuring Athena federation with Amazon Redshift

To configure Athena federation with Amazon Redshift, complete the following steps:

  1. On the AWS Serverless Application Repository, choose Available applications.
  2. In the search field, enter athena federation.

  1. Choose
  2. In the Application settings section, provide the following details:
  3. Application nameAthenaRedshiftConnector
  4. SecretNamePrefixAthenaJdbcFederation
  5. SpillBucketmyworkspace0009/athenafederation
  6. JDBCConnectorConfigRedshift://jdbc:Redshift://<YourAmazon Redshift1Hostname>:5439/<DBName>?user=sample2&password=sample2
  7. DisableSpillEncyption – False
  8. LambdaFunctionNamerstpcds30
  9. SecurityGroupID – Security group ID where Amazon Redshift is deployed
  10. SpillPrefix – Leave default
  11. Subnetids – Use the subnets where Amazon Redshift is running with comma separation
  12. Select the I acknowledge check box.
  13. Choose Deploy.

In the next steps, you configure an Amazon Virtual Private Cloud (Amazon VPC) endpoint for Amazon S3 to allow Lambda to write federated query results to Amazon S3.

  1. On the Amazon VPC console, choose Endpoints.
  2. Choose Create endpoint.
  3. Choose the VPC for your endpoint.

  1. Make any necessary security changes as per your security requirements.

  1. Choose Create endpoint.

Running federated queries with Athena

To start running federated queries, complete the following steps:

  1. On the Athena console, choose Workgroups.
  2. If you don’t see a workgroup called AmazonAthenaPreviewFunctionality, create one.

When this feature becomes generally available, you won’t need to use this workgroup name.

  1. Run your queries, using lambda:rstpcds30 to run against tables in Amazon Redshift.

Athena query performance comparison

Several customers have asked us for performance insights and prescriptive guidance on how queries in Athena compare against federated queries and how to use them. In this section, we use a TPC-DS 3 TB standard dataset and a select few queries that fall in the category of ad hoc and interactive. The comparison of their performance should give you an idea of what to expect when running federated queries against Amazon Redshift.

For the following tests, we used a 3 TB TPC-DS dataset in Amazon S3 data lake with Parquet compressed, partitioned and served by Athena, and the same 3 TB TPC-DS dataset on Amazon Redshift cluster running four RA3.4XL nodes.

The following table summarizes the dataset sizes:

DatasetTable Size (Records)
store_sales8.6 billion
customer30 million
customer_address15 million
customer_demographics1.92 million
item360,000
date_dim73,000
store1,350

We ran the following four tests:

  • T1 – Queries ran in Athena without federation. All table data is in Amazon S3.
  • T2 – Queries ran in Athena with federation to Amazon Redshift. All table data is in Amazon S3, except the store_sales fact table in Amazon Redshift.
  • T3 – Queries ran in Athena with federation to Amazon Redshift. All tables and data are in Redshift.
  • T4 – Queries ran in Amazon Redshift without federation. All tables and data are in Redshift.

The following graph represents the performance of some of the ad hoc and interactive TPC-DS queries.

In the preceding graph, all T3 queries timed out at 900 seconds, depicted by the pink reference line, due to the Lambda 900-second timeout limit. This is due to overhead from store_sales fact data that needed to be transferred back to Athena.

The following graph removes T3 from the visualization, which gives better visibility when comparing the other tests.

Notice the query performance between T1 and T2 that completed in almost the same time while T4 queries ran significantly faster.

Amazon Redshift beats the performance of Athena in providing extremely low latency and should be the tool of choice if you’re looking for very low SLAs for analytics queries that Athena can’t achieve.

The following graph shows the data scanned in Amazon S3 for T1 and T2, which outlines why there isn’t much difference in query performance when compared to federated queries.

For the T2 federated queries, a small amount of dimension data is filtered in Amazon Redshift and brought back to Athena, instead of scanning the entire dimension tables. This is a typical nature for several ad hoc and interactive queries.

The performance of these TPC-DS queries between T1 and T2 is comparable because very little data is transferred back to Athena. You can see a similar behavior in several ad hoc and interactive query use cases because they use limited dimensions and scan a small subset of dimension data. Due to the 900-second timeout for the Lambda instances that connect to Amazon Redshift, it’s advised to minimize the amount of data the query brings back. Although Athena uses multiple Lambda instances in parallel to run your federated query, it’s also important to make sure the Amazon Redshift WLM queue has enough slots to process it, thereby not leading to queue wait time. For example, in some of the preceding queries, 20 Lambda executions were connecting to Amazon Redshift concurrently.

Key performance best practice considerations

When considering Athena federation with Amazon Redshift, you could take into account the following best practices:

  • Athena federation works great for queries with predicate filtering because the predicates are pushed down to Amazon Redshift. Use filter and limited-range scans in your queries to avoid full table scans.
  • If your SQL query requires returning a large volume of data from Amazon Redshift to Athena (which could lead to query timeouts or slow performance), unload the large tables in your query from Redshift to your Amazon S3 data lake.
  • Star schema is a commonly used data model in Amazon Redshift. In the star schema model, unload your large fact tables into your data lake and leave the dimension tables in Amazon Redshift. If large dimension tables are contributing to slow performance or query timeouts, unload those tables to your data lake.
  • When you run federated queries, Athena spins up multiple Lambda functions, which causes a spike in database connections. It’s important to monitor the Amazon Redshift WLM queue slots to ensure there is no queuing. Additionally, you can use concurrency scaling on your Amazon Redshift cluster to benefit from concurrent connections to queue up.

Conclusion

In this post, you learned how to configure and use Athena federation with Amazon Redshift using Lambda. Now you don’t need to wait for all the data in your Amazon Redshift data warehouse to be unloaded to Amazon S3 and maintained on a day-to-day basis to run your queries. You can use the best practice considerations outlined in the post to minimize the data transferred from Amazon Redshift for better performance. When queries are well written for federation, the performance penalties are negligible, as observed in the TPC-DS benchmark queries in this post. Happy query federating!

 


About the Author

Harsha Tadiparthi is a Specialist Sr. Solutions Architect, AWS Analytics. He enjoys solving complex customer problems in Databases and Analytics and delivering successful outcomes. Outside of work, he loves to spend time with his family, watch movies, and travel whenever possible.

 

 

Automate dataset monitoring in Amazon QuickSight

Post Syndicated from Ginni Malik original https://aws.amazon.com/blogs/big-data/automate-dataset-monitoring-in-amazon-quicksight/

Amazon QuickSight is an analytics service that you can use to create datasets, perform one-time analyses, and build visualizations and dashboards. In an enterprise deployment of QuickSight, you can have multiple dashboards, and each dashboard can have multiple visualizations based on multiple datasets. This can quickly become a management overhead to view all the datasets’ status with their latest refresh timestamp.

This post demonstrates how to visualize datasets associated with all the dashboards in your account, with their latest refresh status and refresh time.

Solution overview

The following screenshot illustrates the architecture of the solution.

The architecture includes the following steps:

  1. You create the datasets and tag them via an AWS Lambda
  2. A second function gets the refresh status from the tagged datasets.
  3. The function stores the refresh status in Amazon Simple Storage Service (Amazon S3).
  4. You query the refresh status in Amazon Athena.
  5. You visualize the refresh status in QuickSight.

A QuickSight deployment can have multiple dashboards and each dashboard can have multiple datasets associated with it. You can end up having hundreds of datasets. It’s difficult to know if all the underlying datasets are refreshing as required unless you check them manually. However, QuickSight sends email notifications to the dataset owner on its dataset refresh failure. This solution provides a holistic view of all datasets’ refreshes.

The aim is to create a dashboard that monitors the refresh of the existing datasets and provides refresh status for the datasets.

To implement the solution, you must create the following:

  • A Lambda execution role for QuickSight.
  • A scheduled Lambda function to tag the datasets.
  • A scheduled Lambda function to get the last refresh status of the datasets and store it in Amazon S3.
  • An external table in Athena on top of the S3 bucket.
  • A QuickSight dashboard using Athena as the data source, which provides the datasets’ last refresh status.

This post assumes that you have existing analyses and dashboards with numerous datasets.

Creating a Lambda execution role for QuickSight

Your first step is to create a Lambda execution role that allows you to perform tagging and create QuickSight analysis, datasets, and data sources. The role should be able to describe and update them. The following code is an example role policy (replace the bucket name with the bucket for storing the QuickSight ingestion results):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "quicksight:CreateDashboard",
                "quicksight:List*",
                "quicksight:Describe*",
                "quicksight:Tag*",
                "tag:GetResources"
            ],
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "s3:PutObject",
            "Resource": [
                "arn:aws:s3:::bucket-name-to-store-ingestion/*"
            ]
        }
    ]
}

Creating a scheduled Lambda function to tag the datasets

The next step is to identify all the datasets required for your dashboard and tag them. It’s easier to do this right after you create the dataset. Complete the following steps:

  1. On the QuickSight console, choose Manage data.

  1. Choose your dataset and choose Edit dataset.

  1. Record the dataset ID from the URL (data-sets/<dataset ID>/prepare).

Alternatively, you can use a Lambda function to find the dataset name and ID. See the following code (replace the AwsAccountID with your ID):

import boto3
client = boto3.client('quicksight')
def lambda_handler(event, context):
    response=client.list_data_sets(
        AwsAccountId='1234567890',
        MaxResults=100
    )
    for r in response['DataSetSummaries']:
        dataset_info={}
        dataset_info['id']=r['DataSetId']
        dataset_info['name']=r['Name']
        print (dataset_info['name']+':'+dataset_info['id'])

The function provides all the datasets in your account. Make sure to record the dataset IDs specific to your dashboard.

  1. Create your Lambda function.
  2. Tag the datasets per your individual dashboards. See the following code (use the target dashboard name and ID to create the tagging key, and replace the dataset_ids and account number with your own):
import boto3
client = boto3.client('quicksight')
acct_id = '123456678788'
def lambda_handler(event, context):
    dataset_ids=['0e994f54-8d08-4b64-98ca-195cf7b46077','16d5bf20-4415-42d1-b54c-9aba95b13d67','5c5fd93a-0bb6-468f-a0c4-ff1c15597d20']
    for i in dataset_ids:
        response = client.tag_resource(
            ResourceArn='arn:aws:quicksight:us-east-1:123456678788:dataset/{}'.format(i),
            Tags=[
                {
                    'Key' : 'DashboardName',
                    'Value' : 'QuickSight_refresh_status_demo'
                }
                ]
                )

You can do this for all your dashboards. The only limitation is you can only tag one dataset to one dashboard name key pair.

If you tag the datasets with a wrong key, you can remove them using an untag call and replace the ResourceArn with the specific dataset ARN. See the following code:

     response = client.untag_resource(
    ResourceArn='arn:aws:quicksight:us-east-1:123456678788:dataset/794e28ae-2b89-49ef-b885-196c95bfd4f8',
    TagKeys=[
        'DashboardName'
    ]
    )

Creating a Lambda function to get the last refresh status

The next step is to configure a Lambda function that gets the last refresh status of the tagged datasets and loads it into Amazon S3. You use resourcegroupstaggingapi to get back all the resources with a particular key. For this post, the key is the DashboardName. From the response of the ResourceTagMappingList, you filter out the dataset ID and dataset ARN. You also get the data source ARN and name for each dataset associated with the particular key value. Finally, you list the ingestions for all the datasets and classify them as one of the following:

  • Failed – The last refresh failed.
  • Did not run within last 24 hours – No ingestion ID in the last 24 hours (the time is configurable). You explicitly use this status even if the previous run before the last 24 hours succeeded or failed. This makes sure the datasets adhere to a certain refresh schedule. For this post, you want the datasets to refresh one time a day.
  • Error – No ingestion ID for more than 90 days.

See the following code (replace the placeholder text with your specific values):

import json
import boto3
import csv
from botocore.exceptions import ClientError
from datetime import datetime
from datetime import timedelta
from datetime import timezone
import jmespath

glue = boto3.client('glue')
s3= boto3.client('s3')
client = boto3.client('resourcegroupstaggingapi')
client1 = boto3.client('quicksight')
dataset_ids=[]
AwsAccountId='123456678788'
def lambda_handler(event, context):
    items=[]
    tagfilters=[
        {
            'Key': 'DashboardName',
            'Values': [
                'QuickSight_refresh_status_demo'
                            #add dashboard name
            ]
        },
    ]


    response = client.get_resources(
    TagFilters=tagfilters
    )
    
    # Get the response back for each of the above listed Key Values
    resources = response['ResourceTagMappingList']
    for resource in resources:
        perm=""
        data={}
        permission=[]
        data['resource_ARN']=resource['ResourceARN']
        dataset_id_arn=data['resource_ARN'].split('/')
        data['dataset_id']=dataset_id_arn[1]
        # For each of the above dataset , describe the dataset to get the data source
        response = client1.describe_data_set(
                 AwsAccountId=AwsAccountId,
                 DataSetId=data['dataset_id']
        )

        datasourcearn = jmespath.search('DataSet.PhysicalTableMap.*.*.DataSourceArn',response)
        datasourcearnid= str(datasourcearn[0]).split('/')
        datasourcearnid=datasourcearnid[1]
        datasourcearnid=datasourcearnid.replace("']",'')
        data['DatasetName']=response.get('DataSet').get('Name')
        response = client1.describe_data_source(
            AwsAccountId=AwsAccountId,
            DataSourceId=datasourcearnid
        )
        datasourcename=response.get('DataSource').get('Name')
        data['DataSourceName']=datasourcename
        resource_tags = resource['Tags']
        for tag in resource_tags:

                if tag['Key'] == 'DashboardName':
                    data['dashboard_name'] = tag['Value']

        response1= client1.list_ingestions(
            DataSetId=data['dataset_id'],
            AwsAccountId=AwsAccountId,
            MaxResults=1  # To get the latest ingestion, if you want history you can change this number
        )

        if response1.get('Ingestions'):
            for i in response1['Ingestions']:
                data['IngestionId']=i['IngestionId']
                data['CreatedTime']=i['CreatedTime']
                try:
                    response = client1.describe_ingestion(DataSetId=data['dataset_id'],IngestionId=data['IngestionId'],AwsAccountId=AwsAccountId)

                    if  response:
                        data['Time']=str(response['Ingestion']['CreatedTime'])
                        if ((datetime.utcnow() - response['Ingestion']['CreatedTime'].replace(tzinfo=None)).total_seconds()) >= (24*60*60):  #Check the refresh status within last 24 hours, you can change this per your requirement
                            data['Status']='Did not run within last 24 hrs'
                        else:
                            data['Status']=response['Ingestion']['IngestionStatus']
                except ClientError as e:
                    data['Time']='Failed, Check if dataset is being used'
                    data['Status']=e.response['Error']['Message']

            items.append(data)





    row=['DashboardName,DatsetName,Status,Time,DataSourceName']
    csv_key='quicksight-dashboard-metada/report.csv'



    for data in items:

        row.append(data['dashboard_name'] +','+data['DatasetName']+','+data['Status']+','+data['Time']+','+data['DataSourceName'])
    values = '\n'.join(str(v) for v in row)

    response = s3.put_object(
        Body=values,
        Bucket='bucketname',
        Key=csv_key
    )

The last status run is now stored in a .csv file in the specific bucket mentioned in the Lambda function (see the following screenshot).

You can also schedule your function to run at a certain frequency, depending on when you want to check the status.

Creating an external table in Athena on top of Amazon S3

Now you can create an external Athena table on top of the .csv file you stored in Amazon S3 and query it. Use the following table definition for reference (replace the location with the location of your S3 bucket):

CREATE EXTERNAL TABLE IF NOT EXISTS qs_meta_table
( `DashboardName` string, 
 `DatasetName` string, 
 `Status` string, 
 `LastRefershTime` string, 
 `DataSourceName` string ) 
 ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES 
 ( "separatorChar" = "," )
 LOCATION 's3://bucketname/quicksight-dashboard-metada/' 
 TBLPROPERTIES ( "skip.header.line.count"="1")

You can get the latest status of the dataset refreshes by querying the table with SQL in Athena.

Creating a QuickSight dashboard using Athena as the data source

To visualize this data and share it with others, build a dashboard on top of the data in QuickSight. The following screenshot shows the listed dashboards.

You first create a dataset for the Athena table.

  1. On the QuickSight console, choose Manage data.
  2. Choose Create dataset.

You use Athena as the source for your dataset. If you don’t have an existing Athena data source, you can create a new one. For instructions, see Creating a Data Source.

  1. Choose the table you just created.

  1. Select Import to SPICE for quicker analysis.

Depending on the size of your dataset and expected latency, you can choose Directly query your data instead. If you use SPICE, remember to add a refresh schedule for the dataset.

  1. Create an analysis from the dataset.

For this post, choose a table visual type and drag all the columns to the Value field well.

You can create the visualization as in the following screenshot, with conditional formatting to highlight failed and successful loads.

  1. To publish the dashboard, choose Share on the application bar of the analysis.
  2. Choose Publish dashboard.

  1. For Publish new dashboard as, enter a name for your dashboard.

You can now share the dashboard with end-users.

Conclusion

In this post, we described how to create a QuickSight dashboard that can track the last refresh status of all the datasets in your account. The dashboard provides a single pane view of the status of all the datasets and avoids the manual effort of opening and checking each individual dataset.

 


About the authors

Ginni Malik is an Associate Cloud Developer with AWS.

 

 

 

 

 

Rohan Jamadagni is a Solutions Architect with AWS.

Speed up data ingestion on Amazon Redshift with BryteFlow

Post Syndicated from Pradnya Bhandary original https://aws.amazon.com/blogs/big-data/speed-up-data-ingestion-on-amazon-redshift-with-bryteflow/

This is a guest post by Pradnya Bhandary, Co-Founder and CEO at Bryte Systems.

Data can be transformative for an organization. How and where you store your data for analysis and business intelligence is therefore an especially important decision that each organization needs to make. Should you choose an on-premises data warehouse solution or embrace the cloud?

On-premises data warehouses require servers to be in your data center, a team to manage them, and a large initial investment that tries to accommodate current and future data needs. Cloud data warehouses, on the other hand, are fully managed and can start with a small investment and grow as your business demands. You don’t need to provision for the future, but for present needs. This knowledge brings peace of mind that extra capacity can be added overnight, if needed.

One such cloud data warehouse is Amazon Redshift. Amazon Redshift is the most popular cloud data warehouse. It’s a fully managed, petabyte-scale cloud-based data warehouse product designed for large-scale dataset storage and analysis. It provides agility, flexibility, and cost-effectiveness.

To use Amazon Redshift effectively, getting your data efficiently from various data silos to the data warehouse is critical, because this determines how quickly you can access the data. Do you have to wait until the data is loaded and make do with stale data, or can you access your data across the organization in near-real time to derive fresh and rich data insights?

In this post, we explain how Origin Energy followed some of the best practices for data ingestion for Amazon Redshift and how they achieved faster ingestion rates into Amazon Redshift using the BryteFlow software.

Origin Energy

Origin Energy, a leading energy provider, was finding that its on-premises data warehouse was struggling to support growing data demands. They also needed to unlock siloed data from legacy databases like SAP, Oracle, SQL Server, MySQL, and more. Access to data was fragmented and time consuming.

Moving to a cloud enterprise analytics environment with centralized data access was the only viable option to support their data initiatives. For more information, see the Origin Energy case study.

To speed up data ingestion on Amazon Redshift, they followed data ingestion best practices.

Log-based CDC mechanism to get data to Amazon Redshift

When data is replicated from a source database to a target that could be another database, data warehouse, or cloud data storage object, changes to the data in the source need to be captured and replicated to the destination to keep data consistent and trustworthy. Change data capture (CDC) makes this possible. CDC captures the changes in the source data and updates only the data in the destination that has changed. This does away with the tedious task of bulk load updating and enables real-time data integration. It’s a continual, extremely reliable process and has no impact on source systems.

Parallel multi-threaded initial sync

When doing an initial ingest of data, especially of exceptionally large datasets to Amazon Redshift, parallel, multi-thread syncing to replicate data is extremely helpful in cutting down total data ingestion time. Data replication proceeds in parallel on multiple threads with optimized extraction and loading.

Parallel multi-threaded log-based capture and merge for Oracle

When source systems generate a large amount of incremental data, and transactional logs are written very often, the CDC mechanism or mining of logs for incremental data can lag behind if it can’t keep up with the source throughput. You can configure BryteFlow to have multiple parallel threads for mining. Furthermore, you can configure BryteFlow so that the logs can be mined on a completely different server and therefore there is zero load on the source and data replication is much faster with parallelism. You can make the transaction logs available on a shared mount on a remote server you spin up, and mine logs on this server. This puts zero load on the source systems and operational systems and they aren’t impacted even with mining huge volumes of data (or volumes of incremental data).

Best practices for loading data to Amazon Redshift

The following are the three best practices for loading data into Amazon Redshift.

Split large files into multiple files for high-performance loads

Amazon Redshift is a massively parallel processing (MPP) data warehouse, where several compute nodes work in parallel to ingest the data. Each node is further subdivided into slices, with each slice having one or more dedicated cores, equally dividing the processing capacity. The number of slices per node depends on the node type of the cluster. The following table shows the different node types and the number of slices.

Node TypeDefault Slices per Node
Dense Compute DC2
dc2.large2
dc2.8xlarge16
Dense Storage DS2
ds2.xlarge2
ds2.8xlarge16
RA3 Nodes
ra3.4xlarge4
ra3.16xlarge16

After you extract data into files, you can compress the files and split a single file to multiple files according to the number of slices, so that files are loaded with the compute being distributed evenly across the slices on Amazon Redshift. The number of multiple files is a configurable parameter in BryteFlow that can be set depending on the Amazon Redshift node types. The COPY command that ingests data into Amazon Redshift is configured optimally for fast data loads.

Automatic creation of tables, default distribution keys, and distribution style

Data is distributed among the nodes on the basis of distribution style and distribution key of a particular table in Amazon Redshift. An even distribution of data enables Amazon Redshift to assign the workload evenly to slices and maximizes the benefit of parallel processing. This also helps during data ingestion. When ingesting data, BryteFlow automatically creates tables with the right DDL on Amazon Redshift. It also creates default distribution keys and distribution style, so that table ingestion is highly performant.

Optimum sort keys used for optimum loads support efficient columnar storage

Data in the Amazon Redshift data warehouse is stored in a columnar fashion, which drastically reduces the number of disk I/O requests and minimizes the amount of data loaded into the memory to run a query. Reduction in I/O speeds up queries, and loading less data means Amazon Redshift can perform more in-memory processing. Using the optimum table sort keys is the best practice for efficient loads.

Automatic sync on Amazon Redshift

CDC is an important element of syncing data with Amazon Redshift. BryteFlow automatically merges changes on Amazon Redshift, with type2 history (if configured) with high performance. This means that data is ready to use as soon as it is ingested, without running lengthy ETL processes on it. The following diagram illustrates the type2 history feature.

Metadata for every extract and load can be captured on Amazon Aurora

Details on each extract and load process (for example, tables names), the number of records affected, the start and end times, and various other details is critical operational metadata that’s very useful in determining performance, tuning, and triggering other ETL processes. If this operational metadata is maintained on Amazon Redshift with every extract and load, constant single row inserts and updates can hamper performance drastically because Amazon Redshift is a columnar database and not an OLTP system.

At BryteFlow, we found that the best practice is to keep the operational metadata in an Amazon Aurora database, which is OLTP in nature and can store this metadata with constant updates and inserts and low latency.

Build your data lake on Amazon S3 and automatically query using Amazon Redshift Spectrum

BryteFlow enables you to build a continually refreshing data lake at scale on Amazon Simple Storage Service (Amazon S3) with continual replication and transformation of data. You can configure BryteFlow to use Amazon EMR on the incremental data with data on Amazon S3 and automatically merge and transform your data with an intuitive GUI. The EMR cluster can scale up or down depending on your data needs. You can then query data automatically on Amazon S3 using Amazon Redshift Spectrum.

Offload data ingestion and data preparation on Amazon S3 and load to Amazon Redshift

BryteFlow Blend helps with real-time data preparation of data ingested by BryteFlow, using Apache Spark on Amazon EMR with an intuitive GUI. You can load data prepared on Amazon S3 to Amazon Redshift via BryteFlow Blend or make it accessible to Amazon Redshift via Amazon Redshift Spectrum. This helps reserve the computational resources of Amazon Redshift for the actual querying (queries run much faster) while the Amazon S3 data lake handles data integration. The following diagram illustrates the distributed data integration architecture.

BryteFlow software supports all of the preceding ingestion best practices. Origin Energy used the BryteFlow software to build their analytics platform on Amazon Redshift. Origin’s data access has improved from several days to mere hours. BryteFlow software has helped achieve accurate data replication on the AWS Cloud with low latency, facilitating faster time-to-market for new and highly personalized customer offerings and a significant reduction in data costs.

Conclusion

Amazon Redshift delivers fast performance at scale for the most demanding workloads. Ingesting and preparing your data to Amazon Redshift using the BryteFlow software makes this an extremely attractive value proposition. You can liberate your data across data silos and quickly unlock the value on Amazon Redshift.

To see how this works for your project, you can get a free trial from our website. We offer complete support on your free trial including screen sharing, online support, and consultation.

Alternatively, you can go to AWS Marketplace for a free trial of BryteFlow Standard Edition or BryteFlow Enterprise Edition. Contact us to let us know so we can assist you through the trial.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

 


About the Authors

Pradnya Bhandary is the Co-Founder and CEO at Bryte Systems.

 

 

 

 

Stream, transform, and analyze XML data in real time with Amazon Kinesis, AWS Lambda, and Amazon Redshift

Post Syndicated from Sakti Mishra original https://aws.amazon.com/blogs/big-data/stream-transform-and-analyze-xml-data-in-real-time-with-amazon-kinesis-aws-lambda-and-amazon-redshift/

When we look at enterprise data warehousing systems, we receive data in various formats, such as XML, JSON, or CSV. Most third-party system integrations happen through SOAP or REST web services, where the input and output data format is either XML or JSON. When applications deal with CSV or JSON, it becomes fairly simple to parse because most programming languages and APIs have direct support for CSV or JSON. But for XML files, we need to consider a custom parser, because the format is custom and can be very complex.

When systems interact with each other and process data through different pipelines, they expect real-time processing or availability of data, so that business decisions can be instant and quick. In this post, we discuss a use case where XMLs are streamed through a real-time processing system and can go through a custom XML parser to flatten data for easier business analysis.

To demonstrate the implementation approach, we use AWS cloud services like Amazon Kinesis Data Streams as the message bus, Amazon Kinesis Data Firehose as the delivery stream with Amazon Redshift data warehouse as the target storage solution, and AWS Lambda as record transformer of Kinesis Data Firehose, which flattens the nested XML structure with custom parser script in Python.

AWS services overview

This solution uses AWS services for the following purposes:

  • Kinesis Data Streams is a massively scalable and durable real-time data streaming service. It can continuously capture gigabytes of data per second from hundreds of thousands of sources, such as website click-streams, database event streams, financial transactions, social media feeds, IT logs, and location-tracking events. The data collected is available in milliseconds to enable real-time analytics use cases such as real-time dashboards, real-time anomaly detection, dynamic pricing, and more. We use Kinesis Data Streams because it’s a serverless solution that can scale based on usage.
  • Kinesis Data Firehose is the easiest way to reliably load streaming data into data lakes, data stores, and analytics tools. It can capture, transform, and load streaming data into Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Elasticsearch Service (Amazon ES), and Splunk, enabling near-real-time analytics with existing business intelligence (BI) tools and dashboards you’re already using today. It’s a fully managed service that automatically scales to match the throughput of your data and requires no ongoing administration. It can also batch, compress, transform, and encrypt the data before loading it, minimizing the amount of storage used at the destination and increasing security. In our use case, our target storage layer is Amazon Redshift, so Kinesis Data Firehose fits great to simplify the solution.
  • Lambda is an event-driven, serverless computing platform provided by AWS. It’s a computing service that runs code in response to events and automatically manages the computing resources required by that code. Lambda supports multiple programming languages, and for our use case, we use Python 3.8. Other options include Amazon Kinesis Data Analytics with Flink, Amazon EMR with Spark streaming, Kinesis Data Firehose, or a custom application based on Kinesis consumer library. We use Kinesis Data Firehose as the consumer in this use case, with AWS Lambda as the record transformer, because our target storage is Amazon Redshift, which is supported by Kinesis Data Firehose.
  • Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance. This means customers of all sizes and industries can use it to store and protect any amount of data for a range of use cases, such as websites, mobile applications, backup and restore, archive, enterprise applications, IoT devices, and big data analytics. For our use case, we use Amazon S3 as an intermediate storage before loading to the data warehousing system, so that it’s fault tolerant and provides better performance while loading to Amazon Redshift. By default, Kinesis Data Firehose requests an intermediate S3 bucket path when Amazon Redshift is the target.
  • Amazon Redshift is a fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing BI tools. In our use case, we use Amazon Redshift so that BI tools like Amazon QuickSight can easily connect to Amazon Redshift to build real-time dashboards.

Architecture overview

The following diagram illustrates the simple architecture that you can use to implement the solution.

The architecture includes the following components:

  • The Amazon Kinesis Producer Library (KPL) represents the system that pushes data to Kinesis Data Streams. It can be a simple Amazon Elastic Compute Cloud (Amazon EC2) machine or your local windows command line that executes the Kinesis Data Streams command line interface (CLI) to push messages. Alternatively, it can be a dynamic application that uses Kinesis Data Streams APIs or KPL to push messages dynamically. For our use case, we spin up an EC2 instance through AWS Cloud9 and use Kinesis Data Streams CLI commands to publish messages.
  • Kinesis Data Streams receives messages against a partition key from the publisher and waits for consumers to consume it. By default, the retention period of the messages in Kinesis Data Streams is 24 hours, but you can extend it to 7 days.
  • Kinesis Data Firehose takes a few actions:
    • Consumes data from Kinesis Data Streams and writes the same XML message into a backup S3 bucket.
    • Invokes a Lambda function that acts as a record transformer. Lambda receives input as XML, applies transformations to flatten it to be pipe-delimited content, and returns it to Kinesis Data Firehose.
    • Writes the pipe-delimited content to another S3 bucket, which acts as an intermediate storage bucket before writing into Amazon Redshift.
    • Invokes the Amazon Redshift COPY command, which takes pipe-delimited data from the intermediate S3 bucket and writes it into Amazon Redshift.
  • Data is inserted into the Amazon Redshift table, which you can query for data analysis and reporting.

Solution overview

To implement this solution, you complete the following steps:

  1. Set up the Kinesis data stream as the message bus.
  2. Set up KPL, which publishes sample XML message data to Kinesis Data Streams.
  3. Create an Amazon Redshift cluster, which acts as target storage for the Firehose delivery stream.
  4. Set up the delivery stream, which uses Lambda for record transformation and Amazon Redshift as target storage.
  5. Customize a Lambda function script that converts the nested XML string to a flat pipe-delimited stream.

Prerequisites

Before beginning this tutorial, make sure you have permissions to create Kinesis data streams and publish messages to the streams.

Setting up your Kinesis data stream

You can use the AWS Management Console to create a data stream as a one-time activity. You can configure the cluster capacity as per your requirement, but start with the minimum and apply auto scaling as the data volume increases. Auto scaling is based on Amazon CloudWatch metrics. For more information, see Scale Amazon Kinesis Data Streams with AWS Application Auto Scaling.

Setting up KPL

For this use case, we use the AWS Cloud9 environment IDE, where through the Linux command line, we can execute Kinesis Data Streams CLI commands to publish sample XML messages. The following code shows an example XML of an employee record that has one-level nesting for the all_addresses attribute:

aws kinesis put-record --stream-name <Stream-Name> --data "<employees><employee><first_name>FName 1</first_name><last_name>LName 1</last_name><all_address><address><type>primary</type><street_address>Street Address 1</street_address><state>State 1</state><zip>11111</zip></address><address><type>secondary</type><street_address>Street Address 2</street_address><state>State 2</state><zip>11112</zip></address></all_address><phone>111-111-1111</phone></employee><employee><first_name>FName 2</first_name><last_name>LName 2</last_name><all_address><address><type>primary</type><street_address>Street Address 3</street_address><state>State 3</state><zip>11113</zip></address><address><type>secondary</type><street_address>Street Address 4</street_address><state>State 4</state><zip>11114</zip></address></all_address><phone>111-111-1112</phone></employee></employees>" —partition-key <partition-key-name>

You need to change the stream name, XML data, and partition key in the preceding code as per your use case. Also, instead of an AWS Cloud9 environment, you have additional ways to submit messages to the data stream:

  • Use an EC2 instance to execute the Kinesis Data Streams CLI command
  • Use KPL or Kinesis Data Streams APIs in any programming language to submit messages dynamically through your custom application

Creating an Amazon Redshift cluster

In this step, you create an Amazon Redshift cluster that has required permissions and ports open for Kinesis Data Firehose to write to it. For instructions, see Controlling Access with Amazon Kinesis Data Firehose.

Make sure the cluster has the required port and permissions so that Kinesis Firehose can push data into it. Also make sure the table schema you create matches your pipe-delimited format that Lambda creates as output and Kinesis Data Firehose uses it to write to Amazon Redshift.

Setting up the delivery stream

When you create your Kinesis Data Firehose delivery stream on the console, define the source as Kinesis Data Streams, the target as the Amazon Redshift cluster, and enable record transformation with Lambda.

To complete this step, you need to create an AWS Identity and Access Management (IAM) role with the following permissions for the delivery stream:

  • Read permissions from the data stream
  • Write permissions to the intermediate S3 bucket
  • Write permissions to the defined Amazon Redshift cluster

Define the following configurations for the delivery stream:

  • Enable the source record transformation, where you selected your Lambda function.

  • As an optional step, you can enable source record backup, which saves the source XML to the S3 bucket path you define.

  • Define the intermediate S3 bucket, which you use to store transformed pipe-delimited records and later use for the Amazon Redshift copy.

  • In your Amazon Redshift configurations, for COPY options, make sure to specify DELIMITER ‘|’, because the Lambda function output is pipe delimited and Kinesis Data Firehose uses that in the Amazon Redshift copy operation.

Customizing the Lambda function

This function is invoked through Kinesis Data Firehose when the record arrives in Kinesis Data Streams.

Make sure you increase the Lambda execution timeout to more than 1 minute. See the following code:

from __future__ import print_function

import base64
import json
import boto3
import os
import time
import csv 
import sys

from xml.etree.ElementTree import XML, fromstring
import xml.etree.ElementTree as ET

print('Loading function')


def lambda_handler(event, context):
    output = []

    for record in event['records']:
        payload = base64.b64decode(record['data'])
        parsedRecords = parseXML(payload)
        
        # Do custom processing on the payload here
        output_record = {
            'recordId': record['recordId'],
            'result': 'Ok',
            'data': base64.b64encode(parsedRecords)
        }
        output.append(output_record)

    print('Successfully processed {} records.'.format(len(event['records'])))
    return {'records': output}
    
    
def parseXML(inputXML):
    xmlstring =  str(inputXML.decode('utf-8'))
    
    # create element tree object
    root = ET.fromstring(str(xmlstring))
    #print("Root Tag"+root.tag)
    
    # create empty list for items 
    xmlItems = ""
  
    # iterate over employee records
    for item in root.findall('employee'):
       #print("child tag name:"+item.tag+" - Child attribute")
       
       # Form pipe delimited string, by concatenating XML values
       record = item.find('first_name').text + "|" + item.find('last_name').text + "|" + item.find('phone').text
       
       primaryaddress = ""
       secondaryaddress = ""
       
       # Get primary address and secondary address separately to be concatenated to the original record in sequence
       for addressitem in item.find('all_address').findall('address'):
           if(addressitem.find('type').text == "primary"):
               primaryaddress = addressitem.find('street_address').text + "|" + addressitem.find('state').text + "|" + addressitem.find('zip').text
           elif(addressitem.find('type').text == "secondary"):
               secondaryaddress = addressitem.find('street_address').text + "|" + addressitem.find('state').text + "|" + addressitem.find('zip').text
               
       #print("Primary Address:"+primaryaddress)
       #print("Secondary Address:"+secondaryaddress)
       
       record += "|" + primaryaddress + "|" + secondaryaddress + "\n"
       xmlItems += record
       #print("Record"+record)
    
    #print("Final Transformed Output:"+xmlItems)
    return xmlItems.encode('utf-8')

You can customize this example code to embed your own XML parser logic. Keep in mind that, while using the function, the request and response (synchronous calls) body payload size can be up to 6 MB, so it’s important to make sure the return value isn’t increased over that limit.

Your Amazon Redshift table (employees) has respective fields to capture the flattened pipe-delimited data. Your query might look like the following code to fetch and read the data:

SELECT first_name, last_name, phone, primary_address_street, primary_address_state, primary_address_zip, secondary_address_street, secondary_address_state, secondary_address_zip
FROM employees

The following screenshot shows the result of the query in the Amazon Redshift query editor.

Debugging

While setting up this framework in your development environment, you can debug individual components of the architecture with the following guidelines:

  • Use the Kinesis Data Streams Monitoring tab to validate that it receives messages and read operations are happening through the consumer (Kinesis Data Firehose). You can also use Kinesis Data Streams CLI commands to read from the stream.
  • Use the Kinesis Data Firehose Monitoring tab to check if it receives messages from Kinesis Data Streams and can push them to Amazon Redshift. You can also check for errors on the Error logs tab or directly on the Amazon CloudWatch console.
  • Validate Lambda with a test execution to check that it can transform records to pipe-delimited formats and return to Amazon Data Firehose with the expected format (base64 encoded format).
  • Confirm that the S3 intermediate storage bucket has the transformed record and doesn’t write into failed processing or error record paths. Also, check if the transformed records are pipe delimited and match the schema of the target Amazon Redshift table.
  • Validate if the backup S3 bucket has the original XML format records. If Lambda or the delivery stream fails, you have an approach to manually reprocess it.
  • Make sure Amazon Redshift has the new data records reflecting through SQL SELECT queries and check the cluster’s health on the Monitoring

Conclusion

This post showed you how to integrate real-time streaming of XML messages and flatten them to store in a data warehousing system for real-time dashboards.

Although you followed individual steps for each service in your development environment, for a production setup, consider the following automation methods:

  • AWS CloudFormation allows you to embed infrastructure as code that can spin up all required resources for the project, and you can easily migrate or set up your application in production or other AWS accounts.
  • A custom monitoring dashboard can take input from each AWS service you use through its APIs and show the health of each service with the number of records being processed.

Let us know in the comments any thoughts of questions you have about applying this solution to your use cases.

 


About the Author

Sakti Mishra is a Data Lab Solutions Architect at AWS. He helps customers architect data analytics solutions, which gives them an accelerated path towards modernization initiatives. Outside of work, Sakti enjoys learning new technologies, watching movies, and travel.

 

 

Scale your cloud data warehouse and reduce costs with the new Amazon Redshift RA3 nodes with managed storage

Post Syndicated from Corina Radovanovich original https://aws.amazon.com/blogs/big-data/scale-your-cloud-data-warehouse-and-reduce-costs-with-the-new-amazon-redshift-ra3-nodes-with-managed-storage/

One of our favorite things about working on Amazon Redshift, the cloud data warehouse service at AWS, is the inspiring stories from customers about how they’re using data to gain business insights. Many of our recent engagements have been with customers upgrading to the new instance type, Amazon Redshift RA3 with managed storage. In this post, we share experiences from customers using Amazon Redshift for the first time, and existing customers upgrading from DS2 (Dense Storage 2) and DC2 (Dense Compute 2) instances to gain improvements in performance and storage capacity for the same or lower costs.

From startups to global quick service restaurants and major financial institutions, Amazon Redshift customers span across all industries and sizes, including many Fortune 500 companies. We’re proud that Amazon Redshift breaks down cost and accessibility barriers of a data warehouse, so startups and non-profits can realize the same benefits as established enterprises from running analytics at scale with Amazon Redshift.

The diverse customer base also allows the Amazon Redshift team to continue to innovate with new features and capabilities that deliver the best price performance for any use case. Use cases range from analytics that help Britain’s railways run smoothly, to providing insight into the behavior of millions of people learning a new language, playing online games, learning to code, and much more. As the world around us responds to changes in every aspect of personal and business life, Amazon Redshift is helping tens of thousands of customers respond with fast and powerful analytics.

More and more customers have gravitated to Amazon Redshift because of continued innovation, including the new generation of Amazon Redshift nodes, RA3 with managed storage. This latest generation of Amazon Redshift is unique because it introduces the ability to independently scale compute and storage with Redshift managed storage, (RMS). This enables you to scale cost-effectively because you can add more data without increasing compute cost, or add more compute without increasing storage costs. This makes RA3 a cost-effective option for both steady and diverse data warehouse workloads, gives you room to grow, and maximizes performance.

New customers like Poloniex and OpenVault benefit from the flexibility of Amazon Redshift RA3

Many customers are growing and looking for a cloud data warehouse that can scale with them, easily integrate with other AWS services, and deliver great value. For customers like Poloniex and OpenVault, who are just getting started with Amazon Redshift, we recommend using the new RA3 nodes with managed storage. New customers like RA3 because you can size your data warehouse for your core workload and easily scale for spikes in users and data to balance performance and costs. For example, you can use concurrency scaling to automatically scale out when the number of queries suddenly spike up, or use elastic resize to scale up and add nodes to make queries run faster. If you’re using clusters intermittently, you can pause and resume on a schedule or manually. You can further reduce costs on steady state clusters by investing in reserved instances with a 1- or 3-year commitment.

Poloniex, one of the longest standing cryptocurrency trading platforms in the world, distributing hundreds of billions of dollars in cryptoassets, uses AWS to gain insights into how users interact with their platform and how they can improve the customer experience in trading, lending, storing, and distribution. They evaluated multiple data warehousing options and chose to work with AWS to design a lake house approach by querying and joining data across their Amazon Redshift data warehouse and Amazon Simple Storage Service (Amazon S3) data lake with the Amazon Redshift Spectrum feature.

“When we were evaluating data warehouses, we chose Amazon Redshift over Snowflake because of the transparent and predictable pricing,” says Peter Jamieson, Director of Analytics and Data Science at Poloniex. “The scalability and flexibility have been enormously valuable as we scale our analytics capability with a lean team and infrastructure. We benefit from the separation of compute and storage in the Amazon Redshift RA3 nodes because we have workflows that create a significant spike in our compute needs, especially when aggregating historical transaction data.”

Organizations are often looking to share the data and insight gained through analytics with their end-users as part of their product or service. The Software as a Service (SaaS) model enables this, and we work closely with SaaS customers to understand the value their data provides so they can use Amazon Redshift to unlock additional business value. Amazon Redshift is well positioned to build a scalable, multi-tenant SaaS solution with features that deliver consistent performance with multiple tenants sharing the same Amazon Redshift cluster.

OpenVault, a full-service technology solutions and data analytics company, enables cable, fiber, and mobile operators around the world to unlock the power of the data in their network to optimize and monetize their businesses. They shared a similar story:

“Amazon Redshift powers analytics in our SaaS solutions to provide insight that can be used to anticipate residential and business broadband trends,” says Tony Costa, EVP and CTO at OpenVault. “This makes it possible to use fast-growing broadband data to make decisions that result in revenue growth, new revenue streams, reduced operational/capital expenses, and improved quality of service for broadband operators. We chose Amazon Redshift RA3 because it is a cost-effective analytics and managed storage solution. It empowers OpenVault’s data scientists and operator customers to perform near real-time analysis of billions of rows of records and seamlessly evolve with the growing analytics needs and ad-hoc inquiries of our customers.”

If you’re new to Amazon Redshift, many resources are available to help you ramp up, including AWS employees and partners. For more information, see Getting Started with Amazon Redshift and Request Support for your Amazon Redshift Proof-of-Concept.

Duolingo, Social Standards, Yelp, Codecademy, and Nielsen get better performance and double the storage capacity at the same price by moving from Amazon Redshift DS2 to RA3

For years, customers with large data storage needs chose Amazon Redshift DS2 (Dense Storage 2) for its price-performance value. Customers such as NTT Docomo and Amazon.com ran petabyte-scale workloads in a single cluster on DS2 node types. However, as data size kept increasing exponentially, the amount of data actively being queried continued to become a smaller fraction of the total data size. You had to either keep adding nodes to store more data in the data warehouse, or retire data to Amazon S3 in a data lake. This creates operational overhead. With Amazon Redshift RA3, after the data is ingested in the cluster, it’s automatically moved to managed storage. RA3 nodes keep track of the frequency of access for each data block and cache the hottest blocks. If the blocks aren’t cached, the large networking bandwidth and precise storing techniques return the data in sub-seconds.

For customers like Duolingo, Social Standards, Yelp, and Codecademy, who are among the tens and thousands of customers already using Amazon Redshift, it’s easy to upgrade to RA3.

Duolingo is the most popular language-learning platform and the most downloaded education app in the world, with more than 300 million users. The company’s mission is to make education free, fun, and accessible to all. They upgraded from Amazon Redshift DS2 instances to the largest instance of RA3 to support their growing data.

“We use Amazon Redshift to analyze the events from our app to gain insight into how users learn with Duolingo,” says Jonathan Burket, a Senior Software Engineer at Duolingo. “We load billions of events each day into Amazon Redshift, have hundreds of terabytes of data, and that is expected to double every year. While we store and process all of our data, most of the analysis only uses a subset of that data. The new Amazon Redshift RA3 instances with managed storage deliver two times the performance for most of our queries compared to our previous DS2 instance-based Amazon Redshift clusters. The Amazon Redshift managed storage automatically adapts to our usage patterns. This means we don’t need to manually maintain hot and cold data tiers, and we can keep our costs flat when we process more data.”

For more information about how Duolingo uses Amazon Redshift, watch the session from AWS re:Invent 2019, How to scale data analytics with Amazon Redshift.

Amazon Redshift is designed to handle these high volumes of data that collectively uncover trends and opportunities. At Social Standards, a fast growing market analytics firm, Amazon Redshift powers the analytics that helps enterprises gain insights into collective social intelligence. The comparative analytics platform transforms billions of social data points into benchmarked insights about the brands, products, features, and trends that consumers are talking about.

“At Social Standards, we are creating the next generation of consumer analytics tools to discover and deliver actionable business insights with complete and authentic analysis of social data for strategic decision making, product innovation, financial analytics, and much more,” says Vladimir Bogdanov, CTO at Social Standards. “We use Amazon Redshift for near real-time analysis and storage of massive amounts of data. Each month we add around 600 million new social interactions and 1.2 TB of new data. As we look forward and continue to introduce new ways to analyze the growing data, the new Amazon Redshift RA3 instances proved to be a game changer. We moved from the Amazon Redshift DS2 instance type to RA3 with a quick and easy upgrade, and were able to increase our storage capacity by eight times, increase performance by two times, and keep costs the same.”

These performance and cost benefits also attracted the popular online reviews and marketplace company, Yelp, to upgrade from DS2 to RA3. Yelp’s mission is to connect people with great local businesses, and data mining and efficient data analysis are important in order to build the best user experience.

“We continue to adopt new Amazon Redshift features and are thrilled with the new RA3 instance type,” says Steven Moy, a Software Engineer at Yelp. “We have observed a 1.9 times performance improvement over DS2 while keeping the same costs and providing scalable managed storage. This allows us to keep pace with explosive data growth and have the necessary fuel to train our machine learning systems.”

For more information about how Yelp uses Amazon Redshift, watch the session from AWS re:Invent 2019, What’s new with Amazon Redshift, featuring Yelp.

As current health conditions shine a spotlight on online learning, many organizations are scaling and using data to guide decision-making. Codecademy uses Amazon Redshift to store all the growing data generated through customers’ use of their web application, including high-volume events such as page visits and button clicks. Their data science team uses this data to develop various statistical models, and by analyzing these models, improve the app based on how customers use it.

“Codecademy is an education company committed to teaching modern skills within technology and code, as well as a catalyst in the shift toward online learning,” says Doug Grove, Director of Infrastructure and Platform at Codecademy. “We were leveraging DS2.xls for our Amazon Redshift cluster and moved to RA3.4xls for performance gains. Moving to the RA3s resulted in a two times performance increase and cut data loading times in half. The separation of compute and storage allows us to scale independently, and allows for easier cluster maintenance.”

For many customers that started using a data warehouse on-premises and migrated to AWS, the scale and value of cloud continue to pay off. Nielsen, the global measurement and data analytics company, provides the most complete and trusted view of consumers and markets worldwide with operations in over 100 countries. A recent upgrade from DS2 to RA3 was the next step in their analytics journey, and helped them save costs, increase performance, and prepare for continued growth.

“We migrated from an on-premises data warehouse to Amazon Redshift in 2017 to optimize costs and to scale our solution to meet the growing demand,” says Sri Subramanian, Senior Manager of Technology at Nielsen. “Our data warehouse workloads run 24/7 at a scale of 1 billion rows per day. We recently migrated our Amazon Redshift cluster from DS2.8x to the new RA3.4x instance type. We have seen a performance gain of up to 40–50% on most of our workloads at a similar price point. Since the RA3 instance types separate compute and storage, disk utilization is no longer a concern. The upgrade was straightforward, and we went from proof of concept to solving complex business challenges quickly.”

These performance gains and productivity improvements are consistent themes from the feedback we’re getting from customers moving from DS2 to RA3. For more information about upgrading your workloads, see Overview of RA3 node types.

Rail Delivery Group, FiNC, and Playrix move from Amazon Redshift DC2 to RA3 to scale compute and storage independently for improved query performance and lower costs

Customers often chose DC2 (Dense Compute 2) for its superior query performance and low price. However, as the data sizes grew, clusters became bigger without the need for additional compute power. Many customers like Rail Delivery Group, FiNC, and Playrix are finding that by upgrading to RA3, they can get significantly more storage space and the same superior performance without increasing costs. For some use cases that need a large amount of raw computational power at the cheapest price and don’t require over 1 TB of data, DC2 provides industry-beating performance. However, if data is likely to grow to over 1 TB compressed, choosing RA3 node types and sizing for compute requirements is a much simpler and cheaper solution in the long run.

One company that found their storage needs growing faster than compute is Rail Delivery Group, a non-profit organization that brings together the companies that run Britain’s railway. They use Amazon Redshift to analyze rail industry data such as timetables, ticket sales, and smartcard usage.

“Since we started using Amazon Redshift for analytics in 2017, we have grown from 1 node to 10 nodes,” says Toby Ayre, Head of Data & Analytics at Rail Delivery Group. “Our data storage needs grew much faster than compute needs, and we had to keep unloading the data out of the data warehouse to Amazon S3. Now, with RA3.4xl nodes with managed storage, we can size for query performance and not worry about storage needs. Since we upgraded from a 10 node DC2.large cluster to a two node RA3.4xl cluster, our queries typically run 30% faster.”

Optimizing costs while also preparing for future growth are consistent requirements for our customers. For FiNC Technologies, the developer of the number one healthcare and fitness app in Japan, data drives a cycle of continuous improvement and enables them to deliver on their mission to provide personalized AI for everyone’s wellness. The personalized diet tutor, private gym, and wellness tracker app helps users make informed decisions about their health and well-being based on real-time metrics about their behavior.

“At FiNC Technologies, we rely on Amazon Redshift to manage KPIs to continuously improve our web services and apps,” says Komiyama, Kohei, a Data Scientist in FiNC. “We upgraded to the Amazon Redshift RA3 from DC2 because our storage needs were growing faster than our compute. We found it easy to upgrade, and like that our new data warehouse scales storage capacity automatically without any manual effort. Since upgrading, we’ve reduced operational costs by 70%, and feel prepared for future data growth.”

While FiNC optimized for growing storage, Playrix, one of the leading mobile game developers in the world, optimized for compute. With over $1 billion annual revenue and more than 2,000 global employees, Playrix builds popular games like Township, Fishdom, Gardenscapes, Homescapes, Wildscapes, and Manor Matters. They use data to better understand the customer journey.

“We rely on data from multiple internal and external sources to gain insight into user acquisition and make marketing decisions,” says Mikhail Artyugin, Technical Director at Playrix. “We moved our Amazon Redshift data warehouse from 20 nodes of DC2.xlarge to three nodes of RA3.4xl to future proof our system. We’re thrilled with the increase in computing power that makes it faster to deliver insight on the marketing data, and we have almost infinite storage space with managed storage, all for a reasonable price. The friendly and productive collaboration with AWS enterprise support and product team was an extra bonus.”

Conclusion

The Amazon Redshift RA3 nodes with managed storage deliver value to new customers like Poloniex and OpenVault, and to existing customers upgrading from DC2 and DS2 instances like Duolingo, Social Standards, Yelp, Codecademy, Nielsen, Rail Delivery Group, FiNC, and Playrix.

If you’re new to Amazon Redshift, check out our RA3 recommendation tool available on the AWS Management Console when you create a cluster. If you’re already an Amazon Redshift customer and you haven’t tried out RA3 yet, it’s easy to upgrade in minutes with a cross instance restore or elastic resize. If you have existing Amazon Redshift DC2 or DS2 Reserved Instances, you can contact us to get support with the upgrade. For more information about recommended RA3 node types and cluster sizes when upgrading from DC2 and DS2, see Overview of RA3 node types.

New features and capabilities for Amazon Redshift are released rapidly, and RA3 is set up for the new scale of data because with AQUA (Advanced Query Accelerator) for Amazon Redshift, performance will continue to improve. You can sign up for the preview of this innovative new hardware-accelerated cache, and the clusters running on RA3 will automatically benefit from AQUA when it’s released. We continue to innovate based on what we hear from our customers, so keep an eye on What’s New in Amazon Redshift to learn about our new releases.

 


About the authors

Corina Radovanovich leads product marketing for cloud data warehousing at AWS. She’s worked in marketing and communications for the biggest tech companies worldwide and specializes in cloud data services.

 

 

 

 

Himanshu Raja is a Principal Product Manager for Amazon Redshift. Himanshu loves solving hard problems with data and cherishes moments when data goes against intuition. In his spare time, Himanshu enjoys cooking Indian food and watching action movies.

 

Enhancing customer safety by leveraging the scalable, secure, and cost-optimized Toyota Connected Data Lake

Post Syndicated from Sandeep Kulkarni original https://aws.amazon.com/blogs/big-data/enhancing-customer-safety-by-leveraging-the-scalable-secure-and-cost-optimized-toyota-connected-data-lake/

Toyota Motor Corporation (TMC), a global automotive manufacturer, has made “connected cars” a core priority as part of its broader transformation from an auto company to a mobility company. In recent years, TMC and its affiliate technology and big data company, Toyota Connected, have developed an array of new technologies to provide connected services that enhance customer safety and the vehicle ownership experience. Today, Toyota’s connected cars come standard with an on-board Data Communication Module (DCM) that links to a Controller Area Network (CAN). By using this hardware, Toyota provides various connected services to its customers.

Some of the connected services help drivers to safely enjoy their cars. Telemetry data is available from the car 24×7, and Toyota makes the data available to its dealers (when their customers opt-in for data sharing). For instance, a vehicle’s auxiliary battery voltage declines over time. With this data, dealership staff can proactively contact customers to recommend a charge prior to experiencing any issues. This automotive telemetry can also help fleet management companies monitor vehicle diagnostics, perform preventive maintenance and help avoid breakdowns.

There are other services such as usage-based auto insurance that leverage driving behavior data that can help safe drivers receive discounts on their car insurance. Telemetry plays a vital role in understanding driver behavior. If drivers choose to opt-in, a safety score can be generated based on their driving data and drivers can use their smartphones to check their safe driving scores.

A vehicle generates data every second, which can be bundled into larger packets at one-minute intervals. With millions of connected cars that have data points available every second, the incredible scale required to capture and store that data is immense—there are billions of messages daily generating petabytes of data. To make this vision a reality, Toyota Connected’s Mobility Team embarked on building a real-time “Toyota Connected Data Lake.” Given the scale, we leveraged AWS to build this platform. In this post, we show how we built the data lake and how we provide significant value to our customers.

Overview

The guiding principles for architecture and design that we used are as follows:

  • Serverless: We want to use cloud native technologies and spend minimal time on infrastructure maintenance.
  • Rapid speed to market: We work backwards from customer requirements and iterate frequently to develop minimally viable products (MVPs).
  • Cost-efficient at scale.
  • Low latency: near real time processing.

Our data lake needed to be able to:

  • Capture and store new data (relational and non-relational) at petabyte scale in real time.
  • Provide analytics that go beyond batch reporting and incorporate real time and predictive capabilities.
  • Democratize access to data in a secure and governed way, allowing our team to unleash their creative energy and deliver innovative solutions.

The following diagram shows the high-level architecture

Walkthrough

We built the serverless data lake with Amazon S3 as the primary data store, given the scalability and high availability of S3. The entire process is automated, which reduces the likelihood of human error, increases efficiency, and ensures consistent configurations over time, as well as reduces the cost of operations.

The key components of a data lake include Ingest, Decode, Transform, Analyze, and Consume:

  • IngestConnected vehicles send telemetry data once a minute—which includes speed, acceleration, turns, geo location, fuel level, and diagnostic error codes. This data is ingested into Amazon Kinesis Data Streams, processed through AWS Lambda to make it readable, and the “raw copy” is saved through Amazon Kinesis Data Firehose into an S3
  • Decode:  Data arriving into the Kinesis data stream in the ‘Decode’ pillar is decoded by a serverless Lambda function, which does most of the heavy lifting. Based upon a proprietary specification, this Lambda function does the bit-by-bit decoding of the input message to capture the particular sensor values. The small input payload of 35KB with data from over 180 sensors is now decoded and converted to a JSON message of 3 MB. This is then compressed and written to the ‘Decoded S3 bucket’.
  • Transform The aggregation jobs leverage the massively parallel capability of Amazon EMR, decrypt the decoded messages and convert the data to Apache Parquet Apache Parquet is a columnar storage file format designed for querying large amounts of data, regardless of the data processing framework, or programming language. Parquet allows for better compression, which reduces the amount of storage required. It also reduces I/O, since we can efficiently scan the data. The data sets are now available for analytics purposes, partitioned by masked identification numbers as well as by automotive models and dispatch type. A separate set of jobs transform the data and store it in Amazon DynamoDB to be consumed in real time from APIs.
  • ConsumeApplications needing to consume the data make API calls through the Amazon API Gateway. Authentication to the API calls is based on temporary tokens issued by Amazon Cognito.
  • AnalyzeData analytics can be directly performed off Amazon S3 by leveraging serverless Amazon Athena. Data access is democratized and made available to data science groups, who build and test various models that provide value to our customers.

Additionally, comprehensive monitoring is set up by leveraging Amazon CloudWatch, Amazon ES, and AWS KMS for managing the keys securely.

Scalability

The scalability capabilities of the building blocks in our architecture that allow us to reach this massive scale are:

  • S3: S3 is a massively scalable key-based object store that is well-suited for storing and retrieving large datasets. S3 partitions the index based on key name. To maximize performance of high-concurrency operations on S3, we introduced randomness into each of the Parquet object keys to increase the likelihood that the keys are distributed across many partitions.
  • Lambda: We can run as many concurrent functions as needed and can raise limits as required with AWS support.
  • Kinesis Firehose: It scales elastically based on volume without requiring any human intervention. We batch requests up to 128MiB or 15 minutes, whichever comes earlier to avoid small files. Additional details are available in Srikanth Kodali’s blog post.
  • Kinesis Data Streams: We developed an automated program that adjusts the shards based on incoming volume. This is based on the Kinesis Scaling Utility from AWS Labs, which allows us to scale in a way similar to EC2 Auto Scaling groups.
  • API Gateway: automatically scales to billions of requests and seamlessly handles our API traffic.
  • EMR cluster: We can programmatically scale out to hundreds of nodes based on our volume and scale in after processing is completed.

Our volumes have increased seven-fold since we migrated to AWS and we have only adjusted the number of shards in Kinesis Data Streams and the number of core nodes for EMR processing to scale with the volume.

Security in the AWS cloud

AWS provides a robust suite of security services, allowing us to have a higher level of security in the AWS cloud. Consistent with our security guidelines, data is encrypted both in transit and at rest. Additionally, we use VPC Endpoints, allowing us to keep traffic within the AWS network.

Data protection in transit:

Data protection at rest:

  • S3 server-side encryption handles all encryption, decryption and key management transparently. All user data stored in DynamoDB is fully encrypted at rest, for which we use an AWS-owned customer master key at no additional charge. Server-side encryption for Kinesis Data streams and Kinesis Data Firehose is also enabled to ensure that data is encrypted at rest.

Cost optimization

Given our very large data volumes, we were methodical about optimizing costs across all components of the infrastructure. The ultimate goal was to figure out the cost of the APIs we were exposing. We developed a robust cost model validated with performance testing at production volumes:

  • NAT gateway: When we started this project, one of the significant cost drivers was traffic flowing from Lambda to Kinesis Data Firehose that went over the NAT gateway, since Kinesis Data Firehose did not have a VPC endpoint. Traffic flowing through the NAT gateway costs $0.045/GB, whereas traffic flowing through the VPC endpoint costs $0.01/GB. Based on a product feature request from Toyota, AWS implemented this feature (VPC Endpoint for Firehose) early this year. We implemented this feature, which resulted in a four-and-a-half-fold reduction in our costs for data transfer.
  • Kinesis Data Firehose: Since Kinesis Data Firehose did not support encryption of data at rest initially, we had to use client-side encryption using KMS–this was the second significant cost driver. We requested a feature for native server-side encryption in Kinesis Data Firehose. This was released earlier this year and we enabled server-side encryption on the Kinesis Data Firehose stream. This removed the Key Management Service (KMS), resulting in another 10% reduction in our total costs.

Since Kinesis Data Firehose charges based on the amount of data ingested ($0.029/GB), our Lambda function compresses the data before writing to Kinesis Data Firehose, which saves on the ingestion cost.

  • S3– We use lifecycle policies to move data from S3 (which costs $0.023/GB) to Amazon S3 Glacier (which costs $0.004/GB) after a specified duration. Glacier provides a six-fold cost reduction over S3. We further plan to move the data from Glacier to Amazon S3 Glacier Deep Archive (which costs $0.00099/GB), which will provide us a four-fold reduction over Glacier costs. Additionally, we have set up automated deletes of certain data sets at periodic intervals.
  • EMR– We were planning to use AWS Glue and keep the architecture serverless, but made the decision to leverage EMR from a cost perspective. We leveraged spot instances for transformation jobs in EMR, which can provide up to 60% savings. The hourly jobs complete successfully with spot instances, however the nightly aggregation jobs leveraging r5.4xlarge instances failed frequently as sufficient spot capacity was not available. We decided to move to “on-demand” instances, while we finalize our strategy for “reserved instances” to reduce costs.
  • DynamoDB: Time to Live (TTL) for DynamoDB lets us define when items in a table expire so that they can be automatically deleted from the database. We enabled TTL to expire objects that are not needed after a certain duration. We plan to use reserved capacity for read and write control units to reduce costs. We also use DynamoDB auto scaling ,which helps us manage capacity efficiently, and lower the cost of our workloads because they have a predictable traffic pattern. In Q2 of 2019, DynamoDBremoved the associated costs of DynamoDB Streams used in replicating data globally, which translated to extra cost savings in global tables.
  • Amazon DynamoDB Accelerator(DAX):  Our DynamoDB tables are front-ended by DAX, which improves the response time of our application by dramatically reducing read latency, as compared to using DynamoDB. Using DAX, we also lower the cost of DynamoDB by reducing the amount of provisioned read throughput needed for read-heavy applications.
  • Lambda: We ran benchmarks to arrive at the optimal memory configuration for Lambda functions. Memory allocation in Lambda determines CPU allocation and for some of our Lambda functions, we allocated higher memory, which results in faster execution, thereby reducing the amount of GB-seconds per function execution, which saves time and cost. Using DynamoDB Accelerator (DAX) from  Lambda has several benefits for serverless applications that also use DynamoDB. DAX can improve the response time of your application by dramatically reducing read latency, as compared to using DynamoDB. For serverless applications, combining Lambda with DAX provides an additional benefit: Lower latency results in shorter execution times, which means lower costs for Lambda.
  • Kinesis Data Streams: We scale our streams through an automated job, since our traffic patterns are fairly predictable. During peak hours we add additional shards and delete them during the off-peak hours, thus allowing us to reduce costs when shards are not in use

Enhancing customer safety

The Data Lake presents multiple opportunities to enhance customer safety. Early detection of market defects and pinpointing of target vehicles affected by those defects is made possible through the telemetry data ingested from the vehicles. This early detection leads to early resolution way before the customer is affected. On-board software in the automobiles can be constantly updated over-the-air (OTA), thereby saving time and costs. The automobile can generate a Health Check Report based on the driving style of its drivers, which can create the ideal maintenance plan for drivers for worry-free driving.

The driving data for an individual driver based on speed, sharp turns, rapid acceleration, and sudden braking can be converted into a “driver score” which ranges from 1 to 100 in value. The higher the driver-score, the safer the driver. Drivers can view their scores on mobile devices and monitor the specific locations of harsh driving on the journey map. They can then use this input to self-correct and modify their driving habits to improve their scores, which will not only result in a safer environment but drivers could also get lower insurance rates from insurance companies. This also gives parents an opportunity to monitor the scores for their teenage drivers and coach them appropriately on safe driving habits. Additionally, notifications can be generated if the teenage driver exceeds an agreed-upon speed or leaves a specific area.

Summary

The automated serverless data lake is a robust scalable platform that allows us to analyze data as it becomes available in real time. From an operations perspective, our costs are down significantly. Several aggregation jobs that took 15+ hours to run, now finish in 1/40th of the time. We are impressed with the reliability of the platform that we built. The architectural decision to go serverless has reduced operational burden and will also allow us to have a good handle on our costs going forward. Additionally, we can deploy this pipeline in other geographies with smaller volumes and only pay for what we consume.

Our team accomplished this ambitious development in a short span of six months. They worked in an agile, iterative fashion and continued to deliver robust MVPs to our business partners. Working with the service teams at AWS on product feature requests and seeing them come to fruition in a very short time frame has been a rewarding experience and we look forward to the continued partnership on additional requests.

 


About the Authors


Sandeep Kulkarni is an enterprise architect at AWS. His passion is to accelerate digital transformation for customers and build highly scalable and cost-effective solutions in the cloud. In his spare time, he loves to do yoga and gardening.

 

 

 

 

Shravanthi Denthumdas is the director of mobility services at Toyota Connected.Her team is responsible for building the Data Lake and delivering services that allow drivers to safely enjoy their cars. In her spare time, she likes to spend time with her family and children.

 

 

 

 

Optimize Python ETL by extending Pandas with AWS Data Wrangler

Post Syndicated from Satoshi Kuramitsu original https://aws.amazon.com/blogs/big-data/optimize-python-etl-by-extending-pandas-with-aws-data-wrangler/

Developing extract, transform, and load (ETL) data pipelines is one of the most time-consuming steps to keep data lakes, data warehouses, and databases up to date and ready to provide business insights. You can categorize these pipelines into distributed and non-distributed, and the choice of one or the other depends on the amount of data you need to process.

Apache Spark is widely used to build distributed pipelines, whereas Pandas is preferred for lightweight, non-distributed pipelines. With the second use case in mind, the AWS Professional Service team created AWS Data Wrangler, aiming to fill the integration gap between Pandas and several AWS services, such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, AWS Glue, Amazon Athena, Amazon Aurora, Amazon QuickSight, and Amazon CloudWatch Log Insights.

AWS Data Wrangler is an open-source Python library that enables you to focus on the transformation step of ETL by using familiar Pandas transformation commands and relying on abstracted functions to handle the extraction and load steps.

You can use AWS Data Wrangler in different environments on AWS and on premises (for more information, see Install). This post focuses on data preparation for a data science project on Jupyter. By the end of this walkthrough, you will be able to set up AWS Data Wrangler on your Amazon SageMaker notebook.

Use case overview

In the following walkthrough, you use data stored in the NOAA public S3 bucket. For more information, see NOAA Global Historical Climatology Network Daily. The objective is to convert 10 CSV files (approximately 240 MB total) to a partitioned Parquet dataset, store its related metadata into the AWS Glue Data Catalog, and query the data using Athena to create a data analysis.

Configuring Amazon S3

Your first step is to create an S3 bucket to store the Parquet dataset.

  1. On the Amazon S3 console, choose Create bucket.
  2. For Bucket name, enter a name for your bucket.

  1. Choose Create.

Creating a new database in the Data Catalog

The Data Catalog is an Apache Hive-compatible managed metadata storage that lets you store, annotate, and share metadata on AWS.

For this use case, you use it to store the metadata associated with your Parquet dataset. The Data Catalog is integrated with many analytics services, including Athena, Amazon Redshift Spectrum, and Amazon EMR (Apache Spark, Apache Hive, and Presto).

  1. On the AWS Glue console, choose Databases.
  2. Choose Add database.
  3. For Database name, enter awswrangler_test.
  4. Choose Create.

Launching an Amazon SageMaker notebook

An Amazon SageMaker notebook is a managed instance running the Jupyter Notebook app. For this use case, you use it to write and run your code.

  1. On the Amazon SageMaker console, choose Notebook instance.
  2. Choose Create a notebook instance.
  3. For Notebook instance name, enter a name.
  4. For IAM role, choose an existing AWS Identity and Access Management (IAM) role or create a role that allows you to run Amazon SageMaker and grants access to Amazon S3, Athena, and AWS Glue for the related resources.

  1. Wait for the notebook status to show as InService.
  2. Choose Open Jupyter from the notebook instance you created.

Exploring the data

This section walks you through several notebook paragraphs to expose how to install and use AWS Data Wrangler.

  1. On Jupyter console, under New, choose conda_python3.
  2. To install AWS Data Wrangler, enter the following code:
    !pip install awswrangler

  3. To avoid dependency conflicts, restart the notebook kernel by choosing kernel -> Restart.
  4. Import the library given the usual alias wr:
    import awswrangler as wr

  5. List all files in the NOAA public bucket from the decade of 1880:
    wr.s3.list_objects("s3://noaa-ghcn-pds/csv/188")

The following screenshot shows the output.

  1. Load the whole decade (10 files) into a Pandas DataFrame using the Amazon S3 prefix s3://noaa-ghcn-pds/csv/188:
    col_names = ["id", "dt", "element", "value", "m_flag", "q_flag", "s_flag", "obs_time"]
    
    df = wr.s3.read_csv(
        path="s3://noaa-ghcn-pds/csv/188",
        names=col_names,
        parse_dates=["dt", "obs_time"]  # Hint to parse these columns as date instead of strings
    )

    The following screenshot shows the output.

  1. Create a new column extracting the year from the dt column (the new column is useful for creating partitions in the Parquet dataset):
    df["year"] = df["dt"].dt.year

The following screenshot shows the output.

  1. Store the Pandas DataFrame in the S3 bucket you created in the beginning of this post (replace the [BUCKET] placeholder in the code with your bucket name):
    wr.s3.to_parquet(
        df=df,
        path="s3://[BUCKET]/noaa/",
        dataset=True,
        database="awswrangler_test",
        table="noaa",
        partition_cols=["year"]
    );

The preceding code creates the table noaa in the awswrangler_test database in the Data Catalog.

  1. After processing this, you can confirm the Parquet files exist in Amazon S3 and the table noaa is in AWS Glue data catalog. See the following code:
    wr.s3.list_objects("s3://[BUCKET]/noaa/")
    wr.catalog.table(database="awswrangler_test", table="noaa")

The following screenshot shows the output.

  1. Run a SQL query from Athena that filters only the US maximum temperature measurements of the last 3 years (1887–1889) and receive the result as a Pandas DataFrame:
    sql = """
    SELECT
        dt,
        (value / 10.0) AS temperature  -- Converting tenths of degrees C to regular degrees C
    FROM noaa
    WHERE year BETWEEN 1887 AND 1889  -- Only last 3 years (PARTITION filter)
    AND substr(id, 1, 2)='US'  -- Only U.S. stations
    AND element='TMAX'  -- Only Maximum temperature elements
    AND q_flag is NULL  -- Only HIGH quality measurement
    """
    
    df = wr.athena.read_sql_query(sql, database="awswrangler_test")

The following screenshot shows the output.

The following two queries illustrate how you can visualize the data.

  1. To plot the average maximum temperature measured in the tracked station, enter the following code:
    %matplotlib inline
    df.groupby("dt").mean().plot();

The following screenshot shows the output.

  1. To plot a moving average of the previous metric with a 30-day window, enter the following code:
    %matplotlib inline
    df.groupby("dt").mean().rolling(window=30, center=True).mean().plot();

The following screenshot shows the output.

Cleaning up

To avoid incurring future charges, delete the resources from the following services:

  1. AWS Glue database
    • On the AWS Glue console, choose the database you created.
    • From the Actions drop-down menu, choose Delete database.
    • Choose Delete.
  2. Amazon SageMaker notebook
    • On the Amazon SageMaker console, choose the notebook instance you created.
    • From the Actions drop-down menu, choose Stop.
    • When the status shows as Stopped, choose Database.
    • Choose Delete.
  3. S3 bucket
    • On the Amazon S3 console, choose Buckets.
    • Choose the bucket you created.
    • Choose Empty and enter your bucket name.
    • Choose Confirm.
    • Choose Delete and enter your bucket name.
    • Choose Delete bucket.
  4. IAM Role
    • On the IAM console, choose Roles.
    • Choose the role you attached to Amazon SageMaker.
    • Choose Delete role.
    • Choose Yes.

Conclusion

Installing AWS Data Wrangler is a breeze. With a single command, you can connect ETL tasks to multiple data sources and different data services. The library is a work in progress, with new features and enhancements added regularly. For more tutorials, see the GitHub repo.

 


About the Authors

Satoshi Kuramitsu is a Solutions Architect in AWS. His favorite AWS services are AWS Glue, Amazon Kinesis, and Amazon S3.

 

 

 

 

 

Igor Tavares is a Data & Machine Learning Engineer in the AWS Professional Services team and the original creator of AWS Data Wrangler.

 

 

 

Integrating the MongoDB Cloud with Amazon Kinesis Data Firehose

Post Syndicated from Anusha Dharmalingam original https://aws.amazon.com/blogs/big-data/integrating-the-mongodb-cloud-with-amazon-kinesis-data-firehose/

Amazon Kinesis Data Firehose now supports the MongoDB Cloud platform as one of its delivery destinations. This native integration between Kinesis Data Firehose and MongoDB Cloud provides a managed, secure, scalable, and fault-tolerant delivery mechanism for customers into MongoDB Atlas, a global cloud solution for fully managed MongoDB database service for modern applications.

With the release of Kinesis Data Firehose HTTP endpoint delivery, you can now stream your data through Amazon Data Streams or directly push data to Kinesis Data Firehose and configure it to deliver data to MongoDB Atlas. You can also configure Kinesis Data Firehose to transform the data before delivering it to its destination. You don’t have to write applications and manage resources to read data and push to MongoDB. It’s all managed by AWS, making it easier to estimate costs for your data based on your data volume.

In this post, we discuss how to integrate Kinesis Data Firehose and MongoDB Cloud and demonstrate how to stream data from your source to MongoDB Atlas.

The following diagram depicts the overall architecture of the solution. We configure Kinesis Data Firehose to push the data to a MongoDB Realm event driven serverless javascript function. MongoDB Realm is an intuitive app development service to accelerate your frontend integration by simplifying your backend. We use a specific type of the function called a webhook. The webhook parses the JSON message from Kinesis Data Firehose and inserts parsed records into the MongoDB Atlas database.

Integrating Kinesis Data Firehose and MongoDB Atlas

Kinesis Data Firehose is a fully managed service that automatically scales to match the throughput of your data and requires no ongoing administration. It can also batch, compress, transform, and encrypt the data before loading it, which minimizes the amount of storage used at the destination and increases security.

As part of Kinesis Data Firehose, you can transform your records before delivering them to the destination. In addition, Kinesis Data Firehose enables you to buffer data (based on size or time) before delivering to the final destination. In case of delivery failures, Kinesis Data Firehose can store your failed records in an Amazon Simple Storage Service (Amazon S3) bucket to prevent data loss.

MongoDB Atlas is a platform that can be used across a range of Online Transactional Processing (OLTP) and data analytics applications.  MongoDB Atlas allows developers to address popular use cases such as Internet of Things (IoT), Mobile Apps, Payments, Single View, Customer Data Management and many more.  In all of those cases, developers spend significant amount of time on delivering data to MongoDB Atlas from various data sources.  This integration significantly reduces the amount of development effort by leveraging Kinesis Data Firehose HTTP Endpoint integration to ingest data into MongoDB Atlas.

Creating a MongoDB Cloud Realm Application

  1. Log into your MongoDB cloud account. If you do not have an account you can sign up for a free account.
  2. Create an HTTP endpoint on the MongoDB Atlas platform by choosing 3rd Party Services on the Realm tab.
  3. Choose Add a Service.

  1. Choose HTTP.
  2. For Service Name, enter a name. Your service will appear under this name on the UI.

  1. Choose Add Incoming Webhook.

  1. For Authentication, select System.

  1. Leave other options at their default.
  2. In the function editor, enter the following code:
exports = function(payload, response) {
  
    const decodeBase64 = (s) => {
        var e={},i,b=0,c,x,l=0,a,r='',w=String.fromCharCode,L=s.length
        var A="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
        for(i=0;i<64;i++){e[A.charAt(i)]=i}
        for(x=0;x<L;x++){
            c=e[s.charAt(x)];b=(b<<6)+c;l+=6
            while(l>=8){((a=(b>>>(l-=8))&0xff)||(x<(L-2)))&&(r+=w(a))}
        }
        return r
    }
    
    var fullDocument = JSON.parse(payload.body.text());
    
    const firehoseAccessKey = payload.headers["X-Amz-Firehose-Access-Key"]
    console.log('should be: ' + context.values.get("KDFH_SECRET_KEY"));
 
   // Check shared secret is the same to validate Request source
   if (firehoseAccessKey == context.values.get("KDFH_SECRET_KEY")) {
 

      var collection = context.services.get("Cluster0").db("kdf").collection("kdf-test");
      
      fullDocument.records.forEach((record) => {
            const document = JSON.parse(decodeBase64(record.data))
            const status = collection.insertOne(document);
            console.log("got status: "+ status)
      })

      response.setStatusCode(200)
            const s = JSON.stringify({
                requestId: payload.headers['X-Amz-Firehose-Request-Id'][0],
                timestamp: (new Date()).getTime()
            })
            response.addHeader(
                "Content-Type",
                "application/json"
            );
            response.setBody(s)
            console.log("response JSON:" + s)
      return
   } else {
    response.setStatusCode(500)
            response.setBody(JSON.stringify({
                requestId: payload.headers['X-Amz-Firehose-Request-Id'][0],
                timestamp: (new Date()).getTime()
                errorMessage: "Error authenticating"
            }))
    return
   }
};

The preceding code is a simplified implementation of the webhook. The webhook inserts records one at a time and has abbreviated for readability error handling. For more information about the full implementation, see Using MongoDB Realm WebHooks with Amazon Kinesis Data Firehose.

This webhook uses the values and secrets of MongoDB Realm.

  1. On the Realm tab, choose Values & Secrets.

  1. On the Secrets tab, choose Create New Secret/Add a Secret.

  1. Enter the Secret Name and Secret Value and click save. The Secret Name entered here is the name used in webhook code.

  1. On the Values tab, choose Create New Value/dd a Value.

  1. Enter the Value Name.
  2. For Value Type, select Secret.
  3. For Secret Name, choose the secret you created.

  1. Choose Save.

You can now use the secret in your webhook function.

  1. Choose REVIEW & DEPLOY.

Creating a Kinesis Data Firehose delivery stream to MongoDB

  1. Log into AWS Console and search for Kinesis.
  2. On the Kinesis Data Firehose console, choose Create delivery stream.
  3. For Delivery stream name, enter a name.
  4. For Source, choose Direct PUT of other sources.
  5. Choose Next.

  1. On the Process recordspage, keep all settings at their default and choose Next.
  2. From the Third-party partner drop-down menu, choose MongoDB Cloud.

  1. For MongoDB Realm Webhooks HTTP Endpoint URL, please enter the URL of realm app that was created in MongoDB cloud console.
  2. For API Key, please enter the secret value stored in MongoDB secrets section.
  3. For Content encoding, leave it as Disabled.
  4. For S3 backup mode, select Failed data only.
  5. For S3 bucket, enter the S3 bucket for delivery of log events that exceeded the retry duration. Alternatively, you can create a new bucket by choosing Create new.
  6. Click on Next.
  7. For MongoDB buffer conditions, accept the default MongoDB and Amazon S3 buffer conditions for your stream.  Note that the buffer size should be a value between 1MiB and 16MiB.  Review the limits in MongoDB Atlas documentation.

  1. In the IAM role section, configure permissions for your delivery stream by choosing Create or update IAM role.
  2. Choose Next.
  3. Review your settings and choose Create delivery stream.

As part of HTTP endpoint integration, Kinesis Data Firehose only supports HTTPS endpoints. The server-side TLS/SSL certificate must be signed by a trusted Certificate Authority (CA) and is used for verification by Kinesis Data Firehose.

The body of the request that is delivered from Kinesis Data Firehose is a JSON document with the following schema:

"$schema": http://json-schema.org/draft-07/schema#

title: FirehoseCustomHttpsEndpointRequest
description: >
  The request body that the Firehose service sends to
  custom HTTPS endpoints.
type: object
properties:
  requestId:
    description: >
      Same as the value in the X-Amz-Firehose-Request-Id header,
      duplicated here for convenience.
    type: string
  timestamp:
    description: >
      The timestamp (milliseconds since epoch) at which the Firehose
      server generated this request.
    type: integer
  records:
    description: >
      The actual records of the Delivery Stream, carrying 
      the customer data.
    type: array
    minItems: 1
    maxItems: 10000
    items:
      type: object
      properties:
        data:
          description: >
            The data of this record, in Base64. Note that empty
            records are permitted in Firehose. The maximum allowed
            size of the data, before Base64 encoding, is 1024000
            bytes; the maximum length of this field is therefore
            1365336 chars.
          type: string
          minLength: 0
          maxLength: 1365336

required:
  - requestId
  - records

The records are delivered as a collection based on BufferingHints configured on the Firehose delivery stream. The delivery-side service webhook created on MongoDB Realm has to process these records one by one before inserting them into MongoDB collections or use the MongoDB Bulk APIs.

When Kinesis Data Firehose is set up with an HTTP endpoint destination to MongoDB Cloud, you can push data into Kinesis Data Firehose using Kinesis Agent or SDK from your application. Kinesis Data Firehose is also integrated with other AWS data sources such as Kinesis Data Streams, AWS IoT, Amazon CloudWatch Logs, and Amazon CloudWatch Events.

To test the integration, use the testing option on the Kinesis Data Firehose console and test with sample data. After the time configured in BufferingHints, log in to your Atlas platform and navigate to your Database/Collection to see the ingested records.

Conclusion

In this post, we showed how easy it is to ingest data into the MongoDB Cloud platform using a Kinesis Data Firehose HTTP endpoint. This integration has many use cases.  For example you can stream Internet of Things (IoT) data directly into MongoDB Atlas platform with minimum code using AWS Kinesis Data Firehose HTTP endpoint integration.  Try MongoDB Atlas on AWS here.

 


About the Author

Anusha Dharmalingam is a Solutions Architect at Amazon Web Services, with a passion for Application Development and Big Data solutions. Anusha works with enterprise customers to help them architect, build, and scale applications to achieve their business goals.

 

 

 

 

Igor Alekseev is a Partner Solution Architect at AWS in Data and Analytics. Igor works with strategic partners helping them build complex, AWS-optimized architectures. Prior joining AWS, as a Data/Solution Architect, he implemented many projects in Big Data, including several data lakes in the Hadoop ecosystem. As a Data Engineer, he was involved in applying AI/ML to fraud detection and office automation. Igor’s projects were in a variety of industries including communications, finance, public safety, manufacturing, and healthcare. Earlier, Igor worked as full stack engineer/tech lead.

Organize and share your content with folders in Amazon QuickSight

Post Syndicated from Jose Kunnackal original https://aws.amazon.com/blogs/big-data/organize-and-share-your-content-with-folders-in-amazon-quicksight/

Amazon QuickSight Enterprise Edition now supports folders for organization and sharing content. Folders in QuickSight are of two types:

  • Personal folders – Allow individual authors and administrators to organize assets for their personal ease of navigation and manageability
  • Shared folders – Allow authors and administrators to define folder hierarchies that they can share across the organization and use to manage user permissions and access to dashboards, analyses, and datasets

You can access folders directly from shortcuts on the new QuickSight home page (see the following screenshot). In this post, we take a deeper look at folders and how you can implement this in your QuickSight account.

Asset permissions and folders

Before we dive into how the two types of folders work, let’s understand how asset permissions work in QuickSight. QuickSight assets (dashboards, analyses, and datasets) are created by authors or admins, reside in the cloud, and by default are permissioned to be visible from the UI to only the owner, which in this case is the creator of the asset. The owner can share the asset with other users (authors or admins, or in the case of dashboards, readers) or groups of users. When the asset needs to be shared, QuickSight allows the owner to share with specific users or groups of users, who can then be provided viewer or owner access.

Previously, these flows meant that admins and authors who have hundreds of assets have to manage permissions for users and groups individually. There was no hierarchical structure to easily navigate and discover key assets available. We built personal folders to solve the need for organization for authors and admins, while shared folders provide easier bulk permissions management for authors and discovery of assets for both authors and readers.

Personal folders are available to all authors and admins in QuickSight Enterprise Edition. You can create these folders within your user interface and add assets in them. Personal folders aren’t visible to other users within the account, and they don’t affect the permissions of any objects placed within. This means that if you create a personal folder called Published dashboards and add a dashboard to it, there are no changes to user permissions in the dashboard on account of its addition to this folder. An important difference here is that unlike traditional folders, QuickSight allows you to place the same asset in multiple folders, which avoids the need to replicate the same asset in different folders. This allows you to update one time and make sure all your stakeholders get the latest information.

The following screenshot shows the My folders page on the QuickSight console.

Shared folders in QuickSight are visible to permissioned users across author, admin, and reader roles in QuickSight Enterprise Edition. Top (root)-level shared folders can only be created by admins of the QuickSight account, who can share these with other users or groups. When sharing, folders offer two levels of permissions:

  • Owner access – Allows admins or authors with access to the folder to add and remove assets (including subfolders), modify the folder itself, and share as needed with users or groups.
  • Viewer access – Restricts users to only viewing the folder and contents within, including subfolders. Readers can only be assigned viewer access, and can see the Shared folders section when at least one folder is shared with them.

The following screenshot shows the Shared folders page.

The following screenshot shows the Share folder pop-up window, which you use to choose who to share folders with.

Permissions granted to a user or group at a parent folder level are propagated to subfolders within, which means that owners of a parent folder have access to subfolders. As a result, it’s best to model your permissions tree and folder structure before implementing and sharing folders in your account. Users who are to be restricted to specific folders are best granted access at the lowest level possible.

Folder permissions are currently also inherited by the assets within. For example, if a dashboard is placed in a shared folder, and Sally is granted access to the folder as an owner, Sally now has ownership over the folder and the dashboard. This model allows you to effectively use folders to manage shared permissions across thousands of users without having to implement this on a per-user or per-asset level.

For example, a team of 10 analysts could have owner permissions to a shared folder, which allows them to own both the folder and contents within, while thousands of other users (readers, authors, and admins) can be granted viewer permissions to the folder. This ensures that permissions management for these viewers can be done by the one-time action of granting them viewer permissions over the folder, instead of granting these permissions to users and groups within each dashboard. Permissions applied at the individual asset level continue to be enforced, and the final permissions of a user is the combination of the folder and individual asset permissions (whichever is higher).

Shared folders also enforce a uniqueness check over the folder path, which means that you can’t have two folders that have the same name at the same level in the folder tree. For example, if the admin creates /Oktank/ and shares with Tom and Sally as owners, and Tom creates /Oktank/Marketing/, Sally can no longer create a folder with the name Marketing. She should coordinate with Tom on permissions and get Tom to share this folder as an owner so that she can also contribute to the marketing assets. For personal folders (and for other asset types including dashboards, analyses, and datasets), QuickSight doesn’t require such uniqueness.

With QuickSight Enterprise Edition, dashboards, analyses, and datasets—whether owned by a user or shared with them—exist within the user’s QuickSight account and can be accessed via the asset-specific details page or search. All assets continue to be displayed via these pages, while those added to specific folders become visible via the folders view.

Use case: Oktank Analytics

Let’s put this all together and look from the lens of how a fictional customer, Oktank Analytics, can set up shared folders within their account. Let’s assume that Oktank has three departments: marketing, sales, and finance, with the sales team subdivided into US and EU orgs. Each of these departments and sub-teams has their own set of analysts that build and manage dashboards, and departmental users that expect to see data pertaining to their functional area. Oktank also has C-level executives that need access to dashboards from each department. Finally, QuickSight administrators oversee the overall business intelligence solution.

To implement this in QuickSight and provide a scalable model, the admin team first creates the top-level folder /Oktank/ and grants viewer access to the C-level executives. This grants the leadership team access to all subfolders underneath, making sure that there are no access issues. Access is also limited to viewer, so that the leadership has visibility but can’t accidentally make any changes.

Next, the admin team creates subfolders for marketing, sales, and finance. Both the admins and C-level executives have access to these folders (as owner and viewer, respectively) due to their permissions on the top-level folder.

The following diagram illustrates this folder hierarchy.

Oktank admins grant owner permissions to the Marketing folder to the marketing analyst team (via QuickSight groups). This allows the analyst team to create subfolders that match expectations of their users and leadership. To streamline access, the marketing analyst team creates two subfolders: Assets and Dashboards. The marketing analyst team uses Assets (/Oktank/Marketing/Assets/) to store datasets and analyses that they need to build and manage dashboards. Because all the marketing analysts have access to this folder, critical assets aren’t disrupted when an analyst is on vacation or leaves the company. Marketing analysts have owner permissions, the admin team has owner permissions, and C-level executives have viewer permissions.

The marketing analyst team uses the Dashboards folder to store dashboards that are shared to all marketing users (via QuickSight groups). All marketing users are granted viewer permissions to this folder (/Oktank/Marketing/Dashboards/); marketing analysts grant themselves owner permissions while the admin team and C-level executives have owner and viewer permissions propagated. For marketing users, access to this folder means that all the dashboards relevant to their roles can be explored in /Oktank/Marketing/Dashboards/, which is available through the Shared Folders link on the home page. The marketing analyst team also doesn’t have to share these assets individually or worry about permissions being missed out for specific users or dashboards.

The sales team needs further division because US and EU have different teams and data. The admin team creates the Sales subfolder, and then creates US and EU subfolders. They grant US sales analysts owner access to the US subfolder (/Oktank/Sales/US/), which gives the analysts the ability to create subfolders and share with users as appropriate. This allows the US sales analyst team to create /Oktank/Sales/US/Assets and /Oktank/Sales/US/Dashboards/. Similar to the marketing team, they can now store their critical datasets, analyses, and dashboards in the Assets folder, and open up the Dashboards folder to all US sales personnel, providing a one-stop shop for their users. The C-level executives have reader access to these folders and can access these assets and anything added in the future.

Admins and C-level executives see the following hierarchy in their shared folder structure; admins have owner access to all, and C-level executives have viewer access:

Oktank

Oktank > Marketing

Oktank > Marketing > Assets

Oktank > Marketing > Dashboards

Oktank > Sales

Oktank > Sales > US

Oktank > Sales > US > Assets

Oktank > Sales > US > Dashboards

Oktank > Sales > EU

Oktank > Sales > EU > Assets

Oktank > Sales > EU > Dashboards

Oktank > Finance

Oktank > Finance > Assets

Oktank > Finance > Dashboards

A member of the marketing analyst team sees the following:

Oktank

Oktank > Marketing

Oktank > Marketing > Assets

Oktank > Marketing > Dashboards

A member of the Oktank marketing team (e.g., marketing manager) sees the following:

Oktank

Oktank > Marketing

Oktank > Marketing > Dashboards

A member of the US Sales analyst team sees the following:

Oktank

Oktank > Sales

Oktank > Sales > US

Oktank > Sales > US > Assets

Oktank > Sales > US > Dashboards

A member of the Oktank US Sales team (e.g., salesperson) sees the following:

Oktank

Oktank > Sales

Oktank > Sales > US

Oktank > Sales > US > Dashboards

Conclusion

QuickSight folders provide a powerful way for admins and authors to organize, manage, and share content while being a powerful discovery mechanism for readers. Folders are now generally available in QuickSight Enterprise Edition in all supported QuickSight Regions.

 


About the Author

Jose Kunnackal John is principal product manager for Amazon QuickSight, AWS’ cloud-native, fully managed BI service. Jose started his career with Motorola, writing software for telecom and first responder systems. Later he was Director of Engineering at Trilibis Mobile, where he built a SaaS mobile web platform using AWS services. Jose is excited by the potential of cloud technologies and looks forward to helping customers with their transition to the cloud.

Creating customized Vega visualizations in Amazon Elasticsearch Service

Post Syndicated from Markus Bestehorn original https://aws.amazon.com/blogs/big-data/creating-customized-vega-visualizations-in-amazon-elasticsearch-service/

Computers can easily process vast amounts of data in their raw format, such as databases or binary files, but humans require visualizations to be able to derive facts from data. The plethora of tools and services such as Kibana (as part of Amazon ES) or Amazon Quicksight to design visualizations from a data source are a testimony to this need.

Such tools often provide out-of-the-box templates for designing simple graphs from appropriately pre-processed data, but applying these to production-grade, complex visualizations can be challenging for several reasons:

  • The raw data upon which a visualization is built may contain encoded attributes that aren’t understandable for the viewer. For example, the following layered bar chart could be raw data about people where the gender is encoded in numbers, but the visualization should still have human-readable attribute values.
  • Visualizations are often built on aggregated summaries of raw data. Storing and maintaining different aggregations for each visualization may be unfeasible. For instance, if you classify raw data items in multiple dimensions (for example, classifying cars by color or engine size) and build visualizations on these different dimensions, you need to store different aggregated materialized views of the same data.
  • Different data views used in a single visualization may require an ad hoc computation over the underlying data to generate an appropriate foundational data source.

This post shows how to implement Vega visualizations included in Kibana, which is part of Amazon Elasticsearch Service (Amazon ES), using a real-world clickstream data sample. Vega visualizations are an integrated scripting mechanism of Kibana to perform on-the-fly computations on raw data to generate D3.js visualizations. For this post, we use a fully automated setup using AWS CloudFormation to show how to build a customized histogram for a web analytics use case. This example implements an ad hoc map-reduce like aggregation of the underlying data for a histogram.

Use case

For this post, we use Online Shopping Store – Web Server Logs published by Harvard Dataverse. The 3.3 GB dataset contains 10,365,152 access logs of an online shopping store. For example, see the following data sample:

207.46.13.136 - - [22/Jan/2019:03:56:19 +0330] "GET /product/14926 HTTP/1.1" 404 33617 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"

Each entry contains a source IP address (for the preceding data sample, 207.46.13.136), the HTTP request, and return code (“GET /product/14926 HTTP/1.1” 404), the size of the response in bytes (33617) and additional metadata, such a timestamp. For this use case, we assume the role of a web server administrator who wants to visualize the response message size in bytes for the traffic over a specific time period. We want to generate a histogram of the response message sizes over a given period that looks like the following screenshot.

In the preceding screenshot, the majority of the response sizes are less than 1 MB. Based on this visualization, the administrator of the online shop could identify the following:

  • Requests that result in high bandwidth use
  • Distributed denial of service (DDoS) attacks caused by repeatedly requesting pages, which causes a high amount of response traffic

The built-in Vertical / Horizontal Bar visualization options in Kibana can’t produce this histogram over this Elasticsearch index without storing the raw data. In addition to this, these visualizations should be able to instruct complex transformations and aggregations over the data, so you can generate the preceding histogram. Storing the data in a histogram-friendly way in Amazon ES and building a visualization with Vertical / Horizontal Bar components requires ETL (Extract Transform Load) of the data to a different index. Such an ETL creates unnecessary storage and compute costs. To avoid these costs and complex workflows, Vega provides a flexible approach that can execute such transformations on the fly on the server log data.

We have built some of the necessary transformations outside of our Vega code to keep the code presented in this post concise. This was done to improve the readability of this post; the complete transformation could have also been done in Vega.

Overview of solution

To avoid unnecessary costs and focus on Vega visualization creation task in this post, we use an AWS Lambda function to stream the access logs out of an Amazon Simple Storage Service (Amazon S3) bucket instead of serving them from a web server (this Lambda function contains some of the transformations that have been removed to improve the readability of this post). In a production deployment, you replace the function and S3 bucket with a real web server and the Amazon Kinesis Agent, but the rest of the architecture remains unchanged. AWS CloudFormation deploys the following architecture into your AWS account.

The Lambda function reads and transforms the access logs out of an S3 bucket to stream them into Amazon Kinesis Data Firehose, which forwards the data to Amazon ES. Kinesis Data Firehose is the easiest way to reliably load streaming data into data lakes, data stores, and analytics tools, and therefore it is suitable for this task. The available data will be automatically transferred to Amazon ES after the deployment.

Prerequisites and deploying your CloudFormation template

To follow the content of this post, including a step-by-step walkthrough to build a customized histogram with Vega visualizations on Amazon ES, you need an AWS account and access rights to deploy a CloudFormation template. The Elasticsearch cluster and other resources the template creates result in charges to your AWS account. During a test-run in eu-west-1, we measured costs of $0.50 per hour. Complete the following steps:

  1. Depending on which Region you want to use, sign in to the AWS Management Console and choose one of the following Regions to deploy the necessary resources in your account:
  1. Keep all the default parameters as is.
  2. Select the check boxes, I acknowledge that AWS CloudFormation might create IAM resources with custom names and I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND.
  3. Choose Create Stack to start the deployment.

The deployment takes approximately 20–25 minutes and is finished when the status switches to CREATE_COMPLETE.

Connecting to the Kibana

After you deploy the stack, complete the following steps to log in to Kibana and build the visualization inside Kibana:

  1. On the Outputs tab of your CloudFormation stack, choose the URL for KibanaLoginURL.

This opens the Kibana login prompt.

  1. Unless you changed these at the start of the CloudFormation stack deployment process, the username and password are as follows:

Amazon Cognito requires you to change the password, which Kibana prompts you for.

  1. Enter a new password and a name for yourself.

Remember this password; there is no recovery option in this setup if you need to re-authenticate.

Upon completing these steps, you should arrive at the Kibana home screen (see the following screenshot).

You’re ready to build a Vega visualization in Kibana. We guide through this process in the following sections.

Creating an index pattern and exploring the data

Kibana visualizations are based on data stored in your Elasticsearch cluster, and the data is stored in an Elasticsearch index called vega-visu-blog-index. To create an index pattern, complete the following steps:

  1. On the Kibana home screen, choose Discover.
  2. For Index Pattern, enter “vega*”.
  3. Choose Next step.

  1. For Time Filter field name, choose timestamp.
  2. Choose Create index pattern.

After a few seconds, a summary page with details about the created index pattern appears.

The index pattern is now created and you can use it for queries and visualizations. When you choose Discover, you see a page that shows your index pattern. By default, Kibana displays data of the last 15 minutes, but given that the data is historical, you have to properly select the time range. For this use case, the data spans January 1, 2019 to February 1, 2019. After you configure the time range, choose Update.

In addition to a graphical representation of the tuple count over time, Kibana shows the raw data in the following format:

size_in_bytes: { "size": 13000, "freq": 11 }, …

The following screenshot shows both the visualization and raw data.

As noted before, we executed a pre-aggregation step with the data, which counted the number of requests in the log file with a given size. The message sizes were truncated into steps of 100 bytes. Therefore, the preceding screenshot indicates that there were 11 requests with 13,000–13,099 bytes in size during the minute after January 26, 2019, on 13:59. The next section shows how to create the aggregated histogram from this data using a Vega visualization involving a series of data transformations implicitly and on-the-fly, which Amazon ES runs without changing the underlying data stored in the index.

Vega on-the-fly data transformation

To properly compute the desired histogram, you need to transform the data. Because the format and temporal granularity of the data stored in the Elasticsearch index doesn’t match the format required by the visualization interface. For this use case, we use the scripted metric aggregation pattern, which calls stored scripts written in Painless to implement the following four steps:

  1. Variable initialization
  2. Mapping documents to a key-value structure for the histogram
  3. Grouping values into arrays by keys
  4. Aggregation of array content into the histogram

To implement the first step for initializing the variables used by the other steps, choose Dev Tools on the left side of the screen. The development console allows you to define scripts that are later called by the Vega visualization. The following code for the first step is very simple because it initializes an empty array variable “state.test”:

#init_script
POST _scripts/initialize_variables
{
  "script": {
    "lang" : "painless",
    "source" : "state.test = [:]"
  }
}

Enter the preceding code into the left-hand input section of the Kibana Dev Tools page, select the code, and choose the green triangular button. Kibana acknowledges the loading of the script in the output (see the following screenshot).

To improve the readability of the rest of this section, we will show the result of each step based on the following initial input JSON:

{
  “timestamp”: “2019/01/26 12:58:00”, 
  “size_in_bytes”: [ 
    {“size”: 100, “freq”: 369},
    {“size”: 200, “freq”: 62},
    … 
]}
{
  “timestamp”: “2019/01/26 12:57:00”, 
  “size_in_bytes”: [ 
    {“size”: 100, “freq”: 386},
    {“size”: 200, “freq”: 60},
    … 
]}

The next step of the transformation maps the request size in each document of the index to their appropriate count of occurrences as key-value pairs, i.e., with the example data above, the first size_in_bytes field is transformed into “100”: 369. As in the previous step, implement this part of the transformation by entering the following code into the Kibana Dev Tools page and choosing the green button to load the script.

#map_script
POST _scripts/map_documents
{
  "script": {
    "lang": "painless",
    "source": """
      if (doc.containsKey(params.bucketField))
        for (int i=0; i<doc[params.bucketField].length; i++)
        {
          def key = doc[params.bucketField][i];
          def value = doc[params.countField][i];
          state.test[(key).toString()] = value;
        }
        """
  }
}

Using our example input data results for this step, the following output is computed:

{“100”: 369, “200”: 62, …}
{“100”: 386, “200”: 60, …}

As shown above, the script’s outputs are JSON documents. Because the desired histogram aggregates data over all documents, you need to combine or merge the data by key in the next step. Enter the following code:

#combine_script
POST _scripts/combine_documents
{
  "script": {
    "lang": "painless",
    "source": "return state.test"
  }
}

With the example intermediate output from the previous step, the following intermediate output is computed:

{
  “100”: [369, 386],
  “200”: [62, 60], …
}

Finally, you can aggregate the count value arrays returned by the combine_documents script into one scalar value by each messagesize as implemented by the aggregate_histogram_buckets script. See the following code:

#reduce_script
POST _scripts/aggregate_histogram_buckets
{
  "script": {
    "lang": "painless",
    "source": """
      Map result = [:];
      states.forEach(value ->
        value.forEach((k,v) ->
          {result.merge(k, v, (value1, value2) -> value1+value2)}
        )
      );
      return result.entrySet().stream().sorted(
        Comparator.comparingInt((entry) -> 
        Integer.parseInt(entry.key))
      ).map(
        entry -> [
          "messagesize": Integer.parseInt(entry.key),
          "messagecount": entry.value
        ]).collect(Collectors.toList())
      """
  }
}

The final output for our computation example has the following format:

{
  “messagesize”: 100,
  “messagecount”: 755,
}
{
  “messagesize”: 200,
  “messagesize”: 122
}

This concludes the implementation of the on-the-fly data transformation used for the Vega visualization of this post. The documents are transformed into buckets with two fields—messagesize and—messagecount – which contain the corresponding data for the histogram.

Creating a Vega-Lite visualization

The Vega visualization generates D3.js representations of the data using the on-the-fly transformation discussed earlier. You can create this histogram visualization using Vega or Vega-Lite visualization grammars, which are both supported by Kibana. Vega-Lite is sufficient to implement our use case. To make use of the transformation, create a new visualization in Kibana and choose Vega.

A Vega visualization is created by a JSON document that describes the content and transformations required to generate the visual output. In our use case, we use the vega-visu-blog-index index and the four transformation steps in a scripted metric aggregation operation to generate the data property which provides the main content suitable to visualize as a histogram. The second part of the JSON document specifies the graph type, axis labels, and binning to format the visualization as required for our use case. The full Vega-Lite visualization JSON is as follows:

{
  $schema: https://vega.github.io/schema/vega-lite/v2.json
  data: {
    name: table
    url: {
      index: vega-visu-blog-index
      %timefield%: timestamp
      %context%: true
      body: {
        aggs: {
          temporal_hist_agg: {
            scripted_metric: {
              init_script: {
                id: "initialize_variables"
              }
              map_script: {
                id: map_documents
                params: {
                  bucketField: size_in_bytes.size
                  countField: size_in_bytes.freq
                }
              }
              combine_script: {
                id: "combine_documents"
                }
              reduce_script: {
                id: "aggregate_histogram_buckets"
              }
            }
          }
        }
        size: 0
      }
    }
    format: {property: "aggregations.temporal_hist_agg.value"}
  }
  mark: bar
  title: {
    text: Temporal Message Size Histogram
    frame: bounds
  },
  encoding: {
    x: {
      field: messagesize
      type: ordinal
      axis: {
        title: "Message Size Bucket"
        format: "~s"
      }
      bin: {
        binned: true,
        step: 10000
      }
    }
    y: {
      field: messagecount
      type: quantitative
      axis: {
        title: "Count"
        }
    }
  }
}

Replace all text from the code pane of the Vega visualization designer with the preceding code and choose Apply changes. Kibana computes the visualization calling the four stored scripts (mentioned in the previous section) for on-the-fly data transformations and displays the desired histogram. Afterwards, you can use the visualization just like the other Kibana visualizations to create Kibana dashboards.

To use the visualization in dashboards, save it by choosing Save. Changing the parameters of the visualization automatically results in a re-computation, including the data transformation. You can test this by changing the time period.

Exploring and debugging scripted metric aggregation

The debugger included into most modern browsers allows you to view and test the transformation result of the scripted metric aggregation. To view the transformation result, open the developer tools in your browser, choose Console, and enter the following code:

VEGA_DEBUG.view.data('table')

The Vega Debugger shows the data as a tree, which you can easily explore.

The developer tools are useful when writing transformation scripts to test the functionality of the scripts and manually explore their output.

Cleaning up

To delete all resources and stop incurring costs to your AWS account, complete the following steps:

  1. On the AWS CloudFormation console, from the list of deployed stacks, choose vega-visu-blog.
  2. Choose Delete.

The process can take up to 15 minutes, and removes all the resources you deployed for following this post.

Conclusion

Although Kibana itself provides powerful built-in visualization methods, these approaches require the underlying data to have the right format. This creates issues when the same data is used for different visualizations or simply not available in an ideal format for your visualization. Because storing different aggregations or views of the same data isn’t a cost-effective approach, this post showed how to generate customized visualizations using Amazon ES, Kibana, and Vega visualizations with on-the-fly data transformations.

 


About the authors

Markus Bestehorn is a Principal Prototyping Engagement Manager at AWS. He is responsible for building business-critical prototypes with AWS customers, and is a specialist for IoT and machine learning. His “career” started as a 7-year-old when he got his hands on a computer with two 5.25” floppy disks, no hard disk, and no mouse, on which he started writing BASIC, and later C, as well as C++ programs. He holds a PhD in computer science and all currently available AWS certifications. When he’s not on the computer, he runs or climbs mountains.

 

 

Anil Sener is a Data Prototyping Architect at AWS. He builds prototypes on Big Data Analytics, Streaming, and Machine Learning, which accelerates the production journey on AWS for top EMEA customers. He has two Master Degrees in MIS and Data Science. He likes to read about history and philosophy in his free time.

 

Monitor and Optimize Analytic Workloads on Amazon EMR with Prometheus and Grafana

Post Syndicated from Derek Tan original https://aws.amazon.com/blogs/big-data/monitor-and-optimize-analytic-workloads-on-amazon-emr-with-prometheus-and-grafana/

Amazon EMR is a big data service offered by AWS to run Apache Spark and other open-source applications on AWS in a cost-effective manner. Monitoring Amazon EMR clusters is essential to help detect critical issues with the applications or infrastructure in real time and identify root causes quickly. Viewing how the clusters are being used over time helps operations and engineering teams find potential performance bottlenecks and optimization opportunities to scale out or scale in their clusters and plan for capacity. In this post, we show how to integrate Prometheus, an open-source systems monitoring and alerting tool, and Grafana, an open-source visualization and analytics tool, to provide an end-to-end monitoring system for EMR clusters. Furthermore, we share an example that demonstrates how you can use Prometheus and Grafana to easily identify opportunities to optimize your EMR jobs to improve performance and reduce cost.

This post discusses the following:

  • Installing and configuring Prometheus and Grafana on an Amazon Elastic Compute Cloud (Amazon EC2) instance.
  • Configuring an EMR cluster to emit metrics that Prometheus can scrape from the cluster.
  • Using the Grafana dashboards to analyze the metrics for a workload on the EMR cluster and optimize it.
  • How Prometheus can push alerts to the Alertmanager, which is a component in Prometheus. The Alertmanager sends notifications to the alert_sns_forwarder component, which forwards the notifications to Amazon Simple Notification Service (Amazon SNS).
  • Configuring Amazon SNS to send email notifications.
  • A few considerations when bringing this monitoring system to production.

The following diagram illustrates the solution architecture.

Exporters are agents that gather metrics from the systems being monitored and provide endpoints for Prometheus to poll the metrics. Node_exporter and jmx_exporters collect metrics from the operating system and applications such as YARN and HDFS, respectively, on each node in the EMR cluster.

The Prometheus server polls the endpoints exposed by these exporters on each node to gather the metrics. These metrics are then stored locally on the Prometheus server. When a user opens the Grafana dashboards in their browser, the Grafana server queries the Prometheus server to generate the dashboards that are displayed in the browser.

You can set up alerts in Prometheus; when the alert thresholds are breached, the Prometheus server pushes alerts to the Alertmanager. The Alertmanager takes care of deduplicating, grouping, and routing the notifications to the receivers. Receivers that send events to PagerDuty, Slack, Opsgenie, and others, are natively supported by the Alertmanager.

Because there is no native integration of the Alertmanager with Amazon SNS, we use an implementation of the generic webhook receiver, alert_sns_forwarder, to transform and route the notification message to a pre-configured topic in Amazon SNS. You can subscribe to the topic to receive the alerts via email, SMS, HTTP/S, Amazon Simple Queue Service (Amazon SQS), or AWS Lambda.

Prerequisites

Before getting started, you must have the following prerequisites:

Deploying the solution

We provide the CloudFormation template in this post as a general guide. Please review and customize it as needed. You should also be aware that some of the resources deployed by this stack incur costs when they remain in use.

For purposes of this post, the resources are installed in a VPC with a public subnet. We recommend installing the resources in a private subnet where possible for production. In addition, we recommend enabling TLS connections and password authentication from Prometheus to node_exporter and jmx_exporter and for Grafana. To make it easier to troubleshoot the setup, the CloudFormation template includes setting up network ingress access to port 9090 so you can access the Prometheus UI remotely. You can remove this access if not needed.

The CloudFormation template contains several nested templates. Together, they do the following:

  1. Choose the VPC and subnet to deploy this solution.
  2. Create an EC2 instance with the instance type of your choosing.
  3. Download, install, and configure Prometheus as a service with the right scrape configuration to connect to the EMR cluster being monitored.
  4. Download, install, and configure alert_sns_forwarder as a service that transforms alert notifications from the AlertManager to Amazon SNS messages and publishes those messages to Amazon SNS.
  5. Download, install, and configure the Alertmanager as a service that forwards alert notifications from Prometheus server to alert_sns_forwarder.
  6. Set up a sample alert to send notification messages to Amazon SNS when disk space usage is over 90% on any of the nodes in the EMR cluster being monitored.
  7. Download, install, and configure Grafana as a service that connects to the Prometheus data source. The following dashboards are pre-installed to visualize various metrics on the EMR cluster being monitored:
    • OS Level Metrics – Select CPU, memory, and disk metrics exposed by the Amazon Linux operating system.
    • HDFS – DataNode Metrics – Select storage and network metrics exposed by the HDFS data node process.
    • HDFS – NameNode Metrics – Select storage and replication metrics exposed by the HDFS name node process.
    • YARN – Resource Manager – Select resource, application, and container metrics exposed by the YARN resource manager process.
    • YARN – Node Manager – Select resource and container metrics exposed by the YARN node manager process.
    • YARN – Queues – Select resource, application, and container metrics filtered by YARN queues.
    • JVM Metrics – Select memory and garbage collection metrics exposed by JVM of HDFS and YARN processes.
    • Log Metrics – Log fatals, errors, and warnings by the logger of HDFS and YARN processes.
    • RPC Metrics – Select RPC metrics exposed by HDFS and YARN processes.
  8. Create an Amazon EC2 security group. You can configure network access to inbound TCP ports 22 (SSH), Grafana (3000), and Prometheus UI (9090) with parameters in the CloudFormation template. This allows you to lock down access to the Prometheus and Grafana EC2 instance launched to known CIDR scopes and ports.
  9. Create an IAM instance profile that is used to associate with the EC2 instance with Prometheus and Grafana installed.
  10. Create an SNS topic and subscribe the email address provided as a parameter in the template to receive the notifications from Prometheus Alertmanager.
  11. Launch an EMR cluster with a bootstrap action script that does the following:
    • Download and set up node_exporter, which exposes OS metrics to Prometheus, as a service, on all nodes.
    • Download jmx_exporter, which is used by HDFS Name Node, HDFS Data Node, YARN Resource Manager, and YARN Node Manager processes on all nodes to expose application metrics to Prometheus.
    • Configure HDFS Name Node, HDFS Data Node, YARN Resource Manager, and YARN Node Manager processes on the cluster to launch with jmx_exporter as a Java agent.
  12. Create additional master and slave security groups of the EMR cluster to allow network ingress to ports 7001, 7005, and 9100 from the Prometheus server.

Launching the CloudFormation stack

To launch your stack and provision your resources, complete the following steps:

  1. Choose the following Launch Stack link:

This automatically launches AWS CloudFormation in your AWS account with a template. It prompts you to sign in as needed. You can view the template on the AWS CloudFormation console as required. Make sure that you create the stack in your intended Region.

The CloudFormation stack requires a few parameters, as shown in the following screenshot.

The following table describes the parameters:

ParameterDescriptionDefault Value
Stack nameA meaningful name for the stack; for example, emrPrometheusGrafana.None
Network Configuration
VPCThe VPC where the EC2 instance and EMR cluster should be launched.None
SubnetThe subnet where the EC2 instance and EMR cluster should be launched.None
Amazon EMR Configuration
EMRClusterName

Name of the EMR cluster.

 

emrPrometheusBlog
MasterInstanceTypeInstance type of the master node.m5.xlarge
CoreInstanceType

Instance type of the core node.

 

m5.xlarge
CoreInstanceCount

Number of core instances.

 

2

EMRSSHIPRange

 

The IP address range in CIDR notation (for example, <your ip address>/32) for SSHing to the master node of the EMR cluster. If you want to grant access to your local computer’s public IPv4 address, you can go to https://checkip.amazonaws.com/ or run curl https://checkip.amazonaws.com/ in the terminal. Rules with source of 0.0.0.0/0 allow all IP addresses to access your instance. We recommend setting security group rules to allow access from known IP addresses only.None

EMRKeyName

 

An existing EC2 key pair to enable SSH access to the master node of the EMR cluster.None
EMRReleaseLabel

The Amazon EMR release version.

 

emr-6.0.0
Amazon EC2 Configuration
InstanceTypeThe EC2 instance type to install Prometheus, Alertmanager, alert_sns_forwarder, and Grafana services. Because the instance needs to host Prometheus server, which works as a time series database storing metrics data, an instance with at least 50 GB disk space would be advisable, depending on the usage. For this post, we choose t3.small.t3.small
KeyNameAn existing EC2 key pair to enable SSH access to the instance.None
SSHIPRangeThe IP address range in CIDR notation (for example, <your ip address>/32) for SSHing to the EC2 instances. To grant access to your local computer’s public IPv4 address, go to http://checkip.amazonaws.com/ or run curl http://checkip.amazonaws.com/ in the terminal.None

User Interfaces Network Ingress Access

GrafanaIPRangeThe IP address range in CIDR notation (for example, <your ip address>/32) for accessing the Grafana dashboards on port 3000. To grant access to your local computer’s public IPv4 address, go to http://checkip.amazonaws.com/ or run curl http://checkip.amazonaws.com/ in the terminal.None
PrometheusUIIPRangeThe IP address range in CIDR notation (for example, <your ip address>/32) for accessing the Prometheus UI on port 9090. To grant access to your local computer’s public IPv4 address, go to http://checkip.amazonaws.com/ or run curl http://checkip.amazonaws.com/ in the terminal.None
Alerts from Prometheus
EmailAddressThe email address that you use to subscribe to alerts from Prometheus. You must confirm this subscription via an email message from Amazon SNS.None

 

  1. Enter the parameter values from the preceding table.
  2. Choose Next.
  3. On the next screen, enter any required tags, an IAM role, or any advanced options.
  4. Choose Next.
  5. Review the details on the final screen and select the check boxes confirming AWS CloudFormation might create IAM resources with custom names or require CAPABILITY_AUTO_EXPAND.

  1. Choose Create.

Stack creation takes a few minutes. After the CloudFormation stack is created, on the Outputs tab, you can find the following three key-value pairs:

    • ClusterId – The ID of the EMR cluster created.
    • MasterPublicDnsName – The public DNS name of the master node the EMR cluster.
    • WebsiteURL – The URL for the newly created Grafana dashboard. The default login and password are both admin. You are prompted to change the password the first time you log in.

You should also receive an email from [email protected] asking you to confirm the subscription from Amazon SNS.

  1. Choose the Confirm subscription link in your email.

You have now subscribed your email to the SNS topic that alerts from Prometheus are published to.

Workload example

The following use case demonstrates how you can use the insights of Amazon EMR metrics from the Grafana dashboards to tune the performance of a Hadoop job running on the cluster.

The job for this use case is a simple WordCount that counts the number of words in the input files. You can download the source code for the WordCount program from sourcecode.zip. It is a basic MapReduce program. You use the Yelp business review dataset from Yelp Open Dataset. The original data is in JSON format. For this post, convert the same dataset to GZIP (2.4 GB in size) and BZIP2 (1.8 GB in size) formats. You run this WordCount job on the EMR cluster you launched earlier.

Start the job by adding a step to the cluster with the following code (replace <j-*************> with the cluster ID specified on the Outputs tab of the CloudFormation stack you created and also <s3://bucket-name/outputs-folder> with the Amazon Simple Storage Service (Amazon S3) location for the job output):

$ aws emr add-steps --cluster-id < j-*************>  \
--steps Type=CUSTOM_JAR,Name=WordCountJarGZIP,ActionOnFailure=CONTINUE,Jar=s3://aws-bigdata-blog/artifacts/aws-blog-emr-prometheus-grafana/demo/jars/Wordcount-1.0-SNAPSHOT-job.jar,Args=com.amazonaws.support.training.emr.Wordcount,s3://aws-bigdata-blog/artifacts/aws-blog-emr-prometheus-grafana/demo/datasets/YelpDataGzip/,<s3://bucket-name/outputs-folder>/outputGzip

You can cancel the step at any time by entering the following code (replace <j-*************> with the cluster ID and <s-*************> with the step ID):

 $ aws emr cancel-steps --cluster-id < j-*************>  \
--step-ids < s-*************> 

When the step is in RUNNING status, go to the Grafana dashboards for metrics insights. The following screenshot shows the YARD – Resource Manager dashboard.

For a very long time during the execution of the job, the Container Stats metrics show only two containers (one ApplicationMaster container and one mapper container) were allocated, while no containers were pending. This is expected, because the input format for this job is GZIP, which isn’t splittable. Therefore, no matter how big the input file is, it only starts one mapper container because there is only one InputSplit. Container allocation doesn’t increase until the mapper stage finishes when the reducer stage starts.

VCores Utilization shows the same insight: it stayed below 50% for a long time until multiple reducer containers kicked in after the completion of the single mapper container.

The OS Level Metrics dashboard in the following screenshot also shows the resource utilization for CPU and memory was quite low during the mapper stage of this job, which took most of the process time.

For a cluster with one m5.xlarge master and two m5.xlarge core nodes, the job took 30 minutes to finish. Out of the 30 minutes, 28 minutes were used by the mapper stage.

Based on these insights from the dashboards, you can do some performance tuning.

For the same dataset, instead of using GZIP, use BZIP2 format for the input file. Submit the step with the following code (replace <j-*************> with the cluster ID specified on the Outputs tab of the CloudFormation stack and <s3://bucket-name/outputs-folder> with Amazon S3 location for the job output):

$ aws emr add-steps --cluster-id < j-*************>  \
--steps Type=CUSTOM_JAR,Name=WordCountJarGBZ2,ActionOnFailure=CONTINUE,Jar=s3://aws-bigdata-blog/artifacts/aws-blog-emr-prometheus-grafana/demo/jars/Wordcount-1.0-SNAPSHOT-job.jar,Args=com.amazonaws.support.training.emr.Wordcount,s3://aws-bigdata-blog/artifacts/aws-blog-emr-prometheus-grafana/demo/datasets/YelpDataBZ2/,<s3://bucket-name/outputs-folder>/outputBZ2_1

When the step is in RUNNING status, check the same dashboards.

The following screenshot shows the YARN – Resource Manager dashboard.

Container Stats shows that seven containers got allocated immediately when the job started. Additionally, the number of pending containers is significantly higher, at 46. You changed the input format to BZIP2 this time, which is a splittable compression format. As a result, multiple mapper containers were launched, each processing one InputSplit. This improved the parallelism at the mapper stage.

VCores Utilization also shows that 100% of the VCores were used during the peak of the job process.

The OS Level Metrics dashboard in the following screenshot shows the resource utilization also increased during this run of the job.

the job took 11 minutes to finish, which is a 63% performance improvement compared to the previous run.

Can the job run even faster? Based on what you have found from your dashboards, yes. At some point during the second run, 46 pending containers were waiting to be allocated, and VCores Utilization is 100%, which means the cluster allocated all its container resources at capacity. You can resize the cluster by adding a task instance group with 10 m5.xlarge task nodes with the following code (replace <j-*************> with the cluster ID):

$ aws emr add-instance-groups --cluster-id < j-*************> \
--instance-groups InstanceCount=10,InstanceGroupType=task,InstanceType=m5.xlarge

The add-instance-groups command returns an output similar to the following code:

{
    "ClusterId": "j-*************",
    "InstanceGroupIds": [
        "ig-************"
    ]
}

Record the value of InstanceGroupIds; you use this later to set the nodes in the instance group to 0.

The resizing is reflected on the YARN -Resource Manager dashboard in the following screenshot.

You run the same job with the same BZIP2 formatted input dataset, using the following code:

$ aws emr add-steps --cluster-id < j-*************>  \
--steps Type=CUSTOM_JAR,Name=WordCountJarGBZ2,ActionOnFailure=CONTINUE,Jar=s3://aws-bigdata-blog/artifacts/aws-blog-emr-prometheus-grafana/demo/jars/Wordcount-1.0-SNAPSHOT-job.jar,Args=com.amazonaws.support.training.emr.Wordcount,s3://aws-bigdata-blog/artifacts/aws-blog-emr-prometheus-grafana/demo/datasets/YelpDataBZ2/,<s3://bucket-name/outputs-folder>/outputBZ2_2

This time, all the pending containers immediately got allocated when the job started, because the cluster has more containers in total because you added task nodes.

The job took 4 minutes to complete. This is a 64% performance improvement compared to the second run and 87% performance improvement compared to the first run.

To save cost, after the job is complete, you can scale down the cluster by reducing the task node number in the task instance group that you resized earlier to 0 with the following code (replace <ig-************> with the instance group ID you recorded earlier):

$ aws emr modify-instance-groups \
--instance-groups InstanceGroupId=<ig-************>,InstanceCount=0

Alert example

The CloudFormation stack you created set up an alert to monitor any nodes in the EMR cluster such that if the disk partitions are more than 90% full, an email alert is sent to the email address that you specified.

Follow these steps to write a large file to a disk partition on the master node in the cluster and wait for the system to fire off the email alert:

Connect to the master node in the EMR cluster via SSH with the key:

ssh -i <key-name>.pem [email protected]<ec2 instance public DNS name>

The public DNS name of the master node is specified on the Outputs tab of the CloudFormation stack you created.

  1. Show the current disk usage by entering the following code:
$ df -h /emr
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdb1      5.0G   56M  5.0G   2% /emr
  1. Create a file sized 4.5 GB so that the disk usage goes above 90%:
$ fallocate -l 4.5G /emr/test.img
  1. Check the disk usage again to confirm that the disk usage is now above 90%. See the following code:
$ df -h /emr
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdb1      5.0G  4.6G  459M  92% /emr

You can also view the disk usage information from the Grafana OS Level Metrics dashboard. In the following screenshot, the /emr partition shows 91.1% used on the Disk Space Used panel.

You should now expect an email with the subject “Prometheus Alert” from [email protected] in your inbox.

Considerations for production

Prometheus stores the metrics locally on a time series database that is included in the installation. Planning for disk capacity, disk availability, and snapshots for backup are recommended for improving durability.

By default, the metrics are stored for 15 days. You can configure this retention period with the --storage.tsdb.retention.time command line flag. For solutions that provide remote long-term storage and in some cases high availability, see Remote Endpoints and Storage. The Alertmanager supports creating a cluster for high availability. For more information, see High Availability on the GitHub repo.

Another consideration is how to get notified when the Prometheus monitoring system stops running. The Prometheus server and Alertmanager expose their own metrics by default through localhost:9090/metrics and localhost:9093/metrics endpoints, respectively. These metrics can be scraped and published to Amazon CloudWatch, and you can set up CloudWatch alarms to trigger on missing data points for metrics. For more information, see Using Amazon CloudWatch Alarms.

Cleaning up

To avoid ongoing charges, delete the CloudFormation stack and output files in Amazon S3 that you created during the workload use case.

Conclusion

This post showed how you can set up a monitoring system based on Prometheus and Grafana to monitor an EMR cluster and use Grafana dashboards to view metrics to troubleshoot a performance issue. You can also set up alerts in Prometheus to notify you when critical issues arise, and you can view the dashboards to narrow down the root causes. You can extend this monitoring system to monitor multiple EMR clusters and other applications, which makes it a one-stop system for monitoring metrics across your infrastructure and applications.

 


About the authors

Derek Tan is a principal big data architect, covering Amazon EMR and Athena. Prior to working as a Big Data Architect, he was leading engineering teams in Amazon EMR and Amazon Redshift. During his free time, he enjoys traveling, reading, and spending time with his family.

 

 

 

 

Fei Lang is a senior big data architect at AWS. She is passionate about building the right big data solution for customers. In her spare time, she enjoys the scenery of the Pacific Northwest, going for a swim, and spending time with her family.

 

Vortexa delivers real-time insights on Amazon MSK with Lenses.io

Post Syndicated from Andrew Stevenson original https://aws.amazon.com/blogs/big-data/vortexa-delivers-real-time-insights-on-amazon-msk-with-lenses-io/

This is a guest post by Lenses.io. In their own words, “Lenses.io is a group of engineers, designers, developers, data analysts, tinkerers, writers, and open-source contributors. We specialize in the democratizing of data technologies. That’s why we developed Lenses.io (Lenses-eye-oh) to help enterprises simplify their operations by making data work for teams, not the other way around.”

This post discusses how Vortexa harnesses the power of Apache Kafka to improve real-time data accuracy and accelerate time-to-market by using a combination of Lenses.io for greater observability and Amazon Managed Streaming for Apache Kafka (Amazon MSK) to create clusters on demand.

Vortexa provides a real-time analytics platform for seaborne oil supply, which enables energy market players to make smarter trading and shipping decisions. Their service needs to process tens of millions of data points per day, and instant insights on the slightest changes in ship movements are crucial to their customers.

That’s why Vortexa’s data analytics team relies on powerful data streaming technologies such as Apache Kafka to make their real-time applications work.

However, as they scaled their self-managed Apache Kafka environment, they saw several challenges emerge:

  • Debugging incidents took days. Investigation of incidents involved a series of in-house developed scripts and open-source tools.
  • The smallest mistake could bring down a whole cluster. The team ran the risk of downtime due to difficulties in manual configuration and a nebulous command line.
  • Slowdowns affected production. This meant more time spent on everyday tasks, such as assessing the data quality, and less time shipping new features to market.
  • Specialist skills were difficult to come by. Having a specialist in-house for every data platform permutation—like someone who understood how to both build and operate Apache Kafka—wasn’t wise or sustainable.

And so Fridays for Vortexa became “Kafka Fridays.” The fragilities and complexities in an all-important Apache Kafka infrastructure cost the team their Fridays and often their weekends, and also ate into their release schedules.

Self-healing and observability save the day

To address this, the Vortexa team evaluated several managed solutions, and found that Amazon MSK best met their requirements and budget. Amazon MSK continuously monitors cluster performance, is self-healing, and automatically replaces unhealthy nodes without downtime.

Already users of multiple other AWS services, Vortexa benefited from Amazon MSK integrating with other AWS services such as Amazon Virtual Private Cloud (Amazon VPC), AWS Key Management Service (AWS KMS), and AWS Identity and Access Management (IAM). Amazon MSK can also integrate with Amazon CloudWatch, AWS CloudTrail, and Lenses.io for more effective, integrated monitoring and logging of issues.

However, Apache Kafka was still a black box when it came to understanding their deployed flows. Vortexa uses Lenses.io as their data operations portal, which gives them Apache Kafka monitoring, data observability and governance control for their real-time applications that run across their Amazon MSK and Kubernetes platform. With a clear UI, they can troubleshoot a problem on a streaming application within a few minutes compared to hours taken before, which led to an improvement of at least 30-fold.

“Lenses.io gives us a complete vision and extra management capabilities for our streaming applications in Apache Kafka clusters,” says Nguyen Lam Phuc, Senior DevOps Engineer at Vortexa. “Testing as well as operations suddenly became painless and felt much more intuitive than ever before.”

Previously, the team also needed to manually run commands, add connectors involving trial and error, and implement complex deployment procedures—and cross their fingers that this would work.

“In our four-year history, we’ve had at least five different ways of operating, and four different ways of monitoring Apache Kafka, says Jakub Korzeniowski, Head of Data Services at Vortexa. “The combination of Lenses.io and Amazon MSK has been the only solution that allows us to focus on business logic instead of concentrating the majority of our efforts on just meeting SLAs.”

Now, through a single dashboard, Vortexa can use Lenses.io to view their application topology and expose consumer lag metrics that they can visualize and alert in Lenses.io before forwarding to third-party tools such as Slack and PagerDuty.

Open monitoring with Amazon MSK and Lenses.io

You can monitor the health of the Amazon MSK infrastructure within external tools using the AWS open monitoring framework. Vortexa uses open monitoring with Prometheus to access JMX metrics that Amazon MSK brokers generate. The following diagram illustrates this architecture.

Value in data quality and release velocity for Vortexa

Maksym Schipka, Vortexa’s CTO, explains how Amazon MSK and Lenses.io saved 15% of working hours in Apache Kafka management: “At least twice a week, we would experience issues with Kafka. We were spending so long trying to understand the issue that it was more efficient for us to replace the cluster with a completely new one.

“Amazon MSK and Lenses.io have been pivotal technologies for Vortexa, enabling us to shift significant efforts from maintaining and stabilizing a complex and fragile Kafka infrastructure to focusing more on the quality of analytics and market insights that directly impact the value we deliver to our customers.”

“We can now confidently and frequently update our applications at scale, run complex Kafka streams topologies with ease, and debug applications instead of infrastructure. Our revenue and market share are directly impacted by the speed at which we deliver accurate data to the market, and Amazon MSK and Lenses.io are key enablers of that.”

Fast-tracking to production in minutes, not days

Vortexa can now build new data pipelines that integrate data across their different technologies, including Amazon Elasticsearch Service and Redis, and deploy into production in minutes.

Everyday tasks, such as inspecting messages on a stream with SQL, understanding the profile of a topic, or verifying the replication factor are now easy and performed in under 10 seconds, without the need to have specific in-house Apache Kafka expertise.

“Amazon MSK with Lenses SQL is one step towards making Kafka feel like any other database,” says Romain Guion, Head of Signal Processing and Enrichment at Vortexa. “It allows us to verify the outputs of our models well, which recently resulted in a 90% decrease in AIS data affected by signal noise and spoofing.”

The Vortexa team sees a real-time topological view of their entire data landscape, including their flows to Elasticsearch and Redis and inside their Apache Kafka streams applications.

Conclusion

Using Amazon MSK and Lenses.io helped Vortexa to:

  • Save 15% of working hours on Apache Kafka management
  • Deploy new Amazon MSK clusters into production in minutes, not days
  • Increase Apache Kafka users (both operators and application developers) by 200%
  • Reduce automatic identification system signal noise by 90%, aided by quality assurance done exclusively in Lenses.io

To learn how to get greater visibility into your data streams and streaming platform, visit Lenses.io. For information on how to build and run production applications on Apache Kafka without needing Apache Kafka infrastructure management expertise, visit Amazon Managed Streaming for Apache Kafka.

 


About the author

Andrew Stevenson is the Chief Technology Officer and co-founder of Lenses.io. He leads Lenses.io’s world-class engineering team and technical strategy.

 

 

Build a distributed big data reconciliation engine using Amazon EMR and Amazon Athena

Post Syndicated from Sara Miller original https://aws.amazon.com/blogs/big-data/build-a-distributed-big-data-reconciliation-engine-using-amazon-emr-and-amazon-athena/

This is a guest post by Sara Miller, Head of Data Management and Data Lake, Direct Energy; and Zhouyi Liu, Senior AWS Developer, Direct Energy.

Enterprise companies like Direct Energy migrate on-premises data warehouses and services to AWS to achieve fully manageable digital transformation of their organization. Freedom from traditional data warehouse constraints frees up time and resources to focus on business innovation, and for Direct Energy, building data models that allow us to focus on customer satisfaction. These projects typically start by replicating data from source relational database systems (RDBS) into a single data repository, also known as a data lake. Before you can use this data to drive business insights, you should test it for quality and integrity. The quality of your data has a ripple effect on the insights you can extract and the trust people have in your insights.

Our post tackles data quality and integrity as we introduce a reconciliation tool that uses Amazon EMR and Amazon Athena to provide you with a cloud-native engine to validate vast amounts of data effectively at scale. The engine provides customized summary reports so you can interactively analyze the quality of your data. We also discuss the clarity that the engine has bought to the performance and quality of source ingestion and extract, transform, and load (ETL) processes.

About Direct Energy

Direct Energy is a North American retail energy and energy services provider and a subsidiary of the UK-based utility company, Centrica. Serving over four million residential and business customers, we strive to give you choice, simplicity, and innovation with simple solutions to track, understand, and control the electricity and natural gas you use.

Teams across Direct Energy came together to ensure a successful transition from on-premises SQL Server data warehouses to AWS data lakes, to include the following:

  1. Our Management Information Systems (MIS) team manages core AWS infrastructure, including AWS Identity and Access Management (IAM), Amazon EMR configurations, and more. The team also manages the ingestion of raw data directly from source data warehouses and billing systems housed on SQL Server, Oracle, and MySQL.
  2. Our Data Engineering team, embedded in business teams, picks up the raw source data after it lands in Amazon Simple Storage Service (Amazon S3), casts data types, merges data for select tables, and transforms the data to build out our business data warehouse, data marts, and datasets using PySpark on Amazon EMR. This includes the orchestration of dependencies across these steps.
  3. Completed data marts and datasets are made available to the business through Amazon Redshift and Microsoft Power BI for reporting and analytical modeling processes, such as churn propensity, customer lifetime value, price elasticity, and segmentation.

Due to the various layers and intricacies required to build out our finalized business data marts and datasets, testing is required at each step.

Why Direct Energy needed a reconciliation engine

The Data Engineering team at Direct Energy needed an automated way to check data quality at the row and field levels for data stored in Amazon S3 and those produced by AWS Glue and viewed through Athena against on-premises source systems.

Initially, our MIS team ran daily row counts against source and Athena and was held to a +/-99% SLA, accounting for potential timing issues. We quickly found this process had issues: this only checks that the row count matches between systems and doesn’t adequately profile data or measure comprehensive data quality. Furthermore, the target data on Amazon S3 should match exactly with the source system, but it’s less efficient to run row- and field-level checks on large volumes of data.

Beyond this, through the decommissioning of one of our major customer data warehouses, we rewrote over 350 SQL Server stored procedures in PySpark. Another layer of testing is required here to debug our code and to ensure our ETL processes are producing the same results that they previously produced from our on-premises server. This includes the testing of both staging and final tables.

Because we can’t do manual testing efficiently at this scale, we needed a solution that would enable automated testing of data quality at the row and field levels. We architected and developed a reconciliation engine called Pythagoras to randomly select a sample of records to check cell by cell. The tool runs new samples daily to ensure better coverage. This check validates whether individual values match between tables on Amazon S3 and in our source systems.

Pythagoras architecture

We use a config.yaml file to define the source databases, source tables, how to connect to source systems, reconciliation tasks, and other parameters required to specify the task. A sample ratio for the on-premises server tables is also defined here.

We use table_cols_mapping.json to define the column mapping between our on-premises server tables and the tables we expect to see when querying through Athena.

The reconciliation engine outputs a report, which is saved directly to Amazon S3 and can be viewed with Athena. Here, we can check results in SQL or PySpark or can export to Microsoft Excel.

The following diagram provides a walkthrough of the process. We use Apache Spark running on Amazon EMR to execute the reconciliation engine steps: preprocess, comparison, and postprocess.

Preprocess

The preprocessing component in the preceding architecture prepares data for reconciliation, and the output is used as an input in the comparison component. The preprocess step is responsible for several important functions:

  • The parse file config.yaml defines parameters such as connection strings for data sources, the log path, performance-tuning parameters, the location of reconciliation reports, and more. In addition, we add the reconciliation group and reconciliation task in the file to instruct Pythagoras to do the data integrity testing. This information tells the PySpark job where to fetch the data and the random sample ratio percentage to use in the reconciliation run. See the following code example:
    data_rec:
    - rec_task_1:
      - rec_task_1:
          sample_ratio: 10  #percent
          sourceA: fruit1
          sourceB: fruitA
          sourceA_database: database_sourceA
          sourceB_database: database_sourceB

  • We fetch data from the two data sources in this step.
  • The parse file tablecolmap.json maps the columns to be reconciled from the two data sources. For example, in the following JSON code, we see the value of column fruitA from data source B needs to be compared with the value of column fruit1 from data source A. The same logic applies for column animalB and animal2.
    "rec_task_1": {
      "cols_mapping_b_a": {
        "fruitA": "fruit1",
        "animalB": "animal2"
      }
    }

  • If we have a record from data source A, how can we find the corresponding record in target B? The answer is by using a key column. This key should exist in both sources and it should be unique. The component looks up the user-defined key in the joined_keys.py file to join the two datasets as a single data frame via the key and passes it to the comparison component.
  • Normalizing data types for both data sources also happens in this step if needed. For example, if the value of column x from source A is 1 stored as an integer, and the value of column x from source B is 1 stored as a string, these values are treated as different without any data type normalization; however, they are the same if we ignore the data type formatting.

Comparison

After fetching the output data frame from the preprocess pipeline, the comparison component is ready to perform the reconciliation. The following table shows an example schema of an input data frame.

idsouceA_animalsourceA_fruitsourceA_caranimalfruitcar
1catorangejeepcatorangejeep
2dogappletoyotadogpeartoyota

Columns starting with prefix sourceA_ indicate the data is from data source A. Columns titled animal, fruit, and car come from data source B. In this instance, the separated tabular data is linked and joined on a key id; the key should be uniquely defined in both data sources.

The engine reconciles values from columns in the preceding input data frame. The following table summarizes the output.

idsouceA_animalsourceA_fruitsourceA_caranimalfruitcarmatched_animalmatched_fruitmatched_cartotal_matched
1catorangejeepcatorangejeepTRUETRUETRUETRUE
2dogappletoyotadogpeartoyotaTRUEFALSETRUEFALSE

Columns with the prefix matched_ indicate the reconciled result of the corresponding cells. For instance, in the second record, column sourceA_fruit is apple and column fruit is pear, so the value of matched_fruit is FALSE.

Postprocess

Based on the detailed reconciled results, the engine generates various customized reports and saves those reports to a configured location on Amazon S3. The end-user can use SQL to query against Athena to quickly analyze the data, or can download the results to Microsoft Excel. In this section, we describe three of the reports we use daily.

The following table shows the reconciliation summary at the column level.

table_namecol_namenum_matched_truenum_matched_false
MyFavoritematched_animal20
MyFavoritematched_car20
MyFavoritematched_fruit11
MyFavoritetotal_matched11

This report is generated based on the information from the preceding output. Let’s assume the table is called MyFavorite. The engine counts the matched and unmatched cases over each column and provides an aggregated view of the number of records matching for each column.

In a production environment, the Data Engineering team needs to reconcile hundreds of tables in one batch. Pythagoras naturally supports wrapping these into one reconciliation task. In this case, we can generate another report to show the reconciliation summary per reconciliation task.

For example, in addition to the table MyFavorite, the reconciliation task has another table called MySkills, which needs to be reconciled. The following table shows its reconciliation summary at the column level.

table_namecol_namenum_matched_truenum_matched_false
MySkillsmatched_sport40
MySkillsmatched_instrument31
MyFavoritetotal_matched31

Base on the two preceding summaries, Pythagoras calculates high-level table statistics, as shown in the following table.

table_namedata_qualitytotal_match_rateissue_desc
MyFavoriteRed50%columns: fruit
MySkillsAmber75%columns: instrument

In the config.yml file, the user defines the parameter total_match_rate_threshold. For this use case, let’s say we define it as 50%. If the total_match_rate is greater than 50% and less than 100%, we label data_quality as Amber; if it is less than or equal than 50%, we label it as Red. The column issue_desc shows the list of unmatched columns (all columns with any unmatched values).

These reports help us assess overall data quality for all tables in one reconciliation task and quickly locate problem tables and columns. We use Athena to query the Pythagoras results using SQL. For example, in the following SQL query, we can filter ingested tables with data quality flagged as Red by Pythagoras (in other words, the MyFavorite table is returned):

SELECT  table_name  FROM “recon_summary” WHERE data_quality = ‘Red’

We also use Athena to query Pythagoras results for every step of our ETL process. We review a daily report that enables us to focus on which steps are having issues and the top columns that are experiencing a mismatch.

In the following output table, we would focus on steps 2, 3, and 9 first, because there are glaring issues and other steps may have dependencies on these, then come back to step 5 for some minor cleanup. Otherwise, anything above 99.9% is left alone, accounting for timing issues in our billing systems, for example.

table_namedata_qualitytotal_match_rateissue_desc
step1green0.9996608columns: col1: 200, col2: 93, col3: 12, col4: 10, col5:10
step2red0.82664197columns: col5: 30, col3: 22, col4: 16, col1: 15, col2: 14
step3red0.95370907columns: col1: 50, col2: 43, col3: 12, col4: 10, col5:1
step4green1
step5amber0.9987953columns: col1: 200, col2: 93, col3: 12, col4: 10, col5:14
step6green0.99992985columns: col1: 25
step7green0.99950025columns: col1: 200, col2: 93
step8green0.99993417columns: col1: 50
step9red0.24940514columns: col1: 19000, col2: 4293, col3: 1400, col4: 1000, col5:180
step10green0.99930817columns: col1: 573, col2: 420, col3: 120

We can also perform SQL queries in Athena to drill down further into each step to identify each column’s actual match rate for that particular sample. For example, the following table looks at step2:

table_namecol_namenum_match_truenum_match_false
step2col527730
step2col328522
step2col429116
step2col129215
step2col229314

For this use case, we want to look at col5 first because 30 records in the sample are unmatched, whereas 277 records are matched, then work our way down the list to clean up each column.

Conclusion

In this post, we discussed how Direct Energy uses a data reconciliation tool called Pythagoras to automate and test data quality checks at scale, using Amazon EMR for verify the data quality and Athena to analyze and report the results. Pythagoras brought significant clarity regarding the performance and quality of both Direct Energy source data ingestion and ETL processes, while eliminating the need for manual testing and enabling automated, randomized testing on a much greater scale.

Thus far, the ETL processes for two billing systems have been thoroughly vetted, resulting in 15% and 48% improvements in accuracy. We found that value mismatches are the most common data integrity issue in our data ingestion pipeline. Thanks to Pythagoras, we can quickly and precisely determine these mismatches in large datasets. Data engineering and platform teams then use the data Pythagoras provides to debug our ETL pipelines. After we adjust our pipelines, we run Pythagoras again to ensure the issue is fixed and stays fixed.

The implementation of this tool empowers Direct Energy to decommission widely used data platforms with precision and efficiency, and builds trust in our company’s data quality and integrity across the business.

 


About the Authors

Sara Miller is the Head of Data Management and Data Lake at Direct Energy, a subsidiary of Centrica. She has been with the organization for more than five years. As a versatile leader proficient in data engineering, mathematics, and statistics, Sara has helped organizations transform their reporting and analytics capabilities and has been instrumental in establishing various data science and analytics teams. She currently manages the end-to-end ETL pipeline for the North America residential portfolio, to include the transition from on-premise data warehousing to Amazon Web Services.

 

 

 

Zhouyi Liu is the Senior AWS Developer at Direct Energy, a subsidiary of Centrica. He focuses on the tech stack of Big Data, Machine Learning, AI. He currently works on the end-to-end ETL pipeline for the North America residential portfolio, to include the transition from on-premise data warehousing to Amazon Web Services. Outside of work, he also enjoys the roles of father, husband, and spending time with family.

 

 

 

Embed multi-tenant analytics in applications with Amazon QuickSight

Post Syndicated from Jose Kunnackal original https://aws.amazon.com/blogs/big-data/embed-multi-tenant-analytics-in-applications-with-amazon-quicksight/

Amazon QuickSight recently introduced four new features—embedded authoring, namespaces for multi-tenancy, custom user permissions, and account-level customizations—that, with existing dashboard embedding and API capabilities available in the Enterprise Edition, allow you to integrate advanced dashboarding and analytics capabilities within SaaS applications. Developers and independent software vendors (ISVs) who build these applications can now offer embedded, pre-configured (canned) dashboards to all end-users, while also providing sophisticated ad-hoc exploration and dashboard-building capabilities to power users.

In this post, we look at a use case for a fictional ISV and explore how QuickSight makes it easy to embed analytics into the app with no infrastructure to deploy or manage and scale to hundreds of thousands of users.

New features

  • Embedded authoring – You can embed the full dashboard-building experience within a portal or application, including the QuickSight home page, search, and data experiences. This allows ISVs to provide ad-hoc data exploration and authoring capabilities to an application’s power users, who might want to explore usage data, create specific views as dashboards, and share their creations with other users or groups in their organization. For the ISV, this means freedom from having to perform ad-hoc development for every customer request, while for customers, this empowers them to quickly and easily find insights that are relevant to them.
  • Namespaces – With namespaces, you can logically group and isolate sets of users in a QuickSight account. Before using namespaces, all users in a QuickSight account resided in a single (default) namespace, and as a result could see each other (for example, when trying to share content). You can now provision authors and readers from an organization within a unique namespace so they can see each other but are isolated from all other users in the QuickSight account. Namespaces aren’t required for reader-only scenarios, but are important for providing secure multi-tenancy when using embedded authoring so that authors are restricted to sharing and collaborating with their coworkers only. All users provisioned with a QuickSight account by default (via UI or SSO) exist in the default namespace; non-default namespaces are currently only accessible for federated single sign-on users.
  • Custom user permissions – This feature allows you to customize author permissions—for example, you can remove the ability to create a data source or dataset. This allows admins to provide a restricted set of capabilities to embedded or regular authors, and creates a curated experience for authors by sharing only specific datasets or data sources that are required for the use case.
  • QuickSight customizations – You can turn off the sample analyses and video content in QuickSight and also specify a default theme at a namespace level.

Multi-tenant embedded analytics architecture

The following diagram illustrates how authors reside within namespaces and how they tie into the overall AWS account.

Without namespaces, developers could set up their own embedded dashboards for hundreds of thousands users with QuickSight. For example, see the following dashboard for our fictional company, Oktank Analytica.

With namespaces in place, you can extend this to provide ad-hoc authoring capabilities using curated datasets specific to each customer, created and shared by the developer or ISV. See the following screenshot.

Use case

With this end-result as a target, let’s explore how Oktank Analytica implements such multi-tenant analytics—with both canned dashboards and ad-hoc analysis and dashboard-building capabilities—in its existing SaaS application. For simplicity, we assume that Oktank has two customers, with two authors and two readers within each customer. They also have a development team that uses the default namespace to develop content.

If Oktank only provided embedded dashboards for all users, we could simplify this implementation using just a single namespace for all—dashboards are view-only, so the readers provisioned for each of the customers can’t discover or view each other. This could simply use the QuickSight row-level security feature to make sure appropriate data is displayed to the right users or groups. For more information, see Using Row-level Security (RLS) to Restrict Access to a Dataset.

The following table summarizes Oktank’s namespaces.

Default NamespaceFooCompany Namespace
Foo2Company Namespace
Readers

Bob

Sally

Peter

David

Julie

Tom

Authors

Tom

Susan

Jill

Emma

John

Nancy

Admins

Mike

Andrew

NotesDevelopment teamFoo Company usersFoo2 Company users

When provisioning these users, Oktank uses the default namespace (created as part of QuickSight account setup) for the development team, and provisions admin, authors, and readers as they do normally. They set up customer users after creating the namespaces. For this post, we walk you through the example of Foo Company.

Users in the default namespace are regular QuickSight authors and admins because these are users from Oktank Analytica. However, Oktank wants to restrict the users from FooCompany and Foo2 Company from discovering any assets within the account and limit them to the specific datasets shared with them.

Creating a namespace

To implement this solution, we first create the namespace. See the following code:

aws quicksight create-namespace --aws-account-id 111122223333 --region us-east-1 --namespace FooCompanyNamespace --identity-store QUICKSIGHT 
{
"Status": 202,
"Name": "FooCompanyNamespace",
"CapacityRegion": "us-east-1",
"CreationStatus": "CREATING",
"IdentityStore": "QUICKSIGHT",
"RequestId": "9576f625-39b4-47ee-b56a-bcb95ed1f040"

}

This operation is asynchronous, so we wait and verify that namespace creation finished:

aws quicksight describe-namespace --aws-account-id 111122223333 --region us-east-1 --namespace FooCompanyNamespace
{
    "Status": 200,
    "Namespace": {
        "Name": "FooCompanyNamespace",
        "Arn": "arn:aws:quicksight:us-east-1:111122223333:namespace/FooCompanyNamespace",
        "CapacityRegion": "us-east-1",
        "CreationStatus": "CREATED",
        "IdentityStore": "QUICKSIGHT"
    },
    "RequestId": "9a00b6bd-02b1-471a-9eaf-30db68bedd99"
}

Customizing QuickSight

Now that the namespace is created, we can customize QuickSight to make sure it works well in the multi-tenant setup.

First, we turn off the instructional videos and samples in QuickSight, because Oktank has its own product welcome videos and wants to include QuickSight onboarding in those. QuickSight admins can access this on the Account customization page.

Customizing user permissions

Next, we use custom user permissions to restrict users from creating new data sources or datasets. This allows Oktank to create users that can explore data and create dashboards but not connect to data sources outside of what is shared in the application.

To do this, create a custom user permissions policy. On the Manage users page, choose Manage permissions.

In the policy, restrict the user to only using datasets that Oktank admins share. You can expect more features in the future than what is currently available in this screenshot.

Creating a user account and group

With the custom user permissions policy available, we can create the user account for Jill, who is an author in Foo Company. See the following code:

aws quicksight register-user --namespace FooCompanyNamespace --identity-type QUICKSIGHT --user-role AUTHOR --region us-east-1 --custom-permissions-name DataExploration [email protected] --user-name jill --aws-account-id 111122223333

{
    "Status": 201,
    "User": {
        "Arn": "arn:aws:quicksight:us-east-1:111122223333:user/FooCompanyNamespace/jill",
        "UserName": "jill",
        "Email": "[email protected]",
        "Role": "AUTHOR",
        "Active": false,
        "PrincipalId": "user/d-90677fdc8e/086f9e70-f140-4ac0-9d10-7a21fa718bb9",
        "CustomPermissionsName": "DataExploration"
    },
    "UserInvitationUrl": "https://signin.aws.amazon.com/inviteuser?token=11a1mJtOYlD9T-quoo8b7tJK7bd4_Sa0lPb8Wdr9AW5p29NA7m30lvLKvomMNPnKIaZr3lmLwFf3E0tij5fPj5R9XkT4dSf5b11xsO8MnfNOmerqtHdrgt2StqywbMP2PNUuii1Pz3Xz8pIOIS_4xzFhtPuNwSuiP7JALO5kK3So_HuNdhZn_WLNjiMj47u_dq-NvuBrnZB8Lc8w",
    "RequestId": "78d36175-5d67-4fa9-b45f-124bbf4806dc"
}

For convenience, we can also create groups for authors and readers within FooCompanyNamespace.

aws quicksight create-group --namespace FooCompanyNamespace --region us-east-1 --aws-account-id 111122223333 --group-name="all-authors"
{
    "Status": 201,
    "Group": {
        "Arn": "arn:aws:quicksight:us-east-1:111122223333:group/FooCompanyNamespace/all-authors",
        "GroupName": "all-authors",
        "PrincipalId": "group/d-90677fdc8e/3ec30833-3a1c-4e67-ac65-92950a770c0e"
    },
    "RequestId": "b70a78c8-2453-4b47-8a49-f1f9058a0716"
}

aws quicksight create-group-membership --namespace FooCompanyNamespace --region us-east-1 --aws-account-id 111122223333 --group-name="all-authors" --member-name="jill"
{
    "Status": 200,
    "GroupMember": {
        "Arn": "arn:aws:quicksight:us-east-1:111122223333:user/FooCompanyNamespace/jill",
        "MemberName": "jill"
    },
    "RequestId": "593d1774-27fd-4b11-8fa0-78025b819a8f"
}

aws quicksight create-group-membership --namespace FooCompanyNamespace --region us-east-1 --aws-account-id 111122223333 --group-name="all-authors"  --member-name="emma"
{
    "Status": 200,
    "GroupMember": {
        "Arn": "arn:aws:quicksight:us-east-1:111122223333:user/FooCompanyNamespace/emma",
        "MemberName": "emma"
    },
    "RequestId": "32e65803-497b-475c-a959-d334403320ac"
}
aws quicksight list-group-memberships --namespace FooCompanyNamespace --region us-east-1 --aws-account-id 111122223333 --group-name="all-authors"
{
    "Status": 200,
    "GroupMemberList": [
        {
            "Arn": "arn:aws:quicksight:us-east-1:111122223333:user/FooCompanyNamespace/emma",
            "MemberName": "emma"
        },
        {
            "Arn": "arn:aws:quicksight:us-east-1:111122223333:user/FooCompanyNamespace/jill",
            "MemberName": "jill"
        }
    ],
    "RequestId": "7085373a-53a0-4209-9d6d-cf685caa3184"
}

Sharing assets

Oktank can now use this group to share assets with the users. For example, if Oktank wants to share a dataset with these users, they can use the following code:

aws quicksight update-data-set-permissions --cli-input-json file://datapermissions.json

The datapermissions.json file contains the following:

{
    "AwsAccountId": "111122223333", 
    "DataSetId": "011cef30-cd53-425c-827b-045a171f90f3", 
    "GrantPermissions": [
        {
            "Principal": "arn:aws:quicksight:us-east-1:111122223333:group/FooCompanyNamespace/all-authors",
            "Actions": [
                "quicksight:DescribeDataSet",
                "quicksight:DescribeDataSetPermissions",
                "quicksight:PassDataSet",
                "quicksight:DescribeIngestion",
                "quicksight:ListIngestions"
            ]
        }
    ]
}

Similarly, Oktank can share a pre-built dashboard with this group:

aws quicksight update-dashboard-permissions --cli-input-json file://dbpermissions.json

The dbpermissions.json file contains the following:

{
    "AwsAccountId": ""111122223333", ", 
    "DashboardId": "Oktank-supply-chain-dashboard",
    "GrantPermissions": [
        {
            "Principal": "arn:aws:quicksight:us-east-1:111122223333:group/FooCompanyNamespace/all-authors", 
            "Actions": [
                "quicksight:DescribeDashboard",
                "quicksight:ListDashboardVersions",
                "quicksight:QueryDashboard"
                
            ]
        }
    ]
}

If needed, you can also share dashboards with the namespace, which makes it easy to make sure access is always granted to new users in the namespace.

Creating and sharing a default theme

Before we start with the embedded authoring flow, we can set up a default theme for authors in the Foo namespace. The theme editor in QuickSight allows you to change the colors, fonts, and layouts that are used in a dashboard.

After you create a custom theme, you can share it with the namespace so all users within the namespace have access to it. See the following code:

aws quicksight update-theme-permissions --region us-east-1 --aws-account-id 111122223333 --cli-input-json file://permissions.json

The permissions.json file contains the following:

{
    "AwsAccountId": "111122223333", 
    "ThemeId": "25515eb4-e7e3-4a68-a274-4a863bd79d81", 
    "GrantPermissions": [
        {
            "Principal": "arn:aws:quicksight:us-east-1:111122223333:group/FooCompanyNamespace/all-authors",
            "Actions": [
    "quicksight:DescribeTheme",
    "quicksight:DescribeThemeAlias",
    "quicksight:ListThemeAliases",
    "quicksight:ListThemeVersions"
            ]
        }
    ]
}

aws quicksight create-account-customization --region us-east-1 --aws-account-id 111122223333 --namespace=FooCompanyNamespace --account-customization DefaultTheme="arn:aws:quicksight:us-east-1:111122223333:theme/25515eb4-e7e3-4a68-a274-4a863bd79d81"

{
    "Status": 201,
    "AwsAccountId": "111122223333",
    "Namespace": "FooCompanyNamespace",
    "AccountCustomization": {
        "DefaultTheme": "arn:aws:quicksight:us-east-1:111122223333:theme/25515eb4-e7e3-4a68-a274-4a863bd79d81"
    },
    "RequestId": "221ac63b-8c71-48c2-95e4-ffd9b5476e07"
}

Setting up the embedded authoring experience

Now that we have created the namespace and users, customized QuickSight, enabled a default theme, and shared the theme and dataset, we can set up the embedded authoring experience.

First, make sure that the domain where you embed QuickSight is allowed in your admin settings—access this on the Domains and embedding page.

Next, make sure that your application server has the permissions to invoke the get-dashboard-embed-url and get-session-embed-url commands, which you need for embedding dashboards and the authoring interface, respectively.

For user authentication, QuickSight supports both AWS Identity and Access Management (IAM) federated users and direct QuickSight federation. Both options mean that your end-users never see a QuickSight login page and are simply authenticated by your server. In this use case, we use QuickSight federation because we use a QuickSight identity type for Jill and Emma. See the following code:

aws quicksight get-session-embed-url --aws-account-id 111122223333 --region us-east-1 --user-arn arn:aws:quicksight:us-east-1:111122223333:user/FooCompanyNamespace/jill --entry-point /start/favorites 
{
    "Status": 200,
    "EmbedUrl": "https://us-east-1.quicksight.aws.amazon.com/embedding/eab15737343f4335867483528cd53d20/sn/data-sets?code=AYABeK_9AGgfIBA6_NSqDti_M1cAAAABAAdhd3Mta21zAEthcm46YXdzOmttczp1cy1lYXN0LTE6MjU5NDgwNDYyMTMyOmtleS81NGYwMjdiYy03MDJhLTQxY2YtYmViNS0xNDViOTExNzFkYzMAuAECAQB4P-lAb3AsrVHPwO-wVCEmuFDTp0yz4wFfkrwutzPupewBHRHPzBvGEF0mwTMFwR7fSQAAAH4wfAYJKoZIhvcNAQcGoG8wbQIBADBoBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDAYnevNrQG_42UsUigIBEIA7_amsZyNiF2wOi-LEXZq8X3ToZ3LwlTCAGHhNli8208lv5zIfjEbSr9zOxeF4SsyEWryxNzVYiXd6kg4CAAAAAAwAABAAAAAAAAAAAAAAAAAACuwtWXlySImCI98W0x6jkv____8AAAABAAAAAAAAAAAAAAABAAAA5i3XsPdfGRy9rMCB-EM39rDeOgFDMZKIFqA3lQcLZI_nXith2wGH_1dDP8n5uL1BFLIxVURk3caapiNXyDMxDkHS9x22U1w2TH6wZlQ0_Nd4Eqzn05rPowTzAXU0yiG1nTXo6Rv-_p01tQ1g2IVzm4ykigGXwuxD92ekgKsgEE46OvTwjBRQMtsUaaJKKKcl0J-whFJ3G-p2ATZMcVKACwyUtHWScIJFpVOc4AIM0m4u5quf2m9_OyWpAJgzJn_TtyKxkl-jcOht7S9KsIuLEmvnlJTMdUB6FXk6M1-OxyQ_ogoXH2hvYnW7D7GA1yrUR2T2tj29NA%3D%3D&identityprovider=quicksight&isauthcode=true",
    "RequestId": "48d1c858-36fc-421e-a22b-e64ce6e58545"
}

With the help of JavaScript SDK, embedding this URL within the Oktank SaaS application, we can send Jill directly to access the home page showing default analyses that are curated for them. From here Jill can navigate to the datasets page, where she can access the curated dataset to start exploring from scratch. Because we disabled dataset and data source creation, the embedded authoring interface doesn’t show options to connect to any data sources in the account or outside; it only displays the specific dataset shared.

When Jill tries to explore the data available, QuickSight applies the Oktank theme by default, and allows her to explore data and create dashboards, as needed.

As this dashboard is being built, Jill can share it with Emma, collaborate on it, and decide which users within Foo Company to publish to. With namespaces, Oktank knows that Jill doesn’t see any other users except Emma.

Overall, this set of Amazon QuickSight capabilities allows Oktank to provide a compelling analytics experience within their SaaS portal, while making sure that only the right users see the right data. QuickSight provides a highly scalable, multi-tenant analytics option that you can set up and productionize in days. For more information about this integration, see the tutorial Embed Amazon QuickSight.

Conclusion

The combination of embedded dashboards and authoring capabilities, together with namespaces for multi-tenancy and account customization options, allows developers and ISVs to quickly and easily set up sophisticated analytics for their application users—all without any infrastructure setup or management and scaling to millions of users. For more updates from QuickSight Embedded Analytics, see What’s New in the Amazon QuickSight User Guide!

Additional resources

For more resources, see the following:

 


About the Author

Jose Kunnackal John is principal product manager for Amazon QuickSight, AWS’ cloud-native, fully managed BI service. Jose started his career with Motorola, writing software for telecom and first responder systems. Later he was Director of Engineering at Trilibis Mobile, where he built a SaaS mobile web platform using AWS services. Jose is excited by the potential of cloud technologies and looks forward to helping customers with their transition to the cloud.