All posts by Dylan Qu

Simplify your ETL and ML pipelines using the Amazon Athena UNLOAD feature

Post Syndicated from Dylan Qu original https://aws.amazon.com/blogs/big-data/simplify-your-etl-and-ml-pipelines-using-the-amazon-athena-unload-feature/

Many organizations prefer SQL for data preparation because they already have developers for extract, transform, and load (ETL) jobs and analysts preparing data for machine learning (ML) who understand and write SQL queries. Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon Simple Storage Service (Amazon S3) using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

By default, Athena automatically writes SELECT query results in CSV format to Amazon S3. However, you might often have to write SELECT query results in non-CSV files such as JSON, Parquet, and ORC for various use cases. In this post, we walk you through the UNLOAD statement in Athena and how it helps you implement several use cases, along with code snippets that you can use.

Athena UNLOAD overview

CSV is the only output format used by the Athena SELECT query, but you can use UNLOAD to write the output of a SELECT query to the formats and compression that UNLOAD supports. When you use UNLOAD in a SELECT query statement, it writes the results into Amazon S3 in specified data formats of Apache Parquet, ORC, Apache Avro, TEXTFILE, and JSON.

Although you can use the CTAS statement to output data in formats other than CSV, those statements also require the creation of a table in Athena. The UNLOAD statement is useful when you want to output the results of a SELECT query in a non-CSV format but don’t require the associated table. For example, a downstream application might require the results of a SELECT query to be in JSON format, and Parquet or ORC might provide a performance advantage over CSV if you intend to use the results of the SELECT query for additional analysis.

In this post, we walk you through the following use cases for the UNLOAD feature:

  • Compress Athena query results to reduce storage costs and speed up performance for downstream consumers
  • Store query results in JSON file format for specific downstream consumers
  • Feed downstream Amazon SageMaker ML models that require files as input
  • Simplify ETL pipelines with AWS Step Functions without creating a table

Use case 1: Compress Athena query results

When you’re using Athena to process and create large volumes of data, storage costs can increase significantly if you don’t compress the data. Furthermore, uncompressed formats like CSV and JSON require you to store and transfer a large number of files across the network, which can increase IOPS and network costs. To reduce costs and improve downstream big data processing application performance such as Spark applications, a best practice is to store Athena output into compressed columnar compressed file formats such as ORC and Parquet.

You can use the UNLOAD statement in your Athena SQL statement to create compressed ORC and Parquet file formats. In this example, we use a 3 TB TPC-DS dataset to find all items returned between a store and a website. The following query joins the four tables: item, store_returns, web_returns, and customer_address:

UNLOAD (
		select *
		from store_returns, item, web_returns, customer_address
		where sr_item_sk = i_item_sk and
		wr_item_sk = i_item_sk and
		wr_refunded_addr_sk = ca_address_sk
	) to 's3://your-bucket/temp/athenaunload/usecase1/' with (
		format = 'PARQUET',
		compression = 'SNAPPY',
		partitioned_by = ARRAY['ca_location_type']
		
	)

The resulting query output when Snappy compressed and stored in Parquet format resulted in a 62 GB dataset. The same output in a non-compressed CSV format resulted in a 248 GB dataset. The Snappy compressed Parquet format yielded a 75% smaller storage size, thereby saving storage costs and resulting in faster performance.

Use case 2: Store query results in JSON file format

Some downstream systems to Athena such as web applications or third-party systems require the data formats to be in JSON format. The JSON file format is a text-based, self-describing representation of structured data that is based on key-value pairs. It’s lightweight, and is widely used as a data transfer mechanism by different services, tools, and technologies. In these use cases, the UNLOAD statement with the parameter format value of JSON can unload files in JSON file format to Amazon S3.

The following SQL extracts the returns data for a specific customer within a specific data range against the 3 TB catalog_returns table and stores it to Amazon S3 in JSON format:

UNLOAD (
		select cr_returned_date_sk, cr_returning_customer_sk, cr_catalog_page_sk, cr_net_loss
		from catalog_returns
		where cr_returned_date_sk = 2450821 and cr_returning_customer_sk = 11026691
	) to 's3://your-bucket/temp/athenaunload/usecase2/' with (
		format = 'JSON', compression = 'NONE'
	)

By default, Athena uses Gzip for JSON and TEXTFILE formats. You can set the compression to NONE to store the UNLOAD result without any compression. The query result is stored as the following JSON file:

{"cr_returned_date_sk":2450821,"cr_returning_customer_sk":11026691,"cr_catalog_page_sk":20.8,"cr_net_loss":53.31}

The query result can now be consumed by a downstream web application.

Use case 3: Feed downstream ML models

Analysts and data scientists rely on Athena for ad hoc SQL queries, data discovery, and analysis. They often like to quickly create derived columns such as aggregates or other features. These need to be written as files in Amazon S3 so a downstream ML model can directly read the files without having to rely on a table.

You can also parametrize queries using Athena prepared statements that are repetitive. Using the UNLOAD statement in a prepared statement provides the self-service capability to less technical users or analysts and data scientists to export files needed for their downstream analysis without having to write queries.

In the following example, we create derived columns and feature engineer for a downstream SageMaker ML model that predicts the best discount for catalog items in future promotions. We derive averages for quantity, list price, discount, and sales price for promotional items sold in stores where the promotion is not offered by mail or a special event. Then we restrict the results to a specific gender, marital, and educational status. We use the following query:

UNLOAD(
		Select i_item_id, 
	        avg(ss_quantity) avg_sales,
	        avg(ss_list_price) avg_list_price,
	        avg(ss_coupon_amt) avg_coupon_amt,
	        avg(ss_sales_price) avg_sales_price 
	 from store_sales, customer_demographics, date_dim, item, promotion
	 where cast(ss_sold_date_sk AS int) = d_date_sk and
	       ss_item_sk = i_item_sk and
	       ss_cdemo_sk = cd_demo_sk and
	       ss_promo_sk = p_promo_sk and
	       cd_gender = 'M' and 
	       cd_marital_status = 'M' and
	       cd_education_status = '4 yr Degree' and
	       (p_channel_email = 'N' or p_channel_event = 'N') and
	       d_year = 2001 
	 group by i_item_id
	 order by i_item_id
	) to 's3://your-bucket/temp/athenaunload/usecase3/' with (
		format = 'PARQUET',compression = 'SNAPPY'
	)

The output is written as Parquet files in Amazon S3 for a downstream SageMaker model training job to consume.

Use case 4: Simplify ETL pipelines with Step Functions

Step Functions is integrated with the Athena console to facilitate building workflows that include Athena queries and data processing operations. This helps you create repeatable and scalable data processing pipelines as part of a larger business application and visualize the workflows on the Athena console.

In this use case, we provide an example query result in Parquet format for downstream consumption. In this example, the raw data is in TSV format and gets ingested on a daily basis. We use the Athena UNLOAD statement to convert the data into Parquet format. After that, we send the location of the Parquet file as an Amazon Simple Notification Service (Amazon SNS) notification. Downstream applications can be notified via SNS to take further actions. One common example is to initiate a Lambda function that uploads the Athena transformation result into Amazon Redshift.

The following diagram illustrates the ETL workflow.

The workflow includes the following steps:

  1. Start an AWS Glue crawler pointing to the raw S3 bucket. The crawler updates the metadata of the raw table with new files and partitions.
  2. Invoke a Lambda function to clean up the previous UNLOAD result. This step is required because UNLOAD doesn’t write data to the specified location if the location already has data in it (UNLOAD doesn’t overwrite existing data). To reuse a bucket location as a destination for UNLOAD, delete the data in the bucket location, and then run the query again. Another common pattern is to UNLOAD data to a new partition with incremental data processing.
  3. Start an Athena UNLOAD query to convert the raw data into Parquet.
  4. Send a notification to downstream data consumers when the file is updated.

Set up resources with AWS CloudFormation

To prepare for querying both data sources, launch the provided AWS CloudFormation template:

Keep all the provided parameters and choose Create stack.

The CloudFormation template creates the following resources:

  • An Athena workgroup etl-workgroup, which holds the Athena UNLOAD queries.
  • A data lake S3 bucket that holds the raw table. We use the Amazon Customer Reviews Dataset in this post.
  • An Athena output S3 bucket that holds the UNLOAD result and query metadata.
  • An AWS Glue database.
  • An AWS Glue crawler pointing to the data lake S3 bucket.
  • A LoadDataBucket Lambda function to load the Amazon Customer Reviews raw data into the S3 bucket.
  • A CleanUpS3Folder Lambda function to clean up previous Athena UNLOAD result.
  • An SNS topic to notify downstream systems when the UNLOAD is complete.

When the stack is fully deployed, navigate to the Outputs tab of the stack on the AWS CloudFormation console and note the value of the following resources:

  • AthenaWorkgroup
  • AthenaOutputBucket
  • CleanUpS3FolderLambda
  • GlueCrawler
  • SNSTopic

Build a Step Functions workflow

We use the Athena Workflows feature to build the ETL pipeline.

  1. On the Athena console, under Jobs in the navigation pane, choose Workflows.
  2. Under Create Athena jobs with Step Functions workflows, for Query large datasets, choose Get started.
  3. Choose Create your own workflow.
  4. Choose Continue.

The following is a screenshot of the default workflow. Compare the default workflow against the earlier ETL workflow we described. The default workflow doesn’t contain a Lambda function invocation and has an additional GetQueryResult step.

Next, we add a Lambda Invoke step.

  1. Search for Lambda Invoke in the search bar.
  2. Choose the Lambda:Invoke step and drag it to above the Athena: StartQueryExecution step.
  3. Choose the Athena:GetQueryResults step (right-click) and choose Delete state.

  4. Now the workflow aligns with the earlier design.
  5. Choose the step Glue: StartCrawler.
  6. In the Configuration section, under API Parameters, enter the following JSON (provide the AWS Glue crawler name from the CloudFormation stack output):
    {
      "Name": "GlueCrawler"
    }

  7. Choose the step Glue: GetCrawler.
  8. In the Configuration section, under API Parameters, enter the following JSON:
    {
      "Name": "GlueCrawler"
    }

  9. Choose the step Lambda: Invoke.
  10. In the Configuration section, under API Parameters, for Function name, choose the function -CleanUpS3FolderLambda-.
  11. In the Payload section, enter the following JSON (include the Athena output bucket from the stack output):
    {
      "bucket_name": “AthenaOutputBucket”,
      "prefix": "parquet/"
    }

  12. Choose the step Athena: StartQueryExecution.
  13. In the right Configuration section, under API Parameters, enter the following JSON (provide the Athena output bucket and workgroup name):
    {
      "QueryString": "UNLOAD (SELECT * FROM \"athena_unload_blog\".\"reviews\" )  TO 's3://AthenaOutputBucket/parquet' WITH (format = 'PARQUET',compression = 'SNAPPY')",
      "WorkGroup": “AthenaWorkgroup”
    }

Notice the Wait for task to complete check box is selected. This pauses the workflow while the Athena query is running.

  1. Choose the step SNS: Publish.
  2. In the Configuration section, under API Parameters, for Topic, pick the SNSTopic created by the CloudFormation template.
  3. In the Message section, enter the following JSON to pass the data manifest file location to the downstream consumer:
    {
      "Input.$": "$.QueryExecution.Statistics.DataManifestLocation"
    }

For more information, refer to the GetQueryExecution response syntax.

  1. Choose Next.
  2. Review the generated code and choose Next.
  3. In the Permissions section, choose Create new role.
  4. Review the auto-generated permissions and choose Create state machine.
  5. In the Add additional permissions to your new execution role section, choose Edit role in IAM.
  6. Add permissions and choose Attach policies.
  7. Search for the AWSGlueConsoleFullAccess managed policy and attach it.

This policy grants full access to AWS Glue resources when using the AWS Management console. Generate a policy based on access activity in production following the least privilege principle.

Test the workflow

Next, we test out the Step Functions workflow.

  1. On the Athena console, under Jobs in the navigation pane, choose Workflows.
  2. Under State machines, choose the workflow we just created.
  3. Choose Execute, then choose Start execution to start the workflow.
  4. Wait until the workflow completes, then verify there are UNLOAD Parquet files in the bucket AthenaOutputBucket.

Clean up

To help prevent unwanted charges to your AWS account, delete the AWS resources that you used in this post.

  1. On the Amazon S3 console, choose the -athena-unload-data-lake bucket.
  2. Select all files and folders and choose Delete.
  3. Enter permanently delete as directed and choose Delete objects.
  4. Repeat these steps to remove all files and folders in the -athena-unload-output bucket.
  5. On the AWS CloudFormation console, delete the stack you created.
  6. Wait for the stack status to change to DELETE_COMPLETE.

Conclusion

In this post, we introduced the UNLOAD statement in Athena with some common use cases. We demonstrated how to compress Athena query results to reduce storage costs and improve performance, store query results in JSON file format, feed downstream ML models, and create and visualize ETL pipelines with Step Functions without creating a table.

To learn more, refer to the Athena UNLOAD documentation and Visualizing AWS Step Functions workflows from the Amazon Athena console.


About the Authors

Dylan Qu is a Specialist Solutions Architect focused on Big Data & Analytics with Amazon Web Services. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS.

Harsha Tadiparthi is a Principal Solutions Architect focused on providing analytics and AI/ML strategies and solution designs to customers.

Best practices for consuming Amazon Kinesis Data Streams using AWS Lambda

Post Syndicated from Dylan Qu original https://aws.amazon.com/blogs/big-data/best-practices-for-consuming-amazon-kinesis-data-streams-using-aws-lambda/

Many organizations are processing and analyzing clickstream data in real time from customer-facing applications to look for new business opportunities and identify security incidents in real time. A common practice is to consolidate and enrich logs from applications and servers in real time to proactively identify and resolve failure scenarios and significantly reduce application downtime. Internet of things (IOT) is also driving more adoption for real-time data processing. For example, a connected factory, connected cars, and smart spaces enable seamless sharing of information between people, machines, and sensors.

To help ingest real-time data or streaming data at large scales, you can use Amazon Kinesis Data Streams. Kinesis Data Streams can continuously capture gigabytes of data per second from hundreds of thousands of sources. The data collected is available in milliseconds, enabling real-time analytics. You can use an AWS Lambda function to process records in a Kinesis data stream.

This post discusses common use cases for Lambda stream processing and describes how to optimize the integration between Kinesis Data Streams and Lambda at high throughput with low system overhead and processing latencies.

Using Lambda to process a Kinesis data stream

Before diving into best practices, we discuss good use cases for Lambda stream processing and anti-patterns.

When to use Lambda for Kinesis data stream processing

Lambda integrates natively with Kinesis Data Streams. The polling, checkpointing, and error handling complexities are abstracted when you use this native integration. This allows the Lambda function code to focus on business logic processing. For example, one application can take in IP addresses from the streaming records and enrich them with geographic fields. Another application can take in all system logs from the stream and filter out non-critical ones. Another common use case is to take in text-based system logs and transform them into JSON format.

One key pattern the previous examples share is that the transformation works on a per-record basis. You can still receive batches of records, but the transformation of the records happens individually.

When not to use Lambda for Kinesis data stream processing

By default, Lambda invocates one instance per Kinesis shard. Lambda invokes your function as soon as it has gathered a full batch, or until the batch window expires, as shown in the following diagram.

This means each Lambda invocation only holds records from one shard, so each Lambda invocation is ephemeral and there can be arbitrarily small batch windows for any invocation. Therefore, the following use cases are challenging for Lambda stream processing:

  • Correlation of events of different shards
  • Stateful stream processing, such as windowed aggregations
  • Buffering large volumes of streaming data before writing elsewhere

For the first two use cases, consider using Amazon Kinesis Data Analytics. Kinesis Data Analytics allows you to transform and analyze streaming data in real time. You can build sophisticated streaming applications with Apache Flink. Apache Flink is an open-source framework and engine for processing data streams. Kinesis Data Analytics takes care of everything required to run streaming applications continuously, and scales automatically to match the volume and throughput of your incoming data.

For the third use case, consider using Amazon Kinesis Data Firehose. Kinesis Data Firehose is the easiest way to reliably load streaming data into data lakes, data stores, and analytics services. It can capture, transform, and deliver streaming data to Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Elasticsearch Service (Amazon ES), generic HTTP endpoints, and service providers like Datadog, New Relic, MongoDB, and Splunk. Kinesis Data Firehose enables you to transform your data with Lambda before it’s loaded to data stores.

Developing a Lambda consumer with shared throughput or dedicated throughput

You can use Lambda in two different ways to consume data stream records: you can map a Lambda function to a shared-throughput consumer (standard iterator), or to a dedicated-throughput consumer with enhanced fan-out (EFO).

For standard iterators, Lambda service polls each shard in your stream one time per second for records using HTTP protocol. By default, Lambda invokes your function as soon as records are available in the stream. The invocated instances shares read throughput with other consumers of the shard. Each shard in a data stream provides 2 MB/second of read throughput. You can increase stream throughput by adding more shards. When it comes to latency, the Kinesis Data Streams GetRecords API has a five reads per second per shard limit. This means you can achieve 200-millisecond data retrieval latency for one consumer. With more consumer applications, propagation delay increases. For example, with five consumer applications, each can only retrieve records one time per second and each can retrieve less than 400 Kbps.

To minimize latency and maximize read throughput, you can create a data stream consumer with enhanced fan-out. An EFO consumer gets an isolated connection to the stream that provides a 2 MB/second outbound throughput. It doesn’t impact other applications reading from the stream. Stream consumers use HTTP/2 to push records to Lambda over a long-lived connection. Records can be delivered from producers to consumers in 70 milliseconds or better (a 65% improvement) in typical scenarios.

When to use shared throughput vs. dedicated throughput (EFO)

It’s advisable to use standard consumers when there are fewer (less than three) consuming applications and your use cases aren’t sensitive to latency. EFO is better for use cases that require low latency (70 milliseconds or better) for message delivery to consumer; this is achieved by automatic provisioning of an EFO pipe per consumer, which guarantees low latency irrespective of the number of consumers linked to the shard. EFO has cost dimensions associated with it; there is additional hourly charge per EFO consumer and charge for per GB of EFO data retrievals cost.

Monitoring ongoing stream processing

Kinesis Data Streams and Amazon CloudWatch are integrated so you can collect, view, and analyze CloudWatch metrics for your streaming application. It’s a best practice to make monitoring a priority to head off small problems before they become big ones. In this section, we discuss some key metrics to monitor.

Enhanced shard-level metrics

It’s a best practice to enable shard-level metrics with Kinesis Data Streams. As the name suggests, Kinesis Data Streams sends additional shard-level metrics to CloudWatch every minute. This can help you pinpoint failing consumers for a specific record or shards and identify hot shards. Enhanced shard-level metrics comes with additional cost. For information about pricing, see Amazon CloudWatch pricing.

IteratorAge

Make sure you keep a close eye on the IteratorAge (GetRecords.IteratorAgeMilliseconds) metric. Age is the difference between the current time and when the last record of the GetRecords call was written to the stream. If this value spikes, data processing from the stream is delayed. If the iterator age gets beyond your retention period, the expired records are permanently lost. Use CloudWatch alarms on the Maximum statistic to alert you before this loss is a risk.

The following screenshot shows a visualization of GetRecords.IteratorAgeMilliseconds.

In a single-source, multiple-consumer use case, each Lambda consumer reports its own IteratorAge metric. This helps identify the problematic consumer for further analysis.

You can find common causes and resolutions later in this post.

ReadProvisionedThroughputExceeded

The ReadProvisionedThroughputExceeded metric shows the count of GetRecords calls that have been throttled during a given time period. Use this metric to determine if your reads are being throttled due to exceeding your read throughput limits. If the Average statistic has a value other than 0, some of your consumers are throttled. You can add shards to the stream to increase throughput or use an EFO consumer to trigger your Lambda function.

Being aware of poison messages

A Lambda function is invoked for a batch of records from a shard and it checkpoints upon the success of each batch, so either a batch is processed successfully or entire batch is retried until processing is successful or records fall off the stream based on retention period. A poison message causes the failure of a batch process. It can create two possible scenarios: duplicates in the results, or delayed data processing and loss of data.

The following diagram illustrates when a poison message causes duplicates in the results. If there are 300 records in the data stream and batch size is 200, the Lambda instance is invoked to process the first 200 records. If processing fails at the eighty-third record, the entire batch is tried again, which can cause duplicates in the target for first 82 records depending on the target application.

The following diagram illustrates the problem of delayed data processing and data loss. If there are 300 records in the data stream and the batch size is 200, a Lambda instance is invoked to process the first 200 records until these records expire. This causes these records to be lost, and processing data in the queue is delayed significantly.

 

Addressing poison messages

There are two ways to handle failures gracefully. The first option is to implement logic in the Lambda function code to catch exceptions and log for offline analysis and return success to process the next batch. Exceptions can be logged to Amazon Simple Queue Service (Amazon SQS), CloudWatch Logs, Amazon S3, or other services.

 

The second (and recommended) option is to configure the following retry and failure behaviors settings with Lambda as the consumer for Kinesis Data Streams:

  • On-failure destination – Automatically send records to an SQS queue or Amazon Simple Notification Service (Amazon SNS) topic
  • Retry attempts – Control the maximum retries per batch
  • Maximum age of record – Control the maximum age of records to process
  • Split batch on error – Split every retry batch size to a narrow batch size that is retried to automatically home in on poison messages

Optimizing for performance

In this section, we discuss common causes for Lambda not being able to keep up with Kinesis Data Streams and how to fix it.

Lambda is hitting concurrency limit

Lambda has reached the maximum number of parallel runs within the account, which means that Lambda can’t instantiate additional instances of the function. To identify this, set up CloudWatch alarms on the Throttles metrics exposed by the function. To resolve this issue, consider assigning reserved concurrency to a particular function.

Lambda is throttled on egress throughput of a data stream

This can happen if there are more consumers for a data stream and not enough read provisioned throughput available. To identify this, monitor the ReadProvisionedThroughputExceeded metric and set up a CloudWatch alarm. One or more of the following options can help resolve this issue:

  • Add more shards and scale the data stream
  • Reduce the batch window to process messages more frequently
  • Use a consumer with enhanced fan-out 

Business logic in Lambda is taking too long

To address this issue, consider increasing memory assigned to the function or add shards to the data stream to increase parallelism.

Another approach is to enable concurrent Lambda invocations by configuring Parallelization Factor, a feature that allows more than one simultaneous Lambda invocation per shard. Lambda can process up to 10 batches in each shard simultaneously. Each parallelized batch contains messages with the same partition key. This means the record processing order is still maintained at the partition-key level. The following diagram illustrates this architecture.

For more information, see New AWS Lambda scaling controls for Kinesis and DynamoDB event sources.

Optimizing for cost

Kinesis Data Stream has the following cost components:

  • Shard hours
  • PUT payload units (charged for 25 KB per PUT into a data stream)
  • Extended data retention
  • Enhanced fan-out

One of the key components you can optimize is PUT payload limits. As mentioned earlier, you’re charged for each event you put in a data stream in 25 KB increments, so if you’re sending small messages, it’s advisable to aggregate messages to optimize cost. One of the ways to aggregate multiple small records into a large record is to use Kinesis Producer Library (KPL) aggregation.

The following is an example of a use case with and without record aggregation:

  • Without aggregation:
    • 1,000 records per second, with record size of 512 bytes each
    • Cost is $47.74 per month in us-east-1 Region (with $36.79 PUT payload units)
  • With aggregation:
    • 10 records per second, with records size of 50 kb each
    • Cost is $11.69 per month in us-east-1 Region (with $0.74 PUT payload units)

Another component to optimize is to increase batch windows, which fine-tunes Lambda invocation for cost-optimization.

Conclusion

In this post, we covered the following aspects of Kinesis Data Streams processing with Lambda:

  • Suitable use cases for Lambda stream processing
  • Shared throughput consumers vs. dedicated-throughput consumers (enhanced fan-out)
  • Monitoring
  • Error handling
  • Performance tuning
  • Cost-optimization

To learn more about Amazon Kinesis, see Getting Started with Amazon Kinesis. If you have questions or suggestions, please leave a comment.


About the Authors

Dylan Qu is an AWS solutions architect responsible for providing architectural guidance across the full AWS stack with a focus on Data Analytics, AI/ML and DevOps.

 

 

 

Vishwa Gupta is a Data and ML Engineer with AWS Professional Services Intelligence Practice. He helps customers implement big data and analytics solutions. Outside of work, he enjoys spending time with family, traveling, and playing badminton.

Event-driven refresh of SPICE datasets in Amazon QuickSight

Post Syndicated from Dylan Qu original https://aws.amazon.com/blogs/big-data/event-driven-refresh-of-spice-datasets-in-amazon-quicksight/

Businesses are increasingly harnessing data to improve their business outcomes. To enable this transformation to a data-driven business, customers are bringing together data from structured and unstructured sources into a data lake. Then they use business intelligence (BI) tools, such as Amazon QuickSight, to unlock insights from this data.

To provide fast access to datasets, QuickSight provides a fully managed calculation engine called SPICE—the Super-fast, Parallel, In-Memory Calculation Engine. At the time of writing, SPICE enables you to cache up to 250 million rows or 500 GB of data per dataset.

To extract value from the data quickly, you need access to new data as soon as it’s available. In this post, we describe how to achieve this by refreshing SPICE datasets as part of your extract, transform, and load (ETL) pipelines.

Solution architecture

In this post, you automate the refresh of SPICE datasets by implementing the following architecture.

This architecture consists of two parts: an example ETL job and a decoupled event-driven process to refresh SPICE.

For the ETL job, you use Amazon Simple Storage Service (Amazon S3) as your primary data store. Data lands in an S3 bucket, which we refer to as the raw zone. An Amazon S3 trigger configured on this bucket triggers an AWS Lambda function, which starts an AWS Glue ETL job. This job processes the raw data and outputs processed data into another S3 bucket, which we refer to as the processed zone.

This sample ETL job converts the data to Apache Parquet format and stores it in the processed S3 bucket. You can modify the ETL job to achieve other objectives, like more granular partitioning, compression, or enriching of the data. The Glue Data Catalog stores the metadata and QuickSight datasets are created using Amazon Athena data sources.

To trigger the SPICE dataset refresh, after the ETL job finishes, an Amazon EventBridge rule triggers a Lambda function that initiates the refresh.

In summary, this pipeline transforms your data and updates QuickSight SPICE datasets upon completion.

Deploying the automated data pipeline using AWS CloudFormation

Before deploying the AWS CloudFormation template, make sure you have signed up for QuickSight in one of the 11 supported Regions:

  • US East (Ohio)
  • US East (N. Virginia)
  • US West (Oregon)
  • Asia Pacific (Mumbai)
  • Asia Pacific (Seoul)
  • Asia Pacific (Singapore)
  • Asia Pacific (Sydney)
  • Asia Pacific (Tokyo)
  • EU (Frankfurt)
  • EU (Ireland)
  • EU (London)

This post works with both Standard and Enterprise editions of QuickSight. Enterprise Edition provides richer features and higher limits compared to Standard Edition.

  1. After you sign up for QuickSight, you can use CloudFormation templates to create all the necessary resources by choosing Launch stack:
  2. Enter a stack name; for example, SpiceRefreshBlog.
  3. Acknowledge the AWS Identity and Access Management (IAM) resource creation.
  4. Choose Create stack.

The CloudFormation template creates the following resources in your AWS account:

  • Three S3 buckets to store the following:
    • AWS Glue ETL job script
    • Raw data
    • Processed data
  • Three Lambda functions to do the following:
    • Create the ETL job
    • Initiate the ETL job upon upload of new data in the raw zone
    • Initiate the SPICE dataset refresh when the ETL job is complete
  • An AWS Glue database
  • Two AWS Glue tables to store the following:
    • Raw data
    • Processed data
  • An ETL job to convert the raw data from CSV into Apache Parquet format
  • Four IAM roles: One each for the Lambda functions and one for the ETL job
  • An EventBridge rule that triggers on an AWS Glue job state change event with a state of Succeeded and invokes a Lambda function that performs the SPICE dataset refresh

Importing the dataset

For this post, you use the taxi Trip Record Data dataset publicly available from the NYC Taxi & Limousine Commission Trip Record Data dataset. You upload monthly data in CSV format to the raw zone S3 bucket.

This data is available in Amazon S3 through Open Data on AWS, a service designed to let you spend more time on data analysis rather than data acquisition.

You start by copying the For Hire Vehicle (FHV) data for March 2020. Because the data is already available in Amazon S3 through Open Data, run the following command to copy the data into the raw zone. Make sure you replace <raw bucket name> with the name of the raw bucket created by the CloudFormation template:

aws s3 cp "s3://nyc-tlc/trip data/fhv_tripdata_2020-03.csv" s3://<raw bucket name>

After you copy the data into the raw zone, the Amazon S3 event trigger invokes the Lambda function that triggers the ETL job. You can see the job status on the AWS Glue console by choosing Jobs in the navigation pane. The process takes about 2 minutes.

When the job is complete, check that you can see the Parquet files in the processed zone S3 bucket.

Creating a QuickSight analysis of the data

To visualize the taxi data, we create a QuickSight analysis.

First, you need to give QuickSight the necessary permissions to access the processed zone S3 bucket. For instructions, see I Can’t Connect to Amazon S3.

Then complete the following steps to create an analysis of the taxi data:

  1. On the QuickSight console, choose Datasets.
  2. Choose New dataset.
  3. Choose Athena and provide a name for the data source (such as Athena).
  4. Choose Create data source.
  5. For Database, choose the name of the taxi AWS Glue database (starting with taxigluedatabase).
  6. For Tables, select processed_taxi_data as the table to visualize.
  7. Choose Select.
  8. Ensure Import to SPICE for quicker analytics is selected and choose Visualize.

After the data is imported into SPICE, you can create visuals to display the data. For example, the following screenshot shows a key performance indicator (KPI) of the number of taxi journeys aggregated at the month level and the number of journeys over time.

We use this dashboard to visualize the dataset again after we refresh SPICE with more data.

Automating the SPICE refresh

To refresh the SPICE dataset when the ETL job is complete, the CloudFormation template we deployed created an EventBridge rule that triggers a Lambda function each time an AWS Glue ETL job successfully completes. The following screenshot shows the code for the event pattern.

We need to configure the Lambda function with the ETL job name and the ID of the SPICE dataset we created in QuickSight.

  1. Locate the ETL job name on the AWS Glue console, named TaxiTransformationGlueJob-<unique id>.
  2. To find the SPICE dataset ID, run the following command using the AWS Command Line Interface (AWS CLI):
    aws quicksight list-data-sets --aws-account-id <your AWS account id> 

    The following screenshot shows the output with the dataset ID.

  3. On the Lambda console, open the Lambda function named SpiceRefreshBlog-QuicksightUpdateLambda-<unique id>.
  4. Update line 9 of the code to replace ReplaceWithGlueJobName with the AWS Glue job name and ReplaceWithYourDatasetID with the dataset ID.

Once a Glue job succeeds, this Lambda function is triggered. The EventBridge event that triggers the Lambda contains the name of the job. You can access this from the event as follows, as seen on line 25 of the function:

succeededJob = event[‘detail’][‘jobName’]

The Lambda function looks up the job name in the data_set_map dictionary. If the dictionary contains the job name, the dataset ID is accessed and the function calls the QuickSight Create Ingestion API to refresh the SPICE datasets.

You can extend the data_set_map dictionary to include additional job names and associated SPICE dataset IDs to be refreshed. If using this approach at scale, you might choose to move this configuration information to an Amazon DynamoDB table.

  1. Save the Lambda function by choosing Deploy.

Testing the automated refresh

Now that you have configured the Lambda function, we can test the ETL end-to-end process and make the next month’s data available for analysis.

To add the FHV data for April, run the following AWS CLI command:

aws s3 cp "s3://nyc-tlc/trip data/fhv_tripdata_2020-04.csv" s3://<raw bucket name>

As before, this upload to the raw zone triggers the Lambda function that starts the ETL job. You can to see the progress of the job on the AWS Glue console.

When the job is complete, navigate to QuickSight and open the taxi analysis (or, if you still have it open, refresh the window).

You can now see that both months’ data is available for analysis. This step might take 1–2 minutes to load.

To see the status of each SPICE refresh, navigate back to the dataset on the QuickSight console and choose View History.

The following screenshot shows the status of previous refreshes and the number of rows that have been ingested into SPICE.

Now that you have tested the end-to-end process, you can try copying more FHV data to the raw zone and see the data within your QuickSight analysis.

Cleaning up

To clean up the resources you created by following along with this post, complete the following steps:

  1. Delete the QuickSight analysis you created.
  2. Delete the QuickSight dataset that you created.
  3. Delete the QuickSight data source:
    1. Choose New dataset.
    2. Select the data source and choose Delete data source.
  4. On the Amazon S3 console, delete the contents of the raw and processed S3 buckets.
  5. On the AWS CloudFormation console, select the stack SpiceRefreshBlog and choose Delete.

Conclusion

Using an event-based architecture to automate the refresh of your SPICE datasets makes sure that your business analysts are always viewing the latest available data. This reduction in time to analysis can help your business unlock insights quicker without having to wait for a manual or scheduled process. Additionally, by only refreshing SPICE when new data is available, the underlying data storage resources are used efficiently, so you only pay for what you need!

Get started with QuickSight today!


About the Authors

Rob Craig is a Senior Solutions Architect with AWS. He supports customers in the UK with their cloud journey, providing them with architectural advice and guidance to help them achieve their business outcomes.

 

 

 

 

Dylan Qu is an AWS solutions architect responsible for providing architectural guidance across the full AWS stack with a focus on Data Analytics, AI/ML and DevOps.