How Cookpad scaled its Amazon Redshift cluster while controlling costs with usage limits

Post Syndicated from Shimpei Kodama original https://aws.amazon.com/blogs/big-data/how-cookpad-scaled-its-amazon-redshift-cluster-while-controlling-costs-with-usage-limits/

This is a guest post by Shimpei Kodama, data engineer at Cookpad Inc.

Cookpad is a tech company that builds a community platform where people share recipe ideas and cooking tips. The company’s mission is to “make everyday cooking fun.” It’s one of the largest recipe-sharing platforms in Japan with over 50 million users per month, and is growing fast in other countries as well.

Cookpad has been using Amazon Redshift as its data analytics platform since 2015 to make all data accessible for employees to analyze and gain insights to delight end-users and grow its business. As of this writing, Cookpad’s Amazon Redshift cluster processes thousands of daily queries submitted by its globally distributed teams, along with over 500 batch jobs that run on the same cluster every day.

This post shares how Cookpad shortened load intervals for log data from several hours to minutes to make full logs queryable with Amazon Redshift Spectrum. We also discuss how concurrency scaling has reduced the query queue wait time by 15%. Finally, we present how we easily control costs using the Amazon Redshift pay-as-you-go pricing model.

Analyzing growing log data with Amazon Redshift Spectrum

As our service quickly grew, one of the challenges we had in late 2018 was ever-growing log data. At that time, we were loading data into about 250 log tables. The total volume of log data per month was about 3 TB after compression, and the percentage of disk usage was greater than 80%.

Because the number of log tables and amount of data volume kept growing, we finally reached a point where we were running out of disk space and couldn’t load into Amazon Redshift at specified intervals or keep the full logs queryable through Amazon Redshift without adding nodes. If the RA3 instances had released, we could solve the problem of disk capacity by using it, But we didn’t have that option in 2018.

Also, regarding the load intervals for log tables, we were loading most of the log tables at 6-hour intervals. Although some log tables required shorter intervals so we could check logs immediately after releasing a feature, we couldn’t allow that because it would affect query performance.

With about 3 TB of new logs being added every month (after compression)—causing 80% disk usage—we needed to add more nodes, or offload unused and older data from the cluster. However, neither option was acceptable to us. Although keeping infrequently accessed data like old logs on Amazon Redshift storage might look unreasonable, we wanted to keep old logs and query them for long-term analysis.

Our solution

To tackle those challenges, we decided to move our log data to Amazon Simple Storage Service (Amazon S3) and query it with Amazon Redshift Spectrum.

We built a new data pipeline called Prism, which puts log data (in the Parquet format) into an S3 bucket instead of using Amazon Kinesis Data Firehose. One reason for this was we wanted to have log files in Amazon S3 partitioned by log generation time, rather than by log receiving time, so that delayed logs are put into the right partitions (this custom prefix was released in February 2019). Another reason was we wanted to merge many small files in Amazon S3 into fewer reasonably sized files to improve user query performance. The following diagram shows an overview of our data pipeline.

The pipeline includes the following steps:

  1. Fluentd puts the log files into an S3 bucket.
  2. Amazon S3 event notifications are sent to an Amazon Simple Notification Service (Amazon SNS) topic and an Amazon Simple Queue Service (Amazon SQS) queue, which subscribes to the SNS topic that receives the messages.
  3. The preprocessor receives the messages from the SQS queue. It processes the log files to cleanse and transform them, and saves the processed files in an S3 bucket.
  4. Amazon S3 event notifications are sent to another SQS queue.
  5. The Prism Stream receives the messages from the SQS queue. It converts the files from JSON to Parquet and saves the converted files in another S3 bucket. The log files in the S3 bucket are partitioned by log generation date.
  6. The Parquet log files in the S3 bucket are now available for queries through Amazon Redshift Spectrum.
  7. The Prism Merge periodically merges many small log files into reasonably-sized files.

Results

The intervals for loading logs were dramatically shortened from several hours to about 10 minutes. And full logs are now available for queries through Amazon Redshift Spectrum without adding nodes.

As a bonus, the Amazon Redshift Spectrum schema on read characteristic eliminates the need to reload the data to fix character length overflow for the varchar column, which was occasionally required for Amazon Redshift local tables.

The performance has met our SLA and enabled us to analyze data right away in Amazon S3 without having to load it first. Based on the experiments we did, the performance of Amazon Redshift Spectrum was reasonable. To be more precise, our tests showed that Amazon Redshift Spectrum took 20% longer than Amazon Redshift local tables.

Improving query response time with concurrency scaling

Another challenge we experienced was increasing query queue time because of the nature of more jobs initiated from a globally distributed team. This resulted in some overloaded periods in terms of query concurrency. For example, batch jobs for the UK team start at 03:00 UTC and finish at 08:00 UTC, which translates to 12:00–17:00 in JST. As a result, the Japanese team would often experience degraded performance during that period.

To resolve this issue, we decided to enable Amazon Redshift Concurrency Scaling, which supports processing thousands of concurrent queries by automatically adding more clusters in the background whenever we need it.

We enabled it soon after AWS added the Usage Limits feature to Amazon Redshift, which allowed us to use concurrency scaling while controlling its costs. We’ve set the usage limits (Disable feature) to 1 hour per day.

After we enabled concurrency scaling, we started running over 100 queries on the concurrency scaling cluster every day. And the daily average queue wait time on the main cluster went down by 15%.

Optimizing costs for Amazon Redshift Spectrum and concurrency scaling

We purchased a Reserved Instance for our Amazon Redshift cluster, which provided a significant pricing discount. But the pricing model for both Amazon Redshift Spectrum and concurrency scaling is pay-as-you-go, so we decided to use workload management (WLM) and usage limits to control and monitor the cost to meet our budget.

For Amazon Redshift Spectrum, we configured WLM and usage limits (Alert). We configured the WLM to stop queries when scanning more than 1 TB, to prevent large scans by mistake. Additionally, we configured a weekly usage limit to send an alert to our communication platform when we exceeded the weekly budget for Amazon Redshift Spectrum.

The usage limits and concurrency scaling features monitor and control your usage and associated cost for using both functionalities. You can create daily, weekly, and monthly usage limits, and define actions to take if those limits are reached to maintain predictable spending. Actions include logging usage stats as an event to a system table, generating Amazon SNS alerts, and disabling Amazon Redshift Spectrum or concurrency scaling based on your defined thresholds. This allows you to continue reaping the benefits provided by both Amazon Redshift Spectrum and concurrency scaling with the peace of mind that you can stay within budget simply by configuring the appropriate thresholds.

For more information, see Managing usage limits in Amazon Redshift and watch the following video.

For concurrency scaling, we set the usage limits to 1 hour per day, considering the tradeoff between cost and queue wait time. So the concurrency scaling cluster is up and running for 1 hour (and a little longer, to be exact) every day in our environment. Thankfully, for every 24 hours that our main cluster is in use, we accrue a 1-hour credit for concurrency scaling. So we are using concurrency scaling with just a small additional cost.

Conclusion

Amazon Redshift is vital for Cookpad to enable its employees to conduct self-service analytics. As I’ve described in this post, we’ve successfully expanded our cluster’s capabilities by using new features provided by AWS, without adding nodes.


About the Authors

Shimpei Kodama is a data engineer at Cookpad Inc. Shimpei is in charge of the data infrastructure for analysis in Cookpad. He delivers data and the ability to process it to his colleagues to help them improve the value of their work.

 

 

 

Junpei Ozono is a Senior solutions architect at AWS in Japan. Junpei supports customers’ journeys on the AWS Cloud and guides them to design and develop lake house architectures powered by Amazon Redshift, Amazon S3, and other AWS services.