Ethan Fahy | Noise

Post Syndicated from Ethan Fahy original https://aws.amazon.com/blogs/architecture/how-cimpress-built-a-self-service-api-driven-data-platform-ingestion-service/

Cimpress is a global company that specializes in mass customization, empowering individuals and businesses to design, personalize and customize their own products – such as packaging, signage, masks, and clothing – and to buy those products affordably.

Cimpress is composed of multiple businesses that have the option to use the Cimpress data platform. To provide business autonomy and keep the central data platform team lean and nimble, Cimpress designed their data platform to be scalable, self-service, and API-driven. During a design update to River, the data platform’s data ingestion service, Cimpress engaged AWS to run an AWS Data Lab. In this post, we’ll show how River provides this data ingestion interface across all Cimpress businesses. If your company has a decentralized organizational structure and challenges with a centralized data platform, read on and consider how a self-service, API-driven approach might help.

Evaluating the legacy data ingestion platform and identifying requirements for its replacement

The collaboration process between AWS and Cimpress started with a discussion of pain points with the existing Cimpress data platform’s ingestion service:

Changes were time consuming. The existing data ingestion service was built using Apache Kafka and Apache Spark. When a data stream owner needed new fields or changes in the stream payload schema, it could take weeks working with a data engineer. Spark configuration needed to be manually modified, tested, and then the Spark job had to be deployed to an operational platform.
Scaling was not automated. The existing Spark clusters required large Amazon EC2 instances that were manually scaled. This led to an inefficient use of resources that was driving unnecessary cost.

Assessing these pain points led to the design of an improved data ingestion service that was:

Fully automated and self-service via an API. This would reduce the dependency on engineering support from the central data platform team and decreased the time to production for new data streams.
Serverless with automatic scaling. This would improve performance efficiency and cost optimization while freeing up engineering time.

The rearchitected Cimpress data platform ingestion service

Figure 1. River architecture diagram, depicting the flow of data from data producers through the River data ingestion service into Snowflake.

The River data ingestion service provides data producers with a self-service mechanism to push their data into Snowflake, the Cimpress data platform’s data warehouse service. As seen in Figure 1, here are the steps involved in that process:

1. Data producers use the River API to create a new River data “stream.” Each Cimpress business is given its own Snowflake account in which they can create logically separated data ingestion “streams.” A River data stream can be configured with multiple data layers, fine-tuned permissions, entity tables, entry de-duplication, data delivery monitoring, and Snowflake data clustering.

2. Data producers get access to a River WebUI. In addition to the River API, end users are also able to access a web-based user interface for simplified configuration, monitoring, and management.

3. Data producers push data to the River API. The River RESTful API uses an Application Load Balancer (ALB) backed by AWS Lambda.

4. Data is processed by Amazon Kinesis Data Firehose.

- For payloads under 1 MB, data is passed directly through the ALB into a Lambda function, which deposits the data into Amazon Kinesis Data Firehose. Amazon Kinesis Data Firehose buffers and transforms the data before delivering it to Amazon S3.
- Data payloads over 1 MB can’t pass through the same pipeline. There is a 1-MB limit on payload size coming through an ALB into an AWS Lambda function, and a 1-MB maximum record size for Amazon Kinesis Data Firehose. To circumvent these limits, the River API generates and returns an Amazon S3 presigned URL to the data producer. They can use this Amazon S3 presigned URL to put their large data payloads directly into Amazon S3. Once the data lands in S3, Amazon Simple Notification Service (SNS) invokes a Lambda function to validate the payload format. It then moves it from the S3 landing zone to the standard S3 location with the rest of the buffered data.

5. Data is uploaded to Snowflake. Data objects that land on Amazon S3 initiate an event notification to Snowflake’s Snowpipe continuous data ingestion service. This loads the data from S3 into a Snowflake account.

All the AWS services used in the River architecture are serverless with automatic scaling. This means that the Cimpress data platform team can remain lean and agile. Engineering resources to infrastructure management tasks like capacity provisioning and patching are minimized. These AWS services also have pay-as-you-go pricing, so the cost scales predictably with the amount of data ingested.

Conclusions

The Cimpress data platform team redesigned their data ingestion service to be self-service, API-driven, and highly scalable. As usage of the Cimpress data platform has grown, Cimpress businesses operate more autonomously with speed and agility. As of today, the River data ingestion service reliably processes 2–3 billion messages per month across more than 1,000 different data streams. It drives data insights for more than 7,000 active users of the data platform across all of the Cimpress businesses.

Read more on the topics in this post and get started with building a data platform on AWS:

Noise

All posts by Ethan Fahy

How Cimpress Built a Self-service, API-driven Data Platform Ingestion Service

Evaluating the legacy data ingestion platform and identifying requirements for its replacement

The rearchitected Cimpress data platform ingestion service

Conclusions

The collective thoughts of the interwebz