All posts by Vikas Panghal

Migrating a self-managed message broker to Amazon SQS

2022-03-22 Vikas Panghal

Post Syndicated from Vikas Panghal original https://aws.amazon.com/blogs/architecture/migrating-a-self-managed-message-broker-to-amazon-sqs/

Amazon Payment Services is a payment service provider that operates across the Middle East and North Africa (MENA) geographic regions. Our mission is to provide online businesses with an affordable and trusted payment experience. We provide a secure online payment gateway that is straightforward and safe to use.

Amazon Payment Services has regional experts in payment processing technology in eight countries throughout the Gulf Cooperation Council (GCC) and Levant regional areas. We offer solutions tailored to businesses in their international and local currency. We are continuously improving and scaling our systems to deliver with near-real-time processing capabilities. Everything we do is aimed at creating safe, reliable, and rewarding payment networks that connect the Middle East to the rest of the world.

Our use case of message queues

Our business built a high throughput and resilient queueing system to send messages to our customers. Our implementation relied on a self-managed RabbitMQ cluster and consumers. Consumer is a software that subscribes to a topic name in the queue. When subscribed, any message published into the queue tagged with the same topic name will be received by the consumer for processing. The cluster and consumers were both deployed on Amazon Elastic Compute Cloud (Amazon EC2) instances. As our business scaled, we faced challenges with our existing architecture.

Challenges with our message queues architecture

Managing a RabbitMQ cluster with its nodes deployed inside Amazon EC2 instances came with some operational burdens. Dealing with payments at scale, managing queues, performance, and availability of our RabbitMQ cluster introduced significant challenges:

Managing durability with RabbitMQ queues. When messages are placed in the queue, they persist and survive server restarts. But during a maintenance window they can be lost because we were using a self-managed setup.
Back-pressure mechanism. Our setup lacked a back-pressure mechanism, which resulted in flooding our customers with huge number of messages in peak times. All messages published into the queue were getting sent at the same time.
Customer business requirements. Many customers have business requirements to delay message delivery for a defined time to serve their flow. Our architecture did not support this delay.
Retries. We needed to implement a back-off strategy to space out multiple retries for temporarily failed messages.

Figure 1. Amazon Payment Services’ previous messaging architecture

The previous architecture shown in Figure 1 was able to process a large load of messages within a reasonable delivery time. However, when the message queue built up due to network failures on the customer side, the latency of the overall flow was affected. This required manually scaling the queues, which added significant human effort, time, and overhead. As our business continued to grow, we needed to maintain a strict delivery time service level agreement (SLA.)

Using Amazon SQS as the messaging backbone

The Amazon Payment Services core team designed a solution to use Amazon Simple Queue Service (SQS) with AWS Fargate (see Figure 2.) Amazon SQS is a fully managed message queuing service that enables customers to decouple and scale microservices, distributed systems, and serverless applications. It is a highly scalable, reliable, and durable message queueing service that decreases the complexity and overhead associated with managing and operating message-oriented middleware.

Amazon SQS offers two types of message queues. SQS standard queues offer maximum throughput, best-effort ordering, and at-least-once delivery. SQS FIFO queues provide that messages are processed exactly once, in the exact order they are sent. For our application, we used SQS FIFO queues.

In SQS FIFO queues, messages are stored in partitions (a partition is an allocation of storage replicated across multiple Availability Zones within an AWS Region). With message distribution through message group IDs, we were able to achieve better optimization and partition utilization for the Amazon SQS queues. We could offer higher availability, scalability, and throughput to process messages through consumers.

Figure 2. Amazon Payment Services’ new architecture using Amazon SQS, Amazon ECS, and Amazon SNS

This serverless architecture provided better scaling options for our payment processing services. This helps manage the MENA geographic region peak events for the customers without the need for capacity provisioning. Serverless architecture helps us reduce our operational costs, as we only pay when using the services. Our goals in developing this initial architecture were to achieve consistency, scalability, affordability, security, and high performance.

How Amazon SQS addressed our needs

Migrating to Amazon SQS helped us address many of our requirements and led to a more robust service. Some of our main issues included:

Losing messages during maintenance windows

While doing manual upgrades on RabbitMQ and the hosting operating system, we sometimes faced downtimes. By using Amazon SQS, messaging infrastructure became automated, reducing the need for maintenance operations.

Handling concurrency

Different customers handle messages differently. We needed a way to customize the concurrency by customer. With SQS message group ID in FIFO queues, we were able to use a tag that groups messages together. Messages that belong to the same message group are always processed one by one, in a strict order relative to the message group. Using this feature and a consistent hashing algorithm, we were able to limit the number of simultaneous messages being sent to the customer.

Message delay and handling retries

When messages are sent to the queue, they are immediately pulled and received by customers. However, many customers ask to delay their messages for preprocessing work, so we introduced a message delay timer. Some messages encounter errors that can be resubmitted. But the window between multiple retries must be delayed until we receive delivery confirmation from our customer, or until the retries limit is exceeded. Using SQS, we were able to use the ChangeMessageVisibility operation, to adjust delay times.

Scalability and affordability

To save costs, Amazon SQS FIFO queues and Amazon ECS Fargate tasks run only when needed. These services process data in smaller units and run them in parallel. They can scale up efficiently to handle peak traffic loads. This will satisfy most architectures that handle non-uniform traffic without needing additional application logic.

Secure delivery

Our service delivers messages to the customers via host-to-host secure channels. To secure this data outside our private network, we use Amazon Simple Notification Service (SNS) as our delivery mechanism. Amazon SNS provides HTTPS endpoint delivery of messages coming to topics and subscriptions. AWS enables at-rest and/or in-transit encryption for all architectural components. Amazon SQS also provides AWS Key Management Service (KMS) based encryption or service-managed encryption to encrypt the data at rest.

Performance

To quantify our product’s performance, we monitor the message delivery delay. This metric evaluates the time between sending the message and when the customer receives it from Amazon payment services. Our goal is to have the message sent to the customer in near-real time once the transaction is processed. The new Amazon SQS/ECS architecture enabled us to achieve 200 ms with p99 latency.

Summary

In this blog post, we have shown how using Amazon SQS helped transform and scale our service. We were able to offer a secure, reliable, and highly available solution for businesses. We use AWS services and technologies to run Amazon Payment Services payment gateway, and infrastructure automation to deliver excellent customer service. By using Amazon SQS and Amazon ECS Fargate, Amazon Payment Services can offer secure message delivery at scale to our customers.

Increasing McGraw-Hill’s Application Throughput with Amazon SQS

2021-12-22 Vikas Panghal

Post Syndicated from Vikas Panghal original https://aws.amazon.com/blogs/architecture/increasing-mcgraw-hills-application-throughput-with-amazon-sqs/

This post was co-authored by Vikas Panghal, Principal Product Mgr – Tech, AWS and Nick Afshartous, Principal Data Engineer at McGraw-Hill

McGraw-Hill’s Open Learning Solutions (OL) allow instructors to create online courses using content from various sources, including digital textbooks, instructor material, open educational resources (OER), national media, YouTube videos, and interactive simulations. The integrated assessment component provides instructors and school administrators with insights into student understanding and performance.

McGraw-Hill measures OL’s performance by observing throughput, which is the amount of work done by an application in a given period. McGraw-Hill worked with AWS to ensure OL continues to run smoothly and to allow it to scale with the organization’s growth. This blog post shows how we reviewed and refined their original architecture by incorporating Amazon Simple Queue Service (Amazon SQS). to achieve better throughput and stability.

Reviewing the original Open Learning Solutions architecture

Figure 1 shows the OL original architecture, which works as follows:

The application makes a REST call to DMAPI. DMAPI is an API layer over the Datamart. The call results in a row being inserted in a job requests table in Postgres.
A monitoring process called Watchdog periodically checks the database for pending requests.
Watchdog spins up an Apache Spark on Databricks (Spark) cluster and passes up to 10 requests.
The report is processed and output to Amazon Simple Storage Service (Amazon S3).
Report status is set to completed.
User can view report.
The Databricks clusters shut down.

Figure 1. Original OL architecture

To help isolate longer running reports, we separated requests that have up to five schools (P1) from those having more than five (P2) by allocating a different pool of clusters. Each of the two groups can have up to 70 clusters running concurrently.

Challenges with original architecture

There are several challenges inherent in this original architecture, and we concluded that this architecture will fail under heavy load.

It takes 5 minutes to spin up a Spark cluster. After processing up to 10 requests, each cluster shuts down. Pending requests are processed by new clusters. This results in many clusters continuously being cycled.

We also identified a database resource contention problem. In testing, we couldn’t process 142 reports out of 2,030 simulated reports within the allotted 4 hours. Furthermore, the architecture cannot be scaled out beyond 70 clusters for the P1 and P2 pools. This is because adding more clusters will increase the number of database connections. Other production workloads on Postgres would also be affected.

Refining the architecture with Amazon SQS

To address the challenges with the existing architecture, we rearchitected the pipeline using Amazon SQS. Figure 2 shows the revised architecture. In addition to inserting a row to the requests table, the API call now inserts the job request Id into one of the SQS queues. The corresponding SQS consumers are embedded in the Spark clusters.

Figure 2. New OL architecture with Amazon SQS

The revised flow is as follows:

An API request results in a job request Id being inserted into one of the queues and a row being inserted into the requests table.
Watchdog monitors SQS queues.
Pending requests prompt Watchdog to spin up a Spark cluster.
SQS consumer consumes the messages.
Report data is processed.
Report files output to Amazon S3
Job status is updated in the requests table.
Report can be viewed in the application.

After deploying the Amazon SQS architecture, we reran the previous load of 2,030 reports with a configuration ceiling of up to five Spark clusters. This time all reports were completed within the 4-hour time limit, including the 142 reports that timed out previously. Not only did we achieve better throughput and stability, but we did so by running far fewer clusters.

Reducing the number of clusters reduced the number of concurrent database connections that access Postgres. Unlike the original architecture, we also now have room to scale by adding more clusters and consumers. Another benefit of using Amazon SQS is a more loosely coupled architecture. The Watchdog process now only prompts Spark clusters to spin up, whereas previously it had to extract and pass job requests Ids to the Spark job.

Consumer code and multi-threading

The following code snippet shows how we consumed the messages via Amazon SQS and performed concurrent processing. Messages are consumed and submitted to a thread pool that utilizes Java’s ThreadPoolExecutor for concurrent processing. The full source is located on GitHub.

/**
  * Main Consumer run loop performs the following steps:
  *   1. Consume messages
  *   2. Convert message to Task objects
  *   3. Submit tasks to the ThreadPool
  *   4. Sleep based on the configured poll interval.
  */
 def run(): Unit = {
   while (!this.shutdownFlag) {
     val receiveMessageResult = sqsClient.receiveMessage(new  
                                           ReceiveMessageRequest(queueURL)
       .withMaxNumberOfMessages(threadPoolSize))
     val messages = receiveMessageResult.getMessages
     val tasks = getTasks(messages.asScala.toList)

     threadPool.submitTasks(tasks, sqsConfig.requestTimeoutMinutes)
     Thread.sleep(sqsConfig.pollIntervalSeconds * 1000)
   }

   threadPool.shutdown()
 }

Kafka versus Amazon SQS

We also considered routing the report requests via Kafka, because Kafka is part of our analytics platform. However, Kafka is not a queue, it is a publish-subscribe streaming system with different operational semantics. Unlike queues, Kafka messages are not removed by the consumer. Publish-subscribe semantics can be useful for data processing scenarios. In other words, it can be used in cases where it’s required to reprocess data or to transform data in different ways using multiple independent consumers.

In contrast, for performing tasks, the intent is to process a message exactly once. There can be multiple consumers, and with queue semantics, the consumers work together to pull messages off the queue. Because report processing is a type of task execution, we decided that SQS queue semantics better fit the use case.

Conclusion and future work

In this blog post, we described how we reviewed and revised a report processing pipeline by incorporating Amazon SQS as a messaging layer. Embedding SQS consumers in the Spark clusters resulted in fewer clusters and more efficient cluster utilization. This, in turn, reduced the number of concurrent database connections accessing Postgres.

There are still some improvements that can be made. The DMAPI call currently inserts the report request into a queue and the database. In case of an error, it’s possible for the two to become out of sync. In the next iteration, we can have the consumer insert the request into the database. Hence, the DMAPI call would only insert the SQS message.

Also, the Java ThreadPoolExecutor API being used in the source code exhibits the slow poke problem. Because the call to submit the tasks is synchronous, it will not return until all tasks have completed. Here, any idle threads will not be utilized until the slowest task has completed. There’s an opportunity for improved throughput by using a thread pool that allows idle threads to pick up new tasks.

Ready to get started? Explore the source code illustrating how to build a multi-threaded AWS SQS consumer.

Looking for more architecture content? AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

Noise