All posts by Maximilian Schellhorn

Building resilient multi-tenant systems with Amazon SQS fair queues

Post Syndicated from Maximilian Schellhorn original https://aws.amazon.com/blogs/compute/building-resilient-multi-tenant-systems-with-amazon-sqs-fair-queues/

Today, AWS introduced Amazon Simple Queue Service (Amazon SQS) fair queues, a new feature that mitigates noisy neighbor impact in multi-tenant systems. With fair queues, your applications become more resilient and easier to operate, reducing operational overhead while improving quality of service for your customers.

In distributed architectures, message queues have become the backbone of resilient system design. They act as buffers between components, allowing services to process work asynchronously and at their own pace. When a sudden traffic spike hits your application, queues prevent cascading failures by buffering work and ensuring that downstream services aren’t overwhelmed. Amazon SQS has long been a go-to solution for developers building scalable applications because it’s a fully managed serverless solution that can seamlessly scale to ingest millions of messages per second.

In this post, you learn how to use Amazon SQS fair queues and understand their inner workings through a practical example.

Overview

Many modern applications follow a multi-tenant architecture, where a single application instance serves multiple tenants. A tenant is any entity that shares resources with others. It could be a customer, client application, or request type. This approach reduces operational costs and simplifies maintenance through efficient resource utilization. One example of such shared resources are queues and their associated consumer capacity.

However, multi-tenant systems face challenges when one tenant becomes a noisy neighbor. This tenant impacts others by overutilizing your system’s resources. With queues, this tenant causes a backlog by sending a large volume of messages or by requiring longer processing time. Regular queues deliver older messages first, which increases message dwell time for all tenants in such scenarios. This makes it difficult to maintain quality of service and forces teams to over-provision resources or build complex custom solutions.

Amazon SQS fair queues help maintain low dwell time for other tenants when there is a noisy neighbor. This happens transparently without requiring changes to your existing message processing logic. You define what constitutes a tenant in your system, and Amazon SQS handles the complex orchestration of mitigating noisy neighbor impact.

How it works

Amazon SQS continually monitors the distribution of messages received but not yet deleted (in-flight) by consumers across all tenants. When the system detects an imbalance:

  1. It identifies the noisy tenant, the one causing the queue to build a backlog.
  2. It automatically adjusts message delivery order to prioritize messages belonging to quiet (non-noisy) tenants.
  3. It maintains overall queue throughput.

Consider the following example that consists of a multi-tenant queue and four different tenants (A, B, C, and D).

In the steady state condition, the queue has no backlog, and in-flight messages are evenly distributed among tenants. All messages are consumed immediately when they land in the queue. The dwell time of messages is low for all tenants. Notice that not all consumer capacity is fully utilized in this steady state. The steady state condition is illustrated in the following diagram.

Figure 1: A multi-tenant queue in steady state condition

Figure 1: A multi-tenant queue in steady state condition

Now consider a noisy tenant scenario in which the number of messages of tenant A increases significantly and creates a backlog in the queue. Consumers are busy processing the messages mostly from tenant A, and messages from other tenants are waiting in the backlog, leading to a higher dwell time for all tenants. This noisy tenant scenario is illustrated in the following screenshot.

Figure 2: A multi-tenant queue with a noisy tenant

Figure 2: A multi-tenant queue with a noisy tenant

When a single tenant starts to occupy a significant portion of consumer resources, Amazon SQS fair queues considers this tenant as a noisy neighbor and prioritizes returning messages belonging to other tenants. This prioritization helps maintain low dwell times for quiet tenants (B, C, D), while the dwell time for tenant A’s messages will be elevated until the queue backlog is consumed—but without impacting other tenants. Fair queues are illustrated in the following diagram.

Figure 3: A multi-tenant queue with fair queues

Figure 3: A multi-tenant queue with fair queues

Amazon SQS doesn’t limit the consumption rate per tenant. Consumers can receive messages from noisy neighbor tenants when there is consumer capacity and the queue has no other messages to return. Like Amazon SQS standard queues, fair queues allow virtually unlimited throughput, and there are no limits on the number of tenants you can have in your queue.

How to use

The following is a quick overview of how to get started with Amazon SQS fair queues in your applications. See the feature documentation for a detailed walkthrough. These are the high-level steps the walkthrough follows:

  1. Enable Amazon SQS fair queues by adding a tenant identifier (MessageGroupId) to your messages
  2. Configure Amazon CloudWatch metrics to monitor Amazon SQS fair queues behavior
  3. You can use the example application to observe the Amazon SQS fair queues behavior with varying message volumes

Enable Amazon SQS fair queues by adding a tenant identifier (MessageGroupId) to your messages

Your message producers can add a tenant identifier by setting a MessageGroupId on an outgoing message:

// Send message with tenant identifier
SendMessageRequest request = new SendMessageRequest()
    .withQueueUrl(queueUrl)
    .withMessageBody(messageBody)
    .withMessageGroupId("tenant-123");  // Tenant identifier
sqs.sendMessage(request);

The new fairness capability will be applied automatically in all Amazon SQS standard queues for messages with the MessageGroupId property. It’s important to mention that it doesn’t require any change in the consumer code. It has no impact on API latency and doesn’t come with any throughput limitations.

Configure Amazon CloudWatch metrics to monitor Amazon SQS fair queues behavior

You can monitor Amazon SQS fair queues with Amazon CloudWatch metrics. The following terms are important in this context:

  • Noisy groups – A noisy message group represents a noisy neighbor tenant of a multi-tenant queue.
  • Quiet groups – Message groups excluding noisy groups.

When you use fair queues, Amazon SQS now emits the following additional metrics:

  • ApproximateNumberOfNoisyGroups
  • ApproximateNumberOfMessagesVisibleInQuietGroups
  • ApproximateNumberOfMessagesNotVisibleInQuietGroups
  • ApproximateNumberOfMessagesDelayedInQuietGroups
  • ApproximateAgeOfOldestMessageInQuietGroups

The new ApproximateNumberOfNoisyGroups metric gives the number of message groups (tenants) that are considered noisy in a fair queue. This metric helps identify the number of potential noisy neighbors in multi-tenant environments by tracking message groups consuming disproportionate resources. Use this metric to set alarms that trigger when the number of noisy groups exceeds your acceptable threshold, indicating potential queue fairness issues.

Amazon SQS already provides several standard queue-level metrics that offer approximate insights into the queue’s state, message processing, and potential bottlenecks. These metrics look at all messages in a queue. With fair queues, there’s a new set of four equivalent metrics, shown in the preceding list, that allow the exclusion of messages from noisy neighbor groups and target only quiet groups (non-noisy tenants). Hence, they all have the InQuietGroups suffix.

To monitor the effect of Amazon SQS fair queues you can compare metrics that have the InQuietGroups suffix with standard queue-level metrics. During traffic surges for a specific tenant, the general queue-level metrics might reveal increasing backlogs or older message ages. However, looking at the quiet groups in isolation, you can identify that most non-noisy message groups or tenants aren’t impacted, and you can estimate the total number of impacted message groups.

The following graph shows how the standard queue backlog metric (ApproximateNumberOfMessagesVisible) increases due to a noisy tenant while the backlog for non-noisy tenants (ApproximateNumberOfMessagesVisibleInQuietGroups) remains low.

Figure 4: Queue backlog for noisy and quiet groups

Figure 4: Queue backlog for noisy and quiet groups

While these new metrics provide a good overview of Amazon SQS fair queues behavior, it can be beneficial to understand which specific tenant is causing the load. Use Amazon CloudWatch Contributor Insights to see metrics about the top-N contributors, the total number of unique contributors, and their usage. This is especially helpful in scenarios where you’re dealing with thousands of tenants that would otherwise lead to high-cardinality data (and cost) when emitting traditional metrics. The following screenshot shows an example of a Contributor Insights dashboard on the AWS console that visualizes the top 10 contributors based on MessageGroupId.

Figure 5: Container Insights ReceivedMessagesPerMessageGroupId dashboard

Figure 5: Container Insights ReceivedMessagesPerMessageGroupId dashboard

Contributor Insights creates these metrics based on data from your application log output. Let your code log the number of messages being processed, and the corresponding MessageGroupId within your application. You can find a full example in the sample application in the next section.

Example application

To make it even more straightforward to get started, we’ve prepared an example application that you can use to observe the Amazon SQS fair queues behavior with varying message volumes. You can find the source code repository, infrastructure as code (IaC), and the instructions to run the sample on the sqs-fair-queues repository on GitHub.

The example application includes a load generator to simulate multi-tenant traffic and provides an Amazon CloudWatch dashboard that displays the most important metrics to visualize fair queue behavior. The following screenshot shows an example of the dashboard.


Figure 6: CloudWatch FairQueuesDashboard

Conclusion

Amazon SQS fair queues automatically mitigates the noisy neighbor impact in multi-tenant queues. Even when one tenant generates high message volumes or requires longer processing times (that is, becomes a noisy neighbor), the feature maintains consistent message dwell times for other tenants. When you add a tenant identifier to your messages, Amazon SQS fair queues will automatically detect and mitigate noisy neighbor impact, providing fair access to the queue for other tenants.

We recommend reviewing the Amazon SQS Developer Guide to get started and exploring the sample applications to test the behavior with varying message volumes.

How Shiji Group created a global guest profile store on AWS

Post Syndicated from Maximilian Schellhorn original https://aws.amazon.com/blogs/architecture/how-shiji-group-created-a-global-guest-profile-store-on-aws/

Shiji Group provides global software solutions for the hospitality industry. The Shiji Enterprise Platform enables customers to manage large hotel property portfolios using software as a service (SaaS). Among functionalities such as reservations, housekeeping, finance, and integrations with external systems, the guest profile is a key aspect of the system. Besides personal information (such as name and address) and billing details, the guest profile can include room preferences and entertainment options.

A property portfolio can span multiple hotels across the globe, and each hotel location can offer better customer service by consolidating data. Once the guest gives their cross-border data processing consent (CBDPC), profile information can be shared between properties. This provides a centralized and seamless experience for the hotel guest no matter which hotel in the portfolio was chosen.

In the following blog post, you will explore the architecture of the guest profile store that replicates the profile across multiple geographic areas. We will review the single Region design first and its infrastructure components and architectural patterns. We will then show the evolution to a multi-Region architecture.

Single Region architecture with CQRS

The ability to find relevant guest profile data fast is essential in the day-to-day hospitality business. Therefore, the following architecture uses the command query responsibility segregation (CQRS) pattern to provide high scalability and rich full-text search capabilities without sacrificing performance. With CQRS, write requests (commands) are targeting a different service than read requests (queries). This allows systems to store an item (such as a profile) in a search-optimized format for serving reads, while providing a simple schema for writes.

The microservices for the guest profile architecture are operated as containers on Amazon Elastic Kubernetes Service (Amazon EKS). The write model of the guest profile is stored in an Amazon Relational Database Service (Amazon RDS) PostgreSQL database. A separate read model uses Amazon OpenSearch Service. For interservice communication, Shiji runs a self-managed Apache Kafka cluster on Amazon Elastic Compute Cloud (Amazon EC2).

The following diagram provides a walk through the single Region architecture:

Single Region architecture with CQRS

Figure 1. Single Region architecture with CQRS

  1. The front desk employee creates the Guest Profile upon first interaction with the hotel guest (name, address, billing, and room preferences).
  2. The request is routed to the Kong API Management Solution that is running in an Amazon EKS Kubernetes cluster. It acts as the single entry-point to the system. It identifies the type of request by parsing the URL and forwarding write requests to the profile-write-model-service.
  3. The service validates the request. It stores the data and ProfileCreated event in the PostgreSQL database, Amazon RDS.
  4. A change data capture (CDC) mechanism publishes the ProfileCreated event to an Apache Kafka Local Profiles topic.
  5. The profile-read-model-service subscribes to the Local Profiles topic and stores the profile in an optimized read format in Amazon OpenSearch. Whenever the hotel performs a guest profile search, results will now be provided via the profile-read-model-service.

Multi-Region networking setup

Shiji operates in multiple AWS Regions to provide low latency, regulatory requirements, and resilience across the globe. The previously presented single Region architecture can be replicated to multiple AWS Regions (eu-central-1 and ap-southeast-1, for example). Hotels with a given property portfolio that operate in the same Region can reuse the profile store of the Shiji Enterprise Platform. However, hotels that are being operated in a different AWS Region can be interconnected as well.

This is achieved by providing an AWS Transit Gateway in a separate networking account that connects the different Regions with a VPC attachment:

Multi-Region networking setup

Figure 2. Multi-Region networking setup

The account segregation provides an additional layer of flexibility to add further Regions in the future.

Multi-Region event replication

Upon first arrival, guests can choose to sign a cross-border-data processing consent (CBDPC). This permits the hotel to share the profile information globally. If accepted, the profile-write-model-service creates an additional ProfileCreated event that gets published to a GlobalProfilesEU Apache Kafka topic. This topic is accessible for subscribers in the target Region, which replicates relevant profiles into the local database as follows.

A replicator-service in the target Region (ap-souteast-1) is now able to subscribe to the GlobalProfileEU topic in (eu-central-1), via the established network connection from the previous section. It republishes the event to a local ReplicatedProfiles topic that the profile-write-model-service subscribes to and saves to the local database:

Event replication

Figure 3. Event replication

Putting it all together: The multi-Region guest profile store

The following diagram combines all the components from the previous sections. It provides an end-to-end look at the multi-Region guest profile architecture. Due to the event driven nature of the system, the architecture can be extended without changing the initial flow outlined in the single Region design.

Multi-Region guest profile architecture

Figure 4. Multi-Region guest profile architecture

  1. If the hotel guest signed a cross-border data processing consent (CBDPC), the ProfileCreated event will also be published to a Global Profiles topic.
  2. The replicator-service in the target Region (for example, ap-southeast-1) subscribes to the Global Profiles topic of the source Region (for example, eu-central-1). It then publishes the event to its local Replicated Profiles topic.
  3. The profile-write-model-service in the target Region subscribes to the Replicated Profiles topic and records the item in the Amazon RDS PostgreSQL database with information about the source Region. This will initiate the local replication similar to the single Region design, and therefore creates a consistent experience between both Regions.

Conclusion and outlook

In this blog post, we showed how Shiji built a modern multi-Region microservice architecture on AWS. You have learned about patterns such as CQRS, which provide a scalable solution for both read and write traffic. We’ve also shown what is needed to interconnect two physically separated Regions. With cross-border data processing consent (CBDPC), you have seen how the ownership of guest data can be secured and utilized. The single Region architecture already provided a solid baseline for this solution architecture. The event-driven nature of the system permitted us to add additional functionality for the final multi-Region architecture.

The ability to manage a global guest profile within the main system as well as at the property itself is a huge advantage for enterprise hotel companies. It permits hotels to deliver a unified experience to their guests no matter where the guest is within the hotel or on their journey. Food preferences, spa, room, and more, can all be managed from a single guest profile. This centralized information hasn’t been possible within the hotel’s property management system (PMS) until recently.

Visit Shiji Enterprise Platform for more information.