All posts by Seth Eliot

Verify the resilience of your workloads using Chaos Engineering

Post Syndicated from Seth Eliot original https://aws.amazon.com/blogs/architecture/verify-the-resilience-of-your-workloads-using-chaos-engineering/

The following is an early preview of new guidance to be published as part of updates to the AWS Well-Architected content:

Chaos Engineering enables us to find shortcomings before our customers find them and therefore, provides us with the opportunity to create a better customer experience. Chaos Engineering does not introduce chaos into your systems, instead, it finds the chaos that is already there. By definition, chaos experiments should be fail-safe and tolerated by the system. It is therefore key that you use tools that allow for controlled experiments. A controlled experiment has a clear scope of impact, built in rollback mechanisms, and tight integration with monitoring that provides deep insights to the impact of the experiment in real-time. Chaos Engineering allows you to inject real-world cloud provider faults that give you insights on what you need to improve in regards to observability, incident response, and architecture to be resilient against faults that you cannot predict. To help you with this journey, we have adjusted our guidance in the Well-Architected Reliability Pillar, enabling you to build more robust and resilient workloads on AWS.


Well-Architected Reliability best practice: verify the resilience of your workloads using Chaos Engineering

Chaos Engineering provides your teams with capabilities to continuously inject real world disruptions (simulations) in a controlled way at the service provider, infrastructure, workload, and component levels, with minimal to no impact to your customers. It allows your teams to learn from faults and observe, measure, and improve the resilience of your workloads, as well as validate that alerts fire and teams get notified in the case of an event. When run continuously, Chaos Engineering can highlight deficiencies in your workloads that, if left unaddressed, could negatively affect availability and operation.

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. – Principles of Chaos Engineering

If a system is able to withstand these disruptions, the chaos experiment should be maintained as an automated regression test. In this way, chaos experiments should be run as part of your software development lifecycle (SDLC) and as part of your CI/CD pipeline.

To ensure that your workload can survive component failure, inject real-world events as part of your experiments. For example, experiment with the loss of EC2 instances or failover of the primary Amazon RDS database instance, and verify that your workload is not impacted (or only minimally impacted). Use a combination of component faults to simulate events that may be caused by a disruption in an Availability Zone.

For application-level faults (such as crashes), you can start with stressors such as memory and CPU exhaustion.

To validate fallback or failover mechanisms for external dependencies due to intermittent network disruptions, your components should simulate such an event by blocking access to the third-party providers for a specified duration that might last from seconds to hours.

Other modes of degradation might cause reduced functionality and slow responses, often resulting in a disruption of your services. Common sources of this type of degradation are increased latency on critical services and unreliable network communication (dropped packets). Experiments with these faults, including networking effects such as latency, dropped messages, and DNS failures, could include the inability to resolve a name, reach the DNS service, or establish connections to dependent services.

Chaos Engineering tools

AWS Fault Injection Simulator (AWS FIS) is a fully managed service for running fault injection experiments that can be used as part of your CD pipeline, or outside of the pipeline. AWS FIS is a good choice to use during Chaos Engineering game days. It supports simultaneously introducing faults across different types of resources including Amazon EC2, Amazon ECS, Amazon EKS, and Amazon RDS. These faults include termination of resources, forcing failovers, stressing CPU or memory, throttling, latency, and packet loss. Since it is integrated with Amazon CloudWatch alarms, you can set up stop conditions as guardrails to rollback an experiment if it causes an unexpected impact (Figure 1).

AWS Fault Injection Simulator integrates with AWS resources to enable you to run fault injection experiments for your workloads

Figure 1. AWS Fault Injection Simulator integrates with AWS resources to enable you to run fault injection experiments for your workloads

To expand the scope of faults that can be injected on AWS, AWS FIS integrates with Chaos Mesh and Litmus Chaos, enabling you to coordinate fault injection workflows among multiple tools. For example, you can run a stress test on a pod’s CPU using Chaos Mesh or Litmus faults while terminating a randomly selected percentage of cluster nodes using AWS FIS fault actions.

Implementation steps

1. Determine which faults to use for experiments

Assess the design of your workload for resiliency. Such designs (created using the best practices of the Well-Architected Framework) consider risks based on critical dependencies, past events, known issues, and compliance requirements. List each element of the design intended to maintain resilience and the faults it is designed to mitigate. For more information about creating such lists, see the Operational Readiness Review whitepaper, which guides you on how to create a process to prevent reoccurrence of previous incidents. The Failure Modes & Effects Analysis (FMEA) process provides a framework for performing a component-level analysis of failures and how they impact your workload. FMEA is outlined in more detail in Failure Modes and Continuous Resilience by Adrian Cockcroft.

2. Assign a priority to each fault

To assess priority, consider the frequency of the fault and the impact of failure to the overall workload. It is fine to start with a coarse categorization, such as high, medium, or low, and refine it.

When considering frequency of a given fault, analyze past data for this workload when available. If not available, use data from other workloads running in a similar environment.

When considering impact of a given fault, the larger the scope of the fault, generally the larger the impact. Also consider the workload design and purpose. For example, the ability to access the source data stores is critical for a workload doing data transformation and analysis. In this case, you would prioritize experiments for access faults, as well as throttled access and latency insertion.

Post-incident analyses are a good source of data to understand both frequency and impact of failure modes.

Use the assigned priority to determine which faults to experiment with first and the order with which to develop new fault injection experiments.

3. For each experiment that you will execute, follow the Chaos Engineering/continuous resilience flywheel (Figure 2)

Chaos Engineering/continuous resilience flywheel, using the scientific method by Adrian Hornsby

Figure 2. Chaos Engineering/continuous resilience flywheel, using the scientific method by Adrian Hornsby

3A. Define steady state as some measurable output of a workload that indicates normal behavior

Your workload exhibits steady state if it is operating reliably and as expected. Therefore, validate that your workload is healthy before defining steady state. Steady state does not necessarily mean that there is no impact to the workload when a fault occurs, as a certain percentage in faults could be within acceptable limits. The steady state is your baseline that you will observe during the experiment, which will highlight anomalies if your hypothesis defined in the next step does not turn out as expected.

For example, a steady state of a payments system can be defined as the processing of 300 transactions per second (TPS) with a 99% success rate and round-trip time of 500 ms.

3B. Form a hypothesis about how the workload will react to the fault

A good hypothesis is based on how the workload is expected to mitigate the fault to maintain the steady state. The hypothesis states that given the fault of a specific type, the system or workload will continue steady state, because the workload was designed with specific mitigations. The specific type of fault and mitigations should be specified in the hypothesis.

The following template can be used for the hypothesis (but other wording is also acceptable):

If [specific fault] occurs the [workload name] workload will [describe mitigating controls] to maintain [business or technical metric].

For example:

  • If 20% of the nodes in the EKS node-group are taken down, the Transaction Create API continues to serve the 99th percentile of requests in under 100 ms (steady state). The EKS nodes will recover within five minutes, and pods will get scheduled and process traffic within eight minutes after the initiation of the experiment. Alerts will fire within three minutes.
  • If a single EC2 instance failure occurs, the order system’s Elastic Load Balancer (ELB) health check will cause the ELB to only send requests to the remaining healthy instances while the EC2 Auto scaling replaces the failed instance, maintaining a less than 0.01% increase in server-side (5xx) errors (steady state).
  • If the primary RDS database instance fails, the supply chain data collection workload will failover and connect to the standby RDS database instance to maintain less than one minute of database read/write errors (steady state).

3C. Run the experiment by injecting the fault

An experiment should, by default, be fail-safe and tolerated by the workload. If you know that the workload will fail, do not run the experiment. Chaos Engineering should be used to find known-unknowns or unknown-unknowns. Known-unknowns are things you are aware of but don’t fully understand, and unknown-unknowns are things you are neither aware of nor fully understand. Experimenting against a workload that you know is broken won’t provide you with new insights. Your experiment should be carefully planned, have a clear scope of impact, and provide a roll back mechanism that can be run in case of unexpected turbulence. If your due diligence shows that your workload should survive the experiment, move forward with running the experiment. There are several options for injecting the faults. For workloads on AWS, AWS FIS provides many pre-defined fault simulations called actions. You can also define custom actions that run in AWS FIS using AWS Systems Manager documents.

We discourage the use of custom scripts for chaos experiments, unless the scripts have the capabilities to understand current state of the workload, are able to emit logs, and provide mechanisms for roll backs and stop conditions where possible.

An effective framework or toolset that supports Chaos Engineering should track the current state of an experiment, emit logs, and provide rollback mechanisms, to support the controlled running of an experiment. Start with an established service like AWS FIS that allows you to run experiments with a clearly defined scope and safety mechanisms that rollback the experiment if the experiment introduces unexpected turbulence. To learn about a wider variety of experiments using AWS FIS, see the Resilient and Well-Architected Apps with Chaos Engineering lab. Also, AWS Resilience Hub will analyze your workload and create experiments that you can choose to implement and run in AWS FIS.

For every experiment, clearly understand its scope and its impact. We recommend that faults should be simulated first on a non-production environment before being run in production.

It is ideal to ultimately run in production under real-world load via canary deployments that spin up both a control and experimental system deployment, where feasible. Running experiments during off-peak times is a good practice to mitigate potential impact when first experimenting in production. Also, if using actual customer traffic poses too much risk, you can run experiments using synthetic traffic on production infrastructure against the control and experimental deployments. When using production is not possible, run experiments in pre-production environments that are as close to production as possible.

You must establish and monitor guardrails to ensure that the experiment does not impact production traffic or other systems beyond acceptable limits. Establish stop conditions to stop an experiment if it reaches a threshold on a guardrail metric that you define. This should include the metrics for steady state for the workload, as well as the metric against the components into which you’re injecting the fault. A synthetic monitor (also known as a “user canary”) is one metric you should usually include as a user proxy. Stop conditions for AWS FIS are supported as part of the experiment template, enabling up to five stop-conditions per template.

One of the Principles of Chaos Engineering is to minimize the scope of the experiment and its impact, specifically “While there must be an allowance for some short-term negative impact, it is the responsibility and obligation of the Chaos Engineer to ensure the fallout from experiments are minimized and contained”. A method to verify the scope and potential impact is to run the experiment in a non-production environment first, verifying that thresholds for stop conditions occur as expected during an experiment and observability is in place to catch an exception, instead of directly experimenting in production.

When running fault injection experiments, verify that all responsible parties are well informed. Communicate with appropriate teams, such as the operations teams, service reliability teams, and customer support, to let them know when experiments will be run and what to expect. Give these teams communication tools to inform those running the experiment if they see any adverse effects.

You must restore the workload and its underlying systems back to the original known-good state. Often, the resilient design of the workload will self-heal. But some fault designs or failed experiments can leave your workload in an unexpected failed state. By the end of the experiment, you must be aware of this and restore the workload and systems. With AWS FIS, you can set a rollback configuration (also called a post action) within the action parameters. A post action returns the target to the state that it was in before the action was run. Whether automated (such as using AWS FIS) or manual, these post actions should be part of a playbook that describes how to detect and handle failures.

3D. Verify the hypothesis

The Principles of Chaos Engineering gives this guidance on how to verify steady state of your workload: “Focus on the measurable output of a system, rather than internal attributes of the system. Measurements of that output over a short period of time constitute a proxy for the system’s steady state. The overall system’s throughput, error rates, latency percentiles, etc. could all be metrics of interest representing steady state behavior. By focusing on systemic behavior patterns during experiments, Chaos verifies that the system does work, rather than trying to validate how it works.”

In our two examples from Step 3B, we include the steady state metrics:

  • Less than 0.01% increase in server-side (5xx) errors
  • Less than 1 minute of database read/write errors

The 5xx errors are a good metric because they are a consequence of the failure mode that a client of the workload will experience directly. The database errors measurement is good as a direct consequence of the fault, but should also be supplemented with a client impact measurement such as failed customer requests or errors surfaced to the client. Additionally, include a synthetic monitor (also known as a “user canary”) on any APIs or URIs directly accessed by the client of your workload.

3E. Improve the workload design for resilience

If steady state was not maintained, then investigate how the workload design can be improved to mitigate the fault, applying the best practices of the AWS Well-Architected Reliability Pillar. Additional guidance and resources can be found in the AWS Builder’s Library, which hosts articles about how to improve your health checks and employ retries with backoff in your application code, among others.

After these changes have been implemented, run the experiment again (shown by the dotted line in Figure 2) to determine their effectiveness. If the verify step indicates the hypothesis holds true, then the workload will be in steady state, and the cycle in Figure 2 continues.

4. Run experiments regularly

A chaos experiment is a cycle, and experiments should be run regularly as part of Chaos Engineering. After a workload meets the experiment’s hypothesis, the experiment should be automated to run continuously as a regression part of your CI/CD pipeline. To learn how to do this, explore this blog on how to run AWS FIS experiments using AWS CodePipeline. This lab on recurrent AWS FIS experiments in a CI/CD pipeline enables you to work hands-on with this.

Fault injection experiments are also a part of game days. Game days simulate a failure or event to verify systems, processes, and team responses. The purpose of game days is to actually perform the actions that the team would perform as if an exceptional event happened.

5. Capture and store experiment results

Results for fault injection experiments must be captured and persisted. Include all necessary data necessary (such as time, workload, and conditions) to be able to later analyze experiment results and trends. Examples of results might include screenshots of dashboards, CSV dumps from your metrics database, or a hand-recorded record of events and observations from the experiment. Experiment logging with AWS FIS can be part of this data capture.


This blog post gives early access to the updated implementation guidance on Chaos Engineering we are publishing as part of updates to the AWS Well-Architected content. Using the implementation steps described in this post, you can begin using Chaos Engineering to verify the resilience of your workloads.

Building Resilient Well-Architected Workloads Using AWS Resilience Hub

Post Syndicated from Seth Eliot original https://aws.amazon.com/blogs/architecture/building-resilient-well-architected-workloads-using-aws-resilience-hub/

AWS Resilience Hub is a new service that helps you understand and improve the resiliency of your workloads using AWS Well-Architected best practices. As the lead for the Reliability Pillar of AWS Well-Architected, I am eager to share with you how you can use Resilience Hub to ensure your workload architecture is as reliable as you need.

In this blog post, I’ll show you how to use Resilience Hub to assess and improve the resiliency of your architecture based on its recommendations. I’ll start with a single Availability Zone (AZ) architecture, and evolve the architecture using the resiliency recommendations.

Single AZ architecture

Figure 1 shows the single AZ architecture I’m going to start with and assess using Resilience Hub. This simple web server runs on Amazon Elastic Compute Cloud (Amazon EC2). It serves a static web page stored in an Amazon Simple Storage Service (Amazon S3) bucket, and then records web site statistics in a MySQL Amazon Relational Database Service (Amazon RDS) database. A NAT gateway is also deployed so the EC2 servers can make calls out to the internet. When I add my application to Resilience Hub, it will discover my application structure. Then I can use it to assess my application’s resiliency per the instructions in the Measure and Improve Your Application Resilience with AWS Resilience Hub blog post.

Single AZ architecture

Figure 1. Single AZ architecture

Even with only a single Amazon EC2 instance, it is still useful to include an Elastic Load Balancer. This lets you configure health checks performed against the EC2 instance. It also makes it easier to add more EC2 instances later. The Amazon EC2 Auto Scaling group helps improve resiliency—if the EC2 instance fails its health check, the Amazon EC2 Auto Scaling group will replace it.

Resilience Hub assessment results for the single AZ architecture

Figure 2 shows the results from Resilience Hub; it’s showing a lot of red flags! This architecture does not meet my required RTO (Recovery Time Objective) and RPO (Recovery Point Objective) goals for resiliency, and it is unrecoverable for all failure types. Figure 2 shows that Resilience Hub assesses for several failure types, including failures in the workload application (bad code or data), infrastructure (component failure), or individual AZ availability. AZs within a Region are independent of each other, even if one AZ experiences issues, the other AZs will remain available. The single AZ architecture does not take advantage of that.

Resilience Hub assessment of the single AZ architecture

Figure 2. Resilience Hub assessment of the single AZ architecture

Figure 3 shows the component-level assessment of resiliency where each component corresponds to a part of the single AZ architecture. The results show that the S3 bucket does well. S3 is resilient to AZ failures and stores data across multiple AZs, which results in high data durability.

However, having a single RDS instance means that if the instance fails (infrastructure), or the AZ containing the instance fails (AZ) then it cannot operate (unrecoverable RTO), and the data will be lost (unrecoverable RPO). Similarly, deploying only one NAT gateway leaves the architecture vulnerable if the AZ experiences issues, so it shows as unrecoverable for AZ disruptions. But, because the NAT gateway is a fully managed service, there is no hardware to manage, and therefore it shows as resilient (0s RTO) for infrastructure issues.

Resilience Hub component-level assessment of the single AZ architecture against cloud infrastructure failures

Figure 3. Resilience Hub component-level assessment of the single AZ architecture against cloud infrastructure failures

To improve the single AZ architecture’s resiliency, Resilience Hub recommends enabling multi-AZ for the RDS instance. This will set up a standby database instance in another AZ. For the NAT gateway, it suggests “Add NAT Gateways in multiple AZs. (i.e., every AZ you have resources in).” I’ll implement these suggestions in the next section.

Multi-AZ architecture

Figure 4 shows the multi-AZ architecture that I built based on Resilience Hub’s recommendations, which use the following Well-Architected Reliability best practices:

Well-Architected Best Practice Modification to Architecture
Deploy the workload to multiple locations I set up RDS instances, NAT gateways, and EC2 instances distributed across AZs.
Fail over to healthy resources If one EC2 instance fails, the Elastic Load Balancer will fail over and send traffic to the remaining healthy ones. If the primary RDS instance fails, the standby will be promoted to be the new primary.
Automate healing on all layers The Amazon EC2 Auto Scaling group will replace faulty EC2 instances, and the RDS failover is automatic.
Multi-AZ architecture

Figure 4. Multi-AZ architecture

In the next section, I’ll send this new architecture through Resilience Hub to check how much it has improved.

Resilience Hub assessment results for the multi-AZ architecture

As shown in Figure 5, the architecture still has some problems, but it’s looking much better! Resilience Hub has highlighted application failure as the possible source of resilience issues, so let’s dive into application RTO and RPO.

Resilience Hub assessment of the multi-AZ architecture

Figure 5. Resilience Hub assessment of the multi-AZ architecture

Resilience Hub component-level assessment of the multi-AZ architecture against customer application failures

Figure 6. Resilience Hub component-level assessment of the multi-AZ architecture against customer application failures

By making RDS multi-AZ, data is replicated to a standby instance. If your infrastructure or AZ fails, the system will fail over to the standby instance.

Application failures are different. If the data is corrupted or deleted due to a bug, accident, or unauthorized action, then that deletion or corruption will be replicated to the standby. The standby does not protect you in this case. Similarly, with S3, I need to protect against unwanted deletion or corruption by the application, so let’s see what Resilience Hub recommends.

As shown in Figure 7, for RDS, I can enable instance backup, with a suggested retention period of 7 days. By doing this, I can achieve an RPO of 5 minutes because using automatic backup in RDS allows me to restore a DB instance to any specific time during the backup retention period. The latest restorable time for a DB instance is typically within 5 minutes of the current time. The RTO is longer because it takes time to restore a new RDS instance from backup.

Resilience Hub suggestions to improve RDS resiliency in the architecture

Figure 7. Resilience Hub suggestions to improve RDS resiliency in the architecture

As shown in Figure 8, for S3, it recommends I enable versioning. This feature allows me to roll back any change to an object (including deletion) to a last known good state. This means zero data loss and a 0s RPO.

Resilience Hub suggestions to improve S3 resiliency in the architecture

Figure 8. Resilience Hub suggestions to improve S3 resiliency in the architecture

Let’s implement these suggestions!

Multi-AZ architecture with data backup

Figure 9 shows the multi-AZ architecture that incorporates data backup features.

Multi-AZ architecture incorporating data backup features

Figure 9. Multi-AZ architecture incorporating data backup features

Resilience Hub has recommended more revisions to this architecture based on AWS Well-Architected Reliability best practices:

Well-Architected Best Practice Modification to Architecture
Identify and back up all data that needs to be backed up The data in my RDS database and S3 bucket are backed up.
Perform data backup automatically RDS backups are automatic. S3 object versioning is also automatic.

Resilience Hub final assessment results

And…that’s it! You’ll see in Figure 10 that I’ve achieved the goals I set for RTO and RPO and my architecture is more resilient and reliable.

Resilience Hub assessment of the multi-AZ architecture after incorporating data backup features

Figure 10. Resilience Hub assessment of the multi-AZ architecture after incorporating data backup features

Resilience Hub has even more recommendations for things I could improve. For example, if I switch from MySQL RDS to Amazon Aurora I can use backtracking to reduce the 1 hour 40 minute RTO that it takes to restore an RDS database backup. Backtracking “rewinds” the DB cluster to the time you specify, so you can restore a last known good state without needing to recreate the entire database from backup, which saves time and reduces RTO.

Conclusion

To improve the resiliency of your workloads, you need to apply the right best practices. This blog post shows you how Resilience Hub can help you assess and improve a workload with poor resiliency, identify areas for improvement, implement best practices, and evaluate how those practices meet your resiliency goals.

Disaster Recovery (DR) Architecture on AWS, Part IV: Multi-site Active/Active

Post Syndicated from Seth Eliot original https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-iv-multi-site-active-active/

In my first blog post of this series, I introduced you to four strategies for disaster recovery (DR). My subsequent posts shared details on the backup and restore, pilot light, and warm standby active/passive strategies.

In this post, you’ll learn how to implement an active/active strategy to run your workload and serve requests in two or more distinct sites. Like other DR strategies, this enables your workload to remain available despite disaster events such as natural disasters, technical failures, or human actions.

DR strategies: Multi-site active/active

As we know from our now familiar DR strategies diagram (Figure 1), the multi-site active/active strategy will give you the lowest RTO (recovery time objective) and RPO (recovery point objective). However, this must be weighed against the potential cost and complexity of operating active stacks in multiple sites.

DR strategies

Figure 1. DR strategies

Implementing multi-site active/active

The architecture in Figure 2 shows you how to use AWS Regions as your active sites, creating a multi-Region active/active architecture. Only two Regions are shown, which is common, but more may be used. Each Region hosts a highly available, multi-Availability Zone (AZ) workload stack. In each Region, data is replicated live between the data stores and also backed up. This protects against disasters that include data deletion or corruption, since the data backup can be restored to the last known good state.

Multi-site active/active DR strategy

Figure 2. Multi-site active/active DR strategy

Traffic routing

Each regional stack serves production traffic. How you implement traffic routing determines which Region will receive a given request. Figure 2 shows Amazon Route 53, a highly available and scalable cloud Domain Name System (DNS), used for routing. Route 53 offers multiple routing policies. For example, the geolocation or latency routing policies are good choices for active/active deployments. For geolocation routing, you configure which Region a request goes to based on the origin location of the request. For latency routing, AWS automatically sends requests to the Region that provides the shortest round-trip time.

Your data governance strategy helps inform which routing policy to use. Geolocation routing lets you distribute requests in a deterministic way. This allows you to keep data for certain users within a specific Region, or you can control where write operations are routed to prevent contention. If optimizing for performance is your top priority, then latency routing is a good choice.

Read/write patterns

Read local/write local pattern

The Region to which a request is routed is called the “local Region” for that request. To maintain low latencies and reduce the potential for network error, serve all read and write requests from the local Region of your multi-Region active/active architecture.

I use Amazon DynamoDB for the example architecture in Figure 2. DynamoDB global tables replicate a table to multiple Regions. Writes to the table in any Region are replicated to other Regions within a second. This makes it a good choice when using the read local/write local pattern. However, there is the possibility of write contention if updates are made to the same item in different Regions at about the same time. To help ensure eventual consistency, DynamoDB global tables use a last writer wins reconciliation between concurrent updates. In this case, the data written by the first writer is lost. If your application cannot handle this and you require strong consistency, use another write pattern to avoid write contention.

Read local/write global pattern

With a write global pattern, you choose a Region to be the global write Region and only accept writes in that Region. DynamoDB global tables are still an excellent choice for replicating data globally; however, you must ensure that locally received write requests are re-directed to the global write Region.

Amazon Aurora is another good choice. When deployed as an Aurora global database, a primary cluster is deployed to your global write Region, and read-only instances (Aurora Replicas) are deployed to other AWS Regions. Data is replicated to these read-only instances with typical latency of under a second. Aurora global database write forwarding (available using Aurora MySQL-Compatible Edition) allows Aurora Replicas in the secondary cluster to forward write operations to the primary cluster in the global write Region. This way, you can treat the read-only replicas in all your Regions as if they were read/write capable. Using write forwarding, the request travels over the AWS network and not the public internet, reducing latency.

Amazon ElastiCache for Redis also can replicate data across Regions. For example, to store session data, you write to your global write Region and use Global Datastore to ensure that this data is available to be read from other Regions.

Read local/write partitioned pattern

For write-heavy workloads with users located around the world, your application may not be suited to incur the round trip to the global write Region with every write. Consider using a write partitioned pattern to mitigate this. With this pattern, each item or record is assigned a home Region. This can be done based on the Region it was first written to. Or it can be based on a partition key in the record (such as user ID) by pre-assigning a home Region for each value of this key. As shown in Figure 3, records for this user are assigned to the left AWS Region as their home Region. The goal is to try to map records to a home Region close to where most write requests will originate.

Read local/write partitioned pattern for multi-site active/active DR strategy

Figure 3. Read local/write partitioned pattern for multi-site active/active DR strategy

When the user in Figure 3 travels away from home, they will read local, but writes will be routed back to their home Region. Usually writes will not incur long round trips as they are expected to typically come from near the home Region. Since writes are accepted in all Regions (for records homed to that respective Region), DynamoDB global tables, which accept writes in all Regions, are a good choice here also.

Failover

With a multi-Region active/active strategy, if your workload cannot operate in a Region, failover will route traffic away from the impacted Region to healthy Region(s). You can accomplish this with Route 53 by updating the DNS records. Make sure you set TTL (time to live) on these records low enough so that DNS resolvers will reflect your changes quickly enough to meet your RTO targets. Alternatively, you can use AWS Global Accelerator for routing and failover. It does not rely on DNS. Global Accelerator gives you two static IP addresses. You then configure which Regions user traffic goes to based on traffic dials and weights you set.

If you’re using a write global pattern and the impacted Region is the global write Region, then a new Region needs to be promoted to be the new global write Region. If you’re using a write partitioned pattern, your workload must repartition so that the records homed in the impacted Region are assigned to one of the remaining Regions. Using write local, all Regions can accept writes. With no changes needed to the data storage layer, this pattern can have the fastest (near zero) RTO.

Conclusion

Consider the multi-site active/active strategy for your workload if you need DR with the quickest recovery time (lowest RTO) and least data loss (lowest RPO). Implementing it across Regions (multi-Region) is a good option if you are looking for the most separation and complete independence of your sites, or if you need to provide low latency access to the workload from users in globally diverse locations.

Also consider the trade-offs. Implementing and operating this strategy, particularly using multi-Region, can be more complicated and more expensive, than other DR strategies. When implementing multi-Region active/active in AWS, you have access to resources to choose the routing policy and the read/write pattern that is right for your workload.

Related information