All posts by Michael Haken

Fault-isolated, zonal deployments with AWS CodeDeploy

Post Syndicated from Michael Haken original https://aws.amazon.com/blogs/devops/fault-isolated-zonal-deployments-with-aws-codedeploy/

In this blog post you’ll learn how to use a new feature in AWS CodeDeploy to deploy your application one Availability Zone (AZ) at a time to help increase the operational resilience or your services through improved fault isolation.

Introducing change to a system can be a time of risk. Even the most advanced CI/CD systems with comprehensive testing and phased deployments can still promote a bad change into production. A common approach to reduce this risk is using fractional deployments and closely monitoring critical metrics like availability and latency to gauge a deployment’s success. If the deployment shows signs of failure, the CI/CD system initiates an

This blog post will show you how to leverage CodeDeploy zonal deployments as part of a holistic AZ independent (AZI) architecture strategy, both patterns that many AWS service teams follow. With this feature, you no longer need to distinguish between infrastructure or deployment failures in order to respond to the event. You can use the same observability tools and recovery techniques for both, which allows you to contain the scope of impact to a single AZ and mitigate the impact more quickly and with less complexity. First, let’s define what an AZI architecture is so we can understand how this feature in CodeDeploy supports it.

Availability Zone independence

Fault isolation is an architectural pattern that limits the scope of impact of failures by creating independent fault containers that don’t share fate. It also allows you to quickly recover from failures by shifting traffic or resources away from the impaired fault container. AWS provides a number of different fault isolation boundaries, but the ones most people are familiar with are AZs and Regions. When you build multi-AZ architectures, you can design your application to implement AZI that uses the fault boundary provided by AZs to keep the interaction of resources isolated to their respective AZ (to the greatest extent possible).

A three tier Availability Zone indepdendent architecture deployed across three AZa

An Availability Zone independent (AZI) architecture implemented by disabling cross-zone load balancing and using an independent database read replica per AZ. Only database writes have to cross an AZ boundary.

The result is that the impacts from an impairment in one AZ don’t cascade to resources in other AZs, making the operation of your application in one AZ independent from events in the others. You should also monitor the health of each AZ independently, for example by looking at per-AZ load balancer HTTPCode_Target_5XX_Count metrics, or by sending synthetic requests to the load balancer nodes in each AZ and recording availability and latency metrics for each. When an event occurs that impacts your availability or latency in a single AZ, you can use capabilities like Amazon Route 53 Application Recovery Controller zonal shift to shift traffic away from that AZ to quickly mitigate impact, often in single-digit minutes.

Using zonal shift to shift traffic away from an single AZ

Using zonal shift to move traffic away from the AZ experiencing a service impairment

Traditional deployment strategy challenges

During an event, SRE, engineering, or operations teams can spend a lot of time trying to figure out if the source of impact is an infrastructure problem or related to a failed deployment. Then, based on the identified cause, they may take different mitigation actions. Thus, precious time is spent investigating the source of impact and deciding on the appropriate mitigation plan.

When the cause is due to a failed deployment, traditionally rollbacks are used to mitigate the problem. But rollbacks, even when automated, take time to complete. For example, let’s say your deployment batches take 5 minutes to complete, you deploy in 10% batches, and you’re halfway through a deployment to 100 instances when the rollback is initiated. This means it’s going to take at least 25 minutes to finish the rollback (5 batches, each taking 5 minutes to re-deploy to). And it’s entirely possible during that time that instances where the new software was deployed continue to pass health checks, but result in errors being returned to your customers. In the worst case, if all instances had been deployed to, this event could last for almost an hour with customers being impacted during the entire rollback process. In some cases, deployments can’t be rolled back and have to be rolled forward, meaning a new, updated version needs to be deployed to fix the previous deployment. Writing the code for the new deployment and testing it adds to the recovery time of your system and can be error prone.

Additionally, if your unit of deployment includes multiple AZs, then your potential scope of impact from a failed deployment isn’t predictably bounded. For example, if your CodeDeploy deployment groups target Amazon Elastic Compute Cloud (Amazon EC2) instances based on tags or an Amazon EC2 Auto Scaling group that spans multiple AZs, then you could see impact across the whole Region, even if you’re using fractional deployments. There’s not a smaller fault container that helps consistently limit the scope of impact to a predictable size.

Let’s look at how we can overcome these two challenges by using zonal deployments with CodeDeploy.

Zonal deployments with AWS CodeDeploy

One of the best practices we follow at AWS, described in My CI/CD pipeline is my release captain, is performing fractional deployments aligned to intended fault isolation boundaries, like individual hosts, cells, AZs, and Regions. When we release a change, the deployment is separated into waves, which represent fault containers (like Regions) that are deployed to in parallel, and those are further separated into stages. Within a single Region, the deployment starts with a one-box environment, representing a single host, then moves on to fractional batches (like 10% at a time) inside a single AZ, waits for a period of bake time, moves on to the next AZ, and so on until we’ve completed rolling out the change.

Four stages in a deployment pipeline showing per AZ deployments with bake time.

Deployment stages aligned to intended fault isolation boundaries within a single deployment wave for one Region

By aligning each stage to an expected fault isolation boundary, we create well-defined fault containers that provide an understood and bounded scope of impact in the case that something goes wrong with a deployment. You can take advantage of this same deployment strategy in your own applications by using zonal deployments in CodeDeploy. To utilize this capability, you need to define a custom deployment configuration shown below.

The configuration options for a CodeDeploy deployment configuration using a zonal configuration

Creating a zonal deployment configuration that deploys to 10% of the EC2 instances in each AZ at a time, one AZ at a time

This configuration defines a few important properties. First, it enables the zonal configuration, which ensures deployments will be phased one AZ at a time. In this case, updates will be deployed to batches of 10% of the instances in each AZ (see the minimum number of healthy instances per Availability Zone for more details on configuring this setting). Second, it defines a monitor duration, which is the bake time where the effects of the changes are observed before moving on to the next AZ. This ensures sufficient use of the new software to discover any potential bugs or problems before moving on. The value in this example is defined as 900 seconds, or 15 minutes. You should ensure this value is longer than the time it takes for your alarms to trigger. For example, if you are using an M of N alarm for availability and/or latency, that is using 3 data points out of 5 with 1-minute intervals, you need to make sure your bake time is set to greater than 600 seconds, otherwise, you might move on to the next AZ before your alarm has a chance to mark the deployment as unsuccessful. Finally, I’ve also defined a first zone monitor duration. This overrides the “monitor duration” for the first AZ being deployed to. This is useful since the first AZ is acting as our canary or one-box environment and we may want to wait additional time to be really confident the deployment is successful before moving on to the second AZ.

If your service is deployed behind a load balancer with cross-zone load balancing disabled (which is important to achieve AZI), carefully consider your batch size. The load balancer evenly splits traffic across AZs regardless of how many healthy hosts are in each AZ. Ensure your batch size is small enough that the temporary reduction in capacity during each batch doesn’t overwhelm the remaining instances in the same AZ. You can use the CodeDeploy minimum healthy hosts per AZ option to ensure there are enough healthy hosts in the AZ during a deployment batch or Elastic Load Balancing (ELB) target group minimum healthy target count with DNS failover to shift traffic away from the AZ if too few targets are present.

Recovering from a failed zonal deployment.

When a failure occurs, the highest priority is mitigating the impact, not fixing the root cause. While an automated rollback can help achieve both for a failed deployment, using a zonal shift can improve your recovery time. Let’s take a simple dashboard like the following figure. The top graph shows your availability as perceived by customers through using the regional endpoint of your load balancer like https://load-balancer-name-and-id.elb.us-east-1.amazonaws.com. The graphs below it show the measured availability from Amazon CloudWatch Synthetics canaries that test the load balancer endpoints in each AZ using endpoints like https://us-east-1a.load-balancer-name-and-id.elb.us-east-1.amazonaws.com.

Dashboards showing a drop in availability in one AZ that also impacts the regional customer experience

Dashboard showing impact in one AZ that affects the availability of the service

We can see that something starts impacting resources in AZ1 at 10:38 causing an availability drop. As we would expect, this impact is also seen by customers, shown in the top graph, but it’s unclear what the underlying cause of the availability drop is. Using the approach described in this post, it turns out that it doesn’t matter. Within a few minutes, at 10:41 the CloudWatch composite alarm monitoring the health of AZ1 transitions to the ALARM state and invokes a Lambda function that reads the alarm’s definition to get the AZ ID and ALB ARN involved, and initiates the zonal shift. It’s important that the alarm logic only reacts when a single AZ is impacted, if there was impact in more than one AZ, we would need to treat this as a Regional issue.

The process of identifying a failed deployment in a single AZ and responding with a zonal shift.

After a failed deployment to AZ1, an automatically initiated zonal shift quickly mitigates the customer impact

Then, after a few more minutes, at 10:44, we can see availability from the customer perspective has gone back up to 100% by shifting traffic away from AZ1.

Dashboards showing the regional customer experience has recovered while the AZ is still impacted by the failed deployment

The impact of the failed deployment is mitigated by shifting traffic away from AZ1

It turns out the cause of impact in this case was a failed deployment, and we can see that our synthetic canaries still see the failure while the deployment is rolling back, but we’ve achieved our primary goal of quickly removing the impact to the customer experience. From the start of impact to mitigation, 6 minutes elapsed, which was significantly faster than waiting for the deployment to completely rollback. After the rollback is complete, at 10:58, 20 minutes after the start of the event, we can see the alarm transition back to the OK state and availability return to normal in AZ1, meaning we can end the zonal shift and return to normal operation.

Dashboards showing that after the rollback is complete, the impact to the single AZ has also dissipated

After the deployment rollback is complete, the availability in AZ1 recovers and the zonal shift can be ended

Conclusion

Performing zonal deployments helps improve the effectiveness of AZI architectures. Aligning your deployments to your intended fault isolation boundaries, in this case AZs, creates a predictable scope of impact and helps prevents cascading failures. This in turn allows you to use a common set of observability and mitigation tools for both single-AZ infrastructure events and failed deployments, which can mitigate the impact faster than automated rollbacks. Additionally, by removing the ambiguity on selecting a recovery strategy for operators, it further reduces recovery time and complexity. Learn more about zonal deployments in AWS CodeDeploy here.

Michael Haken

Michael Haken

Michael is a Senior Principal Solutions Architect on the AWS Strategic Accounts team where he helps customers innovate, differentiate their business, and transform their customer experiences. He has 15 years’ experience supporting financial services, public sector, and digital native customers. Michael has his B.A. from UVA and M.S. in Computer Science from Johns Hopkins. Outside of work you’ll find him playing with his family and dogs on his farm.

Creating an organizational multi-Region failover strategy

Post Syndicated from Michael Haken original https://aws.amazon.com/blogs/architecture/creating-an-organizational-multi-region-failover-strategy/

AWS Regions provide fault isolation boundaries that prevent correlated failure and contain the impact from AWS service impairments to a single Region when they occur. You can use these fault boundaries to build multi-Region applications that consist of independent, fault-isolated replicas in each Region that limit shared fate scenarios. This allows you to build multi-Region applications and leverage a spectrum of approaches from backup and restore to pilot light to active/active to implement your multi-Region architecture. However, applications typically don’t operate in isolation; consider both the components you will use and their dependencies as part of your failover strategy. Generally, multiple applications make up what we refer to as a user story, a specific capability offered to an end user, like “posting a picture and caption on a social media app” or “checking out on an e-commerce site”. Because of this, you should develop an organizational multi-Region failover strategy that provides the necessary coordination and consistency to make your approach successful.

Overview

There are four high-level strategies that organizations can pick from to guide a multi-Region approach:

  • Component-level failover
  • Individual application failover
  • Dependency graph failover
  • Entire application portfolio failover

These strategies move from the most granular to the coarsest approach. Each strategy has tradeoffs and addresses different challenges, including flexibility of failover decision making, testability of the failover combinations, presence of modal behavior, and organizational investment in planning and implementation. By the end of this post, you will be able to identify the pros and cons of each strategy so you can make intentional choices about which you select for your multi-Region failover solution.

Component-level failover

Applications are made up of multiple components, including their infrastructure, code and config, data stores, and dependencies. The component-level failover strategy helps you recover from individual component impairments. This means that when a single component is impaired, the application will fail over to a component hosted in a different Region. Consider the application in Figure 1. When the Amazon Simple Storage Service (Amazon S3) resources used by the application experience elevated error rates or higher latency, the application fails over to use data from an S3 bucket in its secondary Region.

When the application experiences an impairment using S3 resources in the primary Region, it fails over to use an S3 bucket in the secondary Region.

Figure 1. When the application experiences an impairment using S3 resources in the primary Region, it fails over to use an S3 bucket in the secondary Region.

This strategy gives the most autonomy and flexibility to individual applications, but has four main tradeoffs:

  • It adds latency by using resources in a second Region because they are physically further away. This gives the application multiple modes of behavior, lower latency when all components are in one Region, and higher latency when the components are split between Regions. Modal behavior can produce unexpected and undesirable results.
  • It introduces the possibility for inconsistent data if asynchronous replication is used in the data store.
  • It typically requires a runtime update of the application’s configuration to switch a component to a different Region, which can be unreliable during a failure scenario.
  • There are 2N-1 possible configurations (where N is the number of components in the application) of the application, which can make every possible combination in an application difficult to test.

Individual application failover

The next strategy allows individual applications to make an autonomous decision to fail over all of its components together, shown in Figure 2. This removes the latency tradeoff from the previous strategy by keeping all of the application components in the same Region. It also significantly reduces the complexity by only having two possible configurations per application. Additionally, applications can be failed over to another Region without updating their configuration by using approaches like Amazon Route 53 DNS failover, removing the unreliability of runtime configuration updates.

Application 3 experiences an impairment and fails over to the secondary Region.

Figure 2. Application 3 experiences an impairment and fails over to the secondary Region

However, allowing individual applications to make their own failover decision can introduce the same modal behavior we saw with component-level failover, just in a different dimension. In the worst case, 50% of the applications in a user story could fail over while 50% don’t, meaning every application interaction could be a cross-Region request, shown in Figure 3.

The worst-case scenario of allowing applications to make failover decisions independently.

Figure 3. The worst-case scenario of allowing applications to make failover decisions independently

Additionally, while this approach removes the complexity of the component failover approach, it still exhibits a level of similar complexity, albeit smaller, by having 2N-1 combinations of application locations across Regions, also making this approach difficult to test and coordinate.

Dependency graph failover

To solve the complexity of the previous strategy, you might decide to coordinate failover of all applications that support a user story as a single unit. We call this a dependency graph and it ensures that all applications that interact with each other will always be in the same Region, as shown in Figure 4.

A dependency graph of applications that all support user story "A".

Figure 4. A dependency graph of applications that all support user story “A”

While this solves the previous latency, modal behavior, and complexity tradeoffs, it comes with its own challenges. In a portfolio with multiple user stories and applications, this graph can be very large and discovering each dependency, especially infrequently used ones, can be difficult. In fact, seemingly unrelated dependency graphs can be connected by a single vertex that is shared between them, as shown in Figure 5.

Two unrelated user stories share a dependency on Application 4, requiring both dependency graphs to failover if either experience an impairment.

Figure 5. Two unrelated user stories share a dependency on Application 4, requiring both dependency graphs to failover if either experience an impairment

For example, if every user story you provide depends on a single authentication and authorization system, when one graph of applications needs to failover, then so does the entire authorization system. In turn, every other user story that depends on that authorization system needs to fail over as well. To mitigate this, you might implement independent replicas of these types of applications in each Region, if possible, to remove edges from the dependency graph.

Entire portfolio failover

The final strategy is failing over an entire application portfolio, whether or not applications are impacted or have any interaction with those that are, as shown in Figure 6. This strategy helps remove the operational burden of creating and maintaining dependency graphs for every user story your business supports.

Every user story fails over together regardless of observed impact from a failure.

Figure 6. Every user story fails over together regardless of observed impact from a failure

The major tradeoff is the organizational investment to create multi-Region capabilities for every application – you might not have made that broad investment in the other strategies. You can make this strategy slightly more granular by implementing it for specific application tiers, for example, failing over all tier-1 applications together, as long as you know there aren’t dependencies across applications of different criticality.

You can also combine this approach with the second strategy. Let individual applications make failover decisions until you see broad enough impact, or impact from the modal behavior, that you decide to make all applications failover to your secondary Region to mitigate the effects.

Conclusion

This blog post has looked at four different high-level approaches for creating an organizational multi-Region failover strategy.

Each strategy optimizes for different outcomes. Component-level failover gives you the highest degree of flexibility without organizational capabilities or coordination, but introduces the most complexity and bimodal behavior. Individual application failover optimizes for less complexity in failover combinations than component-level while still maintaining decentralized flexibility in failover decision making. Dependency graph failover optimizes for only needing to failover the minimum set of applications to support a capability, which removes the presence of modal behavior while requiring more organizational investment to do so. Finally, portfolio failover optimizes for not needing to maintain dependency graphs, but requires significant additional investment to build a multi-Region capability for every application.

Creating the strategy can be an iterative journey. You might start with allowing individual applications to make failover decisions while you build toward a future state of managing failover of independent dependency graphs. For more information on creating multi-Region architectures, see AWS Multi-Region Fundamentals and Disaster Recovery of Workloads on AWS.

Improving Performance and Reducing Cost Using Availability Zone Affinity

Post Syndicated from Michael Haken original https://aws.amazon.com/blogs/architecture/improving-performance-and-reducing-cost-using-availability-zone-affinity/

One of the best practices for building resilient systems in Amazon Virtual Private Cloud (VPC) networks is using multiple Availability Zones (AZ). An AZ is one or more discrete data centers with redundant power, networking, and connectivity. Using multiple AZs allows you to operate workloads that are more highly available, fault tolerant, and scalable than would be possible from a single data center. However, transferring data across AZs adds latency and cost.

This blog post demonstrates an architectural pattern called “Availability Zone Affinity” that improves performance and reduces costs while still maintaining the benefits of Multi-AZ architectures.

Cross Availability Zone effects

AZs are physically separated by a meaningful distance from other AZs in the same AWS Region. Although they all are within 60 miles (100 kilometers) of each other. This produces roundtrip latencies usually under 1-2 milliseconds (ms) between AZs in the same Region. Roundtrip latency between two instances in the same AZ is closer to 100-300 microseconds (µs) when using enhanced networking.1 This can be even lower when the instances use cluster placement groups. Additionally, when data is transferred between two AZs, data transfer charges apply in both directions.

To better understand these effects, we’ll analyze a fictitious workload, the “foo service,” shown in Figure 1. The foo service provides a storage platform for other workloads in AWS to redundantly store data. Requests are first processed by an Application Load Balancer (ALB). ALBs always use cross-zone load balancing to evenly distribute requests to all targets. Next, the request is sent from the load balancer to a request router. The request router performs a few operations, like authorization checks and input validation, before sending it to the storage tier. The storage tier replicates the data sequentially from the lead node, to the middle node, and finally the tail node. Once the data has been written to all three nodes, it is considered committed. The response is sent from the tail node back to the request router, back through the load balancer, and finally returned to the client.

An example system that transfers data across AZs

Figure 1. An example system that transfers data across AZs

We can see in Figure 1 that, in the worst case, the request traversed an AZ boundary eight times. Let’s calculate the fastest possible, zeroth percentile (p0), latency. We’ll assume the best time for non-network processing of the request in the load balancer, request router, and storage tier is 4 ms. If we consider 1 ms as the minimum network latency added for each AZ traversal, in the worst-case scenario of eight AZ traversals, the total processing time can be no faster than 12 ms. At the 50th percentile (p50), meaning the median, let’s assume the cross-AZ latency is 1.5 ms and non-network processing is 8 ms, resulting in a total of 20 ms for overall processing. Additionally, if this system is processing millions of requests, the data transfer charges could become substantial over time. Now, let’s imagine that a workload using the foo service must operate with p50 latency under 20 ms. How can the foo service change their system design to meet this goal?

Availability Zone affinity

The AZ Affinity architectural pattern reduces the number of times an AZ boundary is crossed. In the example system we looked at in Figure 1, AZ Affinity can be implemented with two changes.

  1. First, the ALB is replaced with a Network Load Balancer (NLB). NLBs provide an elastic network interface per AZ that is configured with a static IP. NLBs also have cross-zone load balancing disabled by default. This ensures that requests are only sent to targets that are in the same AZ as the elastic network interface that receives the request.
  2. Second, DNS entries are created for each elastic network interface to provide an AZ-specific record using the AZ ID, which is consistent across accounts. Clients use that DNS record to communicate with a load balancer in the AZ they select. So instead of interacting with a Region-wide service using a DNS name like foo.com, they would instead use use1-az1.foo.com.

Figure 2 shows the system with AZ Affinity. We can see that each request, in the worst case, only traverses an AZ boundary four times. Data transfer costs are reduced by approximately 40 percent compared to the previous implementation. If we use 300 μs as the p50 latency for intra-AZ communication, we now get (4×300μs)+(4×1.5ms)=7.2ms. Using the median 8 ms processing time, this brings the overall median latency to 15.2 ms. This represents a 40 percent reduction in median network latency. When thinking about p90, p99, or even p99.9 latencies, this reduction could be even more significant.

The system now implements AZ Affinity

Figure 2. The system now implements AZ Affinity

Figure 3 shows how you could take this approach one step farther using service discovery. Instead of requiring the client to remember AZ-specific DNS names for load balancers, we can use AWS Cloud Map for service discovery. AWS Cloud Map is a fully managed service that allows clients to look up IP address and port combinations of service instances using DNS and dynamically retrieve abstract endpoints, like URLs, over the HTTP-based service Discovery API. Service discovery can reduce the need for load balancers, removing their cost and added latency.

The client first retrieves details about the service instances in their AZ from the AWS Cloud Map registry. The results are filtered to the client’s AZ by specifying an optional parameter in the request. Then they use that information to send requests to the discovered request routers.

AZ Affinity implemented using AWS Cloud Map for service discovery

Figure 3. AZ Affinity implemented using AWS Cloud Map for service discovery

Workload resiliency

In the new architecture using AZ Affinity, the client has to select which AZ they communicate with. Since they are “pinned” to a single AZ and not load balanced across multiple AZs, they may see impact during an event affecting the AWS infrastructure or foo service in that AZ.

During this kind of event, clients can choose to use retries with exponential backoff or send requests to the other AZs that aren’t impacted. Alternatively, they could implement a circuit breaker to stop making requests from the client in the affected AZ and only use clients in the others. Both approaches allow them to use the resiliency of Multi-AZ systems while taking advantage of AZ Affinity during normal operation.

Client libraries

The easiest way to achieve the process of service discovery, retries with exponential backoff, circuit breakers, and failover is to provide a client library/SDK. The library handles all of this logic for users and makes the process transparent, like what the AWS SDK or CLI does. Users then get two options, the low-level API and the high-level library.

Conclusion

This blog demonstrated how the AZ Affinity pattern helps reduce latency and data transfer costs for Multi-AZ systems while providing high availability. If you want to investigate your data transfer costs, check out the Using AWS Cost Explorer to analyze data transfer costs blog for an approach using AWS Cost Explorer.

For investigating latency in your workload, consider using AWS X-Ray and Amazon CloudWatch for tracing and observability in your system. AZ Affinity isn’t the right solution for every workload, but if you need to reduce inter-AZ data transfer costs or improve latency, it’s definitely an approach to consider.


  1. This estimate was made using t4g.small instances sending ping requests across AZs. The tests were conducted in the us-east-1, us-west-2, and eu-west-1 Regions. These results represent the p0 (fastest) and p50 (median) intra-AZ latency in those Regions at the time they were gathered, but are not a guarantee of the latency between two instances in any location. You should perform your own tests to calculate the performance enhancements AZ Affinity offers.

Using VPC Endpoints in Multi-Region Architectures with Route 53 Resolver

Post Syndicated from Michael Haken original https://aws.amazon.com/blogs/architecture/using-vpc-endpoints-in-multi-region-architectures-with-route-53-resolver/

Many customers are building multi-Region architectures on AWS. They might want to bring their systems closer to their end users, support disaster recovery (DR), or comply with data sovereignty requirements. Often, these architectures use Amazon Virtual Private Cloud (VPC) to host resources like Amazon EC2 instances, Amazon Relational Database Service (RDS) databases, and AWS Lambda functions. Typically, these VPCs are also connected using VPC peering or AWS Transit Gateway.

Within these VPC networks, customers also use AWS PrivateLink to deploy VPC endpoints. These endpoints provide private connectivity between VPCs and AWS services. They also support endpoint policies that allow customers to implement guardrails. As an example, customers frequently use endpoint policies to ensure that only IAM principals in their AWS Organization are accessing resources from their networks.

The challenge some customers have faced is that VPC endpoints can only be used to access resources in the same Region as the endpoint. For example, an Amazon Simple Storage Service (S3) VPC endpoint deployed in us-east-1 can only be used to access S3 buckets also located in us-east-1. To access a bucket in us-east-2, that traffic has to traverse the public internet. Ideally, customers want to keep this traffic within their private network and apply VPC endpoint policies, regardless of the Region where the resource is located.

Amazon Route 53 Resolver to the rescue

One of the ways we can solve this problem is with Amazon Route 53 Resolver. Route 53 Resolver provides inbound and outbound DNS services in a VPC. It allows you to resolve domain names for AWS resources in the Region where the resolver endpoint is deployed. It also allows you to forward DNS requests to other DNS servers based on rules you define. To consistently apply VPC endpoint policies to all traffic, we use Route 53 Resolver to steer traffic to VPC endpoints in each Region.

Figure 1. A multi-Region architecture with Route 53 Resolver and S3 endpoints

Figure 1. A multi-Region architecture with Route 53 Resolver and S3 endpoints

In this example shown in Figure 1, we have a workload that operates in us-east-1. It must access Amazon S3 buckets in us-east-2 and us-west-2. There is a VPC in each Region that is connected via VPC peering to the one in us-east-1. We’ve also deployed an inbound and outbound Route 53 Resolver endpoint in each VPC.

Finally, we also have Amazon S3 interface VPC endpoints in each VPC. These provide their own unique DNS names. They can be resolved to private IP addresses using VPC provided DNS (using the .2 address or 169.254.169.253 address) or the inbound resolver IP addresses.

When the EC2 instance accesses a bucket in us-east-1, the Route 53 Resolver endpoint resolves the DNS name to the private IP address of the VPC endpoint. However, without an outbound rule, a DNS query for a bucket in another Region like us-east-2 would resolve to the public IP address of the S3 service. To solve this, we’re going to add four outbound rules to the resolver in us-east-1.

  • us-west-2.amazonaws.com
  • us-west-2.vpce.amazonaws.com
  • us-east-2.amazonaws.com
  • us-east-2.vpce.amazonaws.com

These rules will forward the DNS request to the appropriate inbound Route 53 Resolver in the peered VPC. When there isn’t a VPC endpoint deployed for a service, the resolver will use its automatically created recursive rule to return the public IP address. Let’s look at how this works in Figure 2.

Figure 2. The workflow of resolving an out-of-Region S3 DNS name

Figure 2. The workflow of resolving an out-of-Region S3 DNS name

  1. The EC2 instance runs a command to list a bucket in us-east-2. The DNS request first goes to the local Route 53 Resolver endpoint in us-east-1.
  2. The Route 53 Resolver in us-east-1 has an outbound rule matching the bucket’s domain name. This forwards all DNS queries for the domain us-east-2.vpce.amazonaws.com to the inbound Route 53 Resolver in us-east-2.
  3. The Route 53 Resolver in us-east-2 responds with the private IP address of the S3 interface VPC endpoint in its VPC. This is then returned to the EC2 instance.
  4. The EC2 instance sends the request to the S3 interface VPC endpoint in us-east-2.

This pattern can be easily extended to support any Region that your organization uses. Add additional VPCs in those Regions to host the Route 53 Resolver endpoints and VPC endpoints. Then, add additional outbound resolver rules for those Regions. You can also support additional AWS services by deploying VPC endpoints for them in each peered VPC that hosts the inbound Route 53 Resolver endpoint.

This architecture can be extended to provide a centralized capability to your entire business instead of supporting a single workload in a VPC. We’ll look at that next.

Scaling cross-Region VPC endpoints with Route 53 Resolver

In Figure 3, each Region has a centralized HTTP proxy fleet. This is located in a dedicated VPC with AWS service VPC endpoints and a Route 53 Resolver endpoint. Each workload VPC in the same Region connects to this VPC over Transit Gateway. All instances send their HTTP traffic to the proxies. The proxies manage resolving domain names and forwarding the traffic to the correct Region. Here, each Route 53 Resolver supports inbound DNS requests from other VPCs. It also has outbound rules to forward requests to the appropriate Region. Let’s walk through how this solution works.

Figure 3. Using Route 53 Resolver endpoints with central HTTP proxies

Figure 3. Using Route 53 Resolver endpoints with central HTTP proxies

  1. The EC2 instance in us-east-1 runs a command to list a bucket in us-east-2. The HTTP request is sent to the proxy fleet in the same Region.
  2. The proxy fleet attempts to resolve the domain name of the bucket in us-east-2. The Route 53 Resolver in us-east-1 has an outbound rule for the domain us-east-2.vpce.amazonaws.com. This rule forwards the DNS query to the inbound Route 53 Resolver in us-east-2. The Route 53 Resolver in us-east-2 responds with the private IP address of the S3 interface endpoint in its VPC.
  3. The proxy server sends the request to the S3 interface endpoint in us-east-2 over the Transit Gateway connection. VPC endpoint policies are consistently applied to the request.

This solution (Figure 3) scales the previous implementation (Figure 2) to support multiple workloads across all of the in-use Regions. And it does this without duplicating VPC endpoints in every VPC.

If your environment doesn’t use HTTP proxies, you could alternatively deploy Route 53 Resolver outbound endpoints in each workload VPC. In this case, you have two options. The outbound rules can forward the DNS requests directly to the cross-Region inbound resolver, like in the Figure 2. Or, there can be a single outbound rule to forward the DNS requests to a central inbound resolver in the same Region (see Figure 3). The first option reduces dependencies on a centralized service. The second option provides reduced management overhead of the creation and updates to outbound rules.

Conclusion

Customers want a straightforward way to use VPC endpoints and endpoint policies for all Regions uniformly and consistently. Route 53 Resolver provides a solution using DNS. This ensures that requests to AWS services that support VPC endpoints stay within the VPC network, regardless of their Region.

Check out the documentation for Route 53 Resolver to learn more about how you can use DNS to simplify using VPC endpoints in multi-Region architectures.