Tag Archives: Routing

Introducing Cloud Native Networking for Amazon ECS Containers

Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/introducing-cloud-native-networking-for-ecs-containers/

This post courtesy of ECS Sr. Software Dev Engineer Anirudh Aithal.

Today, AWS announced Task Networking for Amazon ECS. This feature brings Amazon EC2 networking capabilities to tasks using elastic network interfaces.

An elastic network interface is a virtual network interface that you can attach to an instance in a VPC. When you launch an EC2 virtual machine, an elastic network interface is automatically provisioned to provide networking capabilities for the instance.

A task is a logical group of running containers. Previously, tasks running on Amazon ECS shared the elastic network interface of their EC2 host. Now, the new awsvpc networking mode lets you attach an elastic network interface directly to a task.

This simplifies network configuration, allowing you to treat each container just like an EC2 instance with full networking features, segmentation, and security controls in the VPC.

In this post, I cover how awsvpc mode works and show you how you can start using elastic network interfaces with your tasks running on ECS.

Background:  Elastic network interfaces in EC2

When you launch EC2 instances within a VPC, you don’t have to configure an additional overlay network for those instances to communicate with each other. By default, routing tables in the VPC enable seamless communication between instances and other endpoints. This is made possible by virtual network interfaces in VPCs called elastic network interfaces. Every EC2 instance that launches is automatically assigned an elastic network interface (the primary network interface). All networking parameters—such as subnets, security groups, and so on—are handled as properties of this primary network interface.

Furthermore, an IPv4 address is allocated to every elastic network interface by the VPC at creation (the primary IPv4 address). This primary address is unique and routable within the VPC. This effectively makes your VPC a flat network, resulting in a simple networking topology.

Elastic network interfaces can be treated as fundamental building blocks for connecting various endpoints in a VPC, upon which you can build higher-level abstractions. This allows elastic network interfaces to be leveraged for:

  • VPC-native IPv4 addressing and routing (between instances and other endpoints in the VPC)
  • Network traffic isolation
  • Network policy enforcement using ACLs and firewall rules (security groups)
  • IPv4 address range enforcement (via subnet CIDRs)

Why use awsvpc?

Previously, ECS relied on the networking capability provided by Docker’s default networking behavior to set up the network stack for containers. With the default bridge network mode, containers on an instance are connected to each other using the docker0 bridge. Containers use this bridge to communicate with endpoints outside of the instance, using the primary elastic network interface of the instance on which they are running. Containers share and rely on the networking properties of the primary elastic network interface, including the firewall rules (security group subscription) and IP addressing.

This means you cannot address these containers with the IP address allocated by Docker (it’s allocated from a pool of locally scoped addresses), nor can you enforce finely grained network ACLs and firewall rules. Instead, containers are addressable in your VPC by the combination of the IP address of the primary elastic network interface of the instance, and the host port to which they are mapped (either via static or dynamic port mapping). Also, because a single elastic network interface is shared by multiple containers, it can be difficult to create easily understandable network policies for each container.

The awsvpc networking mode addresses these issues by provisioning elastic network interfaces on a per-task basis. Hence, containers no longer share or contend use these resources. This enables you to:

  • Run multiple copies of the container on the same instance using the same container port without needing to do any port mapping or translation, simplifying the application architecture.
  • Extract higher network performance from your applications as they no longer contend for bandwidth on a shared bridge.
  • Enforce finer-grained access controls for your containerized applications by associating security group rules for each Amazon ECS task, thus improving the security for your applications.

Associating security group rules with a container or containers in a task allows you to restrict the ports and IP addresses from which your application accepts network traffic. For example, you can enforce a policy allowing SSH access to your instance, but blocking the same for containers. Alternatively, you could also enforce a policy where you allow HTTP traffic on port 80 for your containers, but block the same for your instances. Enforcing such security group rules greatly reduces the surface area of attack for your instances and containers.

ECS manages the lifecycle and provisioning of elastic network interfaces for your tasks, creating them on-demand and cleaning them up after your tasks stop. You can specify the same properties for the task as you would when launching an EC2 instance. This means that containers in such tasks are:

  • Addressable by IP addresses and the DNS name of the elastic network interface
  • Attachable as ‘IP’ targets to Application Load Balancers and Network Load Balancers
  • Observable from VPC flow logs
  • Access controlled by security groups

­This also enables you to run multiple copies of the same task definition on the same instance, without needing to worry about port conflicts. You benefit from higher performance because you don’t need to perform any port translations or contend for bandwidth on the shared docker0 bridge, as you do with the bridge networking mode.

Getting started

If you don’t already have an ECS cluster, you can create one using the create cluster wizard. In this post, I use “awsvpc-demo” as the cluster name. Also, if you are following along with the command line instructions, make sure that you have the latest version of the AWS CLI or SDK.

Registering the task definition

The only change to make in your task definition for task networking is to set the networkMode parameter to awsvpc. In the ECS console, enter this value for Network Mode.

 

If you plan on registering a container in this task definition with an ECS service, also specify a container port in the task definition. This example specifies an NGINX container exposing port 80:

This creates a task definition named “nginx-awsvpc" with networking mode set to awsvpc. The following commands illustrate registering the task definition from the command line:

$ cat nginx-awsvpc.json
{
        "family": "nginx-awsvpc",
        "networkMode": "awsvpc",
        "containerDefinitions": [
            {
                "name": "nginx",
                "image": "nginx:latest",
                "cpu": 100,
                "memory": 512,
                "essential": true,
                "portMappings": [
                  {
                    "containerPort": 80,
                    "protocol": "tcp"
                  }
                ]
            }
        ]
}

$ aws ecs register-task-definition --cli-input-json file://./nginx-awsvpc.json

Running the task

To run a task with this task definition, navigate to the cluster in the Amazon ECS console and choose Run new task. Specify the task definition as “nginx-awsvpc“. Next, specify the set of subnets in which to run this task. You must have instances registered with ECS in at least one of these subnets. Otherwise, ECS can’t find a candidate instance to attach the elastic network interface.

You can use the console to narrow down the subnets by selecting a value for Cluster VPC:

 

Next, select a security group for the task. For the purposes of this example, create a new security group that allows ingress only on port 80. Alternatively, you can also select security groups that you’ve already created.

Next, run the task by choosing Run Task.

You should have a running task now. If you look at the details of the task, you see that it has an elastic network interface allocated to it, along with the IP address of the elastic network interface:

You can also use the command line to do this:

$ aws ecs run-task --cluster awsvpc-ecs-demo --network-configuration "awsvpcConfiguration={subnets=["subnet-c070009b"],securityGroups=["sg-9effe8e4"]}" nginx-awsvpc $ aws ecs describe-tasks --cluster awsvpc-ecs-demo --task $ECS_TASK_ARN --query tasks[0]
{
    "taskArn": "arn:aws:ecs:us-west-2:xx..x:task/f5xx-...",
    "group": "family:nginx-awsvpc",
    "attachments": [
        {
            "status": "ATTACHED",
            "type": "ElasticNetworkInterface",
            "id": "xx..",
            "details": [
                {
                    "name": "subnetId",
                    "value": "subnet-c070009b"
                },
                {
                    "name": "networkInterfaceId",
                    "value": "eni-b0aaa4b2"
                },
                {
                    "name": "macAddress",
                    "value": "0a:47:e4:7a:2b:02"
                },
                {
                    "name": "privateIPv4Address",
                    "value": "10.0.0.35"
                }
            ]
        }
    ],
    ...
    "desiredStatus": "RUNNING",
    "taskDefinitionArn": "arn:aws:ecs:us-west-2:xx..x:task-definition/nginx-awsvpc:2",
    "containers": [
        {
            "containerArn": "arn:aws:ecs:us-west-2:xx..x:container/62xx-...",
            "taskArn": "arn:aws:ecs:us-west-2:xx..x:task/f5x-...",
            "name": "nginx",
            "networkBindings": [],
            "lastStatus": "RUNNING",
            "networkInterfaces": [
                {
                    "privateIpv4Address": "10.0.0.35",
                    "attachmentId": "xx.."
                }
            ]
        }
    ]
}

When you describe an “awsvpc” task, details of the elastic network interface are returned via the “attachments” object. You can also get this information from the “containers” object. For example:

$ aws ecs describe-tasks --cluster awsvpc-ecs-demo --task $ECS_TASK_ARN --query tasks[0].containers[0].networkInterfaces[0].privateIpv4Address
"10.0.0.35"

Conclusion

The nginx container is now addressable in your VPC via the 10.0.0.35 IPv4 address. You did not have to modify the security group on the instance to allow requests on port 80, thus improving instance security. Also, you ensured that all ports apart from port 80 were blocked for this application without modifying the application itself, which makes it easier to manage your task on the network. You did not have to interact with any of the elastic network interface API operations, as ECS handled all of that for you.

You can read more about the task networking feature in the ECS documentation. For a detailed look at how this new networking mode is implemented on an instance, see Under the Hood: Task Networking for Amazon ECS.

Please use the comments section below to send your feedback.

Building a Multi-region Serverless Application with Amazon API Gateway and AWS Lambda

Post Syndicated from Stefano Buliani original https://aws.amazon.com/blogs/compute/building-a-multi-region-serverless-application-with-amazon-api-gateway-and-aws-lambda/

This post written by: Magnus Bjorkman – Solutions Architect

Many customers are looking to run their services at global scale, deploying their backend to multiple regions. In this post, we describe how to deploy a Serverless API into multiple regions and how to leverage Amazon Route 53 to route the traffic between regions. We use latency-based routing and health checks to achieve an active-active setup that can fail over between regions in case of an issue. We leverage the new regional API endpoint feature in Amazon API Gateway to make this a seamless process for the API client making the requests. This post does not cover the replication of your data, which is another aspect to consider when deploying applications across regions.

Solution overview

Currently, the default API endpoint type in API Gateway is the edge-optimized API endpoint, which enables clients to access an API through an Amazon CloudFront distribution. This typically improves connection time for geographically diverse clients. By default, a custom domain name is globally unique and the edge-optimized API endpoint would invoke a Lambda function in a single region in the case of Lambda integration. You can’t use this type of endpoint with a Route 53 active-active setup and fail-over.

The new regional API endpoint in API Gateway moves the API endpoint into the region and the custom domain name is unique per region. This makes it possible to run a full copy of an API in each region and then use Route 53 to use an active-active setup and failover. The following diagram shows how you do this:

Active/active multi region architecture

  • Deploy your Rest API stack, consisting of API Gateway and Lambda, in two regions, such as us-east-1 and us-west-2.
  • Choose the regional API endpoint type for your API.
  • Create a custom domain name and choose the regional API endpoint type for that one as well. In both regions, you are configuring the custom domain name to be the same, for example, helloworldapi.replacewithyourcompanyname.com
  • Use the host name of the custom domain names from each region, for example, xxxxxx.execute-api.us-east-1.amazonaws.com and xxxxxx.execute-api.us-west-2.amazonaws.com, to configure record sets in Route 53 for your client-facing domain name, for example, helloworldapi.replacewithyourcompanyname.com

The above solution provides an active-active setup for your API across the two regions, but you are not doing failover yet. For that to work, set up a health check in Route 53:

Route 53 Health Check

A Route 53 health check must have an endpoint to call to check the health of a service. You could do a simple ping of your actual Rest API methods, but instead provide a specific method on your Rest API that does a deep ping. That is, it is a Lambda function that checks the status of all the dependencies.

In the case of the Hello World API, you don’t have any other dependencies. In a real-world scenario, you could check on dependencies as databases, other APIs, and external dependencies. Route 53 health checks themselves cannot use your custom domain name endpoint’s DNS address, so you are going to directly call the API endpoints via their region unique endpoint’s DNS address.

Walkthrough

The following sections describe how to set up this solution. You can find the complete solution at the blog-multi-region-serverless-service GitHub repo. Clone or download the repository locally to be able to do the setup as described.

Prerequisites

You need the following resources to set up the solution described in this post:

  • AWS CLI
  • An S3 bucket in each region in which to deploy the solution, which can be used by the AWS Serverless Application Model (SAM). You can use the following CloudFormation templates to create buckets in us-east-1 and us-west-2:
    • us-east-1:
    • us-west-2:
  • A hosted zone registered in Amazon Route 53. This is used for defining the domain name of your API endpoint, for example, helloworldapi.replacewithyourcompanyname.com. You can use a third-party domain name registrar and then configure the DNS in Amazon Route 53, or you can purchase a domain directly from Amazon Route 53.

Deploy API with health checks in two regions

Start by creating a small “Hello World” Lambda function that sends back a message in the region in which it has been deployed.


"""Return message."""
import logging

logging.basicConfig()
logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):
    """Lambda handler for getting the hello world message."""

    region = context.invoked_function_arn.split(':')[3]

    logger.info("message: " + "Hello from " + region)
    
    return {
		"message": "Hello from " + region
    }

Also create a Lambda function for doing a health check that returns a value based on another environment variable (either “ok” or “fail”) to allow for ease of testing:


"""Return health."""
import logging
import os

logging.basicConfig()
logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):
    """Lambda handler for getting the health."""

    logger.info("status: " + os.environ['STATUS'])
    
    return {
		"status": os.environ['STATUS']
    }

Deploy both of these using an AWS Serverless Application Model (SAM) template. SAM is a CloudFormation extension that is optimized for serverless, and provides a standard way to create a complete serverless application. You can find the full helloworld-sam.yaml template in the blog-multi-region-serverless-service GitHub repo.

A few things to highlight:

  • You are using inline Swagger to define your API so you can substitute the current region in the x-amazon-apigateway-integration section.
  • Most of the Swagger template covers CORS to allow you to test this from a browser.
  • You are also using substitution to populate the environment variable used by the “Hello World” method with the region into which it is being deployed.

The Swagger allows you to use the same SAM template in both regions.

You can only use SAM from the AWS CLI, so do the following from the command prompt. First, deploy the SAM template in us-east-1 with the following commands, replacing “<your bucket in us-east-1>” with a bucket in your account:


> cd helloworld-api
> aws cloudformation package --template-file helloworld-sam.yaml --output-template-file /tmp/cf-helloworld-sam.yaml --s3-bucket <your bucket in us-east-1> --region us-east-1
> aws cloudformation deploy --template-file /tmp/cf-helloworld-sam.yaml --stack-name multiregionhelloworld --capabilities CAPABILITY_IAM --region us-east-1

Second, do the same in us-west-2:


> aws cloudformation package --template-file helloworld-sam.yaml --output-template-file /tmp/cf-helloworld-sam.yaml --s3-bucket <your bucket in us-west-2> --region us-west-2
> aws cloudformation deploy --template-file /tmp/cf-helloworld-sam.yaml --stack-name multiregionhelloworld --capabilities CAPABILITY_IAM --region us-west-2

The API was created with the default endpoint type of Edge Optimized. Switch it to Regional. In the Amazon API Gateway console, select the API that you just created and choose the wheel-icon to edit it.

API Gateway edit API settings

In the edit screen, select the Regional endpoint type and save the API. Do the same in both regions.

Grab the URL for the API in the console by navigating to the method in the prod stage.

API Gateway endpoint link

You can now test this with curl:


> curl https://2wkt1cxxxx.execute-api.us-west-2.amazonaws.com/prod/helloworld
{"message": "Hello from us-west-2"}

Write down the domain name for the URL in each region (for example, 2wkt1cxxxx.execute-api.us-west-2.amazonaws.com), as you need that later when you deploy the Route 53 setup.

Create the custom domain name

Next, create an Amazon API Gateway custom domain name endpoint. As part of using this feature, you must have a hosted zone and domain available to use in Route 53 as well as an SSL certificate that you use with your specific domain name.

You can create the SSL certificate by using AWS Certificate Manager. In the ACM console, choose Get started (if you have no existing certificates) or Request a certificate. Fill out the form with the domain name to use for the custom domain name endpoint, which is the same across the two regions:

Amazon Certificate Manager request new certificate

Go through the remaining steps and validate the certificate for each region before moving on.

You are now ready to create the endpoints. In the Amazon API Gateway console, choose Custom Domain Names, Create Custom Domain Name.

API Gateway create custom domain name

A few things to highlight:

  • The domain name is the same as what you requested earlier through ACM.
  • The endpoint configuration should be regional.
  • Select the ACM Certificate that you created earlier.
  • You need to create a base path mapping that connects back to your earlier API Gateway endpoint. Set the base path to v1 so you can version your API, and then select the API and the prod stage.

Choose Save. You should see your newly created custom domain name:

API Gateway custom domain setup

Note the value for Target Domain Name as you need that for the next step. Do this for both regions.

Deploy Route 53 setup

Use the global Route 53 service to provide DNS lookup for the Rest API, distributing the traffic in an active-active setup based on latency. You can find the full CloudFormation template in the blog-multi-region-serverless-service GitHub repo.

The template sets up health checks, for example, for us-east-1:


HealthcheckRegion1:
  Type: "AWS::Route53::HealthCheck"
  Properties:
    HealthCheckConfig:
      Port: "443"
      Type: "HTTPS_STR_MATCH"
      SearchString: "ok"
      ResourcePath: "/prod/healthcheck"
      FullyQualifiedDomainName: !Ref Region1HealthEndpoint
      RequestInterval: "30"
      FailureThreshold: "2"

Use the health check when you set up the record set and the latency routing, for example, for us-east-1:


Region1EndpointRecord:
  Type: AWS::Route53::RecordSet
  Properties:
    Region: us-east-1
    HealthCheckId: !Ref HealthcheckRegion1
    SetIdentifier: "endpoint-region1"
    HostedZoneId: !Ref HostedZoneId
    Name: !Ref MultiregionEndpoint
    Type: CNAME
    TTL: 60
    ResourceRecords:
      - !Ref Region1Endpoint

You can create the stack by using the following link, copying in the domain names from the previous section, your existing hosted zone name, and the main domain name that is created (for example, hellowordapi.replacewithyourcompanyname.com):

The following screenshot shows what the parameters might look like:
Serverless multi region Route 53 health check

Specifically, the domain names that you collected earlier would map according to following:

  • The domain names from the API Gateway “prod”-stage go into Region1HealthEndpoint and Region2HealthEndpoint.
  • The domain names from the custom domain name’s target domain name goes into Region1Endpoint and Region2Endpoint.

Using the Rest API from server-side applications

You are now ready to use your setup. First, demonstrate the use of the API from server-side clients. You can demonstrate this by using curl from the command line:


> curl https://hellowordapi.replacewithyourcompanyname.com/v1/helloworld/
{"message": "Hello from us-east-1"}

Testing failover of Rest API in browser

Here’s how you can use this from the browser and test the failover. Find all of the files for this test in the browser-client folder of the blog-multi-region-serverless-service GitHub repo.

Use this html file:


<!DOCTYPE HTML>
<html>
<head>
    <meta charset="utf-8"/>
    <meta http-equiv="X-UA-Compatible" content="IE=edge"/>
    <meta name="viewport" content="width=device-width, initial-scale=1"/>
    <title>Multi-Region Client</title>
</head>
<body>
<div>
   <h1>Test Client</h1>

    <p id="client_result">

    </p>

    <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
    <script src="settings.js"></script>
    <script src="client.js"></script>
</body>
</html>

The html file uses this JavaScript file to repeatedly call the API and print the history of messages:


var messageHistory = "";

(function call_service() {

   $.ajax({
      url: helloworldMultiregionendpoint+'v1/helloworld/',
      dataType: "json",
      cache: false,
      success: function(data) {
         messageHistory+="<p>"+data['message']+"</p>";
         $('#client_result').html(messageHistory);
      },
      complete: function() {
         // Schedule the next request when the current one's complete
         setTimeout(call_service, 10000);
      },
      error: function(xhr, status, error) {
         $('#client_result').html('ERROR: '+status);
      }
   });

})();

Also, make sure to update the settings in settings.js to match with the API Gateway endpoints for the DNS-proxy and the multi-regional endpoint for the Hello World API: var helloworldMultiregionendpoint = "https://hellowordapi.replacewithyourcompanyname.com/";

You can now open the HTML file in the browser (you can do this directly from the file system) and you should see something like the following screenshot:

Serverless multi region browser test

You can test failover by changing the environment variable in your health check Lambda function. In the Lambda console, select your health check function and scroll down to the Environment variables section. For the STATUS key, modify the value to fail.

Lambda update environment variable

You should see the region switch in the test client:

Serverless multi region broker test switchover

During an emulated failure like this, the browser might take some additional time to switch over due to connection keep-alive functionality. If you are using a browser like Chrome, you can kill all the connections to see a more immediate fail-over: chrome://net-internals/#sockets

Summary

You have implemented a simple way to do multi-regional serverless applications that fail over seamlessly between regions, either being accessed from the browser or from other applications/services. You achieved this by using the capabilities of Amazon Route 53 to do latency based routing and health checks for fail-over. You unlocked the use of these features in a serverless application by leveraging the new regional endpoint feature of Amazon API Gateway.

The setup was fully scripted using CloudFormation, the AWS Serverless Application Model (SAM), and the AWS CLI, and it can be integrated into deployment tools to push the code across the regions to make sure it is available in all the needed regions. For more information about cross-region deployments, see Building a Cross-Region/Cross-Account Code Deployment Solution on AWS on the AWS DevOps blog.

New – AWS PrivateLink for AWS Services: Kinesis, Service Catalog, EC2 Systems Manager, Amazon EC2 APIs, and ELB APIs in your VPC

Post Syndicated from Ana Visneski original https://aws.amazon.com/blogs/aws/new-aws-privatelink-endpoints-kinesis-ec2-systems-manager-and-elb-apis-in-your-vpc/

This guest post is by Colm MacCárthaigh, Senior Engineer for Amazon Virtual Private Cloud.


Since VPC Endpoints launched in 2015, creating Endpoints has been a popular way to securely access S3 and DynamoDB from an Amazon Virtual Private Cloud (VPC) without the need for an Internet gateway, a NAT gateway, or firewall proxies. With VPC Endpoints, the routing between the VPC and the AWS service is handled by the AWS network, and IAM policies can be used to control access to service resources.

Today we are announcing AWS PrivateLink, the newest generation of VPC Endpoints which is designed for customers to access AWS services in a highly available and scalable manner, while keeping all the traffic within the AWS network. Kinesis, Service Catalog, Amazon EC2, EC2 Systems Manager (SSM), and Elastic Load Balancing (ELB) APIs are now available to use inside your VPC, with support for more services coming soon such as Key Management Service (KMS) and Amazon Cloudwatch.

With traditional endpoints, it’s very much like connecting a virtual cable between your VPC and the AWS service. Connectivity to the AWS service does not require an Internet or NAT gateway, but the endpoint remains outside of your VPC. With PrivateLink, endpoints are instead created directly inside of your VPC, using Elastic Network Interfaces (ENIs) and IP addresses in your VPC’s subnets. The service is now in your VPC, enabling connectivity to AWS services via private IP addresses. That means that VPC Security Groups can be used to manage access to the endpoints and that PrivateLink endpoints can also be accessed from your premises via AWS Direct Connect.

Using the services powered by PrivateLink, customers can now manage fleets of instances, create and manage catalogs of IT services as well as store and process data, without requiring the traffic to traverse the Internet.

Creating a PrivateLink Endpoint
To create a PrivateLink endpoint, I navigate to the VPC Console, select Endpoints, and choose Create Endpoint.

I then choose which service I’d like to access. New PrivateLink endpoints have an “interface” type. In this case I’d like to use the Kinesis service directly from my VPC and I choose the kinesis-streams service.

At this point I can choose which of my VPCs I’d like to launch my new endpoint in, and select the subnets that the ENIs and IP addresses will be placed in. I can also associate the endpoint with a new or existing Security Group, allowing me to control which of my instances can access the Endpoint.

Because PrivateLink endpoints will use IP addresses from my VPC, I have the option to over-ride DNS for the AWS service DNS name by using VPC Private DNS. By leaving Enable Private DNS Name checked, lookups from within my VPC for “kinesis.us-east-1.amazonaws.com” will resolve to the IP addresses for the endpoint that I’m creating. This makes the transition to the endpoint seamless without requiring any changes to my applications. If I’d prefer to test or configure the endpoint before handling traffic by default, I can leave this disabled and then change it at any time by editing the endpoint.

Once I’m ready and happy with the VPC, subnets and DNS settings, I click Create Endpoint to complete the process.

Using a PrivateLink Endpoint

By default, with the Private DNS Name enabled, using a PrivateLink endpoint is as straight-forward as using the SDK, AWS CLI or other software that accesses the service API from within your VPC. There’s no need to change any code or configurations.

To support testing and advanced configurations, every endpoint also gets a set of DNS names that are unique and dedicated to your endpoint. There’s a primary name for the endpoint and zonal names.

The primary name is particularly useful for accessing your endpoint via Direct Connect, without having to use any DNS over-rides on-premises. Naturally, the primary name can also be used inside of your VPC.
The primary name, and the main service name – since I chose to over-ride it – include zonal fault-tolerance and will balance traffic between the Availability Zones. If I had an architecture that uses zonal isolation techniques, either for fault containment and compartmentalization, low latency, or for minimizing regional data transfer I could also use the zonal names to explicitly control whether my traffic flows between or stays within zones.

Pricing & Availability
AWS PrivateLink is available today in all AWS commercial regions except China (Beijing). For the region availability of individual services, please check our documentation.

Pricing starts at $0.01 / hour plus a data processing charge at $0.01 / GB. Data transferred between availability zones, or between your Endpoint and your premises via Direct Connect will also incur the usual EC2 Regional and Direct Connect data transfer charges. For more information, see VPC Pricing.

Colm MacCárthaigh

 

Internet Routing and Traffic Engineering

Post Syndicated from Tom Scholl original https://aws.amazon.com/blogs/architecture/internet-routing-and-traffic-engineering/

Internet Routing

Internet routing today is handled through the use of a routing protocol known as BGP (Border Gateway Protocol). Individual networks on the Internet are represented as an autonomous system (AS). An autonomous system has a globally unique autonomous system number (ASN) which is allocated by a Regional Internet Registry (RIR), who also handle allocation of IP addresses to networks. Each individual autonomous system establishes BGP peering sessions to other autonomous systems to exchange routing information. A BGP peering session is a TCP session established between two routers, each one in a particular autonomous system. This BGP peering session rides across a link, such as a 10Gigabit Ethernet interface between those routers. The routing information contains an IP address prefix and subnet mask. This translates which IP addresses are associated with an autonomous system number (AS origin). Routing information propagates across these autonomous systems based upon policies that individual networks define.

This is where things get a bit interesting because various factors influence how routing is handled on the Internet. There are two main types of relationships between autonomous systems today: Transit and Peering.

Transit is where an autonomous system will pay an upstream network (known as a transit provider) for the ability to forward traffic towards them who will forward that traffic further. It also provides for the autonomous system purchasing (who is the customer in this relationship) to have their routing information propagated to their adjacencies. Transit involves obtaining direct connectivity from a customer network to an upstream transit provider network. These sorts of connections can be multiple 10Gigabit Ethernet links between each other’s routers. Transit pricing is based upon network utilization in a particular dominant direction with 95th percentile billing. A transit provider will look at a months worth of utilization and in the traffic dominant direction they will bill on the 95th percentile of utilization. The unit used in billing is measured in bits-per-second (bps) and is communicated in a price per Mbps (for example – $2 per Mbps).

Peering is where an autonomous system will connect to another autonomous system and agree to exchange traffic with each other (and routing information) of their own networks and any customers (transit customers) they have. With peering, there are two methods that connectivity is formed on. The first is where direct connectivity is established between individual networks routers with multiple 10Gigabit Ethernet or 100Gigabit Ethernet links. This sort of connectivity is known as “private peering” or PNI (Private Network Interconnect). This sort of connection provides both parties with clear visibility into the interface utilization of traffic in both directions (inbound and outbound). Another form of peering that is established is via Internet Exchange switches, or IX’s. With an Internet Exchange, multiple networks will obtain direct connectivity into a set of Ethernet switches. Individual networks can establish BGP sessions across this exchange with other participants. The benefit of the Internet Exchange is that it allows multiple networks to connect to a common location and use it for one-to-many connectivity. A downside is that any given network does not have visibility into the network utilization of other participants.

Most networks will deploy their network equipment (routers, Dense Wave Division Multiplexing (DWDM) transport equipment) into colocation facilities where networks will establish direct connectivity to each other. This can be via Internet Exchange switches (which are also found in these colocation facilities) or direct connections which are fiber optics cables ran between individual suites/racks where the network gear is located.

Routing Policy

Networks will define their routing policy to prefer routing to other networks based upon a variety of items. The BGP best path decision process in a routers operating system dictates how a router will prefer one BGP path over another. Network operators will write their policy to influence that BGP best-path decision process based upon factors such as the cost to deliver traffic to a destination network in addition to performance.

A typical routing policy within most networks will dictate that internal (their own) and routes learned from their own customers are to be preferred over all other paths. After that, most networks will then prefer peering routes since peering is typically free and often times can provide a shorter/optimal path to reach a destination. Finally the least preferred route to a destination is over paid transit links. When it comes to transit paths, both cost and performance are typically factors in determining how to reach a destination network.

Routing policies themselves are defined on routers in a simple text-based policy language that is specific to the router operating system. They contain two types of functions: matching on one or multiple routes and an action for that match. The matching can include a list of actual IP prefixes and subnet lengths, ASN origins, AS-Paths or other types of BGP attributes (communities, next-hop, etc). The actions can include resetting BGP attributes such as local-preference, Multi-Exit-Discriminators (MED) and various other values (communities, Origin, etc). Below is a simplified example of a routing policy on routes learned from a transit provider. It has multiple terms to permit an operator to match on specific Internet routes to set a different local-preference value to control what traffic should be forwarded through that provider. There are additional actions to set other BGP attributes related to classifying the routes so they can be easily identified and acted upon by other routers in the network.

Network operators will tune their routing policy to determine how to send traffic and how to receive traffic through adjacent autonomous systems. This practice is generally known as BGP traffic-engineering. Making outbound traffic changes is by far the easiest to implement because it involves identifying the particular routes you are interested in directing and increasing the routing preference to egress through a particular adjacency. Operators must take care to examine certain things before and after any policy change to understand the impact of their actions.

Inbound traffic-engineering is a bit more difficult as it requires a network operator to alter routing information announcements leaving your network to influence how other autonomous systems on the Internet prefer to route to you. While influencing the directly adjacent networks to you is somewhat trivial, influencing networks further beyond those directly connected can be tricky. This technique requires the use of features that a transit provider can grant via BGP. In the BGP protocol, there is a certain type of attribute known as Communities. Communities are strings you can pass in a routing update across BGP sessions. Most networks use communities to classify routes as transit vs. peer vs. customer. The transit-customer relationship usually gives certain capabilities to a customer to control the further propagation of routes to their adjacencies. This grants a network with the ability to traffic-engineer further upstream to networks it is not directly connected to.

Traffic-engineering is used for several reasons today on the Internet. The first reason might be to reduce bandwidth costs by preferring particular paths (different transit providers). The other is for performance reasons, where a particular transit provider may have less-congested/lower-latency path to a destination network. Network operators will view a variety of metrics to determine if there is a problem and start to make policy changes to examine the outcome. Of course on the Internet, the scale of the traffic being moved around counts. Moving a few Gbps of traffic from one path to another may improve performance, but if you move tens of Gbps over you may encounter congestion on this newly selected path. The links between various networks on the Internet today operate where they scale capacity based upon observed utilization. Even though you may be paying a transit provider for connectivity, this doesn’t mean every link to external networks is scaled for the amount of traffic you wish to push. As traffic grows, links will be added between individual networks. So causing a massive change in utilization on the Internet can result in congestion as these new paths are handling an increased amount of traffic than they never had before. The result is that network operators must pay attention when moving traffic over in increments as well as communication with other networks to gauge the impact of any traffic moves.

Complicating the above traffic engineering operations is that you are not the only person on the Internet trying to push traffic to certain destinations. Other networks are also in a similar position where they’re trying to deliver traffic and will perform their own traffic-engineering. There are also many networks that will refuse to peer with other networks for several reasons. For example, some networks may cite an imbalance in in vs. outbound (traffic ratios) or feel that traffic is being dumped on their network. In these cases, the only way to reach these destinations is via a transit provider. In some cases, these networks may offer a “paid peering” product to provide direct connectivity. That paid peering product may be priced at a value that is lower the price of what you would pay for transit or could offer an uncongested path that you’d normally observe over transit. Just because you have a path via transit doesn’t mean the path is uncongested at all hours of the day (such as during peak hours).

One way to eliminate the hops between networks is to do just that – eliminate them via direct connections. AWS provides a service to do this known as AWS Direct Connect. With Direct Connect, customers can connect their network directly into the AWS network infrastructure. This will enable bypassing the Internet via direct physical connectivity and remove any potential Internet routing or capacity issues.

Traceroute

In order to determine the paths traffic is taking, tools such as traceroute are very useful. Traceroute operates by sending sending packets to a given destination network and it sets the initial IP TTL value to one. The upstream device will generate an ICMP TTL Exceeded message back to you (the source) which will reveal the first hop in your path to the destination. Subsequent packets will be sent from the source and increment the IP TTL value to show each hop along the way towards the destination. It is important to remember that Internet routing typically involves asymmetric paths – the traffic going towards a destination will take a separate set of hops on the return path. When performing traceroutes to diagnose routing issues it is very useful to obtain the reverse path to help isolate a particular direction of traffic being an issue. With an understanding of both directions traffic is taking, it is then easier to understand what sort of traffic-engineering changes can be made. When dealing with Network Operation Centers (NOCs) or support groups, it is important to provide the Public IP of the source and destination addresses involved in the communication. This provides individuals with the information they can use to help reproduce the issue that is being encountered. It is also useful to include any specific details surrounding the communication, such as if it was HTTP (TCP/80) or HTTPS (TCP/443). Some traceroute applications provide the user with the ability to generate its probes using a variety of protocols such as ICMP Echo Request (ping), UDP or TCP packets to a particular port. Several traceroute programs by default will use ICMP Echo Request or UDP packets (destined to a particular port range). While these work most of the time, various networks on the Internet may filter these sorts of packets and it is recommended to use a traceroute probe that replicates the type of traffic you intend to use to the destination network. For example, using traceroute with TCP/80 or TCP/443 can yield better results when dealing with firewalls or other packet filtering.

An example of a UDP based traceroute (using well-defined traceroute port ranges), where multiple routes will permit generating TTL Exceeded for packets bound for those destination ports:

Note that the last hop does not respond, since it most likely denies UDP packets destined to high ports.

With the same traceroute using TCP/443 (HTTPS), we find multiple routers do not respond but the destination does respond since it is listening on TCP/443:

TCP Traceroute to port 443 (HTTPS):

The hops revealed within traceroute provide some insight into the sort of network devices your packets are traversing. Many network operators will add descriptive information in the DNS reverse PTR records, though each network is going to be different. Typically the DNS entries will indicate the router name, some sort of geographical code and the physical or logical router interface the traffic has traversed. Each individual network names their own routers different so the information here is going to usually indicate if a device is a “core” router (no external or customer interfaces) or an “edge” router (with external network connectivity). Of course, this is not a hard rule and it is common to find multi-function devices within a network. The geographical identifier can vary between IATA airport codes, telecom CLLI codes (or a variation upon them) or internally generated identifiers that are unique to that particular network. Occasionally shortened versions of a physical address or city names will appear in here as well. The actual interface can indicate the interface type and speed, though these are only as accurate as you believe an operator is to publicly reveal this and keep their DNS entries up to date.

One important part of traceroute is that the data should be taken with some skepticism. Traceroute will display round-trip-time (RTT) of each individual hop as the packets traverse through the network to their destination. While this value can provide some insight into the latency to these hops, the actual value can be influenced from a variety of factors. For instance, many modern routers today treat packets that TTL expire on them as a low priority when compared to other functions the router is doing (forwarding packets, routing protocols). As a result, the handling of the TTL expired packets and subsequent ICMP TTL Exceeded message generated can take some period of time. This is why it is very common to occasionally see high RTT on intermediate hops within a traceroute (up to hundreds of milliseconds). This does not always indicate that there is a network issue and individuals should always measure the end-to-end latency (via ping or some application tests). In situations where the RTT does increase at a particular hop and continue to increase, this can be an indicator of an overall increase in latency at a particular point in the network. Another item frequently observed in traceroutes are hops that do not respond to traceroute which will be displayed as *’s. This means that the router(s) at this particular hop have either dropped the TTL expired packet or has not generated the ICMP TTL Exceeded message. This is usually the result of two possible things. The first is that many modern routers today implement Control-Plane Policing (CoPP) which are packet filters on the router to control how certain types of packets are handled. In many modern routers today, the use of ASICs (Application-Specific Integrated Circuit) have improved packet lookup & forwarding functions. When a router ASIC receives a packet with the TTL value of one, they will punt the packet to an additional location within the router to handle the ICMP TTL Exceeded generation. On most routers, the ICMP TTL Exceeded generation is done on a CPU integrated on a linecard or the main brain of the router itself (known as a route processor, routing engine or supervisor). Since the CPU of a linecard or routing engine is busy performing things such as forwarding table programming and routing protocols, routers will allow protections to be put in place to restrict the rate of how many TTL exceeded packets can be sent to these components. CoPP allows an operator to set functions such as limiting TTL Exceeded messages to a value such as 100 packets per second. Additionally the router itself may have an additional rate-limiter to address how many ICMP TTL Exceeded messages can be generated as well. In this situation, you’ll find that hops in your traceroute may sometimes not reply at all because of the use of CoPP. This is also why when performing pings to individual hops (routers) on a traceroute you will see packet loss because CoPP is dropping the packets. The other area where CoPP can be applied is where the router may simply deny all TTL exceeded packets. Within traceroute, these hops will always respond with *’s no matter how many times you execute traceroute.

A good presentation that explains using traceroute on the Internet and interpreting its results is found here: https://www.nanog.org/meetings/nanog45/presentations/Sunday/RAS_traceroute_N45.pdf

Troubleshooting issues on the Internet is no easy task and it requires examining multiple sets of information (traceroute, BGP routing tables) to come to a conclusion as to what can be occurring. The use of Internet looking glasses or route servers is useful to providing a different vantage point on the Internet when troubleshooting. The Looking Glass Wikipedia page has several links to sites which you can use to perform pings, traceroutes and examining a BGP routing table from different spots around the world in various networks.

When reaching out to networks or posting in forums looking for support for Internet routing issues it is important to provide useful information for troubleshooting. This includes the source IP address (the Public IP, not a Private/NAT translated one), the destination IP (once again, the Public IP), what protocol and ports being used (TCP/80 for example) and the specific time/date of when you observed the issue. Traceroutes in both directions are incredibly useful since paths on the Internet can be asymmetric.

Selecting Service Endpoints for Reliability and Performance

Post Syndicated from Nate Dye original https://aws.amazon.com/blogs/architecture/selecting-service-endpoints-for-reliability-and-performance/

Choose Your Route Wisely

Much like a roadway, the Internet is subject to congestion and blockage that cause slowdowns and at worst prevent packets from arriving at their destination. Like too many cars jamming themselves onto a highway, too much data over a route on the Internet results in slowdowns. Transatlantic cable breaks have much the same effect as road construction, resulting in detours and further congestion and may prevent access to certain sites altogether. Services with multiple points of presence paired with either smart clients or service-side routing logic improve performance and reliability by ensuring your viewers have access to alternative routes when roadways are blocked. This blog post provides you with an overview of service side and client side designs that can improve the reliability and performance of your Internet-facing services.

In order to understand how smart routing can increase performance, we must first understand why Internet downloads take time. To extend our roadway analogy, imagine we need to drive to the hardware to obtain supplies. If we can get our supplies in one trip and it takes 10 minutes to drive one way, it takes 20 minutes to get the supplies and return to the house project. (For sake of simplicity we’ll exclude the time involved with finding a parking spot, wandering aimlessly around the store in search of the right parts, and loading up materials.) Much of the time is taken up by driving the 5 km across multiples types of roadways to the hardware store.

Packets on the Internet must also transit physical distance and many links to reach their destination. A traceroute from my home network to my website hosted in eu-west-1 traverse 20 different links. It takes time for a packet to traverse the physical distance between each of these nodes. Traveling at a speed of 200,000 km/sec my packets travel from Seattle to Dublin and back in 170 ms.

Unlike roadways, when congestion occurs on Internet paths routers discard packets. Discarded packets result in retransmits by the sender adding additional round trips to the overall download. The additional round trips result in a slower download. Just as multiple trips to the hardware store results in more time spent driving and less time building.

To reduce the time of a round trip, a packet needs to travel a shorter physical distance. To avoid packet loss and additional round trips, a packet needs to avoid broken or congested routes. Performance and reliability of an Internet-facing service are improved by using the shortest physical path between the service and the client and by minimizing the number of round trips for data transfer.

Adding multiple points of presence to your architecture is like opening up a chain of hardware stores around the city. With multiple points of presence, drivers can choose a destination hardware store that allows them to drive the shortest physical distant. If the route to any one hardware store is congested or blocked entirely, driver can instead drive to the next nearest store.

Measuring the Internet

Finding the fastest and most reliable paths on across the Internet requires real world measurement. The closest endpoint as the crow flies often isn’t the same as fastest via Internet paths. The diagrams below depict an all too common example when driving in the real world. Given the starting point (the white dot), hardware store B (the red marker in the second diagram) is actually closer as the crow flies, but requires a lot more driving distance. A driver that knows the roadways will know that it will take less overall time to drive to store A.

Hardware store A is 10 minutes away by car

Hardware store B would be much closer by way of the water, but is 16 minutes by car

The same is true on the Internet, here’s a real-world example from CloudFront just after it launched in 2008. Below is a traceroute from an ISP network in Singapore to the SIN2 edge location in Singapore (internally AWS uses airport codes to identify edge locations). The trace route below shows that a packet sourced from a Singapore ISP to the SIN2 edge location was routed by way of Hong Kong! Based on the connectivity of this viewer network, Singapore customers are better served from the HKG1 edge location rather than SIN2. If routing logic assumed that geographically closer edge locations resulted in lower latency, the router engine would choose SIN2. By measuring latency the system we see that the system should select HKG50 as the preferred edge location, but going doing the routing system can provide the viewer with 32 ms round trip times versus 39 ms round trip times.

The topology of the Internet is always changing. The SIN to HKG example is old, now SIN2 is directly peered with the network above. Continuous or just in time measurements allow a system to adapt when Internet topology changes. Once the AWS networking team built a direct connection between our SIN edge location and the Singapore ISP, the CloudFront measurement system automatically observed lower latencies. SIN2 became preferred lowest latency pop. The direct path results in 2.2 ms round trip times versus 39 ms round trip times.

Leveraging Real World Measurements for Routing Decisions

Service-Side Routing

CloudFront and Route 53 both leverage similar mechanisms to route clients to the closest available end points. CloudFront uses routing logic in its DNS layer. When clients resolve DNS for a CloudFront distribution they receive a set of IPs that map to the closest available edge location. Route 53’s Latency-Based Routing combined with DNS Failover and health checks allows customers to build the same routing logic into any service hosted across multiple AWS regions.

Availability failures are detected via continuous health checks. Each CloudFront edge location checks the availability of every other edge location and reports the health-state to CloudFront DNS servers (which also happen to be Route 53 DNS servers). If an edge location starts failing health checks, CloudFront DNS servers respond to client queries with IPs that map to the client’s the next-closest healthy edge location. AWS customers can implement similar health check fail over capabilities within their services by use the Route 53 DNS failover and health checks feature.

Lowest-latency routing is achieved by continuously sampling latency measurements from viewer networks across the Internet to each AWS region and edge location. Latency measurement data is compiled into a list of AWS sites for each viewer network, sorted by latency. CloudFront and Route 53 DNS servers use this data to direct traffic. When network topology shifts occur, the latency measurement system picks up the changes and reorders the list of least-latent AWS endpoints for that given network.

Client-Side Routing

But service-side technology isn’t the only way to perform latency-based routing. Client software also has access to all of the data it needs to select the best endpoint. A smart client can intelligently rotate through a list of endpoints whenever a connection failure occurs. Smart clients perform their own latency measurements and choose the lowest latency endpoint as its connection destination. Most DNS resolvers do exactly that. Lee’s prior blog post made mention of DNS resolvers and provided a reference to this presentation.

The DNS protocol is well suited to smart-client endpoint selection. When a DNS server resolves a DNS name, say www.internetkitties.com, the server must first determine which DNS servers are authoritative to answer queries for the domain. In my case, my domain is hosted by Route 53, which provides me with 4 assigned name servers. Smart DNS resolvers track the query response times to each name server and preferentially query the fastest name server. If the name server fails to respond, the DNS resolver falls back to other name servers on the list.

There are examples in HTTP as well. The Kindle Fire web browser, Amazon Silk, makes use of a backend-resources hosted in multiple AWS regions to speed up the web browsing experience for its customers. The Amazon Silk client knows about and collects latency measurements against each endpoint of the Amazon Silk backend service. The client connects to the lowest-latency end point. As the device moves around from one network to another, the lowest-latency end point might change.

Service architecture with multiple points of presence on the Internet combined with latency-based service-side or client-side endpoint selection can improve the performance and reliability of your service. Let us know if you’d like to a follow blog on best practices for measuring Internet latencies or lessons learned from the Amazon Silk client-side endpoint selector.

– Nate Dye

Doing Constant Work to Avoid Failures

Post Syndicated from Vadim Meleshuk original https://aws.amazon.com/blogs/architecture/doing-constant-work-to-avoid-failures/

Amazon Route 53’s DNS Failover feature allows fast, automatic rerouting at the DNS layer based on the health of some endpoints. Endpoints are actively monitored from multiple locations and both application or connectivity issues can trigger failover.

Trust No One

One of the goals in designing the DNS Failover feature was making it resilient to large-scale outages. Should a large subset of targets, such as an entire availability zone, fall off the map, the last thing we’d want to do is accumulate a backlog of processing work, delaying the failover. We avoid incurring delays in such cases by making sure that the amount of work we do is bounded regardless of the runtime conditions. This means bounded workload during all stages of processing: data collection, handling, transmission, aggregation and use. Each of these stages requires some thought for it to not cause delays under any circumstances. On top of its own performance, each stage of the pipeline needs to have its output cardinality bounded in a way to avoid causing trouble for the next stage. For extra protection, that next stage should not trust its input to be bounded and implement safety throttling, preferably, isolating input by type and tenant or by whatever dimension of the input may be variable due to a bug or misconfiguration, such as configuring extra previous stages, each producing output, unexpectedly.

Let’s dig into how it works out for our health checking service “data plane,” the part that does all the work once health checks have been configured through the “control plane.” As an aside, we deploy our health checking data plane and control planes similar to the methods described previously.

Health Checkers Example

The frontline component of the system is a fleet of health checker clusters, running from EC2 regions. Each health check is performed by multiple checkers from multiple regions, repeated at the configured frequency. On failure, we do not retry; retrying immediately would increase the number of performed checks if a large proportion of the targets assigned to a given checker fail. Instead, we just perform the next check as scheduled normally (30 seconds later, by default).

The checkers communicate the results to the aggregators using a steady stream, asynchronously from the checks themselves. It wouldn’t matter if, for some reason, we’d be making more checks – the stream is configured separately and does not inflate dynamically.

Transmitting incremental changes, as opposed to the complete current state, will result in an increase in the transmission size in the event there is a spike in the number of status changes. So to comply with the goal, we can either make sure that the transmission pipe does not choke when absolutely everything changes, or we can just transmit the full snapshot of the current state all the time. It turns out that, in our case, the latter is feasible by optimizing the transmission formats for batching (we use a few bits per state and we pack densely), so we happily do that.

Aggregators on the DNS hosts process large streams of incoming states – we are talking about millions each second – and are very sensitive to processing time, so we better be sure that neither the content nor the count of the messages impact that processing time. Here, again, we protect ourselves from spikes in the rate of incoming messages by only performing aggregation once a second per batch of relevant messages. For each batch, only the last result sent by a checker is used and those received previously are discarded. This contrasts with running aggregation for every received message and depending on the constancy of that incoming stream. It is also in contrast to queuing up the messages to smoothen out the bursts of incoming requests. Queues are helpful for this purpose, but only help when bursts are not sustained. Ensuring constancy of work regardless of bursts works better.

In addition to coping with potential incoming message bursts, aggregators need to be able to process messages in a well-bounded time. One of the mistakes we made here originally was not testing the performance of one of the code paths handling some specific unexpected message content. Due to another unexpected condition, at one point, one of the checkers sent a large number of those unexpected states to the aggregators. On their own, these states would be harmless, only triggering some diagnostics. However, the side effect was significant processing slowdown in the aggregators because of the poorly performing code path. Fortunately, because of the constant work design, this simply resulted in skipping some messages during processing and only extending the failure detection latency by a couple seconds. Of course, we fixed the poorly performing path after this; it would have been better to catch this in testing.

All together, for each pipeline stage, the cardinality or rate of the input is bounded and the cost of processing of all types of input is uniformly bounded as well. As a result, each stage is doing a constant amount of work and will not fall over or cause delays based on external events – which is exactly the goal.

– Vadim Meleshuk

A Case Study in Global Fault Isolation

Post Syndicated from Lee-Ming Zen original https://aws.amazon.com/blogs/architecture/a-case-study-in-global-fault-isolation/

In a previous blog post, we talked about using shuffle sharding to get magical fault isolation. Today, we’ll examine a specific use case that Route 53 employs and one of the interesting tradeoffs we decided to make as part of our sharding. Then, we’ll discuss how you can employ some of these concepts in your own applications.

Overview of Anycast DNS

One of our goals at Amazon Route 53 is to provide low-latency DNS resolution to customers. We do this, in part, by announcing our IP addresses using “anycast” from over 50 edge locations around the globe. Anycast works by routing packets to the closest (network-wise) location that is “advertising” a particular address. In the image below, we can see that there are three locations, all of which can receive traffic for the 205.251.194.72 address.

(Blue circles represent edge locations; orange circles represent AWS regions)

For example, if a customer has ns-584.awsdns-09.net assigned as a nameserver, issuing a query to that nameserver could result in that query landing at any one of multiple locations responsible for advertising the underlying IP address. Where the query lands depends on the anycast routing of the Internet, but it is generally going to be the closest network-wise (and hence, low latency) location to the end user.

Behind the scenes, we have thousands of nameserver names (e.g. ns-584.awsdns-09.net) hosted across four top-level domains (.com, .net, .co.uk, and .org). We refer to all the nameservers in one top-level domain as a ‘stripe;’ thus, we have a .com stripe, a .net stripe, a .co.uk stripe, and a .org stripe. This is where shuffle sharding comes in: each Route 53 domain (hosted zone) receives four nameserver names one from each of stripe. As a result, it is unlikely that two zones will overlap completely across all four nameservers. In fact, we enforce a rule during nameserver assignment that no hosted zone can overlap by more than two nameservers with any previously created hosted zone.

DNS Resolution

Before continuing, it’s worth quickly explaining how DNS resolution works. Typically, a client, such as your laptop or desktop has a “stub resolver.” The stub resolver simply contacts a recursive nameserver (resolver), which in turn queries authoritative nameservers, on the Internet to find the answers to a DNS query. Typically, resolvers are provided by your ISP or corporate network infrastructure, or you may rely on an open resolver such as Google DNS. Route 53 is an authoritative nameserver, responsible for replying to resolvers on behalf of customers. For example, when a client program attempts to look up amazonaws.com, the stub resolver on the machine will query the resolver. If the resolver has the data in cache and the value hasn’t expired, it will use the cached value. Otherwise, the resolver will query authoritative nameservers to find the answer.

(Every location advertises one or more stripes, but we only show Sydney, Singapore, and Hong Kong in the above diagram for clarity.)

Each Route 53 edge location is responsible for serving the traffic for one or more stripes. For example, our edge location in Sydney, Australia could serve both the .com and .net, while Singapore could serve just the .org stripe. Any given location can serve the same stripe as other locations. Hong Kong could serve the .net stripe, too. This means that if a resolver in Australia attempts to resolve a query against a nameserver in the .org stripe, which isn’t provided in Australia, the query will go to the closest location that provides the .org stripe (which is likely Singapore). A resolver in Singapore attempting to query against a nameserver in the .net stripe may go to Hong Kong or Sydney depending on the potential Internet routes from that resolver’s particular network. This is shown in the diagram above.

For any given domain, in general, resolvers learn the lowest latency nameserver based upon the round trip time of the query (this technique is often called SRTT or smooth round-trip time). Over a few queries, a resolver in Australia would gravitate toward using the nameservers on the .net and .com stripes for Route 53 customers’ domains.

Not all resolvers do this. Some choose randomly amongst the nameservers. Others may end up choosing the slowest one, but our experiments show that about 80% of resolvers use the lowest RTT nameserver. For additional information, this presentation presents information on how various resolvers choose which nameserver they utilize. Additionally, many other resolvers (such as Google Public DNS) use pre-fetching, or have very short timeouts if a resolver fails to resolve against a particular nameserver.

The Latency-Availability Decision

Given the above resolver behavior, one option, for a DNS provider like Route 53, might be to advertise all four stripes from every edge location. This would mean that no matter which nameserver a resolver choses, it will always go to the closest network location. However, we believe this provides a poor availability model.

Why? Because edge locations can sometimes fail to provide resolution for a variety of reasons that are very hard to control: the edge location may lose power or Internet connectivity, the resolver may lose connectivity to the edge location, or an intermediary transit provider may lose connectivity. Our experiments have shown that these types of events can cause about 5 minutes of disruption as the Internet updates routing tables. In recent years another serious risk has arisen: large-scale transit network congestion due to DDOS attacks. Our colleague, Nathan Dye, has a talk from AWS re:Invent that provides more details: www.youtube.com/watch?v=V7vTPlV8P3U.

In all of these failure scenarios, advertising every nameserver from every location may result in resolvers having no fallback location. All nameservers would route to the same location and resolvers would fail to resolve DNS queries, resulting in an outage for customers.

In the diagram below, we show the difference for a resolver querying domain X, whose nameservers (NX1, NX2, NX3, NX4) are advertised from all locations and domain Y, whose nameservers (NY1, NY2, NY3, NY4) are advertised in a subset of the locations.

When the path from the resolver to location A is impaired, all queries to the nameservers for domain X will fail. In comparison, even if the path from the resolver to location A is impaired, there are other transit paths to reach nameservers at locations B, C, and D in order to resolve the DNS for domain Y.

Route 53 typically advertises only one stripe per edge location. As a result, if anything goes wrong with a resolver being able to reach an edge location, that resolver has three other nameservers in three other locations to which it can fall back. For example, if we deploy bad software that causes the edge location to stop responding, the resolver can still retry elsewhere. This is why we organize our deployments in “stripe order”; Nick Trebon provides a great overview of our deployment strategies in the previous blog post. It also means that queries to Route 53 gain a lot of Internet path diversity, which helps resolvers route around congestion and other intermediary problems on their path to reaching Route 53.

Route 53’s foremost goal is to always meet our promise of a 100% SLA for DNS queries – that all of our customers’ DNS names should resolve all the time. Our customers also tell us that latency is next most important feature of a DNS service provider. Maximizing Internet path and edge location diversity for availibility necessarily means that some nameservers will respond from farther-away edge locations. For most resolvers, our method has no impact on the minimum RTT, or fastest nameserver, and how quickly it can respond. As resolvers generally use the fastest nameserver, we’re confident that any compromise in resolution times is small and that this is a good balance between the goals of low latency and high availability.

On top of our striping across locations, you may have noticed that the four stripes use different top-level domains. We use multiple top-levels domains in case one of the three TLD providers (.com and .net are both operated by Verisign) has any sort of DNS outage. While this rarely happens, it means that as a customer, you’ll have increased protection during a TLD’s DNS outage because at least two of your four nameservers will continue to work.

Applications

You, too, can apply the same techniques in your own systems and applications. If your system isn’t end-user facing, you could also consider utilizing multiple TLDs for resilience as well. Especially in the case where you control your own API and clients calling the API, there’s no reason to place all your eggs in one TLD basket.

Another application of what we’ve discussed is minimizing downtime during failovers. For high availability applications, we recommend customers utilize Route 53 DNS Failover. With failover configured, Route 53 will only return answers for healthy endpoints. In order to determine endpoint health, Route 53 issues health checks against your endpoint. As a result, there is a minimum of 10 seconds (assuming you configured fast health checks with a single failover interval) where the application could be down, but failover has not triggered yet. On top of that, there is the additional time incurred for resolvers to expire the DNS entry from their cache based upon the record’s TTL. To minimize this failover time, you could write your clients to behave similar to the resolver behavior described earlier. And, while you may not employ an anycast system, you can host your endpoints in multiple locations (e.g. different availability zones and perhaps even different regions). Your clients would learn the SRTT of the multiple endpoints over time and only issue queries to the fastest endpoint, but fallback to the other endpoints if the fastest is unavailable. And, of course, you could shuffle shard your endpoints to achieve increased fault isolation while doing all of the above.

– Lee-Ming Zen