Tag Archives: DNS resolution

Simplify DNS management in a multi-account environment with Route 53 Resolver

Post Syndicated from Mahmoud Matouk original https://aws.amazon.com/blogs/security/simplify-dns-management-in-a-multiaccount-environment-with-route-53-resolver/

In a previous post, I showed you a solution to implement central DNS in a multi-account environment that simplified DNS management by reducing the number of servers and forwarders you needed when implementing cross-account and AWS-to-on-premises domain resolution. With the release of the Amazon Route 53 Resolver service, you now have access to a native conditional forwarder that will simplify hybrid DNS resolution even more.

In this post, I’ll show you a modernized solution to centralize DNS management in a multi-account environment by using Route 53 Resolver. This solution allows you to resolve domains across multiple accounts and between workloads running on AWS and on-premises without the need to run a domain controller in AWS.

Solution overview

My solution will show you how to solve three primary use-cases for domain resolution:

  • Resolving on-premises domains from workloads running in your VPCs.
  • Resolving private domains in your AWS environment from workloads running on-premises.
  • Resolving private domains between workloads running in different AWS accounts.

The following diagram explains the high-level full architecture.
 

Figure 1: Solution architecture diagram

Figure 1: Solution architecture diagram

In this architecture:

  1. This is the Amazon-provided default DNS server for the central DNS VPC, which we’ll refer to as the DNS-VPC. This is the second IP address in the VPC CIDR range (as illustrated, this is 172.27.0.2). This default DNS server will be the primary domain resolver for all workloads running in participating AWS accounts.
  2. This shows the Route 53 Resolver endpoints. The inbound endpoint will receive queries forwarded from on-premises DNS servers and from workloads running in participating AWS accounts. The outbound endpoint will be used to forward domain queries from AWS to on-premises DNS.
  3. This shows conditional forwarding rules. For this architecture, we need two rules, one to forward domain queries for onprem.private zone to the on-premises DNS server through the outbound gateway, and a second rule to forward domain queries for awscloud.private to the resolver inbound endpoint in DNS-VPC.
  4. This indicates that these two forwarding rules are shared with all other AWS accounts through AWS Resource Access Manager and are associated with all VPCs in these accounts.
  5. This shows the private hosted zone created in each account with a unique subdomain of awscloud.private.
  6. This shows the on-premises DNS server with conditional forwarders configured to forward queries to the awscloud.private zone to the IP addresses of the Resolver inbound endpoint.

Note: This solution doesn’t require VPC-peering or connectivity between the source/destination VPCs and the DNS-VPC.

How it works

Now, I’m going to show how the domain resolution flow of this architecture works according to the three use-cases I’m focusing on.

First use case

 

 Figure 2:  Use case for resolving on-premises domains from workloads running in AWS

Figure 2: Use case for resolving on-premises domains from workloads running in AWS

First, I’ll look at resolving on-premises domains from workloads running in AWS. If the server with private domain host1.acc1.awscloud.private attempts to resolve the address host1.onprem.private, here’s what happens:

  1. The DNS query will route to the default DNS server of the VPC that hosts host1.acc1.awscloud.private
  2. Because the VPC is associated with the forwarding rules shared from the central DNS account, these rules will be evaluated by the default Amazon-provided DNS in the VPC.
  3. In this example, one of the rules indicates that queries for onprem.private should be forwarded to an on-premises DNS server. Following this rule, the query will be forwarded to an on-premises DNS server.
  4. The forwarding rule is associated with the Resolver outbound endpoint, so the query will be forwarded through this endpoint to an on-premises DNS server.

In this flow, the DNS query that was initiated in one of the participating accounts has been forwarded to the centralized DNS server which, in turn, forwarded this to the on-premises DNS.

Second use case

Next, here’s how on-premises workloads will be able to resolve private domains in your AWS environment:
 

Figure 3: Use case for how on-premises workloads will be able to resolve private domains in your AWS environment

Figure 3: Use case for how on-premises workloads will be able to resolve private domains in your AWS environment

In this case, the query for host1.acc1.awscloud.private is initiated from an on-premises host. Here’s what happens next:

  1. The domain query is forwarded to on-premises DNS server.
  2. The query is then forwarded to the Resolver inbound endpoint via a conditional forwarder rule on the on-premises DNS server.
  3. The query reaches the default DNS server for DNS-VPC.
  4. Because DNS-VPC is associated with the private hosted zone acc1.awscloud.private, the default DNS server will be able to resolve this domain.

In this case, the DNS query has been initiated on-premises and forwarded to centralized DNS on the AWS side through the inbound endpoint.

Third use case

Finally, you might need to resolve domains across multiple AWS accounts. Here’s how you could achieve this:
 

Figure 4: Use case for how to resolve domains across multiple AWS accounts

Figure 4: Use case for how to resolve domains across multiple AWS accounts

Let’s say that host1 in host1.acc1.awscloud.private attempts to resolve the domain host2.acc2.awscloud.private. Here’s what happens:

  1. The domain query is sent to the default DNS server for the VPC hosting source machine (host1).
  2. Because the VPC is associated with the shared forwarding rules, these rules will be evaluated.
  3. A rule indicates that queries for awscloud.private zone should be forwarded to the resolver endpoint in DNS-VPC (for inbound endpoint IP addresses), which will then use the Amazon-provided default DNS to resolve the query.
  4. Because DNS-VPC is associated with the acc2.awscloud.private hosted zone, the default DNS will use auto-defined rules to resolve this domain.

This use case explains the AWS-to-AWS case where the DNS query has been initiated on one participating account and forwarded to central DNS for resolution of domains in another AWS account. Now, I’ll look at what it takes to build this solution in your environment.

How to deploy the solution

I’ll show you how to configure this solution in four steps:

  1. Set up a centralized DNS account.
  2. Set up each participating account.
  3. Create private hosted zones and Route 53 associations.
  4. Configure on-premises DNS forwarders.

Step 1: Set up a centralized DNS account

In this step, you’ll set up resources in the centralized DNS account. Primarily, this includes the DNS-VPC, Resolver endpoints, and forwarding rules.

  1. Create a VPC to act as DNS-VPC according to your business scenario, either using the web console or from an AWS Quick Start. You can review common scenarios in the Amazon VPC user guide; one very common scenarios is a VPC with public and private subnets.
  2. Create resolver endpoints. You need to create an outbound endpoint to forward DNS queries to on-premises DNS and an inbound endpoint to receive DNS queries forwarded from on-premises workloads and other AWS accounts.
  3. Create two forwarding rules. The first rule is to forward DNS queries for zone onprem.private to your on-premises DNS server IP addresses, and the second rule is to forward DNS queries for zone awscloud.private to the IP addresses of the resolver inbound endpoint.
  4. After creating the rules, associate them with DNS-VPC that was created in step #1. This will allow the Route 53 Resolver to start forwarding domain queries accordingly.
  5. Finally, you need to share the two forwarding rules with all participating accounts. To do that, you’ll use AWS Resource Access Manager and you can share the rules with your entire AWS Organization or with specific accounts.

Note: To be able to forward domain queries to your on-premises DNS server, you need connectivity between your data center and DNS-VPC, which could be established either using site-to-site VPN or AWS Direct Connect.

Step 2: Set up participating accounts

For each participating account, you need to configure your VPCs to use the shared forwarding rules, and you need to create a private hosted zone for each account.

  • Accept the shared rules from AWS Resource Access Manager. This step is not required if the rules were shared to your AWS Organization. Then, associate the forwarding rules with the VPCs that host your workloads in each account. Once associated, the resolver will start forwarding DNS queries according to the rules.

At this point, you should be able to resolve on-premises domains from workloads running in any VPC associated with the shared forwarding rules. To create private domains in AWS, you need to create Private Hosted Zones.

Step 3: Create private hosted zones

In this step, you need to create a private hosted zone in each account with a subdomain of awscloud.private. Use unique names for each private hosted zone to avoid domain conflicts in your environment (for example, acc1.awscloud.private or dev.awscloud.private).

  1. Create a private hosted zone in each participating account with a subdomain of awscloud.private and associate it with VPCs running in that account.
  2. Associate the private hosted zone with DNS-VPC. This allows the centralized DNS-VPC to resolve domains in the private hosted zone and act as a DNS resolver between AWS accounts.

Because the private hosted zone and DNS-VPC are in different accounts, you need to associate the private hosted zone with DNS-VPC. To do that, you need to create authorization from the account that owns the private hosted zone and accept this authorization from the account that owns DNS-VPC. You can do that using AWS CLI:

  1. In each participating account, create the authorization using the private hosted zone ID, the region, and the VPC ID that you want to associate (DNS-VPC).
    
        aws route53 create-vpc-association-authorization --hosted-zone-id <hosted-zone-id>  --vpc VPCRegion=<region> ,VPCId=<vpc-id>    
    

  2. In the centralized DNS account, associate the DNS-VPC with the hosted zone in each participating account.
    
        aws route53 associate-vpc-with-hosted-zone --hosted-zone-id <hosted-zone-id> --vpc VPCRegion=<region>,VPCId=<vpc-id>    
    

Step 4: Configure on-premises DNS forwarders

To be able to resolve subdomains within the awscloud.private domain from workloads running on-premises, you need to configure conditional forwarding rules to forward domain queries to the two IP addresses of resolver inbound endpoints that were created in the central DNS account. Note that this requires connectivity between your data center and DNS-VPC, which could be established either using site-to-site VPN or
AWS Direct Connect.

Additional considerations and limitations

Thanks to the flexibility of Route 53 Resolver and conditional forwarding rules, you can control which queries to send to central DNS and which ones to resolve locally in the same account. This is particularly important when you plan to use some AWS services, such as AWS PrivateLink or Amazon Elastic File System (EFS) because domain names associated with these services need to be resolved local to the account that owns them. In this section, I will name two use-cases that require additional considerations.

  1. Interface VPC Endpoints (AWS PrivateLink)

    When you create an AWS PrivateLink interface endpoint, AWS generates endpoint-specific DNS hostnames that you can use to communicate with the service. For AWS services and AWS Marketplace partner services, you can optionally enable private DNS for the endpoint. This option associates a private hosted zone with your VPC. The hosted zone contains a record set for the default DNS name for the service (for example, ec2.us-east-1.amazonaws.com) that resolves to the private IP addresses of the endpoint network interfaces in your VPC. This enables you to make requests to the service using its default DNS hostname instead of the endpoint-specific DNS hostnames.

    If you use private DNS for your endpoint, you have to resolve DNS queries to the endpoint local to the account and use the default DNS provided by AWS. So, in this case, I recommend that you resolve domain queries in amazonaws.com locally and not forward these queries to central DNS.

  2. Mounting EFS with a DNS name

    You can mount an Amazon EFS file system on an Amazon EC2 instance using DNS names. The file system DNS name automatically resolves to the mount target’s IP address in the Availability Zone of the connecting Amazon EC2 instance. To be able to do that, the VPC must use the default DNS provided by Amazon to resolve EFS DNS names.

    If you plan to use EFS in your environment, I recommend that you resolve EFS DNS names locally and avoid sending these queries to central DNS because clients in that case would not receive answers optimized for their availability zone, which might result in higher operation latencies and less durability.

Summary

In this post, I introduced a simplified solution to implement central DNS resolution in a multi-account and hybrid environment. This solution uses AWS Route 53 Resolver, AWS Resource Access Manager, and native Route 53 capabilities and it reduces complexity and operations effort by removing the need for custom DNS servers or forwarders in AWS environment.

If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about this blog post, start a new thread on in the AWS forums.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Mahmoud Matouk

Mahmoud is part of our world-wide public sector Solutions Architects, helping higher education customers build innovative, secured, and highly available solutions using various AWS services.

How to centralize DNS management in a multi-account environment

Post Syndicated from Mahmoud Matouk original https://aws.amazon.com/blogs/security/how-to-centralize-dns-management-in-a-multi-account-environment/

In a multi-account environment where you require connectivity between accounts, and perhaps connectivity between cloud and on-premises workloads, the demand for a robust Domain Name Service (DNS) that’s capable of name resolution across all connected environments will be high.

The most common solution is to implement local DNS in each account and use conditional forwarders for DNS resolutions outside of this account. While this solution might be efficient for a single-account environment, it becomes complex in a multi-account environment.

In this post, I will provide a solution to implement central DNS for multiple accounts. This solution reduces the number of DNS servers and forwarders needed to implement cross-account domain resolution. I will show you how to configure this solution in four steps:

  1. Set up your Central DNS account.
  2. Set up each participating account.
  3. Create Route53 associations.
  4. Configure on-premises DNS (if applicable).

Solution overview

In this solution, you use AWS Directory Service for Microsoft Active Directory (AWS Managed Microsoft AD) as a DNS service in a dedicated account in a Virtual Private Cloud (DNS-VPC).

The DNS service included in AWS Managed Microsoft AD uses conditional forwarders to forward domain resolution to either Amazon Route 53 (for domains in the awscloud.com zone) or to on-premises DNS servers (for domains in the example.com zone). You’ll use AWS Managed Microsoft AD as the primary DNS server for other application accounts in the multi-account environment (participating accounts).

A participating account is any application account that hosts a VPC and uses the centralized AWS Managed Microsoft AD as the primary DNS server for that VPC. Each participating account has a private, hosted zone with a unique zone name to represent this account (for example, business_unit.awscloud.com).

You associate the DNS-VPC with the unique hosted zone in each of the participating accounts, this allows AWS Managed Microsoft AD to use Route 53 to resolve all registered domains in private, hosted zones in participating accounts.

The following diagram shows how the various services work together:
 

Diagram showing the relationship between all the various services

Figure 1: Diagram showing the relationship between all the various services

 

In this diagram, all VPCs in participating accounts use Dynamic Host Configuration Protocol (DHCP) option sets. The option sets configure EC2 instances to use the centralized AWS Managed Microsoft AD in DNS-VPC as their default DNS Server. You also configure AWS Managed Microsoft AD to use conditional forwarders to send domain queries to Route53 or on-premises DNS servers based on query zone. For domain resolution across accounts to work, we associate DNS-VPC with each hosted zone in participating accounts.

If, for example, server.pa1.awscloud.com needs to resolve addresses in the pa3.awscloud.com domain, the sequence shown in the following diagram happens:
 

How domain resolution across accounts works

Figure 2: How domain resolution across accounts works

 

  • 1.1: server.pa1.awscloud.com sends domain name lookup to default DNS server for the name server.pa3.awscloud.com. The request is forwarded to the DNS server defined in the DHCP option set (AWS Managed Microsoft AD in DNS-VPC).
  • 1.2: AWS Managed Microsoft AD forwards name resolution to Route53 because it’s in the awscloud.com zone.
  • 1.3: Route53 resolves the name to the IP address of server.pa3.awscloud.com because DNS-VPC is associated with the private hosted zone pa3.awscloud.com.

Similarly, if server.example.com needs to resolve server.pa3.awscloud.com, the following happens:

  • 2.1: server.example.com sends domain name lookup to on-premise DNS server for the name server.pa3.awscloud.com.
  • 2.2: on-premise DNS server using conditional forwarder forwards domain lookup to AWS Managed Microsoft AD in DNS-VPC.
  • 1.2: AWS Managed Microsoft AD forwards name resolution to Route53 because it’s in the awscloud.com zone.
  • 1.3: Route53 resolves the name to the IP address of server.pa3.awscloud.com because DNS-VPC is associated with the private hosted zone pa3.awscloud.com.

Step 1: Set up a centralized DNS account

In previous AWS Security Blog posts, Drew Dennis covered a couple of options for establishing DNS resolution between on-premises networks and Amazon VPC. In this post, he showed how you can use AWS Managed Microsoft AD (provisioned with AWS Directory Service) to provide DNS resolution with forwarding capabilities.

To set up a centralized DNS account, you can follow the same steps in Drew’s post to create AWS Managed Microsoft AD and configure the forwarders to send DNS queries for awscloud.com to default, VPC-provided DNS and to forward example.com queries to the on-premise DNS server.

Here are a few considerations while setting up central DNS:

  • The VPC that hosts AWS Managed Microsoft AD (DNS-VPC) will be associated with all private hosted zones in participating accounts.
  • To be able to resolve domain names across AWS and on-premises, connectivity through Direct Connect or VPN must be in place.

Step 2: Set up participating accounts

The steps I suggest in this section should be applied individually in each application account that’s participating in central DNS resolution.

  1. Create the VPC(s) that will host your resources in participating account.
  2. Create VPC Peering between local VPC(s) in each participating account and DNS-VPC.
  3. Create a private hosted zone in Route 53. Hosted zone domain names must be unique across all accounts. In the diagram above, we used pa1.awscloud.com / pa2.awscloud.com / pa3.awscloud.com. You could also use a combination of environment and business unit: for example, you could use pa1.dev.awscloud.com to achieve uniqueness.
  4. Associate VPC(s) in each participating account with the local private hosted zone.

The next step is to change the default DNS servers on each VPC using DHCP option set:

  1. Follow these steps to create a new DHCP option set. Make sure in the DNS Servers to put the private IP addresses of the two AWS Managed Microsoft AD servers that were created in DNS-VPC:
     
    The "Create DHCP options set" dialog box

    Figure 3: The “Create DHCP options set” dialog box

     

  2. Follow these steps to assign the DHCP option set to your VPC(s) in participating account.

Step 3: Associate DNS-VPC with private hosted zones in each participating account

The next steps will associate DNS-VPC with the private, hosted zone in each participating account. This allows instances in DNS-VPC to resolve domain records created in these hosted zones. If you need them, here are more details on associating a private, hosted zone with VPC on a different account.

  1. In each participating account, create the authorization using the private hosted zone ID from the previous step, the region, and the VPC ID that you want to associate (DNS-VPC).
     
    aws route53 create-vpc-association-authorization –hosted-zone-id <hosted-zone-id> –vpc VPCRegion=<region>,VPCId=<vpc-id>
     
  2. In the centralized DNS account, associate DNS-VPC with the hosted zone in each participating account.
     
    aws route53 associate-vpc-with-hosted-zone –hosted-zone-id <hosted-zone-id> –vpc VPCRegion=<region>,VPCId=<vpc-id>
     

After completing these steps, AWS Managed Microsoft AD in the centralized DNS account should be able to resolve domain records in the private, hosted zone in each participating account.

Step 4: Setting up on-premises DNS servers

This step is necessary if you would like to resolve AWS private domains from on-premises servers and this task comes down to configuring forwarders on-premise to forward DNS queries to AWS Managed Microsoft AD in DNS-VPC for all domains in the awscloud.com zone.

The steps to implement conditional forwarders vary by DNS product. Follow your product’s documentation to complete this configuration.

Summary

I introduced a simplified solution to implement central DNS resolution in a multi-account environment that could be also extended to support DNS resolution between on-premise resources and AWS. This can help reduce operations effort and the number of resources needed to implement cross-account domain resolution.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Directory Service forum or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Manage Kubernetes Clusters on AWS Using CoreOS Tectonic

Post Syndicated from Arun Gupta original https://aws.amazon.com/blogs/compute/kubernetes-clusters-aws-coreos-tectonic/

There are multiple ways to run a Kubernetes cluster on Amazon Web Services (AWS). The first post in this series explained how to manage a Kubernetes cluster on AWS using kops. This second post explains how to manage a Kubernetes cluster on AWS using CoreOS Tectonic.

Tectonic overview

Tectonic delivers the most current upstream version of Kubernetes with additional features. It is a commercial offering from CoreOS and adds the following features over the upstream:

  • Installer
    Comes with a graphical installer that installs a highly available Kubernetes cluster. Alternatively, the cluster can be installed using AWS CloudFormation templates or Terraform scripts.
  • Operators
    An operator is an application-specific controller that extends the Kubernetes API to create, configure, and manage instances of complex stateful applications on behalf of a Kubernetes user. This release includes an etcd operator for rolling upgrades and a Prometheus operator for monitoring capabilities.
  • Console
    A web console provides a full view of applications running in the cluster. It also allows you to deploy applications to the cluster and start the rolling upgrade of the cluster.
  • Monitoring
    Node CPU and memory metrics are powered by the Prometheus operator. The graphs are available in the console. A large set of preconfigured Prometheus alerts are also available.
  • Security
    Tectonic ensures that cluster is always up to date with the most recent patches/fixes. Tectonic clusters also enable role-based access control (RBAC). Different roles can be mapped to an LDAP service.
  • Support
    CoreOS provides commercial support for clusters created using Tectonic.

Tectonic can be installed on AWS using a GUI installer or Terraform scripts. The installer prompts you for the information needed to boot the Kubernetes cluster, such as AWS access and secret key, number of master and worker nodes, and instance size for the master and worker nodes. The cluster can be created after all the options are specified. Alternatively, Terraform assets can be downloaded and the cluster can be created later. This post shows using the installer.

CoreOS License and Pull Secret

Even though Tectonic is a commercial offering, a cluster for up to 10 nodes can be created by creating a free account at Get Tectonic for Kubernetes. After signup, a CoreOS License and Pull Secret files are provided on your CoreOS account page. Download these files as they are needed by the installer to boot the cluster.

IAM user permission

The IAM user to create the Kubernetes cluster must have access to the following services and features:

  • Amazon Route 53
  • Amazon EC2
  • Elastic Load Balancing
  • Amazon S3
  • Amazon VPC
  • Security groups

Use the aws-policy policy to grant the required permissions for the IAM user.

DNS configuration

A subdomain is required to create the cluster, and it must be registered as a public Route 53 hosted zone. The zone is used to host and expose the console web application. It is also used as the static namespace for the Kubernetes API server. This allows kubectl to be able to talk directly with the master.

The domain may be registered using Route 53. Alternatively, a domain may be registered at a third-party registrar. This post uses a kubernetes-aws.io domain registered at a third-party registrar and a tectonic subdomain within it.

Generate a Route 53 hosted zone using the AWS CLI. Download jq to run this command:

ID=$(uuidgen) && \
aws route53 create-hosted-zone \
--name tectonic.kubernetes-aws.io \
--caller-reference $ID \
| jq .DelegationSet.NameServers

The command shows an output such as the following:

[
  "ns-1924.awsdns-48.co.uk",
  "ns-501.awsdns-62.com",
  "ns-1259.awsdns-29.org",
  "ns-749.awsdns-29.net"
]

Create NS records for the domain with your registrar. Make sure that the NS records can be resolved using a utility like dig web interface. A sample output would look like the following:

The bottom of the screenshot shows NS records configured for the subdomain.

Download and run the Tectonic installer

Download the Tectonic installer (version 1.7.1) and extract it. The latest installer can always be found at coreos.com/tectonic. Start the installer:

./tectonic/tectonic-installer/$PLATFORM/installer

Replace $PLATFORM with either darwin or linux. The installer opens your default browser and prompts you to select the cloud provider. Choose Amazon Web Services as the platform. Choose Next Step.

Specify the Access Key ID and Secret Access Key for the IAM role that you created earlier. This allows the installer to create resources required for the Kubernetes cluster. This also gives the installer full access to your AWS account. Alternatively, to protect the integrity of your main AWS credentials, use a temporary session token to generate temporary credentials.

You also need to choose a region in which to install the cluster. For the purpose of this post, I chose a region close to where I live, Northern California. Choose Next Step.

Give your cluster a name. This name is part of the static namespace for the master and the address of the console.

To enable in-place update to the Kubernetes cluster, select the checkbox next to Automated Updates. It also enables update to the etcd and Prometheus operators. This feature may become a default in future releases.

Choose Upload “tectonic-license.txt” and upload the previously downloaded license file.

Choose Upload “config.json” and upload the previously downloaded pull secret file. Choose Next Step.

Let the installer generate a CA certificate and key. In this case, the browser may not recognize this certificate, which I discuss later in the post. Alternatively, you can provide a CA certificate and a key in PEM format issued by an authorized certificate authority. Choose Next Step.

Use the SSH key for the region specified earlier. You also have an option to generate a new key. This allows you to later connect using SSH into the Amazon EC2 instances provisioned by the cluster. Here is the command that can be used to log in:

ssh –i <key> [email protected]<ec2-instance-ip>

Choose Next Step.

Define the number and instance type of master and worker nodes. In this case, create a 6 nodes cluster. Make sure that the worker nodes have enough processing power and memory to run the containers.

An etcd cluster is used as persistent storage for all of Kubernetes API objects. This cluster is required for the Kubernetes cluster to operate. There are three ways to use the etcd cluster as part of the Tectonic installer:

  • (Default) Provision the cluster using EC2 instances. Additional EC2 instances are used in this case.
  • Use an alpha support for cluster provisioning using the etcd operator. The etcd operator is used for automated operations of the etcd master nodes for the cluster itself, in addition to for etcd instances that are created for application usage. The etcd cluster is provisioned within the Tectonic installer.
  • Bring your own pre-provisioned etcd cluster.

Use the first option in this case.

For more information about choosing the appropriate instance type, see the etcd hardware recommendation. Choose Next Step.

Specify the networking options. The installer can create a new public VPC or use a pre-existing public or private VPC. Make sure that the VPC requirements are met for an existing VPC.

Give a DNS name for the cluster. Choose the domain for which the Route 53 hosted zone was configured earlier, such as tectonic.kubernetes-aws.io. Multiple clusters may be created under a single domain. The cluster name and the DNS name would typically match each other.

To select the CIDR range, choose Show Advanced Settings. You can also choose the Availability Zones for the master and worker nodes. By default, the master and worker nodes are spread across multiple Availability Zones in the chosen region. This makes the cluster highly available.

Leave the other values as default. Choose Next Step.

Specify an email address and password to be used as credentials to log in to the console. Choose Next Step.

At any point during the installation, you can choose Save progress. This allows you to save configurations specified in the installer. This configuration file can then be used to restore progress in the installer at a later point.

To start the cluster installation, choose Submit. At another time, you can download the Terraform assets by choosing Manually boot. This allows you to boot the cluster later.

The logs from the Terraform scripts are shown in the installer. When the installation is complete, the console shows that the Terraform scripts were successfully applied, the domain name was resolved successfully, and that the console has started. The domain works successfully if the DNS resolution worked earlier, and it’s the address where the console is accessible.

Choose Download assets to download assets related to your cluster. It contains your generated CA, kubectl configuration file, and the Terraform state. This download is an important step as it allows you to delete the cluster later.

Choose Next Step for the final installation screen. It allows you to access the Tectonic console, gives you instructions about how to configure kubectl to manage this cluster, and finally deploys an application using kubectl.

Choose Go to my Tectonic Console. In our case, it is also accessible at http://cluster.tectonic.kubernetes-aws.io/.

As I mentioned earlier, the browser does not recognize the self-generated CA certificate. Choose Advanced and connect to the console. Enter the login credentials specified earlier in the installer and choose Login.

The Kubernetes upstream and console version are shown under Software Details. Cluster health shows All systems go and it means that the API server and the backend API can be reached.

To view different Kubernetes resources in the cluster choose, the resource in the left navigation bar. For example, all deployments can be seen by choosing Deployments.

By default, resources in the all namespace are shown. Other namespaces may be chosen by clicking on a menu item on the top of the screen. Different administration tasks such as managing the namespaces, getting list of the nodes and RBAC can be configured as well.

Download and run Kubectl

Kubectl is required to manage the Kubernetes cluster. The latest version of kubectl can be downloaded using the following command:

curl -LO https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/darwin/amd64/kubectl

It can also be conveniently installed using the Homebrew package manager. To find and access a cluster, Kubectl needs a kubeconfig file. By default, this configuration file is at ~/.kube/config. This file is created when a Kubernetes cluster is created from your machine. However, in this case, download this file from the console.

In the console, choose admin, My Account, Download Configuration and follow the steps to download the kubectl configuration file. Move this file to ~/.kube/config. If kubectl has already been used on your machine before, then this file already exists. Make sure to take a backup of that file first.

Now you can run the commands to view the list of deployments:

~ $ kubectl get deployments --all-namespaces
NAMESPACE         NAME                                    DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
kube-system       etcd-operator                           1         1         1            1           43m
kube-system       heapster                                1         1         1            1           40m
kube-system       kube-controller-manager                 3         3         3            3           43m
kube-system       kube-dns                                1         1         1            1           43m
kube-system       kube-scheduler                          3         3         3            3           43m
tectonic-system   container-linux-update-operator         1         1         1            1           40m
tectonic-system   default-http-backend                    1         1         1            1           40m
tectonic-system   kube-state-metrics                      1         1         1            1           40m
tectonic-system   kube-version-operator                   1         1         1            1           40m
tectonic-system   prometheus-operator                     1         1         1            1           40m
tectonic-system   tectonic-channel-operator               1         1         1            1           40m
tectonic-system   tectonic-console                        2         2         2            2           40m
tectonic-system   tectonic-identity                       2         2         2            2           40m
tectonic-system   tectonic-ingress-controller             1         1         1            1           40m
tectonic-system   tectonic-monitoring-auth-alertmanager   1         1         1            1           40m
tectonic-system   tectonic-monitoring-auth-prometheus     1         1         1            1           40m
tectonic-system   tectonic-prometheus-operator            1         1         1            1           40m
tectonic-system   tectonic-stats-emitter                  1         1         1            1           40m

This output is similar to the one shown in the console earlier. Now, this kubectl can be used to manage your resources.

Upgrade the Kubernetes cluster

Tectonic allows the in-place upgrade of the cluster. This is an experimental feature as of this release. The clusters can be updated either automatically, or with manual approval.

To perform the update, choose Administration, Cluster Settings. If an earlier Tectonic installer, version 1.6.2 in this case, is used to install the cluster, then this screen would look like the following:

Choose Check for Updates. If any updates are available, choose Start Upgrade. After the upgrade is completed, the screen is refreshed.

This is an experimental feature in this release and so should only be used on clusters that can be easily replaced. This feature may become a fully supported in a future release. For more information about the upgrade process, see Upgrading Tectonic & Kubernetes.

Delete the Kubernetes cluster

Typically, the Kubernetes cluster is a long-running cluster to serve your applications. After its purpose is served, you may delete it. It is important to delete the cluster as this ensures that all resources created by the cluster are appropriately cleaned up.

The easiest way to delete the cluster is using the assets downloaded in the last step of the installer. Extract the downloaded zip file. This creates a directory like <cluster-name>_TIMESTAMP. In that directory, give the following command to delete the cluster:

TERRAFORM_CONFIG=$(pwd)/.terraformrc terraform destroy --force

This destroys the cluster and all associated resources.

You may have forgotten to download the assets. There is a copy of the assets in the directory tectonic/tectonic-installer/darwin/clusters. In this directory, another directory with the name <cluster-name>_TIMESTAMP contains your assets.

Conclusion

This post explained how to manage Kubernetes clusters using the CoreOS Tectonic graphical installer.  For more details, see Graphical Installer with AWS. If the installation does not succeed, see the helpful Troubleshooting tips. After the cluster is created, see the Tectonic tutorials to learn how to deploy, scale, version, and delete an application.

Future posts in this series will explain other ways of creating and running a Kubernetes cluster on AWS.

Arun

How to Monitor Host-Based Intrusion Detection System Alerts on Amazon EC2 Instances

Post Syndicated from Cameron Worrell original https://aws.amazon.com/blogs/security/how-to-monitor-host-based-intrusion-detection-system-alerts-on-amazon-ec2-instances/

To help you secure your AWS resources, we recommend that you adopt a layered approach that includes the use of preventative and detective controls. For example, incorporating host-based controls for your Amazon EC2 instances can restrict access and provide appropriate levels of visibility into system behaviors and access patterns. These controls often include a host-based intrusion detection system (HIDS) that monitors and analyzes network traffic, log files, and file access on a host. A HIDS typically integrates with alerting and automated remediation solutions to detect and address attacks, unauthorized or suspicious activities, and general errors in your environment.

In this blog post, I show how you can use Amazon CloudWatch Logs to collect and aggregate alerts from an open-source security (OSSEC) HIDS. I use a CloudWatch Logs subscription to deliver the alerts to Amazon Elasticsearch Service (Amazon ES) for analysis and visualization with Kibana – a popular open-source visualization tool. To make it easier for you to see this solution in action, I provide a CloudFormation template to handle most of the deployment work. You can use this solution to gain improved visibility and insights across your EC2 fleet and help drive security remediation activities. For example, if specific hosts are scanning your EC2 instances and triggering OSSEC alerts, you can implement a VPC network access control list (ACL) or AWS WAF rule to block those source IP addresses or CIDR blocks.

Solution overview

The following diagram depicts a high-level overview of this post’s solution.

Diagram showing a high-level overview of this post's solution

Here is how the solution works:

  1. On the target EC2 instances, the OSSEC HIDS generates alerts that the CloudWatch Logs agent captures. The HIDS performs log analysis, integrity checking, Windows registry monitoring, rootkit detection, real-time alerting, and active response. For more information, see Getting started with OSSEC.
  2. The CloudWatch Logs group receives the alerts as events.
  3. A CloudWatch Logs subscription is applied to the target log group to forward the events through AWS Lambda to Amazon ES.
  4. Amazon ES loads the logged alert data.
  5. Kibana visualizes the alerts in near-real time. Amazon ES provides a default installation of Kibana with every Amazon ES domain.

Deployment considerations

For the purposes of this post, the primary OSSEC HIDS deployment consists of a Linux-based installation for which the alerts are generated locally within each system. Note that this solution depends on Amazon ES and Lambda in the target region for deployment. You can find the latest information about AWS service availability in the Region table. You also must identify an Amazon Virtual Private Cloud (VPC) subnet that has Internet access and DNS resolution for your EC2 instances to provision the required components properly.

To simplify the deployment process, I created a test environment AWS CloudFormation template. You can use this template to provision a test environment stack automatically into an existing Amazon VPC subnet. You will use CloudFormation to provision the core components of this solution and then configure Kibana for alert analysis. The source code for this solution is available on GitHub.

This post’s template performs the following high-level steps in the region you choose:

  1. Creates two EC2 instances running Amazon Linux with an AWS Identity and Access Management (IAM) role for CloudWatch Logs access. Note: To provide sample HIDS alert data, the two EC2 instances are configured automatically to generate simulated HIDS alerts locally.
  2. Installs and configures OSSEC, the CloudWatch Logs agent, and additional packages used for the test environment.
  3. Creates the target HIDS Amazon ES domain.
  4. Creates the target HIDS CloudWatch Logs group.
  5. Creates the Lambda function and CloudWatch Logs subscription to send HIDS alerts to Amazon ES.

After the CloudFormation stack has been deployed, you can access the Kibana instance on the Amazon ES domain to complete the final steps of the setup for the test environment, which I show later in the post.

Although out of scope for this blog post, when deploying OSSEC into your existing EC2 environment, you should determine the desired configuration, including target log files for monitoring, directories for integrity checking, and active response. This typically also requires time for testing and tuning of the system to optimize it for your environment. The OSSEC documentation is a good place to start to familiarize yourself with this process. You could take another approach to OSSEC deployment, which involves an agent installation and a separate OSSEC manager to process events centrally before exporting them to CloudWatch Logs. This deployment requires an additional server component and network communication between the agent and the manager. Note that although Windows Server is supported by OSSEC, it requires an agent-based installation and therefore requires an OSSEC manager to be present. Review OSSEC Architecture for additional information about OSSEC architecture and deployment options.

Deploy the solution

This solution’s high-level steps are:

  1. Launch the CloudFormation stack.
  2. Configure a Kibana index pattern and begin exploring alerts.
  3. Configure a Kibana HIDS dashboard and visualize alerts.

1. Launch the CloudFormation stack

You will launch your test environment by using a CloudFormation template that automates the provisioning process. For the following input parameters, you must identify a target VPC and subnet (which requires Internet access) for deployment. If the target subnet uses an Internet gateway, set the AssignPublicIP parameter to true. If the target subnet uses a NAT gateway, you can leave the default setting of AssignPublicIP as false.

First, you will need to stage the Lambda function deployment package in an S3 bucket located in the region into which you are deploying. To do this, download the zipped deployment package and upload it to your in-region bucket. For additional information about uploading objects to S3, see Uploading Object into Amazon S3.

You also must provide a trusted source IP address or CIDR block for access to the environment following the creation of the stack and an EC2 key pair to associate with the instances. For information about creating an EC2 key pair, see Creating a Key Pair Using Amazon EC2. Note that the trusted IP address or CIDR block also is used to create the Amazon ES access policy automatically for Kibana access. We recommend that you use a specific IP address or CIDR range rather than using 0.0.0.0/0, which would allow all IPv4 addresses to access your instances. For more information about authorizing inbound traffic to your instances, see Authorizing Inbound Traffic for Your Linux Instances.

After you have confirmed the input parameters (see the following screenshot and table for more details), create the CloudFormation stack.

Numbered screenshot showing input parameters

Input parameterInput parameter description
1. HIDSInstanceSizeEC2 instance size for test server
2. ESInstanceSizeAmazon ES instance size
3. MyKeyPairA public/private key pair that allows you to connect securely to your instance after it launches
4. MyS3BucketIn-region S3 bucket with the zipped deployment package
5. MyS3KeyIn-region S3 key for the zipped deployment package
6. VPCIdAn Amazon VPC into which to deploy the solution
7. SubnetIdA SubnetId with outbound connectivity within the VPC you selected (requires Internet access)
8. AssignPublicIPSet to true if your subnet is configured to connect through an Internet gateway; set to false if your subnet is configured to connect through a NAT gateway
9. MyTrustedNetworkYour trusted source IP or CIDR block that is used to whitelist access to the EC2 instances and the Amazon ES endpoint

To finish creating the CloudFormation stack:

  1. Enter the input parameters and choose Next.
  2. On the Options page, accept the defaults and choose Next.
  3. On the Review page, confirm the details, select the I acknowledge that AWS CloudFormation might create IAM resources check box, and then choose Create. (The stack will be created in approximately 10 minutes.)

After the stack has been created, note the HIDSESKibanaURL on the CloudFormation Outputs tab. Then, proceed to the Kibana configuration instructions in the next section.

2. Configure a Kibana index pattern and begin exploring alerts

In this section, you perform the initial setup of Kibana. To access Kibana, find the HIDSESKibanaURL in the CloudFormation stack outputs (see the previous section) and choose it. This will bring you to the Kibana instance, which is automatically provisioned to your Amazon ES instance. The source IP you provided in the CloudFormation input parameters is used to automatically populate the Amazon ES access policy. If you receive an error similar to the following error, you must confirm that your Amazon ES access policy is correct.

{"Message":"User: anonymous is not authorized to perform: es:ESHttpGet on resource: hids-alerts"}

For additional information about securing access to your Amazon ES domain, see How to Control Access to Your Amazon Elasticsearch Service Domain.

The OSSEC HIDS alerts now are being processed into Amazon ES. To use Kibana to analyze the alert data interactively, you must configure an index pattern that identifies the data you wish to analyze in Amazon ES. You can read additional information about index patterns in the Kibana documentation.

In the Index name or pattern box, type cwl-2017.*. The index pattern is generated within the Lambda function as cwl-YYYY.MM.DD, so you can use a wildcard character for the month and day to match data from 2017. From the Time-field name drop-down list, choose @timestamp, and then choose Create.

Screenshot of the "Configure an index pattern" screen

In Kibana, you should now be able to choose the Discover pane and see alerts being populated. To set the refresh rate for the display of near-real-time alerts, choose your desired time range in the top right (such as Last 15 minutes).

Screenshot of setting the refresh rate of near-real-time alerts

Choose Auto-refresh, and then choose an interval, such as 5 seconds.

Screenshot of auto-refresh of 5 seconds

Kibana should now be configured to auto-refresh at a 5-second interval within the timeframe you configured. You should now see your alerts updating along with a count graph, as shown in the following screenshot.

Screenshot of the alerts updating with a count graph

The EC2 instances are automatically configured by CloudFormation to simulate activity to display several types of alerts, including:

  • Successful sudo to ROOT executed – The Linux sudo command was successfully executed.
  • Web server 400 error code – The server cannot process the request due to an apparent client error (such as malformed request syntax, too large size, invalid request message framing, or deceptive request routing).
  • SSH insecure connection attempt (scan) – Invalid connection attempt to the SSH listener.
  • Login session opened – Opened login session on the system.
  • Login session closed – Closed login session on the system.
  • New Yum package installed – Package installed on the system.
  • Yum package deleted – Package deleted from the system.

Let’s take a closer look at some of the alert fields, as shown in the following screenshot.

Screenshot highlighting some of the alert fields

The numbered alert fields in the preceding screenshot are defined as follows:

  1. @log_group – The source CloudWatch Logs group
  2. @log_stream – The CloudWatch Logs stream name (InstanceID)
  3. @message – The JSON payload from the source alerts.json OSSEC log
  4. @owner – The AWS account ID where the alert originated
  5. @timestamp – The time stamp applied by the consumer Lambda function
  6. full_log – The log event from the source file
  7. location – The source log file path and file name
  8. rule.comment – A brief description of the OSSEC rule that was matched
  9. rule.level – The OSSEC rule classification from 0 to 16 (see Rules Classification for more information)
  10. rule.sidid – The rule ID of the OSSEC rule that was matched
  11. srcip – The source IP address that triggered the alert; in this case, the simulated alerts contain the local IP of the server

You can enter search criteria in the Kibana query bar to explore HIDS alert data interactively. For example, you can run the following query to see all the rule.level 6 alerts for the EC2 InstanceID i-0e427a8594852eca2 where the source IP is 10.10.10.10.

“rule.level: 6 AND @log_stream: "i-0e427a8594852eca2" AND srcip: 10.10.10.10”

You can perform searches including simple text, Lucene query syntax, or use the full JSON-based Elasticsearch Query DSL. You can find additional information on searching your data in the Elasticsearch documentation.

3. Configure a Kibana HIDS dashboard and visualize alerts

To analyze alert trends and patterns over time, it can be helpful to use charts and graphs to represent the alert data. I have configured a basic dashboard template that you can import into your Kibana instance.

To add the template of a sample HIDS dashboard to your Kibana instance:

  1. Save the template locally and then choose Management in the Kibana navigation pane.
  2. Choose Saved Objects, Import, and the HIDS dashboard template.
  3. Choose the eye icon to the right of the HIDS Alerts dashboard entry. This will take you to the imported dashboard.
    Screenshot of the "Edit Saved Objects" screen

After importing the Kibana dashboard template and selecting it, you will see the HIDS dashboard, as shown in the following screenshot. This sample HIDS dashboard includes Alerts Over Time, Top 20 Alert Types, Rule Level Breakdown, Top 10 Rule Source ID, and Top 10 Source IPs.

Screenshot of the HIDS dashboard

To explore the alert data in more detail, you can choose an alert type on which to filter, as shown in the following two screenshots.

Alert showing SSH insecure connection attempts

Alert showing @timestamp per 30 seconds

You can see more details about the alerts based on criteria such as source IP address or time range. For more information about using Kibana to visualize alert data, see the Kibana User Guide.

Summary

In this blog post, I showed how to use CloudWatch Logs to collect alerts in near-real time from an OSSEC HIDS and use a CloudWatch Logs subscription to pass the alerts into Amazon ES for analysis and visualization with Kibana. The dashboard deployed by this solution can help you improve the security monitoring of your EC2 fleet as part of a defense-in-depth security strategy in your AWS environment.

You can use this solution to help detect attacks, anomalous activities, and error trends across your EC2 fleet. You can also use it to help prioritize remediation efforts for your systems or help determine where to introduce additional security controls such as VPC security group rules, VPC network ACLs, or AWS WAF rules.

If you have comments about this post, add them to the “Comments” section below. If you have questions about or issues implementing this solution, start a new thread on the CloudWatch or Amazon ES forum. The source code for this solution is available on GitHub. If you need OSSEC-specific support, see OSSEC Support Options.

– Cameron

The command-line, for cybersec

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/01/the-command-line-for-cybersec.html

On Twitter I made the mistake of asking people about command-line basics for cybersec professionals. A got a lot of useful responses, which I summarize in this long (5k words) post. It’s mostly driven by the tools I use, with a bit of input from the tweets I got in response to my query.

bash

By command-line this document really means bash.

There are many types of command-line shells. Windows has two, ‘cmd.exe’ and ‘PowerShell’. Unix started with the Bourne shell ‘sh’, and there have been many variations of this over the years, ‘csh’, ‘ksh’, ‘zsh’, ‘tcsh’, etc. When GNU rewrote Unix user-mode software independently, they called their shell “Bourne Again Shell” or “bash” (queue “JSON Bourne” shell jokes here).

Bash is the default shell for Linux and macOS. It’s also available on Windows, as part of their special “Windows Subsystem for Linux”. The windows version of ‘bash’ has become my most used shell.

For Linux IoT devices, BusyBox is the most popular shell. It’s easy to clear, as it includes feature-reduced versions of popular commands.

man

‘Man’ is the command you should not run if you want help for a command.

Man pages are designed to drive away newbies. They are only useful if you already mostly an expert with the command you desire help on. Man pages list all possible features of a program, but do not highlight examples of the most common features, or the most common way to use the commands.

Take ‘sed’ as an example. It’s used most commonly to do a search-and-replace in files, like so:

$ sed ‘s/rob/dave/’ foo.txt

This usage is so common that many non-geeks know of it. Yet, if you type ‘man sed’ to figure out how to do a search and replace, you’ll get nearly incomprehensible gibberish, and no example of this most common usage.

I point this out because most guides on using the shell recommend ‘man’ pages to get help. This is wrong, it’ll just endlessly frustrate you. Instead, google the commands you need help on, or better yet, search StackExchange for answers.

You might try asking questions, like on Twitter or forum sites, but this requires a strategy. If you ask a basic question, self-important dickholes will respond by telling you to “rtfm” or “read the fucking manual”. A better strategy is to exploit their dickhole nature, such as saying “too bad command xxx cannot do yyy”. Helpful people will gladly explain why you are wrong, carefully explaining how xxx does yyy.

If you must use ‘man’, use the ‘apropos’ command to find the right man page. Sometimes multiple things in the system have the same or similar names, leading you to the wrong page.

apt-get install yum

Using the command-line means accessing that huge open-source ecosystem. Most of the things in this guide do no already exist on the system. You have to either compile them from source, or install via a package-manager. Linux distros ship with a small footprint, but have a massive database of precompiled software “packages” in the cloud somewhere. Use the “package manager” to install the software from the cloud.

On Debian-derived systems (like Ubuntu, Kali, Raspbian), type “apt-get install masscan” to install “masscan” (as an example). Use “apt-cache search scan” to find a bunch of scanners you might want to install.

On RedHat systems, use “yum” instead. On BSD, use the “ports” system, which you can also get working for macOS.

If no pre-compiled package exists for a program, then you’ll have to download the source code and compile it. There’s about an 80% chance this will work easy, following the instructions. There is a 20% chance you’ll experience “dependency hell”, for example, needing to install two mutually incompatible versions of Python.

Bash is a scripting language

Don’t forget that shells are really scripting languages. The bit that executes a single command is just a degenerate use of the scripting language. For example, you can do a traditional for loop like:

$ for i in $(seq 1 9); do echo $i; done

In this way, ‘bash’ is no different than any other scripting language, like Perl, Python, NodeJS, PHP CLI, etc. That’s why a lot of stuff on the system actually exists as short ‘bash’ programs, aka. shell scripts.

Few want to write bash scripts, but you are expected to be able to read them, either to tweek existing scripts on the system, or to read StackExchange help.

File system commands

The macOS “Finder” or Windows “File Explorer” are just graphical shells that help you find files, open, and save them. The first commands you learn are for the same functionality on the command-line: pwd, cd, ls, touch, rm, rmdir, mkdir, chmod, chown, find, ln, mount.

The command “rm –rf /” removes everything starting from the root directory. This will also follow mounted server directories, deleting files on the server. I point this out to give an appreciation of the raw power you have over the system from the command-line, and how easy you can disrupt things.

Of particular interest is the “mount” command. Desktop versions of Linux typically mount USB flash drives automatically, but on servers, you need to do it manually, e.g.:

$ mkdir ~/foobar
$ mount /dev/sdb ~/foobar

You’ll also use the ‘mount’ command to connect to file servers, using the “cifs” package if they are Windows file servers:

# apt-get install cifs-utils
# mkdir /mnt/vids
# mount -t cifs -o username=robert,password=foobar123  //192.168.1.11/videos /mnt/vids

Linux system commands

The next commands you’ll learn are about syadmin the Linux system: ps, top, who, history, last, df, du, kill, killall, lsof, lsmod, uname, id, shutdown, and so on.

The first thing hackers do when hacking into a system is run “uname” (to figure out what version of the OS is running) and “id” (to figure out which account they’ve acquired, like “root” or some other user).

The Linux system command I use most is “dmesg” (or ‘tail –f /var/log/dmesg’) which shows you the raw system messages. For example, when I plug in USB drives to a server, I look in ‘dmesg’ to find out which device was added so that I can mount it. I don’t know if this is the best way, it’s just the way I do it (servers don’t automount USB drives like desktops do).

Networking commands

The permanent state of the network (what gets configured on the next bootup) is configured in text files somewhere. But there are a wealth of commands you’ll use to view the current state of networking, make temporary changes, and diagnose problems.

The ‘ifconfig’ command has long been used to view the current TCP/IP configuration and make temporary changes. Learning how TCP/IP works means playing a lot with ‘ifconfig’. Use “ifconfig –a” for even more verbose information.

Use the “route” command to see if you are sending packets to the right router.

Use ‘arp’ command to make sure you can reach the local router.

Use ‘traceroute’ to make sure packets are following the correct route to their destination. You should learn the nifty trick it’s based on (TTLs). You should also play with the TCP, UDP, and ICMP options.

Use ‘ping’ to see if you can reach the target across the Internet. Usefully measures the latency in milliseconds, and congestion (via packet loss). For example, ping NetFlix throughout the day, and notice how the ping latency increases substantially during “prime time” viewing hours.

Use ‘dig’ to make sure DNS resolution is working right. (Some use ‘nslookup’ instead). Dig is useful because it’s the raw universal DNS tool – every time they add some new standard feature to DNS, they add that feature into ‘dig’ as well.

The ‘netstat –tualn’ command views the current TCP/IP connections and which ports are listening. I forget what the various options “tualn” mean, only it’s the output I always want to see, rather than the raw “netstat” command by itself.

You’ll want to use ‘ethtool –k’ to turn off checksum and segmentation offloading. These are features that break packet-captures sometimes.

There is this new fangled ‘ip’ system for Linux networking, replacing many of the above commands, but as an old timer, I haven’t looked into that.

Some other tools for diagnosing local network issues are ‘tcpdump’, ‘nmap’, and ‘netcat’. These are described in more detail below.

ssh

In general, you’ll remotely log into a system in order to use the command-line. We use ‘ssh’ for that. It uses a protocol similar to SSL in order to encrypt the connection. There are two ways to use ‘ssh’ to login, with a password or with a client-side certificate.

When using SSH with a password, you type “ssh [email protected]”. The remote system will then prompt you for a password for that account.

When using client-side certificates, use “ssh-keygen” to generate a key, then either copy the public-key of the client to the server manually, or use “ssh-copy-id” to copy it using the password method above.

How this works is basic application of public-key cryptography. When logging in with a password, you get a copy of the server’s public-key the first time you login, and if it ever changes, you get a nasty warning that somebody may be attempting a man in the middle attack.

$ ssh [email protected]
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!

When using client-side certificates, the server trusts your public-key. This is similar to how client-side certificates work in SSL VPNs.

You can use SSH for things other than loging into a remote shell. You can script ‘ssh’ to run commands remotely on a system in a local shell script. You can use ‘scp’ (SSH copy) to transfer files to and from a remote system. You can do tricks with SSH to create tunnels, which is popular way to bypass the restrictive rules of your local firewall nazi.

openssl

This is your general cryptography toolkit, doing everything from simple encryption, to public-key certificate signing, to establishing SSL connections.

It is extraordinarily user hostile, with terrible inconsistency among options. You can only figure out how to do things by looking up examples on the net, such as on StackExchange. There are competing SSL libraries with their own command-line tools, like GnuTLS and Mozilla NSS that you might find easier to use.

The fundamental use of the ‘openssl’ tool is to create public-keys, “certificate requests”, and creating self-signed certificates. All the web-site certificates I’ve ever obtained has been using the openssl command-line tool to create CSRs.

You should practice using the ‘openssl’ tool to encrypt files, sign files, and to check signatures.

You can use openssl just like PGP for encrypted emails/messages, but following the “S/MIME” standard rather than PGP standard. You might consider learning the ‘pgp’ command-line tools, or the open-source ‘gpg’ or ‘gpg2’ tools as well.

You should learn how to use the “openssl s_client” feature to establish SSL connections, as well as the “openssl s_server” feature to create an SSL proxy for a server that doesn’t otherwise support SSL.

Learning all the ways of using the ‘openssl’ tool to do useful things will go a long way in teaching somebody about crypto and cybersecurity. I can imagine an entire class consisting of nothing but learning ‘openssl’.

netcat (nc, socat, cyptocat, ncat)

A lot of Internet protocols are based on text. That means you can create a raw TCP connection to the service and interact with them using your keyboard. The classic tool for doing this is known as “netcat”, abbreviated “nc”. For example, connect to Google’s web server at port and type the HTTP HEAD command followed by a blank line (hit [return] twice):

$ nc www.google.com 80
HEAD / HTTP/1.0

HTTP/1.0 200 OK
Date: Tue, 17 Jan 2017 01:53:28 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
P3P: CP=”This is not a P3P policy! See https://www.google.com/support/accounts/answer/151657?hl=en for more info.”
Server: gws
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Set-Cookie: NID=95=o7GT1uJCWTPhaPAefs4CcqF7h7Yd7HEqPdAJncZfWfDSnNfliWuSj3XfS5GJXGt67-QJ9nc8xFsydZKufBHLj-K242C3_Vak9Uz1TmtZwT-1zVVBhP8limZI55uXHuPrejAxyTxSCgR6MQ; expires=Wed, 19-Jul-2017 01:53:28 GMT; path=/; domain=.google.com; HttpOnly
Accept-Ranges: none
Vary: Accept-Encoding

Another classic example is to connect to port 25 on a mail server to send email, spoofing the “MAIL FROM” address.

There are several versions of ‘netcat’ that work over SSL as well. My favorite is ‘ncat’, which comes with ‘nmap’, as it’s actively maintained. In theory, “openssl s_client” should also work this way.

nmap

At some point, you’ll need to port scan. The standard program for this is ‘nmap’, and it’s the best. The classic way of using it is something like:

# nmap –A scanme.nmap.org

The ‘-A’ option means to enable all the interesting features like OS detection, version detection, and basic scripts on the most common ports that a server might have open. It takes awhile to run. The “scanme.nmap.org” is a good site to practice on.

Nmap is more than just a port scanner. It has a rich scripting system for probing more deeply into a system than just a port, and to gather more information useful for attacks. The scripting system essentially contains some attacks, such as password guessing.

Scanning the Internet, finding services identified by ‘nmap’ scripts, and interacting with them with tools like ‘ncat’ will teach you a lot about how the Internet works.

BTW, if ‘nmap’ is too slow, using ‘masscan’ instead. It’s a lot faster, though has much more limited functionality.

Packet sniffing with tcpdump and tshark

All Internet traffic consists of packets going between IP addresses. You can capture those packets and view them using “packet sniffers”. The most important packet-sniffer is “Wireshark”, a GUI. For the command-line, there is ‘tcpdump’ and ‘tshark’.

You can run tcpdump on the command-line to watch packets go in/out of the local computer. This performs a quick “decode” of packets as they are captured. It’ll reverse-lookup IP addresses into DNS names, which means its buffers can overflow, dropping new packets while it’s waiting for DNS name responses for previous packets (which can be disabled with -n):

# tcpdump –p –i eth0

A common task is to create a round-robin set of files, saving the last 100 files of 1-gig each. Older files are overwritten. Thus, when an attack happens, you can stop capture, and go backward in times and view the contents of the network traffic using something like Wireshark:

# tcpdump –p -i eth0 -s65535 –C 1000 –W 100 –w cap

Instead of capturing everything, you’ll often set “BPF” filters to narrow down to traffic from a specific target, or a specific port.

The above examples use the –p option to capture traffic destined to the local computer. Sometimes you may want to look at all traffic going to other machines on the local network. You’ll need to figure out how to tap into wires, or setup “monitor” ports on switches for this to work.

A more advanced command-line program is ‘tshark’. It can apply much more complex filters. It can also be used to extract the values of specific fields and dump them to a text files.

Base64/hexdump/xxd/od

These are some rather trivial commands, but you should know them.

The ‘base64’ command encodes binary data in text. The text can then be passed around, such as in email messages. Base64 encoding is often automatic in the output from programs like openssl and PGP.

In many cases, you’ll need to view a hex dump of some binary data. There are many programs to do this, such as hexdump, xxd, od, and more.

grep

Grep searches for a pattern within a file. More important, it searches for a regular expression (regex) in a file. The fu of Unix is that a lot of stuff is stored in text files, and use grep for regex patterns in order to extra stuff stored in those files.

The power of this tool really depends on your mastery of regexes. You should master enough that you can understand StackExhange posts that explain almost what you want to do, and then tweek them to make them work.

Grep, by default, shows only the matching lines. In many cases, you only want the part that matches. To do that, use the –o option. (This is not available on all versions of grep).

You’ll probably want the better, “extended” regular expressions, so use the –E option.

You’ll often want “case-insensitive” options (matching both upper and lower case), so use the –i option.

For example, to extract all MAC address from a text file, you might do something like the following. This extracts all strings that are twelve hex digits.

$ grep –Eio ‘[0-9A-F]{12}’ foo.txt

Text processing

Grep is just the first of the various “text processing filters”. Other useful ones include ‘sed’, ‘cut’, ‘sort’, and ‘uniq’.

You’ll be an expert as piping output of one to the input of the next. You’ll use “sort | uniq” as god (Dennis Ritchie) intended and not the heresy of “sort –u”.

You might want to master ‘awk’. It’s a new programming language, but once you master it, it’ll be easier than other mechanisms.

You’ll end up using ‘wc’ (word-count) a lot. All it does is count the number of lines, words, characters in a file, but you’ll find yourself wanting to do this a lot.

csvkit and jq

You get data in CSV format and JSON format a lot. The tools ‘csvkit’ and ‘jq’ respectively help you deal with those tools, to convert these files into other formats, sticking the data in databases, and so forth.

It’ll be easier using these tools that understand these text formats to extract data than trying to write ‘awk’ command or ‘grep’ regexes.

strings

Most files are binary with a few readable ASCII strings. You use the program ‘strings’ to extract those strings.

This one simple trick sounds stupid, but it’s more powerful than you’d think. For example, I knew that a program probably contained a hard-coded password. I then blindly grabbed all the strings in the program’s binary file and sent them to a password cracker to see if they could decrypt something. And indeed, one of the 100,000 strings in the file worked, thus finding the hard-coded password.

tail -f

So ‘tail’ is just a standard Linux tool for looking at the end of files. If you want to keep checking the end of a live file that’s constantly growing, then use “tail –f”. It’ll sit there waiting for something new to be added to the end of the file, then print it out. I do this a lot, so I thought it’d be worth mentioning.

tar –xvfz, gzip, xz, 7z

In prehistorical times (like the 1980s), Unix was backed up to tape drives. The tar command could be used to combine a bunch of files into a single “archive” to be sent to the tape drive, hence “tape archive” or “tar”.

These days, a lot of stuff you download will be in tar format (ending in .tar). You’ll need to learn how to extract it:

$ tar –xvf something.tar

Nobody knows what the “xvf” options mean anymore, but these letters most be specified in that order. I’m joking here, but only a little: somebody did a survey once and found that virtually nobody know how to use ‘tar’ other than the canned formulas such as this.

Along with combining files into an archive you also need to compress them. In prehistoric Unix, the “compress” command would be used, which would replace a file with a compressed version ending in ‘.z’. This would found to be encumbered with patents, so everyone switched to ‘gzip’ instead, which replaces a file with a new one ending with ‘.gz’.

$ ls foo.txt*
foo.txt
$ gzip foo.txt
$ ls foo.txt*
foo.txt.gz

Combined with tar, you get files with either the “.tar.gz” extension, or simply “.tgz”. You can untar and uncompress at the same time:

$ tar –xvfz something .tar.gz

Gzip is always good enough, but nerds gonna nerd and want to compress with slightly better compression programs. They’ll have extensions like “.bz2”, “.7z”, “.xz”, and so on. There are a ton of them. Some of them are supported directly by the ‘tar’ program:

$ tar –xvfj something.tar.bz2

Then there is the “zip/unzip” program, which supports Windows .zip file format. To create compressed archives these days, I don’t bother with tar, but just use the ZIP format. For example, this will recursively descend a directory, adding all files to a ZIP file that can easily be extracted under Windows:

$ zip –r test.zip ./test/

dd

I should include this under the system tools at the top, but it’s interesting for a number of purposes. The usage is simply to copy one file to another, the in-file to the out-file.

$ dd if=foo.txt of=foo2.txt

But that’s not interesting. What interesting is using it to write to “devices”. The disk drives in your system also exist as raw devices under the /dev directory.

For example, if you want to create a boot USB drive for your Raspberry Pi:

# dd if=rpi-ubuntu.img of=/dev/sdb

Or, you might want to hard erase an entire hard drive by overwriting random data:

# dd if=/dev/urandom of=/dev/sdc

Or, you might want to image a drive on the system, for later forensics, without stumbling on things like open files.

# dd if=/dev/sda of=/media/Lexar/infected.img

The ‘dd’ program has some additional options, like block size and so forth, that you’ll want to pay attention to.

screen and tmux

You log in remotely and start some long running tool. Unfortunately, if you log out, all the processes you started will be killed. If you want it to keep running, then you need a tool to do this.

I use ‘screen’. Before I start a long running port scan, I run the “screen” command. Then, I type [ctrl-a][ctrl-d] to disconnect from that screen, leaving it running in the background.

Then later, I type “screen –r” to reconnect to it. If there are more than one screen sessions, using ‘-r’ by itself will list them all. Use “-r pid” to reattach to the proper one. If you can’t, then use “-D pid” or “-D –RR pid” to forced the other session to detached from whoever is using it.

Tmux is an alternative to screen that many use. It’s cool for also having lots of terminal screens open at once.

curl and wget

Sometimes you want to download files from websites without opening a browser. The ‘curl’ and ‘wget’ programs do that easily. Wget is the traditional way of doing this, but curl is a bit more flexible. I use curl for everything these days, except mirroring a website, in which case I just do “wget –m website”.

The thing that makes ‘curl’ so powerful is that it’s really designed as a tool for poking and prodding all the various features of HTTP. That it’s also useful for downloading files is a happy coincidence. When playing with a target website, curl will allow you do lots of complex things, which you can then script via bash. For example, hackers often write their cross-site scripting/forgeries in bash scripts using curl.

node/php/python/perl/ruby/lua

As mentioned above, bash is its own programming language. But it’s weird, and annoying. So sometimes you want a real programming language. Here are some useful ones.

Yes, PHP is a language that runs in a web server for creating web pages. But if you know the language well, it’s also a fine command-line language for doing stuff.

Yes, JavaScript is a language that runs in the web browser. But if you know it well, it’s also a great language for doing stuff, especially with the “nodejs” version.

Then there are other good command line languages, like the Python, Ruby, Lua, and the venerable Perl.

What makes all these great is the large library support. Somebody has already written a library that nearly does what you want that can be made to work with a little bit of extra code of your own.

My general impression is that Python and NodeJS have the largest libraries likely to have what you want, but you should pick whichever language you like best, whichever makes you most productive. For me, that’s NodeJS, because of the great Visual Code IDE/debugger.

iptables, iptables-save

I shouldn’t include this in the list. Iptables isn’t a command-line tool as such. The tool is the built-in firewalling/NAT features within the Linux kernel. Iptables is just the command to configure it.

Firewalling is an important part of cybersecurity. Everyone should have some experience playing with a Linux system doing basic firewalling tasks: basic rules, NATting, and transparent proxying for mitm attacks.

Use ‘iptables-save’ in order to persistently save your changes.

MySQL

Similar to ‘iptables’, ‘mysql’ isn’t a tool in its own right, but a way of accessing a database maintained by another process on the system.

Filters acting on text files only goes so far. Sometimes you need to dump it into a database, and make queries on that database.

There is also the offensive skill needed to learn how targets store things in a database, and how attackers get the data.

Hackers often publish raw SQL data they’ve stolen in their hacks (like the Ashley-Madisan dump). Being able to stick those dumps into your own database is quite useful. Hint: disable transaction logging while importing mass data.

If you don’t like SQL, you might consider NoSQL tools like Elasticsearch, MongoDB, and Redis that can similarly be useful for arranging and searching data. You’ll probably have to learn some JSON tools for formatting the data.

Reverse engineering tools

A cybersecurity specialty is “reverse engineering”. Some want to reverse engineer the target software being hacked, to understand vulnerabilities. This is needed for commercial software and device firmware where the source code is hidden. Others use these tools to analyze viruses/malware.

The ‘file’ command uses heuristics to discover the type of a file.

There’s a whole skillset for analyzing PDF and Microsoft Office documents. I play with pdf-parser. There’s a long list at this website:
https://zeltser.com/analyzing-malicious-documents/

There’s a whole skillset for analyzing executables. Binwalk is especially useful for analyzing firmware images.

Qemu is useful is a useful virtual-machine. It can emulate full systems, such as an IoT device based on the MIPS processor. Like some other tools mentioned here, it’s more a full subsystem than a simple command-line tool.

On a live system, you can use ‘strace’ to view what system calls a process is making. Use ‘lsof’ to view which files and network connections a process is making.

Password crackers

A common cybersecurity specialty is “password cracking”. There’s two kinds: online and offline password crackers.

Typical online password crackers are ‘hydra’ and ‘medusa’. They can take files containing common passwords and attempt to log on to various protocols remotely, like HTTP, SMB, FTP, Telnet, and so on. I used ‘hydra’ recently in order to find the default/backdoor passwords to many IoT devices I’ve bought recently in my test lab.

Online password crackers must open TCP connections to the target, and try to logon. This limits their speed. They also may be stymied by systems that lock accounts, or introduce delays, after too many bad password attempts.

Typical offline password crackers are ‘hashcat’ and ‘jtr’ (John the Ripper). They work off of stolen encrypted passwords. They can attempt billions of passwords-per-second, because there’s no network interaction, nothing slowing them down.

Understanding offline password crackers means getting an appreciation for the exponential difficulty of the problem. A sufficiently long and complex encrypted password is uncrackable. Instead of brute-force attempts at all possible combinations, we must use tricks, like mutating the top million most common passwords.

I use hashcat because of the great GPU support, but John is also a great program.

WiFi hacking

A common specialty in cybersecurity is WiFi hacking. The difficulty in WiFi hacking is getting the right WiFi hardware that supports the features (monitor mode, packet injection), then the right drivers installed in your operating system. That’s why I use Kali rather than some generic Linux distribution, because it’s got the right drivers installed.

The ‘aircrack-ng’ suite is the best for doing basic hacking, such as packet injection. When the parents are letting the iPad babysit their kid with a loud movie at the otherwise quite coffeeshop, use ‘aircrack-ng’ to deauth the kid.

The ‘reaver’ tool is useful for hacking into sites that leave WPS wide open and misconfigured.

Remote exploitation

A common specialty in cybersecurity is pentesting.

Nmap, curl, and netcat (described above) above are useful tools for this.

Some useful DNS tools are ‘dig’ (described above), dnsrecon/dnsenum/fierce that try to enumerate and guess as many names as possible within a domain. These tools all have unique features, but also have a lot of overlap.

Nikto is a basic tool for probing for common vulnerabilities, out-of-date software, and so on. It’s not really a vulnerability scanner like Nessus used by defenders, but more of a tool for attack.

SQLmap is a popular tool for probing for SQL injection weaknesses.

Then there is ‘msfconsole’. It has some attack features. This is humor – it has all the attack features. Metasploit is the most popular tool for running remote attacks against targets, exploiting vulnerabilities.

Text editor

Finally, there is the decision of text editor. I use ‘vi’ variants. Others like ‘nano’ and variants. There’s no wrong answer as to which editor to use, unless that answer is ‘emacs’.

Conclusion

Obviously, not every cybersecurity professional will be familiar with every tool in this list. If you don’t do reverse-engineering, then you won’t use reverse-engineering tools.

On the other hand, regardless of your specialty, you need to know basic crypto concepts, so you should know something like the ‘openssl’ tool. You need to know basic networking, so things like ‘nmap’ and ‘tcpdump’. You need to be comfortable processing large dumps of data, manipulating it with any tool available. You shouldn’t be frightened by a little sysadmin work.

The above list is therefore a useful starting point for cybersecurity professionals. Of course, those new to the industry won’t have much familiarity with them. But it’s fair to say that I’ve used everything listed above at least once in the last year, and the year before that, and the year before that. I spend a lot of time on StackExchange and Google searching the exact options I need, so I’m not an expert, but I am familiar with the basic use of all these things.

The Most Viewed AWS Security Blog Posts in 2016

Post Syndicated from Craig Liebendorfer original https://aws.amazon.com/blogs/security/the-most-viewed-aws-security-blog-posts-in-2016/

The following 10 posts were the most viewed AWS Security Blog posts that we published during 2016. You can use this list as a guide to catch up on your blog reading or even read a post again that you found particularly useful.

  1. How to Set Up DNS Resolution Between On-Premises Networks and AWS Using AWS Directory Service and Amazon Route 53
  2. How to Control Access to Your Amazon Elasticsearch Service Domain
  3. How to Restrict Amazon S3 Bucket Access to a Specific IAM Role
  4. Announcing AWS Organizations: Centrally Manage Multiple AWS Accounts
  5. How to Configure Rate-Based Blacklisting with AWS WAF and AWS Lambda
  6. How to Use AWS WAF to Block IP Addresses That Generate Bad Requests
  7. How to Record SSH Sessions Established Through a Bastion Host
  8. How to Manage Secrets for Amazon EC2 Container Service–Based Applications by Using Amazon S3 and Docker
  9. Announcing Industry Best Practices for Securing AWS Resources
  10. How to Set Up DNS Resolution Between On-Premises Networks and AWS Using AWS Directory Service and Microsoft Active Directory

The following 10 posts published since the blog’s inception in April 2013 were the most viewed AWS Security Blog posts in 2016.

  1. Writing IAM Policies: How to Grant Access to an Amazon S3 Bucket
  2. Securely Connect to Linux Instances Running in a Private Amazon VPC
  3. A New and Standardized Way to Manage Credentials in the AWS SDKs
  4. Where’s My Secret Access Key?
  5. Enabling Federation to AWS Using Windows Active Directory, ADFS, and SAML 2.0
  6. IAM Policies and Bucket Policies and ACLs! Oh, My! (Controlling Access to S3 Resources)
  7. How to Connect Your On-Premises Active Directory to AWS Using AD Connector
  8. Writing IAM Policies: Grant Access to User-Specific Folders in an Amazon S3 Bucket
  9. How to Help Prepare for DDoS Attacks by Reducing Your Attack Surface
  10. How to Set Up DNS Resolution Between On-Premises Networks and AWS Using AWS Directory Service and Amazon Route 53

Let us know in the comments section below if there is a specific security or compliance topic you would like us to cover on the Security Blog in 2017.

– Craig

A Case Study in Global Fault Isolation

Post Syndicated from Lee-Ming Zen original https://aws.amazon.com/blogs/architecture/a-case-study-in-global-fault-isolation/

In a previous blog post, we talked about using shuffle sharding to get magical fault isolation. Today, we’ll examine a specific use case that Route 53 employs and one of the interesting tradeoffs we decided to make as part of our sharding. Then, we’ll discuss how you can employ some of these concepts in your own applications.

Overview of Anycast DNS

One of our goals at Amazon Route 53 is to provide low-latency DNS resolution to customers. We do this, in part, by announcing our IP addresses using “anycast” from over 50 edge locations around the globe. Anycast works by routing packets to the closest (network-wise) location that is “advertising” a particular address. In the image below, we can see that there are three locations, all of which can receive traffic for the 205.251.194.72 address.

(Blue circles represent edge locations; orange circles represent AWS regions)

For example, if a customer has ns-584.awsdns-09.net assigned as a nameserver, issuing a query to that nameserver could result in that query landing at any one of multiple locations responsible for advertising the underlying IP address. Where the query lands depends on the anycast routing of the Internet, but it is generally going to be the closest network-wise (and hence, low latency) location to the end user.

Behind the scenes, we have thousands of nameserver names (e.g. ns-584.awsdns-09.net) hosted across four top-level domains (.com, .net, .co.uk, and .org). We refer to all the nameservers in one top-level domain as a ‘stripe;’ thus, we have a .com stripe, a .net stripe, a .co.uk stripe, and a .org stripe. This is where shuffle sharding comes in: each Route 53 domain (hosted zone) receives four nameserver names one from each of stripe. As a result, it is unlikely that two zones will overlap completely across all four nameservers. In fact, we enforce a rule during nameserver assignment that no hosted zone can overlap by more than two nameservers with any previously created hosted zone.

DNS Resolution

Before continuing, it’s worth quickly explaining how DNS resolution works. Typically, a client, such as your laptop or desktop has a “stub resolver.” The stub resolver simply contacts a recursive nameserver (resolver), which in turn queries authoritative nameservers, on the Internet to find the answers to a DNS query. Typically, resolvers are provided by your ISP or corporate network infrastructure, or you may rely on an open resolver such as Google DNS. Route 53 is an authoritative nameserver, responsible for replying to resolvers on behalf of customers. For example, when a client program attempts to look up amazonaws.com, the stub resolver on the machine will query the resolver. If the resolver has the data in cache and the value hasn’t expired, it will use the cached value. Otherwise, the resolver will query authoritative nameservers to find the answer.

(Every location advertises one or more stripes, but we only show Sydney, Singapore, and Hong Kong in the above diagram for clarity.)

Each Route 53 edge location is responsible for serving the traffic for one or more stripes. For example, our edge location in Sydney, Australia could serve both the .com and .net, while Singapore could serve just the .org stripe. Any given location can serve the same stripe as other locations. Hong Kong could serve the .net stripe, too. This means that if a resolver in Australia attempts to resolve a query against a nameserver in the .org stripe, which isn’t provided in Australia, the query will go to the closest location that provides the .org stripe (which is likely Singapore). A resolver in Singapore attempting to query against a nameserver in the .net stripe may go to Hong Kong or Sydney depending on the potential Internet routes from that resolver’s particular network. This is shown in the diagram above.

For any given domain, in general, resolvers learn the lowest latency nameserver based upon the round trip time of the query (this technique is often called SRTT or smooth round-trip time). Over a few queries, a resolver in Australia would gravitate toward using the nameservers on the .net and .com stripes for Route 53 customers’ domains.

Not all resolvers do this. Some choose randomly amongst the nameservers. Others may end up choosing the slowest one, but our experiments show that about 80% of resolvers use the lowest RTT nameserver. For additional information, this presentation presents information on how various resolvers choose which nameserver they utilize. Additionally, many other resolvers (such as Google Public DNS) use pre-fetching, or have very short timeouts if a resolver fails to resolve against a particular nameserver.

The Latency-Availability Decision

Given the above resolver behavior, one option, for a DNS provider like Route 53, might be to advertise all four stripes from every edge location. This would mean that no matter which nameserver a resolver choses, it will always go to the closest network location. However, we believe this provides a poor availability model.

Why? Because edge locations can sometimes fail to provide resolution for a variety of reasons that are very hard to control: the edge location may lose power or Internet connectivity, the resolver may lose connectivity to the edge location, or an intermediary transit provider may lose connectivity. Our experiments have shown that these types of events can cause about 5 minutes of disruption as the Internet updates routing tables. In recent years another serious risk has arisen: large-scale transit network congestion due to DDOS attacks. Our colleague, Nathan Dye, has a talk from AWS re:Invent that provides more details: www.youtube.com/watch?v=V7vTPlV8P3U.

In all of these failure scenarios, advertising every nameserver from every location may result in resolvers having no fallback location. All nameservers would route to the same location and resolvers would fail to resolve DNS queries, resulting in an outage for customers.

In the diagram below, we show the difference for a resolver querying domain X, whose nameservers (NX1, NX2, NX3, NX4) are advertised from all locations and domain Y, whose nameservers (NY1, NY2, NY3, NY4) are advertised in a subset of the locations.

When the path from the resolver to location A is impaired, all queries to the nameservers for domain X will fail. In comparison, even if the path from the resolver to location A is impaired, there are other transit paths to reach nameservers at locations B, C, and D in order to resolve the DNS for domain Y.

Route 53 typically advertises only one stripe per edge location. As a result, if anything goes wrong with a resolver being able to reach an edge location, that resolver has three other nameservers in three other locations to which it can fall back. For example, if we deploy bad software that causes the edge location to stop responding, the resolver can still retry elsewhere. This is why we organize our deployments in “stripe order”; Nick Trebon provides a great overview of our deployment strategies in the previous blog post. It also means that queries to Route 53 gain a lot of Internet path diversity, which helps resolvers route around congestion and other intermediary problems on their path to reaching Route 53.

Route 53’s foremost goal is to always meet our promise of a 100% SLA for DNS queries – that all of our customers’ DNS names should resolve all the time. Our customers also tell us that latency is next most important feature of a DNS service provider. Maximizing Internet path and edge location diversity for availibility necessarily means that some nameservers will respond from farther-away edge locations. For most resolvers, our method has no impact on the minimum RTT, or fastest nameserver, and how quickly it can respond. As resolvers generally use the fastest nameserver, we’re confident that any compromise in resolution times is small and that this is a good balance between the goals of low latency and high availability.

On top of our striping across locations, you may have noticed that the four stripes use different top-level domains. We use multiple top-levels domains in case one of the three TLD providers (.com and .net are both operated by Verisign) has any sort of DNS outage. While this rarely happens, it means that as a customer, you’ll have increased protection during a TLD’s DNS outage because at least two of your four nameservers will continue to work.

Applications

You, too, can apply the same techniques in your own systems and applications. If your system isn’t end-user facing, you could also consider utilizing multiple TLDs for resilience as well. Especially in the case where you control your own API and clients calling the API, there’s no reason to place all your eggs in one TLD basket.

Another application of what we’ve discussed is minimizing downtime during failovers. For high availability applications, we recommend customers utilize Route 53 DNS Failover. With failover configured, Route 53 will only return answers for healthy endpoints. In order to determine endpoint health, Route 53 issues health checks against your endpoint. As a result, there is a minimum of 10 seconds (assuming you configured fast health checks with a single failover interval) where the application could be down, but failover has not triggered yet. On top of that, there is the additional time incurred for resolvers to expire the DNS entry from their cache based upon the record’s TTL. To minimize this failover time, you could write your clients to behave similar to the resolver behavior described earlier. And, while you may not employ an anycast system, you can host your endpoints in multiple locations (e.g. different availability zones and perhaps even different regions). Your clients would learn the SRTT of the multiple endpoints over time and only issue queries to the fastest endpoint, but fallback to the other endpoints if the fastest is unavailable. And, of course, you could shuffle shard your endpoints to achieve increased fault isolation while doing all of the above.

– Lee-Ming Zen