Tag Archives: Disaster Recovery

Using Route 53 Private Hosted Zones for Cross-account Multi-region Architectures

Post Syndicated from Anandprasanna Gaitonde original https://aws.amazon.com/blogs/architecture/using-route-53-private-hosted-zones-for-cross-account-multi-region-architectures/

This post was co-written by Anandprasanna Gaitonde, AWS Solutions Architect and John Bickle, Senior Technical Account Manager, AWS Enterprise Support

Introduction

Many AWS customers have internal business applications spread over multiple AWS accounts and on-premises to support different business units. In such environments, you may find a consistent view of DNS records and domain names between on-premises and different AWS accounts useful. Route 53 Private Hosted Zones (PHZs) and Resolver endpoints on AWS create an architecture best practice for centralized DNS in hybrid cloud environment. Your business units can use flexibility and autonomy to manage the hosted zones for their applications and support multi-region application environments for disaster recovery (DR) purposes.

This blog presents an architecture that provides a unified view of the DNS while allowing different AWS accounts to manage subdomains. It utilizes PHZs with overlapping namespaces and cross-account multi-region VPC association for PHZs to create an efficient, scalable, and highly available architecture for DNS.

Architecture Overview

You can set up a multi-account environment using services such as AWS Control Tower to host applications and workloads from different business units in separate AWS accounts. However, these applications have to conform to a naming scheme based on organization policies and simpler management of DNS hierarchy. As a best practice, the integration with on-premises DNS is done by configuring Amazon Route 53 Resolver endpoints in a shared networking account. Following is an example of this architecture.

Route 53 PHZs and Resolver Endpoints

Figure 1 – Architecture Diagram

The customer in this example has on-premises applications under the customer.local domain. Applications hosted in AWS use subdomain delegation to aws.customer.local. The example here shows three applications that belong to three different teams, and those environments are located in their separate AWS accounts to allow for autonomy and flexibility. This architecture pattern follows the option of the “Multi-Account Decentralized” model as described in the whitepaper Hybrid Cloud DNS options for Amazon VPC.

This architecture involves three key components:

1. PHZ configuration: PHZ for the subdomain aws.customer.local is created in the shared Networking account. This is to support centralized management of PHZ for ancillary applications where teams don’t want individual control (Item 1a in Figure). However, for the key business applications, each of the teams or business units creates its own PHZ. For example, app1.aws.customer.local – Application1 in Account A, app2.aws.customer.local – Application2 in Account B, app3.aws.customer.local – Application3 in Account C (Items 1b in Figure). Application1 is a critical business application and has stringent DR requirements. A DR environment of this application is also created in us-west-2.

For a consistent view of DNS and efficient DNS query routing between the AWS accounts and on-premises, best practice is to associate all the PHZs to the Networking Account. PHZs created in Account A, B and C are associated with VPC in Networking Account by using cross-account association of Private Hosted Zones with VPCs. This creates overlapping domains from multiple PHZs for the VPCs of the networking account. It also overlaps with the parent sub-domain PHZ (aws.customer.local) in the Networking account. In such cases where there is two or more PHZ with overlapping namespaces, Route 53 resolver routes traffic based on most specific match as described in the Developer Guide.

2. Route 53 Resolver endpoints for on-premises integration (Item 2 in Figure): The networking account is used to set up the integration with on-premises DNS using Route 53 Resolver endpoints as shown in Resolving DNS queries between VPC and your network. Inbound and Outbound Route 53 Resolver endpoints are created in the VPC in us-east-1 to serve as the integration between on-premises DNS and AWS. The DNS traffic between on-premises to AWS requires an AWS Site2Site VPN connection or AWS Direct Connect connection to carry DNS and application traffic. For each Resolver endpoint, two or more IP addresses can be specified to map to different Availability Zones (AZs). This helps create a highly available architecture.

3. Route 53 Resolver rules (Item 3 in Figure): Forwarding rules are created only in the networking account to route DNS queries for on-premises domains (customer.local) to the on-premises DNS server. AWS Resource Access Manager (RAM) is used to share the rules to accounts A, B and C as mentioned in the section “Sharing forwarding rules with other AWS accounts and using shared rules” in the documentation. Account owners can now associate these shared rules with their VPCs the same way that they associate rules created in their own AWS accounts. If you share the rule with another AWS account, you also indirectly share the outbound endpoint that you specify in the rule as described in the section “Considerations when creating inbound and outbound endpoints” in the documentation. This implies that you use one outbound endpoint in a region to forward DNS queries to your on-premises network from multiple VPCs, even if the VPCs were created in different AWS accounts. Resolver starts to forward DNS queries for the domain name that’s specified in the rule to the outbound endpoint and forward to the on-premises DNS servers. The rules are created in both regions in this architecture.

This architecture provides the following benefits:

  1. Resilient and scalable
  2. Uses the VPC+2 endpoint, local caching and Availability Zone (AZ) isolation
  3. Minimal forwarding hops
  4. Lower cost: optimal use of Resolver endpoints and forwarding rules

In order to handle the DR, here are some other considerations:

  • For app1.aws.customer.local, the same PHZ is associated with VPC in us-west-2 region. While VPCs are regional, the PHZ is a global construct. The same PHZ is accessible from VPCs in different regions.
  • Failover routing policy is set up in the PHZ and failover records are created. However, Route 53 health checkers (being outside of the VPC) require a public IP for your applications. As these business applications are internal to the organization, a metric-based health check with Amazon CloudWatch can be configured as mentioned in Configuring failover in a private hosted zone.
  • Resolver endpoints are created in VPC in another region (us-west-2) in the networking account. This allows on-premises servers to failover to these secondary Resolver inbound endpoints in case the region goes down.
  • A second set of forwarding rules is created in the networking account, which uses the outbound endpoint in us-west-2. These are shared with Account A and then associated with VPC in us-west-2.
  • In addition, to have DR across multiple on-premises locations, the on-premises servers should have a secondary backup DNS on-premises as well (not shown in the diagram).
    This ensures a simple DNS architecture for the DR setup, and seamless failover for applications in case of a region failure.

Considerations

  • If Application 1 needs to communicate to Application 2, then the PHZ from Account A must be shared with Account B. DNS queries can then be routed efficiently for those VPCs in different accounts.
  • Create additional IP addresses in a single AZ/subnet for the resolver endpoints, to handle large volumes of DNS traffic.
  • Look at Considerations while using Private Hosted Zones before implementing such architectures in your AWS environment.

Summary

Hybrid cloud environments can utilize the features of Route 53 Private Hosted Zones such as overlapping namespaces and the ability to perform cross-account and multi-region VPC association. This creates a unified DNS view for your application environments. The architecture allows for scalability and high availability for business applications.

Integrating CloudEndure Disaster Recovery into your security incident response plan

Post Syndicated from Gonen Stein original https://aws.amazon.com/blogs/security/integrating-cloudendure-disaster-recovery-into-your-security-incident-response-plan/

An incident response plan (also known as procedure) contains the detailed actions an organization takes to prepare for a security incident in its IT environment. It also includes the mechanisms to detect, analyze, contain, eradicate, and recover from a security incident. Every incident response plan should contain a section on recovery, which outlines scenarios ranging from single component to full environment recovery. This recovery section should include disaster recovery (DR), with procedures to recover your environment from complete failure. Effective recovery from an IT disaster requires tools that can automate preparation, testing, and recovery processes. In this post, I explain how to integrate CloudEndure Disaster Recovery into the recovery section of your incident response plan. CloudEndure Disaster Recovery is an Amazon Web Services (AWS) DR solution that enables fast, reliable recovery of physical, virtual, and cloud-based servers on AWS. This post also discusses how you can use CloudEndure Disaster Recovery to reduce downtime and data loss when responding to a security incident, and best practices for maintaining your incident response plan.

How disaster recovery fits into a security incident response plan

The AWS Well-Architected Framework security pillar provides guidance to help you apply best practices and current recommendations in the design, delivery, and maintenance of secure AWS workloads. It includes a recommendation to integrate tools to secure and protect your data. A secure data replication and recovery tool helps you protect your data if there’s a security incident and quickly return to normal business operation as you resolve the incident. The recovery section of your incident response plan should define recovery point objectives (RPOs) and recovery time objectives (RTOs) for your DR-protected workloads. RPO is the window of time that data loss can be tolerated due to a disruption. RTO is the amount of time permitted to recover workloads after a disruption.

Your DR response to a security incident can vary based on the type of incident you encounter. For example, your DR plan for responding to a security incident such as ransomware—which involves data corruption—should describe how to recover workloads on your secondary DR site using a recovery point prior to the data corruption. This use case will be discussed further in the next section.

In addition to tools and processes, your security incident response plan should define the roles and responsibilities necessary during an incident. This includes the people and roles in your organization who perform incident mitigation steps, in addition to those who need to be informed and consulted. This can include technology partners, application owners, or subject matter experts (SMEs) outside of your organization who can offer additional expertise. DR-related roles for your incident response plan include:

  • A person who analyzes the situation and provides visibility to decision-makers.
  • A person who decides whether or not to trigger a DR response.
  • A person who actively triggers the DR response.

Be sure to include all of the stakeholders you identify in your documented security incident response procedures and runbooks. Test your plan to verify that the people in these roles have the pre-provisioned access they need to perform their defined role.

How to use CloudEndure Disaster Recovery during a security incident

CloudEndure Disaster Recovery continuously replicates your servers—including OS, system state configuration, databases, applications, and files—to a staging area in your target AWS Region. The staging area contains low-cost resources automatically provisioned and managed by CloudEndure Disaster Recovery. This reduces the cost of provisioning duplicate resources during normal operation. Your fully provisioned recovery environment is launched only during an incident or drill.

If your organization experiences a security incident that can be remediated using DR, you can use CloudEndure Disaster Recovery to perform failover to your target AWS Region from your source environment. When you perform failover, CloudEndure Disaster Recovery orchestrates the recovery of your environment in your target AWS Region. This enables quick recovery, with RPOs of seconds and RTOs of minutes.

To deploy CloudEndure Disaster Recovery, you must first install the CloudEndure agent on the servers in your environment that you want to replicate for DR, and then initiate data replication to your target AWS Region. Once data replication is complete and your data is in sync, you can launch machines in your target AWS Region from the CloudEndure User Console. CloudEndure Disaster Recovery enables you to launch target machines in either Test Mode or Recovery Mode. Your launched machines behave the same way in either mode; the only difference is how the machine lifecycle is displayed in the CloudEndure User Console. Launch machines by opening the Machines page, shown in the following figure, and selecting the machines you want to launch. Then select either Test Mode or Recovery Mode from the Launch Target Machines menu.
 

Figure 1: Machines page on the CloudEndure User Console

Figure 1: Machines page on the CloudEndure User Console

You can launch your entire environment, a group of servers comprising one or more applications, or a single server in your target AWS Region. When you launch machines from the CloudEndure User Console, you’re prompted to choose a recovery point from the Choose Recovery Point dialog box (shown in the following figure).

Use point-in-time recovery to respond to security incidents that involve data corruption, such as ransomware. Your incident response plan should include a mechanism to determine when data corruption occurred. Knowing how to determine which recovery point to choose in the CloudEndure User Console helps you minimize response time during a security incident. Each recovery point is a point-in-time snapshot of your servers that you can use to launch recovery machines in your target AWS Region. Select the latest recovery point before the data corruption to restore your workloads on AWS, and then choose Continue With Launch.
 

Figure 2: Selection of an earlier recovery point from the Choose Recovery Point dialog box

Figure 2: Selection of an earlier recovery point from the Choose Recovery Point dialog box

Run your recovered workloads in your target AWS Region until you’ve resolved the security incident. When the incident is resolved, you can perform failback to your primary environment using CloudEndure Disaster Recovery. You can learn more about CloudEndure Disaster Recovery setup, operation, and recovery by taking this online CloudEndure Disaster Recovery Technical Training.

Test and maintain the recovery section of your incident response plan

Your entire incident response plan must be kept accurate and up to date in order to effectively remediate security incidents if they occur. A best practice for achieving this is through frequently testing all sections of your plan, including your tools. When you first deploy CloudEndure Disaster Recovery, begin running tests as soon as all of your replicated servers are in sync on your target AWS Region. DR solution implementation is generally considered complete when all initial testing has succeeded.

By correctly configuring the networking and security groups in your target AWS Region, you can use CloudEndure Disaster Recovery to launch a test workload in an isolated environment without impacting your source environment. You can run tests as often as you want. Tests don’t incur additional fees beyond payment for the fully provisioned resources generated during tests.

Testing involves two main components: launching the machines you wish to test on AWS, and performing user acceptance testing (UAT) on the launched machines.

  1. Launch machines to test.
     
    Select the machines to test from the Machines page of the CloudEndure User Console by selecting the check box next to the machine. Then choose Test Mode from the Launch Target Machines menu, as shown in the following figure. You can select the latest recovery point or an earlier recovery point.
     
    Figure 3: Select Test Mode to launch selected machines

    Figure 3: Select Test Mode to launch selected machines

     

    The following figure shows the CloudEndure User Console. The Disaster Recovery Lifecycle column shows that the machines have been Tested Recently.

    Figure 4: Machines launched in Test Mode display purple icons in the Status column and Tested Recently in the Disaster Recovery Lifecycle column

    Figure 4: Machines launched in Test Mode display purple icons in the Status column and Tested Recently in the Disaster Recovery Lifecycle column

  2. Perform UAT testing.
     
    Begin UAT testing when the machine launch job is successfully completed and your target machines have booted.

After you’ve successfully deployed, configured, and tested CloudEndure Disaster Recovery on your source environment, add it to your ongoing change management processes so that your incident response plan remains accurate and up-to-date. This includes deploying and testing CloudEndure Disaster Recovery every time you add new servers to your environment. In addition, monitor for changes to your existing resources and make corresponding changes to your CloudEndure Disaster Recovery configuration if necessary.

How CloudEndure Disaster Recovery keeps your data secure

CloudEndure Disaster Recovery has multiple mechanisms to keep your data secure and not introduce new security risks. Data replication is performed using AES 256-bit encryption in transit. Data at rest can be encrypted by using Amazon Elastic Block Store (Amazon EBS) encryption with an AWS managed key or a customer key. Amazon EBS encryption is supported by all volume types, and includes built-in key management infrastructure that has no performance impact. Replication traffic is transmitted directly from your source servers to your target AWS Region, and can be restricted to private connectivity such as AWS Direct Connect or a VPN. CloudEndure Disaster Recovery is ISO 27001 and GDPR compliant and HIPAA eligible.

Summary

Each organization tailors its incident response plan to meet its unique security requirements. As described in this post, you can use CloudEndure Disaster Recovery to improve your organization’s incident response plan. I also explained how to recover from an earlier point in time when you respond to security incidents involving data corruption, and how to test your servers as part of maintaining the DR section of your incident response plan. By following the guidance in this post, you can improve your IT resilience and recover more quickly from security incidents. You can also reduce your DR operational costs by avoiding duplicate provisioning of your DR infrastructure.

Visit the CloudEndure Disaster Recovery product page if you would like to learn more. You can also view the AWS Raise the Bar on Data Protection and Security webinar series for additional information on how to protect your data and improve IT resilience on AWS.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Gonen Stein

Gonen is the Head of Product Strategy for CloudEndure, an AWS company. He combines his expertise in business, cloud infrastructure, storage, and information security to assist enterprise organizations with developing and deploying IT resilience and business continuity strategies in the cloud.