All posts by Ryan Jaeger

Automated Disaster Recovery using CloudEndure

Post Syndicated from Ryan Jaeger original https://aws.amazon.com/blogs/architecture/automated-disaster-recovery-using-cloudendure/

There are any number of events that cause IT outages and impact business continuity. These could include the unexpected infrastructure or application outages caused by flooding, earthquakes, fires, hardware failures, or even malicious attacks. Cloud computing opens a new door to support disaster recovery strategies, with benefits such as elasticity, agility, speed to innovate, and cost savings—all which aid new disaster recovery solutions.

With AWS, organizations can acquire IT resources on-demand, and pay only for the resources they use. Automating disaster recovery (DR) has always been challenging. This blog post shows how you can use automation to allow the orchestration of recovery to eliminate manual processes. CloudEndure Disaster Recovery, an AWS Company, Amazon Route 53, and AWS Lambda are the building blocks to deliver a cost-effective automated DR solution. The example in this post demonstrates how you can recover a production web application with sub-second Recovery Point Objects (RPOs) and Recovery Time Objectives (RTOs) in minutes.

As part of a DR strategy, knowing RPOs and RTOs will determine what kind of solution architecture you need. The RPO represents the point in time of the last recoverable data point (for example, the “last backup”). Any disaster after that point would result in data loss.

The time from the outage to restoration is the RTO. Minimizing RTO and RPO is a cost tradeoff. Restoring from backups and recreating infrastructure after the event is the lowest cost but highest RTO. Conversely, the highest cost and lowest RTO is a solution running a duplicate auto-failover environment.

Solution Overview

CloudEndure is an automated IT resilience solution that lets you recover your environment from unexpected infrastructure or application outages, data corruption, ransomware, or other malicious attacks. It utilizes block-level Continuous Data Replication (CDP), which ensures that target machines are spun up in their most current state during a disaster or drill, so that you can achieve sub-second RPOs. In the event of a disaster, CloudEndure triggers a highly automated machine conversion process and a scalable orchestration engine that can spin up machines in the target AWS Region within minutes. This process enables you to achieve RTOs in minutes. The CloudEndure solution uses a software agent that installs on physical or virtual servers. It connects to a self-service, web-based use console, which then issues an API call to the selected AWS target Region to create a Staging Area in the customer’s AWS account designated to receive the source machine’s replicated data.

Architecture

In the above example, a webserver and database server have the CloudEndure Agent installed, and the disk volumes on each server replicated to a staging environment in the customer’s AWS account. The CloudEndure Replication Server receives the encrypted data replication traffic and writes to the appropriate corresponding EBS volumes. It’s also possible to configure data replication traffic to use VPN or AWS Direct Connect.

With this current setup, if an infrastructure or application outage occurs, a failover to AWS is executed by manually starting the process from the CloudEndure Console. When this happens, CloudEndure creates EC2 instances from the synchronized target EBS volumes. After the failover completes, additional manual steps are needed to change the website’s DNS entry to point to the IP address of the failed over webserver.

Could the CloudEndure failover and DNS update be automated? Yes.

Amazon Route 53 is a highly available and scalable Domain Name System (DNS) web service with three main functions: domain registration, DNS routing, and health checking. A configured Route 53 health check monitors the endpoint of a webserver. If the health check fails over a specified period, an alarm is raised to execute an AWS Lambda function to start the CloudEndure failover process. In addition to health checks, Route 53 DNS Failover allows the DNS record for the webserver to be automatically update based on a healthy endpoint. Now the previously manual process of updating the DNS record to point to the restored web server is automated. You can also build Route 53 DNS Failover configurations to support decision trees to handle complex configurations.

To illustrate this, the following builds on the example by having a primary, secondary, and tertiary DNS Failover choice for the web application:

How Health Checks Work in Complex Amazon Route 53 Configurations

When the CloudEndure failover action executes, it takes several minutes until the target EC2 is launched and configured by CloudEndure. An S3 static web page can be returned to the end-user to improve communication while the failover is happening.

To support this example, Amazon Route 53 DNS failover decision tree can be configured to have a primary, secondary, and tertiary failover. The decision tree logic to support the scenario is the following:

  1. If the primary health check passes, return the primary webserver.
  2. Else, if the secondary health check passes, return the failover webserver.
  3. Else, return the S3 static site.

When the Route 53 health check fails when monitoring the primary endpoint for the webserver, a CloudWatch alarm is configured to ALARM after a set time. This CloudWatch alarm then executes a Lambda function that calls the CloudEndure API to begin the failover.

In the screenshot below, both health checks are reporting “Unhealthy” while the primary health check is in a state of ALARM. At the point, the DNS failover logic should be returning the path to the static S3 site, and the Lambda function executed to start the CloudEndure failover.

The following architecture illustrates the completed scenario:

Conclusion

Having a disaster recovery strategy is critical for business continuity. The benefits of AWS combined with CloudEndure Disaster Recovery creates a non-disruptive DR solution that provides minimal RTO and RPO while reducing total cost of ownership for customers. Leveraging CloudWatch Alarms combined with AWS Lambda for serverless computing are building blocks for a variety of automation scenarios.