Post Syndicated from Colm MacCarthaigh original https://aws.amazon.com/blogs/architecture/aws-and-compartmentalization/
Practically every experienced driver has suffered a flat tire. It’s a real nuisance, you pull over, empty the trunk to get out your spare wheel, jack up the car and replace the puncture before driving yourself to a nearby repair shop. For a car that’s ok, we can tolerate the occasional nuisance, and as drivers we’re never that far from a safe place to pull over or a friendly repair shop.
Using availability terminology, a spare tire is a kind of standby, a component or system that is idly waiting to be deployed when needed. These are common in computer systems too. Many databases rely on standby failover for example, and some of them even rely on personal intervention, with a human running a script as they might wind a car-jack (though we’d recommend using an Amazon Relational Database instead, which include automated failover).
But when the stakes are higher, things are done a little differently. Take the systems in a modern passenger jet for example, which despite recent tragic events, have a stellar safety record. A flight can’t pull over, and in the event of a problem an airliner may have to make it several hours before being within range of a runway. For passenger jets it’s common for critical systems to use active redundancy. A twin-engine jet can fly with just one working engine, for example – so if one fails, the other can still easily keep the jet in the air.
This kind of model is also common in large web systems. There are many EC2 instances handling amazon.com for example, and when one occasionally fails there’s a buffer of capacity spread across the other servers ensuring that customers don’t even notice.
Jet engines don’t simply fail on their own though. Any one of dozens of components—digital engine controllers, fuel lines and pumps, gears and shafts, and so on–can cause the engine to stop working. For every one of these components, the aircraft designers could try to include some redundancy at the component level (and some do, such as avionics), but there are so many that it’s easier to re-frame the design in terms of fault isolation or compartmentalization: as long as each engine depends on separate instances of each component, then no one component can take out both engines. A fuel line may break, but it can only stop one engine from functioning, and the plane has already been designed to work with one engine out.
This kind of compartmentalization is particularly useful for complex computer systems. A large website or web service may depend on tens or even hundreds of sub-services. Only so many can themselves include robust active redundancy. By aligning instances of sub-services so that inter-dependencies never go across compartments we can make sure that a problem can be contained to the compartment it started in. It also means that we can try to resolve problems by quarantining whole compartments, without needing to find the root of the problem within the compartment.
AWS and Compartmentalization
Amazon Web Services includes some features and offerings that enable effective compartmentalization. Firstly, many Amazon Web Services—for example, Amazon S3 and Amazon RDS—are themselves internally compartmentalized and make use of active redundancy designs so that when failures occur they are hidden.
We also offer web services and resources in a range of sizes, along with automation in the form of auto-scaling, CloudFormation templates, and Opsworks recipes that make it easy to manage a higher number of instances.
There is a subtle but important distinction between running a small number of large instances, and a large number of small instances. Four m3.xlarge instances cost as much as two m3.2xlarge instances and provide the same amount of CPU and storage; but for high availability configurations, using four instances requires only a 33% failover capacity buffer and any host-level problem may impact one quarter of your load, whereas using two instances means a 100% buffer and any problem may impact half of your load.
Thirdly, Amazon Web Services has pre-made compartments: up to four availability zones per region. These availability zones are deeply compartmentalized down to the datacenter, network and power level.
Suppose that we create a web site or web service that utilizes four availability zones. This means we need a 25% failover capacity buffer per zone (which compares well to a 100% failover capacity buffer in a standard two data center model). Our service consists of a front end, two dependent backend services (“Foo” and “Bar”) and a data-store (for this example, we’ll use S3).
By constraining any sub-service calls to stay “within” the availability zone we make it easier to isolate faults. If backend service “Bar” fails (for example a software crash) in us-east-1b, this impacts 1/4th of our over-all capacity.
Initially this may not seem much better than if we had spread calls to the Bar service from all zones across all instances of the Bar service; after all, the failure rate would also be one fifth. But the difference is profound.
Firstly, experience has shown that small problems can often become amplified in complex systems. For example if it takes the “Foo” service longer to handle a failed call to the “Bar” service, then the initial problem with the “Bar” service begins to impact the behavior of “Foo” and in turn the frontends.
Secondly, by having a simple all-purpose mechanism to fail away from the infected availability zone, the problem can be reliably, simply, and quickly neutralized, just as a plane can be designed to fly on one engine and many types of failure handled with one procedure—if the engine is malfunctioning and a short checklist’s worth of actions don’t restore it to health, just shut it down and land at the next airport.
Route 53 Infima
Our suggested mechanism for handling this kind of failure is Amazon Route 53 DNS Failover. As DNS is the service that turns service/website names into the list of particular front-end IP addresses to connect to, it sits at the start of every request and is an ideal layer to neutralize problems.
With Route 53 health checks and DNS failover, each front-end is constantly health checked and automatically removed from DNS if there is a problem. Route 53 Health Check URLs are fully customizable and can point to a script that checks every dependency in the availability zone (“Is Foo working, Is Bar working, is S3 reachable, etc …”).
This brings us to Route 53 Infima. Infima is a library designed to model compartmentalization systematically and to help represent those kinds of configurations in DNS. With Infima, you assign endpoints to specific compartments such as availability zone. For advanced configurations you may also layer in additional compartmentalization dimensions; for example you may want to run two different software implementations of the same service (perhaps for blue/green deployments, for application-level redundancy) in each availability zone.
Once the Infima library has been taught the layout of endpoints within the compartments, failures can be simulated in software and any gaps in capacity identified. But the real power of Infima comes in expressing these configurations in DNS. Our example service had 4 endpoints, in 4 availability zones. One option for expressing this in DNS is to return each endpoint one time in every four. Each answer could also depend on a health check, and when the health check fails, it could be removed from DNS. Infima supports this configuration.
However, there is a better option. DNS (and naturally Route 53) allows several endpoints to be represented in a single answer, for example:
When clients (such as browsers or web services clients) receive these answers they generally try several endpoints until they find one that successfully connects. So by including all of the endpoints we gain some fault tolerance. When an endpoint is failing though, as we’ve seen before, the problem can spread and clients can incur retry timers and some delay, so it’s still desirable to remove IPs from DNS answers in a timely manner.
Infima can use the list of compartments, endpoints and their healthchecks to build what we call a RubberTree, a pre-computed decision tree of DNS answers that has answers pre-baked ready and waiting for potential failures: a single node failing, a whole compartment failing, combinations of each and so on. This decision tree is then stored as a Route 53 configuration and can automatically handle any failures. So if the 192.0.2.3 endpoint were to fail, then:
will be returned. By having these decision trees pre-baked and always ready and waiting, Route 53 is able to react quickly to endpoint failures, which with compartmentalization means we are also ready to handle failures of any sub-service serving that endpoint.
The compartmentalization we’ve seen so far is most useful for certain kinds of errors; host-level problems, occasional crashes, application-lockups. But if the problem originates with front-end level requests themselves, for example a denial of service attack, or a “poison pill” request that triggers a calamitous bug then it can quickly infect all of your compartments. Infima also includes some neat functionality to assist in isolating even these kinds of faults, and that will be the topic of our next post.
Bonus Content: Busting Caches
I wrote that removing failing endpoints from DNS in a timely manner is important, even when there are multiple endpoints in an answer. One problem we respond to in this area is broken application-level DNS caching. Certain platforms, including many versions of Java do not respect DNS cache lifetimes (the DNS time-to-live or TTL value) and once a DNS response has been resolved it will be used indefinitely.
One way to mitigate this problem is to use cache “busting”. Route 53 support wildcard records (and wildcard ALIASes, CNAMEs and more). Instead of using a service name such as: “api.example.com”, it is possible to use a wildcard name such as “*.api.example.com”, which will match requests for any name ending in “.api.example.com”.
An application may then be written in such a way as to resolve a partially random name, e.g. “sdsHdsk3.api.example.com”. This name, since it ends in api.example.com will still receive the right answer, but since it is a unique random name every time, it will defeat (or “bust”) any broken platform or OS DNS caching.
– Colm MacCárthaigh