All posts by Colm MacCarthaigh

Enhanced Domain Protections for Amazon CloudFront Requests

Post Syndicated from Colm MacCarthaigh original https://aws.amazon.com/blogs/security/enhanced-domain-protections-for-amazon-cloudfront-requests/

Over the coming weeks, we’ll be adding enhanced domain protections to Amazon CloudFront. The short version is this: the new measures are designed to ensure that requests handled by CloudFront are handled on behalf of legitimate domain owners.

Using CloudFront to receive traffic for a domain you aren’t authorized to use is already a violation of our AWS Terms of Service. When we become aware of this type of activity, we deal with it behind the scenes by disabling abusive accounts. Now we’re integrating checks directly into the CloudFront API and Content Distribution service, as well.

Enhanced Protection against Dangling DNS entries
To use CloudFront with your domain, you must configure your domain to point at CloudFront. You may use a traditional CNAME, or an Amazon Route 53 “ALIAS” record.

A problem can arise if you delete your CloudFront distribution, but leave your DNS still pointing at CloudFront, popularly known as a “dangling” DNS entry. Thankfully, this is very rare, as the domain will no longer work, but we occasionally see customers who leave their old domains dormant. This can also happen if you leave this kind of “dangling” DNS entry pointing at other infrastructure you no longer control. For example, if you leave a domain pointing at an IP address that you don’t control, then there is a risk that someone may come along and “claim” traffic destined for your domain.

In an even more rare set of circumstances, an abuser can exploit a subdomain of a domain that you are actively using. For example, if a customer left “images.example.com” dangling and pointing to a deleted CloudFront distribution which is no longer in use, but they still actively use the parent domain “example.com”, then an abuser could come along and register “images.example.com” as an alternative name on their own distribution and claim traffic that they aren’t entitled to. This also means that cookies may be set and intercepted for HTTP traffic potentially including the parent domain. HTTPS traffic remains protected if you’ve removed the certificate associated with the original CloudFront distribution.

Of course, the best fix for this kind of risk is not to leave dangling DNS entries in the first place. Earlier in February, 2018, we added a new warning to our systems. With this warning, if you remove an alternate domain name from a distribution, you are reminded to delete any DNS entries that may still be pointing at CloudFront.

We also have long-standing checks in the CloudFront API that ensure this kind of domain claiming can’t occur when you are using wildcard domains. If you attempt to add *.example.com to your CloudFront distribution, but another account has already registered www.example.com, then the attempt will fail.

With the new enhanced domain protection, CloudFront will now also check your DNS whenever you remove an alternate domain. If we determine that the domain is still pointing at your CloudFront distribution, the API call will fail and no other accounts will be able to claim this traffic in the future.

Enhanced Protection against Domain Fronting
CloudFront will also be soon be implementing enhanced protections against so-called “Domain Fronting”. Domain Fronting is when a non-standard client makes a TLS/SSL connection to a certain name, but then makes a HTTPS request for an unrelated name. For example, the TLS connection may connect to “www.example.com” but then issue a request for “www.example.org”.

In certain circumstances this is normal and expected. For example, browsers can re-use persistent connections for any domain that is listed in the same SSL Certificate, and these are considered related domains. But in other cases, tools including malware can use this technique between completely unrelated domains to evade restrictions and blocks that can be imposed at the TLS/SSL layer.

To be clear, this technique can’t be used to impersonate domains. The clients are non-standard and are working around the usual TLS/SSL checks that ordinary clients impose. But clearly, no customer ever wants to find that someone else is masquerading as their innocent, ordinary domain. Although these cases are also already handled as a breach of our AWS Terms of Service, in the coming weeks we will be checking that the account that owns the certificate we serve for a particular connection always matches the account that owns the request we handle on that connection. As ever, the security of our customers is our top priority, and we will continue to provide enhanced protection against misconfigurations and abuse from unrelated parties.

Interested in additional AWS Security news? Follow the AWS Security Blog on Twitter.

Automated Reasoning and Amazon s2n

Post Syndicated from Colm MacCarthaigh original https://aws.amazon.com/blogs/security/automated-reasoning-and-amazon-s2n/

In June 2015, AWS Chief Information Security Officer Stephen Schmidt introduced AWS’s new Open Source implementation of the SSL/TLS network encryption protocols, Amazon s2n. s2n is a library that has been designed to be small and fast, with the goal of providing you with network encryption that is more easily understood and fully auditable.

s2n logo

In the 14 months since that announcement, development on s2n has continued, and we have merged more than 100 pull requests from 15 contributors on GitHub. Those active contributors include members of the Amazon S3, Amazon CloudFront, Elastic Load Balancing, AWS Cryptography Engineering, Kernel and OS, and Automated Reasoning teams, as well as 8 external, non-Amazon Open Source contributors.

At the time of the initial s2n announcement, three external security evaluations and penetration tests on s2n had been completed. Those evaluations were code reviews and testing completed by security-focused experts, and came in addition to the code reviews and testing that are applied to every code change at Amazon as standard practice. We have continued to perform such evaluations, and we are pleased to have s2n be the focus of additional analysis from external academic and professional security researchers.

Adding automated reasoning to s2n

Because of s2n’s role as security-critical software, one of our goals is to use s2n as a proving ground for new automated reasoning testing and assurance techniques that we can refine for broader adoption within Amazon and beyond. Increasingly, the availability of compute resources on demand such as Amazon EC2 makes it possible to perform extensive security analysis, even on every code change.

As a first step toward automating security and validation checks, we wanted to move away from traditional static analysis steps involving a manual, developer-triggered process to steps that are integrated into the build and triggered for every GitHub pull request and commit. Our public s2n repository includes automated static analysis of the s2n code using the Open Source LLVM-based scan-build and Cppcheck tools, in addition to other commercial tools AWS runs internally.

Through our participation in the Linux Foundation’s Core Infrastructure Initiative, we have also worked with TrustInSoft to use s2n as a proving ground for a new static analysis tool, tis-interpreter, which detects uses of undefined programming language behavior. By using s2n’s test suite as entry points for runtime analysis, TrustInSoft was able to identify and eliminate a series of minor dependencies on implicit compiler behavior.

We have also extended our run-time analysis of s2n to include integrated fuzz testing. Every change to s2n repositories is run automatically through a libFuzzer-based test, and American fuzzy lop is used for offline testing. Based on analyzing the running code, these tests dynamically generate data inputs that are designed to target and exercise all the modes and options that s2n might encounter. The tests also will find corner cases that could lead to issues causing logical errors, memory leaks, or crashes.

Adding automated formal verification of s2n

These kinds of tests are designed to provide assurance for the security and safety characteristics of the s2n code, but cryptographic code also benefits from formal verification, where the outputs of the cryptographic operations are proven correct for all potential inputs.

Typically, formal verification can be tedious and is performed as research by skilled specialists using mathematical toolsets. As a part of our commitment to automated reasoning, we have contracted with Galois to simplify this process and make it more developer friendly. Combining a domain-specific language called Cryptol and a software analysis tool called SAW, Galois has produced a tool chain that allows us to formally verify important aspects of s2n.

This verification also is built into our public GitHub builds. The verification occurs on every change, and includes negative test cases that “verify the verifier” by deliberately introducing an error into a test-only build and confirming that the tools reject it. The formal specifications required to create the proof are extremely small and readable, and no annotations or changes to the s2n code were necessary to support the proof.

In addition, we have already formally verified s2n’s implementation of HMAC, an important algorithm that is used extensively within the TLS/SSL protocols and elsewhere, and will continue work to verify more of the foundational algorithms used within s2n.

Copy and clone our work

If you want to learn more about s2n integrated testing, would like to use it on your own projects, or are interested in using or contributing to s2n, see our public s2n repository on awslabs.

If you have comments or questions about this blog post, submit them in the “Comments” section below.

-Colm

Automated Reasoning and Amazon s2n

Post Syndicated from Colm MacCarthaigh original https://blogs.aws.amazon.com/security/post/TxLEHNNDPUFDU9/Automated-Reasoning-and-Amazon-s2n

In June 2015, AWS Chief Information Security Officer Stephen Schmidt introduced AWS’s new Open Source implementation of the SSL/TLS network encryption protocols, Amazon s2n. s2n is a library that has been designed to be small and fast, with the goal of providing you with network encryption that is more easily understood and fully auditable.

In the 14 months since that announcement, development on s2n has continued, and we have merged more than 100 pull requests from 15 contributors on GitHub. Those active contributors include members of the Amazon S3, Amazon CloudFront, Elastic Load Balancing, AWS Cryptography Engineering, Kernel and OS, and Automated Reasoning teams, as well as 8 external, non-Amazon Open Source contributors.

At the time of the initial s2n announcement, three external security evaluations and penetration tests on s2n had been completed. Those evaluations were code reviews and testing completed by security-focused experts, and came in addition to the code reviews and testing that are applied to every code change at Amazon as standard practice. We have continued to perform such evaluations, and we are pleased to have s2n be the focus of additional analysis from external academic and professional security researchers.

Adding automated reasoning to s2n

Because of s2n’s role as security-critical software, one of our goals is to use s2n as a proving ground for new automated reasoning testing and assurance techniques that we can refine for broader adoption within Amazon and beyond. Increasingly, the availability of compute resources on demand such as Amazon EC2 makes it possible to perform extensive security analysis, even on every code change.

As a first step toward automating security and validation checks, we wanted to move away from traditional static analysis steps involving a manual, developer-triggered process to steps that are integrated into the build and triggered for every GitHub pull request and commit. Our public s2n repository includes automated static analysis of the s2n code using the Open Source LLVM-based scan-build and Cppcheck tools, in addition to other commercial tools AWS runs internally.

Through our participation in the Linux Foundation’s Core Infrastructure Initiative, we have also worked with TrustInSoft to use s2n as a proving ground for a new static analysis tool, tis-interpreter, which detects uses of undefined programming language behavior. By using s2n’s test suite as entry points for runtime analysis, TrustInSoft was able to identify and eliminate a series of minor dependencies on implicit compiler behavior.

We have also extended our run-time analysis of s2n to include integrated fuzz testing. Every change to s2n repositories is run automatically through a libFuzzer-based test, and American fuzzy lop is used for offline testing. Based on analyzing the running code, these tests dynamically generate data inputs that are designed to target and exercise all the modes and options that s2n might encounter. The tests also will find corner cases that could lead to issues causing logical errors, memory leaks, or crashes.

Adding automated formal verification of s2n

These kinds of tests are designed to provide assurance for the security and safety characteristics of the s2n code, but cryptographic code also benefits from formal verification, where the outputs of the cryptographic operations are proven correct for all potential inputs.

Typically, formal verification can be tedious and is performed as research by skilled specialists using mathematical toolsets. As a part of our commitment to automated reasoning, we have contracted with Galois to simplify this process and make it more developer friendly. Combining a domain-specific language called Cryptol and a software analysis tool called SAW, Galois has produced a tool chain that allows us to formally verify important aspects of s2n.

This verification also is built into our public GitHub builds. The verification occurs on every change, and includes negative test cases that “verify the verifier” by deliberately introducing an error into a test-only build and confirming that the tools reject it. The formal specifications required to create the proof are extremely small and readable, and no annotations or changes to the s2n code were necessary to support the proof.

In addition, we have already formally verified s2n’s implementation of HMAC, an important algorithm that is used extensively within the TLS/SSL protocols and elsewhere, and will continue work to verify more of the foundational algorithms used within s2n.

Copy and clone our work

If you want to learn more about s2n integrated testing, would like to use it on your own projects, or are interested in using or contributing to s2n, see our public s2n repository on awslabs.

If you have comments or questions about this blog post, submit them in the “Comments” section below.

-Colm

s2n and Lucky 13

Post Syndicated from Colm MacCarthaigh original https://blogs.aws.amazon.com/security/post/TxLZP6HNAYWBQ6/s2n-and-Lucky-13

Great security research combines extremely high levels of creativity, paranoia, and attention to detail. All of these qualities are in evidence in two new research papers about how s2n, our Open Source implementation of the SSL/TLS protocols, handles the Lucky 13 attack from 2013. 

The research found issues with how s2n mitigates Lucky 13 and improvements that could be made. These issues did not impact Amazon, AWS, or our customers, and are not the kind that could be exploited in the real world, but the research shows that the early versions of s2n’s mitigations made Lucky 13 attacks tens of millions of times more difficult for attackers, rather than the trillions of times more difficult that we had intended. In any event, the versions of s2n concerned have never been used in production, and improvements that completely prevent the issue were included in the versions of s2n available on GitHub from July 11th , 2015.

The two papers are from:

Martin Albrecht  and Kenny Paterson of the Information Security Group at Royal Holloway, University of London, who have a paper concerning s2n’s initial implementation of Lucky 13 CBC verification, and s2n’s secondary mitigation mechanism using timing delays.  Kenny Paterson is also one of the original authors of the Lucky 13 paper.

Manuel Barbosa (HASLab – INESC TEC and  FCUP) has discovered a bug introduced in the process of making improvements related to the first paper, and has upcoming publications in collaboration with other researchers (see below) where they cover the limits of human review and other interesting implications of this programming error, and exploring formal verification solutions to prevent such errors in the first place.

We would like to thank Albrecht, Paterson and Barbosa for reporting their research through our vulnerability reporting process and for discussing and reviewing the changes we made to s2n in response to that research. Both papers are valuable contributions towards improving the protections in s2n and cryptography generally, and we’re grateful for the work involved.

Our summary: s2n includes two different mitigations against the Lucky 13 attack and although the original implementations in s2n were effective against real-world attacks, each could be improved to be more robust against theoretical attacks.

Lucky 13: a quick recap

In February 2013, Nadhem J. AlFardan and Kenny Paterson of the Information Security Group at Royal Holloway, University of London, published the Lucky 13 attack against TLS. Adam Langley’s blog post on the topic is a great detailed summary of the attack and how it operates and how It was mitigated In OpenSSL.

A brief synopsis is that Lucky 13 is an “Active Person in the Middle” attack against block ciphers where an attacker who is already capable of intercepting and modifying your traffic may tamper with that traffic in ways that allow the attacker to determine which portions of your traffic are encrypted data, and which which portions of your traffic are padding (bytes included to round up to a certain block size).

The attacker’s attempt to tamper with the traffic is detected by the design of the TLS protocol, and triggers an error message. The Lucky 13 research showed that receiving this error message can take a different amount of time depending on whether real data or padding was modified. That information can be combined with other cryptographic attacks to recover the original plaintext.

How the Lucky 13 attack interacts with TLS

Fortunately, there are a number of factors that make the Lucky 13 attack very difficult in real world settings against SSL/TLS.

Firstly, the timing differences involved are extremely small, commonly smaller than a microsecond. The attacker must intercept, modify, pass on traffic and intercept the errors on a network that is stable to the point that differences of less than a microsecond are measurable. Unsecured Wi-Fi networks are easiest to intercept traffic on, but are neither fast enough nor stable enough for the attack to work. Wide-area networks are also too inconsistent so an attacker must be close to their target.  Even within a single high-performance data center network, it is very difficult to obtain the conditions required for this attack to succeed.  With the security features of Amazon Virtual Private Cloud (VPC), it is impossible to surreptitiously intercept TLS traffic within AWS data centers.

Making things harder still for the attacker, attempts to accelerate the attack by trying multiple connections in parallel will contend network queues and can obscure the very measurements needed for the attack to succeed. Our own experiments with real-world networks show an exponential distribution of packet delays in the microsecond to tens of microseconds range.

Secondly, the Lucky 13 attack requires a client to attempt to send or receive the same information over and over again; hundreds to millions of times depending on what is known about the data already along with the exact details of the implementation being targeted. Web browsers typically limit the number of attempts they make to three, as do clients such as the AWS SDKs. Still, as the Lucky 13 paper points out: more advanced and involved attacks may use JavaScript or other techniques to generate that kind of request pattern.

Thirdly, the errors generated by the Lucky 13 attack are fatal in SSL/TLS and cause the connection to be closed. An attack would depend on neither the client nor the server noticing incredibly high connection error rates. At AWS, monitoring and alarming on far lower error rates than would be required to mount a Lucky 13 attack is a standard part of our operational readiness.

Lastly, in response to the original Lucky 13 research, the CBC cipher suites impacted by the vulnerability are no longer as common as they were. Today’s modern web browsers and servers no longer prefer these cipher suites. At AWS, we now see less than 10% of connections use CBC cipher suites.

Considering the limitations on the attack, it is difficult to conceive of circumstances where it is exploitable in the real-world against TLS. Indeed, some implementations of the TLS protocol have chosen not to mitigate the Lucky 13 attack.

With s2n, we decided to be cautious, as attacks only improve over time and to follow the statement of the risk from Matthew Green’s blogpost on the topic: “The attack is borderline practical if you’re using the Datagram version of TLS (DTLS). It’s more on the theoretical side if you’re using standard TLS. However, with some clever engineering, that could change in the future. You should probably patch!“.

Lucky 13, s2n, and balancing trade-offs

Despite the significant difficulty of any practical attack, as described above, s2n has included two different forms of mitigation against the Lucky 13 attack since release: first by minimizing the timing difference mentioned above, and second, by masking any difference by including a delay of up to 10 seconds whenever any error is triggered.

Martin Albrecht and Kenny Paterson, again of the Information Security Group at Royal Holloway, University of London, got in touch to make us aware that our first mitigation could be made more effective on its own, and then after learning about the additional mitigation strategy also performed some impressive research on the effectiveness of the delay technique.

One obvious question was raised: why doesn’t s2n mitigate Lucky 13 in the same way that OpenSSL does? OpenSSL’s change introduced significant complexity, but s2n’s primary goal is to implement the TLS/SSL protocols in a secure way, which includes risk reduction via reduced complexity. Besides hard-to-attack timing issues, our observation is that other software coding errors can lead to much more practical forms of attacks including remote execution, such as the ShellShock vulnerability, or memory disclosure such as the HeartBleed vulnerability in OpenSSL. In short: these latter kinds of vulnerabilities, due to small coding errors, can be catastrophic and rank as higher impact threats.

We must carefully balance the risk of additional complexity and code against the benefit. This leads us to be extremely conservative in adding code to s2n and to prioritize readability, auditability and simplicity in our implementation, which we emphasize in s2n’s development guide.  Simply put, we keep things simple for better security.

Some modern cryptographic implementations, most notably NaCl, strive to make all operations take the same amount of time to execute regardless of the input, using techniques such as branch-free programming and bitwise operations. Where appropriate s2n employs these techniques, too.

For example, s2n verifies CBC padding in constant time regardless of the amount of padding, an important aspect of mitigating Lucky 13 on its own.  Wherever s2n uses these techniques, we include explanatory code comments, in addition to analyzing and auditing the machine code emitted by multiple common compilers.

Unfortunately, the design of the CBC algorithm that was impacted by Lucky 13 pre-dates these techniques, and while it is feasible to do constant-time padding verification, it is not possible to apply the technique to HMAC verification without changing the interface to the HMAC algorithm in quite a complex way, and the interaction between CBC and HMAC verification is where the timing issue shows up.

As the research from Albrecht and Paterson found, retrofitting constant-time practices into OpenSSL’s CBC handling required hundreds of lines of code; inter-weaving the TLS handling code and the HMAC code, and changing the interface to HMAC in a non-standard way. Although this was done with great care and in a manner appropriate to OpenSSL, we do not believe a similar approach is suitable for s2n, where it would lead to an almost 10% increase in code and, in our view, would have a significant impact on clarity, for a very marginal benefit. Additionally, we are also formally verifying our implementation of the HMAC algorithm against the published HMAC standard, so we try hard to use HMAC via its standard interface as closely as possible.

In other words: changing the HMAC interface to be similar to OpenSSL would be adding hard-to-follow code and would defeat much of the benefit of formally verifying our implementation of the HMAC algorithm. With these trade-offs in mind, s2n first mitigated Lucky 13 by counting the number of bytes handled by HMAC and making this equal in all cases, a “close to constant time” technique that completely preserves the standard HMAC interface. This is an improvement over several alternative TLS implementations, which do not equalize the input to HMAC, and we could not observe any timing differences in our network testing.

Albrecht and Paterson pointed out to us that this technique does not account for some internal differences in how the HMAC algorithm behaves depending on the size of the data processed.  We were aware of this limitation, but our validation experiments – attacking s2n the same way an attacker would, and with our secondary mitigation turned off – had not been able to measure a time difference using this approach.

Separate targeted measurements, using a synthetic benchmarking environment and direct instrumentation of the extracted code illustrated the timing difference better, and their latest paper includes some analysis of how this timing difference may be used in an attack.

Albrecht and Paterson were also able to help us avoid making complex changes by suggesting a small and clever change to the HMAC interface: add a single new call that always performs two internal rounds of compression, even if one may not be necessary. With this change in place, the timing differences would be unobservable even in the synthetic environment. Considering the low risk and the experimental evidence of our previous testing we made this change in s2n as part of a commit on July 13th, just less than two weeks after s2n’s public launch on GitHub.

Underscoring the principle that “with more code comes more risk,” this change had a small bug itself; instead of expecting it to take 9 bytes of space to finalize a checksum, the original change specified 8. Manuel Barbosa, José Almeida (HASLab – INESC TEC and Univ. Minho), Gilles Barthe, and François Dupressoir (IMDEA Software Institute) have put together a comprehensive paper detailing the issue and the value of more testing in this area and how properties such as constant time of execution can be verified more systematically.

Perhaps most interesting about this is that the code author, reviewers, and researchers involved are all extremely familiar with the internals of HMAC and hash algorithms, and this small detail still escaped review (though no formal code audit had yet begun on the changes). The same authors are also working on a formal verification tool specific for constant time implementations in collaboration with Michael Emmi (IMDEA Software Institute). We are working with vendors  to apply these techniques more generally.

s2n’s timing blinding mitigation

Lucky 13 is not the only attack against the TLS protocol to use timing differences, and so, as part of developing s2n, we decided to include a general mitigation against timing attacks. Whenever a TLS error is encountered, including Lucky 13 attack attempts, s2n pauses a duration of time between 1ms and 10 seconds. The goal here is to produce a general high-level mitigation which would have rendered any known timing attack impractical, even if the underlying issue were unpatched. Or, as is the case here, if another intended mitigation is not fully effective.

The effect is to hide a signal (the timing difference) in a very large sea of noise. This doesn’t make it impossible to find the original signal, but it takes more measurements. The math involved is straightforward and is included in the original Lucky 13 paper. If the timing issue is 1 microsecond, then a delay of between 1 millisecond and 10 seconds increases the number of attempts required by about 8.3 trillion; far outside the realm of the practical. However this is only the case if the delay is uniform.

Initially our random delay duration was implemented using the usleep routine (or by the calling process if it is handling multiple connections), which is granular to microseconds.  One of our concerns with the usleep approach is that it may be too coarse: a timing issue smaller than a microsecond may still “slip through” if the implementation of usleep is consistent enough.

This issue is not fatal; merely by adding 5 seconds of delay on average to each attempt the Lucky 13 attack is slowed down from a roughly 64-hour per byte attack to an attack which would take at least a year of attack time per byte to succeed (again, assuming idealized circumstances for the attacker).

Based on the renewed concerns, and with the encouragement of Albrecht and Paterson, we changed s2n to use nanosleep() for our delaying mechanism on July 11th. Indeed this change proved beneficial as further research by Albrecht and Paterson over the coming months showed that observable timing issues could indeed slip through the old microsecond granular sleep mechanism and that the plaintext could be recovered in an idealized synthetic environment by making tens of millions attempts per byte, which is less than the intended trillions.

So for an over-simplified assessment of how impractical an attack of this nature would be, assuming an attacker was in the right place, at the right time, and could convince their victim to send their password to a server, over and over again as fast as the network would allow, it would take in the region of 20 years to recover an 8-character password via the Lucky 13 attack vs s2n. The real-world wall-clock time required may be shortened here by trying attempts in parallel, but the contention of network queues and cumulative delay of packets also interfere with measurements, and it is also an even harder attack to mount against clients in the first place.

The change to nanoseconds means that this is now prevented. We are also examining other changes, including a mode in which s2n emulates hardware by handling errors only at set “tick” intervals, as well as a change to enforce a higher minimum timing delay in all cases.

Addendum

As mentioned earlier, some implementations of TLS do not include mitigations against Lucky 13. In some cases this is because they are implemented in higher level languages, where constant time coding techniques are not possible.

These languages are generally safer than low-level ones from the kinds of remote-execution and memory disclosure risk presented by coding errors. As technologists, it would be extremely unfortunate if we were forced to choose between safer programming languages and safer cryptography. Timing blinding may be a useful way to bridge this gap, and we look forward to further research in this area. 

– Colm

Shuffle Sharding: Massive and Magical Fault Isolation

Post Syndicated from Colm MacCarthaigh original https://aws.amazon.com/blogs/architecture/shuffle-sharding-massive-and-magical-fault-isolation/

A standard deck of cards has 52 different playing cards and 2 jokers. If we shuffle a deck thoroughly and deal out a four card hand, there are over 300,000 different hands. Another way to say the same thing is that if we put the cards back, shuffle and deal again, the odds are worse than 1 in 300,000 that we’ll deal the same hand again. It’s very unlikely.

It’s also unlikely, less than a 1/4 chance, that even just one of the cards will match between the two hands. And to complete the picture, there’s a less than a 1/40 chance that two cards will match, and much less than a 1/1000 chance that three cards will be the same.

In our last post I promised to cover more about how Route 53 Infima can be used to isolate faults that are request-related, such as user or customer specific problems. Route 53 Infima’s Shuffle Sharding takes this pattern of rapidly diminishing likelihood for an increasing number of matches – a pattern which underlies many card games, lotteries and even bingo – and combines it with traditional horizontal scaling to produce a kind of fault isolation that can seem almost magical.

Traditional Horizontal Scaling

All but the smallest services usually run on more than one instance (though there are some impressive exceptions). Using multiple instances means that we can have active redundancy: when one instance has a problem the others can take over. Typically traffic and requests are spread across all of the healthy instances.

Though this pattern is great for balancing traffic and for handling occasional instance-level failure, it’s terrible if there’s something harmful about the requests themselves: every instance will be impacted. If the service is serving many customers for example, then one busy customer may swamp everyone else. Throttling requests on a per-customer basis can help, but even throttling mechanisms can themselves be overwhelmed.

Worse still, throttling won’t help if the problem is some kind of “poison request”. If a particular request happens to trigger a bug that causes the system to fail over, then the caller may trigger a cascading failure by repeatedly trying the same request against instance after instance until they have all fallen over.

Sharding and Bulkheads

One fault isolating improvement we can make upon traditional horizontal scaling is to use sharding. Sharding is a technique traditionally used with data storage and indexing systems. Instead of spreading traffic from all customers across every instance, we can divide the instances into shards. For example if we have eight instances for our service, we might create four shards of two instances each (two instances for some redundancy within each shard).

Next we face the decision of how to shard. One common way is by customer id, assigning customers to particular shards, but other sharding choices are viable such as by operation type or by resource identifier. You can even choose to do multidimensional sharding, and have customer-resource pairs select a shard, or customer-resource-operation-type.

Whatever makes the most sense for a given service depends on its innards and its particular mix of risks, but it’s usually possible to find some combination of id or operation type that will make a big difference if it can be isolated.

With this kind of sharding in place, if there is a problem caused by the requests, then we get the same kind of bulkhead effect we have seen before; the problem can be contained to one shard. So in our example above, with four shards then around one quarter of customers (or whichever other dimension has been chosen) may be impacted by a problem triggered by one customer. That’s considerably better than all customers being impacted.

If customers (or objects) are given specific DNS names to use (just as customers are given unique DNS names with many AWS services) then DNS can be used to keep per-customer cleanly separated across shards.

Shuffle Sharding

With sharding, we are able to reduce customer impact in direct proportion to the number of instances we have. Even if we had 100 shards, 1% of customers would still experience impact in the event of a problem. One sensible solution for this is to build monitoring systems that can detect these problems and once detected re-shard the problem requests to their own fully isolated shard. This is great, but it’s a reactive measure and can usually only mitigate impact, rather than avoid it in the first place.

With Shuffle Sharding, we can do better. The basic idea of Shuffle Sharding is to generate shards as we might deal hands from a deck of cards. Take the eight instances example. Previously we divided it into four shards of two instances. With Shuffle Sharding the shards contain two random instances, and the shards, just like our hands of cards, may have some overlap.

By choosing two instances from eight there are 56 potential shuffle shards, much more than the four simple shards we had before.

At first, it may seem as if these Shuffle Shards are less suited to isolating faults; in the above example diagram, two shuffle shards share instance 5, and so a problem affecting that instance may impact both shards. The key to resolving this is to make the client fault tolerant. By having simple retry logic in the client that causes it to try every endpoint in a Shuffle Shard, until one succeeds, we get a dramatic bulkhead effect.

With the client trying each instance in the shard, then a customer who is causing a problem to Shuffle Shard 1, may impact both instance 3 and instance 5 and so become impacted, but the customers using Shuffle Shard 2 should experience only negligible (if any) impact if the client retries have been carefully tested and implemented to handle this kind of partial degradation correctly. Thus the real impact is constrained to 1/56th of the overall shuffle shards.

1/56 impact is a great improvement on 1/4, but we can do even better still. Before, with simple sharding we needed to put two instances in each shard to have some redundancy. With Shuffle Sharding – as in traditional N+1 horizontal scaling – we have more instances available. We can shard to as many instances as we are willing to retry. With 3 retries – a common retry value – we can use four instances in total per shuffle shard.

With four instances per shuffle shard, we can reduce the impact to 1/1680 of our total customer base and we’ve made the “noisy neighbor” problem much more manageable.

Infima and Shuffle Sharding

The Route Infima library includes two kinds of Shuffle sharding. The first kind is simple stateless shuffle sharding. Stateless shuffle sharding uses hashing, much like a bloom filter does, to take a customer, object or other identifiers and turn it into shuffle shard pattern. This technique results in some probability of overlap between customers, just as when we deal hands from a deck of cards. But by being stateless, this kind of shuffle sharding can be easily used, even directly in calling clients.

The second kind of Shuffle Sharding included is Stateful Searching Shuffle Sharding. These shuffle shards are generated using random assignment, again like hands from a deck of cards, but there is built in support for checking each newly generated shard against every previously assigned shard in order to make guarantees about overlap. For example we might choose to give every shuffle shard 4 out of 20 endpoints, but require that no two shuffle shards ever share more than two particular endpoints.

Both kinds of shuffle sharding in Infima are compartmentalization aware. For example, we can ensure that shuffle shards also make use of every availability zone. So our instances could be in 2 availability zones, 4 in each one. Infima can make sure to choose 2 endpoints from each zone, rather than simply 2 at random (which might choose both from one availability zone).

Lastly, Infima also makes it easy to use Shuffle Shards along with RubberTrees, so that endpoints can be easily expressed in DNS using Route 53. For example, every customer can be supplied their own DNS name, which maps to a shuffle shard which is handled by a RubberTree.

Post-script

The two general principles at work are that it can often be better to use many smaller things as it lowers the cost of capacity buffers and makes the impact of any contention small, and that it can be beneficial to allow shards to partially overlap in their membership, in return for an exponential increase in the number of shards the system can support.

Those principles mean that Shuffle Sharding is a general-purpose technique, and you can also choose to Shuffle Shard across many kinds of resources, including pure in-memory data-structures such as queues, rate-limiters, locks and other contended resources.

As it happens, Amazon Route 53, CloudFront and other AWS services use compartmentalization, per-customer Shuffle Sharding and more to provide fault isolation, and we will be sharing some more details about how some of that works in a future blog post.

Update from the author: an earlier version of this blog post used an incorrect figure for the number of 4-card hands from a 52-card deck (I wrote 7 million, based on permutations, instead of 300,000 based on combinations).

– Colm MacCárthaigh

AWS and Compartmentalization

Post Syndicated from Colm MacCarthaigh original https://aws.amazon.com/blogs/architecture/aws-and-compartmentalization/

Practically every experienced driver has suffered a flat tire. It’s a real nuisance, you pull over, empty the trunk to get out your spare wheel, jack up the car and replace the puncture before driving yourself to a nearby repair shop. For a car that’s ok, we can tolerate the occasional nuisance, and as drivers we’re never that far from a safe place to pull over or a friendly repair shop.

Using availability terminology, a spare tire is a kind of standby, a component or system that is idly waiting to be deployed when needed. These are common in computer systems too. Many databases rely on standby failover for example, and some of them even rely on personal intervention, with a human running a script as they might wind a car-jack (though we’d recommend using an Amazon Relational Database instead, which include automated failover).

But when the stakes are higher, things are done a little differently. Take the systems in a modern passenger jet for example, which despite recent tragic events, have a stellar safety record. A flight can’t pull over, and in the event of a problem an airliner may have to make it several hours before being within range of a runway. For passenger jets it’s common for critical systems to use active redundancy. A twin-engine jet can fly with just one working engine, for example – so if one fails, the other can still easily keep the jet in the air.

This kind of model is also common in large web systems. There are many EC2 instances handling amazon.com for example, and when one occasionally fails there’s a buffer of capacity spread across the other servers ensuring that customers don’t even notice.

Jet engines don’t simply fail on their own though. Any one of dozens of components—digital engine controllers, fuel lines and pumps, gears and shafts, and so on–can cause the engine to stop working. For every one of these components, the aircraft designers could try to include some redundancy at the component level (and some do, such as avionics), but there are so many that it’s easier to re-frame the design in terms of fault isolation or compartmentalization: as long as each engine depends on separate instances of each component, then no one component can take out both engines. A fuel line may break, but it can only stop one engine from functioning, and the plane has already been designed to work with one engine out.

This kind of compartmentalization is particularly useful for complex computer systems. A large website or web service may depend on tens or even hundreds of sub-services. Only so many can themselves include robust active redundancy. By aligning instances of sub-services so that inter-dependencies never go across compartments we can make sure that a problem can be contained to the compartment it started in. It also means that we can try to resolve problems by quarantining whole compartments, without needing to find the root of the problem within the compartment.

AWS and Compartmentalization

Amazon Web Services includes some features and offerings that enable effective compartmentalization. Firstly, many Amazon Web Services—for example, Amazon S3 and Amazon RDS—are themselves internally compartmentalized and make use of active redundancy designs so that when failures occur they are hidden.

We also offer web services and resources in a range of sizes, along with automation in the form of auto-scaling, CloudFormation templates, and Opsworks recipes that make it easy to manage a higher number of instances.

There is a subtle but important distinction between running a small number of large instances, and a large number of small instances. Four m3.xlarge instances cost as much as two m3.2xlarge instances and provide the same amount of CPU and storage; but for high availability configurations, using four instances requires only a 33% failover capacity buffer and any host-level problem may impact one quarter of your load, whereas using two instances means a 100% buffer and any problem may impact half of your load.

Thirdly, Amazon Web Services has pre-made compartments: up to four availability zones per region. These availability zones are deeply compartmentalized down to the datacenter, network and power level.

Suppose that we create a web site or web service that utilizes four availability zones. This means we need a 25% failover capacity buffer per zone (which compares well to a 100% failover capacity buffer in a standard two data center model). Our service consists of a front end, two dependent backend services (“Foo” and “Bar”) and a data-store (for this example, we’ll use S3).

By constraining any sub-service calls to stay “within” the availability zone we make it easier to isolate faults. If backend service “Bar” fails (for example a software crash) in us-east-1b, this impacts 1/4th of our over-all capacity.

Initially this may not seem much better than if we had spread calls to the Bar service from all zones across all instances of the Bar service; after all, the failure rate would also be one fifth. But the difference is profound.

Firstly, experience has shown that small problems can often become amplified in complex systems. For example if it takes the “Foo” service longer to handle a failed call to the “Bar” service, then the initial problem with the “Bar” service begins to impact the behavior of “Foo” and in turn the frontends.

Secondly, by having a simple all-purpose mechanism to fail away from the infected availability zone, the problem can be reliably, simply, and quickly neutralized, just as a plane can be designed to fly on one engine and many types of failure handled with one procedure—if the engine is malfunctioning and a short checklist’s worth of actions don’t restore it to health, just shut it down and land at the next airport.

Route 53 Infima

Our suggested mechanism for handling this kind of failure is Amazon Route 53 DNS Failover. As DNS is the service that turns service/website names into the list of particular front-end IP addresses to connect to, it sits at the start of every request and is an ideal layer to neutralize problems.

With Route 53 health checks and DNS failover, each front-end is constantly health checked and automatically removed from DNS if there is a problem. Route 53 Health Check URLs are fully customizable and can point to a script that checks every dependency in the availability zone (“Is Foo working, Is Bar working, is S3 reachable, etc …”).

This brings us to Route 53 Infima. Infima is a library designed to model compartmentalization systematically and to help represent those kinds of configurations in DNS. With Infima, you assign endpoints to specific compartments such as availability zone. For advanced configurations you may also layer in additional compartmentalization dimensions; for example you may want to run two different software implementations of the same service (perhaps for blue/green deployments, for application-level redundancy) in each availability zone.

Once the Infima library has been taught the layout of endpoints within the compartments, failures can be simulated in software and any gaps in capacity identified. But the real power of Infima comes in expressing these configurations in DNS. Our example service had 4 endpoints, in 4 availability zones. One option for expressing this in DNS is to return each endpoint one time in every four. Each answer could also depend on a health check, and when the health check fails, it could be removed from DNS. Infima supports this configuration.

However, there is a better option. DNS (and naturally Route 53) allows several endpoints to be represented in a single answer, for example:

 

When clients (such as browsers or web services clients) receive these answers they generally try several endpoints until they find one that successfully connects. So by including all of the endpoints we gain some fault tolerance. When an endpoint is failing though, as we’ve seen before, the problem can spread and clients can incur retry timers and some delay, so it’s still desirable to remove IPs from DNS answers in a timely manner.

Infima can use the list of compartments, endpoints and their healthchecks to build what we call a RubberTree, a pre-computed decision tree of DNS answers that has answers pre-baked ready and waiting for potential failures: a single node failing, a whole compartment failing, combinations of each and so on. This decision tree is then stored as a Route 53 configuration and can automatically handle any failures. So if the 192.0.2.3 endpoint were to fail, then:

 

will be returned. By having these decision trees pre-baked and always ready and waiting, Route 53 is able to react quickly to endpoint failures, which with compartmentalization means we are also ready to handle failures of any sub-service serving that endpoint.

The compartmentalization we’ve seen so far is most useful for certain kinds of errors; host-level problems, occasional crashes, application-lockups. But if the problem originates with front-end level requests themselves, for example a denial of service attack, or a “poison pill” request that triggers a calamitous bug then it can quickly infect all of your compartments. Infima also includes some neat functionality to assist in isolating even these kinds of faults, and that will be the topic of our next post.

Bonus Content: Busting Caches

I wrote that removing failing endpoints from DNS in a timely manner is important, even when there are multiple endpoints in an answer. One problem we respond to in this area is broken application-level DNS caching. Certain platforms, including many versions of Java do not respect DNS cache lifetimes (the DNS time-to-live or TTL value) and once a DNS response has been resolved it will be used indefinitely.

One way to mitigate this problem is to use cache “busting”. Route 53 support wildcard records (and wildcard ALIASes, CNAMEs and more). Instead of using a service name such as: “api.example.com”, it is possible to use a wildcard name such as “*.api.example.com”, which will match requests for any name ending in “.api.example.com”.

An application may then be written in such a way as to resolve a partially random name, e.g. “sdsHdsk3.api.example.com”. This name, since it ends in api.example.com will still receive the right answer, but since it is a unique random name every time, it will defeat (or “bust”) any broken platform or OS DNS caching.

– Colm MacCárthaigh