# Sizing Up Post-Quantum Signatures

Post Syndicated from Bas Westerbaan original https://blog.cloudflare.com/sizing-up-post-quantum-signatures/

Quantum computers are a boon and a bane. Originally conceived by Manin and Feyman to simulate nature efficiently, large-scale quantum computers will speed-up innovation in material sciences by orders of magnitude. Consider the technical advances enabled by the discovery of new materials (with bronze, iron, steel and silicon each ascribed their own age!); quantum computers could help to unlock the next age of innovation. Unfortunately, they will also break the majority of the cryptography that’s currently used in TLS to protect our web browsing. They fall in two categories:

1. Digital signatures, such as RSA, which ensure you’re talking to the right server.
2. Key exchanges, such as Diffie–Hellman, which are used to agree on encryption keys.

A moderately-sized stable quantum computer will easily break the signatures and key exchanges currently used in TLS using Shor’s algorithm. Luckily this can be fixed: over the last two decades, there has been great progress in so-called post-quantum cryptography. “Post quantum”, abbreviated PQ, means secure against quantum computers. Five years ago, the standards institute NIST started a public process to standardise post-quantum signature schemes and key exchanges. The outcome is expected to be announced early 2022.

At Cloudflare, we’re not just following this process closely, but are also testing the real-world performance of PQ cryptography. In our 2019 experiment with Google, we saw that we can switch to a PQ key exchange with little performance impact. Among the NIST finalists, there are many with even better performance. This is good news, as we would like to switch to PQ key exchanges as soon as possible — indeed, an attacker could intercept sensitive data today, then keep and decrypt it years into the future using a quantum computer.

## Why worry about PQ signatures today

One would think we can take it easy with signatures for TLS: we only need to have them replaced before a large quantum computer is built. The situation, however, is more complicated.

• The lead time to change signatures is higher. Not only do we need to change the browsers and servers, we also need to change certificate authorities (CAs) and everyone’s certificate management.
• TLS is addicted to small and fast signatures. For this page that you’re viewing we sent six signatures: two in the certificate chain; one handshake signature; one OCSP staple and finally two SCTs used for certificate transparency.
• PQ signature schemes have wildly varying performance trade-offs and quirks (as we’ll see below) which stack up quickly with six signatures, which all have slightly different requirements.

One might ask: can’t we be clever and get rid of some of these signatures? We think so! For instance, we can replace the handshake signature with a smaller key exchange or suppress intermediate certificates. Such fundamental changes take years to be adopted. That is why we are also investigating the performance of plain TLS with drop-in PQ signatures.

So, what are our options?

## The zoo of PQ signatures

The three finalists of the NIST competition are Dilithium, Falcon and Rainbow. In the table below we compare them against RSA and ECDSA, both of which are in common use today, and a selection of other PQ schemes that might see standardisation in the future.

None of these PQ signatures are a clear-cut drop-in replacement. To start, all have (much) larger signatures, except for Rainbow, GeMMS and SQISign. Rainbow and GeMMS have huge public keys and SQISign is very slow.

### TLS signatures

To confuse matters even more, the signatures within TLS are not all the same:

• Online. Only the handshake signature is created with every incoming TLS connection, and so signing needs to be fast. Dilithium fits this role well.
• Offline. All other signatures are made months/years in advance, and so signing time is not that important. This group splits in two:
• With a public key. The certificate chain includes signatures and their public keys. Here Falcon seems most suited.
• Without a public key. The remaining three (SCTs and OCSP staple) are just signatures. For these, Rainbow seems optimal, as its large public keys are not transmitted.

Using Dilithium, Falcon, and Rainbow, together, allows optimization for both speed and size simultaneously, which seems like a great idea. However, combining different signatures at the same time has disadvantages:

• A security issue in the design or implementation of one of the signatures compromises the whole.
• Clients need to implement multiple cryptographic algorithms, in this case three of them, which is troublesome for smaller devices — especially if separate hardware support is needed for each of them.

So do we really need to eke out every byte and every cycle of performance? Or can we stick to a single signature scheme for simplicity and security?

### Can we pick just one?

If we stick to one signature scheme, looking just at the numbers, Falcon512 seems like a reasonable option. It needs 5KB of extra space (compared to a classical handshake), about the same as the Dilithium–Falcon–Rainbow chimera of before. Unfortunately Falcon comes with a caveat: creating signatures efficiently requires constant-time 64-bit floating point arithmetic. Without it, signing is 20x slower. But speed alone is not enough; it has to run in constant time. Without that, one can slowly learn the secret key by measuring the time it takes to create a signature.

Although PCs typically have a sufficiently constant-time floating-point unit, many smaller devices do not. Thus, Falcon seems ill-suited for general purpose online signatures.

What about Dilithium2? It needs 17KB extra — let’s find out if that makes a big difference.

## Evidence by Experiment

All the different variables and constraints clearly complicate an already challenging puzzle. The best thing is to just try the options. Over the last few years several interesting papers have appeared studying the various options, such as SKD20, PST20, SKD21 and PKNLN22. These are great starts, but don’t provide a complete picture:

• SCTs and OCSP staples have yet to be considered. Leaving half (three) of the signatures out changes the results significantly.
• The networks tested or emulated offer insights, but are far from representative of real-world conditions. All tests were conducted between two datacenters (which does not include real-world last-mile conditions such as Wi-Fi or spotty mobile connections); or a network was simulated with unrealistic packet loss rates.

Here, Cloudflare can contribute. One of the things we like to do is to put new ideas in the community to the test on a global scale.

In this case we’re just taking a first step. Setting up a real-world experiment with a modified browser is quite involved, especially when we consider the many possible variations. Instead, as a first step, we decided first to investigate the most striking variable, the size, and try to answer the question:

How do larger signatures affect the TLS handshake?

There are two parts to this: how fast are they, and, more importantly, do they work at all?

### Experimental setup

We need some way to emulate bigger signatures without having to modify the clients. We considered several options. The first idea we had was to pad a valid certificate with a dummy extension. That would require a custom certificate for each size to test, which is cumbersome. Then we considered responding with a dummy ServerHello extension. This is, however, not allowed by TLS 1.2 without a corresponding ClientHello extension. In the end, we went for adding dummy certificates.

### Dummy certificates

These dummy certificates are 1kB self-signed invalid certificates that have nothing to do with the certificate chain. To vary the size to test, we simply add more copies. Adding unrelated certificates to the certificate chain is a common misconfiguration and clients have learnt to ignore them. In fact, TLS 1.3 stipulates that these (in rfc-speak) SHOULD be ignored by the client. Testing out hundreds of browsers, we saw no issues.

Standards and reality don’t always agree: when inserting dummy certificates on actual traffic, we saw issues with a small, but not insignificant number of clients. We don’t want to ruin anyone’s connection, and so we decided to use separate connections for this purpose.

### Using challenge pages to launch probes

So what did we actually do? On a small percentage of the challenge pages (those with the CAPTCHA), we pick a number n and a random key and send this key in two separate background requests to:

• 0.tls-size-experiment-c.cloudflareresearch.com
• [n].tls-size-experiment-1.cloudflareresearch.com

The first, the control, is a normal webpage that stores the TLS handshake time under the key that’s been sent. The real action happens at the second, the live, which adds the n dummy certificates to its chain. The live also stores handshake time under the given key. We could call it “experimental” instead of “live”, but the benign control connection is also an important part of the experiment. Indeed, it allows us to see if live connections are missing. These endpoints were a breeze to write using Cloudflare Workers and KV.

### How much dummy data to test?

Before launching the experiment, we tested several libraries and browsers on the live endpoint to see whether they would error due to the dummy certificates. None rejected a single certificate, but how far can we go? TLS 1.3 theoretically allows a certificate chain of 16MB, but in practice many clients reject a much shorter chain. OpenSSL, for instance, rejects one of 102kB. The most stingy we found is Go’s TLS client, which rejects a handshake larger than 64kB. Because of this, we tested with between 1 and 59 dummy certificates.

### Intermezzo: TCP’s congestion window

So, what did we find? The graphs are in the next section, have a peek! Before diving right in, we would like to explain a crucial concept, the TCP congestion window, that helps us read the results.

Data sent over the Internet is broken down in packets of around 1.4kB that traverse many routers to reach their destination. Sometimes a router has more incoming packets than it can handle and it has to drop them — this is called congestion. To avoid causing congestion, TCP initially sends just a few packets (typically ten, so ~14kB). Then, with every acknowledgement received in return, the TCP sender will very quickly ramp up the number of packets that it keeps in flight. This number is called the congestion window (cwnd). When it gets too high, congestion occurs, packets are dropped and in response the sender backs off by dialing down the congestion window. Any dropped packet is seen as a sign of congestion by TCP. For this reason, Wi-Fi has its own retransmission mechanism transparent to TCP.

Considering all this, we would expect to see two effects with larger signatures:

• Gentle slope. Every single packet needs some extra time to transmit, due to limited bandwidth and possible physical-layer retransmissions. This slope isn’t so gentle if your internet connection is slow or spotty.
• cwnd wall. Once we fill the congestion window, we have to wait for a whole roundtrip before we can continue. This effect is stronger if the roundtrip time (RTT) is higher.

The strength of the two effects can differ. With a fast connection and high RTT we expect to see the graph below on the left. With a slow connection and low RTT, we expect the one on the right.

There might be other unknown effects. The best thing is to have a look.

In PQ research, the second effect has gained the most attention. The larger signatures simply do not fit in the initial congestion windows used today. A common suggestion in response has been to simply increase the initial congestion window to accommodate the larger signatures. This is far from a simple change to make globally, and we have to understand if this solves the problem to begin with.

## Results

Over 24 days we’ve received 964,499 live connections from 454,218 different truncated IPs (to 24 bits, “/24”, for IPv4 and 48 bits for IPv6) and 11,239 different ASNs. First, let’s check how many clients had trouble with the bigger handshakes.

### Can clients handle the larger handshakes?

The control connection was missing for 2.4% of the live connections. This is not alarming: we expect some connections to be missing for harmless reasons, such as the user browsing away from the challenge page. There are, however, significantly more live connections without control connection at 3.6%.

In the graph below on the left we break the number of received live connections down by the number of dummy certificates added. Because we pick the number of certificates randomly, the graph is noisy. To get a clearer picture, we started storing the number of certificates added in the corresponding control request, which gives us the graph on the right. The bumps at 10kB and 30kB suggest that there are clients or middleboxes that cannot handle these handshake sizes.

### Handshake times with larger signatures

What is the effect on the handshake time? The graph on the left shows the weighted median and 75th percentile TLS handshake times for different amounts of dummy data added. We use the weight so that every truncated IP contributes equally. On the right we show the slowdowns for each size, relative to the handshake time of the control connection.

We can see the not-so-gentle slope until 40kB, where we hit a little wall that corresponds to Cloudflare’s default initial congestion window of 30 packets.

Adding 35kB fits within our initial congestion window. Nonetheless, the median handshake with 35kB extra is 40% slower. The slowest 10% are even worse off, taking 60% as much time. Thus even though we stay within the congestion window, the added data is not for free at all.

We can now translate these insights back to concrete PQ signatures. For example, using Dilithium2 as a drop-in replacement, we need around 17kB extra. That also fits within our initial congestion window with a median slowdown of 20%, which gets worse for the tail-end of users. For the normal initial congestion window of ten, we expect the slowdown to be much worse — around 60–80%.

There are several caveats to point out:

• These experiments used an initial congestion window of 30 packets instead of ten. With a smaller initial congestion window of ten, which is the default for most systems, we would expect the wall to move from 40kB to around 10kB.
• Because of our presence all across the world, our RTTs are fairly low. Thus the effect of the cwnd wall is smaller for us.
• Challenge pages are served, by design, to those clients that we expect to be bots. This adds a significant bias because bots are generally hosted at well-connected providers, and so are closer than users.
• HTTP/3 was not supported by the server we used for the endpoint. Support for IPv6 was only added ten days into the experiment and accounts for 10.9% of the measurements.
• Actual TLS handshakes differ in size much more than tested in this setup due to differences in certificate sizes and extensions and other factors.

## What have we learned?

The TLS handshake is just one step (~5–20%) in a long chain required to show you a webpage. Casually browsing, it would be hard to notice a TLS handshake that’s 60% slower. But such differences add up. To make a website really fast, you need many seemingly insignificant speedups. Browser developers take this seriously: only in exceptional cases does Chrome allow a change that slows down any microbenchmark by even a percent.

Because of the many parties and complexities involved, we should avoid waiting too long to adopt post-quantum signatures in TLS. That’s a hard sell if it comes at the price of a double-digit slowdown, not least to content servers but also to browser vendors and clients.

A timely adoption of PQ signatures on the web would be great. Our evidence so far suggests that this will be easiest, if six signatures and two public keys would fit in 9kB.

We will continue our efforts to help build a post-quantum secure Internet. To follow along, keep an eye on this blog or have a look at research.cloudflare.com.

Bas Westerbaan is co-submitter of the SPHINCS+ signature scheme.

## References

SKD20: Sikeridis, Kampanakis, Devetsikiotis. Assessing the overhead of post-quantum cryptography in TLS 1.3 and SSH. CoNEXT’20.
PST20: Paquin, Stebila, Tamvada. Benchmarking Post-Quantum Cryptography in TLS. PQCrypto 2020.
SKD21: Sikeridis, Kampanakis, Devetsikiotis. Post-Quantum Authentication in TLS 1.3: A Performance Study. NDSS2020.
PKNLN22: Paul, Kuzovkova, Lahr, Niederhagen. Mixed Certificate Chains for the Transition to Post-Quantum Authentication in TLS 1.3. To appear in AsiaCCS 2022.

# Using HPKE to Encrypt Request Payloads

Post Syndicated from Miguel de Moura original https://blog.cloudflare.com/using-hpke-to-encrypt-request-payloads/

The Managed Rules team was recently given the task of allowing Enterprise users to debug Firewall Rules by viewing the part of a request that matched the rule. This makes it easier to determine what specific attacks a rule is stopping or why a request was a false positive, and what possible refinements of a rule could improve it.

The fundamental problem, though, was how to securely store this debugging data as it may contain sensitive data such as personally identifiable information from submissions, cookies, and other parts of the request. We needed to store this data in such a way that only the user who is allowed to access it can do so. Even Cloudflare shouldn’t be able to see the data, following our philosophy that any personally identifiable information that passes through our network is a toxic asset.

This means we needed to encrypt the data in such a way that we can allow the user to decrypt it, but not Cloudflare. This means public key encryption.

Now we needed to decide on which encryption algorithm to use. We came up with some questions to help us evaluate which one to use:

• What requirements do we have for the algorithm?
• What language do we implement it in?
• How do we make this as secure as possible for users?

Here’s how we made those decisions.

### Algorithm Requirements

While we knew we needed to use public key encryption, we also needed to keep an eye on performance. This led us to select Hybrid Public Key Encryption (HPKE) early on as it has a best-of-both-worlds approach to using symmetric as well as public-key cryptography to increase performance. While these best-of-both-worlds schemes aren’t new [1][2][3], HPKE aims to provide a single, future-proof, robust, interoperable combination of a general key encapsulation mechanism and a symmetric encryption algorithm.

HPKE is an emerging standard developed by the Crypto Forum Research Group (CFRG), the research body that supports the development of Internet standards at the IETF. The CFRG produces specifications called RFCs (such as RFC 7748 for elliptic curves) that are then used in higher level protocols including two we talked about previously: ODoH and ECH. Cloudflare has long been a supporter of Internet standards, so HPKE was a natural choice to use for this feature. Additionally, HPKE was co-authored by one of our colleagues at Cloudflare.

### How HPKE Works

HPKE combines an asymmetric algorithm such as elliptic curve Diffie-Hellman and a symmetric cipher such as AES. One of the upsides of HPKE is that the algorithms aren’t dictated to the implementer, but making a combination that’s provably secure and meets the developer’s intuitive notions of security is important. All too often developers reach for a scheme without carefully understanding what it does, resulting in security vulnerabilities.

HPKE solves these problems by providing a high level of security in a generic manner and providing necessary hooks to tie messages to the context in which they are generated. This is the application of decades of research into the correct security notions and schemes.

HPKE is built in stages. First it turns a Diffie-Hellman key agreement into a Key Encapsulation Mechanism. A key encapsulation mechanism has two algorithms: Encap and Decap. The Encap algorithm creates a symmetric secret and wraps it in a public key, so that only the holder of the private key can unwrap it. An attacker with the encapsulation cannot recover the random key. Decap takes the encapsulation and the private key associated to the public key, and computes the same random key. This translation gives HPKE the flexibility to work almost unchanged with any kind of public key encryption or key agreement algorithm.

HPKE mixes this key with an optional info argument, as well as information relating to the cryptographic parameters used by each side. This ensures that attackers cannot modify messages’ meaning by taking them out of context. A postcard marked “So happy to see you again soon” is ominous from the dentist and endearing from one’s grandmother.

The specification for HPKE is open and available on the IETF website. It is on its way to becoming an RFC after passing multiple rounds of review and analysis by cryptography experts at the CFRG. HPKE is already gaining adoption in IETF protocols like ODoH, ECH, and the new Messaging Layer Security (MLS) protocol. HPKE is also designed with the post-quantum future since it is built to work with any KEM, including all the NIST finalists for post-quantum public-key encryption.

### Implementation Language

Once we had an encryption scheme selected, we needed to settle on an implementation. HPKE is still fairly new, so the libraries aren’t quite mature yet. There is a reference implementation, and we’re in the process of developing an implementation in Go as part of CIRCL. However, in the absence of a clear “go to” that is widely known to be the best, we decided to go with an implementation leveraging the same language already powering much of the Firewall code running at the Cloudflare edge – Rust.

Aside from this, the language benefits from features like native primitives, and crucially the ability to easily compile to WebAssembly (WASM).

As we mentioned in a previous blog post, customers are able to generate a key pair and decrypt payloads either from the dashboard UI or from a CLI. Instead of writing and maintaining two different codebases for these, we opted to reuse the same implementation across the edge component that encrypts the payloads and the UI and CLI that decrypt them. To achieve this we compile our library to target WASM so it can be used in the dashboard UI code that runs in the browser. While this approach may yield a slightly larger JavaScript bundle size and relatively small computational overhead, we found it preferable to spending a significant amount of time securely re-implementing HPKE using JavaScript WebCrypto primitives.

The HPKE implementation we decided on comes with the caveat of not yet being formally audited, so we performed our own internal security review. We analyzed the cryptography primitives being used and the corresponding libraries. Between the composition of said primitives and secure programming practices like correctly zeroing memory and safe usage of random number generators, we found no security issues.

### Making It Secure For Users

To encrypt on behalf of users, we need them to provide us with a public key. To make this as easy as possible, we built a CLI tool along with the ability to do it right in the browser. Either option allows the user to generate a public/private key pair without needing to talk to Cloudflare servers at all.

In our API, we specifically do not accept the private key of the key pair — we don’t want it! We don’t need and don’t want to be able to decrypt the data we’re storing.

For the dashboard, once the user provides the private key for decryption, the key is held in a temporary JavaScript variable and used for the in-browser decryption. This allows the user to not constantly have to provide the key while browsing the Firewall event logs. The private key is also not persisted in any way in the browser, so any action that refreshes the page such as refreshing or navigating away will require the user to provide the key again. We believe this is an acceptable usability compromise for better security.

After deciding how to encrypt the data, we just had to figure out the rest of the feature: what data to encrypt, how to store and transmit it, and how to allow users to decrypt it.

When an HTTP request reaches the L7 Firewall, it is evaluated against a set of rulesets. Each of these rulesets contain several rules written in the wirefilter syntax.

An example of one such rule would be:

http.request.version eq "HTTP/1.1"
and
(
http.request.uri.path matches "\n+."
or
http.request.uri.query matches "\x00+."
)


This expression evaluates to a boolean “true” for HTTP/1.1 requests that either contain one or more newlines followed by a character in the request path or one or more NULL bytes followed by a character in the query string.

Say we had the following request that would match the rule above:

GET /cms/%0Aadmin?action=%00post HTTP/1.1
Host: example.com


If matched data logging is enabled, the rules that match would be executed again in a special context that tags all fields that are accessed during execution. We do this second execution because this tagging adds a noticeable computational overhead, and since the vast majority of requests don’t trigger a rule at all we would be unnecessarily adding overhead to each request. Requests that do match any rules will only match a few rules as well, so we don’t need to re-execute a large portion of the ruleset.

You may notice that although http.request.uri.query matches "\x00+." evaluates to true for this request, it won’t be executed, because the expression short-circuits with the first or condition that also matches. This results in only http.request.version and http.request.uri.path being tagged as accessed:

http.request.version -> HTTP/1.1


Having gathered the fields that were accessed, the Firewall engine does some post-processing; removing fields that are a subset of others (e.g., the query string and the full URI), or truncating fields that are beyond a certain character length.

Finally, these get serialized as JSON, encrypted with the customer’s public key, serialized again as a set of bytes, and prefixed with a version number should we need to change/update it in the future. To simplify consumption of these blobs, our APIs display a base64 encoded version of the bytes:

Now that we have encrypted the data at the edge and persisted it in ClickHouse, we need to allow users to decrypt it. As part of the setup of turning this feature on, users generated a key-pair: the public key which was used to encrypt the payloads and a private key which is used to decrypt them. Decryption is done completely offline via either the command line using cloudflare/matched-data-cli:

$MATCHED_DATA=AkjQDktMX4FudxeQhwa0UPNezhkgLAUbkglNQ8XVCHYqPgAAAAAAAACox6cEwqWQpFVE2gCFyOFsSdm2hCoE0/oWKXZJGa5UPd5mWSRxNctuXNtU32hcYNR/azLjsGO668Jwk+qCdFvmKjEqEMJgI+fvhwLQmm4=$ matched-data-cli decrypt -d $MATCHED_DATA -k$PRIVATE_KEY


Or the dashboard UI:

Since our CLI tool is open-source and HPKE is interoperable, it can also be used in other tooling as part of a user’s logging pipeline, for example in security information and event management (SIEM) software.

### Conclusion

This was a team effort with help from our Research and Security teams throughout the process. We relied on them for recommendations on how best to evaluate the algorithms as well as vetting the libraries we wanted to use.

We’re very pleased with how HPKE has worked out for us from an ease-of-implementation and performance standpoint. It was also an easy choice for us to make due to its impending standardization and best-of-both-worlds approach to security.

# Round 2 post-quantum TLS is now supported in AWS KMS

Post Syndicated from Alex Weibel original https://aws.amazon.com/blogs/security/round-2-post-quantum-tls-is-now-supported-in-aws-kms/

AWS Key Management Service (AWS KMS) now supports three new hybrid post-quantum key exchange algorithms for the Transport Layer Security (TLS) 1.2 encryption protocol that’s used when connecting to AWS KMS API endpoints. These new hybrid post-quantum algorithms combine the proven security of a classical key exchange with the potential quantum-safe properties of new post-quantum key exchanges undergoing evaluation for standardization. The fastest of these algorithms adds approximately 0.3 milliseconds of overheard compared to a classical TLS handshake. The new post-quantum key exchange algorithms added are Round 2 versions of Kyber, Bit Flipping Key Encapsulation (BIKE), and Supersingular Isogeny Key Encapsulation (SIKE). Each organization has submitted their algorithms to the National Institute of Standards and Technology (NIST) as part of NIST’s post-quantum cryptography standardization process. This process spans several rounds of evaluation over multiple years, and is likely to continue beyond 2021.

In our previous hybrid post-quantum TLS blog post, we announced that AWS KMS had launched hybrid post-quantum TLS 1.2 with Round 1 versions of BIKE and SIKE. The Round 1 post-quantum algorithms are still supported by AWS KMS, but at a lower priority than the Round 2 algorithms. You can choose to upgrade your client to enable negotiation of Round 2 algorithms.

## Why post-quantum TLS is important

A large-scale quantum computer would be able to break the current public-key cryptography that’s used for key exchange in classical TLS connections. While a large-scale quantum computer isn’t available today, it’s still important to think about and plan for your long-term security needs. TLS traffic using classical algorithms recorded today could be decrypted by a large-scale quantum computer in the future. If you’re developing applications that rely on the long-term confidentiality of data passed over a TLS connection, you should consider a plan to migrate to post-quantum cryptography before the lifespan of the sensitivity of your data would be susceptible to an unauthorized user with a large-scale quantum computer. As an example, this means that if you believe that a large-scale quantum computer is 25 years away, and your data must be secure for 20 years, you should migrate to post-quantum schemes within the next 5 years. AWS is working to prepare for this future, and we want you to be prepared too.

We’re offering this feature now instead of waiting for standardization efforts to be complete so you have a way to measure the potential performance impact to your applications. Offering this feature now also gives you the protection afforded by the proposed post-quantum schemes today. While we believe that the use of this feature raises the already high security bar for connecting to AWS KMS endpoints, these new cipher suites will impact bandwidth utilization and latency. However, using these new algorithms could also create connection failures for intermediate systems that proxy TLS connections. We’d like to get feedback from you on the effectiveness of our implementation or any issues found so we can improve it over time.

## Hybrid post-quantum TLS 1.2

Hybrid post-quantum TLS is a feature that provides the security protections of both the classical and post-quantum key exchange algorithms in a single TLS handshake. Figure 1 shows the differences in the connection secret derivation process between classical and hybrid post-quantum TLS 1.2. Hybrid post-quantum TLS 1.2 has three major differences from classical TLS 1.2:

• The negotiated post-quantum key is appended to the ECDHE key before being used as the hash-based message authentication code (HMAC) key.
• The text hybrid in its ASCII representation is prepended to the beginning of the HMAC message.
• The entire client key exchange message from the TLS handshake is appended to the end of the HMAC message.

Figure 1: Differences in the connection secret derivation process between classical and hybrid post-quantum TLS 1.2

## Some background on post-quantum TLS

Today, all requests to AWS KMS use TLS with key exchange algorithms that provide perfect forward secrecy and use one of the following classical schemes:

While existing FFDHE and ECDHE schemes use perfect forward secrecy to protect against the compromise of the server’s long-term secret key, these schemes don’t protect against large-scale quantum computers. In the future, a sufficiently capable large-scale quantum computer could run Shor’s Algorithm to recover the TLS session key of a recorded classical session, and thereby gain access to the data inside. Using a post-quantum key exchange algorithm during the TLS handshake protects against attacks from a large-scale quantum computer.

The possibility of large-scale quantum computing has spurred the development of new quantum-resistant cryptographic algorithms. NIST has started the process of standardizing post-quantum key encapsulation mechanisms (KEMs). A KEM is a type of key exchange that’s used to establish a shared symmetric key. AWS has chosen three NIST KEM submissions to adopt in our post-quantum efforts:

Hybrid mode ensures that the negotiated key is as strong as the weakest key agreement scheme. If one of the schemes is broken, the communications remain confidential. The Internet Engineering Task Force (IETF) Hybrid Post-Quantum Key Encapsulation Methods for Transport Layer Security 1.2 draft describes how to combine post-quantum KEMs with ECDHE to create new cipher suites for TLS 1.2.

These cipher suites use a hybrid key exchange that performs two independent key exchanges during the TLS handshake. The key exchange then cryptographically combines the keys from each into a single TLS session key. This strategy combines the proven security of a classical key exchange with the potential quantum-safe properties of new post-quantum key exchanges being analyzed by NIST.

## The effect of hybrid post-quantum TLS on performance

Post-quantum cipher suites have a different performance profile and bandwidth usage from traditional cipher suites. AWS has measured bandwidth and latency across 2,000 TLS handshakes between an Amazon Elastic Compute Cloud (Amazon EC2) C5n.4xlarge client and the public AWS KMS endpoint, which were both in the us-west-2 Region. Your own performance characteristics might differ, and will depend on your environment, including your:

• Hardware–CPU speed and number of cores.
• Existing workloads–how often you call AWS KMS and what other work your application performs.
• Network–location and capacity.

The following graphs and table show latency measurements performed by AWS for all newly supported Round 2 post-quantum algorithms, in addition to the classical ECDHE key exchange algorithm currently used by most customers.

Figure 2 shows the latency differences of all hybrid post-quantum algorithms compared with classical ECDHE alone, and shows that compared to ECDHE alone, SIKE adds approximately 101 milliseconds of overhead, BIKE adds approximately 9.5 milliseconds of overhead, and Kyber adds approximately 0.3 milliseconds of overhead.

Figure 2: TLS handshake latency at varying percentiles for four key exchange algorithms

Figure 3 shows the latency differences between ECDHE with Kyber, and ECDHE alone. The addition of Kyber adds approximately 0.3 milliseconds of overhead.

Figure 3: TLS handshake latency at varying percentiles, with only top two performing key exchange algorithms

The following table shows the total amount of data (in bytes) needed to complete the TLS handshake for each cipher suite, the average latency, and latency at varying percentiles. All measurements were gathered from 2,000 TLS handshakes. The time was measured on the client from the start of the handshake until the handshake was completed, and includes all network transfer time. All connections used RSA authentication with a 2048-bit key, and ECDHE used the secp256r1 curve. All hybrid post-quantum tests used the NIST Round 2 versions. The Kyber test used the Kyber-512 parameter, the BIKE test used the BIKE-1 Level 1 parameter, and the SIKE test used the SIKEp434 parameter.

 Item Bandwidth(bytes) Totalhandshakes Average(ms) p0(ms) p50(ms) p90(ms) p99(ms) ECDHE (classic) 3,574 2,000 3.08 2.07 3.02 3.95 4.71 ECDHE + Kyber R2 5,898 2,000 3.36 2.38 3.17 4.28 5.35 ECDHE + BIKE R2 12,456 2,000 14.91 11.59 14.16 18.27 23.58 ECDHE + SIKE R2 4,628 2,000 112.40 103.22 108.87 126.80 146.56

By default, the AWS SDK client performs a TLS handshake once to set up a new TLS connection, and then reuses that TLS connection for multiple requests. This means that the increased cost of a hybrid post-quantum TLS handshake is amortized over multiple requests sent over the TLS connection. You should take the amortization into account when evaluating the overall additional cost of using post-quantum algorithms; otherwise performance data could be skewed.

AWS KMS has chosen Kyber Round 2 to be KMS’s highest prioritized post-quantum algorithm, with BIKE Round 2, and SIKE Round 2 next in priority order for post-quantum algorithms. This is because Kyber’s performance is closest to the classical ECDHE performance that most AWS KMS customers are using today and are accustomed to.

## How to use hybrid post-quantum cipher suites

To use the post-quantum cipher suites with AWS KMS, you need the preview release of the AWS Common Runtime (CRT) HTTP client for the AWS SDK for Java 2.x. Also, you will need to configure the AWS CRT HTTP client to use the s2n post-quantum hybrid cipher suites. Post-quantum TLS for AWS KMS is available in all AWS Regions except for AWS GovCloud (US-East), AWS GovCloud (US-West), AWS China (Beijing) Region operated by Beijing Sinnet Technology Co. Ltd (“Sinnet”), and AWS China (Ningxia) Region operated by Ningxia Western Cloud Data Technology Co. Ltd. (“NWCD”). Since NIST has not yet standardized post-quantum cryptography, connections that require Federal Information Processing Standards (FIPS) compliance cannot use the hybrid key exchange. For example, kms.<region>.amazonaws.com supports the use of post-quantum cipher suites, while kms-fips.<region>.amazonaws.com does not.

1. If you’re using the AWS SDK for Java 2.x, you must add the preview release of the AWS Common Runtime client to your Maven dependencies.
<dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>aws-crt-client</artifactId>
<version>2.14.13-PREVIEW</version>
</dependency>


2. You then must configure the new SDK and cipher suite in the existing initialization code of your application:
if(!TLS_CIPHER_PREF_KMS_PQ_TLSv1_0_2020_07.isSupported()){
throw new RuntimeException("Post Quantum Ciphers not supported on this Platform");
}

SdkAsyncHttpClient awsCrtHttpClient = AwsCrtAsyncHttpClient.builder()
.tlsCipherPreference(TLS_CIPHER_PREF_KMS_PQ_TLSv1_0_2020_07)
.build();

KmsAsyncClient kms = KmsAsyncClient.builder()
.httpClient(awsCrtHttpClient)
.build();

ListKeysResponse response = kms.listKeys().get();


Now, all connections made to AWS KMS in supported Regions will use the new hybrid post-quantum cipher suites! To see a complete example of everything set up, check out the example application here.

## Things to try

Here are some ideas about how to use this post-quantum-enabled client:

• Run load tests and benchmarks. These new cipher suites perform differently than traditional key exchange algorithms. You might need to adjust your connection timeouts to allow for the longer handshake times or, if you’re running inside an AWS Lambda function, extend the execution timeout setting.
• Try connecting from different locations. Depending on the network path your request takes, you might discover that intermediate hosts, proxies, or firewalls with deep packet inspection (DPI) block the request. This could be due to the new cipher suites in the ClientHello or the larger key exchange messages. If this is the case, you might need to work with your security team or IT administrators to update the relevant configuration to unblock the new TLS cipher suites. We’d like to hear from you about how your infrastructure interacts with this new variant of TLS traffic. If you have questions or feedback, please start a new thread on the AWS KMS discussion forum.

## Conclusion

In this blog post, I announced support for Round 2 hybrid post-quantum algorithms in AWS KMS, and showed you how to begin experimenting with hybrid post-quantum key exchange algorithms for TLS when connecting to AWS KMS endpoints.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

### Alex Weibel

Alex is a Senior Software Engineer on the AWS Crypto Algorithms team. He’s one of the maintainers for Amazon’s TLS Library s2n. Previously, Alex worked on TLS termination and request proxying for S3 and the Elastic Load Balancing Service developing new features for customers. Alex holds a Bachelor of Science degree in Computer Science from the University of Texas at Austin.

# Fall 2020 RPKI Update

Post Syndicated from Louis Poinsignon original https://blog.cloudflare.com/rpki-2020-fall-update/

The Internet is a network of networks. In order to find the path between two points and exchange data, the network devices rely on the information from their peers. This information consists of IP addresses and Autonomous Systems (AS) which announce the addresses using Border Gateway Protocol (BGP).

One problem arises from this design: what protects against a malevolent peer who decides to announce incorrect information? The damage caused by route hijacks can be major.

Routing Public Key Infrastructure (RPKI) is a framework created in 2008. Its goal is to provide a source of truth for Internet Resources (IP addresses) and ASes in signed cryptographically signed records called Route Origin Objects (ROA).

Recently, we’ve seen the significant threshold of two hundred thousands of ROAs being passed. This represents a big step in making the Internet more secure against accidental and deliberate BGP tampering.

We have talked about RPKI in the past but we thought it would be a good time for an update.

In a more technical context, the RPKI framework consists of two parts:

• IP addresses need to be cryptographically signed by their owners in a database managed by a Trust Anchor: Afrinic, APNIC, ARIN, LACNIC and RIPE. Those five organizations are in charge of allocating Internet resources. The ROA indicates which Network Operator is allowed to announce the addresses using BGP.
• Network operators download the list of ROAs, perform the cryptographic checks and then apply filters on the prefixes they receive: this is called BGP Origin Validation.

### The “Is BGP Safe Yet” website

The launch of the website isbgpsafeyet.com to test if your ISP correctly performs BGP Origin Validation was a success. Since launch, it has been visited more than five million times from over 223 countries and 13,000 unique networks (20% of the entire Internet), generating half a million BGP Origin Validation tests.

Many providers subsequently indicated on social media (for example, here or here) that they had an RPKI deployment in the works. This increase in Origin Validation by networks is increasing the security of the Internet globally.

The site’s test for Origin Validation consists of queries toward two addresses, one of which is behind an RPKI invalid prefix and the other behind an RPKI valid prefix. If the query towards the invalid succeeds, the test fails as the ISP does not implement Origin Validation. We counted the number of queries that failed to reach invalid.cloudflare.com. This also included a few thousand RIPE Atlas tests that were started by Cloudflare and various contributors, providing coverage for smaller networks.

Every month since launch we’ve seen that around 10 to 20 networks are deploying RPKI Origin Validation. Among the major providers we can build the following table:

Month Networks
August Swisscom (Switzerland), Salt (Switzerland)
June Colocrossing (USA), Get Norway (Norway), Vocus (Australia), Hurricane Electric (Worldwide), Cogent (Worldwide)
May Sengked Fiber (Indonesia), Online.net (France), WebAfrica Networks (South Africa), CableNet (Cyprus), IDnet (Indonesia), Worldstream (Netherlands), GTT (Worldwide)

With the help of many contributors, we have compiled a list of network operators and public statements at the top of the isbgpsafeyet.com page.

We excluded providers that manually blocked the traffic towards the prefix instead of using RPKI. Among the techniques we see are firewall filtering and manual prefix rejection. The filtering is often propagated to other customer ISPs. In a unique case, an ISP generated a “more-specific” blackhole route that leaked to multiple peers over the Internet.

The deployment of RPKI by major transit providers, also known as Tier 1, such as Cogent, GTT, Hurricane Electric, NTT and Telia made many downstream networks more secure without them having them deploying validation software.

Overall, we looked at the evolution of the successful tests per ASN and we noticed a steady increase over the recent months of 8%.

Furthermore, when we probed the entire IPv4 space this month, using a similar technique to the isbgpsafeyet.com test, many more networks were not able to reach an RPKI invalid prefix than compared to the same period last year. This confirms an increase of RPKI Origin Validation deployment across all network operators. The picture below shows the IPv4 space behind a network with RPKI Origin Validation enabled in yellow and the active space in blue. It uses a Hilbert Curve to efficiently plot IP addresses: for example one /20 prefix (4096 IPs) is a pixel, a /16 prefix (65536 IPs) will form a 4×4 pixels square.

The more the yellow spreads, the safer the Internet becomes.

What does it mean exactly? If you were hijacking a prefix, the users behind the yellow space would likely not be affected. This also applies if you miss-sign your prefixes: you would not be able to reach the services or users behind the yellow space. Once RPKI is enabled everywhere, there will only be yellow squares.

### Progression of signed prefixes

Owners of IP addresses indicate the networks allowed to announce them. They do this by signing prefixes: they create Route Origin Objects (ROA). As of today, there are more than 200,000 ROAs. The distribution shows that the RIPE region is still leading in ROA count, then followed by the APNIC region.

2020 started with 172,000 records and the count is getting close to 200,000 at the beginning of November, approximately a quarter of all the Internet routes. Since last year, the database of ROAs grew by more than 70 percent, from 100,000 records, an average pace of 5% every month.

On the following graph of unique ROAs count per day, we can see two points that were followed by a change in ROA creation rate: 140/day, then 231/day, and since August, 351 new ROAs per day.

It is not yet clear what caused the increase in August.

### Free services and software

In 2018 and 2019, Cloudflare was impacted by BGP route hijacks. Both could have been avoided with RPKI. Not long after the first incident, we started signing prefixes and developing RPKI software. It was necessary to make BGP safer and we wanted to do more than talk about it. But we also needed enough networks to be deploying RPKI as well. By making deployment easier for everyone, we hoped to increase adoption.

The following is a reminder of what we built over the years around RPKI and how it grew.

OctoRPKI is Cloudflare’s open source RPKI Validation software. It periodically generates a JSON document of validated prefixes that we pass onto our routers using GoRTR. It generates most of the data behind the graphs here.

The latest version, 1.2.0, of OctoRPKI was released at the end of October. It implements important security fixes, better memory management and extended logging. This is the first validator to provide detailed information around cryptographically invalid records into Sentry and performance data in distributed tracing tools.
GoRTR remains heavily used in production, including by transit providers. It can natively connect to other validators like rpki-client.

When we released our public rpki.json endpoint in early 2019, the idea was to enable anyone to see what Cloudflare was filtering.

The file is also used as a bootstrap by GoRTR, so that users can test a deployment. The file is cached on more than 200 data centers, ensuring quick and secure delivery of a list of valid prefixes, making RPKI more accessible for smaller networks and developers.

Between March 2019 and November 2020, the number of queries more than doubled and there are five times more networks querying this file.

The growth of queries follows approximately the rate of ROA creation (~5% per month).

A public RTR server is also available on rtr.rpki.cloudflare.com. It includes a plaintext endpoint on port 8282 and an SSH endpoint on port 8283. This allows us to test new versions of GoRTR before release.

Later in 2019, we also built a public dashboard where you can see in-depth RPKI validation. With a GraphQL API, you can now explore the validation data, test a list of prefixes, or see the status of the current routing table.

Currently, the API is used by BGPalerter, an open-source tool that detects routing issues (including hijacks!) from a stream of BGP updates.

Additionally, starting in November, you can access the historical data from May 2019. Data is computed daily and contains the unique records. The team behind the dashboard worked hard to provide a fast and accurate visualization of the daily ROA changes and the volumes of files changed over the day.

### The future

We believe RPKI is going to continue growing, and we would like to thank the hundreds of network engineers around the world who are making the Internet routing more secure by deploying RPKI.

25% of routes are signed and 20% of the Internet is doing origin validation and those numbers grow everyday. We believe BGP will be safer before reaching 100% of deployment; for instance, once the remaining transit providers enable Origin Validation, it is unlikely a BGP hijack will make it to the front page of world news outlets.

While difficult to quantify, we believe that critical mass of protected resources will be reached in late 2021.

We will keep improving the tooling; OctoRPKI and GoRTR are open-source and we welcome contributions. In the near future, we plan on releasing a packaged version of GoRTR that can be directly installed on certain routers. Stay tuned!

# NTS is now an RFC

Post Syndicated from Watson Ladd original https://blog.cloudflare.com/nts-is-now-rfc/

Earlier today the document describing Network Time Security for NTP officially became RFC 8915. This means that Network Time Security (NTS) is officially part of the collection of protocols that makes the Internet work. We’ve changed our time service to use the officially assigned port of 4460 for NTS key exchange, so you can use our service with ease. This is big progress towards securing a ubiquitous Internet protocol.

Over the past months we’ve seen many users of our time service, but very few using Network Time Security. This leaves computers vulnerable to attacks that imitate the server they use to obtain NTP. Part of the problem was the lack of available NTP daemons that supported NTS. That problem is now solved: chrony and ntpsec both support NTS.

Time underlies the security of many of the protocols such as TLS that we rely on to secure our online lives. Without accurate time, there is no way to determine whether or not credentials have expired. The absence of an easily deployed secure time protocol has been a problem for Internet security.

Without NTS or symmetric key authentication there is no guarantee that your computer is actually talking NTP with the computer you think it is. Symmetric key authentication is difficult and painful to set up, but until recently has been the only secure and standardized mechanism for authenticating NTP.  NTS uses the work that goes into the Web Public Key Infrastructure to authenticate NTP servers and ensure that when you set up your computer to talk to time.cloudflare.com, that’s the server your computer gets the time from.

Our involvement in developing and promoting NTS included making a specialized server and releasing the source code, participation in the standardization process, and much working with implementers to hunt down bugs. We also set up our time service with support for NTS from the beginning, and it was a useful resource for implementers to test interoperability.

When Cloudflare supported TLS 1.3 browsers were actively updating, and so deployment quickly took hold. However, the long tail of legacy installs and extended support releases slowed adoption. Similarly until Let’s Encrypt made encryption easy for webservers most web traffic was not encrypted.

By contrast ssh quickly displaced telnet as the way to access remote systems: the security benefits were substantial, and the experience was better. Adoption of protocols is slow, but when there is a real security need it can be much faster. NTS is a real security improvement that is vital to adopt. We’re proud to continue making the Internet a better place by supporting secure protocols.

We hope that operating systems will incorporate NTS support and TLS 1.3 in their supplied NTP daemons. We also urge administrators to deploy NTS as quickly as possible, and NTP server operators to adopt NTS. With Let’s Encrypt provided certificates this is simpler than it has been in the past

We’re continuing our work in this area with the continued development of the Roughtime protocol for even better security as well as engagement with the standardization process to help develop the future of Internet time.

Cloudflare is willing to allow any device to point to time.cloudflare.com and supports NTS. Just as our Universal SSL made it easy for any website to get the security benefits of TLS, our time service makes it easy for any computer to get the benefits of secure time.

# Internship Experience: Cryptography Engineer

Post Syndicated from Watson Ladd original https://blog.cloudflare.com/internship-experience-cryptography-engineer/

Back in the summer of 2017 I was an intern at Cloudflare. During the scholastic year I was a graduate student working on automorphic forms and computational Langlands at Berkeley: a part of number theory with deep connections to representation theory, aimed at uncovering some of the deepest facts about number fields. I had also gotten involved in Internet standardization and security research, but much more on the applied side.

While I had published papers in computer security and had coded for my dissertation, building and deploying new protocols to production systems was going to be new. Going from the academic environment of little day to day supervision to the industrial one of more direction; from greenfield code that would only ever be run by one person to large projects that had to be understandable by a team; from goals measured in years or even decades, to goals measured in days, weeks, or quarters; these transitions would present some challenges.

Cloudflare at that stage was a very different company from what it is now. Entire products and offices simply did not exist. Argo, now a mainstay of our offering for sophisticated companies, was slowly emerging. Access, which has been helping safeguard employees working from home these past weeks, was then experiencing teething issues. Workers was being extensively developed for launch that autumn. Quicksilver was still in the slow stages of replacing KyotoTycoon. Lisbon wasn’t on the map, and Austin was very new.

### Day 1

My first job was to get my laptop working. Quickly I discovered that despite the promise of using either Mac or Linux, only Mac was supported as a local development environment. Most Linux users would take a good part of a month to tweak all the settings and get the local development environment up. I didn’t have months. After three days, I broke down and got a Mac.

Needless to say I asked for some help. Like a drowning man in quicksand, I managed to attract three engineers to this near insoluble problem of the edge dev stack, and after days of hacking on it, fixing problems that had long been ignored, we got it working well enough to test a few things. That development environment is now gone and replaced with one built Kubernetes VMs, and works much better that way. When things work on your machine, you can now send everyone your machine.

### Speeding up

With setup complete enough, it was on to the problem we needed to solve. Our goal was to implement a set of three interrelated Internet drafts, one defining secondary certificates, one defining external authentication with TLS certificates, and a third permitting servers to advertise the websites they could serve.

External authentication is a TLS feature that permits a server or a client on an already opened connection to prove its possession of the private key of another certificate. This proof of possession is tied to the TLS connection, avoiding attacks on bearer tokens caused by the lack of this binding.

Secondary certificates is an HTTP/2 feature enabling clients and servers to send certificates together with proof that they actually know the private key. This feature has many applications such as certificate-based authentication, but also enables us to prove that we are permitted to serve the websites we claim to serve.

The last draft was the HTTP/2 ORIGIN frame. The ORIGIN frame enables a website to advertise other sites that it could serve, permitting more connection reuse than allowed under the traditional rules. Connection reuse is an important part of browser performance as it avoids much of the setup of a connection.

These drafts solved an important problem for Cloudflare. Many resources such as JavaScript, CSS, and images hosted by one website can be used by others. Because Cloudflare proxies so many different websites, our servers have often cached these resources as well. Browsers though, do not know that these different websites are made faster by Cloudflare, and as a result they repeat all the steps to request the subresources again. This takes unnecessary time since there is an established and usable perfectly good connection already. If the browser could know this, it could use the connection again.

We could only solve this problem by getting browsers and the broader community of TLS implementers on board. Some of these drafts such as external authentication and secondary certificates had a broader set of motivations, such as getting certificate based authentication to work with HTTP/2 and TLS 1.3. All of these needs had to be addressed in the drafts, even if we were only implementing a subset of the uses.

Successful standards cover the use cases that are needed while being simple enough to implement and achieve interoperability. Implementation experience is essential to achieving this success: a standard with no implementations fails to incorporate hard won lessons. Computers are hard.

### Prototype

My first goal was to set up a simple prototype to test the much more complex production implementation, as well as to share outside of Cloudflare so that others could have confidence in their implementations. But these drafts that had to be implemented in the prototype were incremental improvements to an already massive stack of TLS and HTTP standards.

I decided it would be easiest to build on top of an already existing implementation of TLS and HTTP. I picked the Go standard library as my base: it’s simple, readable, and in a language I was already familiar with. There was already a basic demo showcasing support in Firefox for the ORIGIN frame, and it would be up to me to extend it.

Using that as my starting point I was able in 3 weeks to set up a demonstration server and a client. This showed good progress, and that nothing in the specification was blocking implementation. But without integrating it into our servers for further experimentation so that we might  discover rare issues that could be showstoppers. This was a bitter lesson learned from TLS 1.3, where it took months to track down a single brand of printer that was incompatible with the standard, and forced a change.

### From Prototype to Production

We also wanted to understand the benefits with some real world data, to convince others that this approach was worthwhile. Our position as a provider to many websites globally gives us diverse, real world data on performance that we use to make our products better, and perhaps more important, to learn lessons that help everyone make the Internet better. As a result we had to implement this in production: the experimental framework for TLS 1.3 development had been removed and we didn’t have an environment for experimentation.

At the time everything at Cloudflare was based on variants of NGINX. We had extended it with modules to implement features like Keyless and customized certificate handling to meet our needs, but much of the business logic was and is carried out in Lua via OpenResty.

Lua has many virtues, but at the time both the TLS termination and the core business logic lived in the same repo despite being different processes at runtime. This made it very difficult to understand what code was running when, and changes to basic libraries could create problems for both. The build system for this creation had the significant disadvantage of building the same targets with different settings. Lua also is a very dynamic language, but unlike the dynamic languages I was used to, there was no way to interact with the system as it was running on requests.

The first step was implementing the ORIGIN frame. In implementing this, we had to figure out which sites hosted the subresources used by the page we were serving. Luckily, we already had this logic to enable server push support driven by Link headers. Building on this let me quickly get ORIGIN working.

This work wasn’t the only thing I was up to as an intern. I was also participating in weekly team meetings, attending our engineering presentations, and getting a sense of what life was like at Cloudflare. We had an excursion for interns to the Computer History Museum in Mountain View and Moffett Field, where we saw the base museum.

The next challenge was getting the CERTIFICATE frame to work. This was a much deeper problem. NGINX processes a request in phases, and some of the phases, like the header processing phase, do not permit network I/O without locking up the event loop. Since we are parsing the headers to determine what to send, the frame is created in the header processing phase. But finding a certificate and telling Keyless to sign it required network I/O.

The standard solution to this problem is to have Lua execute a timer callback, in which network I/O is possible. But this context doesn’t have any data from the request: some serious refactoring was needed to create a way to get the keyless module to function outside the context of a request.

Once the signature was created, the battle was half over. Formatting the CERTIFICATE frame was simple, but it had to be stuck into the connection associated with the request that had demanded it be created. And there was no reason to expect the request was still alive, and no way to know what state it was in when the request was handled by the Keyless module.

To handle this issue I made a shared btree indexed by a number containing space for the data to be passed back and forth. This enabled the request to record that it was ready to send the CERTIFICATE frame and Keyless to record that it was ready with a frame to send. Whichever of these happened second would do the work to enqueue the frame to send out.

This was not an easy solution: the Keyless module had been written years before and largely unmodified. It fundamentally assumed it could access data from the request, and changing this assumption opened the door to difficult to diagnose bugs. It integrates into BoringSSL callbacks through some pretty tricky mechanisms.

However, I was able to test it using the client from the prototype and it worked. Unfortunately when I pushed the commit in which it worked upstream, the CI system could not find the git repo where the client prototype was due to a setting I forgot to change. The CI system unfortunately didn’t associate this failure with the branch, but attempted to check it out whenever it checked out any other branch people were working on. Murphy ensured my accomplishment had happened on a Friday afternoon Pacific time, and the team that manages the SSL server was then exclusively in London…

Monday morning the issue was quickly fixed, and whatever tempers had frayed were smoothed over when we discovered the deficiency in the CI system that had enabled a single branch to break every build. It’s always tricky to work in a global team. Later Alessandro flew to San Francisco for a number of projects with the team here and we worked side by side trying to get a demonstration working on a test site. Unfortunately there was some difficulty tracking down a bug that prevented it working in production. We had run out of time, and my internship was over.  Alessandro flew back to London, and I flew to Idaho to see the eclipse.

### The End

Ultimately we weren’t able to integrate this feature into the software at our edge: the risks of such intrusive changes for a very experimental feature outweighed the benefits. With not much prospect of support by clients, it would be difficult to get the real savings in performance promised. There also were nontechnical issues in standardization that have made this approach more difficult to implement: any form of traffic direction that doesn’t obey DNS creates issues for network debugging, and there were concerns about the impact of certificate misissuance.

While the project was less successful than I hoped it would be, I learned a lot of important skills: collaborating on large software projects, working with git, and communicating with other implementers about issues we found.  I also got a taste of what it would be like to be on the Research team at Cloudflare and turning research from idea into practical reality and this ultimately confirmed my choice to go into industrial research.

I’ve now returned to Cloudflare full-time, working on extensions for TLS as well as time synchronization. These drafts have continued to progress through the standardization process, and we’ve contributed some of the code I wrote as a starting point for other implementers to use.  If we knew all our projects would work out, they wouldn’t be ambitious enough to be research worth doing.

If this sort of research experience appeals to you, we’re hiring.

# Speeding up Linux disk encryption

Post Syndicated from Ignat Korchagin original https://blog.cloudflare.com/speeding-up-linux-disk-encryption/

Data encryption at rest is a must-have for any modern Internet company. Many companies, however, don’t encrypt their disks, because they fear the potential performance penalty caused by encryption overhead.

Encrypting data at rest is vital for Cloudflare with more than 200 data centres across the world. In this post, we will investigate the performance of disk encryption on Linux and explain how we made it at least two times faster for ourselves and our customers!

### Encrypting data at rest

When it comes to encrypting data at rest there are several ways it can be implemented on a modern operating system (OS). Available techniques are tightly coupled with a typical OS storage stack. A simplified version of the storage stack and encryption solutions can be found on the diagram below:

On the top of the stack are applications, which read and write data in files (or streams). The file system in the OS kernel keeps track of which blocks of the underlying block device belong to which files and translates these file reads and writes into block reads and writes, however the hardware specifics of the underlying storage device is abstracted away from the filesystem. Finally, the block subsystem actually passes the block reads and writes to the underlying hardware using appropriate device drivers.

The concept of the storage stack is actually similar to the well-known network OSI model, where each layer has a more high-level view of the information and the implementation details of the lower layers are abstracted away from the upper layers. And, similar to the OSI model, one can apply encryption at different layers (think about TLS vs IPsec or a VPN).

For data at rest we can apply encryption either at the block layers (either in hardware or in software) or at the file level (either directly in applications or in the filesystem).

#### Block vs file encryption

Generally, the higher in the stack we apply encryption, the more flexibility we have. With application level encryption the application maintainers can apply any encryption code they please to any particular data they need. The downside of this approach is they actually have to implement it themselves and encryption in general is not very developer-friendly: one has to know the ins and outs of a specific cryptographic algorithm, properly generate keys, nonces, IVs etc. Additionally, application level encryption does not leverage OS-level caching and Linux page cache in particular: each time the application needs to use the data, it has to either decrypt it again, wasting CPU cycles, or implement its own decrypted “cache”, which introduces more complexity to the code.

File system level encryption makes data encryption transparent to applications, because the file system itself encrypts the data before passing it to the block subsystem, so files are encrypted regardless if the application has crypto support or not. Also, file systems can be configured to encrypt only a particular directory or have different keys for different files. This flexibility, however, comes at a cost of a more complex configuration. File system encryption is also considered less secure than block device encryption as only the contents of the files are encrypted. Files also have associated metadata, like file size, the number of files, the directory tree layout etc., which are still visible to a potential adversary.

Encryption down at the block layer (often referred to as disk encryption or full disk encryption) also makes data encryption transparent to applications and even whole file systems. Unlike file system level encryption it encrypts all data on the disk including file metadata and even free space. It is less flexible though – one can only encrypt the whole disk with a single key, so there is no per-directory, per-file or per-user configuration. From the crypto perspective, not all cryptographic algorithms can be used as the block layer doesn’t have a high-level overview of the data anymore, so it needs to process each block independently. Most common algorithms require some sort of block chaining to be secure, so are not applicable to disk encryption. Instead, special modes were developed just for this specific use-case.

So which layer to choose? As always, it depends… Application and file system level encryption are usually the preferred choice for client systems because of the flexibility. For example, each user on a multi-user desktop may want to encrypt their home directory with a key they own and leave some shared directories unencrypted. On the contrary, on server systems, managed by SaaS/PaaS/IaaS companies (including Cloudflare) the preferred choice is configuration simplicity and security – with full disk encryption enabled any data from any application is automatically encrypted with no exceptions or overrides. We believe that all data needs to be protected without sorting it into "important" vs "not important" buckets, so the selective flexibility the upper layers provide is not needed.

#### Hardware vs software disk encryption

When encrypting data at the block layer it is possible to do it directly in the storage hardware, if the hardware supports it. Doing so usually gives better read/write performance and consumes less resources from the host. However, since most hardware firmware is proprietary, it does not receive as much attention and review from the security community. In the past this led to flaws in some implementations of hardware disk encryption, which render the whole security model useless. Microsoft, for example, started to prefer software-based disk encryption since then.

We didn’t want to put our data and our customers’ data to the risk of using potentially insecure solutions and we strongly believe in open-source. That’s why we rely only on software disk encryption in the Linux kernel, which is open and has been audited by many security professionals across the world.

### Linux disk encryption performance

We aim not only to save bandwidth costs for our customers, but to deliver content to Internet users as fast as possible.

At one point we noticed that our disks were not as fast as we would like them to be. Some profiling as well as a quick A/B test pointed to Linux disk encryption. Because not encrypting the data (even if it is supposed-to-be a public Internet cache) is not a sustainable option, we decided to take a closer look into Linux disk encryption performance.

#### Device mapper and dm-crypt

Linux implements transparent disk encryption via a dm-crypt module and dm-crypt itself is part of device mapper kernel framework. In a nutshell, the device mapper allows pre/post-process IO requests as they travel between the file system and the underlying block device.

dm-crypt in particular encrypts "write" IO requests before sending them further down the stack to the actual block device and decrypts "read" IO requests before sending them up to the file system driver. Simple and easy! Or is it?

#### Benchmarking setup

For the record, the numbers in this post were obtained by running specified commands on an idle Cloudflare G9 server out of production. However, the setup should be easily reproducible on any modern x86 laptop.

Generally, benchmarking anything around a storage stack is hard because of the noise introduced by the storage hardware itself. Not all disks are created equal, so for the purpose of this post we will use the fastest disks available out there – that is no disks.

Instead Linux has an option to emulate a disk directly in RAM. Since RAM is much faster than any persistent storage, it should introduce little bias in our results.

The following command creates a 4GB ramdisk:

$sudo modprobe brd rd_nr=1 rd_size=4194304$ ls /dev/ram0


Now we can set up a dm-crypt instance on top of it thus enabling encryption for the disk. First, we need to generate the disk encryption key, "format" the disk and specify a password to unlock the newly generated key.

$fallocate -l 2M crypthdr.img$ sudo cryptsetup luksFormat /dev/ram0 --header crypthdr.img

WARNING!
========
This will overwrite data on crypthdr.img irrevocably.

Are you sure? (Type uppercase yes): YES
Enter passphrase:
Verify passphrase:


Those who are familiar with LUKS/dm-crypt might have noticed we used a LUKS detached header here. Normally, LUKS stores the password-encrypted disk encryption key on the same disk as the data, but since we want to compare read/write performance between encrypted and unencrypted devices, we might accidentally overwrite the encrypted key during our benchmarking later. Keeping the encrypted key in a separate file avoids this problem for the purposes of this post.

Now, we can actually "unlock" the encrypted device for our testing:

$sudo cryptsetup open --header crypthdr.img /dev/ram0 encrypted-ram0 Enter passphrase for /dev/ram0:$ ls /dev/mapper/encrypted-ram0
/dev/mapper/encrypted-ram0


At this point we can now compare the performance of encrypted vs unencrypted ramdisk: if we read/write data to /dev/ram0, it will be stored in plaintext. Likewise, if we read/write data to /dev/mapper/encrypted-ram0, it will be decrypted/encrypted on the way by dm-crypt and stored in ciphertext.

It’s worth noting that we’re not creating any file system on top of our block devices to avoid biasing results with a file system overhead.

#### Measuring throughput

When it comes to storage testing/benchmarking Flexible I/O tester is the usual go-to solution. Let’s simulate simple sequential read/write load with 4K block size on the ramdisk without encryption:

$sudo fio --filename=/dev/ram0 --readwrite=readwrite --bs=4k --direct=1 --loops=1000000 --name=plain plain: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1 fio-2.16 Starting 1 process ... Run status group 0 (all jobs): READ: io=21013MB, aggrb=1126.5MB/s, minb=1126.5MB/s, maxb=1126.5MB/s, mint=18655msec, maxt=18655msec WRITE: io=21023MB, aggrb=1126.1MB/s, minb=1126.1MB/s, maxb=1126.1MB/s, mint=18655msec, maxt=18655msec Disk stats (read/write): ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%  The above command will run for a long time, so we just stop it after a while. As we can see from the stats, we’re able to read and write roughly with the same throughput around 1126 MB/s. Let’s repeat the test with the encrypted ramdisk: $ sudo fio --filename=/dev/mapper/encrypted-ram0 --readwrite=readwrite --bs=4k --direct=1 --loops=1000000 --name=crypt
crypt: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
...
Run status group 0 (all jobs):
READ: io=1693.7MB, aggrb=150874KB/s, minb=150874KB/s, maxb=150874KB/s, mint=11491msec, maxt=11491msec
WRITE: io=1696.4MB, aggrb=151170KB/s, minb=151170KB/s, maxb=151170KB/s, mint=11491msec, maxt=11491msec


Whoa, that’s a drop! We only get ~147 MB/s now, which is more than 7 times slower! And this is on a totally idle machine!

#### Maybe, crypto is just slow

The first thing we considered is to ensure we use the fastest crypto. cryptsetup allows us to benchmark all the available crypto implementations on the system to select the best one:

$sudo cryptsetup benchmark # Tests are approximate using memory only (no storage IO). PBKDF2-sha1 1340890 iterations per second for 256-bit key PBKDF2-sha256 1539759 iterations per second for 256-bit key PBKDF2-sha512 1205259 iterations per second for 256-bit key PBKDF2-ripemd160 967321 iterations per second for 256-bit key PBKDF2-whirlpool 720175 iterations per second for 256-bit key # Algorithm | Key | Encryption | Decryption aes-cbc 128b 969.7 MiB/s 3110.0 MiB/s serpent-cbc 128b N/A N/A twofish-cbc 128b N/A N/A aes-cbc 256b 756.1 MiB/s 2474.7 MiB/s serpent-cbc 256b N/A N/A twofish-cbc 256b N/A N/A aes-xts 256b 1823.1 MiB/s 1900.3 MiB/s serpent-xts 256b N/A N/A twofish-xts 256b N/A N/A aes-xts 512b 1724.4 MiB/s 1765.8 MiB/s serpent-xts 512b N/A N/A twofish-xts 512b N/A N/A  It seems aes-xts with a 256-bit data encryption key is the fastest here. But which one are we actually using for our encrypted ramdisk? $ sudo dmsetup table /dev/mapper/encrypted-ram0
0 8388608 crypt aes-xts-plain64 0000000000000000000000000000000000000000000000000000000000000000 0 1:0 0


We do use aes-xts with a 256-bit data encryption key (count all the zeroes conveniently masked by dmsetup tool – if you want to see the actual bytes, add the --showkeys option to the above command). The numbers do not add up however: cryptsetup benchmark tells us above not to rely on the results, as "Tests are approximate using memory only (no storage IO)", but that is exactly how we’ve set up our experiment using the ramdisk. In a somewhat worse case (assuming we’re reading all the data and then encrypting/decrypting it sequentially with no parallelism) doing back-of-the-envelope calculation we should be getting around (1126 * 1823) / (1126 + 1823) =~696 MB/s, which is still quite far from the actual 147 * 2 = 294 MB/s (total for reads and writes).

#### dm-crypt performance flags

While reading the cryptsetup man page we noticed that it has two options prefixed with --perf-, which are probably related to performance tuning. The first one is --perf-same_cpu_crypt with a rather cryptic description:

Perform encryption using the same cpu that IO was submitted on.  The default is to use an unbound workqueue so that encryption work is automatically balanced between available CPUs.  This option is only relevant for open action.


So we enable the option

$sudo cryptsetup close encrypted-ram0$ sudo cryptsetup open --header crypthdr.img --perf-same_cpu_crypt /dev/ram0 encrypted-ram0


Note: according to the latest man page there is also a cryptsetup refresh command, which can be used to enable these options live without having to "close" and "re-open" the encrypted device. Our cryptsetup however didn’t support it yet.

Verifying if the option has been really enabled:

$sudo dmsetup table encrypted-ram0 0 8388608 crypt aes-xts-plain64 0000000000000000000000000000000000000000000000000000000000000000 0 1:0 0 1 same_cpu_crypt  Yes, we can now see same_cpu_crypt in the output, which is what we wanted. Let’s rerun the benchmark: $ sudo fio --filename=/dev/mapper/encrypted-ram0 --readwrite=readwrite --bs=4k --direct=1 --loops=1000000 --name=crypt
crypt: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
...
Run status group 0 (all jobs):
READ: io=1596.6MB, aggrb=139811KB/s, minb=139811KB/s, maxb=139811KB/s, mint=11693msec, maxt=11693msec
WRITE: io=1600.9MB, aggrb=140192KB/s, minb=140192KB/s, maxb=140192KB/s, mint=11693msec, maxt=11693msec


Hmm, now it is ~136 MB/s which is slightly worse than before, so no good. What about the second option --perf-submit_from_crypt_cpus:

Disable offloading writes to a separate thread after encryption.  There are some situations where offloading write bios from the encryption threads to a single thread degrades performance significantly.  The default is to offload write bios to the same thread.  This option is only relevant for open action.


Maybe, we are in the "some situation" here, so let’s try it out:

$sudo cryptsetup close encrypted-ram0$ sudo cryptsetup open --header crypthdr.img --perf-submit_from_crypt_cpus /dev/ram0 encrypted-ram0
Enter passphrase for /dev/ram0:
$sudo dmsetup table encrypted-ram0 0 8388608 crypt aes-xts-plain64 0000000000000000000000000000000000000000000000000000000000000000 0 1:0 0 1 submit_from_crypt_cpus  And now the benchmark: $ sudo fio --filename=/dev/mapper/encrypted-ram0 --readwrite=readwrite --bs=4k --direct=1 --loops=1000000 --name=crypt
crypt: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
...
Run status group 0 (all jobs):
READ: io=2066.6MB, aggrb=169835KB/s, minb=169835KB/s, maxb=169835KB/s, mint=12457msec, maxt=12457msec
WRITE: io=2067.7MB, aggrb=169965KB/s, minb=169965KB/s, maxb=169965KB/s, mint=12457msec, maxt=12457msec


~166 MB/s, which is a bit better, but still not good…

Being desperate we decided to seek support from the Internet and posted our findings to the dm-crypt mailing list, but the response we got was not very encouraging:

If the numbers disturb you, then this is from lack of understanding on your side. You are probably unaware that encryption is a heavy-weight operation…

We decided to make a scientific research on this topic by typing "is encryption expensive" into Google Search and one of the top results, which actually contains meaningful measurements, is… our own post about cost of encryption, but in the context of TLS! This is a fascinating read on its own, but the gist is: modern crypto on modern hardware is very cheap even at Cloudflare scale (doing millions of encrypted HTTP requests per second). In fact, it is so cheap that Cloudflare was the first provider to offer free SSL/TLS for everyone.

#### Digging into the source code

When trying to use the custom dm-crypt options described above we were curious why they exist in the first place and what is that "offloading" all about. Originally we expected dm-crypt to be a simple "proxy", which just encrypts/decrypts data as it flows through the stack. Turns out dm-crypt does more than just encrypting memory buffers and a (simplified) IO traverse path diagram is presented below:

When the file system issues a write request, dm-crypt does not process it immediately – instead it puts it into a workqueue named "kcryptd". In a nutshell, a kernel workqueue just schedules some work (encryption in this case) to be performed at some later time, when it is more convenient. When "the time" comes, dm-crypt sends the request to Linux Crypto API for actual encryption. However, modern Linux Crypto API is asynchronous as well, so depending on which particular implementation your system will use, most likely it will not be processed immediately, but queued again for "later time". When Linux Crypto API will finally do the encryption, dm-crypt may try to sort pending write requests by putting each request into a red-black tree. Then a separate kernel thread again at "some time later" actually takes all IO requests in the tree and sends them down the stack.

Now for read requests: this time we need to get the encrypted data first from the hardware, but dm-crypt does not just ask for the driver for the data, but queues the request into a different workqueue named "kcryptd_io". At some point later, when we actually have the encrypted data, we schedule it for decryption using the now familiar "kcryptd" workqueue. "kcryptd" will send the request to Linux Crypto API, which may decrypt the data asynchronously as well.

To be fair the request does not always traverse all these queues, but the important part here is that write requests may be queued up to 4 times in dm-crypt and read requests up to 3 times. At this point we were wondering if all this extra queueing can cause any performance issues. For example, there is a nice presentation from Google about the relationship between queueing and tail latency. One key takeaway from the presentation is:

A significant amount of tail latency is due to queueing effects

So, why are all these queues there and can we remove them?

#### Git archeology

No-one writes more complex code just for fun, especially for the OS kernel. So all these queues must have been put there for a reason. Luckily, the Linux kernel source is managed by git, so we can try to retrace the changes and the decisions around them.

The "kcryptd" workqueue was in the source since the beginning of the available history with the following comment:

Needed because it would be very unwise to do decryption in an interrupt context, so bios returning from read requests get queued here.

So it was for reads only, but even then – why do we care if it is interrupt context or not, if Linux Crypto API will likely use a dedicated thread/queue for encryption anyway? Well, back in 2005 Crypto API was not asynchronous, so this made perfect sense.

In 2006 dm-crypt started to use the "kcryptd" workqueue not only for encryption, but for submitting IO requests:

This patch is designed to help dm-crypt comply with the new constraints imposed by the following patch in -mm: md-dm-reduce-stack-usage-with-stacked-block-devices.patch

It seems the goal here was not to add more concurrency, but rather reduce kernel stack usage, which makes sense again as the kernel has a common stack across all the code, so it is a quite limited resource. It is worth noting, however, that the Linux kernel stack has been expanded in 2014 for x86 platforms, so this might not be a problem anymore.

A first version of "kcryptd_io" workqueue was added in 2007 with the intent to avoid:

starvation caused by many requests waiting for memory allocation…

The request processing was bottlenecking on a single workqueue here, so the solution was to add another one. Makes sense.

We are definitely not the first ones experiencing performance degradation because of extensive queueing: in 2011 a change was introduced to conditionally revert some of the queueing for read requests:

If there is enough memory, code can directly submit bio instead queuing this operation in a separate thread.

Unfortunately, at that time Linux kernel commit messages were not as verbose as today, so there is no performance data available.

In 2015 dm-crypt started to sort writes in a separate "dmcrypt_write" thread before sending them down the stack:

On a multiprocessor machine, encryption requests finish in a different order than they were submitted. Consequently, write requests would be submitted in a different order and it could cause severe performance degradation.

It does make sense as sequential disk access used to be much faster than the random one and dm-crypt was breaking the pattern. But this mostly applies to spinning disks, which were still dominant in 2015. It may not be as important with modern fast SSDs (including NVME SSDs).

Another part of the commit message is worth mentioning:

…in particular it enables IO schedulers like CFQ to sort more effectively…

It mentions the performance benefits for the CFQ IO scheduler, but Linux schedulers have improved since then to the point that CFQ scheduler has been removed from the kernel in 2018.

The same patchset replaces the sorting list with a red-black tree:

In theory the sorting should be performed by the underlying disk scheduler, however, in practice the disk scheduler only accepts and sorts a finite number of requests. To allow the sorting of all requests, dm-crypt needs to implement its own sorting.

The overhead associated with rbtree-based sorting is considered negligible so it is not used conditionally.

All that make sense, but it would be nice to have some backing data.

Interestingly, in the same patchset we see the introduction of our familiar "submit_from_crypt_cpus" option:

Overall, we can see that every change was reasonable and needed, however things have changed since then:

• hardware became faster and smarter
• Linux resource allocation was revisited
• coupled Linux subsystems were rearchitected

And many of the design choices above may not be applicable to modern Linux.

### The "clean-up"

Based on the research above we decided to try to remove all the extra queueing and asynchronous behaviour and revert dm-crypt to its original purpose: simply encrypt/decrypt IO requests as they pass through. But for the sake of stability and further benchmarking we ended up not removing the actual code, but rather adding yet another dm-crypt option, which bypasses all the queues/threads, if enabled. The flag allows us to switch between the current and new behaviour at runtime under full production load, so we can easily revert our changes should we see any side-effects. The resulting patch can be found on the Cloudflare GitHub Linux repository.

#### Synchronous Linux Crypto API

From the diagram above we remember that not all queueing is implemented in dm-crypt. Modern Linux Crypto API may also be asynchronous and for the sake of this experiment we want to eliminate queues there as well. What does "may be" mean, though? The OS may contain different implementations of the same algorithm (for example, hardware-accelerated AES-NI on x86 platforms and generic C-code AES implementations). By default the system chooses the "best" one based on the configured algorithm priority. dm-crypt allows overriding this behaviour and request a particular cipher implementation using the capi: prefix. However, there is one problem. Let us actually check the available AES-XTS (this is our disk encryption cipher, remember?) implementations on our system:

$grep -A 11 'xts(aes)' /proc/crypto name : xts(aes) driver : xts(ecb(aes-generic)) module : kernel priority : 100 refcnt : 7 selftest : passed internal : no type : skcipher async : no blocksize : 16 min keysize : 32 max keysize : 64 -- name : __xts(aes) driver : cryptd(__xts-aes-aesni) module : cryptd priority : 451 refcnt : 1 selftest : passed internal : yes type : skcipher async : yes blocksize : 16 min keysize : 32 max keysize : 64 -- name : xts(aes) driver : xts-aes-aesni module : aesni_intel priority : 401 refcnt : 1 selftest : passed internal : no type : skcipher async : yes blocksize : 16 min keysize : 32 max keysize : 64 -- name : __xts(aes) driver : __xts-aes-aesni module : aesni_intel priority : 401 refcnt : 7 selftest : passed internal : yes type : skcipher async : no blocksize : 16 min keysize : 32 max keysize : 64  We want to explicitly select a synchronous cipher from the above list to avoid queueing effects in threads, but the only two supported are xts(ecb(aes-generic)) (the generic C implementation) and __xts-aes-aesni (the x86 hardware-accelerated implementation). We definitely want the latter as it is much faster (we’re aiming for performance here), but it is suspiciously marked as internal (see internal: yes). If we check the source code: Mark a cipher as a service implementation only usable by another cipher and never by a normal user of the kernel crypto API So this cipher is meant to be used only by other wrapper code in the Crypto API and not outside it. In practice this means, that the caller of the Crypto API needs to explicitly specify this flag, when requesting a particular cipher implementation, but dm-crypt does not do it, because by design it is not part of the Linux Crypto API, rather an "external" user. We already patch the dm-crypt module, so we could as well just add the relevant flag. However, there is another problem with AES-NI in particular: x86 FPU. "Floating point" you say? Why do we need floating point math to do symmetric encryption which should only be about bit shifts and XOR operations? We don’t need the math, but AES-NI instructions use some of the CPU registers, which are dedicated to the FPU. Unfortunately the Linux kernel does not always preserve these registers in interrupt context for performance reasons (saving/restoring FPU is expensive). But dm-crypt may execute code in interrupt context, so we risk corrupting some other process data and we go back to "it would be very unwise to do decryption in an interrupt context" statement in the original code. Our solution to address the above was to create another somewhat "smart" Crypto API module. This module is synchronous and does not roll its own crypto, but is just a "router" of encryption requests: • if we can use the FPU (and thus AES-NI) in the current execution context, we just forward the encryption request to the faster, "internal" __xts-aes-aesni implementation (and we can use it here, because now we are part of the Crypto API) • otherwise, we just forward the encryption request to the slower, generic C-based xts(ecb(aes-generic)) implementation #### Using the whole lot Let’s walk through the process of using it all together. The first step is to grab the patches and recompile the kernel (or just compile dm-crypt and our xtsproxy modules). Next, let’s restart our IO workload in a separate terminal, so we can make sure we can reconfigure the kernel at runtime under load: $ sudo fio --filename=/dev/mapper/encrypted-ram0 --readwrite=readwrite --bs=4k --direct=1 --loops=1000000 --name=crypt
crypt: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
fio-2.16
Starting 1 process
...


In the main terminal make sure our new Crypto API module is loaded and available:

$sudo modprobe xtsproxy$ grep -A 11 'xtsproxy' /proc/crypto
driver       : xts-aes-xtsproxy
module       : xtsproxy
priority     : 0
refcnt       : 0
selftest     : passed
internal     : no
type         : skcipher
async        : no
blocksize    : 16
min keysize  : 32
max keysize  : 64
ivsize       : 16
chunksize    : 16


Reconfigure the encrypted disk to use our newly loaded module and enable our patched dm-crypt flag (we have to use low-level dmsetup tool and cryptsetup obviously is not aware of our modifications):

$sudo dmsetup table encrypted-ram0 --showkeys | sed 's/aes-xts-plain64/capi:xts-aes-xtsproxy-plain64/' | sed 's/$/ 1 force_inline/' | sudo dmsetup reload encrypted-ram0


We just "loaded" the new configuration, but for it to take effect, we need to suspend/resume the encrypted device:

\$ sudo dmsetup suspend encrypted-ram0 && sudo dmsetup resume encrypted-ram0


And now observe the result. We may go back to the other terminal running the fio job and look at the output, but to make things nicer, here’s a snapshot of the observed read/write throughput in Grafana:

Wow, we have more than doubled the throughput! With the total throughput of ~640 MB/s we’re now much closer to the expected ~696 MB/s from above. What about the IO latency? (The await statistic from the iostat reporting tool):

The latency has been cut in half as well!

#### To production

So far we have been using a synthetic setup with some parts of the full production stack missing, like file systems, real hardware and most importantly, production workload. To ensure we’re not optimising imaginary things, here is a snapshot of the production impact these changes bring to the caching part of our stack:

This graph represents a three-way comparison of the worst-case response times (99th percentile) for a cache hit in one of our servers. The green line is from a server with unencrypted disks, which we will use as baseline. The red line is from a server with encrypted disks with the default Linux disk encryption implementation and the blue line is from a server with encrypted disks and our optimisations enabled. As we can see the default Linux disk encryption implementation has a significant impact on our cache latency in worst case scenarios, whereas the patched implementation is indistinguishable from not using encryption at all. In other words the improved encryption implementation does not have any impact at all on our cache response speed, so we basically get it for free! That’s a win!

### We’re just getting started

This post shows how an architecture review can double the performance of a system. Also we reconfirmed that modern cryptography is not expensive and there is usually no excuse not to protect your data.

We are going to submit this work for inclusion in the main kernel source tree, but most likely not in its current form. Although the results look encouraging we have to remember that Linux is a highly portable operating system: it runs on powerful servers as well as small resource constrained IoT devices and on many other CPU architectures as well. The current version of the patches just optimises disk encryption for a particular workload on a particular architecture, but Linux needs a solution which runs smoothly everywhere.

That said, if you think your case is similar and you want to take advantage of the performance improvements now, you may grab the patches and hopefully provide feedback. The runtime flag makes it easy to toggle the functionality on the fly and a simple A/B test may be performed to see if it benefits any particular case or setup. These patches have been running across our wide network of more than 200 data centres on five generations of hardware, so can be reasonably considered stable. Enjoy both performance and security from Cloudflare for all!

# Going Keyless Everywhere

Post Syndicated from Nick Sullivan original https://blog.cloudflare.com/going-keyless-everywhere/

Time flies. The Heartbleed vulnerability was discovered just over five and a half years ago. Heartbleed became a household name not only because it was one of the first bugs with its own web page and logo, but because of what it revealed about the fragility of the Internet as a whole. With Heartbleed, one tiny bug in a cryptography library exposed the personal data of the users of almost every website online.

Heartbleed is an example of an underappreciated class of bugs: remote memory disclosure vulnerabilities. High profile examples other than Heartbleed include Cloudbleed and most recently NetSpectre. These vulnerabilities allow attackers to extract secrets from servers by simply sending them specially-crafted packets. Cloudflare recently completed a multi-year project to make our platform more resilient against this category of bug.

For the last five years, the industry has been dealing with the consequences of the design that led to Heartbleed being so impactful. In this blog post we’ll dig into memory safety, and how we re-designed Cloudflare’s main product to protect private keys from the next Heartbleed.

## Memory Disclosure

Perfect security is not possible for businesses with an online component. History has shown us that no matter how robust their security program, an unexpected exploit can leave a company exposed. One of the more famous recent incidents of this sort is Heartbleed, a vulnerability in a commonly used cryptography library called OpenSSL that exposed the inner details of millions of web servers to anyone with a connection to the Internet. Heartbleed made international news, caused millions of dollars of damage, and still hasn’t been fully resolved.

Typical web services only return data via well-defined public-facing interfaces called APIs. Clients don’t typically get to see what’s going on under the hood inside the server, that would be a huge privacy and security risk. Heartbleed broke that paradigm: it enabled anyone on the Internet to get access to take a peek at the operating memory used by web servers, revealing privileged data usually not exposed via the API. Heartbleed could be used to extract the result of previous data sent to the server, including passwords and credit cards. It could also reveal the inner workings and cryptographic secrets used inside the server, including TLS certificate private keys.

Heartbleed let attackers peek behind the curtain, but not too far. Sensitive data could be extracted, but not everything on the server was at risk. For example, Heartbleed did not enable attackers to steal the content of databases held on the server. You may ask: why was some data at risk but not others? The reason has to do with how modern operating systems are built.

## A simplified view of process isolation

Most modern operating systems are split into multiple layers. These layers are analogous to security clearance levels. So-called user-space applications (like your browser) typically live in a low-security layer called user space. They only have access to computing resources (memory, CPU, networking) if the lower, more credentialed layers let them.

User-space applications need resources to function. For example, they need memory to store their code and working memory to do computations. However, it would be risky to give an application direct access to the physical RAM of the computer they’re running on. Instead, the raw computing elements are restricted to a lower layer called the operating system kernel. The kernel only runs specially-designed applications designed to safely manage these resources and mediate access to them for user-space applications.

When a new user space application process is launched, the kernel gives it a virtual memory space. This virtual memory space acts like real memory to the application but is actually a safely guarded translation layer the kernel uses to protect the real memory. Each application’s virtual memory space is like a parallel universe dedicated to that application. This makes it impossible for one process to view or modify another’s, the other applications are simply not addressable.

## Heartbleed, Cloudbleed and the process boundary

Heartbleed was a vulnerability in the OpenSSL library, which was part of many web server applications. These web servers run in user space, like any common applications. This vulnerability caused the web server to return up to 2 kilobytes of its memory in response to a specially-crafted inbound request.

Cloudbleed was also a memory disclosure bug, albeit one specific to Cloudflare, that got its name because it was so similar to Heartbleed. With Cloudbleed, the vulnerability was not in OpenSSL, but instead in a secondary web server application used for HTML parsing. When this code parsed a certain sequence of HTML, it ended up inserting some process memory into the web page it was serving.

It’s important to note that both of these bugs occurred in applications running in user space, not kernel space. This means that the memory exposed by the bug was necessarily part of the virtual memory of the application. Even if the bug were to expose megabytes of data, it would only expose data specific to that application, not other applications on the system.

In order for a web server to serve traffic over the encrypted HTTPS protocol, it needs access to the certificate’s private key, which is typically kept in the application’s memory. These keys were exposed to the Internet by Heartbleed. The Cloudbleed vulnerability affected a different process, the HTML parser, which doesn’t do HTTPS and therefore doesn’t keep the private key in memory. This meant that HTTPS keys were safe, even if other data in the HTML parser’s memory space wasn’t.

The fact that the HTML parser and the web server were different applications saved us from having to revoke and re-issue our customers’ TLS certificates. However, if another memory disclosure vulnerability is discovered in the web server, these keys are again at risk.

## Moving keys out of Internet-facing processes

Not all web servers keep private keys in memory. In some deployments, private keys are held in a separate machine called a Hardware Security Module (HSM). HSMs are built to withstand physical intrusion and tampering and are often built to comply with stringent compliance requirements. They can often be bulky and expensive. Web servers designed to take advantage of keys in an HSM connect to them over a physical cable and communicate with a specialized protocol called PKCS#11. This allows the web server to serve encrypted content while being physically separated from the private key.

At Cloudflare, we built our own way to separate a web server from a private key: Keyless SSL. Rather than keeping the keys in a separate physical machine connected to the server with a cable, the keys are kept in a key server operated by the customer in their own infrastructure (this can also be backed by an HSM).

More recently, we launched Geo Key Manager, a service that allows users to store private keys in only select Cloudflare locations. Connections to locations that do not have access to the private key use Keyless SSL with a key server hosted in a datacenter that does have access.

In both Keyless SSL and Geo Key Manager, private keys are not only not part of the web server’s memory space, they’re often not even in the same country! This extreme degree of separation is not necessary to protect against the next Heartbleed. All that is needed is for the web server and the key server to not be part of the same application. So that’s what we did. We call this Keyless Everywhere.

## Keyless SSL is coming from inside the house

Repurposing Keyless SSL for Cloudflare-held private keys was easy to conceptualize, but the path from ideation to live in production wasn’t so straightforward. The core functionality of Keyless SSL comes from the open source gokeyless which customers run on their infrastructure, but internally we use it as a library and have replaced the main package with an implementation suited to our requirements (we’ve creatively dubbed it gokeyless-internal).

As with all major architecture changes, it’s prudent to start with testing out the model with something new and low risk. In our case, the test bed was our experimental TLS 1.3 implementation. In order to quickly iterate through draft versions of the TLS specification and push releases without affecting the majority of Cloudflare customers, we re-wrote our custom nginx web server in Go and deployed it in parallel to our existing infrastructure. This server was designed to never hold private keys from the start and only leverage gokeyless-internal. At this time there was only a small amount of TLS 1.3 traffic and it was all coming from the beta versions of browsers, which allowed us to work through the initial kinks of gokeyless-internal without exposing the majority of visitors to security risks or outages due to gokeyless-internal.

The first step towards making TLS 1.3 fully keyless was identifying and implementing the new functionality we needed to add to gokeyless-internal. Keyless SSL was designed to run on customer infrastructure, with the expectation of supporting only a handful of private keys. But our edge must simultaneously support millions of private keys, so we implemented the same lazy loading logic we use in our web server, nginx. Furthermore, a typical customer deployment would put key servers behind a network load balancer, so they could be taken out of service for upgrades or other maintenance. Contrast this with our edge, where it’s important to maximize our resources by serving traffic during software upgrades. This problem is solved by the excellent tableflip package we use elsewhere at Cloudflare.

The next project to go Keyless was Spectrum, which launched with default support for gokeyless-internal. With these small victories in hand, we had the confidence necessary to attempt the big challenge, which was porting our existing nginx infrastructure to a fully keyless model. After implementing the new functionality, and being satisfied with our integration tests, all that’s left is to turn this on in production and call it a day, right? Anyone with experience with large distributed systems knows how far “working in dev” is from “done,” and this story is no different. Thankfully we were anticipating problems, and built a fallback into nginx to complete the handshake itself if any problems were encountered with the gokeyless-internal path. This allowed us to expose gokeyless-internal to production traffic without risking downtime in the event that our reimplementation of the nginx logic was not 100% bug-free.

## When rolling back the code doesn’t roll back the problem

Our deployment plan was to enable Keyless Everywhere, find the most common causes of fallbacks, and then fix them. We could then repeat this process until all sources of fallbacks had been eliminated, after which we could remove access to private keys (and therefore the fallback) from nginx. One of the early causes of fallbacks was gokeyless-internal returning ErrKeyNotFound, indicating that it couldn’t find the requested private key in storage. This should not have been possible, since nginx only makes a request to gokeyless-internal after first finding the certificate and key pair in storage, and we always write the private key and certificate together. It turned out that in addition to returning the error for the intended case of the key truly not found, we were also returning it when transient errors like timeouts were encountered. To resolve this, we updated those transient error conditions to return ErrInternal, and deployed to our canary datacenters. Strangely, we found that a handful of instances in a single datacenter started encountering high rates of fallbacks, and the logs from nginx indicated it was due to a timeout between nginx and gokeyless-internal. The timeouts didn’t occur right away, but once a system started logging some timeouts it never stopped. Even after we rolled back the release, the fallbacks continued with the old version of the software! Furthermore, while nginx was complaining about timeouts, gokeyless-internal seemed perfectly healthy and was reporting reasonable performance metrics (sub-millisecond median request latency).

To debug the issue, we added detailed logging to both nginx and gokeyless, and followed the chain of events backwards once timeouts were encountered.

➜ ~ grep 'timed out' nginx.log | grep Keyless | head -5
2018-07-25T05:30:49.000 29m41 2018/07/25 05:30:49 [error] 4525#0: *1015157 Keyless SSL request/response timed out while reading Keyless SSL response, keyserver: 127.0.0.1
2018-07-25T05:30:49.000 29m41 2018/07/25 05:30:49 [error] 4525#0: *1015231 Keyless SSL request/response timed out while waiting for Keyless SSL response, keyserver: 127.0.0.1
2018-07-25T05:30:49.000 29m41 2018/07/25 05:30:49 [error] 4525#0: *1015271 Keyless SSL request/response timed out while waiting for Keyless SSL response, keyserver: 127.0.0.1
2018-07-25T05:30:49.000 29m41 2018/07/25 05:30:49 [error] 4525#0: *1015280 Keyless SSL request/response timed out while waiting for Keyless SSL response, keyserver: 127.0.0.1
2018-07-25T05:30:50.000 29m41 2018/07/25 05:30:50 [error] 4525#0: *1015289 Keyless SSL request/response timed out while waiting for Keyless SSL response, keyserver: 127.0.0.1


You can see the first request to log a timeout had id 1015157. Also interesting that the first log line was “timed out while reading,” but all the others are “timed out while waiting,” and this latter message is the one that continues forever. Here is the matching request in the gokeyless log:

➜ ~ grep 'id=1015157 ' gokeyless.log | head -1


Aha! That SNI value is clearly invalid (SNIs are like Host headers, i.e. they are domains, not URL paths), and it’s also quite long. Our storage system indexes certificates based on two indices: which SNI they correspond to, and which IP addresses they correspond to (for older clients that don’t support SNI). Our storage interface uses the memcached protocol, and the client library that gokeyless-internal uses rejects requests for keys longer than 250 characters (memcached’s maximum key length), whereas the nginx logic is to simply ignore the invalid SNI and treat the request as if only had an IP. The change in our new release had shifted this condition from ErrKeyNotFound to ErrInternal, which triggered cascading problems in nginx. The “timeouts” it encountered were actually a result of throwing away all in-flight requests multiplexed on a connection which happened to return ErrInternalfor a single request. These requests were retried, but once this condition triggered, nginx became overloaded by the number of retried requests plus the continuous stream of new requests coming in with bad SNI, and was unable to recover. This explains why rolling back gokeyless-internal didn’t fix the problem.

This discovery finally brought our attention to nginx, which thus far had escaped blame since it had been working reliably with customer key servers for years. However, communicating over localhost to a multitenant key server is fundamentally different than reaching out over the public Internet to communicate with a customer’s key server, and we had to make the following changes:

• Instead of a long connection timeout and a relatively short response timeout for customer key servers, extremely short connection timeouts and longer request timeouts are appropriate for a localhost key server.
• Similarly, it’s reasonable to retry (with backoff) if we timeout waiting on a customer key server response, since we can’t trust the network. But over localhost, a timeout would only occur if gokeyless-internal were overloaded and the request were still queued for processing. In this case a retry would only lead to more total work being requested of gokeyless-internal, making the situation worse.
• Most significantly, nginx must not throw away all requests multiplexed on a connection if any single one of them encounters an error, since a single connection no longer represents a single customer.

## Implementations matter

CPU at the edge is one of our most precious assets, and it’s closely guarded by our performance team (aka CPU police). Soon after turning on Keyless Everywhere in one of our canary datacenters, they noticed gokeyless using ~50% of a core per instance. We were shifting the sign operations from nginx to gokeyless, so of course it would be using more CPU now. But nginx should have seen a commensurate reduction in CPU usage, right?

Wrong. Elliptic curve operations are very fast in Go, but it’s known that RSA operations are much slower than their BoringSSL counterparts.

Although Go 1.11 includes optimizations for RSA math operations, we needed more speed. Well-tuned assembly code is required to match the performance of BoringSSL, so Armando Faz from our Crypto team helped claw back some of the lost CPU by reimplementing parts of the math/big package with platform-dependent assembly in an internal fork of Go. The recent assembly policy of Go prefers the use of Go portable code instead of assembly, so these optimizations were not upstreamed. There is still room for more optimizations, and for that reason we’re still evaluating moving to cgo + BoringSSL for sign operations, despite cgo’s many downsides.

## Changing our tooling

Process isolation is a powerful tool for protecting secrets in memory. Our move to Keyless Everywhere demonstrates that this is not a simple tool to leverage. Re-architecting an existing system such as nginx to use process isolation to protect secrets was time-consuming and difficult. Another approach to memory safety is to use a memory-safe language such as Rust.

Rust was originally developed by Mozilla but is starting to be used much more widely. The main advantage that Rust has over C/C++ is that it has memory safety features without a garbage collector.

Re-writing an existing application in a new language such as Rust is a daunting task. That said, many new Cloudflare features, from the powerful Firewall Rules feature to our 1.1.1.1 with WARP app, have been written in Rust to take advantage of its powerful memory-safety properties. We’re really happy with Rust so far and plan on using it even more in the future.

## Conclusion

The harrowing aftermath of Heartbleed taught the industry a lesson that should have been obvious in retrospect: keeping important secrets in applications that can be accessed remotely via the Internet is a risky security practice. In the following years, with a lot of work, we leveraged process separation and Keyless SSL to ensure that the next Heartbleed wouldn’t put customer keys at risk.

However, this is not the end of the road. Recently memory disclosure vulnerabilities such as NetSpectre have been discovered which are able to bypass application process boundaries, so we continue to actively explore new ways to keep keys secure.

# Delegated Credentials for TLS

Post Syndicated from Nick Sullivan original https://blog.cloudflare.com/keyless-delegation/

Today we’re happy to announce support for a new cryptographic protocol that helps make it possible to deploy encrypted services in a global network while still maintaining fast performance and tight control of private keys: Delegated Credentials for TLS. We have been working with partners from Facebook, Mozilla, and the broader IETF community to define this emerging standard. We’re excited to share the gory details today in this blog post.

## Deploying TLS globally

Many of the technical problems we face at Cloudflare are widely shared problems across the Internet industry. As gratifying as it can be to solve a problem for ourselves and our customers, it can be even more gratifying to solve a problem for the entire Internet. For the past three years, we have been working with peers in the industry to solve a specific shared problem in the TLS infrastructure space: How do you terminate TLS connections while storing keys remotely and maintaining performance and availability? Today we’re announcing that Cloudflare now supports Delegated Credentials, the result of this work.

Cloudflare’s TLS/SSL features are among the top reasons customers use our service. Configuring TLS is hard to do without internal expertise. By automating TLS, web site and web service operators gain the latest TLS features and the most secure configurations by default. It also reduces the risk of outages or bad press due to misconfigured or insecure encryption settings. Customers also gain early access to unique features like TLS 1.3, post-quantum cryptography, and OCSP stapling as they become available.

Unfortunately, for web services to authorize a service to terminate TLS for them, they have to trust the service with their private keys, which demands a high level of trust. For services with a global footprint, there is an additional level of nuance. They may operate multiple data centers located in places with varying levels of physical security, and each of these needs to be trusted to terminate TLS.

To tackle these problems of trust, Cloudflare has invested in two technologies: Keyless SSL, which allows customers to use Cloudflare without sharing their private key with Cloudflare; and Geo Key Manager, which allows customers to choose the datacenters in which Cloudflare should keep their keys. Both of these technologies are able to be deployed without any changes to browsers or other clients. They also come with some downsides in the form of availability and performance degradation.

Keyless SSL introduces extra latency at the start of a connection. In order for a server without access to a private key to establish a connection with a client, that servers needs to reach out to a key server, or a remote point of presence, and ask them to do a private key operation. This not only adds additional latency to the connection, causing the content to load slower, but it also introduces some troublesome operational constraints on the customer. Specifically, the server with access to the key needs to be highly available or the connection can fail. Sites often use Cloudflare to improve their site’s availability, so having to run a high-availability key server is an unwelcome requirement.

## Turning a pull into a push

The reason services like Keyless SSL that rely on remote keys are so brittle is their architecture: they are pull-based rather than push-based. Every time a client attempts a handshake with a server that doesn’t have the key, it needs to pull the authorization from the key server. An alternative way to build this sort of system is to periodically push a short-lived authorization key to the server and use that for handshakes. Switching from a pull-based model to a push-based model eliminates the additional latency, but it comes with additional requirements, including the need to change the client.

Enter the new TLS feature of Delegated Credentials (DCs). A delegated credential is a short-lasting key that the certificate’s owner has delegated for use in TLS. They work like a power of attorney: your server authorizes our server to terminate TLS for a limited time. When a browser that supports this protocol connects to our edge servers we can show it this “power of attorney”, instead of needing to reach back to a customer’s server to get it to authorize the TLS connection. This reduces latency and improves performance and reliability.

A fresh delegated credential can be created and pushed out to TLS servers long before the previous credential expires. Momentary blips in availability will not lead to broken handshakes for clients that support delegated credentials. Furthermore, a Delegated Credentials-enabled TLS connection is just as fast as a standard TLS connection: there’s no need to connect to the key server for every handshake. This removes the main drawback of Keyless SSL for DC-enabled clients.

Delegated credentials are intended to be an Internet Standard RFC that anyone can implement and use, not a replacement for Keyless SSL. Since browsers will need to be updated to support the standard, proprietary mechanisms like Keyless SSL and Geo Key Manager will continue to be useful. Delegated credentials aren’t just useful in our context, which is why we’ve developed it openly and with contributions from across industry and academia. Facebook has integrated them into their own TLS implementation, and you can read more about how they view the security benefits here.  When it comes to improving the security of the Internet, we’re all on the same team.

"We believe delegated credentials provide an effective way to boost security by reducing certificate lifetimes without sacrificing reliability. This will soon become an Internet standard and we hope others in the industry adopt delegated credentials to help make the Internet ecosystem more secure."

Subodh Iyengar, software engineer at Facebook

## Extensibility beyond the PKI

At Cloudflare, we’re interested in pushing the state of the art forward by experimenting with new algorithms. In TLS, there are three main areas of experimentation: ciphers, key exchange algorithms, and authentication algorithms. Ciphers and key exchange algorithms are only dependent on two parties: the client and the server. This freedom allows us to deploy exciting new choices like ChaCha20-Poly1305 or post-quantum key agreement in lockstep with browsers. On the other hand, the authentication algorithms used in TLS are dependent on certificates, which introduces certificate authorities and the entire public key infrastructure into the mix.

Unfortunately, the public key infrastructure is very conservative in its choice of algorithms, making it harder to adopt newer cryptography for authentication algorithms in TLS. For instance, EdDSA, a highly-regarded signature scheme, is not supported by certificate authorities, and root programs limit the certificates that will be signed. With the emergence of quantum computing, experimenting with new algorithms is essential to determine which solutions are deployable and functional on the Internet.

Since delegated credentials introduce the ability to use new authentication key types without requiring changes to certificates themselves, this opens up a new area of experimentation. Delegated credentials can be used to provide a level of flexibility in the transition to post-quantum cryptography, by enabling new algorithms and modes of operation to coexist with the existing PKI infrastructure. It also enables tiny victories, like the ability to use smaller, faster Ed25519 signatures in TLS.

## Inside DCs

A delegated credential contains a public key and an expiry time. This bundle is then signed by a certificate along with the certificate itself, binding the delegated credential to the certificate for which it is acting as “power of attorney”. A supporting client indicates its support for delegated credentials by including an extension in its Client Hello.

A server that supports delegated credentials composes the TLS Certificate Verify and Certificate messages as usual, but instead of signing with the certificate’s private key, it includes the certificate along with the DC, and signs with the DC’s private key. Therefore, the private key of the certificate only needs to be used for the signing of the DC.

Certificates used for signing delegated credentials require a special X.509 certificate extension. This requirement exists to avoid breaking assumptions people may have about the impact of temporary access to their keys on security, particularly in cases involving HSMs and the still unfixed Bleichbacher oracles in older TLS versions.  Temporary access to a key can enable signing lots of delegated credentials which start far in the future, and as a result support was made opt-in. Early versions of QUIC had similar issues, and ended up adopting TLS to fix them. Protocol evolution on the Internet requires working well with already existing protocols and their flaws.

## Delegated Credentials at Cloudflare and Beyond

Currently we use delegated credentials as a performance optimization for Geo Key Manager and Keyless SSL. Customers can update their certificates to include the special extension for delegated credentials, and we will automatically create delegated credentials and distribute them to the edge through the Keyless SSL or Geo Key Manager. For more information, see the documentation. It also enables us to be more conservative about where we keep keys for customers, improving our security posture.

Delegated Credentials would be useless if it wasn’t also supported by browsers and other HTTP clients. Christopher Patton, a former intern at Cloudflare, implemented support in Firefox and its underlying NSS security library. This feature is now in the Nightly versions of Firefox. You can turn it on by activating the configuration option security.tls.enable_delegated_credentials at about:config. Studies are ongoing on how effective this will be in a wider deployment. There also is support for Delegated Credentials in BoringSSL.

"At Mozilla we welcome ideas that help to make the Web PKI more robust. The Delegated Credentials feature can help to provide secure and performant TLS connections for our users, and we’re happy to work with Cloudflare to help validate this feature."

Thyla van der Merwe, Cryptography Engineering Manager at Mozilla

One open issue is the question of client clock accuracy. Until we have a wide-scale study we won’t know how many connections using delegated credentials will break because of the 24 hour time limit that is imposed.  Some clients, in particular mobile clients, may have inaccurately set clocks, the root cause of one third of all certificate errors in Chrome. Part of the way that we’re aiming to solve this problem is through standardizing and improving Roughtime, so web browsers and other services that need to validate certificates can do so independent of the client clock.

Cloudflare’s global scale means that we see connections from every corner of the world, and from many different kinds of connection and device. That reach enables us to find rare problems with the deployability of protocols. For example, our early deployment helped inform the development of the TLS 1.3 standard. As we enable developing protocols like delegated credentials, we learn about obstacles that inform and affect their future development.

## Conclusion

As new protocols emerge, we’ll continue to play a role in their development and bring their benefits to our customers. Today’s announcement of a technology that overcomes some limitations of Keyless SSL is just one example of how Cloudflare takes part in improving the Internet not just for our customers, but for everyone. During the standardization process of turning the draft into an RFC, we’ll continue to maintain our implementation and come up with new ways to apply delegated credentials.

# Announcing cfnts: Cloudflare’s implementation of NTS in Rust

Post Syndicated from Watson Ladd original https://blog.cloudflare.com/announcing-cfnts/

Several months ago we announced that we were providing a new public time service. Part of what we were providing was the first major deployment of the new Network Time Security (NTS) protocol, with a newly written implementation of NTS in Rust. In the process, we received helpful advice from the NTP community, especially from the NTPSec and Chrony projects. We’ve also participated in several interoperability events. Now we are returning something to the community: Our implementation, cfnts, is now open source and we welcome your pull requests and issues.

The journey from a blank source file to a working, deployed service was a lengthy one, and it involved many people across multiple teams.

"Correct time is a necessity for most security protocols in use on the Internet. Despite this, secure time transfer over the Internet has previously required complicated configuration on a case by case basis. With the introduction of NTS, secure time synchronization will finally be available for everyone. It is a small, but important, step towards increasing security in all systems that depend on accurate time. I am happy that Cloudflare are sharing their NTS implementation. A diversity of software with NTS support is important for quick adoption of the new protocol."

Marcus Dansarie, coauthor of the NTS specification

## How NTS works

NTS is structured as a suite of two sub-protocols as shown in the figure below. The first is the Network Time Security Key Exchange (NTS-KE), which is always conducted over Transport Layer Security (TLS) and handles the creation of key material and parameter negotiation for the second protocol. The second is NTPv4, the current version of the NTP protocol, which allows the client to synchronize their time from the remote server.

In order to maintain the scalability of NTPv4, it was important that the server not maintain per-client state. A very small server can serve millions of NTP clients. Maintaining this property while providing security is achieved with cookies that the server provides to the client that contain the server state.

In the first stage, the client sends a request to the NTS-KE server and gets a response via TLS. This exchange carries out a number of functions:

• Negotiates the AEAD algorithm to be used in the second stage.
• Negotiates the second protocol. Currently, the standard only defines how NTS works with NTPv4.
• Negotiates the NTP server IP address and port.
• Creates cookies for use in the second stage.
• Creates two symmetric keys (C2S and S2C) from the TLS session via exporters.

In the second stage, the client securely synchronizes the clock with the negotiated NTP server. To synchronize securely, the client sends NTPv4 packets with four special extensions:

• Unique Identifier Extension contains a random nonce used to prevent replay attacks.
• NTS Cookie Extension contains one of the cookies that the client stores. Since currently only the client remembers the two AEAD keys (C2S and S2C), the server needs to use the cookie from this extension to extract the keys. Each cookie contains the keys encrypted under a secret key the server has.
• NTS Cookie Placeholder Extension is a signal from the client to request additional cookies from the server. This extension is needed to make sure that the response is not much longer than the request to prevent amplification attacks.
• NTS Authenticator and Encrypted Extension Fields Extension contains a ciphertext from the AEAD algorithm with C2S as a key and with the NTP header, timestamps, and all the previously mentioned extensions as associated data. Other possible extensions can be included as encrypted data within this field. Without this extension, the timestamp can be spoofed.

After getting a request, the server sends a response back to the client echoing the Unique Identifier Extension to prevent replay attacks, the NTS Cookie Extension to provide the client with more cookies, and the NTS Authenticator and Encrypted Extension Fields Extension with an AEAD ciphertext with S2C as a key. But in the server response, instead of sending the NTS Cookie Extension in plaintext, it needs to be encrypted with the AEAD to provide unlinkability of the NTP requests.

The second handshake can be repeated many times without going back to the first stage since each request and response gives the client a new cookie. The expensive public key operations in TLS are thus amortized over a large number of requests. Furthermore, specialized timekeeping devices like FPGA implementations only need to implement a few symmetric cryptographic functions and can delegate the complex TLS stack to a different device.

## Why Rust?

While many of our services are written in Go, and we have considerable experience on the Crypto team with Go, a garbage collection pause in the middle of responding to an NTP packet would negatively impact accuracy. We picked Rust because of its zero-overhead and useful features.

• Memory safety After Heartbleed, Cloudbleed, and the steady drip of vulnerabilities caused by C’s lack of memory safety, it’s clear that C is not a good choice for new software dealing with untrusted inputs. The obvious solution for memory safety is to use garbage collection, but garbage collection has a substantial runtime overhead, while Rust has less runtime overhead.
• Non-nullability Null pointers are an edge case that is frequently not handled properly. Rust explicitly marks optionality, so all references in Rust can be safely dereferenced. The type system ensures that option types are properly handled.
• Thread safety  Data-race prevention is another key feature of Rust. Rust’s ownership model ensures that all cross-thread accesses are synchronized by default. While not a panacea, this eliminates a major class of bugs.
• Immutability Separating types into mutable and immutable is very important for reducing bugs. For example, in Java, when you pass an object into a function as a parameter, after the function is finished, you will never know whether the object has been mutated or not. Rust allows you to pass the object reference into the function and still be assured that the object is not mutated.
• Error handling  Rust result types help with ensuring that operations that can produce errors are identified and a choice made about the error, even if that choice is passing it on.

While Rust provides safety with zero overhead, coding in Rust involves understanding linear types and for us a new language. In this case the importance of security and performance meant we chose Rust over a potentially easier task in Go.

## Dependencies we use

Because of our scale and for DDoS protection we needed a highly scalable server. For UDP protocols without the concept of a connection, the server can respond to one packet at a time easily, but for TCP this is more complex. Originally we thought about using Tokio. However, at the time Tokio suffered from scheduler problems that had caused other teams some issues. As a result we decided to use Mio directly, basing our work on the examples in Rustls.

We decided to use Rustls over OpenSSL or BoringSSL because of the crate’s consistent error codes and default support for authentication that is difficult to disable accidentally. While there are some features that are not yet supported, it got the job done for our service.

## Other engineering choices

More important than our choice of programming language was our implementation strategy. A working, fully featured NTP implementation is a complicated program involving a phase-locked loop. These have a difficult reputation due to their nonlinear nature, beyond the usual complexities of closed loop control. The response of a phase lock loop to a disturbance can be estimated if the loop is locked and the disturbance small. However, lock acquisition, large disturbances, and the necessary filtering in NTP are all hard to analyze mathematically since they are not captured in the linear models applied for small scale analysis. While NTP works with the total phase, unlike the phase-locked loops of electrical engineering, there are still nonlinear elements. For NTP testing, changes to this loop requires weeks of operation to determine the performance as the loop responds very slowly.

Computer clocks are generally accurate over short periods, while networks are plagued with inconsistent delays. This demands a slow response. Changes we make to our service have taken hours to have an effect, as the clients slowly adapt to the new conditions. While RFC 5905 provides lots of details on an algorithm to adjust the clock, later implementations such as chrony have improved upon the algorithm through much more sophisticated nonlinear filters.

Rather than implement these more sophisticated algorithms, we let chrony adjust the clock of our servers, and copy the state variables in the header from chrony and adjust the dispersion and root delay according to the formulas given in the RFC. This strategy let us focus on the new protocols.

## Prague

Part of what the Internet Engineering Task Force (IETF) does is organize events like hackathons where implementers of a new standard can get together and try to make their stuff work with one another. This exposes bugs and infelicities of language in the standard and the implementations. We attended the IETF 104 hackathon to develop our server and make it work with other implementations. The NTP working group members were extremely generous with their time, and during the process we uncovered a few issues relating to the exact way one has to handle ALPN with older OpenSSL versions.

At the IETF 104 in Prague we had a working client and server for NTS-KE by the end of the hackathon. This was a good amount of progress considering we started with nothing. However, without implementing NTP we didn’t actually know that our server and client were computing the right thing. That would have to wait for later rounds of testing.

## Crypto Week

As Crypto Week 2019 approached we were busily writing code. All of the NTP protocol had to be implemented, together with the connection between the NTP and NTS-KE parts of the server. We also had to deploy processes to synchronize the ticket encrypting keys around the world and work on reconfiguring our own timing infrastructure to support this new service.

With a few weeks to go we had a working implementation, but we needed servers and clients out there to test with. But because we only support TLS 1.3 on the server, which had only just entered into OpenSSL, there were some compatibility problems.

We ended up compiling a chrony branch with NTS support and NTPsec ourselves and testing against time.cloudflare.com. We also tested our client against test servers set up by the chrony and NTPsec projects, in the hopes that this would expose bugs and have our implementations work nicely together. After a few lengthy days of debugging, we found out that our nonce length wasn’t exactly in accordance with the spec, which was quickly fixed. The NTPsec project was extremely helpful in this effort. Of course, this was the day that our office had a blackout, so the testing happened outside in Yerba Buena Gardens.

During the deployment of time.cloudflare.com, we had to open up our firewall to incoming NTP packets. Since the start of Cloudflare’s network existence and because of NTP reflection attacks we had previously closed UDP port 123 on the router. Since source port 123 is also used by clients sometimes to send NTP packets, it’s impossible for NTP servers to filter reflection attacks without parsing the contents of NTP packet, which routers have difficulty doing.  In order to protect Cloudflare infrastructure we got an entire subnet just for the time service, so it could be aggressively throttled and rerouted in case of massive DDoS attacks. This is an exceptional case: most edge services at Cloudflare run on every available IP.

## Bug fixes

Shortly after the public launch, we discovered that older Windows versions shipped with NTP version 3, and our server only spoke version 4. This was easy to fix since the timestamps have not moved in NTP versions: we echo the version back and most still existing NTP version 3 clients will understand what we meant.

Also tricky was the failure of Network Time Foundation ntpd clients to expand the polling interval. It turns out that one has to echo back the client’s polling interval to have the polling interval expand. Chrony does not use the polling interval from the server, and so was not affected by this incompatibility.

Both of these issues were fixed in ways suggested by other NTP implementers who had run into these problems themselves. We thank Miroslav Lichter tremendously for telling us exactly what the problem was, and the members of the Cloudflare community who posted packet captures demonstrating these issues.

## Continued improvement

The original production version of cfnts was not particularly object oriented and several contributors were just learning Rust. As a result there was quite a bit of unwrap and unnecessary mutability flying around. Much of the code was in functions even when it could profitably be attached to structures. All of this had to be restructured. Keep in mind that some of the best code running in the real-world have been written, rewritten, and sometimes rewritten again! This is actually a good thing.

As an internal project we relied on Cloudflare’s internal tooling for building, testing, and deploying code. These were replaced with tools available to everyone like Docker to ensure anyone can contribute. Our repository is integrated with Circle CI, ensuring that all contributions are automatically tested. In addition to unit tests we test the entire end to end functionality of getting a measurement of the time from a server.

## The Future

NTPsec has already released support for NTS but we see very little usage. Please try turning on NTS if you use NTPsec and see how it works with time.cloudflare.com.  As the draft advances through the standards process the protocol will undergo an incompatible change when the identifiers are updated and assigned out of the IANA registry instead of being experimental ones, so this is very much an experiment. Note that your daemon will need TLS 1.3 support and so could require manually compiling OpenSSL and then linking against it.

We’ve also added our time service to the public NTP pool. The NTP pool is a widely used volunteer-maintained service that provides NTP servers geographically spread across the world. Unfortunately, NTS doesn’t currently work well with the pool model, so for the best security, we recommend enabling NTS and using time.cloudflare.com and other NTS supporting servers.

In the future, we’re hoping that more clients support NTS, and have licensed our code liberally to enable this. We would love to hear if you incorporate it into a product and welcome contributions to make it more useful.

We’re also encouraged to see that Netnod has a production NTS service at nts.ntp.se. The more time services and clients that adopt NTS, the more secure the Internet will be.

## Acknowledgements

Tanya Verma and Gabbi Fisher were major contributors to the code, especially the configuration system and the client code. We’d also like to thank Gary Miller, Miroslav Lichter, and all the people at Cloudflare who set up their laptops and home machines to point to time.cloudflare.com for early feedback.

# The TLS Post-Quantum Experiment

Post Syndicated from Kris Kwiatkowski original https://blog.cloudflare.com/the-tls-post-quantum-experiment/

In June, we announced a wide-scale post-quantum experiment with Google. We implemented two post-quantum (i.e., not yet known to be broken by quantum computers) key exchanges, integrated them into our TLS stack and deployed the implementation on our edge servers and in Chrome Canary clients. The goal of the experiment was to evaluate the performance and feasibility of deployment in TLS of two post-quantum key agreement ciphers.

In our previous blog post on post-quantum cryptography, we described differences between those two ciphers in detail. In case you didn’t have a chance to read it, we include a quick recap here. One characteristic of post-quantum key exchange algorithms is that the public keys are much larger than those used by “classical” algorithms. This will have an impact on the duration of the TLS handshake. For our experiment, we chose two algorithms: isogeny-based SIKE and lattice-based HRSS. The former has short key sizes (~330 bytes) but has a high computational cost; the latter has larger key sizes (~1100 bytes), but is a few orders of magnitude faster.

During NIST’s Second PQC Standardization Conference, Nick Sullivan presented our approach to this experiment and some initial results. Quite accurately, he compared NTRU-HRSS to an ostrich and SIKE to a turkey—one is big and fast and the other is small and slow.

## Setup & Execution

We based our experiment on TLS 1.3. Cloudflare operated the server-side TLS connections and Google Chrome (Canary and Dev builds) represented the client side of the experiment. We enabled both CECPQ2 (HRSS + X25519) and CECPQ2b (SIKE/p434 + X25519) key-agreement algorithms on all TLS-terminating edge servers. Since the post-quantum algorithms are considered experimental, the X25519 key exchange serves as a fallback to ensure the classical security of the connection.

Clients participating in the experiment were split into 3 groups—those who initiated TLS handshake with post-quantum CECPQ2, CECPQ2b or non post-quantum X25519 public keys. Each group represented approximately one third of the Chrome Canary population participating in the experiment.

In order to distinguish between clients participating in or excluded from the experiment, we added a custom extension to the TLS handshake. It worked as a simple flag sent by clients and echoed back by Cloudflare edge servers. This allowed us to measure the duration of TLS handshakes only for clients participating in the experiment.

For each connection, we collected telemetry metrics. The most important metric was a TLS server-side handshake duration defined as the time between receiving the Client Hello and Client Finished messages. The diagram below shows details of what was measured and how post-quantum key exchange was integrated with TLS 1.3.

The experiment ran for 53 days in total, between August and October. During this time we collected millions of data samples, representing 5% of (anonymized) TLS connections that contained the extension signaling that the client was part of the experiment. We carried out the experiment in two phases.

In the first phase of the experiment, each client was assigned to use one of the three key exchange groups, and each client offered the same key exchange group for every connection. We collected over 10 million records over 40 days.

In the second phase of the experiment, client behavior was modified so that each client randomly chose which key exchange group to offer for each new connection, allowing us to directly compare the performance of each algorithm on a per-client basis. Data collection for this phase lasted 13 days and we collected 270 thousand records.

## Results

We now describe our server-side measurement results. Client-side results are described at https://www.imperialviolet.org/2019/10/30/pqsivssl.html.

### What did we find?

The primary metric we collected for each connection was the server-side handshake duration. The below histograms show handshake duration timings for all client measurements gathered in the first phase of the experiment, as well as breakdowns into the top five operating systems. The operating system breakdowns shown are restricted to only desktop/laptop devices except for Android, which consists of only mobile devices.

It’s clear from the above plots that for most clients, CECPQ2b performs worse than CECPQ2 and CONTROL. Thus, the small key size of CECPQ2b does not make up for its large computational cost—the ostrich outpaces the turkey.

### Digging a little deeper

This means we’re done, right? Not quite. We are interested in determining if there are any populations of TLS clients for which CECPQ2b consistency outperforms CECPQ2. This requires taking a closer look at the long tail of handshake durations. The below plots show cumulative distribution functions (CDFs) of handshake timings zoomed in on the 80th percentile (e.g., showing the top 20% of slowest handshakes).

Here, we start to see something interesting. For Android, Linux, and Windows devices, there is a crossover point where CECPQ2b actually starts to outperform CECPQ2 (Android: ~94th percentile, Linux: ~92nd percentile, Windows: ~95th percentile). macOS and ChromeOS do not appear to have these crossover points.

These effects are small but statistically significant in some cases. The below table shows approximate 95% confidence intervals for the 50th (median), 95th, and 99th percentiles of handshake durations for each key exchange group and device type, calculated using Maritz-Jarrett estimators. The numbers within square brackets give the lower and upper bounds on our estimates for each percentile of the “true” distribution of handshake durations based on the samples collected in the experiment. For example, with a 95% confidence level we can say that the 99th percentile of handshake durations for CECPQ2 on Android devices lies between 4057ms and 4478ms, while the 99th percentile for CECPQ2b lies between 3276ms and 3646ms. Since the intervals do not overlap, we say that with statistical significance, the experiment indicates that CECPQ2b performs better than CECPQ2 for the slowest 1% of Android connections. Configurations where CECPQ2 or CECPQ2b outperforms the other with statistical significance are marked with green in the table.

### Per-client comparison

A second phase of the experiment directly examined the performance of each key exchange algorithm for individual clients, where a client is defined to be a unique (anonymized) IP address and user agent pair. Instead of choosing a single key exchange algorithm for the duration of the experiment, clients randomly selected one of the experiment configurations for each new connection. Although the duration and sample size were limited for this phase of the experiment, we collected at least three handshake measurements for each group configuration from 3900 unique clients.

The plot below shows for each of these clients the difference in latency between CECPQ2 and CECPQ2b, taking the minimum latency sample for each key exchange group as the representative value. The CDF plot shows that for 80% of clients, CECPQ2 outperformed or matched CECPQ2b, and for 99% of clients, the latency gap remained within 70ms. At a high level, this indicates that very few clients performed significantly worse with CECPQ2 over CECPQ2b.

### Do other factors impact the latency gap?

We looked at a number of other factors—including session resumption, IP version, and network location—to see if they impacted the latency gap between CECPQ2 and CECPQ2b. These factors impacted the overall handshake latency, but we did not find that any made a significant impact on the latency gap between post-quantum ciphers. We share some interesting observations from this analysis below.

#### Session resumption

Approximately 53% of all connections in the experiment were completed with TLS handshake resumption. However, the percentage of resumed connections varied significantly based on the device configuration. Connections from mobile devices were only resumed ~25% of the time, while between 40% and 70% of connections from laptop/desktop devices were resumed. Additionally, resumption provided between a 30% and 50% speedup for all device types.

#### IP version

We also examined the impact of IP version on handshake latency. Only 12.5% of the connections in the experiment used IPv6. These connections were 20-40% faster than IPv4 connections for desktop/laptop devices, but ~15% slower for mobile devices. This could be an artifact of IPv6 being generally deployed on newer devices with faster processors. For Android, the experiment was only run on devices with more modern processors, which perhaps eliminated the bias.

#### Network location

The slow connections making up the long tail of handshake durations were not isolated to a few countries, Autonomous Systems (ASes), or subnets, but originated from a globally diverse set of clients. We did not find a correlation between the relative performance of the two post-quantum key exchange algorithms based on these factors.

## Discussion

We found that CECPQ2 (the ostrich) outperformed CECPQ2 (the turkey) for the majority of connections in the experiment, indicating that fast algorithms with large keys may be more suitable for TLS than slow algorithms with small keys. However, we observed the opposite—that CECPQ2b outperformed CECPQ2—for the slowest connections on some devices, including Windows computers and Android mobile devices. One possible explanation for this is packet fragmentation and packet loss. The maximum size of TCP packets that can be sent across a network is limited by the maximum transmission unit (MTU) of the network path, which is often ~1400 bytes. During the TLS handshake the server responds to the client with its public key and ciphertext, the combined size of which exceeds the MTU, so it is likely that handshake messages must be split across multiple TCP packets. This increases the risk of lost packets and delays due to retransmission. A repeat of this experiment that includes collection of fine-grained TCP telemetry could confirm this hypothesis.

A somewhat surprising result of this experiment is just how fast HRSS performs for the majority of connections. Recall that the CECPQ2 cipher performs key exchange operations for both X25519 and HRSS, but the additional overhead of HRSS is barely noticeable. Comparing benchmark results, we can see that HRSS will be faster than X25519 on the server side and slower on the client side.

In our design, the client side performs two operations—key generation and KEM decapsulation. Looking at those two operations we can see that the key generation is a bottleneck here.

Key generation: 	3553.5 [ops/sec]
KEM decapsulation: 	17186.7 [ops/sec]


In algorithms with quotient-style keys (like NTRU), the key generation algorithm performs an inversion in the quotient ring—an operation that is quite computationally expensive. Alternatively, a TLS implementation could generate ephemeral keys ahead of time in order to speed up key exchange. There are several other lattice-based key exchange candidates that may be worth experimenting with in the context of TLS key exchange, which are based on different underlying principles than the HRSS construction. These candidates have similar key sizes and faster key generation algorithms, but have their own drawbacks. For now, HRSS looks like the more promising algorithm for use in TLS.

In the case of SIKE, we implemented the most recent version of the algorithm, and instantiated it with the most performance-efficient parameter set for our experiment. The algorithm is computationally expensive, so we were required to use assembly to optimize it. In order to ensure best performance on Intel, most performance-critical operations have two different implementations; the library detects CPU capabilities and uses faster instructions if available, but otherwise falls back to a slightly slower generic implementation. We developed our own optimizations for 64-bit ARM CPUs. Nevertheless, our results show that SIKE incurred a significant overhead for every connection, especially on devices with weaker processors. It must be noted that high-performance isogeny-based public key cryptography is arguably much less developed than its lattice-based counterparts. Some ideas to develop this are floating around, and we hope to see performance improvements in the future.

# DNS Encryption Explained

Post Syndicated from Peter Wu original https://blog.cloudflare.com/dns-encryption-explained/

The Domain Name System (DNS) is the address book of the Internet. When you visit cloudflare.com or any other site, your browser will ask a DNS resolver for the IP address where the website can be found. Unfortunately, these DNS queries and answers are typically unprotected. Encrypting DNS would improve user privacy and security. In this post, we will look at two mechanisms for encrypting DNS, known as DNS over TLS (DoT) and DNS over HTTPS (DoH), and explain how they work.

Applications that want to resolve a domain name to an IP address typically use DNS. This is usually not done explicitly by the programmer who wrote the application. Instead, the programmer writes something such as fetch("https://example.com/news") and expects a software library to handle the translation of “example.com” to an IP address.

Behind the scenes, the software library is responsible for discovering and connecting to the external recursive DNS resolver and speaking the DNS protocol (see the figure below) in order to resolve the name requested by the application. The choice of the external DNS resolver and whether any privacy and security is provided at all is outside the control of the application. It depends on the software library in use, and the policies provided by the operating system of the device that runs the software.

## The external DNS resolver

The operating system usually learns the resolver address from the local network using Dynamic Host Configuration Protocol (DHCP). In home and mobile networks, it typically ends up using the resolver from the Internet Service Provider (ISP). In corporate networks, the selected resolver is typically controlled by the network administrator. If desired, users with control over their devices can override the resolver with a specific address, such as the address of a public resolver like Google’s 8.8.8.8 or Cloudflare’s 1.1.1.1, but most users will likely not bother changing it when connecting to a public Wi-Fi hotspot at a coffee shop or airport.

The choice of external resolver has a direct impact on the end-user experience. Most users do not change their resolver settings and will likely end up using the DNS resolver from their network provider. The most obvious observable property is the speed and accuracy of name resolution. Features that improve privacy or security might not be immediately visible, but will help to prevent others from profiling or interfering with your browsing activity. This is especially important on public Wi-Fi networks where anyone in physical proximity can capture and decrypt wireless network traffic.

## Unencrypted DNS

Ever since DNS was created in 1987, it has been largely unencrypted. Everyone between your device and the resolver is able to snoop on or even modify your DNS queries and responses. This includes anyone in your local Wi-Fi network, your Internet Service Provider (ISP), and transit providers. This may affect your privacy by revealing the domain names that are you are visiting.

What can they see? Well, consider this network packet capture taken from a laptop connected to a home network:

The following observations can be made:

• The UDP source port is 53 which is the standard port number for unencrypted DNS. The UDP payload is therefore likely to be a DNS answer.
• That suggests that the source IP address 192.168.2.254 is a DNS resolver while the destination IP 192.168.2.14 is the DNS client.
• The UDP payload could indeed be parsed as a DNS answer, and reveals that the user was trying to visit twitter.com.
• If there are any future connections to 104.244.42.129 or 104.244.42.1, then it is most likely traffic that is directed at “twitter.com”.
• If there is some further encrypted HTTPS traffic to this IP, succeeded by more DNS queries, it could indicate that a web browser loaded additional resources from that page. That could potentially reveal the pages that a user was looking at while visiting twitter.com.

Since the DNS messages are unprotected, other attacks are possible:

• Queries could be directed to a resolver that performs DNS hijacking. For example, in the UK, Virgin Media and BT return a fake response for domains that do not exist, redirecting users to a search page. This redirection is possible because the computer/phone blindly trusts the DNS resolver that was advertised using DHCP by the ISP-provided gateway router.
• Firewalls can easily intercept, block or modify any unencrypted DNS traffic based on the port number alone. It is worth noting that plaintext inspection is not a silver bullet for achieving visibility goals, because the DNS resolver can be bypassed.

## Encrypting DNS

Encrypting DNS makes it much harder for snoopers to look into your DNS messages, or to corrupt them in transit. Just as the web moved from unencrypted HTTP to encrypted HTTPS there are now upgrades to the DNS protocol that encrypt DNS itself. Encrypting the web has made it possible for private and secure communications and commerce to flourish. Encrypting DNS will further enhance user privacy.

Two standardized mechanisms exist to secure the DNS transport between you and the resolver, DNS over TLS (2016) and DNS Queries over HTTPS (2018). Both are based on Transport Layer Security (TLS) which is also used to secure communication between you and a website using HTTPS. In TLS, the server (be it a web server or DNS resolver) authenticates itself to the client (your device) using a certificate. This ensures that no other party can impersonate the server (the resolver).

With DNS over TLS (DoT), the original DNS message is directly embedded into the secure TLS channel. From the outside, one can neither learn the name that was being queried nor modify it. The intended client application will be able to decrypt TLS, it looks like this:

In the packet trace for unencrypted DNS, it was clear that a DNS request can be sent directly by the client, followed by a DNS answer from the resolver. In the encrypted DoT case however, some TLS handshake messages are exchanged prior to sending encrypted DNS messages:

• The client sends a Client Hello, advertising its supported TLS capabilities.
• The server responds with a Server Hello, agreeing on TLS parameters that will be used to secure the connection. The Certificate message contains the identity of the server while the Certificate Verify message will contain a digital signature which can be verified by the client using the server Certificate. The client typically checks this certificate against its local list of trusted Certificate Authorities, but the DoT specification mentions alternative trust mechanisms such as public key pinning.
• Once the TLS handshake is Finished by both the client and server, they can finally start exchanging encrypted messages.
• While the above picture contains one DNS query and answer, in practice the secure TLS connection will remain open and will be reused for future DNS queries.

Securing unencrypted protocols by slapping TLS on top of a new port has been done before:

• Web traffic: HTTP (tcp/80) -> HTTPS (tcp/443)
• Sending email: SMTP (tcp/25) -> SMTPS (tcp/465)
• Receiving email: IMAP (tcp/143) -> IMAPS (tcp/993)
• Now: DNS (tcp/53 or udp/53) -> DoT (tcp/853)

A problem with introducing a new port is that existing firewalls may block it. Either because they employ a whitelist approach where new services have to be explicitly enabled, or a blocklist approach where a network administrator explicitly blocks a service. If the secure option (DoT) is less likely to be available than its insecure option, then users and applications might be tempted to try to fall back to unencrypted DNS. This subsequently could allow attackers to force users to an insecure version.

Such fallback attacks are not theoretical. SSL stripping has previously been used to downgrade HTTPS websites to HTTP, allowing attackers to steal passwords or hijack accounts.

Another approach, DNS Queries over HTTPS (DoH), was designed to support two primary use cases:

• Prevent the above problem where on-path devices interfere with DNS. This includes the port blocking problem above.
• Enable web applications to access DNS through existing browser APIs.
DoH is essentially HTTPS, the same encrypted standard the web uses, and reuses the same port number (tcp/443). Web browsers have already deprecated non-secure HTTP in favor of HTTPS. That makes HTTPS a great choice for securely transporting DNS messages. An example of such a DoH request can be found here.

Some users have been concerned that the use of HTTPS could weaken privacy due to the potential use of cookies for tracking purposes. The DoH protocol designers considered various privacy aspects and explicitly discouraged use of HTTP cookies to prevent tracking, a recommendation that is widely respected. TLS session resumption improves TLS 1.2 handshake performance, but can potentially be used to correlate TLS connections. Luckily, use of TLS 1.3 obviates the need for TLS session resumption by reducing the number of round trips by default, effectively addressing its associated privacy concern.

Using HTTPS means that HTTP protocol improvements can also benefit DoH. For example, the in-development HTTP/3 protocol, built on top of QUIC, could offer additional performance improvements in the presence of packet loss due to lack of head-of-line blocking. This means that multiple DNS queries could be sent simultaneously over the secure channel without blocking each other when one packet is lost.

A draft for DNS over QUIC (DNS/QUIC) also exists and is similar to DoT, but without the head-of-line blocking problem due to the use of QUIC. Both HTTP/3 and DNS/QUIC, however, require a UDP port to be accessible. In theory, both could fall back to DoH over HTTP/2 and DoT respectively.

## Deployment of DoT and DoH

As both DoT and DoH are relatively new, they are not universally deployed yet. On the server side, major public resolvers including Cloudflare’s 1.1.1.1 and Google DNS support it. Many ISP resolvers however still lack support for it. A small list of public resolvers supporting DoH can be found at DNS server sources, another list of public resolvers supporting DoT and DoH can be found on DNS Privacy Public Resolvers.

There are two methods to enable DoT or DoH on end-user devices:

• Add support to applications, bypassing the resolver service from the operating system.
• Add support to the operating system, transparently providing support to applications.

There are generally three configuration modes for DoT or DoH on the client side:

• Off: DNS will not be encrypted.
• Opportunistic mode: try to use a secure transport for DNS, but fallback to unencrypted DNS if the former is unavailable. This mode is vulnerable to downgrade attacks where an attacker can force a device to use unencrypted DNS. It aims to offer privacy when there are no on-path active attackers.
• Strict mode: try to use DNS over a secure transport. If unavailable, fail hard and show an error to the user.

The current state for system-wide configuration of DNS over a secure transport:

• Android 9: supports DoT through its “Private DNS” feature. Modes:
• Opportunistic mode (“Automatic”) is used by default. The resolver from network settings (typically DHCP) will be used.
• Strict mode can be configured by setting an explicit hostname. No IP address is allowed, the hostname is resolved using the default resolver and is also used for validating the certificate. (Relevant source code)
• iOS and Android users can also install the 1.1.1.1 app to enable either DoH or DoT support in strict mode. Internally it uses the VPN programming interfaces to enable interception of unencrypted DNS traffic before it is forwarded over a secure channel.
• Linux with systemd-resolved from systemd 239: DoT through the DNSOverTLS option.

• Off is the default.
• Opportunistic mode can be configured, but no certificate validation is performed.
• Strict mode is available since systemd 243. Any certificate signed by a trusted certificate authority is accepted. However, there is no hostname validation with the GnuTLS backend while the OpenSSL backend expects an IP address.
• In any case, no Server Name Indication (SNI) is sent. The certificate name is not validated, making a man-in-the-middle rather trivial.
• Linux, macOS, and Windows can use a DoH client in strict mode. The cloudflared proxy-dns command uses the Cloudflare DNS resolver by default, but users can override it through the proxy-dns-upstream option.

Web browsers support DoH instead of DoT:

• Firefox 62 supports DoH and provides several Trusted Recursive Resolver (TRR) settings. By default DoH is disabled, but Mozilla is running an experiment to enable DoH for some users in the USA. This experiment currently uses Cloudflare’s 1.1.1.1 resolver, since we are the only provider that currently satisfies the strict resolver policy required by Mozilla. Since many DNS resolvers still do not support an encrypted DNS transport, Mozilla’s approach will ensure that more users are protected using DoH.
• When enabled through the experiment, or through the “Enable DNS over HTTPS” option at Network Settings, Firefox will use opportunistic mode (network.trr.mode=2 at about:config).
• Strict mode can be enabled with network.trr.mode=3, but requires an explicit resolver IP to be specified (for example, network.trr.bootstrapAddress=1.1.1.1).
• While Firefox ignores the default resolver from the system, it can be configured with alternative resolvers. Additionally, enterprise deployments who use a resolver that does not support DoH have the option to disable DoH.
• Chrome 78 enables opportunistic DoH if the system resolver address matches one of the hard-coded DoH providers (source code change). This experiment is enabled for all platforms except Linux and iOS, and excludes enterprise deployments by default.
• Opera 65 adds an option to enable DoH through Cloudflare’s 1.1.1.1 resolver. This feature is off by default. Once enabled, it appears to use opportunistic mode: if 1.1.1.1:443 (without SNI) is reachable, it will be used. Otherwise it falls back to the default resolver, unencrypted.

The DNS over HTTPS page from the curl project has a comprehensive list of DoH providers and additional implementations.

As an alternative to encrypting the full network path between the device and the external DNS resolver, one can take a middle ground: use unencrypted DNS between devices and the gateway of the local network, but encrypt all DNS traffic between the gateway router and the external DNS resolver. Assuming a secure wired or wireless network, this would protect all devices in the local network against a snooping ISP, or other adversaries on the Internet. As public Wi-Fi hotspots are not considered secure, this approach would not be safe on open Wi-Fi networks. Even if it is password-protected with WPA2-PSK, others will still be able to snoop and modify unencrypted DNS.

## Other security considerations

The previous sections described secure DNS transports, DoH and DoT. These will only ensure that your client receives the untampered answer from the DNS resolver. It does not, however, protect the client against the resolver returning the wrong answer (through DNS hijacking or DNS cache poisoning attacks). The “true” answer is determined by the owner of a domain or zone as reported by the authoritative name server. DNSSEC allows clients to verify the integrity of the returned DNS answer and catch any unauthorized tampering along the path between the client and authoritative name server.

However deployment of DNSSEC is hindered by middleboxes that incorrectly forward DNS messages, and even if the information is available, stub resolvers used by applications might not even validate the results. A report from 2016 found that only 26% of users use DNSSEC-validating resolvers.

DoH and DoT protect the transport between the client and the public resolver. The public resolver may have to reach out to additional authoritative name servers in order to resolve a name. Traditionally, the path between any resolver and the authoritative name server uses unencrypted DNS. To protect these DNS messages as well, we did an experiment with Facebook, using DoT between 1.1.1.1 and Facebook’s authoritative name servers. While setting up a secure channel using TLS increases latency, it can be amortized over many queries.

Transport encryption ensures that resolver results and metadata are protected. For example, the EDNS Client Subnet (ECS) information included with DNS queries could reveal the original client address that started the DNS query. Hiding that information along the path improves privacy. It will also prevent broken middle-boxes from breaking DNSSEC due to issues in forwarding DNS.

## Operational issues with DNS encryption

DNS encryption may bring challenges to individuals or organizations that rely on monitoring or modifying DNS traffic. Security appliances that rely on passive monitoring watch all incoming and outgoing network traffic on a machine or on the edge of a network. Based on unencrypted DNS queries, they could potentially identify machines which are infected with malware for example. If the DNS query is encrypted, then passive monitoring solutions will not be able to monitor domain names.

Some parties expect DNS resolvers to apply content filtering for purposes such as:

• Blocking domains used for malware distribution.
• Perform parental control filtering, blocking domains associated with adult content.
• Block access to domains serving illegal content according to local regulations.
• Offer a split-horizon DNS to provide different answers depending on the source network.

An advantage of blocking access to domains via the DNS resolver is that it can be centrally done, without reimplementing it in every single application. Unfortunately, it is also quite coarse. Suppose that a website hosts content for multiple users at example.com/videos/for-kids/ and example.com/videos/for-adults/. The DNS resolver will only be able to see “example.com” and can either choose to block it or not. In this case, application-specific controls such as browser extensions would be more effective since they can actually look into the URLs and selectively prevent content from being accessible.

DNS monitoring is not comprehensive. Malware could skip DNS and hardcode IP addresses, or use alternative methods to query an IP address. However, not all malware is that complicated, so DNS monitoring can still serve as a defence-in-depth tool.

All of these non-passive monitoring or DNS blocking use cases require support from the DNS resolver. Deployments that rely on opportunistic DoH/DoT upgrades of the current resolver will maintain the same feature set as usually provided over unencrypted DNS. Unfortunately this is vulnerable to downgrades, as mentioned before. To solve this, system administrators can point endpoints to a DoH/DoT resolver in strict mode. Ideally this is done through secure device management solutions (MDM, group policy on Windows, etc.).

## Conclusion

One of the cornerstones of the Internet is mapping names to an address using DNS. DNS has traditionally used insecure, unencrypted transports. This has been abused by ISPs in the past for injecting advertisements, but also causes a privacy leak. Nosey visitors in the coffee shop can use unencrypted DNS to follow your activity. All of these issues can be solved by using DNS over TLS (DoT) or DNS over HTTPS (DoH). These techniques to protect the user are relatively new and are seeing increasing adoption.

From a technical perspective, DoH is very similar to HTTPS and follows the general industry trend to deprecate non-secure options. DoT is a simpler transport mode than DoH as the HTTP layer is removed, but that also makes it easier to be blocked, either deliberately or by accident.

Secondary to enabling a secure transport is the choice of a DNS resolver. Some vendors will use the locally configured DNS resolver, but try to opportunistically upgrade the unencrypted transport to a more secure transport (either DoT or DoH). Unfortunately, the DNS resolver usually defaults to one provided by the ISP which may not support secure transports.

Mozilla has adopted a different approach. Rather than relying on local resolvers that may not even support DoH, they allow the user to explicitly select a resolver. Resolvers recommended by Mozilla have to satisfy high standards to protect user privacy. To ensure that parental control features based on DNS remain functional, and to support the split-horizon use case, Mozilla has added a mechanism that allows private resolvers to disable DoH.

The DoT and DoH transport protocols are ready for us to move to a more secure Internet. As can be seen in previous packet traces, these protocols are similar to existing mechanisms to secure application traffic. Once this security and privacy hole is closed, there will be many more to tackle.

Post Syndicated from Alex Davidson original https://blog.cloudflare.com/supporting-the-latest-version-of-the-privacy-pass-protocol/

At Cloudflare, we are committed to supporting and developing new privacy-preserving technologies that benefit all Internet users. In November 2017, we announced server-side support for the Privacy Pass protocol, a piece of work developed in collaboration with the academic community. Privacy Pass, in a nutshell, allows clients to provide proof of trust without revealing where and when the trust was provided. The aim of the protocol is then to allow anyone to prove they are trusted by a server, without that server being able to track the user via the trust that was assigned.

On a technical level, Privacy Pass clients receive attestation tokens from a server, that can then be redeemed in the future. These tokens are provided when a server deems the client to be trusted; for example, after they have logged into a service or if they prove certain characteristics. The redeemed tokens are cryptographically unlinkable to the attestation originally provided by the server, and so they do not reveal anything about the client.

To use Privacy Pass, clients can install an open-source browser extension available in Chrome & Firefox. There have been over 150,000 individual downloads of Privacy Pass worldwide; approximately 130,000 in Chrome and more than 20,000 in Firefox. The extension is supported by Cloudflare to make websites more accessible for users. This complements previous work, including the launch of Cloudflare onion services to help improve accessibility for users of the Tor Browser.

The initial release was almost two years ago, and it was followed up with a research publication that was presented at the Privacy Enhancing Technologies Symposium 2018 (winning a Best Student Paper award). Since then, Cloudflare has been working with the wider community to build on the initial design and improve Privacy Pass. We’ll be talking about the work that we have done to develop the existing implementations, alongside the protocol itself.

# What’s new?

Support for Privacy Pass v2.0 browser extension:

• Easier configuration of workflow.
• Integration with new service provider (hCaptcha).
• Compliance with hash-to-curve draft.
• Possible to rotate keys outside of extension release.
• Available in Chrome and Firefox (works best with up-to-date browser versions).

Rolling out a new server backend using Cloudflare Workers platform:

• Cryptographic operations performed using internal V8 engine.
• Provides public redemption API for Cloudflare Privacy Pass v2.0 tokens.
• Available by making POST requests to https://privacypass.cloudflare.com/api/redeem. See the documentation for example usage.
• Only compatible with extension v2.0 (check that you have updated!).

Standardization:

• Continued development of oblivious pseudorandom functions (OPRFs) draft in prime-order groups with [email protected]
• New draft specifying Privacy Pass protocol.

# Extension v2.0

In the time since the release, we’ve been working on a number of new features. Today we’re excited to announce support for version 2.0 of the extension, the first update since the original release. The extension continues to be available for Chrome and Firefox. You may need to download v2.0 manually from the store if you have auto-updates disabled in your browser.

The extension remains under active development and we still regard our support as in the beta phase. This will continue to be the case as the draft specification of the protocol continues to be written in collaboration with the wider community.

### New Integrations

The client implementation uses the WebRequest API to look for certain types of HTTP requests. When these requests are spotted, they are rewritten to include some cryptographic data required for the Privacy Pass protocol. This allows Privacy Pass providers receiving this data to authorize access for the user.

For example, a user may receive Privacy Pass tokens for completing some server security checks. These tokens are stored by the browser extension, and any future request that needs similar security clearance can be modified to add a stored token as an extra HTTP header. The server can then check the client token and verify that the client has the correct authorization to proceed.

While Cloudflare supports a particular type of request flow, it would be impossible to expect different service providers to all abide by the same exact interaction characteristics. One of the major changes in the v2.0 extension has been a technical rewrite to instead use a central configuration file. The config is specified in the source code of the extension and allows easier modification of the browsing characteristics that initiate Privacy Pass actions. This makes adding new, completely different request flows possible by simply cloning and adapting the configuration for new providers.

To demonstrate that such integrations are now possible with other services beyond Cloudflare, a new version of the extension will soon be rolling out that is supported by the CAPTCHA provider hCaptcha. Users that solve ephemeral challenges provided by hCaptcha will receive privacy-preserving tokens that will be redeemable at other hCaptcha customer sites.

“hCaptcha is focused on user privacy, and supporting Privacy Pass is a natural extension of our work in this area. We look forward to working with Cloudflare and others to make this a common and widely adopted standard, and are currently exploring other applications. Implementing Privacy Pass into our globally distributed service was relatively straightforward, and we have enjoyed working with the Cloudflare team to improve the open source Chrome browser extension in order to deliver the best experience for our users.” – Eli-Shaoul Khedouri, founder of hCaptcha

This hCaptcha integration with the Privacy Pass browser extension acts as a proof-of-concept in establishing support for new services. Any new providers that would like to integrate with the Privacy Pass browser extension can do so simply by making a PR to the open-source repository.

## Improved cryptographic functionality

After the release of v1.0 of the extension, there were features that were still unimplemented. These included proper zero-knowledge proof validation for checking that the server was always using the same committed key. In v2.0 this functionality has been completed, verifiably preventing a malicious server from attempting to deanonymize users by using a different key for each request.

The cryptographic operations required for Privacy Pass are performed using elliptic curve cryptography (ECC). The extension currently uses the NIST P-256 curve, for which we have included some optimisations. Firstly, this makes it possible to store elliptic curve points in compressed and uncompressed data formats. This means that browser storage can be reduced by 50%, and that server responses can be made smaller too.

Secondly, support has been added for hashing to the P-256 curve using the “Simplified Shallue-van de Woestijne-Ulas” (SSWU) method specified in an ongoing draft (https://tools.ietf.org/html/draft-irtf-cfrg-hash-to-curve-03) for standardizing encodings for hashing to elliptic curves. The implementation is compliant with the specification of the “P256-SHA256-SSWU-” ciphersuite in this draft.

These changes have a dual advantage, firstly ensuring that the P-256 hash-to-curve specification is compliant with the draft specification. Secondly this ciphersuite removes the necessity for using probabilistic methods, such as hash-and-increment. The hash-and-increment method has a non-negligible chance of failure, and the running time is highly dependent on the hidden client input. While it is not clear how to abuse timing attack vectors currently, using the SSWU method should reduce the potential for attacking the implementation, and learning the client input.

## Key rotation

As we mentioned above, verifying that the server is always using the same key is an important part of ensuring the client’s privacy. This ensures that the server cannot segregate the user base and reduce client privacy by using different secret keys for each client that it interacts with. The server guarantees that it’s always using the same key by publishing a commitment to its public key somewhere that the client can access.

Every time the server issues Privacy Pass tokens to the client, it also produces a zero-knowledge proof that it has produced these tokens using the correct key.

Before the extension stores any tokens, it first verifies the proof against the commitments it knows. Previously, these commitments were stored directly in the source code of the extension. This meant that if the server wanted to rotate its key, then it required releasing a new version of the extension, which was unnecessarily difficult. The extension has been modified so that the commitments are stored in a trusted location that the client can access when it needs to verify the server response. Currently this location is a separate Privacy Pass GitHub repository. For those that are interested, we have provided a more detailed description of the new commitment format in Appendix A at the end of this post.

# Implementing server-side support in Workers

So far we have focused on client-side updates. As part of supporting v2.0 of the extension, we are rolling out some major changes to the server-side support that Cloudflare uses. For version 1.0, we used a Go implementation of the server. In v2.0 we are introducing a new server implementation that runs in the Cloudflare Workers platform. This server implementation is only compatible with v2.0 releases of Privacy Pass, so you may need to update your extension if you have auto-updates turned off in your browser.

Our server will run at https://privacypass.cloudflare.com, and all Privacy Pass requests sent to the Cloudflare edge are handled by Worker scripts that run on this domain. Our implementation has been rewritten using Javascript, with cryptographic operations running in the V8 engine that powers Cloudflare Workers. This means that we are able to run highly efficient and constant-time cryptographic operations. On top of this, we benefit from the enhanced performance provided by running our code in the Workers Platform, as close to the user as possible.

## WebCrypto support

Firstly, you may be asking, how do we manage to implement cryptographic operations in Cloudflare Workers? Currently, support for performing cryptographic operations is provided in the Workers platform via the WebCrypto API. This API allows users to compute functionality such as cryptographic hashing, alongside more complicated operations like ECDSA signatures.

In the Privacy Pass protocol, as we’ll discuss a bit later, the main cryptographic operations are performed by a protocol known as a verifiable oblivious pseudorandom function (VOPRF). Such a protocol allows a client to learn function outputs computed by a server, without revealing to the server what their actual input was. The verifiable aspect means that the server must also prove (in a publicly verifiable way) that the evaluation they pass to the user is correct. Such a function is pseudorandom because the server output is indistinguishable from a random sequence of bytes.

The VOPRF functionality requires a server to perform low-level ECC operations that are not currently exposed in the WebCrypto API. We balanced out the possible ways of getting around this requirement. First we trialled trying to use the WebCrypto API in a non-standard manner, using EC Diffie-Hellman key exchange as a method for performing the scalar multiplication that we needed. We also tried to implement all operations using pure JavaScript. Unfortunately both methods were unsatisfactory in the sense that they would either mean integrating with large external cryptographic libraries, or they would be far too slow to be used in a performant Internet setting.

In the end, we settled on a solution that adds functions necessary for Privacy Pass to the internal WebCrypto interface in the Cloudflare V8 Javascript engine. This algorithm mimics the sign/verify interface provided by signature algorithms like ECDSA. In short, we use the sign() function to issue Privacy Pass tokens to the client. While verify() can be used by the server to verify data that is redeemed by the client. These functions are implemented directly in the V8 layer and so they are much more performant and secure (running in constant-time, for example) than pure JS alternatives.

The Privacy Pass WebCrypto interface is not currently available for public usage. If it turns out there is enough interest in using this additional algorithm in the Workers platform, then we will consider making it public.

### Applications

In recent times, VOPRFs have been shown to be a highly useful primitive in establishing many cryptographic tools. Aside from Privacy Pass, they are also essential for constructing password-authenticated key exchange protocols such as OPAQUE. They have also been used in designs of private set intersection, password-protected secret-sharing protocols, and privacy-preserving access-control for private data storage.

## Public redemption API

Writing the server in Cloudflare Workers means that we will be providing server-side support for Privacy Pass on a public domain! While we only issue tokens to clients after we are sure that we can trust them, anyone will be able to redeem the tokens using our public redemption API at https://privacypass.cloudflare.com/api/redeem. As we roll-out the server-side component worldwide, you will be able to interact with this API and verify Cloudflare Privacy Pass tokens independently of the browser extension.

This means that any service can accept Privacy Pass tokens from a client that were issued by Cloudflare, and then verify them with the Cloudflare redemption API. Using the result provided by the API, external services can check whether Cloudflare has authorized the user in the past.

We think that this will benefit other service providers because they can use the attestation of authorization from Cloudflare in their own decision-making processes, without sacrificing the privacy of the client at any stage. We hope that this ecosystem can grow further, with potentially more services providing public redemption APIs of their own. With a more diverse set of issuers, these attestations will become more useful.

By running our server on a public domain, we are effectively a customer of the Cloudflare Workers product. This means that we are also able to make use of Workers KV for protecting against malicious clients. In particular, servers must check that clients are not re-using tokens during the redemption phase. The performance of Workers KV in analyzing reads makes this an obvious choice for providing double-spend protection globally.

If you would like to use the public redemption API, we provide documentation for using it at https://privacypass.github.io/api-redeem. We also provide some example requests and responses in Appendix B at the end of the post.

# Standardization & new applications

In tandem with the recent engineering work that we have been doing on supporting Privacy Pass, we have been collaborating with the wider community in an attempt to standardize both the underlying VOPRF functionality, and the protocol itself. While the process of standardization for oblivious pseudorandom functions (OPRFs) has been running for over a year, the recent efforts to standardize the Privacy Pass protocol have been driven by very recent applications that have come about in the last few months.

Standardizing protocols and functionality is an important way of providing interoperable, secure, and performant interfaces for running protocols on the Internet. This makes it easier for developers to write their own implementations of this complex functionality. The process also provides helpful peer reviews from experts in the community, which can lead to better surfacing of potential security risks that should be mitigated in any implementation. Other benefits include coming to a consensus on the most reliable, scalable and performant protocol designs for all possible applications.

## Oblivious pseudorandom functions

Oblivious pseudorandom functions (OPRFs) are a generalization of VOPRFs that do not require the server to prove that they have evaluated the functionality properly. Since July 2019, we have been collaborating on a draft with the Crypto Forum Research Group (CFRG) at the Internet Research Task Force (IRTF) to standardize an OPRF protocol that operates in prime-order groups. This is a generalisation of the setting that is provided by elliptic curves. This is the same VOPRF construction that was originally specified by the Privacy Pass protocol and is based heavily on the original protocol design from the paper of Jarecki, Kiayias and Krawczyk.

One of the recent changes that we’ve made in the draft is to increase the size of the key that we consider for performing OPRF operations on the server-side. Existing research suggests that it is possible to create specific queries that can lead to small amounts of the key being leaked. For keys that provide only 128 bits of security this can be a problem as leaking too many bits would reduce security beyond currently accepted levels. To counter this, we have effectively increased the minimum key size to 196 bits. This prevents this leakage becoming an attack vector using any practical methods. We discuss these attacks in more detail later on when discussing our future plans for VOPRF development.

## Recent applications and standardizing the protocol

The application that we demonstrated when originally supporting Privacy Pass was always intended as a proof-of-concept for the protocol. Over the past few months, a number of new possibilities have arisen in areas that go far beyond what was previously envisaged.

For example, the trust token API, developed by the Web Incubator Community Group, has been proposed as an interface for using Privacy Pass. This applications allows third-party vendors to check that a user has received a trust attestation from a set of central issuers. This allows the vendor to make decisions about the honesty of a client without having to associate a behaviour profile with the identity of the user. The objective is to prevent against fraudulent activity from users who are not trusted by the central issuer set. Checking trust attestations with central issuers would be possible using similar redemption APIs to the one that we have introduced.

A separate piece of work from Facebook details a similar application for preventing fraudulent behavior that may also be compatible with the Privacy Pass protocol. Finally, other applications have arisen in the areas of providing access to private storage and establishing security and privacy models in advertisement confirmations.

### A new draft

With the applications above in mind, we have recently started collaborative work on a new IETF draft that specifically lays out the required functionality provided by the Privacy Pass protocol as a whole. Our aim is to develop, alongside wider industrial partners and the academic community, a functioning specification of the Privacy Pass protocol. We hope that by doing this we will be able to design a base-layer protocol, that can then be used as a cryptographic primitive in wider applications that require some form of lightweight authorization. Our plan is to present the first version of this draft at the upcoming IETF 106 meeting in Singapore next month.

The draft is still in the early stages of development and we are actively looking for people who are interested in helping to shape the protocol specification. We would be grateful for any help that contributes to this process. See the GitHub repository for the current version of the document.

# Future avenues

Finally, while we are actively working on a number of different pathways in the present, the future directions for the project are still open. We believe that there are many applications out there that we have not considered yet and we are excited to see where the protocol is used in the future. Here are some other ideas we have for novel applications and security properties that we think might be worth pursuing in future.

## Publicly verifiable tokens

One of the disadvantages of using a VOPRF is that redemption tokens are only verifiable by the original issuing server. If we used an underlying primitive that allowed public verification of redemption tokens, then anyone could verify that the issuing server had issued the particular token. Such a protocol could be constructed on top of so-called blind signature schemes, such as Blind RSA. Unfortunately, there are performance and security concerns arising from the usage of blind signature schemes in a browser environment. Existing schemes (especially RSA-based variants) require cryptographic computations that are much heavier than the construction used in our VOPRF protocol.

## Post-quantum VOPRF alternatives

The only constructions of VOPRFs exist in pre-quantum settings, usually based on the hardness of well-known problems in group settings such as the discrete-log assumption. No constructions of VOPRFs are known to provide security against adversaries that can run quantum computational algorithms. This means that the Privacy Pass protocol is only believed to be secure against adversaries running  on classical hardware.

Recent developments suggest that quantum computing may arrive sooner than previously thought. As such, we believe that investigating the possibility of constructing practical post-quantum alternatives for our current cryptographic toolkit is a task of great importance for ourselves and the wider community. In this case, devising performant post-quantum alternatives for VOPRF constructions would be an important theoretical advancement. Eventually this would lead to a Privacy Pass protocol that still provides privacy-preserving authorization in a post-quantum world.

## VOPRF security and larger ciphersuites

We mentioned previously that VOPRFs (or simply OPRFs) are susceptible to small amounts of possible leakage in the key. Here we will give a brief description of the actual attacks themselves, along with further details on our plans for implementing higher security ciphersuites to mitigate the leakage.

Specifically, malicious clients can interact with a VOPRF for creating something known as a q-Strong-Diffie-Hellman (q-sDH) sample. Such samples are created in mathematical groups (usually in the elliptic curve setting). For any group there is a public element g that is central to all Diffie-Hellman type operations, along with the server key K, which is usually just interpreted as a randomly generated number from this group. A q-sDH sample takes the form:

( g, g^K, g^(K^2), … , g^(K^q) )


and asks the malicious adversary to create a pair of elements satisfying (g^(1/(s+K)),s). It is possible for a client in the VOPRF protocol to create a q-SDH sample by just submitting the result of the previous VOPRF evaluation back to the server.

While this problem is believed to be hard to break, there are a number of past works that show that the problem is somewhat easier than the size of the group suggests (for example, see here and here). Concretely speaking, the bit security implied by the group can be reduced by up to log2(q) bits. While this is not immediately fatal, even to groups that should provide 128 bits of security, it can lead to a loss of security that means that the setting is no longer future-proof. As a result, any group providing VOPRF functionality that is instantiated using an elliptic curve such as P-256 or Curve25519 provides weaker than advised security guarantees.

With this in mind, we have taken the recent decision to upgrade the ciphersuites that we recommend for OPRF usage to only those that provide > 128 bits of security, as standard. For example, Curve448 provides 196 bits of security. To launch an attack that reduced security to an amount lower than 128 bits would require making 2^(68) client OPRF queries. This is a significant barrier to entry for any attacker, and so we regard these ciphersuites as safe for instantiating the OPRF functionality.

In the near future, it will be necessary to upgrade the ciphersuites that are used in our support of the Privacy Pass browser extension to the recommendations made in the current VOPRF draft. In general, with a more iterative release process, we hope that the Privacy Pass implementation will be able to follow the current draft standard more closely as it evolves during the standardization process.

## Get in touch!

You can now install v2.0 of the Privacy Pass extension in Chrome or Firefox.

If you would like to help contribute to the development of this extension then you can do so on GitHub. Are you a service provider that would like to integrate server-side support for the extension? Then we would be very interested in hearing from you!

We will continue to work with the wider community in developing the standardization of the protocol; taking our motivation from the available applications that have been developed. We are always looking for new applications that can help to expand the Privacy Pass ecosystem beyond its current boundaries.

# Appendix

Here are some extra details related to the topics that we covered above.

## A. Commitment format for key rotations

Key commitments are necessary for the server to prove that they’re acting honestly during the Privacy Pass protocol. The commitments that Privacy Pass uses for the v2.0 release have a slightly different format from the previous release.

"2.00": {
"H": "BPivZ+bqrAZzBHZtROY72/E4UGVKAanNoHL1Oteg25oTPRUkrYeVcYGfkOr425NzWOTLRfmB8cgnlUfAeN2Ikmg=",
"expiry": "2020-01-11T10:29:10.658286752Z",
"sig": "MEUCIQDu9xeF1q89bQuIMtGm0g8KS2srOPv+4hHjMWNVzJ92kAIgYrDKNkg3GRs9Jq5bkE/4mM7/QZInAVvwmIyg6lQZGE0="
}


First, the version of the server key is 2.00, the server must inform the client which version it intends to use in the response to a client containing issued tokens. This is so that the client can always use the correct commitments when verifying the zero-knowledge proof that the server sends.

The value of the member H is the public key commitment to the secret key used by the server. This is base64-encoded elliptic curve point of the form H=kG where G is the fixed generator of the curve, and k is the secret key of the server. Since the discrete-log problem is believed to be hard to solve, deriving k from H is believed to be difficult. The value of the member expiry is an expiry date for the commitment that is used. The value of the member sig is an ECDSA signature evaluated using a long-term signing key associated with the server, and over the values of H and expiry.

When a client retrieves the commitment, it checks that it hasn’t expired and that the signature verifies using the corresponding verification key that is embedded into the configuration of the extension. If these checks pass, it retrieves H and verifies the issuance response sent by the server. Previous versions of these commitments did not include signatures, but these signatures will be validated from v2.0 onwards.

When a server wants to rotate the key, it simply generates a new key k2 and appends a new commitment to k2 with a new identifier such as 2.01. It can then use k2 as the secret for the VOPRF operations that it needs to compute.

## B. Example Redemption API request

The redemption API at is available over HTTPS by sending POST requests to https://privacypass.cloudflare.com/api/redeem. Requests to this endpoint must specify Privacy Pass data using JSON-RPC 2.0 syntax in the body of the request. Let’s look at an example request:

{
"jsonrpc": "2.0",
"method": "redeem",
"params": {
"data": [
"lB2ZEtHOK/2auhOySKoxqiHWXYaFlAIbuoHQnlFz57A=",
"EoSetsN0eVt6ztbLcqp4Gt634aV73SDPzezpku6ky5w=",
"eyJjdXJ2ZSI6InAyNTYiLCJoYXNoIjoic2hhMjU2IiwibWV0aG9kIjoic3d1In0="
],
"bindings": [
"string1",
"string2"
],
"compressed":"false"
},
"id": 1
}


In the above: params.data[0] is the client input data used to generate a token in the issuance phase; params.data[1] is the HMAC tag that the server uses to verify a redemption; and params.data[2] is a stringified, base64-encoded JSON object that specifies the hash-to-curve parameters used by the client. For example, the last element in the array corresponds to the object:

{
curve: "p256",
hash: "sha256",
method: "swu",
}


Which specifies that the client has used the curve P-256, with hash function SHA-256, and the SSWU method for hashing to curve. This allows the server to verify the transaction with the correct ciphersuite. The client must bind the redemption request to some fixed information, which it stores as multiple strings in the array params.bindings. For example, it could send the Host header of the HTTP request, and the HTTP path that was used (this is what is used in the Privacy Pass browser extension). Finally, params.compressed is an optional boolean value (defaulting to false) that indicates whether the HMAC tag was computed over compressed or uncompressed point encodings.

Currently the only supported ciphersuites are the example above, or the same except with method equal to increment for the hash-and-increment method of hashing to a curve. This is the original method used in v1.0 of Privacy Pass, and is supported for backwards-compatibility only. See the provided documentation for more details.

### Example response

If a request is sent to the redemption API and it is successfully verified, then the following response will be returned.

{
"jsonrpc": "2.0",
"result": "success",
"id": 1
}


When an error occurs something similar to the following will be returned.

{
"jsonrpc": "2.0",
"error": {
"message": <error-message>,
"code": <error-code>,
},
"id": 1
}


The error codes that we provide are specified as JSON-RPC 2.0 codes, we document the types of errors that we provide in the API documentation.

# Tales from the Crypt(o team)

Post Syndicated from Nick Sullivan original https://blog.cloudflare.com/tales-from-the-crypt-o-team/

Halloween season is upon us. This week we’re sharing a series of blog posts about work being done at Cloudflare involving cryptography, one of the spookiest technologies around. So bookmark this page and come back every day for tricks, treats, and deep technical content.

## A long-term mission

Cryptography is one of the most powerful technological tools we have, and Cloudflare has been at the forefront of using cryptography to help build a better Internet. Of course, we haven’t been alone on this journey. Making meaningful changes to the way the Internet works requires time, effort, experimentation, momentum, and willing partners. Cloudflare has been involved with several multi-year efforts to leverage cryptography to help make the Internet better.

Here are some highlights to expect this week:

• We’re renewing Cloudflare’s commitment to privacy-enhancing technologies by sharing some of the recent work being done on Privacy Pass
• We’re helping forge a path to a quantum-safe Internet by sharing some of the results of the Post-quantum Cryptography experiment
• We’re sharing the rust-based software we use to power time.cloudflare.com
• We’re doing a deep dive into the technical details of Encrypted DNS
• We’re announcing support for a new technique we developed with industry partners to help keep TLS private keys more secure

The milestones we’re sharing this week would not be possible without partnerships with companies, universities, and individuals working in good faith to help build a better Internet together. Hopefully, this week provides a fun peek into the future of the Internet.

# Introducing time.cloudflare.com

Post Syndicated from Guest Author original https://blog.cloudflare.com/secure-time/

This is a guest post by Aanchal Malhotra, a Graduate Research Assistant at Boston University and former Cloudflare intern on the Cryptography team.

Cloudflare has always been a leader in deploying secure versions of insecure Internet protocols and making them available for free for anyone to use. In 2014, we launched one of the world’s first free, secure HTTPS service (Universal SSL) to go along with our existing free HTTP plan. When we launched the 1.1.1.1 DNS resolver, we also supported the new secure versions of DNS (DNS over HTTPS and DNS over TLS). Today, we are doing the same thing for the Network Time Protocol (NTP), the dominant protocol for obtaining time over the Internet.

This announcement is personal for me. I’ve spent the last four years identifying and fixing vulnerabilities in time protocols. Today I’m proud to help introduce a service that would have made my life from 2015 through 2019 a whole lot harder: time.cloudflare.com, a free time service that supports both NTP and the emerging Network Time Security (NTS) protocol for securing NTP. Now, anyone can get time securely from all our datacenters in 180 cities around the world.

You can use time.cloudflare.com as the source of time for all your devices today with NTP, while NTS clients are still under development. NTPsec includes experimental support for NTS. If you’d like to get updates about NTS client development, email us asking to join at [email protected]. To use NTS to secure time synchronization, reach out to your vendors and inquire about NTS support.

### A small tale of “time” first

Back in 2015, as a fresh graduate student interested in Internet security, I came across this mostly esoteric Internet protocol called the Network Time Protocol (NTP). NTP was designed to synchronize time between computer systems communicating over unreliable and variable-latency network paths. I was actually studying Internet routing security, in particular attacks against the Resource Public Key Infrastructure (RPKI), and kept hitting a dead end because of a cache-flushing issue. As a last-ditch effort I decided to roll back the time on my computer manually, and the attack worked.

I had discovered the importance of time to computer security. Most cryptography uses timestamps to limit certificate and signature validity periods. When connecting to a website, knowledge of the correct time ensures that the certificate you see is current and is not compromised by an attacker. When looking at logs, time synchronization makes sure that events on different machines can be correlated accurately. Certificates and logging infrastructure can break with minutes, hours or months of time difference. Other applications like caching and Bitcoin are sensitive to even very small differences in time on the order of seconds.

Two factor authentication using rolling numbers also rely on accurate clocks. This then creates the need for computer clocks to have access to reasonably accurate time that is securely delivered. NTP is the most commonly used protocol for time synchronization on the Internet. If an attacker can leverage vulnerabilities in NTP to manipulate time on computer clocks, they can undermine the security guarantees provided by these systems.

Motivated by the severity of the issue, I decided to look deeper into NTP and its security. Since the need for synchronizing time across networks was visible early on, NTP is a very old protocol. The first standardized version of NTP dates back to 1985, while the latest NTP version 4 was completed in 2010 (see RFC5905).

In its most common mode, NTP works by having a client send a query packet out to an NTP server that then responds with its clock time. The client then computes an estimate of the difference between its clock and the remote clock and attempts to compensate for network delay in this. NTP client queries multiple servers and implements algorithms to select the best estimate, and rejects clearly wrong answers.

Surprisingly enough, research on NTP and its security was not very active at the time. Before this, in late 2013 and early 2014, high-profile Distributed Denial of Service (DDoS) attacks were carried out by amplifying traffic from NTP servers; attackers able to spoof a victim’s IP address were able to funnel copious amounts of traffic overwhelming the targeted domains. This caught the attention of some researchers. However, these attacks did not exploit flaws in the fundamental protocol design. The attackers simply used NTP as a boring bandwidth multiplier. Cloudflare wrote extensively about these attacks and you can read about it here, here, and here.

I found several flaws in the core NTP protocol design and its implementation that can be exploited by network attackers to launch much more devastating attacks by shifting time or denying service to NTP clients. What is even more concerning was that these attackers do not need to be a Monster-In-The-Middle (MITM), where an attacker can modify traffic between the client and the server, to mount these attacks. A set of recent papers authored by one of us showed that an off-path attacker present anywhere on the network can shift time or deny service to NTP clients. One of the ways this is done is by abusing IP fragmentation.

Fragmentation is a feature of the IP layer where a large packet is chopped into several smaller fragments so that they can pass through the networks that do not support large packets. Basically, any random network element on the path between the client and the server can send a special “ICMP fragmentation needed” packet to the server telling them to fragment the packet to say X bytes. Since the server is not expected to know the IP addresses of all the network elements on its path, this packet can be sent from any source IP.

In our attack, the attacker exploits this feature to make the NTP server fragment its NTP response packet for the victim NTP client. The attacker then spoofs carefully crafted overlapping response fragments from off-path that contain the attacker’s timestamp values. By further exploiting the reassembly policies for overlapping fragments the attacker fools the client into assembling a packet with legitimate fragments and the attacker’s insertions. This evades the authenticity checks that rely on values in the original parts of the packet.

### NTP’s past and future

At the time of NTP’s creation back in 1985, there were two main design goals for the service provided by NTP. First, they wanted it to be robust enough to handle networking errors and equipment failures. So it was designed as a service where client can gather timing samples from multiple peers over multiple communication paths and then average them to get more accurate measurement.

The second goal was load distribution. While every client would like to talk to time servers which are directly attached to high precision time-keeping devices like atomic clocks, GPS, etc, and thus have more accurate time, the capacity of those devices is only so much. So, to reduce protocol load on the network, the service was designed in a hierarchical manner. At the top of the hierarchy are servers connected to non-NTP time sources, that distribute time to other servers, that further distribute time to even more servers. Most computers connect to either these second or third level servers.

The original specification (RFC 958) also states the “non-goals” of the protocol, namely peer authentication and data integrity. Security wasn’t considered critical in the relatively small and trusting early Internet, and the protocols and applications that rely on time for security didn’t exist then. Securing NTP came second to improving the protocol and implementation.

As the Internet has grown, more and more core Internet protocols have been secured through cryptography to protect against abuse: TLS, DNSSEC, RPKI are all steps toward ensuring the security of all communications on the Internet. These protocols use “time” to provide security guarantees. Since security of Internet hinges on the security of NTP, it becomes even more important to secure NTP.

This research perspicuously showed the need for securing NTP. As a result, there was more work in the standards body for Internet Protocols, the Internet Engineering Task Force (IETF) towards cryptographically authenticating NTP. At the time, even though NTPv4 supported both symmetric and asymmetric cryptographic authentication, it was rarely used in practice due to limitations of both approaches.

NTPv4’s symmetric approach to securing synchronization doesn’t scale as the symmetric key must be pre-shared and configured manually: imagine if every client on earth needed a special secret key with the servers they wanted to get time from, the organizations that run those servers would have to do a great deal of work managing keys. This makes this solution quite cumbersome for public servers that must accept queries from arbitrary clients. For context, NIST operates important public time servers and distributes symmetric keys only to users that register, once per year, via US mail or facsimile; the US Naval Office does something similar.

The first attempt to solve the problem of key distribution was the Autokey protocol, described in RFC 5906. Many public NTP servers do not support Autokey (e.g., the NIST and USNO time servers, and many servers in pool.ntp.org). The protocol is badly broken as any network attacker can trivially retrieve the secret key shared between the client and server. The authentication mechanisms are non-standard and quite idiosyncratic.

The future of the Internet is a secure Internet, which means an authenticated and encrypted Internet. But until now NTP remains mostly insecure, despite continuing protocol development. In the meantime more and more services depended on it.

### Fixing the problem

Following the release of our paper, there was a lot more enthusiasm in the NTP community at standards body for Internet Protocols, the Internet Engineering Task Force (IETF) and outside for improving the state of NTP security. As a short-term fix, the ntpd reference implementation software was patched for several vulnerabilities that we found. And for a long-term solution, the community realized the dire need for a secure, authenticated time synchronization protocol based on public-key cryptography, which enables encryption and authentication without requiring the sharing of key material beforehand. Today we have a Network Time Security (NTS) draft at the IETF, thanks to the work of dozens of dedicated individuals at the NTP working group.

In a nutshell, the NTS protocol is divided into two-phases. The first phase is the NTS key exchange that establishes the necessary key material between the NTP client and the server. This phase uses the Transport Layer Security (TLS) handshake and relies on the same public key infrastructure as the web. Once the keys are exchanged, the TLS channel is closed and the protocol enters the second phase. In this phase the results of that TLS handshake are used to authenticate NTP time synchronization packets via extension fields. The interested reader can find more information in the Internet draft.

### Cloudflare’s new service

Today, Cloudflare announces its free time service to anyone on the Internet. We intend to solve the limitations with the existing public time services, in particular by increasing availability, robustness and security.

We use our global network to provide an advantage in latency and accuracy. Our 180 locations around the world all use anycast to automatically route your packets to our closest server. All of our servers are synchronized with stratum 1 time service providers, and then offer NTP to the general public, similar to how other public NTP providers function. The biggest source of inaccuracy for time synchronization protocols is the network asymmetry, leading to a difference in travel times between the client and server and back from the server to the client. However, our servers’ proximity to users means there will be less jitter — a measurement of variance in latency on the network — and possible asymmetry in packet paths. We also hope that in regions with a dearth of NTP servers our service significantly improves the capacity and quality of the NTP ecosystem.

Cloudflare servers obtain authenticated time by using a shared symmetric key with our stratum 1 upstream servers. These upstream servers are geographically spread and ensure that our servers have accurate time in our datacenters. But this approach to securing time doesn’t scale. We had to exchange emails individually with the organizations that run stratum 1 servers, as well as negotiate permission to use them. While this is a solution for us, it isn’t a solution for everyone on the Internet.

As a secure time service provider Cloudflare is proud to announce that we are among the first to offer a free and secure public time service based on Network Time Security. We have implemented the latest NTS IETF draft. As this draft progresses through the Internet standards process we are committed to keeping our service current.

Most NTP implementations are currently working on NTS support, and we expect that the next few months will see broader introduction as well as advancement of the current draft protocol to an RFC. Currently we have interoperability with NTPsec who have implemented draft 18 of NTS. We hope that our service will spur faster adoption of this important improvement to Internet security. Because this is a new service with no backwards compatibility requirements, we are requiring the use of TLS v1.3 with it to promote adoption of the most secure version of TLS.

### Use it

If you have an NTS client, point it at time.cloudflare.com:1234. Otherwise point your NTP client at time.cloudflare.com. More details on configuration are available in the developer docs.

### Conclusion

From our Roughtime service to Universal SSL Cloudflare has played a role in expanding the availability and use of secure protocols. Now with our free public time service we provide a trustworthy, widely available alternative to another insecure legacy protocol. It’s all a part of our mission to help make a faster, reliable, and more secure Internet for everyone.

Thanks to the many other engineers who worked on this project, including Watson Ladd, Gabbi Fisher, and Dina Kozlov

# Introducing time.cloudflare.com

Post Syndicated from Guest Author original https://blog.cloudflare.com/secure-time/

This is a guest post by Aanchal Malhotra, a Graduate Research Assistant at Boston University and former Cloudflare intern on the Cryptography team.

Cloudflare has always been a leader in deploying secure versions of insecure Internet protocols and making them available for free for anyone to use. In 2014, we launched one of the world’s first free, secure HTTPS service (Universal SSL) to go along with our existing free HTTP plan. When we launched the 1.1.1.1 DNS resolver, we also supported the new secure versions of DNS (DNS over HTTPS and DNS over TLS). Today, as part of Crypto Week 2019, we are doing the same thing for the Network Time Protocol (NTP), the dominant protocol for obtaining time over the Internet.

This announcement is personal for me. I’ve spent the last four years identifying and fixing vulnerabilities in time protocols. Today I’m proud to help introduce a service that would have made my life from 2015 through 2019 a whole lot harder: time.cloudflare.com, a free time service that supports both NTP and the emerging Network Time Security (NTS) protocol for securing NTP. Now, anyone can get time securely from all our datacenters in 180 cities around the world.

You can use time.cloudflare.com as the source of time for all your devices today with NTP, while NTS clients are still under development. NTPsec includes experimental support for NTS. If you’d like to get updates about NTS client development, email us asking to join at [email protected]. To use NTS to secure time synchronization, reach out to your vendors and inquire about NTS support.

### A small tale of “time” first

Back in 2015, as a fresh graduate student interested in Internet security, I came across this mostly esoteric Internet protocol called the Network Time Protocol (NTP). NTP was designed to synchronize time between computer systems communicating over unreliable and variable-latency network paths. I was actually studying Internet routing security, in particular attacks against the Resource Public Key Infrastructure (RPKI), and kept hitting a dead end because of a cache-flushing issue. As a last-ditch effort I decided to roll back the time on my computer manually, and the attack worked.

I had discovered the importance of time to computer security. Most cryptography uses timestamps to limit certificate and signature validity periods. When connecting to a website, knowledge of the correct time ensures that the certificate you see is current and is not compromised by an attacker. When looking at logs, time synchronization makes sure that events on different machines can be correlated accurately. Certificates and logging infrastructure can break with minutes, hours or months of time difference. Other applications like caching and Bitcoin are sensitive to even very small differences in time on the order of seconds.

Two factor authentication using rolling numbers also rely on accurate clocks. This then creates the need for computer clocks to have access to reasonably accurate time that is securely delivered. NTP is the most commonly used protocol for time synchronization on the Internet. If an attacker can leverage vulnerabilities in NTP to manipulate time on computer clocks, they can undermine the security guarantees provided by these systems.

Motivated by the severity of the issue, I decided to look deeper into NTP and its security. Since the need for synchronizing time across networks was visible early on, NTP is a very old protocol. The first standardized version of NTP dates back to 1985, while the latest NTP version 4 was completed in 2010 (see RFC5905).

In its most common mode, NTP works by having a client send a query packet out to an NTP server that then responds with its clock time. The client then computes an estimate of the difference between its clock and the remote clock and attempts to compensate for network delay in this. NTP client queries multiple servers and implements algorithms to select the best estimate, and rejects clearly wrong answers.

Surprisingly enough, research on NTP and its security was not very active at the time. Before this, in late 2013 and early 2014, high-profile Distributed Denial of Service (DDoS) attacks were carried out by amplifying traffic from NTP servers; attackers able to spoof a victim’s IP address were able to funnel copious amounts of traffic overwhelming the targeted domains. This caught the attention of some researchers. However, these attacks did not exploit flaws in the fundamental protocol design. The attackers simply used NTP as a boring bandwidth multiplier. Cloudflare wrote extensively about these attacks and you can read about it here, here, and here.

I found several flaws in the core NTP protocol design and its implementation that can be exploited by network attackers to launch much more devastating attacks by shifting time or denying service to NTP clients. What is even more concerning was that these attackers do not need to be a Monster-In-The-Middle (MITM), where an attacker can modify traffic between the client and the server, to mount these attacks. A set of recent papers authored by one of us showed that an off-path attacker present anywhere on the network can shift time or deny service to NTP clients. One of the ways this is done is by abusing IP fragmentation.

Fragmentation is a feature of the IP layer where a large packet is chopped into several smaller fragments so that they can pass through the networks that do not support large packets. Basically, any random network element on the path between the client and the server can send a special “ICMP fragmentation needed” packet to the server telling them to fragment the packet to say X bytes. Since the server is not expected to know the IP addresses of all the network elements on its path, this packet can be sent from any source IP.

In our attack, the attacker exploits this feature to make the NTP server fragment its NTP response packet for the victim NTP client. The attacker then spoofs carefully crafted overlapping response fragments from off-path that contain the attacker’s timestamp values. By further exploiting the reassembly policies for overlapping fragments the attacker fools the client into assembling a packet with legitimate fragments and the attacker’s insertions. This evades the authenticity checks that rely on values in the original parts of the packet.

### NTP’s past and future

At the time of NTP’s creation back in 1985, there were two main design goals for the service provided by NTP. First, they wanted it to be robust enough to handle networking errors and equipment failures. So it was designed as a service where client can gather timing samples from multiple peers over multiple communication paths and then average them to get more accurate measurement.

The second goal was load distribution. While every client would like to talk to time servers which are directly attached to high precision time-keeping devices like atomic clocks, GPS, etc, and thus have more accurate time, the capacity of those devices is only so much. So, to reduce protocol load on the network, the service was designed in a hierarchical manner. At the top of the hierarchy are servers connected to non-NTP time sources, that distribute time to other servers, that further distribute time to even more servers. Most computers connect to either these second or third level servers.

The original specification (RFC 958) also states the “non-goals” of the protocol, namely peer authentication and data integrity. Security wasn’t considered critical in the relatively small and trusting early Internet, and the protocols and applications that rely on time for security didn’t exist then. Securing NTP came second to improving the protocol and implementation.

As the Internet has grown, more and more core Internet protocols have been secured through cryptography to protect against abuse: TLS, DNSSEC, RPKI are all steps toward ensuring the security of all communications on the Internet. These protocols use “time” to provide security guarantees. Since security of Internet hinges on the security of NTP, it becomes even more important to secure NTP.

This research perspicuously showed the need for securing NTP. As a result, there was more work in the standards body for Internet Protocols, the Internet Engineering Task Force (IETF) towards cryptographically authenticating NTP. At the time, even though NTPv4 supported both symmetric and asymmetric cryptographic authentication, it was rarely used in practice due to limitations of both approaches.

NTPv4’s symmetric approach to securing synchronization doesn’t scale as the symmetric key must be pre-shared and configured manually: imagine if every client on earth needed a special secret key with the servers they wanted to get time from, the organizations that run those servers would have to do a great deal of work managing keys. This makes this solution quite cumbersome for public servers that must accept queries from arbitrary clients. For context, NIST operates important public time servers and distributes symmetric keys only to users that register, once per year, via US mail or facsimile; the US Naval Office does something similar.

The first attempt to solve the problem of key distribution was the Autokey protocol, described in RFC 5906. Many public NTP servers do not support Autokey (e.g., the NIST and USNO time servers, and many servers in pool.ntp.org). The protocol is badly broken as any network attacker can trivially retrieve the secret key shared between the client and server. The authentication mechanisms are non-standard and quite idiosyncratic.

The future of the Internet is a secure Internet, which means an authenticated and encrypted Internet. But until now NTP remains mostly insecure, despite continuing protocol development. In the meantime more and more services depended on it.

### Fixing the problem

Following the release of our paper, there was a lot more enthusiasm in the NTP community at standards body for Internet Protocols, the Internet Engineering Task Force (IETF) and outside for improving the state of NTP security. As a short-term fix, the ntpd reference implementation software was patched for several vulnerabilities that we found. And for a long-term solution, the community realized the dire need for a secure, authenticated time synchronization protocol based on public-key cryptography, which enables encryption and authentication without requiring the sharing of key material beforehand. Today we have a Network Time Security (NTS) draft at the IETF, thanks to the work of dozens of dedicated individuals at the NTP working group.

In a nutshell, the NTS protocol is divided into two-phases. The first phase is the NTS key exchange that establishes the necessary key material between the NTP client and the server. This phase uses the Transport Layer Security (TLS) handshake and relies on the same public key infrastructure as the web. Once the keys are exchanged, the TLS channel is closed and the protocol enters the second phase. In this phase the results of that TLS handshake are used to authenticate NTP time synchronization packets via extension fields. The interested reader can find more information in the Internet draft.

### Cloudflare’s new service

Today, Cloudflare announces its free time service to anyone on the Internet. We intend to solve the limitations with the existing public time services, in particular by increasing availability, robustness and security.

We use our global network to provide an advantage in latency and accuracy. Our 180 locations around the world all use anycast to automatically route your packets to our closest server. All of our servers are synchronized with stratum 1 time service providers, and then offer NTP to the general public, similar to how other public NTP providers function. The biggest source of inaccuracy for time synchronization protocols is the network asymmetry, leading to a difference in travel times between the client and server and back from the server to the client. However, our servers’ proximity to users means there will be less jitter — a measurement of variance in latency on the network — and possible asymmetry in packet paths. We also hope that in regions with a dearth of NTP servers our service significantly improves the capacity and quality of the NTP ecosystem.

Cloudflare servers obtain authenticated time by using a shared symmetric key with our stratum 1 upstream servers. These upstream servers are geographically spread and ensure that our servers have accurate time in our datacenters. But this approach to securing time doesn’t scale. We had to exchange emails individually with the organizations that run stratum 1 servers, as well as negotiate permission to use them. While this is a solution for us, it isn’t a solution for everyone on the Internet.

As a secure time service provider Cloudflare is proud to announce that we are among the first to offer a free and secure public time service based on Network Time Security. We have implemented the latest NTS IETF draft. As this draft progresses through the Internet standards process we are committed to keeping our service current.

Most NTP implementations are currently working on NTS support, and we expect that the next few months will see broader introduction as well as advancement of the current draft protocol to an RFC. Currently we have interoperability with NTPsec who have implemented draft 18 of NTS. We hope that our service will spur faster adoption of this important improvement to Internet security. Because this is a new service with no backwards compatibility requirements, we are requiring the use of TLS v1.3 with it to promote adoption of the most secure version of TLS.

### Use it

If you have an NTS client, point it at time.cloudflare.com:1234. Otherwise point your NTP client at time.cloudflare.com. More details on configuration are available in the developer docs.

### Conclusion

From our Roughtime service to Universal SSL Cloudflare has played a role in expanding the availability and use of secure protocols. Now with our free public time service we provide a trustworthy, widely available alternative to another insecure legacy protocol. It’s all a part of our mission to help make a faster, reliable, and more secure Internet for everyone.

Thanks to the many other engineers who worked on this project, including Watson Ladd, Gabbi Fisher, and Dina Kozlov

# The Quantum Menace

Post Syndicated from Armando Faz-Hernández original https://blog.cloudflare.com/the-quantum-menace/

Over the last few decades, the word ‘quantum’ has become increasingly popular. It is common to find articles, reports, and many people interested in quantum mechanics and the new capabilities and improvements it brings to the scientific community. This topic not only concerns physics, since the development of quantum mechanics impacts on several other fields such as chemistry, economics, artificial intelligence, operations research, and undoubtedly, cryptography.

This post begins a trio of blogs describing the impact of quantum computing on cryptography, and how to use stronger algorithms resistant to the power of quantum computing.

• This post introduces quantum computing and describes the main aspects of this new computing model and its devastating impact on security standards; it summarizes some approaches to securing information using quantum-resistant algorithms.
• Due to the relevance of this matter, we present our experiments on a large-scale deployment of quantum-resistant algorithms.
• Our third post introduces CIRCL, open-source Go library featuring optimized implementations of quantum-resistant algorithms and elliptic curve-based primitives.

All of this is part of Cloudflare’s Crypto Week 2019, now fasten your seatbelt and get ready to make a quantum leap.

### What is Quantum Computing?

Back in 1981, Richard Feynman raised the question about what kind of computers can be used to simulate physics. However, some physical phenomena, such as quantum mechanics, cannot be simulated using a classical computer. Then, he conjectured the existence of a computer model that behaves under quantum mechanics rules, which opened a field of research now called quantum computing. To understand the basics of quantum computing, it is necessary to recall how classical computers work, and from that shine a spotlight on the differences between these computational models.

In 1936, Alan Turing and Emil Post independently described models that gave rise to the foundation of the computing model known as the Post-Turing machine, which describes how computers work and allowed further determination of limits for solving problems.

In this model, the units of information are bits, which store one of two possible values, usually denoted by 0 and 1. A computing machine contains a set of bits and performs operations that modify the values of the bits, also known as the machine’s state. Thus, a machine with N bits can be in one of 2ᴺ possible states. With this in mind, the Post-Turing computing model can be abstractly described as a machine of states, in which running a program is translated as machine transitions along the set of states.

A paper David Deutsch published in 1985 describes a computing model that extends the capabilities of a Turing machine based on the theory of quantum mechanics. This computing model introduces several advantages over the Turing model for processing large volumes of information. It also presents unique properties that deviate from the way we understand classical computing. Most of these properties come from the nature of quantum mechanics. We’re going to dive into these details before approaching the concept of quantum computing.

### Superposition

One of the most exciting properties of quantum computing that provides an advantage over the classical computing model is superposition. In physics, superposition is the ability to produce valid states from the addition or superposition of several other states that are part of a system.

Applying these concepts to computing information, it means that there is a system in which it is possible to generate a machine state that represents a (weighted) sum of the states 0 and 1; in this case, the term weighted means that the state can keep track of “the quantity of” 0 and 1 present in the state. In the classical computation model, one bit can only store either the state of 0 or 1, not both; even using two bits, they cannot represent the weighted sum of these states. Hence, to make a distinction from the basic states, quantum computing uses the concept of a quantum bit (qubit) — a unit of information to denote the superposition of two states. This is a cornerstone concept of quantum computing as it provides a way of tracking more than a single state per unit of information, making it a powerful tool for processing information.

So, a qubit represents the sum of two parts: the 0 or 1 state plus the amount each 0/1 state contributes to produce the state of the qubit.

In mathematical notation, qubit $$| \Psi \rangle$$ is an explicit sum indicating that a qubit represents the superposition of the states 0 and 1. This is the Dirac notation used to describe the value of a qubit $$| \Psi \rangle = A | 0 \rangle +B | 1 \rangle$$, where, A and B are complex numbers known as the amplitude of the states 0 and 1, respectively. The value of the basic states is represented by qubits as $$| 0 \rangle = 1 | 0 \rangle + 0 | 1 \rangle$$  and $$| 1 \rangle = 0 | 0 \rangle + 1 | 1 \rangle$$, respectively. The right side of the term contains the abbreviated notation for these special states.

### Measurement

In a classical computer, the values 0 and 1 are implemented as digital signals. Measuring the current of the signal automatically reveals the status of a bit. This means that at any moment the value of the bit can be observed or measured.

The state of a qubit is maintained in a physically closed system, meaning that the properties of the system, such as superposition, require no interaction with the environment; otherwise any interaction, like performing a measurement, can cause interference on the state of a qubit.

Measuring a qubit is a probabilistic experiment. The result is a bit of information that depends on the state of the qubit. The bit, obtained by measuring $$| \Psi \rangle = A | 0 \rangle +B | 1 \rangle$$, will be equal to 0 with probability $$|A|^2$$,  and equal to 1 with probability $$|B|^2$$, where $$|x|$$ represents the absolute value of $$x$$.

From Statistics, we know that the sum of probabilities of all possible events is always equal to 1, so it must hold that $$|A|^2 +|B|^2 =1$$. This last equation motivates to represent qubits as the points of a circle of radius one, and more generally, as the points on the surface of a sphere of radius one, which is known as the Bloch Sphere.

Let’s break it down: If you measure a qubit you also destroy the superposition of the qubit, resulting in a superposition state collapse, where it assumes one of the basics states, providing your final result.

Another way to think about superposition and measurement is through the coin tossing experiment.

Toss a coin in the air and you give people a random choice between two options: heads or tails. Now, don’t focus on the randomness of the experiment, instead note that while the coin is rotating in the air, participants are uncertain which side will face up when the coin lands. Conversely, once the coin stops with a random side facing up, participants are 100% certain of the status.

How does it relate? Qubits are similar to the participants. When a qubit is in a superposition of states, it is tracking the probability of heads or tails, which is the participants’ uncertainty quotient while the coin is in the air. However, once you start to measure the qubit to retrieve its value, the superposition vanishes, and a classical bit value sticks: heads or tails. Measurement is that moment when the coin is static with only one side facing up.

A fair coin is a coin that is not biased. Each side (assume 0=heads and 1=tails) of a fair coin has the same probability of sticking after a measurement is performed. The qubit $$\tfrac{1}{\sqrt{2}}|0\rangle + \tfrac{1}{\sqrt{2}}|1\rangle$$ describes the probabilities of tossing a fair coin. Note that squaring either of the amplitudes results in ½, indicating that there is a 50% chance either heads or tails sticks.

It would be interesting to be able to charge a fair coin at will while it is in the air. Although this is the magic of a professional illusionist, this task, in fact, can be achieved by performing operations over qubits. So, get ready to become the next quantum magician!

### Quantum Gates

A logic gate represents a Boolean function operating over a set of inputs (on the left) and producing an output (on the right). A logic circuit is a set of connected logic gates, a convenient way to represent bit operations.

Other gates are AND, OR, XOR, and NAND, and more. A set of gates is universal if it can generate other gates. For example, NOR and NAND gates are universal since any circuit can be constructed using only these gates.

Quantum computing also admits a description using circuits. Quantum gates operate over qubits, modifying the superposition of the states. For example, there is a quantum gate analogous to the NOT gate, the X gate.

The X quantum gate interchanges the amplitudes of the states of the input qubit.

The Z quantum gate flips the sign’s amplitude of state 1:

Another quantum gate is the Hadamard gate, which generates an equiprobable superposition of the basic states.

Using our coin tossing analogy, the Hadamard gate has the action of tossing a fair coin to the air. In quantum circuits, a triangle represents measuring a qubit, and the resulting bit is indicated by a double-wire.

Other gates, such as the CNOT gate, Pauli’s gates, Toffoli gate, Deutsch gate, are slightly more advanced. Quirk, the open-source playground, is a fun sandbox where you can construct quantum circuits using all of these gates.

### Reversibility

An operation is reversible if there exists another operation that rolls back the output state to the initial state. For instance, a NOT gate is reversible since applying a second NOT gate recovers the initial input.

In contrast, AND, OR, NAND gates are not reversible. This means that some classical computations cannot be reversed by a classic circuit that uses only the output bits. However, if you insert additional bits of information, the operation can be reversed.

Quantum computing mainly focuses on reversible computations, because there’s always a way to construct a reversible circuit to perform an irreversible computation. The reversible version of a circuit could require the use of ancillary qubits as auxiliary (but not temporary) variables.

Due to the nature of composed systems, it could be possible that these ancillas (extra qubits) correlate to qubits of the main computation. This correlation makes it infeasible to reuse ancillas since any modification could have the side-effect on the operation of a reversible circuit. This is like memory assigned to a process by the operating system: the process cannot use memory from other processes or it could cause memory corruption, and processes cannot release their assigned memory to other processes. You could use garbage collection mechanisms for ancillas, but performing reversible computations increases your qubit budget.

### Composed Systems

In quantum mechanics, a single qubit can be described as a single closed system: a system that has no interaction with the environment nor other qubits. Letting qubits interact with others leads to a composed system where more states are represented. The state of a 2-qubit composite system is denoted as $$A_0|00\rangle+A_1|01\rangle+A_2|10\rangle+A_3|11\rangle$$, where, $$A_i$$ values correspond to the amplitudes of the four basic states 00, 01, 10, and 11. This qubit $$\tfrac{1}{2}|00\rangle+\tfrac{1}{2}|01\rangle+\tfrac{1}{2}|10\rangle+\tfrac{1}{2}|11\rangle$$ represents the superposition of these basic states, both having the same probability obtained after measuring the two qubits.

In the classical case, the state of N bits represents only one of 2ᴺ possible states, whereas a composed state of N qubits represents all the 2ᴺ states but in superposition. This is one big difference between these computing models as it carries two important properties: entanglement and quantum parallelism.

### Entanglement

According to the theory behind quantum mechanics, some composed states can be described through the description of its constituents. However, there are composed states where no description is possible, known as entangled states.

The entanglement phenomenon was pointed out by Einstein, Podolsky, and Rosen in the so-called EPR paradox. Suppose there is a composed system of two entangled qubits, in which by performing a measurement in one qubit causes interference in the measurement of the second. This interference occurs even when qubits are separated by a long distance, which means that some information transfer happens faster than the speed of light. This is how quantum entanglement conflicts with the theory of relativity, where information cannot travel faster than the speed of light. The EPR paradox motivated further investigation for deriving new interpretations about quantum mechanics and aiming to resolve the paradox.

Quantum entanglement can help to transfer information at a distance by following a communication protocol. The following protocol examples rely on the fact that Alice and Bob separately possess one of two entangled qubits:

• The superdense coding protocol allows Alice to communicate a 2-bit message $$m_0,m_1$$ to Bob using a quantum communication channel, for example, using fiber optics to transmit photons. All Alice has to do is operate on her qubit according to the value of the message and send the resulting qubit to Bob. Once Bob receives the qubit, he measures both qubits, noting that the collapsed 2-bit state corresponds to Alice’s message.

• The quantum teleportation protocol allows Alice to transmit a qubit to Bob without using a quantum communication channel. Alice measures the qubit to send Bob and her entangled qubit resulting in two bits. Alice sends these bits to Bob, who operates on his entangled qubit according to the bits received and notes that the result state matches the original state of Alice’s qubit.

### Quantum Parallelism

Composed systems of qubits allow representation of more information per composed state. Note that operating on a composed state of N qubits is equivalent to operating over a set of 2ᴺ states in superposition. This procedure is quantum parallelism. In this setting, operating over a large volume of information gives the intuition of performing operations in parallel, like in the parallel computing paradigm; one big caveat is that superposition is not equivalent to parallelism.

Remember that a composed state is a superposition of several states so, a computation that takes a composed state of inputs will result in a composed state of outputs. The main divergence between classical and quantum parallelism is that quantum parallelism can obtain only one of the processed outputs. Observe that a measurement in the output of a composed state causes that the qubits collapse to only one of the outputs, making it unattainable to calculate all computed values.

Although quantum parallelism does not match precisely with the traditional notion of parallel computing, you can still leverage this computational power to get related information.

Deutsch-Jozsa Problem: Assume $$F$$ is a function that takes as input N bits, outputs one bit, and is either constant (always outputs the same value for all inputs) or balanced (outputs 0 for half of the inputs and 1 for the other half). The problem is to determine if $$F$$ is constant or balanced.

The quantum algorithm that solves the Deutsch-Jozsa problem uses quantum parallelism. First, N qubits are initialized in a superposition of 2ᴺ states. Then, in a single shot, it evaluates $$F$$ for all of these states.

The result of applying $$F$$ appears (in the exponent) of the amplitude of the all-zero state. Note that only when $$F$$ is constant is this amplitude, either +1 or -1. If the result of measuring the N qubit is an all-zeros bitstring, then there is a 100% certainty that $$F$$ is constant. Any other result indicates that $$F$$ is balanced.

A deterministic classical algorithm solves this problem using $$2^{N-1}+1$$ evaluations of $$F$$ in the worst case. Meanwhile, the quantum algorithm requires only one evaluation. The Deutsch-Jozsa problem exemplifies the exponential advantage of a quantum algorithm over classical algorithms.

## Quantum Computers

The theory of quantum computing is supported by investigations in the field of quantum mechanics. However, constructing a quantum machine requires a physical system that allows representing qubits and manipulation of states in a reliable and precise way.

The DiVincenzo Criteria require that a physical implementation of a quantum computer must:

1. Be scalable and have well-defined qubits.
2. Be able to initialize qubits to a state.
3. Have long decoherence times to apply quantum error-correcting codes. Decoherence of a qubit happens when the qubit interacts with the environment, for example, when a measurement is performed.
4. Use a universal set of quantum gates.
5. Be able to measure single qubits without modifying others.

Quantum computer physical implementations face huge engineering obstacles to satisfy these requirements. The most important challenge is to guarantee low error rates during computation and measurement. Lowering these rates require techniques for error correction, which add a significant number of qubits specialized on this task. For this reason, the number of qubits of a quantum computer should not be regarded as for classical systems. In a classical computer, the bits of a computer are all effective for performing a calculation, whereas the number of qubits is the sum of the effective qubits (those used to make calculations) plus the ancillas (used for reversible computations) plus the error correction qubits.

Current implementations of quantum computers partially satisfy the DiVincenzo criteria. Quantum adiabatic computers fit in this category since they do not operate using quantum gates. For this reason, they are not considered to be universal quantum computers.

A recurrent problem in optimization is to find the global minimum of an objective function. For example, a route-traffic control system can be modeled as a function that reduces the cost of routing to a minimum. Simulated annealing is a heuristic procedure that provides a good solution to these types of problems. Simulated annealing finds the solution state by slowly introducing changes (the adiabatic process) on the variables that govern the system.

Quantum annealing is the analogous quantum version of simulated annealing. A qubit is initialized into a superposition of states representing all possible solutions to the problem. Here is used the Hamiltonian operator, which is the sum of vectors of potential and kinetic energies of the system. Hence, the objective function is encoded using this operator describing the evolution of the system in correspondence with time. Then, if the system is allowed to evolve very slowly, it will eventually land on a final state representing the optimal value of the objective function.

Currently, there exist adiabatic computers in the market, such as the D-Wave and IBM Q systems, featuring hundreds of qubits; however, their capabilities are somewhat limited to some problems that can be modeled as optimization problems. The limits of adiabatic computers were studied by van Dam et al, showing that despite solving local searching problems and even some instances of the max-SAT problem, there exists harder searching problems this computing model cannot efficiently solve.

### Nuclear Magnetic Resonance

Nuclear Magnetic Resonance (NMR) is a physical phenomena that can be used to represent qubits. The spin of atomic nucleus of molecules is perturbed by an oscillating magnetic field. A 2001 report describes successful implementation of Shor’s algorithm in a 7-qubit NMR quantum computer. An iconic result since this computer was able to factor the number 15.

### Superconducting Quantum Computers

One way to physically construct qubits is based on superconductors, materials that conduct electric current with zero resistance when exposed to temperatures close to absolute zero.

The Josephson effect, in which current flows across the junction of two superconductors separated by a non-superconducting material, is used to physically implement a superposition of states.

When a magnetic flux is applied to this junction, the current flows continuously in one direction. But, depending on the quantity of magnetic flux applied, the current can also flow in the opposite direction. There exists a quantum superposition of currents going both clockwise and counterclockwise leading to a physical implementation of a qubit called flux qubit. The complete device is known as the Superconducting Quantum Interference Device (SQUID) and can be easily coupled scaling the number of qubits. Thus, SQUIDs are like the transistors of a quantum computer.

Examples of superconducting computers are:

• D-wave’s adiabatic computers process quantum annealing for solving diverse optimization problems.
• Google’s 72-qubit computer was recently announced and also several engineering issues such as achieving lower temperatures.
• IBM’s IBM-Q Tokyo, a 20-qubit adiabatic computer, and IBM Q Experience, a cloud-based system for exploring quantum circuits.

IBM Q System

## The Imminent Threat of Quantum Algorithms

The quantum zoo website tracks problems that can be solved using quantum algorithms. As of mid-2018, more than 60 problems appear on this list, targeting diverse applications in the area of number theory, approximation, simulation, and searching. As terrific as it sounds, some easily-solvable problems by quantum computing are surrounding the security of information.

### Grover’s Algorithm

Tales of a quantum detective (fragment). A couple of detectives have the mission of finding one culprit in a group of suspects that always respond to this question honestly: “are you guilty?”.
The detective C follows a classic interrogative method and interviews every person one at a time, until finding the first one that confesses.
The detective Q proceeds in a different way, First gather all suspects in a completely dark room, and after that, the detective Q asks them — are you guilty? — A steady sound comes from the room saying “No!” while at the same time, a single voice mixed in the air responds “Yes!.” Since everybody is submerged in darkness, the detective cannot see the culprit. However, detective Q knows that, as long as the interrogation advances, the culprit will feel desperate and start to speak louder and louder, and so, he continues asking the same question. Suddenly, detective Q turns on the lights, enters into the room, and captures the culprit. How did he do it?

The task of the detective can be modeled as a searching problem. Given a Boolean function $$f$$ that takes N bits and produces one bit, to find the unique input $$x$$ such that $$f(x)=1$$.

A classical algorithm (detective C) finds $$x$$ using $$2^N-1$$ function evaluations in the worst case. However, the quantum algorithm devised by Grover, corresponding to detective Q, searches quadratically faster using around $$2^{N/2}$$ function evaluations.

The key intuition of Grover’s algorithm is increasing the amplitude of the state that represents the solution while maintaining the other states in a lower amplitude. In this way, a system of N qubits, which is a superposition of 2ᴺ possible inputs, can be continuously updated using this intuition until the solution state has an amplitude closer to 1. Hence, after updating the qubits many times, there will be a high probability to measure the solution state.

Initially, a superposition of 2ᴺ states (horizontal axis) is set, each state has an amplitude (vertical axis) close to 0. The qubits are updated so that the amplitude of the solution state increases more than the amplitude of other states. By repeating the update step, the amplitude of the solution state gets closer to 1, which boosts the probability of collapsing to the solution state after measuring.

Grover’s Algorithm (pseudo-code):

1. Prepare an N qubit $$|x\rangle$$ as a uniform superposition of 2ᴺ states.
2. Update the qubits by performing this core operation. $$|x\rangle \mapsto (-1)^{f(x)} |x\rangle$$ The result of $$f(x)$$ only flips the amplitude of the searched state.
3. Negate the N qubit over the average of the amplitudes.
4. Repeat Step 2 and 3 for $$(\tfrac{\pi}{4}) 2^{ N/2}$$ times.
5. Measure the qubit and return the bits obtained.

Alternatively, the second step can be better understood as a conditional statement:

IF f(x) = 1 THEN
Negate the amplitude of the solution state.
ELSE
/* nothing */
ENDIF


Grover’s algorithm considers function $$f$$ a black box, so with slight modifications, the algorithm can also be used to find collisions on the function. This implies that Grover’s algorithm can find a collision using an asymptotically less number of operations than using a brute-force algorithm.

The power of Grover’s algorithm can be turned against cryptographic hash functions. For instance, a quantum computer running Grover’s algorithm could find a collision on SHA256 performing only 2¹²⁸ evaluations of a reversible circuit of SHA256. The natural protection for hash functions is to increase the output size to double. More generally, most of symmetric key encryption algorithms will survive to the power of Grover’s algorithm by doubling the size of keys.

The scenario for public-key algorithms is devastating in face of Peter Shor’s algorithm.

### Shor’s Algorithm

Multiplying integers is an easy task to accomplish, however, finding the factors that compose an integer is difficult. The integer factorization problem is to decompose a given integer number into its prime factors. For example, 42 has three factors 2, 3, and 7 since $$2\times 3\times 7 = 42$$. As the numbers get bigger, integer factorization becomes more difficult to solve, and the hardest instances of integer factorization are when the factors are only two different large primes. Thus, given an integer number $$N$$, to find primes $$p$$ and $$q$$ such that $$N = p \times q$$, is known as integer splitting.

Factoring integers is like cutting wood, and the specific task of splitting integers is analogous to using an axe for splitting the log in two parts. There exist many different tools (algorithms) for accomplishing each task.

For integer factorization, trial division, the Rho method, the elliptic curve method are common algorithms. Fermat’s method, the quadratic- and rational-sieve, leads to the (general) number field sieve (NFS) algorithm for integer splitting. The latter relies on finding a congruence of squares, that is, splitting $$N$$ as a product of squares such that $$N = x^2 – y^2 = (x+y)\times(x-y)$$ The complexity of NFS is mainly attached to the number of pairs $$(x, y)$$ that must be examined before getting a pair that factors $$N$$. The NFS algorithm has subexponential complexity on the size of $$N$$, meaning that the time required for splitting an integer increases significantly as the size of $$N$$ grows. For large integers, the problem becomes intractable for classical computers.

### The Axe of Thor Shor

The many different guesses of the NFS algorithm are analogous to hitting the log using a dulled axe; after subexponential many tries, the log is cut by half. However, using a sharper axe allows you to split the log faster. This sharpened axe is the quantum algorithm proposed by Shor in 1994.

Let $$x$$ be an integer less than $$N$$ and of the order $$k$$. Then, if $$k$$ is even, there exists an integer $$q$$ so $$qN$$ can be factored as follows.

This approach has some issues. For example, the factorization could correspond to $$q$$ not $$N$$ and the order of $$x$$ is unknown, and here is where Shor’s algorithm enters the picture, finding the order of $$x$$.

The internals of Shor’s algorithm rely on encoding the order $$k$$ into a periodic function, so that its period can be obtained using the quantum version of the Fourier transform (QFT). The order of $$x$$ can be found using a polynomial number quantum evaluations of Shor’s algorithm. Therefore, splitting integers using this quantum approach has polynomial complexity on the size of $$N$$.

Shor’s algorithm carries strong implications on the security of the RSA encryption scheme because its security relies on integer factorization. A large-enough quantum computer can efficiently break RSA for current instances.

Alternatively, one may recur to elliptic curves, used in cryptographic protocols like ECDSA or ECDH. Moreover, all TLS ciphersuites use a combination of elliptic curve groups, large prime groups, and RSA and DSA signatures. Unfortunately, these algorithms all succumb to Shor’s algorithm. It only takes a few modifications for Shor’s algorithm to solve the discrete logarithm problem on finite groups. This sounds like a catastrophic story where all of our encrypted data and privacy are no longer secure with the advent of a quantum computer, and in some sense this is true.

On one hand, it is a fact that the quantum computers constructed as of 2019 are not large enough to run, for instance, Shor’s algorithm for the RSA key sizes used in standard protocols. For example, a 2018 report shows experiments on the factorization of a 19-bit number using 94 qubits, they also estimate that 147456 qubits are needed for factoring a 768-bit number. Hence, there numbers indicates that we are still far from breaking RSA.

What if we increment RSA key sizes to be resistant to quantum algorithms, just like for symmetric algorithms?

Bernstein et al. estimated that RSA public keys should be as large as 1 terabyte to maintain secure RSA even in the presence of quantum factoring algorithms. So, for public-key algorithms, increasing the size of keys does not help.

A recent investigation by Gidney and Ekerá shows improvements that accelerate the evaluation of quantum factorization. In their report, the cost of factoring 2048-bit integers is estimated to take a few hours using a quantum machine of 20 million qubits, which is far from any current development. Something worth noting is that the number of qubits needed is two orders of magnitude smaller than the estimated numbers given in previous works developed in this decade. Under these estimates, current encryption algorithms will remain secure several more years; however, consider the following not-so-unrealistic situation.

Information currently encrypted with for example, RSA, can be easily decrypted with a quantum computer in the future. Now, suppose that someone records encrypted information and stores them until a quantum computer is able to decrypt ciphertexts. Although this could be as far as 20 years from now, the forward-secrecy principle is violated. A 20-year gap to the future is sometimes difficult to imagine. So, let’s think backwards, what would happen if all you did on the Internet at the end of the 1990s can be revealed 20 years later — today. How does this impact the security of your personal information? What if the ciphertexts were company secrets or business deals? In 1999, most of us were concerned about the effects of the Y2K problem, now we’re facing Y2Q (years to quantum): the advent of quantum computers.

### Post-Quantum Cryptography

Although the current capacity of the physical implementation of quantum computers is far from a real threat to secure communications, a transition to use stronger problems to protect information has already started. This wave emerged as post-quantum cryptography (PQC). The core idea of PQC is finding algorithms difficult enough that no quantum (and classical) algorithm can solve them.

A recurrent question is: How does it look like a problem that even a quantum computer can not solve?

These so-called quantum-resistant algorithms rely on different hard mathematical assumptions; some of them as old as RSA, others more recently proposed. For example, McEliece cryptosystem, formulated in the late 70s, relies on the hardness of decoding a linear code (in the sense of coding theory). The practical use of this cryptosystem didn’t become widespread, since with the passing of time, other cryptosystems superseded in efficiency. Fortunately, McEliece cryptosystem remains immune to Shor’s algorithm, gaining it relevance in the post-quantum era.

Post-quantum cryptography presents alternatives:

As of 2017, the NIST started an evaluation process that tracks possible alternatives for next-generation secure algorithms. From a practical perspective, all candidates present different trade-offs in implementation and usage. The time and space requirements are diverse; at this moment, it’s too early to define which will succeed RSA and elliptic curves. An initial round collected 70 algorithms for deploying key encapsulation mechanisms and digital signatures. As of early 2019, 28 of these survive and are currently in the analysis, investigation, and experimentation phase.

Cloudflare’s mission is to help build a better Internet. As a proactive action, our cryptography team is preparing experiments on the deployment of post-quantum algorithms at Cloudflare scale. Watch our blog post for more details.

# The Quantum Menace

Post Syndicated from Armando Faz-Hernández original https://blog.cloudflare.com/the-quantum-menace/

Over the last few decades, the word ‘quantum’ has become increasingly popular. It is common to find articles, reports, and many people interested in quantum mechanics and the new capabilities and improvements it brings to the scientific community. This topic not only concerns physics, since the development of quantum mechanics impacts on several other fields such as chemistry, economics, artificial intelligence, operations research, and undoubtedly, cryptography.

This post begins a trio of blogs describing the impact of quantum computing on cryptography, and how to use stronger algorithms resistant to the power of quantum computing.

• This post introduces quantum computing and describes the main aspects of this new computing model and its devastating impact on security standards; it summarizes some approaches to securing information using quantum-resistant algorithms.
• Due to the relevance of this matter, we present our experiments on a large-scale deployment of quantum-resistant algorithms.
• Our third post introduces CIRCL, open-source Go library featuring optimized implementations of quantum-resistant algorithms and elliptic curve-based primitives.

All of this is part of Cloudflare’s Crypto Week 2019, now fasten your seatbelt and get ready to make a quantum leap.

### What is Quantum Computing?

Back in 1981, Richard Feynman raised the question about what kind of computers can be used to simulate physics. Although some physical systems can be simulated in a classical computer, the amount of resources used by such a computer can grow exponentially. Then, he conjectured the existence of a computer model that behaves under quantum mechanics rules, which opened a field of research now called quantum computing. To understand the basics of quantum computing, it is necessary to recall how classical computers work, and from that shine a spotlight on the differences between these computational models.

In 1936, Alan Turing and Emil Post independently described models that gave rise to the foundation of the computing model known as the Post-Turing machine, which describes how computers work and allowed further determination of limits for solving problems.

In this model, the units of information are bits, which store one of two possible values, usually denoted by 0 and 1. A computing machine contains a set of bits and performs operations that modify the values of the bits, also known as the machine’s state. Thus, a machine with N bits can be in one of 2ᴺ possible states. With this in mind, the Post-Turing computing model can be abstractly described as a machine of states, in which running a program is translated as machine transitions along the set of states.

A paper David Deutsch published in 1985 describes a computing model that extends the capabilities of a Turing machine based on the theory of quantum mechanics. This computing model introduces several advantages over the Turing model for processing large volumes of information. It also presents unique properties that deviate from the way we understand classical computing. Most of these properties come from the nature of quantum mechanics. We’re going to dive into these details before approaching the concept of quantum computing.

### Superposition

One of the most exciting properties of quantum computing that provides an advantage over the classical computing model is superposition. In physics, superposition is the ability to produce valid states from the addition or superposition of several other states that are part of a system.

Applying these concepts to computing information, it means that there is a system in which it is possible to generate a machine state that represents a (weighted) sum of the states 0 and 1; in this case, the term weighted means that the state can keep track of “the quantity of” 0 and 1 present in the state. In the classical computation model, one bit can only store either the state of 0 or 1, not both; even using two bits, they cannot represent the weighted sum of these states. Hence, to make a distinction from the basic states, quantum computing uses the concept of a quantum bit (qubit) — a unit of information to denote the superposition of two states. This is a cornerstone concept of quantum computing as it provides a way of tracking more than a single state per unit of information, making it a powerful tool for processing information.

So, a qubit represents the sum of two parts: the 0 or 1 state plus the amount each 0/1 state contributes to produce the state of the qubit.

In mathematical notation, qubit $$| \Psi \rangle$$ is an explicit sum indicating that a qubit represents the superposition of the states 0 and 1. This is the Dirac notation used to describe the value of a qubit $$| \Psi \rangle = A | 0 \rangle +B | 1 \rangle$$, where, A and B are complex numbers known as the amplitude of the states 0 and 1, respectively. The value of the basic states is represented by qubits as $$| 0 \rangle = 1 | 0 \rangle + 0 | 1 \rangle$$  and $$| 1 \rangle = 0 | 0 \rangle + 1 | 1 \rangle$$, respectively. The right side of the term contains the abbreviated notation for these special states.

### Measurement

In a classical computer, the values 0 and 1 are implemented as digital signals. Measuring the current of the signal automatically reveals the status of a bit. This means that at any moment the value of the bit can be observed or measured.

The state of a qubit is maintained in a physically closed system, meaning that the properties of the system, such as superposition, require no interaction with the environment; otherwise any interaction, like performing a measurement, can cause interference on the state of a qubit.

Measuring a qubit is a probabilistic experiment. The result is a bit of information that depends on the state of the qubit. The bit, obtained by measuring $$| \Psi \rangle = A | 0 \rangle +B | 1 \rangle$$, will be equal to 0 with probability $$|A|^2$$,  and equal to 1 with probability $$|B|^2$$, where $$|x|$$ represents the absolute value of $$x$$.

From Statistics, we know that the sum of probabilities of all possible events is always equal to 1, so it must hold that $$|A|^2 +|B|^2 =1$$. This last equation motivates to represent qubits as the points of a circle of radius one, and more generally, as the points on the surface of a sphere of radius one, which is known as the Bloch Sphere.

Let’s break it down: If you measure a qubit you also destroy the superposition of the qubit, resulting in a superposition state collapse, where it assumes one of the basics states, providing your final result.

Another way to think about superposition and measurement is through the coin tossing experiment.

Toss a coin in the air and you give people a random choice between two options: heads or tails. Now, don’t focus on the randomness of the experiment, instead note that while the coin is rotating in the air, participants are uncertain which side will face up when the coin lands. Conversely, once the coin stops with a random side facing up, participants are 100% certain of the status.

How does it relate? Qubits are similar to the participants. When a qubit is in a superposition of states, it is tracking the probability of heads or tails, which is the participants’ uncertainty quotient while the coin is in the air. However, once you start to measure the qubit to retrieve its value, the superposition vanishes, and a classical bit value sticks: heads or tails. Measurement is that moment when the coin is static with only one side facing up.

A fair coin is a coin that is not biased. Each side (assume 0=heads and 1=tails) of a fair coin has the same probability of sticking after a measurement is performed. The qubit $$\tfrac{1}{\sqrt{2}}|0\rangle + \tfrac{1}{\sqrt{2}}|1\rangle$$ describes the probabilities of tossing a fair coin. Note that squaring either of the amplitudes results in ½, indicating that there is a 50% chance either heads or tails sticks.

It would be interesting to be able to charge a fair coin at will while it is in the air. Although this is the magic of a professional illusionist, this task, in fact, can be achieved by performing operations over qubits. So, get ready to become the next quantum magician!

### Quantum Gates

A logic gate represents a Boolean function operating over a set of inputs (on the left) and producing an output (on the right). A logic circuit is a set of connected logic gates, a convenient way to represent bit operations.

Other gates are AND, OR, XOR, and NAND, and more. A set of gates is universal if it can generate other gates. For example, NOR and NAND gates are universal since any circuit can be constructed using only these gates.

Quantum computing also admits a description using circuits. Quantum gates operate over qubits, modifying the superposition of the states. For example, there is a quantum gate analogous to the NOT gate, the X gate.

The X quantum gate interchanges the amplitudes of the states of the input qubit.

The Z quantum gate flips the sign’s amplitude of state 1:

Another quantum gate is the Hadamard gate, which generates an equiprobable superposition of the basic states.

Using our coin tossing analogy, the Hadamard gate has the action of tossing a fair coin to the air. In quantum circuits, a triangle represents measuring a qubit, and the resulting bit is indicated by a double-wire.

Other gates, such as the CNOT gate, Pauli’s gates, Toffoli gate, Deutsch gate, are slightly more advanced. Quirk, the open-source playground, is a fun sandbox where you can construct quantum circuits using all of these gates.

### Reversibility

An operation is reversible if there exists another operation that rolls back the output state to the initial state. For instance, a NOT gate is reversible since applying a second NOT gate recovers the initial input.

In contrast, AND, OR, NAND gates are not reversible. This means that some classical computations cannot be reversed by a classic circuit that uses only the output bits. However, if you insert additional bits of information, the operation can be reversed.

Quantum computing mainly focuses on reversible computations, because there’s always a way to construct a reversible circuit to perform an irreversible computation. The reversible version of a circuit could require the use of ancillary qubits as auxiliary (but not temporary) variables.

Due to the nature of composed systems, it could be possible that these ancillas (extra qubits) correlate to qubits of the main computation. This correlation makes it infeasible to reuse ancillas since any modification could have the side-effect on the operation of a reversible circuit. This is like memory assigned to a process by the operating system: the process cannot use memory from other processes or it could cause memory corruption, and processes cannot release their assigned memory to other processes. You could use garbage collection mechanisms for ancillas, but performing reversible computations increases your qubit budget.

### Composed Systems

In quantum mechanics, a single qubit can be described as a single closed system: a system that has no interaction with the environment nor other qubits. Letting qubits interact with others leads to a composed system where more states are represented. The state of a 2-qubit composite system is denoted as $$A_0|00\rangle+A_1|01\rangle+A_2|10\rangle+A_3|11\rangle$$, where, $$A_i$$ values correspond to the amplitudes of the four basic states 00, 01, 10, and 11. This qubit $$\tfrac{1}{2}|00\rangle+\tfrac{1}{2}|01\rangle+\tfrac{1}{2}|10\rangle+\tfrac{1}{2}|11\rangle$$ represents the superposition of these basic states, both having the same probability obtained after measuring the two qubits.

In the classical case, the state of N bits represents only one of 2ᴺ possible states, whereas a composed state of N qubits represents all the 2ᴺ states but in superposition. This is one big difference between these computing models as it carries two important properties: entanglement and quantum parallelism.

### Entanglement

According to the theory behind quantum mechanics, some composed states can be described through the description of its constituents. However, there are composed states where no description is possible, known as entangled states.

The entanglement phenomenon was pointed out by Einstein, Podolsky, and Rosen in the so-called EPR paradox. Suppose there is a composed system of two entangled qubits, in which by performing a measurement in one qubit causes interference in the measurement of the second. This interference occurs even when qubits are separated by a long distance, which means that some information transfer happens faster than the speed of light. This is how quantum entanglement conflicts with the theory of relativity, where information cannot travel faster than the speed of light. The EPR paradox motivated further investigation for deriving new interpretations about quantum mechanics and aiming to resolve the paradox.

Quantum entanglement can help to transfer information at a distance by following a communication protocol. The following protocol examples rely on the fact that Alice and Bob separately possess one of two entangled qubits:

• The superdense coding protocol allows Alice to communicate a 2-bit message $$m_0,m_1$$ to Bob using a quantum communication channel, for example, using fiber optics to transmit photons. All Alice has to do is operate on her qubit according to the value of the message and send the resulting qubit to Bob. Once Bob receives the qubit, he measures both qubits, noting that the collapsed 2-bit state corresponds to Alice’s message.

• The quantum teleportation protocol allows Alice to transmit a qubit to Bob without using a quantum communication channel. Alice measures the qubit to send Bob and her entangled qubit resulting in two bits. Alice sends these bits to Bob, who operates on his entangled qubit according to the bits received and notes that the result state matches the original state of Alice’s qubit.

### Quantum Parallelism

Composed systems of qubits allow representation of more information per composed state. Note that operating on a composed state of N qubits is equivalent to operating over a set of 2ᴺ states in superposition. This procedure is quantum parallelism. In this setting, operating over a large volume of information gives the intuition of performing operations in parallel, like in the parallel computing paradigm; one big caveat is that superposition is not equivalent to parallelism.

Remember that a composed state is a superposition of several states so, a computation that takes a composed state of inputs will result in a composed state of outputs. The main divergence between classical and quantum parallelism is that quantum parallelism can obtain only one of the processed outputs. Observe that a measurement in the output of a composed state causes that the qubits collapse to only one of the outputs, making it unattainable to calculate all computed values.

Although quantum parallelism does not match precisely with the traditional notion of parallel computing, you can still leverage this computational power to get related information.

Deutsch-Jozsa Problem: Assume $$F$$ is a function that takes as input N bits, outputs one bit, and is either constant (always outputs the same value for all inputs) or balanced (outputs 0 for half of the inputs and 1 for the other half). The problem is to determine if $$F$$ is constant or balanced.

The quantum algorithm that solves the Deutsch-Jozsa problem uses quantum parallelism. First, N qubits are initialized in a superposition of 2ᴺ states. Then, in a single shot, it evaluates $$F$$ for all of these states.

The result of applying $$F$$ appears (in the exponent) of the amplitude of the all-zero state. Note that only when $$F$$ is constant is this amplitude, either +1 or -1. If the result of measuring the N qubit is an all-zeros bitstring, then there is a 100% certainty that $$F$$ is constant. Any other result indicates that $$F$$ is balanced.

A deterministic classical algorithm solves this problem using $$2^{N-1}+1$$ evaluations of $$F$$ in the worst case. Meanwhile, the quantum algorithm requires only one evaluation. The Deutsch-Jozsa problem exemplifies the exponential advantage of a quantum algorithm over classical algorithms.

## Quantum Computers

The theory of quantum computing is supported by investigations in the field of quantum mechanics. However, constructing a quantum machine requires a physical system that allows representing qubits and manipulation of states in a reliable and precise way.

The DiVincenzo Criteria require that a physical implementation of a quantum computer must:

1. Be scalable and have well-defined qubits.
2. Be able to initialize qubits to a state.
3. Have long decoherence times to apply quantum error-correcting codes. Decoherence of a qubit happens when the qubit interacts with the environment, for example, when a measurement is performed.
4. Use a universal set of quantum gates.
5. Be able to measure single qubits without modifying others.

Quantum computer physical implementations face huge engineering obstacles to satisfy these requirements. The most important challenge is to guarantee low error rates during computation and measurement. Lowering these rates require techniques for error correction, which add a significant number of qubits specialized on this task. For this reason, the number of qubits of a quantum computer should not be regarded as for classical systems. In a classical computer, the bits of a computer are all effective for performing a calculation, whereas the number of qubits is the sum of the effective qubits (those used to make calculations) plus the ancillas (used for reversible computations) plus the error correction qubits.

Current implementations of quantum computers partially satisfy the DiVincenzo criteria. Quantum adiabatic computers fit in this category since they do not operate using quantum gates. For this reason, they are not considered to be universal quantum computers.

A recurrent problem in optimization is to find the global minimum of an objective function. For example, a route-traffic control system can be modeled as a function that reduces the cost of routing to a minimum. Simulated annealing is a heuristic procedure that provides a good solution to these types of problems. Simulated annealing finds the solution state by slowly introducing changes (the adiabatic process) on the variables that govern the system.

Quantum annealing is the analogous quantum version of simulated annealing. A qubit is initialized into a superposition of states representing all possible solutions to the problem. Here is used the Hamiltonian operator, which is the sum of vectors of potential and kinetic energies of the system. Hence, the objective function is encoded using this operator describing the evolution of the system in correspondence with time. Then, if the system is allowed to evolve very slowly, it will eventually land on a final state representing the optimal value of the objective function.

Currently, there exist adiabatic computers in the market, such as the D-Wave and IBM Q systems, featuring hundreds of qubits; however, their capabilities are somewhat limited to some problems that can be modeled as optimization problems. The limits of adiabatic computers were studied by van Dam et al, showing that despite solving local searching problems and even some instances of the max-SAT problem, there exists harder searching problems this computing model cannot efficiently solve.

### Nuclear Magnetic Resonance

Nuclear Magnetic Resonance (NMR) is a physical phenomena that can be used to represent qubits. The spin of atomic nucleus of molecules is perturbed by an oscillating magnetic field. A 2001 report describes successful implementation of Shor’s algorithm in a 7-qubit NMR quantum computer. An iconic result since this computer was able to factor the number 15.

### Superconducting Quantum Computers

One way to physically construct qubits is based on superconductors, materials that conduct electric current with zero resistance when exposed to temperatures close to absolute zero.

The Josephson effect, in which current flows across the junction of two superconductors separated by a non-superconducting material, is used to physically implement a superposition of states.

When a magnetic flux is applied to this junction, the current flows continuously in one direction. But, depending on the quantity of magnetic flux applied, the current can also flow in the opposite direction. There exists a quantum superposition of currents going both clockwise and counterclockwise leading to a physical implementation of a qubit called flux qubit. The complete device is known as the Superconducting Quantum Interference Device (SQUID) and can be easily coupled scaling the number of qubits. Thus, SQUIDs are like the transistors of a quantum computer.

Examples of superconducting computers are:

• D-wave’s adiabatic computers process quantum annealing for solving diverse optimization problems.
• Google’s 72-qubit computer was recently announced and also several engineering issues such as achieving lower temperatures.
• IBM’s IBM-Q Tokyo, a 20-qubit adiabatic computer, and IBM Q Experience, a cloud-based system for exploring quantum circuits.

IBM Q System

## The Imminent Threat of Quantum Algorithms

The quantum zoo website tracks problems that can be solved using quantum algorithms. As of mid-2018, more than 60 problems appear on this list, targeting diverse applications in the area of number theory, approximation, simulation, and searching. As terrific as it sounds, some easily-solvable problems by quantum computing are surrounding the security of information.

### Grover’s Algorithm

Tales of a quantum detective (fragment). A couple of detectives have the mission of finding one culprit in a group of suspects that always respond to this question honestly: “are you guilty?”.
The detective C follows a classic interrogative method and interviews every person one at a time, until finding the first one that confesses.
The detective Q proceeds in a different way, First gather all suspects in a completely dark room, and after that, the detective Q asks them — are you guilty? — A steady sound comes from the room saying “No!” while at the same time, a single voice mixed in the air responds “Yes!.” Since everybody is submerged in darkness, the detective cannot see the culprit. However, detective Q knows that, as long as the interrogation advances, the culprit will feel desperate and start to speak louder and louder, and so, he continues asking the same question. Suddenly, detective Q turns on the lights, enters into the room, and captures the culprit. How did he do it?

The task of the detective can be modeled as a searching problem. Given a Boolean function $$f$$ that takes N bits and produces one bit, to find the unique input $$x$$ such that $$f(x)=1$$.

A classical algorithm (detective C) finds $$x$$ using $$2^N-1$$ function evaluations in the worst case. However, the quantum algorithm devised by Grover, corresponding to detective Q, searches quadratically faster using around $$2^{N/2}$$ function evaluations.

The key intuition of Grover’s algorithm is increasing the amplitude of the state that represents the solution while maintaining the other states in a lower amplitude. In this way, a system of N qubits, which is a superposition of 2ᴺ possible inputs, can be continuously updated using this intuition until the solution state has an amplitude closer to 1. Hence, after updating the qubits many times, there will be a high probability to measure the solution state.

Initially, a superposition of 2ᴺ states (horizontal axis) is set, each state has an amplitude (vertical axis) close to 0. The qubits are updated so that the amplitude of the solution state increases more than the amplitude of other states. By repeating the update step, the amplitude of the solution state gets closer to 1, which boosts the probability of collapsing to the solution state after measuring.

Grover’s Algorithm (pseudo-code):

1. Prepare an N qubit $$|x\rangle$$ as a uniform superposition of 2ᴺ states.
2. Update the qubits by performing this core operation. $$|x\rangle \mapsto (-1)^{f(x)} |x\rangle$$ The result of $$f(x)$$ only flips the amplitude of the searched state.
3. Negate the N qubit over the average of the amplitudes.
4. Repeat Step 2 and 3 for $$(\tfrac{\pi}{4}) 2^{ N/2}$$ times.
5. Measure the qubit and return the bits obtained.

Alternatively, the second step can be better understood as a conditional statement:

IF f(x) = 1 THEN
Negate the amplitude of the solution state.
ELSE
/* nothing */
ENDIF


Grover’s algorithm considers function $$f$$ a black box, so with slight modifications, the algorithm can also be used to find collisions on the function. This implies that Grover’s algorithm can find a collision using an asymptotically less number of operations than using a brute-force algorithm.

The power of Grover’s algorithm can be turned against cryptographic hash functions. For instance, a quantum computer running Grover’s algorithm could find a collision on SHA256 performing only 2¹²⁸ evaluations of a reversible circuit of SHA256. The natural protection for hash functions is to increase the output size to double. More generally, most of symmetric key encryption algorithms will survive to the power of Grover’s algorithm by doubling the size of keys.

The scenario for public-key algorithms is devastating in face of Peter Shor’s algorithm.

### Shor’s Algorithm

Multiplying integers is an easy task to accomplish, however, finding the factors that compose an integer is difficult. The integer factorization problem is to decompose a given integer number into its prime factors. For example, 42 has three factors 2, 3, and 7 since $$2\times 3\times 7 = 42$$. As the numbers get bigger, integer factorization becomes more difficult to solve, and the hardest instances of integer factorization are when the factors are only two different large primes. Thus, given an integer number $$N$$, to find primes $$p$$ and $$q$$ such that $$N = p \times q$$, is known as integer splitting.

Factoring integers is like cutting wood, and the specific task of splitting integers is analogous to using an axe for splitting the log in two parts. There exist many different tools (algorithms) for accomplishing each task.

For integer factorization, trial division, the Rho method, the elliptic curve method are common algorithms. Fermat’s method, the quadratic- and rational-sieve, leads to the (general) number field sieve (NFS) algorithm for integer splitting. The latter relies on finding a congruence of squares, that is, splitting $$N$$ as a product of squares such that $$N = x^2 – y^2 = (x+y)\times(x-y)$$ The complexity of NFS is mainly attached to the number of pairs $$(x, y)$$ that must be examined before getting a pair that factors $$N$$. The NFS algorithm has subexponential complexity on the size of $$N$$, meaning that the time required for splitting an integer increases significantly as the size of $$N$$ grows. For large integers, the problem becomes intractable for classical computers.

### The Axe of Thor Shor

The many different guesses of the NFS algorithm are analogous to hitting the log using a dulled axe; after subexponential many tries, the log is cut by half. However, using a sharper axe allows you to split the log faster. This sharpened axe is the quantum algorithm proposed by Shor in 1994.

Let $$x$$ be an integer less than $$N$$ and of the order $$k$$. Then, if $$k$$ is even, there exists an integer $$q$$ so $$qN$$ can be factored as follows.

This approach has some issues. For example, the factorization could correspond to $$q$$ not $$N$$ and the order of $$x$$ is unknown, and here is where Shor’s algorithm enters the picture, finding the order of $$x$$.

The internals of Shor’s algorithm rely on encoding the order $$k$$ into a periodic function, so that its period can be obtained using the quantum version of the Fourier transform (QFT). The order of $$x$$ can be found using a polynomial number quantum evaluations of Shor’s algorithm. Therefore, splitting integers using this quantum approach has polynomial complexity on the size of $$N$$.

Shor’s algorithm carries strong implications on the security of the RSA encryption scheme because its security relies on integer factorization. A large-enough quantum computer can efficiently break RSA for current instances.

Alternatively, one may recur to elliptic curves, used in cryptographic protocols like ECDSA or ECDH. Moreover, all TLS ciphersuites use a combination of elliptic curve groups, large prime groups, and RSA and DSA signatures. Unfortunately, these algorithms all succumb to Shor’s algorithm. It only takes a few modifications for Shor’s algorithm to solve the discrete logarithm problem on finite groups. This sounds like a catastrophic story where all of our encrypted data and privacy are no longer secure with the advent of a quantum computer, and in some sense this is true.

On one hand, it is a fact that the quantum computers constructed as of 2019 are not large enough to run, for instance, Shor’s algorithm for the RSA key sizes used in standard protocols. For example, a 2018 report shows experiments on the factorization of a 19-bit number using 94 qubits, they also estimate that 147456 qubits are needed for factoring a 768-bit number. Hence, there numbers indicates that we are still far from breaking RSA.

What if we increment RSA key sizes to be resistant to quantum algorithms, just like for symmetric algorithms?

Bernstein et al. estimated that RSA public keys should be as large as 1 terabyte to maintain secure RSA even in the presence of quantum factoring algorithms. So, for public-key algorithms, increasing the size of keys does not help.

A recent investigation by Gidney and Ekerá shows improvements that accelerate the evaluation of quantum factorization. In their report, the cost of factoring 2048-bit integers is estimated to take a few hours using a quantum machine of 20 million qubits, which is far from any current development. Something worth noting is that the number of qubits needed is two orders of magnitude smaller than the estimated numbers given in previous works developed in this decade. Under these estimates, current encryption algorithms will remain secure several more years; however, consider the following not-so-unrealistic situation.

Information currently encrypted with for example, RSA, can be easily decrypted with a quantum computer in the future. Now, suppose that someone records encrypted information and stores them until a quantum computer is able to decrypt ciphertexts. Although this could be as far as 20 years from now, the forward-secrecy principle is violated. A 20-year gap to the future is sometimes difficult to imagine. So, let’s think backwards, what would happen if all you did on the Internet at the end of the 1990s can be revealed 20 years later — today. How does this impact the security of your personal information? What if the ciphertexts were company secrets or business deals? In 1999, most of us were concerned about the effects of the Y2K problem, now we’re facing Y2Q (years to quantum): the advent of quantum computers.

### Post-Quantum Cryptography

Although the current capacity of the physical implementation of quantum computers is far from a real threat to secure communications, a transition to use stronger problems to protect information has already started. This wave emerged as post-quantum cryptography (PQC). The core idea of PQC is finding algorithms difficult enough that no quantum (and classical) algorithm can solve them.

A recurrent question is: How does it look like a problem that even a quantum computer can not solve?

These so-called quantum-resistant algorithms rely on different hard mathematical assumptions; some of them as old as RSA, others more recently proposed. For example, McEliece cryptosystem, formulated in the late 70s, relies on the hardness of decoding a linear code (in the sense of coding theory). The practical use of this cryptosystem didn’t become widespread, since with the passing of time, other cryptosystems superseded in efficiency. Fortunately, McEliece cryptosystem remains immune to Shor’s algorithm, gaining it relevance in the post-quantum era.

Post-quantum cryptography presents alternatives:

As of 2017, the NIST started an evaluation process that tracks possible alternatives for next-generation secure algorithms. From a practical perspective, all candidates present different trade-offs in implementation and usage. The time and space requirements are diverse; at this moment, it’s too early to define which will succeed RSA and elliptic curves. An initial round collected 70 algorithms for deploying key encapsulation mechanisms and digital signatures. As of early 2019, 28 of these survive and are currently in the analysis, investigation, and experimentation phase.

Cloudflare’s mission is to help build a better Internet. As a proactive action, our cryptography team is preparing experiments on the deployment of post-quantum algorithms at Cloudflare scale. Watch our blog post for more details.

# Towards Post-Quantum Cryptography in TLS

Post Syndicated from Kris Kwiatkowski original https://blog.cloudflare.com/towards-post-quantum-cryptography-in-tls/

We live in a completely connected society. A society connected by a variety of devices: laptops, mobile phones, wearables, self-driving or self-flying things. We have standards for a common language that allows these devices to communicate with each other. This is critical for wide-scale deployment – especially in cryptography where the smallest detail has great importance.

One of the most important standards-setting organizations is the National Institute of Standards and Technology (NIST), which is hugely influential in determining which standardized cryptographic systems see worldwide adoption. At the end of 2016, NIST announced it would hold a multi-year open project with the goal of standardizing new post-quantum (PQ) cryptographic algorithms secure against both quantum and classical computers.

Many of our devices have very different requirements and capabilities, so it may not be possible to select a “one-size-fits-all” algorithm during the process. NIST mathematician, Dustin Moody, indicated that institute will likely select more than one algorithm:

“There are several systems in use that could be broken by a quantum computer – public-key encryption and digital signatures, to take two examples – and we will need different solutions for each of those systems.”

Initially, NIST selected 82 candidates for further consideration from all submitted algorithms. At the beginning of 2019, this process entered its second stage. Today, there are 26 algorithms still in contention.

### Post-quantum cryptography: what is it really and why do I need it?

In 1994, Peter Shor made a significant discovery in quantum computation. He found an algorithm for integer factorization and computing discrete logarithms, both believed to be hard to solve in classical settings. Since then it has become clear that the ‘hard problems’ on which cryptosystems like RSA and elliptic curve cryptography (ECC) rely – integer factoring and computing discrete logarithms, respectively – are efficiently solvable with quantum computing.

A quantum computer can help to solve some of the problems that are intractable on a classical computer. In theory, they could efficiently solve some fundamental problems in mathematics. This amazing computing power would be highly beneficial, which is why companies are actually trying to build quantum computers. At first, Shor’s algorithm was merely a theoretical result – quantum computers powerful enough to execute it did not exist – but this is quickly changing. In March 2018, Google announced a 72-qubit universal quantum computer. While this is not enough to break say RSA-2048 (still more is needed), many fundamental problems have already been solved.

In anticipation of wide-spread quantum computing, we must start the transition from classical public-key cryptography primitives to post-quantum (PQ) alternatives. It may be that consumers will never get to hold a quantum computer, but a few powerful attackers who will get one can still pose a serious threat. Moreover, under the assumption that current TLS handshakes and ciphertexts are being captured and stored, a future attacker could crack these stored individual session keys and use those results to decrypt the corresponding individual ciphertexts. Even strong security guarantees, like forward secrecy, do not help out much there.

In 2006, the academic research community launched a conference series dedicated to finding alternatives to RSA and ECC. This so-called post-quantum cryptography should run efficiently on a classical computer, but it should also be secure against attacks performed by a quantum computer. As a research field, it has grown substantially in popularity.

Several companies, including Google, Microsoft, Digicert and Thales, are already testing the impact of deploying PQ cryptography. Cloudflare is involved in some of this, but we want to be a company that leads in this direction. The first thing we need to do is understand the real costs of deploying PQ cryptography, and that’s not obvious at all.

### What options do we have?

Many submissions to the NIST project are still under study. Some are very new and little understood; others are more mature and already standardized as RFCs. Some have been broken or withdrawn from the process; others are more conservative or illustrate how far classical cryptography would need to be pushed so that a quantum computer could not crack it within a reasonable cost. Some are very slow and big; others are not. But most cryptographic schemes can be categorized into these families: lattice-based, multivariate, hash-based (signatures only), code-based and isogeny-based.

For some algorithms, nevertheless, there is a fear they may be too inconvenient to use with today’s Internet. We must also be able to integrate new cryptographic schemes with existing protocols, such as SSH or TLS. To do that, designers of PQ cryptosystems must consider these characteristics:

• Latency caused by encryption and decryption on both ends of the communication channel, assuming a variety of devices from big and fast servers to slow and memory constrained IoT (Internet of Things) devices
• Small public keys and signatures to minimize bandwidth
• Clear design that allows cryptanalysis and determining weaknesses that could be exploited
• Use of existing hardware for fast implementation

The work on post-quantum public key cryptosystems must be done in a full view of organizations, governments, cryptographers, and the public. Emerging ideas must be properly vetted by this community to ensure widespread support.

### Helping Build a Better Internet

To better understand the post-quantum world, Cloudflare began experimenting with these algorithms and used them to provide confidentiality in TLS connections.

With Google, we are proposing a wide-scale experiment that combines client- and server-side data collection to evaluate the performance of key-exchange algorithms on actual users’ devices. We hope that this experiment helps choose an algorithm with the best characteristics for the future of the Internet. With Cloudflare’s highly distributed network of access points and Google’s Chrome browser, both companies are in a very good position to perform this experiment.

Our goal is to understand how these algorithms act when used by real clients over real networks, particularly candidate algorithms with significant differences in public-key or ciphertext sizes. Our focus is on how different key sizes affect handshake time in the context of Transport Layer Security (TLS) as used on the web over HTTPS.

Our primary candidates are an NTRU-based construction called HRSS-SXY (by Hülsing – Rijneveld – Schanck – Schwabe, and Tsunekazu Saito – Keita Xagawa – Takashi Yamakawa) and an isogeny-based Supersingular Isogeny Key Encapsulation (SIKE). Details of both algorithms are described in more detail below in section “Dive into post-quantum cryptography”. This table shows a few characteristics for both algorithms. Performance timings were obtained by running the BoringSSL speed test on an Intel Skylake CPU.

KEM Public Key size (bytes) Ciphertext (bytes) Secret size (bytes) KeyGen (op/sec) Encaps (op/sec) Decaps (op/sec) NIST level
HRSS-SXY 1138 1138 32 3952.3 76034.7 21905.8 1
SIKE/p434 330 346 16 367.1 228.0 209.3 1

Currently the most commonly used key exchange algorithm (according to Cloudflare’s data) is the non-quantum X25519. Its public keys are 32 bytes and BoringSSL can generate 49301.2 key pairs, and is able to perform 19628.6 key agreements every second on my Skylake CPU.

Note that HRSS-SXY shows a significant speed advantage, while SIKE has a size advantage. In our experiment, we will deploy these two algorithms on both the server side using Cloudflare’s infrastructure, and the client side using Chrome Canary; both sides will collect telemetry information about TLS handshakes using these two PQ algorithms to see how they perform in practice.

### What do we expect to find?

In 2018, Adam Langley conducted an experiment with the goal of evaluating the likely latency impact of a post-quantum key exchange in TLS. Chrome was augmented with the ability to include a dummy, arbitrarily-sized extension in the TLS ClientHello (fixed number of bytes of random noise). After taking into account the performance and key size offered by different types key-exchange schemes, he concluded that constructs based on structured lattices may be most suitable for future use in TLS.

However, Langley also observed a peculiar phenomenon; client connections measured at 95th percentile had much higher latency than the median. It means that in those cases, isogeny-based systems may be a better choice. In the Dive into post-quantum cryptography, we describe the difference between isogeny-based SIKE and lattice-based NTRU cryptosystems.

In our experiment, we want to more thoroughly evaluate and ascribe root causes to these unexpected latency increases. We would particularly like to learn more about the characteristics of those networks: what causes increased latency? how does the performance cost of isogeny-based algorithms impact the TLS handshake? We want to answer key questions, like:

• What is a good ratio for speed-to-key size (or how much faster could SIKE get to achieve the client-perceived performance of HRSS)?
• How do network middleboxes behave when clients use new PQ algorithms, and which networks have problematic middleboxes?
• How do the different properties of client networks affect TLS performance with different PQ key exchanges? Can we identify specific autonomous systems, device configurations, or network configurations that favor one algorithm over another? How is performance affected in the long tail?

### Experiment Design

Our experiment will involve both server- and client-side performance statistics collection from real users around the world (all the data is anonymized). Cloudflare is operating the server-side TLS connections. We will enable the CECPQ2 (HRSS + X25519) and CECPQ2b (SIKE + X25519) key-agreement algorithms on all TLS-terminating edge servers.

In this experiment, the ClientHello will contain a CECPQ2 or CECPQ2b public key (but never both). Additionally, Chrome will always include X25519 for servers that do not support post-quantum key exchange. The post-quantum key exchange will only be negotiated in TLS version 1.3 when both sides support it.

Since Cloudflare only measures the server side of the connection, it is impossible to determine the time it takes for a ClientHello sent from Chrome to reach Cloudflare’s edge servers; however, we can measure the time it takes for the TLS ServerHello message containing post-quantum key exchange, to reach the client and for the client to respond.

On the client side, Chrome Canary will operate the TLS connection. Google will enable either CECPQ2 or CECPQ2b in Chrome for the following mix of architecture and OSes:

• x86-64: Windows, Linux, macOS, ChromeOS
• aarch64: Android

Our high-level expectation is to get similar results as Langley’s original experiment in 2018 — slightly increased latency for the 50th percentile and higher latency for the 95th. Unfortunately, data collected purely from real users’ connections may not suffice for diagnosing the root causes of why some clients experience excessive slowdown. To this end, we will perform follow-up experiments based on per-client information we collect server-side.

Our primary hypothesis is that excessive slowdowns, like those Langley observed, are largely due to in-network events, such as middleboxes or bloated/lossy links. As a first-pass analysis, we will investigate whether the slowed-down clients share common network features, like common ASes, common transit networks, common link types, and so on. To determine this, we will run a traceroute from vantage points close to our servers back toward the clients (not overloading any particular links or hosts) and study whether some client locations are subject to slowdowns for all destinations or just for some.

### Dive into post-quantum cryptography

Be warned: the details of PQ cryptography may be quite complicated. In some cases it builds on classical cryptography, and in other cases it is completely different math. It would be rather hard to describe details in a single blog post. Instead, we are giving you an intuition of post-quantum cryptography, rather than provide deep academic-level descriptions. We’re skipping a lot of details for the sake of brevity. Nevertheless, settle in for a bit of an epic journey because we have a lot to cover.

### Key encapsulation mechanism

NIST requires that all key-agreement algorithms have a form of key-encapsulation mechanism (KEM). The KEM is a simplified form of public key encryption (PKE). As PKE, it also allows agreement on a secret, but in a slightly different way. The idea is that the session key is an output of the encryption algorithm, conversely to public key encryption schemes where session key is an input to the algorithm. In a KEM, Alice generates a random key and uses the pre-generated public key from Bob to encrypt (encapsulate) it. This results in a ciphertext sent to Bob. Bob uses his private key to decrypt (decapsulate) the ciphertext and retrieve the random key. The idea was initially introduced by Cramer and Shoup. Experience shows that such constructs are easier to design, analyze, and implement as the scheme is limited to communicating a fixed-size session key. Leonardo Da Vinci said, “Simplicity is the ultimate sophistication,” which is very true in cryptography.

The key exchange (KEX) protocol, like Diffie-Hellman, is yet a different construct: it allows two parties to agree on a shared secret that can be used as a symmetric encryption key. For example, Alice generates a key pair and sends a public key to Bob. Bob does the same and uses his own key pair with Alice’s public key to generate the shared secret. He then sends his public key to Alice who can now generate the same shared secret. What’s worth noticing is that both Alice and Bob perform exactly the same operations.

KEM construction can be converted to KEX. Alice performs key generation and sends the public key to Bob. Bob uses it to encapsulate a symmetric session key and sends it back to Alice. Alice decapsulates the ciphertext received from Bob and gets the symmetric key. This is actually what we do in our experiment to make integration with the TLS protocol less complicated.

### NTRU Lattice-based Encryption

We will enable the CECPQ2 implemented by Adam Langley from Google on our servers. He described this implementation in detail here. This key exchange uses the HRSS algorithm, which is based on the NTRU (N-Th Degree TRUncated Polynomial Ring) algorithm. Foregoing too much detail, I am going to explain how NTRU works and give simplified examples, and finally, compare it to HRSS.

NTRU is a cryptosystem based on a polynomial ring. This means that we do not operate on numbers modulo a prime (like in RSA), but on polynomials of degree $$N$$ , where the degree of a polynomial is the highest exponent of its variable. For example, $$x^7 + 6x^3 + 11x^2$$ has degree of 7.

One can add polynomials in the ring in the usual way, by simply adding theirs coefficients modulo some integer. In NTRU this integer is called $$q$$. Polynomials can also be multiplied, but remember, you are operating in the ring, therefore the result of a multiplication is always a polynomial of degree less than $$N$$. It basically means that exponents of the resulting polynomial are added to modulo $$N$$.

In other words, polynomial ring arithmetic is very similar to modular arithmetic, but instead of working with a set of numbers less than N, you are working with a set of polynomials with a degree less than N.

To instantiate the NTRU cryptosystem, three domain parameters must be chosen:

• $$N$$ – degree of the polynomial ring, in NTRU the principal objects are polynomials of degree $$N-1$$.
• $$p$$ – small modulus used during key generation and decryption for reducing message coefficients.
• $$q$$ – large modulus used during algorithm execution for reducing coefficients of the polynomials.

First, we generate a pair of public and private keys. To do that, two polynomials $$f$$ and $$g$$ are chosen from the ring in a way that their randomly generated coefficients are much smaller than $$q$$. Then key generation computes two inverses of the polynomial: $$f_p= f^{-1} \bmod{p} \\ f_q= f^{-1} \bmod{q}$$

The last step is to compute $$pk = p\cdot f_q\cdot g \bmod q$$, which we will use as public key pk. The private key consists of $$f$$ and $$f_p$$. The $$f_q$$ is not part of any key, however it must remain secret.

It might be the case that after choosing $$f$$, the inverses modulo $$p$$ and $$q$$ do not exist. In this case, the algorithm has to start from the beginning and generate another $$f$$. That’s unfortunate because calculating the inverse of a polynomial is a costly operation. HRSS brings an improvement to this issue since it ensures that those inverses always exist, making key generation faster than as proposed initially in NTRU.

The encryption of a message $$m$$ proceeds as follows. First, the message $$m$$ is converted to a ring element $$pt$$ (there exists an algorithm for performing this conversion in both directions). During encryption, NTRU randomly chooses one polynomial $$b$$ called blinder. The goal of the blinder is to generate different ciphertexts per encyption. Thus, the ciphetext $$ct$$ is obtained as $$ct = (b\cdot pk + pt ) \bmod q$$ Decryption looks a bit more complicated but it can also be easily understood. Decryption uses both the secret value $$f$$ and to recover the plaintext as $$v = f \cdot ct \bmod q \\ pt = v \cdot f_p \bmod p$$

This diagram demonstrates why and how decryption works.

After obtaining $$pt$$, the message $$m$$ is recovered by inverting the conversion function.

The underlying hard assumption is that given two polynomials: $$f$$ and $$g$$ whose coefficients are short compared to the modulus $$q$$, it is difficult to distinguish $$pk = \frac{f}{g}$$ from a random element in the ring. It means that it’s hard to find $$f$$ and $$g$$ given only public key pk.

### Lattices

NTRU cryptosystem is a grandfather of lattice-based encryption schemes. The idea of using  difficult problems for cryptographic purposes was due to Ajtai. His work evolved into a whole area of research with the goal of creating more practical, lattice-based cryptosystems.

### What is a lattice and why it can be used for post-quantum crypto?

The picture below visualizes lattice as points in a two-dimensional space. A lattice is defined by the origin $$O$$ and base vectors $$\{ b_1 , b_2\}$$. Every point on the lattice is represented as a linear combination of the base vectors, for example  $$V = -2b_1+b_2$$.

There are two classical NP-hard problems in lattice-based cryptography:

1. Shortest Vector Problem (SVP): Given a lattice, to find the shortest non-zero vector in the lattice. In the graph, the vector $$s$$ is the shortest one. The SVP problem is NP-hard only under some assumptions.
2. Closest Vector Problem (CVP). Given a lattice and a vector $$V$$ (not necessarily in the lattice), to find the closest vector to $$V$$. For example, the closest vector to $$t$$ is $$z$$.

In the graph above, it is easy for us to solve SVP and CVP by simple inspection. However, the lattices used in cryptography have higher dimensions, say above 1000, as well as highly non-orthogonal basis vectors. On these instances, the problems get extremely hard to solve. It’s even believed future quantum computers will have it tough.

HRSS, which we use in our experiment, is based on NTRU, but a slightly better instantiation. The main improvements are:

• Faster key generation algorithm.
• NTRU encryption can produce ciphertexts that are impossible to decrypt (true for many lattice-based schemes). But HRSS fixes this problem.
• HRSS is a key encapsulation mechanism.

### CECPQ2b – Isogeny-based Post-Quantum TLS

Following CECPQ2, we have integrated into BoringSSL another hybrid key exchange mechanism relying on SIKE. It is called CECPQ2b and we will use it in our experimentation in TLS 1.3. SIKE is a key encapsulation method based on Supersingular Isogeny Diffie-Hellman (SIDH). Read more about SIDH in our previous post. The math behind SIDH is related to elliptic curves. A comparison between SIDH and the classical Elliptic Curve Diffie-Hellman (ECDH) is given.

An elliptic curve is a set of points that satisfy a specific mathematical equation. The equation of an elliptic curve may have multiple forms, the standard form is called the Weierstrass equation $$y^2 = x^3 +ax +b$$ and its shape can look like the red curve.

An interesting fact about elliptic curves is have a group structure. That is, the set of points on the curve have associated a binary operation called point addition. The set of points on the elliptic curve is closed under addition. Thus, adding two points results in another point that is also on the elliptic curve.

If we can add two different points on a curve, then we can also add one point to itself. And if we do it multiple times, then the resulting operations is known as a scalar multiplication and denoted as  $$Q = k\cdot P = P+P+\dots+P$$ for an integer $$k$$.

Multiplication of scalars is commutative. It means that two scalar multiplications can be evaluated in any order $$\color{darkred}{k_a}\cdot\color{darkgreen}{k_b} = \color{darkgreen}{k_b}\cdot\color{darkred}{k_a}$$; this an important property that makes ECDH possible.

It turns out that carefully if choosing an elliptic curve “correctly”, scalar multiplication is easy to compute but extremely hard to reverse. Meaning, given two points $$Q$$ and $$P$$ such that $$Q=k\cdot P$$, finding the integer k is a difficult task known as the Elliptic Curve Discrete Logarithm problem (ECDLP). This problem is suitable for cryptographic purposes.

Alice and Bob agree on a secret key as follows. Alice generates a private key $$k_a$$. Then, she uses some publicly known point $$P$$ and calculates her public key as $$Q_a = k_a\cdot P$$. Bob proceeds in similar fashion and gets $$k_b$$ and $$Q_b = k_b\cdot P$$. To agree on a shared secret, each party multiplies their private key with the public key of the other party. The result of this is the shared secret. Key agreement as described above, works thanks to the fact that scalars can commute:
$$\color{darkgreen}{k_a} \cdot Q_b = \color{darkgreen}{k_a} \cdot \color{darkred}{k_b} \cdot P \iff \color{darkred}{k_b} \cdot \color{darkgreen}{k_a} \cdot P = \color{darkred}{k_b} \cdot Q_a$$

There is a vast theory behind elliptic curves. An introduction to elliptic curve cryptography was posted before and more details can be found in this book. Now, lets describe SIDH and compare with ECDH.

### Isogenies on Elliptic Curves

Before explaining the details of SIDH key exchange, I’ll explain the 3 most important concepts, namely: j-invariant, isogeny and its kernel.

Each curve has a number that can be associated to it. Let’s call this number a j-invariant. This number is not unique per curve, meaning many curves have the same value of j-invariant, but it can be viewed as a way to group multiple elliptic curves into disjoint sets. We say that two curves are isomorphic if they are in the same set, called the isomorphism class. The j-invariant is a simple criterion to determine whether two curves are isomorphic. The j-invariant of a curve $$E$$ in Weierstrass form $$y^2 = x^3 + ax + b$$ is given as $$j(E) = 1728\frac{4a^3}{4^3 +27b^2}$$

When it comes to isogeny, think about it as a map between two curves. Each point on some curve $$E$$ is mapped by isogeny to the point on isogenous curve $$E’$$. We denote mapping from curve $$E$$ to $$E’$$ by isogeny $$\phi$$ as:

$$\phi: E \rightarrow E’$$

It depends on the map if those two curves are isomorphic or not. Isogeny can be visualised as:

There may exist many of those mappings, each curve used in SIDH has small number of isogenies to other curves. Natural question is how do we compute such isogeny. Here is where the kernel of an isogeny comes. The kernel uniquely determines isogeny (up to isomorphism class). Formulas for calculating isogeny from its kernel were initially given by J. Vélu and the idea of calculating them efficiently was extended .

To finish, I will summarize what was said above with a picture.

There are two isomorphism classes on the picture above. Both curves $$E_1$$ and $$E_2$$ are isomorphic and have  j-invariant = 6. As curves $$E_3$$ and $$E_4$$ have j-invariant=13, they are in a different isomorphism class. There exists an isogeny $$\phi_2$$ between curve $$E_3$$ and $$E_2$$, so they both are isogeneous. Curves $$\phi_1$$ and $$E_2$$ are isomorphic and there is isogeny $$\phi_1$$ between them. Curves $$E_1$$ and $$E_4$$ are neither isomorphic nor isogeneus.

For brevity I’m skipping many important details, like details of the finite field, the fact that isogenies must be separable and that the kernel is finite. But curious readers can find a number of academic research papers available on the Internet.

### Big picture: similarities with ECDH

Let’s generalize the ECDH algorithm described above, so that we can swap some elements and try to use Supersingular Isogeny Diffie-Hellman.

Note that what actually happens during an ECDH key exchange is:

• We have a set of points on elliptic curve, set S
• We have another group of integers used for point multiplication, G
• We use an element from Z to act on an element from S to get another element from S:

$$G \cdot S \rightarrow S$$

Now the question is: what is our G and S in an SIDH setting? For SIDH to work, we need a big set of elements and something secret that will act on the elements from that set. This “group action” must also be resistant to attacks performed by quantum computers.

In the SIDH setting, those two sets are defined as:

• Set S is a set (graph) of j-invariants, such that all the curves are supersingular: $$S = [j(E_1), j(E_2), j(E_3), …. , j(E_n)]$$
• Set G is a set of isogenies acting on elliptic curves and transforming, for example, the elliptic curve $$E_1$$ into $$E_n$$:

### Random walk on supersingular graph

When we talk about Isogeny Based Cryptography, as a topic distinct from Elliptic Curve Cryptography, we usually mean algorithms and protocols that rely fundamentally on the structure of isogeny graphs. An example of such a (small) graph is pictured below.

Each vertex of the graph represents a different j-invariant of a set of supersingular curves. The edges between vertices represent isogenies converting one elliptic curve to another. As you can notice, the graph is strongly connected, meaning every vertex can be reached from every other vertex. In the context of isogeny-based crypto, we call such a graph a supersingular isogeny graph. I’ll skip some technical details about the construction of this graph (look for those here or here), but instead describe ideas about how it can be used.

As the graph is strongly connected, it is possible to walk a whole graph by starting from any vertex, randomly choosing an edge, following it to the next vertex and then start the process again on a new vertex. Such a way of visiting edges of this graph is called a random walk.

The random walk is a key concept that makes isogeny based crypto feasible. When you look closely at the graph, you can notice that each vertex has a small number of edges incident to it, this is why we can compute the isogenies efficiently. But also for any vertex there is only a limited number of isogenies to choose from, which doesn’t look like good base for a cryptographic scheme. The key question is – where does the security of the scheme come from exactly? In order to get it, it is necessary to visit a couple hundred vertices. What it means in practice is that secret isogeny (of large degree) is constructed as a composition of multiple isogenies (of small, prime degree).  Which means, the secret isogeny is:

This property and properties of the isogeny graph are what makes some of us believe that scheme has a good chance to be secure. More specifically, there is no efficient way of finding a path that connects $$E_0$$ with $$E_n$$, even with quantum computer at hand. The security level of a system depends on value n – the number of steps taken during the walk.

The random walk is a core process used when both generating public keys and computing shared secrets. It starts with party generating random value m (see more below), starting curve $$E_0$$ and points P and Q on this curve. Those values are used to compute the kernel of an isogeny $$R_1$$ in the following way:

$$R_1 = P + m \cdot Q$$

Thanks to formulas given by Vélu we can now use the point $$R_1$$ to compute the isogeny, the party will choose to move from a vertex to another one. After the isogeny $$\phi_{R_1}$$ is calculated it is applied to $$E_0$$  which results in a new curve $$E_1$$:

$$\phi_{R_1}: E_0 \rightarrow E_1$$

Isogeny is also applied to points P and Q. Once on $$E_1$$ the process is repeated. This process is applied n times, and at the end a party ends up on some curve $$E_n$$ which defines isomorphism class, so also j-invariant.

### Supersingular Isogeny Diffie-Hellman

The core idea in SIDH is to compose two random walks on an isogeny graph of elliptic curves in such a way that the end node of both ways of composing is the same.

In order to do it, scheme sets public parameters – starting curve $$E_0$$ and 2 pairs of base points on this curve $$(PA,QA)$$ , $$(PB,QB)$$. Alice generates her random secret keys m, and calculates a secret isogeny $$\phi_q$$ by performing a random walk as described above. The walk finishes with 3 values: elliptic curve $$E_a$$ she has ended up with and pair of points $$\phi_a(PB)$$ and $$\phi_a(QB)$$ after pushing through Alice’s secret isogeny. Bob proceeds analogously which results in the triple $${E_b, \phi_b(PA), \phi_b(QA)}$$. The triple forms a public key which is exchanged between parties.

The picture below visualizes the operation. The black dots represent curves grouped in the same isomorphism classes represented by light blue circles. Alice takes the orange path ending up on a curve $$E_a$$ in a separate isomorphism class than Bob after taking his dark blue path ending on $$E_b$$. SIDH is parametrized in a way that Alice and Bob will always end up in different isomorphism classes.

Upon receipt of triple $${ E_a, \phi_a(PB), \phi_a(QB) }$$  from Alice, Bob will use his secret value m to calculate a new kernel – but instead of using point $$PA$$ and $$QA$$ to calculate an isogeny kernel, he will now use images $$\phi_a(PB)$$ and $$\phi_a(QB)$$ received from Alice:

$$R’_1 = \phi_a(PB) + m \cdot \phi_a(QB)$$

Afterwards, he uses $$R’_1$$ to start the walk again resulting in the isogeny $$\phi’_b: E_a \rightarrow E_{ab}$$. Allice proceeds analogously resulting in the isogeny $$\phi’_a: E_b \rightarrow E_{ba}$$. With isogenies calculated this way, both Alice and Bob will converge in the same isomorphism class. The math math may seem complicated, hopefully the picture below makes it easier to understand.

Bob computes a new isogeny and starts his random walk from $$E_a$$ received from Alice. He ends up on some curve $$E_{ba}$$. Similarly, Alice calculates a new isogeny, applies it on $$E_b$$ received from Bob and her random walk ends on some curve $$E_{ab}$$. Curves $$E_{ab}$$ and $$E_{ba}$$ are not likely to be the same, but construction guarantees that they are isomorphic. As mentioned earlier, isomorphic curves have the same value of j-invariant,  hence the shared secret is a value of j-invariant $$j(E_{ab})$$.

Coming back to differences between SIDH and ECDH – we can split them into four categories: the elements of the group we are operating on, the cornerstone computation required to agree on a shared secret, the elements representing secret values, and the difficult problem on which the security relies.

In ECDH there is a secret key which is an integer scalar, in case of SIDH it is a secret isogeny, which also is generated from an integer scalar. In the case of ECDH one multiplies a point on a curve by a scalar, in the case of SIDH it is a random walk in an isogeny graph. In the case of ECDH, the public key is a point on a curve, in the case of SIDH, the public part is a curve itself and the image of some points after applying isogeny. The shared secret in the case of ECDH is a point on a curve, in the case of SIDH it is a j-invariant.

### SIKE: Supersingular Isogeny Key Encapsulation

SIDH could potentially be used as a drop-in replacement of the ECDH protocol. We have actually implemented a proof-of-concept and added it to our implementation of TLS 1.3 in the tls-tris library and described (together with Mozilla) implementation details in this draft. Nevertheless, there is a problem with SIDH – the keys can be used only once. In 2016, a few researchers came up with an active attack on SIDH which works only when public keys are reused. In the context of TLS, it is not a big problem, because for each session a fresh key pair is generated (ephemeral keys), but it may not be true for other applications.

SIKE is an isogeny key encapsulation which solves this problem. Bob can generate SIKE keys, upload the public part somewhere in the Internet and then anybody can use it whenever he wants to communicate with Bob securely. SIKE reuses SIDH – internally both sides of the connection always perform SIDH key generation, SIDH key agreement and apply some other cryptographic primitives in order to convert SIDH to KEM. SIKE is implemented in a few variants – each variant corresponds to the security levels using 128-, 192- and 256-bit secret keys. Higher security level means longer running time. More details about SIKE can be found here.

SIKE is also one of the candidates in NIST post-quantum “competition“.

I’ve skipped many important details to give a brief description of how isogeny based crypto works. If you’re curious and hungry for details, look at either of these Cloudflare meetups, where Deirdre Connolly talked about isogeny-based cryptography or this talk by Chloe Martindale during PQ Crypto School 2017. And if you would like to know more about quantum attacks on this scheme, I highly recommend this work.

## Conclusion

Quantum computers that can break meaningful cryptographic parameter settings do not exist, yet. They won’t be built for at least the next few years. Nevertheless, they have already changed the way we look at current cryptographic deployments. There are at least two reasons it’s worth investing in PQ cryptography:

• It takes a lot of time to build secure cryptography and we don’t actually know when today’s classical cryptography will be broken. There is a need for a good mathematical base: an initial idea of what may be secure against something that doesn’t exist yet. If you have an idea, you also need good implementation, constant time, resistance to things like time and cache side-channels, DFA, DPA, EM, and a bunch of other abbreviations indicating side-channel resistance. There is also deployment of, for example, algorithms based on elliptic curves were introduced in ’85, but started to really be used in production only during the last decade, 20 or so years later. Obviously, the implementation must be blazingly fast! Last, but not least, integration: we need time to develop standards to allow integration of PQ cryptography with protocols like TLS.
• Even though efficient quantum computers probably won’t exist for another few years, the threat is real. Data encrypted with current cryptographic algorithms can be recorded now with hopes of being broken in the future.

Cloudflare is motivated to help build the Internet of tomorrow with the tools at hand today. Our interest is in cryptographic techniques that can be integrated into existing protocols and widely deployed on the Internet as seamlessly as possible. PQ cryptography, like the rest of cryptography, includes many cryptosystems that can be used for communications in today’s Internet; Alice and Bob need to perform some computation, but they do not need to buy new hardware to do that.

Cloudflare sees great potential in those algorithms and believes that some of them can be used as a safe replacement for classical public-key cryptosystems. Time will tell if we’re justified in this belief!

# Towards Post-Quantum Cryptography in TLS

Post Syndicated from Kris Kwiatkowski original https://blog.cloudflare.com/towards-post-quantum-cryptography-in-tls/

We live in a completely connected society. A society connected by a variety of devices: laptops, mobile phones, wearables, self-driving or self-flying things. We have standards for a common language that allows these devices to communicate with each other. This is critical for wide-scale deployment – especially in cryptography where the smallest detail has great importance.

One of the most important standards-setting organizations is the National Institute of Standards and Technology (NIST), which is hugely influential in determining which standardized cryptographic systems see worldwide adoption. At the end of 2016, NIST announced it would hold a multi-year open project with the goal of standardizing new post-quantum (PQ) cryptographic algorithms secure against both quantum and classical computers.

Many of our devices have very different requirements and capabilities, so it may not be possible to select a “one-size-fits-all” algorithm during the process. NIST mathematician, Dustin Moody, indicated that institute will likely select more than one algorithm:

“There are several systems in use that could be broken by a quantum computer – public-key encryption and digital signatures, to take two examples – and we will need different solutions for each of those systems.”

Initially, NIST selected 82 candidates for further consideration from all submitted algorithms. At the beginning of 2019, this process entered its second stage. Today, there are 26 algorithms still in contention.

### Post-quantum cryptography: what is it really and why do I need it?

In 1994, Peter Shor made a significant discovery in quantum computation. He found an algorithm for integer factorization and computing discrete logarithms, both believed to be hard to solve in classical settings. Since then it has become clear that the ‘hard problems’ on which cryptosystems like RSA and elliptic curve cryptography (ECC) rely – integer factoring and computing discrete logarithms, respectively – are efficiently solvable with quantum computing.

A quantum computer can help to solve some of the problems that are intractable on a classical computer. In theory, they could efficiently solve some fundamental problems in mathematics. This amazing computing power would be highly beneficial, which is why companies are actually trying to build quantum computers. At first, Shor’s algorithm was merely a theoretical result – quantum computers powerful enough to execute it did not exist – but this is quickly changing. In March 2018, Google announced a 72-qubit universal quantum computer. While this is not enough to break say RSA-2048 (still more is needed), many fundamental problems have already been solved.

In anticipation of wide-spread quantum computing, we must start the transition from classical public-key cryptography primitives to post-quantum (PQ) alternatives. It may be that consumers will never get to hold a quantum computer, but a few powerful attackers who will get one can still pose a serious threat. Moreover, under the assumption that current TLS handshakes and ciphertexts are being captured and stored, a future attacker could crack these stored individual session keys and use those results to decrypt the corresponding individual ciphertexts. Even strong security guarantees, like forward secrecy, do not help out much there.

In 2006, the academic research community launched a conference series dedicated to finding alternatives to RSA and ECC. This so-called post-quantum cryptography should run efficiently on a classical computer, but it should also be secure against attacks performed by a quantum computer. As a research field, it has grown substantially in popularity.

Several companies, including Google, Microsoft, Digicert and Thales, are already testing the impact of deploying PQ cryptography. Cloudflare is involved in some of this, but we want to be a company that leads in this direction. The first thing we need to do is understand the real costs of deploying PQ cryptography, and that’s not obvious at all.

### What options do we have?

Many submissions to the NIST project are still under study. Some are very new and little understood; others are more mature and already standardized as RFCs. Some have been broken or withdrawn from the process; others are more conservative or illustrate how far classical cryptography would need to be pushed so that a quantum computer could not crack it within a reasonable cost. Some are very slow and big; others are not. But most cryptographic schemes can be categorized into these families: lattice-based, multivariate, hash-based (signatures only), code-based and isogeny-based.

For some algorithms, nevertheless, there is a fear they may be too inconvenient to use with today’s Internet. We must also be able to integrate new cryptographic schemes with existing protocols, such as SSH or TLS. To do that, designers of PQ cryptosystems must consider these characteristics:

• Latency caused by encryption and decryption on both ends of the communication channel, assuming a variety of devices from big and fast servers to slow and memory constrained IoT (Internet of Things) devices
• Small public keys and signatures to minimize bandwidth
• Clear design that allows cryptanalysis and determining weaknesses that could be exploited
• Use of existing hardware for fast implementation

The work on post-quantum public key cryptosystems must be done in a full view of organizations, governments, cryptographers, and the public. Emerging ideas must be properly vetted by this community to ensure widespread support.

### Helping Build a Better Internet

To better understand the post-quantum world, Cloudflare began experimenting with these algorithms and used them to provide confidentiality in TLS connections.

With Google, we are proposing a wide-scale experiment that combines client- and server-side data collection to evaluate the performance of key-exchange algorithms on actual users’ devices. We hope that this experiment helps choose an algorithm with the best characteristics for the future of the Internet. With Cloudflare’s highly distributed network of access points and Google’s Chrome browser, both companies are in a very good position to perform this experiment.

Our goal is to understand how these algorithms act when used by real clients over real networks, particularly candidate algorithms with significant differences in public-key or ciphertext sizes. Our focus is on how different key sizes affect handshake time in the context of Transport Layer Security (TLS) as used on the web over HTTPS.

Our primary candidates are an NTRU-based construction called HRSS-SXY (by Hülsing – Rijneveld – Schanck – Schwabe, and Tsunekazu Saito – Keita Xagawa – Takashi Yamakawa) and an isogeny-based Supersingular Isogeny Key Encapsulation (SIKE). Details of both algorithms are described in more detail below in section “Dive into post-quantum cryptography”. This table shows a few characteristics for both algorithms. Performance timings were obtained by running the BoringSSL speed test on an Intel Skylake CPU.

KEM Public Key size (bytes) Ciphertext (bytes) Secret size (bytes) KeyGen (op/sec) Encaps (op/sec) Decaps (op/sec) NIST level
HRSS-SXY 1138 1138 32 3952.3 76034.7 21905.8 1
SIKE/p434 330 346 16 367.1 228.0 209.3 1

Currently the most commonly used key exchange algorithm (according to Cloudflare’s data) is the non-quantum X25519. Its public keys are 32 bytes and BoringSSL can generate 49301.2 key pairs, and is able to perform 19628.6 key agreements every second on my Skylake CPU.

Note that HRSS-SXY shows a significant speed advantage, while SIKE has a size advantage. In our experiment, we will deploy these two algorithms on both the server side using Cloudflare’s infrastructure, and the client side using Chrome Canary; both sides will collect telemetry information about TLS handshakes using these two PQ algorithms to see how they perform in practice.

### What do we expect to find?

In 2018, Adam Langley conducted an experiment with the goal of evaluating the likely latency impact of a post-quantum key exchange in TLS. Chrome was augmented with the ability to include a dummy, arbitrarily-sized extension in the TLS ClientHello (fixed number of bytes of random noise). After taking into account the performance and key size offered by different types key-exchange schemes, he concluded that constructs based on structured lattices may be most suitable for future use in TLS.

However, Langley also observed a peculiar phenomenon; client connections measured at 95th percentile had much higher latency than the median. It means that in those cases, isogeny-based systems may be a better choice. In the “Dive into post-quantum cryptography”, we describe the difference between isogeny-based SIKE and lattice-based NTRU cryptosystems.

In our experiment, we want to more thoroughly evaluate and ascribe root causes to these unexpected latency increases. We would particularly like to learn more about the characteristics of those networks: what causes increased latency? how does the performance cost of isogeny-based algorithms impact the TLS handshake? We want to answer key questions, like:

• What is a good ratio for speed-to-key size (or how much faster could SIKE get to achieve the client-perceived performance of HRSS)?
• How do network middleboxes behave when clients use new PQ algorithms, and which networks have problematic middleboxes?
• How do the different properties of client networks affect TLS performance with different PQ key exchanges? Can we identify specific autonomous systems, device configurations, or network configurations that favor one algorithm over another? How is performance affected in the long tail?

### Experiment Design

Our experiment will involve both server- and client-side performance statistics collection from real users around the world (all the data is anonymized). Cloudflare is operating the server-side TLS connections. We will enable the CECPQ2 (HRSS + X25519) and CECPQ2b (SIKE + X25519) key-agreement algorithms on all TLS-terminating edge servers.

In this experiment, the ClientHello will contain a CECPQ2 or CECPQ2b public key (but never both). Additionally, Chrome will always include X25519 for servers that do not support post-quantum key exchange. The post-quantum key exchange will only be negotiated in TLS version 1.3 when both sides support it.

Since Cloudflare only measures the server side of the connection, it is impossible to determine the time it takes for a ClientHello sent from Chrome to reach Cloudflare’s edge servers; however, we can measure the time it takes for the TLS ServerHello message containing post-quantum key exchange, to reach the client and for the client to respond.

On the client side, Chrome Canary will operate the TLS connection. Google will enable either CECPQ2 or CECPQ2b in Chrome for the following mix of architecture and OSes:

• x86-64: Windows, Linux, macOS, ChromeOS
• aarch64: Android

Our high-level expectation is to get similar results as Langley’s original experiment in 2018 — slightly increased latency for the 50th percentile and higher latency for the 95th. Unfortunately, data collected purely from real users’ connections may not suffice for diagnosing the root causes of why some clients experience excessive slowdown. To this end, we will perform follow-up experiments based on per-client information we collect server-side.

Our primary hypothesis is that excessive slowdowns, like those Langley observed, are largely due to in-network events, such as middleboxes or bloated/lossy links. As a first-pass analysis, we will investigate whether the slowed-down clients share common network features, like common ASes, common transit networks, common link types, and so on. To determine this, we will run a traceroute from vantage points close to our servers back toward the clients (not overloading any particular links or hosts) and study whether some client locations are subject to slowdowns for all destinations or just for some.

### Dive into post-quantum cryptography

Be warned: the details of PQ cryptography may be quite complicated. In some cases it builds on classical cryptography, and in other cases it is completely different math. It would be rather hard to describe details in a single blog post. Instead, we are giving you an intuition of post-quantum cryptography, rather than provide deep academic-level descriptions. We’re skipping a lot of details for the sake of brevity. Nevertheless, settle in for a bit of an epic journey because we have a lot to cover.

### Key encapsulation mechanism

NIST requires that all key-agreement algorithms have a form of key-encapsulation mechanism (KEM). The KEM is a simplified form of public key encryption (PKE). As PKE, it also allows agreement on a secret, but in a slightly different way. The idea is that the session key is an output of the encryption algorithm, conversely to public key encryption schemes where session key is an input to the algorithm. In a KEM, Alice generates a random key and uses the pre-generated public key from Bob to encrypt (encapsulate) it. This results in a ciphertext sent to Bob. Bob uses his private key to decrypt (decapsulate) the ciphertext and retrieve the random key. The idea was initially introduced by Cramer and Shoup. Experience shows that such constructs are easier to design, analyze, and implement as the scheme is limited to communicating a fixed-size session key. Leonardo Da Vinci said, “Simplicity is the ultimate sophistication,” which is very true in cryptography.

The key exchange (KEX) protocol, like Diffie-Hellman, is yet a different construct: it allows two parties to agree on a shared secret that can be used as a symmetric encryption key. For example, Alice generates a key pair and sends a public key to Bob. Bob does the same and uses his own key pair with Alice’s public key to generate the shared secret. He then sends his public key to Alice who can now generate the same shared secret. What’s worth noticing is that both Alice and Bob perform exactly the same operations.

KEM construction can be converted to KEX. Alice performs key generation and sends the public key to Bob. Bob uses it to encapsulate a symmetric session key and sends it back to Alice. Alice decapsulates the ciphertext received from Bob and gets the symmetric key. This is actually what we do in our experiment to make integration with the TLS protocol less complicated.

### NTRU Lattice-based Encryption

We will enable the CECPQ2 implemented by Adam Langley from Google on our servers. He described this implementation in detail here. This key exchange uses the HRSS algorithm, which is based on the NTRU (N-Th Degree TRUncated Polynomial Ring) algorithm. Foregoing too much detail, I am going to explain how NTRU works and give simplified examples, and finally, compare it to HRSS.

NTRU is a cryptosystem based on a polynomial ring. This means that we do not operate on numbers modulo a prime (like in RSA), but on polynomials of degree $$N$$ , where the degree of a polynomial is the highest exponent of its variable. For example, $$x^7 + 6x^3 + 11x^2$$ has degree of 7.

One can add polynomials in the ring in the usual way, by simply adding theirs coefficients modulo some integer. In NTRU this integer is called $$q$$. Polynomials can also be multiplied, but remember, you are operating in the ring, therefore the result of a multiplication is always a polynomial of degree less than $$N$$. It basically means that exponents of the resulting polynomial are added to modulo $$N$$.

In other words, polynomial ring arithmetic is very similar to modular arithmetic, but instead of working with a set of numbers less than N, you are working with a set of polynomials with a degree less than N.

To instantiate the NTRU cryptosystem, three domain parameters must be chosen:

• $$N$$ – degree of the polynomial ring, in NTRU the principal objects are polynomials of degree $$N-1$$.
• $$p$$ – small modulus used during key generation and decryption for reducing message coefficients.
• $$q$$ – large modulus used during algorithm execution for reducing coefficients of the polynomials.

First, we generate a pair of public and private keys. To do that, two polynomials $$f$$ and $$g$$ are chosen from the ring in a way that their randomly generated coefficients are much smaller than $$q$$. Then key generation computes two inverses of the polynomial: $$f_p= f^{-1} \bmod{p} \\ f_q= f^{-1} \bmod{q}$$

The last step is to compute $$pk = p\cdot f_q\cdot g \bmod q$$, which we will use as public key pk. The private key consists of $$f$$ and $$f_p$$. The $$f_q$$ is not part of any key, however it must remain secret.

It might be the case that after choosing $$f$$, the inverses modulo $$p$$ and $$q$$ do not exist. In this case, the algorithm has to start from the beginning and generate another $$f$$. That’s unfortunate because calculating the inverse of a polynomial is a costly operation. HRSS brings an improvement to this issue since it ensures that those inverses always exist, making key generation faster than as proposed initially in NTRU.

The encryption of a message $$m$$ proceeds as follows. First, the message $$m$$ is converted to a ring element $$pt$$ (there exists an algorithm for performing this conversion in both directions). During encryption, NTRU randomly chooses one polynomial $$b$$ called blinder. The goal of the blinder is to generate different ciphertexts per encyption. Thus, the ciphetext $$ct$$ is obtained as $$ct = (b\cdot pk + pt ) \bmod q$$ Decryption looks a bit more complicated but it can also be easily understood. Decryption uses both the secret value $$f$$ and to recover the plaintext as $$v = f \cdot ct \bmod q \\ pt = v \cdot f_p \bmod p$$

This diagram demonstrates why and how decryption works.

After obtaining $$pt$$, the message $$m$$ is recovered by inverting the conversion function.

The underlying hard assumption is that given two polynomials: $$f$$ and $$g$$ whose coefficients are short compared to the modulus $$q$$, it is difficult to distinguish $$pk = \frac{f}{g}$$ from a random element in the ring. It means that it’s hard to find $$f$$ and $$g$$ given only public key pk.

### Lattices

NTRU cryptosystem is a grandfather of lattice-based encryption schemes. The idea of using  difficult problems for cryptographic purposes was due to Ajtai. His work evolved into a whole area of research with the goal of creating more practical, lattice-based cryptosystems.

### What is a lattice and why it can be used for post-quantum crypto?

The picture below visualizes lattice as points in a two-dimensional space. A lattice is defined by the origin $$O$$ and base vectors $$\{ b_1 , b_2\}$$. Every point on the lattice is represented as a linear combination of the base vectors, for example  $$V = -2b_1+b_2$$.

There are two classical NP-hard problems in lattice-based cryptography:

1. Shortest Vector Problem (SVP): Given a lattice, to find the shortest non-zero vector in the lattice. In the graph, the vector $$s$$ is the shortest one. The SVP problem is NP-hard only under some assumptions.
2. Closest Vector Problem (CVP). Given a lattice and a vector $$V$$ (not necessarily in the lattice), to find the closest vector to $$V$$. For example, the closest vector to $$t$$ is $$z$$.

In the graph above, it is easy for us to solve SVP and CVP by simple inspection. However, the lattices used in cryptography have higher dimensions, say above 1000, as well as highly non-orthogonal basis vectors. On these instances, the problems get extremely hard to solve. It’s even believed future quantum computers will have it tough.

HRSS, which we use in our experiment, is based on NTRU, but a slightly better instantiation. The main improvements are:

• Faster key generation algorithm.
• NTRU encryption can produce ciphertexts that are impossible to decrypt (true for many lattice-based schemes). But HRSS fixes this problem.
• HRSS is a key encapsulation mechanism.

### CECPQ2b – Isogeny-based Post-Quantum TLS

Following CECPQ2, we have integrated into BoringSSL another hybrid key exchange mechanism relying on SIKE. It is called CECPQ2b and we will use it in our experimentation in TLS 1.3. SIKE is a key encapsulation method based on Supersingular Isogeny Diffie-Hellman (SIDH). Read more about SIDH in our previous post. The math behind SIDH is related to elliptic curves. A comparison between SIDH and the classical Elliptic Curve Diffie-Hellman (ECDH) is given.

An elliptic curve is a set of points that satisfy a specific mathematical equation. The equation of an elliptic curve may have multiple forms, the standard form is called the Weierstrass equation $$y^2 = x^3 +ax +b$$ and its shape can look like the red curve.

An interesting fact about elliptic curves is have a group structure. That is, the set of points on the curve have associated a binary operation called point addition. The set of points on the elliptic curve is closed under addition. Thus, adding two points results in another point that is also on the elliptic curve.

If we can add two different points on a curve, then we can also add one point to itself. And if we do it multiple times, then the resulting operations is known as a scalar multiplication and denoted as  $$Q = k\cdot P = P+P+\dots+P$$ for an integer $$k$$.

Multiplication of scalars is commutative. It means that two scalar multiplications can be evaluated in any order $$\color{darkred}{k_a}\cdot\color{darkgreen}{k_b} = \color{darkgreen}{k_b}\cdot\color{darkred}{k_a}$$; this an important property that makes ECDH possible.

It turns out that carefully if choosing an elliptic curve “correctly”, scalar multiplication is easy to compute but extremely hard to reverse. Meaning, given two points $$Q$$ and $$P$$ such that $$Q=k\cdot P$$, finding the integer k is a difficult task known as the Elliptic Curve Discrete Logarithm problem (ECDLP). This problem is suitable for cryptographic purposes.

Alice and Bob agree on a secret key as follows. Alice generates a private key $$k_a$$. Then, she uses some publicly known point $$P$$ and calculates her public key as $$Q_a = k_a\cdot P$$. Bob proceeds in similar fashion and gets $$k_b$$ and $$Q_b = k_b\cdot P$$. To agree on a shared secret, each party multiplies their private key with the public key of the other party. The result of this is the shared secret. Key agreement as described above, works thanks to the fact that scalars can commute:
$$\color{darkgreen}{k_a} \cdot Q_b = \color{darkgreen}{k_a} \cdot \color{darkred}{k_b} \cdot P \iff \color{darkred}{k_b} \cdot \color{darkgreen}{k_a} \cdot P = \color{darkred}{k_b} \cdot Q_a$$

There is a vast theory behind elliptic curves. An introduction to elliptic curve cryptography was posted before and more details can be found in this book. Now, lets describe SIDH and compare with ECDH.

### Isogenies on Elliptic Curves

Before explaining the details of SIDH key exchange, I’ll explain the 3 most important concepts, namely: j-invariant, isogeny and its kernel.

Each curve has a number that can be associated to it. Let’s call this number a j-invariant. This number is not unique per curve, meaning many curves have the same value of j-invariant, but it can be viewed as a way to group multiple elliptic curves into disjoint sets. We say that two curves are isomorphic if they are in the same set, called the isomorphism class. The j-invariant is a simple criterion to determine whether two curves are isomorphic. The j-invariant of a curve $$E$$ in Weierstrass form $$y^2 = x^3 + ax + b$$ is given as $$j(E) = 1728\frac{4a^3}{4a^3 +27b^2}$$

When it comes to isogeny, think about it as a map between two curves. Each point on some curve $$E$$ is mapped by isogeny to the point on isogenous curve $$E’$$. We denote mapping from curve $$E$$ to $$E’$$ by isogeny $$\phi$$ as:

$$\phi: E \rightarrow E’$$

It depends on the map if those two curves are isomorphic or not. Isogeny can be visualised as:

There may exist many of those mappings, each curve used in SIDH has small number of isogenies to other curves. Natural question is how do we compute such isogeny. Here is where the kernel of an isogeny comes. The kernel uniquely determines isogeny (up to isomorphism class). Formulas for calculating isogeny from its kernel were initially given by J. Vélu and the idea of calculating them efficiently was extended.

To finish, I will summarize what was said above with a picture.

There are two isomorphism classes on the picture above. Both curves $$E_1$$ and $$E_2$$ are isomorphic and have  j-invariant = 6. As curves $$E_3$$ and $$E_4$$ have j-invariant=13, they are in a different isomorphism class. There exists an isogeny $$\phi_2$$ between curve $$E_3$$ and $$E_2$$, so they both are isogeneous. Curves $$\phi_1$$ and $$E_2$$ are isomorphic and there is isogeny $$\phi_1$$ between them. Curves $$E_1$$ and $$E_4$$ are not isomorphic.

For brevity I’m skipping many important details, like details of the finite field, the fact that isogenies must be separable and that the kernel is finite. But curious readers can find a number of academic research papers available on the Internet.

### Big picture: similarities with ECDH

Let’s generalize the ECDH algorithm described above, so that we can swap some elements and try to use Supersingular Isogeny Diffie-Hellman.

Note that what actually happens during an ECDH key exchange is:

• We have a set of points on elliptic curve, set S
• We have another group of integers used for point multiplication, G
• We use an element from Z to act on an element from S to get another element from S:

$$G \cdot S \rightarrow S$$

Now the question is: what is our G and S in an SIDH setting? For SIDH to work, we need a big set of elements and something secret that will act on the elements from that set. This “group action” must also be resistant to attacks performed by quantum computers.

In the SIDH setting, those two sets are defined as:

• Set S is a set (graph) of j-invariants, such that all the curves are supersingular: $$S = [j(E_1), j(E_2), j(E_3), …. , j(E_n)]$$
• Set G is a set of isogenies acting on elliptic curves and transforming, for example, the elliptic curve $$E_1$$ into $$E_n$$:

### Random walk on supersingular graph

When we talk about Isogeny Based Cryptography, as a topic distinct from Elliptic Curve Cryptography, we usually mean algorithms and protocols that rely fundamentally on the structure of isogeny graphs. An example of such a (small) graph is pictured below.

Each vertex of the graph represents a different j-invariant of a set of supersingular curves. The edges between vertices represent isogenies converting one elliptic curve to another. As you can notice, the graph is strongly connected, meaning every vertex can be reached from every other vertex. In the context of isogeny-based crypto, we call such a graph a supersingular isogeny graph. I’ll skip some technical details about the construction of this graph (look for those here or here), but instead describe ideas about how it can be used.

As the graph is strongly connected, it is possible to walk a whole graph by starting from any vertex, randomly choosing an edge, following it to the next vertex and then start the process again on a new vertex. Such a way of visiting edges of this graph is called a random walk.

The random walk is a key concept that makes isogeny based crypto feasible. When you look closely at the graph, you can notice that each vertex has a small number of edges incident to it, this is why we can compute the isogenies efficiently. But also for any vertex there is only a limited number of isogenies to choose from, which doesn’t look like good base for a cryptographic scheme. The key question is – where does the security of the scheme come from exactly? In order to get it, it is necessary to visit a couple hundred vertices. What it means in practice is that secret isogeny (of large degree) is constructed as a composition of multiple isogenies (of small, prime degree).  Which means, the secret isogeny is:

This property and properties of the isogeny graph are what makes some of us believe that scheme has a good chance to be secure. More specifically, there is no efficient way of finding a path that connects $$E_0$$ with $$E_n$$, even with quantum computer at hand. The security level of a system depends on value n – the number of steps taken during the walk.

The random walk is a core process used when both generating public keys and computing shared secrets. It starts with party generating random value m (see more below), starting curve $$E_0$$ and points P and Q on this curve. Those values are used to compute the kernel of an isogeny $$R_1$$ in the following way:

$$R_1 = P + m \cdot Q$$

Thanks to formulas given by Vélu we can now use the point $$R_1$$ to compute the isogeny, the party will choose to move from a vertex to another one. After the isogeny $$\phi_{R_1}$$ is calculated it is applied to $$E_0$$  which results in a new curve $$E_1$$:

$$\phi_{R_1}: E_0 \rightarrow E_1$$

Isogeny is also applied to points P and Q. Once on $$E_1$$ the process is repeated. This process is applied n times, and at the end a party ends up on some curve $$E_n$$ which defines isomorphism class, so also j-invariant.

### Supersingular Isogeny Diffie-Hellman

The core idea in SIDH is to compose two random walks on an isogeny graph of elliptic curves in such a way that the end node of both ways of composing is the same.

In order to do it, scheme sets public parameters – starting curve $$E_0$$ and 2 pairs of base points on this curve $$(PA,QA)$$ , $$(PB,QB)$$. Alice generates her random secret keys m, and calculates a secret isogeny $$\phi_q$$ by performing a random walk as described above. The walk finishes with 3 values: elliptic curve $$E_a$$ she has ended up with and pair of points $$\phi_a(PB)$$ and $$\phi_a(QB)$$ after pushing through Alice’s secret isogeny. Bob proceeds analogously which results in the triple $${E_b, \phi_b(PA), \phi_b(QA)}$$. The triple forms a public key which is exchanged between parties.

The picture below visualizes the operation. The black dots represent curves grouped in the same isomorphism classes represented by light blue circles. Alice takes the orange path ending up on a curve $$E_a$$ in a separate isomorphism class than Bob after taking his dark blue path ending on $$E_b$$. SIDH is parametrized in a way that Alice and Bob will always end up in different isomorphism classes.

Upon receipt of triple $${ E_a, \phi_a(PB), \phi_a(QB) }$$  from Alice, Bob will use his secret value m to calculate a new kernel – but instead of using point $$PA$$ and $$QA$$ to calculate an isogeny kernel, he will now use images $$\phi_a(PB)$$ and $$\phi_a(QB)$$ received from Alice:

$$R’_1 = \phi_a(PB) + m \cdot \phi_a(QB)$$

Afterwards, he uses $$R’_1$$ to start the walk again resulting in the isogeny $$\phi’_b: E_a \rightarrow E_{ab}$$. Allice proceeds analogously resulting in the isogeny $$\phi’_a: E_b \rightarrow E_{ba}$$. With isogenies calculated this way, both Alice and Bob will converge in the same isomorphism class. The math math may seem complicated, hopefully the picture below makes it easier to understand.

Bob computes a new isogeny and starts his random walk from $$E_a$$ received from Alice. He ends up on some curve $$E_{ba}$$. Similarly, Alice calculates a new isogeny, applies it on $$E_b$$ received from Bob and her random walk ends on some curve $$E_{ab}$$. Curves $$E_{ab}$$ and $$E_{ba}$$ are not likely to be the same, but construction guarantees that they are isomorphic. As mentioned earlier, isomorphic curves have the same value of j-invariant,  hence the shared secret is a value of j-invariant $$j(E_{ab})$$.

Coming back to differences between SIDH and ECDH – we can split them into four categories: the elements of the group we are operating on, the cornerstone computation required to agree on a shared secret, the elements representing secret values, and the difficult problem on which the security relies.

In ECDH there is a secret key which is an integer scalar, in case of SIDH it is a secret isogeny, which also is generated from an integer scalar. In the case of ECDH one multiplies a point on a curve by a scalar, in the case of SIDH it is a random walk in an isogeny graph. In the case of ECDH, the public key is a point on a curve, in the case of SIDH, the public part is a curve itself and the image of some points after applying isogeny. The shared secret in the case of ECDH is a point on a curve, in the case of SIDH it is a j-invariant.

### SIKE: Supersingular Isogeny Key Encapsulation

SIDH could potentially be used as a drop-in replacement of the ECDH protocol. We have actually implemented a proof-of-concept and added it to our implementation of TLS 1.3 in the tls-tris library and described (together with Mozilla) implementation details in this draft. Nevertheless, there is a problem with SIDH – the keys can be used only once. In 2016, a few researchers came up with an active attack on SIDH which works only when public keys are reused. In the context of TLS, it is not a big problem, because for each session a fresh key pair is generated (ephemeral keys), but it may not be true for other applications.

SIKE is an isogeny key encapsulation which solves this problem. Bob can generate SIKE keys, upload the public part somewhere in the Internet and then anybody can use it whenever he wants to communicate with Bob securely. SIKE reuses SIDH – internally both sides of the connection always perform SIDH key generation, SIDH key agreement and apply some other cryptographic primitives in order to convert SIDH to KEM. SIKE is implemented in a few variants – each variant corresponds to the security levels using 128-, 192- and 256-bit secret keys. Higher security level means longer running time. More details about SIKE can be found here.

SIKE is also one of the candidates in NIST post-quantum “competition“.

I’ve skipped many important details to give a brief description of how isogeny based crypto works. If you’re curious and hungry for details, look at either of these Cloudflare meetups, where Deirdre Connolly talked about isogeny-based cryptography or this talk by Chloe Martindale during PQ Crypto School 2017. And if you would like to know more about quantum attacks on this scheme, I highly recommend this work.

## Conclusion

Quantum computers that can break meaningful cryptographic parameter settings do not exist, yet. They won’t be built for at least the next few years. Nevertheless, they have already changed the way we look at current cryptographic deployments. There are at least two reasons it’s worth investing in PQ cryptography:

• It takes a lot of time to build secure cryptography and we don’t actually know when today’s classical cryptography will be broken. There is a need for a good mathematical base: an initial idea of what may be secure against something that doesn’t exist yet. If you have an idea, you also need good implementation, constant time, resistance to things like time and cache side-channels, DFA, DPA, EM, and a bunch of other abbreviations indicating side-channel resistance. There is also deployment of, for example, algorithms based on elliptic curves were introduced in ’85, but started to really be used in production only during the last decade, 20 or so years later. Obviously, the implementation must be blazingly fast! Last, but not least, integration: we need time to develop standards to allow integration of PQ cryptography with protocols like TLS.
• Even though efficient quantum computers probably won’t exist for another few years, the threat is real. Data encrypted with current cryptographic algorithms can be recorded now with hopes of being broken in the future.

Cloudflare is motivated to help build the Internet of tomorrow with the tools at hand today. Our interest is in cryptographic techniques that can be integrated into existing protocols and widely deployed on the Internet as seamlessly as possible. PQ cryptography, like the rest of cryptography, includes many cryptosystems that can be used for communications in today’s Internet; Alice and Bob need to perform some computation, but they do not need to buy new hardware to do that.

Cloudflare sees great potential in those algorithms and believes that some of them can be used as a safe replacement for classical public-key cryptosystems. Time will tell if we’re justified in this belief!