All posts by Tanya Verma

Inside Geo Key Manager v2: re-imagining access control for distributed systems

Post Syndicated from Tanya Verma original https://blog.cloudflare.com/inside-geo-key-manager-v2/

Inside Geo Key Manager v2: re-imagining access control for distributed systems

Inside Geo Key Manager v2: re-imagining access control for distributed systems

In December 2022 we announced the closed beta of the new version of Geo Key Manager. Geo Key Manager v2 (GeoV2) is the next step in our journey to provide customers with a secure and flexible way to control the distribution of their private keys by geographic location. Our original system, Geo Key Manager v1, was launched as a research project in 2017, but as customer needs evolved and our scale increased, we realized that we needed to make significant improvements to provide a better user experience.

One of the principal challenges we faced with Geo Key Manager v1 (GeoV1) was the inflexibility of our access control policies. Customers required richer data localization, often spurred by regulatory concerns. Internally, events such as the conflict in Ukraine reinforced the need to be able to quickly restrict access to sensitive key material. Geo Key Manager v1’s underlying cryptography was a combination of identity-based broadcast encryption and identity-based revocation that simulated a subset of the functionality offered by Attribute-Based Encryption (ABE). Replacing this with an established ABE scheme addressed the inflexibility of our access control policies and provided a more secure foundation for our system.

Unlike our previous scheme, which limited future flexibility by freezing the set of participating data centers and policies at the outset, using ABE made the system easily adaptable for future needs. It allowed us to take advantage of performance gains from additional data centers added after instantiation and drastically simplified the process for handling changes to attributes and policies. Furthermore, GeoV1 struggled with some perplexing performance issues that contributed to high tail latency and a painfully manual key rotation process. GeoV2 is our answer to these challenges and limitations of GeoV1.

While this blog focuses on our solution for geographical key management, the lessons here can also be applied to other access control needs. Access control solutions are traditionally implemented using a highly-available central authority to police access to resources. As we will see, ABE allows us to avoid this single point of failure. As there are no large scale ABE-based access control systems we are aware of, we hope our discussion can help engineers consider using ABE as an alternative to access control with minimal reliance on a centralized authority. To facilitate this, we’ve included our implementation of ABE in CIRCL, our open source cryptographic library.

Unsatisfactory attempts at a solution

Before coming back to GeoV2, let’s take a little detour and examine the problem we’re trying to solve.

Consider this example: a large European bank wants to store their TLS private keys only within the EU. This bank is a customer of Cloudflare, which means we perform TLS handshakes on their behalf. The reason we need to terminate TLS for them is so that we can provide the best protection against DDoS attacks, improve performance by caching, support web application firewalls, etc.

In order to terminate TLS, we need to have access to their TLS private keys1. The control plane, which handles API traffic, encrypts the customer’s uploaded private key with a master public key shared amongst all machines globally. It then puts the key into a globally distributed KV store, Quicksilver. This means every machine in every data center around the world has a local copy of this customer’s TLS private key. Consequently, every machine in each data center has a copy of every customer’s private key.

Inside Geo Key Manager v2: re-imagining access control for distributed systems
Customer uploading their TLS certificate and private key to be stored in all data centers

This bank however, wants its key to be stored only in EU data centers. In order to allow this to happen, we have three options.

The first option is to ensure that only EU data centers can receive this key and terminate the handshake. All other machines proxy TLS requests to an EU server for processing. This would require giving each machine only a subset of the entire keyset stored in Quicksilver, which challenges core design decisions Cloudflare has made over the years that assume the entire dataset is replicated on every machine.

Inside Geo Key Manager v2: re-imagining access control for distributed systems
Restricting customer keys to EU data centers

Another option is to store the keys in the core data center instead of Quicksilver. This would allow us to enforce the proper access control policy every time, ensuring that only certain machines can access certain keys. However, this would defeat the purpose of having a global network in the first place: to reduce latency and avoid a single point of failure at the core.

Inside Geo Key Manager v2: re-imagining access control for distributed systems
Storing keys in core data center where complicated business logic runs to enforce policies

A third option is to use public key cryptography. Instead of having a master key pair, every data center is issued its own key pair. The core encrypts the customer’s private key with the keys of every data center allowed to use it. Only machines in the EU will be able to access the key in this example. Let’s assume there are 500 data centers, with 50 machines each. Of these 500 data centers, let’s say 200 are in the EU. Where 100 keys of 1kB consumed a total of 100 x 500 x 50 x 1 kB (globally), now they will consume 200 times that, and in the worst case, up to 500 times. This increases the space it takes to store the keys on each machine by a whole new factor – before, the storage space was purely a function of how many customer keys are registered; now, the storage space is still a function of the number of customer keys, but also multiplied by the number of data centers.

Inside Geo Key Manager v2: re-imagining access control for distributed systems
Assigning unique keys to each data center and wrapping customer key with EU data center keys

Unfortunately, all three of these options are undesirable in their own ways. They would either require changing fundamental assumptions we made about the architecture of Cloudflare, abandoning the advantages of using a highly distributed network, or quadratically increasing the storage this feature uses.

A deeper look at the third option reveals – why not create two key pairs instead of a unique one for each data center? One pair would be common among all EU data centers, and one for all non-EU data centers. This way, the core only needs to encrypt the customer’s key twice instead of for each EU data center. This is a good solution for the EU bank, but it doesn’t scale once we start adding additional policies. Consider the example: a data center in New York City could have a key for the policy “country: US”, another one for “country: US or region: EU”, another one for “not country: RU”, and so on… You can already see this getting rather unwieldy. And every time a new data center is provisioned, all policies must be re-evaluated and the appropriate keys assigned.

Inside Geo Key Manager v2: re-imagining access control for distributed systems
A key for each policy and its negation

Geo Key Manager v1: identity-based encryption and broadcast encryption

The invention of RSA in 1978 kicked off the era of modern public key cryptography, but anyone who has used GPG or is involved with certificate authorities can attest to the difficulty of managing public key infrastructure that connects keys to user identities. In 1984, Shamir asked if it was possible to create a public-key encryption system where the public key could be any string. His motivation for this question was to simplify email management. Instead of encrypting an email to Bob using Bob’s public key, Alice could encrypt it to Bob’s identity [email protected]. Finally, in 2001, Boneh and Franklin figured out how to make it work.

Broadcast encryption was first proposed in 1993 by Fiat and Naor. It lets you send the same encrypted message to everyone, but only people with the right key can decrypt it. Looking back to our third option, instead of wrapping the customer’s key with the key of every EU data center, we could use broadcast encryption to create a singular encryption of the customer’s key that only EU-based data centers could decrypt. This would solve the storage problem.

Geo Key Manager v1 used a combination of identity-based broadcast encryption and identity-based revocation to implement access control. Briefly, a set of identities is designated for each region and each data center location. Then, each machine is issued an identity-based private key for its region and location. With this in place, access to the customer’s key can be controlled using three sets: the set of regions to encrypt to, the set of locations inside the region to exclude, and the set of locations outside the region to include. For example, the customer’s key could be encrypted so that it is available in all regions except for a few specific locations, and also available in a few locations outside those regions. This blog post has all the nitty-gritty details of this approach.

Unfortunately this scheme was insufficiently responsive to customer needs; the parameters used during initial cryptographic setup, such as the list of regions, data centers, and their attributes, were baked into the system and could not be easily changed. Tough luck excluding the UK from the EU region post Brexit, or supporting a new region based on a recent compliance standard that customers need. Using a predetermined static list of locations also made it difficult to quickly revoke machine access. Additionally, decryption keys could not be assigned to new data centers provisioned after setup, preventing them from speeding up requests. These limitations provided the impetus for integrating Attribute-Based Encryption (ABE) into Geo Key Manager.

Attribute-Based Encryption

In 2004, Amit Sahai and Brent Waters proposed a new cryptosystem based on access policies, known as attribute-based encryption (ABE). Essentially, a message is encrypted under an access policy rather than an identity. Users are issued a private key based on their attributes, and they can only decrypt the message if their attributes satisfy the policy. This allows for more flexible and fine-grained access control than traditional methods of encryption.

Inside Geo Key Manager v2: re-imagining access control for distributed systems
Brief timeline of Public Key Encryption

The policy can be attached either to the key or to the ciphertext, leading to two variants of ABE: key-policy attribute-based encryption (KP-ABE) and ciphertext-policy attribute-based encryption (CP-ABE). There exist trade-offs between them, but they are functionally equivalent as they are duals of each other. Let’s focus on CP-ABE it aligns more closely with real-world access control. Imagine a hospital where a doctor has the attributes “role: doctor” and “region: US”, while a nurse has the attributes “role: nurse” and “region: EU”. A document encrypted under the policy “role: doctor or region: EU” can be decrypted by both the doctor and nurse. In other words, ABE is like a magical lock that only opens for people who have the right attributes.

Policy Semantics
country: US or region: EU Decryption is possible either in the US or in the European Union
not (country: RU or country: US) Decryption is not possible in Russia and US
country: US and security: high Decryption is possible only in data centers within the US that have a high level of security (for some security definition established previously)

There are many different ABE schemes out there, with varying properties. The scheme we choose must satisfy a few requirements:

  1. Negation We want to be able to support boolean formulas consisting of AND, OR and NOT, aka non-monotonic boolean formulas. While practically every scheme handles AND and OR, NOT is rarer to find. Negation makes blocklisting certain countries or machines easier.
  2. Repeated Attributes Consider the policy “organization: executive or (organization: weapons and clearance: top-secret)”. The attribute “organization” has been repeated twice in the policy. Schemes with support for repetition add significant expressibility and flexibility when composing policies.
  3. Security against Chosen Ciphertext Attacks Most schemes are presented in a form that is only secure if the attacker doesn’t choose the messages to decrypt (CPA). There are standard ways to convert such a scheme into one that is secure even if the attacker manipulates ciphertexts (CCA), but it isn’t automatic. We apply the well-known Boneh-Katz transform to our chosen scheme to make it secure against this class of attacks. We will present a proof of security for the end to end scheme in our forthcoming paper.

Negation in particular deserves further comment. For an attribute to be satisfied when negated, the name must stay the same, but the value must differ. It’s like the data center is saying, “I have a country, but it’s definitely not Japan”, instead of “I don’t have a country”. This might seem counterintuitive, but it enables decryption without needing to examine every attribute value. It also makes it safe to roll out attributes incrementally. Based on these criteria, we ended up choosing the scheme by Tomida et al (2021).

Implementing a complex cryptographic scheme such as this can be quite challenging. The discrete log assumption that underlies traditional public key cryptography is not sufficient to meet the security requirements of ABE. ABE schemes must secure both ciphertexts and the attribute-based secret keys, whereas traditional public key cryptography only imposes security constraints on the ciphertexts, while the secret key is merely an integer. To achieve this, most ABE schemes are constructed using a mathematical operation known as bilinear pairings.

The speed at which we can perform pairing operations determines the baseline performance of our implementation. Their efficiency is particularly desirable during decryption, where they are used to combine the attribute-based secret key with the ciphertext in order to recover the plaintext. To this end, we rely on our highly optimized pairing implementations in our open source library of cryptographic suites, CIRCL, which we discuss at length in a previous blog. Additionally, the various keys, attributes and the ciphertext that embeds the access structure are expressed as matrices and vectors. We wrote linear algebra routines to handle matrix operations such as multiplication, transpose, inverse that are necessary to manipulate the structures as needed. We also added serialization, extensive testing and benchmarking. Finally, we implemented our conversion to a CCA2 secure scheme.

In addition to the core cryptography, we had to decide how to express and represent policies. Ultimately we decided on using strings for our API. While perhaps less convenient for programs than structures would be, users of our scheme would have to implement a parser anyway. Having us do it for them seemed like a way to have a more stable interface. This means the frontend of our policy language was composed of boolean expressions as strings, such as “country: JP or (not region: EU)”, while the backend is a monotonic boolean circuit consisting of wires and gates. Monotonic boolean circuits only include AND and OR gates. In order to handle NOT gates, we assigned positive or negative values to the wires. Every NOT gate can be placed directly on a wire because of De Morgan’s Law, which allows the conversion of a formula like “not (X and Y)” into “not X or not Y”, and similarly for disjunction.

The following is a demonstration of the API. The central authority runs Setup to generate the master public key and master secret key. The master public key can be used by anyone to encrypt a message over an access policy. The master secret key, held by the central authority, is used to generate secret keys for users based on their attributes. Attributes themselves can be supplied out-of-band. In our case, we rely on the machine provisioning database to provide and validate attributes. These attribute-based secret keys are securely distributed to users, such as over TLS, and are used to decrypt ciphertexts. The API also includes helper functions to check decryption capabilities and extract policies from ciphertexts for improved usability.

publicKey, masterSecretKey := cpabe.Setup()

policy := cpabe.Policy{}
policy.FromString("country: US or region: EU")

ciphertext := publicKey.Encrypt(policy, []byte("secret message"))

attrsParisDC := cpabe.Attributes{}
attrsParisDC.FromMap(map[string]string{"country": "FR", "region": "EU"}

secretKeyParisDC := masterSecretKey.KeyGen(attrsParisDC)

plaintext := secretKeyParisDC.Decrypt(ciphertext)

assertEquals(plaintext, "secret message")

We now come back to our original example. This time, the central authority holds the master secret key. Each machine in every data center presents its set of attributes to the central authority, which, after some validation, generates a unique attribute-based secret key for that particular machine. Key issuance happens when a machine is first brought up, if keys must be rotated, or if an attribute has changed, but never in the critical path of a TLS handshake. This solution is also collusion resistant, which means two machines without the appropriate attributes cannot combine their keys to decrypt a secret that they individually could not decrypt. For example, a machine with the attribute  “country: US” and another with “security: high”. These machines cannot collude together to decrypt a resource with the policy “country: US and security: high”.

Crucially, this solution can seamlessly scale and respond to changes to machines. If a new machine is added, the central authority can simply issue it a secret key since the participants of the scheme don’t have to be predetermined at setup, unlike our previous identity-broadcast scheme.

Inside Geo Key Manager v2: re-imagining access control for distributed systems
Key Distribution

When a customer uploads their TLS certificate, they can specify a policy, and the central authority will encrypt their private key with the master public key under the specified policy. The encrypted customer key then gets written to Quicksilver, to be distributed to all data centers. In practice, there is a layer of indirection here that we will discuss in a later section.

Inside Geo Key Manager v2: re-imagining access control for distributed systems
Encryption using Master Public Key

When a user visits the customer’s website, the TLS termination service at the data center that first receives the request, fetches the customer’s encrypted private key from Quicksilver. If the service’s attributes do not satisfy the policy, decryption fails and the request is proxied to the closest data center that satisfies the policy. Whichever data center can successfully decrypt the key performs the signature to complete the TLS handshake.

Inside Geo Key Manager v2: re-imagining access control for distributed systems
Decryption using Attribute-based Secret Key (Simplified)

The following table summarizes the pros and cons of the various solutions we discussed:

Solution Flexible policies Fault Tolerant Efficient Space Low Latency Collusion-resistant Changes to machines
Different copies of Quicksilver in data centers
Complicated Business Logic in Core
Encrypt customer keys with each data center’s unique keypair
Encrypt customer keys with a policy-based keypair, where each data center has multiple policy-based keypairs
Identity-Based Broadcast Encryption + Identity-Based Negative Broadcast Encryption(Geo Key Manager v1)
Attribute-Based Encryption(Geo Key Manager v2)

Performance characteristics

We characterize our scheme’s performance on measures inspired by ECRYPT. We set the attribute size to 50, which is significantly higher than necessary for most applications, but serves as a worst case scenario for benchmarking purposes. We conduct our measurements on a laptop with Intel Core i7-10610U CPU @ 1.80GHz and compare the results against RSA with 2048-bit security, X25519 and our previous scheme.

Scheme Secret key(bytes) Public key(bytes) Overhead of encrypting 23 bytes
(ciphertext length – message length)
Overhead of encrypting 10k bytes
(ciphertext length – message length)
RSA-2048 1190 (PKCS#1) 256 233 3568
X25519 32 32 48 48
GeoV1 scheme 4838 4742 169 169
GeoV2 ABE scheme 33416 3282 19419 19419

Different attribute based encryption schemes optimize for different performance profiles. Some may have fast key generation, while others may prioritize fast decryption. In our case, we only care about fast decryption because it is the only part of the process that lies in the critical path of a request. Everything else happens out-of-band where the extra overhead is acceptable.

Scheme Generating keypair Encrypting 23 bytes Decrypting 23 bytes
RSA-2048 117 ms 0.043 ms 1.26 ms
X25519 0.045 ms 0.093 ms 0.046 ms
GeoV1 scheme 75 ms 10.7 ms 13.9 ms
GeoV2 ABE scheme 1796 ms 704 ms 62.4 ms

A Brief Note on Attribute-Based Access Control (ABAC)

We have used Attribute-Based Encryption to implement what is commonly known as Attribute-Based Access Control (ABAC).

ABAC is an extension of the more familiar Role-Based Access Control (RBAC). To understand why ABAC is relevant, let’s briefly discuss its origins. In 1970, the United States Department of Defense introduced Discretionary Access Control (DAC). DAC is how Unix file systems are implemented. But DAC isn’t enough if you want to restrict resharing, because the owner of the resource can grant other users permission to access it in ways that the central administrator does not agree with. To address this, the Department of Defense introduced Mandatory Access Control (MAC). DRM is a good example of MAC. Even though you have the file, you don’t have a right to share it to others.

RBAC is an implementation of certain aspects of MAC. ABAC is an extension of RBAC that was defined by NIST in 2017 to address the increasing characteristics of users that are not restricted to their roles, such as time of day, user agent, and so on.

However, RBAC/ABAC is simply a specification. While they are traditionally implemented using a central authority to police access to some resource, it doesn’t have to be so. Attribute-based encryption is an excellent mechanism to implement ABAC in distributed systems.

Key rotation

While it may be tempting to attribute all failures to DNS, changing keys is another strong contender in this race. Suffering through the rather manual and error-prone key rotation process of Geo Key Manager v1 taught us to make robust and simple key rotation without impact on availability, an explicit design goal for Geo Key Manager v2.

To facilitate key rotation and improve performance, we introduce a layer of indirection to the customer key wrapping (encryption) process. When a customer uploads their TLS private key, instead of encrypting with the Master Public Key, we generate a X25519 keypair, called the policy key. The central authority then adds the public part of this newly minted policy keypair and its associated policy label to a database. It then encrypts the private half of the policy keypair with the Master Public Key, over the associated access policy. The customer’s private key is encrypted with the public policy key, and saved into Quicksilver.

When a user accesses the customer’s website, the TLS termination service at the data center that receives the request fetches the encrypted policy key associated with the customer’s access policy. If the machine’s attributes don’t satisfy the policy, decryption fails and the request is forwarded to the closest satisfying data center. If decryption succeeds, the policy key is used to decrypt the customer’s private key and complete the handshake.

Key Purpose CA in core Core Network
Master Public Key Encrypts private policy keys over an access policy Generate Read
Master Secret Key Generates secret keys for machines based on their attributes Generate,Read
Machine Secret Key / Attribute-Based Secret Key Decrypts private policy keys stored in global KV store, Quicksilver Generate Read
Customer TLS Private Key Performs digital signature necessary to complete TLS handshake to the customer’s website Read (transiently on upload) Read
Public Policy Key Encrypts customers’ TLS private keys Generate,
Read
Private Policy Key Decrypts customer’s TLS private keys Read (transiently during key rotation) Generate Read

However, policy keys are not generated for every customer’s certificate upload. As shown in the figure below, if a customer requests a policy that already exists in the system and thus has an associated policy key, the policy key will get re-used. Since most customers use the same few policies, such as restricting to one country, or restricting to the EU, the number of policy keys is orders of magnitude smaller compared to the number of customer keys.

Inside Geo Key Manager v2: re-imagining access control for distributed systems
Policy Keys

This sharing of policy keys is tremendously useful for key rotation. When master keys are rotated (and consequently the machine secret keys), only the handful of policy keys used to control access to the customers’ keys need to be re-encrypted, rather than every customer’s key encryption. This reduces compute and bandwidth requirements. Additionally, caching policy keys at the TLS termination service improves performance by reducing the need for frequent decryptions in the critical path.

This is similar to hybrid encryption, where public key cryptography is used to establish a shared symmetric key, which then gets used to encrypt data. The difference here is that the policy keys are not symmetric, but rather X25519 keypairs, which is an asymmetric scheme based on elliptic curves. While not as fast as symmetric schemes like AES, traditional elliptic curve cryptography is significantly faster than attribute-based encryption. The advantage here is that the central service doesn’t need access to secret key material to encrypt customer keys.

The other component of robust key rotation involves maintaining multiple key versions.The latest key generation is used for encryption, but the latest and previous versions can be used for decryption. We use a system of states to manage key transitions and safe deletion of older keys. We also have extensive monitoring in place to alert us if any machines are not using the appropriate key generations.

The Tail At Scale

Geo Key Manager suffered from high tail latency, which occasionally impacted availability. Jeff Dean’s paper, The Tail at Scale, is an enlightening read on how even elevated p99 latency at Cloudflare scale can be damaging. Despite revamping the server and client components of our service, the p99 latency didn’t budge. These revamps, such as switching from worker pools to one goroutine per request, did simplify the service, as they removed thousands of lines of code. Distributed tracing was able to pin down the delays: they took place between the client sending a request and the server receiving it. But we could not dig in further. We even wrote a blog last year describing our debugging endeavors, but without a concrete solution.

Finally, we realized that there is a level of indirection between the client and the server. Our data centers around the world are very different sizes. To avoid swamping smaller data centers with connections, larger data centers would task individual, intermediary machines with proxying requests to other data centers using the Go net/rpc library.

Once we included the forwarding function on the intermediary server in the trace, the problem became clear. There was a long delay between issuing the request and processing it. Yet the code was merely a call to a built-in library function. Why was it delaying the request?

Ultimately we found that there was a lock held while the request was serialized. The net/rpc package does not support streams, but our packet-oriented custom application protocol, which we wrote before the advent of gRPC, does support streaming. To bridge this gap, we executed a request and waited for the response in the serialization function. While an expedient way to get the code written, it created a performance bottleneck as only one request could be forwarded at a time.

Our solution was to use channels for coordination, letting multiple requests execute while we waited for the responses to arrive. When we rolled it out we saw dramatic decreases in tail latency.

Inside Geo Key Manager v2: re-imagining access control for distributed systems
The results of fixing RPC failures in remote colo in Australia

Unfortunately we cannot make the speed of light any faster (yet). Customers who want their keys kept only in the US while their website users are in the land down under will have to endure some delays as we make the trans-pacific voyage. But thanks to session tickets, those delays only affect new connections.

Uptime was also significantly improved. Data centers provisioned after cryptographic initiation could now participate in the system, which also implies that data centers that did not satisfy a certain policy had a broader range of satisfying neighbors to which they could forward the signing request to. This increased redundancy in the system, and particularly benefited data centers in regions without the best internet connectivity. The graph below represents successful probes spanning every machine globally over a two-day period. For GeoV1, we see websites with policies for US and EU regions falling to under 98% at one point, while for GeoV2, uptime rarely drops below 4 9s of availability.

Inside Geo Key Manager v2: re-imagining access control for distributed systems
Uptime by Key Profile across US and EU for GeoV1 and GeoV2, and IN for GeoV2

Conclusion

Congratulations dear reader for making it this far. Just like you, applied cryptography has come a long way, but only limited slivers manage to penetrate the barrier between research and real-world adoption. Bridging this gap can help enable novel capabilities for protecting sensitive data. Attribute-based encryption itself has become much more efficient and featureful over the past few years. We hope that this post encourages you to consider ABE for your own access control needs, particularly if you deal with distributed systems and don’t want to depend on a highly available central authority. We have open-sourced our implementation of CP-ABE in CIRCL, and plan on publishing a paper with additional details.

We look forward to the numerous product improvements to Geo Key Manager made possible by this new cryptographic foundation. We plan to use this ABE-based mechanism for storing not just private keys, but also other types of data. We are working on making it more user-friendly and generalizable for internal services to use.

Acknowledgements

We’d like to thank Watson Ladd for his contributions to this project during his tenure at Cloudflare.

……
1While true for most customers, we do offer Keyless SSL that allows customers who can run their own keyservers, the ability to store their private keys on-prem

Geo Key Manager: Setting up a service for scale

Post Syndicated from Tanya Verma original https://blog.cloudflare.com/scaling-geo-key-manager/

Geo Key Manager: Setting up a service for scale

In 2017, we launched Geo Key Manager, a service that allows Cloudflare customers to choose where they store their TLS certificate private keys. For example, if a US customer only wants its private keys stored in US data centers, we can make that happen. When a user from Tokyo makes a request to this website or API, it first hits the Tokyo data center. As the Tokyo data center lacks access to the private key, it contacts a datacenter in the US to terminate the TLS request. Once the TLS session is established, the Tokyo datacenter can serve future requests. For a detailed description of how this works, refer to this post on Geo Key Manager.

This is a story about the evolution of systems in response to increase in scale and scope. Geo Key Manager started off as a small research project and, as it got used more and more, wasn’t scaling as well as we wanted it to. This post describes the challenges Geo Key Manager is facing today, particularly from a networking standpoint, and some of the steps along its way to a truly scalable service.

Geo Key Manager started out as a research project that leveraged two key innovations: Keyless SSL, an early Cloudflare innovation; and identity-based encryption and broadcast encryption, a relatively new cryptographic technique which can be used to construct identity-based access management schemes — in this case, identities based on geography. Keyless SSL was originally designed as a keyserver that customers would host on their own infrastructure, allowing them to retain complete ownership of their own private keys while still reaping the benefits of Cloudflare.

Eventually we started using Keyless SSL for Geo Key Manager requests, and later, all TLS terminations at Cloudflare were switched to an internal version of Keyless SSL. We made several tweaks to make the transition go more smoothly, but this meant that we were using Keyless SSL in ways we hadn’t originally intended.

With the increasing risks of balkanization of the Internet in response to geography-specific regulations like GDPR, demand for products like Geo Key Manager which enable users to retain geographical control of their information has surged. A lot of the work we do on the Research team is exciting because we get to apply cutting-edge advancements in the field to Cloudflare scale systems. It’s also fascinating to see projects be used in new and unexpected ways. But inevitably, many of our systems use technology that has never before been used at this scale which can trigger failures as we shall see.

A trans-pacific voyage

In late March, we started seeing an increase in TLS handshake failures in Melbourne, Australia. When we start observing failures in a specific data center, depending on the affected service, we have several options to mitigate impact. For critical systems like TLS termination, one approach is to reroute all traffic using anycast to neighboring data centers. But when we rerouted traffic away from the Melbourne data center, the same failures moved to the nearby data centers of Adelaide and Perth. As we were investigating the issue, we noticed a large spike in timeouts.

Geo Key Manager: Setting up a service for scale

The service that is first in line in the system that makes up TLS termination at Cloudflare performs all the parts of TLS termination that do not require access to the customers’ private keys. Since it is an Internet-facing process, we try to keep sensitive information out of this process as much as possible, in case of memory disclosure bugs. So it forwards the key signing request to keynotto, a service written in Rust that performs RSA and ECDSA key signatures.

We continued our investigation. This service was timing out on the requests it sent to keynotto. Next, we looked into the requests itself. We observed that there was an increase in total requests, but not by a very large factor. You can see a large drop in traffic in the Melbourne data center below. That indicates the time where we dropped traffic from Melbourne and rerouted it to other nearby data centers.

Geo Key Manager: Setting up a service for scale

We decided to track one of these timed-out requests using our distributed tracing infrastructure. The Jaeger UI allows you to chart traces that take the longest. Thanks to this, we quickly figured out that most of these failures were being caused by a single, new zone that was getting a large amount of API traffic. And interestingly, this zone also had Geo Key Manager enabled, with a policy set to US-only data centers.

Geo Key Manager routes requests to the closest data center that satisfies the policy. This happened to be a data center in San Francisco. That meant adding ~175ms to the actual time spent performing the key signing (median is 3 ms), to account for the trans-pacific voyage 🚢. So, it made sense that the remote key signatures were relatively slow, but what was causing TLS timeouts and degradation for unrelated zones with nothing to do with Geo Key Manager?

Below are graphs depicting the increase by quantile in RSA key signatures latencies in Melbourne. Plotting them all on the same graph didn’t work right with the scale, so the p50 is shown separately from p70 and p90.

Geo Key Manager: Setting up a service for scale
Geo Key Manager: Setting up a service for scale

To answer why unrelated zones had timeouts and performance degradation, we have to understand the architecture of the three services involved in terminating TLS.

Geo Key Manager: Setting up a service for scale

Life of a TLS request

TLS requests arrive at a data center routed by anycast. Each server runs the same stack of services, so each has its own instance of the initial service, keynotto, and gokeyless (discussed shortly). The first service has a worker pool of half the number of CPU cores. There’s 96 cores on each server, so 48 workers. Each of these workers creates their own connection to keynotto, which they use to send and receive the responses of key-signing requests. keynotto, at that point in time, could multiplex between all 48 of these connections because it spawned a new thread to handle each connection. However, it processed all requests on the same connection sequentially. Given that there could be dozens of requests per second on the same connection, if even a single one was slow, it would cause head of line blocking of all other requests enqueued after it. So long as most requests were short lived, this bug went unnoticed. But when a lot of traffic needed to be processed via Geo Key Manager, the head of line blocking created problems. This type of flaw is usually only exposed under heavy load or when load testing, and will make more sense after I introduce gokeyless and explain the history of keynotto.

gokeyless-internal, very imaginatively named, is the internal instance of our Keyless SSL keyserver written in Go. I’ll abbreviate it to gokeyless for the sake of simplicity. Before we introduced keynotto as the designated key signing process, the first service sent all key signing requests directly to gokeyless. gokeyless created worker pools based on the type of operation consisting of goroutines, a very lightweight thread managed by the Go runtime. The ones of interest are the RSA, ECDSA, and the remote worker pool. RSA and ECDSA are fairly obvious, these goroutines performed RSA key signatures and ECDSA key signatures. Any requests involving Geo Key Manager were placed in the remote pool. This prevented the network-dependent remote requests from affecting local key signatures. Worker pools were an artifact of a previous generation of Keyless which didn’t need to account for remote requests. Using benchmarks we had noticed that spinning up worker pools provided some marginal latency benefits. For local operations only, performance was optimal as the computation was very fast and CPU bound. However, when we started adding remote operations that could block the workers needed for performing local operations, we decided to create a new worker pool only for remote ops.

When gokeyless was designed in 2014, Rust was not a mature language. But that changed recently, and we decided to experiment with a minimal Rust proxy placed between the first service and gokeyless. This would handle RSA and ECDSA signatures, which were about 99% of all key signing operations, while handing off the more esoteric operations like Geo Key Manager over to gokeyless running locally on the same server. The hope was that we could eke out some performance gains from the tight runtime control afforded by Rust and the use of Rust’s cryptographic libraries. Performance is incredibly important to Cloudflare since CPU is one of the main limiting factors for edge servers. Go’s RSA is notorious for being slow and using a lot of CPU. Given that one in three handshakes use RSA, it is important to optimize it. CGo seemed to create unnecessary overhead, and there was no assembly-only RSA that we could use. We tried to speed up RSA without CGo and only using assembly and made some strides, but it was still a little non-optimal. So keynotto was built to take advantage of the fast RSA implementation in BoringSSL.

The next section is a quick diversion into keynotto that isn’t strictly related to the story, but who doesn’t enjoy a hot take on Go vs Rust?

Go vs Rust: The battle continues

While keynotto was initially deployed because it objectively lowered CPU, we weren’t sure what fraction of its benefits over gokeyless were related to different cryptography libraries used vs the choice of language. keynotto used BoringCrypto, the TLS library that is part of BoringSSL, whereas gokeyless used Go’s standard crypto library crypto/tls. And keynotto was implemented in Rust, while gokeyless in Go. However, recently during the process of FedRAMP certification, we had to switch the TLS library in gokeyless to use the same cryptography library as keynotto. This was because while the Go standard library doesn’t use a FIPS validated cryptography module, this Google-maintained branch of Go that uses BoringCrypto does. We then turned off keynotto in a few very large data centers for a couple days and compared this to the control group of having keynotto turned on. We found that moving gokeyless to BoringCrypto provided a very marginal benefit, forcing us to revisit our stance on CGo. This result meant we could attribute the difference to using Rust over Go and implementation differences between keynotto and gokeyless.

Turning off keynotto resulted in an average 26% increase in maximum memory consumed and 71% increase in maximum CPU consumed. In the case of all quantiles, we observed a significant increase in latency for both RSA and ECDSA. ECDSA as measured across four different large data centers (each is a different color) is shown:

Geo Key Manager: Setting up a service for scale

Remote ops don’t play fair

The first service, keynotto and gokeyless communicate over TCP via the custom Keyless protocol. This protocol was developed in 2014, and originally intended for communicating with external key servers aka gokeyless. After this protocol started being used internally, by increasing numbers of clients in several different languages, challenges started appearing. In particular, each client had to implement by hand the serialization code and make sense of all the features the protocol provided. The custom protocol also makes tracing more challenging to do right. So while older clients like the first service have tuned their implementations, for newer clients like keynotto, it can be difficult to consider every property such as the right way to propagate traces and such.

One such property that the Keyless protocol offers is multiplexing of requests within a connection. It does so by provisioning unique IDs for each request, which allows for out-of-order delivery of the responses similar to protocols like HTTP/2. While the first service and gokeyless leveraged this property to handle situations where the order of responses is different from the order of requests, keynotto, a much newer service, didn’t. This is why it had to process requests on the same connection sequentially. This led to local key signing requests — which took roughly 3ms — being blocked on the remote requests that took 60x that duration!

We now know why most TLS requests were degraded/dropped. But it’s also worth examining the remote requests themselves. The gokeyless remote worker pool had 200 workers in each server. One of the mitigations we applied was bumping that number to 2,000. Increasing the concurrency in a system when faced with problems that resemble resource constraints is an understandable thing to do. The reason not to do so is high resource utilization. Let’s get some context on why bumping up remote workers 10x wasn’t necessarily a problem. When gokeyless was created, it had separate pools with a configurable number of workers for RSA, ECDSA, and remote ops. RSA and ECDSA had a large fraction of workers, and remote was relatively smaller. Post keynotto, the need for the RSA and ECDSA pool was obviated since keynotto handled all local key signing operations.

Let’s do some napkin math now to prove that bumping it by 10x could not have possibly helped. We were receiving five thousand requests per second in addition to the usual traffic in Melbourne. 200 were remote requests. Melbourne has 60 servers. Looking at a breakdown of remote requests by server, the maximum remote requests one server had was ten requests per second (rps). The graph below shows the percentage of workers in gokeyless actually performing operations, “other” is the remote workers. We can see that while Melbourne worker utilization peaked at 47%, that is still not 100%. And we see that the utilization in San Francisco was much lower, so it couldn’t be SF that was slowing things down.

Geo Key Manager: Setting up a service for scale

Given that we had 200 workers per server and at most ten requests per second, even if they took 200 ms each, gokeyless was not the bottleneck. We had no reason to increase the number of remote workers. The culprit here was keynotto’s serial queue.

The solution here was to prevent remote signing ops from blocking local ops in keynotto. We did this by changing the internal task handling model of keynotto to not just handle each connection concurrently, but also to process each request on a connection concurrently. This was done by using a multi-producer, single-consumer queue. When a request from a connection was read, it was handed off to a new thread for processing while the connection thread went back to reading requests. When the request was done processing, the response was written to a channel. A write thread for each connection was also created, which polled the read side of the channel and wrote the response back to the connection.

The next thing we added were carefully chosen timeout values to keynotto. Timeouts allow for faster failure and faster cleanup of resources associated with the timed out request. Choosing appropriate timeouts can be tricky, especially when composed across multiple services. Since keynotto is downstream of the first service but upstream of gokeyless, its timeout should be smaller than the former but larger than the latter to allow upstream services to diagnose which process triggered the timeout. Timeouts can further be complicated when network latency is involved, because trans-pacific p99 latency is very different from p99 between two US states. Using the worst case network latency can be a fair compromise. We chose keynotto timeouts to be an order of magnitude larger than the p99 request latency, since the purpose was to prevent continued stall and resource consumption.

A more general solution to avoid the two issues outlined above would be to use gRPC. gRPC was released in 2015, which was one year too late to be used in Keyless at that time. It has several advantages over custom protocol implementations such as multi-language client libraries, improved tracing, easy timeouts, load balancing, protobuf for serialization and so on, but directly relevant here is multiplexing. gRPC supports multiplexing out of the box which makes it unnecessary to handle request/response multiplexing manually.

Tracking requests across the Atlantic

Then in early June, we started seeing TLS termination timeouts in Chicago. This time, there were timeouts across the first service, keynotto, and gokeyless. We had on our hands a zone with Geo Key Manager policy set to the EU that had suddenly started receiving around 80 thousand remote rps, which was significantly more than the 200 remote rps in Australia. This time, keynotto was not responsible since its head of line blocking issues had been fixed. It was timing out waiting for gokeyless to perform the remote key signatures. A quick glance at the maximum worker utilization of gokeyless for the entire datacenter revealed that it was maxed out. To counteract the issue, we increased the remote workers from 200 to 20,000, and then to 200,000. It wasn’t enough. Requests were still timing out and worker utilization was squarely at 100% until we rate limited the traffic to that zone.

Geo Key Manager: Setting up a service for scale

Why didn’t increasing the number of workers by a factor of 1,000 times help? We didn’t even have 250,000 rps for the entire datacenter, let alone every server.

File descriptors don’t scale

gokeyless maintains a map of ping latencies to each data center as measured from the data center that the instance is running in. It updates this map frequently and uses it to make decisions about which data center to route remote requests to. A long time ago, every server in every data center maintained its own connections to other data centers. This quickly grew out of hand and we started running out of file descriptors as the number of data centers grew. So, we switched to a new system of creating pods of 32 servers and shared the connections that each server needed to maintain with other data centers by the size of the pod. This was great for reducing the number of file descriptors being used by each server.

An example will illustrate this better. In a data center with 100 servers, there would be four pods: three with 32 servers each and one with four servers. If there were a total of 65 data centers in the world, then each server in a pod of 32 will be responsible for maintaining connections to two data centers, and each server in the pod of four will handle 16 servers. Remote requests are routed to the data center with the shortest latency. So in a pod of 32 servers, the server that maintains the connection with the closest data center will be responsible for sending remote requests from every member of the pod to the target data center. In case of failure or timeout, the data center which is second closest (and then the third closest) will be responsible. If all three closest data centers return failures, we give up and return a failure. So this means if a data center with 130 servers is receiving 80k remote rps, which is 615 rps/server, then there will be approximately four servers that are actually responsible for routing all remote traffic at any given time, and will each be handling ~20k rps. This was the case in Chicago. These requests were being routed to a data center in Hamburg, Germany and there could be a maximum of four connections at the same time between Chicago and Hamburg.

Geo Key Manager: Setting up a service for scale

Little’s Law

At that time, gokeyless used a worker pool architecture with a queue for buffering additional tasks. This queue blocked after 1,024 requests were queued, to create backpressure on the client. This is a common technique to prevent overloading — not accepting more tasks than the server knows it can handle. p50 RSA latency was 4 ms. The latency from Chicago to Hamburg was ~100 ms. After we bumped up the remote workers for each server to 200,000 in Chicago, we were surely not constrained on the sending end since we only had 80k rps for the whole data center and each request shouldn’t take more than 100 ms in the vast majority of cases. We know this because of Little’s Law. Little’s Law states that the capacity of a system is the arrival rate multiplied by the service time for each request. Let’s see how it applies here, and how it allowed us to prove why increasing concurrency or queue size did not help.

Consider the queue in the German data center, Hamburg in this case. We hadn’t bumped the number of remote workers there, so only 200 remote workers were available. Assuming 20k rps arrived to one server in Hamburg and each took ~5 ms to process, the necessary concurrency in the system should be 100 workers. We had 200 workers. Even without a queue, Hamburg should’ve been easily able to handle the 20k rps thrown at it. We investigated how many remote rps were actually processed per server in Hamburg:

Geo Key Manager: Setting up a service for scale

These numbers didn’t make any sense! The maximum number at any time was a little over 700. We should’ve been able to process orders of magnitude more requests. By the time we investigated this, the incident had ended. We had no visibility into the size of the service level queue — or the size of the TCP queue — to understand what could have caused the timeouts. We speculated that while p50 for the population of all RSA key signatures might be 4 ms, perhaps for the specific population of Geo Key Manager RSA signatures, things took a whole lot longer. With 200 workers, we processed ~500 rps, which meant each operation would have to take 400 ms. p99.9 for signatures can be around this much, so it is possible that was how long things took.

We recently ran a load test to establish an upper bound on remote rps, and discovered that one server in the US could process 7,000 remote rps to the EU — much lower than Little’s Law suggests. We identified several slow remote requests through tracing and noticed these usually corresponded with multiple tries to different remote hosts. The explanation for why these retries are necessary to get through to an otherwise very functional data center is that gokeyless uses a hardcoded list of data centers and corresponding server hostnames to establish connections to. Hostnames can change when new servers are added or removed. If we are attempting to connect to an invalid host and waiting on a one second timeout, we will certainly need to retry to another host, which can cause large increases in average processing time.

After the Hamburg outage, we decided to remove gokeyless worker pools and switch to a model of one goroutine per request. Goroutines are extremely lightweight, with benchmarks that suggest that a 4 GB machine can handle around one million goroutines. This was something that made sense to do after we had extended gokeyless to support remote requests in 2017, but because of the complexity involved in changing the entire architecture of processing requests, we held off on the change. Furthermore, when we launched Geo Key Manager, we didn’t have enough traffic that might have prompted us to rethink this architecture urgently. This removes complexity from the code, because tuning worker pool sizes and queue sizes can be complicated. We have also observed, on average, a 7x drop in the memory consumed on switching to this one goroutine per request model, because we no longer have idle goroutines active as was in the case of worker pools. It also makes it easier to trace the life of an individual request because you get to follow along all the steps in its execution, which can help when trying to chase down per request slowdowns.

Conclusion

The complexity of distributed systems can quickly scale up to unmanageable levels, making it important to have deep visibility into it. An extremely important tool to have is distributed tracing, which tracks a request through multiple services and provides information about time spent in each part of the journey. We didn’t propagate gokeyless trace spans through parts of the remote operation, which prevented us from identifying why Hamburg only processed 700 rps. Having examples on hand for various error cases can also make diagnosis easier, especially in the midst of an outage. While we load tested our TLS termination system in the common case, where key signatures happened locally, we hadn’t load tested the less common and lower volume use case of remote operations. Using sources of truth that update dynamically to reflect the current state of the world instead of static sources is also important. In our case, the list of data centers that gokeyless connected to for performing remote operation was hardcoded a long time ago and never updated. So while we added more data centers, gokeyless was unable to make use of them, and in some cases, may have been connecting to servers with invalid.

We’re now working on overhauling several pieces of Geo Key Manager to make it significantly more flexible and scalable, so think of this as setting up the stage for a future blog post where we finally solve some of the issues outlined here, and stay tuned for updates!

Improving DNS Privacy with Oblivious DoH in 1.1.1.1

Post Syndicated from Tanya Verma original https://blog.cloudflare.com/oblivious-dns/

Improving DNS Privacy with Oblivious DoH in 1.1.1.1

Improving DNS Privacy with Oblivious DoH in 1.1.1.1

Today we are announcing support for a new proposed DNS standard — co-authored by engineers from Cloudflare, Apple, and Fastly — that separates IP addresses from queries, so that no single entity can see both at the same time. Even better, we’ve made source code available, so anyone can try out ODoH, or run their own ODoH service!

But first, a bit of context. The Domain Name System (DNS) is the foundation of a human-usable Internet. It maps usable domain names, such as cloudflare.com, to IP addresses and other information needed to connect to that domain. A quick primer about the importance and issues with DNS can be read in a previous blog post. For this post, it’s enough to know that, in the initial design and still dominant usage of DNS, queries are sent in cleartext. This means anyone on the network path between your device and the DNS resolver can see both the query that contains the hostname (or website) you want, as well as the IP address that identifies your device.

To safeguard DNS from onlookers and third parties, the IETF standardized DNS encryption with DNS over HTTPS (DoH) and DNS over TLS (DoT). Both protocols prevent queries from being intercepted, redirected, or modified between the client and resolver. Client support for DoT and DoH is growing, having been implemented in recent versions of Firefox, iOS, and more. Even so, until there is wider deployment among Internet service providers, Cloudflare is one of only a few providers to offer a public DoH/DoT service. This has raised two main concerns. One concern is that the centralization of DNS introduces single points of failure (although, with data centers in more than 100 countries, Cloudflare is designed to always be reachable). The other concern is that the resolver can still link all queries to client IP addresses.

Cloudflare is committed to end-user privacy. Users of our public DNS resolver service are protected by a strong, audited privacy policy. However, for some, trusting Cloudflare with sensitive query information is a barrier to adoption, even with such a strong privacy policy. Instead of relying on privacy policies and audits, what if we could give users an option to remove that bar with technical guarantees?

Today, Cloudflare and partners are launching support for a protocol that does exactly that: Oblivious DNS over HTTPS, or ODoH for short.

ODoH Partners:

We’re excited to launch ODoH with several leading launch partners who are equally committed to privacy.

A key component of ODoH is a proxy that is disjoint from the target resolver. Today, we’re launching ODoH with several leading proxy partners, including: PCCW, SURF, and Equinix.

Improving DNS Privacy with Oblivious DoH in 1.1.1.1

“ODoH is a revolutionary new concept designed to keep users’ privacy at the center of everything. Our ODoH partnership with Cloudflare positions us well in the privacy and “Infrastructure of the Internet” space. As well as the enhanced security and performance of the underlying PCCW Global network, which can be accessed on-demand via Console Connect, the performance of the proxies on our network are now improved by Cloudflare’s 1.1.1.1 resolvers. This model for the first time completely decouples client proxy from the resolvers. This partnership strengthens our existing focus on privacy as the world moves to a more remote model and privacy becomes an even more critical feature.” — Michael Glynn, Vice President, Digital Automated Innovation, PCCW Global

Improving DNS Privacy with Oblivious DoH in 1.1.1.1

“We are partnering with Cloudflare to implement better user privacy via ODoH. The move to ODoH is a true paradigm shift, where the users’ privacy or the IP address is not exposed to any provider, resulting in true privacy. With the launch of ODoH-pilot, we’re joining the power of Cloudflare’s network to meet the challenges of any users around the globe. The move to ODoH is not only a paradigm shift but it emphasizes how privacy is important to any users than ever, especially during 2020. It resonates with our core focus and belief around Privacy.” — Joost van Dijk, Technical Product Manager, SURF

Improving DNS Privacy with Oblivious DoH in 1.1.1.1

How does Oblivious DNS over HTTPS (ODoH) work?

ODoH works by adding a layer of public key encryption, as well as a network proxy between clients and DoH servers such as 1.1.1.1. The combination of these two added elements guarantees that only the user has access to both the DNS messages and their own IP address at the same time.

Improving DNS Privacy with Oblivious DoH in 1.1.1.1

There are three players in the ODoH path. Looking at the figure above, let’s begin with the target. The target decrypts queries encrypted by the client, via a proxy. Similarly, the target encrypts responses and returns them to the proxy. The standard says that the target may or may not be the resolver (we’ll touch on this later). The proxy does as a proxy is supposed to do, in that it forwards messages between client and target. The client behaves as it does in DNS and DoH, but differs by encrypting queries for the target, and decrypting the target’s responses. Any client that chooses to do so can specify a proxy and target of choice.

Together, the added encryption and proxying provide the following guarantees:

  1. The target sees only the query and the proxy’s IP address.
  2. The proxy has no visibility into the DNS messages, with no ability to identify, read, or modify either the query being sent by the client or the answer being returned by the target.
  3. Only the intended target can read the content of the query and produce a response.

These three guarantees improve client privacy while maintaining the security and integrity of DNS queries. However, each of these guarantees relies on one fundamental property — that the proxy and the target servers do not collude. So long as there is no collusion, an attacker succeeds only if both the proxy and target are compromised.

One aspect of this system worth highlighting is that the target is separate from the upstream recursive resolver that performs DNS resolution. In practice, for performance, we expect the target to be the same. In fact, 1.1.1.1 is now both a recursive resolver and a target! There is no reason that a target needs to exist separately from any resolver. If they are separated then the target is free to choose resolvers, and just act as a go-between. The only real requirement, remember, is that the proxy and target never collude.

Also, importantly, clients are in complete control of proxy and target selection. Without any need for TRR-like programs, clients can have privacy for their queries, in addition to security. Since the target only knows about the proxy, the target and any upstream resolver are oblivious to the existence of any client IP addresses. Importantly, this puts clients in greater control over their queries and the ways they might be used. For example, clients could select and alter their proxies and targets any time, for any reason!

ODoH Message Flow

In ODoH, the ‘O’ stands for oblivious, and this property comes from the level of encryption of the DNS messages themselves. This added encryption is `end-to-end` between client and target, and independent from the connection-level encryption provided by TLS/HTTPS. One might ask why this additional encryption is required at all in the presence of a proxy. This is because two separate TLS connections are required to support proxy functionality. Specifically, the proxy terminates a TLS connection from the client, and initiates another TLS connection to the target. Between those two connections, the DNS message contexts would otherwise appear in plaintext! For this reason, ODoH additionally encrypts messages between client and target so the proxy has no access to the message contents.

The whole process begins with clients that encrypt their query for the target using HPKE. Clients obtain the target’s public key via DNS, where it is bundled into a HTTPS resource record and protected by DNSSEC. When the TTL for this key expires, clients request a new copy of the key as needed (just as they would for an A/AAAA record when that record’s TTL expires). The usage of a target’s DNSSEC-validated public key guarantees that only the intended target can decrypt the query and encrypt a response (answer).

Clients transmit these encrypted queries to a proxy over an HTTPS connection. Upon receipt, the proxy forwards the query to the designated target. The target then decrypts the query, produces a response by sending the query to a recursive resolver such as 1.1.1.1, and then encrypts the response to the client. The encrypted query from the client contains encapsulated keying material from which targets derive the response encryption symmetric key.

This response is then sent back to the proxy, and then subsequently forwarded to the client. All communication is authenticated and confidential since these DNS messages are end-to-end encrypted, despite being transmitted over two separate HTTPS connections (client-proxy and proxy-target). The message that otherwise appears to the proxy as plaintext is actually an encrypted garble.

What about Performance? Do I have to trade performance to get privacy?

We’ve been doing lots of measurements to find out, and will be doing more as ODoH deploys more widely. Our initial set of measurement configurations spanned cities in the USA, Canada, and Brazil. Importantly, our measurements include not just 1.1.1.1, but also 8.8.8.8 and 9.9.9.9. The full set of measurements, so far, is documented for open access.

In those measurements, it was important to isolate the cost of proxying and additional encryption from the cost of TCP and TLS connection setup. This is because the TLS and TCP costs are incurred by DoH, anyway. So, in our setup, we ‘primed’ measurements by establishing connections once and reusing that connection for all measurements. We did this for both DoH and for ODoH, since the same strategy could be used in either case.

The first thing that we can say with confidence is that the additional encryption is marginal. We know this because we randomly selected 10,000 domains from the Tranco million dataset and measured both encryption of the A record with a different public key, as well as its decryption. The additional cost between a proxied DoH query/response and its ODoH counterpart is consistently less than 1ms at the 99th percentile.

The ODoH request-response pipeline, however, is much more than just encryption. A very useful way of looking at measurements is by looking at the cumulative distribution chart — if you’re familiar with these kinds of charts, skip to the next paragraph. In contrast to most charts where we start along the x-axis, with cumulative distributions we often start with the y-axis.

The chart below shows the cumulative distributions for query/response times in DoH, ODoH, and DoH when transmitted over the Tor Network. The dashed horizontal line that starts on the left from 0.5 is the 50% mark. Along this horizontal line, for any plotted curve, the part of the curve below the dashed line is 50% of the data points. Now look at the x-axis, which is a measure of time. The lines that appear to the left are faster than lines to the right. One last important detail is that the x-axis is plotted on a logarithmic scale. What does this mean? Notice that the distance between the labeled markers (10x) is equal in cumulative distributions but the ‘x’ is an exponent, and represents orders of magnitude. So, while the time difference between the first two markers is 9ms, the difference between the 3rd and 4th markers is 900ms.

Improving DNS Privacy with Oblivious DoH in 1.1.1.1

In this chart, the middle curve represents ODoH measurements. We also measured the performance of privacy-preserving alternatives, for example, DoH queries transmitted over the Tor network as represented by the right curve in the chart. (Additional privacy-preserving alternatives are captured in the open access technical report.) Compared to other privacy-oriented DNS variants, ODoH cuts query time in half, or better. This point is important since privacy and performance rarely play nicely together, so seeing this kind of improvement is encouraging!

The chart above also tells us that 50% of the time ODoH queries are resolved in fewer than 228ms. Now compare the middle line to the left line that represents ‘straight-line’ (or normal) DoH without any modification. That left plotline says that 50% of the time, DoH queries are resolved in fewer than 146ms. Looking below the 50% mark, the curves also tell us that ½ the time that difference is never greater than 100ms. On the other side, looking at the curves above the 50% mark tells us that ½ ODoH queries are competitive with DoH.

Those curves also hide a lot of information, so it is important to delve further into the measurements. The chart below has three different cumulative distribution curves that describe ODoH performance if we select proxies and targets by their latency. This is also an example of the insights that measurements can reveal, some of which are counterintuitive. For example, looking above 0.5, these curves say that ½ of ODoH query/response times are virtually indistinguishable, no matter the choice of proxy and target. Now shift attention below 0.5 and compare the two solid curves against the dashed curve that represents overall average. This region suggests that selecting the lowest-latency proxy and target offers minimal improvement over the average but, most importantly, it shows that selecting the lowest-latency proxy leads to worse performance!

Improving DNS Privacy with Oblivious DoH in 1.1.1.1

Open questions remain, of course. This first set of measurements were executed largely in North America. Does performance change at a global level? How does this affect client performance, in practice? We’re working on finding out, and this release will help us to do that.

Interesting! Can I experiment with ODoH? Is there an open ODoH service?

Yes, and yes! We have open sourced our interoperable ODoH implementations in Rust, odoh-rs and Go, odoh-go, as well as integrated the target into the Cloudflare DNS Resolver. That’s right, 1.1.1.1 is ready to receive queries via ODoH.

We have also open sourced test clients in Rust, odoh-client-rs, and Go, odoh-client-go, to demo ODoH queries. You can also check out the HPKE configuration used by ODoH for message encryption to 1.1.1.1 by querying the service directly:

$ dig -t type65 +dnssec @ns1.cloudflare.com odoh.cloudflare-dns.com 

; <<>> DiG 9.10.6 <<>> -t type65 +dnssec @ns1.cloudflare.com odoh.cloudflare-dns.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 19923
;; flags: qr aa rd; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 1232
;; QUESTION SECTION:
;odoh.cloudflare-dns.com.	IN	TYPE65

;; ANSWER SECTION:
odoh.cloudflare-dns.com. 300	IN	TYPE65	\# 108 00010000010003026832000400086810F8F96810F9F9000600202606 470000000000000000006810F8F92606470000000000000000006810 F9F98001002E002CFF0200280020000100010020ED82DBE32CCDE189 BC6C643A80B5FAFF82548D21601C613408BACAAE6467B30A
odoh.cloudflare-dns.com. 300	IN	RRSIG	TYPE65 13 3 300 20201119163629 20201117143629 34505 odoh.cloudflare-dns.com. yny5+ApxPSO6Q4aegv09ZnBmPiXxDEnX5Xv21TAchxbxt1VhqlHpb5Oc 8yQPNGXb0fb+NyibmHlvTXjphYjcPA==

;; Query time: 21 msec
;; SERVER: 173.245.58.100#53(173.245.58.100)
;; WHEN: Wed Nov 18 07:36:29 PST 2020
;; MSG SIZE  rcvd: 291

We are working to add ODoH to existing stub resolvers such as cloudflared. If you’re interested in adding support to a client, or if you encounter bugs with the implementations, please drop us a line at [email protected]! Announcements about the ODoH specification and server will be sent to the IETF DPRIVE mailing list. You can subscribe and follow announcements and discussion about the specification here.

We are committed to moving it forward in the IETF and are already seeing interest from client vendors. Eric Rescorla, CTO of Firefox, says, “Oblivious DoH is a great addition to the secure DNS ecosystem. We’re excited to see it starting to take off and are looking forward to experimenting with it in Firefox.” We hope that more operators join us along the way and provide support for the protocol, by running either proxies or targets, and we hope client support will increase as the available infrastructure increases, too.

The ODoH protocol is a practical approach for improving privacy of users, and aims to improve the overall adoption of encrypted DNS protocols without compromising performance and user experience on the Internet.

Acknowledgements

Marek Vavruša and Anbang Wen were instrumental in getting the 1.1.1.1 resolver to support ODoH. Chris Wood and Peter Wu helped get the ODoH libraries ready and tested.