Tag Archives: Geo Key Manager

Inside Geo Key Manager v2: re-imagining access control for distributed systems

Post Syndicated from Tanya Verma original https://blog.cloudflare.com/inside-geo-key-manager-v2/

Inside Geo Key Manager v2: re-imagining access control for distributed systems

Inside Geo Key Manager v2: re-imagining access control for distributed systems

In December 2022 we announced the closed beta of the new version of Geo Key Manager. Geo Key Manager v2 (GeoV2) is the next step in our journey to provide customers with a secure and flexible way to control the distribution of their private keys by geographic location. Our original system, Geo Key Manager v1, was launched as a research project in 2017, but as customer needs evolved and our scale increased, we realized that we needed to make significant improvements to provide a better user experience.

One of the principal challenges we faced with Geo Key Manager v1 (GeoV1) was the inflexibility of our access control policies. Customers required richer data localization, often spurred by regulatory concerns. Internally, events such as the conflict in Ukraine reinforced the need to be able to quickly restrict access to sensitive key material. Geo Key Manager v1’s underlying cryptography was a combination of identity-based broadcast encryption and identity-based revocation that simulated a subset of the functionality offered by Attribute-Based Encryption (ABE). Replacing this with an established ABE scheme addressed the inflexibility of our access control policies and provided a more secure foundation for our system.

Unlike our previous scheme, which limited future flexibility by freezing the set of participating data centers and policies at the outset, using ABE made the system easily adaptable for future needs. It allowed us to take advantage of performance gains from additional data centers added after instantiation and drastically simplified the process for handling changes to attributes and policies. Furthermore, GeoV1 struggled with some perplexing performance issues that contributed to high tail latency and a painfully manual key rotation process. GeoV2 is our answer to these challenges and limitations of GeoV1.

While this blog focuses on our solution for geographical key management, the lessons here can also be applied to other access control needs. Access control solutions are traditionally implemented using a highly-available central authority to police access to resources. As we will see, ABE allows us to avoid this single point of failure. As there are no large scale ABE-based access control systems we are aware of, we hope our discussion can help engineers consider using ABE as an alternative to access control with minimal reliance on a centralized authority. To facilitate this, we’ve included our implementation of ABE in CIRCL, our open source cryptographic library.

Unsatisfactory attempts at a solution

Before coming back to GeoV2, let’s take a little detour and examine the problem we’re trying to solve.

Consider this example: a large European bank wants to store their TLS private keys only within the EU. This bank is a customer of Cloudflare, which means we perform TLS handshakes on their behalf. The reason we need to terminate TLS for them is so that we can provide the best protection against DDoS attacks, improve performance by caching, support web application firewalls, etc.

In order to terminate TLS, we need to have access to their TLS private keys1. The control plane, which handles API traffic, encrypts the customer’s uploaded private key with a master public key shared amongst all machines globally. It then puts the key into a globally distributed KV store, Quicksilver. This means every machine in every data center around the world has a local copy of this customer’s TLS private key. Consequently, every machine in each data center has a copy of every customer’s private key.

Inside Geo Key Manager v2: re-imagining access control for distributed systems
Customer uploading their TLS certificate and private key to be stored in all data centers

This bank however, wants its key to be stored only in EU data centers. In order to allow this to happen, we have three options.

The first option is to ensure that only EU data centers can receive this key and terminate the handshake. All other machines proxy TLS requests to an EU server for processing. This would require giving each machine only a subset of the entire keyset stored in Quicksilver, which challenges core design decisions Cloudflare has made over the years that assume the entire dataset is replicated on every machine.

Inside Geo Key Manager v2: re-imagining access control for distributed systems
Restricting customer keys to EU data centers

Another option is to store the keys in the core data center instead of Quicksilver. This would allow us to enforce the proper access control policy every time, ensuring that only certain machines can access certain keys. However, this would defeat the purpose of having a global network in the first place: to reduce latency and avoid a single point of failure at the core.

Inside Geo Key Manager v2: re-imagining access control for distributed systems
Storing keys in core data center where complicated business logic runs to enforce policies

A third option is to use public key cryptography. Instead of having a master key pair, every data center is issued its own key pair. The core encrypts the customer’s private key with the keys of every data center allowed to use it. Only machines in the EU will be able to access the key in this example. Let’s assume there are 500 data centers, with 50 machines each. Of these 500 data centers, let’s say 200 are in the EU. Where 100 keys of 1kB consumed a total of 100 x 500 x 50 x 1 kB (globally), now they will consume 200 times that, and in the worst case, up to 500 times. This increases the space it takes to store the keys on each machine by a whole new factor – before, the storage space was purely a function of how many customer keys are registered; now, the storage space is still a function of the number of customer keys, but also multiplied by the number of data centers.

Inside Geo Key Manager v2: re-imagining access control for distributed systems
Assigning unique keys to each data center and wrapping customer key with EU data center keys

Unfortunately, all three of these options are undesirable in their own ways. They would either require changing fundamental assumptions we made about the architecture of Cloudflare, abandoning the advantages of using a highly distributed network, or quadratically increasing the storage this feature uses.

A deeper look at the third option reveals – why not create two key pairs instead of a unique one for each data center? One pair would be common among all EU data centers, and one for all non-EU data centers. This way, the core only needs to encrypt the customer’s key twice instead of for each EU data center. This is a good solution for the EU bank, but it doesn’t scale once we start adding additional policies. Consider the example: a data center in New York City could have a key for the policy “country: US”, another one for “country: US or region: EU”, another one for “not country: RU”, and so on… You can already see this getting rather unwieldy. And every time a new data center is provisioned, all policies must be re-evaluated and the appropriate keys assigned.

Inside Geo Key Manager v2: re-imagining access control for distributed systems
A key for each policy and its negation

Geo Key Manager v1: identity-based encryption and broadcast encryption

The invention of RSA in 1978 kicked off the era of modern public key cryptography, but anyone who has used GPG or is involved with certificate authorities can attest to the difficulty of managing public key infrastructure that connects keys to user identities. In 1984, Shamir asked if it was possible to create a public-key encryption system where the public key could be any string. His motivation for this question was to simplify email management. Instead of encrypting an email to Bob using Bob’s public key, Alice could encrypt it to Bob’s identity [email protected]. Finally, in 2001, Boneh and Franklin figured out how to make it work.

Broadcast encryption was first proposed in 1993 by Fiat and Naor. It lets you send the same encrypted message to everyone, but only people with the right key can decrypt it. Looking back to our third option, instead of wrapping the customer’s key with the key of every EU data center, we could use broadcast encryption to create a singular encryption of the customer’s key that only EU-based data centers could decrypt. This would solve the storage problem.

Geo Key Manager v1 used a combination of identity-based broadcast encryption and identity-based revocation to implement access control. Briefly, a set of identities is designated for each region and each data center location. Then, each machine is issued an identity-based private key for its region and location. With this in place, access to the customer’s key can be controlled using three sets: the set of regions to encrypt to, the set of locations inside the region to exclude, and the set of locations outside the region to include. For example, the customer’s key could be encrypted so that it is available in all regions except for a few specific locations, and also available in a few locations outside those regions. This blog post has all the nitty-gritty details of this approach.

Unfortunately this scheme was insufficiently responsive to customer needs; the parameters used during initial cryptographic setup, such as the list of regions, data centers, and their attributes, were baked into the system and could not be easily changed. Tough luck excluding the UK from the EU region post Brexit, or supporting a new region based on a recent compliance standard that customers need. Using a predetermined static list of locations also made it difficult to quickly revoke machine access. Additionally, decryption keys could not be assigned to new data centers provisioned after setup, preventing them from speeding up requests. These limitations provided the impetus for integrating Attribute-Based Encryption (ABE) into Geo Key Manager.

Attribute-Based Encryption

In 2004, Amit Sahai and Brent Waters proposed a new cryptosystem based on access policies, known as attribute-based encryption (ABE). Essentially, a message is encrypted under an access policy rather than an identity. Users are issued a private key based on their attributes, and they can only decrypt the message if their attributes satisfy the policy. This allows for more flexible and fine-grained access control than traditional methods of encryption.

Inside Geo Key Manager v2: re-imagining access control for distributed systems
Brief timeline of Public Key Encryption

The policy can be attached either to the key or to the ciphertext, leading to two variants of ABE: key-policy attribute-based encryption (KP-ABE) and ciphertext-policy attribute-based encryption (CP-ABE). There exist trade-offs between them, but they are functionally equivalent as they are duals of each other. Let’s focus on CP-ABE it aligns more closely with real-world access control. Imagine a hospital where a doctor has the attributes “role: doctor” and “region: US”, while a nurse has the attributes “role: nurse” and “region: EU”. A document encrypted under the policy “role: doctor or region: EU” can be decrypted by both the doctor and nurse. In other words, ABE is like a magical lock that only opens for people who have the right attributes.

Policy Semantics
country: US or region: EU Decryption is possible either in the US or in the European Union
not (country: RU or country: US) Decryption is not possible in Russia and US
country: US and security: high Decryption is possible only in data centers within the US that have a high level of security (for some security definition established previously)

There are many different ABE schemes out there, with varying properties. The scheme we choose must satisfy a few requirements:

  1. Negation We want to be able to support boolean formulas consisting of AND, OR and NOT, aka non-monotonic boolean formulas. While practically every scheme handles AND and OR, NOT is rarer to find. Negation makes blocklisting certain countries or machines easier.
  2. Repeated Attributes Consider the policy “organization: executive or (organization: weapons and clearance: top-secret)”. The attribute “organization” has been repeated twice in the policy. Schemes with support for repetition add significant expressibility and flexibility when composing policies.
  3. Security against Chosen Ciphertext Attacks Most schemes are presented in a form that is only secure if the attacker doesn’t choose the messages to decrypt (CPA). There are standard ways to convert such a scheme into one that is secure even if the attacker manipulates ciphertexts (CCA), but it isn’t automatic. We apply the well-known Boneh-Katz transform to our chosen scheme to make it secure against this class of attacks. We will present a proof of security for the end to end scheme in our forthcoming paper.

Negation in particular deserves further comment. For an attribute to be satisfied when negated, the name must stay the same, but the value must differ. It’s like the data center is saying, “I have a country, but it’s definitely not Japan”, instead of “I don’t have a country”. This might seem counterintuitive, but it enables decryption without needing to examine every attribute value. It also makes it safe to roll out attributes incrementally. Based on these criteria, we ended up choosing the scheme by Tomida et al (2021).

Implementing a complex cryptographic scheme such as this can be quite challenging. The discrete log assumption that underlies traditional public key cryptography is not sufficient to meet the security requirements of ABE. ABE schemes must secure both ciphertexts and the attribute-based secret keys, whereas traditional public key cryptography only imposes security constraints on the ciphertexts, while the secret key is merely an integer. To achieve this, most ABE schemes are constructed using a mathematical operation known as bilinear pairings.

The speed at which we can perform pairing operations determines the baseline performance of our implementation. Their efficiency is particularly desirable during decryption, where they are used to combine the attribute-based secret key with the ciphertext in order to recover the plaintext. To this end, we rely on our highly optimized pairing implementations in our open source library of cryptographic suites, CIRCL, which we discuss at length in a previous blog. Additionally, the various keys, attributes and the ciphertext that embeds the access structure are expressed as matrices and vectors. We wrote linear algebra routines to handle matrix operations such as multiplication, transpose, inverse that are necessary to manipulate the structures as needed. We also added serialization, extensive testing and benchmarking. Finally, we implemented our conversion to a CCA2 secure scheme.

In addition to the core cryptography, we had to decide how to express and represent policies. Ultimately we decided on using strings for our API. While perhaps less convenient for programs than structures would be, users of our scheme would have to implement a parser anyway. Having us do it for them seemed like a way to have a more stable interface. This means the frontend of our policy language was composed of boolean expressions as strings, such as “country: JP or (not region: EU)”, while the backend is a monotonic boolean circuit consisting of wires and gates. Monotonic boolean circuits only include AND and OR gates. In order to handle NOT gates, we assigned positive or negative values to the wires. Every NOT gate can be placed directly on a wire because of De Morgan’s Law, which allows the conversion of a formula like “not (X and Y)” into “not X or not Y”, and similarly for disjunction.

The following is a demonstration of the API. The central authority runs Setup to generate the master public key and master secret key. The master public key can be used by anyone to encrypt a message over an access policy. The master secret key, held by the central authority, is used to generate secret keys for users based on their attributes. Attributes themselves can be supplied out-of-band. In our case, we rely on the machine provisioning database to provide and validate attributes. These attribute-based secret keys are securely distributed to users, such as over TLS, and are used to decrypt ciphertexts. The API also includes helper functions to check decryption capabilities and extract policies from ciphertexts for improved usability.

publicKey, masterSecretKey := cpabe.Setup()

policy := cpabe.Policy{}
policy.FromString("country: US or region: EU")

ciphertext := publicKey.Encrypt(policy, []byte("secret message"))

attrsParisDC := cpabe.Attributes{}
attrsParisDC.FromMap(map[string]string{"country": "FR", "region": "EU"}

secretKeyParisDC := masterSecretKey.KeyGen(attrsParisDC)

plaintext := secretKeyParisDC.Decrypt(ciphertext)

assertEquals(plaintext, "secret message")

We now come back to our original example. This time, the central authority holds the master secret key. Each machine in every data center presents its set of attributes to the central authority, which, after some validation, generates a unique attribute-based secret key for that particular machine. Key issuance happens when a machine is first brought up, if keys must be rotated, or if an attribute has changed, but never in the critical path of a TLS handshake. This solution is also collusion resistant, which means two machines without the appropriate attributes cannot combine their keys to decrypt a secret that they individually could not decrypt. For example, a machine with the attribute  “country: US” and another with “security: high”. These machines cannot collude together to decrypt a resource with the policy “country: US and security: high”.

Crucially, this solution can seamlessly scale and respond to changes to machines. If a new machine is added, the central authority can simply issue it a secret key since the participants of the scheme don’t have to be predetermined at setup, unlike our previous identity-broadcast scheme.

Inside Geo Key Manager v2: re-imagining access control for distributed systems
Key Distribution

When a customer uploads their TLS certificate, they can specify a policy, and the central authority will encrypt their private key with the master public key under the specified policy. The encrypted customer key then gets written to Quicksilver, to be distributed to all data centers. In practice, there is a layer of indirection here that we will discuss in a later section.

Inside Geo Key Manager v2: re-imagining access control for distributed systems
Encryption using Master Public Key

When a user visits the customer’s website, the TLS termination service at the data center that first receives the request, fetches the customer’s encrypted private key from Quicksilver. If the service’s attributes do not satisfy the policy, decryption fails and the request is proxied to the closest data center that satisfies the policy. Whichever data center can successfully decrypt the key performs the signature to complete the TLS handshake.

Inside Geo Key Manager v2: re-imagining access control for distributed systems
Decryption using Attribute-based Secret Key (Simplified)

The following table summarizes the pros and cons of the various solutions we discussed:

Solution Flexible policies Fault Tolerant Efficient Space Low Latency Collusion-resistant Changes to machines
Different copies of Quicksilver in data centers
Complicated Business Logic in Core
Encrypt customer keys with each data center’s unique keypair
Encrypt customer keys with a policy-based keypair, where each data center has multiple policy-based keypairs
Identity-Based Broadcast Encryption + Identity-Based Negative Broadcast Encryption(Geo Key Manager v1)
Attribute-Based Encryption(Geo Key Manager v2)

Performance characteristics

We characterize our scheme’s performance on measures inspired by ECRYPT. We set the attribute size to 50, which is significantly higher than necessary for most applications, but serves as a worst case scenario for benchmarking purposes. We conduct our measurements on a laptop with Intel Core i7-10610U CPU @ 1.80GHz and compare the results against RSA with 2048-bit security, X25519 and our previous scheme.

Scheme Secret key(bytes) Public key(bytes) Overhead of encrypting 23 bytes
(ciphertext length – message length)
Overhead of encrypting 10k bytes
(ciphertext length – message length)
RSA-2048 1190 (PKCS#1) 256 233 3568
X25519 32 32 48 48
GeoV1 scheme 4838 4742 169 169
GeoV2 ABE scheme 33416 3282 19419 19419

Different attribute based encryption schemes optimize for different performance profiles. Some may have fast key generation, while others may prioritize fast decryption. In our case, we only care about fast decryption because it is the only part of the process that lies in the critical path of a request. Everything else happens out-of-band where the extra overhead is acceptable.

Scheme Generating keypair Encrypting 23 bytes Decrypting 23 bytes
RSA-2048 117 ms 0.043 ms 1.26 ms
X25519 0.045 ms 0.093 ms 0.046 ms
GeoV1 scheme 75 ms 10.7 ms 13.9 ms
GeoV2 ABE scheme 1796 ms 704 ms 62.4 ms

A Brief Note on Attribute-Based Access Control (ABAC)

We have used Attribute-Based Encryption to implement what is commonly known as Attribute-Based Access Control (ABAC).

ABAC is an extension of the more familiar Role-Based Access Control (RBAC). To understand why ABAC is relevant, let’s briefly discuss its origins. In 1970, the United States Department of Defense introduced Discretionary Access Control (DAC). DAC is how Unix file systems are implemented. But DAC isn’t enough if you want to restrict resharing, because the owner of the resource can grant other users permission to access it in ways that the central administrator does not agree with. To address this, the Department of Defense introduced Mandatory Access Control (MAC). DRM is a good example of MAC. Even though you have the file, you don’t have a right to share it to others.

RBAC is an implementation of certain aspects of MAC. ABAC is an extension of RBAC that was defined by NIST in 2017 to address the increasing characteristics of users that are not restricted to their roles, such as time of day, user agent, and so on.

However, RBAC/ABAC is simply a specification. While they are traditionally implemented using a central authority to police access to some resource, it doesn’t have to be so. Attribute-based encryption is an excellent mechanism to implement ABAC in distributed systems.

Key rotation

While it may be tempting to attribute all failures to DNS, changing keys is another strong contender in this race. Suffering through the rather manual and error-prone key rotation process of Geo Key Manager v1 taught us to make robust and simple key rotation without impact on availability, an explicit design goal for Geo Key Manager v2.

To facilitate key rotation and improve performance, we introduce a layer of indirection to the customer key wrapping (encryption) process. When a customer uploads their TLS private key, instead of encrypting with the Master Public Key, we generate a X25519 keypair, called the policy key. The central authority then adds the public part of this newly minted policy keypair and its associated policy label to a database. It then encrypts the private half of the policy keypair with the Master Public Key, over the associated access policy. The customer’s private key is encrypted with the public policy key, and saved into Quicksilver.

When a user accesses the customer’s website, the TLS termination service at the data center that receives the request fetches the encrypted policy key associated with the customer’s access policy. If the machine’s attributes don’t satisfy the policy, decryption fails and the request is forwarded to the closest satisfying data center. If decryption succeeds, the policy key is used to decrypt the customer’s private key and complete the handshake.

Key Purpose CA in core Core Network
Master Public Key Encrypts private policy keys over an access policy Generate Read
Master Secret Key Generates secret keys for machines based on their attributes Generate,Read
Machine Secret Key / Attribute-Based Secret Key Decrypts private policy keys stored in global KV store, Quicksilver Generate Read
Customer TLS Private Key Performs digital signature necessary to complete TLS handshake to the customer’s website Read (transiently on upload) Read
Public Policy Key Encrypts customers’ TLS private keys Generate,
Read
Private Policy Key Decrypts customer’s TLS private keys Read (transiently during key rotation) Generate Read

However, policy keys are not generated for every customer’s certificate upload. As shown in the figure below, if a customer requests a policy that already exists in the system and thus has an associated policy key, the policy key will get re-used. Since most customers use the same few policies, such as restricting to one country, or restricting to the EU, the number of policy keys is orders of magnitude smaller compared to the number of customer keys.

Inside Geo Key Manager v2: re-imagining access control for distributed systems
Policy Keys

This sharing of policy keys is tremendously useful for key rotation. When master keys are rotated (and consequently the machine secret keys), only the handful of policy keys used to control access to the customers’ keys need to be re-encrypted, rather than every customer’s key encryption. This reduces compute and bandwidth requirements. Additionally, caching policy keys at the TLS termination service improves performance by reducing the need for frequent decryptions in the critical path.

This is similar to hybrid encryption, where public key cryptography is used to establish a shared symmetric key, which then gets used to encrypt data. The difference here is that the policy keys are not symmetric, but rather X25519 keypairs, which is an asymmetric scheme based on elliptic curves. While not as fast as symmetric schemes like AES, traditional elliptic curve cryptography is significantly faster than attribute-based encryption. The advantage here is that the central service doesn’t need access to secret key material to encrypt customer keys.

The other component of robust key rotation involves maintaining multiple key versions.The latest key generation is used for encryption, but the latest and previous versions can be used for decryption. We use a system of states to manage key transitions and safe deletion of older keys. We also have extensive monitoring in place to alert us if any machines are not using the appropriate key generations.

The Tail At Scale

Geo Key Manager suffered from high tail latency, which occasionally impacted availability. Jeff Dean’s paper, The Tail at Scale, is an enlightening read on how even elevated p99 latency at Cloudflare scale can be damaging. Despite revamping the server and client components of our service, the p99 latency didn’t budge. These revamps, such as switching from worker pools to one goroutine per request, did simplify the service, as they removed thousands of lines of code. Distributed tracing was able to pin down the delays: they took place between the client sending a request and the server receiving it. But we could not dig in further. We even wrote a blog last year describing our debugging endeavors, but without a concrete solution.

Finally, we realized that there is a level of indirection between the client and the server. Our data centers around the world are very different sizes. To avoid swamping smaller data centers with connections, larger data centers would task individual, intermediary machines with proxying requests to other data centers using the Go net/rpc library.

Once we included the forwarding function on the intermediary server in the trace, the problem became clear. There was a long delay between issuing the request and processing it. Yet the code was merely a call to a built-in library function. Why was it delaying the request?

Ultimately we found that there was a lock held while the request was serialized. The net/rpc package does not support streams, but our packet-oriented custom application protocol, which we wrote before the advent of gRPC, does support streaming. To bridge this gap, we executed a request and waited for the response in the serialization function. While an expedient way to get the code written, it created a performance bottleneck as only one request could be forwarded at a time.

Our solution was to use channels for coordination, letting multiple requests execute while we waited for the responses to arrive. When we rolled it out we saw dramatic decreases in tail latency.

Inside Geo Key Manager v2: re-imagining access control for distributed systems
The results of fixing RPC failures in remote colo in Australia

Unfortunately we cannot make the speed of light any faster (yet). Customers who want their keys kept only in the US while their website users are in the land down under will have to endure some delays as we make the trans-pacific voyage. But thanks to session tickets, those delays only affect new connections.

Uptime was also significantly improved. Data centers provisioned after cryptographic initiation could now participate in the system, which also implies that data centers that did not satisfy a certain policy had a broader range of satisfying neighbors to which they could forward the signing request to. This increased redundancy in the system, and particularly benefited data centers in regions without the best internet connectivity. The graph below represents successful probes spanning every machine globally over a two-day period. For GeoV1, we see websites with policies for US and EU regions falling to under 98% at one point, while for GeoV2, uptime rarely drops below 4 9s of availability.

Inside Geo Key Manager v2: re-imagining access control for distributed systems
Uptime by Key Profile across US and EU for GeoV1 and GeoV2, and IN for GeoV2

Conclusion

Congratulations dear reader for making it this far. Just like you, applied cryptography has come a long way, but only limited slivers manage to penetrate the barrier between research and real-world adoption. Bridging this gap can help enable novel capabilities for protecting sensitive data. Attribute-based encryption itself has become much more efficient and featureful over the past few years. We hope that this post encourages you to consider ABE for your own access control needs, particularly if you deal with distributed systems and don’t want to depend on a highly available central authority. We have open-sourced our implementation of CP-ABE in CIRCL, and plan on publishing a paper with additional details.

We look forward to the numerous product improvements to Geo Key Manager made possible by this new cryptographic foundation. We plan to use this ABE-based mechanism for storing not just private keys, but also other types of data. We are working on making it more user-friendly and generalizable for internal services to use.

Acknowledgements

We’d like to thank Watson Ladd for his contributions to this project during his tenure at Cloudflare.

……
1While true for most customers, we do offer Keyless SSL that allows customers who can run their own keyservers, the ability to store their private keys on-prem

A new, configurable and scalable version of Geo Key Manager, now available in Closed Beta

Post Syndicated from Dina Kozlov original https://blog.cloudflare.com/configurable-and-scalable-geo-key-manager-closed-beta/

A new, configurable and scalable version of Geo Key Manager, now available in Closed Beta

A new, configurable and scalable version of Geo Key Manager, now available in Closed Beta

Today, traffic on the Internet stays encrypted through the use of public and private keys that encrypt the data as it’s being transmitted. Cloudflare helps secure millions of websites by managing the encryption keys that keep this data protected. To provide lightning fast services, Cloudflare stores these keys on our fleet of data centers that spans more than 150 countries. However, some compliance regulations require that private keys are only stored in specific geographic locations.

In 2017, we introduced Geo Key Manager, a product that allows customers to store and manage the encryption keys for their domains in different geographic locations so that compliance regulations are met and that data remains secure. We launched the product a few months before General Data Protection Regulation (GDPR) went into effect and built it to support three regions: the US, the European Union (EU), and a set of our top tier data centers that employ the highest security measures. Since then, GDPR-like laws have quickly expanded and now, more than 15 countries have comparable data protection laws or regulations that include restrictions on data transfer across and/or data localization within a certain boundary.

At Cloudflare, we like to be prepared for the future. We want to give our customers tools that allow them to maintain compliance in this ever-changing environment. That’s why we’re excited to announce a new version of Geo Key Manager — one that allows customers to define boundaries by country, ”only store my private keys in India”, by a region ”only store my private keys in the European Union”, or by a standard, such as “only store my private keys in FIPS compliant data centers” — now available in Closed Beta, sign up here!

Learnings from Geo Key Manager v1

Geo Key Manager has been around for a few years now, and we’ve used this time to gather feedback from our customers. As the demand for a more flexible system grew, we decided to go back to the drawing board and create a new version of Geo Key Manager that would better meet our customers’ needs.

We initially launched Geo Key Manager with support for US, EU, and Highest Security Data centers. Those regions were sufficient at the time, but customers wrestling with data localization obligations in other jurisdictions need more flexibility when it comes to selecting countries and regions. Some customers want to be able to set restrictions to maintain their private keys in one country, some want the keys stored everywhere except in certain countries, and some may want to mix and match rules and say “store them in X and Y, but not in Z”. What we learned from our customers is that they need flexibility, something that will allow them to keep up with the ever-changing rules and policies — and that’s what we set out to build out.

The next issue we faced was scalability.  When we built the initial regions, we included a hard-coded list of data centers that met our criteria for the US, EU, “high security” data center regions.  However, this list was static because the underlying cryptography did not support dynamic changes to our list of data centers. In order to distribute private keys to new data centers that met our criteria, we would have had to completely overhaul the system. In addition to that, our network significantly expands every year, with more than 100 new data centers since the initial launch. That means that any new potential locations that could be used to store private keys are currently not in use, degrading the performance and reliability of customers using this feature.

With our current scale, automation and expansion is a must-have. Our new system needs to dynamically scale every time we onboard or remove a data center from our Network, without any human intervention or large overhaul.

Finally, one of our biggest learnings was that customers make mistakes, such as defining a region that’s so small that availability becomes a concern. Our job is to prevent our customers from making changes that we know will negatively impact them.

Define your own geo-restrictions with the new version of Geo Key Manager

A new, configurable and scalable version of Geo Key Manager, now available in Closed Beta

Cloudflare has significantly grown in the last few years and so has our international customer base. Customers need to keep their traffic regionalized. This region can be as broad as a continent — Asia, for example. Or, it can be a specific country, like Japan.

From our conversations with our customers, we’ve heard that they want to be able to define these regions themselves. This is why today we’re excited to announce that customers will be able to use Geo Key Manager to create what we call “policies”.

A policy can be a single country, defined by two-letter (ISO 3166) country code. It can be a region, such as “EU” for the European Union or Oceania. It can be a mix and match of the two, “country:US or region: EU”.

Our new policy based Geo Key Manager allows you to create allowlist or blocklists of countries and supported regions, giving you control over the boundary in which your private key will be stored. If you’d like to store your private keys globally and omit a few countries, you can do that.

If you would like to store your private keys in the EU and US, you would make the following API call:

curl -X POST "https://api.cloudflare.com/client/v4/zones/zone_id/custom_certificates" \
     -H "X-Auth-Email: [email protected]" \
     -H "X-Auth-Key: auth-key" \
     -H "Content-Type: application/json" \
     --data '{"certificate":"certificate","private_key":"private_key","policy":"(country: US) or (region: EU)", "type": "sni_custom"}'

If you would like to store your private keys in the EU, but not in France, here is how you can define that:

curl -X POST "https://api.cloudflare.com/client/v4/zones/zone_id/custom_certificates" \
     -H "X-Auth-Email: [email protected]" \
     -H "X-Auth-Key: auth-key" \
     -H "Content-Type: application/json" \
     --data '{"certificate":"certificate","private_key":"private_key","policy": "region: EU and (not country: FR)", "type": "sni_custom"}'

Geo Key Manager can now support more than 30 countries and regions. But that’s not all! The superpower of our Geo Key Manager technology is that it doesn’t actually have to be “geo” based, but instead, it’s attribute based. In the future, we’ll have a policy that will allow our customers to define where their private keys are stored based on a compliance standard like FedRAMP or ISO 27001.

Reliability, resiliency, and redundancy

By giving our customers the remote control for Geo Key Manager, we want to make sure that customers understand the impact of their changes on both redundancy and latency.

On the redundancy side, one of our biggest concerns is allowing customers to choose a region small enough that if a data center is removed for maintenance, for example, then availability is drastically impacted. To protect our customers, we’ve added redundancy restrictions. These prevent our customers from setting regions with too few data centers, ensuring that all the data centers within a policy can offer high availability and redundancy.

Not just that, but in the last few years, we’ve significantly improved the underlying networking that powers Geo Key Manager. For more information on how we did that, keep an eye out for a technical deep dive inside Geo Key Manager.

Performance matters

A new, configurable and scalable version of Geo Key Manager, now available in Closed Beta

With the original regions (US, EU, and Highest Security Data Centers), we learned customers may overlook possible latency impacts that occur when defining the key manager to a certain region. Imagine your keys are stored in the US. For your Asia-based customers, there’s going to be some latency impact for the requests that go around the world. Now, with customers being able to define more granular regions, we want to make sure that before customers make that change, they see the impact of it.

If you’re an E-Commerce platform then performance is always top-of-mind. One thing that we’re working on right now is performance metrics for Geo Key Manager policies both from a regional point of view — “what’s the latency impact for Asia based customers?” and from a global point of view — “for anyone in the world, what is the average impact of this policy?”.

By seeing the latency impact, if you see that the impact is unacceptable, you may want to create a separate domain for your service that’s specific to the region that it’s serving.

Closed Beta, now available!

Interested in trying out the latest version of Geo Key Manager? Fill out this form.

Coming soon!

Geo Key Manager is only available via API at the moment. But, we are working on creating an easy-to-use UI for it, so that customers can easily manage their policies and regions. In addition, we’ll surface performance measurements and warnings when we see any degraded impact in terms of performance or redundancy to ensure that customers are mindful when setting policies.

We’re also excited to extend our Geo Key Manager product beyond custom uploaded certificates. In the future, certificates issued through Advanced Certificate Manager or SSL for SaaS will be allowed to add policy based restrictions for the key storage.

Finally, we’re looking to add more default regions to make the selection process simple for our customers. If you have any regions that you’d like us to support, or just general feedback or feature requests related to Geo Key Manager, make a note of it on the form. We love hearing from our customers!

Geo Key Manager: Setting up a service for scale

Post Syndicated from Tanya Verma original https://blog.cloudflare.com/scaling-geo-key-manager/

Geo Key Manager: Setting up a service for scale

In 2017, we launched Geo Key Manager, a service that allows Cloudflare customers to choose where they store their TLS certificate private keys. For example, if a US customer only wants its private keys stored in US data centers, we can make that happen. When a user from Tokyo makes a request to this website or API, it first hits the Tokyo data center. As the Tokyo data center lacks access to the private key, it contacts a datacenter in the US to terminate the TLS request. Once the TLS session is established, the Tokyo datacenter can serve future requests. For a detailed description of how this works, refer to this post on Geo Key Manager.

This is a story about the evolution of systems in response to increase in scale and scope. Geo Key Manager started off as a small research project and, as it got used more and more, wasn’t scaling as well as we wanted it to. This post describes the challenges Geo Key Manager is facing today, particularly from a networking standpoint, and some of the steps along its way to a truly scalable service.

Geo Key Manager started out as a research project that leveraged two key innovations: Keyless SSL, an early Cloudflare innovation; and identity-based encryption and broadcast encryption, a relatively new cryptographic technique which can be used to construct identity-based access management schemes — in this case, identities based on geography. Keyless SSL was originally designed as a keyserver that customers would host on their own infrastructure, allowing them to retain complete ownership of their own private keys while still reaping the benefits of Cloudflare.

Eventually we started using Keyless SSL for Geo Key Manager requests, and later, all TLS terminations at Cloudflare were switched to an internal version of Keyless SSL. We made several tweaks to make the transition go more smoothly, but this meant that we were using Keyless SSL in ways we hadn’t originally intended.

With the increasing risks of balkanization of the Internet in response to geography-specific regulations like GDPR, demand for products like Geo Key Manager which enable users to retain geographical control of their information has surged. A lot of the work we do on the Research team is exciting because we get to apply cutting-edge advancements in the field to Cloudflare scale systems. It’s also fascinating to see projects be used in new and unexpected ways. But inevitably, many of our systems use technology that has never before been used at this scale which can trigger failures as we shall see.

A trans-pacific voyage

In late March, we started seeing an increase in TLS handshake failures in Melbourne, Australia. When we start observing failures in a specific data center, depending on the affected service, we have several options to mitigate impact. For critical systems like TLS termination, one approach is to reroute all traffic using anycast to neighboring data centers. But when we rerouted traffic away from the Melbourne data center, the same failures moved to the nearby data centers of Adelaide and Perth. As we were investigating the issue, we noticed a large spike in timeouts.

Geo Key Manager: Setting up a service for scale

The service that is first in line in the system that makes up TLS termination at Cloudflare performs all the parts of TLS termination that do not require access to the customers’ private keys. Since it is an Internet-facing process, we try to keep sensitive information out of this process as much as possible, in case of memory disclosure bugs. So it forwards the key signing request to keynotto, a service written in Rust that performs RSA and ECDSA key signatures.

We continued our investigation. This service was timing out on the requests it sent to keynotto. Next, we looked into the requests itself. We observed that there was an increase in total requests, but not by a very large factor. You can see a large drop in traffic in the Melbourne data center below. That indicates the time where we dropped traffic from Melbourne and rerouted it to other nearby data centers.

Geo Key Manager: Setting up a service for scale

We decided to track one of these timed-out requests using our distributed tracing infrastructure. The Jaeger UI allows you to chart traces that take the longest. Thanks to this, we quickly figured out that most of these failures were being caused by a single, new zone that was getting a large amount of API traffic. And interestingly, this zone also had Geo Key Manager enabled, with a policy set to US-only data centers.

Geo Key Manager routes requests to the closest data center that satisfies the policy. This happened to be a data center in San Francisco. That meant adding ~175ms to the actual time spent performing the key signing (median is 3 ms), to account for the trans-pacific voyage 🚢. So, it made sense that the remote key signatures were relatively slow, but what was causing TLS timeouts and degradation for unrelated zones with nothing to do with Geo Key Manager?

Below are graphs depicting the increase by quantile in RSA key signatures latencies in Melbourne. Plotting them all on the same graph didn’t work right with the scale, so the p50 is shown separately from p70 and p90.

Geo Key Manager: Setting up a service for scale
Geo Key Manager: Setting up a service for scale

To answer why unrelated zones had timeouts and performance degradation, we have to understand the architecture of the three services involved in terminating TLS.

Geo Key Manager: Setting up a service for scale

Life of a TLS request

TLS requests arrive at a data center routed by anycast. Each server runs the same stack of services, so each has its own instance of the initial service, keynotto, and gokeyless (discussed shortly). The first service has a worker pool of half the number of CPU cores. There’s 96 cores on each server, so 48 workers. Each of these workers creates their own connection to keynotto, which they use to send and receive the responses of key-signing requests. keynotto, at that point in time, could multiplex between all 48 of these connections because it spawned a new thread to handle each connection. However, it processed all requests on the same connection sequentially. Given that there could be dozens of requests per second on the same connection, if even a single one was slow, it would cause head of line blocking of all other requests enqueued after it. So long as most requests were short lived, this bug went unnoticed. But when a lot of traffic needed to be processed via Geo Key Manager, the head of line blocking created problems. This type of flaw is usually only exposed under heavy load or when load testing, and will make more sense after I introduce gokeyless and explain the history of keynotto.

gokeyless-internal, very imaginatively named, is the internal instance of our Keyless SSL keyserver written in Go. I’ll abbreviate it to gokeyless for the sake of simplicity. Before we introduced keynotto as the designated key signing process, the first service sent all key signing requests directly to gokeyless. gokeyless created worker pools based on the type of operation consisting of goroutines, a very lightweight thread managed by the Go runtime. The ones of interest are the RSA, ECDSA, and the remote worker pool. RSA and ECDSA are fairly obvious, these goroutines performed RSA key signatures and ECDSA key signatures. Any requests involving Geo Key Manager were placed in the remote pool. This prevented the network-dependent remote requests from affecting local key signatures. Worker pools were an artifact of a previous generation of Keyless which didn’t need to account for remote requests. Using benchmarks we had noticed that spinning up worker pools provided some marginal latency benefits. For local operations only, performance was optimal as the computation was very fast and CPU bound. However, when we started adding remote operations that could block the workers needed for performing local operations, we decided to create a new worker pool only for remote ops.

When gokeyless was designed in 2014, Rust was not a mature language. But that changed recently, and we decided to experiment with a minimal Rust proxy placed between the first service and gokeyless. This would handle RSA and ECDSA signatures, which were about 99% of all key signing operations, while handing off the more esoteric operations like Geo Key Manager over to gokeyless running locally on the same server. The hope was that we could eke out some performance gains from the tight runtime control afforded by Rust and the use of Rust’s cryptographic libraries. Performance is incredibly important to Cloudflare since CPU is one of the main limiting factors for edge servers. Go’s RSA is notorious for being slow and using a lot of CPU. Given that one in three handshakes use RSA, it is important to optimize it. CGo seemed to create unnecessary overhead, and there was no assembly-only RSA that we could use. We tried to speed up RSA without CGo and only using assembly and made some strides, but it was still a little non-optimal. So keynotto was built to take advantage of the fast RSA implementation in BoringSSL.

The next section is a quick diversion into keynotto that isn’t strictly related to the story, but who doesn’t enjoy a hot take on Go vs Rust?

Go vs Rust: The battle continues

While keynotto was initially deployed because it objectively lowered CPU, we weren’t sure what fraction of its benefits over gokeyless were related to different cryptography libraries used vs the choice of language. keynotto used BoringCrypto, the TLS library that is part of BoringSSL, whereas gokeyless used Go’s standard crypto library crypto/tls. And keynotto was implemented in Rust, while gokeyless in Go. However, recently during the process of FedRAMP certification, we had to switch the TLS library in gokeyless to use the same cryptography library as keynotto. This was because while the Go standard library doesn’t use a FIPS validated cryptography module, this Google-maintained branch of Go that uses BoringCrypto does. We then turned off keynotto in a few very large data centers for a couple days and compared this to the control group of having keynotto turned on. We found that moving gokeyless to BoringCrypto provided a very marginal benefit, forcing us to revisit our stance on CGo. This result meant we could attribute the difference to using Rust over Go and implementation differences between keynotto and gokeyless.

Turning off keynotto resulted in an average 26% increase in maximum memory consumed and 71% increase in maximum CPU consumed. In the case of all quantiles, we observed a significant increase in latency for both RSA and ECDSA. ECDSA as measured across four different large data centers (each is a different color) is shown:

Geo Key Manager: Setting up a service for scale

Remote ops don’t play fair

The first service, keynotto and gokeyless communicate over TCP via the custom Keyless protocol. This protocol was developed in 2014, and originally intended for communicating with external key servers aka gokeyless. After this protocol started being used internally, by increasing numbers of clients in several different languages, challenges started appearing. In particular, each client had to implement by hand the serialization code and make sense of all the features the protocol provided. The custom protocol also makes tracing more challenging to do right. So while older clients like the first service have tuned their implementations, for newer clients like keynotto, it can be difficult to consider every property such as the right way to propagate traces and such.

One such property that the Keyless protocol offers is multiplexing of requests within a connection. It does so by provisioning unique IDs for each request, which allows for out-of-order delivery of the responses similar to protocols like HTTP/2. While the first service and gokeyless leveraged this property to handle situations where the order of responses is different from the order of requests, keynotto, a much newer service, didn’t. This is why it had to process requests on the same connection sequentially. This led to local key signing requests — which took roughly 3ms — being blocked on the remote requests that took 60x that duration!

We now know why most TLS requests were degraded/dropped. But it’s also worth examining the remote requests themselves. The gokeyless remote worker pool had 200 workers in each server. One of the mitigations we applied was bumping that number to 2,000. Increasing the concurrency in a system when faced with problems that resemble resource constraints is an understandable thing to do. The reason not to do so is high resource utilization. Let’s get some context on why bumping up remote workers 10x wasn’t necessarily a problem. When gokeyless was created, it had separate pools with a configurable number of workers for RSA, ECDSA, and remote ops. RSA and ECDSA had a large fraction of workers, and remote was relatively smaller. Post keynotto, the need for the RSA and ECDSA pool was obviated since keynotto handled all local key signing operations.

Let’s do some napkin math now to prove that bumping it by 10x could not have possibly helped. We were receiving five thousand requests per second in addition to the usual traffic in Melbourne. 200 were remote requests. Melbourne has 60 servers. Looking at a breakdown of remote requests by server, the maximum remote requests one server had was ten requests per second (rps). The graph below shows the percentage of workers in gokeyless actually performing operations, “other” is the remote workers. We can see that while Melbourne worker utilization peaked at 47%, that is still not 100%. And we see that the utilization in San Francisco was much lower, so it couldn’t be SF that was slowing things down.

Geo Key Manager: Setting up a service for scale

Given that we had 200 workers per server and at most ten requests per second, even if they took 200 ms each, gokeyless was not the bottleneck. We had no reason to increase the number of remote workers. The culprit here was keynotto’s serial queue.

The solution here was to prevent remote signing ops from blocking local ops in keynotto. We did this by changing the internal task handling model of keynotto to not just handle each connection concurrently, but also to process each request on a connection concurrently. This was done by using a multi-producer, single-consumer queue. When a request from a connection was read, it was handed off to a new thread for processing while the connection thread went back to reading requests. When the request was done processing, the response was written to a channel. A write thread for each connection was also created, which polled the read side of the channel and wrote the response back to the connection.

The next thing we added were carefully chosen timeout values to keynotto. Timeouts allow for faster failure and faster cleanup of resources associated with the timed out request. Choosing appropriate timeouts can be tricky, especially when composed across multiple services. Since keynotto is downstream of the first service but upstream of gokeyless, its timeout should be smaller than the former but larger than the latter to allow upstream services to diagnose which process triggered the timeout. Timeouts can further be complicated when network latency is involved, because trans-pacific p99 latency is very different from p99 between two US states. Using the worst case network latency can be a fair compromise. We chose keynotto timeouts to be an order of magnitude larger than the p99 request latency, since the purpose was to prevent continued stall and resource consumption.

A more general solution to avoid the two issues outlined above would be to use gRPC. gRPC was released in 2015, which was one year too late to be used in Keyless at that time. It has several advantages over custom protocol implementations such as multi-language client libraries, improved tracing, easy timeouts, load balancing, protobuf for serialization and so on, but directly relevant here is multiplexing. gRPC supports multiplexing out of the box which makes it unnecessary to handle request/response multiplexing manually.

Tracking requests across the Atlantic

Then in early June, we started seeing TLS termination timeouts in Chicago. This time, there were timeouts across the first service, keynotto, and gokeyless. We had on our hands a zone with Geo Key Manager policy set to the EU that had suddenly started receiving around 80 thousand remote rps, which was significantly more than the 200 remote rps in Australia. This time, keynotto was not responsible since its head of line blocking issues had been fixed. It was timing out waiting for gokeyless to perform the remote key signatures. A quick glance at the maximum worker utilization of gokeyless for the entire datacenter revealed that it was maxed out. To counteract the issue, we increased the remote workers from 200 to 20,000, and then to 200,000. It wasn’t enough. Requests were still timing out and worker utilization was squarely at 100% until we rate limited the traffic to that zone.

Geo Key Manager: Setting up a service for scale

Why didn’t increasing the number of workers by a factor of 1,000 times help? We didn’t even have 250,000 rps for the entire datacenter, let alone every server.

File descriptors don’t scale

gokeyless maintains a map of ping latencies to each data center as measured from the data center that the instance is running in. It updates this map frequently and uses it to make decisions about which data center to route remote requests to. A long time ago, every server in every data center maintained its own connections to other data centers. This quickly grew out of hand and we started running out of file descriptors as the number of data centers grew. So, we switched to a new system of creating pods of 32 servers and shared the connections that each server needed to maintain with other data centers by the size of the pod. This was great for reducing the number of file descriptors being used by each server.

An example will illustrate this better. In a data center with 100 servers, there would be four pods: three with 32 servers each and one with four servers. If there were a total of 65 data centers in the world, then each server in a pod of 32 will be responsible for maintaining connections to two data centers, and each server in the pod of four will handle 16 servers. Remote requests are routed to the data center with the shortest latency. So in a pod of 32 servers, the server that maintains the connection with the closest data center will be responsible for sending remote requests from every member of the pod to the target data center. In case of failure or timeout, the data center which is second closest (and then the third closest) will be responsible. If all three closest data centers return failures, we give up and return a failure. So this means if a data center with 130 servers is receiving 80k remote rps, which is 615 rps/server, then there will be approximately four servers that are actually responsible for routing all remote traffic at any given time, and will each be handling ~20k rps. This was the case in Chicago. These requests were being routed to a data center in Hamburg, Germany and there could be a maximum of four connections at the same time between Chicago and Hamburg.

Geo Key Manager: Setting up a service for scale

Little’s Law

At that time, gokeyless used a worker pool architecture with a queue for buffering additional tasks. This queue blocked after 1,024 requests were queued, to create backpressure on the client. This is a common technique to prevent overloading — not accepting more tasks than the server knows it can handle. p50 RSA latency was 4 ms. The latency from Chicago to Hamburg was ~100 ms. After we bumped up the remote workers for each server to 200,000 in Chicago, we were surely not constrained on the sending end since we only had 80k rps for the whole data center and each request shouldn’t take more than 100 ms in the vast majority of cases. We know this because of Little’s Law. Little’s Law states that the capacity of a system is the arrival rate multiplied by the service time for each request. Let’s see how it applies here, and how it allowed us to prove why increasing concurrency or queue size did not help.

Consider the queue in the German data center, Hamburg in this case. We hadn’t bumped the number of remote workers there, so only 200 remote workers were available. Assuming 20k rps arrived to one server in Hamburg and each took ~5 ms to process, the necessary concurrency in the system should be 100 workers. We had 200 workers. Even without a queue, Hamburg should’ve been easily able to handle the 20k rps thrown at it. We investigated how many remote rps were actually processed per server in Hamburg:

Geo Key Manager: Setting up a service for scale

These numbers didn’t make any sense! The maximum number at any time was a little over 700. We should’ve been able to process orders of magnitude more requests. By the time we investigated this, the incident had ended. We had no visibility into the size of the service level queue — or the size of the TCP queue — to understand what could have caused the timeouts. We speculated that while p50 for the population of all RSA key signatures might be 4 ms, perhaps for the specific population of Geo Key Manager RSA signatures, things took a whole lot longer. With 200 workers, we processed ~500 rps, which meant each operation would have to take 400 ms. p99.9 for signatures can be around this much, so it is possible that was how long things took.

We recently ran a load test to establish an upper bound on remote rps, and discovered that one server in the US could process 7,000 remote rps to the EU — much lower than Little’s Law suggests. We identified several slow remote requests through tracing and noticed these usually corresponded with multiple tries to different remote hosts. The explanation for why these retries are necessary to get through to an otherwise very functional data center is that gokeyless uses a hardcoded list of data centers and corresponding server hostnames to establish connections to. Hostnames can change when new servers are added or removed. If we are attempting to connect to an invalid host and waiting on a one second timeout, we will certainly need to retry to another host, which can cause large increases in average processing time.

After the Hamburg outage, we decided to remove gokeyless worker pools and switch to a model of one goroutine per request. Goroutines are extremely lightweight, with benchmarks that suggest that a 4 GB machine can handle around one million goroutines. This was something that made sense to do after we had extended gokeyless to support remote requests in 2017, but because of the complexity involved in changing the entire architecture of processing requests, we held off on the change. Furthermore, when we launched Geo Key Manager, we didn’t have enough traffic that might have prompted us to rethink this architecture urgently. This removes complexity from the code, because tuning worker pool sizes and queue sizes can be complicated. We have also observed, on average, a 7x drop in the memory consumed on switching to this one goroutine per request model, because we no longer have idle goroutines active as was in the case of worker pools. It also makes it easier to trace the life of an individual request because you get to follow along all the steps in its execution, which can help when trying to chase down per request slowdowns.

Conclusion

The complexity of distributed systems can quickly scale up to unmanageable levels, making it important to have deep visibility into it. An extremely important tool to have is distributed tracing, which tracks a request through multiple services and provides information about time spent in each part of the journey. We didn’t propagate gokeyless trace spans through parts of the remote operation, which prevented us from identifying why Hamburg only processed 700 rps. Having examples on hand for various error cases can also make diagnosis easier, especially in the midst of an outage. While we load tested our TLS termination system in the common case, where key signatures happened locally, we hadn’t load tested the less common and lower volume use case of remote operations. Using sources of truth that update dynamically to reflect the current state of the world instead of static sources is also important. In our case, the list of data centers that gokeyless connected to for performing remote operation was hardcoded a long time ago and never updated. So while we added more data centers, gokeyless was unable to make use of them, and in some cases, may have been connecting to servers with invalid.

We’re now working on overhauling several pieces of Geo Key Manager to make it significantly more flexible and scalable, so think of this as setting up the stage for a future blog post where we finally solve some of the issues outlined here, and stay tuned for updates!