Preparing to Issue 200 Million Certificates in 24 Hours

Post Syndicated from Let's Encrypt - Free SSL/TLS Certificates original https://letsencrypt.org/2021/02/10/200m-certs-24hrs.html

On a normal day Let’s Encrypt issues nearly two million certificates. When we think about what essential infrastructure for the Internet needs to be prepared for though, we’re not thinking about normal days. We want to be prepared to respond as best we can to the most difficult situations that might arise. In some of the worst scenarios, we might want to re-issue all of our certificates in a 24 hour period in order to avoid widespread disruptions. That means being prepared to issue 200 million certificates in a day, something no publicly trusted CA has ever done.

We recently completed most of the work and investments needed to issue 200 million certificates in a day and we thought we’d let people know what was involved. All of this was made possible by our sponsors and funders, including major hardware contributions from Cisco, Thales, and Fortinet.

Scenarios

Security and compliance events are a certainty in our industry. We obviously try to minimize them, but since we anticipate they will happen, we spend a lot of time preparing to respond in the best ways possible.

In February of 2020 a bug affecting our compliance caused us to need to revoke and replace three million active certificates. That was approximately 2.6% of all active certificates.

What if that bug had affected all of our certificates? That’s more than 150 million certificates covering more than 240 million domains. What if it had also been a more serious bug, requiring us to revoke and replace all certificates within 24 hours? That’s the kind of worst case scenario we need to be prepared for.

To make matters worse, during an incident it takes some time to evaluate the problem and make decisions, so we would be starting revocation and reissuance somewhat after the beginning of the 24-hour clock. That means we would actually have less time once the momentous decision has been made to revoke and replace 150 million or more certificates.

Infrastructure Improvements

After reviewing our systems, we determined that there were four primary bottlenecks that would prevent us from replacing 200 million certificates in a day. Database performance, internal networking speed, cryptographic signing module (HSM) performance, and bandwidth.

Database Performance

Let’s Encrypt has a primary certificate authority database at the heart of the service we offer. This tracks the status of certificate issuance and the associated accounts. It is write heavy, with plenty of reads as well. At any given time a single database server is the writer, with some reads directed at identical machines with replicas. The single writer non-clustered strategy helps with consistency and reduces complexity, but it means that writes and some reads must operate within the performance constraints of a single machine.

Our previous generation of database servers had dual Intel Xeon E5-2650 v4 CPUs, 24 physical cores in total. They had 1TB memory with 24 3.8TB SSDs connected via SATA in a RAID 10 configuration. These worked fine for daily issuance but would not be able to handle replacing all of our certificates in a single day.

We have replaced them with a new generation of database server from Dell featuring dual AMD EPYC 7542 CPUs, 64 physical cores in total. These machines have 2TB of faster RAM. Much faster CPUs and double the memory is great, but the really interesting thing about these machines is that the EPYC CPUs provide 128 PCIe4 lanes each. This means we could pack in 24 6.4TB NVME drives for massive I/O performance. There is no viable hardware RAID for NVME, so we’ve switched to ZFS to provide the data protection we need.

Internal Networking

Let’s Encrypt infrastructure was originally built with 1G copper networking. Between bonding multiple connections for 2G performance, and use of the very limited number of 10G ports available on our switches, it served us pretty well until 2020.

By 2020 the volume of data we were moving around internally was too much to handle efficiently with 1G copper. Some normal operations were significantly slower than we’d like (e.g. backups, replicas), and during incidents the 1G network could cause significant delays in our response.

We originally looked into upgrading to 10G, but learned that upgrading to 25G fiber wasn’t much more expensive. Cisco ended up generously donating most of the switches and equipment we needed for this upgrade, and after replacing a lot of server NICs Let’s Encrypt is now running on a 25G fiber network!

Funny story – back in 2014 Cisco had donated really nice 10G fiber switches to use in building the original Let’s Encrypt infrastructure. Our cabinets at the time had an unusually short depth though, and the 10G switches didn’t physically fit. We had to return them for physically smaller 1G switches. 1G seemed like plenty at the time. We have since moved to new cabinets with standard depth!

HSM Performance

Each Let’s Encrypt data center has a pair of Luna HSMs that sign all certificates and their OCSP responses. If we want to revoke and reissue 200 million certificates, we need the Luna HSMs to perform the following cryptographic signing operations:

200 million OCSP response signatures for revocation
200 million certificate signatures for replacement certificates
200 million OCSP response signatures for the new certificates

That means we need to perform 600 million cryptographic signatures in 24 hours or less, with some performance overhead to account for request clustering.

We need to assume that during incidents we might only have one data center online, so when we’re thinking about HSM performance we’re thinking about what we can do with just one pair. Our previous HSMs could perform approximately 1,100 signing operations each per second, 2,200 between the pair. That’s 190,080,000 signatures in 24 hours, working at full capacity. It’s under half of what we would need, and take more time than we might have.

In order to get to where we need to be Thales generously donated new HSMs with about 10x the performance – approximately 10,000 signing operations per second, 20,000 between the pair. That means we can now perform 864,000,000 signing operations in 24 hours from a single data center.

Bandwidth

Issuing a certificate is not particularly bandwidth intensive, but during an incident we typically use a lot more bandwidth for systems recovery and analysis. We move large volumes of logs and other files between data centers and the cloud for analysis and forensic purposes. We may need to synchronize large databases. We create copies and additional back-ups. If the connections in our data centers are slow, it slows down our response.

After determining that data center connection speed could add significantly to our response time, we increased bandwidth accordingly. Fortinet helped provide hardware that helps us protect and manage these higher capacity connections.

API Extension

In order to get all those certificates replaced, we need an efficient and automated way to notify ACME clients that they should perform early renewal. Normally ACME clients renew their certificates when one third of their lifetime is remaining, and don’t contact our servers otherwise. We published a draft extension to ACME last year that describes a way for clients to regularly poll ACME servers to find out about early-renewal events. We plan to polish up that draft, implement, and collaborate with clients and large integrators to get it implemented on the client side.

Supporting Let’s Encrypt

We depend on contributions from our community of users and supporters in order to provide our services. If your company or organization would like to sponsor Let’s Encrypt please email us at [email protected]. We ask that you make an individual contribution if it is within your means.

Noise