Tag Archives: Kyoto-Tycoon

Moving Quicksilver into production

Post Syndicated from Geoffrey Plouviez original https://blog.cloudflare.com/moving-quicksilver-into-production/

Moving Quicksilver into production

One of the great arts of software engineering is making updates and improvements to working systems without taking them offline. For some systems this can be rather easy, spin up a new web server or load balancer, redirect traffic and you’re done. For other systems, such as the core distributed data store which keeps millions of websites online, it’s a bit more of a challenge.

Quicksilver is the data store responsible for storing and distributing the billions of KV pairs used to configure the millions of sites and Internet services which use Cloudflare. In a previous post, we discussed why it was built and what it was replacing. Building it, however, was only a small part of the challenge. We needed to deploy it to production into a network which was designed to be fault tolerant and in which downtime was unacceptable.

We needed a way to deploy our new service seamlessly, and to roll back that deploy should something go wrong. Ultimately many, many, things did go wrong, and every bit of failure tolerance put into the system proved to be worth its weight in gold because none of this was visible to customers.

The Bridge

Our goal in deploying Quicksilver was to run it in parallel with our then existing KV distribution tool, Kyoto Tycoon. Once we had evaluated its performance and scalability, we would gradually phase out reads and writes to Kyoto Tycoon, until only Quicksilver was handling production traffic. To make this possible we decided to build a bridge – QSKTBridge.

Moving Quicksilver into production
Image Credit: https://en.wikipedia.org/wiki/Bodie_Creek_Suspension_Bridge
Moving Quicksilver into production

Our bridge replicated from Kyoto Tycoon, sending write and delete commands to our Quicksilver root node. This bridge service was deployed on Kubernetes within our infrastructure, consumed the Kyoto Tycoon update feed, and wrote the batched changes every 500ms.

In the event of node failure or misconfiguration it was possible for multiple bridge nodes to be live simultaneously. To prevent duplicate entries we included a timestamp with each change. When writing into Quicksilver we used the Compare And Swap command to ensure that only one batch with a given timestamp would ever be written.

First Baby Steps In To Production

Gradually we began to point reads over a loopback interface from the Kyoto Tycoon port to the Quicksilver port. Fortunately we implemented the memcached and the KT HTTP protocol to make it transparent to the many client instances that connect to Kyoto Tycoon and Quicksilver.

We began with DOG, DOG is a virtual data center which serves traffic for Cloudflare employees only. It’s called DOG because we use it for dog-fooding. Testing releases on the dog-fooding data center helps prevent serious issues from ever reaching the rest of the world. Our DOG testing went suspiciously well, so we began to move forward with more of our data centers around the world. Little did we know this rollout would take more than a year!

Replication Saturation

To make life easier for SREs provisioning a new data center, we implemented a bootstrap mechanism that pulls an entire database from a remote server. This was an easy win because our datastore library – LMDB – provides the ability to copy an entire LMDB environment to a file descriptor. Quicksilver simply wrapped this code and sent a specific version of the DB to a network socket using its file descriptor.

When bootstrapping our Quicksilver DBs to more servers, Kyoto Tycoon was still serving production traffic in parallel. As we mentioned in the first blogpost, Kyoto Tycoon has an exclusive write lock that is sensitive to I/O. We were used to the problems this can cause such as write bursts slowing down reads. This time around we found the separate bootstrap process caused issues. It filled the page cache quickly, causing I/O saturation, which impacted all production services reading from Kyoto Tycoon.

There was no easy fix to that, to bootstrap Quicksilver DBs in a datacenter we’d have to wait until it was moved offline for maintenance, which pushed the schedule beyond our initial estimates. Once we had a data center deployment we could finally see metrics of Quicksilver running in the wild. It was important to validate that the performance was healthy before moving on to the next data center. This required coordination with the engineering teams that consume from Quicksilver.

The FL (front line) test candidate

Many requests reaching Cloudflare’s edge are forwarded to a component called FL, which acts as the “brain”. It uses multiple configuration values to decide what to do with the request. It might apply the WAF, pass the request to Workers and so on. FL requires a lot of QS access, so it was the perfect candidate to help validate Quicksilver performance in a real environment.

Here are some screenshots showing FL metrics. The vertical red line marks the switchover point, on the left of the line is Kyoto Tycoon, on the right is Quicksilver. The drop immediately before the line is due to the upgrade procedure itself. We saw an immediate and significant improvement in error rates and response time by migrating to Quicksilver.

Moving Quicksilver into production

The middle graph shows the error count. Green dot errors are non-critical errors: the first request to Kyoto Tycoon/Quicksilver failed so FL will try again. The red dot errors are the critical ones, they mean the second request also failed. In that case FL will return an error code to an end user. Both counts decreased substantially.

The bottom graph shows the read latency average from Kyoto Tycoon/Quicksilver. The average drops a lot and is usually around 500 microseconds and no longer reaches 1ms which it used to do quite often with Kyoto Tycoon.

Some readers may notice a few things. First of all, it was said earlier that the consumers were reading through the loopback interface. How come we experience errors there? How come we need 500us to read from Quicksilver?

At that time Cloudflare had some old SSDs and these could sometimes take hundreds of milliseconds to respond and this was affecting the response time from Quicksilver hence triggering read timeout from FL.

That was also obviously increasing the average response time above.

From there, some readers would ask why using averages and not percentiles? This screenshot dates back to 2017 and at that time the tool used for gathering FL metrics did not offer percentiles, only min, max and average.

Also, back then, all consumers were reading through TCP over loopback. Later on we switched to Unix socket.

Finally, we enabled memcached and KT HTTP malformed request detection. It appears that some consumers were sending us bogus requests and not monitoring the value returned properly, so they could not detect any error. We are now alerting on all bogus requests as this should never happen.

Spotting the wood for the trees

Removing Kyoto Tycoon on the edge took around a year. We’d done a lot of work already but rolling Quicksilver out to production services meant the project was just getting started. Thanks to the new metrics we put in place, we learned a lot about how things actually happen in the wild. And this let us come up with some ideas for improvement.

Our biggest surprise was that most of our replication delays were caused by saturated I/O and not network issues. This is something we could not easily see with Kyoto Tycoon. A server experiencing I/O contention could take seconds to write a batch to disk. It slowly became out of sync and then failed to propagate updates fast enough to its own clients, causing them to fall out of sync. The workaround was quite simple, if the latest received transaction log is over 30 seconds old, Quicksilver would disconnect and try the next server in the list of sources.

This was again an issue caused by aging SSDs. When a disk reaches that state it is very likely to be queued for a hardware replacement.

We quickly had to address another weakness in our codebase. A Quicksilver instance which was not currently replicating from a remote server would stop its own replication server, forcing clients to disconnect and reconnect somewhere else. The problem is this behavior has a cascading effect, if an intermediate server disconnects from its server to move somewhere else, it will disconnect all its clients, etc. The problem was compounded by an exponential backoff approach. In some parts of the world our replication topology has six layers, triggering the backoff many times caused significant and unnecessary replication lag. Ultimately the feature was fully removed.

Reads, writes, and sad SSDs

We discovered that we had been underestimating the frequency of I/O errors and their effect on performance. LMDB maps the database file into memory and when the kernel experiences an I/O error while accessing the SSD, it will terminate Quicksilver with a SIGBUS. We could see these in kernel logs. I/O errors are a sign of unhealthy disks and cause us to see spikes in Quicksilver’s read and write latency metrics. They can also cause spikes in the system load average, which affects all running services. When I/O errors occur, a check usually finds such disks are close to their maximum allowed write cycles, and further writes will cause a permanent failure.

Something we didn’t anticipate was LMDB’s interaction with our filesystem, causing high write amplification that increased I/O load and accelerated the wear and tear of our SSDs. For example, when Quicksilver wrote 1 megabyte, 30 megabytes were flushed to disk. This amplification is due to LMDB copy-on-write, writing a 2 byte key-value (KV) pair ends up copying the whole page. That was the price to pay for crashproof storage. We’ve seen amplification factors between 1.5X and 80X. All of this happens before any internal amplification that might happen in an SSD.

We also spotted that some producers kept writing the same KV pair over and over. Although we saw low rates of 50 writes per second, considering the amplification problem above, repetition increased pressure on SSDs. To fix this we just drop identical KV writes on the root node.

Where does a server replicate from?

Each small data center replicates from multiple bigger sites and these replicate from our two core data centers. To decide which data centers to use, we talk to the Network team, get an answer and hardcode the topology in Salt, the same place we configure all our infrastructure.

Each replication node is identified using IP addresses in our Salt configuration. This is quite a rigid way of doing things, wouldn’t DNS be easier? The interesting thing is that Quicksilver is a source of DNS for customers and Cloudflare infrastructure. We do not want Quicksilver to become dependent on itself, that could lead to problems.

While this somewhat rigid way of configuring topology works fine, as Cloudflare continues to scale there is room for improvement. Configuring Salt can be tricky and changes need to be thoroughly tested to avoid breaking things. It would be nicer for us to have a way to build the topology more dynamically, allowing us to more easily respond to any changes in network details and provision new data centers. That is something we are working on.

Removing Kyoto Tycoon from the Edge

Migrating services away from Kyoto Tycoon requires identifying them. There were some obvious candidates like FL and DNS which play a big part in day-to-day Cloudflare activity and do a lot of reads. But the tricky part to completely removing Kyoto Tycoon is in finding all the services that use it.

We informed engineering teams of our migration work but that was not enough. Some consumers get less attention because they were done during a research activity, or they run as part of a team’s side project. We added some logging code into Kyoto Tycoon so we could see who exactly was connecting and reached out directly to them. That’s a good way to catch the consumers that are permanently connected or connect often. However, some consumers only connect rarely and briefly, for example at startup time to load some configuration. So this takes time.

We also spotted some special cases that used exotic setups like using stunnel to read from the memcached interface remotely. That might have been a “quick-win” back in the day but its technical debt that has to be paid off; we now strictly enforce the methods that consumers use to access Quicksilver.

The migration activity was steady and methodical and it took a few months. We deployed Quicksilver per group of data centers and instance by instance, removing Kyoto Tycoon only when we deemed it ready. We needed to support two sets of configuration and alerting in parallel. That’s not ideal but it gave us confidence and ensured that customers were not disrupted. The only thing they might have noticed is the performance improvement!

Quicksilver, Core side, welcome to the mouth of technical debt madness

Once the edge migration was complete, it was time to take care of the core data centers. We just dealt with 200+ data centers so doing two should be easy right? Well no, the edge side of Quicksilver uses the reading interface, the core side is the writing interface so it is a totally different kind of job. Removing Kyoto Tycoon from the core was harder than the edge.

Many Kyoto Tycoon components were all running on a single physical server – the Kyoto Tycoon root node. This single point of failure had become a growing risk over the years. It had been responsible for incidents, for example it once completely disappeared due to faulty hardware. In that emergency we had to start all services on a backup server. The move to Quicksilver was a great opportunity to remove the Kyoto Tycoon root node server. This was a daunting task and almost all the engineering teams across the company got involved.

Note: Cloudflare’s migration from Marathon to Kubernetes was happening in parallel to this task so each time we refer to Kubernetes it is likely that the job was started on Marathon before being migrated later on.

Moving Quicksilver into production
The diagram above shows how all components below interact with each other.

From KTrest to QSrest

KTrest is a stateless service providing a REST interface for writing and deleting KV pairs from Kyoto Tycoon. Its only purpose is to enforce access control (ACL) so engineering teams do not experience accidental keyspace overlap. Of course, sometimes keyspaces are shared on purpose but it is not the norm.

The Quicksilver team took ownership of KTrest and migrated it from the root server to Kubernetes. We asked all teams to move their KTrest producers to Kubernetes as well. This went smoothly and did not require code changes in the services.

Our goal was to move people to Quicksilver, so we built QSrest, a KTrest compatible service for Quicksilver. Remember the problem about disk load? The QSKTBridge was in charge of batching: aggregating 500ms of updates before flushing to Quicksilver, to help our SSDs cope with our sync frequency. If we moved people to QSrest, it needed to support batching too. So it does. We used QSrest as an opportunity to implement write quotas. Once a quota is reached, all further writes from the same producer would be slowed down until the quota goes back to normal.

We found that not all teams actually used KTrest. Many teams were still writing directly to Kyoto Tycoon. One reason for this is that they were sending large range read requests in order to read thousands of keys in one shot – a feature that KTrest did not provide. People might ask why we would allow reads on a write interface. This is an unfortunate historical artefact and we wanted to put things right. Large range requests would hurt Quicksilver (we learned this from our experience with the bootstrap mechanism) so we added a cluster of servers which would serve these range requests: the Quicksilver consumer nodes. We asked the teams to switch their writes to KTrest and their reads to Quicksilver consumers.

Once the migration of the producers out of the root node was complete, we asked teams to start switching from KTrest to QSrest. Once the move was started, moving back would be really hard: our Kyoto Tycoon and Quicksilver DB were now inconsistent. We had to move over 50 producers, owned by teams in different time zones. So we did it one after another while carefully babysitting our servers to spot any early warning which could end up in an incident.

Once we were all done, we then started to look at Kyoto Tycoon transaction logs: is there something still writing to it? We found an obsolete heartbeat mechanism which was shut down.

Finally, Kyoto Tycoon root processes were turned off to never be turned on again.

Removing Kyoto Tycoon from all over Cloudflare ended up taking four years.

No hardware single point of failure

We have made great progress, Kyoto Tycoon is no longer among us, we smoothly moved all Cloudflare services to Quicksilver, our new KV store. And things now look more like this.

Moving Quicksilver into production

The job is still far from done though. The Quicksilver root server was still a single point of failure, so we decided to build Quicksilver Raft – a Raft-enabled root cluster using the etcd Raft package. It was harder than I originally thought. Adding Raft into Quicksilver required proper synchronization between the Raft snapshot and the current Quicksilver DB. Sometimes there are situations where Quicksilver needs to fall back to asynchronous replication to catch up before doing proper Raft synchronous writes. Getting all of this to work required deep changes in our code base, which also proved to be hard to build proper unit and integration tests. But removing single points of failure is worth it.

Introducing QSQSBridge

We like to test our components in an environment very close to production before releasing it. How do we test the core side/writer interface of Quicksilver then? Well the days of Kyoto Tycoon, we used to have a second Quicksilver root node feeding from Kyoto Tycoon through the KTQSbridge. In turn, some very small data centers were being fed from that root node. When Kyoto Tycoon was removed we lost that capability, limiting our testing. So we built QSQSbridge, which replicates from a Quicksilver instance and writes to a Quicksilver root node.

Removing the Quicksilver legacy top-main

With Quicksilver Raft and QSQSbridge in place, we tested three small data centers replicating from the test Raft cluster over the course of weeks. We then wanted to promote the high availability (HA) cluster and make the whole world replicate from it. In the same way we did for Kyoto Tycoon removal, we wanted to be able to roll back at any time if things go wrong.

Ultimately we reworked the QSQSbridge so that it builds compatible transaction logs with the feeding root node. That way, we started to move groups of data centers underneath the Raft cluster and we could, at any time, move them back to the legacy lop-main.

We started with this:

Moving Quicksilver into production

We moved our 10 QSrest instances one after another from the legacy root node to write against the HA cluster. We did this slowly without any significant issue and we were finally able to turn off the legacy top-main.This is what we ended up with:

Moving Quicksilver into production

And that was it, Quicksilver was running in the wild without a hardware single point of failure.

First signs of obsolescence

The Quicksilver bootstrap mechanism we built to make SRE life better worked by pulling a full copy of a database but it had pains. When bootstrapping begins, a long lived read transaction is opened on the server, it must keep the DB version requested intact while also applying new log entries. This causes two problems. First, if the read is interrupted by something like a network glitch the read transaction is closed and the version of the DB is not available anymore. The entire process has to start from scratch. Secondly, it causes fragmentation in the source DB, which requires somebody to go and compact it.

For the first six months, the databases were fairly small and everything worked fine. However, as our databases continued to grow, these two problems came to fruition. Longer transfers are at risk of network glitches and the problem compounds, especially for more remote data centers. Longer transfers mean more fragmentation and more time compacting it. Bootstrapping was becoming a pain.

One day an SRE reported than manually bootstrapping instances one after another was much faster than bootstrapping all together in parallel. We investigated and saw that when all Quicksilver instances were bootstrapping at the same time the kernel and SSD were overwhelmed swapping pages. There was page cache contention caused by the ratio of the combined size of the DBs vs. available physical memory.

The workaround was to serialize bootstrapping across instances. We did this by implementing a lock between Quicksilver processes, a client grabs the lock to bootstrap and gives it back once done, ready for the next client. The lock moves around in round robin fashion forever.

Moving Quicksilver into production

Unfortunately, we hit another page cache thrashing issue. We know that LMDB mostly does random I/O but we turned on readahead to load as much DB as possible into memory. But again as our DBs grew, the readahead started evicting useful pages. So we improved the performance by… disabling it.

Our biggest issue by far though is more fundamental: disk space and wear and tear! Each and every server in Cloudflare stores the same data. So a piece of data one megabyte in size consumes at least 10 gigabytes globally. To accommodate continued growth you can add more disks, replace dying disks, and use bigger disks. That’s not a very sustainable solution. It became clear that the Quicksilver design we came up with five years ago was becoming obsolete and needed a rethink. We approach that problem both in the short term and long term.

The long and short of it

In the long term we are now working on a sharded version of Quicksilver. This will avoid storing all data on all machines but guarantee the entire dataset is in each data center.

In the shorter term we have addressed our storage size problems with a few approaches.

Many engineering teams did not even know how many bytes of Quicksilver they used. Some might have had an impression that storage is free. We built a service named “QSusage” which uses the QSrest ACL to match KV with their owner and computes the total disk space used per producer, including Quicksilver metadata. This service allows teams to better understand their usage and pace of growth.

We also built a service name “QSanalytics” to help answer the question “How do I know which of my KV pairs are never used?”. It gathers all keys being accessed all over Cloudflare, aggregates these and pushes these into a ClickHouse cluster and we store these information over a 30 days rolling window. There’s no sampling here, we keep track of all read access. We can easily report back to engineering teams responsible for unused keys, they can consider if these can be deleted or must be kept.

Some of our issues are caused by the storage engine LMDB. So we started to look around and got quickly interested in RocksDB. It provides built in compression, online defragmentation and other features like prefix deduplication.

We ran some tests using representative Quicksilver data and they seemed to show that RocksDB could store the same data in 40% of the space compared to LMDB. Read latency was a little bit higher in some cases but nothing too serious. CPU usage was also higher, using around 150% of CPU time compared to LMDBs 70%. Write amplification appeared to be much lower, giving some relief to our SSD and the opportunity to ingest more data.

We decided to switch to RocksDB to help sustain our growing product and customer bases. In a follow-up blog we’ll do a deeper dive into performance comparisons and explain how we managed to switch the entire business without impacting customers at all.


Migrating from Kyoto Tycoon to Quicksilver was no mean feat. It required company-wide collaboration, time and patience to smoothly roll out a new system without impacting customers. We’ve removed single points of failure and improved the performance of database access, which helps to deliver a better experience.

In the process of moving to Quicksilver we learned a lot. Issues that were unknown at design time were surfaced by running software and services at scale. Furthermore, as the customer and product base grows and their requirements change, we’ve got new insight into what our KV store needs to do in the future. Quicksilver gives us a solid technical platform to do this.

Introducing Quicksilver: Configuration Distribution at Internet Scale

Post Syndicated from Geoffrey Plouviez original https://blog.cloudflare.com/introducing-quicksilver-configuration-distribution-at-internet-scale/

Introducing Quicksilver: Configuration Distribution at Internet Scale

Cloudflare’s network processes more than fourteen million HTTP requests per second at peak for Internet users around the world. We spend a lot of time thinking about the tools we use to make those requests faster and more secure, but a secret-sauce which makes all of this possible is how we distribute configuration globally. Every time a user makes a change to their DNS, adds a Worker, or makes any of hundreds of other changes to their configuration, we distribute that change to 200 cities in 90 countries where we operate hardware. And we do that within seconds. The system that does this needs to not only be fast, but also impeccably reliable: more than 26 million Internet properties are depending on it. It also has had to scale dramatically as Cloudflare has grown over the past decade.

Historically, we built this system on top of the Kyoto Tycoon (KT) datastore. In the early days, it served us incredibly well. We contributed support for encrypted replication and wrote a foreign data wrapper for PostgreSQL. However, what worked for the first 25 cities was starting to show its age as we passed 100. In the summer of 2015 we decided to write a replacement from scratch. This is the story of how and why we outgrew KT, learned we needed something new, and built what was needed.

How KT Worked at Cloudflare

Where should traffic to example.com be directed to?

What is the current load balancing weight of the second origin of this website?

Which pages of this site should be stored in the cache?

These are all questions which can only be answered with configuration, provided by users and delivered to our machines around the world which serve Internet traffic. We are massively dependent on our ability to get configuration from our API to every machine around the world.

It was not acceptable for us to make Internet requests on demand to load this data however, the data had to live in every edge location. The architecture of the edge which serves requests is designed to be highly failure tolerant. Each data center must be able to successfully serve requests even if cut off from any source of central configuration or control.

Our first large-scale attempt to solve this problem relied on deploying Kyoto-Tycoon (KT) to thousands of machines. Our centralized web services would write values to a set of root nodes which would distribute the values to management nodes living in every data center. Each server would eventually get its own copy of the data from a management node in the data center in which it was located:

Data flow from the API to data centres and individual machines

Introducing Quicksilver: Configuration Distribution at Internet Scale

Doing at least one read from KT was on the critical path of virtually every Cloudflare service. Every DNS or HTTP request would send multiple requests to a KT store and each TLS handshake would load certificates from it. If KT was down, many of our services were down, if KT was slow, many of our services were slow. Having a service like KT was something of a superpower for us, making it possible for us to deploy new services and trust that configuration would be fast and reliable. But when it wasn’t, we had very big problems.

As we began to scale one of our first fixes was to shard KT into different instances. For example, we put Page Rules in a different KT than DNS records. In 2015, we used to operate eight such instances storing a total of 100 million key-value (KV) pairs, with about 200 KV values changed per second. Across our infrastructure, we were running tens of thousands of these KT processes. Kyoto Tycoon is, to say the least, very difficult to operate at this scale. It’s clear we pushed it past its limits and for what it was designed for.

To put that in context it’s valuable to look back to the description of KT provided by its creators:

[It] is a lightweight datastore server with auto expiration mechanism, which is useful to handle cache data and persistent data of various applications. (http://fallabs.com/kyototycoon/)

It seemed likely that we were not uncovering design failures of KT, but rather we were simply trying to get it to solve problems it was not designed for. Let’s take a deeper look at the issues we uncovered.

Exclusive write lock… or not?

To talk about the write lock it’s useful to start with another description provided by the KT documentation:

Functions of API are reentrant and available in multi-thread environment. Different database objects can be operated in parallel entirely. For simultaneous operations against the same database object, rwlock (reader-writer lock) is used for exclusion control. That is, while a writing thread is operating an object, other reading threads and writing threads are blocked. However, while a reading thread is operating an object, reading threads are not blocked. Locking granularity depends on data structures. The hash database uses record locking. The B+ tree database uses page locking.

On first glance this sounds great! There is no exclusive write lock over the entire DB. But there are no free lunches; as we scaled we started to detect poor performance when writing and reading from KT at the same time.

In the world of Cloudflare, each KT process replicated from a management node and was receiving from a few writes per second to a thousand writes per second. The same process would serve thousands of read requests per second as well. When heavy write bursts were happening we would notice an increase in the read latency from KT. This was affecting production traffic, resulting in slower responses than we expect of our edge.

Here are the percentiles for read latency of a script reading from KT the same 20 key/value pairs in an infinite loop. Each key and each value is 2 bytes.  We never update these key/value pairs so we always get the same value.

Without doing any writes, the read performance is somewhat acceptable even at high percentiles:

  • P99: 9ms
  • P99.9: 15ms

When we add a writer sequentially adding a 40kB value, however, things get worse. After running the same read performance test, our latency values have skyrocketed:

  • P99: 154ms
  • P99.9: 250ms

Adding 250ms of latency to a request through Cloudflare would never be acceptable. It gets even worse when we add a second writer, suddenly our latency at the 99.9th percentile (the 0.1% slowest reads) is over a second!

  • P99: 701ms
  • P99.9: 1215ms
Introducing Quicksilver: Configuration Distribution at Internet Scale

These numbers are concerning: writing more increases the read latency significantly. Given how many sites change their Cloudflare configuration every second, it was impossible for us to imagine a world where write loads would not be high. We had to track down the source of this poor performance. There must be either resource contention or some form of write locking, but where?

The Lock Hunt

After looking into the code the issue seems to be that when reading from KT, the function accept in the file kcplandb.h of Kyoto Cabinet acquires a lock:

Introducing Quicksilver: Configuration Distribution at Internet Scale

This lock is also acquired in the synchronize function of kcfile.cc in charge of flushing data to disk:

Introducing Quicksilver: Configuration Distribution at Internet Scale

This is where we have a problem. Flushing to disk blocks all reads and flushing is slow.

So in theory the storage engine can handle parallel requests but in reality we found at least one place where this is not true. Based on this and other experiments and code review we came to the conclusion that KT was simply not designed for concurrent access. Due to the exclusive write lock implementation of KT, I/O writes degraded read latency to unacceptable levels.

In the beginning, the occurrence of that issue was rare and not a top priority. But as our customer base grew at rocket speed, all related datasets grew at the same pace. The number of writes per day was increasing constantly and this contention started to have an unacceptable impact on performance.

As you can imagine our immediate fix was to do less writes. We were able to make small-scale changes to writing services to reduce their load, but this was quickly eclipsed by the growth of the company and the launch of new products. Before we knew it, our write levels were right back to where they began!

As a final step we disabled the fsync which KT was doing on each write. This meant KT would only flush to disk on shutdown, introducing potential data corruption which required its own tooling to detect and repair.

Unsynchronized Swimming

Continuing our theme of beginning with the KT documentation, it’s worth looking at how they discuss non-durable writes:

If an application process which opened a database terminated without closing the database, it involves risks that some records may be missing and the database may be broken. By default, durability is settled when the database is closed properly and it is not settled for each updating operation.

At Cloudflare scale, kernel panics or even processor bugs happen and unexpectedly kill services. By turning off syncing to improve performance we began to experience database corruption. KT comes with a mechanism to repair a broken DB which we used successfully at first. Sadly, on our largest databases, the rebuilding process took a very long time and in many cases would not complete at all. This created a massive operational problem for our SRE team.

Ultimately we turned off the auto-repair mechanism so KT would not start if the DB was broken and each time we lost a database we copied it from a healthy node. This syncing was being done manually by our SRE team. That team’s time is much better spent building systems and investigating problems; the manual work couldn’t continue.

Not syncing to disk caused another issue: KT had to flush the entire DB when it was being shut down. Again, this worked fine at the beginning, but with the DB getting bigger and bigger, the shut down time started to sometimes hit the systemd grace period and KT was terminated with a SIGKILL. This led to even more database corruption.

Because all of the KT DB instances were growing at the same pace, this issue went from minor to critical seemingly overnight. SREs wasted hours syncing DBs from healthy instances before we understood the problem and greatly increased the grace period provided by systemd.

We also experienced numerous random instances of database corruption. Too often KT was shut down cleanly without any error but when restarted the DB was corrupted and had to be restored. In the beginning, with 25 data centers, it happened rarely. Over the years we added thousands of new servers to Cloudflare infrastructure and it was occurring multiple times a day.

Writing More

Most of our writes are adding new KV pairs, not overwriting or deleting. We can see this in our key count growth:

  1. In 2015, we had around 100 million KV pairs
  2. In 2017, we passed 200 million
  3. In 2018, we passed 500 million
  4. In 2019, we exceeded 1 billion
Introducing Quicksilver: Configuration Distribution at Internet Scale

Unfortunately in a world where the quantity of data is always growing, it’s not realistic to think you will never flush to disk. As we write new keys the page cache quickly fills. When it’s full, it is flushed to disk. I/O saturation was leading to the very same contention problems we experienced previously.

Each time KT received a heavy write burst, we could see the read latency from KT increasing in our DC. At that point it was obvious to us that the KT DB locking implementation could no longer do the job for us with or without syncing. Storage wasn’t our only problem however, the other key function of KT and our configuration system is replication.

The Best Effort Replication Protocol

KT replication protocol is based solely on timestamp. If a transaction fails to replicate for any reason but it is not detected, the timestamp will continue to advance forever missing that entry.

How can we have missing log entries? KT replicates data by sending an ordered list of transaction logs. Each log entry details what change is being made to our configuration database. These logs are kept for a period of time, but are eventually ‘garbage collected’, with old entries removed.

Let’s think about a KT instance being down for days, it then restarts and asks for the transaction log from the last one it got. The management node receiving the request will send the nearest entries to this timestamp, but there could be missing transaction logs due to garbage collection. The client would not get all the updates it should and this is how we quietly end up with an inconsistent DB.

Another weakness we noticed happens when the timestamp file is being written. Here is a snippet of the file ktserver.cc where the client replication code is implemented:

Introducing Quicksilver: Configuration Distribution at Internet Scale

This code snippet runs the loop as long as replication is working fine. The timestamp file is only going to be written when the loop terminates. The call to write_rts (the function writing to disk the last applied transaction log) can be seen at the bottom of the screenshot.

If KT terminates unexpectedly in the middle of that loop, the timestamp file won’t be updated. When this KT restarts and if it successfully repairs the database it will replicate from the last value written to the rts file. KT could end up replaying days of transaction logs which were already applied to the DB and values written days ago could be made visible again to our services for some time before everything gets back up to date!

We also regularly experienced databases getting out of sync without any reason. Sometimes these caught up by themselves, sometimes they didn’t. We have never been able to properly identify the root cause of that issue. Distributed systems are hard, and distributed databases are brutal. They require extensive observability tooling to deploy properly which didn’t exist for KT.

Upgrading Kyoto Tycoon in Production

Multiple processes cannot access one database file at the same time. A database file is locked by reader-writer lock while a process is connected to it.

We release hundreds of software updates a day across our many engineering teams. However, we only very rarely deploy large-scale updates and upgrades to the underlying infrastructure which runs our code. This frequency has increased over time, but in 2015 we would do a “CDN Release” once per quarter.

To perform a CDN Release we reroute traffic from specific servers and take the time to fully upgrade the software running on those machines all the way down to the kernel. As this was only done once per quarter, it could take several months for an upgrade to a service like KT to make it to every machine.

Most Cloudflare services now implement a zero downtime upgrade mechanism where we can upgrade the service without dropping production traffic. With this we can release a new version of our web or DNS servers outside of a CDN release window, which allows our engineering teams to move much faster. Many of our services are now actually implemented with Cloudflare Workers which can be deployed even faster (using KT’s replacement!).

Unfortunately this was not the case in 2015 when KT was being scaled. Problematically, KT does not allow multiple processes to concurrently access the same database file so starting a new process while the previous one was still running was impossible. One idea was that we could stop KT, hold all incoming requests and start a new one. Unfortunately stopping KT would usually take over 15 minutes with no guarantee regarding the DB status.

Because stopping KT is very slow and only one KT process can access the DB, it was not possible to upgrade KT outside of a CDN release, locking us into that aging process.

High-ish Availability

One final quote from the KT documentation:

Kyoto Tycoon supports “dual master” replication topology which realizes higher availability. It means that two servers replicate each other so that you don’t have to restart the survivor when one of them crashed.


Note that updating both of the servers at the same time might cause inconsistency of their databases. That is, you should use one master as a “active master” and the other as a “standby master”.

Said in other words: When dual master is enabled all writes should always go to the same root node and a switch should be performed manually to promote the standby master when the root node dies.

Unfortunately that violates our principles of high availability. With no capability for automatic zero-downtime failover it wasn’t possible to handle the failure of the KT top root node without some amount of configuration propagation delay.

Building Quicksilver

Addressing these issues in Kyoto Tycoon wasn’t deemed feasible. The project had no maintainer, the last official update being from April 2012, and was composed of a code base of 100k lines of C++. We looked at alternative open source systems at the time, none of which fit our use case well.

Our KT implementation suffered from some fundamental limitations:

  1. No high availability
  2. Weak replication protocol
  3. Exclusive write lock
  4. Not zero downtime upgrade friendly

It was also unreliable, critical replication and database functionality would break quite often. At some point, keeping KT up in running at Cloudflare was consuming 48 hours of SRE time per week.

We decided to build our own replicated key value store tailored for our needs and we called it Quicksilver. As of today Quicksilver powers an average of 2.5 trillion reads each day with an average latency in microseconds.

Fun fact: the name Quicksilver was picked by John Graham-Cumming, Cloudflare’s CTO. The terrible secret that only very few members of the humankind know is that he originally named it “Velocireplicator”. It is a secret though. Don’t tell anyone. Thank you.

Storage Engine

One major complication with our legacy system, KT, was the difficulty of bootstrapping new machines. Replication is a slow way to populate an empty database, it’s much more efficient to be able to instantiate a new machine from a snapshot containing most of the data, and then only use replication to keep it up to date. Unfortunately KT required nodes to be shut down before they could be snapshotted, making this challenging. One requirement for Quicksilver then was to use a storage engine which could provide running snapshots. Even further, as Quicksilver is performance critical, a snapshot must also not have a negative impact on other services that read from Quicksilver. With this requirement in mind we settled on a datastore library called LMDB after extensive analysis of different options.LMDB’s design makes taking consistent snapshots easy. LMDB is also optimized for low read latency rather than write throughput. This is important since we serve tens of millions of reads per second across thousands of machines, but only change values relatively infrequently. In fact, systems that switched from KT to Quicksilver saw drastically reduced read response times, especially on heavily loaded machines. For example, for our DNS service, the 99th percentile of reads dropped by two orders of magnitude!

LMDB also allows multiple processes to concurrently access the same datastore. This is very useful for implementing zero downtime upgrades for Quicksilver: we can start the new version while still serving current requests with the old version. Many data stores implement an exclusive write lock which requires only a single user to write at a time, or even worse, restricts reads while a write is conducted. LMDB does not implement any such lock.

LMDB is also append-only, meaning it only writes new data, it doesn’t overwrite existing data. Beyond that, nothing is ever written to disk in a state which could be considered corrupted. This makes it crash-proof, after any termination it can immediately be restarted without issue. This means it does not require any type of crash recovery tooling.

Transaction Logs

LMDB does a great job of allowing us to query Quicksilver from each of our edge servers, but it alone doesn’t give us a distributed database. We also needed to develop a way to distribute the changes made to customer configurations into the thousands of instances of LMDB we now have around the world. We quickly settled on a fan-out type distribution where nodes would query master-nodes, who would in turn query top-masters, for the latest updates.

Introducing Quicksilver: Configuration Distribution at Internet Scale

Unfortunately there is no such thing as a perfectly reliable network or system. It is easy for a network to become disconnected or a machine to go down just long enough to miss critical replication updates. Conversely though, when users make changes to their Cloudflare configuration it is critical that they propagate accurately whatever the condition of the network. To ensure this, we used one of the oldest tricks in the book and included a monotonically increasing sequence number in our Quicksilver protocol:

< 0023 SET “hello” “world”
< 0024 SET “lorem” “ipsum”
< 0025 DEL “42”…

It is now easily possible to detect whether an update was lost, by comparing the sequence number and making sure it is exactly one higher than the last message we have seen. The astute reader will notice that this is simply a log. This process is pleasantly simple because our system does not need to support global writes, only reads. As writes are relatively infrequent and it is easy for us to elect a single data center to aggregate writes and ensure the monotonicity of our counter.

One of the most common failure modes of a distributed system are configuration errors. An analysis of how things could go wrong led us to realize that we had missed a simple failure case: since we are running separate Quicksilver instances for different kinds of data, we could corrupt a database by misconfiguring it. For example, nothing would prevent the DNS database from being updated with changes for the Page Rules database. The solution was to add unique IDs for each database and to require these IDs when initiating the replication protocol.

Through our experience with our legacy system, KT, we knew that replication does not always scale as easily as we would like. With KT it was common for us to saturate IO on our machines, slowing down reads as we tried to replay the replication log. To solve this with Quicksilver we decided to engineer a batch mode where many updates can be combined into a single write, allowing them all to be committed to disk at once. This significantly improved replication performance by reducing the number of disk writes we have to make. Today, we are batching all updates which occur in a 500ms window, and this has made highly-durable writes manageable.

Where should we store the transaction logs? For design simplicity we decided to store these within a different bucket of our LMDB database. By doing this, we can commit the transaction log and the update to the database in one shot. Originally the log was kept in a separate file, but storing it with the database simplifies the code.

Unfortunately this came at a cost: fragmentation. LMDB does not naturally fragment values, it needs to store every value in a sequential region of the disk. Eventually the disk begins to fill and the large regions which offer enough space to fit a particularly big value start to become hard to find. As our disks begin to fill up, it can take minutes for LMDB to find enough space to store a large value. Unfortunately we exacerbated this problem by storing the transaction log within the database.

This fragmentation issue was not only causing high write latency, it was also making the databases grow very quickly. When the reasonably-sized free spaces between values start to become filled, less and less of the disk becomes usable. Eventually the only free space on disk is too small to store any of the actual values we can store. If all of this space were compacted into a single region, however, there would be plenty of space available.

The compaction process requires rewriting an entire DB from scratch. This is something we do after bringing data centers offline and its benefits last for around 2 months, but it is far from a perfect solution. To do better we began fragmenting the transaction log into page-sized chunks in our code to improve the write performance. Eventually we will also split large values into chunks that are small enough for LMDB to happily manage, and we will handle assembling these chunks in our code into the actual values to be returned.

We also implemented a key-value level CRC. The checksum is written when the transaction log is applied to the DB and checks the KV pair is read. This checksum makes it possible to quickly identify and alert on any bugs in the fragmentation code. Within the QS team we are usually against this kind of defensive measure, we prefer focusing on code quality instead of defending against the consequences of bugs, but database consistency is so critical to the company that even we couldn’t argue against playing defense.

LMDB stability has been exceptional. It has been running in production for over three years. We have experienced only a single bug and zero data corruption. Considering we serve over 2.5 trillion read requests and 30 million write requests a day on over 90,000 database instances across thousands of servers, this is very impressive.


Transaction logs are a critical part of our replication system, but each log entry ends up being significantly larger than the size of the values it represents. To prevent our disk space from being overwhelmed we use Snappy to compress entries. We also periodically garbage collect entries, only keeping the most recent required for replication.

For safety purposes we also added an incremental hash within our transaction logs. The hash helps us to ensure that messages have not been lost or incorrectly ordered in the log.

One other potential misconfiguration which scares us is the possibility of a Quicksilver node connecting to, and attempting to replicate from, itself. To prevent this we added a randomly generated process ID which is also exchanged in the handshake:

type ClientHandshake struct {
	// Unique ID of the client. Verifies that client and master have distinct IDs. Defaults to a per-process unique ID.
	ID uuid.ID
	// ID of the database to connect to. Verifies that client and master have the same ID. Disabled by default.
	DBID uuid.ID
	// At which index to start replication. First log entry will have Index = LastIndex + 1.
	LastIndex logentry.Index
	// At which index to stop replication.
	StopIndex logentry.Index
	// Ignore LastIndex and retrieve only new log entries.
	NewestOnly bool
	// Called by Update when a new entry is applied to the db.
	Metrics func(*logentry.LogEntry)

Each Quicksilver instance has a list of primary servers and secondary servers. It will always try to replicate from a primary node which is often another QS node near it. There are a variety of reasons why this replication may not work, however. For example, if the target machine’s database is too old it will need a larger changeset than exists on the source machine. To handle this, our secondary masters store a significantly longer history to allow machines to be offline for a full week and still be correctly resynchronized on startup.


Building a system is often much easier than maintaining it. One challenge was being able to do weekly releases without stopping the service.

Fortunately the LMDB datastore supports multiple process reading and writing to the DB file simultaneously. We use Systemd to listen on incoming connections, and immediately hand the sockets over to our Quicksilver instance. When it’s time to upgrade we start our new instance, and pass the listening socket over to the new instance seamlessly.

We also control the clients used to query Quicksilver. By adding an automatic retry to requests we are able to ensure that momentary blips in availability don’t result in user-facing failures.

After years of experience maintaining this system we came to a surprising conclusion: handing off sockets is really neat, but it might involve more complexity than is warranted. A Quicksilver restart happens in single-digit milliseconds, making it more acceptable than we would have thought to allow connections to momentarily fail without any downstream effects. We are currently evaluating the wisdom of simplifying the update system, as in our experience simplicity is often the best true proxy for reliability.


It’s easy to overlook monitoring when designing a new system. We have learned however that a system is only as good as our ability to both know how well it is working, and our ability to debug issues as they arise. Quicksilver is as mission-critical as anything possibly can be at Cloudflare, it is worth the effort to ensure we can keep it running.

We use Prometheus for collecting our metrics and we use Grafana to monitor Quicksilver. Our SRE team uses a global dashboard, one dashboard for each datacenter, and one dashboard per server, to monitor its performance and availability. Our primary alerting is also driven by Prometheus and distributed using PagerDuty.

We have learned that detecting availability is rather easy, if Quicksilver isn’t available countless alerts will fire in many systems throughout Cloudflare. Detecting replication lag is more tricky, as systems will appear to continue working until it is discovered that changes aren’t taking effect. We monitor our replication lag by writing a heartbeat at the top of the replication tree and computing the time difference on each server.


Quicksilver is, on one level, an infrastructure tool. Ideally no one, not even most of the engineers who work here at Cloudflare, should have to think twice about it. On another level, the ability to distribute configuration changes in seconds is one of our greatest strengths as a company. It makes using Cloudflare enjoyable and powerful for our users, and it becomes a key advantage for every product we build. This is the beauty and art of infrastructure: building something which is simple enough to make everything built on top of it more powerful, more predictable, and more reliable.

We are planning on open sourcing Quicksilver in the near future and hope it serves you as well as it has served us. If you’re interested in working with us on this and other projects, please take a look at our jobs page.