Tag Archives: postgres

Migrating billions of records: moving our active DNS database while it’s in use

Post Syndicated from Alex Fattouche original https://blog.cloudflare.com/migrating-billions-of-records-moving-our-active-dns-database-while-in-use

According to a survey done by W3Techs, as of October 2024, Cloudflare is used as an authoritative DNS provider by 14.5% of all websites. As an authoritative DNS provider, we are responsible for managing and serving all the DNS records for our clients’ domains. This means we have an enormous responsibility to provide the best service possible, starting at the data plane. As such, we are constantly investing in our infrastructure to ensure the reliability and performance of our systems.

DNS is often referred to as the phone book of the Internet, and is a key component of the Internet. If you have ever used a phone book, you know that they can become extremely large depending on the size of the physical area it covers. A zone file in DNS is no different from a phone book. It has a list of records that provide details about a domain, usually including critical information like what IP address(es) each hostname is associated with. For example:

example.com      59 IN A 198.51.100.0
blog.example.com 59 IN A 198.51.100.1
ask.example.com  59 IN A 198.51.100.2

It is not unusual for these zone files to reach millions of records in size, just for a single domain. The biggest single zone on Cloudflare holds roughly 4 million DNS records, but the vast majority of zones hold fewer than 100 DNS records. Given our scale according to W3Techs, you can imagine how much DNS data alone Cloudflare is responsible for. Given this volume of data, and all the complexities that come at that scale, there needs to be a very good reason to move it from one database cluster to another. 

Why migrate 

When initially measured in 2022, DNS data took up approximately 40% of the storage capacity in Cloudflare’s main database cluster (cfdb). This database cluster, consisting of a primary system and multiple replicas, is responsible for storing DNS zones, propagated to our data centers in over 330 cities via our distributed KV store Quicksilver. cfdb is accessed by most of Cloudflare’s APIs, including the DNS Records API. Today, the DNS Records API is the API most used by our customers, with each request resulting in a query to the database. As such, it’s always been important to optimize the DNS Records API and its surrounding infrastructure to ensure we can successfully serve every request that comes in.

As Cloudflare scaled, cfdb was becoming increasingly strained under the pressures of several services, many unrelated to DNS. During spikes of requests to our DNS systems, other Cloudflare services experienced degradation in the database performance. It was understood that in order to properly scale, we needed to optimize our database access and improve the systems that interact with it. However, it was evident that system level improvements could only be just so useful, and the growing pains were becoming unbearable. In late 2022, the DNS team decided, along with the help of 25 other teams, to detach itself from cfdb and move our DNS records data to another database cluster.

Pre-migration

From a DNS perspective, this migration to an improved database cluster was in the works for several years. Cloudflare initially relied on a single Postgres database cluster, cfdb. At Cloudflare’s inception, cfdb was responsible for storing information about zones and accounts and the majority of services on the Cloudflare control plane depended on it. Since around 2017, as Cloudflare grew, many services moved their data out of cfdb to be served by a microservice. Unfortunately, the difficulty of these migrations are directly proportional to the amount of services that depend on the data being migrated, and in this case, most services require knowledge of both zones and DNS records.

Although the term “zone” was born from the DNS point of view, it has since evolved into something more. Today, zones on Cloudflare store many different types of non-DNS related settings and help link several non-DNS related products to customers’ websites. Therefore, it didn’t make sense to move both zone data and DNS record data together. This separation of two historically tightly coupled DNS concepts proved to be an incredibly challenging problem, involving many engineers and systems. In addition, it was clear that if we were going to dedicate the resources to solving this problem, we should also remove some of the legacy issues that came along with the original solution. 

One of the main issues with the legacy database was that the DNS team had little control over which systems accessed exactly what data and at what rate. Moving to a new database gave us the opportunity to create a more tightly controlled interface to the DNS data. This was manifested as an internal DNS Records gRPC API which allows us to make sweeping changes to our data while only requiring a single change to the API, rather than coordinating with other systems.  For example, the DNS team can alter access logic and auditing procedures under the hood. In addition, it allows us to appropriately rate-limit and cache data depending on our needs. The move to this new API itself was no small feat, and with the help of several teams, we managed to migrate over 20 services, using 5 different programming languages, from direct database access to using our managed gRPC API. Many of these services touch very important areas such as DNSSEC, TLS, Email, Tunnels, Workers, Spectrum, and R2 storage. Therefore, it was important to get it right. 

One of the last issues to tackle was the logical decoupling of common DNS database functions from zone data. Many of these functions expect to be able to access both DNS record data and DNS zone data at the same time. For example, at record creation time, our API needs to check that the zone is not over its maximum record allowance. Originally this check occurred at the SQL level by verifying that the record count was lower than the record limit for the zone. However, once you remove access to the zone itself, you are no longer able to confirm this. Our DNS Records API also made use of SQL functions to audit record changes, which requires access to both DNS record and zone data. Luckily, over the past several years, we have migrated this functionality out of our monolithic API and into separate microservices. This allowed us to move the auditing and zone setting logic to the application level rather than the database level. Ultimately, we are still taking advantage of SQL functions in the new database cluster, but they are fully independent of any other legacy systems, and are able to take advantage of the latest Postgres version.

Now that Cloudflare DNS was mostly decoupled from the zones database, it was time to proceed with the data migration. For this, we built what would become our Change Data Capture and Transfer Service (CDCTS).

Requirements for the Change Data Capture and Transfer Service

The Database team is responsible for all Postgres clusters within Cloudflare, and were tasked with executing the data migration of two tables that store DNS data: cf_rec and cf_archived_rec, from the original cfdb cluster to a new cluster we called dnsdb.  We had several key requirements that drove our design:

  • Don’t lose data. This is the number one priority when handling any sort of data. Losing data means losing trust, and it is incredibly difficult to regain that trust once it’s lost.  Important in this is the ability to prove no data had been lost.  The migration process would, ideally, be easily auditable.

  • Minimize downtime.  We wanted a solution with less than a minute of downtime during the migration, and ideally with just a few seconds of delay.

These two requirements meant that we had to be able to migrate data changes in near real-time, meaning we either needed to implement logical replication, or some custom method to capture changes, migrate them, and apply them in a table in a separate Postgres cluster.

We first looked at using Postgres logical replication using pgLogical, but had concerns about its performance and our ability to audit its correctness.  Then some additional requirements emerged that made a pgLogical implementation of logical replication impossible:

  • The ability to move data must be bidirectional. We had to have the ability to switch back to cfdb without significant downtime in case of unforeseen problems with the new implementation. 

  • Partition the cf_rec table in the new database. This was a long-desired improvement and since most access to cf_rec is by zone_id, it was decided that mod(zone_id, num_partitions) would be the partition key.

  • Transferred data accessible from original database.  In case we had functionality that still needed access to data, a foreign table pointing to dnsdb would be available in cfdb. This could be used as emergency access to avoid needing to roll back the entire migration for a single missed process.

  • Only allow writes in one database.  Applications should know where the primary database is, and should be blocked from writing to both databases at the same time.

Details about the tables being migrated

The primary table, cf_rec, stores DNS record information, and its rows are regularly inserted, updated, and deleted. At the time of the migration, this table had 1.7 billion records, and with several indexes took up 1.5 TB of disk. Typical daily usage would observe 3-5 million inserts, 1 million updates, and 3-5 million deletes.

The second table, cf_archived_rec, stores copies of cf_rec that are obsolete — this table generally only has records inserted and is never updated or deleted.  As such, it would see roughly 3-5 million inserts per day, corresponding to the records deleted from cf_rec. At the time of the migration, this table had roughly 4.3 billion records.

Fortunately, neither table made use of database triggers or foreign keys, which meant that we could insert/update/delete records in this table without triggering changes or worrying about dependencies on other tables.

Ultimately, both of these tables are highly active and are the source of truth for many highly critical systems at Cloudflare.

Designing the Change Data Capture and Transfer Service

There were two main parts to this database migration:

  1. Initial copy: Take all the data from cfdb and put it in dnsdb.

  2. Change copy: Take all the changes in cfdb since the initial copy and update dnsdb to reflect them. This is the more involved part of the process.

Normally, logical replication replays every insert, update, and delete on a copy of the data in the same transaction order, making a single-threaded pipeline.  We considered using a queue-based system but again, speed and auditability were both concerns as any queue would typically replay one change at a time.  We wanted to be able to apply large sets of changes, so that after an initial dump and restore, we could quickly catch up with the changed data. For the rest of the blog, we will only speak about cf_rec for simplicity, but the process for cf_archived_rec is the same.

What we decided on was a simple change capture table. Rows from this capture table would be loaded in real-time by a database trigger, with a transfer service that could migrate and apply thousands of changed records to dnsdb in each batch. Lastly, we added some auditing logic on top to ensure that we could easily verify that all data was safely transferred without downtime.

Basic model of change data capture 

For cf_rec to be migrated, we would create a change logging table, along with a trigger function and a  table trigger to capture the new state of the record after any insert/update/delete.  

The change logging table named log_cf_rec had the same columns as cf_rec, as well as four new columns:

  • change_id:  a sequence generated unique identifier of the record

  • action: a single character indicating whether this record represents an [i]nsert, [u]pdate, or [d]elete

  • change_timestamp: the date/time when the change record was created

  • change_user: the database user that made the change.  

A trigger was placed on the cf_rec table so that each insert/update would copy the new values of the record into the change table, and for deletes, create a ‘D’ record with the primary key value. 

Here is an example of the change logging where we delete, re-insert, update, and finally select from the log_cf_rec table. Note that the actual cf_rec and log_cf_rec tables have many more columns, but have been edited for simplicity.

dns_records=# DELETE FROM  cf_rec WHERE rec_id = 13;

dns_records=# SELECT * from log_cf_rec;
Change_id | action | rec_id | zone_id | name
----------------------------------------------
1         | D      | 13     |         |   

dns_records=# INSERT INTO cf_rec VALUES(13,299,'cloudflare.example.com');  

dns_records=# UPDATE cf_rec SET name = 'test.example.com' WHERE rec_id = 13;

dns_records=# SELECT * from log_cf_rec;
Change_id | action | rec_id | zone_id | name
----------------------------------------------
1         | D      | 13     |         |  
2         | I      | 13     | 299     | cloudflare.example.com
3         | U      | 13     | 299     | test.example.com 

In addition to log_cf_rec, we also introduced 2 more tables in cfdb and 3 more tables in dnsdb:

cfdb

  1. transferred_log_cf_rec: Responsible for auditing the batches transferred to dnsdb.

  2. log_change_action: Responsible for summarizing the transfer size in order to compare with the log_change_action in dnsdb.

dnsdb

  1. migrate_log_cf_rec: Responsible for collecting batch changes in dnsdb, which would later be applied to cf_rec in dnsdb.

  2. applied_migrate_log_cf_rec: Responsible for auditing the batches that had been successfully applied to cf_rec in dnsdb.

  3. log_change_action: Responsible for summarizing the transfer size in order to compare with the log_change_action in cfdb.

Initial copy

With change logging in place, we were now ready to do the initial copy of the tables from cfdb to dnsdb. Because we were changing the structure of the tables in the destination database and because of network timeouts, we wanted to bring the data over in small pieces and validate that it was brought over accurately, rather than doing a single multi-hour copy or pg_dump.  We also wanted to ensure a long-running read could not impact production and that the process could be paused and resumed at any time.  The basic model to transfer data was done with a simple psql copy statement piped into another psql copy statement.  No intermediate files were used.

psql_cfdb -c "COPY (SELECT * FROM cf_rec WHERE id BETWEEN n and n+1000000 TO STDOUT)" | 

psql_dnsdb -c "COPY cf_rec FROM STDIN"

Prior to a batch being moved, the count of records to be moved was recorded in cfdb, and after each batch was moved, a count was recorded in dnsdb and compared to the count in cfdb to ensure that a network interruption or other unforeseen error did not cause data to be lost. The bash script to copy data looked like this, where we included files that could be touched to pause or end the copy (if they cause load on production or there was an incident).  Once again, this code below has been heavily simplified.

#!/bin/bash
for i in "$@"; do
   # Allow user to control whether this is paused or not via pause_copy file
   while [ -f pause_copy ]; do
      sleep 1
   done
   # Allow user to end migration by creating end_copy file
   if [ ! -f end_copy ]; then
      # Copy a batch of records from cfdb to dnsdb
      # Get count of records from cfdb 
	# Get count of records from dnsdb
 	# Compare cfdb count with dnsdb count and alert if different 
   fi
done

Bash copy script

Change copy

Once the initial copy was completed, we needed to update dnsdb with any changes that had occurred in cfdb since the start of the initial copy. To implement this change copy, we created a function fn_log_change_transfer_log_cf_rec that could be passed a batch_id and batch_size, and did 5 things, all of which were executed in a single database transaction:

  1. Select a batch_size of records from log_cf_rec in cfdb.

  2. Copy the batch to transferred_log_cf_rec in cfdb to mark it as transferred.

  3. Delete the batch from log_cf_rec.

  4. Write a summary of the action to log_change_action table. This will later be used to compare transferred records with cfdb.

  5. Return the batch of records.

We then took the returned batch of records and copied them to migrate_log_cf_rec in dnsdb. We used the same bash script as above, except this time, the copy command looked like this:

psql_cfdb -c "COPY (SELECT * FROM fn_log_change_transfer_log_cf_rec(<batch_id>,<batch_size>) TO STDOUT" | 

psql_dnsdb -c "COPY migrate_log_cf_rec FROM STDIN"

Applying changes in the destination database

Now, with a batch of data in the migrate_log_cf_rec table, we called a newly created function log_change_apply to apply and audit the changes. Once again, this was all executed within a single database transaction. The function did the following:

  1. Move a batch from the migrate_log_cf_rec table to a new temporary table.

  2. Write the counts for the batch_id to the log_change_action table.

  3. Delete from the temporary table all but the latest record for a unique id (last action). For example, an insert followed by 30 updates would have a single record left, the final update. There is no need to apply all the intermediate updates.

  4. Delete any record from cf_rec that has any corresponding changes.

  5. Insert any [i]nsert or [u]pdate records in cf_rec.

  6. Copy the batch to applied_migrate_log_cf_rec for a full audit trail.

Putting it all together

There were 4 distinct phases, each of which was part of a different database transaction:

  1. Call fn_log_change_transfer_log_cf_rec in cfdb to get a batch of records.

  2. Copy the batch of records to dnsdb.

  3. Call log_change_apply in dnsdb to apply the batch of records.

  4. Compare the log_change_action table in each respective database to ensure counts match.


This process was run every 3 seconds for several weeks before the migration to ensure that we could keep dnsdb in sync with cfdb.

Managing which database is live

The last major pre-migration task was the construction of the request locking system that would be used throughout the actual migration. The aim was to create a system that would allow the database to communicate with the DNS Records API, to allow the DNS Records API to handle HTTP connections more gracefully. If done correctly, this could reduce downtime for DNS Record API users to nearly zero.

In order to facilitate this, a new table called cf_migration_manager was created. The table would be periodically polled by the DNS Records API, communicating two critical pieces of information:

  1. Which database was active. Here we just used a simple A or B naming convention.

  2. If the database was locked for writing. In the event the database was locked for writing, the DNS Records API would hold HTTP requests until the lock was released by the database.

Both pieces of information would be controlled within a migration manager script.

The benefit of migrating the 20+ internal services from direct database access to using our internal DNS Records gRPC API is that we were able to control access to the database to ensure that no one else would be writing without going through the cf_migration_manager.

During the migration 

Although we aimed to complete this migration in a matter of seconds, we announced a DNS maintenance window that could last a couple of hours just to be safe. Now that everything was set up, and both cfdb and dnsdb were roughly in sync, it was time to proceed with the migration. The steps were as follows:

  1. Lower the time between copies from 3s to 0.5s.

  2. Lock cfdb for writes via cf_migration_manager. This would tell the DNS Records API to hold write connections.

  3. Make cfdb read-only and migrate the last logged changes to dnsdb

  4. Enable writes to dnsdb

  5. Tell DNS Records API that dnsdb is the new primary database and that write connections can proceed via the cf_migration_manager.

Since we needed to ensure that the last changes were copied to dnsdb before enabling writing, this entire process took no more than 2 seconds. During the migration we saw a spike of API latency as a result of the migration manager locking writes, and then dealing with a backlog of queries. However, we recovered back to normal latencies after several minutes. 


DNS Records API Latency and Requests during migration

Unfortunately, due to the far-reaching impact that DNS has at Cloudflare, this was not the end of the migration. There were 3 lesser-used services that had slipped by in our scan of services accessing DNS records via cfdb. Fortunately, the setup of the foreign table meant that we could very quickly fix any residual issues by simply changing the table name. 

Post-migration

Almost immediately, as expected, we saw a steep drop in usage across cfdb. This freed up a lot of resources for other services to take advantage of.


cfdb usage dropped significantly after the migration period.

Since the migration, the average requests per second to the DNS Records API has more than doubled. At the same time, our CPU usage across both cfdb and dnsdb has settled at below 10% as seen below, giving us room for spikes and future growth. 



cfdb and dnsdb CPU usage now

As a result of this improved capacity, our database-related incident rate dropped dramatically.

As for query latencies, our latency post-migration is slightly lower on average, with fewer sustained spikes above 500ms. However, the performance improvement is largely noticed during high load periods, when our database handles spikes without significant issues. Many of these spikes come as a result of clients making calls to collect a large amount of DNS records or making several changes to their zone in short bursts. Both of these actions are common use cases for large customers onboarding zones.

In addition to these improvements, the DNS team also has more granular control over dnsdb cluster-specific settings that can be tweaked for our needs rather than catering to all the other services. For example, we were able to make custom changes to replication lag limits to ensure that services using replicas were able to read with some amount of certainty that the data would exist in a consistent form. Measures like this reduce overall load on the primary because almost all read queries can now go to the replicas.

Although this migration was a resounding success, we are always working to improve our systems. As we grow, so do our customers, which means the need to scale never really ends. We have more exciting improvements on the roadmap, and we are looking forward to sharing more details in the future.

The DNS team at Cloudflare isn’t the only team solving challenging problems like the one above. If this sounds interesting to you, we have many more tech deep dives on our blog, and we are always looking for curious engineers to join our team — see open opportunities here.

Elephants in tunnels: how Hyperdrive connects to databases inside your VPC networks

Post Syndicated from Andrew Repp original https://blog.cloudflare.com/elephants-in-tunnels-how-hyperdrive-connects-to-databases-inside-your-vpc-networks

With September’s announcement of Hyperdrive’s ability to send database traffic from Workers over Cloudflare Tunnels, we wanted to dive into the details of what it took to make this happen.

Hyper-who?

Accessing your data from anywhere in Region Earth can be hard. Traditional databases are powerful, familiar, and feature-rich, but your users can be thousands of miles away from your database. This can cause slower connection startup times, slower queries, and connection exhaustion as everything takes longer to accomplish.

Cloudflare Workers is an incredibly lightweight runtime, which enables our customers to deploy their applications globally by default and renders the cold start problem almost irrelevant. The trade-off for these light, ephemeral execution contexts is the lack of persistence for things like database connections. Database connections are also notoriously expensive to spin up, with many round trips required between client and server before any query or result bytes can be exchanged.

Hyperdrive is designed to make the centralized databases you already have feel like they’re global while keeping connections to those databases hot. We use our global network to get faster routes to your database, keep connection pools primed, and cache your most frequently run queries as close to users as possible.

Why a Tunnel?

For something as sensitive as your database, exposing access to the public Internet can be uncomfortable. It is common to instead host your database on a private network, and allowlist known-safe IP addresses or configure GRE tunnels to permit traffic to it. This is complex, toilsome, and error-prone. 

On Cloudflare’s Developer Platform, we strive for simplicity and ease-of-use. We cannot expect all of our customers to be experts in configuring networking solutions, and so we went in search of a simpler solution. Being your own customer is rarely a bad choice, and it so happens that Cloudflare offers an excellent option for this scenario: Tunnels.

Cloudflare Tunnel is a Zero Trust product that creates a secure connection between your private network and Cloudflare. Exposing services within your private network can be as simple as running a cloudflared binary, or deploying a Docker container running the cloudflared image we distribute.


A custom handler and generic streams

Integrating with Tunnels to support sending Postgres directly through them was a bit of a new challenge for us. Most of the time, when we use Tunnels internally (more on that later!), we rely on the excellent job cloudflared does of handling all of the mechanics, and we just treat them as pipes. That wouldn’t work for Hyperdrive, though, so we had to dig into how Tunnels actually ingress traffic to build a solution.

Hyperdrive handles Postgres traffic using an entirely custom implementation of the Postgres message protocol. This is necessary, because we sometimes have to alter the specific type or content of messages sent from client to server, or vice versa. Handling individual bytes gives us the flexibility to implement whatever logic any new feature might need.

An additional, perhaps less obvious, benefit of handling Postgres message traffic as just bytes is that we are not bound to the transport layer choices of some ORM or library. One of the nuances of running services in Cloudflare is that we may want to egress traffic over different services or protocols, for a variety of different reasons. In this case, being able to egress traffic via a Tunnel would be pretty challenging if we were stuck with whatever raw TCP socket a library had established for us.

The way we accomplish this relies on a mainstay of Rust: traits (which are how Rust lets developers apply logic across generic functions and types). In the Rust ecosystem, there are two traits that define the behavior Hyperdrive wants out of its transport layers: AsyncRead and AsyncWrite. There are a couple of others we also need, but we’re going to focus on just these two. These traits enable us to code our entire custom handler against a generic stream of data, without the handler needing to know anything about the underlying protocol used to implement the stream. So, we can pass around a WebSocket connection as a generic I/O stream, wherever it might be needed.

As an example, the code to create a generic TCP stream and send a Postgres startup message across it might look like this:

/// Send a startup message to a Postgres server, in the role of a PG client.
/// https://www.postgresql.org/docs/current/protocol-message-formats.html#PROTOCOL-MESSAGE-FORMATS-STARTUPMESSAGE
pub async fn send_startup<S>(stream: &mut S, user_name: &str, db_name: &str, app_name: &str) -> Result<(), ConnectionError>
where
    S: AsyncWrite + Unpin,
{
    let protocol_number = 196608 as i32;
    let user_str = &b"user\0"[..];
    let user_bytes = user_name.as_bytes();
    let db_str = &b"database\0"[..];
    let db_bytes = db_name.as_bytes();
    let app_str = &b"application_name\0"[..];
    let app_bytes = app_name.as_bytes();
    let len = 4 + 4
        + user_str.len() + user_bytes.len() + 1
        + db_str.len() + db_bytes.len() + 1
        + app_str.len() + app_bytes.len() + 1 + 1;

    // Construct a BytesMut of our startup message, then send it
    let mut startup_message = BytesMut::with_capacity(len as usize);
    startup_message.put_i32(len as i32);
    startup_message.put_i32(protocol_number);
    startup_message.put(user_str);
    startup_message.put_slice(user_bytes);
    startup_message.put_u8(0);
    startup_message.put(db_str);
    startup_message.put_slice(db_bytes);
    startup_message.put_u8(0);
    startup_message.put(app_str);
    startup_message.put_slice(app_bytes);
    startup_message.put_u8(0);
    startup_message.put_u8(0);

    match stream.write_all(&startup_message).await {
        Ok(_) => Ok(()),
        Err(err) => {
            error!("Error writing startup to server: {}", err.to_string());
            ConnectionError::InternalError
        }
    }
}

/// Connect to a TCP socket
let stream = match TcpStream::connect(("localhost", 5432)).await {
    Ok(s) => s,
    Err(err) => {
        error!("Error connecting to address: {}", err.to_string());
        return ConnectionError::InternalError;
    }
};
let _ = send_startup(&mut stream, "db_user", "my_db").await;

With this approach, if we wanted to encrypt the stream using TLS before we write to it (upgrading our existing TcpStream connection in-place, to an SslStream), we would only have to change the code we use to create the stream, while generating and sending the traffic would remain unchanged. This is because SslStream also implements AsyncWrite!

/// We're handwaving the SSL setup here. You're welcome.
let conn_config = new_tls_client_config()?;

/// Encrypt the TcpStream, returning an SslStream
let ssl_stream = match tokio_boring::connect(conn_config, domain, stream).await {
    Ok(s) => s,
    Err(err) => {
        error!("Error during websocket TLS handshake: {}", err.to_string());
        return ConnectionError::InternalError;
    }
};
let _ = send_startup(&mut ssl_stream, "db_user", "my_db").await;

Whence WebSocket

WebSocket is an application layer protocol that enables bidirectional communication between a client and server. Typically, to establish a WebSocket connection, a client initiates an HTTP request and indicates they wish to upgrade the connection to WebSocket via the “Upgrade” header. Then, once the client and server complete the handshake, both parties can send messages over the connection until one of them terminates it.

Now, it turns out that the way Cloudflare Tunnels work under the hood is that both ends of the tunnel want to speak WebSocket, and rely on a translation layer to convert all traffic to or from WebSocket. The cloudflared daemon you spin up within your private network handles this for us! For Hyperdrive, however, we did not have a suitable translation layer to send Postgres messages across WebSocket, and had to write one.

One of the (many) fantastic things about Rust traits is that the contract they present is very clear. To be AsyncRead, you just need to implement poll_read. To be AsyncWrite, you need to implement only three functions (poll_write, poll_flush, and poll_shutdown). Further, there is excellent support for WebSocket in Rust built on top of the tungstenite-rs library.

Thus, building our custom WebSocket stream such that it can share the same machinery as all our other generic streams just means translating the existing WebSocket support into these poll functions. There are some existing OSS projects that do this, but for multiple reasons we could not use the existing options. The primary reason is that Hyperdrive operates across multiple threads (thanks to the tokio runtime), and so we rely on our connections to also handle Send, Sync, and Unpin. None of the available solutions had all five traits handled. It turns out that most of them went with the paradigm of Sink and Stream, which provide a solid base from which to translate to AsyncRead and AsyncWrite. In fact some of the functions overlap, and can be passed through almost unchanged. For example, poll_flush and poll_shutdown have 1-to-1 analogs, and require almost no engineering effort to convert from Sink to AsyncWrite.

/// We use this struct to implement the traits we need on top of a WebSocketStream
pub struct HyperSocket<S>
where
    S: AsyncRead + AsyncWrite + Send + Sync + Unpin,
{
    inner: WebSocketStream<S>,
    read_state: Option<ReadState>,
    write_err: Option<Error>,
}

impl<S> AsyncWrite for HyperSocket<S>
where
    S: AsyncRead + AsyncWrite + Send + Sync + Unpin,
{
    fn poll_flush(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<io::Result<()>> {
        match ready!(Pin::new(&mut self.inner).poll_flush(cx)) {
            Ok(_) => Poll::Ready(Ok(())),
            Err(err) => Poll::Ready(Err(Error::new(ErrorKind::Other, err))),
        }
    }

    fn poll_shutdown(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<io::Result<()>> {
        match ready!(Pin::new(&mut self.inner).poll_close(cx)) {
            Ok(_) => Poll::Ready(Ok(())),
            Err(err) => Poll::Ready(Err(Error::new(ErrorKind::Other, err))),
        }
    }
}

With that translation done, we can use an existing WebSocket library to upgrade our SslStream connection to a Cloudflare Tunnel, and wrap the result in our AsyncRead/AsyncWrite implementation. The result can then be used anywhere that our other transport streams would work, without any changes needed to the rest of our codebase! 

That would look something like this:

let websocket = match tokio_tungstenite::client_async(request, ssl_stream).await {
    Ok(ws) => Ok(ws),
    Err(err) => {
        error!("Error during websocket conn setup: {}", err.to_string());
        return ConnectionError::InternalError;
    }
};
let websocket_stream = HyperSocket::new(websocket));
let _ = send_startup(&mut websocket_stream, "db_user", "my_db").await;

Access granted

An observant reader might have noticed that in the code example above we snuck in a variable named request that we passed in when upgrading from an SslStream to a WebSocketStream. This is for multiple reasons. The first reason is that Tunnels are assigned a hostname and use this hostname for routing. The second and more interesting reason is that (as mentioned above) when negotiating an upgrade from HTTP to WebSocket, a request must be sent to the server hosting the ingress side of the Tunnel to perform the upgrade. This is pretty universal, but we also add in an extra piece here.

At Cloudflare, we believe that secure defaults and defense in depth are the correct ways to build a better Internet. This is why traffic across Tunnels is encrypted, for example. However, that does not necessarily prevent unwanted traffic from being sent into your Tunnel, and therefore egressing out to your database. While Postgres offers a robust set of access control options for protecting your database, wouldn’t it be best if unwanted traffic never got into your private network in the first place? 

To that end, all Tunnels set up for use with Hyperdrive should have a Zero Trust Access Application configured to protect them. These applications should use a Service Token to authorize connections. When setting up a new Hyperdrive, you have the option to provide the token’s ID and Secret, which will be encrypted and stored alongside the rest of your configuration. These will be presented as part of the WebSocket upgrade request to authorize the connection, allowing your database traffic through while preventing unwanted access.

This can be done within the request’s headers, and might look something like this:

let ws_url = format!("wss://{}", host);
let mut request = match ws_url.into_client_request() {
    Ok(req) => req,
    Err(err) => {
        error!(
            "Hostname {} could not be parsed into a valid request URL: {}", 
            host,
            err.to_string()
        );
        return ConnectionError::InternalError;
    }
};
request.headers_mut().insert(
    "CF-Access-Client-Id",
    http::header::HeaderValue::from_str(&client_id).unwrap(),
);
request.headers_mut().insert(
    "CF-Access-Client-Secret",
    http::header::HeaderValue::from_str(&client_secret).unwrap(),
);

Building for customer zero

If you’ve been reading the blog for a long time, some of this might sound a bit familiar.  This isn’t the first time that we’ve sent Postgres traffic across a tunnel, it’s something most of us do from our laptops regularly.  This works very well for interactive use cases with low traffic volume and a high tolerance for latency, but historically most of our products have not been able to employ the same approach.

Cloudflare operates many data centers around the world, and most services run in every one of those data centers. There are some tasks, however, that make the most sense to run in a more centralized fashion. These include tasks such as managing control plane operations, or storing configuration state.  Nearly every Cloudflare product houses its control plane information in Postgres clusters run centrally in a handful of our data centers, and we use a variety of approaches for accessing that centralized data from elsewhere in our network. For example, many services currently use a push-based model to publish updates to Quicksilver, and work through the complexities implied by such a model. This has been a recurring challenge for any team looking to build a new product.

Hyperdrive’s entire reason for being is to make it easy to access such central databases from our global network. When we began exploring Tunnel integrations as a feature, many internal teams spoke up immediately and strongly suggested they’d be interested in using it themselves. This was an excellent opportunity for Cloudflare to scratch its own itch, while also getting a lot of traffic on a new feature before releasing it directly to the public. As always, being “customer zero” means that we get fast feedback, more reliability over time, stronger connections between teams, and an overall better suite of products. We jumped at the chance.

As we rolled out early versions of Tunnel integration, we worked closely with internal teams to get them access to it, and fixed any rough spots they encountered. We’re pleased to share that this first batch of teams have found great success building new or refactored products on Hyperdrive over Tunnels. For example: if you’ve already tried out Workers Builds, or recently submitted an abuse report, you’re among our first users!  At the time of this writing, we have several more internal teams working to onboard, and we on the Hyperdrive team are very excited to see all the different ways in which fast and simple connections from Workers to a centralized database can help Cloudflare just as much as they’ve been helping our external customers.

Outro

Cloudflare is on a mission to make the Internet faster, safer, and more reliable. Hyperdrive was built to make connecting to centralized databases from the Workers runtime as quick and consistent as possible, and this latest development is designed to help all those who want to use Hyperdrive without directly exposing resources within their virtual private clouds (VPCs) on the public web.

To this end, we chose to build a solution around our suite of industry-leading Zero Trust tools, and were delighted to find how simple it was to implement in our runtime given the power and extensibility of the Rust trait system. 

Without waiting for the ink to dry, multiple teams within Cloudflare have adopted this new feature to quickly and easily solve what have historically been complex challenges, and are happily operating it in production today.

And now, if you haven’t already, try setting up Hyperdrive across a Tunnel, and let us know what you think in the Hyperdrive Discord channel!

Supporting Postgres Named Prepared Statements in Hyperdrive

Post Syndicated from Andrew Repp original https://blog.cloudflare.com/postgres-named-prepared-statements-supported-hyperdrive


Hyperdrive (Cloudflare’s globally distributed SQL connection pooler and cache) recently added support for Postgres protocol-level named prepared statements across pooled connections. Named prepared statements allow Postgres to cache query execution plans, providing potentially substantial performance improvements. Further, many popular drivers in the ecosystem use these by default, meaning that not having them is a bit of a footgun for developers. We are very excited that Hyperdrive’s users will now have access to better performance and a more seamless development experience, without needing to make any significant changes to their applications!

While we’re not the first connection pooler to add this support (PgBouncer got to it in October 2023 in version 1.21, for example), there were some unique challenges in how we implemented it. To that end, we wanted to do a deep dive on what it took for us to deliver this.

Hyper-what?

One of the classic problems of building on the web is that your users are everywhere, but your database tends to be in one spot.  Combine that with pesky limitations like network routing, or the speed of light, and you can often run into situations where your users feel the pain of having your database so far away. This can look like slower queries, slower startup times, and connection exhaustion as everything takes longer to accomplish.

Hyperdrive is designed to make the centralized databases you already have feel like they’re global. We use our global network to get faster routes to your database, keep connection pools primed, and cache your most frequently run queries as close to users as possible.

Postgres Message Protocol

To understand exactly what the challenge with prepared statements is, it’s first necessary to dig in a bit to the Postgres Message Protocol. Specifically, we are going to take a look at the protocol for an “extended” query, which uses different message types and is a bit more complex than a “simple” query, but which is more powerful and thus more widely used.

A query using Hyperdrive might be coded something like this, but a lot goes on under the hood in order for Postgres to reliably return your response.

import postgres from "postgres";

// with Hyperdrive, we don't have to disable prepared statements anymore!
// const sql = postgres(env.HYPERDRIVE.connectionString, {prepare: false});

// make a connection, with the default postgres.js settings (prepare is set to true)
const sql = postgres(env.HYPERDRIVE.connectionString);

// This sends the query, and while it looks like a single action it contains several 
// messages implied within it
let [{ a, b, c, id }] = await sql`SELECT a, b, c, id FROM hyper_test WHERE id = ${target_id}`;

To prepare a statement, a Postgres client begins by sending a Parse message. This includes the query string, the number of parameters to be interpolated, and the statement’s name. The name is a key piece of this puzzle. If it is empty, then Postgres uses a special “unnamed” prepared statement slot that gets overwritten on each new Parse. These are relatively easy to support, as most drivers will keep the entirety of a message sequence for unnamed statements together, and will not try to get too aggressive about reusing the prepared statement because it is overwritten so often.

If the statement has a name, however, then it is kept prepared for the remainder of the Postgres session (unless it is explicitly removed with DEALLOCATE). This is convenient because parsing a query string and preparing the statement costs bytes sent on the wire and CPU cycles to process, so reusing a statement is quite a nice optimization.

Once done with Parse, there are a few remaining steps to (the simplest form of) an extended query:

  • A Bind message, which provides the specific values to be passed for the parameters in the statement (if any).
  • An Execute message, which tells the Postgres server to actually perform the data retrieval and processing.
  • And finally a Sync message, which causes the server to close the implicit transaction, return results, and provides a synchronization point for error handling.

While that is the core pattern for accomplishing an extended protocol query, there are many more complexities possible (named Portal, ErrorResponse, etc.).

We will briefly mention one other complexity we often encounter in this protocol, which is Describe messages. Many drivers leverage Postgres’ built-in types to help with deserialization of the results into structs or classes. This is accomplished by sending a Parse-Describe-Flush/Sync sequence, which will send a statement to be prepared, and will expect back information about the types and data the query will return. This complicates bookkeeping around named prepared statements, as now there are two separate queries, with two separate kinds of responses, that must be kept track of. We won’t go into much depth on the tradeoffs of an additional round-trip in exchange for advanced information about the results’ format, but suffice it to say that it must be handled explicitly in order for the overall system to gracefully support prepared statements.

So the basic query from our code above looks like this from a message perspective:

A more complete description and the full structure of each message type are well described in the Postgres documentation.

So, what’s so hard about that?

Buffering Messages

The first challenge that Hyperdrive must solve (that many other connection poolers don’t have) is that it’s also a cache.

The happiest path for a query on Hyperdrive never travels far, and we are quite proud of the low latency of our cache hits. However, this presents a particular challenge in the case of an extended protocol query. A Parse by itself is insufficient as a cache key, both because the parameter values in the Bind messages can alter the expected results, and because it might be followed up with either a Describe or an Execute message which will invoke drastically different responses.

So Hyperdrive cannot simply pass each message to the origin database, as we must buffer them in a message log until we have enough information to reliably distinguish between cache keys. It turns out that receiving a Sync is quite a natural point at which to check whether you have enough information to serve a response. For most scenarios, we buffer until we receive a Sync, and then (assuming the scenario is cacheable) we determine whether we can serve the response from cache or we need to take a connection to the origin database.

Taking a Connection From the Pool

Assuming we aren’t serving a response from cache, for whatever reason, we’ll need to take an origin connection from our pool. One of the key advantages any connection pooler offers is in allowing many client connections to share few database connections, so minimizing how often and for how long these connections are held is crucial to making Hyperdrive performant.

To this end, Hyperdrive operates in what is traditionally called “transaction mode”. This means that a connection taken from the pool for any given transaction is returned once that transaction concludes. This is in contrast to what is often called “session mode”, where once a connection is taken from the pool it is held by the client until the client disconnects.

For Hyperdrive, allowing any client to take any database connection is vital. This is because if we “pin” a client to a given database connection then we have one fewer available for every other possible client. You can run yourself out of database connections very quickly once you start down that path, especially when your clients are many small Workers spread around the world.

The challenge prepared statements present to this scenario is that they exist at the “session” scope, which is to say, at the scope of one connection. If a client prepares a statement on connection A, but tries to reuse it and gets assigned connection B, Postgres will naturally throw an error claiming the statement doesn’t exist in the given session. No results will be returned, the client is unhappy, and all that’s left is to retry with a Parse message included. This causes extra round-trips between client and server, defeating the whole purpose of what is meant to be an optimization.

One of the goals of a connection pooler is to be as transparent to the client and server as possible. There are limitations, as Postgres will let you do some powerful things to session state that cannot be reasonably shared across arbitrary client connections, but to the extent possible the endpoints should not have to know or care about any multiplexing happening between them.

This means that when a client sends a Parse message on its connection, it should expect that the statement will be available for reuse when it wants to send a Bind-Execute-Sync sequence later on. It also means that the server should not get Bind messages for statements that only exist on some other session. Maintaining this illusion is the crux of providing support for this feature.

Putting it all together

So, what does the solution look like? If a client sends Parse-Bind-Execute-Sync with a named prepared statement, then later sends Bind-Execute-Sync to reuse it, how can we make sure that everything happens as expected? The solution, it turns out, needs just a few built-in Rust data structures for efficiently capturing what we need (a HashMap, some LruCaches and a VecDeque), and some straightforward business logic to keep track of when to intervene in the messages being passed back and forth.

Whenever a named Parse comes in, we store it in an in-memory HashMap on the server that handles message processing for that client’s connection. This persists until the client is disconnected. This means that whenever we see anything referencing the statement, we can go retrieve the complete message defining it. We’ll come back to this in a moment.

Once we’ve buffered all the messages we can and gotten to the point where it’s time to return results (let’s say because the client sent a Sync), we need to start applying some logic. For the sake of brevity we’re going to omit talking through error handling here, as it does add some significant complexity but is somewhat out of scope for this discussion.

There are two main questions that determine how we should proceed:

  1. Does our message sequence include a Parse, or are we trying to reuse a pre-existing statement?
  2. Do we have a cache hit or are we serving from the origin database?

This gives us four scenarios to consider:

  1. Parse with cache hit
  2. Parse with cache miss
  3. Reuse with cache hit
  4. Reuse with cache miss

A Parse with a cache hit is the easiest path to address, as we don’t need to do anything special. We use the messages sent as a cache key, and serve the results back to the client. We will still keep the Parse in our HashMap in case we want it later (#2 below), but otherwise we’re good to go.

A Parse with a cache miss is a bit more complicated, as now we need to send these messages to the origin server. We take a connection at random from our pool and do so, passing the results back to the client. With that, we’ve begun to make changes to session state such that all our database connections are no longer identical to each other. To keep track of what we’ve done to muddy up our state, we keep a LruCache on each connection of which statements it already has prepared. In the case where we need to evict from such a cache, we will also DEALLOCATE the statement on the connection to keep things tracked correctly.

Reuse with a cache hit is yet more tricky, but still straightforward enough. In the example below, we are sent a Bind with the same parameters twice (#1 and #9). We must identify that we received a Bind without a preceding Parse, we must go retrieve that Parse (#10), and we must use the information from it to build our cache key. Once all that is accomplished, we can serve our results from cache, needing only to trim out the ParseComplete within the cached results before returning them to the client.

Reuse with a cache miss is the hardest scenario, as it may require us to lie in both directions. In the example below, we cache results for one set of parameters (#8), but are sent a Bind with different parameters (#9). As in the cache hit scenario, we must identify that we were not sent a Parse as part of the current message sequence, retrieve it from our HashMap (#10), and build our cache key to GET from cache and confirm the miss (#11). Once we take a connection from the pool, though, we then need to check if it already has the statement we want prepared. If not, we must take our saved Parse and prepend it to our message log to be sent along to the origin database (#13). Thus, what the server receives looks like a perfectly valid Parse-Bind-Execute-Sync sequence. This is where our VecDeque (mentioned above) comes in, as converting our message log to that structure allowed us to very ergonomically make such changes without needing to rebuild the whole byte sequence. Once we receive the response from the server, all that’s needed is to trim out the initial ParseComplete response from the server, as a well-made client would likely be very confused receiving such a response to a Parse it didn’t send. With that message trimmed out, however, the client is in the position of getting exactly what it asked for, and both sides of the conversation are happy.

Dénouement

Now that we’ve got a working solution, where all parties are functioning well, let’s review! Our solution lets us share database connections across arbitrary clients with no “pinning”, no custom handling on either client or server, and supports reuse of prepared statements to reduce CPU load on re-parsing queries and reduce network traffic on re-sending Parse messages. Engineering always involves tradeoffs, so the cost of this is that we will sometimes still need to sneak in a Parse because a client got assigned a different connection on reuse, and in those scenarios there is a small amount of additional memory overhead because the same statement is prepared on multiple connections.

And now, if you haven’t already, go give Hyperdrive a spin, and let us know what you think in the Hyperdrive Discord channel!

Introducing Relational Database Connectors

Post Syndicated from Kabir Sikand original https://blog.cloudflare.com/relational-database-connectors/

Introducing Relational Database Connectors

Introducing Relational Database Connectors

At Cloudflare, we’re building the best compute platform in the world. We want to make it easy, seamless, and obvious to build your applications with us. But simply making the best compute platform is not enough — at the heart of your applications are the data they interact with.

Cloudflare has multiple data storage solutions available today: Workers KV, R2, and Durable Objects. All three follow Cloudflare’s design goals for Workers: global by default, infinitely scalable, and delightful for developers to use. We’ve partnered with third-party storage solutions like Fauna, MongoDB and Prisma, who have built data platforms that align beautifully with our design goals and written tutorials for databases that already support HTTP connections.

The one area that’s been sorely missed: relational databases. Cloudflare itself runs on relational databases, and we’re not alone. In April, we asked which Node libraries you wanted us to support, and four of the top five requests were related to databases. For this Full Stack Week, we asked ourselves: how could we support relational databases in a way that aligned with our design goals?

Today, we’re taking a first step towards that world by announcing support for relational databases, including Postgres and MySQL from Workers.

Connecting to a database is no simple task — if it were as easy as passing a connection string to a database driver, we would have already done it. We’ve had to overcome several hurdles to reach this point, and have several more still to conquer.  

Our goal with this announcement is to work with you, our developers, to solve the unique pain points that come from accessing databases inside Workers. If you’d like to work with us, fill out this form or join us on Discord — this is just the beginning. If you’d just like to grab the code and play around, use this example to get started connecting to your own database, or check out our demo.

Why are Database Connectors so hard to build?

Serverless database connections are challenging to support for several reasons.

Databases are needy — they often require TCP connections, since they assume long-lived connections between an application server and the database. The Workers runtime doesn’t currently support TCP connections, so we’ve only been able to support HTTP-based databases or proxies.

Like a relationship, establishing a connection isn’t quite enough. Developers use client libraries for databases to make submitting queries and managing the responses easy. Since the Workers runtime is not entirely Node.js compatible, we need to either roll our own database library or find one that does not use unsupported built-in libraries.

Finally, databases are sensitive. It often takes external libraries to manage shared connections between an application server and a database, since these connections tend to be expensive to establish.

Moving past these challenges

Our approach today gives us the foundation to address each of these challenges in creative ways going forward.

First, we’re leveraging cloudflared to create a secure tunnel between Cloudflare and a private network within your existing infrastructure. Cloudflared already supports proxying HTTP to TCP over WebSockets — Our challenge is providing interfaces that look like the socket interfaces existing libraries expect, while rewiring the implementations to redirect reads and writes to our websocket. This method is fast, safe, and secure; but limiting in that we lack control of where to direct the final connections. This is a problem we will solve soon, but until then our approach is essential to gathering latency and performance data to see where else we need to improve.

Introducing Relational Database Connectors

Next, we’ve created a shim-layer that adapts the socket API from a popular runtime to connect directly to databases using a WebSocket. This allows us to bundle code as-is, without forking or otherwise making significant changes to the database library. As part of this announcement, we’ve published a tutorial on how to connect to and query a Postgres database from your Workers, using existing Cloudflare technology and a driver from the growing community at Deno. We’re excited to work with the upstream maintainers, on expanding support.

Finally, we’re most excited for how this approach will let us begin to manage connection pooling and connection establishment overhead. While our current tech demo requires setting up the Cloudflare Tunnel on your own infrastructure, we’re looking for customers who’d like to pilot a model where Cloudflare hosts the tunnel for you.

Where we’re going

We’re just getting started. Our goal with today’s announcement is to find customers who are looking to build new applications or migrate existing applications to Workers while working with data that’s stored in a relational database.

Just as Cloudflare started by providing security, performance, and reliability for customer’s websites, we’re excited about a future where Cloudflare manages database connections, handles replication of data across cloud providers and provides low-latency access to data globally.

First, we’re looking to add support for TCP into the runtime natively. With native support for TCP we’ll not only have better support for databases, but expand the Workers runtime to work with data infrastructure more broadly.

Our position in the network layer of the stack makes providing performance, security benefits and extremely reduced egress costs to global databases all possible realities. To do so, we’ll repurpose the HTTP to TCP proxy service that we’ve currently built and run it for developers as a connection pooling service, managing connections to their databases on their behalf.

Finally, our network makes caching data and making it accessible globally at low latency possible. Once we have connections back to your data, making it globally accessible in Cloudflare’s network will unlock fundamentally new architectures for distributed data.

Take our connectors for a spin

Want to check things out? There are three main steps to getting up-and-running:

  1. Deploying cloudflared within your infrastructure.
  2. Deploying a database that connects to cloudflared.
  3. Deploying a Worker with the database driver that submits queries.

The Postgres tutorial is available here.

When you’re all done, it’ll look a little something like this:

import { Client } from './driver/postgres/postgres'

export default {
  async fetch(request: Request, env, ctx: ExecutionContext) {
    try {
      const client = new Client({
        user: 'postgres',
        database: 'postgres',
        hostname: 'https://db.example.com',
        password: '',
        port: 5432,
      })
      await client.connect()
      const result = await client.queryArray('SELECT * FROM users WHERE uuid=1;')
      ctx.waitUntil(client.end())
      return new Response(JSON.stringify(result.rows[0]))
    } catch (e) {
      return new Response((e as Error).message)
    }
  },
}

Hit any snags? Fill out this form, join our Discord or shoot us an email and let’s chat!

Modernizing a familiar approach to REST APIs, with PostgreSQL and Cloudflare Workers

Post Syndicated from Kristian Freeman original https://blog.cloudflare.com/modernizing-a-familiar-approach-to-rest-apis-with-postgresql-and-cloudflare-workers/

Modernizing a familiar approach to REST APIs, with PostgreSQL and Cloudflare Workers

Postgres is a ubiquitous open-source database technology. It contains a vast number of features and offers rock-solid reliability. It’s also one of the most popular SQL database tools in the industry. As the industry builds “modern” developer experience tools—real-time and highly interactive—Postgres has also served as a great foundation. Projects like Hasura, which offers a real-time GraphQL engine, and Supabase, an open-source Firebase alternative, use Postgres under the hood. This makes Postgres a technology that every developer should know, and consider using in their applications.

For many developers, REST APIs serve as the primary way we interact with our data. Language-specific libraries like pg allow developers to connect with Postgres in their code, and directly interact with their databases. Yet in almost every case, developers reinvent the wheel, building the same connection logic on an app-by-app basis.

Many developers building applications with Cloudflare Workers, our serverless functions platform, have asked how they can use Postgres in Workers functions. Today, we’re releasing a new tutorial for Workers that shows how to connect to Postgres inside Workers functions. Built on PostgREST, you’ll write a REST API that communicates directly with your database, on the edge.

This means that you can entirely build applications on Cloudflare’s edge — using Workers as a performant and globally-distributed API, and Cloudflare Pages, our Jamstack deployment platform, as the host for your frontend user interface. With Workers, you can add new API endpoints and handle authentication in front of your database without needing to alter your Postgres configuration. With features like Workers KV and Durable Objects, Workers can provide globally-distributed caching in front of your Postgres database. Features like WebSockets can be used to build real-time interactions for your applications, without having to migrate from Postgres to a new database-as-a-service platform.

PostgREST is an open-source tool that generates a standards-compliant REST API for your Postgres databases. Many growing database-as-a-service startups like Retool and Supabase use PostgREST under the hood. PostgREST is fast and has great defaults, allowing you to access your Postgres data using standard REST conventions.

It’s great to be able to access your database directly from Workers, but do you really want to expose your database directly to the public Internet? Luckily, Cloudflare has a solution for this, and it works great with PostgREST: Cloudflare Tunnel. Cloudflare Tunnel is one of my personal favorite products at Cloudflare. It creates a secure tunnel between your local server and the Cloudflare network. We want to expose our PostgREST endpoint, without making our entire database available on the public internet. Cloudflare Tunnel allows us to do that securely.

Modernizing a familiar approach to REST APIs, with PostgreSQL and Cloudflare Workers

By using PostgREST with Postgres, we can build REST API-based applications. In particular, it’s an excellent fit for Cloudflare Workers, our serverless function platform. Workers is a great place to build REST APIs. With the open-source JavaScript library postgrest-js, we can interact with a PostgREST endpoint from inside our Workers function, using simple JS-based primitives.

By the way — if you haven’t built a REST API with Workers yet, check out our free video course with Egghead: “Building a Serverless API with Cloudflare Workers”.

Scaling applications built on Postgres is an incredibly common problem that developers face. Often, this means duplicating your Postgres database and distributing reads between your primary database, and a fleet of “read replicas”. With PostgREST and Workers, we can begin to explore a different approach to solving the scaling problem. Workers’ unique architecture allows us to deploy hyper-performant functions in front of Postgres databases. With tools like Workers KV and Durable Objects, exposed in Workers as basic JavaScript APIs, we can build intelligent caches for our databases, without sacrificing performance or developer experience.

If you’d like to learn more about building REST APIs in Cloudflare Workers using PostgREST, check out our new tutorial! We’ve also provided two open-source libraries to help you get started. cloudflare/postgres-postgrest-cloudflared-example helps you set up a Cloudflare Tunnel-backed Postgres + PostgREST endpoint. postgrest-worker-example is an example of using postgrest-js inside of Cloudflare Workers, to build REST APIs with your Postgres databases.

Modernizing a familiar approach to REST APIs, with PostgreSQL and Cloudflare Workers

With postgrest-js, you can build dynamic queries and request data from your database using the JS primitives you know and love:

// Get all users with at least 100 followers
const { data: users, error } = await client
.from('users')
.select(‘*’)
.gte('followers', 100)

You can also join our Cloudflare Developers Discord community! Learn more about what you can build with Cloudflare Workers, and meet our wonderful community of developers from around the world. Get your invite link here.

A Byzantine failure in the real world

Post Syndicated from Tom Lianza original https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/

A Byzantine failure in the real world

An analysis of the Cloudflare API availability incident on 2020-11-02

When we review design documents at Cloudflare, we are always on the lookout for Single Points of Failure (SPOFs). Eliminating these is a necessary step in architecting a system you can be confident in. Ironically, when you’re designing a system with built-in redundancy, you spend most of your time thinking about how well it functions when that redundancy is lost.

On November 2, 2020, Cloudflare had an incident that impacted the availability of the API and dashboard for six hours and 33 minutes. During this incident, the success rate for queries to our API periodically dipped as low as 75%, and the dashboard experience was as much as 80 times slower than normal. While Cloudflare’s edge is massively distributed across the world (and kept working without a hitch), Cloudflare’s control plane (API & dashboard) is made up of a large number of microservices that are redundant across two regions. For most services, the databases backing those microservices are only writable in one region at a time.

Each of Cloudflare’s control plane data centers has multiple racks of servers. Each of those racks has two switches that operate as a pair—both are normally active, but either can handle the load if the other fails. Cloudflare survives rack-level failures by spreading the most critical services across racks. Every piece of hardware has two or more power supplies with different power feeds. Every server that stores critical data uses RAID 10 redundant disks or storage systems that replicate data across at least three machines in different racks, or both. Redundancy at each layer is something we review and require. So—how could things go wrong?

In this post we present a timeline of what happened, and how a difficult failure mode known as a Byzantine fault played a role in a cascading series of events.

2020-11-02 14:43 UTC: Partial Switch Failure

At 14:43, a network switch started misbehaving. Alerts began firing about the switch being unreachable to pings. The device was in a partially operating state: network control plane protocols such as LACP and BGP remained operational, while others, such as vPC, were not. The vPC link is used to synchronize ports across multiple switches, so that they appear as one large, aggregated switch to servers connected to them. At the same time, the data plane (or forwarding plane) was not processing and forwarding all the packets received from connected devices.

This failure scenario is completely invisible to the connected nodes, as each server only sees an issue for some of its traffic due to the load-balancing nature of LACP. Had the switch failed fully, all traffic would have failed over to the peer switch, as the connected links would’ve simply gone down, and the ports would’ve dropped out of the forwarding LACP bundles.

Six minutes later, the switch recovered without human intervention. But this odd failure mode led to further problems that lasted long after the switch had returned to normal operation.

2020-11-02 14:44 UTC: etcd Errors begin

The rack with the misbehaving switch included one server in our etcd cluster. We use etcd heavily in our core data centers whenever we need strongly consistent data storage that’s reliable across multiple nodes.

In the event that the cluster leader fails, etcd uses the RAFT protocol to maintain consistency and establish consensus to promote a new leader. In the RAFT protocol, cluster members are assumed to be either available or unavailable, and to provide accurate information or none at all. This works fine when a machine crashes, but is not always able to handle situations where different members of the cluster have conflicting information.

In this particular situation:

  • Network traffic between node 1 (in the affected rack) and node 3 (the leader) was being sent through the switch in the degraded state,
  • Network traffic between node 1 and node 2 were going through its working peer, and
  • Network traffic between node 2 and node 3 was unaffected.

This caused cluster members to have conflicting views of reality, known in distributed systems theory as a Byzantine fault. As a consequence of this conflicting information, node 1 repeatedly initiated leader elections, voting for itself, while node 2 repeatedly voted for node 3, which it could still connect to. This resulted in ties that did not promote a leader node 1 could reach. RAFT leader elections are disruptive, blocking all writes until they’re resolved, so this made the cluster read-only until the faulty switch recovered and node 1 could once again reach node 3.

A Byzantine failure in the real world

2020-11-02 14:45 UTC: Database system promotes a new primary database

Cloudflare’s control plane services use relational databases hosted across multiple clusters within a data center. Each cluster is configured for high availability. The cluster setup includes a primary database, a synchronous replica, and one or more asynchronous replicas. This setup allows redundancy within a data center. For cross-datacenter redundancy, a similar high availability secondary cluster is set up and replicated in a geographically dispersed data center for disaster recovery. The cluster management system leverages etcd for cluster member discovery and coordination.

When etcd became read-only, two clusters were unable to communicate that they had a healthy primary database. This triggered the automatic promotion of a synchronous database replica to become the new primary. This process happened automatically and without error or data loss.

There was a defect in our cluster management system that requires a rebuild of all database replicas when a new primary database is promoted. So, although the new primary database was available instantly, the replicas would take considerable time to become available, depending on the size of the database. For one of the clusters, service was restored quickly. Synchronous and asynchronous database replicas were rebuilt and started replicating successfully from primary, and the impact was minimal.

For the other cluster, however, performant operation of that database required a replica to be online. Because this database handles authentication for API calls and dashboard activities, it takes a lot of reads, and one replica was heavily utilized to spare the primary the load. When this failover happened and no replicas were available, the primary was overloaded, as it had to take all of the load. This is when the main impact started.

Reduce Load, Leverage Redundancy

At this point we saw that our primary authentication database was overwhelmed and began shedding load from it. We dialed back the rate at which we push SSL certificates to the edge, send emails, and other features, to give it space to handle the additional load. Unfortunately, because of its size, we knew it would take several hours for a replica to be fully rebuilt.

A silver lining here is that every database cluster in our primary data center also has online replicas in our secondary data center. Those replicas are not part of the local failover process, and were online and available throughout the incident. The process of steering read-queries to those replicas was not yet automated, so we manually diverted API traffic that could leverage those read replicas to the secondary data center. This substantially improved our API availability.

The Dashboard

The Cloudflare dashboard, like most web applications, has the notion of a user session. When user sessions are created (each time a user logs in) we perform some database operations and keep data in a Redis cluster for the duration of that user’s session. Unlike our API calls, our user sessions cannot currently be moved across the ocean without disruption. As we took actions to improve the availability of our API calls, we were unfortunately making the user experience on the dashboard worse.

This is an area of the system that is currently designed to be able to fail over across data centers in the event of a disaster, but has not yet been designed to work in both data centers at the same time. After a first period in which users on the dashboard became increasingly frustrated, we failed the authentication calls fully back to our primary data center, and kept working on our primary database to ensure we could provide the best service levels possible in that degraded state.

2020-11-02 21:20 UTC Database Replica Rebuilt

The instant the first database replica rebuilt, it put itself back into service, and performance resumed to normal levels. We re-ramped all of the services that had been turned down, so all asynchronous processing could catch up, and after a period of monitoring marked the end of the incident.

Redundant Points of Failure

The cascade of failures in this incident was interesting because each system, on its face, had redundancy. Moreover, no system fully failed—each entered a degraded state. That combination meant the chain of events that transpired was considerably harder to model and anticipate. It was frustrating yet reassuring that some of the possible failure modes were already being addressed.

A team was already working on fixing the limitation that requires a database replica rebuild upon promotion. Our user sessions system was inflexible in scenarios where we’d like to steer traffic around, and redesigning that was already in progress.

This incident also led us to revisit the configuration parameters we put in place for things that auto-remediate. In previous years, promoting a database replica to primary took far longer than we liked, so getting that process automated and able to trigger on a minute’s notice was a point of pride. At the same time, for at least one of our databases, the cure may be worse than the disease, and in fact we may not want to invoke the promotion process so quickly. Immediately after this incident we adjusted that configuration accordingly.

Byzantine Fault Tolerance (BFT) is a hot research topic. Solutions have been known since 1982, but have had to choose between a variety of engineering tradeoffs, including security, performance, and algorithmic simplicity. Most general-purpose cluster management systems choose to forgo BFT entirely and use protocols based on PAXOS, or simplifications of PAXOS such as RAFT, that perform better and are easier to understand than BFT consensus protocols. In many cases, a simple protocol that is known to be vulnerable to a rare failure mode is safer than a complex protocol that is difficult to implement correctly or debug.

The first uses of BFT consensus were in safety-critical systems such as aircraft and spacecraft controls. These systems typically have hard real time latency constraints that require tightly coupling consensus with application logic in ways that make these implementations unsuitable for general-purpose services like etcd. Contemporary research on BFT consensus is mostly focused on applications that cross trust boundaries, which need to protect against malicious cluster members as well as malfunctioning cluster members. These designs are more suitable for implementing general-purpose services such as etcd, and we look forward to collaborating with researchers and the open source community to make them suitable for production cluster management.

We are very sorry for the difficulty the outage caused, and are continuing to improve as our systems grow. We’ve since fixed the bug in our cluster management system, and are continuing to tune each of the systems involved in this incident to be more resilient to failures of their dependencies.  If you’re interested in helping solve these problems at scale, please visit cloudflare.com/careers.