Tag Archives: Edge Database

Cloudflare Integrations Marketplace introduces three new partners: Sentry, Momento and Turso

Post Syndicated from Tanushree Sharma original http://blog.cloudflare.com/cloudflare-integrations-marketplace-new-partners-sentry-momento-turso/

Cloudflare Integrations Marketplace introduces three new partners: Sentry, Momento and Turso

Cloudflare Integrations Marketplace introduces three new partners: Sentry, Momento and Turso

Building modern full-stack applications requires connecting to many hosted third party services, from observability platforms to databases and more. All too often, this means spending time doing busywork, managing credentials and writing glue code just to get started. This is why we’re building out the Cloudflare Integrations Marketplace to allow developers to easily discover, configure and deploy products to use with Workers.

Earlier this year, we introduced integrations with Supabase, PlanetScale, Neon and Upstash. Today, we are thrilled to introduce our newest additions to Cloudflare’s Integrations Marketplace – Sentry, Turso and Momento.

Let's take a closer look at some of the exciting integration providers that are now part of the Workers Integration Marketplace.

Improve performance and reliability by connecting Workers to Sentry

When your Worker encounters an error you want to know what happened and exactly what line of code triggered it. Sentry is an application monitoring platform that helps developers identify and resolve issues in real-time.

The Workers and Sentry integration automatically sends errors, exceptions and console.log() messages from your Worker to Sentry with no code changes required. Here’s how it works:

  1. You enable the integration from the Cloudflare Dashboard.
  2. The credentials from the Sentry project of your choice are automatically added to your Worker.
  3. You can configure sampling to control the volume of events you want sent to Sentry. This includes selecting the sample rate for different status codes and exceptions.
  4. Cloudflare deploys a Tail Worker behind the scenes that contains all the logic needed to capture and send data to Sentry.
  5. Like magic, errors, exceptions, and log messages are automatically sent to your Sentry project.

In the future, we’ll be improving this integration by adding support for uploading source maps and stack traces so that you can pinpoint exactly which line of your code caused the issue. We’ll also be tying in Workers deployments with Sentry releases to correlate new versions of your Worker with events in Sentry that help pinpoint problematic deployments. Check out our developer documentation for more information.

Develop at the Data Edge with Turso + Workers

Turso is an edge-hosted, distributed database based on libSQL, an open-source fork of SQLite. Turso focuses on providing a global service that minimizes query latency (and thus, application latency!). It’s perfect for use with Cloudflare Workers – both compute and data are served close to users.

Turso follows the model of having one primary database with replicas that are located globally, close to users. Turso automatically routes requests to a replica closest to where the Worker was invoked. This model works very efficiently for read heavy applications since read requests can be served globally. If you’re running an application that has heavy write workloads, or want to cut down on replication costs, you can run Turso with just the primary instance and use Smart Placement to speed up queries.

The Turso and Workers integration automatically pulls in Turso API credentials and adds them as secrets to your Worker, so that you can start using Turso by simply establishing a connection using the libsql SDK. Get started with the Turso and Workers Integration today by heading to our developer documentation.

Cache responses from data stores with Momento

Momento Cache is a low latency serverless caching solution that can be used on top of relational databases, key-value databases or object stores to get faster load times and better performance. Momento abstracts details like scaling, warming and replication so that users can deploy cache in a matter of minutes.

The Momento and Workers integration automatically pulls in your Momento API key using an OAuth2 flow. The Momento API key is added as a secret in Workers and, from there, you can start using the Momento SDK in Workers. Head to our developer documentation to learn more and use the Momento and Workers integration!

Try integrations out today

We want to give you back time, so that you can focus less on configuring and connecting third party tools to Workers and spend more time building. We’re excited to see what you build with integrations. Share your projects with us on Twitter (@CloudflareDev) and stay tuned for more exciting updates as we continue to grow our Integrations Marketplace!

If you would like to build an integration with Cloudflare Workers, fill out the integration request form and we’ll be in touch.

Performance isolation in a multi-tenant database environment

Post Syndicated from Justin Kwan original https://blog.cloudflare.com/performance-isolation-in-a-multi-tenant-database-environment/

Performance isolation in a multi-tenant database environment

Performance isolation in a multi-tenant database environment

Operating at Cloudflare scale means that across the technology stack we spend a great deal of time handling different load conditions. In this blog post we talk about how we solved performance difficulties with our Postgres clusters. These clusters support a large number of tenants and highly variable load conditions leading to the need to isolate activity to prevent tenants taking too much time from others. Welcome to real-world, large database cluster management!

As an intern at Cloudflare I got to work on improving how our database clusters behave under load and open source the resulting code.

Cloudflare operates production Postgres clusters across multiple regions in data centers. Some of our earliest service offerings, such as our DNS Resolver, Firewall, and DDoS Protection, depend on our Postgres clusters’ high availability for OLTP workloads. The high availability cluster manager, Stolon, is employed across all clusters to independently control and replicate data across Postgres instances and elect Postgres leaders and failover under high load scenarios.

PgBouncer and HAProxy act as the gateway layer in each cluster. Each tenant acquires client-side connections from PgBouncer instead of Postgres directly. PgBouncer holds a pool of maximum server-side connections to Postgres, allocating those across multiple tenants to prevent Postgres connection starvation. From here, PgBouncer forwards queries to HAProxy, which load balances across Postgres primary and read replicas.

Problem

Our multi-tenant Postgres instances operate on bare metal servers in non-containerized environments. Each backend application service is considered a single tenant, where they may use one of multiple Postgres roles. Due to each cluster serving multiple tenants, all tenants share and contend for available system resources such as CPU time, memory, disk IO on each cluster machine, as well as finite database resources such as server-side Postgres connections and table locks. Each tenant has a unique workload that varies in system level resource consumption, making it impossible to enforce throttling using a global value.

This has become problematic in production affecting neighboring tenants:

  • Throughput. A tenant may issue a burst of transactions, starving shared resources from other tenants and degrading their performance.
  • Latency: A single tenant may issue very long or expensive queries, often concurrently, such as large table scans for ETL extraction or queries with lengthy table locks.

Both of these scenarios can result in degraded query execution for neighboring tenants. Their transactions may hang or take significantly longer to execute (higher latency) due to either reduced CPU share time, or slower disk IO operations due to many seeks from misbehaving tenant(s). Moreover, other tenants may be blocked from acquiring database connections from the database proxy level (PgBouncer) due to existing ones being held during long and expensive queries.

Previous solution

When database cluster load significantly increases, finding which tenants are responsible is the first challenge. Some techniques include searching through all tenants’ previous queries under typical system load and determining whether any new expensive queries have been introduced under the Postgres’ pg_stat_activity view.

Database concurrency throttling

Once the misbehaving tenants are identified, Postgres server-side connection limits are manually enforced using the Postgres query.

ALTER USER "some_bad-user" WITH CONNECTION LIMIT 123;

This essentially restricts or “squeezes” the concurrent throughput for a single user, where each tenant will only be able to exhaust their share of connections.

Manual concurrency (connection) throttling has shown improvements in shedding load in Postgres during high production workloads:

Performance isolation in a multi-tenant database environment

While we have seen success with this approach, it is not perfect and is horribly manual. It also suffers from the following:

  • Postgres does not immediately kill existing tenant connections when a new user limit is set; the user may continue to issue bursty or expensive queries.
  • Tenants may still issue very expensive, resource intensive queries (affecting neighboring tenants) even if their concurrency (connection pool size) is reduced.
  • Manually applying connection limits against a misbehaving tenant is toil; an SRE could be paged to physically apply the new limit at any time of the day.
  • Manually analyzing and detecting misbehaving tenants based on queries can be time-consuming and stressful especially during an incident, requiring production SQL analysis experience.
  • Additionally, applying new throttling limits per user/pool, such as the allocated connection count, can be arbitrary and experimental while requiring extensive understanding of tenant workloads.
  • Oftentimes, Postgres may be under so much load that it begins to hang (CPU starvation). SREs may be unable to manually throttle tenants through native interfaces once a high load situation occurs.

New solution

Gateway concurrency throttling

Typically, the system level resource consumption of a query is difficult to control and isolate once submitted to the server or database system for execution. However, a common approach is to intercept and throttle connections or queries at the gateway layer, controlling per user/pool traffic characteristics based on system resource consumption.

We have implemented connection throttling at our database proxy server/connection pooler, PgBouncer. Previously, PgBouncer’s user level connection limits would not kill existing connections, but only prevent exceeding it. We now support the ability to throttle and kill existing connections owned by each user or each user’s connection pool statically via configuration or at runtime via new administrative commands.

PgBouncer Configuration
[users]
dns_service_user = max_user_connections=60
firewall_service_user = max_user_connections=80
[pools]
user1.database1 = pool_size=90
PgBouncer Runtime Commands
SET USER dns_service_user = ‘max_user_connections=40’;
SET POOL dns_service_user.dns_db = ‘pool_size=30’;

This required major bug fixes, refactoring and implementation work in our fork of PgBouncer. We’ve also raised multiple pull requests to contribute all of our features to PgBouncer open source. To read about all of our work in PgBouncer, read this blog.

These new features now allow for faster and more granular “load shedding” against a misbehaving tenant’s concurrency (connection pool, user and database pair), while enabling stricter performance isolation.

Future solutions

We are continuing to build infrastructure components that monitor per-tenant resource consumption and detect which tenants are misbehaving based on system resource indicators against historical baselines. We aim to automate connection and query throttling against tenants using these new administrative commands.

We are also experimenting with various automated approaches to enforce strict tenant performance isolation.

Congestion avoidance

An adaptation of the TCP Vegas congestion avoidance algorithm can be employed to adaptively estimate and enforce each tenant’s optimal concurrency while still maintaining low latency and high throughput for neighboring tenants. This approach does not require resource consumption profiling, manual threshold tuning, knowledge of underlying system hardware, or expensive computation.

Traditionally, TCP Vegas converges to the initially unknown and optimal congestion window (max packets that can be sent concurrently). In the same spirit, we can treat the unknown congestion window as the optimal concurrency or connection pool size for database queries. At the gateway layer, PgBouncer, each tenant will begin with a small connection pool size, while we dynamically sample each tenant’s transaction’s round trip time (RTT) against Postgres. We gradually increase the connection pool size (congestion window) of a tenant so long as their transaction RTTs do not deteriorate.

Performance isolation in a multi-tenant database environment

When a tenant’s sampled transaction latency increases, the formula’s minimum by sampled request latency ratio will decrease, naturally reducing the tenant’s available concurrency which reduces database load.

Performance isolation in a multi-tenant database environment

Essentially, this algorithm will “back off” when observing high query latencies as the indicator of high database load, regardless of whether the latency is due to CPU time or disk/network IO blocking, etc. This formula will converge to find the optimal concurrency limit (connection pool size) since the latency ratio always converges to 0 with sufficiently large sample request latencies. The square root of the current tenant pool size is chosen as a constant request “burst” headroom because of its fast growth and being relatively large for small pool sizes (when latencies are low) but converges when the pool size is reduced (when latencies are high).

Rather than reactively shedding load, congestion avoidance preventatively or “smoothly” throttles traffic before load induced performance degradation becomes an issue. This algorithm aims to prevent database server resource starvation which causes other queries to hang.

Theoretically, if one tenant misbehaves and causes load induced latency for others, this TCP congestion algorithm may incorrectly blindly throttle all tenants. Hence why it may be necessary to apply this adaptive throttling only against tenants with high CPU to latency correlation when the system performance is degrading.

Tenant resource quotas

Configurable resource quotas can be introduced per each tenant. Upstream application service tenants are restricted to their allocated share of resources expressed as CPU % utilized per second and max memory. If a tenant overuses their share, the database gateway (PgBouncer) should throttle their concurrency, queries per second and ingress bytes to force consumption within their allocated slice.

Resource throttling a tenant must not “spillover” or affect other tenants accessing the same cluster. This could otherwise reduce the availability of other customer-facing applications and violate SLO (service-level objectives). Resource restriction must be isolated to each tenant.

If traffic is low against Postgres instances, tenants should be permitted to exceed their allocation limit. However, when load against the cluster degrades the entire performance of the system (latency), the tenant’s limit must be re-enforced at the gateway layer, PgBouncer. We can make deductions around the health of the entire database server based on indicators such as average query latency’s rate of change against a predefined threshold. All tenants should agree that a surplus in resource consumption may result in query throttling of any pattern.

Each tenant has a unique and variable workload, which may degrade multi tenant performance at any time. Quick detection requires profiling the baseline resource consumption of each tenant’s (or tenant’s connection pooled) workload against each local Postgres server (backend pids) in near real-time. From here, we can correlate the “baseline” traffic characteristics with system level resource consumption per database instance.

Taking an average or generalizing statistical measures across distributed nodes (each tenant’s resource consumption on Postgres instances in this case) can be inaccurate due to high variance in traffic against leader vs replica instances. This would lead to faulty throttling decisions applied against users. For instance, we should not throttle a user’s concurrency on an idle read replica even if the user consumes excessive resources on the primary database instance. It is preferable to capture tenant consumption on a per Postgres instance level, and enforce throttling per instance rather than across the entire cluster.

Multivariable regression can be employed to model the relationship between independent variables (concurrency, queries per second, ingested bytes) against the dependent variables (system level resource consumption). We can calculate and enforce the optimal independent variables per tenant under high load scenarios. To account for workload changes, regression adaptability vs accuracy will need to be tuned by adjusting the sliding window size (amount of time to retain profiled data points) when capturing workload consumption.

Gateway query queuing

User queries can be prioritized for submission to Postgres at the gateway layer (PgBouncer). Within a one or multiple global priority queues, query submissions by all tenants are ordered based on the current resource consumption of the tenant’s connection pool or the tenant itself. Alternatively, ordering can be based on each query’s historical resource consumption, where each query is independently profiled. Based on changes in tenant resource consumption captured from each Postgres instance’s server, all queued queries can be reordered every time the scheduler forwards a query to be submitted.

To prevent priority queue starvation (one tenant’s query is at the end of the queue and is never executed), the gateway level query queuing can be configured to only enable when there is peak load/traffic against the Postgres instance. Or, the time of enqueueing a query can be factored into the priority ordering.

Performance isolation in a multi-tenant database environment

This approach would isolate tenant performance by allowing non-offending tenants to continue reserving connections and executing queries (such as critical health monitoring queries). Higher latency would only be observed from the tenants that are utilizing more resources (from many/expensive transactions). This approach is straightforward to understand, generic in application (can queue transactions based on other input metrics), and non-destructive as it does not kill client/server connections, and should only drop queries when the in-memory priority queue reaches capacity.

Conclusion

Performance isolation in our multi-tenant storage environment continues to be a very interesting challenge that touches areas including OS resource management, database internals, queueing theory, congestion algorithms and even statistics. We’d love to hear how the community has tackled the “noisy neighbor” problem by isolating tenant performance at scale!

Open sourcing our fork of PgBouncer

Post Syndicated from Justin Kwan original https://blog.cloudflare.com/open-sourcing-our-fork-of-pgbouncer/

Open sourcing our fork of PgBouncer

Open sourcing our fork of PgBouncer

Cloudflare operates highly available Postgres production clusters across multiple data centers, supporting the transactional workloads of our core service offerings such as our DNS Resolver, Firewall, and DDoS Protection.

Multiple PgBouncer instances sit at the front of the gateway layer per each cluster, acting as a TCP proxy that provides Postgres connection pooling. PgBouncer’s pooling enables upstream applications to connect to Postgres, without having to constantly open and close connections (expensive) at the database level, while also reducing the number of Postgres connections used. Each tenant acquires client-side connections from PgBouncer instead of Postgres directly.

Open sourcing our fork of PgBouncer

PgBouncer will hold a pool of maximum server-side connections to Postgres, allocating those across multiple tenants to prevent Postgres connection starvation. From here, PgBouncer will forward backend queries to HAProxy, which load balances across Postgres primary and read replicas.

As an intern at Cloudflare I got to work on improving how our database clusters behave under load and open source the resulting code.

We run our Postgres infrastructure in non-containerized, bare metal environments which consequently leads to multitenant resource contention between Postgres users. To enforce stricter tenant performance isolation at the database level (CPU time utilized, memory consumption, disk IO operations), we’d like to configure and enforce connection limits per user and connection pool at PgBouncer.

To do that we had to add features and fix bugs in PgBouncer. Rather than continue to maintain a private fork we are open sourcing our code for others to use.

Authentication Rejection

The PgBouncer connection pooler offers options to enforce server connection pool size limits (effective concurrency) per user via static configuration. However, an authentication bug upstream prevented these features from correctly working when Postgres was set to use HBA authentication. Administrators who sensibly use server-side authentication could not take advantage of these user-level features.

This ongoing issue has also been experienced by others in the open-source community:

https://github.com/pgbouncer/pgbouncer/issues/484
https://github.com/pgbouncer/pgbouncer/issues/596

Root Cause

PgBouncer needs a Postgres user’s password when proxying submitted queries from client connection to a Postgres server connection. PgBouncer will fetch a user’s Postgres password defined in userlist.txt (auth_file) when a user first logs in to compare against the provided password. However, if the user is not defined in userlist.txt, Pgbouncer will fetch their password from the Postgres pg_shadow system view for comparison. This password will be used when PgBouncer subsequently forwards queries from this user to Postgres. The same applies when Postgres is configured to use HBA authentication.

Following serious debugging efforts and time spent in GDB, we found that multiple user objects are typically created for a single real user: via configuration loading from the [users] section and upon the user’s first login. In PgBouncer, any users requiring a shadow auth query would be stored under their respective database struct instance, whereas any user with a password defined in userlist.txt would be stored globally. Because the non-authenticated user already existed in memory after being parsed from the [users] section, PgBouncer assumed that the user was defined in userlist.txt, where the shadow authentication query could be skipped. It would not bother to fetch and set the user’s password upon first login, resulting in an empty user password. This is why subsequent queries submitted by the user would be rejected with authentication failure at Postgres.

To solve this, we simplified the code to globally store all users in one place rather than store different types of users (requiring different methods of authentication) in a disaggregated fashion per database or globally. Also, rather than assuming a user is authenticated if they merely exist, we keep track of whether the user requires authentication via auth query or from fetching their password from userlist.txt. This depends on how they were created.

We saw the value in troubleshooting and fixing these issues; it would unlock an entire class of features in PgBouncer for our use cases, while benefiting many in the open-source community.

New Features

We’ve also done work to implement and support additional features in PgBouncer to enforce stricter tenant performance isolation.

Previously, PgBouncer would only prevent tenants from exceeding preconfigured limits, not particularly helpful when it’s too late and a user is misbehaving or already has too many connections. PgBouncer now supports enforcing or shrinking per user connection pool limits at runtime, where it is most critically needed to throttle tenants who are issuing a burst of expensive queries, or are hogging connections from other tenants. We’ve also implemented new administrative commands to throttle the maximum connections per user or per pool at runtime.

PgBouncer also now supports statically configuring and dynamically enforcing connection limits per connection pool. This feature is extremely important in order to granularly throttle a tenant’s misbehaving connection pool without throttling and reducing availability on its other non-misbehaving pools.

PgBouncer Configuration
[users]
dns_service_user = max_user_connections=60
firewall_service_user = max_user_connections=80
[pools]
user1.database1 = pool_size=90
PgBouncer Runtime Commands
SET USER dns_service_user = ‘max_user_connections=40’;
SET POOL dns_service_user.dns_db = ‘pool_size=30’;

These new features required major refactoring around how PgBouncer stores users, databases weakly referenced and stored passwords of different users, and how we enforce killing server side connections while still in use.

Conclusion

We are committed to improving PgBouncer in open source and contributing all of our features to benefit the wider community. If you are interested, please consider contributing to our open source PgBouncer fork. After all, it is the community that makes PgBouncer possible!