All posts by Brian Batraski

Cloudflare Waiting Room

Post Syndicated from Brian Batraski original https://blog.cloudflare.com/cloudflare-waiting-room/

Cloudflare Waiting Room

Cloudflare Waiting Room

Today, we are excited to announce Cloudflare Waiting Room! It will first be available to select customers through a new program called Project Fair Shot which aims to help with the problem of overwhelming demand for COVID-19 vaccinations causing appointment registration websites to fail. General availability in our Business and Enterprise plans will be added in the near future.

Wait, you’re excited about a… Waiting Room?

Most of us are familiar with the concept of a waiting room, and rarely are we excited about the idea of being in one. Usually our first experience of one is at a doctor’s office — yes, you have an appointment, but sometimes the doctor is running late (or one of the patients was). Given the doctor can only see one person at a time… the waiting room was born, as a mechanism to queue up patients.

While servers can handle more concurrent requests than a doctor can, they too can be overwhelmed. If, in a pre-COVID world, you’ve ever tried buying tickets to a popular concert or event, you’ve probably encountered a waiting room online. It limits requests inbound to an application, and places these requests into a virtual queue. Once the number of users in the application has reduced, new users are let in within the defined thresholds the application can handle. This protects the origin servers supporting the application from being inundated with too many requests, while also ensuring equity from a user perspective — users who try to access a resource when the system is overloaded are not unfairly dropped and forced to reconnect, hoping to join their chance in the queue.

Why Now?

Given not many of us are going to live concerts any time soon, why is Cloudflare doing this now?

Well, perhaps we aren’t going to concerts, but the second order effects of COVID-19 have created a huge need for waiting rooms. First of all, given social distancing and the closing of many places of business and government, customers and citizens have shifted to online channels, putting substantially more strain on business and government infrastructure.

Second, the pandemic and the flow-on consequences of it have meant many folks around the world have come to rely on resources that they didn’t need twelve months earlier. To be specific, these are often health or government-related resources — for example, unemployment insurance websites. The online infrastructure was set up to handle a peak load that didn’t foresee the impact of COVID-19. We’re seeing a similar pattern emerge with websites that are related to vaccines.

Historically, the number of organizations that needed waiting rooms was quite small. The nature of most businesses online usually involve a more consistent user load, rather than huge crushes of people all at once. Those organizations were able to build custom waiting rooms and were integrated deeply into their application (for example, buying tickets).  With Cloudflare’s Waiting Room, no code changes to the application are necessary and a Waiting Room can be set up in a matter of minutes for any website without writing a single line of code.

Whether you are an engineering architect or a business operations analyst, setting up a Waiting Room is simple. We make it quick and easy to ensure your applications are reliable and protected from unexpected spikes in traffic.  Other features we felt were important are automatic enablement and dynamic outflow. In other words, a waiting room should turn on automatically when thresholds are exceeded and as users finish their tasks in the application, let out different sized buckets of users and intake new ones already in the queue. It should just work. Lastly, we’ve seen the major impact COVID-19 has made on users and businesses alike, especially, but not limited to, the health and government sectors. We wanted to provide another way to ensure these applications remain available and functional so all users can receive the care that they need and not errors within their browser.

How does Cloudflare’s Waiting Room work?

We built Waiting Room on top of our edge network and our Workers product. By leveraging Workers and our new Durable Objects offerings, we were able to remove the need for any customer coding and provide a seamless, out of the box product that will ‘just work’. On top of this, we get the benefits of the scale and performance of our Workers product to ensure we maintain extremely low latency overhead, keep estimated times presented to end users accurate as can be and not keep any user in the queue longer than needed. But building a centralized system in a decentralized network is no easy task. When requests come into an application from around the world, we need to be able to get a broad, accurate view of what that load looks like inbound and outbound to a given application.

Cloudflare Waiting Room
Request going through Cloudflare without a Waiting Room

These requests, as fast as they are, still take time to travel across the planet. And so, a unique edge case was presented. What if a website is getting reasonable traffic from North America and Europe, but then a sudden major spike of traffic takes place from South America – how do we know when to keep letting users into the application and when to kick in the Waiting Room to protect the origin servers from being overloaded?

Thanks to some clever engineering and our Workers product, we were able to create a system that almost immediately keeps itself synced with global demand to an application giving us the necessary insight into when we should and should not be queueing users into the Waiting Room. By leveraging our global Anycast network and over 200+ data centers, we remove any single point of failure to protect our customers’ infrastructure yet also provide a great experience to end-users who have to wait a small amount of time to enter the application under high load.

Cloudflare Waiting Room
Request going through Cloudflare with a Waiting Room

How to setup a Waiting Room

Setting up a Waiting Room is incredibly easy and very fast! At the easiest side of the scale, a user needs to fill out only five fields: 1) the name of the Waiting Room, 2) a hostname (which will already be pre-populated with the zone it’s being configured on), 3) the total active users that can be in the application at any given time, 4) the new users per minute allowed into the application, and 5) the session duration for any given user. No coding or any application changes are necessary.

Cloudflare Waiting Room

We provide the option of using our default Waiting Room template for customers who don’t want to add additional branding. This simplifies the process of getting a Waiting Room up and running.

Cloudflare Waiting Room

That’s it! Press save and the Waiting Room is ready to go!

Cloudflare Waiting Room

For customers with more time and technical ability, the same process is followed, except we give full customization capabilities to our users so they can brand the Waiting Room, ensuring it matches the look and feel of their overall product.

Cloudflare Waiting Room

Lastly, managing different Waiting Rooms is incredibly easy. With our Manage Waiting Room table, at a glance you are able to get a full snapshot of which rooms are actively queueing, not queueing, and/or disabled.

Cloudflare Waiting Room

We are very excited to put the power of our Waiting Room into the hands of our customers to ensure they continue to focus on their businesses and customers. Keep an eye out for another blog post coming soon with major updates to our Waiting Room product for Enterprise!

Health Check Analytics and how you can use it

Post Syndicated from Brian Batraski original https://blog.cloudflare.com/health-check-analytics-and-how-you-can-use-it/

Health Check Analytics and how you can use it

At the end of last year, we introduced Standalone Health Checks – a service that lets you monitor the health of your origin servers and avoid the need to purchase additional third party services. The more that can be controlled from Cloudflare decreases maintenance cost, vendor management, and infrastructure complexity. This is important as it ensures you are able to scale your infrastructure seamlessly as your company grows. Today, we are introducing Standalone Health Check Analytics to help decrease your time to resolution for any potential issues. You can find Health Check Analytics in the sub-menu under the Traffic tab in your Cloudflare Dashboard.

Health Check Analytics and how you can use it

As a refresher, Standalone Health Checks is a service that monitors an IP address or hostname for your origin servers or application and notifies you in near real-time if there happens to be a problem. These Health Checks support fine-tuned configurations based on expected codes, interval, protocols, timeout and more. These configurations enable you to properly target your checks based on the unique setup of your infrastructure. An example of a Health Check can be seen below which is monitoring an origin server in a staging environment with a notification set via email.

Health Check Analytics and how you can use it

Once you set up a notification, you will be alerted when there is a change in the health of your origin server. In the example above, if your staging environment starts responding with anything other than a 200 OK response code, we will send you an email within seconds so you can take the necessary action before customers are impacted.

Introducing Standalone Health Check Analytics

Once you get the notification email, we provide tools that help to quickly debug the possible cause of the issue with detailed logs as well as data visualizations enabling you to better understand the context around the issue. Let’s walk through a real-world scenario and see how Health Check Analytics helps decrease our time to resolution.

A notification email has been sent to you letting you know that Staging is unhealthy. You log into your dashboard and go into Health Check Analytics for this particular Health Check. In the screenshot below, you can see that Staging is up 76% of the time vs 100% of the time for Production. Now that we see the graph validating the email notification that there is indeed a problem, we need to dig in further.  Below the graph you can see a breakdown of the type of errors that have taken place in both the Staging and Production addresses over the specified time period. We see there is only one error taking place, but very frequently, in the staging environment – a TCP Connection Failed error, leading to the lower availability.

This starts to narrow the funnel for what the issue could be. We know that there is something wrong with either the Staging server’s ability to receive connections, maybe an issue during the SYN-ACK handshake, or possibly an issue with the router being used and not an issue with the origin server at all but instead receiving a down-stream consequence. With this information, you can quickly make the necessary checks to validate your hypothesis and minimize your time to resolution. Instead of having to dig through endless logs, or try to make educated guesses at where the issue could stem from, Health Check Analytics allows you to quickly hone in on detailed areas that could be the root cause. This in turn maximizes your application reliability but more importantly, keeps trust and brand expectation with your customers.

Health Check Analytics and how you can use it

Being able to quickly understand an overview of your infrastructure is important, but sometimes being able to dig deeper into each healthcheck can be more valuable to understand what is happening at a granular level. In addition to general information like address, response-code, round trip time (RTT) and failure reason,  we are adding more features to help you understand the Health Check result(s). We have also added extra information into the event table so you can quickly understand a given problem. In the case of a Response Code Mismatch Error, we now provide the expected response code for a given Health Check along with the received code. This removes the need to go back and check the configuration that may have been setup long ago and keep focus on the problem at hand.

Health Check Analytics and how you can use it

The availability of different portions of your infrastructure is very important, but this does not provide the complete view. Performance is continuing to skyrocket in importance and value to customers. If an application is not performant, they will quickly go to a competitor without a second thought. Sometimes RTT is not enough to understand why requests have higher latency and where the root of the issue may reside. To better understand where time is spent for a given request, we are introducing the waterfall view of a request within the Event Log. With this view, you can understand the time taken for the TCP connection, time taken for the TLS handshake, and the time to first byte (TTFB) for the request. The waterfall will give you a chronological idea about time spent in different stages of the request.

  1. Time taken for establishing the initial TCP connection.(in dark blue, approx 41ms)
  2. Once the TCP connection is established, time is spent doing the TLS handshake. This is another component that takes up time for HTTPS websites. (in light blue, approx 80ms)
  3. Once the SYN-ACK handshake and connection is complete, then the time taken for the first byte to be received is also exposed. (in dark orange, approx 222ms)
  4. The total round trip time (RTT) is the time taken to load the complete page once the TLS handshake is complete. The difference between the RTT and the TTFB gives you the time spent downloading content from a page. If your page has a large amount of content, the difference between TTFB and RTT will be high. (in light orange, approx 302ms). The page load time is approximately 80 ms for the address.

Using the information above lends to a number of steps that can be taken for the website owner. The delay in initial TCP connection time could be decreased by making the website available in different geo locations around the world. This could also reduce the time for TLS handshake as each round trip will be faster. Another thing that is visible is the page load time of 80ms. This might be because of the contents of the page, maybe compression can be applied on the server side to make the load time better or unnecessary content can be removed. The information in the waterfall view can also tell if an additional external library increases the time to load the website after a release.

Cloudflare has over 200 edge locations around the world making it one of the largest Anycast networks on the planet. When a health check is configured, it can be run across the different regions on the Cloudflare infrastructure, enabling you to see the variation in latency around the world for specific Health Checks.

Health Check Analytics and how you can use it
Waterfall from India
Health Check Analytics and how you can use it
Waterfall from Western North America‌‌

Based on the new information provided from Health Check Analytics, users can definitively validate that the address performs better from Western North America compared to India due to the page load time and overall RTT.

How do health checks run?

To understand and decipher the logs that are found in the analytics dashboard, it is important to understand how Cloudflare runs the Health Checks. Cloudflare has data centers in more than 200 cities across 90+ countries throughout the world [more]. We don’t run health checks from every single of these data centers (that would be a lot of requests to your servers!). Instead, we let you pick between one and thirteen regions from which to run health checks [Regions].

The Internet is not the same everywhere around the world. So your users may not have the same experience on your application according to where they are. Running Health Checks from different regions lets you know the health of your application from the point of view of the Cloudflare network in each of these regions.

Imagine you configure a Health Check from two regions, Western North America and South East Asia, at an interval of 10 seconds. You may have been expecting to get two requests to your origin server every 10 seconds, but if you look at your server’s logs you will see that you are actually getting six. That is because we send Health Checks not just from one location in each region but three.

Health Check Analytics and how you can use it

For a health check configured from All Regions (thirteen regions) there will be 39 requests to your server per configured interval.

You may wonder: ‘Why do you probe from multiple locations within a region?’ We do this to make sure the health we report represents the overall performance of your service as seen from that region. Before we report a change, we check that at least two locations agree on the status. We added a third one to make sure that the system keeps running even if there is an issue at one of our locations.

Conclusion

Health Check Analytics is now live and available to all Pro, Business and Enterprise customers!  We are very excited to help decrease your time to resolution and ensure your application reliability is maximised.

Adding the Fallback Pool to the Load Balancing UI and other significant UI enhancements

Post Syndicated from Brian Batraski original https://blog.cloudflare.com/adding-the-fallback-pool-to-the-load-balancing-ui/

Adding the Fallback Pool to the Load Balancing UI and other significant UI enhancements

Adding the Fallback Pool to the Load Balancing UI and other significant UI enhancements

The Cloudflare Load Balancer was introduced over three years ago to provide our customers with a powerful, easy to use tool to intelligently route traffic to their origins across the world. During the initial design process, one of the questions we had to answer was ‘where do we send traffic if all pools are down?’ We did not think it made sense just to drop the traffic, so we used the concept of a ‘fallback pool’ to send traffic to a ‘pool of last resort’ in the case that no pools were detected as available. While this may still result in an error, it gave an eyeball request a chance at being served successfully in case the pool was still up.

As a brief reminder, a load balancer helps route traffic across your origin servers to ensure your overall infrastructure stays healthy and available. Load Balancers are made up of pools, which can be thought of as collections of servers in a particular location.

Over the past three years, we’ve made many updates to the dashboard. The new designs now support the fallback pool addition to the dashboard UI. The use of a fallback pool is incredibly helpful in a tight spot, but not having it viewable in the dashboard led to confusion around which pool was set as the fallback. Was there a fallback pool set at all? We want to be sure you have the tools to support your day-to-day work, while also ensuring our dashboard is usable and intuitive.

You can now check which pool is set as the fallback in any given Load Balancer, along with being able to easily designate any pool in the Load Balancer as the fallback. If no fallback pool is set, then the last pool in the list will automatically be chosen. We made the decision to auto-set a pool to be sure that customers are always covered in case the worst scenario happens. You can access the fallback pool within the Traffic App of the Cloudflare dashboard when creating or editing a Load Balancer.

Adding the Fallback Pool to the Load Balancing UI and other significant UI enhancements

Load Balancing UI Improvements

Not only did we add the fallback pool to the UI, but we saw this as an opportunity to update other areas of the Load Balancing app that have caused some confusion in the past.

Facelift and De-modaling

As a start, we gave the main Load Balancing page a face lift as well as de-modaling (moving content out of a smaller modal screen into a larger area) the majority of the Load Balancing UI. We felt moving this content out of a small web element would allow users to more easily understand the content on the page and allow us to better use the larger available space rather than being limited to the small area of a modal. This change has been applied when you create or edit a Load Balancer and manage monitors and/or pools.

Before:

Adding the Fallback Pool to the Load Balancing UI and other significant UI enhancements

After:

Adding the Fallback Pool to the Load Balancing UI and other significant UI enhancements

The updated UI has combined the health status and icon to declutter the available space and make it clear at a glance what the status is for a particular Load Balancer or Pool. We have also updated to a smaller toggle button across the Load Balancing UI, which allows us to update the action buttons with the added margin space gained. Now that we are utilizing the page surface area more efficiently, we moved forward to add more information in our tables so users are more aware of the shared aspects of their Load Balancer.

Shared Objects and Editing

Shared objects have caused some level of concern for companies who have teams across the world – all leveraging the Cloudflare dashboard.

Some of the shared objects, Monitors and Pools, have a new column added outlining which Pools or Load Balancers are currently in use by a particular Monitor or Pool. This brings more clarity around what will be affected by any changes made by someone from your organization. This supports users to be more autonomous and confident when they make an update in the dashboard. If someone from team A wants to update a monitor for a production server, they can do so without the worry of monitoring for another pool possibly breaking or have to speak to team B first. The time saved and empowerment to make updates as things change in your business is incredibly valuable. It supports velocity you may want to achieve while maintaining a safe environment to operate in. The days of having to worry about unforeseen consequences that could crop up later down the road are swiftly coming to a close.

This helps teams understand the impact of a given change and what else would be affected. But, we did not feel this was enough. We want to be sure that everyone is confident in the changes they are making. On top of the additional columns, we added in a number of confirmation modals to drive confidence about a particular change. Also, a list in the modal of the other Load Balancers or Pools that would be impacted. We really wanted to drive the message home around which objects are shared: we made a final change to allow edits of monitors to take place only within the Manage Monitors page. We felt that having users navigate to the manage page in itself gives more understanding that these items are shared. For example, allowing edits to a Monitor in the same view of editing a Load Balancer can make it seem like those changes are only for that Load Balancer, which is not always the case.

Adding the Fallback Pool to the Load Balancing UI and other significant UI enhancements

Manage Monitors before:

Adding the Fallback Pool to the Load Balancing UI and other significant UI enhancements

Manage monitors after:

Adding the Fallback Pool to the Load Balancing UI and other significant UI enhancements

Updated CTAs/Buttons

Lastly, when users would expand the Manage Load Balancer table to view more details about their Pools or Origins within that specific Load Balancer, they would click the large X icon in the top right of that expanded card to close it – seems reasonable in the expanded context.

Adding the Fallback Pool to the Load Balancing UI and other significant UI enhancements

But, the X icon did not close the expanded card, but rather deleted the Load Balancer altogether. This is dangerous and we want to prevent users from making mistakes. With the added space we gained from de-modaling large areas of the UI, we have updated these buttons to be clickable text buttons that read ‘Edit’ or ‘Delete’ instead of the icon buttons. The difference is providing clearly defined text around the action that will take place, rather than leaving it up to a users interpretation of what the icon on the button means and the action it would result in. We felt this was much clearer to users and not be met with unwanted changes.

We are very excited about the updates to the Load Balancing dashboard and look forward to improving day in and day out.

After:

Adding the Fallback Pool to the Load Balancing UI and other significant UI enhancements

New tools to monitor your server and avoid downtime

Post Syndicated from Brian Batraski original https://blog.cloudflare.com/new-tools-to-monitor-your-server-and-avoid-downtime/

New tools to monitor your server and avoid downtime

New tools to monitor your server and avoid downtime

When your server goes down, it’s a big problem. Today, Cloudflare is introducing two new tools to help you understand and respond faster to origin downtime — plus, a new service to automatically avoid downtime.

The new features are:

  • Standalone Health Checks, which notify you as soon as we detect problems at your origin server, without needing a Cloudflare Load Balancer.
  • Passive Origin Monitoring, which lets you know when your origin cannot be reached, with no configuration required.
  • Zero-Downtime Failover, which can automatically avert failures by retrying requests to origin.

Standalone Health Checks

Our first new tool is Standalone Health Checks, which will notify you as soon as we detect problems at your origin server — without needing a Cloudflare Load Balancer.

A Health Check is a service that runs on our edge network to monitor whether your origin server is online. Health Checks are a key part of our load balancing service because they allow us to quickly and actively route traffic to origin servers that are live and ready to serve requests. Standalone Health Checks allow you to monitor the health of your origin even if you only have one origin or do not yet need to balance traffic across your infrastructure.

We’ve provided many dimensions for you to hone in on exactly what you’d like to check, including response code, protocol type, and interval. You can specify a particular path if your origin serves multiple applications, or you can check a larger subset of response codes for your staging environment. All of these options allow you to properly target your Health Check, giving you a precise picture of what is wrong with your origin.

New tools to monitor your server and avoid downtime

If one of your origin servers becomes unavailable, you will receive a notification letting you know of the health change, along with detailed information about the failure so you can take action to restore your origin’s health.  

Lastly, once you’ve set up your Health Checks across the different origin servers, you may want to see trends or the top unhealthy origins. With Health Check Analytics, you’ll be able to view all the change events for a given health check, isolate origins that may be top offenders or not performing at par, and move forward with a fix. On top of this, in the near future, we are working to provide you with access to all Health Check raw events to ensure you have the detailed lens to compare Cloudflare Health Check Event logs against internal server logs.

New tools to monitor your server and avoid downtime

Users on the Pro, Business, or Enterprise plan will have access to Standalone Health Checks and Health Check Analytics to promote top-tier application reliability and help maximize brand trust with their customers. You can access Standalone Health Checks and Health Check Analytics through the Traffic app in the dashboard.

Passive Origin Monitoring

Standalone Health Checks are a super flexible way to understand what’s happening at your origin server. However, they require some forethought to configure before an outage happens. That’s why we’re excited to introduce Passive Origin Monitoring, which will automatically notify you when a problem occurs — no configuration required.

Cloudflare knows when your origin is down, because we’re the ones trying to reach it to serve traffic! When we detect downtime lasting longer than a few minutes, we’ll send you an email.

Starting today, you can configure origin monitoring alerts to go to multiple email addresses. Origin Monitoring alerts are available in the new Notification Center (more on that below!) in the Cloudflare dashboard:

New tools to monitor your server and avoid downtime

Passive Origin Monitoring is available to customers on all Cloudflare plans.

Zero-Downtime Failover

What’s better than getting notified about downtime? Never having downtime in the first place! With Zero-Downtime Failover, we can automatically retry requests to origin, even before Load Balancing kicks in.

How does it work? If a request to your origin fails, and Cloudflare has another record for your origin server, we’ll just try another origin within the same HTTP request. The alternate record could be either an A/AAAA record configured via Cloudflare DNS, or another origin server in the same Load Balancing pool.

Consider an website, example.com, that has web servers at two different IP addresses: 203.0.113.1 and 203.0.113.2. Before Zero-Downtime Failover, if 203.0.113.1 becomes unavailable, Cloudflare would attempt to connect, fail, and ultimately serve an error page to the user. With Zero-Downtime Failover, if 203.0.113.1 cannot be reached, then Cloudflare’s proxy will seamlessly attempt to connect to 203.0.113.2. If the second server can respond, then Cloudflare can avert serving an error to example.com’s user.

Since we rolled Zero-Downtime Failover a few weeks ago, we’ve prevented tens of millions of requests per day from failing!

Zero-Downtime Failover works in conjunction with Load Balancing, Standalone Health Checks, and Passive Origin Monitoring to keep your website running without a hitch. Health Checks and Load Balancing can avert failure, but take time to kick in. Zero-Downtime failover works instantly, but adds latency on each connection attempt. In practice, Zero-Downtime Failover is helpful at the start of an event, when it can instantly recover from errors; once a Health Check has detected a problem, a Load Balancer can then kick in and properly re-route traffic. And if no origin is available, we’ll send an alert via Passive Origin Monitoring.

To see an example of this in practice, consider an incident from a recent customer. They saw a spike in errors at their origin that would ordinarily cause availability to plummet (red line), but thanks to Zero-Downtime failover, their actual availability stayed flat (blue line).

New tools to monitor your server and avoid downtime

During a 30 minute time period, Zero-Downtime Failover improved overall availability from 99.53% to 99.98%, and prevented 140,000 HTTP requests from resulting in an error.

It’s important to note that we only attempt to retry requests that have failed during the TCP or TLS connection phase, which ensures that HTTP headers and payload have not been transmitted yet. Thanks to this safety mechanism, we’re able to make Zero-Downtime Failover Cloudflare’s default behavior for Pro, Business, and Enterprise plans. In other words, Zero-Downtime Failover makes connections to your origins more reliable with no configuration or action required.

Coming soon: more notifications, more flexibility

Our customers are always asking us for more insights into the health of their critical edge infrastructure. Health Checks and Passive Origin monitoring are a significant step towards Cloudflare taking a proactive instead of reactive approach to insights.

To support this work, today we’re announcing the Notification Center as the central place to manage notifications. This is available in the dashboard today, accessible from your Account Home.

From here, you can create new notifications, as well as view any existing notifications you’ve already set up. Today’s release allows you to configure  Passive Origin Monitoring notifications, and set multiple email recipients.

New tools to monitor your server and avoid downtime

We’re excited about today’s launches to helping our customers avoid downtime. Based on your feedback, we have lots of improvements planned that can help you get the timely insights you need:

  • New notification delivery mechanisms
  • More events that can trigger notifications
  • Advanced configuration options for Health Checks, including added protocols, threshold based notifications, and threshold based status changes.
  • More ways to configure Passive Health Checks, like the ability to add thresholds, and filter to specific status codes

Introducing Load Balancing Analytics

Post Syndicated from Brian Batraski original https://blog.cloudflare.com/introducing-load-balancing-analytics/

Introducing Load Balancing Analytics

Introducing Load Balancing Analytics

Cloudflare aspires to make Internet properties everywhere faster, more secure, and more reliable. Load Balancing helps with speed and reliability and has been evolving over the past three years.

Let’s go through a scenario that highlights a bit more of what a Load Balancer is and the value it can provide.  A standard load balancer comprises a set of pools, each of which have origin servers that are hostnames and/or IP addresses. A routing policy is assigned to each load balancer, which determines the origin pool selection process.

Let’s say you build an API that is using cloud provider ACME Web Services. Unfortunately, ACME had a rough week, and their service had a regional outage in their Eastern US region. Consequently, your website was unable to serve traffic during this period, which resulted in reduced brand trust from users and missed revenue. To prevent this from happening again, you decide to take two steps: use a secondary cloud provider (in order to avoid having ACME as a single point of failure) and use Cloudflare’s Load Balancing to take advantage of the multi-cloud architecture. Cloudflare’s Load Balancing can help you maximize your API’s availability for your new architecture. For example, you can assign health checks to each of your origin pools. These health checks can monitor your origin servers’ health by checking HTTP status codes, response bodies, and more. If an origin pool’s response doesn’t match what is expected, then traffic will stop being steered there. This will reduce downtime for your API when ACME has a regional outage because traffic in that region will seamlessly be rerouted to your fallback origin pool(s). In this scenario, you can set the fallback pool to be origin servers in your secondary cloud provider. In addition to health checks, you can use the ‘random’ routing policy in order to distribute your customers’ API requests evenly across your backend. If you want to optimize your response time instead, you can use ‘dynamic steering’, which will send traffic to the origin determined to be closest to your customer.

Our customers love Cloudflare Load Balancing, and we’re always looking to improve and make our customers’ lives easier. Since Cloudflare’s Load Balancing was first released, the most popular customer request was for an analytics service that would provide insights on traffic steering decisions.

Today, we are rolling out Load Balancing Analytics in the Traffic tab of the Cloudflare  dashboard. The three major components in the analytics service are:

  • An overview of traffic flow that can be filtered by load balancer, pool, origin, and region.
  • A latency map that indicates origin health status and latency metrics from Cloudflare’s global network spanning 194 cities and growing!
  • Event logs denoting changes in origin health. This feature was released in 2018 and tracks pool and origin transitions between healthy and unhealthy states. We’ve moved these logs under the new Load Balancing Analytics subtab. See the documentation to learn more.

In this blog post, we’ll discuss the traffic flow distribution and the latency map.

Traffic Flow Overview

Our users want a detailed view into where their traffic is going, why it is going there, and insights into what changes may optimize their infrastructure. With Load Balancing Analytics, users can graphically view traffic demands on load balancers, pools, and origins over variable time ranges.

Understanding how traffic flow is distributed informs the process of creating new origin pools, adapting to peak traffic demands, and observing failover response during origin pool failures.

Introducing Load Balancing Analytics
Figure 1

In Figure 1, we can see an overview of traffic for a given domain. On Tuesday, the 24th, the red pool was created and added to the load balancer. In the following 36 hours, as the red pool handled more traffic, the blue and green pool both saw a reduced workload. In this scenario, the traffic distribution graph did provide the customer with new insights. First, it demonstrated that traffic was being steered to the new red pool. It also allowed the customer to understand the new level of traffic distribution across their network. Finally, it allowed the customer to confirm whether traffic decreased in the expected pools. Over time, these graphs can be used to better manage capacity and plan for upcoming infrastructure needs.

Latency Map

The traffic distribution overview is only one part of the puzzle. Another essential component is understanding request performance around the world. This is useful because customers can ensure user requests are handled as fast as possible, regardless of where in the world the request originates.

The standard Load Balancing configuration contains monitors that probe the health of customer origins. These monitors can be configured to run from a particular region(s) or, for Enterprise customers, from all Cloudflare locations. They collect useful information, such as round-trip time, that can be aggregated to create the latency map.

The map provides a summary of how responsive origins are from around the world, so customers can see regions where requests are underperforming and may need further investigation. A common metric used to identify performance is request latency. We found that the p90 latency for all Load Balancing origins being monitored is 300 milliseconds, which means that 90% of all monitors’ health checks had a round trip time faster than 300 milliseconds. We used this value to identify locations where latency was slower than the p90 latency seen by other Load Balancing customers.

Introducing Load Balancing Analytics
Figure 2

In Figure 2, we can see the responsiveness of the Northeast Asia pool. The Northeast Asia pool is slow specifically for monitors in South America, the Middle East, and Southern Africa, but fast for monitors that are probing closer to the origin pool. Unfortunately, this means users for the pool in countries like Paraguay are seeing high request latency. High page load times have many unfortunate consequences: higher visitor bounce rate, decreased visitor satisfaction rate, and a lower search engine ranking. In order to avoid these repercussions, a site administrator could consider adding a new origin pool in a region closer to underserved regions. In Figure 3, we can see the result of adding a new origin pool in Eastern North America. We see the number of locations where the domain was found to be unhealthy drops to zero and the number of slow locations cut by more than 50%.

Introducing Load Balancing Analytics
Figure 3

Tied with the traffic flow metrics from the Overview page, the latency map arms users with insights to optimize their internal systems, reduce their costs, and increase their application availability.

GraphQL Analytics API

Behind the scenes, Load Balancing Analytics is powered by the GraphQL Analytics API. As you’ll learn later this week, GraphQL provides many benefits to us at Cloudflare. Customers now only need to learn a single API format that will allow them to extract only the data they require. For internal development, GraphQL eliminates the need for customized analytics APIs for each service, reduces query cost by increasing cache hits, and reduces developer fatigue by using a straightforward query language with standardized input and output formats. Very soon, all Load Balancing customers on paid plans will be given the opportunity to extract insights from the GraphQL API.  Let’s walk through some examples of how you can utilize the GraphQL API to understand your Load Balancing logs.

Suppose you want to understand the number of requests the pools for a load balancer are seeing from the different locations in Cloudflare’s global network. The query in Figure 4 counts the number of unique (location, pool ID) combinations every fifteen minutes over the course of a week.

Introducing Load Balancing Analytics
Figure 4

For context, our example load balancer, lb.example.com, utilizes dynamic steering. Dynamic steering directs requests to the most responsive, available, origin pool, which is often the closest. It does so using a weighted round-trip time measurement. Let’s try to understand why all traffic from Singapore (SIN) is being steered to our pool in Northeast Asia (asia-ne). We can run the query in Figure 5. This query shows us that the asia-ne pool has an avgRttMs value of 67ms, whereas the other two pools have avgRttMs values that exceed 150ms. The lower avgRttMs value explains why traffic in Singapore is being routed to the asia-ne pool.

Introducing Load Balancing Analytics
Figure 5

Notice how the query in Figure 4 uses the loadBalancingRequestsGroups schema, whereas the query in Figure 5 uses the loadBalancingRequests schema. loadBalancingRequestsGroups queries aggregate data over the requested query interval, whereas loadBalancingRequests provides granular information on individual requests. For those ready to get started, Cloudflare has written a helpful guide. The GraphQL website is also a great resource. We recommend you use an IDE like GraphiQL to make your queries. GraphiQL embeds the schema documentation into the IDE, autocompletes, saves your queries, and manages your custom headers, all of which help make the developer experience smoother.

Conclusion

Now that the Load Balancing Analytics solution is live and available to all Pro, Business, Enterprise customers, we’re excited for you to start using it! We’ve attached a survey to the Traffic overview page, and we’d love to hear your feedback.