Tag Archives: Product News

Know When You’ve Been DDoS’d

Post Syndicated from Omer Yoachimik original https://blog.cloudflare.com/announcing-ddos-alerts/

Know When You’ve Been DDoS’d

Know When You’ve Been DDoS’d

Today we’re announcing the availability of DDoS attack alerts. The alerts are available for free for all Cloudflare’s customers on paid plans.

Unmetered DDoS protection

Last week we celebrated Cloudflare’s 10th birthday in what we call Birthday Week. Every year, on each day of Birthday Week, we announce a new product with the goal of helping make the Internet a better place — one that is safer and faster. To do that, over the years we’ve democratized many products that were previously only available to large enterprises by making them available for free (or at very low cost) to all. For example, on Cloudflare’s 7th birthday in 2017, we announced free unmetered DDoS protection as part of every Cloudflare product and every plan, including the free plan.

DDoS attacks aim to take down websites or online services and make them unavailable to the public. We wanted to make sure that every organization and every website is available and accessible, regardless if they can or can’t afford enterprise-grade DDoS protection. This has been a core part of our mission. We’ve been heavily investing in our DDoS protection capabilities over the last 10 years, and we will continue to do so in the future.

Real-time DDoS attack alerts

I’ve recently published a few blogs that provide a look under the hood of our DDoS protection systems. These systems run autonomously, they detect and mitigate attacks without any human intervention. As was the case with the 654 Gbps attack in July, and the 754 Mpps attack in June. We’ve been successful at blocking DDoS attacks and also providing our users with important analytics and insights about the attacks, but our customers also want to be notified in real-time when they are targeted by DDoS attacks.

So today, we’re excited to announce the availability of DDoS alerts. The current delivery methods by Cloudflare plan type are listed in the table below. Additional delivery methods will be made available in the future.

Delivery methods by plan

Delivery method Plan
Free Pro Business Enterprise
Email
PagerDuty

There are two types of DDoS alerts: HTTP DDoS alerts and L3/4 DDoS alerts. Whether you are eligible to one or both depends on the Cloudflare services that you are subscribed to. The table below lists the alert types by the Cloudflare service.

Alert types by service

Alert type Service
WAF/CDN Spectrum Spectrum BYOIP Magic Transit
HTTP DDoS alerts
L3/4 DDoS alerts Coming soon Coming soon

Creating a DDoS alert policy

In order to receive alerts on DDoS attacks that target your Cloudflare-protected Internet property, you must first create a notification policy. That’s fast and easy:

  1. Log in to your Cloudflare account dashboard: https://dash.cloudflare.com
  2. In the Account Home page, navigate to the Notifications tab
  3. In the Notifications card, click Create
  4. Give your notification a name, add an optional description, and the email addresses of the recipients.
Know When You’ve Been DDoS’d

If you are on the Business plan or higher, you’ll need to connect to PagerDuty before creating the alert policy. Once you’ve done so, you’ll have the option to send the alert to your PagerDuty service.

Receive the alert, view the attack, and give feedback

When developing and designing the alert template, we interviewed many of our customers to understand what information is important to them, what would make the alert useful and easy to understand. We’ve intentionally made the alert short. The email subject is also straightforward: DDoS Attack Detected, and it will only be sent from our official email address: [email protected][dot]com. Add this email to your list of trusted email addresses to assure you don’t miss the alerts.

The alert includes the following information:

  1. A short description of what happened
  2. The date and time the attack was initially detected and mitigated by our systems
  3. The attack type
  4. The max rate of the attack when the alert was triggered
  5. The attack target

The attack may be ongoing when you receive the alert and so we also include a link to view the attack in the Cloudflare dashboard and also a link to provide feedback on the protection and visibility.

Know When You’ve Been DDoS’d

We’d love to get your feedback!

We’d love your feedback on our DDoS protection solution. When you receive a DDoS alert, you’ll be provided with a link to submit your feedback. Measuring user satisfaction helps us build better products. Your feedback helps us measure user satisfaction for Cloudflare’s DDoS protection and the attack analytics that we provide in the dashboard. User satisfaction rates are one of the main Key Performance Indicators (KPIs) for our DDoS protection service that we monitor closely. So give your feedback, and help us make DDoS protection better for everyone.

Not a Cloudflare customer yet? Sign up to get started.

Birthday week: Cloudflare turns 10

Post Syndicated from James Allworth original https://blog.cloudflare.com/birthday-week-cloudflare-turns-10/

Birthday week: Cloudflare turns 10

Birthday week: Cloudflare turns 10

2020 marks a major milestone for Cloudflare: it’s our 10th birthday.

We’ve always used birthdays as an opportunity to give back to the Internet. But this year — a year in which the Internet has been so central to giving us all some degree of connectedness and normalcy — it feels like giving back to the Internet has been more important than ever.

And while we couldn’t celebrate in person, we were humbled by some of the incredible minds that joined us online to talk about how the Internet has changed over the last ten years — and what we might see over the next ten.

With that, let’s recap the key announcements from Birthday Week 2020.

Day 1, Monday: Workers

During Birthday Week in 2017, Cloudflare announced Workers — a serverless platform that represented a completely new way to build applications: by writing your code directly onto our network edge. On Monday of this year’s Birthday Week, we announced Durable Objects and Cron Triggers — both of which continue to expand the use cases that Workers can address.

Many folks associate the serverless paradigm with functions as a service — which, at its core, is stateless. Workers KV started down the path of changing this, providing high availability storage on the edge. However, there are use cases where consistency (a client making a request to a database will get the same view of data) is more important than availability (a client making a request to a database requests always receives a response). Say you want to sell tickets to a concert — you don’t want to allow two people to be able to purchase the same ticket.  With a traditional application, with a database running in one location, that’s relatively easy to ensure. But with Workers running in Cloudflare’s data centers all over the world, ensuring consistency is a little bit more challenging. Workers Durable Objects solves for this for developers: giving them access to high consistency storage when they’re building on the Workers platform.

Similarly, triggering Workers has historically needed a user to do something,  A user visiting a URL, for example. But developers have use cases when they want a Worker to run, independent of a user doing something right now. Syncing for example. Batch jobs. Or perhaps doing something 24 hours after a user has done something. And this is where Cron Triggers come in — now, for developers on the Workers platform, there’s no more need to rely on an eyeball to get things rolling.  

Day 2, Tuesday: Analytics

There are a lot of website analytics products out there on the market. Many of those products are, not surprisingly, very good.

But the way they’ve been implemented often leaves a lot to be desired. Most of them operate by tracking individual users, using client-side state like cookies or localStorage — or even fingerprints. This is increasingly a problem. There’s the principle of it: we don’t want to be tracked individually — why would we want visitors to our web properties to feel tracked either? Beyond that though, because so many people are feeling uncomfortable with how they’re being tracked around the web, they’re simply blocking a lot of these analytics products. As a result, all these analytics products are increasingly becoming less accurate.

On Tuesday, we announced a new Web Analytics product that allows you to get the best of both worlds — detailed and accurate analytics, without compromising on the privacy of your users. We don’t use any client-side state, like cookies or localStorage, for the purposes of tracking users. And we don’t “fingerprint” individuals via their IP address, User Agent string, or any other data for the purpose of displaying analytics (we consider fingerprinting even more intrusive than cookies, because users have no way to opt out). Because Cloudflare’s business has never been built around tracking users or selling advertising, we don’t do it. Just the metrics, ma’am.

That wasn’t all on Tuesday, though. Another crucial aspect of owning a web property is website performance. Not only does it impact user experience, Google uses a blended measure of performance to inform site ranking in their search results. Google’s Chrome team has been doing some great work on metricizing site performance, and that’s culminated in Web Vitals. We’ve worked with the Chrome team to integrate Web Vitals in our Browser Insights product. You’ve always gotten edge-side performance analytics from Cloudflare, but now, you’re not just seeing the server side view of your web performance: it’s blended with how your users perceive performance, too. We take all that data and present it in a pragmatic way to help you figure out what you need to do to optimize the performance of your site.

Day 3, Wednesday: Cloudflare Radar and Speeding up HTTPS/HTTP3

As of today, Cloudflare sits in front of 14.5% of the world’s top 10 million websites. The privilege of getting to serve so many different customers means we get visibility into a lot of things on the web. Wednesday of birthday week was about us taking advantage of that for everyone who is out on the web today.

If you think about the traffic flowing through a city at any given time, it’s like a living, breathing creature. It ebbs and flows; it has rhythms that follow the sun and moon. Unusual events can cause traffic jams; as can accidents. Many cities have traffic reporting services for exactly this reason; knowing what’s going on can help immensely those that need to navigate the city streets. The web is like a global version of this, and given the role that the Internet now plays for humanity, understanding what’s going on probably equals in importance to all those city traffic reports all around the world.

And yet, when you want to get the equivalent of that traffic report, where do you go?

Cloudflare Radar is our answer to that question. Each second, Cloudflare handles on average 18 million HTTP requests and 6 million DNS requests. We block 72 billion cyberthreats every day. Add to that 1 billion unique IP addresses connecting to Cloudflare’s network, we have one of the most representative views on Internet traffic worldwide. Before Radar, all this activity, good and bad, was only available internally at Cloudflare: we used it to help improve our service and protect our customers. With the release of Radar, however, we’re exposing it externally: shining a light on the Internet’s patterns for the world to see.

On the subject of spotting interesting patterns. Back in late June, our team noticed a weird spike in DNS requests for the 65479 Resource Record. It turns out, these spikes were a part of Apple’s iOS14 beta release — Apple were testing out a new SVCB/HTTPS record type. The aim: to patch a limitation that’s been inherent in the HTTPS and HTTP3 protocol. When a user types in a URL without specifying the protocol (e.g. HTTPS), the initial negotiation happens in plaintext because browsers will start with HTTP. Only once it’s established that an HTTPS or HTTP3 resource exists will the browser transition over to that. The problem here is twofold: latency, and also security.

But you know what happens before any HTTP negotiation can happen? A DNS request. And that’s what Apple had implemented that created this interesting pattern: the DNS request was effectively asking whether the site supported HTTPS, or HTTP3. As of Wednesday during birthday week, Cloudflare’s DNS servers will now automatically generate HTTPS records on the fly to advertise whether a particular zone supports HTTP/3 and/or HTTP/2, based on whether those features are enabled on the zone. The result: better performance, and improved security. Who says you need to pick just one?

Day 4, Thursday: API day

Nobody has ever doubted the importance of user interfaces. Finding ways for humans and computers to engage each other has been an area of focus since the very first computers were invented. But as the web has grown, data has become the new oil, and applications have proliferated, there’s another interface that has grown in importance: the interface between different types of applications. Day 4 of Birthday Week was all about APIs.

The first announcement was beta support for gRPC: a new type of protocol that’s intended for building APIs at scale. Most REST APIs use HTTPS and JSON to communicate values. The problem with these is that they’re really designed for that other type of interface mentioned above: for humans to talk to computers. The upside is it makes things human readable; the downside is they’re really inefficient, and as the use of APIs only continues to explode this inefficiency proliferates. The gRPC protocol is an answer to this: it’s an efficient protocol for computers to talk to each other. But up until now, that also came at a price: because gRPC uses newer technology (like HTTP/2) under the covers, existing security and performance tools did not support gRPC traffic out of the box. This meant that customers adopting gRPC to power their APIs had to pick between modernity on one hand, and things like security, performance, and reliability on the other.

Cloudflare’s announcement of support of gPRC fixes this trade-off: when you put your gPRC APIs on Cloudflare, you get all the traditional benefits of Cloudflare along with it. Apprehensive of exposing your APIs to bad actors? Need more performance? Turn on Argo Smart Routing to decrease time to first byte. Increase reliability by adding a Load Balancer. Or add security features such as Bot Management and the WAF.

Speaking of the WAF. If you think about the way our WAF works, it secures web application from attacks by looking for attack patterns — say, bot patterns that try to imitate human patterns, or abuse of how a browser interacts with a site; in both instances, the attack is intended to break something. But because what computers need to talk to each other is different from what computers need to talk to humans, the attack vectors are different. Therefore protecting APIs isn’t quite the same as protecting websites.

API Shield is purpose-built for just this. It makes it simple to secure APIs through the use of strong client certificate-based authentication, and strict schema-based validation. On the authentication side, API shield uses mutual TLS — which is not vulnerable to the reuse or sharing of passwords or tokens. And once developers can be sure that only legitimate clients (with SSL certificates in hand) are connecting to their APIs, the next step in API Shield is making sure that those clients are making valid requests. It works by matching the contents of API requests—the query parameters that come after the URL and contents of the POST body—against a contract or “schema” that contains the rules for what is expected. If validation fails, the API call is blocked protecting the origin from an invalid request or a malicious payload.

And, as you’d expect from Cloudflare, gRPC and API Shield support each other out of the box.

Day 5, Friday: Automatic Platform Optimization (starting with WordPress)

The idea of caching static assets is not new, and it’s something Cloudflare has supported from its inception. It works wonders in speeding up websites: particularly if your origin is slow and/or your user is far from the origin server, then all your performance metrics will be affected. Caching also also has the added benefit of reducing load on origin servers.

However, things get a little more tricky when it comes to dynamic assets: if the asset could change, shouldn’t you go back to the origin just to make sure? For this reason, by default, Cloudflare doesn’t cache HTML content: there’s a chance it’s going to change for each user. The reality is though, most HTML isn’t really dynamic. It needs to be able to change relatively quickly when the site is updated but for a huge portion of the web, the content is static for months or years at a time. There are special cases like when a user is logged-in (as the admin or otherwise) where the content needs to differ but the vast majority of visits are of anonymous users.

Automatic Platform Optimization, which was announced on Friday, brings more intelligence to this — allowing us to figure out when we should be caching HTML, and when we shouldn’t. The advantage of this is it moves more content closer to the user, and it does it automagically — there’s no configuration required. The benefits aren’t trivial: a 72% reduction in Time to First Byte (TTFB), 23% reduction to First Contentful Paint, and 13% reduction in Speed Index for desktop users at the 90th percentile. We’re starting off with support for WordPress — 38% of all websites, but the plan is to expand this to other platforms in the near future.

All day, every day: Cloudflare TV

Ten years is a long time. The milestone for Cloudflare seemed to be the perfect opportunity to look back over the last ten years of the Internet — what’s changed, what’s surprised us? And more than that: what’s coming over the next ten years?

To look back and then peer out into the future, we were humbled to be joined by some of the most celebrated names in tech and beyond. Among the highlights: Apple co-founder Steve Wozniak, Zoom CEO Eric Yuan, OpenTable CEO Debby Soo, Stripe co-founder and President John Collison, Former CEO & Executive Chairman of Google and Co-Founder of Schmidt Futures Eric Schmidt, former McAfee CEO Chris Young, former Seal Team 6 Commander Dave Cooper, Project Include CEO Ellen Pao, and so many more. All told, it was  24 hours of live discussions over the course of the week.

And with that, it’s a wrap! To everyone who has been a part of the Cloudflare journey over the past 10 years: our customers, folks on the team, friends and supporters, and our partners all around the world: thank you. It’s been an incredible ride.

And, as our co-founder Michelle likes to say, we’re just getting started.

Introducing support for the AVIF image format

Post Syndicated from Kornel Lesiński original https://blog.cloudflare.com/generate-avif-images-with-image-resizing/

Introducing support for the AVIF image format

Introducing support for the AVIF image format

We’ve added support for the new AVIF image format in Image Resizing. It compresses images significantly better than older-generation formats such as WebP and JPEG. It’s supported in Chrome desktop today, and support is coming to other Chromium-based browsers, as well as Firefox.

What’s the benefit?

More than a half of an average website’s bandwidth is spent on images. Improved image compression can save bandwidth and improve overall performance of the web. The compression in AVIF is so good that images can reduce to half the size of JPEG and WebP

What is AVIF?

AVIF is a combination of the HEIF ISO standard, and a royalty-free AV1 codec by Mozilla, Xiph, Google, Cisco, and many others.

Currently JPEG is the most popular image format on the Web. It’s doing remarkably well for its age, and it will likely remain popular for years to come thanks to its excellent compatibility. There have been many previous attempts at replacing JPEG, such as JPEG 2000, JPEG XR and WebP. However, these formats offered only modest compression improvements, and didn’t always beat JPEG on image quality. Compression and image quality in AVIF is better than in all of them, and by a wide margin.

Introducing support for the AVIF image format Introducing support for the AVIF image format Introducing support for the AVIF image format
JPEG (17KB) WebP (17KB) AVIF (17KB)

Why a new image format?

One of the big things AVIF does is it fixes WebP’s biggest flaws. WebP is over 10 years old, and AVIF is a major upgrade over the WebP format. These two formats are technically related: they’re both based on a video codec from the VPx family. WebP uses the old VP8 version, while AVIF is based on AV1, which is the next generation after VP10.

WebP is limited to 8-bit color depth, and in its best-compressing mode of operation, it can only store color at half of the image’s resolution (known as chroma subsampling). This causes edges of saturated colors to be smudged or pixelated in WebP. AVIF supports 10- and 12-bit color at full resolution, and high dynamic range (HDR).

Introducing support for the AVIF image format JPEG
Introducing support for the AVIF image format WebP
Introducing support for the AVIF image format WebP (sharp YUV option)
Introducing support for the AVIF image format AVIF

AV1 also uses a new compression technique: chroma-from-luma. Most image formats store brightness separately from color hue. AVIF uses the brightness channel to guess what the color channel may look like. They are usually correlated, so a good guess gives smaller file sizes and sharper edges.

Adoption of AV1 and AVIF

VP8 and WebP suffered from reluctant industry adoption. Safari only added WebP support very recently, 10 years after Chrome.

Chrome 85 supports AVIF already. It’s a work in progress in other Chromium-based browsers, and Firefox. Apple hasn’t announced whether Safari will support AVIF. However, this time Apple is one of the companies in the Alliance for Open Media, creators of AVIF. The AV1 codec is already seeing faster adoption than the previous royalty-free codecs. Latest GPUs from Nvidia, AMD, and Intel already have hardware decoding support for AV1.

AVIF uses the same HEIF container as the HEIC format used in iOS’s camera. AVIF and HEIC offer similar compression performance. The difference is that HEIC is based on a commercial, patent-encumbered H.265 format. In countries that allow software patents, H.265 is illegal to use without obtaining patent licenses. It’s covered by thousands of patents, owned by hundreds of companies, which have fragmented into two competing licensing organizations. Costs and complexity of patent licensing used to be acceptable when videos were published by big studios, and the cost could be passed on to the customer in the price of physical discs and hardware players. Nowadays, when video is mostly consumed via free browsers and apps, the old licensing model has become unsustainable. This has created a huge incentive for Web giants like Google, Netflix, and Amazon to get behind the royalty-free alternative. AV1 is free to use by anyone, and the alliance of tech giants behind it will defend it from patent troll’s lawsuits.

Non-standard image formats usually can only be created using their vendor’s own implementation. AVIF and AV1 are already ahead with multiple independent implementations: libaom, Intel SVT-AV1, libgav1, dav1d, and rav1e. This is a sign of a healthy adoption and a prerequisite to be a Web standard. Our Image Resizing is implemented in Rust, so we’ve chosen the rav1e encoder to create a pure-Rust implementation of AVIF.

Caveats

AVIF is a feature-rich format. Most of its features are for smartphone cameras, such as “live” photos, depth maps, bursts, HDR, and lossless editing. Browsers will support only a fraction of these features. In our implementation in Image Resizing we’re supporting only still standard-range images.

Two biggest problems in AVIF are encoding speed and lack of progressive rendering.

AVIF images don’t show anything on screen until they’re fully downloaded. In contrast, a progressive JPEG can display a lower-quality approximation of the image very quickly, while it’s still loading. When progressive JPEGs are delivered well, they make websites appear to load much faster. Progressive JPEG can look loaded at half of its file size. AVIF can fully load at half of JPEG’s size, so it somewhat overcomes the lack of progressive rendering with the sheer compression advantage. In this case only WebP is left behind, which has neither progressive rendering nor strong compression.

Decoding AVIF images for display takes relatively more CPU power than other codecs, but it should be fast enough in practice. Even low-end Android devices can play AV1 videos in full HD without help of hardware acceleration, and AVIF images are just a single frame of an AV1 video. However, encoding of AVIF images is much slower. It may take even a few seconds to create a single image. AVIF supports tiling, which accelerates encoding on multi-core CPUs. We have lots of CPU cores, so we take advantage of this to make encoding faster. Image Resizing doesn’t use the maximum compression level possible in AVIF to further increase compression speed. Resized images are cached, so the encoding speed is noticeable only on a cache miss.

We will be adding AVIF support to Polish as well. Polish converts images asynchronously in the background, which completely hides the encoding latency, and it will be able to compress AVIF images better than Image Resizing.

Note about benchmarking

It’s surprisingly difficult to fairly and accurately judge which lossy codec is better. Lossy codecs are specifically designed to mainly compress image details that are too subtle for the naked eye to see, so “looks almost the same, but the file size is smaller!” is a common feature of all lossy codecs, and not specific enough to draw conclusions about their performance.

Lossy codecs can’t be compared by comparing just file sizes. Lossy codecs can easily make files of any size. Their effectiveness is in the ratio between file size and visual quality they can achieve.

The best way to compare codecs is to make each compress an image to the exact same file size, and then to compare the actual visual quality they’ve achieved. If the file sizes differ, any judgement may be unfair, because the codec that generated the larger file may have done so only because it was asked to preserve more details, not because it can’t compress better.

How and when to enable AVIF today?

We recommend AVIF for websites that need to deliver high-quality images with as little bandwidth as possible. This is important for users of slow networks and in countries where the bandwidth is expensive.

Browsers that support the AVIF format announce it by adding image/avif to their Accept request header. To request the AVIF format from Image Resizing, set the format option to avif.

The best method to detect and enable support for AVIF is to use image resizing in Workers:

addEventListener('fetch', event => {
  const imageURL = "https://jpeg.speedcf.com/cat/4.jpg";

  const resizingOptions = {
    // You can set any options here, see:
    // https://developers.cloudflare.com/images/worker
    width: 800,
    sharpen: 1.0,
  };

  const accept = event.request.headers.get("accept");
  const isAVIFSupported = /image\/avif/.test(accept);
  if (isAVIFSupported) {
    resizingOptions.format = "avif";
  }
  event.respondWith(fetch(imageURL), {cf:{image: resizingOptions}})
})

The above script will auto-detect the supported format, and serve AVIF automatically. Alternatively, you can use URL-based resizing together with the <picture> element:

<picture>
    <source type="image/avif" 
            srcset="/cdn-cgi/image/format=avif/image.jpg">
    <img src="/image.jpg">
</picture>

This uses our /cdn-cgi/image/… endpoint to perform the conversion, and the alternative source will be picked only by browsers that support the AVIF format.

We have the format=auto option, but it won’t choose AVIF yet. We’re cautious, and we’d like to test the new format more before enabling it by default.

Introducing Automatic Platform Optimization, starting with WordPress

Post Syndicated from Garrett Galow original https://blog.cloudflare.com/automatic-platform-optimizations-starting-with-wordpress/

Introducing Automatic Platform Optimization, starting with WordPress

Introducing Automatic Platform Optimization, starting with WordPress

Today, we are announcing a new service to serve more than just the static content of your website with the Automatic Platform Optimization (APO) service. With this launch, we are supporting WordPress, the most popular website hosting solution serving 38% of all websites. Our testing, as detailed below, showed a 72% reduction in Time to First Byte (TTFB), 23% reduction to First Contentful Paint, and 13% reduction in Speed Index for desktop users at the 90th percentile, by serving nearly all of your website’s content from Cloudflare’s network. This means visitors to your website see not only the first content sooner but all content more quickly.

With Automatic Platform Optimization for WordPress, your customers won’t suffer any slowness caused by common issues like shared hosting congestion, slow database lookups, or misbehaving plugins. This service is now available for anyone using WordPress. It costs $5/month for customers on our Free plan and is included, at no additional cost, in our Professional, Business, and Enterprise plans. No usage fees, no surprises, just speed.

How to get started

The easiest way to get started with APO is from your WordPress admin console.

1. First, install the Cloudflare WordPress plugin on your WordPress website or update to the latest version (3.8.2 or higher).
2. Authenticate the plugin (steps here) to talk to Cloudflare if you have not already done that.
3. From the Home screen of the Cloudflare section, turn on Automatic Platform Optimization.

Free customers will first be directed to the Cloudflare Dashboard to purchase the service.

Why We Built This

At Cloudflare, we jump at the opportunity to make hard problems for our customers disappear with the click of a button. Running a consistently fast website is challenging. Many businesses don’t have the time nor money to spend on complicated and expensive performance solutions for their site. Even if they do, it can be extremely costly to pay for specialized attention to ensure you get the best performance possible. Having a fast website doesn’t have to be complicated, though. The closer your content is to your customers, the better your site will perform. Static content caching does this for files like images, CSS and JavaScript, but that is only part of the equation. Dynamic content is still fetched from the origin incurring costly round trips and additional processing time. For more info on dynamic versus static content, see our learning center.

WordPress is one of the most open platforms in the world, but that means you are always at risk of incurring performance penalties because of plugins or other sources that, while necessary, may be hard to pinpoint and resolve. With the Automatic Platform Optimization service, we put your website into our network that is within 10 milliseconds of 99% of the Internet-connected population in the developed world, all without having to change your existing hosting provider. This means that for most requests your customers won’t even need to go to your origin, reducing many costly round trips and server processing time. These optimizations run on our edge network, so they also will not impact render or interactivity since no additional JavaScript is run on the client.

How We Measure Web Performance

Evaluating performance of a website is difficult. There are many different metrics you can track and it is not always obvious which metrics most meaningfully represent a user’s experience. As discussed when we blogged about our new Speed page, we aim to simplify this for customers by automating tests using the infrastructure of webpagetest.org, and summarizing both the results visually and numerically in one place.

Introducing Automatic Platform Optimization, starting with WordPress

The visualization gives you a clear idea of what customers are going to see when they come to your site, and the Critical Loading Times provide the most important metrics to judge your performance. On top of seeing your site’s performance, we provide a list of recommendations for ways to even further increase your performance. If you are using WordPress, then we will test your site with Automatic Platform Optimizations to estimate the benefit you will get with the service.

The Benefits of Automatic Platform Optimization

We tested APO on over 500 Cloudflare customer websites that run on WordPress to understand what the performance improvements would be. The results speak for themselves:

Test Results

Metric Percentiles Baseline Cloudflare APO Enabled Improvement (%)
Time to First Byte (TTFB) 90th 1252 ms 351 ms 71.96%
10th 254 ms 261 ms -2.76%
First Contentful Paint
(FCP)
90th 2655 ms 2056 ms 22.55%
10th 894 ms 783 ms 12.46%
Speed Index
(SI)
90th 6428 5586 13.11%
10th 1301 1242 4.52%

Note: Results are based on test results of 505 randomly selected websites that are cached by Cloudflare. Tests were run using WebPageTest from South Carolina, USA and the following environment: Chrome, Cable connection speed.

Most importantly, with APO, a site’s TTFB is made both fast and consistent. Because we now serve the html from Cloudflare’s edge with 0 origin processing time, getting the first byte to the eyeball is consistently fast. Under heavy load, a WordPress origin can suffer delays in building the html and returning it to visitors. APO removes the variance due to load resulting in consistent TTFB <400 ms.

Additionally, between faster TTFB and additional caching of third party fonts, we see performance improvements in both FCP and SI for both the fastest and slowest of the sites we tested. Some of this comes naturally from reducing the TTFB, since every millisecond you shave off of TTFB is a potential millisecond gain for other metrics. Caching additional third party fonts allows us to reduce the time it takes to fetch that content. Given fonts can often block paints due to text rendering, this improves the rate at which the page paints, and improves the Speed Index.

We asked the folks at Kinsta to try out APO, given their expertise in WordPress, and tell us what they think. Brian Li, a Website Content Manager at Kinsta, ran a set of tests from around the world on a website hosted in Tokyo. I’ll let him explain what they did and the results:

At Kinsta, WordPress performance is something that’s near and dear to our hearts. So, when Cloudflare reached out about testing their new Automatic Platform Optimization (APO) service for WordPress, we were all ears.
 
This is what we did to test out the new service:

  1. We set up a test site in Tokyo, Japan – one of the 24 high-performance data center locations available for Kinsta customers.
  2. We ran several speed tests from six different locations around the world with and without Cloudflare’s APO.

 
The results were incredible!
 
By caching static HTML on Cloudflare’s edge network, we saw a 70-300% performance increase. As expected, the testing locations furthest away from Tokyo saw the biggest reduction in load time.
 
If your WordPress site uses a traditional CDN that only caches CSS, JS, and images, upgrading to Cloudflare’s WordPress APO is a no-brainer and will help you stay competitive with modern Jamstack and static sites that live on the edge by default.

Brian’s test results are summarized in this image:

Introducing Automatic Platform Optimization, starting with WordPress
Page Load Speeds for loading a website hosted in Tokyo from 6 locations worldwide – comparing Kinsta, Kinsta with KeyCDN, and Kinsta with Cloudflare APO.

One of the clear benefits, from Kinsta’s testing of APO, is the consistency of performance for serving your site no matter where your visitors are in the world. The consistent sub-second performance shown with APO versus two or three second load times in other setups makes it clear that if you have a global customer base, APO delivers an improved experience for all visitors.

How Automatic Platform Optimization Works

Automatic Platform Optimization is the result of being able to use the power of Cloudflare Workers to intelligently cache dynamic content. By caching dynamic content, we can serve the entire website from our edge network. Think ‘static site’ but without any of the work of having to build or maintain a static site. Customers can keep managing and updating content on their website in the same way and leave the hard work for performance to us. Serving both static and dynamic content from our network results, generally, in no origin requests or origin processing time. This means all the communication occurs between the user’s device and our edge. Reducing the multitude of round trips typically required from our edge to the origin for dynamic content is what makes this service so effective. Let’s first see what it normally looks like to load a WordPress site for a visitor.

Introducing Automatic Platform Optimization, starting with WordPress
A sequence diagram for a typical user visiting a site‌‌

In a regular request flow, Cloudflare is able to cache some of the content like images, CSS, or JS, while other requests go to either the origin or a third party service in order to fetch the content. Most importantly the first request to fetch the HTML for the site needs to go to the origin which is a typical cause of long TTFB, since no other requests get made until the client can receive the HTML and parse it to make subsequent requests.

Introducing Automatic Platform Optimization, starting with WordPress
The same site visit but with APO enabled.

Once APO is enabled, all those trips to the origin are removed. TTFB benefits greatly because the first hop starts and ends at Cloudflare’s network. This also means the browser starts working on fetching and painting the webpage sooner meaning each paint event occurs earlier. Last by caching third party fonts, we remove additional requests that would need to leave Cloudflare’s network and extend the time to display text to the user. Often, websites use fonts hosted on third-party domains. While this saves bandwidth costs that would be incurred from hosting it on the origin, depending on where those fonts are hosted, it can still be a costly operation to fetch them. By rehosting the fonts and serving them from our cache, we can reduce one of the remaining costly round trips.

With APO for WordPress, you can say bye bye to database congestion or unwieldy plugins slowing down your customers’ experience. These benefits are stacked on top of our already fast TLS connection times and industry leading protocol support like HTTP/2 that ensure we are using the most efficient and the fastest way to connect and deliver your website to your customers.

For customers with WordPress sites that support authenticated sessions, you do not have to worry about us caching content from authenticated users and serving it to others. We bypass the cache on standard WordPress and WooCommerce cookies for authenticated users. This ensures customized content for a specific user is only visible to that user. While this has been available to customers with our Business-level service, it is now available for any WordPress customer that enables APO.

You might be wondering: “This all sounds great, but what about when I change content on my site?” Because this service works in tandem with our WordPress plugin, we are able to understand when you make changes and ensure we quickly purge the content in Cloudflare’s edge and refresh it with the new content. With the plugin installed, we detect content changes and update our edge network worldwide with automatic cache purges. As part of this release, we have updated our WordPress plugin, so whether or not you use APO, you should upgrade to the latest version today. If you do not or cannot use our WordPress plugin, then APO will still provide the same performance benefits, but may serve stale content for up to 30 minutes and when the content is requested again.

This service was built on the prototype work originally blogged about here and here. For a more in depth look at the technical side of the service and how Cloudflare Workers allowed us to build the Automatic Platform Optimization service, see the accompanying blog post.

WordPress Today, Other Platforms Coming Soon

While today’s announcement is focused on supporting WordPress, this is just the start. We plan to bring these same capabilities to other popular platforms used for web hosting. If you operate a platform and are interested in how we may be able to work together to improve things for all your customers, please get in touch. If you are running a website, let us know what platform you want to see us bring Automatic Platform Optimization to next.

Introducing API Shield

Post Syndicated from Patrick R. Donahue original https://blog.cloudflare.com/introducing-api-shield/

Introducing API Shield

APIs are the lifeblood of modern Internet-connected applications. Every millisecond they carry requests from mobile applications—place this food delivery order, “like” this picture—and directions to IoT devices—unlock the car door, start the wash cycle, my human just finished a 5k run—among countless other calls.

They’re also the target of widespread attacks designed to perform unauthorized actions or exfiltrate data, as data from Gartner increasingly shows: “by 2021, 90% of web-enabled applications will have more surface area for attack in the form of exposed APIs rather than the UI, up from 40% in 2019, and “Gartner predicted that, by 2022, API abuses will move from an infrequent to the most-frequent attack vector, resulting in data breaches for enterprise web applications”. Of the 18 million requests per second that traverse Cloudflare’s network, 50% are directed towards APIs—with the majority of these requests blocked as malicious.

To combat these threats, Cloudflare is making it simple to secure APIs through the use of strong client certificate-based identity and strict schema-based validation. As of today, these capabilities are available free for all plans within our new “API Shield” offering. And as of today, the security benefits also extend to gRPC-based APIs, which use binary formats such as protocol buffers rather than JSON, and have been growing in popularity with our customer base.

Introducing API Shield

Continue reading to learn more about the new capabilities, or jump right to the “Demonstration” paragraph for examples of how to get started configuring your first API Shield rule.

Positive security models and client certificates

A “positive security” model is one that allows only known behavior and identities, while rejecting everything else. It is the opposite of the traditional “negative security” model enforced by a Web Application Firewall (WAF) that allows everything except for requests coming from problematic IPs, ASNs, countries or requests with problematic signatures (SQL injection attempts, etc.).

Implementing a positive security model for APIs is the most direct way to eliminate the noise of credential stuffing attacks and other automated scanning tools. And the first step towards a positive model is deploying strong authentication such as mutual TLS authentication, which is not vulnerable to the reuse or sharing of passwords.

Just as we simplified the issuance of server certificates back in 2014 with Universal SSL, API Shield reduces the process of issuing client certificates to clicking a few buttons in the Cloudflare Dashboard. By providing a fully hosted private public key infrastructure (PKI), you can focus on your applications and features—rather than operating and securing your own certificate authority (CA).

Introducing API Shield

Enforcing valid requests with schema validation

Once developers can be sure that only legitimate clients (with SSL certificates in hand) are connecting to their APIs, the next step in implementing a positive security model is making sure that those clients are making valid requests. Extracting a client certificate from a device and reusing elsewhere is difficult, but not impossible, so it’s also important to make sure that the API is being called as intended.

Requests containing extraneous input may not have been anticipated by the API developer, and can cause problems if processed directly by the application, so these should be dropped at the edge if possible. API Schema validation works by matching the contents of API requests—the query parameters that come after the URL and contents of the POST body—against a contract or “schema” that contains the rules for what is expected. If validation fails, the API call is blocked protecting the origin from an invalid request or a malicious payload.

Schema validation is currently in closed beta for JSON payloads, with gRPC/protocol buffer support on the roadmap. If you would like to join the beta please open a support ticket with the subject “API Schema Validation Beta”. After the beta has ended, we plan to make schema validation available as part of the API Shield user interface.

Introducing API Shield

Demonstration

To demonstrate how the APIs powering IoT devices and mobile applications can be secured, we have built an API Shield demonstration using client certificates and schema validation.

Temperatures are captured by an IoT device, represented in the demo by a Raspberry Pi 3 Model B+ with an external infrared temperature sensor, and then transmitted via a POST request to a Cloudflare-protected API. Temperatures are subsequently retrieved by GET requests and then displayed in a mobile application built in Swift for iOS.

In both cases, the API was actually built using Cloudflare Workers® and Workers KV, but can be replaced by any Internet-accessible endpoint.

1. API Configuration

Before configuring the IoT device and mobile application to communicate securely with the API, we need to bootstrap the API endpoints. To keep the example simple, while also allowing for additional customization, we’ve implemented the API as a Cloudflare Worker (borrowing code from the To-Do List tutorial).

In this particular example the temperatures are stored in Workers KV using the source IP address as a key, but this could easily be replaced by a value from the client certificate, e.g., the fingerprint. The code below saves a temperature and timestamp into KV when a POST is made, and returns the most recent 5 temperatures when a GET request is made.

const defaultData = { temperatures: [] }

const getCache = key => TEMPERATURES.get(key)
const setCache = (key, data) => TEMPERATURES.put(key, data)

async function addTemperature(request) {

    // pull previously recorded temperatures for this client
    const ip = request.headers.get('CF-Connecting-IP')
    const cacheKey = `data-${ip}`
    let data
    const cache = await getCache(cacheKey)
    if (!cache) {
        await setCache(cacheKey, JSON.stringify(defaultData))
        data = defaultData
    } else {
        data = JSON.parse(cache)
    }

    // append the recorded temperatures with the submitted reading (assuming it has both temperature and a timestamp)
    try {
        const body = await request.text()
        const val = JSON.parse(body)

        if (val.temperature && val.time) {
            data.temperatures.push(val)
            await setCache(cacheKey, JSON.stringify(data))
            return new Response("", { status: 201 })
        } else {
            return new Response("Unable to parse temperature and/or timestamp from JSON POST body", { status: 400 })
        }
    } catch (err) {
        return new Response(err, { status: 500 })
    }
}

function compareTimestamps(a,b) {
    return -1 * (Date.parse(a.time) - Date.parse(b.time))
}

// return the 5 most recent temperature measurements
async function getTemperatures(request) {
    const ip = request.headers.get('CF-Connecting-IP')
    const cacheKey = `data-${ip}`

    const cache = await getCache(cacheKey)
    if (!cache) {
        return new Response(JSON.stringify(defaultData), { status: 200, headers: { 'content-type': 'application/json' } })
    } else {
        data = JSON.parse(cache)
        const retval = JSON.stringify(data.temperatures.sort(compareTimestamps).splice(0,5))
        return new Response(retval, { status: 200, headers: { 'content-type': 'application/json' } })
    }
}

async function handleRequest(request) {

    if (request.method === 'POST') {
        return addTemperature(request)
    } else {
        return getTemperatures(request)
    }

}

addEventListener('fetch', event => {
  event.respondWith(handleRequest(event.request))
})

Before adding mutual TLS authentication, we’ll test POST’ing a random temperature reading:

$ TEMPERATURE=$(echo $((361 + RANDOM %11)) | awk '{printf("%.2f",$1/10.0)}')
$ TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%SZ")

$ echo -e "$TEMPERATURE\n$TIMESTAMP"
36.30
2020-09-28T02:57:49Z

$ curl -v -H "Content-Type: application/json" -d '{"temperature":'''$TEMPERATURE''', "time": "'''$TIMESTAMP'''"}' https://shield.upinatoms.com/temps 2>&1 | grep "< HTTP/2"
< HTTP/2 201 

And here’s a subsequent read of that temperature, along with the previous 4 that were submitted:

$ curl -s https://shield.upinatoms.com/temps | jq .
[
  {
    "temperature": 36.3,
    "time": "2020-09-28T02:57:49Z"
  },
  {
    "temperature": 36.7,
    "time": "2020-09-28T02:54:56Z"
  },
  {
    "temperature": 36.2,
    "time": "2020-09-28T02:33:08Z"
  },
    {
    "temperature": 36.5,
    "time": "2020-09-28T02:29:22Z"
  },
  {
    "temperature": 36.9,
    "time": "2020-09-28T02:27:19Z"
  } 
]

2. Client certificate issuance

With our API in hand, it’s time to lock it down to require a valid client certificate. Before doing so we’ll want to generate those certificates. To do so, you can either go to the SSL/TLS → Client Certificates tab of the Cloudflare Dashboard and click “Create Certificate” or you can automate the process via API calls.

Because most developers at scale will be generating their own private keys and CSRs and requesting that they be signed via API, we’ll show that process here. Using Cloudflare’s PKI toolkit CFSSL we’ll first create a bootstrap certificate fo the iOS application, and then we’ll create a certificate for the IoT device:

$ cat <<'EOF' | tee -a csr.json
{
    "hosts": [
        "ios-bootstrap.devices.upinatoms.com"
    ],
    "CN": "ios-bootstrap.devices.upinatoms.com",
    "key": {
        "algo": "rsa",
        "size": 2048
    },
    "names": [{
        "C": "US",
        "L": "Austin",
        "O": "Temperature Testers, Inc.",
        "OU": "Tech Operations",
        "ST": "Texas"
    }]
}
EOF

$ cfssl genkey csr.json | cfssljson -bare certificate
2020/09/27 21:28:46 [INFO] generate received request
2020/09/27 21:28:46 [INFO] received CSR
2020/09/27 21:28:46 [INFO] generating key: rsa-2048
2020/09/27 21:28:47 [INFO] encoded CSR

$ mv certificate-key.pem ios-key.pem
$ mv certificate.csr ios.csr

// and do the same for the IoT sensor
$ sed -i.bak 's/ios-bootstrap/sensor-001/g' csr.json
$ cfssl genkey csr.json | cfssljson -bare certificate
...
$ mv certificate-key.pem sensor-key.pem
$ mv certificate.csr sensor.csr
Generate a private key and CSR for the IoT device and iOS application
// we need to replace actual newlines in the CSR with ‘\n’ before POST’ing
$ CSR=$(cat ios.csr | perl -pe 's/\n/\\n/g')
$ request_body=$(< <(cat <<EOF
{
  "validity_days": 3650,
  "csr":"$CSR"
}
EOF
))

// save the response so we can view it and then extract the certificate
$ curl -H 'X-Auth-Email: YOUR_EMAIL' -H 'X-Auth-Key: YOUR_API_KEY' -H 'Content-Type: application/json' -d “$request_body” https://api.cloudflare.com/client/v4/zones/YOUR_ZONE_ID/client_certificates > response.json

$ cat response.json | jq .
{
  "success": true,
  "errors": [],
  "messages": [],
  "result": {
    "id": "7bf7f70c-7600-42e1-81c4-e4c0da9aa515",
    "certificate_authority": {
      "id": "8f5606d9-5133-4e53-b062-a2e5da51be5e",
      "name": "Cloudflare Managed CA for account 11cbe197c050c9e422aaa103cfe30ed8"
    },
    "certificate": "-----BEGIN CERTIFICATE-----\nMIIEkzCCA...\n-----END CERTIFICATE-----\n",
    "csr": "-----BEGIN CERTIFICATE REQUEST-----\nMIIDITCCA...\n-----END CERTIFICATE REQUEST-----\n",
    "ski": "eb2a48a19802a705c0e8a39489a71bd586638fdf",
    "serial_number": "133270673305904147240315902291726509220894288063",
    "signature": "SHA256WithRSA",
    "common_name": "ios-bootstrap.devices.upinatoms.com",
    "organization": "Temperature Testers, Inc.",
    "organizational_unit": "Tech Operations",
    "country": "US",
    "state": "Texas",
    "location": "Austin",
    "expires_on": "2030-09-26T02:41:00Z",
    "issued_on": "2020-09-28T02:41:00Z",
    "fingerprint_sha256": "84b045d498f53a59bef53358441a3957de81261211fc9b6d46b0bf5880bdaf25",
    "validity_days": 3650
  }
}

$ cat response.json | jq .result.certificate | perl -npe 's/\\n/\n/g; s/"//g' > ios.pem

// now ask that the second client certificate signing request be signed
$ CSR=$(cat sensor.csr | perl -pe 's/\n/\\n/g')
$ request_body=$(< <(cat <<EOF
{
  "validity_days": 3650,
  "csr":"$CSR"
}
EOF
))

$ curl -H 'X-Auth-Email: YOUR_EMAIL' -H 'X-Auth-Key: YOUR_API_KEY' -H 'Content-Type: application/json' -d "$request_body" https://api.cloudflare.com/client/v4/zones/YOUR_ZONE_ID/client_certificates | perl -npe 's/\\n/\n/g; s/"//g' > sensor.pem
Ask Cloudflare to sign the CSRs with the private CA issued for your zone

3. API Shield rule creation

With certificates in hand we can now configure the API endpoint to require their use. Below is a demonstration of how to create such a rule.

The steps include specifying which hostnames to prompt for certificates, e.g., shield.upinatoms.com, and then creating the API Shield rule.

Introducing API Shield

4. IoT Device Communication

To prepare the IoT device for secure communication with our API endpoint we need to embed the certificate on the device, and then point our application to it so it can be used when making the POST request to the API endpoint.

We securely copied the private key and certificate into /etc/ssl/private/sensor-key.pem and /etc/ssl/certs/sensor.pem, and then modified our sample script to point to these files:

import requests
import json
from datetime import datetime

def readSensor():

    # Takes a reading from a temperature sensor and store it to temp_measurement 

    dateTimeObj = datetime.now()
    timestampStr = dateTimeObj.strftime(‘%Y-%m-%dT%H:%M:%SZ’)

    measurement = {'temperature':str(36.5),'time':timestampStr}
    return measurement

def main():

    print("Cloudflare API Shield [IoT device demonstration]")

    temperature = readSensor()
    payload = json.dumps(temperature)
    
    url = 'https://shield.upinatoms.com/temps'
    json_headers = {'Content-Type': 'application/json'}
    cert_file = ('/etc/ssl/certs/sensor.pem', '/etc/ssl/private/sensor-key.pem')
    
    r = requests.post(url, headers = json_headers, data = payload, cert = cert_file)
    
    print("Request body: ", r.request.body)
    print("Response status code: %d" % r.status_code)

When the script attempts to connect to https://shield.upinatoms.com/temps, Cloudflare requests that a ClientCertificate is sent, and our script sends the contents of sensor.pem before demonstrating it has possession of sensor-key.pem as required to complete the SSL/TLS handshake.

If we fail to send the client certificate or attempt to include extraneous fields in the API request, the schema validation (configuration not shown) fails and the request is rejected:

Cloudflare API Shield [IoT device demonstration]
Request body:  {"temperature": "36.5", "time": "2020-09-28T15:52:19Z"}
Response status code: 403

If instead a valid certificate is presented and the payload follows the schema previously uploaded, our script POSTs the latest temperature reading to the API.

Cloudflare API Shield [IoT device demonstration]
Request body:  {"temperature": "36.5", "time": "2020-09-28T15:56:45Z"}
Response status code: 201

5. Mobile Application (iOS) Communication

Now that temperature requests have been sent to our API endpoint, it’s time to read them securely from our mobile application using one of the client certificates.

For purposes of brevity, we’re going to embed a “bootstrap” certificate and key as a PKCS#12 file within the application bundle. In a real world deployment, this bootstrap certificate should only be used alongside users’ credentials to authenticate to an API endpoint that can return a unique user certificate. Corporate users will want to use MDM to distribute certificates so that the underlying mobile

Package the certificate and private key

Before adding the bootstrap certificate and private key, we need to combine them into a binary PKCS#12 file. This binary file will then be added to our iOS application bundle.

$ openssl pkcs12 -export -out bootstrap-cert.pfx -inkey ios-key.pem -in ios.pem
Enter Export Password:
Verifying - Enter Export Password:

Add the certificate bundle to your iOS application

Within XCode, click File → Add Files To “[Project Name]” and select your .pfx file. Make sure to check “Add to target” before confirming.

Modify your URLSession code to use the client certificate

This article provides a nice walkthrough of using a PKCS#11 class and URLSessionDelegate  to modify your application to complete mutual TLS authentication when connecting to an API that requires it.

Looking Forward

In the coming months, we plan to expand API Shield with a number of additional features designed to protect API traffic. For customers that want to use their own PKI, we will provide the ability to import their own CAs, something available today as part of Cloudflare Access.

As we receive feedback on our schema validation beta, we will look to make the capability generally available to all customers. If you’re trying out the beta and have thoughts to share, we’d love to hear your feedback.

Beyond certificates and schema validation, we’re excited to layer on additional API security capabilities as well as deep analytics to help you better understand your APIs. If you there are features you’d like to see, let us know in the comments below!

Announcing support for gRPC

Post Syndicated from Achiel van der Mandele original https://blog.cloudflare.com/announcing-grpc/

Announcing support for gRPC

Today we’re excited to announce beta support for proxying gRPC, a next-generation protocol that allows you to build APIs at scale. With gRPC on Cloudflare, you get access to the security, reliability and performance features that you’re used to having at your fingertips for traditional APIs. Sign up for the beta today in the Network tab of the Cloudflare dashboard.

gRPC has proven itself to be a popular new protocol for building APIs at scale: it’s more efficient and built to offer superior bi-directional streaming capabilities. However, because gRPC uses newer technology, like HTTP/2, under the covers, existing security and performance tools did not support gRPC traffic out of the box. This meant that customers adopting gRPC to power their APIs had to pick between modernity on one hand, and things like security, performance, and reliability on the other. Because supporting modern protocols and making sure people can operate them safely and performantly is in our DNA, we set out to fix this.

When you put your gRPC APIs on Cloudflare, you immediately gain all the benefits that come with Cloudflare. Apprehensive of exposing your APIs to bad actors? Add security features such as WAF and Bot Management. Need more performance? Turn on Argo Smart Routing to decrease time to first byte. Or increase reliability by adding a Load Balancer.

And naturally, gRPC plugs in to API Shield, allowing you to add more security by enforcing client authentication and schema validation at the edge.

What is gRPC?

Protocols like JSON-REST have been the bread and butter of Internet facing APIs for several years. They’re great in that they operate over HTTP, their payloads are human readable, and a large body of tooling exists to quickly set up an API for another machine to talk to. However, the same things that make these protocols popular are also weaknesses; JSON, as an example, is inefficient to store and transmit, and expensive for computers to parse.

In 2015, Google introduced gRPC, a protocol designed to be fast and efficient, relying on binary protocol buffers to serialize messages before they are transferred over the wire. This prevents (normal) humans from reading them but results in much higher processing efficiency. gRPC has become increasingly popular in the era of microservices because it neatly addresses the shortfalls laid out above.

JSON Protocol Buffers
{ “foo”: “bar” } 0b111001001100001011000100000001100001010

gRPC relies on HTTP/2 as a transport mechanism. This poses a problem for customers trying to deploy common security technologies like web application firewalls, as most reverse proxy solutions (including Cloudflare’s HTTP stack, until today) downgrade HTTP requests down to HTTP/1.1 before sending them off to an origin.

Beyond microservices in a datacenter, the original use case for gRPC, adoption has grown in many other contexts. Many popular mobile apps have millions of users, that all rely on messages being sent back and forth between mobile phones and servers. We’ve seen many customers wire up API connectivity for their mobile apps by using the same gRPC API endpoints they already have inside their data centers for communication with clients in the outside world.

While this solves the efficiency issues with running services at scale, it exposes critical parts of these customers’ infrastructure to the Internet, introducing security and reliability issues. Today we are introducing support for gRPC at Cloudflare, to secure and improve the experience of running gRPC APIs on the Internet.

How does gRPC + Cloudflare work?

The engineering work our team had to do to add gRPC support is composed of a few pieces:

  1. Changes to the early stages of our request processing pipeline to identify gRPC traffic coming down the wire.
  2. Additional functionality in our WAF to “understand” gRPC traffic, ensuring gRPC connections are handled correctly within the WAF, including inspecting all components of the initial gRPC connection request.
  3. Adding support to establish HTTP/2 connections to customer origins for gRPC traffic, allowing gRPC to be proxied through our edge. HTTP/2 to origin support is currently limited to gRPC traffic, though we expect to expand the scope of traffic proxied back to origin over HTTP/2 soon.

What does this mean for you, a Cloudflare customer interested in using our tools to secure and accelerate your API? Because of the hard work we’ve done, enabling support for gRPC is a click of a button in the Cloudflare dashboard.

Using gRPC to build mobile apps at scale

Why does Cloudflare supporting gRPC matter? To dig in on one use case, let’s look at mobile apps. Apps need quick, efficient ways of interacting with servers to get the information needed to show on your phone. There is no browser, so they rely on APIs to get the information. An API stands for application programming interface and is a standardized way for machines (say, your phone and a server) to talk to each other.

Let’s say we’re a mobile app developer with thousands, or even millions of users. With this many users, using a modern protocol, gRPC, allows us to run less compute infrastructure than would be necessary with older, less efficient protocols like JSON-REST. But exposing these endpoints, naked, on the Internet is really scary. Up until now there were very few, if any, options for protecting gRPC endpoints against application layer attacks with a WAF and guarding against volumetric attacks with DDoS mitigation tools. That changes today, with Cloudflare adding gRPC to it’s set of supported protocols.  

With gRPC on Cloudflare, you get the full benefits of our security, reliability and performance products:

  • WAF for inspection of incoming gRPC requests. Use managed rules or craft your own.
  • Load Balancing to increase reliability: configure multiple gRPC backends to handle the load, let Cloudflare distribute the load across them. Backend selection can be done in round-robin fashion, based on health checks or load.
  • Argo Smart Routing to increase performance by transporting your gRPC messages faster than the Internet would be able to route them. Messages are routed around congestion on the Internet, resulting in an average reduction of time to first byte by 30%.

And of course, all of this works with API Shield, an easy way to add mTLS authentication to any API endpoint.

Enabling gRPC support

To enable gRPC support, head to the Cloudflare dashboard and go to the Network tab. From there you can sign up for the beta.

Announcing support for gRPC

We have limited seats available at launch, but will open up more broadly over the next few weeks. After signing up and toggling gRPC support, you’ll have to enable Cloudflare proxying on your domain on the DNS tab to activate Cloudflare for your gRPC API.

We’re excited to bring gRPC support to the masses, allowing you to add the security, reliability and performance benefits that you’re used to getting with Cloudflare. Enabling is just a click away. Take it for a spin and let us know what you think!

Introducing Cloudflare Radar

Post Syndicated from Marc Lamik original https://blog.cloudflare.com/introducing-cloudflare-radar/

Introducing Cloudflare Radar

Introducing Cloudflare Radar

Unlike the tides, Internet use ebbs and flows with the motion of the sun not the moon. Across the world usage quietens during the night and picks up as morning comes. Internet use also follows patterns that humans create, dipping down when people stopped to applaud healthcare workers fighting COVID-19, or pausing to watch their country’s president address them, or slowing for religious reasons.

And while humans leave a mark on the Internet, so do automated systems. These systems might be doing useful work (like building search engine databases) or harm (like scraping content, or attacking an Internet property).

All the while Internet use (and attacks) is growing. Zoom into any day and you’ll see the familiar daily wave of Internet use reflecting day and night, zoom out and you’ll likely spot weekends when Internet use often slows down a little, zoom out further and you might spot the occasional change in use caused by a holiday, zoom out further and you’ll see that Internet use grows inexorably.

And attacks don’t only grow, they change. New techniques are invented while old ones remain evergreen. DDoS activity continues day and night roaming from one victim to another. Automated scanning tools look for vulnerabilities in anything, literally anything, connected to the Internet.

Sometimes the Internet fails in a country, perhaps because of a cable cut somewhere beneath the sea, or because of government intervention. That too is something we track and measure.

All this activity, good and bad, shows up in the trends and details that Cloudflare tracks to help improve our service and protect our customers. Until today this insight was only available internally at Cloudflare, today we are launching a new service, Cloudflare Radar, that shines a light on the Internet’s patterns.

Each second, Cloudflare handles on average 18 million HTTP requests and 6 million DNS requests. With 1 billion unique IP addresses connecting to Cloudflare’s network we have one of the most representative views on Internet traffic worldwide.

And by blocking 72 billion cyberthreats every day Cloudflare also has a unique position in understanding and mitigating Internet threats.

Our goal is to help build a better Internet and we want to do this by exposing insights, threats and trends based on the aggregated data that we have. We want to help anyone understand what is happening on the Internet from a security, performance and usage perspective. Every Internet user should have easy access to answer the questions that they have.

There are three key components that we’re launching today: Radar Internet Insights, Radar Domain Insights and Radar IP Insights.

Radar Internet Insights

At the top of Cloudflare Radar we show the latest news about events that are currently happening on the Internet. This includes news about the adoption of new technologies, browsers or operating systems. We are also keeping all users up to date with interesting events around developments in Internet traffic. This could be traffic patterns seen in specific countries or patterns related to events like the COVID-19 pandemic.

Introducing Cloudflare Radar

Sign up for Radar Alerts to always stay up-to-date.

Below the news section users can find rapidly updated trend data. All of which can be viewed worldwide or by country. The data is available for several time frames: last hour, last 24 hours, last 7 days. We’ll soon make available the 30 days time frame to help explore longer term trends.

Change in Internet traffic

You can drill down on specific countries and Cloudflare Radar will show you the change in aggregate Internet traffic seen by our network for that country. We also show an info box on the right with a snapshot of interesting data points.

Introducing Cloudflare Radar

Worldwide and for individual countries we have an algorithm calculating which domains are most popular and have recently started trending (i.e. have seen a large change in popularity). Services with multiple domains and subdomains are aggregated to ensure best comparability. We show here the relative rank of domains and are able to spot big changes in ranking to highlight new trends as they appear.

Introducing Cloudflare Radar

The trending domains section are still in beta as we are training our algorithm to best detect the next big things as they emerge.

There is also a search bar that enables a user to search for a specific domain or IP address to get detailed information about it. More on that below.

Attack activity

The attack activity section gives information about different types of cyberattacks observed by Cloudflare. First we show the attacks mitigated by our Layer 3 and 4 Denial of Service prevention systems. We show the used attack protocol as well as the change in attack volume over the selected time frame.

Introducing Cloudflare Radar

Secondly, we show Layer 7 threat information based on requests that we blocked. Layer 7 requests get blocked by a variety of systems (such as our WAF, our layer 7 DDoS mitigation system and our customer configurable firewall). We show the system responsible for blocking as well as the change of blocked requests over the selected time frame.

Introducing Cloudflare Radar

Based on the analytics we handle on HTTP requests we are able to show trends over a diverse set of data points. This includes the distribution of mobile vs. desktop traffic, or the percentage of traffic detected as coming from bots. We also dig into longer term trends like the use of HTTPS or the share of IPv6.

Introducing Cloudflare Radar

The bottom section shows the top browsers worldwide or for the selected country. In this example we selected Vietnam and you can see that over 6% of users are using Cốc Cốc a local browser.

Introducing Cloudflare Radar

Radar Domain Insights

We give users the option to dig in deeper on an individual domain. Giving the opportunity to get to know the global ranking as well as security information. This enables everyone to identify potential threats and risks.

To look up a domain or hostname in Radar by typing it in the search box within the top domains on the Radar Internet Insights Homepage.

Introducing Cloudflare Radar

For example, suppose you search for cloudflare.com. You’ll get sent to a domain-specific page with information about cloudflare.com.

Introducing Cloudflare Radar

At the top we provide an overview of the domain’s configuration with Domain Badges. From here you can, at a glance, understand what technologies the domain is using. For cloudflare.com you can see that it supports TLS, IPv6, DNSSEC and eSNI. There’s also an indication of the age of the domain (since registration) and its worldwide popularity.

Below you find the domain’s content categories. If you find a domain that is in the wrong category, please use our Domain Categorization Feedback to let us know.

We also show global popularity trends from our domain ranking formula. For domains with a global audience there’s also a map giving information about popularity by country.

Introducing Cloudflare Radar

Radar IP Insights

For an individual IP address (instead of a domain) we show different information. To look up an IP address simply insert it in the search bar within the top domains on the Radar Internet Insights. For a quick lookup of your own IP just open radar.cloudflare.com/me.

Introducing Cloudflare Radar

For IPs we show the network (the ASN) and geographic information. For your own IP we also show more detailed location information as well as an invitation to check the speed of your Internet connection using speed.cloudflare.com.

Next Steps

The current product is just the beginning of Cloudflare’s approach to making knowledge about the Internet more accessible. Over the next few weeks and months we will add more data points and the 30 days time frame functionality.  And we’ll allow users to filter the charts not only by country but also by categorization (such as by industry).

Stay tuned for more to come.

Free, Privacy-First Analytics for a Better Web

Post Syndicated from Jon Levine original https://blog.cloudflare.com/free-privacy-first-analytics-for-a-better-web/

Free, Privacy-First Analytics for a Better Web

Everyone with a website needs to know some basic facts about their website: what pages are people visiting? Where in the world are they? What other sites sent traffic to my website?

There are “free” analytics tools out there, but they come at a cost: not money, but your users’ privacy. Today we’re announcing a brand new, privacy-first analytics service that’s open to everyone — even if they’re not already a Cloudflare customer. And if you’re a Cloudflare customer, we’ve enhanced our analytics to make them even more powerful than before.

The most important analytics feature: Privacy

The most popular analytics services available were built to help ad-supported sites sell more ads. But, a lot of websites don’t have ads. So if you use those services, you’re giving up the privacy of your users in order to understand how what you’ve put online is performing.

Cloudflare’s business has never been built around tracking users or selling advertising. We don’t want to know what you do on the Internet — it’s not our business. So we wanted to build an analytics service that gets back to what really matters for web creators, not necessarily marketers, and to give web creators the information they need in a simple, clean way that doesn’t sacrifice their visitors’ privacy. And giving web creators these analytics shouldn’t depend on their use of Cloudflare’s infrastructure for performance and security. (More on that in a bit.)

What does it mean for us to make our analytics “privacy-first”? Most importantly, it means we don’t need to track individual users over time for the purposes of serving analytics. We don’t use any client-side state, like cookies or localStorage, for the purposes of tracking users. And we don’t “fingerprint” individuals via their IP address, User Agent string, or any other data for the purpose of displaying analytics. (We consider fingerprinting even more intrusive than cookies, because users have no way to opt out.)

Counting visits without tracking users

One of the most essential stats about any website is: “how many people went there”? Analytics tools frequently show counts of “unique” visitors, which requires tracking individual users by a cookie or IP address.

We use the concept of a visit: a privacy-friendly measure of how people have interacted with your website. A visit is defined simply as a successful page view that has an HTTP referer that doesn’t match the hostname of the request. This tells you how many times people came to your website and clicked around before navigating away, but doesn’t require tracking individuals.

Free, Privacy-First Analytics for a Better Web

A visit has slightly different semantics from a “unique”, and you should expect this number to differ from other analytics tools.

All of the details, none of the bots

Our analytics deliver the most important metrics about your website, like page views and visits. But we know that an essential analytics feature is flexibility: the ability to add arbitrary filters, and slice-and-dice data as you see fit. Our analytics can show you the top hostnames, URLs, countries, and other critical metrics like status codes. You can filter on any of these metrics with a click and see the whole dashboard update.

I’m especially excited about two features in our time series charts: the ability to drag-to-zoom into a narrower time range, and the ability to “group by” different dimensions to see data in a different way. This is a super powerful way to drill into an anomaly in traffic and quickly see what’s going on. For example, you might notice a spike in traffic, zoom into that spike, and then try different groupings to see what contributed the extra clicks. A GIF is worth a thousand words:

And for customers of our Bot Management product, we’re working on the ability to detect (and remove) automated traffic. Coming very soon, you’ll be able to see which bots are reaching your website — with just a click, block them by using Firewall Rules.

This is all possible thanks to our ABR analytics technology, which enables us to serve analytics very quickly for websites large and small. Check out our blog post to learn more about how this works.

Edge or Browser analytics? Why not both?

There are two ways to collect web analytics data: at the edge (or on an origin server), or in the client using a JavaScript beacon.

Historically, Cloudflare has collected analytics data at our edge. This has some nice benefits over traditional, client-side analytics approaches:

  • It’s more accurate because you don’t miss users who block third-party scripts, or JavaScript altogether
  • You can see all of the traffic back to your origin server, even if an HTML page doesn’t load
  • We can detect (and block bots), apply Firewall rules, and generally scrub traffic of unwanted noise
  • You can measure the performance of your origin server

More commonly, most web analytics providers use client-side measurement. This has some benefits as well:

  • You can understand performance as your users see it — e.g. how long did the page actually take to render
  • You can detect errors in client-side JavaScript execution
  • You can define custom event types emitted by JavaScript frameworks

Ultimately, we want our customers to have the best of both worlds. We think it’s really powerful to get web traffic numbers directly from the edge. We also launched Browser Insights a year ago to augment our existing edge analytics with more performance information, and today Browser Insights are taking a big step forward by incorporating Web Vitals metrics.

But, we know not everyone can modify their DNS to take advantage of Cloudflare’s edge services. That’s why today we’re announcing a free, standalone analytics product for everyone.

How do I get it?

For existing Cloudflare customers on our Pro, Biz, and Enterprise plans, just go to your Analytics tab! Starting today, you’ll see a banner to opt-in to the new analytics experience. (We plan to make this the default in a few weeks.)

But when building privacy-first analytics, we realized it’s important to make this accessible even to folks who don’t use Cloudflare today. You’ll be able to use Cloudflare’s web analytics even if you can’t change your DNS servers — just add our JavaScript, and you’re good to go.

We’re still putting on the finishing touches on our JavaScript-based analytics, but you can sign up here and we’ll let you know when it’s ready.

The evolution of analytics at Cloudflare

Just over a year ago, Cloudflare’s analytics consisted of a simple set of metrics: cached vs uncached data transfer, or how many requests were blocked by the Firewall. Today we provide flexible, powerful analytics across all our products, including Firewall, Cache, Load Balancing and Network traffic.

While we’ve been focused on building analytics about our products, we realized that our analytics are also powerful as a standalone product. Today is just the first step on that journey. We have so much more planned: from real-time analytics, to ever-more performance analysis, and even allowing customers to add custom events.

We want to hear what you want most out of analytics — drop a note in the comments to let us know what you want to see next.

Explaining Cloudflare’s ABR Analytics

Post Syndicated from Jamie Herre original https://blog.cloudflare.com/explaining-cloudflares-abr-analytics/

Explaining Cloudflare's ABR Analytics

Cloudflare’s analytics products help customers answer questions about their traffic by analyzing the mind-boggling, ever-increasing number of events (HTTP requests, Workers requests, Spectrum events) logged by Cloudflare products every day.  The answers to these questions depend on the point of view of the question being asked, and we’ve come up with a way to exploit this fact to improve the quality and responsiveness of our analytics.

Useful Accuracy

Consider the following questions and answers:

What is the length of the coastline of Great Britain? 12.4K km
What is the total world population? 7.8B
How many stars are in the Milky Way? 250B
What is the total volume of the Antarctic ice shelf? 25.4M km3
What is the worldwide production of lentils? 6.3M tonnes
How many HTTP requests hit my site in the last week? 22.6M

Useful answers do not benefit from being overly exact.  For large quantities, knowing the correct order of magnitude and a few significant digits gives the most useful answer.  At Cloudflare, the difference in traffic between different sites or when a single site is under attack can cross nine orders of magnitude and, in general, all our traffic follows a Pareto distribution, meaning that what’s appropriate for one site or one moment in time might not work for another.

Explaining Cloudflare's ABR Analytics

Because of this distribution, a query that scans a few hundred records for one customer will need to scan billions for another.  A report that needs to load a handful of rows under normal operation might need to load millions when a site is under attack.

To get a sense of the relative difference of each of these numbers, remember “Powers of Ten”, an amazing visualization that Ray and Charles Eames produced in 1977.  Notice that the scale of an image determines what resolution is practical for recording and displaying it.

Explaining Cloudflare's ABR Analytics

Using ABR to determine resolution

This basic fact informed our design and implementation of ABR for Cloudflare analytics.  ABR stands for “Adaptive Bit Rate”.  It’s essentially an eponym for the term as used in video streaming such as Cloudflare’s own Stream Delivery.  In those cases, the server will select the best resolution for a video stream to match your client and network connection.

In our case, every analytics query that supports ABR will be calculated at a resolution matching the query.  For example, if you’re interested to know from which country the most firewall events were generated in the past week, the system might opt to use a lower resolution version of the firewall data than if you had opted to look at the last hour. The lower resolution version will provide the same answer but take less time and fewer resources.  By using multiple, different resolutions of the same data, our analytics can provide consistent response times and a better user experience.

You might be aware that we use a columnar store called ClickHouse to store and process our analytics data.  When using ABR with ClickHouse, we write the same data at multiple resolutions into separate tables.  Usually, we cover seven orders of magnitude – from 100% to 0.0001% of the original events.  We wind up using an additional 12% of disk storage but enable very fast ad hoc queries on the reduced resolution tables.

Explaining Cloudflare's ABR Analytics

Aggregations and Rollups

The ABR technique facilitates aggregations by making compact estimates of every dimension.  Another way to achieve the same ends is with a system that computes “rollups”.  Rollups save space by computing either complete or partial aggregations of the data as it arrives.  

For example, suppose we wanted to count a total number of lentils. (Lentils are legumes and among the oldest and most widely cultivated crops.  They are a staple food in many parts of the world.)  We could just count each lentil as it passed through the processing system. Of course because there a lot of lentils, that system is distributed – meaning that there are hundreds of separate machines.  Therefore we’ll actually have hundreds of separate counters.

Also, we’ll want to include more information than just the count, so we’ll also include the weight of each lentil and maybe 10 or 20 other attributes. And of course, we don’t want just a total for each attribute, but we’ll want to be able to break it down by color, origin, distributor and many other things, and also we’ll want to break these down by slices of time.

In the end, we’ll have tens of thousands or possibly millions of aggregations to be tabulated and saved every minute.  These aggregations are expensive to compute, especially when using aggregations more complicated than simple counters and sums.  They also destroy some information.  For example, once we’ve processed all the lentils through the rollups, we can’t say for sure that we’ve counted them all, and most importantly, whichever attributes we neglected to aggregate are unavailable.

The number we’re counting, 6.3M tonnes, only includes two significant digits which can easily be achieved by counting a sample.  Most of the rollup computations used on each lentil (on the order 1013 to account for 6.3M tonnes) are wasted.

Other forms of aggregations

So far, we’ve discussed ABR and its application to aggregations, but we’ve only given examples involving “counts” and “sums”.  There are other, more complex forms of aggregations we use quite heavily.  Two examples are “topK” and “count-distinct”.

A “topK” aggregation attempts to show the K most frequent items in a set.  For example, the most frequent IP address, or country.  To compute topK, just count the frequency of each item in the set and return the K items with the highest frequencies. Under ABR, we compute topK based on the set found in the matching resolution sample. Using a sample makes this computation a lot faster and less complex, but there are problems.

The estimate of topK derived from a sample is biased and dependent on the distribution of the underlying data. This can result in overestimating the significance of elements in the set as compared to their frequency in the full set. In practice this effect can only be noticed when the cardinality of the set is very high and you’re not going to notice this effect on a Cloudflare dashboard.  If your site has a lot of traffic and you’re looking at the Top K URLs or browser types, there will be no difference visible at different resolutions.  Also keep in mind that as long as we’re estimating the “proportion” of the element in the set and the set is large, the results will be quite accurate.

The other fascinating aggregation we support is known as “count-distinct”, or number of uniques.  In this case we want to know the number of unique values in a set.  For example, how many unique cache keys have been used.  We can safely say that a uniform random sample of the set cannot be used to estimate this number.  However, we do have a solution.

We can generate another, alternate sample based on the value in question.  For example, instead of taking a random sample of all requests, we take a random sample of IP addresses.  This is sometimes called distinct reservoir sampling, and it allows us to estimate the true number of distinct IPs based on the cardinality of the sampled set. Again, there are techniques available to improve these estimates, and we’ll be implementing some of those.

ABR improves resilience and scalability

Using ABR saves us resources.  Even better, it allows us to query all the attributes in the original data, not just those included in rollups.  And even better, it allows us to check our assumptions against different sample intervals in separate tables as a check that the system is working correctly, because the original events are preserved.

However, the greatest benefits of employing ABR are the ones that aren’t directly visible. Even under ideal conditions, a large distributed system such as Cloudflare’s data pipeline is subject to high tail latency.  This occurs when any single part of the system takes longer than usual for any number of a long list of reasons.  In these cases, the ABR system will adapt to provide the best results available at that moment in time.

For example, compare this chart showing Cache Performance for a site under attack with the same chart generated a moment later while we simulate a failure of some of the servers in our cluster.  In the days before ABR, your Cloudflare dashboard would fail to load in this scenario.  Now, with ABR analytics, you won’t see significant degradation.

Explaining Cloudflare's ABR Analytics
Explaining Cloudflare's ABR Analytics

Stretching the analogy to ABR in video streaming, we want you to be able to enjoy your analytics dashboard without being bothered by issues related to faulty servers, or network latency, or long running queries.  With ABR you can get appropriate answers to your questions reliably and within a predictable amount of time.

In the coming months, we’re going to be releasing a variety of new dashboards and analytics products based on this simple but profound technology.  Watch your Cloudflare dashboard for increasingly useful and interactive analytics.

Introducing Cron Triggers for Cloudflare Workers

Post Syndicated from Nancy Gao original https://blog.cloudflare.com/introducing-cron-triggers-for-cloudflare-workers/

Introducing Cron Triggers for Cloudflare Workers

Introducing Cron Triggers for Cloudflare Workers

Today the Cloudflare Workers team is thrilled to announce the launch of Cron Triggers. Before now, Workers were triggered purely by incoming HTTP requests but starting today you’ll be able to set a scheduler to run your Worker on a timed interval. This was a highly requested feature that we know a lot of developers will find useful, and we’ve heard your feedback after Serverless Week.

Introducing Cron Triggers for Cloudflare Workers

We are excited to offer this feature at no additional cost, and it will be available on both the Workers free tier and the paid tier, now called Workers Bundled. Since it doesn’t matter which city a Cron Trigger routes the Worker through, we are able to maximize Cloudflare’s distributed system and send scheduled jobs to underutilized machinery. Running jobs on these quiet machines is both efficient and cost effective, and we are able to pass those cost savings down to you.

What is a Cron Trigger and how might I use such a feature?

Introducing Cron Triggers for Cloudflare Workers

In case you’re not familiar with Unix systems, the cron pattern allows you to schedule jobs to run periodically at fixed intervals or at scheduled times. Cron Triggers in the context of Workers allow users to set time-based invocations for the job. These Workers happen on a recurring schedule, and differ from traditional Workers in that they do not fire on HTTP requests.

Most developers are familiar with the cron pattern and its usefulness across a wide range of applications. Pulling the latest data from APIs or running regular integration tests on a preset schedule are common examples of this.

“We’re excited about Cron Triggers. Workers is crucial to our stack, so using this feature for live integration tests will boost the developer experience.” – Brian Marks, Software Engineer at Bazaarvoice

How much does it cost to use Cron Triggers?

Triggers are included at no additional cost! Scheduled Workers count towards your request cap for both the free tier and Workers Bundled, but rest assured that there will be no hidden or extra fees. Our competitors charge extra for cron events, or in some cases offer a very limited free tier. We want to make this feature widely accessible and have decided not to charge on a per-trigger basis. While there are no limits for the number of triggers you can have across an account, note that there is a limit of 3 triggers per Worker script for this feature. You can read more about limits on Workers plans in this documentation.

How are you able to offer this feature at no additional cost?

Cloudflare supports a massive distributed system that spans the globe with 200+ cities. Our nodes are named for the IATA airport code that they are closest to. Most of the time we run Workers close to the request origin for performance reasons (ie SFO if you are in the Bay Area, or CDG if you are lucky enough to be in Paris 🥐🍷🧀).  In a typical HTTP Worker, we do this because we know that performance is of material importance when someone is waiting for the response.

In the case of Cron Triggers, where the user is running a task on a timed basis, those performance needs are different. A few milliseconds of extra latency do not matter as much when the user isn’t actively waiting for the response. The nature of the feature gives us much more flexibility on where to run the job, since it doesn’t have to necessarily be in a city close to the end user.

Cron Triggers are run on underutilized machines to make the best use of our capacity and route traffic efficiently. For example, a job scheduled from San Francisco at 7pm Pacific Time might be sent to Paris because it’s 4am there and traffic across Europe is low.  Sending traffic to these machines during quiet hours is very efficient, and we are more than happy to pass those cost savings down to you. Aside from this scheduling optimization, Workers that are called by Cron Triggers behave similarly to and have all of the same performance and security benefits as typical HTTP Workers.

What’s happening below the hood?

At a high level, schedules created through our API create records in our database. These records contain the information necessary to execute the Worker on the given cron schedule. These records are then picked up by another service which continuously evaluates the state of our edge and distributes the schedules among cities. Once the schedules have been distributed to the edge, a service running in the node polls for changes to the schedules and makes sure they get sent to our runtime at the appropriate time.

If you want to know more details about how we implemented this feature, please refer to the technical blog.

What’s coming next?

With this feature, we’ve expanded what’s possible to build with Workers, and further simplified the developer experience. While Workers previously only ran on web requests, we believe the future of edge computing isn’t strictly tied to HTTP requests and responses.  We want to introduce more types of Workers in the future.

We plan to expand out triggers to include different types, such as data or event-based triggers. Our goal is to give users more flexibility and control over when their Workers run. Cron Triggers are our first step in this direction. In addition, we plan to keep iterating on Cron Triggers to make edge infrastructure selection even more sophisticated and optimized — for example, we might even consider triggers that allow our users to run in the most energy-efficient data centers.

How to try Cron Triggers

Cron triggers are live today! You can try it in the Workers dashboard by creating a new Worker and setting up a Cron Trigger.

Making Time for Cron Triggers: A Look Inside

Post Syndicated from Aaron Lisman original https://blog.cloudflare.com/cron-triggers-for-scheduled-workers/

Making Time for Cron Triggers: A Look Inside

Making Time for Cron Triggers: A Look Inside

Today, we are excited to launch Cron Triggers to the Cloudflare Workers serverless compute platform. We’ve heard the developer feedback, and we want to give our users the ability to run a given Worker on a scheduled basis. In case you’re not familiar with Unix systems, the cron pattern allows developers to schedule jobs to run at fixed intervals. This pattern is ideal for running any types of periodic jobs like maintenance or calling third party APIs to get up-to-date data. Cron Triggers has been a highly requested feature even inside Cloudflare and we hope that you will find this feature as useful as we have!

Making Time for Cron Triggers: A Look Inside

Where are Cron Triggers going to be run?

Cron Triggers are executed from the edge. At Cloudflare, we believe strongly in edge computing and wanted our new feature to get all of the performance and reliability benefits of running on our edge. Thus, we wrote a service in core that is responsible for distributing schedules to a new edge service through Quicksilver which will then trigger the Workers themselves.

What’s happening under the hood?

At a high level, schedules created through our API create records in our database with the information necessary to execute the Worker and the given cron schedule. These records are then picked up by another service which continuously evaluates the state of our edge and distributes the schedules between cities.

Once the schedules have been distributed to the edge, a service running in the edge node polls for changes to the schedules and makes sure they get sent to our runtime at the appropriate time.

New Event Type

Making Time for Cron Triggers: A Look Inside

Cron Triggers gave us the opportunity to finally recognize a new Worker ‘type’ in our API. While Workers currently only run on web requests, we have lots of ideas for the future of edge computing that aren’t strictly tied to HTTP requests and responses. Expect to see even more new handlers in the future for other non-HTTP events like log information from your Worker (think custom wrangler tail!) or even TCP Workers.

Here’s an example of the new Javascript API:

addEventListener('scheduled', event => {
  event.waitUntil(someAsyncFunction(event))
})

Where event has the following interface in Typescript:

interface ScheduledEvent {
  type: 'scheduled';
  scheduledTime: int; // milliseconds since the Unix epoch
}

As long as your Worker has a handler for this new event type, you’ll be able to give it a schedule.

New APIs

PUT /client/v4/accounts/:account_identifier/workers/scripts/:name

The script upload API remains the same, but during script validation we now detect and return the registered event handlers.

PUT /client/v4/accounts/:account_identifier/workers/scripts/:name/schedules
Body

[
 {"cron": "* * * * *"},
 ...
]

This will create or modify all schedules for a script, removing all schedules not in the list. For now, there’s a limit of 3 distinct cron schedules. Schedules can be set to run as often as one minute and don’t accept schedules with years in them (sorry, you’ll have to run your Y3K migration script another way).

GET /client/v4/accounts/:account_identifier/workers/scripts/:name/schedules
Response

{
 "schedules": [
   {
     "cron": "* * * * *",
      "created_on": <time>,
      "modified_on": <time>
   },
   ...
 ]
}

The Scheduler service is responsible for reading the schedules from Postgres and generating per-node schedules to place into Quicksilver. For now, the service simply avoids trying to execute your Worker on an edge node that may be disabled for some reason, but such an approach also gives us a lot of flexibility in deciding where your Worker executes.

In addition to edge node availability, we could optimize for compute cost, bandwidth, or even latency in the future!

What’s actually executing these schedules?

To consume the schedules and actually trigger the Worker, we built a new service in Rust and deployed to our edge using HashiCorp Nomad. Nomad ensures that the schedule runner remains running in the edge node and can move it between machines as necessary. Rust was the best choice for this service since it needed to be fast with high availability and Cap’n Proto RPC support for calling into the runtime. With Tokio, Anyhow, Clap, and Serde, it was easy to quickly get the service up and running without having to really worry about async, error handling, or configuration.

On top of that, due to our specific needs for cron parsing, we built a specialized cron parser using nom that allowed us to quickly parse and compile expressions into values that check against a given time to determine if we should run a schedule.

Once the schedule runner has the schedules, it checks the time and selects the Workers that need to be run. To let the runtime know it’s time to run, we send a Cap’n Proto RPC message. The runtime then does its thing, calling the new ‘scheduled’ event handler instead of ‘fetch’.

How can I try this?

As of today, the Cron Triggers feature is live! Please try it out by creating a Worker and finding the Triggers tab – we’re excited to see what you build with it!

Workers Durable Objects Beta: A New Approach to Stateful Serverless

Post Syndicated from Kenton Varda original https://blog.cloudflare.com/introducing-workers-durable-objects/

Workers Durable Objects Beta:
A New Approach to Stateful Serverless

Workers Durable Objects Beta:
A New Approach to Stateful Serverless

We launched Cloudflare Workers® in 2017 with a radical vision: code running at the network edge could not only improve performance, but also be easier to deploy and cheaper to run than code running in a single datacenter. That vision means Workers is about more than just edge compute — we’re rethinking how applications are built.

Using a “serverless” approach has allowed us to make deploys dead simple, and using isolate technology has allowed us to deliver serverless more cheaply and without the lengthy cold starts that hold back other providers. We added easy-to-use eventually-consistent edge storage to the platform with Workers KV.

But up until today, it hasn’t been possible to manage state with strong consistency, or to coordinate in real time between multiple clients, entirely on the edge. Thus, these parts of your application still had to be hosted elsewhere.

Durable Objects provide a truly serverless approach to storage and state: consistent, low-latency, distributed, yet effortless to maintain and scale. They also provide an easy way to coordinate between clients, whether it be users in a particular chat room, editors of a particular document, or IoT devices in a particular smart home. Durable Objects are the missing piece in the Workers stack that makes it possible for whole applications to run entirely on the edge, with no centralized “origin” server at all.

Today we are beginning a closed beta of Durable Objects.

Request a beta invite »

What is a “Durable Object”?

I’m going to be honest: naming this product was hard, because it’s not quite like any other cloud technology that is widely-used today. This proverbial bike shed has many layers of paint, but ultimately we settled on “Unique Durable Objects”, or “Durable Objects” for short. Let me explain what they are by breaking that down:

  • Objects: Durable Objects are objects in the sense of Object-Oriented Programming. A Durable Object is an instance of a class — literally, a class definition written in JavaScript (or your language of choice). The class has methods which define its public interface. An object is an instance of this class, combining the code with some private state.
  • Unique: Each object has a globally-unique identifier. That object exists in only one location in the whole world at a time. Any Worker running anywhere in the world that knows the object’s ID can send messages to it. All those messages end up delivered to the same place.
  • Durable: Unlike a normal object in JavaScript, Durable Objects can have persistent state stored on disk. Each object’s durable state is private to it, which means not only that access to storage is fast, but the object can even safely maintain a consistent copy of the state in memory and operate on it with zero latency. The in-memory object will be shut down when idle and recreated later on-demand.

What can they do?

Durable Objects have two primary abilities:

  • Storage: Each object has attached durable storage. Because this storage is private to a specific object, the storage is always co-located with the object. This means the storage can be very fast while providing strong, transactional consistency. Durable Objects apply the serverless philosophy to storage, splitting the traditional large monolithic databases up into many small, logical units. In doing so, we get the advantages you’ve come to expect from serverless: effortless scaling with zero maintenance burden.
  • Coordination: Historically, with Workers, each request would be randomly load-balanced to a Worker instance. Since there was no way to control which instance received a request, there was no way to force two clients to talk to the same Worker, and therefore no way for clients to coordinate through Workers. Durable Objects change that: requests related to the same topic can be forwarded to the same object, which can then coordinate between them, without any need to touch storage. For example, this can be used to facilitate real-time chat, collaborative editing, video conferencing, pub/sub message queues, game sessions, and much more.

The astute reader may notice that many coordination use cases call for WebSockets — and indeed, conversely, most WebSocket use cases require coordination. Because of this complementary relationship, along with the Durable Objects beta, we’ve also added WebSocket support to Workers. For more on this, see the Q&A below.

Region: Earth

Workers Durable Objects Beta:
A New Approach to Stateful Serverless

When using Durable Objects, Cloudflare automatically determines the Cloudflare datacenter that each object will live in, and can transparently migrate objects between locations as needed.

Traditional databases and stateful infrastructure usually require you to think about geographical “regions”, so that you can be sure to store data close to where it is used. Thinking about regions can often be an unnatural burden, especially for applications that are not inherently geographical.

With Durable Objects, you instead design your storage model to match your application’s logical data model. For example, a document editor would have an object for each document, while a chat app would have an object for each chat. There is no problem creating millions or billions of objects, as each object has minimal overhead.

Killer app: Real-time collaborative document editing

Let’s say you have a spreadsheet editor application — or, really, any kind of app where users edit a complex document. It works great for one user, but now you want multiple users to be able to edit it at the same time. How do you accomplish this?

For the standard web application stack, this is a hard problem. Traditional databases simply aren’t designed to be real-time. When Alice and Bob are editing the same spreadsheet, you want every one of Alice’s keystrokes to appear immediately on Bob’s screen, and vice versa. But if you merely store the keystrokes to a database, and have the users repeatedly poll the database for new updates, at best your application will have poor latency, and at worst you may find database transactions repeatedly fail as users on opposite sides of the world fight over editing the same content.

The secret to solving this problem is to have a live coordination point. Alice and Bob connect to the same coordinator, typically using WebSockets. The coordinator then forwards Alice’s keystrokes to Bob and Bob’s keystrokes to Alice, without having to go through a storage layer. When Alice and Bob edit the same content at the same time, the coordinator resolves conflicts instantly. The coordinator can then take responsibility for updating the document in storage — but because the coordinator keeps a live copy of the document in-memory, writing back to storage can happen asynchronously.

Every big-name real-time collaborative document editor works this way. But for many web developers, especially those building on serverless infrastructure, this kind of solution has long been out-of-reach. Standard serverless infrastructure — and even cloud infrastructure more generally — just does not make it easy to assign these coordination points and direct users to talk to the same instance of your server.

Durable Objects make this easy. Not only do they make it easy to assign a coordination point, but Cloudflare will automatically create the coordinator close to the users using it and migrate it as needed, minimizing latency. The availability of local, durable storage means that changes to the document can be saved reliably in an instant, even if the eventual long-term storage is slower. Or, you can even store the entire document on the edge and abandon your database altogether.

With Durable Objects lowering the barrier, we hope to see real-time collaboration become the norm across the web. There’s no longer any reason to make users refresh for updates.

Example: An atomic counter

Here’s a very simple example of a Durable Object which can be incremented, decremented, and read over HTTP. This counter is consistent even when receiving simultaneous requests from multiple clients — none of the increments or decrements will be lost. At the same time, reads are served entirely from memory, no disk access needed.

export class Counter {
  // Constructor called by the system when the object is needed to
  // handle requests.
  constructor(controller, env) {
    // `controller.storage` is an interface to access the object's
    // on-disk durable storage.
    this.storage = controller.storage
  }

  // Private helper method called from fetch(), below.
  async initialize() {
    let stored = await this.storage.get("value");
    this.value = stored || 0;
  }

  // Handle HTTP requests from clients.
  //
  // The system calls this method when an HTTP request is sent to
  // the object. Note that these requests strictly come from other
  // parts of your Worker, not from the public internet.
  async fetch(request) {
    // Make sure we're fully initialized from storage.
    if (!this.initializePromise) {
      this.initializePromise = this.initialize();
    }
    await this.initializePromise;

    // Apply requested action.
    let url = new URL(request.url);
    switch (url.pathname) {
      case "/increment":
        ++this.value;
        await this.storage.put("value", this.value);
        break;
      case "/decrement":
        --this.value;
        await this.storage.put("value", this.value);
        break;
      case "/":
        // Just serve the current value. No storage calls needed!
        break;
      default:
        return new Response("Not found", {status: 404});
    }

    // Return current value.
    return new Response(this.value);
  }
}

Once the class has been bound to a Durable Object namespace, a particular instance of Counter can be accessed from anywhere in the world using code like:

// Derive the ID for the counter object named "my-counter".
// This name is associated with exactly one instance in the
// whole world.
let id = COUNTER_NAMESPACE.idFromName("my-counter");

// Send a request to it.
let response = await COUNTER_NAMESPACE.get(id).fetch(request);

Demo: Chat

Chat is arguably real-time collaboration in its purest form. And to that end, we have built a demo open source chat app that runs entirely at the edge using Durable Objects.

Try the live demo »See the source code on GitHub »

This chat app uses a Durable Object to control each chat room. Users connect to the object using WebSockets. Messages from one user are broadcast to all the other users. The chat history is also stored in durable storage, but this is only for history. Real-time messages are relayed directly from one user to others without going through the storage layer.

Additionally, this demo uses Durable Objects for a second purpose: Applying a rate limit to messages from any particular IP. Each IP is assigned a Durable Object that tracks recent request frequency, so that users who send too many messages can be temporarily blocked — even across multiple chat rooms. Interestingly, these objects don’t actually store any durable state at all, because they only care about very recent history, and it’s not a big deal if a rate limiter randomly resets on occasion. So, these rate limiter objects are an example of a pure coordination object with no storage.

This chat app is only a few hundred lines of code. The deployment configuration is only a few lines. Yet, it will scale seamlessly to any number of chat rooms, limited only by Cloudflare’s available resources. Of course, any individual chat room’s scalability has a limit, since each object is single-threaded. But, that limit is far beyond what a human participant could keep up with anyway.

Other use cases

Durable Objects have infinite uses. Here are just a few ideas, beyond the ones described above:

  • Shopping cart: An online storefront could track a user’s shopping cart in an object. The rest of the storefront could be served as a fully static web site. Cloudflare will automatically host the cart object close to the end user, minimizing latency.
  • Game server: A multiplayer game could track the state of a match in an object, hosted on the edge close to the players.
  • IoT coordination: Devices within a family’s house could coordinate through an object, avoiding the need to talk to distant servers.
  • Social feeds: Each user could have a Durable Object that aggregates their subscriptions.
  • Comment/chat widgets: A web site that is otherwise static content can add a comment widget or even a live chat widget on individual articles. Each article would use a separate Durable Object to coordinate. This way the origin server can focus on static content only.

The Future: True Edge Databases

We see Durable Objects as a low-level primitive for building distributed systems. Some applications, like those mentioned above, can use objects directly to implement a coordination layer, or maybe even as their sole storage layer.

However, Durable Objects today are not a complete database solution. Each object can see only its own data. To perform a query or transaction across multiple objects, the application needs to do some extra work.

That said, every big distributed database – whether it be relational, document, graph, etc. – is, at some low level, composed of “chunks” or “shards” that store one piece of the overall data. The job of a distributed database is to coordinate between chunks.

We see a future of edge databases that store each “chunk” as a Durable Object. By doing so, it will be possible to build databases that operate entirely at the edge, fully distributed with no regions or home location. These databases need not be built by us; anyone can potentially build them on top of Durable Objects. Durable Objects are only the first step in the edge storage journey.

Join the Beta

Storing data is a big responsibility which we do not take lightly. Because of the critical importance of getting it right, we are being careful. We will be making Durable Objects available gradually over the next several months.

As with any beta, this product is a work in progress, and some of what is described in this post is not fully enabled yet. Full details of beta limitations can be found in the documentation.

If you’d like to try out Durable Objects now, tell us about your use case. We’ll be selecting the most interesting use cases for early access.

Request a beta invite »

Q&A

Can Durable Objects serve WebSockets?

Yes.

As part of the Durable Objects beta, we’ve made it possible for Workers to act as WebSocket endpoints — including as a client or as a server. Before now, Workers could proxy WebSocket connections on to a back-end server, but could not speak the protocol directly.

While technically any Worker can speak WebSocket in this way, WebSockets are most useful when combined with Durable Objects. When a client connects to your application using a WebSocket, you need a way for server-generated events to be sent back to the existing socket connection. Without Durable Objects, there’s no way to send an event to the specific Worker holding a WebSocket. With Durable Objects, you can now forward the WebSocket to an Object. Messages can then be addressed to that Object by its unique ID, and the Object can then forward those messages down the WebSocket to the client.

The chat app demo presented above uses WebSockets. Check out the source code to see how it works.

How does this compare to Workers KV?

Two years ago, we introduced Workers KV, a global key-value data store. KV is a fairly minimalist global data store that serves certain purposes well, but is not for everyone. KV is eventually consistent, which means that writes made in one location may not be visible in other locations immediately. Moreover, it implements “last write wins” semantics, which means that if a single key is being modified from multiple locations in the world at once, it’s easy for those writes to overwrite each other. KV is designed this way to support low-latency reads for data that doesn’t frequently change. However, these design decisions make KV inappropriate for state that changes frequently, or when changes need to be immediately visible worldwide.

Durable Objects, in contrast, are not primarily a storage product at all — many use cases for them do not actually utilize durable storage. To the extent that they do provide storage, Durable Objects sit at the opposite end of the storage spectrum from KV. They are extremely well-suited to workloads requiring transactional guarantees and immediate consistency. However, since transactions inherently must be coordinated in a single location, and clients on the opposite side of the world from that location will experience moderate latency due to the inherent limitations of the speed of light. Durable Objects will combat this problem by auto-migrating to live close to where they are used.

In short, Workers KV remains the best way to serve static content, configuration, and other rarely-changing data around the world, while Durable Objects are better for managing dynamic state and coordination.

Going forward, we plan to utilize Durable Objects in the implementation of Workers KV itself, in order to deliver even better performance.

Why not use CRDTs?

You can build CRDT-based storage on top of Durable Objects, but Durable Objects do not require you to use CRDTs.

Conflict-free Replicated Data Types (CRDTs), or their cousins, Operational Transforms (OTs), are a technology that allows data to be edited from multiple places in the world simultaneously without synchronization, and without data loss. For example, these technologies are commonly used in the implementation of real-time collaborative document editors, so that a user’s keypresses can show up in their local copy of the document in real time, without waiting to see if anyone else edited another part of the document first. Without getting into details, you can think of these techniques like a real time version of “git fork” and “git merge”, where all merge conflicts are resolved automatically in a deterministic way, so that everyone ends up with the same state in the end.

CRDTs are a powerful technology, but applying them correctly can be challenging. Only certain kinds of data structures lend themselves to automatic conflict resolution in a way that doesn’t lead to easy data loss. Any developer familiar with git can see the problem: arbitrary conflict resolution is hard, and any automated algorithm for it will likely get things wrong sometimes. It’s all the more difficult if the algorithm has to handle merges in arbitrary order and still get the same answer.

We feel that, for most applications, CRDTs are overly complex and not worth the effort. Worse, the set of data structures that can be represented as a CRDT is too limited for many applications. It’s usually much easier to assign a single authoritative coordination point for each document, which is exactly what Durable Objects accomplish.

With that said, CRDTs can be used on top of Durable Objects. If an object’s state lends itself to CRDT treatment, then an application could replicate that object into several objects serving different regions, which then synchronize their states via CRDT. This would make sense for applications to implement as an optimization if and when they find it is worth the effort.

Last thoughts: What does it mean for state to be “serverless”?

Traditionally, serverless has focused on stateless compute. In serverless architectures, the logical unit of compute is reduced to something fine-grained: a single event, such as an HTTP request. This works especially well because events just happened to be the logical unit of work that we think about when designing server applications. No one thinks about their business logic in units of “servers” or “containers” or “processes” — we think about events. It is exactly because of this semantic alignment that serverless succeeds in shifting so much of the logistical burden of maintaining servers away from the developer and towards the cloud provider.

However, serverless architecture has traditionally been stateless. Each event executes in isolation. If you wanted to store data, you had to connect to a traditional database. If you wanted to coordinate between requests, you had to connect to some other service that provides that ability. These external services have tended to re-introduce the operational concerns that serverless was intended to avoid. Developers and service operators have to worry not just about scaling their databases to handle increasing load, but also about how to split their database into “regions” to effectively handle global traffic. The latter concern can be especially cumbersome.

So how can we apply the serverless philosophy to state? Just like serverless compute is about splitting compute into fine-grained pieces, serverless state is about splitting state into fine-grained pieces. Again, we seek to find a unit of state that corresponds to logical units in our application. The logical unit of state in an application is not a “table” or a “collection” or a “graph”. Instead, it depends on the application. The logical unit of state in a chat app is a chat room. The logical unit of state in an online spreadsheet editor is a spreadsheet. The logical unit of state in an online storefront is a shopping cart. By making the physical unit of storage provided by the storage layer match the logical unit of state inherent in the application, we can allow the underlying storage provider (Cloudflare) to take responsibility for a wide array of logistical concerns that previously fell on the developer, including scalability and regionality.

This is what Durable Objects do.

Cloudflare’s Always Online and the Internet Archive Team Up to Fight Origin Errors

Post Syndicated from Alex Krivit original https://blog.cloudflare.com/cloudflares-always-online-and-the-internet-archive-team-up-to-fight-origin-errors/

Cloudflare’s Always Online and the Internet Archive Team Up to Fight Origin Errors

Cloudflare’s Always Online and the Internet Archive Team Up to Fight Origin Errors

Every day, all across the Internet, something bad but entirely normal happens: thousands of origin servers go down, resulting in connection errors and frustrated users. Cloudflare’s users collectively spend over four and a half years each day waiting for unreachable origin servers to respond with error messages. But visitors don’t want to see error pages, they want to see content!

Today is exciting for all those who want the Internet to be stronger, more resilient, and have important redundancies: Cloudflare is pleased to announce a partnership with the Internet Archive to bring new functionality to our Always Online service.

Always Online serves as insurance for our customers’ websites. Should a customer’s origin go offline, timeout, or otherwise break, Always Online is there to step in and serve archived copies of webpages to visitors. The Internet Archive is a nonprofit organization that runs the Wayback Machine, a service which saves snapshots of billions of websites across the Internet. By partnering with the Internet Archive, Cloudflare is able to seamlessly deliver responses for unreachable websites from the Internet Archive, while the Internet Archive can continue their mission of archiving the web to provide access to all knowledge.

Enabling Always Online in the Cloudflare dashboard allows us to work with the Internet Archive to save a copy of a website to the Wayback Machine. When a website’s origin is down, Cloudflare will go to the Internet Archive to retrieve the most recently archived version of the site, so that visitors will still be able to view the site’s content.

Trying to reach a busted origin

When a person visits a Cloudflare website, a request is made from their laptop/phone/tablet/smart fridge to Cloudflare’s edge. Our edge first looks to see if we can respond with cached content; if the requested content is not in cache, or is determined to be expired, we then obtain a fresh copy from the origin. As part of fulfilling an uncached/expired origin fetch, we also update our cache to allow subsequent requests to be served to visitors faster and more securely.If we are unable to reach the origin, our edge tries a few more times to connect before marking the origin as being down and serving an error page to the visitor. Receiving an error page is not ideal for anyone, so we try really hard to ensure that visitors to websites using Cloudflare can get some content, even if an origin is struggling.

Cloudflare’s Always Online and the Internet Archive Team Up to Fight Origin Errors

A brief history of Always Online

When Cloudflare started 10 years ago, most of our customers were small and running on hosts that were subject to frequent downtime. These early customers feared that their host may go down at the same time a search engine was indexing their site. The search engine’s crawler would report the downed site as non-responsive and the site would drop in their search ranking. Always Online was born from that concern.

Through operating Always Online over the past 10 years, we’ve learned that fighting Internet downtime with simple, unobtrusive tools was something that our customers and their users deeply value. Though some features have undergone rewrite upon rewrite, other parts of our code have remained relatively untouched by the sands of time, a testament to their robustness. For example, Always Online clearly shows a banner indicating that it is serving an archived version of the page due to the origin being unreachable, and this transparency is well-received by both website owners and visitors.

We recently set out to make Always Online even better. We wanted to preserve what customers loved — as seamless an experience as possible for their users when their origin servers are down — while increasing the amount of content available through Always Online, ensuring it is as fresh as possible, and performing this archiving in a way that helps make the Internet a better place.

Cloudflare’s Always Online and the Internet Archive Team Up to Fight Origin Errors
What a visitor will see with Always Online. 

Enter the Internet Archive

Partnering with the Internet Archive’s Wayback Machine to power the next generation of Always Online accomplishes all of these goals. The Internet Archive’s mission is to provide universal access to all knowledge. Since 1996, the Internet Archive’s Wayback Machine has been archiving much of the public Web: preserving and making available millions of websites and pages that would otherwise be lost. In pursuit of that mission, they have archived more than 468 billion web pages, amounting to more than 45 petabytes of information.

Always Online’s integration with the Internet Archive will help the Archive expand their record of the Internet; many of the domains that opt-in to Always Online functionality may not have been otherwise discovered by the Archive’s crawler. And for Cloudflare customers, the Archive will seamlessly provide visitors access to content that would otherwise be errors.

In other words, Cloudflare partnering with the Internet Archive makes the Internet better, stronger, and more available to everyone.

“Through our partnership with Cloudflare, we are learning about, and archiving, webpages we might not have otherwise known about, and by integrating with Cloudflare’s Always Online service, archives of those pages are available to people trying to access them if they become unavailable via the live web”
Mark Graham, Director of the Internet Archive’s Wayback Machine

“We are excited to work with Cloudflare and expect this partnership to bring important redundancy to the Internet and allow for us to advance our ongoing efforts to make the Internet more useful and reliable.”
Brewster Kahle, Founder and Digital Librarian of the Internet Archive

How does the new Always Online work behind the scenes?

Upgrading to the new Always Online in the Cloudflare dashboard allows us to share some basic information about your website with the Internet Archive (like hostname and popular URLs), so they can begin to crawl and archive your website at regular intervals. This information sharing and crawling ensures content is available to Always Online and also serves to deepen the library of content available directly through the Archive.

If your origin goes down or is unreachable, Cloudflare’s edge will return a status code in the 520 to 527 range, indicating an issue connecting to the origin. When this happens, Cloudflare will first look to the local edge datacenter to see if there is a stale or expired version of content we can serve to the website visitor. If there isn’t a version in the local cache, Cloudflare will then go to the Internet Archive and fetch the most recently archived version of the site to serve to your visitors. When that happens, Always Online serves the archived content with a banner to let your visitors know that your origin is having problems. The banner allows for your visitors to check and see if your origin is back online with a single click. While dynamic content that requires communication with an origin server will still show an error to visitors (e.g. web applications or shopping carts), basic content will often be available with Always Online.

Cloudflare’s Always Online and the Internet Archive Team Up to Fight Origin Errors

Enabling the new Always Online

For now, the old Always Online service will still be available, but we plan to fully transition to the Internet Archive-backed version soon.

Cloudflare customers can enable Always Online in the dashboard:

Cloudflare’s Always Online and the Internet Archive Team Up to Fight Origin Errors

Learn More

  • For more about Always Online, and how it works, please check out our documentation.
  • To get started using Always Online, please log into your Cloudflare dashboard and toggle it on.
  • Please see the Internet Archive’s announcement of our partnership here.
  • To help improve Always Online, or other parts of our slice of the Internet, drop us a line.

Unimog – Cloudflare’s edge load balancer

Post Syndicated from David Wragg original https://blog.cloudflare.com/unimog-cloudflares-edge-load-balancer/

Unimog - Cloudflare’s edge load balancer

As the scale of Cloudflare’s edge network has grown, we sometimes reach the limits of parts of our architecture. About two years ago we realized that our existing solution for spreading load within our data centers could no longer meet our needs. We embarked on a project to deploy a Layer 4 Load Balancer, internally called Unimog, to improve the reliability and operational efficiency of our edge network. Unimog has now been deployed in production for over a year.

This post explains the problems Unimog solves and how it works. Unimog builds on techniques used in other Layer 4 Load Balancers, but there are many details of its implementation that are tailored to the needs of our edge network.

Unimog - Cloudflare’s edge load balancer

The role of Unimog in our edge network

Cloudflare operates an anycast network, meaning that our data centers in 200+ cities around the world serve the same IP addresses. For example, our own cloudflare.com website uses Cloudflare services, and one of its IP addresses is 104.17.175.85. All of our data centers will accept connections to that address and respond to HTTP requests. By the magic of Internet routing, when you visit cloudflare.com and your browser connects to 104.17.175.85, your connection will usually go to the closest (and therefore fastest) data center.

Unimog - Cloudflare’s edge load balancer

Inside those data centers are many servers. The number of servers in each varies greatly (the biggest data centers have a hundred times more servers than the smallest ones). The servers run the application services that implement our products (our caching, DNS, WAF, DDoS mitigation, Spectrum, WARP, etc). Within a single data center, any of the servers can handle a connection for any of our services on any of our anycast IP addresses. This uniformity keeps things simple and avoids bottlenecks.

Unimog - Cloudflare’s edge load balancer

But if any server within a data center can handle any connection, when a connection arrives from a browser or some other client, what controls which server it goes to? That’s the job of Unimog.

There are two main reasons why we need this control. The first is that we regularly move servers in and out of operation, and servers should only receive connections when they are in operation. For example, we sometimes remove a server from operation in order to perform maintenance on it. And sometimes servers are automatically removed from operation because health checks indicate that they are not functioning correctly.

The second reason concerns the management of the load on the servers (by load we mean the amount of computing work each one needs to do). If the load on a server exceeds the capacity of its hardware resources, then the quality of service to users will suffer. The performance experienced by users degrades as a server approaches saturation, and if a server becomes sufficiently overloaded, users may see errors. We also want to prevent servers being underloaded, which would reduce the value we get from our investment in hardware. So Unimog ensures that the load is spread across the servers in a data center. This general idea is called load balancing (balancing because the work has to be done somewhere, and so for the load on one server to go down, the load on some other server must go up).

Note that in this post, we’ll discuss how Cloudflare balances the load on its own servers in edge data centers. But load balancing is a requirement that occurs in many places in distributed computing systems. Cloudflare also has a Layer 7 Load Balancing product to allow our customers to balance load across their servers. And Cloudflare uses load balancing in other places internally.

Deploying Unimog led to a big improvement in our ability to balance the load on our servers in our edge data centers. Here’s a chart for one data center, showing the difference due to Unimog. Each line shows the processor utilization of an individual server (the colour of the lines indicates server model). The load on the servers varies during the day with the activity of users close to this data center. The white line marks the point when we enabled Unimog. You can see that after that point, the load on the servers became much more uniform. We saw similar results when we deployed Unimog to our other data centers.

Unimog - Cloudflare’s edge load balancer

How Unimog compares to other load balancers

There are a variety of techniques for load balancing. Unimog belongs to a category called Layer 4 Load Balancers (L4LBs). L4LBs direct packets on the network by inspecting information up to layer 4 of the OSI network model, which distinguishes them from the more common Layer 7 Load Balancers.

The advantage of L4LBs is their efficiency. They direct packets without processing the payload of those packets, so they avoid the overheads associated with higher level protocols. For any load balancer, it’s important that the resources consumed by the load balancer are low compared to the resources devoted to useful work. At Cloudflare, we already pay close attention to the efficient implementation of our services, and that sets a high bar for the load balancer that we put in front of those services.

The downside of L4LBs is that they can only control which connections go to which servers. They cannot modify the data going over the connection, which prevents them from participating in higher-level protocols like TLS, HTTP, etc. (in contrast, Layer 7 Load Balancers act as proxies, so they can modify data on the connection and participate in those higher-level protocols).

L4LBs are not new. They are mostly used at companies which have scaling needs that would be hard to meet with L7LBs alone. Google has published about Maglev, Facebook open-sourced Katran, and Github has open-sourced their GLB.

Unimog is the L4LB that Cloudflare has built to meet the needs of our edge network. It shares features with other L4LBs, and it is particularly strongly influenced by GLB. But there are some requirements that were not well-served by existing L4LBs, leading us to build our own:

  • Unimog is designed to run on the same general-purpose servers that provide application services, rather than requiring a separate tier of servers dedicated to load balancing.
  • It performs dynamic load balancing: measurements of server load are used to adjust the number of connections going to each server, in order to accurately balance load.
  • It supports long-lived connections that remain established for days.
  • Virtual IP addresses are managed as ranges (Cloudflare serves hundreds of thousands of IPv4 addresses on behalf of our customers, so it is impractical to configure these individually).
  • Unimog is tightly integrated with our existing DDoS mitigation system, and the implementation relies on the same XDP technology in the Linux kernel.

The rest of this post describes these features and the design and implementation choices that follow from them in more detail.

For Unimog to balance load, it’s not enough to send the same (or approximately the same) number of connections to each server, because the performance of our servers varies. We regularly update our server hardware, and we’re now on our 10th generation. Once we deploy a server, we keep it in service for as long as it is cost effective, and the lifetime of a server can be several years. It’s not unusual for a single data center to contain a mix of server models, due to expansion and upgrades over time. Processor performance has increased significantly across our server generations. So within a single data center, we need to send different numbers of connections to different servers to utilize the same percentage of their capacity.

It’s also not enough to give each server a fixed share of connections based on static estimates of their capacity. Not all connections consume the same amount of CPU. And there are other activities running on our servers and consuming CPU that are not directly driven by connections from clients. So in order to accurately balance load across servers, Unimog does dynamic load balancing: it takes regular measurements of the load on each of our servers, and uses a control loop that increases or decreases the number of connections going to each server so that their loads converge to an appropriate value.

Refresher: TCP connections

The relationship between TCP packets and connections is central to the operation of Unimog, so we’ll briefly describe that relationship.

(Unimog supports UDP as well as TCP, but for clarity most of this post will focus on the TCP support. We explain how UDP support differs towards the end.)

Here is the outline of a TCP packet:

Unimog - Cloudflare’s edge load balancer

The TCP connection that this packet belongs to is identified by the four labelled header fields, which span the IPv4/IPv6 (i.e. layer 3) and TCP (i.e. layer 4) headers: the source and destination addresses, and the source and destination ports. Collectively, these four fields are known as the 4-tuple. When we say the Unimog sends a connection to a server, we mean that all the packets with the 4-tuple identifying that connection are sent to that server.

A TCP connection is established via a three-way handshake between the client and the server handling that connection. Once a connection has been established, it is crucial that all the incoming packets for that connection go to that same server. If a TCP packet belonging to the connection is sent to a different server, it will signal the fact that it doesn’t know about the connection to the client with a TCP RST (reset) packet. Upon receiving this notification, the client terminates the connection, probably resulting in the user seeing an error. So a misdirected packet is much worse than a dropped packet. As usual, we consider the network to be unreliable, and it’s fine for occasional packets to be dropped. But even a single misdirected packet can lead to a broken connection.

Cloudflare handles a wide variety of connections on behalf of our customers. Many of these connections carry HTTP, and are typically short lived. But some HTTP connections are used for websockets, and can remain established for hours or days. Our Spectrum product supports arbitrary TCP connections. TCP connections can be terminated or stall for many reasons, and ideally all applications that use long-lived connections would be able to reconnect transparently, and applications would be designed to support such reconnections. But not all applications and protocols meet this ideal, so we strive to maintain long-lived connections. Unimog can maintain connections that last for many days.

Forwarding packets

The previous section described that the function of Unimog is to steer connections to servers. We’ll now explain how this is implemented.

To start with, let’s consider how one of our data centers might look without Unimog or any other load balancer. Here’s a conceptual view:

Unimog - Cloudflare’s edge load balancer

Packets arrive from the Internet, and pass through the router, which forwards them on to servers (in reality there is usually additional network infrastructure between the router and the servers, but it doesn’t play a significant role here so we’ll ignore it).

But is such a simple arrangement possible? Can the router spread traffic over servers without some kind of load balancer in between? Routers have a feature called ECMP (equal cost multipath) routing. Its original purpose is to allow traffic to be spread across multiple paths between two locations, but it is commonly repurposed to spread traffic across multiple servers within a data center. In fact, Cloudflare relied on ECMP alone to spread load across servers before we deployed Unimog. ECMP uses a hashing scheme to ensure that packets on a given connection use the same path (Unimog also employs a hashing scheme, so we’ll discuss how this can work in further detail below) . But ECMP is vulnerable to changes in the set of active servers, such as when servers go in and out of service. These changes cause rehashing events, which break connections to all the servers in an ECMP group. Also, routers impose limits on the sizes of ECMP groups, which means that a single ECMP group cannot cover all the servers in our larger edge data centers. Finally, ECMP does not allow us to do dynamic load balancing by adjusting the share of connections going to each server. These drawbacks mean that ECMP alone is not an effective approach.

Ideally, to overcome the drawbacks of ECMP, we could program the router with the appropriate logic to direct connections to servers in the way we want. But although programmable network data planes have been a hot research topic in recent years, commodity routers are still essentially fixed-function devices.

We can work around the limitations of routers by having the router send the packets to some load balancing servers, and then programming those load balancers to forward packets as we want. If the load balancers all act on packets in a consistent way, then it doesn’t matter which load balancer gets which packets from the router (so we can use ECMP to spread packets across the load balancers). That suggests an arrangement like this:

Unimog - Cloudflare’s edge load balancer

And indeed L4LBs are often deployed like this.

Instead, Unimog makes every server into a load balancer. The router can send any packet to any server, and that initial server will forward the packet to the right server for that connection:

Unimog - Cloudflare’s edge load balancer

We have two reasons to favour this arrangement:

First, in our edge network, we avoid specialised roles for servers. We run the same software stack on the servers in our edge network, providing all of our product features, whether DDoS attack prevention, website performance features, Cloudflare Workers, WARP, etc. This uniformity is key to the efficient operation of our edge network: we don’t have to manage how many load balancers we have within each of our data centers, because all of our servers act as load balancers.

The second reason relates to stopping attacks. Cloudflare’s edge network is the target of incessant attacks. Some of these attacks are volumetric – large packet floods which attempt to overwhelm the ability of our data centers to process network traffic from the Internet, and so impact our ability to service legitimate traffic. To successfully mitigate such attacks, it’s important to filter out attack packets as early as possible, minimising the resources they consume. This means that our attack mitigation system needs to occur before the forwarding done by Unimog. That mitigation system is called l4drop, and we’ve written about it before. l4drop and Unimog are closely integrated. Because l4drop runs on all of our servers, and because l4drop comes before Unimog, it’s natural for Unimog to run on all of our servers too.

XDP and xdpd

Unimog implements packet forwarding using a Linux kernel facility called XDP. XDP allows a program to be attached to a network interface, and the program gets run for every packet that arrives, before it is processed by the kernel’s main network stack. The XDP program returns an action code to tell the kernel what to do with the packet:

  • PASS: Pass the packet on to the kernel’s network stack for normal processing.
  • DROP: Drop the packet. This is the basis for l4drop.
  • TX: Transmit the packet back out of the network interface. The XDP program can modify the packet data before transmission. This action is the basis for Unimog forwarding.

XDP programs run within the kernel, making this an efficient approach even at high packet rates. XDP programs are expressed as eBPF bytecode, and run within an in-kernel virtual machine. Upon loading an XDP program, the kernel compiles its eBPF code into machine code. The kernel also verifies the program to check that it does not compromise security or stability. eBPF is not only used in the context of XDP: many recent Linux kernel innovations employ eBPF, as it provides a convenient and efficient way to extend the behaviour of the kernel.

XDP is much more convenient than alternative approaches to packet-level processing, particularly in our context where the servers involved also have many other tasks. We have continued to enhance Unimog since its initial deployment. Our deployment model for new versions of our Unimog XDP code is essentially the same as for userspace services, and we are able to deploy new versions on a weekly basis if needed. Also, established techniques for optimizing the performance of the Linux network stack provide good performance for XDP.

There are two main alternatives for efficient packet-level processing:

  • Kernel-bypass networking (such as DPDK), where a program in userspace manages a network interface (or some part of one) directly without the involvement of the kernel. This approach works best when servers can be dedicated to a network function (due to the need to dedicate processor or network interface hardware resources, and awkward integration with the normal kernel network stack; see our old post about this). But we avoid putting servers in specialised roles. (Github’s open-source GLB uses DPDK, and this is one of the main factors that made GLB unsuitable for us.)
  • Kernel modules, where code is added to the kernel to perform the necessary network functions. The Linux IPVS (IP Virtual Server) subsystem falls into this category. But developing, testing, and deploying kernel modules is cumbersome compared to XDP.

The following diagram shows an overview of our use of XDP. Both l4drop and Unimog are implemented by an XDP program. l4drop matches attack packets, and uses the DROP action to discard them. Unimog forwards packets, using the TX action to resend them. Packets that are not dropped or forwarded pass through to the normal Linux network stack. To support our elaborate use of XPD, we have developed the xdpd daemon which performs the necessary supervisory and support functions for our XDP programs.

Unimog - Cloudflare’s edge load balancer

Rather than a single XDP program, we have a chain of XDP programs that must be run for each packet (l4drop, Unimog, and others we have not covered here). One of the responsibilities of xdpd is to prepare these programs, and to make the appropriate system calls to load them and assemble the full chain.

Our XDP programs come from two sources. Some are developed in a conventional way: engineers write C code, our build system compiles it (with clang) to eBPF ELF files, and our release system deploys those files to our servers. Our Unimog XDP code works like this. In contrast, the l4drop XDP code is dynamically generated by xdpd based on information it receives from attack detection systems.

xdpd has many other duties to support our use of XDP:

  • XDP programs can be supplied with data using data structures called maps. xdpd populates the maps needed by our programs, based on information received from control planes.
  • Programs (for instance, our Unimog XDP program) may depend upon configuration values which are fixed while the program runs, but do not have universal values known at the time their C code was compiled. It would be possible to supply these values to the program via maps, but that would be inefficient (retrieving a value from a map requires a call to a helper function). So instead, xdpd will fix up the eBPF program to insert these constants before it is loaded.
  • Cloudflare carefully monitors the behaviour of all our software systems, and this includes our XDP programs: They emit metrics (via another use of maps), which xdpd exposes to our metrics and alerting system (prometheus).
  • When we deploy a new version of xdpd, it gracefully upgrades in such a way that there is no interruption to the operation of Unimog or l4drop.

Although the XDP programs are written in C, xdpd itself is written in Go. Much of its code is specific to Cloudflare. But in the course of developing xdpd, we have collaborated with Cilium to develop https://github.com/cilium/ebpf, an open source Go library that provides the operations needed by xdpd for manipulating and loading eBPF programs and related objects. We’re also collaborating with the Linux eBPF community to share our experience, and extend the core eBPF technology in ways that make features of xdpd obsolete.

In evaluating the performance of Unimog, our main concern is efficiency: that is, the resources consumed for load balancing relative to the resources used for customer-visible services. Our measurements show that Unimog costs less than 1% of the processor utilization, compared to a scenario where no load balancing is in use. Other L4LBs, intended to be used with servers dedicated to load balancing, may place more emphasis on maximum throughput of packets. Nonetheless, our experience with Unimog and XDP in general indicates that the throughput is more than adequate for our needs, even during large volumetric attacks.

Unimog is not the first L4LB to use XDP. In 2018, Facebook open sourced Katran, their XDP-based L4LB data plane. We considered the possibility of reusing code from Katran. But it would not have been worthwhile: the core C code needed to implement an XDP-based L4LB is relatively modest (about 1000 lines of C, both for Unimog and Katran). Furthermore, we had requirements that were not met by Katran, and we also needed to integrate with existing components and systems at Cloudflare (particularly l4drop). So very little of the code could have been reused as-is.

Encapsulation

As discussed as the start of this post, clients make connections to one of our edge data centers with a destination IP address that can be served by any one of our servers. These addresses that do not correspond to a specific server are known as virtual IPs (VIPs). When our Unimog XDP program forwards a packet destined to a VIP, it must replace that VIP address with the direct IP (DIP) of the appropriate server for the connection, so that when the packet is retransmitted it will reach that server. But it is not sufficient to overwrite the VIP in the packet headers with the DIP, as that would hide the original destination address from the server handling the connection (the original destination address is often needed to correctly handle the connection).

Instead, the packet must be encapsulated: Another set of packet headers is prepended to the packet, so that the original packet becomes the payload in this new packet. The DIP is then used as the destination address in the outer headers, but the addressing information in the headers of the original packet is preserved. The encapsulated packet is then retransmitted. Once it reaches the target server, it must be decapsulated: the outer headers are stripped off to yield the original packet as if it had arrived directly.

Encapsulation is a general concept in computer networking, and is used in a variety of contexts. The headers to be added to the packet by encapsulation are defined by an encapsulation format. Many different encapsulation formats have been defined within the industry, tailored to the requirements in specific contexts. Unimog uses a format called GUE (Generic UDP Encapsulation), in order to allow us to re-use the glb-redirect component from github’s GLB (glb-redirect is discussed below).

GUE is a relatively simple encapsulation format. It encapsulates within a UDP packet, placing a GUE-specific header between the outer IP/UDP headers and the payload packet to allow extension data to be carried (and we’ll see how Unimog takes advantage of this):

Unimog - Cloudflare’s edge load balancer

When an encapsulated packet arrives at a server, the encapsulation process must be reversed. This step is called decapsulation. The headers that were added during the encapsulation process are removed, leaving the original packet to be processed by the network stack as if it had arrived directly from the client.

An issue that can arise with encapsulation is hitting limits on the maximum packet size, because the encapsulation process makes packets larger. The de-facto maximum packet size on the Internet is 1500 bytes, and not coincidentally this is also the maximum packet size on ethernet networks. For Unimog, encapsulating a 1500-byte packet results in a 1536-byte packet. To allow for these enlarged encapsulated packets, we have enabled jumbo frames on the networks inside our data centers, so that the 1500-byte limit only applies to packets headed out to the Internet.

Forwarding logic

So far, we have described the technology used to implement the Unimog load balancer, but not how our Unimog XDP program selects the DIP address when forwarding a packet. This section describes the basic scheme. But as we’ll see, there is a problem, so then we’ll describe how this scheme is elaborated to solve that problem.

In outline, our Unimog XDP program processes each packet in the following way:

  1. Determine whether the packet is destined for a VIP address. Not all of the packets arriving at a server are for VIP addresses. Other packets are passed through for normal handling by the kernel’s network stack. (xdpd obtains the VIP address ranges from the Unimog control plane.)
  2. Determine the DIP for the server handling the packet’s connection.
  3. Encapsulate the packet, and retransmit it to the DIP.

In step 2, note that all the load balancers must act consistently – when forwarding packets, they must all agree about which connections go to which servers. The rate of new connections arriving at a data center is large, so it’s not practical for load balancers to agree by communicating information about connections amongst themselves. Instead L4LBs adopt designs which allow the load balancers to reach consistent forwarding decisions independently. To do this, they rely on hashing schemes: Take the 4-tuple identifying the packet’s connection, put it through a hash function to obtain a key (the hash function ensures that these key values are uniformly distributed), then perform some kind of lookup into a data structure to turn the key into the DIP for the target server.

Unimog uses such a scheme, with a data structure that is simple compared to some other L4LBs. We call this data structure the forwarding table, and it consists of an array where each entry contains a DIP specifying the server target server for the relevant packets (we call these entries buckets). The forwarding table is generated by the Unimog control plane and broadcast to the load balancers (more on this below), so that it has the same contents on all load balancers.

To look up a packet’s key in the forwarding table, the low N bits from the key are used as the index for a bucket (the forwarding table is always a power-of-2 in size):

Unimog - Cloudflare’s edge load balancer

Note that this approach does not provide per-connection control – each bucket typically applies to many connections. All load balancers in a data center use the same forwarding table, so they all forward packets in a consistent manner. This means it doesn’t matter which packets are sent by the router to which servers, and so ECMP re-hashes are a non-issue. And because the forwarding table is immutable and simple in structure, lookups are fast.

Although the above description only discusses a single forwarding table, Unimog supports multiple forwarding tables, each one associated with a trafficset – the traffic destined for a particular service. Ranges of VIP addresses are associated with a trafficset. Each trafficset has its own configuration settings and forwarding tables. This gives us the flexibility to differentiate how Unimog behaves for different services.

Precise load balancing requires the ability to make fine adjustments to the number of connections arriving at each server. So we make the number of buckets in the forwarding table more than 100 times the number of servers. Our data centers can contain hundreds of servers, and so it is normal for a Unimog forwarding table to have tens of thousands of buckets. The DIP for a given server is repeated across many buckets in the forwarding table, and by increasing or decreasing the number of buckets that refer to a server, we can control the share of connections going to that server. Not all buckets will correspond to exactly the same number of connections at a given point in time (the properties of the hash function make this a statistical matter). But experience with Unimog has demonstrated that the relationship between the number of buckets and resulting server load is sufficiently strong to allow for good load balancing.

But as mentioned, there is a problem with this scheme as presented so far. Updating a forwarding table, and changing the DIPs in some buckets, would break connections that hash to those buckets (because packets on those connections would get forwarded to a different server after the update). But one of the requirements for Unimog is to allow us to change which servers get new connections without impacting the existing connections. For example, sometimes we want to drain the connections to a server, maintaining the existing connections to that server but not forwarding new connections to it, in the expectation that many of the existing connections will terminate of their own accord. The next section explains how we fix this scheme to allow such changes.

Maintaining established connections

To make changes to the forwarding table without breaking established connections, Unimog adopts the “daisy chaining” technique described in the paper Stateless Datacenter Load-balancing with Beamer.

To understand how the Beamer technique works, let’s look at what can go wrong when a forwarding table changes: imagine the forwarding table is updated so that a bucket which contained the DIP of server A now refers to server B. A packet that would formerly have been sent to A by the load balancers is now sent to B. If that packet initiates a new connection (it’s a TCP SYN packet), there’s no problem – server B will continue the three-way handshake to complete the new connection. On the other hand, if the packet belongs to a connection established before the change, then the TCP implementation of server B has no matching TCP socket, and so sends a RST back to the client, breaking the connection.

This explanation hints at a solution: the problem occurs when server B receives a forwarded packet that does not match a TCP socket. If we could change its behaviour in this case to forward the packet a second time to the DIP of server A, that would allow the connection to server A to be preserved. For this to work, server B needs to know the DIP for the bucket before the change.

To accomplish this, we extend the forwarding table so that each bucket has two slots, each containing the DIP for a server. The first slot contains the current DIP, which is used by the load balancer to forward packets as discussed (and here we refer to this forwarding as the first hop). The second slot preserves the previous DIP (if any), in order to allow the packet to be forwarded again on a second hop when necessary.

For example, imagine we have a forwarding table that refers to servers A, B, and C, and then it is updated to stop new connections going to server A, but maintaining established connections to server A. This is achieved by replacing server A’s DIP in the first slot of any buckets where it appears, but preserving it in the second slot:

Unimog - Cloudflare’s edge load balancer

In addition to extending the forwarding table, this approach requires a component on each server to forward packets on the second hop when necessary. This diagram shows where this redirector fits into the path a packet can take:

Unimog - Cloudflare’s edge load balancer

The redirector follows some simple logic to decide whether to process a packet locally on the first-hop server or to forward it on the second-hop server:

  • If the packet is a SYN packet, initiating a new connection, then it is always processed by the first-hop server. This ensures that new connections go to the first-hop server.
  • For other packets, the redirector checks whether the packet belongs to a connection with a corresponding TCP socket on the first-hop server. If so, it is processed by that server.
  • Otherwise, the packet has no corresponding TCP socket on the first-hop server. So it is forwarded on to the second-hop server to be processed there (in the expectation that it belongs to some connection established on the second-hop server that we wish to maintain).

In that last step, the redirector needs to know the DIP for the second hop. To avoid the need for the redirector to do forwarding table lookups, the second-hop DIP is placed into the encapsulated packet by the Unimog XDP program (which already does a forwarding table lookup, so it has easy access to this value). This second-hop DIP is carried in a GUE extension header, so that it is readily available to the redirector if it needs to forward the packet again.

This second hop, when necessary, does have a cost. But in our data centers, the fraction of forwarded packets that take the second hop is usually less than 1% (despite the significance of long-lived connections in our context). The result is that the practical overhead of the second hops is modest.

When we initially deployed Unimog, we adopted the glb-redirect iptables module from github’s GLB to serve as the redirector component. In fact, some implementation choices in Unimog, such as the use of GUE, were made in order to facilitate this re-use. glb-redirect worked well for us initially, but subsequently we wanted to enhance the redirector logic. glb-redirect is a custom Linux kernel module, and developing and deploying changes to kernel modules is more difficult for us than for eBPF-based components such as our XDP programs. This is not merely due to Cloudflare having invested more engineering effort in software infrastructure for eBPF; it also results from the more explicit boundary between the kernel and eBPF programs (for example, we are able to run the same eBPF programs on a range of kernel versions without recompilation). We wanted to achieve the same ease of development for the redirector as for our XDP programs.

To that end, we decided to write an eBPF replacement for glb-redirect. While the redirector could be implemented within XDP, like our load balancer, practical concerns led us to implement it as a TC classifier program instead (TC is the traffic control subsystem within the Linux network stack). A downside to XDP is that the packet contents prior to processing by the XDP program are not visible using conventional tools such as tcpdump, complicating debugging. TC classifiers do not have this downside, and in the context of the redirector, which passes most packets through, the performance advantages of XDP would not be significant.

The result is cls-redirect, a redirector implemented as a TC classifier program. We have contributed our cls-redirect code as part of the Linux kernel test suite. In addition to implementing the redirector logic, cls-redirect also implements decapsulation, removing the need to separately configure GUE tunnel endpoints for this purpose.

There are some features suggested in the Beamer paper that Unimog does not implement:

  • Beamer embeds generation numbers in the encapsulated packets to address a potential corner case where a ECMP rehash event occurs at the same time as a forwarding table update is propagating from the control plane to the load balancers. Given the combination of circumstances required for a connection to be impacted by this issue, we believe that in our context the number of affected connections is negligible, and so the added complexity of the generation numbers is not worthwhile.
  • In the Beamer paper, the concept of daisy-chaining encompasses third hops etc. to preserve connections across a series of changes to a bucket. Unimog only uses two hops (the first and second hops above), so in general it can only preserve connections across a single update to a bucket. But our experience is that even with only two hops, a careful strategy for updating the forwarding tables permits connection lifetimes of days.

To elaborate on this second point: when the control plane is updating the forwarding table, it often has some choice in which buckets to change, depending on the event that led to the update. For example, if a server is being brought into service, then some buckets must be assigned to it (by placing the DIP for the new server in the first slot of the bucket). But there is a choice about which buckets. A strategy of choosing the least-recently modified buckets will tend to minimise the impact to connections.

Furthermore, when updating the forwarding table to adjust the balance of load between servers, Unimog often uses a novel trick: due to the redirector logic, exchanging the first-hop and second-hop DIPs for a bucket only affects which server receives new connections for that bucket, and never impacts any established connections. Unimog is able to achieve load balancing in our edge data centers largely through forwarding table changes of this type.

Control plane

So far, we have discussed the Unimog data plane – the part that processes network packets. But much of the development effort on Unimog has been devoted to the control plane – the part that generates the forwarding tables used by the data plane. In order to correctly maintain the forwarding tables, the control plane consumes information from multiple sources:

  • Server information: Unimog needs to know the set of servers present in a data center, some key information about each one (such as their DIP addresses), and their operational status. It also needs signals about transitional states, such as when a server is being withdrawn from service, in order to gracefully drain connections (preventing the server from receiving new connections, while maintaining its established connections).
  • Health: Unimog should only send connections to servers that are able to correctly handle those connections, otherwise those servers should be removed from the forwarding tables. To ensure this, it needs health information at the node level (indicating that a server is available) and at the service level (indicating that a service is functioning normally on a server).
  • Load: in order to balance load, Unimog needs information about the resource utilization on each server.
  • IP address information: Cloudflare serves hundreds of thousands of IPv4 addresses, and these are something that we have to treat as a dynamic resource rather than something statically configured.

The control plane is implemented by a process called the conductor. In each of our edge data centers, there is one active conductor, but there are also standby instances that will take over if the active instance goes away.

We use Hashicorp’s Consul in a number of ways in the Unimog control plane (we have an independent Consul server cluster in each data center):

  • Consul provides a key-value store, with support for blocking queries so that changes to values can be received promptly. We use this to propagate the forwarding tables and VIP address information from the conductor to xdpd on the servers.
  • Consul provides server- and service-level health checks. We use this as the source of health information for Unimog.
  • The conductor stores its state in the Consul KV store, and uses Consul’s distributed locks to ensure that only one conductor instance is active.

The conductor obtains server load information from Prometheus, which we already use for metrics throughout our systems. It balances the load across the servers using a control loop, periodically adjusting the forwarding tables to send more connections to underloaded servers and less connections to overloaded servers. The load for a server is defined by a Prometheus metric expression which measures processor utilization (with some intricacies to better handle characteristics of our workloads). The determination of whether a server is underloaded or overloaded is based on comparison with the average value of the load metric, and the adjustments made to the forwarding table are proportional to the deviation from the average. So the result of the feedback loop is that the load metric for all servers converges on the average.

Finally, the conductor queries internal Cloudflare APIs to obtain the necessary information on servers and addresses.

Unimog - Cloudflare’s edge load balancer

Unimog is a critical system: incorrect, poorly adjusted or stale forwarding tables could cause incoming network traffic to a data center to be dropped, or servers to be overloaded, to the point that a data center would have to be removed from service. To maintain a high quality of service and minimise the overhead of managing our many edge data centers, we have to be able to upgrade all components. So to the greatest extent possible, all components are able to tolerate brief absences of the other components without any impact to service. In some cases this is possible through careful design. In other cases, it requires explicit handling. For example, we have found that Consul can temporarily report inaccurate health information for a server and its services when the Consul agent on that server is restarted (for example, in order to upgrade Consul). So we implemented the necessary logic in the conductor to detect and disregard these transient health changes.

Unimog also forms a complex system with feedback loops: The conductor reacts to its observations of behaviour of the servers, and the servers react to the control information they receive from the conductor. This can lead to behaviours of the overall system that are hard to anticipate or test for. For instance, not long after we deployed Unimog we encountered surprising behaviour when data centers became overloaded. This is of course a scenario that we strive to avoid, and we have automated systems to remove traffic from overloaded data centers if it does. But if a data center became sufficiently overloaded, then health information from its servers would indicate that many servers were degraded to the point that Unimog would stop sending new connections to those servers. Under normal circumstances, this is the correct reaction to a degraded server. But if enough servers become degraded, diverting new connections to other servers would mean those servers became degraded, while the original servers were able to recover. So it was possible for a data center that became temporarily overloaded to get stuck in a state where servers oscillated between healthy and degraded, even after the level of demand on the data center had returned to normal. To correct this issue, the conductor now has logic to distinguish between isolated degraded servers and such data center-wide problems. We have continued to improve Unimog in response to operational experience, ensuring that it behaves in a predictable manner over a wide range of conditions.

UDP Support

So far, we have described Unimog’s support for directing TCP connections. But Unimog also supports UDP traffic. UDP does not have explicit connections between clients and servers, so how it works depends upon how the UDP application exchanges packets between the client and server. There are a few cases of interest:

Request-response UDP applications

Some applications, such as DNS, use a simple request-response pattern: the client sends a request packet to the server, and expects a response packet in return. Here, there is nothing corresponding to a connection (the client only sends a single packet, so there is no requirement to make sure that multiple packets arrive at the same server). But Unimog can still provide value by spreading the requests across our servers.

To cater to this case, Unimog operates as described in previous sections, hashing the 4-tuple from the packet headers (the source and destination IP addresses and ports). But the Beamer daisy-chaining technique that allows connections to be maintained does not apply here, and so the buckets in the forwarding table only have a single slot.

UDP applications with flows

Some UDP applications have long-lived flows of packets between the client and server. Like TCP connections, these flows are identified by the 4-tuple. It is necessary that such flows go to the same server (even when Cloudflare is just passing a flow through to the origin server, it is convenient for detecting and mitigating certain kinds of attack to have that flow pass through a single server within one of Cloudflare’s data centers).

It’s possible to treat these flows by hashing the 4-tuple, skipping the Beamer daisy-chaining technique as for request-response applications. But then adding servers will cause some flows to change servers (this would effectively be a form of consistent hashing). For UDP applications, we can’t say in general what impact this has, as we can for TCP connections. But it’s possible that it causes some disruption, so it would be nice to avoid this.

So Unimog adapts the daisy-chaining technique to apply it to UDP flows. The outline remains similar to that for TCP: the same redirector component on each server decides whether to send a packet on a second hop. But UDP does not have anything corresponding to TCP’s SYN packet that indicates a new connection. So for UDP, the part that depends on SYNs is removed, and the logic applied for each packet becomes:

  • The redirector checks whether the packet belongs to a connection with a corresponding UDP socket on the first-hop server. If so, it is processed by that server.
  • Otherwise, the packet has no corresponding TCP socket on the first-hop server. So it is forwarded on to the second-hop server to be processed there (in the expectation that it belongs to some flow established on the second-hop server that we wish to maintain).

Although the change compared to the TCP logic is not large, it has the effect of switching the roles of the first- and second-hop servers: For UDP, new flows go to the second-hop server. The Unimog control plane has to take account of this when it updates a forwarding table. When it introduces a server into a bucket, that server should receive new connections or flows. For a TCP trafficset, this means it becomes the first-hop server. For UDP trafficset, it must become the second-hop server.

This difference between handling of TCP and UDP also leads to higher overheads for UDP. In the case of TCP, as new connections are formed and old connections terminate over time, fewer packets will require the second hop, and so the overhead tends to diminish. But with UDP, new connections always involve the second hop. This is why we differentiate the two cases, taking advantage of SYN packets in the TCP case.

The UDP logic also places a requirement on services. The redirector must be able to match packets to the corresponding sockets on a server according to their 4-tuple. This is not a problem in the TCP case, because all TCP connections are represented by connected sockets in the BSD sockets API (these sockets are obtained from an accept system call, so that they have a local address and a peer address, determining the 4-tuple). But for UDP, unconnected sockets (lacking a declared peer address) can be used to send and receive packets. So some UDP services only use unconnected sockets. For the redirector logic above to work, services must create connected UDP sockets in order to expose their flows to the redirector.

UDP applications with sessions

Some UDP-based protocols have explicit sessions, with a session identifier in each packet. Session identifiers allow sessions to persist even if the 4-tuple changes. This happens in mobility scenarios – for example, if a mobile device passes from a WiFi to a cellular network, causing its IP address to change. An example of a UDP-based protocol with session identifiers is QUIC (which calls them connection IDs).

Our Unimog XDP program allows a flow dissector to be configured for different trafficsets. The flow dissector is the part of the code that is responsible for taking a packet and extracting the value that identifies the flow or connection (this value is then hashed and used for the lookup into the forwarding table). For TCP and UDP, there are default flow dissectors that extract the 4-tuple. But specialised flow dissectors can be added to handle UDP-based protocols.

We have used this functionality in our WARP product. We extended the Wireguard protocol used by WARP in a backwards-compatible way to include a session identifier, and added a flow dissector to Unimog to exploit it. There are more details in our post on the technical challenges of WARP.

Conclusion

Unimog has been deployed to all of Cloudflare’s edge data centers for over a year, and it has become essential to our operations. Throughout that time, we have continued to enhance Unimog (many of the features described here were not present when it was first deployed). So the ease of developing and deploying changes, due to XDP and xdpd, has been a significant benefit. Today we continue to extend it, to support more services, and to help us manage our traffic and the load on our servers in more contexts.

Two clicks to add region-based Zero Trust compliance

Post Syndicated from Sam Rhea original https://blog.cloudflare.com/two-clicks-to-enable-regional-zero-trust-compliance/

Two clicks to add region-based Zero Trust compliance

Your team members are probably not just working from home – they may be working from different regions or countries. The flexibility of remote work gives employees a chance to work from the towns where they grew up or countries they always wanted to visit. However, that distribution also presents compliance challenges.

Depending on your industry, keeping data inside of certain regions can be a compliance or regulatory requirement. You might require employees to connect from certain countries or exclude entire countries altogether from your corporate systems.

When we worked in physical offices, keeping data inside of a country was easy. All of your users connecting to an application from that office were, of course, in that country. Remote work changed that and teams had to scramble to find a way to keep people productive from anywhere, which often led to sacrifices in terms of compliance. Starting today, you can make geography-based compliance easy again in Cloudflare Access with just two clicks.

You can now build rules that require employees to connect from certain countries. You can also add rules that block team members from connecting from other countries. This feature works with any identity provider configured and requires no other changes for your users or administrators.

What is Cloudflare Access?

Cloudflare Access secures applications by applying Zero Trust enforcement to every request. Rather than trusting anyone on a private network, Access checks for identity any time someone attempts to reach an application. With Cloudflare’s global network, that check takes place in a data center in over 200 cities around the world to avoid compromising performance.

Behind the scenes, administrators build rules to decide who should be able to reach the tools protected by Access. In turn, when users need to connect to those tools, they are prompted to authenticate with one of the identity provider options. Cloudflare Access checks their login against the list of allowed users and, if permitted, allows the request to proceed.

Two clicks to add region-based Zero Trust compliance

Cloudflare Access can check more than just their username. As a Zero Trust platform, Access aggregates multiple sources of signal about a user and surfaces those to the administrator. Some signals include if the user authenticated with a mutual TLS client certificate or hard key. However, some organizations also have compliance requirements that center around region, in addition to multifactor authentication.

Allow some countries, exclude others

You can build Cloudflare Access rules to be as simple as only allow team members with @team.com email addresses. However, usernames and passwords alone are not always sufficient. Depending on where you operate, or where you need to operate, you can use Cloudflare Access to layer country-specific rules on top of your identity provider workflows.

With this release, you can now add rules that require users to connect from certain countries or restrict logins from other countries. For example, you can require that users only connect from Portugal.

Two clicks to add region-based Zero Trust compliance

You can also exclude countries altogether. Cloudflare does not have an office in Costa Rica, a place I know many of us would love to visit. If a member of the team was on a beach vacation there and I wanted to make sure they really unplugged from work, we could add a rule to block logins to our applications from Costa Rica.

Two clicks to add region-based Zero Trust compliance

Some applications might not need country-specific requirements. Cloudflare Access rules can be configured on an application-by-application basis. You can add rules about country connections to specific applications that contain sensitive information, while limiting others to just identity.

Audit logins by country and user

Cloudflare Access captures every request a user makes to an internal application, without the need for any code changes. Your organization can export these logs to a third-party storage or SIEM solution to audit the country of origin for each user request. With that data, your compliance and security teams can quickly audit where your corporate devices are operating without the need to deploy additional client-side software.

Layer with other Zero Trust rules

Zero trust security starts with a username. Administrators build rules to determine which users can reach specific applications. Cloudflare Access integrates with your team’s identity provider, or even multiple identity providers, to make those username-based decisions at the edge of our network.

However, identity consists of more than just a username. Cloudflare Access can aggregate multiple sources of signal in Cloudflare’s network. Access can use that information to make a decision about identity in our network – long before that request ever reaches your infrastructure.

You can combine user rules with mutual TLS requirements, or device posture checks, and even force logins to always use a hard key. All of these zero-trust rules run inline with Cloudflare’s existing security features, like our WAF and DDoS mitigation, to add layers of security to every request. The Cloudflare network gives your team a zero-trust platform to apply all of the data we can gather about a request to determine whether or not it should be allowed.

The country rules we’re announcing today become another layer in that zero trust model. Like other sources of signal, you can combine these rules to build a comprehensive policy tailored to your organization’s compliance or security needs. For example, you can build a rule that only allows users to login to your application when they connect from Germany and use a physical hard key.

Two clicks to add region-based Zero Trust compliance

How to get started

To get started, navigate to an application you have added to Cloudflare Access or create a new one. Cloudflare Access policies consist of actions that can allow, block, or bypass requests based on the criteria defined. Access follows policies in order of precedence from top to bottom in the UI.

Inside of a policy you can define the criteria with three types of operators:

  • Include: Include rules function like OR operators. Users must meet at least one criterion in an Include rule. For example, an include rule can be constructed to allow anyone with @cloudflare.com email domains or [email protected] email domains to connect.
  • Require: Require rules function like AND operators. Users must meet all Require rule criteria.
  • Exclude: Exclusion rules function like “NOT” operators. Users must not meet the criterion of an Exclude rule.

To require that users connect from a particular country, create an Allow policy that includes your users email or identity provider group. Within that Allow policy, add a Require rule and choose the country that will be required. If you want to create a rule that requires multiple countries, you can add them into an Access Group.

Two clicks to add region-based Zero Trust compliance

You can then add that group into the Require rule.

Two clicks to add region-based Zero Trust compliance

What’s next?

Cloudflare Access, part of Cloudflare for Teams, is available today. The country requirement rule is available in all plans.You can follow the documentation here to add the additional rule.

Asynchronous HTMLRewriter for Cloudflare Workers

Post Syndicated from Ashcon Partovi original https://blog.cloudflare.com/asynchronous-htmlrewriter-for-cloudflare-workers/

Asynchronous HTMLRewriter for Cloudflare Workers

Asynchronous HTMLRewriter for Cloudflare Workers

Last year, we launched HTMLRewriter for Cloudflare Workers, which enables developers to make streaming changes to HTML on the edge. Unlike a traditional DOM parser that loads the entire HTML document into memory, we developed a streaming parser written in Rust. Today, we’re announcing support for asynchronous handlers in HTMLRewriter. Now you can perform asynchronous tasks based on the content of the HTML document: from prefetching fonts and image assets to fetching user-specific content from a CMS.

How can I use HTMLRewriter?

We designed HTMLRewriter to have a jQuery-like experience. First, you define a handler, then you assign it to a CSS selector; Workers does the rest for you. You can look at our new and improved documentation to see our supported list of selectors, which now include nth-child selectors. The example below changes the alternative text for every second image in a document.

async function editHtml(request) {
  return new HTMLRewriter()
     .on("img:nth-child(2)", new ElementHandler())
     .transform(await fetch(request))
}

class ElementHandler {
   element(e) {
      e.setAttribute("alt", "A very interesting image")
   }
}

Since these changes are applied using streams, we maintain a low TTFB (time to first byte) and users never know the HTML was transformed. If you’re interested in how we’re able to accomplish this technically, you can read our blog post about HTML parsing.

What’s new with HTMLRewriter?

Now you can define an async handler which allows any code that uses await. This means you can make dynamic HTML injection, based on the contents of the document, without having prior knowledge of what it contains. This allows you to customize HTML based on a particular user, feature flag, or even an integration with a CMS.

class UserCustomizer {
   // Remember to add the `async` keyword to the handler method
   async element(e) {
      const user = await fetch(`https://my.api.com/user/${e.getAttribute("user-id")}/online`)
      if (user.ok) {
         // Add the user’s name to the element
         e.setAttribute("user-name", await user.text())
      } else {
         // Remove the element, since this user not online
         e.remove()
      }
   }
}

What can I build with HTMLRewriter?

To illustrate the flexibility of HTMLRewriter, I wrote an example that you can deploy on your own website. If you manage a website, you know that old links and images can expire with time. Here’s an excerpt from a years’ old post I wrote on the Cloudflare Blog:

Asynchronous HTMLRewriter for Cloudflare Workers

As you might see, that missing image is not the prettiest sight. However, we can easily fix this using async handlers in HTMLRewriter. Using a service like the Internet Archive API, we can check if an image no longer exists and rewrite the URL to use the latest archive. That means users don’t see an ugly placeholder and won’t even know the image was replaced.

async function fetchAndFixImages(request) {
   return new HTMLRewriter()
      .on("img", new ImageFixer())
      .transform(await fetch(request))
}

class ImageFixer {
   async element(e) {
    var url = e.getAttribute("src")
    var response = await fetch(url)
    if (!response.ok) {
       var archive = await fetch(`https://archive.org/wayback/available?url=${url}`)
       if (archive.ok) {
          var snapshot = await archive.json()
          e.setAttribute("src", snapshot.archived_snapshots.closest.url)
       } else {
          e.remove()
       }
    }
  }
}

Using the Workers Playground, you can view a working sample of the above code. A more complex example could even alert a service like Sentry when a missing image is detected. Using the previous missing image, now you can see the image is restored and users are none of the wiser.

Asynchronous HTMLRewriter for Cloudflare Workers

If you’re interested in deploying this to your own website, click on the button below:

Asynchronous HTMLRewriter for Cloudflare Workers

What else can I build with HTMLRewriter?

We’ve been blown away by developer projects using HTMLRewriter. Here are a few projects that caught our eye and are great examples of the power of Cloudflare Workers and HTMLRewriter:

If you’re interested in using HTMLRewriter, check out our documentation. Also be sure to share any creations you’ve made with @CloudflareDev, we love looking at the awesome projects you build.

Improving the Wrangler Startup Experience

Post Syndicated from Joshua Johnson original https://blog.cloudflare.com/improving-the-wrangler-startup-experience/

Improving the Wrangler Startup Experience

Improving the Wrangler Startup Experience

Today I’m excited to announce wrangler login, an easy way to get started with Wrangler! This summer for my internship on the Workers Developer Productivity team I was tasked with helping improve the Wrangler user experience. For those who don’t know, Workers is Cloudflare’s serverless platform which allows users to deploy their software directly to Cloudflare’s edge network.

This means you can write any behaviour on requests heading to your site or even run fully fledged applications directly on the edge. Wrangler is the open-source CLI tool used to manage your Workers and has a big focus on enabling a smooth developer experience.

When I first heard I was working on Wrangler, I was excited that I would be working on such a cool product but also a little nervous. This was the first time I would be writing Rust in a professional environment, the first time making meaningful open-source contributions, and on top of that the first time doing all of this remotely. But thanks to lots of guidance and support from my mentor and team, I was able to help make the Wrangler and Workers developer experience just a little bit better.

The Problem

The main improvement I focused on this summer was the experience when getting started with Wrangler. For many of the commands to publish and develop live Workers, the user first needs to authenticate with Cloudflare. This is mainly done through the wrangler config command which has the user go create an API token and paste it into Wrangler. Creating a token involves going to the Cloudflare dashboard, going to your profile, going to the API tokens page, selecting a token template, adding your zones and accounts, and finally creating the token. While this is a completely valid authentication flow, it’s not as easy as it could be.

It could be frustrating to users who have to leave Wrangler and then possibly get lost in the wrong dashboard page or use the wrong settings for their token. When a group of intern candidates were given the task of using Wrangler, most of them got stuck on this step! Many users might forgo using Workers altogether if this is the first thing they encounter when sitting down to develop. Instead we wanted an experience where users could use their Cloudflare login (ie. their username, password, and possible two-factor authentication) and immediately be ready to go.

No OAuth? No Problem

What we came up with was a way to create and transfer API Tokens for a user, similar to how Argo Tunnel does their login.

Improving the Wrangler Startup Experience

An overview of the process is shown above, which starts with Wrangler. When the user types wrangler login in their terminal, they will be prompted to open the Cloudflare dashboard in their browser. All dashboard pages require the user to sign in before loading and once the user is signed in, all actions taken by the dashboard page will use the authentication of that user.

This means we can make a dashboard page which automatically creates an API token configured to manage Workers. Then when the user loads this page, a properly configured API token will be created for that user. Our dashboard page will then hand off the token to EdgeWorker Config Service (EWC) which will temporarily store it. While this is all going on Wrangler will be polling EWC waiting for the token to appear and once it does, Wrangler will retrieve the token and authenticate the user. With this, we have a seamless way to authenticate a Cloudflare user.

Security

One thing we had to be mindful of was security, these are users’ tokens after all. If someone was listening to network traffic and saw the request to the Cloudflare dashboard page, nothing would be stopping them from polling EWC themselves and stealing the token away from the user to wreak havoc on their Workers and zones. To solve this problem we used asymmetric RSA encryption. Asymmetric encryption lets us create two separate but mathematically connected keys. One is a private key which can encrypt and decrypt information and one is a public key which can only encrypt information.

Wrangler will generate a public-private key pair and pass off the public key to our dashboard page. Once the dashboard page is finished creating our token, EWC will then encrypt the token using the public key before storing. This means in the previous scenario where someone takes the token from our user, all they will have is an encrypted token they can’t use. The only way to decrypt it would be with the private key held by Wrangler.

Improving the Wrangler Startup Experience

In the end, this solution results in a smooth experience for Workers users. Now instead of rummaging through dashboard pages you can get started with Wrangler in only a few seconds, sometimes without having to leave the comfort of your own terminal.

Try out wrangler login in the 1.11.0 release of Wrangler and let us know how you like it. Also I would like to thank the Workers team for helping make this possible and giving me an awesome experience this summer! In order to implement this feature I had to touch different parts of Cloudflare like EWC and Stratus (Cloudflare’s front end monorepo) and work in areas unfamiliar to me such as frontend TypeScript and React. The responsiveness and encouragement I received helped get this feature created and helped make for a great summer!

How Cloudflare uses Cloudflare Spectrum: A look into an intern’s project at Cloudflare

Post Syndicated from Ryan Jacobs original https://blog.cloudflare.com/how-cloudflare-uses-cloudflare-spectrum-a-look-into-an-interns-project-at-cloudflare/

How Cloudflare uses Cloudflare Spectrum: A look into an intern’s project at Cloudflare

How Cloudflare uses Cloudflare Spectrum: A look into an intern’s project at Cloudflare

Cloudflare extensively uses its own products internally in a process known as dogfooding. As part of my onboarding as an intern on the Spectrum (a layer 4 reverse proxy) team, I learned that many internal services dogfood Spectrum, as they are exposed to the Internet and benefit from layer 4 DDoS protection. One of my first tasks was to update the configuration for an internal service that was using Spectrum. The configuration was managed in Salt (used for configuration management at Cloudflare), which was not particularly user-friendly, and required an engineer on the Spectrum team to handle updating it manually.

This process took about a week. That should instantly raise some questions, as a typical Spectrum customer can create a new Spectrum app in under a minute through Cloudflare Dashboard. So why couldn’t I?

This question formed the basis of my intern project for the summer.

The Process

Cloudflare uses various IP ranges for its products. Some customers also authorize Cloudflare to announce their IP prefixes on their behalf (this is known as BYOIP). Collectively, we can refer to these IPs as managed addresses. To prevent Bad Stuff (defined later) from happening, we prohibit managed addresses from being used as Spectrum origins. To accomplish this, Spectrum had its own table of denied networks that included the managed addresses. For the average customer, this approach works great – they have no legitimate reason to use a managed address as an origin.

Unfortunately, the services dogfooding Spectrum all use Cloudflare IPs, preventing those teams with a legitimate use-case from creating a Spectrum app through the configuration service (i.e. Cloudflare Dashboard). To bypass this check, these internal customers needed to define a custom Spectrum configuration, which needed to be manually deployed to the edge via a pull request to our Salt repo, resulting in a time consuming process.

If an internal customer wanted to change their configuration, the same time consuming process must be used. While this allowed internal customers to use Spectrum, it was tedious and error prone.

Bad Stuff

Bad Stuff is quite vague and deserves a better definition. It may seem arbitrary that we deny Cloudflare managed addresses. To motivate this, consider two Spectrum apps, A and B, where the origin of app A is the Cloudflare edge IP of app B, and the origin of app B is the edge IP of app A. Essentially, app A will proxy incoming connections to app B, and app B will proxy incoming connections to app A, creating a cycle.

This could potentially crash the daemon or degrade performance. In practice, this configuration is useless and would only be created by a malicious user, as the proxied connection never reaches an origin, so it is never allowed.

How Cloudflare uses Cloudflare Spectrum: A look into an intern’s project at Cloudflare

In fact, the more general case of setting another Spectrum app as an origin (even when the configuration does not result in a cycle) is (almost1) never needed, so it also needs to be avoided.

As well, since we are providing a reverse proxy to customer origins, we do not need to allow connections to IP ranges that cannot be used on the public Internet, as specified in this RFC.

The Problem

To improve usability and allow internal Spectrum customers to create apps using the Dashboard instead of the static configuration workflow, we needed a way to give particular customers permission to use Cloudflare managed addresses in their Spectrum configuration. Solving this problem was my main project for the internship.

A good starting point ended up being the Addressing API. The Addressing API is Cloudflare’s solution to IP management, an internal database and suite of tools to keep track of IP prefixes, with the goal of providing a unified source of truth for how IP addresses are being used across the organization. This makes it possible to provide a cross-product platform for products and features such as BYOIP, BGP On Demand, and Magic Transit.

The Addressing API keeps track of all Cloudflare managed IP prefixes, along with who owns the prefix. As well, the owner of a prefix can give permission for someone else to use the prefix. We call this a delegation.

A user’s permission to use an IP address managed by the Addressing API is determined as followed:

  1. Is the user the owner of the prefix containing the IP address?
    a) Yes, the user has permission to use the IP
    b) No, go to step 2
  2. Has the user been delegated a prefix containing the IP address?
    a) Yes, the user has permission to use the IP.
    b) No, the user does not have permission to use the IP.

The Solution

With the information present in the Addressing API, the solution starts to become clear. For a given customer and IP, we use the following algorithm:

  1. Is the IP managed by Cloudflare (or contained in the previous RFC)?
    a) Yes, go to step 2
    b) No, allow as origin
  2. Does the customer have permission to use the IP address?
    a) Yes, allow as origin
    b) No, deny as origin

As long as the internal customer has been given permission to use the Cloudflare IP (through a delegation in the Addressing API), this approach would allow them to use it as an origin.

However, we run into a corner case here – since BYOIP customers also have permission to use their own ranges, they would be able to set their own IP as an origin, potentially causing a cycle. To mitigate this, we need to check if the IP is a Spectrum edge IP. Fortunately, the Addressing API also contains this information, so all we have to do is check if the given origin IP is already in use as a Spectrum edge IP, and if so, deny it. Since all of the denied networks checks occur in the Addressing API, we were able to remove Spectrum’s own deny network database, reducing the engineering workload to maintain it along the way.

Let’s go through a concrete example. Consider an internal customer who wants to use 104.16.8.54/32 as an origin for their Spectrum app. This address is managed by Cloudflare, and suppose the customer has permission to use it, and the address is not already in use as an edge IP. This means the customer is able to specify this IP as an origin, since it meets all of our criteria.

For example, a request to the Addressing API could look like this:

curl --silent 'https://addr-api.internal/edge_services/spectrum/validate_origin_ip_acl?cidr=104.16.8.54/32' -H "Authorization: Bearer $JWT" | jq .
{
  "success": true,
  "errors": [],
  "result": {
    "allowed_origins": {
      "104.16.8.54/32": {
        "allowed": true,
        "is_managed": true,
        "is_delegated": true,
        "is_reserved": false,
        "has_binding": false
      }
    }
  },
  "messages": []
}

Now we have completely moved the responsibility of validating the use of origin IP addresses from Spectrum’s configuration service to the Addressing API.

Performance

This approach required making another HTTP request on the critical path of every create app request in the Spectrum configuration service. Some basic performance testing showed (as expected) increased response times for the API call (about 100ms). This led to discussion among the Spectrum team about the performance impact of different HTTP requests throughout the critical path. To investigate, we decided to use OpenTracing.

OpenTracing is a standard for providing distributed tracing of microservices. When an HTTP request is received, special headers are added to it to allow it to be traced across the different services. Within a given trace, we can see how long a SQL query took, the time a function took to complete, the amount of time a request spent at a given service, and more.

We have been deploying a tracing system for our services to provide more visibility into a complex system.

After instrumenting the Spectrum config service with OpenTracing, we were able to determine that the Addressing API accounted for a very small amount of time in the overall request, and allowed us to identify potentially problematic request times to other services.

How Cloudflare uses Cloudflare Spectrum: A look into an intern’s project at Cloudflare

Lessons Learned

Reading documentation is important! Having a good understanding of how the Addressing API and the config service worked allowed me to create and integrate an endpoint that made sense for my use-case.

Writing documentation is just as important. For the final part of my project, I had to onboard Crossbow – an internal Cloudflare tool used for diagnostics – to Spectrum, using the new features I had implemented. I had written an onboarding guide, but some stuff was unclear during the onboarding process, so I made sure to gather feedback from the Crossbow team to improve the guide.

Finally, I learned not to underestimate the amount of complexity required to implement relatively simple validation logic. In fact, the implementation required understanding the entire system. This includes how multiple microservices work together to validate the configuration and understanding how the data is moved from the Core to the Edge, and then processed there. I found increasing my understanding of this system to be just as important and rewarding as completing the project.

Footnotes:

Regional Services actually makes use of proxying a Spectrum connection to another colocation, and then proxying to the origin, but the configuration plane is not involved in this setup.

Introducing Deploy Buttons

Post Syndicated from David Song original https://blog.cloudflare.com/introducing-deploy-buttons/

Introducing Deploy Buttons

Introducing Deploy Buttons

When I first try out new development platforms, the first thing I do is get an OSS (Open Source Software) project I find on Github up and running. I used to start by following tutorials or digging through documentation. It’s a little bit counterintuitive. Let me share with you why. One reason is that Hello, World! examples rarely show the real “magic” of the platform. I want to feel excited and get a sense of how other people are creatively using the platform.

For example, I love it when I can build and deploy an OSS Pokedex app in a few minutes on Flutter to see if the platform actually lives up to the hype. It’s so much easier to do this than to spend a few hours following tutorials and documentation to get through the initial learning curve. You can think of it as shortening the time to first dopamine.

Another reason is that it makes learning the new platform much faster. Building off of an experienced developer’s work shows me which classes and functions are most useful to learn. There’s more nuance to building out full applications than is usually explained in the documentation. I can see how the pieces fit together and get a deeper understanding.

When I started my internship on the Workers team, I realized I could help create the magical experience of trying out a new platform quickly for the Workers dev experience. The team and I have spent the last few months working hard with the Workers team to make this possible. Today, I’m happy to share with you: Deploy Buttons.

Deploy Buttons let you deploy a project to the Workers Platform without even needing to set up a local development environment. Before Deploy Buttons, new developers had to jump between multiple places like the signup pages, the docs, the dashboard, and Wrangler to deploy a project. Now, it’s as easy as clicking a Deploy Button and three short steps to deploy using our new web-based deploy tool. This is the fastest way for beginners to get something live in less than five minutes, and for developers to share their projects with others by creating their own deploy buttons.

Try this button to deploy a GraphQL db:

Introducing Deploy Buttons

We’ve also curated a few awesome projects to try deploying now with this new deploy flow.

Before Deploy Buttons, things were a bit more complicated. When you found a project you like on GitHub or an example on our template gallery, you had to go to cloudflare.com to create an account, go to the docs to set up a local development environment, clone the project, and finally, learn how to use wrangler to deploy it. Now, you can just click on the deploy button to quickly get a project up and running with Workers to experience the magic of serverless.

Introducing Deploy Buttons

Here’s how it works. For example, if you find an interesting Workers project on GitHub and want to deploy it, you would click the “Deploy to Workers Button” on the repo README. This would take you to our new web-based deploy tool where you can deploy that project in just three steps. First, connect your Github account. Second, connect your Cloudflare account. Third, click “Deploy” and the project will be forked to your Github account and deployed to Workers with our Github action.


Want to create your own Deploy Button for your projects?

1) Add a GitHub Actions workflow to your project.

Add a new file to .github/workflows, such as .github/workflows/deploy.yml, and create a GitHub workflow for deploying your project. It should include a set of events, including at least repository_dispatch, but probably push and maybe schedule as well. Add a step for publishing your project using wrangler-action:

name: Build
on:
  push:
  pull_request:
  repository_dispatch:
deploy:
  runs-on: ubuntu-latest
  timeout-minutes: 60
  needs: test
  steps:
    - uses: actions/[email protected]
    - name: Publish
      uses: cloudflare/[email protected]

2) Add support for CF_API_TOKEN and CF_ACCOUNT_ID in your repo workflow:

# Update "Publish" step from last code snippet
- name: Publish
  uses: cloudflare/[email protected]
  with:
    apiToken: ${{ secrets.CF_API_TOKEN }}
  env:
    CF_ACCOUNT_ID: ${{ secrets.CF_ACCOUNT_ID }}

3) Add the Markdown code for your button to your project’s README, replacing the example url parameter with your repository URL.

Introducing Deploy Buttons

[![Deploy to Cloudflare Workers](http://deploy.workers.cloudflare.com/button)](http://deploy.workers.cloudflare.com/?url=https://github.com/YOURUSERNAME/YOURREPO)

Does your project use features only available in a paid Workers plan like Workers KV? Providing the paid=true query parameter to the /button and the deploy application paths will render a “Deploy to Workers Bundled” button, as seen below — it will also render a notice in the UI that the project requires Workers Bundled:

Introducing Deploy Buttons

[![Deploy to Cloudflare Workers](http://deploy.workers.cloudflare.com/button?paid=true)](http://deploy.workers.cloudflare.com/?url=https://github.com/YOURUSERNAME/YOURREPO&paid=true)

Thanks for tuning in. If you have any feedback, please fill out our feedback survey. We have more exciting features and improvements to announce soon for making the developer experience for Workers even better.

Require hard key auth with Cloudflare Access

Post Syndicated from Sam Rhea original https://blog.cloudflare.com/require-hard-key-auth-with-cloudflare-access/

Require hard key auth with Cloudflare Access

Last month, attackers compromised a Twitter team member’s access to an internal administrative panel in order to take over high-profile accounts. Full details of the breach are still pending, but Twitter has shared that the attackers stole credentials through a coordinated spear phishing attack.

The attackers convinced a team member to share login permissions, giving the attackers the ability to access the Twitter control plane. Once authenticated, they sent password reset flows to email accounts they controlled in order to hijack the Twitter accounts.

Administrative panels like Twitter’s are a rich target for phishing attacks because they give attackers a backdoor to privileged systems. Customer-facing teams at SaaS companies rely on these administrative panels to update end-user data and troubleshoot user account issues. If an attacker can compromise a single team member’s account they can potentially impact thousands of end users.

We have our own administrative panel at Cloudflare and we’ve deployed a number of safeguards over the last several years to keep it secure from phishing attacks. However, we had no way to enforce the security feature we think would most insulate us from phishing attacks: physical hard keys.

With hard keys, users can only login when they use a physical device – one that does not produce codes that could be shared over chat. Google notably eliminated all employee phishing cases by rolling out their own hard keys. We issued them to every team member who had to login to our identity provider to reach the admin panel, but our identity provider allowed users to to fall back to a less secure option for MFA.

To solve that problem, we moved the hard key requirement into Cloudflare’s network. Using Cloudflare Access, we can now restrict the ability to reach our admin panel only to team members who authenticate with a hard key. Starting today, we’re making that feature available to all teams.

Securing our own internal tools

Our admin panel gives our team members the ability to turn on features, manage settings, and investigate issues for our customers. The tool itself maintains its own set of application-level permissions that control the actions that administrators can take. Some customer accounts are scoped to specific team members; other accounts cannot be modified without the customer’s explicit approval.

We layer other safeguards on top of those permissions. Users who are inactive for a number of days need to be manually re-enabled. We regularly audit who can use the tool and trim back the list.

We used to lean on a VPN to keep the front door to the admin panel secure. Team members would connect to our VPN using a client on their device. Once on the VPN, they could reach the login page of the tool.

The VPN was painful for end users and a source of concern for our security team. Two issues, in particular, motivated us to find a better solution.

  • Segmentation was difficult. Not every user in the company needs to reach the admin panel, but any user on the VPN could connect to the login page. We wanted to limit that access to users in specific permission groups as part of a defense-in-depth strategy.
  • We could not add additional signals. Users could connect to the private network with just a password and MFA code. We could not limit VPN connections to corporate laptops, healthy devices, or other more granular restrictions.

About two years ago, we migrated that admin panel’s security perimeter to Cloudflare Access. Access gave us a zero-trust alternative to our VPN. Instead of being able to reach the admin panel because you are on the private network, Access continuously checks every request to the tool for identity against a list of allowed users.

We could use Access to limit who could reach certain tools, and it became a foundation for adding other sources of signal down the road. Access also gave us a new level of visibility. Since all requests pass through Cloudflare’s edge, Access provides our team with visibility into every request to every path. Without making any server-side code changes, Access can log every request and attribute it to an authenticated user. We can export those to our SIEM and create a comprehensive audit trail.

Hard keys without weak fallbacks

We integrated Cloudflare Access with our identity provider, which supports multifactor authentication (MFA). To login through Cloudflare Access, users would need to authenticate with their password and a MFA option. The ability to choose an option meant a less secure method could be selected.

Those fallback options were subject to a higher risk of phishing attack. SMS-based codes can be vulnerable to SIM swapping attacks. App-based time-based one-time-pin (TOTP) codes can linger on forgotten devices and, more dangerously, be transmitted as part of an attack.

Hard keys stand out because they rely on control of a physical item. With hard keys, users login with their password and then have to tap an actual key (typically in the form of a USB device). That tap presents a certificate, that only lives on the key, to the service configured to trust it. A user with the hard key could not inadvertently share that code over the phone or in a chat with an attacker.

We distribute hard keys to all team members at Cloudflare. However, we could not require team members to use them on an app-by-app basis. If I don’t have my hard key around, I always have the option to fall back to a TOTP code. We needed a filtering engine that could combine multiple sources of identity signal, which Cloudflare Access provided.

How Cloudflare Access solved our own problem

If you remember going to bars or clubs before the pandemic, Cloudflare Access might seem familiar. You had to show your identity card at the door to a bouncer. The bouncer (and the establishment) did not issue that card – your government did. However, they know what it should look like and how to use its information.

Once checked, the bouncer stamped the back of your hand. You could put your ID back in your wallet and the stamp became proof that the bartenders inside knew to trust.

Cloudflare Access works like that bouncer. When users connect to a resource secured by Cloudflare Access, we check for their ID by sending them to login to their identity provider like Okta or Azure AD. Users authenticate and their identity provider sends Cloudflare Access details like who they are and, for certain providers, what MFA method they used.

Like that stamp, Cloudflare Access uses the external identity information to create a distinct badge that we trust. Access generates a JSON Web Token (JWT) for the user and stores it in their browser. That token becomes the user’s badge for the rest of their session. Cloudflare Access looks for that JWT on every request the user makes to the application and enforces rules that an administrator configures about who can proceed.

Cloudflare Access can store more than just the user’s identity in the JWT. If the identity provider captures the MFA method used by a team member, Access can read that value and store it as an additional field in the JWT. RFC 8176, Authentication Method Reference Values, standardizes these values and how they are shared between systems.

We can use that standard to introduce an MFA option into the JWT created by Cloudflare Access. Access could then add an additional check that evaluated both the user’s identity and the MFA method they used to login.

The policy flexibility gave us what we needed to work with our security team to solve this problem. By adding a new rule that layers on top of the identity rule, we could immediately require that every team member logging in to our admin panel do so with their software-secured key.

Require hard key auth with Cloudflare Access

In our case, that includes any hard keys supported by WebAuthN and FIDO2 or keys tied to a physical device like Apple Touch ID and Windows Hello. Access would reject the attempt if they (or an attacker) used the fallback TOTP code – even if the identity provider allowed the login.

How to get started

You can begin using the same feature our own security team needed today at no additional cost. You’ll need an identity provider that supports Authenticated Method Reference, or amr settings. Today, that consists of Okta and Azure Active Directory. We expect others will add support and we will update our documentation as they do.

To get started, navigate to an application you have in Cloudflare Access or create a new one. In the rules that determine who is allowed to reach the application, add a rule in the “Require” section. Select “MFA” from the dropdown and then choose which options you want to require.

Require hard key auth with Cloudflare Access

What’s next?

Cloudflare Access, part of Cloudflare for Teams, is available today. You can follow the documentation here to add the additional rule. All accounts can use 50 seats of Cloudflare Access for free, including the hard key requirement feature.