Tag Archives: Product News

What’s new with Workers KV?

Post Syndicated from Steve Klabnik original https://blog.cloudflare.com/whats-new-with-workers-kv/

What’s new with Workers KV?

What’s new with Workers KV?

The Storage team here at Cloudflare shipped Workers KV, our global, low-latency, key-value store, earlier this year. As people have started using it, we’ve gotten some feature requests, and have shipped some new features in response! In this post, we’ll talk about some of these use cases and how these new features enable them.

New KV APIs

We’ve shipped some new APIs, both via api.cloudflare.com, as well as inside of a Worker. The first one provides the ability to upload and delete more than one key/value pair at once. Given that Workers KV is great for read-heavy, write-light workloads, a common pattern when getting started with KV is to write a bunch of data via the API, and then read that data from within a Worker. You can now do these bulk uploads without needing a separate API call for every key/value pair. This feature is available via api.cloudflare.com, but is not yet available from within a Worker.

For example, say we’re using KV to redirect legacy URLs to their new homes. We have a list of URLs to redirect, and where they should redirect to. We can turn this list into JSON that looks like this:

[
  {
    "key": "/old/post/1",
    "value": "/new-post-slug-1"
  },
  {
    "key": "/old/post/2",
    "value": "/new-post-slug-2"
  }
]

And then POST this JSON to the new bulk endpoint, /storage/kv/namespaces/:namespace_id/bulk. This will add both key/value pairs to our namespace.

Likewise, if we wanted to drop support for these redirects, we could issue a DELETE that has this body:

[
    "/old/post/1",
    "/old/post/2"
]

to /storage/kv/namespaces/:namespace_id/bulk, and we’d delete both key/value pairs in a single call to the API.

The bulk upload API has one more trick up its sleeve: not all data is a string. For example, you may have an image as a value, which is just a bag of bytes. if you need to write some binary data, you’ll have to base64 the value’s contents so that it’s valid JSON. You’ll also need to set one more key:

[
  {
    "key": "profile-picture",
    "value": "aGVsbG8gd29ybGQ=",
    "base64": true
  }
]

Workers KV will decode the value from base64, and then store the resulting bytes.

Beyond bulk upload and delete, we’ve also given you the ability to list all of the keys you’ve stored in any of your namespaces, from both the API and within a Worker. For example, if you wrote a blog powered by Workers + Workers KV, you might have each blog post stored as a key/value pair in a namespace called “contents”. Most blogs have some sort of “index” page that lists all of the posts that you can read. To create this page, we need to get a listing of all of the keys, since each key corresponds to a given post. We could do this from within a Worker by calling list() on our namespace binding:

const value = await contents.list()

But what we get back isn’t only a list of keys. The object looks like this:

{
  keys: [
    { name: "Title 1” },
    { name: "Title 2” }
  ],
  list_complete: false,
  cursor: "6Ck1la0VxJ0djhidm1MdX2FyD"
}

We’ll talk about this “cursor” stuff in a second, but if we wanted to get the list of titles, we’d have to iterate over the keys property, and pull out the names:

const keyNames = value.keys.map(e => e.name)

keyNames would be an array of strings:

[“Title 1”, “Title 2”, “Title 3”, “Title 4”, “Title 5”]

We could take keyNames and those titles to build our page.

So what’s up with the list_complete and cursor properties? Well, imagine that we’ve been a very prolific blogger, and we’ve now written thousands of posts. The list API is paginated, meaning that it will only return the first thousand keys. To see if there are more pages available, you can check the list_complete property. If it is false, you can use the cursor to fetch another page of results. The value of cursor is an opaque token that you pass to another call to list:

const value = await NAMESPACE.list()
const cursor = value.cursor
const next_value = await NAMESAPACE.list({"cursor": cursor})

This will give us another page of results, and we can repeat this process until list_complete is true.

Listing keys has one more trick up its sleeve: you can also return only keys that have a certain prefix. Imagine we want to have a list of posts, but only the posts that were made in October of 2019. While Workers KV is only a key/value store, we can use the prefix functionality to do interesting things by filtering the list. In our original implementation, we had stored the titles of keys only:

  • Title 1
  • Title 2

We could change this to include the date in YYYY-MM-DD format, with a colon separating the two:

  • 2019-09-01:Title 1
  • 2019-10-15:Title 2

We can now ask for a list of all posts made in 2019:

const value = await NAMESAPCE.list({"prefix": "2019"})

Or a list of all posts made in October of 2019:

const value = await NAMESAPCE.list({"prefix": "2019-10"})

These calls will only return keys with the given prefix, which in our case, corresponds to a date. This technique can let you group keys together in interesting ways. We’re looking forward to seeing what you all do with this new functionality!

Relaxing limits

For various reasons, there are a few hard limits with what you can do with Workers KV. We’ve decided to raise some of these limits, which expands what you can do.

The first is the limit of the number of namespaces any account could have. This was previously set at 20, but some of you have made a lot of namespaces! We’ve decided to relax this limit to 100 instead. This means you can create five times the number of namespaces you previously could.

Additionally, we had a two megabyte maximum size for values. We’ve increased the limit for values to ten megabytes. With the release of Workers Sites, folks are keeping things like images inside of Workers KV, and two megabytes felt a bit cramped. While Workers KV is not a great fit for truly large values, ten megabytes gives you the ability to store larger images easily. As an example, a 4k monitor has a native resolution of 4096 x 2160 pixels. If we had an image at this resolution as a lossless PNG, for example, it would be just over five megabytes in size.

KV browser

Finally, you may have noticed that there’s now a KV browser in the dashboard! Needing to type out a cURL command just to see what’s in your namespace was a real pain, and so we’ve given you the ability to check out the contents of your namespaces right on the web. When you look at a namespace, you’ll also see a table of keys and values:

What’s new with Workers KV?

The browser has grown with a bunch of useful features since it initially shipped. You can not only see your keys and values, but also add new ones:

What’s new with Workers KV?

edit existing ones:

What’s new with Workers KV?

…and even upload files!

What’s new with Workers KV?

You can also download them:

What’s new with Workers KV?

As we ship new features in Workers KV, we’ll be expanding the browser to include them too.

Wrangler integration

The Workers Developer Experience team has also been shipping some features related to Workers KV. Specifically, you can fully interact with your namespaces and the key/value pairs inside of them.

For example, my personal website is running on Workers Sites. I have a Wrangler project named “website” to manage it. If I wanted to add another namespace, I could do this:

$ wrangler kv:namespace create new_namespace
Creating namespace with title "website-new_namespace"
Success: WorkersKvNamespace {
    id: "<id>",
    title: "website-new_namespace",
}

Add the following to your wrangler.toml:

kv-namespaces = [
    { binding = "new_namespace", id = "<id>" }
]

I’ve redacted the namespace IDs here, but Wrangler let me know that the creation was successful, and provided me with the configuration I need to put in my wrangler.toml. Once I’ve done that, I can add new key/value pairs:

$ wrangler kv:key put "hello" "world" --binding new_namespace
Success

And read it back out again:

> wrangler kv:key get "hello" --binding new_namespace
world

If you’d like to learn more about the design of these features, “How we design features for Wrangler, the Cloudflare Workers CLI” discusses them in depth.

More to come

The Storage team is working hard at improving Workers KV, and we’ll keep shipping new stuff every so often. Our updates will be more regular in the future. If there’s something you’d particularly like to see, please reach out!

Delegated Credentials for TLS

Post Syndicated from Nick Sullivan original https://blog.cloudflare.com/keyless-delegation/

Delegated Credentials for TLS

Delegated Credentials for TLS

Today we’re happy to announce support for a new cryptographic protocol that helps make it possible to deploy encrypted services in a global network while still maintaining fast performance and tight control of private keys: Delegated Credentials for TLS. We have been working with partners from Facebook, Mozilla, and the broader IETF community to define this emerging standard. We’re excited to share the gory details today in this blog post.

Deploying TLS globally

Many of the technical problems we face at Cloudflare are widely shared problems across the Internet industry. As gratifying as it can be to solve a problem for ourselves and our customers, it can be even more gratifying to solve a problem for the entire Internet. For the past three years, we have been working with peers in the industry to solve a specific shared problem in the TLS infrastructure space: How do you terminate TLS connections while storing keys remotely and maintaining performance and availability? Today we’re announcing that Cloudflare now supports Delegated Credentials, the result of this work.

Cloudflare’s TLS/SSL features are among the top reasons customers use our service. Configuring TLS is hard to do without internal expertise. By automating TLS, web site and web service operators gain the latest TLS features and the most secure configurations by default. It also reduces the risk of outages or bad press due to misconfigured or insecure encryption settings. Customers also gain early access to unique features like TLS 1.3, post-quantum cryptography, and OCSP stapling as they become available.

Unfortunately, for web services to authorize a service to terminate TLS for them, they have to trust the service with their private keys, which demands a high level of trust. For services with a global footprint, there is an additional level of nuance. They may operate multiple data centers located in places with varying levels of physical security, and each of these needs to be trusted to terminate TLS.

To tackle these problems of trust, Cloudflare has invested in two technologies: Keyless SSL, which allows customers to use Cloudflare without sharing their private key with Cloudflare; and Geo Key Manager, which allows customers to choose the datacenters in which Cloudflare should keep their keys. Both of these technologies are able to be deployed without any changes to browsers or other clients. They also come with some downsides in the form of availability and performance degradation.

Keyless SSL introduces extra latency at the start of a connection. In order for a server without access to a private key to establish a connection with a client, that servers needs to reach out to a key server, or a remote point of presence, and ask them to do a private key operation. This not only adds additional latency to the connection, causing the content to load slower, but it also introduces some troublesome operational constraints on the customer. Specifically, the server with access to the key needs to be highly available or the connection can fail. Sites often use Cloudflare to improve their site’s availability, so having to run a high-availability key server is an unwelcome requirement.

Turning a pull into a push

The reason services like Keyless SSL that rely on remote keys are so brittle is their architecture: they are pull-based rather than push-based. Every time a client attempts a handshake with a server that doesn’t have the key, it needs to pull the authorization from the key server. An alternative way to build this sort of system is to periodically push a short-lived authorization key to the server and use that for handshakes. Switching from a pull-based model to a push-based model eliminates the additional latency, but it comes with additional requirements, including the need to change the client.

Enter the new TLS feature of Delegated Credentials (DCs). A delegated credential is a short-lasting key that the certificate’s owner has delegated for use in TLS. They work like a power of attorney: your server authorizes our server to terminate TLS for a limited time. When a browser that supports this protocol connects to our edge servers we can show it this “power of attorney”, instead of needing to reach back to a customer’s server to get it to authorize the TLS connection. This reduces latency and improves performance and reliability.

Delegated Credentials for TLS
The pull model

Delegated Credentials for TLS
The push model

A fresh delegated credential can be created and pushed out to TLS servers long before the previous credential expires. Momentary blips in availability will not lead to broken handshakes for clients that support delegated credentials. Furthermore, a Delegated Credentials-enabled TLS connection is just as fast as a standard TLS connection: there’s no need to connect to the key server for every handshake. This removes the main drawback of Keyless SSL for DC-enabled clients.

Delegated credentials are intended to be an Internet Standard RFC that anyone can implement and use, not a replacement for Keyless SSL. Since browsers will need to be updated to support the standard, proprietary mechanisms like Keyless SSL and Geo Key Manager will continue to be useful. Delegated credentials aren’t just useful in our context, which is why we’ve developed it openly and with contributions from across industry and academia. Facebook has integrated them into their own TLS implementation, and you can read more about how they view the security benefits here.  When it comes to improving the security of the Internet, we’re all on the same team.

"We believe delegated credentials provide an effective way to boost security by reducing certificate lifetimes without sacrificing reliability. This will soon become an Internet standard and we hope others in the industry adopt delegated credentials to help make the Internet ecosystem more secure."

Subodh Iyengar, software engineer at Facebook

Extensibility beyond the PKI

At Cloudflare, we’re interested in pushing the state of the art forward by experimenting with new algorithms. In TLS, there are three main areas of experimentation: ciphers, key exchange algorithms, and authentication algorithms. Ciphers and key exchange algorithms are only dependent on two parties: the client and the server. This freedom allows us to deploy exciting new choices like ChaCha20-Poly1305 or post-quantum key agreement in lockstep with browsers. On the other hand, the authentication algorithms used in TLS are dependent on certificates, which introduces certificate authorities and the entire public key infrastructure into the mix.

Unfortunately, the public key infrastructure is very conservative in its choice of algorithms, making it harder to adopt newer cryptography for authentication algorithms in TLS. For instance, EdDSA, a highly-regarded signature scheme, is not supported by certificate authorities, and root programs limit the certificates that will be signed. With the emergence of quantum computing, experimenting with new algorithms is essential to determine which solutions are deployable and functional on the Internet.

Since delegated credentials introduce the ability to use new authentication key types without requiring changes to certificates themselves, this opens up a new area of experimentation. Delegated credentials can be used to provide a level of flexibility in the transition to post-quantum cryptography, by enabling new algorithms and modes of operation to coexist with the existing PKI infrastructure. It also enables tiny victories, like the ability to use smaller, faster Ed25519 signatures in TLS.

Inside DCs

A delegated credential contains a public key and an expiry time. This bundle is then signed by a certificate along with the certificate itself, binding the delegated credential to the certificate for which it is acting as “power of attorney”. A supporting client indicates its support for delegated credentials by including an extension in its Client Hello.

A server that supports delegated credentials composes the TLS Certificate Verify and Certificate messages as usual, but instead of signing with the certificate’s private key, it includes the certificate along with the DC, and signs with the DC’s private key. Therefore, the private key of the certificate only needs to be used for the signing of the DC.

Certificates used for signing delegated credentials require a special X.509 certificate extension. This requirement exists to avoid breaking assumptions people may have about the impact of temporary access to their keys on security, particularly in cases involving HSMs and the still unfixed Bleichbacher oracles in older TLS versions.  Temporary access to a key can enable signing lots of delegated credentials which start far in the future, and as a result support was made opt-in. Early versions of QUIC had similar issues, and ended up adopting TLS to fix them. Protocol evolution on the Internet requires working well with already existing protocols and their flaws.

Delegated Credentials at Cloudflare and Beyond

Currently we use delegated credentials as a performance optimization for Geo Key Manager and Keyless SSL. Customers can update their certificates to include the special extension for delegated credentials, and we will automatically create delegated credentials and distribute them to the edge through the Keyless SSL or Geo Key Manager. For more information, see the documentation. It also enables us to be more conservative about where we keep keys for customers, improving our security posture.

Delegated Credentials would be useless if it wasn’t also supported by browsers and other HTTP clients. Christopher Patton, a former intern at Cloudflare, implemented support in Firefox and its underlying NSS security library. This feature is now in the Nightly versions of Firefox. You can turn it on by activating the configuration option security.tls.enable_delegated_credentials at about:config. Studies are ongoing on how effective this will be in a wider deployment. There also is support for Delegated Credentials in BoringSSL.

"At Mozilla we welcome ideas that help to make the Web PKI more robust. The Delegated Credentials feature can help to provide secure and performant TLS connections for our users, and we’re happy to work with Cloudflare to help validate this feature."

Thyla van der Merwe, Cryptography Engineering Manager at Mozilla

One open issue is the question of client clock accuracy. Until we have a wide-scale study we won’t know how many connections using delegated credentials will break because of the 24 hour time limit that is imposed.  Some clients, in particular mobile clients, may have inaccurately set clocks, the root cause of one third of all certificate errors in Chrome. Part of the way that we’re aiming to solve this problem is through standardizing and improving Roughtime, so web browsers and other services that need to validate certificates can do so independent of the client clock.

Cloudflare’s global scale means that we see connections from every corner of the world, and from many different kinds of connection and device. That reach enables us to find rare problems with the deployability of protocols. For example, our early deployment helped inform the development of the TLS 1.3 standard. As we enable developing protocols like delegated credentials, we learn about obstacles that inform and affect their future development.

Conclusion

As new protocols emerge, we’ll continue to play a role in their development and bring their benefits to our customers. Today’s announcement of a technology that overcomes some limitations of Keyless SSL is just one example of how Cloudflare takes part in improving the Internet not just for our customers, but for everyone. During the standardization process of turning the draft into an RFC, we’ll continue to maintain our implementation and come up with new ways to apply delegated credentials.

Announcing cfnts: Cloudflare’s implementation of NTS in Rust

Post Syndicated from Watson Ladd original https://blog.cloudflare.com/announcing-cfnts/

Announcing cfnts: Cloudflare's implementation of NTS in Rust

Announcing cfnts: Cloudflare's implementation of NTS in Rust

Several months ago we announced that we were providing a new public time service. Part of what we were providing was the first major deployment of the new Network Time Security (NTS) protocol, with a newly written implementation of NTS in Rust. In the process, we received helpful advice from the NTP community, especially from the NTPSec and Chrony projects. We’ve also participated in several interoperability events. Now we are returning something to the community: Our implementation, cfnts, is now open source and we welcome your pull requests and issues.

The journey from a blank source file to a working, deployed service was a lengthy one, and it involved many people across multiple teams.


"Correct time is a necessity for most security protocols in use on the Internet. Despite this, secure time transfer over the Internet has previously required complicated configuration on a case by case basis. With the introduction of NTS, secure time synchronization will finally be available for everyone. It is a small, but important, step towards increasing security in all systems that depend on accurate time. I am happy that Cloudflare are sharing their NTS implementation. A diversity of software with NTS support is important for quick adoption of the new protocol."

Marcus Dansarie, coauthor of the NTS specification


How NTS works

NTS is structured as a suite of two sub-protocols as shown in the figure below. The first is the Network Time Security Key Exchange (NTS-KE), which is always conducted over Transport Layer Security (TLS) and handles the creation of key material and parameter negotiation for the second protocol. The second is NTPv4, the current version of the NTP protocol, which allows the client to synchronize their time from the remote server.

In order to maintain the scalability of NTPv4, it was important that the server not maintain per-client state. A very small server can serve millions of NTP clients. Maintaining this property while providing security is achieved with cookies that the server provides to the client that contain the server state.

In the first stage, the client sends a request to the NTS-KE server and gets a response via TLS. This exchange carries out a number of functions:

  • Negotiates the AEAD algorithm to be used in the second stage.
  • Negotiates the second protocol. Currently, the standard only defines how NTS works with NTPv4.
  • Negotiates the NTP server IP address and port.
  • Creates cookies for use in the second stage.
  • Creates two symmetric keys (C2S and S2C) from the TLS session via exporters.

Announcing cfnts: Cloudflare's implementation of NTS in Rust

In the second stage, the client securely synchronizes the clock with the negotiated NTP server. To synchronize securely, the client sends NTPv4 packets with four special extensions:

  • Unique Identifier Extension contains a random nonce used to prevent replay attacks.
  • NTS Cookie Extension contains one of the cookies that the client stores. Since currently only the client remembers the two AEAD keys (C2S and S2C), the server needs to use the cookie from this extension to extract the keys. Each cookie contains the keys encrypted under a secret key the server has.
  • NTS Cookie Placeholder Extension is a signal from the client to request additional cookies from the server. This extension is needed to make sure that the response is not much longer than the request to prevent amplification attacks.
  • NTS Authenticator and Encrypted Extension Fields Extension contains a ciphertext from the AEAD algorithm with C2S as a key and with the NTP header, timestamps, and all the previously mentioned extensions as associated data. Other possible extensions can be included as encrypted data within this field. Without this extension, the timestamp can be spoofed.

After getting a request, the server sends a response back to the client echoing the Unique Identifier Extension to prevent replay attacks, the NTS Cookie Extension to provide the client with more cookies, and the NTS Authenticator and Encrypted Extension Fields Extension with an AEAD ciphertext with S2C as a key. But in the server response, instead of sending the NTS Cookie Extension in plaintext, it needs to be encrypted with the AEAD to provide unlinkability of the NTP requests.

The second handshake can be repeated many times without going back to the first stage since each request and response gives the client a new cookie. The expensive public key operations in TLS are thus amortized over a large number of requests. Furthermore, specialized timekeeping devices like FPGA implementations only need to implement a few symmetric cryptographic functions and can delegate the complex TLS stack to a different device.

Why Rust?

While many of our services are written in Go, and we have considerable experience on the Crypto team with Go, a garbage collection pause in the middle of responding to an NTP packet would negatively impact accuracy. We picked Rust because of its zero-overhead and useful features.

  • Memory safety After Heartbleed, Cloudbleed, and the steady drip of vulnerabilities caused by C’s lack of memory safety, it’s clear that C is not a good choice for new software dealing with untrusted inputs. The obvious solution for memory safety is to use garbage collection, but garbage collection has a substantial runtime overhead, while Rust has less runtime overhead.
  • Non-nullability Null pointers are an edge case that is frequently not handled properly. Rust explicitly marks optionality, so all references in Rust can be safely dereferenced. The type system ensures that option types are properly handled.
  • Thread safety  Data-race prevention is another key feature of Rust. Rust’s ownership model ensures that all cross-thread accesses are synchronized by default. While not a panacea, this eliminates a major class of bugs.
  • Immutability Separating types into mutable and immutable is very important for reducing bugs. For example, in Java, when you pass an object into a function as a parameter, after the function is finished, you will never know whether the object has been mutated or not. Rust allows you to pass the object reference into the function and still be assured that the object is not mutated.
  • Error handling  Rust result types help with ensuring that operations that can produce errors are identified and a choice made about the error, even if that choice is passing it on.

While Rust provides safety with zero overhead, coding in Rust involves understanding linear types and for us a new language. In this case the importance of security and performance meant we chose Rust over a potentially easier task in Go.

Dependencies we use

Because of our scale and for DDoS protection we needed a highly scalable server. For UDP protocols without the concept of a connection, the server can respond to one packet at a time easily, but for TCP this is more complex. Originally we thought about using Tokio. However, at the time Tokio suffered from scheduler problems that had caused other teams some issues. As a result we decided to use Mio directly, basing our work on the examples in Rustls.

We decided to use Rustls over OpenSSL or BoringSSL because of the crate’s consistent error codes and default support for authentication that is difficult to disable accidentally. While there are some features that are not yet supported, it got the job done for our service.

Other engineering choices

More important than our choice of programming language was our implementation strategy. A working, fully featured NTP implementation is a complicated program involving a phase-locked loop. These have a difficult reputation due to their nonlinear nature, beyond the usual complexities of closed loop control. The response of a phase lock loop to a disturbance can be estimated if the loop is locked and the disturbance small. However, lock acquisition, large disturbances, and the necessary filtering in NTP are all hard to analyze mathematically since they are not captured in the linear models applied for small scale analysis. While NTP works with the total phase, unlike the phase-locked loops of electrical engineering, there are still nonlinear elements. For NTP testing, changes to this loop requires weeks of operation to determine the performance as the loop responds very slowly.

Computer clocks are generally accurate over short periods, while networks are plagued with inconsistent delays. This demands a slow response. Changes we make to our service have taken hours to have an effect, as the clients slowly adapt to the new conditions. While RFC 5905 provides lots of details on an algorithm to adjust the clock, later implementations such as chrony have improved upon the algorithm through much more sophisticated nonlinear filters.

Rather than implement these more sophisticated algorithms, we let chrony adjust the clock of our servers, and copy the state variables in the header from chrony and adjust the dispersion and root delay according to the formulas given in the RFC. This strategy let us focus on the new protocols.

Prague

Part of what the Internet Engineering Task Force (IETF) does is organize events like hackathons where implementers of a new standard can get together and try to make their stuff work with one another. This exposes bugs and infelicities of language in the standard and the implementations. We attended the IETF 104 hackathon to develop our server and make it work with other implementations. The NTP working group members were extremely generous with their time, and during the process we uncovered a few issues relating to the exact way one has to handle ALPN with older OpenSSL versions.

At the IETF 104 in Prague we had a working client and server for NTS-KE by the end of the hackathon. This was a good amount of progress considering we started with nothing. However, without implementing NTP we didn’t actually know that our server and client were computing the right thing. That would have to wait for later rounds of testing.

Announcing cfnts: Cloudflare's implementation of NTS in Rust
Wireshark during some NTS debugging

Crypto Week

As Crypto Week 2019 approached we were busily writing code. All of the NTP protocol had to be implemented, together with the connection between the NTP and NTS-KE parts of the server. We also had to deploy processes to synchronize the ticket encrypting keys around the world and work on reconfiguring our own timing infrastructure to support this new service.

With a few weeks to go we had a working implementation, but we needed servers and clients out there to test with. But because we only support TLS 1.3 on the server, which had only just entered into OpenSSL, there were some compatibility problems.

We ended up compiling a chrony branch with NTS support and NTPsec ourselves and testing against time.cloudflare.com. We also tested our client against test servers set up by the chrony and NTPsec projects, in the hopes that this would expose bugs and have our implementations work nicely together. After a few lengthy days of debugging, we found out that our nonce length wasn’t exactly in accordance with the spec, which was quickly fixed. The NTPsec project was extremely helpful in this effort. Of course, this was the day that our office had a blackout, so the testing happened outside in Yerba Buena Gardens.

Announcing cfnts: Cloudflare's implementation of NTS in Rust
Yerba Buena commons. Taken by Wikipedia user Beyond My Ken. CC-BY-SA

During the deployment of time.cloudflare.com, we had to open up our firewall to incoming NTP packets. Since the start of Cloudflare’s network existence and because of NTP reflection attacks we had previously closed UDP port 123 on the router. Since source port 123 is also used by clients sometimes to send NTP packets, it’s impossible for NTP servers to filter reflection attacks without parsing the contents of NTP packet, which routers have difficulty doing.  In order to protect Cloudflare infrastructure we got an entire subnet just for the time service, so it could be aggressively throttled and rerouted in case of massive DDoS attacks. This is an exceptional case: most edge services at Cloudflare run on every available IP.

Bug fixes

Shortly after the public launch, we discovered that older Windows versions shipped with NTP version 3, and our server only spoke version 4. This was easy to fix since the timestamps have not moved in NTP versions: we echo the version back and most still existing NTP version 3 clients will understand what we meant.

Also tricky was the failure of Network Time Foundation ntpd clients to expand the polling interval. It turns out that one has to echo back the client’s polling interval to have the polling interval expand. Chrony does not use the polling interval from the server, and so was not affected by this incompatibility.

Both of these issues were fixed in ways suggested by other NTP implementers who had run into these problems themselves. We thank Miroslav Lichter tremendously for telling us exactly what the problem was, and the members of the Cloudflare community who posted packet captures demonstrating these issues.

Continued improvement

The original production version of cfnts was not particularly object oriented and several contributors were just learning Rust. As a result there was quite a bit of unwrap and unnecessary mutability flying around. Much of the code was in functions even when it could profitably be attached to structures. All of this had to be restructured. Keep in mind that some of the best code running in the real-world have been written, rewritten, and sometimes rewritten again! This is actually a good thing.

As an internal project we relied on Cloudflare’s internal tooling for building, testing, and deploying code. These were replaced with tools available to everyone like Docker to ensure anyone can contribute. Our repository is integrated with Circle CI, ensuring that all contributions are automatically tested. In addition to unit tests we test the entire end to end functionality of getting a measurement of the time from a server.

The Future

NTPsec has already released support for NTS but we see very little usage. Please try turning on NTS if you use NTPsec and see how it works with time.cloudflare.com.  As the draft advances through the standards process the protocol will undergo an incompatible change when the identifiers are updated and assigned out of the IANA registry instead of being experimental ones, so this is very much an experiment. Note that your daemon will need TLS 1.3 support and so could require manually compiling OpenSSL and then linking against it.

We’ve also added our time service to the public NTP pool. The NTP pool is a widely used volunteer-maintained service that provides NTP servers geographically spread across the world. Unfortunately, NTS doesn’t currently work well with the pool model, so for the best security, we recommend enabling NTS and using time.cloudflare.com and other NTS supporting servers.

In the future, we’re hoping that more clients support NTS, and have licensed our code liberally to enable this. We would love to hear if you incorporate it into a product and welcome contributions to make it more useful.

We’re also encouraged to see that Netnod has a production NTS service at nts.ntp.se. The more time services and clients that adopt NTS, the more secure the Internet will be.

Acknowledgements

Tanya Verma and Gabbi Fisher were major contributors to the code, especially the configuration system and the client code. We’d also like to thank Gary Miller, Miroslav Lichter, and all the people at Cloudflare who set up their laptops and home machines to point to time.cloudflare.com for early feedback.

Announcing cfnts: Cloudflare's implementation of NTS in Rust

Public keys are not enough for SSH security

Post Syndicated from Sam Rhea original https://blog.cloudflare.com/public-keys-are-not-enough-for-ssh-security/

Public keys are not enough for SSH security

If your organization uses SSH public keys, it’s entirely possible you have already mislaid one. There is a file sitting in a backup or on a former employee’s computer which grants the holder access to your infrastructure. If you share SSH keys between employees it’s likely only a few keys are enough to give an attacker access to your entire system. If you don’t share them, it’s likely your team has generated so many keys you long lost track of at least one.

If an attacker can breach a single one of your client devices it’s likely there is a known_hosts file which lists every target which can be trivially reached with the keys the machine already contains. If someone is able to compromise a team member’s laptop, they could use keys on the device that lack password protection to reach sensitive destinations.

Should that happen, how would you respond and revoke the lost SSH key? Do you have an accounting of the keys which have been generated? Do you rotate SSH keys? How do you manage that across an entire organization so consumed with serving customers that security has to be effortless to be adopted?

Cloudflare Access launched support for SSH connections last year to bring zero-trust security to how teams connect to infrastructure. Access integrates with your IdP to bring SSO security to SSH connections by enforcing identity-based rules each time a user attempts to connect to a target resource.

However, once Access connected users to the server they still had to rely on legacy SSH keys to authorize their account. Starting today, we’re excited to help teams remove that requirement and replace static SSH keys with short-lived certificates.

Replacing a private network with Cloudflare Access

In traditional network perimeter models, teams secure their infrastructure with two gates: a private network and SSH keys.

The private network requires that any user attempting to connect to a server must be on the same network, or a peered equivalent (such as a VPN). However, that introduces some risk. Private networks default to trust that a user on the network can reach a machine. Administrators must proactively segment the network or secure each piece of the infrastructure with control lists to work backwards from that default.

Cloudflare Access secures infrastructure by starting from the other direction: no user should be trusted. Instead, users must prove they should be able to access any unique machine or destination by default.

We released support for SSH connections in Cloudflare Access last year to help teams leave that network perimeter model and replace it with one that evaluates every request to a server for user identity. Through integration with popular identity providers, that solution also gives teams the ability to bring their SSO pipeline into their SSH flow.

Replacing static SSH keys with short-lived certificates

Once a user is connected to a server over SSH, they typically need to authorize their session. The machine they are attempting to reach will have a set of profiles which consists of user or role identities. Those profiles define what actions the user is able to take.

SSH processes make a few options available for the user to login to a profile. In some cases, users can login with a username and password combination. However, most teams rely on public-private key certificates to handle that login. To use that flow, administrators and users need to take prerequisite steps.

Prior to the connection, the user will generate a certificate and provide the public key to an administrator, who will then configure the server to trust the certificate and associate it with a certain user and set of permissions. The user stores that certificate on their device and presents it during that last mile. However, this leaves open all of the problems that SSO attempts to solve:

  • Most teams never force users to rotate certificates. If they do, it might be required once a year at most. This leaves static credentials to core infrastructure lingering on hundreds or thousands of devices.
  • Users are responsible for securing their certificates on their devices. Users are also responsible for passwords, but organizations can enforce requirements and revocation centrally.
  • Revocation is difficult. Teams must administer a CRL or OCSP platform to ensure that lost or stolen certificates are not used.

With Cloudflare Access, you can bring your SSO accounts to user authentication within your infrastructure. No static keys required.

How does it work?

To build this we turned to three tools we already had: Cloudflare Access, Argo Tunnel and Workers.

Access is a policy engine which combines the employee data in your identity provider (like Okta or AzureAD) with policies you craft. Based on those policies Access is able to limit access to your internal applications to the users you choose. It’s not a far leap to see how the same policy concept could be used to control access to a server over SSH. You write a policy and we use it to decide which of your employees should be able to access which resources. Then we generate a short-lived certificate allowing them to access that resource for only the briefest period of time. If you remove a user from your IdP, their access to your infrastructure is similarly removed, seamlessly.

To actually funnel the traffic through our network we use another existing Cloudflare tool: Argo Tunnel. Argo Tunnel flips the traditional model of connecting a server to the Internet. When you spin up our daemon on a machine it makes outbound connections to Cloudflare, and all of your traffic then flows over those connections. This allows the machine to be a part of Cloudflare’s network without you having to expose the machine to the Internet directly.

For HTTP use cases, Argo Tunnel only needs to run on the server. In the case of the Access SSH flow, we proxy SSH traffic through Cloudflare by running the Argo Tunnel client, cloudflared, on both the server and the end user’s laptop.

When users connect over SSH to a resource secured by Access for Infrastructure, they use the command-line tool cloudflared. cloudflared takes the SSH traffic bound for that hostname and forwards it through Cloudflare based on SSH config settings. No piping or command wrapping required. cloudflared launches a browser window and prompts the user to authenticate with their SSO credentials.

Once authenticated, Access checks the user’s identity against the policy you have configured for that application. If the user is permitted to reach the resource, Access generates a JSON Web Token (JWT), signed by Cloudflare and scoped to the user and application. Access distributes that token to the user’s device through cloudflared and the tool stores it locally.

Like the core Access authentication flow, the token validation is built with a Cloudflare Worker running in every one of our data centers, making it both fast and highly available. Workers made it possible for us to deploy this SSH proxying to all 194 of Cloudflare’s data centers, meaning Access for Infrastructure often speeds up SSH sessions rather than slowing them down.

Public keys are not enough for SSH security

With short-lived certificates enabled, the instance of cloudflared running on the client takes one additional step. cloudflared sends that token to a Cloudflare certificate signing endpoint that creates an ephemeral certificate. The user’s SSH flow then sends both the token, which is used to authenticate through Access, and the short-lived certificate, which is used to authenticate to the server. Like the core Access authentication flow, the token validation is built with a Cloudflare Worker running in every one of our data centers, making it both fast and highly available.

Public keys are not enough for SSH security

When the server receives the request, it validates the short-lived certificate against that public key and, if authentic, authorizes the user identity to a matching Unix user. The certificate, once issued, is valid for 2 minutes but the SSH connection can last longer once the session has started.

What is the end user experience?

Cloudflare Access’ SSH feature is entirely transparent to the end user and does not require any unique SSH commands, wrappers, or flags. Instead, Access requires that your team members take a couple one-time steps to get started:

1. Install the cloudflared daemon

The same lightweight software that runs on the target server is used to proxy SSH connections from your team members’ devices through Cloudflare. Users can install it with popular package managers like brew or at the link available here. Alternatively, the software is open-source and can be built and distributed by your administrators.

2. Print SSH configuration update and save

Once an end user has installed cloudflared, they need to run one command to generate new lines to add to their SSH config file:

cloudflared access ssh-config --hostname vm.example.com --short-lived-cert

The --hostname field will contain the hostname or wildcard subdomain of the resource protected behind Access. Once run, cloudflared will print the following configurations details:

Host vm.example.com
    ProxyCommand bash -c '/usr/local/bin/cloudflared access ssh-gen --hostname %h; ssh -tt %[email protected] >&2 <&1'
    
Host cfpipe-vm.example.com
    HostName vm.example.com
    ProxyCommand /usr/local/bin/cloudflared access ssh --hostname %h
    IdentityFile ~/.cloudflared/vm.example.com-cf_key
    CertificateFile ~/.cloudflared/vm.example.com-cf_key-cert.pup

Users need to append that output to their SSH config file. Once saved, they can connect over SSH to the protected resource. Access will prompt them to authenticate with their SSO credentials in the browser, in the same way they login to any other browser-based tool. If they already have an active browser session with their credentials, they’ll just see a success page.

In their terminal, cloudflared will establish the session and issue the client certificate that corresponds to their identity.


What’s next?

With short-lived certificates, Access can become a single SSO-integrated gateway for your team and infrastructure in any environment. Users can SSH directly to a given machine and administrators can replace their jumphosts altogether, removing that overhead. The feature is available today for all Access customers. You can get started by following the documentation available here.

Tales from the Crypt(o team)

Post Syndicated from Nick Sullivan original https://blog.cloudflare.com/tales-from-the-crypt-o-team/

Tales from the Crypt(o team)

Tales from the Crypt(o team)

Halloween season is upon us. This week we’re sharing a series of blog posts about work being done at Cloudflare involving cryptography, one of the spookiest technologies around. So bookmark this page and come back every day for tricks, treats, and deep technical content.

A long-term mission

Cryptography is one of the most powerful technological tools we have, and Cloudflare has been at the forefront of using cryptography to help build a better Internet. Of course, we haven’t been alone on this journey. Making meaningful changes to the way the Internet works requires time, effort, experimentation, momentum, and willing partners. Cloudflare has been involved with several multi-year efforts to leverage cryptography to help make the Internet better.

Here are some highlights to expect this week:

  • We’re renewing Cloudflare’s commitment to privacy-enhancing technologies by sharing some of the recent work being done on Privacy Pass
  • We’re helping forge a path to a quantum-safe Internet by sharing some of the results of the Post-quantum Cryptography experiment
  • We’re sharing the rust-based software we use to power time.cloudflare.com
  • We’re doing a deep dive into the technical details of Encrypted DNS
  • We’re announcing support for a new technique we developed with industry partners to help keep TLS private keys more secure

The milestones we’re sharing this week would not be possible without partnerships with companies, universities, and individuals working in good faith to help build a better Internet together. Hopefully, this week provides a fun peek into the future of the Internet.

Birthday Week 2019 Wrap-up

Post Syndicated from Jake Anderson original https://blog.cloudflare.com/birthday-week-2019-wrap-up/

Birthday Week 2019 Wrap-up

Birthday Week 2019 Wrap-up

This week we celebrated Cloudflare’s 9th birthday by launching a variety of new offerings that support our mission: to help build a better Internet.  Below is a summary recap of how we celebrated Birthday Week 2019.

Cleaning up bad bots

Every day Cloudflare protects over 20 million Internet properties from malicious bots, and this week you were invited to join in the fight!  Now you can enable “bot fight mode” in the Firewall settings of the Cloudflare Dashboard and we’ll start deploying CPU intensive code to traffic originating from malicious bots.  This wastes the bots’ CPU resources and makes it more difficult and costly for perpetrators to deploy malicious bots at scale. We’ll also share the IP addresses of malicious bot traffic with our Bandwidth Alliance partners, who can help kick malicious bots offline. Join us in the battle against bad bots – and, as you can read here – you can help the climate too!

Browser Insights

Speed matters, and if you manage a website or app, you want to make sure that you’re delivering a high performing website to all of your global end users. Now you can enable Browser Insights in the Speed section of the Cloudflare Dashboard to analyze website performance from the perspective of your users’ web browsers.  

WARP, the wait is over

Several months ago we announced WARP, a free mobile app purpose-built to address the security and performance challenges of the mobile Internet, while also respecting user privacy.  After months of testing and development, this week we (finally) rolled out WARP to approximately 2 million wait-list customers.  We also enabled Warp Plus, a WARP experience that uses Argo routing technology to route your mobile traffic across faster, less-congested, routes through the Internet.  Warp and Warp Plus (Warp+) are now available in the iOS and Android App stores and we can’t wait for you to give it a try!

HTTP/3 Support

Last year we announced early support for QUIC, a UDP based protocol that aims to make everything on the Internet work faster, with built-in encryption. The IETF subsequently decided that QUIC should be the foundation of the next generation of the HTTP protocol, HTTP/3. This week, Cloudflare was the first to introduce support for HTTP/3 in partnership with Google Chrome and Mozilla.

Workers Sites

Finally, to wrap up our birthday week announcements, we announced Workers Sites. The Workers serverless platform continues to grow and evolve, and every day we discover new and innovative ways to help developers build and optimize their applications. Workers Sites enables developers to easily deploy lightweight static sites across Cloudflare’s global cloud platform without having to build out the traditional backend server infrastructure to support these sites.

We look forward to Birthday Week every year, as a chance to showcase some of our exciting new offerings — but we all know building a better Internet is about more than one week.  It’s an effort that takes place all year long, and requires the help of our partners, employees and especially you — our customers. Thank you for being a customer, providing valuable feedback and helping us stay focused on our mission to help build a better Internet.

Can’t get enough of this week’s announcements, or want to learn more? Register for next week’s Birthday Week Recap webinar to get the inside scoop on every announcement.

Workers Sites: Extending the Workers platform with our own serverless building blocks

Post Syndicated from Ashley Williams original https://blog.cloudflare.com/extending-the-workers-platform/

Workers Sites: Extending the Workers platform with our own serverless building blocks

As of today, with the Wrangler CLI, you can now deploy entire websites directly to Cloudflare Workers and Workers KV. If you can statically generate the assets for your site, think create-react-app, Jekyll, or even the WP2Static plugin, you can deploy it to our entire global network, which spans 194 cities in more than 90 countries.

While you could deploy an entire site directly to Workers before, it wasn’t the easiest process. So, the Workers Developer Experience Team came up with a solution to make deploying static assets a significantly better experience.

Using our Workers command-line tool Wrangler, we’ve made it possible to deploy any static site to Workers in three easy steps: run wrangler init --site, configure the newly created wrangler.toml file with your account and project details, and then publish it to Cloudflare’s edge with wrangler publish. If you want to explore how this works, check out our new Workers Sites tutorial for create-react-app, where we cover how this new functionality allows you to deploy without needing to write any additional code!

While in hindsight the path we took to get to this point might not seem the most straightforward, it really highlights the flexibility of the entire Workers platform to easily support use cases that we didn’t originally envision. With this in mind, I’ll walk you through the implementation and thinking we did to get to this point. I’ll also talk a bit about how the flexibility of the Workers platform has us excited, both for the ethos it represents, and the future it enables.

So, what went into building Workers Sites?

“Filesystem?! Where we’re going, we don’t need a filesystem!”

The Workers platform is built on v8 isolates, which, while awesome, lack a filesystem. If you’ve ever deployed a static site via FTP, uploaded it to object storage, or used a computer, you’d probably agree that filesystems are important. For many use cases, like building an API or routing, you don’t need a filesystem, but as the vision for Workers grew and our audience grew with it, it became clear to us that this was a limitation we needed to address for new features.

Welcome to the simulation

Without a filesystem, we decided to simulate one on top of Workers KV! Workers KV provides access to a secure key-value store that runs across Cloudflare’s Edge alongside Workers.

When running wrangler preview or wrangler publish, we check your wrangler.toml for the site key. The site key points to a bucket, which represents the KV namespace we’ll use to represent your static assets. We then upload each of your assets, where the path relative to the entry directory is the key, and the blob of the file is the value.

Workers Sites: Extending the Workers platform with our own serverless building blocks

When a request from a user comes in, the Worker reads the request’s URI and looks up the asset that matches the segment requested. For example, if a user fetches “my-site.com/about.html”, the Worker looks up the “about.html” key in KV and returns the blob. Behind the scenes, we’ll also detect the mime-type of the requested asset and return the response with the correct content-type headers.

For folks who are used to building static sites or sites with a static asset serving component, this could feel deeply overengineered. Others may argue that, indeed, this is just how filesystems are built! The interesting thing for us is that we had to build one, there wasn’t just one there waiting for us.

It was great that we could put this together with Workers KV, but we still had a problem…

Cache rules everything around me

Workers KV is a database, and so it’s set up for both read and write operations. However, it’s primarily tuned for read-heavy workloads on entries that don’t generally have a long life span. This works well for applications where data is accessed frequently and often updated. But, for static websites, assets are generally written once, and then they are never (or infrequently) written to again. Static site content should be cached for a very long time, if not forever (long live Space Jam). This means we need to cache data much longer than KV is used to.

To fix this, on publish or preview, Wrangler walks the entry-point directory you’ve declared in your wrangler.toml and creates an asset manifest: a map of your filenames to a hash of their content. We use this asset manifest to map requests for a particular filename, say index.html, to the content hash of the most recently uploaded static asset.

You may be familiar with the concept of an asset manifest from using tools like create-react-app. Asset manifests help maintain asset fingerprints for caching in the browser. We took this idea and implemented it in Workers Sites, so that we can leverage the edge cache as well!

Workers Sites: Extending the Workers platform with our own serverless building blocks

This now allows us to, after first read per location, cache the static assets in the Cloudflare cache so that the assets can be stored on the edge indefinitely. This reduces reads to KV to almost nothing; we want to use KV for durability purposes, but we want to use a longer caching strategy for performance. Let’s dive in to exactly what this looks like:

How it works

When a new asset is created, Wrangler publish will push the new asset to KV as well as an asset manifest to the edge alongside your Worker.

Workers Sites: Extending the Workers platform with our own serverless building blocks

When someone first accesses your page, the Cloudflare location closest to them will run your Worker. The Worker script will determine the content hash of the asset they’ve requested by looking up that asset in the asset manifest. It will use the filename and content hash as the key to fetch the asset’s contents from KV. At this time it will also insert the asset’s contents into Cloudflare’s edge cache, again keyed by filename and content hash. It will then respond to the request with the asset.

Workers Sites: Extending the Workers platform with our own serverless building blocks

On subsequent requests, the Worker script will look up the content hash in the asset manifest, and check the cache to see if the asset is there. Since this is a subsequent request, it will find your asset in the cache on the edge and return a response containing the asset without having to fetch the asset contents from KV.

Workers Sites: Extending the Workers platform with our own serverless building blocks

So what happens when you update your “index.html”- or any of your static assets? The process is very similar to what happens on the upload of a new asset. You’ll run wrangler publish with your new asset on your local machine. Wrangler will walk your asset directory and upload them to KV. At the same time, it will create a new asset manifest containing the filename and a content hash representing the new contents of the asset. When a request comes into your Worker, your Worker will look into the asset manifest and retrieve the new content hash for that asset. The Worker will look in the cache now for the new hash! It will then fetch the new asset from KV, populate the cache, and return the new file to your end user.

Edge caching happens per location across 194 cities around the world, ensuring that the most frequently accessed content on your page is cached in a location closest to those requesting content, reducing latency. All of this happens in *addition* to the browser cache, which means that your assets are nearly always incredibly close to end users!

By being on the edge, a Worker is in a unique position to be able to cache not only static assets like JS, CSS and images, but also HTML assets! Traditional static site solutions utilize your site’s HTML an entry point to the static site generator’s asset manifest. With this method of caching your HTML, it would be impossible to bust that cache because there is no other entry point to manage your assets’ fingerprints other than the HTML itself. However, in a Worker the entry point is your *Worker*! We can then leverage our wrangler asset-manifest to look up and fetch the accurate and cacheable HTML, while at the same time cache bust on content hash.

Making the possible imaginable

“What we have is a crisis of imagination. Albert Einstein said that you cannot solve a problem with the same mind-set that created it.” – Peter Buffett

When building a brand new developer platform, there’s often a vast number of possible applications. However, the sheer number of possibilities often make each one difficult to imagine. That’s why we think the most important part of any platform is its flexibility to adapt to previously unimagined use cases. And, we don’t mean that just for us. It’s important that everyone has the ability to customize the platform to new and interesting use cases!

At face value, the work we did to implement this feature might seem like another solution for a previously solved problem. However, it’s a great example of how a group of dedicated developers can improve the platform experience for others.

We hope that by paving a way to include static assets in a Worker, developers can use the extra cognitive space to conceive of even more new ways to use Workers that may have been hard to imagine before.

Workers Sites isn’t the end goal, but a stepping stone to continue to think critically about what it means to build a Web Application. We’re excited to give developers the space to explore how simple static applications can grow and evolve, when combined with the dynamic power of edge computing.

Go forth and build something awesome!


Have you built something interesting with Workers? Let us know @CloudflareDev!

Workers Sites: deploy your website directly to our network

Post Syndicated from Rita Kozlov original https://blog.cloudflare.com/workers-sites/

Workers Sites: deploy your website directly to our network

Workers Sites: deploy your website directly to our network

Performance on the web has always been a battle against the speed of light — accessing a site from London that is served from Seattle, WA means every single asset request has to travel over seven thousand miles. The first breakthrough in the web performance battle was HTTP/1.1 connection keep-alive and browsers opening multiple connections. The next breakthrough was the CDN, bringing your static assets closer to your end users by caching them in data centers closer to them. Today, with Workers Sites, we’re excited to announce the next big breakthrough — entire sites distributed directly onto the edge of the Internet.

Deploying to the edge of the network

Why isn’t just caching assets sufficient? Yes, caching improves performance, but significant performance improvement comes with a series of headaches. The CDN can make a guess at which assets it should cache, but that is just a guess. Configuring your site for maximum performance has always been an error-prone process, requiring a wide collection of esoteric rules and headers. Even when perfectly configured, almost nothing is cached forever, precious requests still often need to travel all the way to your origin (wherever it may be). Cache invalidation is, after all, one of the hardest problems in computer science.

This begs the question: rather than moving bytes from the origin to the edge bit by bit clumsily, why not push the whole origin to the edge?

Workers Sites: Extending the Workers platform

Two years ago for Birthday Week, we announced Cloudflare Workers, a way for developers to write and run JavaScript and WebAssembly on our network in 194 cities around the world. A year later, we released Workers KV, our distributed key-value store that gave developers the ability to store state at the edge in those same cities.

Workers Sites leverages the power of Workers and Workers KV by allowing developers to upload their sites directly to the edge, and closer to the end users. Born on the edge, Workers Sites is what we think modern development on the web should look like, natively secure, fast, and massively scalable. Less of your time is spent on configuration, and more of your time is spent on your code, and content itself.

How it works

Workers Sites are deployed with a few terminal commands, and can serve a site generated by any static site generator, such as Hugo, Gatsby or Jekyll. Using Wrangler (our CLI), you can upload your site’s assets directly into KV. When a request hits your Workers Site, the Cloudflare Worker generated by Wrangler, will read and serve the asset from KV, with the appropriate headers (no need to worry about Content-Type, and Cache-Control; we’ve got you covered).

Workers Sites can be used to deploy any static site such as a blog, marketing sites, or portfolio.  If you ever decide your site needs to become a little less static, your Worker is just code, edit and extend it until you have a dynamic site running all around the world.

Getting started

To get started with Workers Sites, you first need to sign up for Workers. After selecting your workers.dev subdomain, choose the Workers Unlimited plan (starting at $5 / month) to get access to Workers KV and the ability to deploy Workers Sites.

After signing up for Workers Unlimited you’ll need to install the CLI for Workers, Wrangler. Wrangler can be installed either from NPM or Cargo:

# NPM Installation
npm i @cloudflare/wrangler -g
# Cargo Installation
cargo install wrangler

Once you install Wrangler, you are ready to deploy your static site, with the following steps:

  1. Run wrangler init --site in the directory that contains your static site’s built assets
  2. Fill in the newly created wrangler.toml file with your account and project details
  3. Publish your site with wrangler publish

You can also check out our Workers Sites reference documentation or follow the full tutorial for create-react-app in the docs.

If you’d prefer to get started by watching a video, we’ve got you covered! This video will walk you through creating and deploying your first Workers Site.


Blazing fast: from Atlanta to Zagreb

In addition to improving the developer experience, we did a lot of work behind the scenes making sure that both deploys and the sites themselves are blazing fast — we’re excited to share the how with you in our technical blog post.

To test the performance of Workers Sites we took one of our personal sites and deployed it to run some benchmarks. This test was for our site but your results may vary.

One common way to benchmark the performance of your site it using Google Lighthouse, which you can do directly from the Audits tab of your Chrome browser.

Workers Sites: deploy your website directly to our network

So we passed the first test with flying colors — 100! However, running a benchmark from your own computer introduces a bias: your users are not necessarily where you are. In fact, your users are increasingly not where you are.

Where you’re benchmarking from is really important: running tests from different locations will yield different results. Benchmarking from Seattle and hitting a server on the West coast says very little about your global performance.

We decided to use a tool called Catchpoint to run benchmarks from cities around the world. To see how we compare, we deployed the site to three different static site deployment platforms including Workers Sites.

Since providers offer data center regions on the coasts of the United States, or central Europe, it’s common to see good performance in regions such as North America, and we’ve got you covered here:

Workers Sites: deploy your website directly to our network

But what about your users in the rest of the world? Performance is even more critical in those regions: the first users are not going to be connecting to your site on a MacBook Pro, on a blazing fast connection. Workers Sites allows you to reach those regions without any additional effort on your part — every time our map grows, your global presence grows with it.

We’ve done the work of running some benchmarks from different parts of the world for you, and we’re pleased to share the results:

Workers Sites: deploy your website directly to our network

One last thing…

Deploying your next site with Workers Sites is easy and leads to great performance, so we thought it was only right that we deploy with Workers Sites ourselves. With this announcement, we are also open sourcing the Cloudflare Workers docs! And, they are now served from a Cloudflare data center near you using Workers Sites.

We can’t wait to see what you deploy with Workers Sites!


Have you built something interesting with Workers or Workers Sites? Let us know @CloudflareDev!

HTTP/3: the past, the present, and the future

Post Syndicated from Alessandro Ghedini original https://blog.cloudflare.com/http3-the-past-present-and-future/

HTTP/3: the past, the present, and the future

During last year’s Birthday Week we announced preliminary support for QUIC and HTTP/3 (or “HTTP over QUIC” as it was known back then), the new standard for the web, enabling faster, more reliable, and more secure connections to web endpoints like websites and APIs. We also let our customers join a waiting list to try QUIC and HTTP/3 as soon as they became available.

HTTP/3: the past, the present, and the future

Since then, we’ve been working with industry peers through the Internet Engineering Task Force, including Google Chrome and Mozilla Firefox, to iterate on the HTTP/3 and QUIC standards documents. In parallel with the standards maturing, we’ve also worked on improving support on our network.

We are now happy to announce that QUIC and HTTP/3 support is available on the Cloudflare edge network. We’re excited to be joined in this announcement by Google Chrome and Mozilla Firefox, two of the leading browser vendors and partners in our effort to make the web faster and more reliable for all.

In the words of Ryan Hamilton, Staff Software Engineer at Google, “HTTP/3 should make the web better for everyone. The Chrome and Cloudflare teams have worked together closely to bring HTTP/3 and QUIC from nascent standards to widely adopted technologies for improving the web. Strong partnership between industry leaders is what makes Internet standards innovations possible, and we look forward to our continued work together.”

What does this mean for you, a Cloudflare customer who uses our services and edge network to make your web presence faster and more secure? Once HTTP/3 support is enabled for your domain in the Cloudflare dashboard, your customers can interact with your websites and APIs using HTTP/3. We’ve been steadily inviting customers on our HTTP/3 waiting list to turn on the feature (so keep an eye out for an email from us), and in the coming weeks we’ll make the feature available to everyone.

What does this announcement mean if you’re a user of the Internet interacting with sites and APIs through a browser and other clients? Starting today, you can use Chrome Canary to interact with Cloudflare and other servers over HTTP/3. For those of you looking for a command line client, curl also provides support for HTTP/3. Instructions for using Chrome and curl with HTTP/3 follow later in this post.

The Chicken and the Egg

Standards innovation on the Internet has historically been difficult because of a chicken and egg problem: which needs to come first, server support (like Cloudflare, or other large sources of response data) or client support (like browsers, operating systems, etc)? Both sides of a connection need to support a new communications protocol for it to be any use at all.

Cloudflare has a long history of driving web standards forward, from HTTP/2 (the version of HTTP preceding HTTP/3), to TLS 1.3, to things like encrypted SNI. We’ve pushed standards forward by partnering with like-minded organizations who share in our desire to help build a better Internet. Our efforts to move HTTP/3 into the mainstream are no different.

Throughout the HTTP/3 standards development process, we’ve been working closely with industry partners to build and validate client HTTP/3 support compatible with our edge support. We’re thrilled to be joined by Google Chrome and curl, both of which can be used today to make requests to the Cloudflare edge over HTTP/3. Mozilla Firefox expects to ship support in a nightly release soon as well.

Bringing this all together: today is a good day for Internet users; widespread rollout of HTTP/3 will mean a faster web experience for all, and today’s support is a large step toward that.

More importantly, today is a good day for the Internet: Chrome, curl, and Cloudflare, and soon, Mozilla, rolling out experimental but functional, support for HTTP/3 in quick succession shows that the Internet standards creation process works. Coordinated by the Internet Engineering Task Force, industry partners, competitors, and other key stakeholders can come together to craft standards that benefit the entire Internet, not just the behemoths.

Eric Rescorla, CTO of Firefox, summed it up nicely: “Developing a new network protocol is hard, and getting it right requires everyone to work together. Over the past few years, we’ve been working with Cloudflare and other industry partners to test TLS 1.3 and now HTTP/3 and QUIC. Cloudflare’s early server-side support for these protocols has helped us work the interoperability kinks out of our client-side Firefox implementation. We look forward to advancing the security and performance of the Internet together.”

HTTP/3: the past, the present, and the future

How did we get here?

Before we dive deeper into HTTP/3, let’s have a quick look at the evolution of HTTP over the years in order to better understand why HTTP/3 is needed.

It all started back in 1996 with the publication of the HTTP/1.0 specification which defined the basic HTTP textual wire format as we know it today (for the purposes of this post I’m pretending HTTP/0.9 never existed). In HTTP/1.0 a new TCP connection is created for each request/response exchange between clients and servers, meaning that all requests incur a latency penalty as the TCP and TLS handshakes are completed before each request.

HTTP/3: the past, the present, and the future

Worse still, rather than sending all outstanding data as fast as possible once the connection is established, TCP enforces a warm-up period called “slow start”, which allows the TCP congestion control algorithm to determine the amount of data that can be in flight at any given moment before congestion on the network path occurs, and avoid flooding the network with packets it can’t handle. But because new connections have to go through the slow start process, they can’t use all of the network bandwidth available immediately.

The HTTP/1.1 revision of the HTTP specification tried to solve these problems a few years later by introducing the concept of “keep-alive” connections, that allow clients to reuse TCP connections, and thus amortize the cost of the initial connection establishment and slow start across multiple requests. But this was no silver bullet: while multiple requests could share the same connection, they still had to be serialized one after the other, so a client and server could only execute a single request/response exchange at any given time for each connection.

As the web evolved, browsers found themselves needing more and more concurrency when fetching and rendering web pages as the number of resources (CSS, JavaScript, images, …) required by each web site increased over the years. But since HTTP/1.1 only allowed clients to do one HTTP request/response exchange at a time, the only way to gain concurrency at the network layer was to use multiple TCP connections to the same origin in parallel, thus losing most of the benefits of keep-alive connections. While connections would still be reused to a certain (but lesser) extent, we were back at square one.

Finally, more than a decade later, came SPDY and then HTTP/2, which, among other things, introduced the concept of HTTP “streams”: an abstraction that allows HTTP implementations to concurrently multiplex different HTTP exchanges onto the same TCP connection, allowing browsers to more efficiently reuse TCP connections.

HTTP/3: the past, the present, and the future

But, yet again, this was no silver bullet! HTTP/2 solves the original problem — inefficient use of a single TCP connection — since multiple requests/responses can now be transmitted over the same connection at the same time. However, all requests and responses are equally affected by packet loss (e.g. due to network congestion), even if the data that is lost only concerns a single request. This is because while the HTTP/2 layer can segregate different HTTP exchanges on separate streams, TCP has no knowledge of this abstraction, and all it sees is a stream of bytes with no particular meaning.

The role of TCP is to deliver the entire stream of bytes, in the correct order, from one endpoint to the other. When a TCP packet carrying some of those bytes is lost on the network path, it creates a gap in the stream and TCP needs to fill it by resending the affected packet when the loss is detected. While doing so, none of the successfully delivered bytes that follow the lost ones can be delivered to the application, even if they were not themselves lost and belong to a completely independent HTTP request. So they end up getting unnecessarily delayed as TCP cannot know whether the application would be able to process them without the missing bits. This problem is known as “head-of-line blocking”.

Enter HTTP/3

This is where HTTP/3 comes into play: instead of using TCP as the transport layer for the session, it uses QUIC, a new Internet transport protocol, which, among other things, introduces streams as first-class citizens at the transport layer. QUIC streams share the same QUIC connection, so no additional handshakes and slow starts are required to create new ones, but QUIC streams are delivered independently such that in most cases packet loss affecting one stream doesn’t affect others. This is possible because QUIC packets are encapsulated on top of UDP datagrams.

Using UDP allows much more flexibility compared to TCP, and enables QUIC implementations to live fully in user-space — updates to the protocol’s implementations are not tied to operating systems updates as is the case with TCP. With QUIC, HTTP-level streams can be simply mapped on top of QUIC streams to get all the benefits of HTTP/2 without the head-of-line blocking.

QUIC also combines the typical 3-way TCP handshake with TLS 1.3‘s handshake. Combining these steps means that encryption and authentication are provided by default, and also enables faster connection establishment. In other words, even when a new QUIC connection is required for the initial request in an HTTP session, the latency incurred before data starts flowing is lower than that of TCP with TLS.

HTTP/3: the past, the present, and the future

But why not just use HTTP/2 on top of QUIC, instead of creating a whole new HTTP revision? After all, HTTP/2 also offers the stream multiplexing feature. As it turns out, it’s somewhat more complicated than that.

While it’s true that some of the HTTP/2 features can be mapped on top of QUIC very easily, that’s not true for all of them. One in particular, HTTP/2’s header compression scheme called HPACK, heavily depends on the order in which different HTTP requests and responses are delivered to the endpoints. QUIC enforces delivery order of bytes within single streams, but does not guarantee ordering among different streams.

This behavior required the creation of a new HTTP header compression scheme, called QPACK, which fixes the problem but requires changes to the HTTP mapping. In addition, some of the features offered by HTTP/2 (like per-stream flow control) are already offered by QUIC itself, so they were dropped from HTTP/3 in order to remove unnecessary complexity from the protocol.

HTTP/3, powered by a delicious quiche

QUIC and HTTP/3 are very exciting standards, promising to address many of the shortcomings of previous standards and ushering in a new era of performance on the web. So how do we go from exciting standards documents to working implementation?

Cloudflare’s QUIC and HTTP/3 support is powered by quiche, our own open-source implementation written in Rust.

HTTP/3: the past, the present, and the future

You can find it on GitHub at github.com/cloudflare/quiche.

We announced quiche a few months ago and since then have added support for the HTTP/3 protocol, on top of the existing QUIC support. We have designed quiche in such a way that it can now be used to implement HTTP/3 clients and servers or just plain QUIC ones.

How do I enable HTTP/3 for my domain?

As mentioned above, we have started on-boarding customers that signed up for the waiting list. If you are on the waiting list and have received an email from us communicating that you can now enable the feature for your websites, you can simply go to the Cloudflare dashboard and flip the switch from the “Network” tab manually:

HTTP/3: the past, the present, and the future

We expect to make the HTTP/3 feature available to all customers in the near future.

Once enabled, you can experiment with HTTP/3 in a number of ways:

Using Google Chrome as an HTTP/3 client

In order to use the Chrome browser to connect to your website over HTTP/3, you first need to download and install the latest Canary build. Then all you need to do to enable HTTP/3 support is starting Chrome Canary with the “–enable-quic” and “–quic-version=h3-23” command-line arguments.

Once Chrome is started with the required arguments, you can just type your domain in the address bar, and see it loaded over HTTP/3 (you can use the Network tab in Chrome’s Developer Tools to check what protocol version was used). Note that due to how HTTP/3 is negotiated between the browser and the server, HTTP/3 might not be used for the first few connections to the domain, so you should try to reload the page a few times.

If this seems too complicated, don’t worry, as the HTTP/3 support in Chrome will become more stable as time goes on, enabling HTTP/3 will become easier.

This is what the Network tab in the Developer Tools shows when browsing this very blog over HTTP/3:

HTTP/3: the past, the present, and the future

Note that due to the experimental nature of the HTTP/3 support in Chrome, the protocol is actually identified as “http2+quic/99” in Developer Tools, but don’t let that fool you, it is indeed HTTP/3.

Using curl

The curl command-line tool also supports HTTP/3 as an experimental feature. You’ll need to download the latest version from git and follow the instructions on how to enable HTTP/3 support.

If you’re running macOS, we’ve also made it easy to install an HTTP/3 equipped version of curl via Homebrew:

 % brew install --HEAD -s https://raw.githubusercontent.com/cloudflare/homebrew-cloudflare/master/curl.rb

In order to perform an HTTP/3 request all you need is to add the “–http3” command-line flag to a normal curl command:

 % ./curl -I https://blog.cloudflare.com/ --http3
HTTP/3 200
date: Tue, 17 Sep 2019 12:27:07 GMT
content-type: text/html; charset=utf-8
set-cookie: __cfduid=d3fc7b95edd40bc69c7d894d296564df31568723227; expires=Wed, 16-Sep-20 12:27:07 GMT; path=/; domain=.blog.cloudflare.com; HttpOnly; Secure
x-powered-by: Express
cache-control: public, max-age=60
vary: Accept-Encoding
cf-cache-status: HIT
age: 57
expires: Tue, 17 Sep 2019 12:28:07 GMT
alt-svc: h3-22=":443"; ma=86400
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
server: cloudflare
cf-ray: 517b128df871bfe3-MAN

Using quiche’s http3-client

Finally, we also provide an example HTTP/3 command-line client (as well as a command-line server) built on top of quiche, that you can use to experiment with HTTP/3.

To get it running, first clone quiche’s GitHub repository:

$ git clone --recursive https://github.com/cloudflare/quiche

Then build it. You need a working Rust and Cargo installation for this to work (we recommend using rustup to easily setup a working Rust development environment).

$ cargo build --examples

And finally you can execute an HTTP/3 request:

$ RUST_LOG=info target/debug/examples/http3-client https://blog.cloudflare.com/

What’s next?

In the coming months we’ll be working on improving and optimizing our QUIC and HTTP/3 implementation, and will eventually allow everyone to enable this new feature without having to go through a waiting list. We’ll continue updating our implementation as standards evolve, which may result in breaking changes between draft versions of the standards.

Here are a few new features on our roadmap that we’re particularly excited about:

Connection migration

One important feature that QUIC enables is seamless and transparent migration of connections between different networks (such as your home WiFi network and your carrier’s mobile network as you leave for work in the morning) without requiring a whole new connection to be created.

HTTP/3: the past, the present, and the future

This feature will require some additional changes to our infrastructure, but it’s something we are excited to offer our customers in the future.

Zero Round Trip Time Resumption

Just like TLS 1.3, QUIC supports a mode of operation that allows clients to start sending HTTP requests before the connection handshake has completed. We don’t yet support this feature in our QUIC deployment, but we’ll be working on making it available, just like we already do for our TLS 1.3 support.

HTTP/3: it’s alive!

We are excited to support HTTP/3 and allow our customers to experiment with it while efforts to standardize QUIC and HTTP/3 are still ongoing. We’ll continue working alongside other organizations, including Google and Mozilla, to finalize the QUIC and HTTP/3 standards and encourage broad adoption.

Here’s to a faster, more reliable, more secure web experience for all.

The Technical Challenges of Building Cloudflare WARP

Post Syndicated from Zack Bloom original https://blog.cloudflare.com/warp-technical-challenges/

The Technical Challenges of Building Cloudflare WARP

If you have seen our other post you know that we released WARP to the last members of our waiting list today. With WARP our goal was to secure and improve the connection between your mobile devices and the Internet. Along the way we ran into problems with phone and operating system versions, diverse networks, and our own infrastructure, all while working to meet the pent up demand of a waiting list nearly two million people long.

To understand all these problems and how we solved them we first need to give you some background on how the Cloudflare network works:

How Our Network Works

The Cloudflare network is composed of data centers located in 194 cities and in more than 90 countries. Every Cloudflare data center is composed of many servers that receive a continual flood of requests and has to distribute those requests between the servers that handle them. We use a set of routers to perform that operation:

The Technical Challenges of Building Cloudflare WARP

Our routers listen on Anycast IP addresses which are advertised over the public Internet. If you have a site on Cloudflare, your site is available via two of these addresses. In this case, I am doing a DNS query for “workers.dev”, a site which is powered by Cloudflare:

➜ dig workers.dev

;; QUESTION SECTION:
;workers.dev.      IN  A

;; ANSWER SECTION:
workers.dev.    161  IN  A  198.41.215.162
workers.dev.    161  IN  A  198.41.214.162

;; SERVER: 1.1.1.1#53(1.1.1.1)

workers.dev is available at two addresses 198.41.215.162 and 198.41.214.162 (along with two IPv6 addresses available via the AAAA DNS query). Those two addresses are advertised from every one of our data centers around the world. When someone connects to any Internet property on Cloudflare, each networking device their packets pass through will choose the shortest path to the nearest Cloudflare data center from their computer or phone.

Once the packets hit our data center, we send them to one of the many servers which operate there. Traditionally, one might use a load balancer to do that type of traffic distribution across multiple machines. Unfortunately putting a set of load balancers capable of handling our volume of traffic in every data center would be exceptionally expensive, and wouldn’t scale as easily as our servers do. Instead, we use devices built for operating on exceptional volumes of traffic: network routers.

Once a packet hits our data center it is processed by a router. That router sends the traffic to one of a set of servers responsible for handling that address using a routing strategy called ECMP (Equal-Cost Multi-Path). ECMP refers to the situation where the router doesn’t have a clear ‘winner’ between multiple routes, it has multiple good next hops, all to the same ultimate destination. In our case we hack that concept a bit, rather than using ECMP to balance across multiple intermediary links, we make the intermediary link addresses the final destination of our traffic: our servers.

The Technical Challenges of Building Cloudflare WARP

Here’s the configuration of a Juniper-brand router of the type which might be in one of our data centers, and which is configured to balance traffic across three destinations:

[email protected]# show routing-options

static {
  route 172.16.1.0/24 next-hop [ 172.16.2.1 172.16.2.2 172.16.2.3 ];
}
forwarding-table {
  export load-balancing-policy;
}

Since the ‘next-hop’ is our server, traffic will be split across multiple machines very efficiently.

TCP, IP, and ECMP

IP is responsible for sending packets of data from addresses like 93.184.216.34 to 208.80.153.224 (or [2606:2800:220:1:248:1893:25c8:1946] to [2620:0:860:ed1a::1] in the case of IPv6) across the Internet. It’s the “Internet Protocol”.

TCP (Transmission Control Protocol) operates on top of a protocol like IP which can send a packet from one place to another, and makes data transmission reliable and useful for more than one process at a time. It is responsible for taking the unreliable and misordered packets that might arrive over a protocol like IP and delivering them reliably, in the correct order. It also introduces the concept of a ‘port’, a number from 1-65535 which help route traffic on a computer or phone to a specific service (such as the web or email). Each TCP connection has a source and destination port which is included in the header TCP adds to the beginning of each packet. Without the idea of ports it would not be easy to figure out which messages were destined for which program. For example, both Google Chrome and Mail might wish to send messages over your WiFi connection at the same time, so they will each use their own port.

Here’s an example of making a request for https://cloudflare.com/ at 198.41.215.162, on the default port for HTTPS: 443. My computer has randomly assigned me the port 51602 which it will listen on it for a response, which will (hopefully) receive the contents of the site:

Internet Protocol Version 4, Src: 19.5.7.21, Dst: 198.41.215.162
    Protocol: TCP (6)
    Source: 19.5.7.21
    Destination: 198.41.215.162
Transmission Control Protocol, Src Port: 51602, Dst Port: 443, Seq: 0, Len: 0
    Source Port: 51602
    Destination Port: 443

Looking at the same request from the Cloudflare side will be a mirror image, a request from my public IP address originating at my source port, destined for port 443 (I’m ignoring NAT for the moment, more on that later):

Internet Protocol Version 4, Src: 198.41.215.16, Dst: 19.5.7.21
    Protocol: TCP (6)
    Source: 198.41.215.162
    Destination: 19.5.7.21
Transmission Control Protocol, Src Port: 443, Dst Port: 51602, Seq: 0, Len: 0
    Source Port: 443
    Destination Port: 51602

We can now return to ECMP! It could be theoretically possible to use ECMP to balance packets between servers randomly, but you would almost never want to do that. A message over the Internet is generally composed of multiple TCP packets. If each packet were sent to a different server it would be impossible to reconstruct the original message in any one place and act on it. Even beyond that, it would be terrible for performance: we rely on being able to maintain long-lived TCP and TLS sessions which require a persistent connection to a single server. To provide that persistence, our routers don’t balance traffic randomly, they use a combination of four values: the source address, the source port, the destination address, and the destination port. Traffic with the same combination of those four values will always make it to the same server. In the case of my example above, all of my messages destined to cloudflare.com will make it to a single server which can reconstruct the TCP packets into my request and return packets in a response.

Enter WARP

For a conventional request it is very important that our ECMP routing sends all of your packets to the same server for the duration of your request. Over the web a request commonly lasts less than ten seconds and the system works well. Unfortunately we quickly ran into issues with WARP.

WARP uses a session key negotiated with public-key encryption to secure packets. For a successful connection, both sides must negotiate a connection which is then only valid for that particular client and the specific server they are talking to. This negotiation takes time and has to be completed any time a client talks to a new server. Even worse, if packets get sent which expect one server, and end up at another, they can’t be decrypted, breaking the connection. Detecting those failed packets and restarting the connection from scratch takes so much time that our alpha testers experienced it as a complete loss of their Internet connection. As you can imagine, testers don’t leave WARP on very long when it prevents them from using the Internet.

WARP was experiencing so many failures because devices were switching servers much more often than we expected. If you recall, our ECMP router configuration uses a combination of (Source IP, Source Port, Destination IP, Destination Port) to match a packet to a server. Destination IP doesn’t generally change, WARP clients are always connecting to the same Anycast addresses. Similarly, Destination Port doesn’t change, we always listen on the same port for WARP traffic. The other two values, Source IP and Source Port, were changing much more frequently than we had planned.

One source of these changes was expected. WARP runs on cell phones, and cell phones commonly switch from Cellular to Wi-Fi connections. When you make that switch you suddenly go from communicating over the Internet via your cellular carrier’s (like AT&T or Verizon) IP address space to that of the Internet Service Provider your Wi-Fi connection uses (like Comcast or Google Fiber). It’s essentially impossible that your IP address won’t change when you move between connections.

The port changes occurred even more frequently than could be explained by network switches however. For an understanding of why we need to introduce one more component of Internet lore: Network Address Translation.

NAT

An IPv4 address is composed of 32 bits (often written as four eight-bit numbers). If you exclude the reserved addresses which can’t be used, you are left with 3,706,452,992 possible addresses. This number has remained constant since IPv4 was deployed on the ARPANET in 1983, even as the number of devices has exploded (although it might go up a bit soon if the 0.0.0.0/8 becomes available). This data is based on Gartner Research predictions and estimates:

The Technical Challenges of Building Cloudflare WARP

IPv6 is the definitive solution to this problem. It expands the length of an address from 32 to 128 bits, with 125 available in a valid Internet address at the moment (all public IPv6 addresses have the first three bits set to 001, the remaining 87.5% of the IPv6 address space is not considered necessary yet). 2^125 is an impossibly large number and would be more than enough for every device on Earth to have its own address. Unfortunately, 21 years after it was published, IPv6 still remains unsupported on many networks. Much of the Internet still relies on IPv4, and as seen above, there aren’t enough IPv4 addresses for every device to have their own.

To solve this problem many devices are commonly put behind a single Internet-addressable IP address. A router is used to do Network Address Translation; to take messages which arrive on that single public IP and forward them to the appropriate device on their local network. In effect it’s as if everyone in your apartment building had the same street address, and the postal worker was responsible for sorting out what mail was meant for which person.

When your devices send a packet destined for the Internet your router intercepts it. The router then rewrites the source address to the single public Internet address allocated for you, and the source port to a port which is unique for all the messages being sent across all the Internet-connected devices on your network. Just as your computer chooses a random source port for your messages which was unique between all the different processes on your computer, your router chooses a random source port which is unique for all the Internet connections across your entire network. It remembers the port it is selecting for you as belonging to your connection, and allows the message to continue over the Internet.

The Technical Challenges of Building Cloudflare WARP

When a response arrives destined for the port it has allocated to you, it matches it to your connection and again rewrites it, this time replacing the destination address with your address on the local network, and the destination port with the original source port you specified. It has transparently allowed all the devices on your network to act as if they were one big computer with a single Internet-connected IP address.

This process works very well for the duration of a common request over the Internet. Your router only has so much space however, so it will helpfully delete old port assignments, freeing up space for new ones. It generally waits for the connection to not have any messages for thirty seconds or more before deleting an assignment, making it unlikely a response will arrive which it can no longer direct to the appropriate source. Unfortunately, WARP sessions need to last much longer than thirty seconds.

When you next send a message after your NAT session has expired, you are given a new source port. That new port causes your ECMP mapping (based on source IP, source port, destination IP, destination port) to change, causing us to route your requests to a new machine within the Cloudflare data center your messages are arriving at. This breaks your WARP session, and your Internet connection.

We experimented extensively with methods of keeping your NAT session fresh by periodically sending keep-alive messages which would prevent routers and mobile carriers from evicting mappings. Unfortunately waking the radio of your device every thirty seconds has unfortunate consequences for your battery life, and it was not entirely successful at preventing port and address changes. We needed a way to always map sessions to the same machine, even as their source port (and even source address) changed.

Fortunately, we had a solution which came from elsewhere at Cloudflare. We don’t use dedicated load balancers, but we do have many of the same problems load balancers solve. We have long needed to map traffic to Cloudflare servers with more control than ECMP allows alone. Rather than deploying an entire tier of load balancers, we use every server in our network as a load balancer, forwarding packets first to an arbitrary machine and then relying on that machine to forward the packet to the appropriate host. This consumes minimal resources and allows us to scale our load balancing infrastructure with each new machine we add. We have a lot more to share on how this infrastructure works and what makes it unique, subscribe to this blog to be notified when that post is released.

To make our load balancing technique work though we needed a way to identify which client a WARP packet was associated with before it could be decrypted. To understand how we did that it’s helpful to understand how WARP encrypts your messages. The industry standard way of connecting a device to a remote network is a VPN. VPNs use a protocol like IPsec to allow your device to send messages securely to a remote network. Unfortunately, VPNs are generally rather disliked. They slow down connections, eat battery life, and their complexity makes them frequently the source of security vulnerabilities. Users of corporate networks which mandate VPNs often hate them, and the idea that we would convince millions of consumers to install one voluntarily seemed ridiculous.

After considering and testing several more modern options, we landed on WireGuard®. WireGuard is a modern, high performance, and most importantly, simple, protocol created by Jason Donenfeld to solve the same problem. Its original code-base is less than 1% the size of a popular IPsec implementation, making it easy for us to understand and secure. We chose Rust as the language most likely to give us the performance and safety we needed and implemented WireGuard while optimizing the code heavily to run quickly on the platforms we were targeting. Then we open sourced the project.

The Technical Challenges of Building Cloudflare WARP

WireGuard changes two very relevant things about the traffic you send over the Internet. The first is it uses UDP not TCP. The second is it uses a session key negotiated with public-key encryption to secure the contents of that UDP packet.

TCP is the conventional protocol used for loading a website over the Internet. It combines the ability to address ports (which we talked about previously) with reliable delivery and flow control. Reliable delivery ensures that if a message is dropped, TCP will eventually resend the missing data. Flow control gives TCP the tools it needs to handle many clients all sharing the same link who exceed its capacity. UDP is a much simpler protocol which trades these capabilities for simplicity, it makes a best-effort attempt to send a message, and if the message is missing or there is too much data for the links, messages are simply never heard of again.

UDP’s lack of reliability would normally be a problem while browsing the Internet, but we are not simply sending UDP, we are sending a complete TCP packet _inside_ our UDP packets.

Inside the payload encrypted by WireGuard we have a complete TCP header which contains all the information necessary to ensure reliable delivery. We then wrap it with WireGuard’s encryption and use UDP to (less-than-reliably) send it over the Internet. Should it be dropped TCP will do its job just as if a network link lost the message and resend it. If we instead wrapped our inner, encrypted, TCP session in another TCP packet as some other protocols do we would dramatically increase the number of network messages required, destroying performance.

The second interesting component of WireGuard relevant to our discussion is public-key encryption. WireGuard allows you to secure each message you send such that only the specific destination you are sending it to can decrypt it. That is a powerful way of ensuring your security as you browse the Internet, but it means it is impossible to read anything inside the encrypted payload until the message has reached the server which is responsible for your session.

Returning to our load balancing issue, you can see that only three things are accessible to us before we can decrypt the message: The IP Header, the UDP Header, and the WireGuard header. Neither the IP Header or UDP Header include the information we need, as we have already failed with the four pieces of information they contain (source IP, source port, destination IP, destination port). That leaves the WireGuard header as the one location where we can find an identifier which can be used to keep track of who the client was before decrypting the message. Unfortunately, there isn’t one. This is the format of the message used to initiate a connection:

The Technical Challenges of Building Cloudflare WARP

sender looks temptingly like a client id, but it’s randomly assigned every handshake. Handshakes have to be performed every two minutes to rotate keys making them insufficiently persistent. We could have forked the protocol to add any number of additional fields, but it is important to us to remain wire-compatible with other WireGuard clients. Fortunately, WireGuard has a three byte block in its header which is not currently used by other clients. We decided to put our identifier in this region and still support messages from other WireGuard clients (albeit with less reliable routing than we can offer). If this reserved section is used for other purposes we can ignore those bits or work with the WireGuard team to extend the protocol in another suitable way.

When we begin a WireGuard session we include our clientid field which is provided by our authentication server which has to be communicated with to begin a WARP session:

The Technical Challenges of Building Cloudflare WARP

Data messages similarly include the same field:

The Technical Challenges of Building Cloudflare WARP

It’s important to note that the clientid is only 24 bits long. That means there are less possible clientid values than the current number of users waiting to use WARP. This suits us well as we don’t need or want the ability to track individual WARP users. clientid is only necessary for load balancing, once it serves its purpose we get it expunged from our systems as quickly as we can.

The load balancing system now uses a hash of the clientid to identify which machine a packet should be routed to, meaning  WARP messages always arrive at the same machine even as you change networks or move from Wi-Fi to cellular, and the problem was eliminated.

Client Software

Cloudflare has never developed client software before. We take pride in selling a service anyone can use without needing to buy hardware or provision infrastructure. To make WARP work, however, we needed to deploy our code onto one of the most ubiquitous hardware platforms on Earth: smartphones.

While developing software on mobile devices has gotten steadily easier over the past decade, unfortunately developing low-level networking software remains rather difficult. To consider one example: we began the project using the latest iOS connection API called Network, introduced in iOS 12. Apple strongly recommends the use of Network, in their words “Your customers are going to appreciate how much better your connections, how much more reliable your connections are established, and they’ll appreciate the longer battery life from the better performance.”

The Network framework provides a pleasantly high-level API which, as they say, integrates well with the native performance features built into iOS. Creating a UDP connection (connection is a bit of a misnomer, there are no connections in UDP, just packets) is as simple as:

self.connection = NWConnection(host: hostUDP, port: portUDP, using: .udp)

And sending a message can be as easy as:

self.connection?.send(content: content)

Unfortunately, at a certain point code actually gets deployed, and bug reports begin flowing in. The first issue was the simplicity of the API made it impossible for us to process more than a single UDP packet at a time. We commonly use packets of up to 1500 bytes, running a speed test on my Google Fiber connection currently results in a speed of 370 Mbps, or almost thirty-one thousand packets per second. Attempting to process each packet individually was slowing down connections by as much as 40%. According to Apple, the best solution to get the performance we needed was to fallback to the older NWUDPSession API, introduced in iOS 9.

IPv6

If we compare the code required to create a NWUDPSession to the example above you will notice that we suddenly care which protocol, IPv4 or IPv6, we are using:

let v4Session = NWUDPSession(upgradeFor: self.ipv4Session)
v4Session.setReadHandler(self.filteringReadHandler, maxDatagrams: 32)

In fact, NWUDPSession does not handle many of the more tricky elements of creating connections over the Internet. For example, the Network framework will automatically determine whether a connection should be made over IPv4 or 6:

The Technical Challenges of Building Cloudflare WARP

NWUDPSession does not do this for you, so we began creating our own logic to determine which type of connection should be used. Once we began to experiment, it quickly became clear that they are not created equal. It’s fairly common for a route to the same destination to have very different performance based on whether you use its IPv4 or IPv6 address. Often this is because there are simply fewer IPv4 addresses which have been around for longer, making it possible for those routes to be better optimized by the Internet’s infrastructure.

Every Cloudflare product has to support IPv6 as a rule. In 2016, we enabled IPv6 for over 98% of our network, over four million sites, and made a pretty big dent in IPv6 adoption on the web:

The Technical Challenges of Building Cloudflare WARP

We couldn’t release WARP without IPv6 support. We needed to ensure that we were always using the fastest possible connection while still supporting both protocols with equal measure. To solve that we turned to a technology we have used with DNS for years: Happy Eyeballs. As codified in RFC 6555 Happy Eyeballs is the idea that you should try to look for both an IPv4 and IPv6 address when doing a DNS lookup. Whichever returns first, wins. That way you can allow IPv6 websites to load quickly even in a world which does not fully support it.

As an example, I am loading the website http://zack.is/. My web browser makes a DNS request for both the IPv4 address (an “A” record) and the IPv6 address (an “AAAA” record) at the same time:

Internet Protocol Version 4, Src: 192.168.7.21, Dst: 1.1.1.1
User Datagram Protocol, Src Port: 47447, Dst Port: 53
Domain Name System (query)
    Queries
        zack.is: type A, class IN

Internet Protocol Version 4, Src: 192.168.7.21, Dst: 1.1.1.1
User Datagram Protocol, Src Port: 49946, Dst Port: 53
Domain Name System (query)
    Queries
        zack.is: type AAAA, class IN

In this case the response to the A query returned more quickly, and the connection is begun using that protocol:

Internet Protocol Version 4, Src: 1.1.1.1, Dst: 192.168.7.21
User Datagram Protocol, Src Port: 53, Dst Port: 47447
Domain Name System (response)
    Queries
        zack.is: type A, class IN
    Answers
        zack.is: type A, class IN, addr 104.24.101.191
       
Internet Protocol Version 4, Src: 192.168.7.21, Dst: 104.24.101.191
Transmission Control Protocol, Src Port: 55244, Dst Port: 80, Seq: 0, Len: 0
    Source Port: 55244
    Destination Port: 80
    Flags: 0x002 (SYN)

We don’t need to do DNS queries to make WARP connections, we know the IP addresses of our data centers already, but we do want to know which of the IPv4 and IPv6 addresses will lead to a faster route over the Internet. To accomplish that we perform the same technique but at the network level: we send a packet over each protocol and use the protocol which returns first for subsequent messages. With some error handling and logging removed for brevity, it appears as:

let raceFinished = Atomic<Bool>(false)

let happyEyeballsRacer: (NWUDPSession, NWUDPSession, String) -> Void = {
    (session, otherSession, name) in
    // Session is the session the racer runs for, otherSession is a session we race against

    let handleMessage: ([Data]) -> Void = { datagrams in
        // This handler will be executed twice, once for the winner, again for the loser.
        // It does not matter what reply we received. Any reply means this connection is working.

        if raceFinished.swap(true) {
            // This racer lost
            return self.filteringReadHandler(data: datagrams, error: nil)
        }

        // The winner becomes the current session
        self.wireguardServerUDPSession = session

        session.setReadHandler(self.readHandler, maxDatagrams: 32)
        otherSession.setReadHandler(self.filteringReadHandler, maxDatagrams: 32)
    }

    session.setReadHandler({ (datagrams) in
        handleMessage(datagrams)
    }, maxDatagrams: 1)

    if !raceFinished.value {
        // Send a handshake message
        session.writeDatagram(onViable())
    }
}

This technique successfully allows us to support IPv6 addressing. In fact, every device which uses WARP instantly supports IPv6 addressing even on networks which don’t have support. Using WARP takes the 34% of Comcast’s network which doesn’t support IPv6 or the 69% of Charter’s network which doesn’t (as of 2018), and allows those users to communicate to IPv6 servers successfully.

This test shows my phone’s IPv6 support before and after enabling WARP:

The Technical Challenges of Building Cloudflare WARP

The Technical Challenges of Building Cloudflare WARP

Dying Connections

Nothing is simple however, with iOS 12.2 NWUDPSession began to trigger errors which killed connections. These errors were only identified with a code ‘55’. After some research it appears 55 has referred to the same error since the early foundations of the FreeBSD operating system OS X was originally built upon. In FreeBSD it’s commonly referred to as ENOBUFS, and it’s returned when the operating system does not have sufficient BUFfer Space to handle the operation being completed. For example, looking at the source of a FreeBSD today, you see this code in its IPv6 implementation:

The Technical Challenges of Building Cloudflare WARP

In this example, if enough memory cannot be allocated to accommodate the size of an IPv6 and ICMP6 header, the error ENOBUFS (which is mapped to the number 55) will be returned. Unfortunately, Apple’s take on FreeBSD is not open source however: how, when, and why they might be returning the error is a mystery. This error has been experienced by other UDP-based projects, but a resolution is not forthcoming.

What is clear is once an error 55 begins occurring, the connection is no longer usable. To handle this case we need to reconnect, but doing the same Happy Eyeballs mechanic we do on initial connection is both unnecessary (as we were already talking over the fastest connection), and will consume valuable time. Instead we add a second connection method which is only used to recreate an already working session:

/**
Create a new UDP connection to the server using a Happy Eyeballs like heuristic.

This function should be called when first establishing a connection to the edge server.

It will initiate a new connection over IPv4 and IPv6 in parallel, keeping the connection that receives the first response.
*/

func connect(onViable: @escaping () -> Data, onReply: @escaping () -> Void, onFailure: @escaping () -> Void, onDisconnect: @escaping () -> Void)

/**
Recreate the current connections.

This function should be called as a response to error code 55, when a quick connection is required.

Unlike `happyEyeballs`, this function will use viability as its only success criteria.
*/

func reconnect(onViable: @escaping () -> Void, onFailure: @escaping () -> Void, onDisconnect: @escaping () -> Void)

Using reconnect we are able to recreate sessions broken by code 55 errors, but it still adds a latency hit which is not ideal. As with all client software development on a closed-source platform however, we are dependent on the platform to identify and fix platform-level bugs.

Truthfully, this is just one of a long list of platform-specific bugs we ran into building WARP. We hope to continue working with device vendors to get them fixed. There are an unimaginable number of device and connection combinations, and each connection doesn’t just exist at one moment in time, they are always changing, entering and leaving broken states almost faster than we can track. Even now, getting WARP to work on every device and connection on Earth is not a solved problem, we still get daily bug reports which we work to triage and resolve.

WARP+

WARP is meant to be a place where we can apply optimizations which make the Internet better. We have a lot of experience making websites more performant, WARP is our opportunity to experiment with doing the same for all Internet traffic.

At Cloudflare we have a product called Argo. Argo makes websites’ time to first byte more than 30% faster on average by continually monitoring thousands of routes over the Internet between our data centers. That data builds a database which maps every IP address range with the fastest possible route to every destination. When a packet arrives it first reaches the closest data center to the client, then that data center uses data from our tests to discover the route which will get the packet to its destination with the lowest possible latency. You can think of it like a traffic-aware GPS for the Internet.

Argo has historically only operated on HTTP packets. HTTP is the protocol which powers the web, sending messages which load websites on top of TCP and IP. For example, if I load http://zack.is/, an HTTP message is sent inside a TCP packet:

Internet Protocol Version 4, Src: 192.168.7.21, Dst: 104.24.101.191
Transmission Control Protocol, Src Port: 55244, Dst Port: 80
    Source Port: 55244
    Destination Port: 80
    TCP payload (414 bytes)
Hypertext Transfer Protocol
    GET / HTTP/1.1\r\n
    Host: zack.is\r\n
    Connection: keep-alive\r\n
    Accept-Encoding: gzip, deflate\r\n
    Accept-Language: en-US,en;q=0.9\r\n
    \r\n

The modern and secure web presents a problem for us however: When I make the same request over HTTPS (https://zack.is) rather than just HTTP (http://zack.is), I see a very different result over the wire:

Internet Protocol Version 4, Src: 192.168.7.21, Dst: 104.25.151.102
Transmission Control Protocol, Src Port: 55983, Dst Port: 443
    Source Port: 55983
    Destination Port: 443
    Transport Layer Security
    TCP payload (54 bytes)
Transport Layer Security
    TLSv1.2 Record Layer: Application Data Protocol: http-over-tls
        Encrypted Application Data: 82b6dd7be8c5758ad012649fae4f469c2d9e68fe15c17297…

My request has been encrypted! It’s no longer possible for WARP (or anyone but the destination) to tell what is in the payload. It might be HTTP, but it also might be any other protocol. If my site is one of the twenty-million which use Cloudflare already, we can decrypt the traffic and accelerate it (along with a long list of other optimizations). But for encrypted traffic destined for another source existing HTTP-only Argo technology was not going to work.

Fortunately we now have a good amount of experience working with non-HTTP traffic through our Spectrum and Magic Transit products. To solve our problem the Argo team turned to the CONNECT protocol.

As we now know, when a WARP request is made it first communicates over the WireGuard protocol to a server running in one of our 194 data centers around the world. Once the WireGuard message has been decrypted, we examine the destination IP address to see if it is an HTTP request destined for a Cloudflare-powered site, or a request destined elsewhere. If it’s destined for us it enters our standard HTTP serving path; often we can reply to the request directly from our cache in the very same data center.

If it’s not destined for a Cloudflare-powered site we instead forward the packet to a proxy process which runs on each machine. This proxy is responsible for loading the fastest path from our Argo database and beginning an HTTP session with a machine in the data center this traffic should be forwarded to. It uses the CONNECT command to both transmit metadata (as headers) and turn the HTTP session into a connection which can transmit the raw bytes of the payload:

CONNECT 8.54.232.11:5564 HTTP/1.1\r\n
Exit-Tcp-Keepalive-Duration: 15\r\n
Application: warp\r\n
\r\n
<data to send to origin>

Once the message arrives at the destination data center it is either forwarded to another data center (if that is best for performance), or directed directly to the origin which is awaiting the traffic.

The Technical Challenges of Building Cloudflare WARP

Smart routing is just the beginning of WARP+; We have a long list of projects and plans which are all aimed at making your Internet faster, and couldn’t be more thrilled to finally have a platform to test them with.

Our Mission

Today, after well over a year of development, WARP is available to you and to your friends and family. For us though, this is just the beginning. With the ability to improve full network connection for all traffic, we unlock a whole new world of optimizations and security improvements which were simply impossible before. We couldn’t be more excited to experiment, play, and eventually release, all sorts of new WARP and WARP+ features.

Cloudflare’s mission is to help build a better Internet. If we are willing to experiment and solve hard technical problems together we believe we can help make the future of the Internet better than the Internet of today, and we are all grateful to play a part in that. Thank you for trusting us with your Internet connection.

WARP was built by Vlad Krasnov, Chris Branch, Dane Knecht, Naga Tripirineni, Andrew Plunk, Adam Schwartz, Irtefa, and intern Michelle Chen with support from members of our Austin, San Francisco, Champaign, London, Warsaw, and Lisbon offices.

WARP is here (sorry it took so long)

Post Syndicated from Matthew Prince original https://blog.cloudflare.com/announcing-warp-plus/

WARP is here (sorry it took so long)

WARP is here (sorry it took so long)

Today, after a longer than expected wait, we’re opening WARP and WARP Plus to the general public. If you haven’t heard about it yet, WARP is a mobile app designed for everyone which uses our global network to secure all of your phone’s Internet traffic.

We announced WARP on April 1 of this year and expected to roll it out over the next few months at a fairly steady clip and get it released to everyone who wanted to use it by July. That didn’t happen. It turned out that building a next generation service to secure consumer mobile connections without slowing them down or burning battery was… harder than we originally thought.

Before today, there were approximately two million people on the waitlist to try WARP. That demand blew us away. It also embarrassed us. The common refrain is consumers don’t care about their security and privacy, but the attention WARP got proved to us how wrong that assumption actually is.

This post is an explanation of why releasing WARP took so long, what we’ve learned along the way, and an apology for those who have been eagerly waiting. It also talks briefly about the rationale for why we built WARP as well as the privacy principles we’ve committed to. However, if you want a deeper dive on those last two topics, I encourage you to read our original launch announcement.

And, if you just want to jump in and try it, you can download and start using WARP on your iOS or Android devices for free through the following links:

If you’ve already installed the 1.1.1.1 App on your device, you may need to update to the latest version in order to get the option to enable Warp.

Mea Culpa

Let me start with the apology. We are sorry making WARP available took far longer than we ever intended. As a way of hopefully making amends, for everyone who was on the waitlist before today, we’re giving 10 GB of WARP Plus — the even faster version of WARP that uses Cloudflare’s Argo network — to those of you who have been patiently waiting.

For people just signing up today, the basic WARP service is free without bandwidth caps or limitations. The unlimited version of WARP Plus is available for a monthly subscription fee. WARP Plus is the even faster version of WARP that you can optionally pay for. The fee for WARP Plus varies by region and is designed to approximate what a McDonald’s Big Mac would cost in the region. On iOS, the WARP Plus pricing as of the publication of this post is still being adjusted on a regional basis, but that should settle out in the next couple days.

WARP Plus uses Cloudflare’s virtual private backbone, known as Argo, to achieve higher speeds and ensure your connection is encrypted across the long haul of the Internet. We charge for it because it costs us more to provide. However, in order to help spread the word about WARP, you can earn 1GB of WARP Plus for every friend you refer to sign up for WARP. And everyone you refer gets 1GB of WARP Plus for free to get started as well.

Okay, Thanks, That’s Nice, But What Took You So Long?

So what took us so long?

WARP is an ambitious project. We set out to secure Internet connections from mobile devices to the edge of Cloudflare’s network. In doing so, however, we didn’t want to slow devices down or burn excess battery. We wanted it to just work. We also wanted to bet on the technology of the future, not the technology of the past. Specifically, we wanted to build not around legacy protocols like IPsec, but instead around the hyper-efficient WireGuard protocol.

At some level, we thought it would be easy. We already had the 1.1.1.1 App that was securing DNS requests running on millions of mobile devices. That worked great. How much harder could securing all the rest of the requests on a device be? Right??

It turns out, a lot. Zack Bloom has written up a great technical post describing many of the challenges we faced and the solutions we had to invent to deal with them. If you’re interested, I encourage you to check it out.

Some highlights:

Apple threw us a curveball by releasing iOS 12.2 just days before the April 1 planned roll out. The new version of iOS significantly changed the underlying network stack implementation in a way that made some of what we were doing to implement WARP unstable. Ultimately we had to find work-arounds in our networking code, costing us valuable time.

We had a version of the WARP app that (kind of) worked on April 1. But, when we started to invite people from outside of Cloudflare to use it, we quickly realized that the mobile Internet around the world was far more wild and varied than we’d anticipated. The Internet is made up of diverse network components which do not always play nicely, we knew that. What we didn’t expect was how much more pain is introduced by the diversity of mobile carriers, mobile operating systems, and mobile device models.

And, while phones in our testbed were relatively stationary, phones in the real world move around — a lot. When they do, their network settings can change wildly. While that doesn’t matter much for stateless, simple DNS queries, for the rest of Internet traffic that makes things complex. Keeping WireGuard fast requires long-lived sessions between your phone and a server in our network, maintaining that for hours and days was very complex. Even beyond that, we use a technology called Anycast to route your traffic to our network. Anycast meant your traffic could move not just between machines, but between entire data centers. That made things very complex.

Overcoming Challenges

But there is a huge difference between hard and impossible. From long before the announcement, the team has been hard at work and I’m deeply proud of what they’ve accomplished. We changed our roll out plan to focus on iOS and solidify the shared underpinnings of the app to ensure it would work even with future network stack upgrades. We invited beta users not in the order of when they signed up, but instead based on networks where we didn’t yet have information to help us discover as many corner cases as possible. And we invented new technologies to keep session state even when the wild west of mobile networks and Anycast routing collide.

WARP is here (sorry it took so long)

I’ve been running WARP on my phone since April 1. The first few months were… rough. Really rough. But, today, WARP has blended into the background of my mobile. And I sleep better knowing that my Internet connections from my phone are secure. Using my phone is as fast, and in some cases faster, than without WARP. In other words, WARP today does what we set out to accomplish: securing your mobile Internet connection and otherwise getting out of the way.

There Will Be Bugs

While WARP is a lot better than it was when we first announced it, we know there are still bugs. The most common bug we’re seeing these days is when WARP is significantly slower than using the mobile Internet without WARP. This is usually due to traffic being misrouted. For instance, we discovered a network in Turkey earlier this week that was being routed to London rather than our local Turkish facility. Once we’re aware of these routing issues we can typically fix them quickly.

Other common bugs involved captive portals — the pages where you have to enter information, for instance, when connecting to a hotel WiFI. We’ve fixed a lot of them but we haven’t had WARP users connecting to every hotel WiFi yet, so there will inevitably still be some that are broken.

WARP is here (sorry it took so long)

We’ve made it easy to report issues that you discover. From the 1.1.1.1 App you can click on the little bug icon near the top of the screen, or just shake your phone with the app open, and quickly send us a report. We expect, over the weeks ahead, we’ll be squashing many of the bugs that you report.

Even Faster With Plus

WARP is not just a product, it’s a testbed for all of the Internet-improving technology we have spent years developing. One dream was to use our Argo routing technology to allow all of your Internet traffic to use faster, less-congested, routes through the Internet. When used by Cloudflare customers for the past several years Argo has improved the speed of their websites by an average of over 30%. Through some hard work of the team we are making that technology available to you as WARP Plus.

WARP is here (sorry it took so long)

The WARP Plus technology is not without cost for us. Routing your traffic over our network often costs us more than if we release it directly to the Internet. To cover those costs we charge a monthly fee — $4.99/month or less — for WARP Plus. The fee depends on the region that you’re in and is intended to approximate what a Big Mac would cost in the same region.

Basic WARP is free. Our first priority is not to make money off of WARP however, we want to grow it to secure every single phone. To help make that happen, we wanted to give you an incentive to share WARP with your friends. You can earn 1GB of free WARP Plus for every person you share WARP with. And everyone you refer also gets 1GB of WARP Plus for free as well. There is no limit on how much WARP Plus data you can earn by sharing.

Privacy First

The free consumer security space has traditionally not been the most reputable. Many other companies that have promised to keep consumers’ data safe but instead built businesses around selling it or using it help target you with advertising. We think that’s disgusting. That is not Cloudflare’s business model and it never will be. WARP continues all the strong privacy protections that 1.1.1.1 launched with including:

  1. We don’t write user-identifiable log data to disk;
  2. We will never sell your browsing data or use it in any way to target you with advertising data;
  3. Don’t need to provide any personal information — not your name, phone number, or email address — in order to use WARP or WARP Plus; and
  4. We will regularly work with outside auditors to ensure we’re living up to these promises.

What WARP Is Not

From a technical perspective, WARP is a VPN. But it is designed for a very different audience than a traditional VPN. WARP is not designed to allow you to access geo-restricted content when you’re traveling. It will not hide your IP address from the websites you visit. If you’re looking for that kind of high-security protection then a traditional VPN or a service like Tor are likely better choices for you.

WARP, instead, is built for the average consumer. It’s built to ensure that your data is secured while it’s in transit. So the networks between you and the applications you’re using can’t spy on you. It will help protect you from people sniffing your data while you’re at a local coffee shop. It will also help ensure that your ISP isn’t hoovering up data on your browsing patterns to sell to advertisers.

WARP isn’t designed for the ultra-techie who wants to specify exactly what server their traffic will be routed through. There’s basically only one button in the WARP interface: ON or OFF. It’s simple on purpose. It’s designed for my mom and dad who ask me every holiday dinner what they can do to be a bit safer online. I’m excited this year to have something easy for them to do: install the 1.1.1.1 App, enable WARP, and rest a bit easier.

How Fast Is It?

Once we got WARP to a stable place, this was my first question. My initial inclination was to go to one of the many Speed Test sites and see the results. And the results were… weird. Sometimes much faster, sometimes much slower. Overall, they didn’t make a lot of sense. The reason why is that these sites are designed to measure the speed of your ISP. WARP is different, so these test sites don’t give particularly accurate readings.

The better test is to visit common sites around the Internet and see how they load, in real conditions, on WARP versus off. We’ve built a tool that does this. Generally, in our tests, WARP is around the same speed as non-WARP connections when you’re on a high performance network. As network conditions get worse, WARP will often improve performance more. But your experience will depend on the particular conditions of your network.

We plan, in the next few weeks, to expose the test tool within the 1.1.1.1 App so you can see how your device loads a set of popular sites without WARP, with WARP, and with WARP Plus. And, again, if you’re seeing particularly poor performance, please report it to us. Our goal is to provide security without slowing you down or burning excess battery. We can already do that for many networks and devices and we won’t rest until we can do it for everyone.

Here’s to a More Secure, Fast Internet

Cloudflare’s mission is to help build a better Internet. We’ve done that by securing and making more performance millions of Internet properties since we launched almost exactly 9 years ago. WARP furthers Cloudflare’s mission by extending our network to help make every consumer’s mobile device a bit more secure. Our team is proud of what we’ve built with WARP — albeit a bit embarrassed it took us so long to get into your hands. We hope you’ll forgive us for the delay, give WARP a try, and let us know what you think.

WARP is here (sorry it took so long)

Inside the Web Browser’s Performance API

Post Syndicated from Young Park original https://blog.cloudflare.com/browser-performance-api/

Inside the Web Browser’s Performance API

Building a beautiful, feature-rich website is easier than ever before. Not long ago, you’d have to fire up a text editor and hand-craft a lot of HTML, CSS, and JavaScript. Today, you can use WYSIWYG tools and third-party libraries that make development much simpler. The flip side of this is that it can be hard to see everything that’s going into your website — and the performance can suffer.

The good news is that modern web browsers expose lots of performance data that can help you understand how your web page performs. With the launch of Browser Insights today, we can analyze the performance from the perspective of the web browser and what the end user actually experiences. In this post, we’ll dive into how we think about performance and utilize the timing APIs in the web browser.

How web browsers measure performance

In the old days, the only way for a developer to profile performance was to intercept requests and measure the time from the beginning of the page load until the end of the load event.

Today, we can use Web APIs that are supported by modern browsers. This is part of the web standard called the Performance API. The Performance API consists of a collection of individual APIs that include:

  • Navigation Timing (for timing information relates to the page and navigation)
  • Resource Timing (for timing data regarding the loading of an application’s resources)
  • Paint Timing (that provides timing information about paint operation during the page construction)

In this blog post, we will primarily focus on the Navigation Timing API.

Inside the Performance API

To see what’s collected with the Performance API, you can open the Developer Tools in Chrome browser and type ‘performance’ in the console tab (or type in performance.timing to get direct access to the PerformanceTiming attribute).

Try expanding the Performance object by clicking on the arrow before the label and again expand the ‘timing’. This is called PerformanceTiming, which includes all the timings that relate to the current page load as UNIX epoch timestamp (milliseconds). The timing attributes shown are not in the order of the load. So for better understanding, let’s look at the illustration provided by the W3C.

Inside the Web Browser’s Performance API
Image from https://www.w3.org/TR/navigation-timing/

As we can see from the diagram shown, each element (represented as a box above) in the order from left-to-right, represents the navigation flow of the page load. Each element has an attribute from the starting point to the end (and some have multiple attributes!) so that we can measure the elapsed time for each element. For example, to get the Request Time, you could type in the command like shown below in the console which appears to be 60 milliseconds.

Inside the Web Browser’s Performance API

How Cloudflare uses performance data

Once your website is proxied through Cloudflare and Browser Insights is enabled, we write and inject a JavaScript beacon into the web page. Our beacon collects metrics from the Performance API to send to our edge, where it can be used to understand where your website is slowing down or having any network problems. The reported data is shown in the Cloudflare dashboard on the Speed Page showing as averages of each timing metric:

Inside the Web Browser’s Performance API

The metrics we surface are:

  • DNS (domainLookupEnd – domainLookupStart): How long the DNS query takes. This could appear as zero if the connection is reused or the content was stored in the local cache (memory or disk).
  • TCP (connectEnd – connectStart): How long it takes to establish a TCP connection the server. If HTTPS, this process includes TLS negotiation time.
  • Request (responseStart – requestStart): The time elapsed between making an HTTP request and receiving the first byte of the response.
  • Response (responseEnd – responseStart): The time elapsed between the first byte and the last byte of the response received. You can think of this as a resource download time.
  • Processing (domComplete – domLoading): How long it took to render the page. If this number is big, you can optimize your document architecture, resource size, or configure settings under Speed page such as Auto Minify the source code. This document process can be drill down more with domInteractive, domContentLoadedEventStart, domContentLoadedEventEnd, and domComplete. We plan to provide more detailed analytics on this later on.
  • Load Event (loadEventEnd – loadEventStart): When the browser finishes loading its document and resources, it triggers a `load` event. This duration may be helpful to you if you have additional functions or any logic for the load event.
  • Total Time: Sum of each timing metrics shown on the graph.

If you are seeing any spikes or unusual form of a line in the stacked line chart, you could start investigating on each element to see what is causing the problem.

For more about how to use Browser Insights, see our announcement blog post.

What’s next

In this blog post, we’ve focused on the Navigation Timing API, because it’s at the heart of our first version of Browser Insights. In the near future, we plan to incorporate metrics from some of the other APIs. For example, we can break down some of the longer timings by looking at individual resource loads, and pointing out what’s taking longer. In addition to that, we plan to track JavaScript errors, provide a way to measure A/B performance, set up monitoring/alerting based on the metrics, and so on. So stay tuned!

Introducing Browser Insights

Post Syndicated from Jon Levine original https://blog.cloudflare.com/introducing-browser-insights/

Introducing Browser Insights

Speed matters. We know that when your website or app gets faster, users have a better experience and you get more conversions and more revenue. At Cloudflare, we spend our days obsessing about speed and building new features to squeeze out as much performance as possible.

But to improve speed, you first need to measure it. That’s why we’re launching Browser Insights: a new tool that measures the performance of your website from the perspective of your users. Browser Insights lets you dive in to understand where, when, and why web pages are slow. And you can enable it today, for free, with one click.

Introducing Browser Insights

Why did we build Browser Insights?

Let’s say you run an e-commerce site, and you want to make your conversion rates better. You’ve noticed that there’s a lot of traffic from visitors in Peru, but they have worse conversion than users in North America. Maybe you theorize that it takes a long time to load your checkout page, which causes customers to drop off before checking out. How would you verify that this is happening?

There are a few ways you could do this: you could check your server logs to look at timing information, or you could load the page a few times in your browser to see what’s slow.

These approaches have a few downsides though:

  • If you only look at server-side data, you miss factors that impact the end-user experience — how long did it take for the web browser to load all the necessary scripts, execute them, and paint the page?
  • If you only measure from one computer (or a small number of them), you miss the diversity of the computing population — for example, “how does this work on a phone on a 3G connection?”

To solve these problems, we use Real User Monitoring. This gives us the best of both worlds: we can run a timer inside real web browsers. This timer captures how long it takes web pages to load, from your actual users.

How does it work?

Browser Insights can be enabled with the flip of a switch in the “Speed” section of the dashboard:

Introducing Browser Insights

When it’s enabled, we add a small snippet of JavaScript code to each HTML page load that uses the standard Performance API to collect timing info. Then we can start showing you metrics about how your web pages are performing in the real world:

Introducing Browser Insights

There’s a lot of info this graph! At a high level, there are two main types of metrics

  • Request-level metrics like TCP connection time, or Request time. These metrics are counted on every page load and are impacted by Internet infrastructure, like the mobile network of your end users, or the speed of your servers.
  • Page-level metrics like Page Load Time, which take into account the many requests needed to load a web page, plus the time taken to parse HTML and execute JavaScript.

For more information about what these times mean and how we chose them, see our companion blog post.

Digging into the data

In addition to seeing several metrics about your web page performance, it’s helpful to drill into the dimensions that impact performance like URL and Country. This means you can filter down to the performance of a specific page (like your home page or checkout page), and you can see the locations where your site loads the fastest and slowest.

Going back to our example above, we want to see how performance in Peru compares to North America:

Introducing Browser Insights

Sure enough, we can confirm that there’s significant traffic from Peru, but web pages take about 13s to load on average — compared with just 4.2 seconds for users in the US. Theory confirmed!  Now we can filter all of our metrics to just Peru to understand what’s happening better:

Introducing Browser Insights

Note that “Processing” has increased the most, all the way to 12 seconds. Request times are higher as well, likely because we are connecting to an origin server in the US. Web pages are made of many individual requests, so it makes sense that, when combined, they lead to slower load times. In this example, caching faster content would probably lead to significantly page loads.

What’s coming next?

Our launch today is just the tip of the iceberg for Browser Insights. In the near future we want to add much more information that will help you understand exactly what’s slowing down your website, and what you can do to make it faster. We plan to add:

  • More metrics and dimensions, including page-level metrics like Time to First Paint and more dimensions like browser and network type
  • Subresource analytics. The average web page loads over 100 subresources, and we can provide a waterfall chart to show exactly which one is slow.
  • A/B testing, to show you how potential configuration changes will impact the performance of your own traffic
  • Error collection to monitor issues at the network layer, in JavaScript, etc
  • Alerting so that you know when performance falls below a pre-defined threshold
  • Insights powered by Cloudflare that tell you why something might be slow – for example, how your cache hit ratio impacts page load time

Protecting user privacy

Cloudflare’s mission to help build a better Internet is based on the importance we place on establishing trust with our customers, our customers’ end users, and the Internet community globally. We have a transparent business model that aligns with the interests of our customers — we make money from protecting and speeding up our customers’ Internet properties. We do not sell our customers’ (or their end users’) data.

Browser Insights requires that end users’ browsers report timing information back to Cloudflare. We designed Browser Insights so that it reports only the bare-minimum information needed to show our customers how their websites are performing. The only metrics Browser Insights collects are about timing. We do not track individual end users across our customers’ Internet properties. We encourage you to open up the Inspector in your favorite web browser to see what we’re sending back!

Try Browser Insights today

Last May we announced the all-new Speed Page. Our mission with the Speed page is to show you how fast your website is, and what you can do to make it faster. Today, we’re excited to announce that the new Speed Page is available for everyone!

Browser Insights will be available on the Speed page in early access and we’ll be working hard to bring it to everyone as soon as possible in the coming weeks. Watch this space for updates!

Introducing Browser Insights


Subscribe to this blog for daily updates on all our Birthday Week announcements.

Cleaning up bad bots (and the climate)

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/cleaning-up-bad-bots/

Cleaning up bad bots (and the climate)

From the very beginning Cloudflare has been stopping malicious bots from scraping websites, or misusing APIs. Over time we’ve improved our bot detection methods and deployed large machine learning models that are able to distinguish real traffic (be it from humans or apps) from malicious bots. We’ve also built a large catalog of good bots to detect things like helpful indexing by search engines.

But it’s not enough. Malicious bots continue to be a problem on the Internet and we’ve decided to fight back. From today customers have the option of enabling “bot fight mode” in their Cloudflare Dashboard.

Cleaning up bad bots (and the climate)

Once enabled, when we detect a bad bot, we will do three things: (1) we’re going to disincentivize the bot maker economically by tarpitting them, including requiring them to solve a computationally intensive challenge that will require more of their bot’s CPU; (2) for Bandwidth Alliance partners, we’re going to hand the IP of the bot to the partner and get the bot kicked offline; and (3) we’re going to plant trees to make up for the bot’s carbon cost.

Cleaning up bad bots (and the climate)

Malicious bots harm legitimate web publishers and applications, hurt hosting providers by misusing resources, and they doubly hurt the planet through the cost of electricity for servers and cooling for their bots and their victims.

Enough is enough. Our goal is nothing short of making it no longer viable to run a malicious bot on the Internet. And we think, with our scale, we can do exactly that.

How Cloudflare Detects Bots

Cloudflare’s secret sauce (ok, not very secret sauce) is our vast scale.  We currently handle traffic for over 20 million Internet properties ranging from the smallest personal web sites, through backend APIs for popular apps and IoT devices, to some of the best known names on the Internet (including 10% of the Fortune 1000).

This scale gives us a huge advantage in that we see an enormous amount and variety of traffic allowing us to build large machine learning models of Internet behavior. That scale and variety allows us to test new rules and models quickly and easily.

Our bot detection breaks down into four large components:

  • Identification of well known legitimate bots;
  • Hand written rules for simple bots that, however simple, get used day in, day out;
  • Our Bot Activity Detector model that spots the behavior of bots based on past traffic and blocks them; and
  • Our Trusted Client model that spots whether an HTTP User-Agent is what it says it is.

In addition, Gatebot, our DDoS mitigation system, fingerprints DDoS bots and blocks their traffic at the packet level. Beyond Gatebot, customers also have access to our Firewall Rules where they can write granular rules to block very specific attack types.

Another model allows us to determine whether an IP address belongs to a VPN endpoint, a home broadband subscriber, a company using NAT or a hosting or cloud provider. It’s this last group that “Bot Cleanup” targets.

Today, Cloudflare challenges over 3 billion bot requests per day. Some of those bots are about to have a really bad time.

How Cloudflare Fights Bots

The cost of launching a bot attack consists of the expense of CPU time that powers the attack. If our models show that the traffic is coming from a bot, and it’s on a hosting or a cloud provider, we’ll deploy CPU intensive code to make the bot writer expend more CPU and slow them down. By forcing the attacker to use more CPU, we increase their costs during an attack and deter future ones.

This is one of the many so-called “tarpitting” techniques we’re now deploying across our network to change the economics of running a malicious bot. Malicious bot operators be warned: if you target resources behind Cloudflare’s IP space and we’re going to make you spin your wheels.

Every minute we tie malicious bots up is a minute they’re not harming the Internet as a whole. This means we aren’t just protecting our customers but everyone online currently terrorized by malicious bots. The spirit of Cloudflare’s Birthday Week has always been about giving back to the Internet as a whole, and we can think of no better gift than ridding the Internet of malicious bots.

Beyond just wasting bots time we want to also get them shut down. If the infrastructure provider hosting the bot is part of the Bandwidth Alliance, we’ll share the bot’s IP address so they can shutdown the bot completely. The Bandwidth Alliance allows us to reduce transit costs with partners and, with this launch, also helps us work together with them to make the Internet safer for legitimate users.

Generally, everyone we ran Bot Fight Mode by thought it was a great idea. The only objection we heard was that as we start forcing bots to solve CPU intensive challenges in the short term, before they just give up — which we think is inevitable in the long term — we may raise carbon emissions. To combat those emissions we’re committed to estimating the extra CPU utilized by these bots, calculating their carbon cost, and then planting trees to compensate and build a better future.

Planting Trees

Dealing with climate change requires multiple efforts by people and companies. Cloudflare announced earlier this year that we had expanded our purchasing of Renewable Energy Certificates (that previously covered our North American operations) to our entire global network of 194 cities.

To figure out how much tree planting we need to do we need to calculate the cost of the extra CPU used when making a bot work hard. Here’s how that will work.

Using a figure of 450 kg CO2/year  (from https://www.goclimateneutral.org/blog/the-carbon-footprint-of-servers/) for the types of server that a bad bot might use (cloud server using a non-renewable energy source) we get about 8kg CO2/year per CPU core. We are able to measure the time bots spend burning CPU and so we can directly estimate the amount of CO2 emitted by our fight back.

According to One Tree Planted, a single mature tree can absorb about 21kg CO2/year. So, very roughly, each tree can absorb a year’s worth of CO2 from 2.5 CPU cores.

Since trees take time to mature and the scale of the climate change challenge we’re going to pay to overplant trees. For every tree that we calculate we’d need to plant to sequester the CO2 emissions from fighting bots we’re going to donate $25 to One Tree Planted to plant 25 trees.

And, of course, we’ll be handing the IPs of bad bots to our Bandwidth Alliance partners to get the bots shut down and remove their carbon cost completely. In the past, the tech community has largely defeated email spammers and DDoS-for-hire services by making their efforts fruitless, we think this is the right strategy to now defeat malicious bots once and for all.

Who Do Bots Hurt?

Malicious bots can cause significant harm to our customers’ infrastructure and often result in bad experiences for our customers’ users.

For example, a recent customer was being crippled by a credential stuffing attack that not only was attempting to compromise their users’ accounts but was doing so in such significant volume that it was effectively causing a small scale Denial of Service on all aspects of the customer’s website.

The malicious bot was overloading the customer’s conventional threat prevention infrastructure and we rapidly onboarded them as an Under Attack customer. As a part of the onboarding, we identified that the attack could be specifically thwarted using our Bot Management product while not impacting any legitimate user traffic.

Another trend we have seen is the increase of the combination of bots with botnets, particularly in the world of inventory hoarding bots. The motivation and willingness to spend for these bot operators is quite high.

The targets are goods of generally of limited supply and high in demand and in value. Think sneakers, concert tickets, airline seats, and popular short run Broadway musicals. Bot operators who are able to purchase those items at retail can charge massive premiums in aftermarket sales. When the operator identifies a target site, such as an ecommerce retailer, and a specific item, such as a new pair of sneakers going on sale, they can purchase time on the new Residential Proxy as a Service market to gain access to end user machines and (relatively) clean IPs from which to launch their attack.

They then utilize sophisticated techniques and triggers to change characteristics of the machine, network, and software they use to generate the attack through a very wide array of options and combinations, thwarting systems that rely on repetition or known patterns. This type of attack hurts multiple targets as well: the ecommerce site has real frustrated users who can’t purchase the in demand item. The real users who are losing out on inventory to an attacker who is just there to skim off the largest profit possible. And the unwitting users who are part of the botnet have their resources, such as their home broadband connection, used without their consent or knowledge.

The bottom line is that bots hurt companies and their customers.

Summary

Cloudflare has fought malicious bots from the very beginning and over time has deployed more and more sophisticated methods to block them. Using the power of the over 20 million Internet properties we protect and accelerate and our visibility of networks and users around the world we have build machine learning models that sort the bots from the good and block the bad.

But bots continue to be a problem and our new bot fight mode will directly disincentive bot writers from attacking customers. At the same time we don’t want to contribute to climate change and are offsetting the carbon cost of bots by planting trees to absorb carbon and help build a better future (and Internet).

Cleaning up bad bots (and the climate)

Welcome to Birthday Week 2019

Post Syndicated from Michelle Zatlyn original https://blog.cloudflare.com/birthday-week-2019/

Welcome to Birthday Week 2019

Welcome to Birthday Week 2019

September has always been a special month for Cloudflare. Nine years ago — on September 27th — we launched Cloudflare. And, each year since, we’ve celebrated our birthday with a week full of new products and innovations that support our mission of helping to build a better Internet.

Our mission guides everything we do. One of the most intentional words in our mission is ‘helping’. Building an Internet that can meet the world’s needs cannot be done by any one company or individual; rather, it takes a global community — from nonprofit organizations and businesses to governments and individuals — collaborating to deliver new standards, technologies, and innovations. We believe Cloudflare should be an active participant in the community and help where we can and should.

Our customers and partners are an active part of the community. I often say that customers are one of my favorite parts of my job (our team is my other favorite part). Our customers give us feedback all the time about what they’d like to see to make their Internet properties more secure, more performant and more reliable. Our partners bring forward standards to help make the Internet run more smoothly. For Birthday Week this year, you are going to see many of these come to life.

No Spoilers Here

That sense of Community is the spirit of this year’s Birthday Week. Here is a sneak peak of what to expect this week.  

  • Monday: As cyberthreats increase, we are going on the offensive. Cybersecurity has often been called an arms race between good guys and bad guys. To kick off the week, we’re making an announcement that will give our customers the upper hand on one of the most pervasive threats on the Internet today. Already, we block 44 billion cyber threats per day. With Monday’s launch, we’ll give anyone in our community the chance to join us in this fight. If you want to stop bad actors in their tracks, you’ll love tomorrow’s announcement.
  • Tuesday: Insights = the big picture on performance. One thing we hear clearly from our customers is that they need better insights into how their products are performing — from their customers’ point of view. And, with Cloudflare’s vast network, we have an opportunity to meet this need because we see every network and every user in the world. We will announce a powerful new tool that will help our customers gather better data, allowing them to then gauge the speed and performance of their Internet properties. Tune into our blog on Tuesday for more.
  • Wednesday: The wait is over. Our product and engineering teams have been working round-the-clock to build a new experience that makes the Internet faster and safer for everyone. If I were to say anything more, it would surely give it away so I’ll leave it at that — you’ll just need to tune in to the blog on Wednesday to find out.
  • Thursday: Advancing the Internet, one protocol at a time. New protocols are the underpinning of building a modern Internet. But, to do this effectively (or at all), requires a community-based approach. On Thursday, we’re excited to launch a key new protocol with some truly remarkable partners and leaders.
  • Friday: Giving developers more agility and independence. We love developers. Since we first launched Cloudflare Workers two years ago, we’ve seen significant adoption and new innovative solutions that our customers are building with our serverless platform. This week we’ll announce a new Workers service that will let developers easily deploy new web pages and content with Workers.

There’s a lot that goes into helping build a better Internet — whether it’s improving privacy, giving developers the tools and insights they need to better serve their customers, or teaming up with our ecosystem partners to make our new products more accessible for everyone. Community is the special ingredient to making this happen. We can’t wait to share what we’ve been working on.

—  Michelle

PS: If you’re not already subscribing to the blog, sign up now to receive daily updates that will be sent to your inbox each day this week. We’ll also be hosting a Birthday Week webinar on Thursday, October 3rd – register to get a recap of the week’s announcements.

How We Design Features for Wrangler, the Cloudflare Workers CLI

Post Syndicated from Ashley M Lewis original https://blog.cloudflare.com/how-we-design-features-for-wrangler/

How We Design Features for Wrangler, the Cloudflare Workers CLI

How We Design Features for Wrangler, the Cloudflare Workers CLI

The most recent update to Wrangler, version 1.3.1, introduces important new features for developers building Cloudflare Workers — from built-in deployment environments to first class support for Workers KV. Wrangler is Cloudflare’s first officially supported CLI. Branching into this field of software has been a novel experience for us engineers and product folks on the Cloudflare Workers team.

As part of the 1.3.1 release, the folks on the Workers Developer Experience team dove into the thought process that goes into building out features for a CLI and thinking like users. Because while we wish building a CLI were as easy as our teammate Avery tweeted…


… it brings design challenges that many of us have never encountered. To overcome these challenges successfully requires deep empathy for users across the entire team, as well as the ability to address ambiguous questions related to how developers write Workers.

Wrangler, meet Workers KV

Our new KV functionality introduced a host of new features, from creating KV namespaces to bulk uploading key-value pairs for use within a Worker. This new functionality primarily consisted of logic for interacting with the Workers KV API, meaning that the technical work under “the hood” was relatively straightforward. Figuring out how to cleanly represent these new features to Wrangler users, however, became the fundamental question of this release.

Designing the invocations for new KV functionality unsurprisingly required multiple iterations, and taught us a lot about usability along the way!

Attempt 1

For our initial pass, the path originally seemed so obvious. (Narrator: It really, really wasn’t). We hypothesized that having Wrangler support familiar commands — like ls and rm — would be a reasonable mapping of familiar command line tools to Workers KV, and ended up with the following set of invocations below:

# creates a new KV Namespace
$ wrangler kv add myNamespace									
	
# sets a string key that doesn't expire
$ wrangler kv set myKey=”someStringValue”

# sets many keys
$ wrangler kv set myKey=”someStringValue” myKey2=”someStringValue2” ...

# sets a volatile (expiring) key that expires in 60 s
$ wrangler kv set myVolatileKey=path/to/value --ttl 60s

# deletes three keys
$ wrangler kv rm myNamespace myKey1 myKey2 myKey3

# lists all your namespaces
$ wrangler kv ls

# lists all the keys for a namespace
$ wrangler kv ls myNamespace

# removes all keys from a namespace, then removes the namespace		
$ wrangler kv rm -r myNamespace

While these commands invoked familiar shell utilities, they made interacting with your KV namespace a lot more like interacting with a filesystem than a key value store. The juxtaposition of a well-known command like ls with a non-command, set, was confusing. Additionally, mapping preexisting command line tools to KV actions was not a good 1-1 mapping (especially for rm -r; there is no need to recursively delete a KV namespace like a directory if you can just delete the namespace!)

This draft also surfaced use cases we needed to support: namely, we needed support for actions like easy bulk uploads from a file. This draft required users to enter every KV pair in the command line instead of reading from a file of key-value pairs; this was also a non-starter.

Finally, these KV subcommands caused confusion about actions to different resources. For example, the command for listing your Workers KV namespaces looked a lot like the command for listing keys within a namespace.

Going forward, we needed to meet these newly identified needs.

Attempt 2

Our next attempt shed the shell utilities in favor of simple, declarative subcommands like create, list, and delete. It also addressed the need for easy-to-use bulk uploads by allowing users to pass a JSON file of keys and values to Wrangler.

# create a namespace
$ wrangler kv create namespace <title>

# delete a namespace
$ wrangler kv delete namespace <namespace-id>

# list namespaces
$ wrangler kv list namespace

# write key-value pairs to a namespace, with an optional expiration flag
$ wrangler kv write key <namespace-id> <key> <value> --ttl 60s

# delete a key from a namespace
$ wrangler kv delete key <namespace-id> <key>

# list all keys in a namespace
$ wrangler kv list key <namespace-id>

# write bulk kv pairs. can be json file or directory; if dir keys will be file paths from root, value will be contents of files
$ wrangler kv write bulk ./path/to/assets

# delete bulk pairs; same input functionality as above
$ wrangler kv delete bulk ./path/to/assets

Given the breadth of new functionality we planned to introduce, we also built out a taxonomy of new subcommands to ensure that invocations for different resources — namespaces, keys, and bulk sets of key-value pairs — were consistent:

How We Design Features for Wrangler, the Cloudflare Workers CLI

Designing invocations with taxonomies became a crucial part of our development process going forward, and gave us a clear look at the “big picture” of our new KV features.

This approach was closer to what we wanted. It offered bulk put and bulk delete operations that would read multiple key-value pairs from a JSON file. After specifying an action subcommand (e.g. delete), users now explicitly stated which resource an action applied to (namespace , key, bulk) and reduced confusion about which action applied to which KV component.

This draft, however, was still not as explicit as we wanted it to be. The distinction between operations on namespaces versus keys was not as obvious as we wanted, and we still feared the possibility of different delete operations accidentally producing unwanted deletes (a possibly disastrous outcome!)

Attempt 3

We really wanted to help differentiate where in the hierarchy of structs a user was operating at any given time. Were they operating on namespaces, keys, or bulk sets of keys in a given operation, and how could we make that as clear as possible? We looked around, comparing the ways CLIs from kubectl to Heroku’s handled commands affecting different objects. We landed on a pleasing pattern inspired by Heroku’s CLI: colon-delimited command namespacing:

plugins:install PLUGIN    # installs a plugin into the CLI
plugins:link [PATH]       # links a local plugin to the CLI for development
plugins:uninstall PLUGIN  # uninstalls or unlinks a plugin
plugins:update            # updates installed plugins

So we adopted kv:namespace, kv:key, and kv:bulk to semantically separate our commands:

# namespace commands operate on namespaces
$ wrangler kv:namespace create <title> [--env]
$ wrangler kv:namespace delete <binding> [--env]
$ wrangler kv:namespace rename <binding> <new-title> [--env]
$ wrangler kv:namespace list [--env]
# key commands operate on individual keys
$ wrangler kv:key write <binding> <key>=<value> [--env | --ttl | --exp]
$ wrangler kv:key delete <binding> <key> [--env]
$ wrangler kv:key list <binding> [--env]
# bulk commands take a user-generated JSON file as an argument
$ wrangler kv:bulk write <binding> ./path/to/data.json [--env]
$ wrangler kv:bulk delete <binding> ./path/to/data.json [--env]

And ultimately ended up with this topology:

How We Design Features for Wrangler, the Cloudflare Workers CLI

We were even closer to our desired usage pattern; the object acted upon was explicit to users, and the action applied to the object was also clear.

There was one usage issue left. Supplying namespace-ids–a field that specifies which Workers KV namespace to perform an action to–required users to get their clunky KV namespace-id (a string like 06779da6940b431db6e566b4846d64db) and provide it in the command-line under the namespace-id option. This namespace-id value is what our Workers KV API expects in requests, but would be cumbersome for users to dig up and provide, let alone frequently use.

The solution we came to takes advantage of the wrangler.toml present in every Wrangler-generated Worker. To publish a Worker that uses a Workers KV store, the following field is needed in the Worker’s wrangler.toml:

kv-namespaces = [
	{ binding = "TEST_NAMESPACE", id = "06779da6940b431db6e566b4846d64db" }
]

This field specifies a Workers KV namespace that is bound to the name TEST_NAMESPACE, such that a Worker script can access it with logic like:

TEST_NAMESPACE.get(“my_key”);

We also decided to take advantage of this wrangler.toml field to allow users to specify a KV binding name instead of a KV namespace id. Upon providing a KV binding name, Wrangler could look up the associated id in wrangler.toml and use that for Workers KV API calls.

Wrangler users performing actions on KV namespaces could simply provide --binding TEST_NAMESPACE for their KV calls let Wrangler retrieve its ID from wrangler.toml. Users can still specify --namespace-id directly if they do not have namespaces specified in their wrangler.toml.

Finally, we reached our happy point: Wrangler’s new KV subcommands were explicit, offered functionality for both individual and bulk actions with Workers KV, and felt ergonomic for Wrangler users to integrate into their day-to-day operations.

Lessons Learned

Throughout this design process, we identified the following takeaways to carry into future Wrangler work:

  1. Taxonomies of your CLI’s subcommands and invocations are a great way to ensure consistency and clarity. CLI users tend to anticipate similar semantics and workflows within a CLI, so visually documenting all paths for the CLI can greatly help with identifying where new work can be consistent with older semantics. Drawing out these taxonomies can also expose missing features that seem like a fundamental part of the “big picture” of a CLI’s functionality.
  2. Use other CLIs for inspiration and sanity checking. Drawing logic from popular CLIs helped us confirm our assumptions about what users like, and learn established patterns for complex CLI invocations.
  3. Avoid logic that requires passing in raw ID strings. Testing CLIs a lot means that remembering and re-pasting ID values gets very tedious very quickly. Emphasizing a set of purely human-readable CLI commands and arguments makes for a far more intuitive experience. When possible, taking advantage of configuration files (like we did with wrangler.toml) offers a straightforward way to provide mappings of human-readable names to complex IDs.

We’re excited to continue using these design principles we’ve learned and documented as we grow Wrangler into a one-stop Cloudflare Workers shop.

If you’d like to try out Wrangler, check it out on GitHub and let us know what you think! We would love your feedback.

How We Design Features for Wrangler, the Cloudflare Workers CLI

Announcing the General Availability of API Tokens

Post Syndicated from Garrett Galow original https://blog.cloudflare.com/api-tokens-general-availability/

Announcing the General Availability of API Tokens

APIs at Cloudflare

Announcing the General Availability of API Tokens

Today we are announcing the general availability of API Tokens – a scalable and more secure way to interact with the Cloudflare API. As part of making a better internet, Cloudflare strives to simplify manageability of a customer’s presence at the edge. Part of the way we do this is by ensuring that all of our products and services are configurable by API. Customers ranging from partners to enterprises to developers want to automate management of Cloudflare. Sometimes that is done via our API directly, and other times it is done via open source software we help maintain like our Terraform provider or Cloudflare-Go library. It is critical that customers who are automating management of Cloudflare can keep their Cloudflare services as secure as possible.

Least Privilege and Why it Matters

Securing software systems is hard. Limiting what a piece of software can do is a good defense to prevent mistakes or malicious actions from having greater impact than they could. The principle of least privilege helps guide how much access a given system should have to perform actions. Originally formulated by Jerome Saltzer, “Every program and every privileged user of the system should operate using the least amount of privilege necessary to complete the job.” In the case of Cloudflare, many customers have various domains routing traffic leveraging many different services. If a bad actor gets unauthorized access to a system they can use whatever access that system has to cause further damage or steal additional information.

Let’s see how the capabilities of API Tokens fit into the principle of least privilege.

About API Tokens

API Tokens provide three main capabilities:

  1. Scoping API Tokens by Cloudflare resource
  2. Scoping API Tokens by permission
  3. The ability to provision multiple API Tokens

Let’s break down each of these capabilities.

Scoping API Tokens by Cloudflare Resource

Cloudflare separates service configuration by zone which typically equates to a domain. Additionally, some customers have multiple accounts each with many zones. It is important that when granting API access to a service it only has access to the accounts resources and zones that are pertinent for the job at hand. API Tokens can be scoped to only cover specific accounts and specific zones. One common use case is if you have a staging zone and a production zone, then an API Token can be limited to only be able to affect the staging zone and not have access to the production zone.

Scoping API Tokens by Permission

Being able to scope an API Token to a specific zone is great, but in one zone there are many different services that can be configured: firewall rules, page rules, and load balancers just to name a few. If a customer has a service that should only be able to create new firewall rules in response to traffic patterns, then also allowing that service to change DNS records is a violation of least privilege. API Tokens allow you to scope each token to specific permission. Multiple permissions can be combined to create custom tokens to fit specific use cases.

Multiple API Tokens

If you use Cloudflare to protect and accelerate multiple services, then may be making API changes to Cloudflare from multiple locations – different servers, VMs, containers, or workers. Being able to create an API Token per service means each service is insulated to changes from another. If one API Token is leaked or needs to be rolled, there won’t be any impact to the other services’ API Tokens. Also the capabilities mentioned previously mean that each service can be scoped to exactly what actions and resources necessary. This allows customers to better realize the practice of least privilege for accessing Cloudflare by API.

Now let’s walk through how to create an API Token and use it.

Using API Tokens

To create your first API Token go to the ‘API Tokens’ section of your user profile which can be found here: dash.cloudflare.com/profile/api-tokens

1. On this page, you will find both a list of all of your API Tokens in addition to your Global API Key and Origin CA Key.

Announcing the General Availability of API Tokens
API Tokens Getting Started – Create Token

To create your first API Token, select ‘Create Token’.


2. On the create screen there are two ways to create your token. You can create it from scratch through the ‘Custom’ option or you can start with a predefined template by selecting ‘Start with a template’.

Announcing the General Availability of API Tokens
API Token Template Selection

For this case, we will use the ‘Edit zone DNS’ template to create an API Token that can edit a single zone’s DNS records.


3. Once the template is selected, we need to pick a zone for the API Token to be scoped to. Notice that the DNS Edit permission was already pre-selected.

Announcing the General Availability of API Tokens
Specifying the zone for which the token will be able to control DNS

In this case, ‘garrettgalow.com’ is selected as the Cloudflare zone that the API Token will be able to edit DNS records for.


4. Once I select continue to summary, I’m given a chance to review my selection. In this case the resources and permissions are quite simple, but this gives you a change to make sure you are giving the API Token exactly the correct amount of privilege before creating it.

Announcing the General Availability of API Tokens
Token Summary – confirmation


5. Once created, we are presented with the API Token. This screen is the only time you will be presented with the secret so be sure to put the secret in a safe place! Anyone with this secret can perform the granted actions on the resources specified so protect it like a password. In the below screenshot I have black boxed the secret for obvious reasons. If you happen to lose the secret, you can always regenerate it from the API Tokens table so you don’t have to configure all the permissions again.

Announcing the General Availability of API Tokens
Token creation completion screen with the token secret

In addition to the secret itself this screen provides an example curl request that can be used to verify that the token was successfully created. It also provides an example of how the token should be used for any direct HTTP requests. With API Tokens we now follow the RFC Authorization Bearer standard. Calling that API we see a successful response telling us that the token is valid and active

~$ curl -X GET "https://api.cloudflare.com/client/v4/user/tokens/verify" \
>      -H "Authorization: Bearer vh9awGupxxxxxxxxxxxxxxxxxxx" \
>      -H "Content-Type:application/json" | jq

{
  "result": {
    "id": "ad599f2b67cdccf24a160f5dcd7bc57b",
    "status": "active"
  },
  "success": true,
  "errors": [],
  "messages": [
    {
      "code": 10000,
      "message": "This API Token is valid and active",
      "type": null
    }
  ]
}

What’s coming next

For anyone using the Cloudflare API, we recommend moving to using API Tokens over their predecessor API Keys going forward. With this announcement, our Terraform provider, Cloudflare-Go library, and WordPress plugin are all updated for API Token compatibility. Other libraries will receive updates soon. Both API Tokens and API Keys will be supported for the time being for customers to be able to safely migrate. We have more planned capabilities for API Tokens to further safeguard how and when tokens are used, so stay tuned for future announcements!

Let us know what you think and what you’d like to see next regarding API security on the Cloudflare Community.

Supercharging Firewall Events for Self-Serve

Post Syndicated from Alex Cruz Farmer original https://blog.cloudflare.com/supercharging-firewall-events-for-self-serve/

Supercharging Firewall Events for Self-Serve

Today, I’m very pleased to announce the release of a completely overhauled version of our Firewall Event log to our Free, Pro and Business customers. This new Firewall Events log is now available in your Dashboard, and you are not required to do anything to receive this new capability.

Supercharging Firewall Events for Self-Serve

No more modals!

We have done away with those pesky modals, providing a much smoother user experience. To review more detailed information about an event, you simply click anywhere on the event list row.

Supercharging Firewall Events for Self-Serve

In the expanded view, you are provided with all the information you may need to identify or diagnose issues with your Firewall or find more details about a potential threat to your application.

Additional matches per event

Cloudflare has several Firewall features to give customers granular control of their security. With this control comes some complexity when debugging why a request was stopped by the Firewall. To help clarify what happened, we have provided an “Additional matches” count at the bottom for events triggered by multiple services or rules for the same request. Clicking the number expands a list showing each rule and service along with the corresponding action.

Supercharging Firewall Events for Self-Serve

Search for any field within a Firewall Event

This is one of my favourite parts of our new Firewall Event Log. Many of our customers have expressed their frustration with the difficulty of pinpointing specific events. This is where our new search capabilities come into their own. Customers can now filter and freeform search for any field that is visible in a Firewall Event!

Let’s say you want to find all the requests originating from a specific ISP or country where your Firewall Rules issued a JavaScript challenge. There are two different ways to do this in the UI.

Firstly, when in the detail view, you can create an include or exclude filter for that field value.

Supercharging Firewall Events for Self-Serve

Secondly, you can create a freeform filter using the “+ Add Filter” button at the top, or edit one of the already filtered fields:

Supercharging Firewall Events for Self-Serve

As illustrated above, with our WAF Managed Rules enabled in log only, we can see all the rules which would have triggered if this was a legitimate attack. This allows you to confirm that your configuration is working as expected.

Scoping your search to a specific date and time

In our old Firewall Event Log, to find an event, users had to traverse through many pages to find Events from a specific date. The last major change we have added is the capability to select a time window to view events between two points in time over the last 2 weeks. In the time selection window, Free and Pro customers can choose a 24 hour time window and our Business customers can view up to 72 hours.

Supercharging Firewall Events for Self-Serve

We want your feedback!

We need your help! Please feel free to leave any feedback on our Community forums, or open a Support ticket with any problems you find. Your feedback is critical to our product improvement process, and we look forward to hearing from you.

Live Preview: Build and Test Workers Faster with Wrangler CLI 1.2.0

Post Syndicated from Matt Alonso original https://blog.cloudflare.com/live-preview-build-and-test-workers-faster-with-wrangler-cli-1-2-0/

Live Preview: Build and Test Workers Faster with Wrangler CLI 1.2.0

As part of my internship on the Workers Developer Experience team, I set out to polish the Wrangler CLI for Cloudflare Workers. If you’re not familiar with Workers, the premise is quite simple: Write a bit of Javascript that takes in an HTTP request, does some processing, and spits out a response. The magic lies in where your Workers scripts run: on Cloudflare’s edge network, which spans 193 cities in more than 90 countries. Workers can be used for nearly anything from configuring Cloudflare caching behavior to building entire serverless web applications. And, you don’t have to worry about operations at all.

I was excited to focus on Wrangler, because Wrangler aims to make developing and publishing Workers projects a pleasant experience for everyone, whether you’re a solo dev working on the next big thing, or an engineer at a Fortune 100 enterprise. The whole point of serverless is about reducing friction, and Wrangler reflects that ethos.

However, when I started at Cloudflare in early June, some parts of the development experience still needed some love. While working on a new WASM tutorial for the Workers documentation, I noticed a storm brewing in my browser…

Live Preview: Build and Test Workers Faster with Wrangler CLI 1.2.0

Wrangler lets you test your Workers project with a subcommand called wrangler preview, and every time I called it to test a new change it opened a new tab. Fast iteration is the most crucial part of a good developer experience, and while the preview was fast, things were getting messy. I was fighting my tooling, having to keep track of the latest preview tab every time I wanted to test a new change. I knew that if I was annoyed about this, others would be too.

So, I thought about what our customers wanted: similarity with tooling that they already used. I set out to create an experience inspired by `webpack-dev-server` and other similar watch-and-build tools, where you would have a single tab that would refresh live with your latest changes. However, I knew that getting changes into the Workers runtime to achieve this goal would be a tall order for week 2 of my internship, so I started thinking about solutions to send updates directly to the previewer.

Wrangler is written in Rust, so I was able to utilize the crates.io ecosystem while developing this feature. I used the notify crate, which provides a cross-platform abstraction layer over the various file system event APIs provided by major OSes. However, there are some gotchas when implementing a file watcher that triggers a build and upload: you can’t simply trigger a build after every filesystem event, as a single file save can emit several events in quick succession depending on which editor you use! To prevent wasteful builds, I implemented a cooldown period, which only triggers the build process when no new file system events have been detected for at least 2 seconds. Rust’s rich standard library makes implementing concurrent behaviors like this very elegant:

/* rx.recv_timeout returns Ok if there was an event on the rx channel
 * or Err if the cooldown period has passed. The while let Ok(_) syntax
 * will end the loop if the cooldown period has ended, or restart the cooldown period if there was an event on the rx channel
 */
while let Ok(_) = rx.recv_timeout(cooldown) {
  message::working("Detected change during cooldown...");
}

Another challenge was handling communication with the previewer. I settled on an unconventional application of WebSockets, creating one to localhost to allow for a browser application to communicate with the Wrangler CLI running on the local machine. I coordinated with the Workers UI team to get my WebSocket client added to the preview UI, and with the security team to pass a security review for the feature, to make sure script contents were properly protected from exposure.

This was the result:

Live Preview: Build and Test Workers Faster with Wrangler CLI 1.2.0

This is what Developer Experience is all about. You should feel like 💆🏻‍♀️💆🏽‍♂️ when using Wrangler, not like 😡. If this isn’t the case, we want to hear about it.

Live Preview was shipped in the 1.2.0 release of Wrangler, exposed under wrangler preview --watch. It works for all Wrangler projects, even ones that use WebAssembly.

And to the Workers Developer Experience team, Dubs, Ashley, Avery, Gabbi, Kristian, Sven, and Victoria: thank you. Y’all are motivated, talented, and I genuinely had fun every day this summer.

Magic Transit makes your network smarter, better, stronger, and cheaper to operate

Post Syndicated from Rustam Lalkaka original https://blog.cloudflare.com/magic-transit/

Magic Transit makes your network smarter, better, stronger, and cheaper to operate

Today we’re excited to announce Cloudflare Magic Transit. Magic Transit provides secure, performant, and reliable IP connectivity to the Internet. Out-of-the-box, Magic Transit deployed in front of your on-premise network protects it from DDoS attack and enables provisioning of a full suite of virtual network functions, including advanced packet filtering, load balancing, and traffic management tools.

Magic Transit makes your network smarter, better, stronger, and cheaper to operate

Magic Transit is built on the standards and networking primitives you are familiar with, but delivered from Cloudflare’s global edge network as a service. Traffic is ingested by the Cloudflare Network with anycast and BGP, announcing your company’s IP address space and extending your network presence globally. Today, our anycast edge network spans 193 cities in more than 90 countries around the world.

Once packets hit our network, traffic is inspected for attacks, filtered, steered, accelerated, and sent onward to the origin. Magic Transit will connect back to your origin infrastructure over Generic Routing Encapsulation (GRE) tunnels, private network interconnects (PNI), or other forms of peering.

Enterprises are often forced to pick between performance and security when deploying IP network services. Magic Transit is designed from the ground up to minimize these trade-offs: performance and security are better together. Magic Transit deploys IP security services across our entire global network. This means no more diverting traffic to small numbers of distant “scrubbing centers” or relying on on-premise hardware to mitigate attacks on your infrastructure.

We’ve been laying the groundwork for Magic Transit for as long as Cloudflare has been in existence, since 2010. Scaling and securing the IP network Cloudflare is built on has required tooling that would have been impossible or exorbitantly expensive to buy. So we built the tools ourselves! We grew up in the age of software-defined networking and network function virtualization, and the principles behind these modern concepts run through everything we do.

When we talk to our customers managing on-premise networks, we consistently hear a few things: building and managing their networks is expensive and painful, and those on-premise networks aren’t going away anytime soon.

Traditionally, CIOs trying to connect their IP networks to the Internet do this in two steps:

  1. Source connectivity to the Internet from transit providers (ISPs).
  2. Purchase, operate, and maintain network function specific hardware appliances. Think hardware load balancers, firewalls, DDoS mitigation equipment, WAN optimization, and more.

Each of these boxes costs time and money to maintain, not to mention the skilled, expensive people required to properly run them. Each additional link in the chain makes a network harder to manage.

This all sounded familiar to us. We had an aha! moment: we had the same issues managing our datacenter networks that power all of our products, and we had spent significant time and effort building solutions to those problems. Now, nine years later, we had a robust set of tools we could turn into products for our own customers.

Magic Transit aims to bring the traditional datacenter hardware model into the cloud, packaging transit with all the network “hardware” you might need to keep your network fast, reliable, and secure. Once deployed, Magic Transit allows seamless provisioning of virtualized network functions, including routing, DDoS mitigation, firewalling, load balancing, and traffic acceleration services.

Magic Transit is your network’s on-ramp to the Internet

Magic Transit delivers its connectivity, security, and performance benefits by serving as the “front door” to your IP network. This means it accepts IP packets destined for your network, processes them, and then outputs them to your origin infrastructure.

Connecting to the Internet via Cloudflare offers numerous benefits. Starting with the most basic, Cloudflare is one of the most extensively connected networks on the Internet. We work with carriers, Internet exchanges, and peering partners around the world to ensure that a bit placed on our network will reach its destination quickly and reliably, no matter the destination.

An example deployment: Acme Corp

Let’s walk through how a customer might deploy Magic Transit. Customer Acme Corp. owns the IP prefix 203.0.113.0/24, which they use to address a rack of hardware they run in their own physical datacenter. Acme currently announces routes to the Internet from their customer-premise equipment (CPE, aka a router at the perimeter of their datacenter), telling the world 203.0.113.0/24 is reachable from their autonomous system number, AS64512. Acme has DDoS mitigation and firewall hardware appliances on-premise.

Magic Transit makes your network smarter, better, stronger, and cheaper to operate

Acme wants to connect to the Cloudflare Network to improve the security and performance of their own network. Specifically, they’ve been the target of distributed denial of service attacks, and want to sleep soundly at night without relying on on-premise hardware. This is where Cloudflare comes in.

Magic Transit makes your network smarter, better, stronger, and cheaper to operate

Deploying Magic Transit in front of their network is simple:

  1. Cloudflare uses Border Gateway Protocol (BGP) to announce Acme’s 203.0.113.0/24 prefix from Cloudflare’s edge, with Acme’s permission.
  2. Cloudflare begins ingesting packets destined for the Acme IP prefix.
  3. Magic Transit applies DDoS mitigation and firewall rules to the network traffic. After it is ingested by the Cloudflare network, traffic that would benefit from HTTPS caching and WAF inspection can be “upgraded” to our Layer 7 HTTPS pipeline without incurring additional network hops.
  4. Acme would like Cloudflare to use Generic Routing Encapsulation (GRE) to tunnel traffic back from the Cloudflare Network back to Acme’s datacenter. GRE tunnels are initiated from anycast endpoints back to Acme’s premise. Through the magic of anycast, the tunnels are constantly and simultaneously connected to hundreds of network locations, ensuring the tunnels are highly available and resilient to network failures that would bring down traditionally formed GRE tunnels.
  5. Cloudflare egresses packets bound for Acme over these GRE tunnels.

Let’s dive deeper on how the DDoS mitigation included in Magic Transit works.

Magic Transit protects networks from DDoS attack

Customers deploying Cloudflare Magic Transit instantly get access to the same IP-layer DDoS protection system that has protected the Cloudflare Network for the past 9 years. This is the same mitigation system that stopped a 942Gbps attack dead in its tracks, in seconds. This is the same mitigation system that knew how to stop memcached amplification attacks days before a 1.3Tbps attack took down Github, which did not have Cloudflare watching its back. This is the same mitigation we trust every day to protect Cloudflare, and now it protects your network.

Cloudflare has historically protected Layer 7 HTTP and HTTPS applications from attacks at all layers of the OSI Layer model. The DDoS protection our customers have come to know and love relies on a blend of techniques, but can be broken into a few complementary defenses:

  1. Anycast and a network presence in 193 cities around the world allows our network to get close to users and attackers, allowing us to soak up traffic close to the source without introducing significant latency.
  2. 30+Tbps of network capacity allows us to soak up a lot of traffic close to the source. Cloudflare’s network has more capacity to stop DDoS attacks than that of Akamai Prolexic, Imperva, Neustar, and Radware — combined.
  3. Our HTTPS reverse proxy absorbs L3 (IP layer) and L4 (TCP layer) attacks by terminating connections and re-establishing them to the origin. This stops most spurious packet transmissions from ever getting close to a customer origin server.
  4. Layer 7 mitigations and rate limiting stop floods at the HTTPS application layer.

Looking at the above description carefully, you might notice something: our reverse proxy servers protect our customers by terminating connections, but our network and servers still get slammed by the L3 and 4 attacks we stop on behalf of our customers. How do we protect our own infrastructure from these attacks?

Enter Gatebot!

Gatebot is a suite of software running on every one of our servers inside each of our datacenters in the 193 cities we operate, constantly analyzing and blocking attack traffic. Part of Gatebot’s beauty is its simple architecture; it sits silently, in wait, sampling packets as they pass from the network card into the kernel and onward into userspace. Gatebot does not have a learning or warm-up period. As soon as it detects an attack, it instructs the kernel of the machine it is running on to drop the packet, log its decision, and move on.

Historically, if you wanted to protect your network from a DDoS attack, you might have purchased a specialized piece of hardware to sit at the perimeter of your network. This hardware box (let’s call it “The DDoS Protection Box”) would have been fantastically expensive, pretty to look at (as pretty as a 2U hardware box could get), and required a ton of recurring effort and money to stay on its feet, keep its licence up to date, and keep its attack detection system accurate and trained.

For one thing, it would have to be carefully monitored to make sure it was stopping attacks but not stopping legitimate traffic. For another, if an attacker managed to generate enough traffic to saturate your datacenter’s transit links to the Internet, you were out of luck; no box sitting inside your datacenter can protect you from an attack generating enough traffic to congest the links running from the outside world to the datacenter itself.

Early on, Cloudflare considered buying The DDoS Protection Box(es) to protect our various network locations, but ruled them out quickly. Buying hardware would have incurred substantial cost and complexity. In addition, buying, racking, and managing specialized pieces of hardware makes a network hard to scale. There had to be a better way. We set out to solve this problem ourselves, starting from first principles and modern technology.

To make our modern approach to DDoS mitigation work, we had to invent a suite of tools and techniques to allow us to do ultra-high performance networking on a generic x86 server running Linux.

At the core of our network data plane is the eXpress Data Path (XDP) and the extended Berkeley Packet Filter (eBPF), a set of APIs that allow us to build ultra-high performance networking applications in the Linux kernel. My colleagues have written extensively about how we use XDP and eBPF to stop DDoS attacks:

At the end of the day, we ended up with a DDoS mitigation system that:

  • Is delivered by our entire network, spread across 193 cities around the world. To put this another way, our network doesn’t have the concept of “scrubbing centers” — every single one of our network locations is always mitigating attacks, all the time. This means faster attack mitigation and minimal latency impact for your users.
  • Has exceptionally fast times to mitigate, with most attacks mitigated in 10s or less.
  • Was built in-house, giving us deep visibility into its behavior and the ability to rapidly develop new mitigations as we see new attack types.
  • Is deployed as a service, and is horizontally scalable. Adding x86 hardware running our DDoS mitigation software stack to a datacenter (or adding another network location) instantly brings more DDoS mitigation capacity online.

Gatebot is designed to protect Cloudflare infrastructure from attack. And today, as part of Magic Transit, customers operating their own IP networks and infrastructure can rely on Gatebot to protect their own network.

Magic Transit puts your network hardware in the cloud

We’ve covered how Cloudflare Magic Transit connects your network to the Internet, and how it protects you from DDoS attack. If you were running your network the old-fashioned way, this is where you’d stop to buy firewall hardware, and maybe another box to do load balancing.

With Magic Transit, you don’t need those boxes. We have a long track record of delivering common network functions (firewalls, load balancers, etc.) as services. Up until this point, customers deploying our services have relied on DNS to bring traffic to our edge, after which our Layer 3 (IP), Layer 4 (TCP & UDP), and Layer 7 (HTTP, HTTPS, and DNS) stacks take over and deliver performance and security to our customers.

Magic Transit is designed to handle your entire network, but does not enforce a one-size-fits-all approach to what services get applied to which portion of your traffic. To revisit Acme, our example customer from above, they have brought 203.0.113.0/24 to the Cloudflare Network. This represents 256 IPv4 addresses, some of which (eg 203.0.113.8/30) might front load balancers and HTTP servers, others mail servers, and others still custom UDP-based applications.

Each of these sub-ranges may have different security and traffic management requirements. Magic Transit allows you to configure specific IP addresses with their own suite of services, or apply the same configuration to large portions (or all) of your block.

Taking the above example, Acme may wish that the 203.0.113.8/30 block containing HTTP services fronted by a traditional hardware load balancer instead deploy the Cloudflare Load Balancer, and also wants HTTP traffic analyzed with Cloudflare’s WAF and content cached by our CDN. With Magic Transit, deploying these network functions is straight-forward — a few clicks in our dashboard or API calls will have your traffic handled at a higher layer of network abstraction, with all the attendant goodies applying application level load balancing, firewall, and caching logic bring.

This is just one example of a deployment customers might pursue. We’ve worked with several who just want pure IP passthrough, with DDoS mitigation applied to specific IP addresses. Want that? We got you!

Magic Transit runs on the entire Cloudflare Global Network. Or, no more scrubs!

When you connect your network to Cloudflare Magic Transit, you get access to the entire Cloudflare network. This means all of our network locations become your network locations. Our network capacity becomes your network capacity, at your disposal to power your experiences, deliver your content, and mitigate attacks on your infrastructure.

How expansive is the Cloudflare Network? We’re in 193 cities worldwide, with more than 30Tbps of network capacity spread across them. Cloudflare operates within 100 milliseconds of 98% of the Internet-connected population in the developed world, and 93% of the Internet-connected population globally (for context, the blink of an eye is 300-400 milliseconds).

Magic Transit makes your network smarter, better, stronger, and cheaper to operate
Areas of the globe within 100 milliseconds of a Cloudflare datacenter.

Just as we built our own products in house, we also built our network in house. Every product runs in every datacenter, meaning our entire network delivers all of our services. This might not have been the case if we had assembled our product portfolio piecemeal through acquisition, or not had completeness of vision when we set out to build our current suite of services.

The end result for customers of Magic Transit: a network presence around the globe as soon you come on board. Full access to a diverse set of services worldwide. All delivered with latency and performance in mind.

We’ll be sharing a lot more technical detail on how we deliver Magic Transit in the coming weeks and months.

Magic Transit lowers total cost of ownership

Traditional network services don’t come cheap; they require high capital outlays up front, investment in staff to operate, and ongoing maintenance contracts to stay functional. Just as our product aims to be disruptive technically, we want to disrupt traditional network cost-structures as well.

Magic Transit is delivered and billed as a service. You pay for what you use, and can add services at any time. Your team will thank you for its ease of management; your management will thank you for its ease of accounting. That sounds pretty good to us!

Magic Transit is available today

We’ve worked hard over the past nine years to get our network, management tools, and network functions as a service into the state they’re in today. We’re excited to get the tools we use every day in customers’ hands.

So that brings us to naming. When we showed this to customers the most common word they used was ‘whoa.’ When we pressed what they meant by that they almost all said: ‘It’s so much better than any solution we’ve seen before. It’s, like, magic!’ So it seems only natural, if a bit cheesy, that we call this product what it is: Magic Transit.

We think this is all pretty magical, and think you will too. Contact our Enterprise Sales Team today.