All posts by Joshua Johnson

Improving Worker Tail scalability

Post Syndicated from Joshua Johnson original http://blog.cloudflare.com/improving-worker-tail-scalability/

Improving Worker Tail scalability

Improving Worker Tail scalability

Being able to get real-time information from applications in production is extremely important. Many times software passes local testing and automation, but then users report that something isn’t working correctly. Being able to quickly see what is happening, and how often, is critical to debugging.

This is why we originally developed the Workers Tail feature – to allow developers the ability to view requests, exceptions, and information for their Workers and to provide a window into what’s happening in real time. When we developed it, we also took the opportunity to build it on top of our own Workers technology using products like Trace Workers and Durable Objects. Over the last couple of years, we’ve continued to iterate on this feature – allowing users to quickly access logs from the Dashboard and via Wrangler CLI.

Today, we’re excited to announce that tail can now be enabled for Workers at any size and scale! In addition to telling you about the new and improved scalability, we wanted to share how we built it, and the changes we made to enable it to scale better.

Why Tail was limited

Tail leverages Durable Objects to handle coordination between the Worker producing messages and consumers like wrangler and the Cloudflare dashboard, and Durable Objects are a great choice for handling real-time communication like this. However, when a single Durable Object instance starts to receive a very high volume of traffic – like the kind that can come with tailing live Workers – it can see some performance issues.

As a result, Workers with a high volume of traffic could not be supported by the original Tail infrastructure. Tail had to be limited to Workers receiving 100 requests/second (RPS) or less. This was a significant limitation that resulted in many users with large, high-traffic Workers having to turn to their own tooling to get proper observability in production.

Believing that every feature we provide should scale with users during their development journey, we set out to improve Tail's performance at high loads.

Updating the way filters work

The first improvement was to the existing filtering feature. When starting a Tail with wrangler tail (and now with the Cloudflare dashboard) users have the ability to filter out messages based on information in the requests or logs.
Previously, this filtering was handled within the Durable Object, which meant that even if a user was filtering out the majority of their traffic, the Durable Object would still have to handle every message. Often users with high traffic Tails were using many filters to better interpret their logs, but wouldn’t be able to start a Tail due to the 100 RPS limit.

We moved filtering out of the Durable Object and into the Tail message producer, preventing any filtered messages from reaching the Tail Durable Object, and thereby reducing the load on the Tail Durable Object. Moving the filtering out of the Durable Object was the first step in improving Tail’s performance at scale.

Sampling logs to keep Tails within Durable Object limits

After moving log filtering outside of the Durable Object, there was still the issue of determining when Tails could be started since there was no way to determine to what degree filters would reduce traffic for a given Tail, and simply starting a Durable Object back up would mean that it more than likely hit the 100 RPS limit immediately.

The solution for this was to add a safety mechanism for the Durable Object while the Tail was running.

We created a simple controller to track the RPS hitting a Durable Object and sample messages until the desired volume of 100 RPS is reached. As shown below, sampling keeps the Tail Durable Object RPS below the target of 100.

Improving Worker Tail scalability

When messages are sampled, the following message appears every five seconds to let the user know that they are in sampling mode:

Improving Worker Tail scalability

This message goes away once the Tail is stopped or filters are applied that drop the RPS below 100.

A final failsafe

Finally as a last resort a failsafe mechanism was added in the case the Durable Object gets fully overloaded. Since RPS tracking is done within the Durable Object, if the Durable Object is overloaded due to an extremely large amount of traffic, the sampling mechanism will fail.

In the case that an overload is detected, all messages forwarded to the Durable Object are stopped periodically to prevent any issues with Workers infrastructure.

Improving Worker Tail scalability

Here we can see a user who had a large amount of traffic that started to become sampled. As the traffic increased, the number of sampled messages grew. Since the traffic was too fast for the sampling mechanism to handle, the Durable Object got overloaded. However, soon excess messages were blocked and the overload stopped.

Try it out

These new improvements are in place currently and available to all users 🎉

To Tail Workers via the Dashboard, log in, navigate to your Worker, and click on the Logs tab. You can then start a log stream via the default view.

Improving Worker Tail scalability

If you’re using the Wrangler CLI, you can start a new Tail by running wrangler tail.

Beyond Worker tail

While we're excited for tail to be able to reach new limits and scale, we also recognize users may want to go beyond the live logs provided by Tail.

For example, if you’d like to push log events to additional destinations for a historical view of your application’s performance, we offer Logpush. If you’d like more insight into and control over log messages and events themselves, we offer Tail Workers.

These products, and others, can be read about in our Logs documentation. All of them are available for use today.

Improving the Wrangler Startup Experience

Post Syndicated from Joshua Johnson original https://blog.cloudflare.com/improving-the-wrangler-startup-experience/

Improving the Wrangler Startup Experience

Improving the Wrangler Startup Experience

Today I’m excited to announce wrangler login, an easy way to get started with Wrangler! This summer for my internship on the Workers Developer Productivity team I was tasked with helping improve the Wrangler user experience. For those who don’t know, Workers is Cloudflare’s serverless platform which allows users to deploy their software directly to Cloudflare’s edge network.

This means you can write any behaviour on requests heading to your site or even run fully fledged applications directly on the edge. Wrangler is the open-source CLI tool used to manage your Workers and has a big focus on enabling a smooth developer experience.

When I first heard I was working on Wrangler, I was excited that I would be working on such a cool product but also a little nervous. This was the first time I would be writing Rust in a professional environment, the first time making meaningful open-source contributions, and on top of that the first time doing all of this remotely. But thanks to lots of guidance and support from my mentor and team, I was able to help make the Wrangler and Workers developer experience just a little bit better.

The Problem

The main improvement I focused on this summer was the experience when getting started with Wrangler. For many of the commands to publish and develop live Workers, the user first needs to authenticate with Cloudflare. This is mainly done through the wrangler config command which has the user go create an API token and paste it into Wrangler. Creating a token involves going to the Cloudflare dashboard, going to your profile, going to the API tokens page, selecting a token template, adding your zones and accounts, and finally creating the token. While this is a completely valid authentication flow, it’s not as easy as it could be.

It could be frustrating to users who have to leave Wrangler and then possibly get lost in the wrong dashboard page or use the wrong settings for their token. When a group of intern candidates were given the task of using Wrangler, most of them got stuck on this step! Many users might forgo using Workers altogether if this is the first thing they encounter when sitting down to develop. Instead we wanted an experience where users could use their Cloudflare login (ie. their username, password, and possible two-factor authentication) and immediately be ready to go.

No OAuth? No Problem

What we came up with was a way to create and transfer API Tokens for a user, similar to how Argo Tunnel does their login.

Improving the Wrangler Startup Experience

An overview of the process is shown above, which starts with Wrangler. When the user types wrangler login in their terminal, they will be prompted to open the Cloudflare dashboard in their browser. All dashboard pages require the user to sign in before loading and once the user is signed in, all actions taken by the dashboard page will use the authentication of that user.

This means we can make a dashboard page which automatically creates an API token configured to manage Workers. Then when the user loads this page, a properly configured API token will be created for that user. Our dashboard page will then hand off the token to EdgeWorker Config Service (EWC) which will temporarily store it. While this is all going on Wrangler will be polling EWC waiting for the token to appear and once it does, Wrangler will retrieve the token and authenticate the user. With this, we have a seamless way to authenticate a Cloudflare user.

Security

One thing we had to be mindful of was security, these are users’ tokens after all. If someone was listening to network traffic and saw the request to the Cloudflare dashboard page, nothing would be stopping them from polling EWC themselves and stealing the token away from the user to wreak havoc on their Workers and zones. To solve this problem we used asymmetric RSA encryption. Asymmetric encryption lets us create two separate but mathematically connected keys. One is a private key which can encrypt and decrypt information and one is a public key which can only encrypt information.

Wrangler will generate a public-private key pair and pass off the public key to our dashboard page. Once the dashboard page is finished creating our token, EWC will then encrypt the token using the public key before storing. This means in the previous scenario where someone takes the token from our user, all they will have is an encrypted token they can’t use. The only way to decrypt it would be with the private key held by Wrangler.

Improving the Wrangler Startup Experience

In the end, this solution results in a smooth experience for Workers users. Now instead of rummaging through dashboard pages you can get started with Wrangler in only a few seconds, sometimes without having to leave the comfort of your own terminal.

Try out wrangler login in the 1.11.0 release of Wrangler and let us know how you like it. Also I would like to thank the Workers team for helping make this possible and giving me an awesome experience this summer! In order to implement this feature I had to touch different parts of Cloudflare like EWC and Stratus (Cloudflare’s front end monorepo) and work in areas unfamiliar to me such as frontend TypeScript and React. The responsiveness and encouragement I received helped get this feature created and helped make for a great summer!