Tag Archives: sre

How the Cloudflare global network optimizes for system reboots during low-traffic periods

Post Syndicated from Opeyemi Onikute original http://blog.cloudflare.com/how-the-cloudflare-global-network-optimizes-for-system-reboots-during-low-traffic-periods/

How the Cloudflare global network optimizes for system reboots during low-traffic periods

How the Cloudflare global network optimizes for system reboots during low-traffic periods

To facilitate the huge scale of Cloudflare’s customer base, we maintain data centers which span more than 300 cities in over 100 countries, including approximately 30 locations in Mainland China.

The Cloudflare global network is built to be continuously updated in a zero downtime manner, but some changes may need a server reboot to safely take effect. To enable this, we have mechanisms for the whole fleet to automatically reboot with changes gated on a unique identifier for the reboot cycle. Each data center has a maintenance window, which is a time period – usually a couple of hours – during which reboots are permitted.

We take our customer experience very seriously, and hence we have several mechanisms to ensure that disruption to customer traffic does not occur. One example is Unimog, our in-house load balancer that spreads load across the servers in a data center, ensuring that there is no disruption when a server is taken out for routine maintenance.

The SRE team decided to further reduce risk by only allowing reboots in a data center when the customer traffic is at the lowest. We also needed to automate the existing manual process for determining the window – eliminating toil.

In this post, we’ll discuss how the team improved this manual process and automated the determination of these windows using a trigonometric function – sinusoidal wave fitting.

When is the best time to reboot?

Thanks to how efficient our load-balancing framework is within a data center, technically we could proceed to schedule reboots throughout the day with zero impact to traffic flowing through the data center. However, operationally the management is simplified by requiring reboots take place between certain times for each data center. It both acts as a rate-limiter to avoid rebooting all servers in our larger data centers in a single day, and makes remediating any unforeseen issues that arise during the reboots more straightforward, as issues can be caught within the first batch of reboots.

One of the first steps is to determine the time window during which we are going to allow these reboots to take place; choosing the relative low-traffic period for a data center makes the most sense for obvious reasons. In the past, these low-traffic windows were found manually by humans reviewing historical traffic trends present in our metrics; it was SRE who were responsible for creating and maintaining the definition of these windows, which became particularly toilsome:

  1. Traffic trends are always changing, requiring increasingly frequent reviews of maintenance hours.
  2. We move quickly at Cloudflare, there is always a data center in a state of provisioning, making it difficult to keep maintenance windows up-to-date.
  3. The system was inflexible, and provided no dynamic decision-making.
  4. This responsibility became SRE toil as it was repetitive, process-based work that could and should be automated.

Time to be more efficient

We quickly realized that we needed to make this process more efficient using automation. An ideal solution would be one that was accurate, easy to maintain, re-usable, and could be consumed by other teams.

A theoretical solution to this was sine-fitting on the CPU pattern of the data center over a configurable period. e.g. two weeks. This method is a way to transform the pattern into a theoretical sinusoidal wave as shown in the image below.

How the Cloudflare global network optimizes for system reboots during low-traffic periods

With a sine wave, the most common troughs can be determined. The periods where these troughs occur are then used as options for the maintenance window.

Sinusoidal wave theory – the secret sauce

We think math is fun and were excited to see how this held up in practice. To implement the logic and tests, we needed to understand the theory. This section details the important bits for anyone that is interested in implementing this for their maintenance cycles as well.

The image below shows a theoretical representation of a sine wave. It is represented by the mathematical function y(t) = Asin(2πft + φ) where A = Amplitude, f = Frequency, t = Time and φ = Phase.

How the Cloudflare global network optimizes for system reboots during low-traffic periods

In practice, various programming language packages exist to fit an arbitrary dataset on a curve. For example, Python has the scipy curve_fit function.

We used Python and to make the result more accurate, it is recommended to include arbitrary values as initial guesses. These are described below;

Amplitude: This is the distance from the peak/valley to the time axis, and is approximated as the standard deviation multiplied by √2. The standard deviation represents the variability in the data points, and for a sine wave that varies between -1 and +1, the standard deviation is approximately 0.707 (or 1/√2). Therefore, by multiplying the standard deviation of the data by √2, we can represent an approximation of the amplitude.

Frequency: This is the number of cycles (time periods) in one second. We are concerned with the daily CPU pattern, meaning that the guess should be one full wave every 24 hours (i.e. 1/86400).

Phase: This is the position of the wave at T=0. No guess is needed for this.

Offset: To fit the sine wave on the CPU values, we need to shift upwards by the offset. This offset is the mean of the CPU values.

Here’s a basic example of how it can be implemented in Python:

timestamps = numpy.array(timestamps)
cpu = numpy.array(cpu)
guess_freq = 1 / 86400  # 24h periodicity
guess_amp = numpy.std(cpu) * 2.0**0.5
guess_offset = numpy.mean(cpu)
guess = numpy.array([guess_amp, 2.0 * numpy.pi * guess_freq, 0.0, guess_offset])
 
def sinfunc(timestamps, amplitude, frequency, phase, offset):
    return amplitude * numpy.sin(frequency * timestamps + phase) + offset
 
amplitude, frequency, phase, offset, _ = scipy.optimize.curve_fit(
    sinfunc, timestamps, cpu, p0=guess, maxfev=2000
)

Applying the theory

With the theory understood, we implemented this logic in our reboot system. To determine the window, the reboot system queries Prometheus for the data center CPU over a configurable period and attempts to fit a curve on the resultant pattern.

If there’s an accurate enough fit, the window is cached in Consul along with other metadata. Otherwise, fallback logic is implemented. For various reasons, some data centers might not have enough data for a fit at that moment. For example, a data center which was only recently provisioned and hasn’t served enough traffic yet.

How the Cloudflare global network optimizes for system reboots during low-traffic periods

When a server requests to reboot, the system checks if the current time is within the maintenance window first, before running other pre-flight checks. In most cases the window already exists because of the prefetch mechanism implemented, but when it doesn’t due to Consul session expiry or some other reason, it is computed using the CPU data in Prometheus.

The considerations in this phase were:

Caching: Calculation of the window should only be done over a pre-decided validity period. To achieve this we store the information in a Consul KV, along with a session lock that expires after the validity period. We have mentioned in the past that we use Consul as a service-discovery and key-value storage mechanism. This is an example of the latter.

Pre-fetch: In practice, it makes sense to control when this computation happens. There are several options but the most efficient was to implement a pre-fetch of this window on startup of the reboot system.

Observability: We exported a couple of metrics to Prometheus, which help us understand the decisions being made and any errors we need to address. We also export the maintenance window itself for consumption by other automated systems and teams.

How accurate is this fit?

Most load patterns fit into the sine wave, but there are some edge cases occasionally. e.g. a smaller data center that has a constant CPU. In those cases we have fallback mechanisms, but it also got us thinking about how to determine the accuracy of each fit.

With accuracy data we can make smarter decisions about accepting the automatic window, track regressions and unearth data centers with unexpected patterns. The theoretical solution here is referred to as the goodness of fit.

Curve fitting in Python with curve_fit describes curve fitting and calculating the chi-squared value. The formula for a goodness of fit chi-squared test in the image below:

How the Cloudflare global network optimizes for system reboots during low-traffic periods

The different values are the observed (Yi), expected (f(Xi)) and uncertainty. In this theory, the closer the chi-square value to the length of the sample, the better the fit is. Chi-squared values that are a lot smaller than the length represent an overestimated uncertainty and much larger represent a bad fit.

This is implemented with a simple formula:

def goodness_of_fit(observed, expected):
 
    chisq = numpy.sum(((observed - expected)/ numpy.std(observed)) ** 2)
 
    # Present the chisq value percentage relative to the sample length
    n = len(observed)
    return ((n - chisq) / n) * 100

The observed result of this is that the smaller the chi-squared, the more accurate the fit and vice versa. Hence we can provide a fit percentage of the difference between the length and the chi-squared.

There are three main types of fit, as shown in the images below.

Bad Fit

These data centers do not exhibit a sinusoidal pattern and the maintenance window can only be determined arbitrarily. This is common in test data centers which do not handle customer traffic. In these cases, it makes sense to turn off load-based reboots and use an arbitrary window. In these cases, it is common to require faster reboots on a different schedule to catch any potential issues early.

How the Cloudflare global network optimizes for system reboots during low-traffic periods

Skewed Fit

These data centers exhibit sinusoidal traffic patterns but are a bit skewed, with some smaller troughs within the wave cycle. The troughs (and hence the windows) are still correct, but the accuracy of fit is reduced.

How the Cloudflare global network optimizes for system reboots during low-traffic periods

Great Fit

These are data centers with very clear patterns and great fits. This is the ideal scenario and most fall into this category.

How the Cloudflare global network optimizes for system reboots during low-traffic periods

What’s next?

We will continue to iterate on this to make it more accurate, and provide more ways to consume the information. We have a variety of maintenance use-cases that cut across multiple organizations, and it’s exciting to see the information used more widely besides reboots. For example, teams can use maintenance windows to make automated decisions in downstream services such as running compute-intensive background tasks only in those periods.

I want to do this type of work

If you found this post very interesting and want to contribute, take a look at our careers page for open positions! We’d love to have a conversation with you.

Ensuring the Successful Launch of Ads on Netflix

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/ensuring-the-successful-launch-of-ads-on-netflix-f99490fdf1ba

By Jose Fernandez, Ed Barker, Hank Jacobs

Introduction

In November 2022, we introduced a brand new tier — Basic with ads. This tier extended existing infrastructure by adding new backend components and a new remote call to our ads partner on the playback path. As we were gearing up for launch, we wanted to ensure it would go as smoothly as possible. To do this, we devised a novel way to simulate the projected traffic weeks ahead of launch by building upon the traffic migration framework described here. We used this simulation to help us surface problems of scale and validate our Ads algorithms.

Basic with ads was launched worldwide on November 3rd. In this blog post, we’ll discuss the methods we used to ensure a successful launch, including:

  • How we tested the system
  • Netflix technologies involved
  • Best practices we developed

Realistic Test Traffic

Netflix traffic ebbs and flows throughout the day in a sinusoidal pattern. New content or national events may drive brief spikes, but, by and large, traffic is usually smoothly increasing or decreasing. An exception to this trend is when we redirect traffic between AWS data centers during regional evacuations, which leads to sudden spikes in traffic in multiple regions. Region evacuations can occur at any time, for a variety of reasons.

Typical SPS distribution across data centers
SPS distribution across data centers during regional traffic shifts
Fig. 1: Traffic Patterns

While evaluating options to test anticipated load and evaluate our ad selection algorithms at scale, we realized that mimicking member viewing behavior in combination with the seasonality of our organic traffic with abrupt regional shifts were important requirements. Replaying real traffic and making it appear as Basic with ads traffic was a better solution than artificially simulating Netflix traffic. Replay traffic enabled us to test our new systems and algorithms at scale before launch, while also making the traffic as realistic as possible.

The Setup

A key objective of this initiative was to ensure that our customers were not impacted. We used member viewing habits to drive the simulation, but customers did not see any ads as a result. Achieving this goal required extensive planning and implementation of measures to isolate the replay traffic environment from the production environment.

Netflix’s data science team provided projections of what the Basic with ads subscriber count would look like a month after launch. We used this information to simulate a subscriber population through our AB testing platform. When traffic matching our AB test criteria arrived at our playback services, we stored copies of those requests in a Mantis stream.

Next, we launched a Mantis job that processed all requests in the stream and replayed them in a duplicate production environment created for replay traffic. We set the services in this environment to “replay traffic” mode, which meant that they did not alter state and were programmed to treat the request as being on the ads plan, which activated the components of the ads system.

The replay traffic environment generated responses containing a standard playback manifest, a JSON document containing all the necessary information for a Netflix device to start playback. It also included metadata about ads, such as ad placement and impression-tracking events. We stored these responses in a Keystone stream with outputs for Kafka and Elasticsearch. A Kafka consumer retrieved the playback manifests with ad metadata and simulated a device playing the content and triggering the impression-tracking events. We used Elasticsearch dashboards to analyze results.

Ultimately, we accurately simulated the projected Basic with ads traffic weeks ahead of the launch date.

A diagram of the systems involved in traffic replay
Fig. 2: The Traffic Replay Setup

The Rollout

To fully replay the traffic, we first validated the idea with a small percentage of traffic. The Mantis query language allowed us to set the percentage of replay traffic to process. We informed our engineering and business partners, including customer support, about the experiment and ramped up traffic incrementally while monitoring the success and error metrics through Lumen dashboards. We continued ramping up and eventually reached 100% replay. At this point we felt confident to run the replay traffic 24/7.

To validate handling traffic spikes caused by regional evacuations, we utilized Netflix’s region evacuation exercises which are scheduled regularly. By coordinating with the team in charge of region evacuations and aligning with their calendar, we validated our system and third-party touchpoints at 100% replay traffic during these exercises.

We also constructed and checked our ad monitoring and alerting system during this period. Having representative data allowed us to be more confident in our alerting thresholds. The ads team also made necessary modifications to the algorithms to achieve the desired business outcomes for launch.

Finally, we conducted chaos experiments using the ChAP experimentation platform. This allowed us to validate our fallback logic and our new systems under failure scenarios. By intentionally introducing failure into the simulation, we were able to identify points of weakness and make the necessary improvements to ensure that our ads systems were resilient and able to handle unexpected events.

The availability of replay traffic 24/7 enabled us to refine our systems and boost our launch confidence, reducing stress levels for the team.

Takeaways

The above summarizes three months of hard work by a tiger team consisting of representatives from various backend teams and Netflix’s centralized SRE team. This work helped ensure a successful launch of the Basic with ads tier on November 3rd.

To briefly recap, here are a few of the things that we took away from this journey:

  • Accurately simulating real traffic helps build confidence in new systems and algorithms more quickly.
  • Large scale testing using representative traffic helps to uncover bugs and operational surprises.
  • Replay traffic has other applications outside of load testing that can be leveraged to build new products and features at Netflix.

What’s Next

Replay traffic at Netflix has numerous applications, one of which has proven to be a valuable tool for development and launch readiness. The Resilience team is streamlining this simulation strategy by integrating it into the CHAP experimentation platform, making it accessible for all development teams without the need for extensive infrastructure setup. Keep an eye out for updates on this.


Ensuring the Successful Launch of Ads on Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.