In February 2021, Cloudflare launched Project Fair Shot — a program that gave our Waiting Room product free of charge to any government, municipality, private/public business, or anyone responsible for the scheduling and/or dissemination of the COVID-19 vaccine.
By having our Waiting Room technology in front of the vaccine scheduling application, it ensured that:
Applications would remain available, reliable, and resilient against massive spikes of traffic for users attempting to get their vaccine appointment scheduled.
Visitors could wait for their long-awaited vaccine with confidence, arriving at a branded queuing page that provided accurate, estimated wait times.
Vaccines would get distributed equitably, and not just to folks with faster reflexes or Internet connections.
Since February, we’ve seen a good number of participants in Project Fair Shot. To date, we have helped more than 100 customers across more than 10 countries to schedule approximately 100 million vaccinations. Even better, these vaccinations went smoothly, with customers like the County of San Luis Obispo regularly dealing with more than 20,000 appointments in a day. “The bottom line is Cloudflare saved lives today. Our County will forever be grateful for your participation in getting the vaccine to those that need it most in an elegant, efficient and ethical manner” — Web Services Administrator for the County of San Luis Obispo.
We are happy to have helped not just in the US, but worldwide as well. In Canada, we partnered with a number of organizations and the Canadian government to increase access to the vaccine. One partner stated: “Our relationship with Cloudflare went from ‘Let’s try Waiting Room’ to ‘Unless you have this, we’re not going live with that public-facing site.’” — CEO of Verto Health. In another country in Europe, we saw over three million people go through the Waiting Room in less than 24 hours, leading to a significantly smoother and less stressful experience. Cities in Japan, — working closely with our partner, Classmethod — have been able to vaccinate over 40 million people and are on track to complete their vaccination process across 317 cities. If you want more stories from Project Fair Shot, check out our case studies.
We are continuing to add more customers to Project Fair Shot every day to ensure we are doing all that we can to help distribute more vaccines. With the emergence of the Delta variant and others, vaccine distribution (and soon, booster shots) is still very much a real problem to keep everyone healthy and resilient. Because of these new developments, Cloudflare will be extending Project Fair Shot until at least July 1, 2022. Though we are not excited to see the pandemic continue, we are humbled to be able to provide our services and be a critical part in helping us collectively move towards a better tomorrow.
We use Kubernetes to run many of the diverse services that help us control Cloudflare’s edge. We have five geographically diverse clusters, with hundreds of nodes in our largest cluster. These clusters are self-managed on bare-metal machines which gives us a good amount of power and flexibility in the software and integrations with Kubernetes. However, it also means we don’t have a cloud provider to rely on for virtualizing or managing the nodes. This distinction becomes even more prominent when considering all the different reasons that nodes degrade. With self-managed bare-metal machines, the list of reasons that cause a node to become unhealthy include:
Kernel-level software failures
Kubernetes cluster-level software failures
Degraded network communication
Software updates are required
We have plenty of examples of failures in the aforementioned categories, but one example has been particularly tedious to deal with. It starts with the following log line from the kernel:
unregister_netdevice: waiting for lo to become free. Usage count = 1
The issue is further observed with the number of network interfaces on the node owned by the Container Network Interface (CNI) plugin getting out of proportion with the number of running pods:
$ ip link | grep cali | wc -l
This is unexpected as it shouldn’t exceed the maximum number of pods allowed on a node (we use the default limit of 110). While this issue is interesting and perhaps worthy of a whole separate blog, the short of it is that the Linux network interfaces owned by the CNI are not getting cleaned up after a pod terminates.
Some history on this can be read in a Docker GitHub issue. We found this seems to plague nodes with a longer uptime, and after rebooting the node it would be fine for about a month. However, with a significant number of nodes, this was happening multiple times per day. Each instance would need rebooting, which means going through our worker reboot procedure which looked like this:
Cordon off the affected node to prevent new workloads from scheduling on it.
Collect any diagnostic information for later investigation.
Drain the node of current workloads.
Reboot and wait for the node to come back.
Verify the node is healthy.
Re-enable scheduling of new workloads to the node.
While solving the underlying issue would be ideal, we needed a mitigation to avoid toil in the meantime — an automated node remediation process.
Existing Detection and Remediation Solutions
While not complicated, the manual remediation process outlined previously became tedious and distracting, as we had to reboot nodes multiple times a day. Some manual intervention is unavoidable, but for cases matching the following, we wanted automation:
Generic worker nodes
Software issues confined to a given node
Already researched and diagnosed issues
Limiting automatic remediation to generic worker nodes is important as there are other node types in our clusters where more care is required. For example, for control-plane nodes the process has to be augmented to check etcd cluster health and ensure proper redundancy for components servicing the Kubernetes API. We are also going to limit the problem space to known software issues confined to a node where we expect automatic remediation to be the right answer (as in our ballooning network interface problem). With that in mind, we took a look at existing solutions that we could use.
Node Problem Detector
Node problem detector is a daemon that runs on each node that detects problems and reports them to the Kubernetes API. It has a pluggable problem daemon system such that one can add their own logic for detecting issues with a node. Node problems are distinguished between temporary and permanent problems, with the latter being persisted as status conditions on the Kubernetes node resources.2
Draino and Cluster-Autoscaler
Draino as its name implies, drains nodes but does so based on Kubernetes node conditions. It is meant to be used with cluster-autoscaler which then can add or remove nodes via the cluster plugins to scale node groups.
Kured is a daemon that looks at the presence of a file on the node to initiate a drain, reboot and uncordon of the given node. It uses a locking mechanism via the Kubernetes API to ensure only a single node is acted upon at a time.
The Kubernetes cluster-lifecycle SIG has been working on the cluster-api project to enable declaratively defining clusters to simplify provisioning, upgrading, and operating multiple Kubernetes clusters. It has a concept of machine resources which back Kubernetes node resources and furthermore has a concept of machine health checks. Machine health checks use node conditions to determine unhealthy nodes and then the cluster-api provider is then delegated to replace that machine via create and delete operations.
Proof of Concept
Interestingly, with all the above except for Kured, there is a theme of pluggable components centered around Kubernetes node conditions. We wanted to see if we could build a proof of concept using the existing theme and solutions. For the existing solutions, draino with cluster-autoscaler didn’t make sense in a non-cloud environment like our bare-metal set up. The cluster-api health checks are interesting, however they require a more complete investment into the cluster-api project to really make sense. That left us with node-problem-detector and kured. Deploying node-problem-detector was simple, and we ended up testing a custom-plugin-monitor like the following:
With that in place, the actual remediation needed to happen. Kured seemed to do most everything we needed, except that it was looking at a file instead of Kubernetes node conditions. We hacked together a patch to change that and tested it successfully end to end — we had a working proof of concept!
Revisiting Problem Detection
While the above worked, we found that node-problem-detector was unwieldy because we were replicating our existing monitoring into shell scripts and node-problem-detector configuration. A 2017 blog post describes Cloudflare’s monitoring stack, although some things have changed since then. What hasn’t changed is our extensive usage of Prometheus and Alertmanager.
For the network interface issue and other issues we wanted to address, we already had the necessary exported metrics and alerting to go with them. Here is what our already existing alert looked like3:
Note that we use a “notify” label to drive the routing logic in Alertmanager. However, that got us asking, could we just route this to a Kubernetes node condition instead?
Sciuro is our open-source replacement of node-problem-detector that has one simple job: synchronize Kubernetes node conditions with currently firing alerts in Alertmanager. Node problems can be defined with a more holistic view and using already existing exporters such as node exporter, cadvisor or mtail. It also doesn’t run on affected nodes which allows us to rely on out-of-band remediation techniques. Here is a high level diagram of how Sciuro works:
Starting from the top, nodes are scraped by Prometheus, which collects those metrics and fires relevant alerts to Alertmanager. Sciuro polls Alertmanager for alerts with a matching receiver, matches them with a corresponding node resource in the Kubernetes API and updates that node’s conditions accordingly.
In more detail, we can start by defining an alert in Prometheus like the following:
Note the two differences from the previous alert. First, we use a new name with a more sensitive trigger. The idea is that we want automatic node remediation to try fixing the node first as soon as possible, but if the problem worsens or automatic remediation is failing, humans will still get notified to act. The second difference is that instead of notifying chat rooms, we route to a target called “node-condition-k8s”.
Sciuro then comes into play, polling the Altertmanager API for alerts matching the “node-condition-k8s” receiver. The following shows the equivalent using amtool:
$ amtool alert query -r node-condition-k8s
Alertname Starts At Summary
CalicoTooManyInterfacesEarly 2021-05-11 03:25:21 UTC Kubernetes node worker1a has too many Calico interfaces
Note the node and instance labels which Sciuro will use for matching with the corresponding Kubernetes node. Sciuro uses the excellent controller-runtime to keep track of and update node sources in the Kubernetes API. We can observe the updated node condition on the status field via kubectl:
$ kubectl get node worker1a -o json | jq '.status.conditions | select(.type | test("^AlertManager"))'
"message": "[P6] Kubernetes node worker1a has too many Calico interfaces",
One important note is Sciuro added the AlertManager_ prefix to the node condition type to prevent conflicts with other node condition types. For example, DiskPressure, a kubelet managed condition, could also be an alert name. Sciuro will also properly update heartbeat and transition times to reflect when it first saw the alert and its last update. With node conditions synchronized by Sciuro, remediation can take place via one of the existing tools. As mentioned previously we are using a modified version of Kured for now.
We’re happy to announce that we’ve open sourced Sciuro, and it can be found on GitHub where you can read the code, find the deployment instructions, or open a Pull Request for changes.
Managing Node Uptime
While we began using automatic node remediation for obvious problems, we’ve expanded its purpose to additionally keep node uptime low. Low node uptime is desirable to further reduce drift on nodes, keep the node initialization process well-oiled, and encourage the best deployment practices on the Kubernetes clusters. To expand on the last point, services that are deployed with best practices and in a high availability fashion should see negligible impact when a single node leaves the cluster. However, services that are not deployed with best practices will most likely have problems especially if they rely on singleton pods. By draining nodes more frequently, it introduces regular chaos that encourages best practices. To enable this with automatic node remediation the following alert was defined:
There is a bit of juggling with the kube_node_roles metric in the above to isolate the alert to generic worker nodes, but at a high level it looks at node_boot_time_seconds, a metric from prometheus node_exporter. Again the notify label is configured to send to node conditions which kicks off the automatic node remediation. One further detail is the priority here is set to “9” which is of lower precedence than our other alerts. Note that the message field of the node condition is prefixed with the alert priority in brackets. This allows the remediation process to take priority into account when choosing which node to remediate first, which is important because Kured uses a lock to act on a single node at a time.
In the past 30 days, we’ve used the above automatic node remediation process to action 571 nodes. That has saved our humans a considerable amount of time. We’ve also been able to reduce the time to repair for some issues as automatic remediation can act at all times of the day and with a faster response time.
As mentioned before, we’re open sourcing Sciuro and its code can be found on GitHub. We’re open to issues, suggestions, and pull requests. We do have some ideas for future improvements. For Sciuro, we may look to reduce latency which is mainly due to polling and potentially add a push model from Altermanager although this isn’t a need we’ve had yet. For the larger node remediation story, we hope to do an overhaul of the remediating component. As mentioned previously, we are currently using a fork of kured, but a future replacement component should include the following:
Use out-of-band management interfaces to be able to shut down and power on nodes without a functional operating system.
Move from decentralized architecture to a centralized one that can integrate more complicated logic. This might include being able to act on entire failure domains in parallel.
Handle specialized nodes such as masters or storage nodes.
Finally, we’re looking for more people passionate about Kubernetes to join our team. Come help us push Kubernetes to the next level to serve Cloudflare’s many needs!
1Exhaustion can be applied to hardware resources, kernel resources, or logical resources like the amount of logging being produced. 2Nearly all Kubernetes objects have spec and status fields. The status field is used to describe the current state of an object. For node resources, typically the kubelet manages a conditions field under the status field for reporting things like if the node is ready for servicing pods. 3The format of the following alert is documented on Prometheus Alerting Rules.
Serving more than approximately25 million Internet properties is not an easy thing, and neither is serving 20 million requests per second on average. At Cloudflare, we achieve this by running a homogeneous edge environment: almost every Cloudflare server runs all Cloudflare products.
As we offer more and more products and enjoy the benefit of horizontal scalability, our edge stack continues to grow in complexity. Originally, we only operated at the application layer with our CDN service and DoS protection. Then we launched transport layer products, such as Spectrum and Argo. Now we have further expanded our footprint into the IP layer and physical link with Magic Transit. They all run on every machine we have. The work of our engineers enables our products to evolve at a fast pace, and to serve our customers better.
However, such software complexity presents a sheer challenge to operation: the more changes you make, the more likely it is that something is going to break. And we don’t tolerate any of our mistakes slipping into the production environment and affecting our customers.
In this article, we will discuss one of the techniques we use to fight such software complexity: simulations. Simulations are basically system tests that run with synthesized customer traffic and applications. We would like to introduce our simulation system, SOAR, i.e. Simulation for Observability, reliAbility, and secuRity.
What is SOAR? Simply put, it’s a data center built specifically for simulations. It runs the same software stack as our production data centers, but without any production traffic. Within SOAR, there are end-user servers, product servers, and origin servers (Figure 2). The product servers behave exactly the same as servers in our production edge network, and they are the targets that we want to test. End-user servers and origin servers run applications that try to simulate customer behaviors. The simplest case is to run network benchmarks through product servers, in order to evaluate how effective the corresponding products are. Instead of sending test traffic over the Internet, everything happens in the LAN environment that Cloudflare tightly controls. This gives us great flexibility in testing network features such as bring-your-own-IP (BYOIP) products.
To demonstrate how this works, let’s go through a running example using Magic Transit.
Magic Transit is a product that provides IP layer protection and acceleration. One of the main functions of Magic Transit is to shield customers from DDoS attacks.
Customers bring their IP ranges to advertise from Cloudflare edge. When attackers initiate a DoS attack, Cloudflare absorbs all the customer’s traffic, drops the attack traffic, and encapsulates clean traffic to customers. For this product, operational concerns are multifold, and here are some examples:
Have we properly configured our data plane so that traffic can reach customers? Is BGP ready? Are ECMP routes programmed correctly? Are health probes working correctly?
Do we have any untested corner cases that only manifest with a large amount of traffic?
Is our DoS system dropping malicious traffic as intended? How effective
Will any other team’s changes break Magic Transit as our edge keeps growing?
To ease these concerns, we run simulated customers with SOAR. Yes, simulated, not real. For example, assume a customer Alice onboarded an IP range 192.0.2.0/24 to Magic Transit. To simulate this customer, in SOAR we can configure a test application (e.g. iperf server) on one origin server to represent Alice’s service. We also bring up a product server to run Magic Transit. This product server will filter traffic toward a.b.c.0/24, and GRE encapsulated cleansed traffic to Alice’s specified GRE endpoint. To make it work, we also add routing rules to forward packets destined to 192.0.2.0/24 to go through the product server above. Similarly, we add routing rules to deliver GRE packets from the product server to the origin servers. Lastly, we start running test clients as eyeballs to evaluate the functional correctness, performance metrics, and resource usage.
For the rest of this article, we will talk about the design and implementation of this simulation system, as well as several real cases in which it helped us catch problems early or avoid problems altogether.
From performance simulation to config simulation
Before we created SOAR, we had already built a “performance simulation” for our layer 7 services. It is based on SaltStack, our configuration management software. All the simulation cases are system test cases against Cloudflare-owned HTTP sites. These cases are statically configured and run non-stop. Each simulation case produces multiple Prometheus metrics such as requests per second and latency. We monitor these metrics daily on our Grafana dashboard.
While this simulation system is very useful, it becomes less efficient as we have more and more simulation cases and products to run and analyze.
Isolation and Coordination
As more types of simulations are onboarded, it is critical to ensure each simulation runs in a clean environment, and all tasks of a simulation run together. This challenge is specific to providers like Cloudflare, whose products are not virtualized because we want to maximize our edge performance. As a result, we have to isolate simulations and clean up by ourselves; otherwise, different simulations may cross-affect each other.
For example, for Magic Transit simulations, we need to create a GRE tunnel on an origin server and set up several routes on all three servers, to make sure simulated traffic can flow as real Magic Transit customers would. We cannot leave these routes after the simulation finishes, or there might be a conflict. We once ran into a situation where different simulations required different source IP addresses to the same destination. In our original performance simulation environment, we will have to modify simulation applications to avoid these conflicts. This approach is less desirable as different engineering teams have to pay attention to other teams’ code.
Moreover, the performance simulation addresses only the most basic system test cases: a client sends traffic to a server and measures the performance characteristics like request per second and latency quantile. But the situation we want to simulate and validate in our production environment can be far more complex.
In our previous example of Magic Transit, customers can configure complicated network topology. Figure 4 is one simplified case. Let’s say Alice establishes four GRE tunnels with Cloudflare; two connect to her data center 1, and traffic will be ECMP hashed between these two tunnels. Similarly, she also establishes another two tunnels to her data center 2 with similar ECMP settings. She would like to see traffic hit data center 1 normally and fail over to data center 2 when tunnels 1 and 2 are both down.
In order to examine the effectiveness of route failover, we would need to inject errors on the product servers only after the traffic on the eyeball server has started. But this type of coordination is not achievable with statically defined simulations.
Engineer friendliness and Interactiveness
Our performance simulation is not engineer-friendly. Not just because it is all statically configured in SaltStack (most engineering teams do not possess Salt expertise), but it is also not integrated with an engineer’s daily routine. Can engineers trigger a simulation on every branch build? Can simulation results get back in time to inform that a performance problem occurs? The answer is no, it is not possible with our static configuration. An engineer can submit a Salt PR to config a new simulation, but this simulation may have to wait for several hours because all other unfinished simulations need to complete first (recall it is just a static loop). Nor can an engineer add a test to the team’s repository to run on every build, as it needs to reside in the SRE-managed Salt repository, making it unmanageable as the number of simulations grows.
To address the above limitations, we designed SOAR.
The architecture is a performance simulation structure, extended. We created an internal coordinator service to:
Interface with engineers, so they could now submit one-time simulations from their laptop or within the building pipeline, or view previous execution results.
Dispatch and coordinate simulation tasks to each simulation server, where a simulation agent executes these tasks. The coordinator will isolate simulations properly so none of them contends on system resources. For example, the simplest policy we implemented is to never run two simulations on the same server at the same time.
The coordinator is secured by Cloudflare Access, so that only employees can visit this service. The coordinator will serve two types of simulations: one-time simulation to be run in an ad-hoc way and mainly on a per pull request manner, to ease development testing. It’s also callable from our CI system. Another type is repetitive simulations that are stored in the coordinator’s persistent storage. These simulations serve daily monitoring purposes and will be executed periodically.
Each simulation server runs a simulation agent. This agent will execute two types of tasks received from the coordinator: system tasks and user tasks. System tasks change the system-wide configurations and will be reverted after each simulation terminates. These will include but are not limited to route change, link change, address change, ipset change and iptables change.
User tasks, on the other hand, run benchmarks that we are interested in evaluating, and will be terminated if it exceeds an allocated execution budget. Each user task is isolated in a cgroup, and the agent will ensure all user tasks are executed with dedicated resources. The generic runtime metrics of user tasks is monitored by Cadvisor and sent to Prometheus and Alert Manager. A user task can export its own metrics to Prometheus as well.
For SOAR to run reliably, we provisioned a dedicated environment that enforces the same settings for the production environment and operates it as a production system: hardened security, standard alerts on watch, no engineer access except approved tools. This to a large extent allows us to run simulations as a stable source of anomaly detection.
Simulating with customer-specific configuration
An important ability of SOAR is to simulate for a specific customer. This will provide the customer with more guarantees that both their configurations and our services are battle-tested with traffic before they go live. It can also be used to bisect problems during a customer escalation, helping customer support to rule out unrelated factors more easily.
All of our edge servers know how to dispatch an incoming customer packet. This factor greatly reduces difficulties in simulating a specific customer. What we need to do in simulation is to mock routing and domain translation on simulated eyeballs and origins, so that they will correctly send traffic to designated product servers. And the problem is solved—magic!
The actual implementation is also straightforward: as simulations run in a LAN environment, we have tight control over how to route packets (servers are on the same broadcast domain). Any eyeball, origin, or product server can just use a direct routing rule and a static DNS entry in /etc/hosts to redirect packets to go to the correct destination.
Running a simulation this way allows us to separate customer configuration management from the simulation service: our products will manage it, so any time a customer configuration is changed, they will already reflect in simulations without special care.
Implementation and Integration
All SOAR components are built with Golang from scratch on Linux servers. It took three engineer-months to build the prototype and onboard the first engineering use case. While there are other mature platforms for job scheduling, task isolation, and monitoring, building our own allows us to better absorb new requirements from engineering teams, which is much easier and quicker than an external dependency.
In addition, we are currently integrating the simulation service into our release pipeline. Cloudflare built a release manager internally to schedule product version changes in controlled steps: a new product is first deployed into dogfooding data centers. After the product has been trialed by Cloudflare employees, it moves to several canary data centers with limited customer traffic. If nothing bad happens, in an hour or so, it starts to land in larger data centers spread across three tiers. Tier-3 will receive the changes an hour earlier than tier-2, and the same applies to tier-2 and tier-1. This ensures a product would be battle-tested enough before it can serve the majority of Cloudflare customers.
Now we move this further by adding a simulation step even before dogfooding. In this step, all changes are deployed into the simulation environment, and engineering teams will configure which simulations to run. Dogfooding starts only when there is no performance regression or functional breakage. Here performance regression is based on Prometheus metrics, where each engineering team can define their own Prometheus query to interpret the performance results. All configured simulations will run periodically to detect problems in releases that do not tie to a specific product, e.g. a Linux kernel upgrade. SREs receive notifications asynchronously if any issue is discovered.
Simulations at Cloudflare: Case Studies
Simulations are very useful inside Cloudflare. Let’s see some real experiences we had in the past.
Detecting an anomaly on data center specific releases
In our Magic Transit example, the engineering team was about to release physical network interconnect (PNI) support. With PNI, customer data centers physically peer with Cloudflare routers.
However, this PNI functionality introduces a problem in our normal release process. However, PNI data centers are typically different from our dogfooding and canary data centers. If we still release with the normal process, then two critical phases are skipped. And what’s worse, the PNI data center could be a choke point in front of that customer’s traffic. If the PNI data center is taken down, no other data center can replace its role.
SOAR in this case is an important utility to help. The basic idea is to configure a server with PNI information. This server will act as if it runs in a PNI data center. Then we run simulated eyeball and origin to examine if there is any functional breakage:
With such simulation capability, we were able to detect several problems early on and before releasing. For example, we caught a problem that impacts checksum offloading, which could encapsulate TCP packets with the wrong inner checksum and cause the packets to be dropped at the origin side. This problem does not exist in our virtualized testing environment and integration tests; it only happens when production hardware comes into play. We then use this simulation as a success indicator to test various fixes until we get the packet flow running normally again.
Continuously monitor performance on the edge stack
When a team configures a simulation, it runs on the same stack where all other teams run their products as well. This means when a simulation starts to show unhealthy results, it may or may not directly relate to the product associated with that simulation.
But with continuous simulations, we will have more chances to detect issues before things go south, or at least it will serve as a hint to quickly react to emerging problems. In an example early this year, we noticed one of our performance simulation dashboards showed that some HTTP request throughput was dropping by 20%. After digging into the case, we found our bot detection system had made a change that affected related requests. Luckily enough we moved fast thanks to the hint from the simulation (and some other useful tools like Opentracing).
Our recent enhancement from just HTTP performance simulation to SOAR makes it even more useful for customers. This is because we are now able to simulate with customer-specific configurations, so we might expose customer-specific problems. We are still dogfooding this, and hopefully, we can deploy it to our customers soon.
DoS Attacks as Simulations
When we started to develop Magic Transit, a question worth monitoring was how effective our mitigation pipeline is, and how to apply thresholds for different customers. For our new ACK flood mitigation system, flowtrackd, we onboarded its performance simulation cases together with tunable ACK flood. Combined with customer-specific configuration, this allows us to compare the throughput result under different volumes of attacks, and systematically tune our mitigation threshold.
Another important factor that we will be able to achieve with our “attack simulation” system is to mount attacks we have seen in the past, making sure the development of our mitigation pipelines won’t ever pass on these known attacks to our customers.
In this article, we introduced Cloudflare’s simulation system, SOAR. While simulation is not a new tool, we can use it to improve reliability, observability, and security. Our adoption of SOAR is still in its early stages, but we are pretty confident that, by fully leveraging simulations, we will push our quality of service to a new level.
Cloud solutions architects should ideally “build today with tomorrow in mind,” meaning their solutions need to cater to current scale requirements as well as the anticipated growth of the solution. This growth can be either the organic growth of a solution or it could be related to a merger and acquisition type of scenario, where its size is increased dramatically within a short period of time.
Still, when a solution scales, many architects experience added complexity to the overall architecture in terms of its manageability, performance, security, etc. By architecting your solution or application to scale reliably, you can avoid the introduction of additional complexity, degraded performance, or reduced security as a result of scaling.
Generally, a solution or service’s reliability is influenced by its up time, performance, security, manageability, etc. In order to achieve reliability in the context of scale, take into consideration the following primary design principals.
Modularity aims to break a complex component or solution into smaller parts that are less complicated and easier to scale, secure, and manage.
Figure 1: Monolithic architecture vs. modular architecture
Modular design is commonly used in modern application developments. where an application’s software is constructed of multiple and loosely coupled building blocks (functions). These functions collectively integrate through pre-defined common interfaces or APIs to form the desired application functionality (commonly referred to as microservices architecture).
This design principle can also be applied to different components of the solution’s architecture. For example, when building a cloud solution on a single Amazon VPC, it may reach certain scaling limits and make it harder to introduce changes at scale due to the higher level of dependencies. This single complex VPC can be divided into multiple smaller and simpler VPCs. The architecture based on multiple VPCs can vary. For example, the VPCs can be divided based on a service or application building block, a specific function of the application, or on organizational functions like a VPC for various departments. This principle can also be leveraged at a regional level for very high scale global architectures. You can make the architecture modular at a global level by distributing the multiple VPCs across different AWS Regions to achieve global scale (facilitated by AWS Global Infrastructure).
In addition, modularity promotes separation of concerns by having well-defined boundaries among the different components of the architecture. As a result, each component can be managed, secured, and scaled independently. Also, it helps you avoid what is commonly known as “fate sharing,” where a vertically scaled server hosts a monolithic application, and any failure to this server will impact the entire application.
Horizontal scaling, commonly referred to as scale-out, is the capability to automatically add systems/instances in a distributed manner in order to handle an increase in load. Examples of this increase in load could be the increase of number of sessions to a web application. With horizontal scaling, the load is distributed across multiple instances. By distributing these instances across Availability Zones, horizontal scaling not only increases performance, but also improves the overall reliability.
In order for the application to work seamlessly in a scale-out distributed manner, the application needs to be designed to support a stateless scaling model, where the application’s state information is stored and requested independently from the application’s instances. This makes the on-demand horizontal scaling easier to achieve and manage.
This principle can be complemented with a modularity design principle, in which the scaling model can be applied to certain component(s) or microservice(s) of the application stack. For example, only scale-out Amazon Elastic Cloud Compute (EC2) front-end web instances that reside behind an Elastic Load Balancing (ELB) layer with auto-scaling groups. In contrast, this elastic horizontal scalability might be very difficult to achieve for a monolithic type of application.
Leverage the content delivery network
Leveraging Amazon CloudFront and its edge locations as part of the solution architecture can enable your application or service to scale rapidly and reliably at a global level, without adding any complexity to the solution. The integration of a CDN can take different forms depending on the solution use case.
For example, CloudFront played an important role to enable the scale required throughout Amazon Prime Day 2020 by serving up web and streamed content to a worldwide audience, which handled over 280 million HTTP requests per minute.
Go serverless where possible
As discussed earlier in this post, modular architectures based on microservices reduce the complexity of the individual component or microservice. At scale it may introduce a different type of complexity related to the number of these independent components (microservices). This is where serverless services can help to reduce such complexity reliably and at scale. With this design model you no longer have to provision, manually scale, maintain servers, operating systems, or runtimes to run your applications.
To avoid any major changes at a later stage to accommodate security requirements, it’s essential that security is taken into consideration as part of the initial solution design. For example, if the cloud project is new or small, and you don’t consider security properly at the initial stages, once the solution starts to scale, redesigning the entire cloud project from scratch to accommodate security best practices is usually not a simple option, which may lead to consider suboptimal security solutions that may impact the desired scale to be achieved. By leveraging CDN as part of the solution architecture (as discussed above), using Amazon CloudFront, you can minimize the impact of distributed denial of service (DDoS) attacks as well as perform application layer filtering at the edge. Also, when considering serverless services and the Shared Responsibility Model, from a security lens you can delegate a considerable part of the application stack to AWS so that you can focus on building applications. See The Shared Responsibility Model for AWS Lambda.
Design with security in mind by incorporating the necessary security services as part of the initial cloud solution. This will allow you to add more security capabilities and features as the solution grows, without the need to make major changes to the design.
Design for failure
The reliability of a service or solution in the cloud depends on multiple factors, the primary of which is resiliency. This design principle becomes even more critical at scale because the failure impact magnitude typically will be higher. Therefore, to achieve a reliable scalability, it is essential to design a resilient solution, capable of recovering from infrastructure or service disruptions. This principle involves designing the overall solution in such a way that even if one or more of its components fail, the solution is still be capable of providing an acceptable level of its expected function(s). See AWS Well-Architected Framework – Reliability Pillar for more information.
Designing for scale alone is not enough. Reliable scalability should be always the targeted architectural attribute. The design principles discussed in this blog act as the foundational pillars to support it, and ideally should be combined with adopting a DevOps model.
Getting stuck in traffic is one of the most frustrating experiences for drivers around the world. Everyone slows to a crawl, sometimes for a minor issue or sometimes for no reason at all. As engineers at Netflix, we are constantly reevaluating how to redesign traffic management. What if we knew the urgency of each traveler and could selectively route cars through, rather than making everyone wait?
In Netflix engineering, we’re driven by ensuring Netflix is there when you need it to be. Yet, as recent as last year, our systems were susceptible to metaphorical traffic jams; we had on/off circuit breakers, but no progressive way to shed load. Motivated by improving the lives of our members, we’ve introduced priority-based progressive load shedding.
The animation below shows the behavior of the Netflix viewer experience when the backend is throttling traffic based on priority. While the lower priority requests are throttled, the playback experience remains uninterrupted and the viewer is able to enjoy their title. Let’s dig into how we accomplished this.
Failure can occur due to a myriad of reasons: misbehaving clients that trigger a retry storm, an under-scaled service in the backend, a bad deployment, a network blip, or issues with the cloud provider. All such failures can put a system under unexpected load, and at some point in the past, every single one of these examples has prevented our members’ ability to play. With these incidents in mind, we set out to make Netflix more resilient with these goals:
Consistently prioritize requests across device types (Mobile, Browser, and TV)
Progressively throttle requests based on priority
Validate assumptions by using Chaos Testing (deliberate fault injection) for requests of specific priorities
The resulting architecture that we envisioned with priority throttling and chaos testing included is captured below.
Building a request taxonomy
We decided to focus on three dimensions in order to categorize request traffic: throughput, functionality, and criticality. Based on these characteristics, traffic was classified into the following:
NON_CRITICAL: This traffic does not affect playback or members’ experience. Logs and background requests are examples of this type of traffic. These requests are usually high throughput which contributes to a large percentage of load in the system.
DEGRADED_EXPERIENCE: This traffic affects members’ experience, but not the ability to play. The traffic in this bucket is used for features like: stop and pause markers, language selection in the player, viewing history, and others.
CRITICAL: This traffic affects the ability to play. Members will see an error message when they hit play if the request fails.
Using attributes of the request, the API gateway service (Zuul) categorizes the requests into NON_CRITICAL, DEGRADED_EXPERIENCE and CRITICAL buckets, and computes a priority score between 1 to 100 for each request given its individual characteristics. The computation is done as a first step so that it is available for the rest of the request lifecycle.
Most of the time, the request workflow proceeds normally without taking the request priority into account. However, as with any service, sometimes we reach a point when either one of our backends is in trouble or Zuul itself is in trouble. When that happens requests with higher priority get preferential treatment. The higher priority requests will get served, while the lower priority ones will not. The implementation is analogous to a priority queue with a dynamic priority threshold. This allows Zuul to drop requests with a priority lower than the current threshold.
Finding the best place to throttle traffic
Zuul can apply load shedding in two moments during the request lifecycle: when it routes requests to a specific back-end service (service throttling) or at the time of initial request processing, which affects all back-end services (global throttling).
Zuul can sense when a back-end service is in trouble by monitoring the error rates and concurrent requests to that service. Those two metrics are approximate indicators of failures and latency. When the threshold percentage for one of these two metrics is crossed, we reduce load on the service by throttling traffic.
Another case is when Zuul itself is in trouble. As opposed to the scenario above, global throttling will affect all back-end services behind Zuul, rather than a single back-end service. The impact of this global throttling can cause much bigger problems for members. The key metrics used to trigger global throttling are CPU utilization, concurrent requests, and connection count. When any of the thresholds for those metrics are crossed, Zuul will aggressively throttle traffic to keep itself up and healthy while the system recovers. This functionality is critical: if Zuul goes down, no traffic can get through to our backend services, resulting in a total outage.
Once we had the prioritization piece in place, we were able to combine it with our load shedding mechanism to dramatically improve streaming reliability. When we’re in a bad situation (i.e. any of the thresholds above are exceeded), we progressively drop traffic, starting with the lowest priority. A cubic function is used to manage the level of throttling. If things get really, really bad the level will hit the sharp side of the curve, throttling everything.
The graph above is an example of how the cubic function is applied. As the overload percentage increases (i.e. the range between the throttling threshold and the max capacity), the priority threshold trails it very slowly: at 35%, it’s still in the mid-90s. If the system continues to degrade, we hit priority 50 at 80% exceeded and then eventually 10 at 95%, and so on.
Given that a relatively small amount of requests impact streaming availability, throttling low priority traffic may affect certain product features but will not prevent members pressing “play” and watching their favorite show. By adding progressive priority-based load shedding, Zuul can shed enough traffic to stabilize services without members noticing.
Handling retry storms
When Zuul decides to drop traffic, it sends a signal to devices to let them know that we need them to back off. It does this by indicating how many retries they can perform and what kind of time window they can perform them in. For example:
Using this backpressure mechanism, we can stop retry storms much faster than we could in the past. We automatically adjust these two dials based on the priority of the request. Requests with higher priority will retry more aggressively than lower ones, also increasing streaming availability.
Validating which requests are right for the job
To validate our request taxonomy assumptions on whether a specific request fell into the NON_CRITICAL, DEGRADED, or CRITICAL bucket, we needed a way to test the user’s experience when that request was shed. To accomplish this, we leveraged our internal failure injection tool (FIT) and created a failure injection point in Zuul that allowed us to shed any request based on a supplied priority. This enabled us to manually simulate a load shedded experience by blocking ranges of priorities for a specific device or member, giving us an idea of which requests could be safely shed without impacting the user.
Continually ensuring those requests are still right for the job
One of the goals here is to reduce members’ pain by shedding requests that are not expected to affect the user’s streaming experience. However, Netflix changes quickly and requests that were thought to be noncritical can unexpectedly become critical. In addition, Netflix has a wide variety of client devices, client versions, and ways to interact with the system. To make sure we weren’t causing members pain when throttling NON_CRITICAL requests in any of these scenarios, we leveraged our infrastructure experimentation platform ChAP.
This platform allows us to stage an A/B experiment that will allocate a small number of production users to either a control or treatment group for 45 minutes while throttling a range of priorities for the treatment group. This lets us capture a variety of live use cases and measure the impact to their playback experience. ChAP analyzes the members’ KPIs per device to determine if there is a deviation between the control and the treatment groups.
In our first experiment, we detected a race condition in both Android and iOS devices for a low priority request that caused sporadic playback errors. Since we practice continuous experimentation, once the initial experiments were run and the bugs were fixed, we scheduled them to run on a periodic basis. This allows us to detect regressions early and keep users streaming.
Reaping the benefits
In 2019, before progressive load shedding was in place, the Netflix streaming services experienced an outage that resulted in a sizable percentage of members who were not able to play for a period of time. In 2020, days after the implementation was deployed, the team started seeing the benefit of the solution. Netflix experienced a similar issue with the same potential impact as the outage seen in 2019. Unlike then, Zuul’s progressive load shedding kicked in and started shedding traffic until the service was in a healthy state without impacting members’ ability to play at all.
The graph below shows a stable streaming availability metric stream per second (SPS) while Zuul is performing progressive load shedding based on request priority during the incident. The different colors in the graph represent requests with different priority being throttled.
Members were happily watching their favorite show on Netflix while the infrastructure was self-recovering from a system failure.
We are not done yet
For future work, the team is looking into expanding the use of request priority for other use cases like better retry policies between devices and back-ends, dynamically changing load shedding thresholds, tuning the request priorities using Chaos Testing as a guiding principle, and other areas that will make Netflix even more resilient.
If you’re interested in helping Netflix stay up in the face of shifting systems and unexpected failures, reach out to us. We’re hiring!
October 1 was this year’s DNS Flag Day. Read on to find out all about DNS Flag Day and how it affects Cloudflare’s DNS services (hint: it doesn’t, we already did the work to be compliant).
What is DNS Flag Day?
DNS Flag Day is an initiative by several DNS vendors and operators to increase the compliance of implementations with DNS standards. The goal is to make DNS more secure, reliable and robust. Rather than a push for new features, DNS flag day is meant to ensure that workarounds for non-compliance can be reduced and a common set of functionalities can be established and relied upon.
Last year’s flag day was February 1, and it set forth that servers and clients must be able to properly handle the Extensions to DNS (EDNS0) protocol (first RFC about EDNS0 are from 1999 – RFC 2671). This way, by assuming clients have a working implementation of EDNS0, servers can resort to always sending messages as EDNS0. This is needed to support DNSSEC, the DNS security extensions. We were, of course, more than thrilled to support the effort, as we’re keen to push DNSSECadoption forward .
DNS Flag Day 2020
The goal for this year’s flag day is to increase DNS messaging reliability by focusing on problems around IP fragmentation of DNS packets. The intention is to reduce DNS message fragmentation which continues to be a problem. We can do that by ensuring cleartext DNS messages sent over UDP are not too large, as large messages risk being fragmented during the transport. Additionally, when sending or receiving large DNS messages, we have the ability to do so over TCP.
Problem with DNS transport over UDP
A potential issue with sending DNS messages over UDP is that the sender has no indication of the recipient actually receiving the message. When using TCP, each packet being sent is acknowledged (ACKed) by the recipient, and the sender will attempt to resend any packets not being ACKed. UDP, although it may be faster than TCP, does not have the same mechanism of messaging reliability. Anyone still wishing to use UDP as their transport protocol of choice will have to implement this reliability mechanism in higher layers of the network stack. For instance, this is what is being done in QUIC, the new Internet transport protocol used by HTTP/3 that is built on top of UDP.
Even the earliest DNS standards (RFC 1035) specified the use of sending DNS messages over TCP as well as over UDP. Unfortunately, the choice of supporting TCP or not was up to the implementer/operator, and then firewalls were sometimes set to block DNS over TCP. More recent updates to RFC 1035, on the other hand, require that the DNS server is available to query using DNS over TCP.
DNS message fragmentation
Sending data over networks and the Internet is restricted to the limitation of how large each packet can be. Data is chopped up into a stream of packets, and sized to adhere to the Maximum Transmission Unit (MTU) of the network. MTU is typically 1500 bytes for IPv4 and, in the case of IPv6, the minimum is 1280 bytes. Subtracting both the IP header size (IPv4 20 bytes/IPv6 40 bytes) and the UDP protocol header size (8 bytes) from the MTU, we end up with a maximum DNS message size of 1472 bytes for IPv4 and 1232 bytes in order for a message to fit within a single packet. If the message is any larger than that, it will have to be fragmented into more packets.
Sending large messages causes them to get fragmented into more than one pack. This is not a problem with TCP transports since each packet is ACK:ed to ensure proper delivery. However, the same does not hold true when sending large DNS messages over UDP. For many intents and purposes, UDP has been treated as a second-class citizen to TCP as far as network routing is concerned. It is quite common to see UDP packet fragments being dropped by routers and firewalls, potentially causing parts of a message to be lost. To avoid fragmentation over UDP it is better to truncate the DNS message and set the Truncation Flag in the DNS response. This tells the recipient that more data is available if the query is retried over TCP.
DNS Flag Day 2020 wants to ensure that DNS message fragmentation does not happen. When larger DNS messages need to be sent, we need to ensure it can be done reliably over TCP.
DNS servers need to support DNS message transport over TCP in order to be compliant with this year’s flag day. Also, DNS messages sent over UDP must never exceed the limit over which they risk being fragmented.
Cloudflare authoritative DNS and 18.104.22.168
We fully support the DNS Flag Day initiative, as it aims to make DNS more reliable and robust, and it ensures a common set of features for the DNS community to evolve on. In the DNS ecosystem, we are as much a client as we are a provider. When we perform DNS lookups on behalf of our customers and users, we rely on other providers to follow standards and be compliant. When they are not, and we can’t work around the issues, it leads to problems resolving names and reaching resources.
Both our public resolver 22.214.171.124 as well as our authoritative DNS service, set and enforce reasonable limits on DNS message sizes when sent over UDP. Of course, both services are available over TCP. If you’re already using Cloudflare, there is nothing you need to do but to keep using our DNS services! We will continually work on improving DNS.
If you already understand how Secondary DNS works, please feel free to skip this section. It does not provide any Cloudflare-specific information.
Secondary DNS has many use cases across the Internet; however, traditionally, it was used as a synchronized backup for when the primary DNS server was unable to respond to queries. A more modern approach involves focusing on redundancy across many different nameservers, which in many cases broadcast the same anycasted IP address.
Secondary DNS involves the unidirectional transfer of DNS zones from the primary to the Secondary DNS server(s). One primary can have any number of Secondary DNS servers that it must communicate with in order to keep track of any zone updates. A zone update is considered a change in the contents of a zone, which ultimately leads to a Start of Authority (SOA) serial number increase. The zone’s SOA serial is one of the key elements of Secondary DNS; it is how primary and secondary servers synchronize zones. Below is an example of what an SOA record might look like during a dig query.
example.com 3600 IN SOA ashley.ns.cloudflare.com. dns.cloudflare.com.
2034097105 // Serial
10000 // Refresh
2400 // Retry
604800 // Expire
3600 // Minimum TTL
Each of the numbers is used in the following way:
Serial – Used to keep track of the status of the zone, must be incremented at every change.
Refresh – The maximum number of seconds that can elapse before a Secondary DNS server must check for a SOA serial change.
Retry – The maximum number of seconds that can elapse before a Secondary DNS server must check for a SOA serial change, after previously failing to contact the primary.
Expire – The maximum number of seconds that a Secondary DNS server can serve stale information, in the event the primary cannot be contacted.
Minimum TTL – Per RFC 2308, the number of seconds that a DNS negative response should be cached for.
Using the above information, the Secondary DNS server stores an SOA record for each of the zones it is tracking. When the serial increases, it knows that the zone must have changed, and that a zone transfer must be initiated.
Serial increases can be detected in the following ways:
The fastest way for the Secondary DNS server to keep track of a serial change is to have the primary server NOTIFY them any time a zone has changed using the DNS protocol as specified in RFC 1996, Secondary DNS servers will instantly be able to initiate a zone transfer.
Another way is for the Secondary DNS server to simply poll the primary every “Refresh” seconds. This isn’t as fast as the NOTIFY approach, but it is a good fallback in case the notifies have failed.
One of the issues with the basic NOTIFY protocol is that anyone on the Internet could potentially notify the Secondary DNS server of a zone update. If an initial SOA query is not performed by the Secondary DNS server before initiating a zone transfer, this is an easy way to perform an amplification attack. There is two common ways to prevent anyone on the Internet from being able to NOTIFY Secondary DNS servers:
Using transaction signatures (TSIG) as per RFC 2845. These are to be placed as the last record in the extra records section of the DNS message. Usually the number of extra records (or ARCOUNT) should be no more than two in this case.
Using IP based access control lists (ACL). This increases security but also prevents flexibility in server location and IP address allocation.
Generally NOTIFY messages are sent over UDP, however TCP can be used in the event the primary server has reason to believe that TCP is necessary (i.e. firewall issues).
In addition to serial tracking, it is important to ensure that a standard protocol is used between primary and Secondary DNS server(s), to efficiently transfer the zone. DNS zone transfer protocols do not attempt to solve the confidentiality, authentication and integrity triad (CIA); however, the use of TSIG on top of the basic zone transfer protocols can provide integrity and authentication. As a result of DNS being a public protocol, confidentiality during the zone transfer process is generally not a concern.
Authoritative Zone Transfer (AXFR)
AXFR is the original zone transfer protocol that was specified in RFC 1034 and RFC 1035 and later further explained in RFC 5936. AXFR is done over a TCP connection because a reliable protocol is needed to ensure packets are not lost during the transfer. Using this protocol, the primary DNS server will transfer all of the zone contents to the Secondary DNS server, in one connection, regardless of the serial number. AXFR is recommended to be used for the first zone transfer, when none of the records are propagated, and IXFR is recommended after that.
Incremental Zone Transfer (IXFR)
IXFR is the more sophisticated zone transfer protocol that was specified in RFC 1995. Unlike the AXFR protocol, during an IXFR, the primary server will only send the secondary server the records that have changed since its current version of the zone (based on the serial number). This means that when a Secondary DNS server wants to initiate an IXFR, it sends its current serial number to the primary DNS server. The primary DNS server will then format its response based on previous versions of changes made to the zone. IXFR messages must obey the following pattern:
Current latest SOA
Secondary server current SOA
DNS record deletions
Secondary server current SOA + changes
DNS record additions
Current latest SOA
Steps 2,3,4,5,6 can be repeated any number of times, as each of those represents one change set of deletions and additions, ultimately leading to a new serial.
IXFR can be done over UDP or TCP, but again TCP is generally recommended to avoid packet loss.
How Does Secondary DNS Work at Cloudflare?
The DNS team loves microservice architecture! When we initially implemented Secondary DNS at Cloudflare, it was done using Mesos Marathon. This allowed us to separate each of our services into several different marathon apps, individually scaling apps as needed. All of these services live in our core data centers. The following services were created:
Zone Transferer – responsible for attempting IXFR, followed by AXFR if IXFR fails.
Zone Transfer Scheduler – responsible for periodically checking zone SOA serials for changes.
Rest API – responsible for registering new zones and primary nameservers.
In addition to the marathon apps, we also had an app external to the cluster:
Notify Listener – responsible for listening for notifies from primary servers and telling the Zone Transferer to initiate an AXFR/IXFR.
Each of these microservices communicates with the others through Kafka.
Once the zone transferer completes the AXFR/IXFR, it then passes the zone through to our zone builder, and finally gets pushed out to our edge at each of our 200 locations.
Although this current architecture worked great in the beginning, it left us open to many vulnerabilities and scalability issues down the road. As our Secondary DNS product became more popular, it was important that we proactively scaled and reduced the technical debt as much as possible. As with many companies in the industry, Cloudflare has recently migrated all of our core data center services to Kubernetes, moving away from individually managed apps and Marathon clusters.
What this meant for Secondary DNS is that all of our Marathon-based services, as well as our NOTIFY Listener, had to be migrated to Kubernetes. Although this long migration ended up paying off, many difficult challenges arose along the way that required us to come up with unique solutions in order to have a seamless, zero downtime migration.
Challenges When Migrating to Kubernetes
Although the entire DNS team agreed that kubernetes was the way forward for Secondary DNS, it also introduced several challenges. These challenges arose from a need to properly scale up across many distributed locations while also protecting each of our individual data centers. Since our core does not rely on anycast to automatically distribute requests, as we introduce more customers, it opens us up to denial-of-service attacks.
The two main issues we ran into during the migration were:
How do we create a distributed and reliable system that makes use of kubernetes principles while also making sure our customers know which IPs we will be communicating from?
When opening up a public-facing UDP socket to the Internet, how do we protect ourselves while also preventing unnecessary spam towards primary nameservers?.
As was previously mentioned, one form of protection in the Secondary DNS protocol is to only allow certain IPs to initiate zone transfers. There is a fine line between primary servers allow listing too many IPs and them having to frequently update their IP ACLs. We considered several solutions:
Allowlist all Cloudflare IPs and dynamically update
Proxy egress traffic
Ultimately we decided to proxy our egress traffic from k8s, to the DNS primary servers, using static proxy addresses. Shadowsocks-libev was chosen as the SOCKS5 implementation because it is fast, secure and known to scale. In addition, it can handle both UDP/TCP and IPv4/IPv6.
The partnership of k8s and Shadowsocks combined with a large enough IP range brings many benefits:
Efficient load balancing
Primary server ACLs only need to be updated once
It allows us to make use of kubernetes for both the Zone Transferer and the Local ShadowSocks Proxy.
Shadowsocks proxy can be reused by many different Cloudflare services.
The Notify Listener requires listening on static IPs for NOTIFY Messages coming from primary DNS servers. This is mostly a solved problem through the use of k8s services of type loadbalancer, however exposing this service directly to the Internet makes us uneasy because of its susceptibility to attacks. Fortunately DDoS protection is one of Cloudflare’s strengths, which lead us to the likely solution of dogfooding one of our own products, Spectrum.
Spectrum provides the following features to our service:
Figure 3 shows two interesting attributes of the system:
Spectrum <-> k8s IPv4 only:
This is because our custom k8s load balancer currently only supports IPv4; however, Spectrum has no issue terminating the IPv6 connection and establishing a new IPv4 connection.
Spectrum <-> k8s routing decisions based of L4 protocol:
This is because k8s only supports one of TCP/UDP/SCTP per service of type load balancer. Once again, spectrum has no issues proxying this correctly.
One of the problems with using a L4 proxy in between services is that source IP addresses get changed to the source IP address of the proxy (Spectrum in this case). Not knowing the source IP address means we have no idea who sent the NOTIFY message, opening us up to attack vectors. Fortunately, Spectrum’s proxy protocol feature is capable of adding custom headers to TCP/UDP packets which contain source IP/Port information.
As we are using miekg/dns for our Notify Listener, adding proxy headers to the DNS NOTIFY messages would cause failures in validation at the DNS server level. Alternatively, we were able to implement custom read and write decorators that do the following:
Reader: Extract source address information on inbound NOTIFY messages. Place extracted information into new DNS records located in the additional section of the message.
Writer: Remove additional records from the DNS message on outbound NOTIFY replies. Generate a new reply using proxy protocol headers.
There is no way to spoof these records, because the server only permits two extra records, one of which is the optional TSIG. Any other records will be overwritten.
This custom decorator approach abstracts the proxying away from the Notify Listener through the use of the DNS protocol.
Although knowing the source IP will block a significant amount of bad traffic, since NOTIFY messages can use both UDP and TCP, it is prone to IP spoofing. To ensure that the primary servers do not get spammed, we have made the following additions to the Zone Transferer:
Always ensure that the SOA has actually been updated before initiating a zone transfer.
Only allow at most one working transfer and one scheduled transfer per zone.
Additional Technical Challenges
Zone Transferer Scheduling
As shown in figure 1, there are several ways of sending Kafka messages to the Zone Transferer in order to initiate a zone transfer. There is no benefit in having a large backlog of zone transfers for the same zone. Once a zone has been transferred, assuming no more changes, it does not need to be transferred again. This means that we should only have at most one transfer ongoing, and one scheduled transfer at the same time, for any zone.
If we want to limit our number of scheduled messages to one per zone, this involves ignoring Kafka messages that get sent to the Zone Transferer. This is not as simple as ignoring specific messages in any random order. One of the benefits of Kafka is that it holds on to messages until the user actually decides to acknowledge them, by committing that messages offset. Since Kafka is just a queue of messages, it has no concept of order other than first in first out (FIFO). If a user is capable of reading from the Kafka topic concurrently, it is entirely possible that a message in the middle of the queue be committed before a message at the end of the queue.
Most of the time this isn’t an issue, because we know that one of the concurrent readers has read the message from the end of the queue and is processing it. There is one Kubernetes-related catch to this issue, though: pods are ephemeral. The kube master doesn’t care what your concurrent reader is doing, it will kill the pod and it’s up to your application to handle it.
Consider the following problem:
Read offset 1. Start transferring zone 1.
Read offset 2. Start transferring zone 2.
Zone 2 transfer finishes. Commit offset 2, essentially also marking offset 1.
Read offset 3 Start transferring zone 3.
If these events happen, zone 1 will never be transferred. It is important that zones stay up to date with the primary servers, otherwise stale data will be served from the Secondary DNS server. The solution to this problem involves the use of a list to track which messages have been read and completely processed. In this case, when a zone transfer has finished, it does not necessarily mean that the kafka message should be immediately committed. The solution is as follows:
Keep a list of Kafka messages, sorted based on offset.
If finished transfer, remove from list:
If the message is the oldest in the list, commit the messages offset.
This solution is essentially soft committing Kafka messages, until we can confidently say that all other messages have been acknowledged. It’s important to note that this only truly works in a distributed manner if the Kafka messages are keyed by zone id, this will ensure the same zone will always be processed by the same Kafka consumer.
Life of a Secondary DNS Request
Although Cloudflare has a large global network, as shown above, the zone transferring process does not take place at each of the edge datacenter locations (which would surely overwhelm many primary servers), but rather in our core data centers. In this case, how do we propagate to our edge in seconds? After transferring the zone, there are a couple more steps that need to be taken before the change can be seen at the edge.
Zone Builder – This interacts with the Zone Transferer to build the zone according to what Cloudflare edge understands. This then writes to Quicksilver, our super fast, distributed KV store.
Authoritative Server – This reads from Quicksilver and serves the built zone.
What About Performance?
At the time of writing this post, according to dnsperf.com, Cloudflare leads in global performance for both Authoritative and Resolver DNS. Here, Secondary DNS falls under the authoritative DNS category here. Let’s break down the performance of each of the different parts of the Secondary DNS pipeline, from the primary server updating its records, to them being present at the Cloudflare edge.
Primary Server to Notify Listener – Our most accurate measurement is only precise to the second, but we know UDP/TCP communication is likely much faster than that.
NOTIFY to Zone Transferer – This is negligible
Zone Transferer to Primary Server – 99% of the time we see ~800ms as the average latency for a zone transfer.
4. Zone Transferer to Zone Builder – 99% of the time we see ~10ms to build a zone.
5. Zone Builder to Quicksilver edge: 95% of the time we see less than 1s propagation.
End to End latency: less than 5 seconds on average. Although we have several external probes running around the world to test propagation latencies, they lack precision due to their sleep intervals, location, provider and number of zones that need to run. The actual propagation latency is likely much lower than what is shown in figure 10. Each of the different colored dots is a separate data center location around the world.
An additional test was performed manually to get a real world estimate, the test had the following attributes:
Primary server: NS1 Number of records changed: 1 Start test timer event: Change record on NS1 Stop test timer event: Observe record change at Cloudflare edge using dig Recorded timer value: 6 seconds
Cloudflare serves 15.8 trillion DNS queries per month, operating within 100ms of 99% of the Internet-connected population. The goal of Cloudflare operated Secondary DNS is to allow our customers with custom DNS solutions, be it on-premise or some other DNS provider, to be able to take advantage of Cloudflare’s DNS performance and more recently, through Secondary Override, our proxying and security capabilities too. Secondary DNS is currently available on the Enterprise plan, if you’d like to take advantage of it, please let your account team know. For additional documentation on Secondary DNS, please refer to our support article.
In this blog post, we will walk you through the reliability model of services running in our more than 200 edge cities worldwide. Then, we will go over how deploying a new dynamic task scheduling system, HashiCorp Nomad, helped us improve the availability of services in each of those data centers, covering how we deployed Nomad and the challenges we overcame along the way. Finally, we will show you both how we currently use Nomad and how we are planning on using it in the future.
Reliability model of services running in each data center
For this blog post, we will distinguish between two different categories of services running in each data center:
Customer-facing services: all of our stack of products that our customers use, such as caching, WAF, DDoS protection, rate-limiting, load-balancing, etc.
Management services: software required to operate the data center, that is not in the direct request path of customer traffic.
The reliability model of our customer-facing services is to run them on all machines in each data center. This works well as it allows each data center’s capacity to scale dynamically by adding more machines.
Scaling is especially made easy thanks to our dynamic load balancing system, Unimog, which runs on each machine. Its role is to continuously re-balance traffic based on current resource usage and to check the health of services. This helps provide resiliency to individual machine failures and ensures resource usage is close to identical on all machines.
As an example, here is the CPU usage over a day in one of our data centers where each time series represents one machine and the different colors represent different generations of hardware. Unimog keeps all machines processing traffic and at roughly the same CPU utilization.
Some of our larger data centers have a substantial number of machines, but sometimes we need to reliably run just a single or a few instances of a management service in each location.
There are currently a couple of options to do this, each have their own pros and cons:
Deploying the service to all machines in the data center:
Pro: it ensures the service’s reliability
Con: it unnecessarily uses resources which could have been used to serve customer traffic and is not cost-effective
Deploying the service to a static handful of machines in each data center:
Pro: it is less wasteful of resources and more cost-effective
Con: it runs the risk of service unavailability when those handful of machines unexpectedly fail
A third, more viable option, is to use dynamic task scheduling so that only the right amount of resources are used while ensuring reliability.
A need for more dynamic task scheduling
Having to pick between two suboptimal reliability model options for management services we want running in each data center was not ideal.
Indeed, some of those services, even though they are not in the request path, are required to continue operating the data center. If the machines running those services become unavailable, in some cases we have to temporarily disable the data center while recovering them. Doing so automatically re-routes users to the next available data center and doesn’t cause disruption. In fact, the entire Cloudflare network is designed to operate with data centers being disabled and brought back automatically. But it’s optimal to route end users to a data center near them so we want to minimize any data center level downtime.
This led us to realize we needed a system to ensure a certain number of instances of a service were running in each data center, regardless of which physical machine ends up running it.
Customer-facing services run on all machines in each data center and do not need to be onboarded to that new system. On the other hand, services currently running on a fixed subset of machines with sub-optimal reliability guarantees and services which don’t need to run on all machines are good candidates for onboarding.
Our pick: HashiCorp Nomad
Armed with our set of requirements, we conducted some research on candidate solutions.
While Kubernetes was another option, we decided to use HashiCorp’s Nomad for the following reasons:
Satisfies our initial requirement, which was reliably running a single instance of a binary with resource isolation in each data center.
Has few dependencies and a straightforward integration with Consul. Consul is another piece of HashiCorp software we had already deployed in each datacenter. It provides distributed key-value storage and service discovery capabilities.
Is lightweight (single Go binary), easy to deploy and provision new clusters which is a plus when deploying as many clusters as we have data centers.
Has a modular task driver (part responsible for executing tasks and providing resource isolation) architecture to support not only containers but also binaries and any custom task driver.
Is open source and written in Go. We have Go language experience within the team, and Nomad has a responsive community of maintainers on GitHub.
Nomad is split in two different pieces:
Nomad Server: instances forming the cluster responsible for scheduling, five per data center to provide sufficient failure tolerance
Nomad Client: instances executing the actual tasks, running on all machines in every data center
To guarantee Nomad Server cluster reliability, we deployed instances on machines which are part of different failure domains:
In different inter-connected physical data centers forming a single location
In different racks, connected to different switches
In different multi-node chassis (most of our edge hardware comes in the form of multi-node chassis, one chassis contains four individual servers)
We also added logic to our configuration management tool to ensure we always keep a consistent number of Nomad Server instances regardless of the expansions and decommissions of servers happening on a close to daily basis.
The logic is rather simple, as server expansions and decommissions happen, the Nomad Server role gets redistributed to a new list of machines. Our configuration management tool then ensures that Nomad Server runs on the new machines before turning it off on the old ones.
Additionally, because server expansions and decommissions affect a subset of racks at a time and the Nomad Server role assignment logic provides rack-diversity guarantees, the cluster stays healthy as quorum is kept at all times.
Nomad job files are templated and checked into a git repository. Our configuration management tool then ensures the jobs are scheduled in every data center. From there, Nomad takes over and ensures the jobs are running at all times in each data center.
By exposing rack metadata to each Nomad Client, we are able to make sure each instance of a particular service runs in a different rack and is tied to a different failure domain. This way we make sure that the failure of one rack of servers won’t impact the service health as the service is also running in a different rack, unaffected by the failure.
We achieve this with the following job file constraint:
We leveraged Nomad integration with Consul to get Nomad jobs dynamically added to the Consul Service Catalog. This allows us to discover where a particular service is currently running in each data center by querying Consul. Additionally, with the Consul DNS Interface enabled, we can also use DNS-based lookups to target services running on Nomad.
To be able to properly operate as many Nomad clusters as we have data centers, good observability on Nomad clusters and services running on those clusters was essential.
We use Prometheus to scrape Nomad Server and Client instances running in each data center and Alertmanager to alert on key metrics. Using Prometheus metrics, we built a Grafana dashboard to provide visibility on each cluster.
We set up our Prometheus instances to discover services running on Nomad by querying the Consul Service Directory and scraping their metrics periodically using the following Prometheus configuration:
We then use those metrics to create Grafana dashboards and set up alerts for services running on Nomad.
To restrict access to Nomad API endpoints, we enabled mutual TLS authentication and are generating client certificates for each entity interacting with Nomad. This way, only entities with a valid client certificate can interact with Nomad API endpoints in order to schedule jobs or perform any CLI operation.
Deploying a new component always comes with its set of challenges; here is a list of a few hurdles we have had to overcome along the way.
Ramdisk rootfs and pivot_root
When starting to use the exec driver to run binaries isolated in a chroot environment, we noticed our stateless root partition running on ramdisk was not supported as the task would not start and we got this error message in our logs:
Feb 12 19:49:03 machine nomad-client: 2020-02-12T19:49:03.332Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=fa202-63b-33f-924-42cbd5 task=server error="failed to launch command with executor: rpc error: code = Unknown desc = container_linux.go:346: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:109: jailing process inside rootfs caused \\\"pivot_root invalid argument\\\"\"""
We filed a GitHub issue and submitted a workaround pull request which was promptly reviewed and merged upstream.
In parallel, to properly fix it, other team members patched our Linux image to support pivot_root and sent the patch upstream for review on the kernel mailing list.
Resource usage containment
One very important aspect was to make sure the resource usage of tasks running on Nomad would not disrupt other services colocated on the same machine.
Disk space is a shared resource on every machine and being able to set a quota for Nomad was a must. We achieved this by isolating the Nomad data directory to a dedicated fixed-size mount point on each machine. Limiting disk bandwidth and IOPS, however, is not currently supported out of the box by Nomad.
Nomad job files have a resources section where memory and CPU usage can be limited (memory is in MB, cpu is in MHz):
memory = 2000
cpu = 500
This uses cgroups under the hood and our testing showed that while memory limits are enforced as one would expect, the CPU limits are soft limits and not enforced as long as there is available CPU on the host machine.
As mentioned above, all machines currently run the same customer-facing workload. Scheduling individual jobs dynamically with Nomad to run on single machines challenges that assumption.
While our dynamic load balancing system, Unimog, balances requests based on resource usage to ensure it is close to identical on all machines, batch type jobs with spiky resource usage can pose a challenge.
We will be paying attention to this as we onboard more services and:
attempt to limit resource usage spikiness of Nomad jobs with constraints aforementioned
ensure Unimog adjusts to this batch type workload and does not end up in a positive feedback loop
What we are running on Nomad
Now Nomad has been deployed in every data center, we are able to improve the reliability of management services essential to operations by gradually onboarding them. We took a first step by onboarding our reboot and maintenance management service.
Reboot and maintenance management service
In each data center, we run a service which facilitates online unattended rolling reboots and maintenance of machines. This service used to run on a single well-known machine in each data center. This made it vulnerable to single machine failures and when down prevented machines from enabling automatically after a reboot. Therefore, it was a great first service to be onboarded to Nomad to improve its reliability.
We now have a guarantee this service is always running in each data center regardless of individual machine failures. Instead of other machines relying on a well-known address to target this service, they now query Consul DNS and dynamically figure out where the service is running to interact with it.
This is a big improvement in terms of reliability for this service, therefore many more management services are expected to follow in the upcoming months and we are very excited for this to happen.
Keeping Customers Streaming — The Centralized Site Reliability Practice at Netflix
By Hank Jacobs, Senior Site Reliability Engineer on CORE
We’re privileged to be in the business of bringing joy to our customers at Netflix. Whether it’s a compelling new series or an innovative product feature, we strive to provide a best-in-class service that people love and can enjoy anytime, anywhere. A key underpinning to keeping our customers happy and streaming is a strong focus on reliability.
Reliability, formally speaking, is the ability of a system to function under stated conditions for a period of time. Put simply, reliability means a system should work and continue working. From failure injection testing to regularly exercising our region evacuation abilities, Netflix engineers invest a lot in ensuring the services that comprise Netflix are robust and reliable. Many teams contribute to the reliability of Netflix and own the reliability of their service or area of expertise. The Critical Operations and Reliability Engineering team at Netflix (CORE) is responsible for the reliability of the Netflix service as a whole.
CORE is a team consisting of Site Reliability Engineers, Applied Resilience Engineers, and Performance Engineers. Our group is responsible for the reliability of business-critical operations. Unlike most SRE teams, we do not own or operate any customer-serving services nor do we routinely make production code changes, build infrastructure, or embed on service teams. Our primary focus is ensuring Netflix stays up. Practically speaking, this includes activities such as systemic risk identification, handling the lifecycle of an incident, and reliability consulting.
Teams at Netflix follow the service ownership model: they operate what they build. Most of the time, service owners catch issues before they impact customers. Things still do go sideways and incidents happen that impact the customer experience. This is where the CORE team steps in: CORE configures, maintains, and responds to alerts that monitor high-level business KPIs (e.g. stream starts per second). When one of those alerts fires, the CORE on-call engineer assesses the situation to determine the scope of impact, identify involved services, and engage service owners to assist with mitigation. From there, CORE begins to manage the incident.
Incident management at Netflix doesn’t follow common management practices like the ITIL model. In an incident, the CORE on-call engineer generally operates as the Incident Manager. The Incident Manager is responsible for performing or delegating activities such as:
Coordination — bringing in relevant service owners to help with the investigation and focus on mitigation
Decision Making — making key choices to facilitate the mitigation and remediation of customer impact (e.g. deciding if we should evacuate a region)
Scribe — keeping track of incident details such as involved teams, mitigation efforts, graphs of the current impact, etc.
Technical Sleuthing — assisting the responding service owners with understanding what systems are contributing to the incident
Liaison — communicating information about the incident across business functions with both internal and external teams as necessary
Once the customer impact is successfully mitigated, CORE is then responsible for coordinating the post-incident analysis. Analysis comes in many shapes and sizes depending on the impact and uniqueness of the incident, but most incidents go through what we call “memorialization”. This process includes a write-up of what happened, what mitigations took place, and what follow-up work was discussed. For particularly unique, interesting, or impactful incidents, CORE may host an Incident Review or engage in a deeper, long-form investigation. Most post-incident analysis, especially for impactful incidents, is done in partnership with one of CORE’s Applied Resilience Engineers. A key point to emphasize is that all incident analysis work focuses on the sociotechnical aspects of an incident. Consequently, post-incident analysis tends to uncover many practical learnings and improvements for all involved. We frequently socialize these findings outside of those directly involved to help share learnings across the company.
So what happens when a CORE engineer is not on-call or doing incident analysis? Unsurprisingly, the response varies widely based on the skillset and interests of the individual team member. In broad strokes, examples include:
Preserving operational visibility and response capabilities — fixing and improving our dashboards, alerts, and automation
Reliability consulting — discussing various aspects including architectural decisions, systemic observability, application performance, and on-call health training
Systematic risk identification and mitigation — partner with various teams to identify and fix systematic risks revealed by incidents
Internal tooling — build and maintain tools that support and augment our incident response capabilities
Learning and re-learning the changes to a complex, ever-moving system
Building and maintaining relationships with other teams
Overall, we’ve found that this form of reliability work best suits the needs and goals of Netflix. Reliability being CORE’s primary focus affords us the bandwidth to both proactively explore potential business-critical risks as well as effectively respond to those risks. Additionally, having a broad view of the system allows us to spot systematic risks as they develop. By being a separate and central team, we can more efficiently share learnings across the larger engineering organization and more easily consult with teams on an ad hoc basis. Ultimately, CORE’s singular focus on reliability empowers us to reveal business-critical sociotechnical risks, facilitate effective responses to those risks and ensure Netflix continues to bring joy to our customers.
Open Sourcing Mantis: A Platform For Building Cost-Effective, Realtime, Operations-Focused Applications
By Jeff Chao on behalf of the Mantis team
Today we’re excited to announce that we’re open sourcing Mantis, a platform that helps Netflix engineers better understand the behavior of their applications to ensure the highest quality experience for our members. We believe the challenges we face here at Netflix are not necessarily unique to Netflix which is why we’re sharing it with the broader community.
As a streaming microservices ecosystem, the Mantis platform provides engineers with capabilities to minimize the costs of observing and operating complex distributed systems without compromising on operational insights. Engineers have built cost-efficient applications on top of Mantis to quickly identify issues, trigger alerts, and apply remediations to minimize or completely avoid downtime to the Netflix service. Where other systems may take over ten minutes to process metrics accurately, Mantis reduces that from tens of minutes down to seconds, effectively reducing our Mean-Time-To-Detect. This is crucial because any amount of downtime is brutal and comes with an incredibly high impact to our members — every second counts during an outage.
As the company continues to grow our member base, and as those members use the Netflix service even more, having cost-efficient, rapid, and precise insights into the operational health of our systems is only growing in importance. For example, a five-minute outage today is equivalent to a two-hour outage at the time of our last Mantis blog post.
Mantis Makes It Easy to Answer New Questions
The traditional way of working with metrics and logs alone is not sufficient for large-scale and growing systems. Metrics and logs require that you to know what you want to answer ahead of time. Mantis on the other hand allows us to sidestep this drawback completely by giving us the ability to answer new questions without having to add new instrumentation. Instead of logs or metrics, Mantis enables a democratization of events where developers can tap into an event stream from any instrumented application on demand. By making consumption on-demand, you’re able to freely publish all of your data to Mantis.
Mantis is Cost-Effective in Answering Questions
Publishing 100% of your operational data so that you’re able to answer new questions in the future is traditionally cost prohibitive at scale. Mantis uses an on-demand, reactive model where you don’t pay the cost for these events until something is subscribed to their stream. To further reduce cost, Mantis reissues the same data for equivalent subscribers. In this way, Mantis is differentiated from other systems by allowing us to achieve streaming-based observability on events while empowering engineers with the tooling to reduce costs that would otherwise become detrimental to the business.
From the beginning, we’ve built Mantis with this exact guiding principle in mind: Let’s make sure we minimize the costs of observing and operating our systems without compromising on required and opportunistic insights.
Guiding Principles Behind Building Mantis
The following are the guiding principles behind building Mantis.
We should have access to raw events. Applications that publish events into Mantis should be free to publish every single event. If we prematurely transform events at this stage, then we’re already at a disadvantage when it comes to getting insight since data in its original form is already lost.
We should be able to access these events in realtime. Operational use cases are inherently time sensitive by nature. The traditional method of publishing, storing, and then aggregating events in batch is too slow. Instead, we should process and serve events one at a time as they arrive. This becomes increasingly important with scale as the impact becomes much larger in far less time.
We should be able to ask new questions of this data without having to add new instrumentation to your applications. It’s not possible to know ahead of time every single possible failure mode our systems might encounter despite all the rigor built in to make these systems resilient. When these failures do inevitably occur, it’s important that we can derive new insights with this data. You should be able to publish as large of an event with as much context as you want. That way, when you think of a new questions to ask of your systems in the future, the data will be available for you to answer those questions.
We should be able to do all of the above in a cost-effective way. As our business critical systems scale, we need to make sure the systems in support of these business critical systems don’t end up costing more than the business critical systems themselves.
With these guiding principles in mind, let’s take a look at how Mantis brings value to Netflix.
How Mantis Brings Value to Netflix
Mantis has been in production for over four years. Over this period several critical operational insight applications have been built on top of the Mantis platform.
A few noteworthy examples include:
Realtime monitoring of Netflix streaming health which examines all of Netflix’s streaming video traffic in realtime and accurately identifies negative impact on the viewing experience with fine-grained granularity. This system serves as an early warning indicator of the overall health of the Netflix service and will trigger and alert relevant teams within seconds.
Contextual Alerting which analyzes millions of interactions between dozens of Netflix microservices in realtime to identify anomalies and provide operators with rich and relevant context. The realtime nature of these Mantis-backed aggregations allows the Mean-Time-To-Detect to be cut down from tens of minutes to a few seconds. Given the scale of Netflix this makes a huge impact.
Raven which allows users to perform ad-hoc exploration of realtime data from hundreds of streaming sources using our Mantis Query Language (MQL).
Cassandra Health check which analyzes rich operational events in realtime to generate a holistic picture of the health of every Cassandra cluster at Netflix.
Alerting on Log data which detects application errors by processing data from thousands of Netflix servers in realtime.
Chaos Experimentation monitoring which tracks user experience during a Chaos exercise in realtime and triggers an abort of the chaos exercise in case of an adverse impact.
Realtime Personally Identifiable Information (PII) data detection samples data across all streaming sources to quickly identify transmission of sensitive data.
A lot of work has gone into making Mantis successful at Netflix. We’d like to thank all the contributors, in alphabetical order by first name, that have been involved with Mantis at various points of its existence:
Andrei Ushakov, Ben Christensen, Ben Schmaus, Chris Carey, Cody Rioux, Daniel Jacobson, Danny Yuan, Erik Meijer, Indrajit Roy Choudhury, Jeff Chao, Josh Evans, Justin Becker, Kathrin Probst, Kevin Lew, Neeraj Joshi, Nick Mahilani, Piyush Goyal, Prashanth Ramdas, Ram Vaithalingam, Ranjit Mavinkurve, Sangeeta Narayanan, Santosh Kalidindi, Seth Katz, Sharma Podila, Zhenzhong Xu.
Trust on the Internet is underpinned by the Public Key Infrastructure (PKI). PKI grants servers the ability to securely serve websites by issuing digital certificates, providing the foundation for encrypted and authentic communication.
Certificates make HTTPS encryption possible by using the public key in the certificate to verify server identity. HTTPS is especially important for websites that transmit sensitive data, such as banking credentials or private messages. Thankfully, modern browsers, such as Google Chrome, flag websites not secured using HTTPS by marking them “Not secure,” allowing users to be more security conscious of the websites they visit.
Now that we know what certificates are used for, let’s talk about where they come from.
Certificate Authorities (CAs) are the institutions responsible for issuing certificates.
When issuing a certificate for any given domain, they use Domain Control Validation (DCV) to verify that the entity requesting a certificate for the domain is the legitimate owner of the domain. With DCV the domain owner:
creates a DNS resource record for a domain;
uploads a document to the web server located at that domain; OR
proves ownership of the domain’s administrative email account.
The DCV process prevents adversaries from obtaining private-key and certificate pairs for domains not owned by the requestor.
Preventing adversaries from acquiring this pair is critical: if an incorrectly issued certificate and private-key pair wind up in an adversary’s hands, they could pose as the victim’s domain and serve sensitive HTTPS traffic. This violates our existing trust of the Internet, and compromises private data on a potentially massive scale.
For example, an adversary that tricks a CA into mis-issuing a certificate for gmail.com could then perform TLS handshakes while pretending to be Google, and exfiltrate cookies and login information to gain access to the victim’s Gmail account. The risks of certificate mis-issuance are clearly severe.
Domain Control Validation
To prevent attacks like this, CAs only issue a certificate after performing DCV. One way of validating domain ownership is through HTTP validation, done by uploading a text file to a specific HTTP endpoint on the webserver they want to secure. Another DCV method is done using email verification, where an email with a validation code link is sent to the administrative contact for the domain.
Suppose Alice buys the domain name aliceswonderland.com and wants to get a dedicated certificate for this domain. Alice chooses to use Let’s Encrypt as their certificate authority. First, Alice must generate their own private key and create a certificate signing request (CSR). She sends the CSR to Let’s Encrypt, but the CA won’t issue a certificate for that CSR and private key until they know Alice owns aliceswonderland.com. Alice can then choose to prove that she owns this domain through HTTP validation.
When Let’s Encrypt performs DCV over HTTP, they require Alice to place a randomly named file in the /.well-known/acme-challenge path for her website. The CA must retrieve the text file by sending an HTTP GET request to http://aliceswonderland.com/.well-known/acme-challenge/<random_filename>. An expected value must be present on this endpoint for DCV to succeed.
For HTTP validation, Alice would upload a file to http://aliceswonderland.com/.well-known/acme-challenge/YnV0dHNz
where the body contains:
HTTP/1.1 200 OK
The CA instructs them to use the Base64 token YnV0dHNz. TEST_CLIENT_KEY in an account-linked key that only the certificate requestor and the CA know. The CA uses this field combination to verify that the certificate requestor actually owns the domain. Afterwards, Alice can get her certificate for her website!
Another way users can validate domain ownership is to add a DNS TXT record containing a verification string or token from the CA to their domain’s resource records. For example, here’s a domain for an enterprise validating itself towards Google:
$ dig TXT aliceswonderland.com
aliceswonderland.com. 28 IN TXT "google-site-verification=COanvvo4CIfihirYW6C0jGMUt2zogbE_lC6YBsfvV-U"
Here, Alice chooses to create a TXT DNS resource record with a specific token value. A Google CA can verify the presence of this token to validate that Alice actually owns her website.
Types of BGP Hijacking Attacks
Certificate issuance is required for servers to securely communicate with clients. This is why it’s so important that the process responsible for issuing certificates is also secure. Unfortunately, this is not always the case.
Researchers at Princeton University recently discovered that common DCV methods are vulnerable to attacks executed by network-level adversaries. If Border Gateway Protocol (BGP) is the “postal service” of the Internet responsible for delivering data through the most efficient routes, then Autonomous Systems (AS) are individual post office branches that represent an Internet network run by a single organization. Sometimes network-level adversaries advertise false routes over BGP to steal traffic, especially if that traffic contains something important, like a domain’s certificate.
Bamboozling Certificate Authorities with BGP highlights five types of attacks that can be orchestrated during the DCV process to obtain a certificate for a domain the adversary does not own. After implementing these attacks, the authors were able to (ethically) obtain certificates for domains they did not own from the top five CAs: Let’s Encrypt, GoDaddy, Comodo, Symantec, and GlobalSign. But how did they do it?
Attacking the Domain Control Validation Process
There are two main approaches to attacking the DCV process with BGP hijacking:
These attacks create a vulnerability when an adversary sends a certificate signing request for a victim’s domain to a CA. When the CA verifies the network resources using an HTTP GET request (as discussed earlier), the adversary then uses BGP attacks to hijack traffic to the victim’s domain in a way that the CA’s request is rerouted to the adversary and not the domain owner. To understand how these attacks are conducted, we first need to do a little bit of math.
Every device on the Internet uses an IP (Internet Protocol) address as a numerical identifier. IPv4 addresses contain 32 bits and follow a slash notation to indicate the size of the prefix. So, in the network address 126.96.36.199/24, “/24” refers to how many bits the network contains. This means that there are 8 bits left that contain the host addresses, for a total of 256 host addresses. The smaller the prefix number, the more host addresses remain in the network. With this knowledge, let’s jump into the attacks!
Attack one: Sub-Prefix Attack
When BGP announces a route, the router always prefers to follow the more specific route. So if 188.8.131.52/8 and 184.108.40.206/24 are advertised, the router will use the latter as it is the more specific prefix. This becomes a problem when an adversary makes a BGP announcement to a specific IP address while using the victim’s domain IP address. Let’s say the IP address for our victim, leagueofentropy.com, is 220.127.116.11/8. If an adversary announces the prefix 18.104.22.168/24, then they will capture the victim’s traffic, launching a sub-prefix hijack attack.
For example, in an attack during April 2018, routes were announced with the more specific /24 vs. the existing /23. In the diagram below, /23 is Texas and /24 is the more specific Austin, Texas. The new (but nefarious) routes overrode the existing routes for portions of the Internet. The attacker then ran a nefarious DNS server on the normal IP addresses with DNS records pointing at some new nefarious web server instead of the existing server. This attracted the traffic destined for the victim’s domain within the area the nefarious routes were being propagated. The reason this attack was successful was because a more specific prefix is always preferred by the receiving routers.
Attack two: Equally-Specific-Prefix Attack
In the last attack, the adversary was able to hijack traffic by offering a more specific announcement, but what if the victim’s prefix is /24 and a sub-prefix attack is not viable? In this case, an attacker would launch an equally-specific-prefix hijack, where the attacker announces the same prefix as the victim. This means that the AS chooses the preferred route between the victim and the adversary’s announcements based on properties like path length. This attack only ever intercepts a portion of the traffic.
There are more advanced attacks that are covered in more depth in the paper. They are fundamentally similar attacks but are more stealthy.
Once an attacker has successfully obtained a bogus certificate for a domain that they do not own, they can perform a convincing attack where they pose as the victim’s domain and are able to decrypt and intercept the victim’s TLS traffic. The ability to decrypt the TLS traffic allows the adversary to completely Monster-in-the-Middle (MITM) encrypted TLS traffic and reroute Internet traffic destined for the victim’s domain to the adversary. To increase the stealthiness of the attack, the adversary will continue to forward traffic through the victim’s domain to perform the attack in an undetected manner.
Another way an adversary can gain control of a domain is by spoofing DNS traffic by using a source IP address that belongs to a DNS nameserver. Because anyone can modify their packets’ outbound IP addresses, an adversary can fake the IP address of any DNS nameserver involved in resolving the victim’s domain, and impersonate a nameserver when responding to a CA.
This attack is more sophisticated than simply spamming a CA with falsified DNS responses. Because each DNS query has its own randomized query identifiers and source port, a fake DNS response must match the DNS query’s identifiers to be convincing. Because these query identifiers are random, making a spoofed response with the correct identifiers is extremely difficult.
Adversaries can fragment User Datagram Protocol (UDP) DNS packets so that identifying DNS response information (like the random DNS query identifier) is delivered in one packet, while the actual answer section follows in another packet. This way, the adversary spoofs the DNS response to a legitimate DNS query.
Say an adversary wants to get a mis-issued certificate for victim.com by forcing packet fragmentation and spoofing DNS validation. The adversary sends a DNS nameserver for victim.com a DNS packet with a small Maximum Transmission Unit, or maximum byte size. This gets the nameserver to start fragmenting DNS responses. When the CA sends a DNS query to a nameserver for victim.com asking for victim.com’s TXT records, the nameserver will fragment the response into the two packets described above: the first contains the query ID and source port, which the adversary cannot spoof, and the second one contains the answer section, which the adversary can spoof. The adversary can continually send a spoofed answer to the CA throughout the DNS validation process, in the hopes of sliding their spoofed answer in before the CA receives the real answer from the nameserver.
In doing so, the answer section of a DNS response (the important part!) can be falsified, and an adversary can trick a CA into mis-issuing a certificate.
At first glance, one could think a Certificate Transparency log could expose a mis-issued certificate and allow a CA to quickly revoke it. CT logs, however, can take up to 24 hours to include newly issued certificates, and certificate revocation can be inconsistently followed among different browsers. We need a solution that allows CAs to proactively prevent this attacks, not retroactively address them.
We’re excited to announce that Cloudflare provides CAs a free API to leverage our global network to perform DCV from multiple vantage points around the world. This API bolsters the DCV process against BGP hijacking and off-path DNS attacks.
Given that Cloudflare runs 175+ datacenters around the world, we are in a unique position to perform DCV from multiple vantage points. Each datacenter has a unique path to DNS nameservers or HTTP endpoints, which means that successful hijacking of a BGP route can only affect a subset of DCV requests, further hampering BGP hijacks. And since we use RPKI, we actually sign and verify BGP routes.
This DCV checker additionally protects CAs against off-path, DNS spoofing attacks. An additional feature that we built into the service that helps protect against off-path attackers is DNS query source IP randomization. By making the source IP unpredictable to the attacker, it becomes more challenging to spoof the second fragment of the forged DNS response to the DCV validation agent.
By comparing multiple DCV results collected over multiple paths, our DCV API makes it virtually impossible for an adversary to mislead a CA into thinking they own a domain when they actually don’t. CAs can use our tool to ensure that they only issue certificates to rightful domain owners.
Our multipath DCV checker consists of two services:
DCV agents responsible for performing DCV out of a specific datacenter, and
a DCV orchestrator that handles multipath DCV requests from CAs and dispatches them to a subset of DCV agents.
When a CA wants to ensure that DCV occurred without being intercepted, it can send a request to our API specifying the type of DCV to perform and its parameters.
The DCV orchestrator then forwards each request to a random subset of over 20 DCV agents in different datacenters. Each DCV agent performs the DCV request and forwards the result to the DCV orchestrator, which aggregates what each agent observed and returns it to the CA.
This approach can also be generalized to performing multipath queries over DNS records, like Certificate Authority Authorization (CAA) records. CAA records authorize CAs to issue certificates for a domain, so spoofing them to trick unauthorized CAs into issuing certificates is another attack vector that multipath observation prevents.
As we were developing our multipath checker, we were in contact with the Princeton research group that introduced the proof-of-concept (PoC) of certificate mis-issuance through BGP hijacking attacks. Prateek Mittal, coauthor of the Bamboozling Certificate Authorities with BGP paper, wrote:
“Our analysis shows that domain validation from multiple vantage points significantly mitigates the impact of localized BGP attacks. We recommend that all certificate authorities adopt this approach to enhance web security. A particularly attractive feature of Cloudflare’s implementation of this defense is that Cloudflare has access to a vast number of vantage points on the Internet, which significantly enhances the robustness of domain control validation.”
Our DCV checker follows our belief that trust on the Internet must be distributed, and vetted through third-party analysis (like that provided by Cloudflare) to ensure consistency and security. This tool joins our pre-existing Certificate Transparency monitor as a set of services CAs are welcome to use in improving the accountability of certificate issuance.
An Opportunity to Dogfood
Building our multipath DCV checker also allowed us to dogfood multiple Cloudflare products.
The DCV orchestrator as a simple fetcher and aggregator was a fantastic candidate for Cloudflare Workers. We implemented the orchestrator in TypeScript using this post as a guide, and created a typed, reliable orchestrator service that was easy to deploy and iterate on. Hooray that we don’t have to maintain our own dcv-orchestrator server!
We use Argo Tunnel to allow Cloudflare Workers to contact DCV agents. Argo Tunnel allows us to easily and securely expose our DCV agents to the Workers environment. Since Cloudflare has approximately 175 datacenters running DCV agents, we expose many services through Argo Tunnel, and have had the opportunity to load test Argo Tunnel as a power user with a wide variety of origins. Argo Tunnel readily handled this influx of new origins!
Getting Access to the Multipath DCV Checker
If you and/or your organization are interested in trying our DCV checker, email [email protected] and let us know! We’d love to hear more about how multipath querying and validation bolsters the security of your certificate issuance.
As a new class of BGP and IP spoofing attacks threaten to undermine PKI fundamentals, it’s important that website owners advocate for multipath validation when they are issued certificates. We encourage all CAs to use multipath validation, whether it is Cloudflare’s or their own. Jacob Hoffman-Andrews, Tech Lead, Let’s Encrypt, wrote:
“BGP hijacking is one of the big challenges the web PKI still needs to solve, and we think multipath validation can be part of the solution. We’re testing out our own implementation and we encourage other CAs to pursue multipath as well”
Hopefully in the future, website owners will look at multipath validation support when selecting a CA.
Yesterday, we celebrated the fifth anniversary of Project Galileo. More than 550 websites are part of this program, and they have something in common: each and every one of them has been subject to attacks in the last month. In this blog post, we will look at the security events we observed between the 23 April 2019 and 23 May 2019.
Project Galileo sites are protected by the Cloudflare Firewall and Advanced DDoS Protection which contain a number of features that can be used to detect and mitigate different types of attack and suspicious traffic. The following table shows how each of these features contributed to the protection of sites on Project Galileo.
Although not the most impressive in terms of blocked requests, the WAF is the most interesting as it identifies and blocks malicious requests, based on heuristics and rules that are the result of seeing attacks across all of our customers and learning from those. The WAF is available to all of our paying customers, protecting them against 0-days, SQL/XSS exploits and more. For the Project Galileo customers the WAF rules blocked more than 4.5 million requests in the month that we looked at, matching over 130 WAF rules and approximately 150k requests per day.
This heat map may initially appear confusing but reading one is easy once you know what to expect so bear with us! It is a table where each line is a website on Project Galileo and each column is a day. The color represents the number of requests triggering WAF rules – on a scale from 0 (white) to a lot (dark red). The darker the cell, the more requests were blocked on this day.
We observe malicious traffic on a daily basis for most websites we protect. The average Project Galileo site saw malicious traffic for 27 days in the 1 month observed, and for almost 60% of the sites we noticed daily events.
Fortunately, the vast majority of websites only receive a few malicious requests per day, likely from automated scanners. In some cases, we notice a net increase in attacks against some websites – and a few websites are under a constant influx of attacks.
This heat map shows the WAF rules that blocked requests by day. At first, it seems some rules are useless as they never match malicious requests, but this plot makes it obvious that some attack vectors become active all of a sudden (isolated dark cells). This is especially true for 0-days, malicious traffic starts once an exploit is published and is very active on the first few days. The dark active lines are the most common malicious requests, and these WAF rules protect against things like XSS and SQL injection attacks.
DoS (Denial of Service)
A DoS attack prevents legitimate visitors from accessing a website by flooding it with bad traffic. Due to the way Cloudflare works, websites protected by Cloudflare are immune to many DoS vectors, out of the box. We block layer 3 and 4 attacks, which includes SYN floods and UDP amplifications. DNS nameservers, often described as the Internet’s phone book, are fully managed by Cloudflare, and protected – visitors know how to reach the websites.
Can you spot the attack?
As for layer 7 attacks (for instance, HTTP floods), we rely on Gatebot, an automated tool to detect, analyse and block DoS attacks, so you can sleep. The graph shows the requests per second we received on a zone, and whether or not it reached the origin server. As you can see, the bad traffic was identified automatically by Gatebot, and more than 1.6 million requests were blocked as a result.
For websites with specific requirements we provide tools to allow customers to block traffic to precisely fit their needs. Customers can easily implement complex logic using Firewall Rules to filter out specific chunks of traffic, block IPs / Networks / Countries using Access Rules and Project Galileo sites have done just that. Let’s see a few examples.
Firewall Rules allows website owners to challenge or block as much or as little traffic as they desire, and this can be done as a surgical tool “block just this request” or as a general tool “challenge every request”.
For instance, a well-known website used Firewall Rules to prevent twenty IPs from fetching specific pages. 3 of these IPs were then used to send a total of 4.5 million requests over a short period of time, and the following chart shows the requests seen for this website. When this happened Cloudflare, mitigated the traffic ensuring that the website remains available.
Another website, built with WordPress, is using Cloudflare to cache their webpages. As POST requests are not cacheable, they always hit the origin machine and increase load on the origin server – that’s why this website is using firewall rules to block POST requests, except on their administration backend. Smart!
Website owners can also deny or challenge requests based on the visitor’s IP address, Autonomous System Number (ASN) or Country. Dubbed Access Rules, it is enforced on all pages of a website – hassle-free.
For example, a news website is using Cloudflare’s Access Rules to challenge visitors from countries outside of their geographic region who are accessing their website. We enforce the rules globally even for cached resources, and take care of GeoIP database updates for them, so they don’t have to.
The Zone Lockdown utility restricts a specific URL to specific IP addresses. This is useful to protect an internal but public path being accessed by external IP addresses. A non-profit based in the United Kingdom is using Zone Lockdown to restrict access to their WordPress’ admin panel and login page, hardening their website without relying on non official plugins. Although it does not prevent very sophisticated attacks, it shields them against automated attacks and phishing attempts – as even if their credentials are stolen, they can’t be used as easily.
Cloudflare acts as a CDN, caching resources and happily serving them, reducing bandwidth used by the origin server … and indirectly the costs. Unfortunately, not all requests can be cached and some requests are very expensive to handle. Malicious users may abuse this to increase load on the server, and website owners can rely on our Rate Limit to help them: they define thresholds, expressed in requests over a time span, and we make sure to enforce this threshold. A non-profit fighting against poverty relies on rate limits to protect their donation page, and we are glad to help!
Last but not least, one of Cloudflare’s greatest assets is our threat intelligence. With such a wide lens of the threat landscape, Cloudflare uses our Firewall data, combined with machine learning to curate our IP Reputation databases. This data is provided to all Cloudflare customers, and is configured through our Security Level feature. Customers then may define their threshold sensitivity, ranging from Essentially Off to I’m Under Attack. For every incoming request, we ask visitors to complete a challenge if the score is above a customer defined threshold. This system alone is responsible for 25% of the requests we mitigated: it’s extremely easy to use, and it constantly learns from the other protections.
When taken together, the Cloudflare Firewall features provide our Project Galileo customers comprehensive and effective security that enables them to ensure their important work is available. The majority of security events were handled automatically, and this is our strength – security that is always on, always available, always learning.
If you want to start an intense conversation in the halls of Cloudflare, try describing us as a “CDN”. CDNs don’t generally provide you with Load Balancing, they don’t allow you to deploy Serverless Applications, and they certainly don’t get installed onto your phone. One of the costs of that confusion is many people don’t realize everything Cloudflare can do for people who want to operate in multiple public clouds, or want to operate in both the cloud and on their own hardware.
Cloudflare has countless servers located in 180 data centers around the world. Each one is capable of acting as a Layer 7 load balancer, directing incoming traffic between origins wherever they may be. You could, for example, add load balancing between a set of machines you have in AWS’ EC2, and another set you keep in Google Cloud.
This load balancing isn’t just round-robining traffic. It supports weighting to allow you to control how much traffic goes to each cluster. It supports latency-based routing to automatically route traffic to the cluster which is closer (so adding geographic distribution can be as simple as spinning up machines). It even supports health checks, allowing it to automatically direct traffic to the cloud which is currently healthy.
Most importantly, it doesn’t run in any of the provider’s clouds and isn’t dependent on them to function properly. Even better, since the load balancing runs near virtually every Internet user around the world it doesn’t come at any performance cost. (Using our Argo technology performance often gets better!).
One of the hardest components to managing a multi-cloud deployment is networking. Each provider has their own method of defining networks and firewalls, and even tools which can deploy clusters across multiple clouds often can’t quite manage to get the networking configuration to work in the same way. The task of setting it up can often be a trial-and-error game where the final config is best never touched again, leaving ‘going multi-cloud’ as a failed experiment within organizations.
At Cloudflare we have a technology called Argo Tunnel which flips networking on its head. Rather than opening ports and directing incoming traffic, each of your virtual machines (or k8s pods) makes outbound tunnels to the nearest Cloudflare PoPs. All of your Internet traffic then flows over those tunnels. You keep all your ports closed to inbound traffic, and never have to think about Internet networking again.
What’s so powerful about this configuration is is makes it trivial to spin up machines in new locations. Want a dozen machines in Australia? As long as they start the Argo Tunnel daemon they will start receiving traffic. Don’t need them any more? Shut them down and the traffic will be routed elsewhere. And, of course, none of this relies on any one public cloud provider, making it reliable even if they should have issues.
Argo Tunnel makes it trivial to add machines in new clouds, or to keep machines on-prem even as you start shifting workloads into the Cloud.
One thing you’ll realize about using Argo Tunnel is you now have secure tunnels which connect your infrastructure with Cloudflare’s network. Once traffic reaches that network, it doesn’t necessarily have to flow directly to your machines. It could, for example, have access control applied where we use your Identity Provider (like Okta or Active Directory) to decide who should be able to access what. Rather than wrestling with VPCs and VPN configurations, you can move to a zero-trust model where you use policies to decide exactly who can access what on a per-request basis.
In fact, you can now do this with SSH as well! You can manage all your user accounts in a single place and control with precision who can access which piece of infrastructure, irrespective of which cloud it’s in.
No computer system is perfect, and ours is no exception. We make mistakes, have bugs in our code, and deal with the pain of operating at the scale of the Internet every day. One great innovation in the recent history of computers, however, is the idea that it is possible to build a reliable system on top of many individually unreliable components.
Each of Cloudflare’s PoPs is designed to function without communication or support from others, or from a central data center. That alone greatly increases our tolerance for network partitions and moves us from maintaining a single system to be closer to maintaining 180 independent clouds, any of which can serve all traffic.
We are also a system built on anycast which allows us to tap into the fundamental reliability of the Internet. The Internet uses a protocol called BGP which asks each system who would like to receive traffic for a particular IP address to ‘advertise’ it. Each router then will decide to forward traffic based on which person advertising an address is the closest. We advertise all of our IP addresses in every one of our data centers. If a data centers goes down, it stops advertising BGP routes, and the very same packets which would have been destined for it arrive in another PoP seamlessly.
Ultimately we are trying to help build a better Internet. We don’t believe that Internet is built on the back of a single provider. Many of the services provided by these cloud providers are simply too complex to be as reliable as the Internet demands.
True reliability and cost control both require existing on multiple clouds. It is clear that the tools which the Internet of the 80s and 90s gave us may be insufficient to move into that future. With a smarter network we can do more, better.
Today I’m excited to announce built-in authentication support in Application Load Balancers (ALB). ALB can now securely authenticate users as they access applications, letting developers eliminate the code they have to write to support authentication and offload the responsibility of authentication from the backend. The team built a great live example where you can try out the authentication functionality.
Identity-based security is a crucial component of modern applications and as customers continue to move mission critical applications into the cloud, developers are asked to write the same authentication code again and again. Enterprises want to use their on-premises identities with their cloud applications. Web developers want to use federated identities from social networks to allow their users to sign-in. ALB’s new authentication action provides authentication through social Identity Providers (IdP) like Google, Facebook, and Amazon through Amazon Cognito. It also natively integrates with any OpenID Connect protocol compliant IdP, providing secure authentication and a single sign-on experience across your applications.
How Does ALB Authentication Work?
Authentication is a complicated topic and our readers may have differing levels of expertise with it. I want to cover a few key concepts to make sure we’re all on the same page. If you’re already an authentication expert and you just want to see how ALB authentication works feel free to skip to the next section!
Authentication verifies identity.
Authorization verifies permissions, the things an identity is allowed to do.
OpenID Connect (OIDC) is a simple identity, or authentication, layer built on top on top of the OAuth 2.0 protocol. The OIDC specification document is pretty well written and worth a casual read.
Identity Providers (IdPs) manage identity information and provide authentication services. ALB supports any OIDC compliant IdP and you can use a service like Amazon Cognito or Auth0 to aggregate different identities from various IdPs like Active Directory, LDAP, Google, Facebook, Amazon, or others deployed in AWS or on premises.
When we get away from the terminology for a bit, all of this boils down to figuring out who a user is and what they’re allowed to do. Doing this securely and efficiently is hard. Traditionally, enterprises have used a protocol called SAML with their IdPs, to provide a single sign-on (SSO) experience for their internal users. SAML is XML heavy and modern applications have started using OIDC with JSON mechanism to share claims. Developers can use SAML in ALB with Amazon Cognito’s SAML support. Web app or mobile developers typically use federated identities via social IdPs like Facebook, Amazon, or Google which, conveniently, are also supported by Amazon Cognito.
ALB Authentication works by defining an authentication action in a listener rule. The ALB’s authentication action will check if a session cookie exists on incoming requests, then check that it’s valid. If the session cookie is set and valid then the ALB will route the request to the target group with X-AMZN-OIDC-* headers set. The headers contain identity information in JSON Web Token (JWT) format, that a backend can use to identify a user. If the session cookie is not set or invalid then ALB will follow the OIDC protocol and issue an HTTP 302 redirect to the identity provider. The protocol is a lot to unpack and is covered more thoroughly in the documentation for those curious.
ALB Authentication Walkthrough
I have a simple Python flask app in an Amazon ECS cluster running in some AWS Fargate containers. The containers are in a target group routed to by an ALB. I want to make sure users of my application are logged in before accessing the authenticated portions of my application. First, I’ll navigate to the ALB in the console and edit the rules.
I want to make sure all access to /account* endpoints is authenticated so I’ll add new rule with a condition to match those endpoints.
Now, I’ll add a new rule and create an Authenticate action in that rule.
I’ll have ALB create a new Amazon Cognito user pool for me by providing some configuration details.
After creating the Amazon Cognito pool, I can make some additional configuration in the advanced settings.
I can change the default cookie name, adjust the timeout, adjust the scope, and choose the action for unauthenticated requests.
I can pick Deny to serve a 401 for all unauthenticated requests or I can pick Allow which will pass through to the application if unauthenticated. This is useful for Single Page Apps (SPAs). For now, I’ll choose Authenticate, which will prompt the IdP, in this case Amazon Cognito, to authenticate the user and reload the existing page.
Now I’ll add a forwarding action for my target group and save the rule.
Over on the Facebook side I just need to add my Amazon Cognito User Pool Domain to the whitelisted OAuth redirect URLs.
I would follow similar steps for other authentication providers.
Now, when I navigate to an authenticated page my Fargate containers receive the originating request with the X-Amzn-Oidc-* headers set by ALB. Using the information in those headers (claims-data, identity, access-token) my application can implement authorization.
All of this was possible without having to write a single line of code to deal with each of the IdPs. However, it’s still important for the implementing applications to verify the signature on the JWT header to ensure the request hasn’t been tampered with.
Of course everything we’ve seen today is also available in the the API and AWS Command Line Interface (CLI). You can find additional information on the feature in the documentation. This feature is provided at no additional charge.
With authentication built-in to ALB, developers can focus on building their applications instead of rebuilding authentication for every application, all the while maintaining the scale, availability, and reliability of ALB. I think this feature is a pretty big deal and I can’t wait to see what customers build with it. Let us know what you think of this feature in the comments or on twitter!
What do I do with a Mac that still has personal data on it? Do I take out the disk drive and smash it? Do I sweep it with a really strong magnet? Is there a difference in how I handle a hard drive (HDD) versus a solid-state drive (SSD)? Well, taking a sledgehammer or projectile weapon to your old machine is certainly one way to make the data irretrievable, and it can be enormously cathartic as long as you follow appropriate safety and disposal protocols. But there are far less destructive ways to make sure your data is gone for good. Let me introduce you to secure erasing.
Which Type of Drive Do You Have?
Before we start, you need to know whether you have a HDD or a SSD. To find out, or at least to make sure, you click on the Apple menu and select “About this Mac.” Once there, select the “Storage” tab to see which type of drive is in your system.
The first example, below, shows a SATA Disk (HDD) in the system.
In the next case, we see we have a Solid State SATA Drive (SSD), plus a Mac SuperDrive.
The third screen shot shows an SSD, as well. In this case it’s called “Flash Storage.”
Make Sure You Have a Backup
Before you get started, you’ll want to make sure that any important data on your hard drive has moved somewhere else. OS X’s built-in Time Machine backup software is a good start, especially when paired with Backblaze. You can learn more about using Time Machine in our Mac Backup Guide.
With a local backup copy in hand and secure cloud storage, you know your data is always safe no matter what happens.
Once you’ve verified your data is backed up, roll up your sleeves and get to work. The key is OS X Recovery — a special part of the Mac operating system since OS X 10.7 “Lion.”
How to Wipe a Mac Hard Disk Drive (HDD)
NOTE: If you’re interested in wiping an SSD, see below.
Make sure your Mac is turned off.
Press the power button.
Immediately hold down the command and R keys.
Wait until the Apple logo appears.
Select “Disk Utility” from the OS X Utilities list. Click Continue.
Select the disk you’d like to erase by clicking on it in the sidebar.
Click the Erase button.
Click the Security Options button.
The Security Options window includes a slider that enables you to determine how thoroughly you want to erase your hard drive.
There are four notches to that Security Options slider. “Fastest” is quick but insecure — data could potentially be rebuilt using a file recovery app. Moving that slider to the right introduces progressively more secure erasing. Disk Utility’s most secure level erases the information used to access the files on your disk, then writes zeroes across the disk surface seven times to help remove any trace of what was there. This setting conforms to the DoD 5220.22-M specification.
Once you’ve selected the level of secure erasing you’re comfortable with, click the OK button.
Click the Erase button to begin. Bear in mind that the more secure method you select, the longer it will take. The most secure methods can add hours to the process.
Once it’s done, the Mac’s hard drive will be clean as a whistle and ready for its next adventure: a fresh installation of OS X, being donated to a relative or a local charity, or just sent to an e-waste facility. Of course you can still drill a hole in your disk or smash it with a sledgehammer if it makes you happy, but now you know how to wipe the data from your old computer with much less ruckus.
The above instructions apply to older Macintoshes with HDDs. What do you do if you have an SSD?
Securely Erasing SSDs, and Why Not To
Most new Macs ship with solid state drives (SSDs). Only the iMac and Mac mini ship with regular hard drives anymore, and even those are available in pure SSD variants if you want.
If your Mac comes equipped with an SSD, Apple’s Disk Utility software won’t actually let you zero the hard drive.
With an SSD drive, Secure Erase and Erasing Free Space are not available in Disk Utility. These options are not needed for an SSD drive because a standard erase makes it difficult to recover data from an SSD.
In fact, some folks will tell you not to zero out the data on an SSD, since it can cause wear and tear on the memory cells that, over time, can affect its reliability. I don’t think that’s nearly as big an issue as it used to be — SSD reliability and longevity has improved.
If “Standard Erase” doesn’t quite make you feel comfortable that your data can’t be recovered, there are a couple of options.
FileVault Keeps Your Data Safe
One way to make sure that your SSD’s data remains secure is to use FileVault. FileVault is whole-disk encryption for the Mac. With FileVault engaged, you need a password to access the information on your hard drive. Without it, that data is encrypted.
There’s one potential downside of FileVault — if you lose your password or the encryption key, you’re screwed: You’re not getting your data back any time soon. Based on my experience working at a Mac repair shop, losing a FileVault key happens more frequently than it should.
When you first set up a new Mac, you’re given the option of turning FileVault on. If you don’t do it then, you can turn on FileVault at any time by clicking on your Mac’s System Preferences, clicking on Security & Privacy, and clicking on the FileVault tab. Be warned, however, that the initial encryption process can take hours, as will decryption if you ever need to turn FileVault off.
With FileVault turned on, you can restart your Mac into its Recovery System (by restarting the Mac while holding down the command and R keys) and erase the hard drive using Disk Utility, once you’ve unlocked it (by selecting the disk, clicking the File menu, and clicking Unlock). That deletes the FileVault key, which means any data on the drive is useless.
FileVault doesn’t impact the performance of most modern Macs, though I’d suggest only using it if your Mac has an SSD, not a conventional hard disk drive.
Securely Erasing Free Space on Your SSD
If you don’t want to take Apple’s word for it, if you’re not using FileVault, or if you just want to, there is a way to securely erase free space on your SSD. It’s a little more involved but it works.
Before we get into the nitty-gritty, let me state for the record that this really isn’t necessary to do, which is why Apple’s made it so hard to do. But if you’re set on it, you’ll need to use Apple’s Terminal app. Terminal provides you with command line interface access to the OS X operating system. Terminal lives in the Utilities folder, but you can access Terminal from the Mac’s Recovery System, as well. Once your Mac has booted into the Recovery partition, click the Utilities menu and select Terminal to launch it.
From a Terminal command line, type:
diskutil secureErase freespace VALUE /Volumes/DRIVE
That tells your Mac to securely erase the free space on your SSD. You’ll need to change VALUE to a number between 0 and 4. 0 is a single-pass run of zeroes; 1 is a single-pass run of random numbers; 2 is a 7-pass erase; 3 is a 35-pass erase; and 4 is a 3-pass erase. DRIVE should be changed to the name of your hard drive. To run a 7-pass erase of your SSD drive in “JohnB-Macbook”, you would enter the following:
And remember, if you used a space in the name of your Mac’s hard drive, you need to insert a leading backslash before the space. For example, to run a 35-pass erase on a hard drive called “Macintosh HD” you enter the following:
diskutil secureErase freespace 3 /Volumes/Macintosh\ HD
Something to remember is that the more extensive the erase procedure, the longer it will take.
When Erasing is Not Enough — How to Destroy a Drive
As of March 31, 2018 we had 100,110 spinning hard drives. Of that number, there were 1,922 boot drives and 98,188 data drives. This review looks at the quarterly and lifetime statistics for the data drive models in operation in our data centers. We’ll also take a look at why we are collecting and reporting 10 new SMART attributes and take a sneak peak at some 8 TB Toshiba drives. Along the way, we’ll share observations and insights on the data presented and we look forward to you doing the same in the comments.
Since April 2013, Backblaze has recorded and saved daily hard drive statistics from the drives in our data centers. Each entry consists of the date, manufacturer, model, serial number, status (operational or failed), and all of the SMART attributes reported by that drive. Currently there are about 97 million entries totaling 26 GB of data. You can download this data from our website if you want to do your own research, but for starters here’s what we found.
Hard Drive Reliability Statistics for Q1 2018
At the end of Q1 2018 Backblaze was monitoring 98,188 hard drives used to store data. For our evaluation below we remove from consideration those drives which were used for testing purposes and those drive models for which we did not have at least 45 drives. This leaves us with 98,046 hard drives. The table below covers just Q1 2018.
Notes and Observations
If a drive model has a failure rate of 0%, it only means there were no drive failures of that model during Q1 2018.
The overall Annualized Failure Rate (AFR) for Q1 is just 1.2%, well below the Q4 2017 AFR of 1.65%. Remember that quarterly failure rates can be volatile, especially for models that have a small number of drives and/or a small number of Drive Days.
There were 142 drives (98,188 minus 98,046) that were not included in the list above because we did not have at least 45 of a given drive model. We use 45 drives of the same model as the minimum number when we report quarterly, yearly, and lifetime drive statistics.
Welcome Toshiba 8TB drives, almost…
We mentioned Toshiba 8 TB drives in the first paragraph, but they don’t show up in the Q1 Stats chart. What gives? We only had 20 of the Toshiba 8 TB drives in operation in Q1, so they were excluded from the chart. Why do we have only 20 drives? When we test out a new drive model we start with the “tome test” and it takes 20 drives to fill one tome. A tome is the same drive model in the same logical position in each of the 20 Storage Pods that make up a Backblaze Vault. There are 60 tomes in each vault.
In this test, we created a Backblaze Vault of 8 TB drives, with 59 of the tomes being Seagate 8 TB drives and 1 tome being the Toshiba drives. Then we monitored the performance of the vault and its member tomes to see if, in this case, the Toshiba drives performed as expected.
So far the Toshiba drive is performing fine, but they have been in place for only 20 days. Next up is the “pod test” where we fill a Storage Pod with Toshiba drives and integrate it into a Backblaze Vault comprised of like-sized drives. We hope to have a better look at the Toshiba 8 TB drives in our Q2 report — stay tuned.
Lifetime Hard Drive Reliability Statistics
While the quarterly chart presented earlier gets a lot of interest, the real test of any drive model is over time. Below is the lifetime failure rate chart for all the hard drive models which have 45 or more drives in operation as of March 31st, 2018. For each model, we compute their reliability starting from when they were first installed.
Notes and Observations
The failure rates of all of the larger drives (8-, 10- and 12 TB) are very good, 1.2% AFR (Annualized Failure Rate) or less. Many of these drives were deployed in the last year, so there is some volatility in the data, but you can use the Confidence Interval to get a sense of the failure percentage range.
The overall failure rate of 1.84% is the lowest we have ever achieved, besting the previous low of 2.00% from the end of 2017.
Our regular readers and drive stats wonks may have noticed a sizable jump in the number of HGST 8 TB drives (model: HUH728080ALE600), from 45 last quarter to 1,045 this quarter. As the 10 TB and 12 TB drives become more available, the price per terabyte of the 8 TB drives has gone down. This presented an opportunity to purchase the HGST drives at a price in line with our budget.
We purchased and placed into service the 45 original HGST 8 TB drives in Q2 of 2015. They were our first Helium-filled drives and our only ones until the 10 TB and 12 TB Seagate drives arrived in Q3 2017. We’ll take a first look into whether or not Helium makes a difference in drive failure rates in an upcoming blog post.
New SMART Attributes
If you have previously worked with the hard drive stats data or plan to, you’ll notice that we added 10 more columns of data starting in 2018. There are 5 new SMART attributes we are tracking each with a raw and normalized value:
177 – Wear Range Delta
179 – Used Reserved Block Count Total
181- Program Fail Count Total or Non-4K Aligned Access Count
182 – Erase Fail Count
235 – Good Block Count AND System(Free) Block Count
The 5 values are all related to SSD drives.
Yes, SSD drives, but before you jump to any conclusions, we used 10 Samsung 850 EVO SSDs as boot drives for a period of time in Q1. This was an experiment to see if we could reduce boot up time for the Storage Pods. In our case, the improved boot up speed wasn’t worth the SSD cost, but it did add 10 new columns to the hard drive stats data.
Speaking of hard drive stats data, the complete data set used to create the information used in this review is available on our Hard Drive Test Data page. You can download and use this data for free for your own purpose, all we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone. It is free.
If you just want the summarized data used to create the tables and charts in this blog post, you can download the ZIP file containing the MS Excel spreadsheet.
Good luck and let us know if you find anything interesting.
[Ed: 5/1/2018 – Updated Lifetime chart to fix error in confidence interval for HGST 4TB drive, model: HDS5C4040ALE630]
We launched AWS Support a full decade ago, with Gold and Silver plans focused on Amazon EC2, Amazon S3, and Amazon SQS. Starting from that initial offering, backed by a small team in Seattle, AWS Support now encompasses thousands of people working from more than 60 locations.
A Quick Look Back Over the years, that offering has matured and evolved in order to meet the needs of an increasingly diverse base of AWS customers. We aim to support you at every step of your cloud adoption journey, from your initial experiments to the time you deploy mission-critical workloads and applications.
We have worked hard to make our support model helpful and proactive. We do our best to provide you with the tools, alerts, and knowledge that will help you to build systems that are secure, robust, and dependable. Here are some of our most recent efforts toward that goal:
Trusted Advisor S3 Bucket Policy Check – AWS Trusted Advisor provides you with five categories of checks and makes recommendations that are designed to improve security and performance. Earlier this year we announced that the S3 Bucket Permissions Check is now free, and available to all AWS users. If you are signed up for the Business or Professional level of AWS Support, you can also monitor this check (and many others) using Amazon CloudWatch Events. You can use this to monitor and secure your buckets without human intervention.
Personal Health Dashboard – This tool provides you with alerts and guidance when AWS is experiencing events that may affect you. You get a personalized view into the performance and availability of the AWS services that underlie your AWS resources. It also generates Amazon CloudWatch Events so that you can initiate automated failover and remediation if necessary.
Well Architected / Cloud Ops Review – We’ve learned a lot about how to architect AWS-powered systems over the years and we want to share everything we know with you! The AWS Well-Architected Framework provide proven, detailed guidance in critical areas including operational excellence, security, reliability, performance efficiency, and cost optimization. You can read the materials online and you can also sign up for the online training course. If you are signed up for Enterprise support, you can also benefit from our Cloud Ops review.
Infrastructure Event Management – If you are launching a new app, kicking off a big migration, or hosting a large-scale event similar to Prime Day, we are ready with guidance and real-time support. Our Infrastructure Event Management team will help you to assess the readiness of your environment and work with you to identify and mitigate risks ahead of time.
The Amazon retail site makes heavy use of AWS. You can read my post, Prime Day 2017 – Powered by AWS, to learn more about the process of preparing to sustain a record-setting amount of traffic and to accept a like number of orders.
Come and Join Us The AWS Support Team is in continuous hiring mode and we have openings all over the world! Here are a couple of highlights:
Today, our customers use AWS CloudHSM to meet corporate, contractual and regulatory compliance requirements for data security by using dedicated Hardware Security Module (HSM) instances within the AWS cloud. CloudHSM delivers all the benefits of traditional HSMs including secure generation, storage, and management of cryptographic keys used for data encryption that are controlled and accessible only by you.
As a managed service, it automates time-consuming administrative tasks such as hardware provisioning, software patching, high availability, backups and scaling for your sensitive and regulated workloads in a cost-effective manner. Backup and restore functionality is the core building block enabling scalability, reliability and high availability in CloudHSM.
You should consider using AWS CloudHSM if you require:
Keys stored in dedicated, third-party validated hardware security modules under your exclusive control
FIPS 140-2 compliance
Integration with applications using PKCS#11, Java JCE, or Microsoft CNG interfaces
Healthcare applications subject to HIPAA regulations
Streaming video solutions subject to contractual DRM requirements
We recently released a whitepaper, “Security of CloudHSM Backups” that provides in-depth information on how backups are protected in all three phases of the CloudHSM backup lifecycle process: Creation, Archive, and Restore.
About the Author
Balaji Iyer is a senior consultant in the Professional Services team at Amazon Web Services. In this role, he has helped several customers successfully navigate their journey to AWS. His specialties include architecting and implementing highly-scalable distributed systems, operational security, large scale migrations, and leading strategic AWS initiatives.
Backblaze is growing, and with it our need to cater to a lot of different use cases that our customers bring to us. We needed a Solutions Engineer to help out, and after a long search we’ve hired our first one! Lets learn a bit more about Nathan shall we?
What is your Backblaze Title?
Solutions Engineer. Our customers bring a thousand different use cases to both B1 and B2, and I’m here to help them figure out how best to make those use cases a reality. Also, any odd jobs that Nilay wants me to do.
Where are you originally from?
I am native to the San Francisco Bay Area, studying mathematics at UC Santa Cruz, and then computer science at California University of Hayward (which has since renamed itself California University of the East Hills. I observe that it’s still in Hayward).
What attracted you to Backblaze?
As a stable, growing company with huge growth and even bigger potential, the business model is attractive, and the team is outstanding. Add to that the strong commitment to transparency, and it’s a hard company to resist. We can store – and restore – data while offering superior reliability at an economic advantage to do-it-yourself, and that’s a great place to be.
What do you expect to learn while being at Backblaze?
Everything I need to, but principally how our customers choose to interact with web storage. Storage isn’t a solution per se, but it’s an important component of any persistent solution. I’m looking forward to working with all the different concepts our customers have to make use of storage.
Where else have you worked?
All sorts of places, but I’ll admit publicly to EMC, Gemalto, and my own little (failed, alas) startup, IC2N. I worked with low-level document imaging.
Where did you go to school?
UC Santa Cruz, BA Mathematics CU Hayward, Master of Science in Computer Science.
What’s your dream job?
Sipping tea in the California redwood forest. However, solutions engineer at Backblaze is a good second choice!
Favorite place you’ve traveled?
Ashland, Oregon, for the Oregon Shakespeare Festival and the marble caves (most caves form from limestone).
Theater. Pathfinder. Writing. Baking cookies and cakes.
Of what achievement are you most proud?
Marrying the most wonderful man in the world.
Star Trek or Star Wars?
Star Trek’s utopian science fiction vision of humanity and science resonates a lot more strongly with me than the dystopian science fantasy of Star Wars.
Coke or Pepsi?
Neither. I’d much rather have a cup of jasmine tea.
It varies, but I love Indian and Thai cuisine. Truly excellent Italian food is marvelous – wood fired pizza, if I had to pick only one, but the world would be a boring place with a single favorite food.
Why do you like certain things?
If I knew that, I’d be in marketing.
Anything else you’d like you’d like to tell us?
If you haven’t already encountered the amazing authors Patricia McKillip and Lois McMasters Bujold – go encounter them. Be happy.
There’s nothing wrong with a nice cup of tea and a long game of Pathfinder. Sign us up! Welcome to the team Nathan!
Founded in 2007, Backblaze started with a mission to make backup software elegant and provide complete peace of mind. Over the course of almost a decade, we have become a pioneer in robust, scalable low cost cloud backup. Recently, we launched B2, robust and reliable object storage at just $0.005/gb/mo. We offer the lowest price of any of the big players and are still profitable.
Backblaze has a culture of openness. The hardware designs for our storage pods are open source. Key parts of the software, including the Reed-Solomon erasure coding are open-source. Backblaze is the only company that publishes hard drive reliability statistics.
We’ve managed to nurture a team-oriented culture with amazingly low turnover. We value our people and their families. The team is distributed across the U.S., but we work in Pacific Time, so work is limited to work time, leaving evenings and weekends open for personal and family time. Check out our “About Us” page to learn more about the people and some of our perks.
We have built a profitable, high growth business. While we love our investors, we have maintained control over the business. That means our corporate goals are simple – grow sustainably and profitably.
Our engineering team is 10 software engineers, and 2 quality assurance engineers. Most engineers are experienced, and a couple are more junior. The team will be growing as the company grows to meet the demand for our products; we plan to add at least 6 more engineers in 2018. The software includes the storage systems that run in the data center, the web APIs that clients access, the web site, and client programs that run on phones, tablets, and computers.
As the Director of Engineering, you will be:
managing the software engineering team
ensuring consistent delivery of top-quality services to our customers
collaborating closely with the operations team
directing engineering execution to scale the business and build new services
transforming a self-directed, scrappy startup team into a mid-size engineering organization
A successful director will have the opportunity to grow into the role of VP of Engineering. Backblaze expects to continue our exponential growth of our storage services in the upcoming years, with matching growth in the engineering team..
This position is located in San Mateo, California.
We are a looking for a director who:
has a good understanding of software engineering best practices
has experience scaling a large, distributed system
gets energized by creating an environment where engineers thrive
understands the trade-offs between building a solid foundation and shipping new features
has a track record of building effective teams
Required for all Backblaze Employees:
Good attitude and willingness to do whatever it takes to get the job done
Strong desire to work for a small fast-paced company
Desire to learn and adapt to rapidly changing technologies and work environment
Rigorous adherence to best practices
Relentless attention to detail
Excellent interpersonal skills and good oral/written communication
Excellent troubleshooting and problem solving skills
Some Backblaze Perks:
Competitive healthcare plans
Competitive compensation and 401k
All employees receive Option grants
Unlimited vacation days
Fully stocked Micro kitchen
Catered breakfast and lunches
Awesome people who work on awesome projects
New Parent Childcare bonus
Normal work hours
Get to bring your pets into the office
San Mateo Office — located near Caltrain and Highways 101 & 280.
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.