All posts by John Graham-Cumming

Creating serendipity with Python

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/creating-serendipity-with-python/

Creating serendipity with Python

We’ve been experimenting with breaking up employees into random groups (of size 4) and setting up video hangouts between them. We’re doing this to replace the serendipitous meetings that sometimes occur around coffee machines, in lunch lines or while waiting for the printer. And also, we just want people to get to know each other.

Which lead to me writing some code. The core of which is divide n elements into groups of at least size g minimizing the size of each group. So, suppose an office has 15 employees in it then it would be divided into three groups of sizes 5, 5, 5; if an office had 16 employees it would be 4, 4, 4, 4; if it had 17 employees it would be 4, 4, 4, 5 and so on.

I initially wrote the following code (in Python):

    groups = [g] * (n//g)

    for e in range(0, n % g):
        groups[e % len(groups)] += 1

The first line creates n//g (// is integer division) entries of size g (for example, if g == 4 and n == 17 then groups == [4, 4, 4, 4]). The for loop deals with the ‘left over’ parts that don’t divide exactly into groups of size g. If g == 4 and n == 17 then there will be one left over element to add to one of the existing [4, 4, 4, 4] groups resulting in [5, 4, 4, 4].

The e % len(groups) is needed because it’s possible that there are more elements left over after dividing into equal sized groups than there are entries in groups. For example, if g == 4 and n == 11 then groups is initially set to [4, 4] with three left over elements that have to be distributed into just two entries in groups.

So, that code works and here’s the output for various sizes of n (and g == 4):

    4 [4]
    5 [5]
    6 [6]
    7 [7]
    8 [4, 4]
    9 [5, 4]
    10 [5, 5]
    11 [6, 5]
    12 [4, 4, 4]
    13 [5, 4, 4]
    14 [5, 5, 4]
    15 [5, 5, 5]
    16 [4, 4, 4, 4]
    17 [5, 4, 4, 4]

But the code irritated me because I felt there must be a simple formula to work out how many elements should be in each group. After noodling on this problem I decided to do something that’s often helpful… make the problem simple and naive, or, at least, the solution simple and naive, and so I wrote this code:

    groups = [0] * (n//g)

    for i in range(n):
        groups[i % len(groups)] += 1

This is a really simple implementation. I don’t like it because it loops n times but it helps visualize something. Imagine that g == 4 and n == 17. This loop ‘fills up’ each entry in groups like this (each square is an entry in groups and numbers in the squares are values of i for which that entry was incremented by the loop).

Creating serendipity with Python

So groups ends up being [5, 4, 4, 4].  What this helps see is that the number of times groups[i] is incremented depends on the number of times the for loop ‘loops around’ on the ith element. And that’s something that’s easy to calculate without looping.

So this means that the code is now simply:

    groups = [1+max(0,n-(i+1))//(n//g) for i in range(n//g)]

And to me that is more satisfying. n//g is the size of groups which makes the loop update each entry in groups once. Each entry is set to 1 + max(0, n-(i+1))//(n//g). You can think of this as follows:

1. The 1 is the first element to be dropped into each entry in groups.

2. max(0, n-(i+1)) is the number of elements left over once you’ve placed 1 in each of the elements of groups up to position i. It’s divided by n//g to work out how many times the process of sharing out elements (see the naive loop above) will loop around.

If #2 there isn’t clear, consider the image above and imagine we are computing groups[0] (n == 17 and g == 4). We place 1 in groups[0] leaving 16 elements to share out. If you naively shared them out you’d loop around four times and thus need to add 16/4 elements to groups[0]making it 5.

Move on to groups[1] and place a 1 in it. Now there are 15 elements to share out, that’s 15/4 (which is 3 in integer division) and so you place 4 in groups[1]. And so on…

And that solution pleases me most. It succinctly creates groups in one shot. Of course, I might have over thought this… and others might think the other solutions are clearer or more maintainable.

Cloudflare Radar’s 2020 Year In Review

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/cloudflare-radar-2020-year-in-review/

Cloudflare Radar's 2020 Year In Review

Cloudflare Radar's 2020 Year In Review

Throughout 2020, we tracked changing Internet trends as the SARS-Cov-2 pandemic forced us all to change the way we were living, working, exercising and learning. In early April, we created a dedicated website https://builtforthis.net/ that showed some of the ways in which Internet use had changed, suddenly, because of the crisis.

On that website, we showed how traffic patterns had changed; for example, where people accessed the Internet from, how usage had jumped up dramatically, and how Internet attacks continued unabated and ultimately increased.

Today we are launching a dedicated Year In Review page with interactive maps and charts you can use to explore what changed on the Internet in 2020. Year In Review is part of Cloudflare Radar. We launched Radar in September 2020 to give anyone access to Internet use and abuse trends that Cloudflare normally had reserved only for employees.

Where people accessed the Internet

To get a sense for the Year In Review, let’s zoom in on London (you can do the same with any city from a long list of locations that we’ve analyzed). Here’s a map showing the change in Internet use comparing April (post-lockdown) and February (pre-lockdown). This map compares working hours Internet use on a weekday between those two months.

As you can clearly see, with offices closed in central London (and elsewhere), Internet use dropped (the blue colour) while usage increased in largely residential areas. Looking out to the west of London, a blue area near Windsor shows how Internet usage dropped at London’s Heathrow airport and surrounding areas.

Cloudflare Radar's 2020 Year In Review

A similar story plays out slightly later in the San Francisco Bay Area.

Cloudflare Radar's 2020 Year In Review

But that trend reverses in July, with an increase in Internet use in many places that saw a rapid decrease in April.

Cloudflare Radar's 2020 Year In Review

When you select a city from the map, a second chart shows the overall trend in Internet use for the country in which that city is located. For example, here’s the chart for the United States. The Y-axis shows the percentage change in Internet traffic compared to the start of the year.

Cloudflare Radar's 2020 Year In Review

Internet use really took off in March (when the lockdowns began) and rapidly increased to 40% higher than the start of the year. And usage has pretty much stayed there for all of 2020: that’s the new normal.

Here’s what happened in France (when selecting Paris) on the map view.

Cloudflare Radar's 2020 Year In Review

Internet use was flat until the lockdowns began. At that point, it took off and grew close to 40% over the beginning of the year. But there’s a visible slow down during the summer months, with Internet use up “only” 20% over the start of the year. Usage picked up again at “la rentrée” in September, with a new normal of about 30% growth in 2020.

What people did on the Internet

Returning to London, we can zoom into what people did on the Internet as the lockdowns began. The UK government announced a lockdown on March 23. On that day, the mixture of Internet use looked like this:

Cloudflare Radar's 2020 Year In Review

A few days later, the E-commerce category had jumped from 12.9% to 15.1% as people shopped online for groceries, clothing, webcams, school supplies, and more. Travel dropped from 1.5% of traffic to 1.1% (a decline of 30%).

Cloudflare Radar's 2020 Year In Review

And then by early mid-April E-commerce had increased to 16.2% of traffic with Travel remaining low.

Cloudflare Radar's 2020 Year In Review

But not all the trends are pandemic-related. One question is: to what extent is Black Friday (November 27, 2020) an event outside the US? We can answer that by moving the London slider to late November and look at the change in E-commerce. Watch carefully as E-commerce traffic grows towards Black Friday and actually peaks at 21.8% of traffic on Saturday, November 28.

As Christmas approached, E-commerce dropped off, but another category became very important: Entertainment. Notice how it peaked on Christmas Eve, as Britons, no doubt, turned to entertainment online during a locked-down Christmas.


And Hacking 2020

Of course, a pandemic didn’t mean that hacking activity decreased. Throughout 2020 and across the world, hackers continued to run their tools to attack websites, overwhelm APIs, and try to exfiltrate data.

Cloudflare Radar's 2020 Year In Review

Explore More

To explore data for 2020, you can check out Cloudflare Radar’s Year In Review page. To go deep into any specific country with up-to-date data about current trends, start at Cloudflare Radar’s homepage.

Internet traffic disruption caused by the Christmas Day bombing in Nashville

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/internet-traffic-disruption-caused-by-the-christmas-day-bombing-in-nashville/

Internet traffic disruption caused by the Christmas Day bombing in Nashville

On Christmas Day 2020, an apparent suicide bomb exploded in Nashville, TN. The explosion happened outside an AT&T network building on Second Avenue in Nashville at 1230 UTC. Damage to the AT&T building and its power supply and generators quickly caused an outage for telephone and Internet service for local people. These outages continued for two days.

Looking at traffic flow data for AT&T in the Nashville area to Cloudflare we can see that services continued operating (on battery power according to reports) for over five hours after the explosion, but at 1748 UTC we saw a dramatic drop in traffic. 1748 UTC is close to noon in Nashville when reports indicate that people lost phone and Internet service.

Internet traffic disruption caused by the Christmas Day bombing in Nashville

We saw traffic from Nashville via AT&T start to recover over a 45 minute period on December 27 at 1822 UTC making the total outage 2 days and 34 minutes.

Internet traffic disruption caused by the Christmas Day bombing in Nashville

Traffic flows continue to be normal and no further disruption has been seen.

Privacy needs to be built into the Internet

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/internet-privacy/

Privacy needs to be built into the Internet

Privacy needs to be built into the Internet

The first phase of the Internet lasted until the early 1990s. During that time it was created and debugged, and grew globally. Its growth was not hampered by concerns about data security or privacy. Until the 1990s the race was for connectivity.

Connectivity meant that people could get online and use the Internet wherever they were. Because the “inter” in Internet implied interoperability the network was able to grow rapidly using a variety of technologies. Think dialup modems using ordinary phones lines, cable modems sending the Internet over coax originally designed for television, Ethernet, and, later, fibre optic connections and WiFi.

By the 1990s, the Internet was being used widely and for uses far beyond its academic origins. Early web pioneers, like Netscape, realized that the potential for e-commerce was gigantic but would be held back if people couldn’t have confidence in the security of online transactions.

Thus, with the introduction of SSL in 1994, the Internet moved to a second phase where security became paramount. Securing the web, and the Internet more generally, helped create the dotcom rush and the secure, online world we live in today. But this security was misunderstood by some as providing guarantees about privacy which it did not.

People feel safe going online to shop, read the news, look up ailments and search for a life partner because cryptography prevents an eavesdropper from seeing what they are doing, and provides a guarantee that a website is who it claims to be. But it does not provide any privacy guarantee. The website you are visiting knows, at the very least, the IP address of your Internet connection.

And even with encryption a well placed eavesdropper can learn at least the names of websites you are visiting because of that information leaks from protocols that weren’t designed to preserve privacy.

People who aim to remain anonymous on the Internet therefore turn to technologies like Tor or VPNs. But remaining anonymous from a website you shop from or an airline’s online booking site doesn’t make any sense. In those instances, the company you are dealing with will know who you are because you tell them your home address, name, passport number etc. You want them to know.

That makes privacy a nuanced thing: you want to remain anonymous to an eavesdropper but make sure a retailer knows where you live.

The connectivity phase of the Internet made it possible for you to connect to a computer anywhere in the world just as easily as one in your own city. The security phase of the Internet solved the problem of giving you confidence to hand over information to an airline or a retailer. Combining these two phases resulted in an Internet you can trust to transmit your data, but little control over where that data ultimately ended up.

Phase 3

A French citizen could just as easily buy goods from a Spanish website as from a North American one. In both cases, the retailer would know the French name and address where the purchases were to be delivered. This creates a conundrum for a privacy-conscious citizen. The Internet created an amazing global platform for commerce, news and information (how easy it is for the French citizen to stay in contact with family in Cote d’Ivoire and even read the local news there from afar).

And while shopping an eavesdropper (such as an ISP, a coffee shop owner or an intelligence agency) could tell which website the French citizen was visiting.

And the Internet also meant that your and my information is dispersed across the world. And different countries have different rules about how that data is to be stored and shared. And countries and regions have data sharing agreements to allow cross-border transfer of private information about citizens.

Concerns about eavesdropping and where data ends up have created the world we are living in today where privacy concerns are coming to the forefront, especially in Europe but in many other countries as well.

In addition, the economics and flexibility of SaaS and cloud applications meant that it made sense to actually transfer data to a limited number of large data centers (which are sometimes confusingly called regions) where data from people all over the world can be processed. And, by and large, that was the world of the Internet, universal connectivity, widespread security, and data sharing through cross-border agreements.

This apparent utopia got snowed on by the leaking of secret documents describing the relationship between the US NSA (and its Five Eyes partners) and large Internet companies, and that intelligence agencies were scooping up data from choke points on the Internet. Those revelations brought to the public’s attention the fact that their data could, in some cases, be accessed by foreign intelligence agencies

Quite quickly those large data centers in far flung countries looked like a bad idea, and governments and citizens started to demand control of data. This is the third phase of the Internet. Privacy joins universal connectivity and security as core.

But what is control over data or privacy? Different governments have different ideas and different requirements, which can differ for different data sets. Some countries are convinced that the only way to control data is to keep it inside their countries, where they believe they can control who gets access to it. Other countries believe that they can address the risks by putting restrictions to prevent certain governments or companies from getting access to data. And the regulatory challenges are only getting more complicated.

This will be an enormous challenge for companies that have built a business on aggregating citizens’ information in order to target advertising, but it is also a challenge for anyone offering an Internet service. Just as companies have had to face the scourge of DDoS attacks and hacking, and have had to stay up to date with the latest in encryption technology, they will fundamentally have to store and process their customers’ data in different countries in different ways.

The European Union, in particular, has pushed a comprehensive approach to data privacy. Although the EU has had data protection principles in place since 1995, the implementation of the EU’s General Data Protection Regulation (GDPR) in 2018 has generated a new era of privacy online. GDPR imposes limitations on how the personal data of EU residents can be collected, stored, deleted, modified and otherwise processed.

Among the GDPR’s requirements are provisions on how EU personal data should be protected if that personal data leaves the EU. Although the US and the EU worked together to develop a set of voluntary commitments to make it easier for companies to transfer data between the two countries, that framework — the Privacy Shield — was invalidated this past summer. As a result, companies are grappling with how they can transfer data outside the EU, consistent with GDPR requirements. Recommendations recently issued by the European Data Protection Board (EDPB), which require data exporters to assess the law in third countries, determine whether that law adequately protects privacy, and if necessary, obtain guarantees of additional safeguards from data importers, have only added to companies’ concerns.

This anxiety over whether there are controls over data adequate to address the concerns of European regulators has prompted many of our customers to explore whether it is possible to prevent data subject to the GDPR from leaving the EU at all.

Gone are the days when all the world’s data could be processed in a massive data center regardless of its provenance.

One reaction to this change could be a retreat into every country building its own online email services, HR systems, e-commerce providers, and more. This would be a massive wasted effort. There are economies of scale if the same service can be used by Germans, Peruvians, Indonesians, Australians…

The answer to this privacy challenge is the same as the answer to the connectivity and security phases of the Internet: build it! We need to build a privacy-respecting Internet and give companies the tools to easily build privacy-respecting applications.

This week we’ll be talking about new tools from Cloudflare that make building privacy-respecting applications easy by allowing companies to situate their users’ data in the countries and regions of their choosing. And we’ll be talking about new protocols that build privacy into the very structure of the Internet. We’ll update on the latest quantum-resistant algorithms that help keep private data private today and into the far future.

We’ll show how it’s possible to run a massive DNS resolver service like 1.1.1.1 and preserve users’ privacy through a clever new protocol. We’ll look at how to make passwords that can’t be leaked. And we’ll give everyone the power to get web analytics without tracking people.

Welcome to Phase 3 of the Internet: always on, always secure, always private.

Introducing the Cloudflare Data Localization Suite

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/introducing-the-cloudflare-data-localization-suite/

Introducing the Cloudflare Data Localization Suite

Introducing the Cloudflare Data Localization Suite

Today we’re excited to announce the Cloudflare Data Localization Suite, which helps businesses get the performance and security benefits of Cloudflare’s global network, while making it easy to set rules and controls at the edge about where their data is stored and protected.

The Data Localization Suite is available now as an add-on for Enterprise customers.

Cloudflare’s network is private and compliant by design. Preserving end-user privacy is core to our mission of helping to build a better Internet; we’ve never sold personal data about customers or end users of our network. We comply with laws like GDPR and maintain certifications such as ISO-27001.

Today, we’re announcing tools that make it simple for our customers to build the same rigor into their own applications. In this post, I’ll explain the different types of data that we process and how the Data Localization Suite keeps this data local.

We’ll also talk about how Cloudflare makes it possible to build applications that comply with data locality laws, while remaining fast, secure and scalable.

Why keep data local?

Cloudflare’s customers have increasing desire or face legal requirements for data locality: they want to control the geographic location where their data is handled. Many categories of data that our customers process (including healthcare, legal, or financial data) may be subject to obligations that specify the data be stored or processed in a specific location. The preference or requirement for data localization is growing across jurisdictions such as the EU, India, and Brazil; over time, we expect more customers in more places will be expected to keep data local.

Although “data locality” sounds like a simple concept, our conversations with Cloudflare customers make clear that there are a number of unique challenges they face in the attempt to move toward this goal.  The availability of information on their Internet properties will remain global–they don’t want to limit access to their websites to local jurisdictions–but they want to make sure data stays local. Variously, they are trying to figure out:

  • How do I build local requirements into my global online operations?
  • How do I make sure unencrypted traffic is only available locally?
  • How do I make sure personal data is handled according to localization obligations?
  • How do I make sure my applications only store data in certain locations?

The Cloudflare Data Localization Suite attempts to respond to these questions.

Until now, customers who wanted to localize their data had to choose to restrict their application to one data center, or to one cloud provider’s region. This is a fragile approach, fraught with performance, reliability, and security challenges. Cloudflare is creating a new paradigm: customers should be able to get the performance and security benefits of our global network, while effortlessly keeping their data local.

Encryption is the backbone of privacy

Before we go into data locality, we should discuss encryption. Privacy isn’t possible without strong encryption; otherwise, anyone could snoop your customers’ data, regardless of where it’s stored.

Data is often described as being “in transit” and “at rest”. It’s critically important that both are encrypted. Data “in transit” refers to just that—data while it’s moving about on the wire, whether a local network or the public Internet. “At rest” generally means stored on a disk somewhere, whether a spinning HDD or a modern SSD.

In transit, Cloudflare can enforce that all traffic to end-users uses modern TLS and gets the highest level of encryption possible. We can also enforce that all traffic back to customers’ origin servers is always encrypted. Communication between all our edge and core data centers is always encrypted.

Cloudflare encrypts all of the data we handle at rest, usually with disk-level encryption. From cached files on our edge network, to configuration state in databases in our core data centers—every byte is encrypted at rest.

Control where TLS private keys can be accessed

Given the importance of encryption, one of the most sensitive pieces of data that our customers trust us to protect are their cryptographic keys, which enable data to be decrypted. Cloudflare offers two ways for customers to ensure that their private keys are only accessible in locations they specify.

Keyless SSL allows a customer to store and manage their own SSL private keys for use with Cloudflare on any external infrastructure of their choosing. Customers can use a variety of systems for their keystore, including hardware security modules (“HSMs”), virtual servers, and hardware running Unix/Linux and Windows that is housed in environments customers control. Cloudflare never has access to the private key with Keyless SSL.

Geo Key Manager gives customers granular control over which locations should store their keys. For example, a customer can choose for the private keys required for inspection of traffic to only be accessible inside data centers located in the European Union.

Manage where HTTPS requests and responses are inspected

In order to deploy our WAF, or detect malicious bot traffic, Cloudflare must terminate TLS in our edge data centers and inspect HTTPS request and response payloads.

Regional Services gives organizations control over where their traffic is inspected. With Regional Services enabled, traffic is ingested on Cloudflare’s global Anycast network at the location closest to the client, where we can provide L3 and L4 DDoS protection. Instead of being inspected at the HTTP level at that data center, this traffic is securely transmitted to Cloudflare data centers inside the region selected by the customer and handled there.

Introducing the Cloudflare Data Localization Suite

Control the logs and analytics generated by your traffic

In addition to making our customers’ infrastructure and teams faster, more secure, and more reliable, we also provide insights into what our services do, and how customers can make better use of them. We gather metadata about the traffic that goes through our edge data centers, and use this to improve the operation of our own network: for example, by crafting WAF rules to block the latest attacks, or by developing machine learning models to detect malicious bots. We also make this data available to our customers in the form of logs and analytics.

This only requires a subset of the metadata to be processed in our core data centers in the US/EU. This data contains information about how many requests were served, how much data was sent, how long requests took, and other information that is essential for the operation of our network.

With Edge Log Delivery, customers can send logs directly from the edge to their partner of choice—for example, an Azure storage bucket in their preferred region, or an instance of Splunk that runs in an on-premise data center. With this option, customers can still get their complete logs in their preferred region, without these logs first flowing through either of our US or EU core data centers.

Introducing the Cloudflare Data Localization Suite

Edge Log Delivery is in early beta for Enterprise customers today—please visit our product page for more information.

Ultimately, we are working towards providing customers full control over where their metadata is stored, and for how long. In the coming year, we plan to allow customers to be able to choose exactly which fields are stored, and for how long, and in which location.

Building location-aware applications from the ground up

So far, we’ve discussed how Cloudflare’s products can offer global performance and security solutions for our customers, while keeping their existing keys, application data, and metadata local.

But we know that customers are also struggling to use existing, traditional cloud systems to manage their data locality needs. Existing platforms may allow code or data to be deployed to a specific region, but having copies of applications in each region, and managing state across each of them, can be challenging at best (or impossible at worst).

The ultimate promise of serverless has been to allow any developer to say “I don’t care where my code runs, just make it scale.” Increasingly, another promise will need to be “I do care where my code runs, and I need more control to satisfy my compliance department.” Cloudflare Workers allows you the best of both worlds, with instant scaling, locations that span more than 100 countries around the world, and the granularity to choose exactly what you need.

Introducing the Cloudflare Data Localization Suite

We are announcing a major improvement that lets customers control where their applications store data: Workers Durable Objects will support Jurisdiction Restrictions.  Durable Objects provide globally consistent state and coordination to serverless applications running on the Cloudflare Workers platform. Jurisdiction Restrictions will make it possible for users to ensure that their Durable Objects do not store data or run outside of a given jurisdiction—making it trivially simple to build applications that combine global performance with local compliance. With automatic migration of Durable Objects, adapting to new rules will be as simple as adding a tag to a set of Durable Objects.

Building for the long haul

The data localization landscape is constantly evolving. Since we began working on the Data Localization Suite, the European Data Protection Board has released new guidance about how data may be transferred between the EU and the US. And we know this is just the beginning — over time, more regions and more industries will have data localization requirements.

At Cloudflare, we stay on top of the latest developments around data protection so our customers don’t have to. The Data Localization Suite gives our customers the tools to set rules and controls at the edge about where their data is stored and protected, while taking advantage of our global network.

Welcome to Birthday Week 2020

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/welcome-to-birthday-week-2020/

Welcome to Birthday Week 2020

Each year we celebrate our launch on September 27, 2010 with a week of product announcements. We call this Birthday Week, but rather than receiving gifts, we give them away. This year is no different, except that it is… Cloudflare is 10 years old.

Before looking forward to the coming week, let’s take a look back at announcements from previous Birthday Weeks.

Welcome to Birthday Week 2020

A year into Cloudflare’s life (in 2011) we launched automatic support for IPv6. This was the first of a long line of announcements that support our goal of making available to everyone the latest technologies. If you’ve been following Cloudflare’s growth you’ll know those include SPDY/HTTP/2, TLS 1.3, QUIC/HTTP/3, DoH and DoT, WebP, … At two years old we celebrated with a timeline of our first two years and the fact that we’d reached 500,000 domains using the service. A year later that number had tripled.

Welcome to Birthday Week 2020

In 2014 we released Universal SSL and gave all our customers SSL certificates. In one go we massively increased the size of the encrypted web and made it free and simple to go from http:// to https://. Other HTTPS related features we’ve rolled out include: Automatic HTTPS Rewrites, Encrypted SNI and our CT Log.

Welcome to Birthday Week 2020

In 2017 we unwrapped a bunch of goodies with Unmetered DDoS Mitigation, our video streaming service, Cloudflare Stream, the ability to control where private SSL keys stored through Geo Key Manager. And, last but not least, our hugely popular serverless platform Cloudflare Workers. It’s hard to believe that it’s been three years since we changed the way people think about serverless with our massively distributed, secure and fast to update platform.

Welcome to Birthday Week 2020

Two years ago Cloudflare became a domain registrar with the launch of our “at cost” service: Cloudflare Registrar. We also announced the Bandwidth Alliance which is designed to reduce or eliminate high cloud egress fees. We rolled out support for QUIC and Cloudflare Workers got a globally distributed key value store: Workers KV.

Welcome to Birthday Week 2020

Which brings us to last year with the launch of WARP Plus to speed up and secure the “last mile” connection between a device and Cloudflare’s network. Browser Insights so that customers can optimize their website’s performance and see how each Cloudflare tool helps.

We greatly enhanced our bot management tools with Bot Defend Mode, and rolled out Workers Sites to bring the power of Workers and Workers KV to entire websites.

Welcome to Birthday Week 2020

No Spoilers Here

Here are some hints about what to expect this year for our 10th anniversary Birthday Week:

Welcome to Birthday Week 2020
  • Monday: We’re fundamentally changing how people think about Serverless

If you studied computer science you’ll probably have come across Niklaus Wirth’s book “Algorithms + Data Structures = Programs”. We’re going to start the week with two enhancements to Cloudflare Workers that are fundamentally going to change how people think about serverless. The lambda calculus is a nice theoretical foundation, but it’s Turing machines that won the day. If you want to build large, real programs you need to have algorithms and data structures.

Welcome to Birthday Week 2020
  • Tuesday and Wednesday are all about observability. Of an Internet property and of the Internet itself. And they are also about privacy. We’ll roll out new functionality so you can see what’s happening without the need to track people.
Welcome to Birthday Week 2020
  • Thursday is security day with a new service to protect the parts of websites and Internet applications that are behind the scenes. And, finally, on Friday it’s all about one click performance improvements that leverage our more than 200 city network to speed up static and dynamic content.

Welcome to Birthday Week 2020!

Cloudflare outage on July 17, 2020

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/cloudflare-outage-on-july-17-2020/

Cloudflare outage on July 17, 2020

Today a configuration error in our backbone network caused an outage for Internet properties and Cloudflare services that lasted 27 minutes. We saw traffic drop by about 50% across our network. Because of the architecture of our backbone this outage didn’t affect the entire Cloudflare network and was localized to certain geographies.

The outage occurred because, while working on an unrelated issue with a segment of the backbone from Newark to Chicago, our network engineering team updated the configuration on a router in Atlanta to alleviate congestion. This configuration contained an error that caused all traffic across our backbone to be sent to Atlanta. This quickly overwhelmed the Atlanta router and caused Cloudflare network locations connected to the backbone to fail.

The affected locations were San Jose, Dallas, Seattle, Los Angeles, Chicago, Washington, DC, Richmond, Newark, Atlanta, London, Amsterdam, Frankfurt, Paris, Stockholm, Moscow, St. Petersburg, São Paulo, Curitiba, and Porto Alegre. Other locations continued to operate normally.

For the avoidance of doubt: this was not caused by an attack or breach of any kind.

We are sorry for this outage and have already made a global change to the backbone configuration that will prevent it from being able to occur again.

The Cloudflare Backbone

Cloudflare outage on July 17, 2020

Cloudflare operates a backbone between many of our data centers around the world. The backbone is a series of private lines between our data centers that we use for faster and more reliable paths between them. These links allow us to carry traffic between different data centers, without going over the public Internet.

We use this, for example, to reach a website origin server sitting in New York, carrying requests over our private backbone to both San Jose, California, as far as Frankfurt or São Paulo. This additional option to avoid the public Internet allows a higher quality of service, as the private network can be used to avoid Internet congestion points. With the backbone, we have far greater control over where and how to route Internet requests and traffic than the public Internet provides.

Timeline

All timestamps are UTC.

First, an issue occurred on the backbone link between Newark and Chicago which led to backbone congestion in between Atlanta and Washington, DC.

In responding to that issue, a configuration change was made in Atlanta. That change started the outage at 21:12. Once the outage was understood, the Atlanta router was disabled and traffic began flowing normally again at 21:39.

Shortly after, we saw congestion at one of our core data centers that processes logs and metrics, causing some logs to be dropped. During this period the edge network continued to operate normally.

  • 20:25: Loss of backbone link between EWR and ORD
  • 20:25: Backbone between ATL and IAD is congesting
  • 21:12 to 21:39: ATL attracted traffic from across the backbone
  • 21:39 to 21:47: ATL dropped from the backbone, service restored
  • 21:47 to 22:10: Core congestion caused some logs to drop, edge continues operating
  • 22:10: Full recovery, including logs and metrics

Here’s a view of the impact from Cloudflare’s internal traffic manager tool. The red and orange region at the top shows CPU utilization in Atlanta reaching overload, and the white regions show affected data centers seeing CPU drop to near zero as they were no longer handling traffic. This is the period of the outage.

Other, unaffected data centers show no change in their CPU utilization during the incident. That’s indicated by the fact that the green color does not change during the incident for those data centers.

Cloudflare outage on July 17, 2020

What happened and what we’re doing about it

As there was backbone congestion in Atlanta, the team had decided to remove some of Atlanta’s backbone traffic. But instead of removing the Atlanta routes from the backbone, a one line change started leaking all BGP routes into the backbone.

{master}[edit]
atl01# show | compare 
[edit policy-options policy-statement 6-BBONE-OUT term 6-SITE-LOCAL from]
!       inactive: prefix-list 6-SITE-LOCAL { ... }

The complete term looks like this:

from {
    prefix-list 6-SITE-LOCAL;
}
then {
    local-preference 200;
    community add SITE-LOCAL-ROUTE;
    community add ATL01;
    community add NORTH-AMERICA;
    accept;
}

This term sets the local-preference, adds some communities, and accepts the routes that match the prefix-list. Local-preference is a transitive property on iBGP sessions (it will be transferred to the next BGP peer).

The correct change would have been to deactivate the term instead of the prefix-list.

By removing the prefix-list condition, the router was instructed to send all its BGP routes to all other backbone routers, with an increased local-preference of 200. Unfortunately at the time, local routes that the edge routers received from our compute nodes had a local-preference of 100. As the higher local-preference wins, all of the traffic meant for local compute nodes went to Atlanta compute nodes instead.

With the routes sent out, Atlanta started attracting traffic from across the backbone.

We are making the following changes:

  • Introduce a maximum-prefix limit on our backbone BGP sessions – this would have shut down the backbone in Atlanta, but our network is built to function properly without a backbone. This change will be deployed on Monday, July 20.
  • Change the BGP local-preference for local server routes. This change will prevent a single location from attracting other locations’ traffic in a similar manner. This change has been deployed following the incident.

Conclusion

We’ve never experienced an outage on our backbone and our team responded quickly to restore service in the affected locations, but this was a very painful period for everyone involved. We are sorry for the disruption to our customers and to all the users who were unable to access Internet properties while the outage was happening.

We’ve already made changes to the backbone configuration to make sure that this cannot happen again, and further changes will resume on Monday.

Cloudflare’s first year in Lisbon

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/cloudflares-first-year-in-lisbon/

Cloudflare's first year in Lisbon

Cloudflare's first year in Lisbon

A year ago I wrote about the opening of Cloudflare’s office in Lisbon, it’s hard to believe that a year has flown by. At the time I wrote:

Lisbon’s combination of a large and growing existing tech ecosystem, attractive immigration policy, political stability, high standard of living, as well as logistical factors like time zone (the same as the UK) and direct flights to San Francisco made it the clear winner.

We landed in Lisbon with a small team of transplants from other Cloudflare offices. Twelve of us moved from the UK, US and Singapore to bootstrap here. Today we are 35 people with another 10 having accepted offers; we’ve almost quadrupled in a year and we intend to keep growing to around 80 by the end of 2020.

Cloudflare's first year in Lisbon

If you read back to my description of why we chose Lisbon only one item hasn’t turned out quite as we expected. Sure enough TAP Portugal does have direct flights to San Francisco but the pandemic put an end to all business flying worldwide for Cloudflare. We all look forward to getting back to being able to visit our colleagues in other locations.

The pandemic also put us in the odd position of needing to move from one empty office to another. Back in January the Cloudflare Lisbon office was in the Chiado and only had capacity for about 14 people. With our rapid growth we moved, in February, to a larger, temporary location on Avenida da Liberdade which had room for about 25 people.

Cloudflare's first year in Lisbon
Leaving the Chiado‌‌

And in early April, we moved to our longer term office on Praça Marquês de Pombal. Of course, by that time the State of Emergency had been declared in Portugal and the office move took place in our absence. But it sits waiting for our return sometime in early 2021.

The team that landed in Lisbon covered Customer Support, Security, IT, Technology, and  Emerging Technology and Incubation, but, as we suspected, we’ve grown in many other departments and the rest of Cloudflare is realizing how much Lisbon and Portugal have to offer. In addition to the original team we now have people in SRE, Payroll, Accounting, Trust and Safety, People and Places, Product Management and Infrastructure.

Cloudflare's first year in Lisbon
View from the Cloudflare Lisbon office‌‌

Despite the pandemic we’re continuing to invest in Lisbon with 24 open roles in Customer Support, Infrastructure, People and Places, Engineering, Accounting and Finance, Security, Business Intelligence, Product Management and Emerging Technology and Incubation.

As I said in an interview with AICEP earlier this year “É nosso objetivo construir em Lisboa um dos maiores escritórios da Cloudflare” (“It’s our objective to build in Lisbon one of the major Cloudflare offices”). You can read the full Portuguese-language interview here. We continue to believe that Lisbon is a vital part of Cloudflare’s growth.

Cloudflare's first year in Lisbon

I’ve spent a huge amount of my career on aircraft and the last few months have felt very odd, but I couldn’t have been happier to find myself temporarily stuck in Lisbon. No doubt we’ll all be traveling again but this last year has confirmed my impression that Lisbon is a great place to live.

I asked our team what they’d found they love about living in Lisbon and Portugal. They came back with pasteis de nata, sunshine every day, the jacaranda trees, feijoada, empada de galinha, Joker, Super Bock, chocolate mousse being an everyday staple, Maria biscuits, quality fresh produce, dolphins, lizards in the gardens, MB Way, ovos moles de Aveiro, so great that only ~30/40min from here you get such nice beaches like the ones in Setubal, Sintra, Cascais, Sesimbra, bica, sardines, the Alentejo coastline, the chicken from Bonjardim, family friendliness and how nice it is to raise children here, fast, reliable and cheap Internet access, and so much more.

If you’d like to join us please visit our careers page for Lisbon.

When people pause the Internet goes quiet

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/when-people-pause/

When people pause the Internet goes quiet

Recent news about the Internet has mostly been about the great increase in usage as those workers who can have been told to work from home. I’ve written about this twice recently, first in early March and then last week look at how Internet use has risen to a new normal.

When people pause the Internet goes quiet

As human behaviour has changed in response to the pandemic, it’s left a mark on the charts that network operators look at day in, day out to ensure that their networks are running correctly.

Most Internet traffic has a fairly simple rhythm to it. Here, for example, is daily traffic seen on the Amsterdam Internet Exchange. It’s a pattern that’s familiar to most network operators. People sleep at night, and there’s a peak of usage in the early evening when people get home and perhaps stream a movie, or listen to music or use the web for things they couldn’t do during the workday.

When people pause the Internet goes quiet

But sometimes that rhythm get broken. Recently we’ve seen the evening peak by joined by morning peaks as well. Here’s a graph from the Milan Internet Exchange. There are three peaks: morning, afternoon and evening.  These peaks seem to be caused by people working from home and children being schooled and playing at home.

When people pause the Internet goes quiet

But there are way human behaviour shows up on graphs like these.  When humans pause the Internet goes quiet. Here are two examples that I’ve seen recently.

The UK and #ClapForNHS

Here’s a chart of Internet traffic last week in the UK. The triple peak is clearly visible (see circle A). But circle B shows a significant drop in traffic on Thursday, April 23.

When people pause the Internet goes quiet

That’s when people in the UK clapped for NHS workers to show their appreciation for those on the front line dealing with people sick with COVID-19.

Ramadan

Ramadan started last Friday, April 24 and it shows up in Internet traffic in countries with large Muslim populations. Here, for example, is a graph of traffic in Tunisia over the weekend. A similar pattern is seen across the Muslim world.

When people pause the Internet goes quiet

Two important parts of the day during Ramadan show up on the chart. These are the iftar and sahoor. Circle A shows the iftar, the evening meal at which Muslims break the fast. Circle B shows the sahoor, the early morning meal before the day’s fasting.

Looking at the previous weekend (in green) you can see that the Ramadan-related changes are not present and that Internet use is generally higher (by 10% to 15%).

When people pause the Internet goes quiet

Conclusion

We built the Internet for ourselves and despite all the machine to machine traffic that takes place (think IoT devices chatting to their APIs, or computers updating software in the night), human directed traffic dominates.

I’d love to hear from readers about other ways human activity might show up in these Internet trends.

Internet performance during the COVID-19 emergency

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/recent-trends-in-internet-traffic/

A month ago I wrote about changes in Internet traffic caused by the COVID-19 emergency. At the time I wrote:

Cloudflare is watching carefully as Internet traffic patterns around the world alter as people alter their daily lives through home-working, cordon sanitaire, and social distancing. None of these traffic changes raise any concern for us. Cloudflare’s network is well provisioned to handle significant spikes in traffic. We have not seen, and do not anticipate, any impact on our network’s performance, reliability, or security globally.

That holds true today; our network is performing as expected under increased load. Overall the Internet has shown that it was built for this: designed to handle huge changes in traffic, outages, and a changing mix of use. As we are well into April I thought it was time for an update.

Growth

Here’s a chart showing the relative change in Internet use as seen by Cloudflare since the beginning of the year. I’ve calculated moving average of the trailing seven days for each country and use December 29, 2019 as the reference point.

On this chart the highest growth in Internet use has been in Portugal: it’s currently running at about a 50% increase with Spain close behind followed by the UK. Italy flattened out at about a 40% increase in usage towards the end of March and France seems to be plateauing at a little over 30% up on the end of last year.

It’s interesting to see how steeply Internet use grew in the UK, Spain and Portugal (the red, yellow and blue lines rise very steeply), with Spain and Portugal almost in unison and the UK lagging by about two weeks.

Looking at some other major economies we see other, yet similar patterns.

Similar increases in utilization are seen here. The US, Canada, Australia and Brazil are all running at between 40% and 50% the level of use at the beginning of the year.

Stability

We measure the TCP RTT (round trip time) between our servers and visitors to Internet properties that are Cloudflare customers. This gives us a measure of the speed of the networks between us and end users, and if the RTT increases it is also a measure of congestion along the path.

Looking at TCP RTT over the last 90 days can help identify changes in congestion or the network. Cloudflare connects widely to the Internet via peering (and through the use of transit) and we connect to the largest number of Internet exchanges worldwide to ensure fast access for all users.

Cloudflare is also present in 200 cities worldwide; thus the TCP RTT seen by Cloudflare gives a measure of the performance of end-user networks within a country. Here’s a chart showing the median and 95th percentile TCP RTT in the UK in the last 90 days.

What’s striking in this chart is that despite the massive increase in Internet use (the grey line), the TCP RTT hasn’t changed significantly. From our vantage point UK networks are coping well.

Here’s the situation in Italy:

The picture here is slightly different. Both median and 95th percentile TCP RTT increased as traffic increased. This indicates that networks aren’t operating as smoothly in Italy. It’s noticeable, though, that as traffic has plateaued the TCP RTT has improved somewhat (take a look at the 95th percentile) indicating that ISPs and other network providers in Italy have likely taken action to improve the situation.

This doesn’t mean that Italian Internet is in trouble, just that it’s strained more than, say, the Internet in the UK.

Conclusion

The Internet has seen incredible, sudden growth in traffic but continues to operate well. What Cloudflare sees reflects what we’ve heard anecdotally: some end-user networks are feeling the strain of the sudden change of load but are working and helping us all cope with the societal effects of COVID-19.

It’s hard to imagine another utility (say electricity, water or gas) coping with a sudden and continuous increase in demand of 50%.

Cloudflare Workers Now Support COBOL

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/cloudflare-workers-now-support-cobol/

Cloudflare Workers Now Support COBOL

Recently, COBOL has been in the news as the State of New Jersey has asked for help with a COBOL-based system for unemployment claims. The system has come under heavy load because of the societal effects of the SARS-CoV-2 virus. This appears to have prompted IBM to offer free online COBOL training.

Cloudflare Workers Now Support COBOL

As old as COBOL is (60 years old this month), it is still heavily used in information management systems and pretty much anywhere there’s an IBM mainframe around. Three years ago Thomson Reuters reported that COBOL is used in 43% of banking systems, is behind 80% of in-person financial transactions and 95% of times an ATM card is used. They also reported 100s of billions of lines of running COBOL.

COBOL is often a source of amusement for programmers because it is seen as old, verbose, clunky, and difficult to maintain. And it’s often the case that people making the jokes have never actually written any COBOL. We plan to give them a chance: COBOL can now be used to write code for Cloudflare’s serverless platform Workers.

Here’s a simple “Hello, World!” program written in COBOL and accessible at https://hello-world.cobol.workers.dev/. It doesn’t do much–it just outputs “Hello, World!”–but it does it using COBOL.

        IDENTIFICATION DIVISION.
        PROGRAM-ID. HELLO-WORLD.
        DATA DIVISION.
        WORKING-STORAGE SECTION.
        01 HTTP_OK   PIC X(4)  VALUE "200".
        01 OUTPUT_TEXT PIC X(14) VALUE "Hello, World!".
        PROCEDURE DIVISION.
            CALL "set_http_status" USING HTTP_OK.
            CALL "append_http_body" USING OUTPUT_TEXT.
        STOP RUN.

If you’ve never seen a COBOL program before, it might look very odd. The language emerged in 1960 from the work of a committee designing a language for business (COBOL = COmmon Business Oriented Language) and was intended to be easy to read and understand (hence the verbose syntax). It was partly based on an early language called FLOW-MATIC created by Grace Hopper.

IDENTIFICATION DIVISION.

To put COBOL in context: FORTRAN arrived in 1957, LISP and ALGOL in 1958, APL in 1962 and BASIC in 1964. The C language didn’t arrive on scene until 1972. The late 1950s and early 1960s saw a huge amount of work on programming languages, some coming from industry (such as FORTRAN and COBOL) and others from academia (such as LISP and ALGOL).

COBOL is a compiled language and can easily be compiled to WebAssembly and run on Cloudflare Workers. If you want to get started with COBOL, the GNUCobol project is a good place to begin.

Here’s a program that waits for you to press ENTER and then adds up the numbers 1 to 1000 and outputs the result:

       IDENTIFICATION DIVISION.
       PROGRAM-ID. ADD.
       ENVIRONMENT DIVISION.
       DATA DIVISION.
       WORKING-STORAGE SECTION.
       77 IDX  PICTURE 9999.
       77 SUMX PICTURE 999999.
       77 X    PICTURE X.
       PROCEDURE DIVISION.
       BEGIN.
           ACCEPT X.
           MOVE ZERO TO IDX.
           MOVE ZERO TO SUMX.
           PERFORM ADD-PAR UNTIL IDX = 1001.
           DISPLAY SUMX.
           STOP RUN.
       ADD-PAR.
           COMPUTE SUMX = SUMX + IDX.
           ADD 1 TO IDX.

You can compile it and run it using GNUCobol like this (I put this in a file called `terminator.cob`)

$ cobc -x terminator.cob
$ ./terminator
500500
$

cobc compiles the COBOL program to an executable file. It can also output a C file containing C code to implement the COBOL program:

$ cobc -C -o terminator.c -x terminator.cob

This .c file can then be compiled to WebAssembly. I’ve done that and placed this program (with small modifications to make it output via HTTP, as in the Hello, World! program above) at https://terminator.cobol.workers.dev/. Note that the online version doesn’t wait for you to press ENTER, it just does the calculation and gives you the answer.

DATA DIVISION.

You might be wondering why I called this `terminator.cob`. That’s because this is part of the code that appears in The Terminator, James Cameron’s 1984 film. The film features a ton of code from the Apple ][ and a little snippet of COBOL (see the screenshot from the film below).

Cloudflare Workers Now Support COBOL

The screenshot shows the view from one of the HK-Aerial hunter-killer VTOL craft used by Skynet to try to wipe out the remnants of humanity. Using COBOL.

You can learn all about that in this YouTube video I produced:

For those of you of the nerdy persuasion, here’s the original code as it appeared in the May 1984 edition of “73 Magazine” and was copied to look cool on screen in The Terminator.

Cloudflare Workers Now Support COBOL

If you want to scale your own COBOL-implemented Skynet, it only takes a few steps to convert COBOL to WebAssembly and have it run in over 200 cities worldwide on Cloudflare’s network.

PROCEDURE DIVISION.

Here’s how you can take your COBOL program and turn it into a Worker.

There are multiple compiler implementations of the COBOL language and a few of them are proprietary. We decided to use GnuCOBOL (formerly OpenCOBOL) because it’s free software.

Given that Cloudflare Workers supports WebAssembly, it sounded quite straightforward: GnuCOBOL can compile COBOL to C and Emscripten compiles C/C++ to WebAssembly. However, we need to make sure that our WebAssembly binary is as small and fast as possible to maximize the time for user-code to run instead of COBOL’s runtime.

GnuCOBOL has a runtime library called libcob, which implements COBOL’s runtime semantics, using GMP (GNU Multiple Precision Arithmetic Library) for arithmetic. After we compiled both these libraries to WebAssembly and linked against our compiled COBOL program, we threw the WebAssembly binary in a Cloudflare Worker.

It was too big and it hit the CPU limit (you can find Cloudflare Worker’s limits here), so it was time to optimize.

GMP turns out to be a big library, but luckily for us someone made an optimized version for JavaScript (https://github.com/kripken/gmp.js), which was much smaller and reduced the WebAssembly instantiation time. As a side note, it’s often the case that functions implemented in C could be removed in favour of a JavaScript implementation already existing on the web platform. But for this project we didn’t want to rewrite GMP.

While Emscripten can emulate a file system with all its syscalls, it didn’t seem necessary in a Cloudflare Worker. We patched GnuCOBOL to remove the support for local user configuration and other small things, allowing us to remove the emulated file system.

The size of our Wasm binary is relatively small compared to other languages. For example, around 230KB with optimization enabled for the Game of Life later in this blog post.

Now that we have a COBOL program running in a Cloudflare Worker, we still need a way to generate an HTTP response.

The HTTP response generation and manipulation is written in JavaScript (for now… some changes to WebAssembly are currently being discussed that would allow a better integration). Emscripten imports these functions and makes them available in C, and finally we link all the C code with our COBOL program. COBOL already has good interoperability with C code.

As an example, we implemented the rock-paper-scissors game (https://github.com/cloudflare/cobol-worker/blob/master/src/worker.cob). See the full source (https://github.com/cloudflare/cobol-worker).

Our work can be used by anyone wanting to compile COBOL to WebAssembly; the toolchain we used is available on GitHub (https://github.com/cloudflare/cobaul) and is free to use.

To deploy your own COBOL Worker, you can run the following commands. Make sure that you have wrangler installed on your machine (https://github.com/cloudflare/wrangler).

wrangler generate cobol-worker https://github.com/cloudflare/cobol-worker-template

It will generate a cobol-worker directory containing the Worker. Follow the instructions in your terminal to configure your Cloudflare account with wrangler.

Your worker is ready to go; enter npm run deploy and once deployed the URL will be displayed in the console.

STOP RUN.

I am very grateful to Sven Sauleau for doing the work to make it easy to port a COBOL program into a Workers file and for writing the PROCEDURE DIVISION section above and to Dane Knecht for suggesting Conway’s Game of Life.

Cloudflare Workers with WebAssembly is an easy-to-use serverless platform that’s fast and cheap and scalable. It supports a wide variety of languages–including COBOL (and C, C++, Rust, Go, JavaScript, etc.). Give it a try today.

AFTERWORD

We learnt the other day of the death of John Conway who is well known for Conway’s Game of Life. In tribute to Conway, XKCD dedicated a cartoon:

Cloudflare Workers Now Support COBOL

I decided to implement the Game of Life in COBOL and reproduce the cartoon.

Here’s the code:

IDENTIFICATION DIVISION.
       PROGRAM-ID. worker.
       DATA DIVISION.
       WORKING-STORAGE SECTION.
       01 PARAM-NAME PIC X(7).
       01 PARAM-VALUE PIC 9(10).
       01 PARAM-OUTPUT PIC X(10).
       01 PARAM PIC 9(10) BINARY.
       01 PARAM-COUNTER PIC 9(2) VALUE 0.
       01 DREW PIC 9 VALUE 0.
       01 TOTAL-ROWS PIC 9(2) VALUE 20.
       01 TOTAL-COLUMNS PIC 9(2) VALUE 15.
       01 ROW-COUNTER PIC 9(2) VALUE 0.
       01 COLUMN-COUNTER PIC 9(2) VALUE 0.
       01 OLD-WORLD PIC X(300).
       01 NEW-WORLD PIC X(300).
       01 CELL PIC X(1) VALUE "0".
       01 X PIC 9(2) VALUE 0.
       01 Y PIC 9(2) VALUE 0.
       01 POS PIC 9(3).
       01 ROW-OFFSET PIC S9.
       01 COLUMN-OFFSET PIC S9.
       01 NEIGHBORS PIC 9 VALUE 0.
       PROCEDURE DIVISION.
           CALL "get_http_form" USING "state" RETURNING PARAM.
	   IF PARAM = 1 THEN
	      PERFORM VARYING PARAM-COUNTER FROM 1 BY 1 UNTIL PARAM-COUNTER > 30
	         STRING "state" PARAM-COUNTER INTO PARAM-NAME
	         CALL "get_http_form" USING PARAM-NAME RETURNING PARAM-VALUE
		 COMPUTE POS = (PARAM-COUNTER - 1) * 10 + 1
		 MOVE PARAM-VALUE TO NEW-WORLD(POS:10)
	      END-PERFORM
 	  ELSE
	    MOVE "000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001110000000000001010000000000001010000000000000100000000000101110000000000010101000000000000100100000000001010000000000001010000000000000000000000000000000000000000000000000000000000000000000" TO NEW-WORLD.
           PERFORM PRINT-WORLD.
           MOVE NEW-WORLD TO OLD-WORLD.
           PERFORM VARYING ROW-COUNTER FROM 1 BY 1 UNTIL ROW-COUNTER > TOTAL-ROWS
               PERFORM ITERATE-CELL VARYING COLUMN-COUNTER FROM 1 BY 1 UNTIL COLUMN-COUNTER > TOTAL-COLUMNS
	   END-PERFORM.
	   PERFORM PRINT-FORM.
           STOP RUN.
       ITERATE-CELL.
           PERFORM COUNT-NEIGHBORS.
	   COMPUTE POS = (ROW-COUNTER - 1) * TOTAL-COLUMNS + COLUMN-COUNTER.
           MOVE OLD-WORLD(POS:1) TO CELL.
           IF CELL = "1" AND NEIGHBORS < 2 THEN
               MOVE "0" TO NEW-WORLD(POS:1).
           IF CELL = "1" AND (NEIGHBORS = 2 OR NEIGHBORS = 3) THEN
               MOVE "1" TO NEW-WORLD(POS:1).
           IF CELL = "1" AND NEIGHBORS > 3 THEN
               MOVE "0" TO NEW-WORLD(POS:1).
           IF CELL = "0" AND NEIGHBORS = 3 THEN
               MOVE "1" TO NEW-WORLD(POS:1).
       COUNT-NEIGHBORS.
           MOVE 0 TO NEIGHBORS.
	   PERFORM COUNT-NEIGHBOR
	       VARYING ROW-OFFSET FROM -1 BY 1 UNTIL ROW-OFFSET > 1
	          AFTER COLUMN-OFFSET FROM -1 BY 1 UNTIL COLUMN-OFFSET > 1.
       COUNT-NEIGHBOR.
           IF ROW-OFFSET <> 0 OR COLUMN-OFFSET <> 0 THEN
               COMPUTE Y = ROW-COUNTER + ROW-OFFSET
               COMPUTE X = COLUMN-COUNTER + COLUMN-OFFSET
               IF X >= 1 AND X <= TOTAL-ROWS AND Y >= 1 AND Y <= TOTAL-COLUMNS THEN
	       	   COMPUTE POS = (Y - 1) * TOTAL-COLUMNS + X
                   MOVE OLD-WORLD(POS:1) TO CELL
		   IF CELL = "1" THEN
		      COMPUTE NEIGHBORS = NEIGHBORS + 1.
       PRINT-FORM.
           CALL "append_http_body" USING "<form name=frm1 method=POST><input type=hidden name=state value=".
	   CALL "append_http_body" USING DREW.
	   CALL "append_http_body" USING ">".
	   PERFORM VARYING PARAM-COUNTER FROM 1 BY 1 UNTIL PARAM-COUNTER > 30
    	       CALL "append_http_body" USING "<input type=hidden name=state"
	       CALL "append_http_body" USING PARAM-COUNTER
    	       CALL "append_http_body" USING " value="
	       COMPUTE POS = (PARAM-COUNTER - 1) * 10 + 1
	       MOVE NEW-WORLD(POS:10) TO PARAM-OUTPUT
	       CALL "append_http_body" USING PARAM-OUTPUT
    	       CALL "append_http_body" USING ">"
	   END-PERFORM
           CALL "append_http_body" USING "</form>".
       PRINT-WORLD.
           MOVE 0 TO DREW.
           CALL "set_http_status" USING "200".
	   CALL "append_http_body" USING "<html><body onload='setTimeout(function() { document.frm1.submit() }, 1000)'>"
	   CALL "append_http_body" USING "<style>table { background:-color: white; } td { width: 10px; height: 10px}</style>".
           CALL "append_http_body" USING "<table>".
           PERFORM PRINT-ROW VARYING ROW-COUNTER FROM 3 BY 1 UNTIL ROW-COUNTER >= TOTAL-ROWS - 1.
           CALL "append_http_body" USING "</table></body></html>".
       PRINT-ROW.
           CALL "append_http_body" USING "<tr>".
           PERFORM PRINT-CELL VARYING COLUMN-COUNTER FROM 3 BY 1 UNTIL COLUMN-COUNTER >= TOTAL-COLUMNS - 1.
           CALL "append_http_body" USING "</tr>".
       PRINT-CELL.
	   COMPUTE POS = (ROW-COUNTER - 1) * TOTAL-COLUMNS + COLUMN-COUNTER.
	   MOVE NEW-WORLD(POS:1) TO CELL.
           IF CELL = "1" THEN
	       MOVE 1 TO DREW
               CALL "append_http_body" USING "<td bgcolor=blue></td>".
           IF CELL = "0" THEN
               CALL "append_http_body" USING "<td></td>".

If you want to run your own simulation you can do an HTTP POST with 30 parameters that when concatenated form the layout of the 15×20 world simulated in COBOL.

If you want to install this yourself, take the following steps:

  1. Sign up for Cloudflare
  2. Sign up for a workers.dev subdomain. I’ve already grabbed cobol.workers.dev, but imagine you’ve managed to grab my-cool-name.workers.dev
  3. Install wrangler, Cloudflare’s CLI for deploying Workers
  4. Create a new COBOL Worker using the template

wrangler generate cobol-worker https://github.com/cloudflare/cobol-worker-template

5.  Configure wrangler.toml to point to your account and set a name for this project, let’s say my-first-cobol.

6.  Grab the files src/index.js and src/worker.cob from my repo here: https://github.com/jgrahamc/game-of-life and replace them in the cobol-worker.

7.  npm run deploy

8.  The COBOL Worker will be running at https://my-first-cobol.my-cool-name.workers.dev/

Cloudflare Dashboard and API Outage on April 15, 2020

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/cloudflare-dashboard-and-api-outage-on-april-15-2020/

Starting at 1531 UTC and lasting until 1952 UTC, the Cloudflare Dashboard and API were unavailable because of the disconnection of multiple, redundant fibre connections from one of our two core data centers.

This outage was not caused by a DDoS attack, or related to traffic increases caused by the COVID-19 crisis. Nor was it caused by any malfunction of software or hardware, or any misconfiguration.

What happened

As part of planned maintenance at one of our core data centers, we instructed technicians to remove all the equipment in one of our cabinets. That cabinet contained old inactive equipment we were going to retire and had no active traffic or data on any of the servers in the cabinet. The cabinet also contained a patch panel (switchboard of cables) providing all external connectivity to other Cloudflare data centers. Over the space of three minutes, the technician decommissioning our unused hardware also disconnected the cables in this patch panel.

This data center houses Cloudflare’s main control plane and database and as such, when we lost connectivity, the Dashboard and API became unavailable immediately. The Cloudflare network itself continued to operate normally and proxied customer websites and applications continued to operate. As did Magic Transit, Cloudflare Access, and Cloudflare Spectrum. All security services, such as our Web Application Firewall, continued to work normally.

But the following were not possible:

  • Logging into the Dashboard
  • Using the API
  • Making any configuration changes (such as changing a DNS record)
  • Purging cache
  • Running automated Load Balancing health checks
  • Creating Argo Tunnel connections
  • Creating or updating Cloudflare Workers
  • Transferring domains to Cloudflare Registrar
  • Accessing Cloudflare Logs and Analytics
  • Encoding videos on Cloudflare Stream
  • Logging information from edge services (customers will see a gap in log data)

No configuration data was lost as a result of the outage. Our customers’ configuration data is both backed up and replicated off-site, but neither backups nor replicas were needed. All configuration data remained in place.

How we responded

During the outage period, we worked simultaneously to cut over to our disaster recovery core data center and restore connectivity.

Dozens of engineers worked in two virtual war rooms, as Cloudflare is mostly working remotely because of the COVID-19 emergency. One room dedicated to restoring connectivity, the other to disaster recovery failover.

We quickly failed over our internal monitoring systems so that we had visibility of the entire Cloudflare network. This gave us global control and the ability to see issues in any of our network locations in more than 200 cities worldwide. This cutover meant that Cloudflare’s edge service could continue running normally and the SRE team could deal with any problems that arose in the day to day operation of the service.

As we were working the incident, we made a decision every 20 minutes on whether to fail over the Dashboard and API to disaster recovery or to continue trying to restore connectivity. If there had been physical damage to the data center (e.g. if this had been a natural disaster) the decision to cut over would have been easy, but because we had run tests on the failover we knew that the failback from disaster recovery would be very complex and so we were weighing the best course of action as the incident unfolded.

At 1944 UTC the first link from the data center to the Internet came back up. This was a backup link with 10Gbps of connectivity.
At 1951 UTC we restored the first of four large links to the Internet.
At 1952 UTC the Cloudflare Dashboard and API became available.
At 2016 UTC the second of four links was restored.
At 2019 UTC the third of four links was restored.
At 2031 UTC fully-redundant connectivity was restored.

Moving forward

We take this incident very seriously, and recognize the magnitude of impact it had. We have identified several steps we can take to address the risk of these sorts of problems from recurring in the future, and we plan to start working on these matters immediately:

  • Design: While the external connectivity used diverse providers and led to diverse data centers, we had all the connections going through only one patch panel, creating a single physical point of failure. This should be spread out across multiple parts of our facility.
  • Documentation: After the cables were removed from the patch panel, we lost valuable time identifying for data center technicians the critical cables providing external connectivity to be restored. We should take steps to ensure the various cables and panels are labeled for quick identification by anyone working to remediate the problem. This should expedite our ability to access the needed documentation.
  • Process: While sending our technicians instructions to retire hardware, we should call out clearly the cabling that should not be touched.

We will be running a full internal post-mortem to ensure that the root causes of this incident are found and addressed.

We are very sorry for the disruption.

Announcing the Results of the 1.1.1.1 Public DNS Resolver Privacy Examination

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/announcing-the-results-of-the-1-1-1-1-public-dns-resolver-privacy-examination/

Announcing the Results of the 1.1.1.1 Public DNS Resolver Privacy Examination

Announcing the Results of the 1.1.1.1 Public DNS Resolver Privacy Examination

On April 1, 2018, we took a big step toward improving Internet privacy and security with the launch of the 1.1.1.1 public DNS resolver — the Internet’s fastest, privacy-first public DNS resolver. And we really meant privacy first. We were not satisfied with the status quo and believed that secure DNS resolution with transparent privacy practices should be the new normal. So we committed to our public resolver users that we would not retain any personal data about requests made using our 1.1.1.1 resolver. We also built in technical measures to facilitate DNS over HTTPS to help keep your DNS queries secure. We’ve never wanted to know what individuals do on the Internet, and we took technical steps to ensure we can’t know.

We knew there would be skeptics. Many consumers believe that if they aren’t paying for a product, then they are the product. We don’t believe that has to be the case. So we committed to retaining a Big 4 accounting firm to perform an examination of our 1.1.1.1 resolver privacy commitments.

Today we’re excited to announce that the 1.1.1.1 resolver examination has been completed and a copy of the independent accountants’ report can be obtained from our compliance page.

The examination process

We gained a number of observations and lessons from the privacy examination of the 1.1.1.1 resolver. First, we learned that it takes much longer to agree on terms and complete an examination when you ask an accounting firm to do what we believe is the first of its kind examination of custom privacy commitments for a recursive resolver.

We also observed that privacy by design works. Not that we were surprised — we use privacy by design principles in all our products and services. Because we baked anonymization best practices into the 1.1.1.1 resolver when we built it, we were able to demonstrate that we didn’t have any personal data to sell. More specifically, in accordance with RFC 6235, we decided to truncate the client/source IP at our edge data centers so that we never store in non-volatile storage the full IP address of the 1.1.1.1 resolver user.

We knew that a truncated IP address would be enough to help us understand general Internet trends and where traffic is coming from. In addition, we also further improved our privacy-first approach by replacing the truncated IP address with the network number (the ASN) for our internal logs. On top of that, we committed to only retaining those anonymized logs for a limited period of time. It’s the privacy version of belt plus suspenders plus another belt.

Finally, we learned that aligning our examination of the 1.1.1.1 resolver with our SOC 2 report most efficiently demonstrated that we had the appropriate change control procedures and audit logs in place to confirm that our IP truncation logic and limited data retention periods were in effect during the examination period. The 1.1.1.1 resolver examination period of February 1, 2019, through October 31, 2019, was the earliest we could go back to while relying on our SOC 2 report.

Details on the examination

When we launched the 1.1.1.1 resolver, we committed that we would not track what individual users of our 1.1.1.1 resolver are searching for online. The examination validated that our system is configured to achieve what we think is the most important part of this commitment — we never write the querying IP addresses together with the DNS query to disk and therefore have no idea who is making a specific request using the 1.1.1.1 resolver. This means we don’t track which sites any individual visits, and we won’t sell your personal data, ever.

We want to be fully transparent that during the examination we uncovered that our routers randomly capture up to 0.05% of all requests that pass through them, including the querying IP address of resolver users. We do this separately from the 1.1.1.1 service for all traffic passing into our network and we retain such data for a limited period of time for use in connection with network troubleshooting and mitigating denial of service attacks.

To explain — if a specific IP address is flowing through one of our data centers a large number of times, then it is often associated with malicious requests or a botnet. We need to keep that information to mitigate attacks against our network and to prevent our network from being used as an attack vector itself. This limited subsample of data is not linked up with DNS queries handled by the 1.1.1.1 service and does not have any impact on user privacy.

We also want to acknowledge that when we made our privacy promises about how we would handle non-personally identifiable log data for 1.1.1.1 resolver requests, we made what we now see were some confusing statements about how we would handle those anonymous logs.

For example, we learned that our blog post commitment about retention of anonymous log data was not written clearly enough and our previous statements were not as clear because we referred to temporary logs, transactional logs, and permanent logs in ways that could have been better defined. For example, our 1.1.1.1 resolver privacy FAQs stated that we would not retain transactional logs for more than 24 hours but that some anonymous logs would be retained indefinitely. However, our blog post announcing the public resolver didn’t capture that distinction. You can see a clearer statement about our handling of anonymous logs on our privacy commitments page mentioned below.

With this in mind, we updated and clarified our privacy commitments for the 1.1.1.1 resolver as outlined below. The most critical part of these commitments remains unchanged: We don’t want to know what you do on the Internet — it’s none of our business — and we’ve taken the technical steps to ensure we can’t.

Our 1.1.1.1 public DNS resolver commitments

We have refined our commitments to 1.1.1.1 resolver privacy as part of our examination effort. The nature and intent of our commitments remain consistent with our original commitments. These updated commitments are what was included in the examination:

  1. Cloudflare will not sell or share public resolver users’ personal data with third parties or use personal data from the public resolver to target any user with advertisements.
  2. Cloudflare will only retain or use what is being asked, not information that will identify who is asking it. Except for randomly sampled network packets captured from at most 0.05% of all traffic sent to Cloudflare’s network infrastructure, Cloudflare will not retain the source IP from DNS queries to the public resolver in non-volatile storage (more on that below). The randomly sampled packets are solely used for network troubleshooting and DoS mitigation purposes.
  3. A public resolver user’s IP address (referred to as the client or source IP address) will not be stored in non-volatile storage. Cloudflare will anonymize source IP addresses via IP truncation methods (last octet for IPv4 and last 80 bits for IPv6). Cloudflare will delete the truncated IP address within 25 hours.
  4. Cloudflare will retain only the limited transaction and debug log data (“Public Resolver Logs”) for the legitimate operation of our Public Resolver and research purposes, and Cloudflare will delete the Public Resolver Logs within 25 hours.
  5. Cloudflare will not share the Public Resolver Logs with any third parties except for APNIC pursuant to a Research Cooperative Agreement. APNIC will only have limited access to query the anonymized data in the Public Resolver Logs and conduct research related to the operation of the DNS system.

Proving privacy commitments

We created the 1.1.1.1 resolver because we recognized significant privacy problems: ISPs, WiFi networks you connect to, your mobile network provider, and anyone else listening in on the Internet can see every site you visit and every app you use — even if the content is encrypted. Some DNS providers even sell data about your Internet activity or use it to target you with ads. DNS can also be used as a tool of censorship against many of the groups we protect through our Project Galileo.

If you use DNS-over-HTTPS or DNS-over-TLS to our 1.1.1.1 resolver, your DNS lookup request will be sent over a secure channel. This means that if you use the 1.1.1.1 resolver then in addition to our privacy guarantees an eavesdropper can’t see your DNS requests. We promise we won’t be looking at what you’re doing.

We strongly believe that consumers should expect their service providers to be able to show proof that they are actually abiding by their privacy commitments. If we were able to have our 1.1.1.1 resolver privacy commitments examined by an independent accounting firm, we think other organizations can do the same. We encourage other providers to follow suit and help improve privacy and transparency for Internet users globally. And for our part, we will continue to engage well-respected auditing firms to audit our 1.1.1.1 resolver privacy commitments. We also appreciate the work that Mozilla has undertaken to encourage entities that operate recursive resolvers to adopt data handling practices that protect the privacy of user data.

Details of the 1.1.1.1 resolver privacy examination and our accountant’s opinion can be found on Cloudflare’s Compliance page.

Visit https://developers.cloudflare.com/1.1.1.1/ from any device to get started with the Internet’s fastest, privacy-first DNS service.

PS Cloudflare has traditionally used tomorrow, April 1, to release new products. Two years ago we launched the 1.1.1.1 free, fast, privacy-focused public DNS resolver. One year ago we launched Warp, our way of securing and accelerating mobile Internet access.

And tomorrow?

Then three key changes
One before the weft, also
Safety to the roost

COVID-19 impacts on Internet traffic: Seattle, Northern Italy and South Korea

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/covid-19-impacts-on-internet-traffic-seattle-italy-and-south-korea/

COVID-19 impacts on Internet traffic: Seattle, Northern Italy and South Korea

The last few weeks have seen unprecedented changes in how people live and work around the world. Over time more and more companies have given their employees the right to work from home, restricted business travel and, in some cases, outright sent their entire workforce home. In some countries, quarantines are in place keeping people restricted to their homes.

These changes in daily life are showing up as changes in patterns of Internet use around the world. In this blog post I take a look at changing patterns in northern Italy, South Korea and the Seattle area of Washington state.

Seattle

To understand how Internet use is changing, it’s first helpful to start with what a normal pattern looks like. Here’s a chart of traffic from our Dallas point of presence in the middle of January 2020.

COVID-19 impacts on Internet traffic: Seattle, Northern Italy and South Korea

This is a pretty typical pattern. If you look carefully you can see that Internet use is down a little at the weekend and that Internet usage is diurnal: Internet use drops down during the night and then picks up again in the morning. The peaks occur at around 2100 local time and the troughs in the dead of night at around 0300. This sort of pattern repeats worldwide with the only real difference being whether a peak occurs in the early morning (at work) or evening (at home).

Now here’s Seattle in the first week of January this year. I’ve zoomed in to a single week so we see a little more of the bumpiness of traffic during the day but it’s pretty much the same story.

COVID-19 impacts on Internet traffic: Seattle, Northern Italy and South Korea

Now let’s zoom out to the time period January 15 to March 12. Here’s what the chart looks like for traffic coming from Cloudflare’s Seattle PoP over that period (the gaps in the chart are just missing data in the measurement tool I’m using).

COVID-19 impacts on Internet traffic: Seattle, Northern Italy and South Korea

Focus in on the beginning of the chart. Looks like the familiar diurnal pattern with quieter weekends. But around January 30 something changes. There’s a big spike of traffic and traffic stays elevated. The weekends aren’t so clear either. The first reported case of COVID-19 was on January 21 in the Seattle area.

Towards the end of February, the first deaths occurred in Washington state. In early March employees of Facebook, Microsoft and Amazon in the Seattle area were all confirmed to be infected. At this point, employers began encouraging or requiring their staff to work from home. If you focus on the last part of the chart and compare it with the first two things stand out: Internet usage has grown greatly and the night time troughs are less evident. People seem to be using the Internet more and for more hours.

Throughout the period there are also days with double spikes of traffic. If I zoom into the period March 5 to March 12 it’s interesting to compare with the week in January above.

COVID-19 impacts on Internet traffic: Seattle, Northern Italy and South Korea

Firstly, traffic is up about 40% and nighttime troughs are now above the levels seen in January during the day. The traffic is also spiky and continues through the weekend at similar levels to the week.

Next we can zoom in on traffic to residential ISPs in the Seattle area. Here’s a chart showing the first three days of this week (March 9 to March 11) compared to Monday to Wednesday a month prior in early February (February 10 to February 12).

COVID-19 impacts on Internet traffic: Seattle, Northern Italy and South Korea

Traffic to residential ISPs appears to be up about 5% month on month during the work day. We might have expected this to be higher given the number of local companies asking employees to work from home but many appear to be using VPNs that route all Internet traffic back through the corporate gateway.

Northern Italy

Turning to northern Italy, and in particular northern Italy, where there has been a serious outbreak of COVID-19 leading to first a local quarantine and then a national one. Most of the traffic in northern Italy is served from our Milan point of presence.

For reference here’s what traffic looked like the first week in January.

COVID-19 impacts on Internet traffic: Seattle, Northern Italy and South Korea

A familiar pattern with peak traffic typically in the evening. Here’s traffic for March 5 to 12.

COVID-19 impacts on Internet traffic: Seattle, Northern Italy and South Korea

Traffic has grown by more than 30% with Internet usage up at all hours of the day and night. Another change that’s a little harder to see is that traffic is ramping up earlier in the morning than in early January. In early January traffic started rising rapidly at 0900 UTC and reach the daytime plateaus you see above around 1400 UTC. In March, we see the traffic jump up more rapidly at 0900 UTC and reach a first plateau before tending to jump up again.

Drilling into the types of domains that Italians are accessing we see changes in how people are using the Internet. Online chat systems are up 1.3x to 3x of normal usage. Video streaming appears to have roughly doubled. People are accessing news and information websites about 30% to 60% more and online gaming is up about 20%.

One final look at northern Italy. Here’s the period that covers the introduction of the first cordon sanitaire in communes in the north.

COVID-19 impacts on Internet traffic: Seattle, Northern Italy and South Korea

The big spike of traffic is the evening of Monday, February 24 when the first cordons sanitaire came into full effect.

South Korea

Here’s the normal traffic pattern in Seoul, South Korea using the first week of January as an example of what traffic looked like before the outbreak of COVID-19:

COVID-19 impacts on Internet traffic: Seattle, Northern Italy and South Korea

And here’s March 5 to 12 for comparison:

COVID-19 impacts on Internet traffic: Seattle, Northern Italy and South Korea

There’s no huge change in traffic patterns other than that Internet traffic seen by Cloudflare is up about 5%.

Digging into the websites and APIs that people are accessing in South Korea shows some significant changes: traffic to websites offering anime streaming up over 2x, online chat up 1.2x to 1.8x and online gaming up about 30%.

In both northern Italy and South Korea traffic associated with fitness trackers is down, perhaps reflecting that people are unable to take part in their usual exercise, sports and fitness activities.

Conclusion

Cloudflare is watching carefully as Internet traffic patterns around the world alter as people alter their daily lives through home-working, cordon sanitaire, and social distancing. None of these traffic changes raise any concern for us. Cloudflare’s network is well provisioned to handle significant spikes in traffic. We have not seen, and do not anticipate, any impact on our network’s performance, reliability, or security globally.

This holiday’s biggest online shopping day was… Black Friday

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/this-holidays-biggest-online-shopping-day-was-black-friday/

This holiday's biggest online shopping day was... Black Friday

What’s the biggest day of the holiday season for holiday shopping? Black Friday, the day after US Thanksgiving, has been embraced globally as the day retail stores announce their sales. But it was believed that the following Monday, dubbed “Cyber Monday,” may be even bigger. Or, with the explosion of reliable 2-day and even 1-day shipping, maybe another day closer to Christmas has taken the crown. At Cloudflare, we aimed to answer this question for the 2019 holiday shopping season.

Black Friday was the biggest online shopping day but the second biggest wasn’t Cyber Monday… it was Thanksgiving Day itself (the day before Black Friday!). Cyber Monday was the fourth biggest day.

Here’s a look at checkout events seen across Cloudflare’s network since before Thanksgiving in the US.

This holiday's biggest online shopping day was... Black Friday
Checkout events as a percentage of checkouts on Black Friday

The weekends are shown in yellow and Black Friday and Cyber Monday are shown in green. You can see that checkouts ramped up during Thanksgiving week and then continued through the weekend into Cyber Monday.

Black Friday had twice the number of checkouts as the preceding Friday and the entire Thanksgiving week dominates. Post-Cyber Monday, no day reached 50% of the number of checkouts we saw on Black Friday. And Cyber Monday was just 60% of Black Friday.

So, Black Friday is the peak day but Thanksgiving Day is the runner up. Perhaps it deserves its own moniker: Thrifty Thursday anyone?

Checkouts occur more frequently from Monday to Friday and then drop off over the weekend.  After Cyber Monday only one other day showed an interesting peak. Looking at last week it does appear that Tuesday, December 17 was the pre-Christmas peak for online checkouts. Perhaps fast online shipping made consumers feel they could use online shopping as long as they got their purchases by the weekend before Christmas.

Happy Holidays from everyone at Cloudflare!

Talk Transcript: How Cloudflare Thinks About Security

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/talk-transcript-how-cloudflare-thinks-about-security/

Talk Transcript: How Cloudflare Thinks About Security
Image courtesy of Unbabel

Talk Transcript: How Cloudflare Thinks About Security

This is the text I used for a talk at artificial intelligence powered translation platform, Unbabel, in Lisbon on September 25, 2019.

Bom dia. Eu sou John Graham-Cumming o CTO do Cloudflare. E agora eu vou falar em inglês.

Thanks for inviting me to talk about Cloudflare and how we think about security. I’m about to move to Portugal permanently so I hope I’ll be able to do this talk in Portuguese in a few months.

I know that most of you don’t have English as a first language so I’m going to speak a little more deliberately than usual. And I’ll make the text of this talk available for you to read.

But there are no slides today.

I’m going to talk about how Cloudflare thinks about internal security, how we protect ourselves and how we secure our day to day work. This isn’t a talk about Cloudflare’s products.

Culture

Let’s begin with culture.

Many companies have culture statements. I think almost 100% of these are pure nonsense. Culture is how you act every day, not words written in the wall.

One significant piece of company culture is the internal Security Incident mailing list which anyone in the company can send a message to. And they do! So far this month there have been 55 separate emails to that list reporting a security problem.

These mails come from all over the company, from every department. Two to three per day. And each mail is investigated by the internal security team. Each mail is assigned a Security Incident issue in our internal Atlassian Jira instance.

People send: reports that their laptop or phone has been stolen (their credentials get immediately invalidated), suspicions about a weird email that they’ve received (it might be phishing or malware in an attachment), a concern about physical security (for example, someone wanders into the office and starts asking odd questions), that they clicked on a bad link, that they lost their access card, and, occasionally, a security concern about our product.

Things like stolen or lost laptops and phones happen way more often than you’d imagine. We seem to lose about two per month. For that reason and many others we use full disk encryption on devices, complex passwords and two factor auth on every service employees need to access. And we discourage anyone storing anything on my laptop and ask them to primarily use cloud apps for work. Plus we centrally manage machines and can remote wipe.

We have a 100% blame free culture. You clicked on a weird link? We’ll help you. Lost your phone? We’ll help you. Think you might have been phished? We’ll help you.

This has led to a culture of reporting problems, however minor, when they occur. It’s our first line of internal defense.

Just this month I clicked on a link that sent my web browser crazy hopping through redirects until I ended up at a bad place. I reported that to the mailing list.

I’ve never worked anywhere with such a strong culture of reporting security problems big and small.

Hackers

We also use HackerOne to let people report security problems from the outside. This month we’ve received 14 reports of security problems. To be honest, most of what we receive through HackerOne is very low priority. People run automated scanning tools and report the smallest of configuration problems, or, quite often, things that they don’t understand but that look like security problems to them. But we triage and handle them all.

And people do on occasion report things that we need to fix.

We also have a private paid bug bounty program where we work with a group of individual hackers (around 150 right now) who get paid for the vulnerabilities that they’ve found.

We’ve found that this combination of a public responsible disclosure program and then a private paid program is working well. We invite the best hackers who come in through the public program to work with us closely in the private program.

Identity

So, that’s all about people, internal and external, reporting problems, vulnerabilities, or attacks. A very short step from that is knowing who the people are.

And that’s where identity and authentication become critical. In fact, as an industry trend identity management and authentication are one of the biggest areas of spending by CSOs and CISOs. And Cloudflare is no different.

OK, well it is different, instead of spending a lot of identity and authentication we’ve built our own solutions.

We did not always have good identity practices. In fact, for many years our systems had different logins and passwords and it was a complete mess. When a new employee started accounts had to be made on Google for email and calendar, on Atlassian for Jira and Wiki, on the VPN, on the WiFi network and then on a myriad of other systems for the blog, HR, SSH, build systems, etc. etc.

And when someone left all that had to be undone. And frequently this was done incorrectly. People would leave and accounts would still be left running for a period of time. This was a huge headache for us and is a huge headache for literally every company.

If I could tell companies one thing they can do to improve their security it would be: sort out identity and authentication. We did and it made things so much better.

This makes the process of bringing someone on board much smoother and the same when they leave. We can control who accesses what systems from a single control panel.

I have one login via a product we built called Cloudflare Access and I can get access to pretty much everything. I looked in my LastPass Vault while writing this talk and there are a total of just five username and password combination and two of those needed deleting because we’ve migrated those systems to Access.

So, yes, we use password managers. And we lock down everything with high quality passwords and two factor authentication. Everyone at Cloudflare has a Yubikey and access to TOTP (such as Google Authenticator). There are three golden rules: all passwords should be created by the password manager, all authentication has to have a second factor and the second factor cannot be SMS.

We had great fun rolling out Yubikeys to the company because we did it during our annual retreat in a single company wide sitting. Each year Cloudflare gets the entire company together (now over 1,000 people) in a hotel for two to three days of working together, learning from outside experts and physical and cultural activities.

Last year the security team gave everyone a pair of physical security tokens (a Yubikey and a Titan Key from Google for Bluetooth) and in an epic session configured everyone’s accounts to use them.

Note: do not attempt to get 500 people to sync Bluetooth devices in the same room at the same time. Bluetooth cannot cope.

Another important thing we implemented is automatic timeout of access to a system. If you don’t use access to a system you lose it. That way we don’t have accounts that might have access to sensitive systems that could potentially be exploited.

Openness

To return to the subject of Culture for a moment an important Cloudflare trait is openness.

Some of you may know that back in 2017 Cloudflare had a horrible bug in our software that became called Cloudbleed. This bug leaked memory from inside our servers into people’s web browsing. Some of that web browsing was being done by search engine crawlers and ended up in the caches of search engines like Google.

We had to do two things: stop the actual bug (this was relatively easy and was done in under an hour) and then clean up the equivalent of an oil spill of data. That took longer (about a week to ten days) and was very complicated.

But from the very first night when we were informed of the problem we began documenting what had happened and what were doing. I opened an EMACS buffer in the dead of night and started keeping a record.

That record turned into a giant disclosure blog post that contained the gory details of the error we made, its consequences and how we reacted once the error was known.

We followed up a few days later with a further long blog post assessing the impact and risk associated with the problem.

This approach to being totally open ended up being a huge success for us. It increased trust in our product and made people want to work with us more.

I was on my way to Berlin to give a talk to a large retailer about Cloudbleed when I suddenly realized that the company I was giving the talk at was NOT a customer. And I asked the salesperson I was with what I was doing.

I walked in to their 1,000 person engineering team all assembled to hear my talk. Afterwards the VP of Engineering thanked me saying that our transparency had made them want to work with us rather than their current vendor. My talk was really a sales pitch.

Similarly, at RSA last year I gave a talk about Cloudbleed and a very large company’s CSO came up and asked to use my talk internally to try to encourage their company to be so open.

When on July 2 this year we had an outage, which wasn’t security related, we once again blogged in incredible detail about what happened. And once again we heard from people about how our transparency mattered to them.

The lesson is that being open about mistakes increases trust. And if people trust you then they’ll tend to tell you when there are problems. I get a ton of reports of potential security problems via Twitter or email.

Change

After Cloudbleed we started changing how we write software. Cloudbleed was caused, in part, by the use of memory-unsafe languages. In that case it was C code that could run past the end of a buffer.

We didn’t want that to happen again and so we’ve prioritized languages where that simply cannot happen. Such as Go and Rust. We were very well known for using Go. If you’ve ever visited a Cloudflare website, or used an app (and you have because of our scale) that uses us for its API then you’ve first done a DNS query to one of our servers.

That DNS query will have been responded to by a Go program called RRDNS.

There’s also a lot of Rust being written at Cloudflare and some of our newer products are being created using it. For example, Firewall Rules which do arbitrary filtering of requests to our customers are handled by a Rust program that needs to be low latency, stable and secure.

Security is a company wide commitment

The other post-Cloudbleed change was that any crashes on our machines came under the spotlight from the very top. If a process crashes I personally get emailed about it. And if the team doesn’t take those crashes seriously they get me poking at them until they do.

We missed the fact that Cloudbleed was crashing our machines and we won’t let that happen again. We use Sentry to correlate information about crashes and the Sentry output is one of the first things I look at in the morning.

Which, I think, brings up an important point. I spoke earlier about our culture of “If you see something weird, say something” but it’s equally important that security comes from the top down.

Our CSO, Joe Sullivan, doesn’t report to me, he reports to the CEO. That sends a clear message about where security sits in the company. But, also, the security team itself isn’t sitting quietly in the corner securing everything.

They are setting standards, acting as trusted advisors, and helping deal with incidents. But their biggest role is to be a source of knowledge for the rest of the company. Everyone at Cloudflare plays a role in keeping us secure.

You might expect me to have access to our all our systems, a passcard that gets me into any room, a login for any service. But the opposite is true: I don’t have access to most things. I don’t need it to get my job done and so I don’t have it.

This makes me a less attractive target for hackers, and we apply the same rule to everyone. If you don’t need access for your job you don’t get it. That’s made a lot easier by the identity and authentication systems and by our rule about timing out access if you don’t use a service. You probably didn’t need it in the first place.

The flip side of all of us owning security is that deliberately doing the wrong thing has severe consequences.

Making a mistake is just fine. The person who wrote the bad line of code that caused Cloudbleed didn’t get fired, the person who wrote the bad regex that brought our service to a halt on July 2 is still with us.‌‌

Detection and Response‌‌

Naturally, things do go wrong internally. Things that didn’t get reported. To do with them we need to detect problems quickly. This is an area where the security team does have real expertise and data.‌‌

We do this by collecting data about how our endpoints (my laptop, a company phone, servers on the edge of our network) are behaving. And this is fed into a homebuilt data platform that allows the security team to alert on anomalies.‌‌

It also allows them to look at historical data in case of a problem that occurred in the past, or to understand when a problem started. ‌‌

Initially the team was going to use a commercial data platform or SIEM but they quickly realized that these platforms are incredibly expensive and they could build their own at a considerably lower price.‌‌

Also, Cloudflare handles a huge amount of data. When you’re looking at operating system level events on machines in 194 cities plus every employee you’re dealing with a huge stream. And the commercial data platforms love to charge by the size of that stream.‌‌

We are integrating internal DNS data, activity on individual machines, network netflow information, badge reader logs and operating system level events to get a complete picture of what’s happening on any machine we own.‌‌

When someone joins Cloudflare they travel to our head office in San Francisco for a week of training. Part of that training involves getting their laptop and setting it up and getting familiar with our internal systems and security.‌‌

During one of these orientation weeks a new employee managed to download malware while setting up their laptop. Our internal detection systems spotted this happening and the security team popped over to the orientation room and helped the employee get a fresh laptop.‌‌

The time between the malware being downloaded and detected was about 40 minutes.‌‌

If you don’t want to build something like this yourself, take a look at Google’s Chronicle product. It’s very cool. ‌‌

One really rich source of data about your organization is DNS. For example, you can often spot malware just by the DNS queries it makes from a machine. If you do one thing then make sure all your machines use a single DNS resolver and get its logs.‌‌‌‌

Edge Security‌‌

In some ways the most interesting part of Cloudflare is the least interesting from a security perspective. Not because there aren’t great technical challenges to securing machines in 194 cities but because some of the more apparently mundane things I’ve talked about how such huge impact.‌‌

Identity, Authentication, Culture, Detection and Response.‌‌

But, of course, the edge needs securing. And it’s a combination of physical data center security and software. ‌‌

To give you one example let’s talk about SSL private keys. Those keys need to be distributed to our machines so that when an SSL connection is made to one of our servers we can respond. But SSL private keys are… private!‌‌

And we have a lot of them. So we have to distribute private key material securely. This is a hard problem. We encrypt the private keys while at rest and in transport with a separate key that is distributed to our edge machines securely. ‌‌

Access to that key is tightly controlled so that no one can start decrypting keys in our database. And if our database leaked then the keys couldn’t be decrypted since the key needed is stored separately.‌‌

And that key is itself GPG encrypted.‌‌

But wait… there’s more!‌‌

We don’t actually want to have decrypted keys stored in any process that accessible from the Internet. So we use a technology called Keyless SSL where the keys are kept by a separate process and accessed only when needed to perform operations.‌‌

And Keyless SSL can run anywhere. For example, it doesn’t have to be on the same machine as the machine handling an SSL connection. It doesn’t even have to be in the same country. Some of our customers make use of that to specify where their keys are distributed to).

Use Cloudflare to secure Cloudflare

One key strategy of Cloudflare is to eat our own dogfood. If you’ve not heard that term before it’s quite common in the US. The idea is that if you’re making food for dogs you should be so confident in its quality that you’d eat it yourself.

Cloudflare does the same for security. We use our own products to secure ourselves. But more than that if we see that there’s a product we don’t currently have in our security toolkit then we’ll go and build it.

Since Cloudflare is a cybersecurity company we face the same challenges as our customers, but we can also build our way out of those challenges. In  this way, our internal security team is also a product team. They help to build or influence the direction of our own products.

The team is also a Cloudflare customer using our products to secure us and we get feedback internally on how well our products work. That makes us more secure and our products better.

Our customers data is more precious than ours‌‌

The data that passes through Cloudflare’s network is private and often very personal. Just think of your web browsing or app use. So we take great care of it.‌‌

We’re handling that data on behalf of our customers. They are trusting us to handle it with care and so we think of it as more precious than our own internal data.‌‌

Of course, we secure both because the security of one is related to the security of the other. But it’s worth thinking about the data you have that, in a way, belongs to your customer and is only in your care.‌‌‌‌

Finally‌‌

I hope this talk has been useful. I’ve tried to give you a sense of how Cloudflare thinks about security and operates. We don’t claim to be the ultimate geniuses of security and would love to hear your thoughts, ideas and experiences so we can improve.‌‌

Security is not static and requires constant attention and part of that attention is listening to what’s worked for others.‌‌

Thank you.‌‌‌‌‌‌‌‌‌‌‌‌

Cleaning up bad bots (and the climate)

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/cleaning-up-bad-bots/

Cleaning up bad bots (and the climate)

From the very beginning Cloudflare has been stopping malicious bots from scraping websites, or misusing APIs. Over time we’ve improved our bot detection methods and deployed large machine learning models that are able to distinguish real traffic (be it from humans or apps) from malicious bots. We’ve also built a large catalog of good bots to detect things like helpful indexing by search engines.

But it’s not enough. Malicious bots continue to be a problem on the Internet and we’ve decided to fight back. From today customers have the option of enabling “bot fight mode” in their Cloudflare Dashboard.

Cleaning up bad bots (and the climate)

Once enabled, when we detect a bad bot, we will do three things: (1) we’re going to disincentivize the bot maker economically by tarpitting them, including requiring them to solve a computationally intensive challenge that will require more of their bot’s CPU; (2) for Bandwidth Alliance partners, we’re going to hand the IP of the bot to the partner and get the bot kicked offline; and (3) we’re going to plant trees to make up for the bot’s carbon cost.

Cleaning up bad bots (and the climate)

Malicious bots harm legitimate web publishers and applications, hurt hosting providers by misusing resources, and they doubly hurt the planet through the cost of electricity for servers and cooling for their bots and their victims.

Enough is enough. Our goal is nothing short of making it no longer viable to run a malicious bot on the Internet. And we think, with our scale, we can do exactly that.

How Cloudflare Detects Bots

Cloudflare’s secret sauce (ok, not very secret sauce) is our vast scale.  We currently handle traffic for over 20 million Internet properties ranging from the smallest personal web sites, through backend APIs for popular apps and IoT devices, to some of the best known names on the Internet (including 10% of the Fortune 1000).

This scale gives us a huge advantage in that we see an enormous amount and variety of traffic allowing us to build large machine learning models of Internet behavior. That scale and variety allows us to test new rules and models quickly and easily.

Our bot detection breaks down into four large components:

  • Identification of well known legitimate bots;
  • Hand written rules for simple bots that, however simple, get used day in, day out;
  • Our Bot Activity Detector model that spots the behavior of bots based on past traffic and blocks them; and
  • Our Trusted Client model that spots whether an HTTP User-Agent is what it says it is.

In addition, Gatebot, our DDoS mitigation system, fingerprints DDoS bots and blocks their traffic at the packet level. Beyond Gatebot, customers also have access to our Firewall Rules where they can write granular rules to block very specific attack types.

Another model allows us to determine whether an IP address belongs to a VPN endpoint, a home broadband subscriber, a company using NAT or a hosting or cloud provider. It’s this last group that “Bot Cleanup” targets.

Today, Cloudflare challenges over 3 billion bot requests per day. Some of those bots are about to have a really bad time.

How Cloudflare Fights Bots

The cost of launching a bot attack consists of the expense of CPU time that powers the attack. If our models show that the traffic is coming from a bot, and it’s on a hosting or a cloud provider, we’ll deploy CPU intensive code to make the bot writer expend more CPU and slow them down. By forcing the attacker to use more CPU, we increase their costs during an attack and deter future ones.

This is one of the many so-called “tarpitting” techniques we’re now deploying across our network to change the economics of running a malicious bot. Malicious bot operators be warned: if you target resources behind Cloudflare’s IP space and we’re going to make you spin your wheels.

Every minute we tie malicious bots up is a minute they’re not harming the Internet as a whole. This means we aren’t just protecting our customers but everyone online currently terrorized by malicious bots. The spirit of Cloudflare’s Birthday Week has always been about giving back to the Internet as a whole, and we can think of no better gift than ridding the Internet of malicious bots.

Beyond just wasting bots time we want to also get them shut down. If the infrastructure provider hosting the bot is part of the Bandwidth Alliance, we’ll share the bot’s IP address so they can shutdown the bot completely. The Bandwidth Alliance allows us to reduce transit costs with partners and, with this launch, also helps us work together with them to make the Internet safer for legitimate users.

Generally, everyone we ran Bot Fight Mode by thought it was a great idea. The only objection we heard was that as we start forcing bots to solve CPU intensive challenges in the short term, before they just give up — which we think is inevitable in the long term — we may raise carbon emissions. To combat those emissions we’re committed to estimating the extra CPU utilized by these bots, calculating their carbon cost, and then planting trees to compensate and build a better future.

Planting Trees

Dealing with climate change requires multiple efforts by people and companies. Cloudflare announced earlier this year that we had expanded our purchasing of Renewable Energy Certificates (that previously covered our North American operations) to our entire global network of 194 cities.

To figure out how much tree planting we need to do we need to calculate the cost of the extra CPU used when making a bot work hard. Here’s how that will work.

Using a figure of 450 kg CO2/year  (from https://www.goclimateneutral.org/blog/the-carbon-footprint-of-servers/) for the types of server that a bad bot might use (cloud server using a non-renewable energy source) we get about 8kg CO2/year per CPU core. We are able to measure the time bots spend burning CPU and so we can directly estimate the amount of CO2 emitted by our fight back.

According to One Tree Planted, a single mature tree can absorb about 21kg CO2/year. So, very roughly, each tree can absorb a year’s worth of CO2 from 2.5 CPU cores.

Since trees take time to mature and the scale of the climate change challenge we’re going to pay to overplant trees. For every tree that we calculate we’d need to plant to sequester the CO2 emissions from fighting bots we’re going to donate $25 to One Tree Planted to plant 25 trees.

And, of course, we’ll be handing the IPs of bad bots to our Bandwidth Alliance partners to get the bots shut down and remove their carbon cost completely. In the past, the tech community has largely defeated email spammers and DDoS-for-hire services by making their efforts fruitless, we think this is the right strategy to now defeat malicious bots once and for all.

Who Do Bots Hurt?

Malicious bots can cause significant harm to our customers’ infrastructure and often result in bad experiences for our customers’ users.

For example, a recent customer was being crippled by a credential stuffing attack that not only was attempting to compromise their users’ accounts but was doing so in such significant volume that it was effectively causing a small scale Denial of Service on all aspects of the customer’s website.

The malicious bot was overloading the customer’s conventional threat prevention infrastructure and we rapidly onboarded them as an Under Attack customer. As a part of the onboarding, we identified that the attack could be specifically thwarted using our Bot Management product while not impacting any legitimate user traffic.

Another trend we have seen is the increase of the combination of bots with botnets, particularly in the world of inventory hoarding bots. The motivation and willingness to spend for these bot operators is quite high.

The targets are goods of generally of limited supply and high in demand and in value. Think sneakers, concert tickets, airline seats, and popular short run Broadway musicals. Bot operators who are able to purchase those items at retail can charge massive premiums in aftermarket sales. When the operator identifies a target site, such as an ecommerce retailer, and a specific item, such as a new pair of sneakers going on sale, they can purchase time on the new Residential Proxy as a Service market to gain access to end user machines and (relatively) clean IPs from which to launch their attack.

They then utilize sophisticated techniques and triggers to change characteristics of the machine, network, and software they use to generate the attack through a very wide array of options and combinations, thwarting systems that rely on repetition or known patterns. This type of attack hurts multiple targets as well: the ecommerce site has real frustrated users who can’t purchase the in demand item. The real users who are losing out on inventory to an attacker who is just there to skim off the largest profit possible. And the unwitting users who are part of the botnet have their resources, such as their home broadband connection, used without their consent or knowledge.

The bottom line is that bots hurt companies and their customers.

Summary

Cloudflare has fought malicious bots from the very beginning and over time has deployed more and more sophisticated methods to block them. Using the power of the over 20 million Internet properties we protect and accelerate and our visibility of networks and users around the world we have build machine learning models that sort the bots from the good and block the bad.

But bots continue to be a problem and our new bot fight mode will directly disincentive bot writers from attacking customers. At the same time we don’t want to contribute to climate change and are offsetting the carbon cost of bots by planting trees to absorb carbon and help build a better future (and Internet).

Cleaning up bad bots (and the climate)

Details zum Cloudflare-Ausfall am 2. Juli 2019

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/404/

Vor etwa neun Jahren war Cloudflare noch ein winziges Unternehmen und ich war ein Kunde, kein Mitarbeiter. Cloudflare gab es erst seit einem Monat. Eines Tages wurde ich darüber benachrichtigt, dass bei meiner kleinen Website jgc.org der DNS-Service nicht mehr funktionierte. Cloudflare hat seine Verwendung von Protocol Buffers angepasst und dadurch wurde der DNS-Service unterbrochen.

Ich habe eine E-Mail mit dem Titel „Where‘s my dns?“ (Wo ist mein DNS) direkt an Matthew Prince gesendet und er hat mit einer langen, detaillierten, technischen Erklärung reagiert (Sie können den vollständigen E-Mail-Austausch hier lesen), auf die ich antwortete:

Von: John Graham-Cumming
Datum: Do., 7. Okt. 2010 um 09:14
Betreff: Re: Wo ist mein DNS?
An: Matthew Prince

Toller Bericht, danke. Ich werde auf jeden Fall anrufen, wenn es ein
Problem geben sollte.  Es wäre wahrscheinlich sinnvoll, all das in
einem Blog-Beitrag festzuhalten, wenn Sie alle technischen Details haben. Ich glaube nämlich,
dass es Kunden wirklich zu schätzen wissen, wenn mit solchen Dingen offen und ehrlich umgegangen wird.
Sie könnten auch die Traffic-Zunahme nach der Implementierung mit
Diagrammen veranschaulichen.

Ich habe eine recht zuverlässige Überwachung für meine Websites eingerichtet, deshalb bekomme ich eine SMS, wenn
etwas ausfällt.  Meine Daten zeigen, dass die Website von 13:03:07 bis
14:04:12 nicht verfügbar war.  Die Tests erfolgen alle fünf Minuten.

Das war nur ein kleiner Fehler und ich bin mir sicher, dass Sie etwas daraus lernen.  Aber bräuchten Sie nicht vielleicht
jemanden in Europa? :-)

Darauf antwortete er:

Von: Matthew Prince
Datum: Do., 7. Okt. 2010 um 09:57
Betreff: Re: Wo ist mein DNS?
An: John Graham-Cumming

Vielen Dank. Wir haben allen geantwortet, die sich bei uns gemeldet haben. In bin gerade auf dem Weg
zum Büro und wir werden etwas in den Blog stellen oder einen offiziellen
Beitrag ganz oben auf dem Bulletin Board System verankern. Ich stimme Ihnen zu 100 % zu,
dass Transparenz der richtige Weg ist.

Und so kommt es, dass ich heute ein Mitarbeiter eines deutlich größeren Cloudflare bin und für Transparenz sorge, indem ich über unsere Fehler, ihre Auswirkungen und unsere Gegenmaßnahmen schreibe.

Die Ereignisse des 2. Juli

Am 2. Juli haben wir eine neue Regel zu unseren WAF Managed Rules hinzugefügt, durch die alle CPU-Kerne überlastet wurden, die HTTP/HTTPS-Traffic im weltweiten Cloudflare-Netzwerk verarbeiten. Wir optimieren die WAF Managed Rules kontinuierlich, um neue Schwachstellen und Bedrohungen zu eliminieren. Zum Beispiel haben wir mit einem schnellen WAF-Update im Mai eine Regel implementiert, um eine schwerwiegende SharePoint-Schwachstelle zu schließen. Die Möglichkeit, Regeln schnell und global bereitzustellen, ist ein besonders wichtiges Feature unserer WAF.

Leider enthielt das Update vom letzten Dienstag einen regulären Ausdruck, der ein enormes Backtracking ausgelöst hat und die CPUs der HTTP/HTTPS-Verarbeitung überlastet hat. Dadurch wurden die grundlegenden Proxy-, CDN- und WAF-Funktionen von Cloudflare deaktiviert. Auf dem folgenden Graphen können Sie sehen, dass die CPUs für den HTTP/HTTPS-Traffic bei allen Servern unseres Netzwerks fast zu 100 % ausgelastet waren.

CPU-Auslastung bei einem unserer PoPs während des Vorfalls

Deshalb wurde unseren Kunden (und deren Kunden) beim Aufrufen einer beliebigen Cloudflare-Domain eine 502-Fehlerseite angezeigt. Die 502-Fehler wurden von den Cloudflare-Webservern erzeugt, die noch über CPU-Kerne verfügten, aber die Prozesse für den HTTP/HTTPS-Traffic nicht erreichen konnten.

Wir wissen, wie sehr der Vorfall unseren Kunden geschadet hat. Wir schämen uns dafür. Auch unsere eigenen Betriebsabläufe waren betroffen, als wir Gegenmaßnahmen ergriffen haben.

Der Ausfall muss Ihnen als Kunde enormen Stress, Frustration und vielleicht sogar Verzweiflung bereitet haben. Wir hatten seit sechs Jahren keinen globalen Ausfall mehr, entsprechend groß war unser Ärger.

Die CPU-Überlastung wurde von einer einzigen WAF-Regel verursacht, die einen schlecht geschriebenen regulären Ausdruck enthielt, der ein enormes Backtracking auslöste. Dies ist der reguläre Ausdruck, der den Ausfall verursacht hat: (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))

Obwohl dieser reguläre Ausdruck für viele Personen von Interesse ist (und unten genauer beschrieben wird), sind die genauen Gründe für die 27 Minuten lange Nichtverfügbarkeit des Cloudflare-Services deutlich komplexer, als dass einfach nur ein schlecht geschriebener regulärer Ausdruck implementiert wurde. Wir haben uns die Zeit genommen, die Ereigniskette aufzuschreiben, die zum Ausfall geführt hat und unsere Reaktion gebremst hat. Wenn Sie mehr über das Backtracking bei regulären Ausdrücken und die möglichen Gegenmaßnahmen erfahren möchten, sehen Sie sich den Anhang am Ende des Beitrags an.

Was passiert ist

Betrachten wir die Ereignisse in ihrer Reihenfolge. Alle Zeitangaben in diesem Blog basieren auf UTC.

Um 13:42 hat ein Engineer des Firewall-Teams eine kleine Änderung an den Regeln der XSS-Erkennung mithilfe eines automatischen Prozesses implementiert. Dadurch wurde ein Ticket für eine Änderungsanfrage erzeugt. Wir verwenden Jira, um diese Tickets zu bearbeiten und unten sehen Sie einen Screenshot davon.

Drei Minuten später ist die erste PagerDuty-Seite ausgefallen, was auf einen Fehler bei der WAF hingedeutet hat. Das war ein synthetischer Test, mit dem außerhalb von Cloudflare überprüft wird, ob die WAF ordnungsgemäß funktioniert (wir nutzen Hunderte solcher Tests). Direkt darauf folgten die Meldungen weiterer End-to-End-Tests über die Ausfälle von Cloudflare-Services bei Websites, eine Warnung wegen einer rapide Abnahme des globalen Traffics, eine enorme Anzahl an 502-Fehlern und dann viele Berichte von unseren PoPs (Points-of-Presence) in Städten auf der ganzen Welt, die eine CPU-Überlastung anzeigten.

Einige dieser Meldungen wurden auf meiner Uhr angezeigt und ich bin während des Meetings aufgesprungen und war gerade auf dem Weg zu meinem Schreibtisch, als ein leitender Solutions Engineer mich darüber informierte, dass wir 80 % unseres Traffics verloren hatten. Ich rannte zu unserer SRE-Abteilung, wo das Team gerade die Situation analysierte. Anfangs wurde sogar spekuliert, ob es sich um einen Angriff ungeahnten Ausmaßes handeln könnte.

Das SRE-Team von Cloudflare ist auf der ganzen Welt verteilt, damit rund um die Uhr für Monitoring gesorgt ist. Warnungen wie diese, die meist nur sehr spezifische Probleme mit überschaubaren Auswirkungen betreffen, werden mit internen Dashboards überwacht und mehrfach täglich überprüft und behandelt. Diese Menge an Websites und Warnungen deutete aber darauf hin, dass etwas äußerst Schwerwiegendes vorgefallen ist, weshalb das SRE-Team dies sofort als P0-Vorfall deklariert hat und ihn zum leitenden Engineering und System Engineering eskaliert hat.

Das Engineering-Team aus London befand sich gerade im zentralen Veranstaltungsraum und hörte sich einen internen Tech Talk an. Der Tech Talk wurde unterbrochen, das Team versammelte sich in einem großen Konferenzraum und andere schalteten sich dazu. Das war kein normales Problem, um das sich das SRE-Team alleine kümmern konnte: Alle relevanten Teams mussten gleichzeitig verfügbar sein.

Um 14:00 wurde erkannt, dass die WAF der Ursprung des Problems ist, und die Möglichkeit eines Angriffs wurde ausgeschlossen. Das Performance Team konnte CPU-Daten in Echtzeit abrufen, die eindeutig belegten, dass die WAF ursächlich war. Ein Teammitglied konnte dies mit strace bestätigen. Ein anderes Team erhielt Fehlerprotokolle, die auf Probleme bei der WAF hindeuteten. Um 14:02 wandten sich alle Blicke des Teams zu mir, als die Verwendung eines „global kill“ im Raum stand, eines Cloudflare-Mechanismus, mit dem eine bestimmte Komponente weltweit deaktiviert werden kann.

Aber dazu mussten wir erst einmal die Fähigkeit zu einem „global kill“ der WAF erhalten. Ohne Weiteres war dies nicht möglich. Wir verwenden unsere eigenen Produkte und da unser Access-Dienst nicht mehr funktionierte, konnten wir uns bei unserem internen Control Panel nicht authentifizieren (wir haben festgestellt, dass einige Teammitglieder ihren Zugriff verloren hatten, weil eine Sicherheitsfunktion ihre Anmeldedaten deaktiviert, wenn sie das interne Control Panel nicht regelmäßig verwenden).

Und wir konnten andere interne Dienste wie Jira oder das Build-System nicht mehr aufrufen. Wir mussten dieses Problem mit einem Mechanismus umgehen, der nur sehr selten verwendet wurde (und ein weiterer Prozess, den wir nach dem Vorfall genauer unter die Lupe nahmen). Letztendlich konnte ein Teammitglied um 14:07 den „global kill“ der WAF ausführen und um 14:09 befanden sich Traffic und CPU-Niveaus wieder weltweit im normalen Bereich. Der restliche Cloudflare-Schutzmechanismus war wieder aktiv.

Dann sorgten wir dafür, dass die WAF wieder funktionierte. Da dieser Vorfall ziemlich ernst war, führten wir in einer einzigen Stadt sowohl negative Tests (mit der Frage, ob wirklich diese eine Änderung das Problem verursacht hatte) als auch positive Tests (zur Überprüfung, ob der Rollback wirklich funktioniert hatte) mit einem Teil des Traffics durch, nachdem wir den Traffic unserer zahlenden Kunden von diesem Standort abgezogen hatten.

Um 14:52 waren wir zu 100 % davon überzeugt, dass wir die Ursache verstanden hatten, das Problem behoben war und die WAF wieder global aktiv war.

Wie Cloudflare arbeitet

Cloudflare verfügt über ein Engineering-Team, das an WAF Managed Rules arbeitet. Es optimiert kontinuierlich die Erkennungsraten, minimiert die falsch-positiven Ergebnisse und reagiert unmittelbar auf neue Bedrohungen. In den vergangenen 60 Tagen wurden 476 Änderungsanfragen für die WAF Managed Rules bearbeitet (durchschnittlich eine alle 3 Stunden).

Diese spezielle Änderung wurde im „Simulationsmodus“ bereitgestellt, in dem der echte Kunden-Traffic zwar von der Regel überprüft wird, er aber ungehindert durchgeleitet wird. Mit diesem Modus testen wir die Effektivität einer Regel und messen die Raten falsch-positiver und falsch-negativer Ergebnisse. Aber selbst im „Simulationsmodus“ müssen die Regeln tatsächlich ausgeführt werden und in diesem Fall enthielt die Regel einen regulären Ausdruck, der eine CPU-Überlastung auslöste.

Wie oben in der Änderungsanfrage ersichtlich, gibt es einen Bereitstellungsplan, einen Rollbackplan und einen Link zum internen Standard Operating Procedure (SOP) für diese Art von Bereitstellung. Das SOP erlaubt ausdrücklich die globale Implementierung einer Regeländerung. Diese Methodik unterscheidet sich deutlich von unserem normalen Ansatz bei der Software-Veröffentlichung, wo SOP die Software zunächst bei einem internen Dogfooding-Netzwerk-PoP (Point of Presence) implementiert (den unsere Kunden nur passieren), dann bei einer geringen Kundenzahl an einem isolierten Standort, gefolgt von einer großen Kundenzahl und schließlich weltweit. „Dogfooding“ ist übrigens ein englischer Ausdruck dafür, dass ein Unternehmen sein eigenes Produkt verwendet.

Der Prozess zur Software-Veröffentlichung sieht folgendermaßen aus: Wir verwenden intern git mittels BitBucket. Die Engineers, die Änderungen bearbeiten, schreiben Code, der in TeamCity erstellt wird. Wenn das Build bestätigt wird, werden Prüfer zugewiesen. Sobald ein Pull Request bestätigt wurde, wird der Code erstellt und die Test-Suite (erneut) ausgeführt.

Wenn der Build-Test erfolgreich war, wird bei Jira eine Änderungsanfrage erstellt und die Änderung muss von der zuständigen Führungskraft oder einer technischen Leitung bestätigt werden. Nach der Bestätigung erfolgt die Bereitstellung an den „Animal PoPs“, wie wir sie nennen: DOG, PIG und Canaries.

Der DOG-PoP ist ein Cloudflare-PoP (genau wie eine unserer Städte auf der Welt), der aber nur von Cloudflare-Mitarbeitern verwendet wird. Mithilfe dieses Dogfooding-PoPs können wir Probleme beheben, bevor ein Kunden-Traffic damit in Kontakt kommt. Und genau das passiert auch häufig.

Wenn der DOG-Test erfolgreich abgeschlossen wird, geht der Code in die PIG-Phase über (Englisch „Guinea Pig“, zu Deutsch „Meerschweinchen“). Dies ist ein Cloudflare-PoP, an dem ein kleiner Anteil des Traffics von kostenlosen Benutzern den neuen Code durchläuft.

Wenn dieser Test erfolgreich ist, geht der Code zu den „Canaries“ (Kanarienvögeln) über. Wir verfügen über drei auf die ganze Welt verteilte Canary-PoPs, über die der Traffic von zahlenden und kostenlosen Kunden geleitet wird, damit der neue Code noch ein letztes Mal auf Fehler überprüft werden kann.

Veröffentlichungsprozess bei Cloudflare-Software

Nach dem erfolgreichen Canary-Test ist der Code zur globalen Implementierung freigegeben. Je nach Codeänderung können mehrere Stunden oder Tage bis zum Abschluss des gesamten Prozesses aus DOG, PIG, Canary und Global vergehen. Dank der Vielseitigkeit des Netzwerks und der Kunden von Cloudflare können wir den Code gründlich überprüfen, bevor eine neue Version global für alle Kunden eingeführt wird. Im Falle der WAF findet dieser Prozess aber keine Anwendung, da Bedrohungen ja eine schnelle Reaktion erfordern.

WAF-Bedrohungen

In den vergangenen Jahren mussten wir eine drastische Zunahme an Schwachstellen bei gängigen Anwendungen feststellen. Das lässt sich auf die steigende Verfügbarkeit von Softwaretestingtools mit Methoden wie Fuzzing zurückführen (einen neuen Blog-Beitrag zum Thema Fuzzing haben wir erst kürzlich hier veröffentlicht).

Quelle: https://cvedetails.com/

Häufig wird ein Proof of Concept (PoC) erstellt und direkt auf Github veröffentlicht, damit Teams, die Anwendungen ausführen und bearbeiten, ihre Tests durchführen können, um sicherzustellen, dass sie über geeignete Schutzmaßnahmen verfügen. Deshalb muss Cloudflare unbedingt so schnell wie möglich auf neue Bedrohungen reagieren und Softwarepatches für seine Kunden bereitstellen.

Ein gutes Beispiel für diesen proaktiven Schutz von Cloudflare ist die Bereitstellung der Schutzmaßnahmen wegen der SharePoint-Schwachstelle im Mai hier im Blog). Direkt nach der öffentlichen Bekanntgabe verzeichneten wir einen signifikanten Anstieg der Exploit-Versuche bei den SharePoint-Installationen unserer Kunden. Unser Team hält kontinuierlich Ausschau nach neuen Bedrohungen und schreibt Regeln, um sie im Sinne unserer Kunden zu bekämpfen.

Bei der spezifischen Regel, die den Ausfall am letzten Dienstag verursachte, ging es um XSS-Angriffe (Cross-Site Scripting). Auch diese haben in den letzten Jahren signifikant zugenommen.

Quelle: https://cvedetails.com/

Im Rahmen des Standardverfahrens für eine Anpassung der WAF Managed Rules sind erfolgreiche CI-Tests (Continuous Integration) vor der globalen Bereitstellung vorgesehen. Diese wurden am letzten Dienstag erfolgreich durchgeführt und die Regeln wurden bereitgestellt. Um 13:31 hat ein Engineer des Teams einen Pull Request mit der Änderung implementiert, nachdem sie bestätigt worden war.

Um 13:37 hat TeamCity die Regeln erstellt, seine Tests durchgeführt und grünes Licht gegeben. Die WAF-Testsuite überprüft die grundlegenden WAF-Funktionen und besteht aus einer großen Testsammlung für individuelle Abgleichfunktionen. Nach dem Testlauf werden die individuellen WAF-Regeln getestet, indem eine große Anzahl an HTTP-Anfragen unter Einbeziehung der WAF ausgeführt wird. Diese HTTP-Anfragen sind als Testanfragen konzipiert, die von der WAF blockiert werden sollen (damit potenzielle Angriffe abgewehrt werden) bzw. nicht blockiert werden sollen (damit nicht zu viel blockiert wird und keine falsch-positiven Ergebnisse entstehen). Nicht getestet wurde jedoch die übermäßige CPU-Auslastung durch die WAF. Auch die Überprüfung der Protokolldateien von vorherigen WAF-Builds hat ergeben, dass bei der Regel keine überhöhte Testsuite-Laufzeit erkannt wurde, die letztendlich eine CPU-Auslastung verursachen könnte.

Die Tests wurden erfolgreich abgeschlossen und TeamCity begann um 13:42 mit der automatischen Bereitstellung der Änderung.

Quicksilver

Da WAF-Regeln akute Bedrohungen abwehren müssen, werden sie mit unserer verteilten Schlüssel-Werte-Datenbank Quicksilver bereitgestellt, die Änderungen in wenigen Sekunden global implementiert. Diese Technologie wird von allen unseren Kunden für Konfigurationsänderungen in unserem Dashboard oder per API verwendet und darauf beruht unsere Fähigkeit, auf Änderungen äußerst schnell zu reagieren.

Bis jetzt haben wir noch nicht viel über Quicksilver gesprochen. Zuvor haben wir Kyoto Tycoonals globalen Schlüssel-Werte-Speicher (KV-Speicher) verwendet, aber wir hatten damit Probleme im Betrieb und erstellten dann unseren eigenen KV-Speicher, der für unsere über 180 Standorte repliziert wird. Mit Quicksilver übertragen wir Änderungen an Kundenkonfigurationen, aktualisieren wir WAF-Regeln und verteilen JavaScript-Code, der von Kunden mit Cloudflare Workers geschrieben wurde.

Es dauert vom Klicken auf eine Schaltfläche im Dashboard oder Tätigen eines API-Aufrufs zum Ändern der Konfiguration nur ein paar Sekunden, bis die Änderung aktiv ist – global. Die Kunden lieben mittlerweile diese Konfigurierbarkeit mit Höchstgeschwindigkeit. Sie erwarten bei Workers eine praktisch sofortige, globale Softwarebereitstellung. Quicksilver verteilt durchschnittlich etwa 350 Änderungen pro Sekunde.

Und Quicksilver ist sehr schnell.  Unser P99 für die Verteilung einer Änderung an jeden Rechner weltweit lag bei durchschnittlich 2,29 s. Diese Geschwindigkeit ist normalerweise eine tolle Sache. Wenn man ein Feature aktiviert oder den Cache entleert, weiß man, dass diese Änderung praktisch sofort live ist, weltweit. Jede Codeübermittlung mit Cloudflare Workers erfolgt mit der gleichen Geschwindigkeit. Das ist Teil des Versprechens der schnellen Updates von Cloudflare, die da sind, wenn man sie braucht.

In diesem Fall hieß die Geschwindigkeit jedoch, dass eine Änderung an den Regeln innerhalb von Sekunden global live war. Wie Sie sehen, nutzt der WAF-Code Lua. Cloudflare nutzt Lua bei der Produktion in hohem Maße. Details zu Lua in der WAF wurden bereits erörtert. Die Lua-WAF nutzt intern PCRE und verwendet Backtracking zum Abgleich. Sie hat keine Schutzvorrichtung gegen aus der Reihe tanzende Ausdrücke. Mehr dazu, und was wir dagegen tun, lesen Sie unten.

Alles bis zum Zeitpunkt der Regelbereitstellung erfolgte „korrekt“: eine Pull-Anfrage wurde gestellt, sie wurde genehmigt, CI/CD erstellte den Code und testete ihn, eine Änderungsanfrage mit einem SOP mit Details zu Rollout und Rollback wurde eingereicht und das Rollout wurde ausgeführt.

Bereitstellungsprozess für Cloudflare WAF

Was ist schiefgelaufen?

Wie erwähnt, stellen wir jede Woche Dutzende neuer Regeln für die WAF bereit und haben mehrere Systeme installiert, um negative Auswirkungen dieser Bereitstellungen zu vermeiden. Wenn also etwas schiefgeht, ist die Ursache normalerweise ein unwahrscheinliches Zusammentreffen mehrerer Faktoren. Die Suche nach einer einzigen Grundursache mag zwar befriedigend sein, geht aber oft an der Realität vorbei. Dies sind die Verwundbarkeiten, die alle zusammentrafen, bis der Punkt erreicht war, an dem die Cloudflare-Services für HTTP/HTTPS offline gingen.

  1. Ein Techniker schrieb einen regulären Ausdruck, der leicht ein enormes Backtracking bewirken konnte.
  2. Ein Schutz vor übermäßiger CPU-Auslastung durch einen regulären Ausdruck wurde versehentlich einige Wochen vorher während einer Umgestaltung der WAF entfernt. Die Umgestaltung war Teil des Bemühens, die CPU-Nutzung durch die WAF zu reduzieren.
  3. Die verwendete Engine für reguläre Ausdrücke hatte keine Komplexitätsgarantien.
  4. Die Testsuite konnte eine übermäßige CPU-Nutzung nicht erkennen.
  5. Das SOP erlaubte, dass eine nicht mit einem Notfall zusammenhängende Regeländerung global in die Produktion ging, ohne dass ein gestaffelter Rollout stattfand.
  6. Der Rollback-Plan erforderte, dass die komplette WAF zweimal ausgeführt wird, was zu lange dauerte.
  7. Das Auslösen der ersten Warnmeldung für den globalen Traffic-Rückgang dauerte zu lange.
  8. Unsere Statusseite wurde nicht schnell genug aktualisiert.
  9. Wir hatten wegen des Ausfalls Probleme, auf unsere eigenen Systeme zuzugreifen, und die Mitarbeiter waren für das Umgehungsverfahren nicht gut geschult.
  10. Die SREs hatten den Zugriff auf einige Systeme verloren, da ihre Anmeldedaten aus Sicherheitsgründen ausgesetzt wurden.
  11. Unsere Kunden konnten nicht auf das Cloudflare-Dashboard oder die Cloudflare-API zugreifen, da sie durch das Cloudflare-Edge laufen.

Das ist seit letztem Dienstag passiert

Zunächst stellten wir alle Release-Arbeiten am WAF komplett ein und widmeten uns Folgendem:

  1. Wiedereinführung des entfernten Schutzes vor übermäßiger CPU-Auslastung (erledigt)
  2. Manuelles Überprüfen aller 3.868 Regeln in den WAF Managed Rules, um etwaige andere Fälle von potenziell übermäßigem Backtracking zu finden und zu korrigieren (Prüfung abgeschlossen)
  3. Einführung von Performanceprofilierung für alle Regeln an die Testsuite (voraussichtl.  Juli 2019)
  4. Wechsel zur re2- oder Rust regex-Engine, die beide Laufzeitgarantien bieten (voraussichtl. 31. Juli)
  5. Ändern der SOP, sodass gestaffelte Rollouts von Regeln erfolgen, wie sie auch für andere Software bei Cloudflare durchgeführt werden; dabei soll die Fähigkeit für globale Notfallbereitstellungen bei aktiven Angriffen erhalten bleiben
  6. Implementieren einer Notfallfunktion zum Entfernen von Cloudflare-Dashboard und -API aus dem Cloudflare-Edge
  7. Automatisieren von Aktualisierungen der Cloudflare Status-Seite

Langfristig möchten wir von der Lua-WAF abrücken, die ich vor Jahren schrieb. Wir migrieren die WAF, sodass sie die neue Firewall-Engine nutzen kann. Dadurch wird die WAF schneller und es wird eine weitere Schutzebene hinzugefügt.

Schlussfolgerung

Dieser Ausfall war für unsere Kunden und das Team äußerst ärgerlich. Wir reagierten schnell, um das Problem zu beheben, und korrigieren nun die Mängel im Prozess, die den Ausfall möglich machten. Wir gehen der Sache auf den Grund, um Schutz vor weiteren potenziellen Problemen durch die Verwendung regulärer Ausdrücke zu bieten, indem wir die zugrunde liegende Technologie ersetzen.

Wir sind beschämt über den Ausfall und entschuldigen uns bei unseren Kunden für die Auswirkungen. Wir denken, dass die von uns vorgenommenen Änderungen dafür sorgen, dass ein solcher Ausfall nie mehr auftreten wird.

Anhang: Über das Backtracking von regulären Ausdrücken

Um genau zu verstehen, wie  (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*))) die CPU-Überlastung verursachte, müssen wir etwas darüber wissen, wie eine Standard-Engine für reguläre Ausdrücke funktioniert. Der kritische Teil ist .*(?:.*=.*). Das (?: und die passende ) sind eine Gruppe ohne Erfassung (d. h., der Ausdruck innerhalb der Klammern ist als ein einziger Ausdruck zusammen gruppiert).

Bei der Diskussion, warum dieses Muster eine CPU-Überlastung verursachte, können wir ihn getrost ignorieren und das Muster als .*.*=.* behandeln. Wenn es darauf reduziert wird, sieht das Muster natürlich unnötig komplex aus; das Wichtige ist jedoch, dass jeder Ausdruck aus der „realen Welt“ (wie die komplexen Ausdrücke in unseren WAF-Regeln), der von der Engine verlangt, „irgendetwas gefolgt von irgendetwas abzugleichen“, zu katastrophalem Backtracking führen kann. Hier ist der Grund:

In einem regulären Ausdruck bedeutet . den Abgleich eines einzigen Zeichens. .* bedeutet einen „gierigen“ (greedy) Abgleich von null oder mehr Zeichen (d. h. einen Abgleich von so viel wie möglich). .*.*=.* bedeutet also den Abgleich von null oder mehr Zeichen, dann den Abgleich von null oder mehr Zeichen, dann das Finden eines literalen =-Zeichens, dann den Abgleich von null oder mehr Zeichen.

Nehmen wir die Testzeichenfolge x=x. Sie entspricht dem Ausdruck .*.*=.*. Die .*.* vor dem Gleichheitszeichen können mit dem ersten  x abgeglichen werden (einer der .* entspricht dem x, der andere entspricht null Zeichen). Der .* nach dem = entspricht dem letzten x.

Dieser Abgleich erfordert 23 Schritte. Der erste .* in .*.*=.* verhält sich gierig und gleicht die gesamte Zeichenfolge x=x ab. Die Engine fährt dann mit dem nächsten .* fort. Es sind keine passenden Zeichen mehr übrig, also entspricht der zweite .* null Zeichen (das ist zulässig). Dann fährt die Engine mit dem = fort. Da keine Zeichen zum Abgleichen übrig sind (der erste .* hat alle x=x aufgebraucht), schlägt der Abgleich fehl.

An diesem Punkt führt die Engine für reguläre Ausdrücke Backtracking durch. Sie kehrt zum ersten .* zurück und gleicht ihn mit  x= (anstatt x=x) ab, dann wechselt sie zum zweiten .*. Dieser .* entspricht dem zweiten x. Nun sind keine weiteren Zeichen zum Abgleichen übrig. Deshalb schlägt der Abgleich fehl, wenn die Engine nun versucht, das = in .*.*=.* abzugleichen. Die Engine führt erneut Backtracking durch.

Dieses Mal führt sie das Backtracking so durch, dass der erste .* noch x= entspricht, der zweite .* jedoch nicht mehr x, sondern null Zeichen entspricht. Die Engine fährt dann damit fort, das Literal = im Muster .*.*=.* zu suchen, aber das schlägt fehl (weil es bereits mit dem ersten .* abgeglichen wurde). Die Engine führt erneut Backtracking durch.

Dieses Mal entspricht der erste .* nur dem ersten x. Der zweite .* verhält sich jedoch gierig und gleicht =x ab. Sie ahnen, was nun kommt. Wenn die Engine versucht, das Literal = abzugleichen, schlägt dies fehl und sie führt erneut Backtracking durch.

Der erste .* entspricht immer noch nur dem ersten x. Nun entspricht der zweite .* nur =. Die Engine kann aber, Sie ahnen es, das Literal = nicht abgleichen, weil ihm der zweite .* entsprach. Die Engine führt also erneut Backtracking durch. Denn bei all dem geht es ja, Sie erinnern sich, darum, eine Zeichenfolge aus drei Zeichen abzugleichen.

Nun, da der erste .* nur dem ersten x entspricht und der zweite .* null Zeichen entspricht, kann die Engine schließlich den Literal = im Ausdruck mit dem = in der Zeichenfolge abgleichen. Sie fährt fort und der letzte .* entspricht dem letzten x.

23 Schritte zum Abgleich von x=x. Hier ist ein kurzes Video davon mit dem Perl Regexp::Debugger, das die durchgeführten Schritte und Backtrackings zeigt.

Das ist viel Arbeit. Was aber passiert, wenn die Zeichenfolge von x=x zu x=xx geändert wird? Dieses Mal erfordert der Abgleich 33 Schritte. Und wenn die Eingabe x=xxx lautet, sind es 45. Das ist nicht linear. Hier ist ein Diagramm, das den Abgleich von x=x bis x=xxxxxxxxxxxxxxxxxxxx zeigt (20 x nach dem =). Bei 20 x nach dem = benötigt die Engine 555 Schritte für den Abgleich! (Und wenn das x= fehlen würde, sodass die Zeichenfolge nur aus 20 x bestünde, würde die Engine sogar 4.067 Schritte benötigen, um herauszufinden, dass das Muster nicht übereinstimmt).

Dieses Video zeigt alle notwendigen Backtrackings zum Abgleich von x=xxxxxxxxxxxxxxxxxxxx:

Das ist schlecht, denn wenn die Eingabegröße sich erhöht, steigt die Abgleichzeit superlinear. Es hätte jedoch noch schlimmer kommen können, wenn der reguläre Ausdruck etwas anders aussähe. Angenommen, er hätte .*.*=.*; gelautet (d. h. mit einem literalen Semikolon am Ende des Musters). Dieser Ausdruck könnte z. B. geschrieben werden, um einen Ausdruck wie foo=bar; abzugleichen.

Dieses Mal wäre das Backtracking katastrophal gewesen. Der Abgleich von x=x erfordert 90 Schritte statt 23. Und die Zahl der Schritte wächst sehr schnell. Das Abgleichen von x= gefolgt von 20 x erfordert 5.353 Schritte. Hier ist das entsprechende Diagramm. Sehen Sie sich die Y-Achsen-Werte genau an und vergleichen Sie sie mit dem vorherigen Diagramm.

Um das Bild zu vervollständigen, sind hier alle 5.353 Schritte des fehlgeschlagenen Abgleichs von x=xxxxxxxxxxxxxxxxxxxx mit .*.*=.*;

Durch die Verwendung „fauler“ (lazy) anstelle gieriger Abgleiche lässt sich die Zahl der Backtrackings reduzieren, die in diesem Fall auftreten. Wenn der ursprüngliche Ausdruck zu .*?.*?=.*? geändert wird, erfordert der Abgleich von x=x 11 Schritte (statt 23). Genauso ist es beim Abgleich von x=xxxxxxxxxxxxxxxxxxxx. Der Grund ist, dass das ?  nach dem .* die Engine anweist, zuerst die kleinste Anzahl von Zeichen abzugleichen, bevor sie mit den nächsten Schritten fortfährt.

Faulheit ist aber keine umfassende Lösung für dieses Backtracking-Verhalten. Wenn im Beispiel mit dem katastrophalen Backtracking .*.*=.*; zu .*?.*?=.*?; geändert wird, verändert sich seine Laufzeit überhaupt nicht. x=xerfordert weiterhin 555 Schritte und  x= gefolgt von 20 x erfordert weiterhin 5.353 Schritte.

Die einzige echte Lösung, abgesehen von einem kompletten Umschreiben des Musters, ist, von einer Engine für reguläre Ausdrücke mit diesem Backtracking-Mechanismus abzurücken. Genau das tun wir innerhalb der nächsten paar Wochen.

Die Lösung dieses Problems ist seit 1968 bekannt, als Ken Thompson den Artikel „Programming Techniques: Regular expression search algorithm“ veröffentlichte. Darin wird ein Mechanismus zum Umwandeln eines regulären Ausdrucks in einen NEA (nichtdeterministischer endlicher Automat) beschrieben. Außerdem werden die Zustandswechsel im NEA erläutert, die einem Algorithmus folgen, der zeitlich linear für die Größe der abgeglichenen Zeichenfolge ausgeführt wird.

\

Thompsons Artikel nimmt nicht direkt Bezug auf den NEA, aber der Algorithmus mit linearer Zeit wird genau erklärt und ein ALGOL-60-Programm, das Assemblersprachencode für den IBM 7094 generiert, wird vorgestellt. Die Implementierung mag obskur erscheinen, die Idee ist es nicht.

So sähe der reguläre Ausdruck .*.*=.* aus, wenn er gemäß den Zeichnungen in Thompsons Artikel dargestellt würde:

Abbildung 0 zeigt fünf Zustände, angefangen mit 0. Die drei Kreise zeigen die Zustände 1, 2 und 3. Sie entsprechen den drei .* im regulären Ausdruck. Die drei Rhomben mit Punkten darin entsprechen einem einzelnen Zeichen. Der Rhombus mit einem =-Zeichen entspricht dem literalen =-Zeichen. Zustand 4 ist der Endzustand. Wenn er erreicht ist, wurde der reguläre Ausdruck abgeglichen.

Um zu prüfen, wie ein solches Zustandsdiagramm zum Abgleich des regulären Ausdrucks .*.*=.* verwendet werden kann, sehen wir uns nun den Abgleich der Zeichenfolge x=x an. Das Programm beginnt mit Zustand 0, wie in Abbildung 1 gezeigt.

Der Schlüssel dazu, diesen Algorithmus zum Funktionieren zu bringen, ist, dass der Zustandsautomat gleichzeitig mehrere Zustände aufweist. Der NEA führt jeden Wechsel, den er erreichen kann, gleichzeitig durch.

Noch bevor er eine Eingabe liest, wechselt er sofort sowohl in Zustand 1 als auch in Zustand 2, wie in Abbildung 2 gezeigt.

In Abbildung 2 sehen wir, was passieren würde, wenn er zuerst x in x=x berücksichtigt. Das x kann dem obersten Punkt entsprechen, indem von Zustand 1 gewechselt und wieder zurück zu Zustand 1 gewechselt wird. Oder das x kann dem Punkt darunter entsprechen, indem von Zustand 2 gewechselt und wieder zurück zu Zustand 2 gewechselt wird.

Nach dem Abgleich des ersten x in x=x sind die Zustände also weiterhin 1 und 2. Die Zustände 3 oder 4 können nicht erreicht werden, da dazu ein literales =-Zeichen benötigt wird.

Als nächstes nimmt sich der Algorithmus das = in x=x vor. Ähnlich wie das x zuvor kann es einem der beiden oberen Kreise mit dem Wechsel von Zustand 1 zu Zustand 1 bzw. Zustand 2 zu Zustand 2 entsprechen. Zusätzlich kann jedoch das Literal = abgeglichen werden und der Algorithmus kann von Zustand 2 zu Zustand 3 (und sofort zu Zustand 4) wechseln. Das ist in Abbildung 3 veranschaulicht.

Als nächstes erreicht der Algorithmus das letzte x in x=x. Von den Zuständen 1 und 2 sind die gleichen Wechsel zurück zu den Zuständen 1 und 2 möglich. Von Zustand 3 kann das x dem Punkt auf der rechten Seite entsprechen und zurück zu Zustand 3 wechseln.

An diesem Punkt wurde jedes Zeichen in x=x berücksichtigt; da Zustand 4 erreicht wurde, entspricht der reguläre Ausdruck dieser Zeichenfolge. Jedes Zeichen wurde einmal verarbeitet. Der Algorithmus war also linear für die Länge der Eingabezeichenfolge. Und kein Backtracking war erforderlich.

Es mag offensichtlich sein, aber nachdem Zustand 4 erreicht wurde (nach dem Abgleich von x=), war der reguläre Ausdruck abgeglichen und der Algorithmus konnte enden, ohne das letzte x überhaupt zu berücksichtigen.

Der Algorithmus ist linear für die Größe seiner Eingabe.

Cloudflare em Lisboa

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/cloudflare-lisbon-office-portuguese/

Cloudflare em Lisboa

Eu fui o 24º funcionário da Cloudflare e o primeiro a trabalhar fora de São Francisco. A trabalhar num escritorio improvisado em minha casa, e escrevi um pedaço grande do software da Cloudflare antes de ter contratato uma equipa em Londres. Hoje, Cloudflare London, a nossa a sede da EMEA a região da Europa, Médio Oriente e África tem mais de 200 pessoas a trabalhar no edifício histórico County Hall há frente do Parlamento Britânico. O meu escritório improvisado é agora história antiga.

Cloudflare em Lisboa
CC BY-SA 2.0 image by Sridhar Saraf

Cloudflare não parou em Londres. Temos pessoas em Munique, Cingapura, Pequim, Austin, Texas, Chicago e Champaign, Illinois, Nova York, Washington,DC, São José, California, Miami, Florida, Sydney, Austrália e também em Sao Francisco e Londres. Hoje estamos a anunciar o estabelecimento de um novo escritório em Lisboa, Portugal. Como parte da abertura do escritório este Verão irei me deslocar para Lisboa juntamente com um pequeno número de pessoal técnico de outros escritórios da Cloudflare.

Estamos a recrutar em Lisboa neste momento. Pode visitar este link para ver todas as oportunidades actuais. Estamos há procura de candidatos para preencher os cargos de Engenheiro, Segurança, Produto, Produto de Estratégia, Investigação Tecnológica e Atendimento ao Cliente.

Se está interessado num cargo que não está actualmente listado na nossa página de carreiras profissionais, também poderá enviar-nos um email para a nossa equipa de recruitamento pelo [email protected] para expressar o seu interesse.

Cloudflare em Lisboa
CC BY-SA 2.0 Image by Rustam Aliyev

A minha primeira ideia realista de Lisboa formou-se há 30 anos atrás com a publicação de 1989 do John Le Carré, The Russia House (A casa da Rússia). Tão real, claro, como qualquer Le Carré’s visão do mundo:

[…] dez anos atrás, por um capricho qualquer, Barley Blair, tido herdado uns quantos milhares por uma tia distante, comprou para si um pé de terra mais modesto em Lisboa, onde costumava ter descansos regulares com o peso de uma alma multilateral. Poderia ter sido Cornwall, poderia ter sido a Provença ou mesmo até Timbuktu. Contudo, Lisboa por um acidente agarrou-o […]

A escolha da Clouflare por Lisboa, não aconteceu por um acaso, mas sim por uma pesquisa cuidadosa de uma nova cidade continental Europeia para localizar um escritório. Eu fui convidado novamente para ir a Lisboa em 2014 para ser um dos oradores na Sapo Codebits e fiquei impressionado com o tamanho e a variedade de talento técnico presente no evento. Subsequentemente, visitámos 45 cidades por 29 países, reduzindo a uma lista final de três.

A combinação de um elevado e crescente ecossistema de tecnologia existente em Lisboa, uma política de imigração atraente,estabilidade política, alto padrão de vida, assim como todos os factores logísticos como o fuso horário (o mesmo que na Grã-Bretanha) e os voos directos para São Francisco fizeram com que fosse o vencedor evidente.

Eu começei a aprender Português há três meses…e estou desejoso para descobrir este país e a cultura, e criar um novo escritório para a Cloudflare.

Encontrámos um ecossistema tecnológico local próspero, apoiado tanto pelo governo como por uma miríade de startups empolgantes, e esperamos colaborar com eles para continuar a elevar o perfil de Lisboa.

Cloudflare’s new Lisbon office

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/cloudflare-lisbon-office/

Cloudflare's new Lisbon office

I was the 24th employee of Cloudflare and the first outside of San Francisco. Working out of my spare bedroom, I wrote a chunk of Cloudflare’s software before starting to recruit a team in London. Today, Cloudflare London, our EMEA headquarters, has more than 200 people working in the historic County Hall building opposite the Houses of Parliament. My spare bedroom is ancient history.

Cloudflare's new Lisbon office
CC BY-SA 2.0 image by Sridhar Saraf

And Cloudflare didn’t stop at London. We now have people in Munich, Singapore, Beijing, Austin, TX, Chicago and Champaign, IL, New York, Washington, DC, San Jose, CA, Miami, FL, and Sydney, Australia, as well as San Francisco and London. And today we’re announcing the establishment of a new technical hub in Lisbon, Portugal. As part of that office opening I will be relocating to Lisbon this summer along with a small number of technical folks from other Cloudflare offices.

We’re recruiting in Lisbon starting today. Go here to see all the current opportunities. We’re looking for people to fill roles in Engineering, Security, Product, Product Strategy, Technology Research, and Customer Support.

Cloudflare's new Lisbon office
CC BY-SA 2.0 Image by Rustam Aliyev

My first real idea of Lisbon dates to 30 years ago with the 1989 publication of John Le Carré’s The Russia House. As real, of course, as any Le Carré view of the world:

[…] ten years ago on a whim Barley Blair, having inherited a stray couple of thousand from a remote aunt, bought himself a scruffy pied-a-terre in Lisbon, where he was accustomed to take periodic rests from the burden of his many-sided soul. It could have been Cornwall, it could have been Provence or Timbuktu. But Lisbon by an accident had got him […]

Cloudflare’s choice of Lisbon, however, came not by way of an accident but a careful search for a new continental European city in which to locate a technical office. I had been invited to Lisbon back in 2014 to speak at SAPO Codebits and been impressed by the size and range of technical talent present at the event. Subsequently, we looked at 45 cities across 29 countries, narrowing down to a final list of three.

Lisbon’s combination of a large and growing existing tech ecosystem, attractive immigration policy, political stability, high standard of living, as well as logistical factors like time zone (the same as the UK) and direct flights to San Francisco made it the clear winner.

Eu começei a aprender Português há três meses… and I’m looking forward to discovering a country and a culture, and building a new technical hub for Cloudflare. We have found a thriving local technology ecosystem, supported both by the government and a myriad of exciting startups, and we look forward to collaborating with them to continue to raise Lisbon’s profile.