Tag Archives: Better Internet

The challenges of sanctioning the Internet

Post Syndicated from Laura Klick original https://blog.cloudflare.com/the-challenges-of-sanctioning-the-internet/

The challenges of sanctioning the Internet

The challenges of sanctioning the Internet

Following Russia’s invasion of Ukraine, governments around the world, including the US, UK, and EU announced sweeping sanctions targeting the Russian and Belarussian economies. These sanctions prohibit a specified level of economic activity in an effort to use economic influences to punish targeted countries. Almost overnight, we saw unprecedented restrictions put in place for multinational companies doing business in Russia or Belarus.

Separately, recent events in Iran led the US government to authorize additional Internet/communications activities, which were being used widely by average Iranians protesting against the government. This was done by expanding some existing licenses, or exceptions, to sanctions the US has imposed on Iran.

While the use of sanctions as a tool for responding to foreign relations crises is nothing new, the wide-ranging multilateral sanctions that have been imposed on Russia and the recent authorizations in Iran are significant and provide fresh examples of how sanctions can affect access to a free and open global Internet.

Balancing interests in sanctions policy

Cloudflare is committed to complying with all applicable sanctions, including US, UK, and EU sanctions, and we have put in place programs to ensure that compliance. At the same time, we recognize the important role we and other Internet infrastructure companies play in protecting a key human right and principle also supported by the US, UK, and EU governments: free expression online.

One overarching principle of sanctions policy is that sanctions are intended to increase the cost of violating international norms and ultimately force authoritarian regimes and malicious actors to change behavior. The purpose of sanctions is not to punish or isolate ordinary citizens of a particular country or region. In fact, ordinary citizens can be powerful catalysts for the policy changes that sanctions are seeking to achieve. However, as we’ve seen over and over again, changes in policy, particularly in countries that have authoritarian regimes, do not happen overnight, and they often depend on the ability of individuals to communicate with each other and with the rest of the world. For example, in Iran, we’ve witnessed the important role that social media has played in helping support and spread the protest movement sparked by the killing of Mahsa Amini. Similarly, in the wake of Russia’s invasion of Ukraine, ordinary Russians continue to look for ways to access non-Russian news sources via private Internet access tools and VPNs.

It’s a tricky balance to impose costs on bad actors while maintaining open lines of communication for ordinary citizens, but it’s a balance that we’ve seen the US Government take a leading role in preserving, even in areas where most other transactions/activities might otherwise be prohibited. For example, the key US law authorizing the executive branch to deploy sanctions exempts “any postal, telegraphic, telephonic or other personal communication, which does not involve a transfer of anything of value.” The US government also has a long tradition of issuing authorizations, also known as General Licenses, permitting additional telecommunications and Internet-related activities, including in Cuba, Iran, Russia, Syria, and certain restricted regions of Ukraine. This means that US companies, like Cloudflare, can continue to provide many products and services that support free and secure Internet communications.

Although these exemptions and licenses can help the US Government establish the policy goal of supporting Internet freedom, they are only effective if private sector companies make use of them. That may be easier said than done. Because of the financial and reputational penalties that can be imposed if a company violates sanctions, even inadvertently, companies often have an incentive to take a simple and blunt approach to sanctions compliance without trying to do the nuanced thing and availing themselves of the exceptions in the General Licenses. Companies have to invest significant time and money into understanding the legal requirements and applicable exemptions and licenses when deciding whether to provide services in high risk countries. Cloudflare has made these investments because they align with our goal of helping build a better Internet and making a free and secure Internet accessible to all.

As governments continue to use sanctions as a foreign policy tool, we think it’s important that Internet infrastructure companies discuss how the legal framework is impacting their ability to support a global Internet. Described below are some of the key issues we’ve identified and ways that regulators can help balance the policy goals of sanctions with the need to support the free flow of communications for ordinary citizens around the world.

There are two broad categories of sanctions: (1) country-/region-based, and (2) individual/entity list-based. Sanctions can vary across jurisdictions, meaning that US sanctions look different from EU and UK sanctions and there can be significant differences. Companies that operate around the world have to pay close attention to individual rules and regulations to ensure compliance with sanctions.

Country-/region-based sanctions

With respect to country-/region-based sanctions, the US government has imposed comprehensive sanctions on doing business in Cuba, Iran, North Korea, Syria, and certain restricted regions of Ukraine (Crimea, Luhansk, and Donetsk). The purpose of comprehensive sanctions is to impose severe punishments on state actors in these countries by denying them access to valuable US goods/services. You might think that this means that Internet companies are therefore barred from providing services to these countries/regions, but that’s where things get complicated. The US government has issued General Licenses, which authorize US companies to engage in certain Internet- and telecommunications-related activities.

While these General Licenses are helpful in that they may authorize peering services, VPN, SSL certificates, and other services incident to the exchange of communications over the Internet, the activities authorized vary across sanctioned jurisdictions. In some countries/regions (e.g., Cuba, Iran, and the Donetsk and Luhansk regions), except for government parties, some free and paid services are authorized, but in other instances (e.g., Crimea and Syria), all authorized services must be available at no cost to the user. Along the same lines, some General Licenses list specific types of services/products that may be provided, while others leave it up to a company to make their own determination whether a product/service is authorized by the terms of the license. Neither the UK nor the EU has issued any Internet-related General Licenses, which has become a particular issue in the context of Russia where there are now significant restrictions in place.

With respect to Iran, the US government recently issued a new General License that broadens the products/services authorized and provides other clarifications to make it easier for companies to provide Internet services to ordinary Iranians. The new General License is encouraging for companies, like Cloudflare, that would like to help support access to the broader Internet for ordinary Iranian citizens. But as with any new policy, it takes time for companies to understand the changes and make decisions about whether to invest additional time and resources to expand services offerings in a high risk country like Iran. Given the significant restrictions that have been imposed on doing business in Iran over the years, there are a number of logistical challenges with seeking to enter a market where so many activities remain prohibited. Moreover, there is always a risk that sanctions policies can change, so companies will take this into account when weighing whether to deploy expensive hardware/equipment or make other long-term investments.

Party-based sanctions

Apart from country-based sanctions, many governments, including the US, UK, and EU maintain list-based sanctions, which prohibit dealings with specific listed parties. Like many multinational companies, Cloudflare screens customers and other third parties to identify links to sanctioned parties. We do not engage in any transactions with or provide services to any parties that have been listed on applicable sanctions lists or any parties that are owned or controlled by such parties and our Terms of Service prohibit sanctioned parties from using our services.

Over the years, the US government has continued to add parties to its sanctions list. Notably, when the US government adds a party to the sanctions list, it will include corresponding identifying information, including possible aliases, physical address, as well as email address and domain names to the extent they are known. The UK has also started adding domains and email addresses, but those domains and email addresses do not always align with what is on the US list, creating further complexities for multinational companies in this space.

While there are a number of sanctions screening providers that will help companies conduct due diligence on third parties they are considering doing business with, email addresses and domains are not automatically screened. This can be challenging for Internet infrastructure companies for whom email addresses and domain names are critical pieces of data when onboarding a customer. With limited automated solutions, companies must invest significant time and resources building proprietary tools that block sanctioned domains and email addresses from signing up for their services.

Cloudflare may also receive abuse reports alleging that domains are operated by sanctioned parties. However, unless a domain is listed on a sanctions list, it can be challenging to determine if a domain is subject to sanctions. Without clear guidance from regulators, companies must develop their own processes for reviewing these reports. While it is important that companies terminate services to domains owned or operated by a sanctioned party, it’s also critical that they do so in a way that is fair and consistent.

Implications for a free and open Internet

Sanctions are an important tool for responding to geopolitical challenges, and they can help impose economic costs on parties that violate international norms, including human rights. However, sanctions can also have unintended consequences when they are not properly deployed. While regulators have learned a number of lessons over the years when imposing sanctions on more traditional sanctions targets, like the financial and energy sectors, the global Internet remains a complicated area that has only recently become a more prominent focus of sanctions. With the technology constantly evolving and a number of different parties involved in maintaining a secure and reliable Internet, it is critical that regulators are clear about their expectations and seek to minimize any chilling effects.

Key stakeholders involved in maintaining a free and open Internet are likely to continue exiting sensitive markets in the absence of clear guidance from regulators. This will only lead to further fragmentation of the global Internet and open the door for authoritarian governments to monitor and control global communications – an outcome that clearly undermines the policy goals of the sanctions. These are complicated issues, and we don’t pretend to have all the answers. But, there are things that regulators can do to mitigate unintended consequences of sanctions policies and promote a free and open Internet. Here are a few key points that we advocate to policymakers:

  • Continue partnering with stakeholders to understand practical implications before imposing new sanctions and determine where additional clarifying guidance might be helpful.
  • Apply a consistent and coordinated approach to exemptions/authorizations to make it easier for multinational companies to provide services in challenging jurisdictions.
  • Provide clear guidelines for Internet-related companies as to when a domain or user may be subject to sanctions (i.e., adding domain names and email addresses to applicable sanctions lists) and ensure consistency across jurisdictions.

Looking forward

An integral part of Cloudflare’s mission to help build a better Internet involves making sure that ordinary individuals have access to a free and secure Internet. While global sanctions will continue to present challenges to Internet infrastructure companies, like Cloudflare, we are committed to both compliance with applicable sanctions and helping to maintain open lines of communication around the world–and we will continue to advocate for policies that do the same.

The challenges of sanctioning the Internet

Project A11Y: how we upgraded Cloudflare’s dashboard to adhere to industry accessibility standards

Post Syndicated from Emily Flannery original https://blog.cloudflare.com/project-a11y/

Project A11Y: how we upgraded Cloudflare’s dashboard to adhere to industry accessibility standards

Project A11Y: how we upgraded Cloudflare’s dashboard to adhere to industry accessibility standards

At Cloudflare, we believe the Internet should be accessible to everyone. And today, we’re happy to announce a more inclusive Cloudflare dashboard experience for our users with disabilities. Recent improvements mean our dashboard now adheres to industry accessibility standards, including Web Content Accessibility Guidelines (WCAG) 2.1 AA and Section 508 of the Rehabilitation Act.

Over the past several months, the Cloudflare team and our partners have been hard at work to make the Cloudflare dashboard1 as accessible as possible for every single one of our current and potential customers. This means incorporating accessibility features that comply with the latest Web Content Accessibility Guidelines (WCAG) and Section 508 of the US’s federal Rehabilitation Act. We are invested in working to meet or exceed these standards; to demonstrate that commitment and share openly about the state of accessibility on the Cloudflare dashboard, we have completed the Voluntary Product Accessibility Template (VPAT), a document used to evaluate our level of conformance today.

Conformance with a technical and legal spec is a bit abstract–but for us, accessibility simply means that as many people as possible can be successful users of the Cloudflare dashboard. This is important because each day, more and more individuals and businesses rely upon Cloudflare to administer and protect their websites.

For individuals with disabilities who work on technology, we believe that an accessible Cloudflare dashboard could mean improved economic and technical opportunities, safer websites, and equal access to tools that are shaping how we work and build on the Internet.

For designers and developers at Cloudflare, our accessibility remediation project has resulted in an overhaul of our component library. Our newly WCAG-compliant components expedite and simplify our work building accessible products. They make it possible for us to deliver on our commitment to an accessible dashboard going forward.

Our Journey to an Accessible Cloudflare Dashboard

In 2021, we initiated an audit with third party experts to identify accessibility challenges in the Cloudflare dashboard. This audit came back with a daunting 213-page document—a very, very long list of compliance gaps.

We learned from the audit that there were many users we had unintentionally failed to design and build for in Cloudflare dashboard user interfaces. Most especially, we had not done well accommodating keyboard users and screen reader users, who often rely upon these technologies because of a physical impairment. Those impairments include low vision or blindness, motor disabilities (examples include tremors and repetitive strain injury), or cognitive disabilities (examples include dyslexia and dyscalculia).

As a product and engineering organization, we had spent more than a decade in cycles of rapid growth and product development. While we’re proud of what we have built, the audit made clear to us that there was a great need to address the design and technical debt we had accrued along the way.

One year, four hundred Jira tickets, and over 25 new, accessible web components later, we’re ready to celebrate our progress with you. Major categories of work included:

  1. Forms: We re-wrote our internal form components with accessibility and developer experience top of mind. We improved form validation and error handling, labels, required field annotations, and made use of persistent input descriptions instead of placeholders. Then, we deployed those component upgrades across the dashboard.
  2. Data visualizations: After conducting a rigorous re-evaluation of their design, we re-engineered charts and graphs to be accessible to keyboard and screen reader users. See below for a brief case study.
  3. Heading tags: We corrected page structure throughout the dashboard by replacing all our heading tags (<h1>, <h2>, etc.) with a technique we borrowed from Heydon Pickering. This technique is an approach to heading level management that uses React Context and basic arithmetic.
  4. SVGs: We reworked how we create SVGs (Scalable Vector Graphics), so that they are labeled properly and only exposed to assistive technology when useful.
  5. Node modules: We jumped several major versions of old, inaccessible node modules that our UI components depend upon (and we broke many things along the way).
  6. Color: We overhauled our use of color, and contributed a new volume of accessible sequential colors to our design system.
  7. Bugs: We squashed a lot of bugs that had made their way into the dashboard over the years. The most common type of bug we encountered related to incorrect or unsemantic use of HTML elements—for example, using a <div> where we should have used a <td> (table data) or <tr> (table row) element within a table.

Case Study: Accessibility Work On Cloudflare Dashboard Data & Analytics

The Cloudflare dashboard is replete with analytics and data visualizations designed to offer deep insight into users’ websites’ performance, traffic, security, and more. Making those data visualizations accessible proved to be among the most complex and interdisciplinary issues we faced in the remediation work.

An example of a problem we needed to solve related to WCAG success criterion 1.4.1, which pertains to the use of color. 1.4.1 specifies that color cannot be the only means by which to convey information, such as the differentiation between two items compared in a chart or graph.

Our charts were clearly nonconforming with this standard, using color alone to represent different data being compared. For example, a typical graph might have used the color blue to show the number of requests to a website that were 200 OK, and the color orange to show 403 Forbidden, but failed to offer users another way to discern between the two status codes.

Our UI team went to work on the problem, and chose to focus our effort first on the Cloudflare dashboard time series graphs.

Interestingly, we found that design patterns recommended even by accessibility experts created wholly unusable visualizations when placed into the context of real world data. Examples of such recommended patterns include using different line weights, patterns (dashed, dotted or other line styles), and terminal glyphs (symbols set at the beginning and end of the lines) to differentiate items being compared.

We tried, and failed, to apply a number of these patterns; you can see the evolution of this work on our time series graph component in the three different images below.

v.1

Project A11Y: how we upgraded Cloudflare’s dashboard to adhere to industry accessibility standards
Here is an early attempt at using both terminal glyphs and patterns to differentiate data in a time series graph. You can see that the terminal glyphs pile up and become indistinguishable; the differences among the line patterns are very hard to discern. This code never made it into production.

v.2

Project A11Y: how we upgraded Cloudflare’s dashboard to adhere to industry accessibility standards
In this version, we eliminated terminal glyphs but kept line patterns. Additionally, we faded the unfocused items in the graph to help bring highlighted data to the forefront. This latter technique made it into our final solution.

v.3

Project A11Y: how we upgraded Cloudflare’s dashboard to adhere to industry accessibility standards
Here we eliminated patterns altogether, simplified the user interface to only use the fading technique on unfocused items, and put our new, sequentially accessible colors to use. Finally, a visual design solution approved by accessibility and data visualization experts, as well as our design and engineering teams.

After arriving at our design solution, we had some engineering work to do.

In order to meet WCAG success criterion 2.1.1, we rewrote our time series graphs to be fully keyboard accessible by adding focus handling to every data point, and enabling the traversal of data using arrow keys.

Navigating time series data points by keyboard on the Cloudflare dashboard.

We did some fine-tuning, specifically to support screen readers: we eliminated auditory “chartjunk” (unnecessary clutter or information in a chart or graph) and cleaned up decontextualized data (a scenario in which numbers are exposed to and read by a screen reader, but contextualizing information, like x- and y-axis labels, is not).

And lastly, to meet WCAG 1.1.1, we engineered new UI component wrappers to make chart and graph data downloadable in CSV format. We deployed this part of the solution across all charts and graphs, not just the time series charts like those shown above. No matter how you browse and interact with the web, we hope you’ll notice this functionality around the Cloudflare dashboard and find value in it.

Making all of this data available to low vision, keyboard, and assistive technology users was an interesting challenge for us, and a true team effort. It necessitated a separate data visualization report conducted by another, more specialized team of third party experts, deep collaboration between engineering and design, and many weeks of development.

Applying this thorough treatment to all data visualizations on the Cloudflare dashboard is our goal, but still work in progress. Please stay tuned for more accessible updates to our chart and graph components.

Conclusion

There’s a lot of nuance to accessibility work, and we were novices at the beginning: researching and learning as we were doing. We also broke a lot of things in the process, which (as any engineering team knows!) can be stressful.

Overall, our team’s biggest challenge was figuring out how to complete a high volume of cross-functional work in the shortest time possible, while also setting a foundation for these improvements to persist over time.

As a frontend engineering and design team, we are very grateful for having had the opportunity to focus on this problem space and to learn from truly world-class accessibility experts along the way.

Accessibility matters to us, and we know it does to you. We’re proud of our progress, and there’s always more to do to make Cloudflare more usable for all of our customers. This is a critical piece of our foundation at Cloudflare, where we are building the most secure, performant and reliable solutions for the Internet. Stay tuned for what’s next!

Not using Cloudflare yet? Get started today and join us on our mission to build a better Internet.

1All references to “dashboard” in this post are specific to the primary user authenticated Cloudflare web platform. This does not include Cloudflare’s product-specific dashboards, marketing, support, educational materials, or third party integrations.

Route leaks and confirmation biases

Post Syndicated from Maximilian Wilhelm original https://blog.cloudflare.com/route-leaks-and-confirmation-biases/

Route leaks and confirmation biases

Route leaks and confirmation biases

This is not what I imagined my first blog article would look like, but here we go.

On February 1, 2022, a configuration error on one of our routers caused a route leak of up to 2,000 Internet prefixes to one of our Internet transit providers. This leak lasted for 32 seconds and at a later time 7 seconds. We did not see any traffic spikes or drops in our network and did not see any customer impact because of this error, but this may have caused an impact to external parties, and we are sorry for the mistake.

Route leaks and confirmation biases

Timeline

All timestamps are UTC.

As part of our efforts to build the best network, we regularly update our Internet transit and peering links throughout our network. On February 1, 2022, we had a “hot-cut” scheduled with one of our Internet transit providers to simultaneously update router configurations on Cloudflare and ISP routers to migrate one of our existing Internet transit links in Newark to a link with more capacity. Doing a “hot-cut” means that both parties will change cabling and configuration at the same time, usually while being on a conference call, to reduce downtime and impact on the network. The migration started off-peak at 10:45 (05:45 local time) with our network engineer entering the bridge call with our data center engineers and remote hands on site as well as operators from the ISP.

At 11:17, we connected the new fiber link and established the BGP sessions to the ISP successfully. We had BGP filters in place on our end to not accept and send any prefixes, so we could evaluate the connection and settings without any impact on our network and services.

As the connection between our router and the ISP — like most Internet connections — was realized over a fiber link, the first item to check are the “light levels” of that link. This shows the strength of the optical signal received by our router from the ISP router and can indicate a bad connection when it’s too low. Low light levels are likely caused by unclean fiber ends or not fully seated connectors, but may also indicate a defective optical transceiver which connects the fiber link to the router – all of which can degrade service quality.

The next item on the checklist is interface errors, which will occur when a network device receives incorrect or malformed network packets, which would also indicate a bad connection and would likely lead to a degradation in service quality, too.

As light levels were good, and we observed no errors on the link, we deemed it ready for production and removed the BGP reject filters at 11:22.

This immediately triggered the maximum prefix-limit protection the ISP had configured on the BGP session and shut down the session, preventing further impact. The maximum prefix-limit is a safeguard in BGP to prevent the spread of route leaks and to protect the Internet. The limit is usually set just a little higher than the expected number of Internet prefixes from a peer to leave some headroom for growth but also catch configuration errors fast. The configured value was just 40 prefixes short of the number of prefixes we were advertising at that site, so this was considered the reason for the session to be shut down. After checking back internally, we asked the ISP to raise the prefix-limit, which they did.

The BGP session was reestablished at 12:08 and immediately shut down again. The problem was identified and fixed at 12:14.

10:45: Start of scheduled maintenance

11:17: New link was connected and BGP sessions went up (filters still in place)

11:22: Link was deemed ready for production and filters removed

11:23: BGP sessions were torn down by ISP router due to configured prefix-limit

12:08: ISP configures higher prefix-limits, BGP sessions briefly come up again and are shut down

12:14: Issue identified and configuration updated

What happened and what we’re doing about it

The outage occurred while migrating one of our Internet transits to a link with more capacity. Once the new link and a BGP session had been established, and the link deemed error-free, our network engineering team followed the peer-reviewed deployment plan. The team removed the filters from the BGP sessions, which prevented the Cloudflare router from accepting and sending prefixes via BGP.

Due to an oversight in the deployment plan, which had been peer-reviewed before without noticing this issue, no BGP filters to only export prefixes of Cloudflare and our customers were added. A peer review on the internal chat did not notice this either, so the network engineer performing this change went ahead.

ewr02# show |compare                                     
[edit protocols bgp group 4-ORANGE-TRANSIT]
-  import REJECT-ALL;
-  export REJECT-ALL;
[edit protocols bgp group 6-ORANGE-TRANSIT]
-  import REJECT-ALL;
-  export REJECT-ALL;

The change resulted in our router sending all known prefixes to the ISP router, which shut down the session as the number of prefixes received exceeded the maximum prefix-limit configured.

As the configured values for the maximum prefix-limits turned out to be rather low for the number of prefixes on our network, this didn’t come as a surprise to our network engineering team and no investigation into why the BGP session went down was started. The prefix-limit being too low seemed to be a perfectly valid reason.

We asked the ISP to increase the prefix-limit, which they did after they received approval on their side. Once the prefix-limit had been increased and the previously shutdown BGP sessions reset, the sessions were reestablished but were shut down immediately as the maximum prefix-limit was triggered again. This is when our network engineer started questioning whether there was another issue at fault and found and corrected the configuration error previously overlooked.

We made the following change in response to this event: we introduced an implicit reject policy for BGP sessions which will take effect if no import/export policy is configured for a specific BGP neighbor or neighbor group. This change has been deployed.

BGP security & preventing route-leaks — what’s in the cards?

Route leaks aren’t new, and they keep happening. The industry has come up with many approaches to limit the impact or even prevent route-leaks. Policies and filters are used to control which prefixes should be exported to or imported from a given peer. RPKI can help to make sure only allowed prefixes are accepted from a peer and a maximum prefix-limit can act as a last line of defense when everything else fails.

BGP policies and filters are commonly used to ensure only explicitly allowed prefixes are sent out to BGP peers, usually only allowing prefixes owned by the entity operating the network and its customers. They can also be used to tweak some knobs (BGP local-pref, MED, AS path prepend, etc.) to influence routing decisions and balance traffic across links. This is what the policies we have in place for our peers and transits do. As explained above, the maximum prefix-limit is intended to tear down BGP sessions if more prefixes are being sent or received than to be expected. We have talked about RPKI before, it’s the required cryptographic upgrade to BGP routing, and we still are on our path to securing Internet Routing.

To improve the overall stability of the Internet even more, in 2017, a new Internet standard was proposed, which adds another layer of protection into the mix: RFC8212 defines Default External BGP (EBGP) Route Propagation Behavior without Policies which pretty much tackles the exact issues we were facing.

This RFC updates the BGP-4 standard (RFC4271) which defines how BGP works and what vendors are expected to implement. On the Juniper operating system, JunOS, this can be activated by setting defaults ebgp no-policy reject-always on the protocols bgp hierarchy level starting with Junos OS Release 20.3R1.

If you are running an older version of JunOS, a similar effect can be achieved by defining a REJECT-ALL policy and setting this as import/export policy on the protocols bgp hierarchy level. Note that this will also affect iBGP sessions, which the solution above will have no impact on.

policy-statement REJECT-ALL {
  then reject;
}

protocol bgp {
  import REJECT-ALL;
  export REJECT-ALL;
}

Conclusion

We are sorry for leaking routes of prefixes which did not belong to Cloudflare or our customers and to network engineers who got paged as a result of this.

We have processes in place to make sure that changes to our infrastructure are reviewed before being executed, so potential issues can be spotted before they reach production. In this case, the review process failed to catch this configuration error. In response, we will increase our efforts to further our network automation, to fully derive the device configuration from an intended state.

While this configuration error was caused by human error, it could have been detected and mitigated significantly faster if the confirmation bias did not kick in, making the operator think the observed behavior was to be expected. This error underlines the importance of our existing efforts on training our people to be aware of biases we have in our life. This also serves as a great example on how confirmation bias can influence and impact our work and that we should question our conclusions (early).

It also shows how important protocols like RPKI are. Route leaks are something even experienced network operators can cause accidentally, and technical solutions are needed to reduce the impact of leaks whether they are intentional or the result of an error.

Heard in the halls of Web Summit 2021

Post Syndicated from João Tomé original https://blog.cloudflare.com/web-summit-2021-internet/

Heard in the halls of Web Summit 2021
Opening night of Web Summit 2021, at the Altice Arena in Lisbon, Portugal. Photo by Sam Barnes/Web Summit

Heard in the halls of Web Summit 2021

Global in-person events were back in a big way at the start of November (1-4) in Lisbon, Portugal, with Web Summit 2021 gathering more than 42,000 attendees from 128 countries. I was there to discover Internet trends and meet interesting people. What I saw was the contagious excitement of people from all corners of the world coming together for what seemed like a type of normality in a time when the Internet “is almost as important as having water”, according to Sonia Jorge from the World Wide Web Foundation.

Here’s some of what I heard in the halls.

With a lot happening on a screen, the lockdowns throughout the pandemic showed us a glimpse of what the metaverse could be, just without VR or AR headsets. Think about the way many were able to use virtual tools to work all day, learn, collaborate, order food, supplies, and communicate with friends and family — all from their homes.

While many had this experience, many others were unable to, with some talks at the event focusing on the digital divide and how “Internet access is a basic human right”, according to the grandson of Nelson Mandela — we interviewed him, and you can watch the conversation below.

The future already has some paths laid out, and many were discussed at the event.

The pandemic helped to accelerate most of them, especially by bringing more people (in some countries) to the digital world.

The CPO of Meta, Chris Cox, shared how the company previously known as Facebook has some ideas about the future of augmented reality, and how they want to see those ideas play out in the next five to 10 years. “We want to get the conversation going,” he said.

Also present at the event was Jon Vlassopulos, Global Head of Music, Roblox. He explained how virtual concerts on the video game platform could be the future of music performances, and even bring free tickets to fans of famous music stars like Adele. Stars like Zara Larsson, KSI and Ava Max have already performed on Roblox and “they’re making big money from selling digital merchandise”.

On the other hand, Paddy Cosgrave, CEO of Web Summit, says that there’s something magical about in-person big events that can’t be replicated in full online events. However, the real and virtual world can complement each other — it was announced that CES 2022 will use a combination of Web Summit online and offline software.

Web3 was another big part of the discussion, sometimes in clear sight, other times embedded in the many conversations about blockchain, NFTs and cryptocurrencies, and as a vision for a decentralized web (we’re actually working on that).

Speakers also focused on data privacy and security, ethics in AI and data protection. Ownership to the user and sovereignty were topics discussed and emphasized by Sir Tim Berners-Lee on the last day of the event.

The workplace was also a popular topic, as well as the changes it underwent in the past couple of years. We heard about the importance of diversity in the workplace, as well as the future of work — is it going to be flexible, hybrid, full remote or something in between? Speakers also mentioned The Great Resignation and the reset of people’s and organizations’ mindsets.

Using AI to hire and motivate people was also in the air, as well as big topics like the digitalization of healthcare, mental health, behaviour changes in humans (young and adult) who are more and more on the Internet and even the decentralization of financial services.

And here are some examples of the different speakers at the event we talked to:

Vice-Admiral Gouveia e Melo: Vaccination, misinformation and leadership

Portuguese Navy officer and coordinator of the Task Force for the Portugal COVID-19 vaccination plan

Portugal has achieved an 86% vaccination rate on the vice-admiral’s watch. He brought a sense of mission to a task that involved organization, focus and the use of both digital and communication tools.

The country started the vaccination process late but is now one of the countries with a higher vaccination rate in the world. We talked with the vice-admiral about how the Internet helped, but also how it created problems related to disinformation and misinformation, and we asked about the dangers of controlling speech online. Finally, we asked for bits of leadership advice.

Sonia Jorge: The need for Internet — affordable, fast and for everyone

Executive Director World Wide Web Foundation (Alliance for Affordable Internet)

“The Internet is now an essential public good that everybody needs at this time just like we need to drink water or to have electricity and shelter. We should do more to bring everyone into the digital society.”

In some countries around the world Internet access is very limited. In some places people have to go to a particular plaza to have access to the Internet five years ago John Graham-Cumming saw something similar in Cuba. Sonia Jorge knows that very well. She is trying to bring affordable Internet to everyone and that challenge is more difficult than it appears.

She explains that the world is far behind in the UN’s goals for Internet access — today only about half of the earth’s population has any Internet access at all. But many of those who have access to the World Wide Web have limited possibilities to be online: “some have access once a month, for example.” So the digital divide is real, and it “should worry everyone”.

The pandemic caused health and economic difficulties that didn’t help the mission of bringing good, fast and reliable Internet to everyone. Nevertheless, Sonia — who is Portuguese and moved to the US to study when she was 17 — saw that many African countries like Nigeria began to realize that the Internet is really important for knowledge and also for the possibilities it opens in terms of cultural, financial and societal growth.

Sonia also highlights that there is a big disparity in the world between men and women in terms of Internet access.

David Kiron: The future of work and how AI (and philosophy) can help

Editorial director of MIT Sloan Management Review

Technology will play a significant role in the future of work. In a way, that “future” is already here, but isn’t evenly distributed — and researchers are just beginning to study it. David Kiron goes on to explain the challenge for some people to be “really seen by their leadership when you’re not in the office.”

The former senior researcher at Harvard Business School tells us how companies started valuing employees even more through the pandemic. There’s also an opportunity for different ways of work interaction through digital tools — “Zoom calls aren’t it.” He’s also worried that the pandemic caused a great reset that is driving many out of the workforce entirely: “There’s a trend of working moms opting out,” for example.

About the metaverse and a universe of universes: “If tech leaders spent more time reading philosophy they might have a better sense of where the world is going (…) more and more leaders of companies are taking on the philosopher’s role.”

And how can AI help? “Once you get AI going in a company we saw in our new study that there’s a big bump in morale, collaboration, learning and people’s sense on what they should be doing”. AI can also help better identify talent and match candidates to skills that are already represented in a company, but he also highlights that “humans play a role in all the stages of the hiring and working process.”

David Kiron explains that “if you’re not asking the right questions to your AI teams you’re going to be behind other companies that are doing better questions”. He adds that AI can help with performance, but it also helps “redefine what performance means in your organization by finding other metrics to look at.”

Ana Maiques: neuroscience & women in tech

Co-founder and CEO of neuroscience-based medical device company Neuroelectrics

We talked to Ana about the future of the Internet. She thinks moving forward there will be more fluid interfaces — not only limited to computers and smartphones, but we will have different devices that go beyond VR headsets and that will lead to new types of interactions. In the neuroscience field, she has big hopes in the technology that Neuroelectrics, her company, is developing in Barcelona, Spain. They work with devices that use non-invasive transcranial electrical stimulation to treat the brain in diseases like epilepsy, depression and Alzheimer.

Neuroelectrics is also developing a process called digital copy (for better personalized treatments) that could be useful in the future if someone develops one of these problems. But she says humankind is still very far from the dangers of something like a mind-reading device or the possibility of reading and downloading thoughts and dreams: “it’s fun to think of science fiction possibilities, but we need to act now on things and problems that are affecting us today.”

She also talks about the difficulties of being a woman in the tech business and raising money. “But little by little I see more women and that’s why it’s important to get out there and explain to women that they can do it.”

Siyabulela Mandela: The Internet is a human right

Director for Africa Journalists for Human Rights

The grandson of Nelson Mandela is on a mission to help journalists in Africa to be free to publish human rights stories. He explains how the Internet is critical for this mission and “a human rights issue”. Not only does the Internet give communities access to trustworthy information, but it also helps them become aware of their rights, gives access to financial tools and allows them to grow in our era.

He also highlights how the Internet can be misused, for example when it becomes a vehicle for misinformation, or when governments shut down Internet access to control communities — in Sudan the Internet has been cut off since October 25, 2021 (you can track that information on Cloudflare Radar).

Carlos Moedas: The light (and innovation) in Lisbon

Newly elected Mayor of Lisbon; previous European Commissioner for Research, Science and Innovation

Why is Lisbon attracting so many tech companies and talent? Carlos Moedas welcomes Cloudflare to his city — we’re growing fast in the city, and we have more than 80 job openings in the country. He also talks about why Portugal’s capital is so special and should be considered by company leaders who want to grow innovative companies. Paddy Cosgrave, from the Web Summit, told us something similar four weeks ago.

The ambition? “Make Lisbon the capital of innovation of the world” or, at least, of Europe. The new mayor also has a project called Unicorn Factory to achieve just that.

Sudarsan Reddy: Why is Cloudflare Tunnel relevant?

Cloudflare engineer from the Tunnel Team

Also, at the event was our very own engineer Sudarsan Reddy (based in Lisbon). We asked him some questions about Cloudflare Tunnel, our tunneling software that lets you quickly secure and encrypt application traffic to any type of infrastructure, so you can hide your server IP addresses, block direct attacks, and get back to delivering great applications.

Sudarsan focuses on what Tunnel is, why it is relevant, how it works and examples of situations where it can make a difference.

Yusuf Sherwani: Addiction treated online

Co-founder & CEO, Quit Genius

Yusuf graduated as a doctor from Imperial College School of Medicine, in London, but joined two passions, healthcare and technology, when he co-founded Quit Genius. He explains how in just 18 months the pandemic accelerated the adoption of digital health by 10 years, and there’s no going back. “The Internet enables people to unlock improvements to their lives, and digital healthcare went from being convenient to a necessity”.

We dig into the benefits of digital healthcare, but also the scrutiny that is needed in technology, now that it is more powerful than ever and cemented in people’s lives. Yusuf also gives examples of how his digital clinic is helping people in treating tobacco, vaping, alcohol, and opioid addictions.

Yusuf has co-authored 12 peer-reviewed studies on behavioural health and substance addictions. He was featured on the Forbes 30 Under 30 List of 2018 and in Fast Company’s 100 Most Creative People in Business.

David Shrier: From sharing economy to blockchain

American futurist and Professor of Practice, AI & Innovation with Imperial College Business School in London

David sums up how the pandemic has affected people’s relationship with technology: “Everyone is tired of Zoom calls, but the convenience opened people’s minds”.

We also talk about the digital divide, about human-centered ways of working with AI, and we also address the potential in VR and AR and how nobody saw the sharing economy coming 20 years ago and, now, “it’s incredible to see how people embraced blockchain and the digitalization of financial services”.

Dame Til Wykes: The mental health discussion went viral

Professor of Clinical Psychology and Rehabilitation at King’s College London, Director of the NIHR Clinical Research Network: Mental Health

As someone with experience in the psychology field for more than 50 years, Dame Til Wykes still had to learn new ways of engaging with patients throughout the pandemic — and even learn which buttons to push on a computer to make Zoom calls. COVID-19 and the hardships of the pandemic made people more aware and ready to talk about their mental health issues, like anxiety or depression. But the pandemic wasn’t the same for everyone and Dame Til Wykes is worried about some of the effects, “most of them remain to be seen”.

Remote consultations were a big help, but she reminds us that in her field it is important to see the whole person and not just the face — for example, “if someone is tapping a foot nervously while giving us a smile, that tells us something that we cannot see in a Zoom call”. She also mentions the adoption of meditation apps bringing a form of help to some was another positive trend in this difficult period, as well as the reset button the pandemic brought to some people’s lives.

Multi-User IP Address Detection

Post Syndicated from Alex Chen original https://blog.cloudflare.com/multi-user-ip-address-detection/

Multi-User IP Address Detection

Multi-User IP Address Detection

Cloudflare provides our customers with security tools that help them protect their Internet applications against malicious or undesired traffic. Malicious traffic can include scraping content from a website, spamming form submissions, and a variety of other cyberattacks. To protect themselves from these types of threats while minimizing the blocking of legitimate site visitors, Cloudflare’s customers need to be able to identify traffic that might be malicious.

We know some of our customers rely on IP addresses to distinguish between traffic from legitimate users and potentially malicious users. However, in many cases the IP address of a request does not correspond to a particular user or even device. Furthermore, Cloudflare believes that in the long term, the IP address will be an even more unreliable signal for identifying the origin of a request. We envision a day where IP will be completely unassociated with identity. With that vision in mind, multi-user IP address detection represents our first step: pointing out situations where the IP address of a request cannot be assumed to be a single user. This gives our customers the ability to make more judicious decisions when responding to traffic from an IP address, instead of indiscriminately treating that traffic as though it was coming from a single user.

Historically, companies commonly treated IP addresses like mobile phone numbers: each phone number in theory corresponds to a single person. If you get several spam calls within an hour from the same phone number, you might safely assume that phone number represents a single person and ignore future calls or even block that number. Similarly, many Internet security detection engines rely on IP addresses to discern which requests are legitimate and which are malicious.

However, this analogy is flawed and can present a problem for security. In practice, IP addresses are more like postal addresses because they can be shared by more than one person at a time (and because of NAT and CG-NAT the number of people sharing an IP can be very large!). Many existing Internet security tools accept IP addresses as a reliable way to distinguish between site visitors. However, if multiple visitors share the same IP address, security products cannot rely on the IP address as a unique identifying signal. Thousands of requests from thousands of different users need to be treated differently from thousands of requests from the same user. The former is likely normal traffic, while the latter is almost certainly automated, malicious traffic.

Multi-User IP Address Detection

For example, if several people in the same apartment building accessed the same site, it’s possible all of their requests would be routed through a middlebox operated by their Internet service provider that has only one IP address. But this sudden series of requests from the same IP address could closely resemble the behavior of a bot. In this case, IP addresses can’t be used by our customers to distinguish this activity from a real threat, leading them to mistakenly block or challenge their legitimate site visitors.

By adding multi-user IP address detection to Cloudflare products, we’re improving the quality of our detection techniques and reducing false positives for our customers.

Examples of Multi-User IP Addresses

Multi-user IP addresses take on many forms. When your company uses an enterprise VPN, for example, employees may share the same IP address when accessing external websites. Other types of VPNs and proxies also place multiple users behind a single IP address.

Another type of multi-user IP address originated from the core communications protocol of the Internet. IPv4 was developed in the 1980s. The protocol uses a 32-bit address space, allowing for over four billion unique addresses. Today, however, there are many times more devices than IPv4 addresses, meaning that not every device can have a unique IP address. Though IPv6 (IPv4’s successor protocol) solves the problem with 128-bit addresses (supporting 2128 unique addresses), IPv4 still routes the majority of Internet traffic (76% of human-only traffic is IPv4, as shown on Cloudflare Radar).

Multi-User IP Address Detection

To solve this issue, many devices in the same Local Area Network (LAN) can share a single Internet-addressable IP address to communicate with the public Internet, while using private Internet addresses to communicate within the LAN. Since private addresses are to be used only within a LAN, different LANs can number their hosts using the same private IP address space. The Internet gateway of the LAN does the Network Address Translation (NAT), namely takes messages which arrive on that single public IP and forwards them to the private IP of the appropriate device on their local network. In effect it’s similar to how everyone in an office building shares the same street address, and the front desk worker is responsible for sorting out what mail was meant for which person.

While NAT allows multiple devices behind the same Internet gateway to share the same public IP address, the explosive growth of the Internet population necessitated further reuse of the limited IPv4 address space. Internet Service Providers (ISPs) required users in different LANs to share the same IP address for their service to scale. Carrier-Grade Network Address Translation (CG-NAT) emerged as another solution for address space reuse. Network operators can use CG-NAT middleboxes to translate hundreds or thousands of private IPv4 addresses into a single (or pool of) public IPv4 address. However, this sharing is not without side-effects. CG-NAT results in IP addresses that cannot be tied to single devices, users, or broadband subscriptions, creating issues for security products that rely on the IP address as a way to distinguish between requests from different users.

What We Built

We built a tool to help our customers detect when a /24 IP prefix (set of IP addresses that have the same first 24 bits) is likely to contain multi-user IP addresses, so they can more finely tune the security rules that protect their websites. In order to identify multi-user IP prefixes, we leverage both internal data and public data sources. Within this data, we look at a few key parameters.

Multi-User IP Address Detection
Each TCP connection between a source (client) and a destination (server) is identified by 4 identifiers (source IP, source port, destination IP, destination port)

When an Internet user visits a website, the underlying TCP stack opens a number of connections in order to send and receive data from remote servers. Each connection is identified by a 4-tuple (source IP, source port, destination IP, destination port). Repeating requests from the same web client will likely be mapped to the same source port, so the number of distinct source ports can serve as a good indication of the number of distinct client applications. By counting the number of open source ports for a given IP address, you can estimate whether this address is shared by multiple users.

User agents provide device-reported information about themselves such as browser and operating system versions. For multi-user IP detection, you can count the number of distinct user agents in requests from a given IP. To avoid overcounting web clients per device, you can exclude requests that are identified as triggered by bots and we only count requests from user agents that are used by web browsers. There are some tradeoffs to this approach: some users may use multiple web browsers and some other users may have exactly the same user agent. Nevertheless, past research has shown that the number of unique web browser user agents is the best tradeoff to most accurately determine CG-NAT usage.

Mozilla/5.0 (X11; Linux x86_64; rv:92.0) Gecko/20100101 Firefox/92.0

For our inferences, we group IP addresses to their corresponding /24 IP prefix. The figure below shows the distribution of browser User Agents per /24 IP prefix, based on data accumulated over the period of a day. About 35% of the prefixes have more than 100 different browser clients behind them.

Multi-User IP Address Detection

Our service also uses other publicly available data sources to further refine the accuracy of our identification and to classify the type of multi-user IP address. For example, we collect data from PeeringDB, which is a database where network operators self-identify their network type, traffic levels, interconnection points, and peering policy. This data only covers a fraction of the Internet’s autonomous systems (ASes). To overcome this limitation, we use this data and our own data (number of requests per AS, number of websites in each AS) to infer AS type. We also use external data sources such as IRR to identify requests from VPNs and proxy servers.

These details (especially AS type) can provide more information on the type of multi-user IP address. For instance, CG-NAT systems are almost exclusively deployed by broadband providers, so by inferring the AS type (ISP, CDN, Enterprise, etc.), we can more confidently infer the type of each multi-user IP address. A scheduled job periodically executes code to pull data from these sources, process it, and write the list of multi-user IP addresses to a database. That IP info data is then ingested by another system that deploys it to Cloudflare’s edge, enabling our security products to detect potential threats with minimal latency.

To validate our inferences for which IP addresses are multi-user, we created a dataset relying on separate data and measurements which we believe are more reliable indicators. One method we used was running traceroute queries through RIPE Atlas, from each RIPE Atlas probe to the probe’s public IP address. By examining the traceroute hops, we can determine if an IP is behind a CG-NAT or another middlebox. For example, if an IP is not behind a CG-NAT, the traceroute should terminate immediately or just have one hop (likely a home NAT). On the other hand, if a traceroute path includes addresses within the RFC 6598 CGNAT prefix or other hops in the private or shared address space, it is likely the corresponding probe is behind CG-NAT.

To further improve our validation datasets, we’re also reaching out to our ISP partners to confirm the known IP addresses of CG-NATs. As we refine our validation data, we can more accurately tune our multi-user IP address inference parameters and provide a better experience to ISP customers on sites protected by Cloudflare security products.

The multi-user IP detection service currently recognizes approximately 500,000 unique multi-user IP addresses and is being tuned to further improve detection accuracy. Be on the lookout for an upcoming technical blog post, where we will take a deeper look at the system we built and the metrics collected after running this service for a longer period of time.

How Will This Impact Bot Management and Rate Limiting Customers?

Our initial launch will integrate multi-user IP address detection into our Bot Management and Rate Limiting products.

Multi-User IP Address Detection
The three modules that comprise the bot detection system. 

The Cloudflare Bot Management product has five detection mechanisms. The integration will improve three of the five: the machine learning (ML) detection mechanism, the heuristics engine, and the behavioral analysis models. Multi-user IP addresses and their types will serve as additional features to train our ML model. Furthermore, logic will be added to ensure multi-user IP addresses are treated differently in our other detection mechanisms. For instance, our behavioral analysis detection mechanism shouldn’t treat a series of requests from a multi-user IP the same as a series of requests from a single-user IP. There won’t be any new ways to see or interact with this feature, but you should expect to see a decrease in false positive bot detections involving multi-user IP addresses.

The integration with Rate Limiting will allow us to increase the set rate limiting threshold when receiving requests coming from multi-user IP addresses. The factor by which we increase the threshold will be conservative so as not to completely bypass the rate limit. However, the increased threshold should greatly reduce cases where legitimate users behind multi-user IP addresses are blocked or challenged.

Looking Forward

We plan to further integrate across all of Cloudflare’s products that rely upon IP addresses as a measure of uniqueness, including but not limited to DDoS Protection, Cloudflare One Intel, and Web Application Firewall.

We will also continue to make improvements to our multi-user IP address detection system to incorporate additional data sources and improve accuracy. One data source would allow us to get a fraction for the estimated number of subscribers over the total number of IPs advertised (owned) by an AS. ASes that have more estimated subscribers than available IPs would have to rely on CG-NAT to provide service to all subscribers.

As mentioned above, with the help of our ISP partners we hope to improve the validation datasets we use to test and refine the accuracy of our inferences. Additionally, our integration with Bot Management will also unlock an opportunity to create a feedback loop that further validates our datasets. The challenge solve rate (CSR) is a metric generated by Bot Management that indicates the proportion of requests that were challenged and solved (and thus assumed to be human). Examining requests with both high and low CSRs will allow us to check if the multi-user IP addresses we have initially identified indeed represent mostly legitimate human traffic that our customers should not block.

The continued adoption of IPv6 might someday make CG-NATs and other IPv4 sharing technologies irrelevant, as the address space will no longer be limited. This could reduce the prevalence of multi-user IP addresses. However, with the development of new networking technologies that obfuscate IP addresses for user privacy (for example, IPv6 randomized address assignment), it seems unlikely it will become any easier to tie an IP address to a single user. Cloudflare firmly believes that eventually, IP will be completely unassociated with identity.

Yet in the short term, we recognize that IP addresses still play a pivotal role for the security of our customers. By integrating this multi-user IP address detection capability into our products, we aim to deliver a more free and fluid experience for everyone using the Internet.

Coalescing Connections to Improve Network Privacy and Performance

Post Syndicated from Talha Paracha original https://blog.cloudflare.com/connection-coalescing-experiments/

Coalescing Connections to Improve Network Privacy and Performance

Coalescing Connections to Improve Network Privacy and Performance

Web pages typically have a large number of embedded subresources (e.g., JavaScript, CSS, image files, ads, beacons) that are fetched by a browser on page loads. Requests for these subresources can prompt browsers to perform further DNS lookups, TCP connections, and TLS handshakes, which can have a significant impact on how long it takes for the user to see the content and interact with the page. Further, each additional request exposes metadata (such as plaintext DNS queries, or unencrypted SNI in TLS handshake) which can have potential privacy implications for the user. With these factors in mind, we carried out a measurement study to understand how we can leverage Connection Coalescing (aka Connection Reuse) to address such concerns, and study its feasibility.

Background

The web has come a long way and initially consisted of very simple protocols. One of them was HTTP/1.0, which required browsers to make a separate connection for every subresource on the page. This design was quickly recognized as having significant performance bottlenecks and was extended with HTTP pipelining and persistent connections in HTTP/1.1 revision, which allowed HTTP requests to reuse the same TCP connection. But, yet again, this was no silver bullet: while multiple requests could share the same connection, they still had to be serialized one after the other, so a client and server could only execute a single request/response exchange at any given time for each connection. As time passed, websites became more complex in structure and dynamic in nature, and HTTP/1.1 was identified as a major bottleneck. The only way to gain concurrency at the network layer was to use multiple TCP connections to the same origin in parallel, but this meant losing most benefits of persistent connections and ended up overloading the origin servers which were unable to meet the concurrency demand.

To address these performance limitations, the SPDY protocol was introduced over a decade later. SPDY supported stream multiplexing, where requests to and responses from the server used a single interleaved TCP connection, and allowed browsers to prioritize requests for critical subresources first — that were blocking page rendering. A modified variant of SPDY was standardized by the IETF as HTTP/2 in 2012 and published as RFC 7540 in 2015.

HTTP/2 and onwards retained this new standard for connection reuse. More specifically, all subresources on the same domain were able to reuse the same TCP/TLS (or UDP/QUIC) connection without any head-of-line blocking (at least on the application layer). This resulted in a single connection for all the subresources — reducing extraneous requests on page loads — potentially speeding up some websites and applications.

Interestingly, the protocol has a lesser-known feature to also enable subresources at different hostnames to be fetched over the same connection. We studied the real-world feasibility and benefits of this technique as an effort to improve users’ experience for websites across our network.

Coalescing Connections to Improve Network Privacy and Performance
Connection Coalescing allows reusing a TLS connection across different domains

Connection Coalescing

The technique is often referred to as Connection Coalescing and, to put it simply, is a way to access resources from different hostnames that are accessible from the same web server.

There are several reasons for why a single server could handle requests for different hosts, ranging from low-cost virtual hosting to the usage of CDNs and cloud providers (including Cloudflare, that acts as a reverse proxy for approximately 25 million Internet properties). Before going into the technical conditions required to enable connection coalescing, we should take a look at some benefits such a strategy can provide.

  • Privacy. When resources at different hostnames are loaded via separate TLS connections, those connections expose metadata to ISPs and other observers via the Server Name Indicator (SNI) field about the destinations that are being contacted (i.e., in the absence of encrypted SNI). This set of exposed SNI’s can allow an on-path adversary to fingerprint traffic and possibly determine user interactions on the webpage. On the other hand, coalesced requests for more than one hostname on a single connection exposes only one destination, and helps avoid such threats.
  • Performance. Additional TLS handshakes and TCP connections can incur significant costs in terms of cpu, memory and other resources. Thus, coalescing requests to use the same connection can optimize resource utilization.
  • Resource Prioritization. Multiplexing requests on a single connection means that applications have better visibility and more direct control over how related resources are prioritized and scheduled. In the absence of coalescing, the network properties (for example, route congestion) can interfere with the intended order of delivery for resources. This reliability gained through connection coalescing opens up new optimization opportunities to improve web page load times, among other things.

However, along with all these potential benefits, connection coalescing also has some associated risk factors that need to be considered in practice. First, TCP incorporates “fair” congestion control mechanisms — if there are ten connections on the same route, each gets approximately 1/10th of the total bandwidth. So with a route congested and bandwidth restricted, a client relying on multiple connections might be better off (for example, if they have five of the ten connections, their total share of bandwidth would be half). Second, browsers will use different parallelization routines for scheduling requests on multiple connections versus the same connection — it is not immediately clear whether the former or latter would perform better. Third, multiple connections exhibit an inherent form of load balancing for TLS-termination processes. That’s because multiple requests on the same connection must be answered by the same TLS-termination process that holds the session keys (often on the same physical server). So, it is important to study connection coalescing carefully before rolling it out widely.

With this context in mind, we studied the feasibility of connection coalescing on real-world traffic. More specifically, the two questions we wanted to answer were
(a) can we empirically demonstrate and quantify the theoretical benefits of connection coalescing?, and (b) could coalescing cause unintended side effects, such as performance degradation, due to the risks highlighted above?

In order to answer these questions, we first made the observation that a large number of Cloudflare customers request subresources from cdnjs — which is also powered by Cloudflare. For context, cdnjs has public JavaScript and CSS libraries (like jQuery), and is used by more than 12% of all websites on the Internet. One popular way these websites include resources from cdnjs is by using <script src="https://cdnjs.cloudflare.com/..." ></script> HTML tags. But there are other ways as well, such as the usage of XMLHttpRequest or Fetch APIs. Regardless of the way these resources are included, browsers will need to fetch them for completely loading a website.

We then identified a list of approximately four thousand websites using Cloudflare (on the Free plan) that likely used cdnjs. We divided this list of sites into evenly-sized and randomly-picked control and experiment groups. Our plan was to enable coalescing only for the experiment group, so that subresource requests generated from their web pages for cdnjs could reuse existing connections. In this way, we were able to compare results obtained on the experiment group, with the ones for the control group, and attribute any differences observed to connection coalescing.

In order to signal browsers that the requests can be coalesced, we served cdnjs and the sites from the same IP address in a few regions around the world. This meant the same DNS responses for all the zones that were part of the study — eventually load balanced by our Anycast network. These sites also had TLS certificates that included cdnjs.

The above two conditions (same IP and compatible certificate) are required to achieve coalescing as per the HTTP/2 spec. However, the QUIC spec allows coalescing even if only the second condition is met. Major web browsers are yet to adopt the QUIC coalescing mechanism, and currently use only the HTTP/2 coalescing logic for both protocols.

Coalescing Connections to Improve Network Privacy and Performance
Requests to Experiment Group Zones and cdnjs being coalesced on the same TLS connection

Results

We started noticing evidence of real-world coalescing from the day our experiment was launched. The following graph shows that approximately 50% of requests to cdnjs from our experiment group sites are coalesced (i.e., their TLS SNI does not equal cdnjs) as compared to 0% of requests from the control group sites.

Coalescing Connections to Improve Network Privacy and Performance
Coalesced Requests to cdnjs from Control and Experimental Group Zones

In addition, we conducted active measurements using our private WebPageTest instances at the landing pages of experiment and control sites — using the two well-supported browsers: Google Chrome and Firefox. From our results, Chrome created about 78% fewer TLS connections to cdnjs for our experiment group sites, as compared to the control group. But surprisingly, Firefox created just roughly 22% fewer connections. As TLS handshakes are computationally expensive because they involve cryptographic signatures and key exchange algorithms, fewer handshakes meant less CPU cycles spent by both the client and the server.

Upon further analysis, we were able to make two observations from the data:

  • A fraction of sites that never coalesced connections with either browser appeared to load subresources with CORS enabled (i.e., <script src="https://cdnjs.cloudflare.com/..." integrity="sha512-894Y..." crossorigin="anonymous">). This is the default way cdnjs recommends inclusion of subresources, as CORS is needed for integrity checks that provide substantial mitigations against script-manipulation attacks. We do not recommend removing this attribute. Our testing also revealed that using XMLHttpRequest or Fetch APIs to load subresources disabled coalescing as well. It is unclear why browsers choose to not coalesce such connections, and we are in contact with the vendors to find out.
  • Although both Firefox and Chrome coalesced requests for cdnjs on existing connections, the reason for the discrepancy in the number of TLS connections to cdnjs (approximately 78% vs roughly 22%) is because Firefox appears to open new connections even if it does not end up using them.

After evaluating the potential benefits of coalescing, we wanted to understand if coalescing caused any unintended side effects. Hence, the final measurement we conducted was to check whether our experiments were detrimental to a website’s performance. We tracked Page Load Times (PLT) and Largest Contentful Paint (LCP) across a variety of stimulated network conditions using both Chrome and Firefox and found the results for experiment vs control group to not be statistically significant.

Coalescing Connections to Improve Network Privacy and Performance
Page load times for control and experiment group sites. Each site was loaded once, and the “fullyLoaded” metric from WebPageTest is reported

Conclusion

We consider our experimentation successful in determining the feasibility of connection coalescing and highlighting its potential benefits in terms of privacy and performance. More specifically, we observed the privacy benefits of coalescing in more than 50% of requests to cdnjs from real-world traffic. In addition, our active testing demonstrated that browsers create fewer TLS connections with coalescing enabled. Interestingly, our results also revealed that the benefits might not always occur (i.e., CORS-enabled requests, Firefox creating additional TLS connections despite coalescing). Finally, we did not find any evidence that coalescing can cause harm to real-world users’ experience on the Internet.

Some future directions we would like to explore include:

  • More aggressive connection reuse with multiple hostnames, while identifying conditions most suitable for coalescing.
  • Understanding how different connection reuse methods compare, e.g., IP-based coalescing vs. use of Origin Frames, and what effects do they have on user experience over the Internet.
  • Evaluating coalescing support among different browser vendors, and encouraging adoption of HTTP/3 QUIC based coalescing.
  • Reaping the full benefits of connection coalescing by experimenting with custom priority schemes for requests within the same connection.

Please send questions and feedback to [email protected]. We’re excited to continue this line of work in our effort to help build a better Internet! For those interested in joining our team please visit our Careers Page.

Project Myriagon: Cloudflare Passes 10,000 Connected Networks

Post Syndicated from Ticiane Takami original https://blog.cloudflare.com/10000-networks-and-beyond/

Project Myriagon: Cloudflare Passes 10,000 Connected Networks

Project Myriagon: Cloudflare Passes 10,000 Connected Networks

During Speed Week, we’ve talked a lot about the products we’ve improved and the places we’ve expanded to. Today, we have a final exciting announcement: Cloudflare now connects with more than 10,000 other networks. Put another way, over 10,000 networks have direct on-ramps to the Cloudflare network.

This is the culmination of a special project we’ve been working on for the last few months dubbed Project Myriagon, a reference to the 10,000-sided polygon of the same name. In going about this project, we have learned a lot about the performance impact of adding more direct connections to our network — in one recent case, we saw a 90% reduction in median round-trip end-user latency.

But to really explain why this is such a big milestone, we first need to explain a bit about how the Internet works.

More roads leading to Rome

The Internet that all know and rely on is, on a basic level, an interconnected series of independently run local networks. Each network is defined as its own “autonomous system.” These networks are delineated numerically with Autonomous Systems Numbers, or ASNs. An ASN is like the Internet version of a zip code, a short number directly mapping to a distinct region of IP space using a clearly defined methodology. Network interconnection is all about bringing together different ASNs to exponentially multiply the number of possible paths between source and destination.

Most of us have home networks behind a modem and router, connecting your individual miniature network to your ISP. Your ISP then connects with other networks, to fetch the web pages or other Internet traffic you request. These networks in turn have connections to different networks, who in turn connect to interconnected networks, and so on, until your data reaches its destination. The fewer networks your request has to traverse, generally, the lower the end-to-end latency and odds that something will get lost along the way.

The average number of hops between any one network on the Internet to any other network is around 5.7 and 4.7, for the IPv4 and IPv6 networks respectively.

Project Myriagon: Cloudflare Passes 10,000 Connected Networks
Project Myriagon: Cloudflare Passes 10,000 Connected Networks
Source: https://blog.apnic.net/2020/01/14/bgp-in-2019-the-bgp-table/

How do ASNs work?

ASNs are a key part of the routing protocol that directs traffic along the Internet, BGP. Internet Assigned Numbers Authority (IANA), the global coordinator of the DNS Root, IP addressing, and other Internet protocol resources like AS Numbers, delegates ASN-making authority to Regional Internet Registries (RIRs), who in turn assign individual ASNs to network operators in line with their regional policies. The five RIRs are AFRINIC, APNIC, ARIN, LACNIC and RIPE, each entitled to assign and attribute ASN numbers in their respective appointed regions.

Cloudflare’s ASN is 13335, one of the approximately 70,000 ASNs advertised on the Internet. While we’d like to — and plan on — connecting to every one of these ASNs eventually, our team tries to prioritize those with the greatest impact on our overall breadth and improving the proximity to as many people on Earth as possible.

As enabling optimal routes is key to our core business and services, we continuously track how many ASNs we connect to (technically referred to as “adjacent networks”). With Project Myriagon, we aimed to speed up our rate of interconnection and pass 10,000 adjacent networks by the end of the year. By September 2021, we reached that milestone, bringing us from 8,300 at the start of 2020 to over 10,000 today.

As shown in the table below, that milestone is part of a continuous effort towards gradually hitting more of the total advertised ASNs on the Internet.

Project Myriagon: Cloudflare Passes 10,000 Connected Networks
The Regional Internet Registries and their Regions

Table 1: Cloudflare’s peer ASNs and their respective RIR

Project Myriagon: Cloudflare Passes 10,000 Connected Networks

Given that there are 70,000+ ASNs out there, you might be wondering: why is 10,000 a big deal? To understand this, we need to look deeply at BGP, the protocol that glues the Internet together. There are three different classes of ASNs:

  • Transit Only ASNs: these networks only provide connectivity to other networks. They don’t have any IP addresses inside their networks. These networks are quite rare, as it’s very unusual to not have any IP addresses inside your network. Instead, these networks are often used primarily for distinct management purposes within a single organization.
  • Origin Only ASNs: these are networks that do not provide connectivity to other networks. They are a stub network, and often, like your home network, only connected to a single ISP.
  • Mixed ASNs: these networks both have IP addresses inside their network, and provide connectivity to other networks.

Origin Only ASNs Mixed ASNs Transit Only ASNs
61,127 11,128 443

Source: https://bgp.potaroo.net/as6447/

One interesting fact: of the 61,127 origin only ASNs, nearly 43,000 of them are only connected to their ISP. As such, our direct connections to over 10,000 networks indicates that of the networks that connect more than one network, a very good percentage are now already connected to Cloudflare.

Cutting out the middle man

Directly connecting to a network — and eliminating the hops in between — can greatly improve performance in two ways. First, connecting with a network directly allows for Internet traffic to be exchanged locally rather than detouring through remote cities; and secondly, direct connections help avoid the congestion caused by bottlenecks that sometimes happen between networks.

To take a recent real-world example, turning up a direct peering session caused a 90% improvement in median end-user latency when turning up a peering session with a European network, from an average of 76ms to an average of 7ms.

Project Myriagon: Cloudflare Passes 10,000 Connected Networks
Immediate 90% improvement in median end-user latency after peering with a new network. 

By using our own on-ramps to other networks, we both ensure superior performance for our users and avoid adding load and causing congestion on the Internet at large.

And AS13335 is just getting started

Project Myriagon: Cloudflare Passes 10,000 Connected Networks

Cloudflare is an anycast network, meaning that the better connected we are, the faster and better-protected we are — obviating legacy concepts like scrubbing centers and slow origins. Hitting five digits of connected networks is something we’re really proud of as a step on our goal to helping to build a better Internet. As we’ve mentioned throughout the week, we’re all about high speed without having to pay a security or reliability cost.

There’s still work to do! While Project Myriagon has brought us, we believe, to be one of the top 5 most connected networks in the world, we estimate Google is connected to 12,000-15,000 networks. And so, today, we are kicking off Project CatchG. We won’t rest until we’re #1.

Interested in peering with us to help build a better Internet? Reach out to [email protected] with your request. More details on the locations we are present at can be found at http://as13335.peeringdb.com/.

Cloudflare Backbone: A Fast Lane on the Busy Internet Highway

Post Syndicated from Tanner Ryan original https://blog.cloudflare.com/cloudflare-backbone-internet-fast-lane/

Cloudflare Backbone: A Fast Lane on the Busy Internet Highway

Cloudflare Backbone: A Fast Lane on the Busy Internet Highway

The Internet is an amazing place. It’s a communication superhighway, allowing people and machines to exchange exabytes of information every day. But it’s not without its share of issues: whether it’s DDoS attacks, route leaks, cable cuts, or packet loss, the components of the Internet do not always work as intended.

The reason Cloudflare exists is to help solve these problems. As we continue to grow our rapidly expanding global network in more than 250 cities, while directly connecting with more than 9,800 networks, it’s important that our network continues to help bring improved performance and resiliency to the Internet. To accomplish this, we built our own backbone. Other than improving redundancy, the immediate advantage to you as a Cloudflare user? It can reduce your website loading times by up to 45% — and you don’t have to do a thing.

The Cloudflare Backbone

We began building out our global backbone in 2018. It comprises a network of long-distance fiber optic cables connecting various Cloudflare data centers across North America, South America, Europe, and Asia. This also includes Cloudflare’s metro fiber network, directly connecting data centers within a metropolitan area.

Cloudflare Backbone: A Fast Lane on the Busy Internet Highway

Our backbone is a dedicated network, providing guaranteed network capacity and consistent latency between various locations. It gives us the ability to securely, reliably, and quickly route packets between our data centers, without having to rely on other networks.

This dedicated network can be thought of as a fast lane on a busy highway. When traffic in the normal lanes of the highway encounter slowdowns from congestion and accidents, vehicles can make use of a fast lane to bypass the traffic and get to their destination on time.

Our software-defined network is like a smart GPS device, as we’re always calculating the performance of routes between various networks. If a route on the public Internet becomes congested or unavailable, our network automatically adjusts routing preferences in real-time to make use of all routes we have available, including our dedicated backbone, helping to deliver your network packets to the destination as fast as we can.

Measuring backbone improvements

As we grow our global infrastructure, it’s important that we analyze our network to quantify the impact we’re having on performance.

Here’s a simple, real-world test we’ve used to validate that our backbone helps speed up our global network. We deployed a simple API service hosted on a public cloud provider, located in Chicago, Illinois. Once placed behind Cloudflare, we performed benchmarks from various geographic locations with the backbone disabled and enabled to measure the change in performance.

Instead of comparing the difference in latency our backbone creates, it is important that our experiment captures a real-world performance gain that an API service or website would experience. To validate this, our primary metric is measuring the average request time when accessing an API service from Miami, Seattle, San Jose, São Paulo, and Tokyo. To capture the response of the network itself, we disabled caching on the Cloudflare dashboard and sent 100 requests from each testing location, both while forcing traffic through our backbone, and through the public Internet.

Now, before we claim our backbone solves all Internet problems, you can probably notice that for some tests (Seattle, WA and San Jose, CA), there was actually an increase in response time when we forced traffic through the backbone. Since latency is directly proportional to the distance of fiber optic cables, and since we have over 9,800 direct connections with other Internet networks, there is a possibility that an uncongested path on the public Internet might be geographically shorter, causing this speedup compared to our backbone.

Luckily for us, we have technologies like Argo Smart Routing, Argo Tiered Caching, WARP+, and most recently announced Orpheus, which dynamically calculates the performance of each route at our data centers, choosing the fastest healthy route at that time. What might be the fastest path during this test may not be the fastest at the time you are reading this.

With that disclaimer out of the way, now onto the test.

Cloudflare Backbone: A Fast Lane on the Busy Internet Highway

With the backbone disabled, if a visitor from São Paulo performed a request to our service, they would be routed to our São Paulo data center via BGP Anycast. With caching disabled, our São Paulo data center forwarded the request over the public Internet to the origin server in Chicago. On average, the entire process to fetch data from the origin server and return to the response to the requesting user took 335.8 milliseconds.

Once the backbone was enabled and requests were created, our software performed tests to determine the fastest healthy route to the origin, whether it was a route on the public Internet or through our private backbone. For this test the backbone was faster, resulting in an average total request time of 230.2 milliseconds. Just by routing the request through our private backbone, we improved the average response time by 31%.

We saw even better improvement when testing from Tokyo. When routing the request over the public Internet, the request took an average of 424 milliseconds. By enabling our backbone which created a faster path, the request took an average of 234 milliseconds, creating an average response time improvement of 44%.

Visitor Location Distance to Chicago Avg. response time using public Internet (ms) Avg. response using backbone (ms) Change in response time
Miami, FL, US 1917 km 84 75 10.7% decrease
Seattle, WA, US 2785 km 118 124 5.1% increase
San Jose, CA, US 2856 km 122 132 8.2% increase
São Paulo, BR 8403 km 336 230 31.5% decrease
Tokyo JP 10129 km 424 234 44.8% decrease

We also observed a smaller deviation in the response time of packets routed through our backbone over larger distances.

Cloudflare Backbone: A Fast Lane on the Busy Internet Highway

Our next generation network

Cloudflare is built on top of lossy, unreliable networks that we do not have control over. It’s our software that turns these traditional tubes of the Internet into a smart, high performing, and reliable network Cloudflare customers get to use today. Coupled with our new, but rapidly expanding backbone, it is this software that produces significant performance gains over traditional Internet networks.

Whether you visit a website powered by Cloudflare’s Argo Smart Routing, Argo Tiered Caching, Orpheus, or use our 1.1.1.1 service with WARP+ to access the Internet, you get direct access to the Internet fast lane we call the Cloudflare backbone.

For Cloudflare, a better Internet means improving Internet security, reliability, and performance. The backbone gives us the ability to build out our network in areas that have typically lacked infrastructure investments by other networks. Even with issues on the public Internet, these initiatives allow us to be located within 50 milliseconds of 95% of the Internet connected population.

In addition to our growing global infrastructure providing 1.1.1.1, WARP, Roughtime, NTP, IPFS Gateway, Drand, and F-Root to the greater Internet, it’s important that we extend our services to those who are most vulnerable. This is why we extend all our infrastructure benefits directly to the community, through projects like Galileo, Athenian, Fair Shot, and Pangea.

And while these thousands of fiber optic connections are already fixing today’s Internet issues, we truly are just getting started.

Want to help build the future Internet? Networks that are faster, safer, and more reliable than they are today? The Cloudflare Infrastructure team is currently hiring!

If you operate an ISP or transit network and would like to bring your users faster and more reliable access to websites and services powered by Cloudflare’s rapidly expanding network, please reach out to our Edge Partnerships team at [email protected].

Introducing Project Fair Shot: Ensuring COVID-19 Vaccine Registration Sites Can Keep Up With Demand

Post Syndicated from Matthew Prince original https://blog.cloudflare.com/project-fair-shot/

Introducing Project Fair Shot: Ensuring COVID-19 Vaccine Registration Sites Can Keep Up With Demand

Introducing Project Fair Shot: Ensuring COVID-19 Vaccine Registration Sites Can Keep Up With Demand

Around the world government and medical organizations are struggling with one of the most difficult logistics challenges in history: equitably and efficiently distributing the COVID-19 vaccine. There are challenges around communicating who is eligible to be vaccinated, registering those who are eligible for appointments, ensuring they show up for their appointments, transporting the vaccine under the required handling conditions, ensuring that there are trained personnel to administer the vaccine, and then doing it all over again as most of the vaccines require two doses.

Cloudflare can’t help with most of that problem, but there is one key part that we realized we could help facilitate: ensuring that registration websites don’t crash under load when they first begin scheduling vaccine appointments. Project Fair Shot provides Cloudflare’s new Waiting Room service for free for any government, municipality, hospital, pharmacy, or other organization responsible for distributing COVID-19 vaccines. It is open to eligible organizations around the world and will remain free until at least July 1, 2021 or longer if there is still more demand for appointments for the vaccine than there is supply.

Crashing Registration Websites

The problem of vaccine scheduling registration websites crashing under load isn’t theoretical: it is happening over and over as organizations attempt to schedule the administration of the vaccine. This hit home at Cloudflare last weekend. The wife of one of our senior team members was trying to register her parents to receive the vaccine. They met all the criteria and the municipality where they lived was scheduled to open appointments at noon.

When the time came for the site to open, it immediately crashed. The cause wasn’t hackers or malicious activity. It was merely that so many people were trying to access the site at once. “Why doesn’t Cloudflare build a service that organizes a queue into an orderly fashion so these sites don’t get overwhelmed?” she asked her husband.

A Virtual Waiting Room

Turns out, we were already working on such a feature, but not for this use case. The problem of fairly distributing something where there is more demand than supply comes up with several of our clients. Whether selling tickets to a hot concert, the latest new sneaker, or access to popular national park hikes it is a difficult challenge to ensure that everyone eligible has a fair chance.

The solution is to open registration to acquire the scarce item ahead of the actual sale. Anyone who visits the site ahead of time can be put into a queue. The moment before the sale opens, the order of the queue can be randomly (and fairly) shuffled. People can then be let in in order of their new, random position in the queue — allowing only so many at any time as the backend of the site can handle.

At Cloudflare, we were building this functionality for our customers as a feature called Waiting Room. (You can learn more about the technical details of Waiting Room in this post by Brian Batraski who helped build it.) The technology is powerful because it can be used in front of any existing web registration site without needing any code changes or hardware installation. Simply deploy Cloudflare through a simple DNS change and then configure Waiting Room to ensure any transactional site, no matter how meagerly resourced, can keep up with demand.

Recognizing a Critical Need; Moving Up the Launch

We planned to release it in February. Then, when we saw vaccine sites crashing under load and frustration of people eligible for the vaccine building, we realized we needed to move the launch up and offer the service for free to organizations struggling to fairly distribute the vaccine. With that, Project Fair Shot was born.

Government, municipal, hospital, pharmacy, clinic, and any other organizations charged with scheduling appointments to distribute the vaccine can apply to participate in Project Fair Shot by visiting: projectfairshot.org

Giving Front Line Organizations the Technical Resources They Need

The service will be free for qualified organizations at least until July 1, 2021 or longer if there is still more demand for appointments for the vaccine than there is supply. We are not experts in medical cold storage and I get squeamish at the sight of needles, so we can’t help with many of the logistical challenges of distributing the vaccine. But, seeing how we could support this aspect, our team knew we needed to do all we could to help.

The superheroes of this crisis are the medical professionals who are taking care of the sick and the scientists who so quickly invented these miraculous vaccines. We’re proud of the supporting role Cloudflare has played helping ensure the Internet has continued to function well when the world needed it most. Project Fair Shot is one more way we are living up to our mission of helping build a better Internet.

Cloudflare Waiting Room

Post Syndicated from Brian Batraski original https://blog.cloudflare.com/cloudflare-waiting-room/

Cloudflare Waiting Room

Cloudflare Waiting Room

Today, we are excited to announce Cloudflare Waiting Room! It will first be available to select customers through a new program called Project Fair Shot which aims to help with the problem of overwhelming demand for COVID-19 vaccinations causing appointment registration websites to fail. General availability in our Business and Enterprise plans will be added in the near future.

Wait, you’re excited about a… Waiting Room?

Most of us are familiar with the concept of a waiting room, and rarely are we excited about the idea of being in one. Usually our first experience of one is at a doctor’s office — yes, you have an appointment, but sometimes the doctor is running late (or one of the patients was). Given the doctor can only see one person at a time… the waiting room was born, as a mechanism to queue up patients.

While servers can handle more concurrent requests than a doctor can, they too can be overwhelmed. If, in a pre-COVID world, you’ve ever tried buying tickets to a popular concert or event, you’ve probably encountered a waiting room online. It limits requests inbound to an application, and places these requests into a virtual queue. Once the number of users in the application has reduced, new users are let in within the defined thresholds the application can handle. This protects the origin servers supporting the application from being inundated with too many requests, while also ensuring equity from a user perspective — users who try to access a resource when the system is overloaded are not unfairly dropped and forced to reconnect, hoping to join their chance in the queue.

Why Now?

Given not many of us are going to live concerts any time soon, why is Cloudflare doing this now?

Well, perhaps we aren’t going to concerts, but the second order effects of COVID-19 have created a huge need for waiting rooms. First of all, given social distancing and the closing of many places of business and government, customers and citizens have shifted to online channels, putting substantially more strain on business and government infrastructure.

Second, the pandemic and the flow-on consequences of it have meant many folks around the world have come to rely on resources that they didn’t need twelve months earlier. To be specific, these are often health or government-related resources — for example, unemployment insurance websites. The online infrastructure was set up to handle a peak load that didn’t foresee the impact of COVID-19. We’re seeing a similar pattern emerge with websites that are related to vaccines.

Historically, the number of organizations that needed waiting rooms was quite small. The nature of most businesses online usually involve a more consistent user load, rather than huge crushes of people all at once. Those organizations were able to build custom waiting rooms and were integrated deeply into their application (for example, buying tickets).  With Cloudflare’s Waiting Room, no code changes to the application are necessary and a Waiting Room can be set up in a matter of minutes for any website without writing a single line of code.

Whether you are an engineering architect or a business operations analyst, setting up a Waiting Room is simple. We make it quick and easy to ensure your applications are reliable and protected from unexpected spikes in traffic.  Other features we felt were important are automatic enablement and dynamic outflow. In other words, a waiting room should turn on automatically when thresholds are exceeded and as users finish their tasks in the application, let out different sized buckets of users and intake new ones already in the queue. It should just work. Lastly, we’ve seen the major impact COVID-19 has made on users and businesses alike, especially, but not limited to, the health and government sectors. We wanted to provide another way to ensure these applications remain available and functional so all users can receive the care that they need and not errors within their browser.

How does Cloudflare’s Waiting Room work?

We built Waiting Room on top of our edge network and our Workers product. By leveraging Workers and our new Durable Objects offerings, we were able to remove the need for any customer coding and provide a seamless, out of the box product that will ‘just work’. On top of this, we get the benefits of the scale and performance of our Workers product to ensure we maintain extremely low latency overhead, keep estimated times presented to end users accurate as can be and not keep any user in the queue longer than needed. But building a centralized system in a decentralized network is no easy task. When requests come into an application from around the world, we need to be able to get a broad, accurate view of what that load looks like inbound and outbound to a given application.

Cloudflare Waiting Room
Request going through Cloudflare without a Waiting Room

These requests, as fast as they are, still take time to travel across the planet. And so, a unique edge case was presented. What if a website is getting reasonable traffic from North America and Europe, but then a sudden major spike of traffic takes place from South America – how do we know when to keep letting users into the application and when to kick in the Waiting Room to protect the origin servers from being overloaded?

Thanks to some clever engineering and our Workers product, we were able to create a system that almost immediately keeps itself synced with global demand to an application giving us the necessary insight into when we should and should not be queueing users into the Waiting Room. By leveraging our global Anycast network and over 200+ data centers, we remove any single point of failure to protect our customers’ infrastructure yet also provide a great experience to end-users who have to wait a small amount of time to enter the application under high load.

Cloudflare Waiting Room
Request going through Cloudflare with a Waiting Room

How to setup a Waiting Room

Setting up a Waiting Room is incredibly easy and very fast! At the easiest side of the scale, a user needs to fill out only five fields: 1) the name of the Waiting Room, 2) a hostname (which will already be pre-populated with the zone it’s being configured on), 3) the total active users that can be in the application at any given time, 4) the new users per minute allowed into the application, and 5) the session duration for any given user. No coding or any application changes are necessary.

Cloudflare Waiting Room

We provide the option of using our default Waiting Room template for customers who don’t want to add additional branding. This simplifies the process of getting a Waiting Room up and running.

Cloudflare Waiting Room

That’s it! Press save and the Waiting Room is ready to go!

Cloudflare Waiting Room

For customers with more time and technical ability, the same process is followed, except we give full customization capabilities to our users so they can brand the Waiting Room, ensuring it matches the look and feel of their overall product.

Cloudflare Waiting Room

Lastly, managing different Waiting Rooms is incredibly easy. With our Manage Waiting Room table, at a glance you are able to get a full snapshot of which rooms are actively queueing, not queueing, and/or disabled.

Cloudflare Waiting Room

We are very excited to put the power of our Waiting Room into the hands of our customers to ensure they continue to focus on their businesses and customers. Keep an eye out for another blog post coming soon with major updates to our Waiting Room product for Enterprise!