Tag Archives: security

Custom Headers for Cloudflare Pages

Post Syndicated from Nevi Shah original https://blog.cloudflare.com/custom-headers-for-pages/

Custom Headers for Cloudflare Pages

Custom Headers for Cloudflare Pages

Until today, Cloudflare Workers has been a great solution to setting headers, but we wanted to create an even smoother developer experience. Today, we’re excited to announce that Pages now natively supports custom headers on your projects! Simply create a _headers file in the build directory of your project and within it, define the rules you want to apply.

  X-Hiring: Looking for a job? We're hiring engineers

What can you set with custom headers?

Being able to set custom headers is useful for a variety of reasons — let’s explore some of your most popular use cases.

Search Engine Optimization (SEO)

When you create a Pages project, a pages.dev deployment is created for your project which enables you to get started immediately and easily preview changes as you iterate. However, we realize this poses an issue — publishing multiple copies of your website can harm your rankings in search engine results. One way to solve this is by disabling indexing on all pages.dev subdomains, but we see many using their pages.dev subdomain as their primary domain. With today’s announcement you can attach headers such as X-Robots-Tag to hint to Google and other search engines how you’d like your deployment to be indexed.

For example, to prevent your pages.dev deployment from being indexed, you can add the following to your _headers file:

  X-Robots-Tag: noindex


Customizing headers doesn’t just help with your site’s search result ranking — a number of browser security features can be configured with headers. A few headers that can enhance your site’s security are:

  • X-Frame-Options: You can prevent click-jacking by informing browsers not to embed your application inside another (e.g. with an <iframe>).
  • X-Content-Type-Option: nosniff: To prevent browsers from interpreting a response as any other content-type than what is defined with the Content-Type header.
  • Referrer-Policy: This allows you to customize how much information visitors give about where they’re coming from when they navigate away from your page.
  • Permissions-Policy: Browser features can be disabled to varying degrees with this header (recently renamed from Feature-Policy).
  • Content-Security-Policy: And if you need fine-grained control over the content in your application, this header allows you to configure a number of security settings, including similar controls to the X-Frame-Options header.

You can configure these headers to protect an /app/* path, with the following in your _headers file:

  X-Frame-Options: DENY
  X-Content-Type-Options: nosniff
  Referrer-Policy: no-referrer
  Permissions-Policy: document-domain=()
  Content-Security-Policy: script-src 'self'; frame-ancestors 'none';


Modern browsers implement a security protection called CORS or Cross-Origin Resource Sharing. This prevents one domain from being able to force a user’s action on another. Without CORS, a malicious site owner might be able to do things like make requests to unsuspecting visitors’ banks and initiate a transfer on their behalf. However, with CORS, requests are prevented from one origin to another to stop the malicious activity.

There are, however, some cases where it is safe to allow these cross-origin requests. So-called, “simple requests” (such as linking to an image hosted on a different domain) are permitted by the browser. Fetching these resources dynamically is often where the difficulty arises, and the browser is sometimes overzealous in its protection. Simple static assets on Pages are safe to serve to any domain, since the request takes no action and there is no visitor session. Because of this, a domain owner can attach CORS headers to specify exactly which requests can be allowed in the _headers file for fine-grained and explicit control.

For example, the use of the asterisk will enable any origin to request any asset from your Pages deployment:

  Access-Control-Allow-Origin: *

To be more restrictive and limit requests to only be allowed from a ‘staging’ subdomain, we can do the following:

  Access-Control-Allow-Origin: https://staging.:project.pages.dev

How we built support for custom headers

To support all these use cases for custom headers, we had to build a new engine to determine which rules to apply for each incoming request. Backed, of course, by Workers, this engine supports splats and placeholders, and allows you to include those matched values in your headers.

Although we don’t support all of its features, we’ve modeled this matching engine after the URLPattern specification which was recently shipped with Chrome 95. We plan to be able to fully implement this specification for custom headers once URLPattern lands in the Workers runtime, and there should hopefully be no breaking changes to migrate.

Enhanced support for redirects

With this same engine, we’re bringing these features to your _redirects file as well. You can now configure your redirects with splats, placeholders and status codes as shown in the example below:

/blog/* https://blog.example.com/:splat 301
/products/:code/:name /products?name=:name&code=:code
/submit-form https://static-form.example.com/submit 307

Get started

Custom headers and redirects for Cloudflare Pages can be configured today. Check out our documentation to get started, and let us know how you’re using it in our Discord server. We’d love to hear about what this unlocks for your projects!

Coming up…

And finally, if a _headers file and enhanced support for _redirects just isn’t enough for you, we also have something big coming very soon which will give you the power to build even more powerful projects. Stay tuned!

Cloudflare for SaaS for All, now Generally Available!

Post Syndicated from Dina Kozlov original https://blog.cloudflare.com/cloudflare-for-saas-for-all-now-generally-available/

Cloudflare for SaaS for All, now Generally Available!

Cloudflare for SaaS for All, now Generally Available!

During Developer Week a few months ago, we opened up the Beta for Cloudflare for SaaS: a one-stop shop for SaaS providers looking to provide fast load times, unparalleled redundancy, and the strongest security to their customers.

Since then, we’ve seen numerous developers integrate with our technology, allowing them to spend their time building out their solution instead of focusing on the burdens of running a fast, secure, and scalable infrastructure — after all, that’s what we’re here for.

Today, we are very excited to announce that Cloudflare for SaaS is generally available, so that every customer, big and small, can use Cloudflare for SaaS to continue scaling and building their SaaS business.

What is Cloudflare for SaaS?

If you’re running a SaaS company, you have customers that are fully reliant on you for your service. That means you’re responsible for keeping their domain fast, secure, and protected. But this isn’t simple. There’s a long checklist you need to get through to put a solution in your customers’ hands:

  • Set up an origin server
  • Encrypt your customers’ traffic
  • Keep your customers online
  • Boost the performance of global customers
  • Support vanity domains
  • Protect against attacks and bots
  • Scale for growth
  • Provide insights and analytics

And on top of that, you need to also focus on building out your solution and your business. As a developer or startup with limited resources, this can delay your product launch by weeks or months.

That’s what we’re here to help with! We have numerous engineering teams whose sole focus is to work on products that take care of each one of these tasks, so you don’t have to!

The Cloudflare solution:

  • Set up an origin server  → Workers
  • Encrypt your customers’ traffic →  SSL for SaaS
  • Keep your customers online → Cloudflare’s global Anycast network
  • Boost the performance of global customers → Argo Smart Routing/Cache
  • Support vanity domains → Custom Hostnames
  • Protect against attacks and bots → WAF and Bot Management
  • Scale for growth → Workers
  • Provide insights and analytics → Custom Hostname Analytics

Pricing, Made for Developers

Starting today, Cloudflare for SaaS is available to purchase on Free, Pro, and Business plans. We wanted to make sure that the pricing made sense for developers. At the time of building, you don’t know how many customers you’ll have, so we wanted to offer flexibility by keeping the pricing as simple as possible: only pay for the customers you use.

Each customer domain using the service is called a Custom Hostname. For each Custom Hostname, we automatically provision a TLS certificate. But not just that!  Beyond the TLS certificate, each of your Custom Hostnames inherits the full suite of Cloudflare products that you set up on your SaaS zone. From Bot Management to Argo Smart Routing, you can extend these add-ons that protect and accelerate your domain to your customers.

Custom Hostnames cost two dollars per month. We will only charge you after each Custom Hostname has been onboarded, adjusted according to when you created it. That means that if you created 10 Custom Hostnames at the start of the month and 10 Custom Hostnames halfway through, at the end of the month you will be billed $30.

This way, you’re only charged for the Custom Hostnames that you provision. It’s also a great incentive to make sure you clean up after your churned customers.

If you’re an Enterprise customer and want to learn more about the benefits that you can from Cloudflare for SaaS, make sure you check out our blog post about the latest developments.

Show us what you’re building!

During the beta alone, we’ve seen incredible projects built out on the platform. We wanted to showcase these developers to show you what’s possible. And even better, some of these have been built on our Workers platform! We’d love to see what you’re working on. Join our Discord channel and showcase your work! Have feature requests for us? Let us know!

mmm.page: Simple Personal Websites

Cloudflare for SaaS for All, now Generally Available!

mmm.page is a drag-and-drop website builder that makes it dead simple to create auto-responsive, collage-like websites: websites with overlapping text, images, GIFs, YouTube videos, Spotify embeds, and (a lot) more. To make it easier, all the standard website tedium — uptime, usability, performance, reliability, responsiveness, SEO, etc. — are handled under the hood so all you have to worry about is adding content and arranging it how you want.

Under their hood is Cloudflare. Cloudflare’s CDN allows both the flexibility of server-side pages as well as the instant loading times of static pages — not to mention an 80% reduction in server costs. Custom Hostnames alone saved months of development time by handling domain names and SSL management (which are otherwise tricky to get perfect and reliable).

They’ve used Workers for increasingly more tasks that would’ve otherwise taken an order of magnitude more time if implemented with their current backend monolith — the ease of deployment and comparatively low cost of Workers is something that keeps them coming back.

The longer-term hope is for pages to be used as a sort of beacon signal, an easy-to-make yet unbounded way to express to others the things you’re interested in, especially for things that aren’t so easily describable or captured in words. They look forward to a world of a ton more DIY micro-sites. Cloudflare has been crucial in taking care of much of the difficult technical plumbing and giving them more time to work on designs and features that get them closer to this hope.


Cloudflare for SaaS for All, now Generally Available!

Lightfunnels is a performance driven e-commerce and lead generation platform. It focuses on delivering fast, reliable, and highly converting sales funnels to its users and their customers.

With Cloudflare for SaaS, Lightfunnels allows users to preserve their brand by easily connecting their own domain names with SSL to use on their funnels.

The platform handles large e-commerce traffic volume through Cloudflare Workers. This helps Lightfunnels serve pages from the closest edge to the customer, wherever they are in the world, allowing for blazing fast page load speeds.

Workers also come with a powerful caching API that eliminates a great percentage of back-end trips and reduces the stress on their servers.

“Our aim is to build the best performing e-commerce and lead generation platform on the market. Page load speeds play a significant role in performance. Using Cloudflare for SaaS along with Cloudflare Workers made building a reliable, secure, and fast infrastructure a breeze.”
Yassir Ennazk, Co-founder & CEO at Lightfunnels


Ventrata is a SaaS multi-channel booking platform for large attractions and tour operators. They power booking sites and B2B booking portals for clients that run on other domains. Cloudflare for SaaS has allowed them to leverage all of Cloudflare’s tools, including Firewall, image caching, Workers, and free TLS certificates on Custom Hostnames, while allowing their clients to keep full control of their brand. Their implementation involved just 4 lines of code without any infrastructure/DevOps help required, which would have been impossible before.

Currently a part of the Beta?

If you were accepted as a part of the Cloudflare for SaaS Beta, you will get a notice next week about migrating to the paid version.  

Help build a better Internet

Want to be a part of the Cloudflare team and work on the products that power Cloudflare for SaaS? We’re hiring!

Getting Cloudflare Tunnels to connect to the Cloudflare Network with QUIC

Post Syndicated from Sudarsan Reddy original https://blog.cloudflare.com/getting-cloudflare-tunnels-to-connect-to-the-cloudflare-network-with-quic/

Getting Cloudflare Tunnels to connect to the Cloudflare Network with QUIC

Getting Cloudflare Tunnels to connect to the Cloudflare Network with QUIC

I work on Cloudflare Tunnel, which lets customers quickly connect their private services and networks through the Cloudflare network without having to expose their public IPs or ports through their firewall. Tunnel is managed for users by cloudflared, a tool that runs on the same network as the private services. It proxies traffic for these services via Cloudflare, and users can then access these services securely through the Cloudflare network.

Recently, I was trying to get Cloudflare Tunnel to connect to the Cloudflare network using a UDP protocol, QUIC. While doing this, I ran into an interesting connectivity problem unique to UDP. In this post I will talk about how I went about debugging this connectivity issue beyond the land of firewalls, and how some interesting differences between UDP and TCP came into play when sending network packets.

How does Cloudflare Tunnel work?

Getting Cloudflare Tunnels to connect to the Cloudflare Network with QUIC

cloudflared works by opening several connections to different servers on the Cloudflare edge. Currently, these are long-lived TCP-based connections proxied over HTTP/2 frames. When Cloudflare receives a request to a hostname, it is proxied through these connections to the local service behind cloudflared.

While our HTTP/2 protocol mode works great, we’d like to improve a few things. First, TCP traffic sent over HTTP/2 is susceptible to Head of Line (HoL) blocking — this affects both HTTP traffic and traffic from WARP routing. Additionally, it is currently not possible to initiate communication from cloudflared’s HTTP/2 server in an efficient way. With the current Go implementation of HTTP/2, we could use Server-Sent Events, but this is not very useful in the scheme of proxying L4 traffic.

The upgrade to QUIC solves possible HoL blocking issues and opens up avenues that allow us to initiate communication from cloudflared to a different cloudflared in the future.

Naturally, QUIC required a UDP-based listener on our edge servers which cloudflared could connect to. We already connect to a TCP-based listener for the existing protocols, so this should be nice and easy, right?

Failed to dial to the edge

Things weren’t as straightforward as they first looked. I added a QUIC listener on the edge, and the ability for cloudflared to connect to this new UDP-based listener. I tried to run my brand new QUIC tunnel and this happened.

$  cloudflared tunnel run --protocol quic my-tunnel
2021-09-17T18:44:11Z ERR Failed to create new quic connection, err: failed to dial to edge: timeout: no recent network activity

cloudflared wasn’t even establishing a connection to the edge. I started looking at the obvious places first. Did I add a firewall rule allowing traffic to this port? Check. Did I have iptables rules ACCEPTing or DROPping appropriate traffic for this port? Check. They seemed to be in order. So what else could I do?

tcpdump all the packets

I started by logging for UDP traffic on the machine my server was running on to see what could be happening.

$  sudo tcpdump -n -i eth0 port 7844 and udp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
14:44:27.742629 IP > UDP, length 1252
14:44:27.743298 IP > UDP, length 37

Looking at this tcpdump helped me understand why I had no connectivity! Not only was this port getting UDP traffic but I was also seeing traffic flow out. But there seemed to be something strange afoot. Incoming packets were being sent to while responses were being sent back from (this is an example IP used for illustration purposes)  instead.

Why is this a problem? If a host (in this case, the server) chooses an address from a network unable to communicate with a public Internet host, it is likely that the return half of the communication will never arrive. But wait a minute. Why is some other IP getting prioritized over a source address my packets were already being sent to? Let’s take a deeper look at some IP addresses. (Note that I’ve deliberately oversimplified and scrambled results to minimally illustrate the problem)

$  ip addr list
eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1600 qdisc noqueue state UP group default qlen 1000
inet scope global eth0
inet scope global eth0 

$ ip route show
default via dev eth0

So this was clearly why the server was working fine on my machine but not on the Cloudflare edge servers. It looks like I have multiple IPs on the interface my service is bound to. The IP that is the default route is being sent back as the source address of the packet.

Why does this work for TCP but not UDP?

Connection-oriented protocols, like TCP, initiate a connection (connect()) with a three-way handshake. The kernel therefore maintains a state about ongoing connections and uses this to determine the source IP address at the time of a response.

Because UDP (unless SOCK_SEQPACKET is involved) is connectionless, the kernel cannot maintain state like TCP does. The recvfrom  system call is invoked from the server side and tells who the data comes from. Unfortunately, recvfrom  does not tell us which IP this data is addressed for. Therefore, when the UDP server invokes the sendto system call to respond to the client, we can only tell it which address to send the data to. The responsibility of determining the source-address IP then falls to the kernel. The kernel has certain heuristics that it uses to determine the source address. This may or may not work, and in the ip routes example above, these heuristics did not work. The kernel naturally (and wrongly) picks the address of the default route to respond with.

Telling the kernel what to do

I had to rely on my application to set the source address explicitly and therefore not rely on kernel heuristics.

Linux has some generic I/O system calls, namely recvmsg  and sendmsg. Their function signatures allow us to both read or write additional out-of-band data we can pass the source address to. This control information is passed via the msghdr struct’s msg_control field.

ssize_t sendmsg(int socket, const struct msghdr *message, int flags)
ssize_t recvmsg(int socket, struct msghdr *message, int flags);
struct msghdr {
     void    *   msg_name;   /* Socket name          */
     int     msg_namelen;    /* Length of name       */
     struct iovec *  msg_iov;    /* Data blocks          */
     __kernel_size_t msg_iovlen; /* Number of blocks     */
     void    *   msg_control;    /* Per protocol magic (eg BSD file descriptor passing) */
    __kernel_size_t msg_controllen; /* Length of cmsg list */
     unsigned int    msg_flags;

We can now copy the control information we’ve gotten from recvmsg back when calling sendmsg, providing the kernel with information about the source address.The library I used (https://github.com/lucas-clemente/quic-go) had a recent update that did exactly this! I pulled the changes into my service and gave it a spin.

But alas. It did not work! A quick tcpdump showed that the same source address was being sent back. It seemed clear from reading the source code that the recvmsg and sendmsg were being called with the right values. It did not make sense.

So I had to see for myself if these system calls were being made.

strace all the system calls

strace is an extremely useful tool that tracks all system calls and signals sent/received by a process. Here’s what it had to say. I’ve removed all the information not relevant to this specific issue.

17:39:09.130346 recvmsg(3, {msg_name={sa_family=AF_INET6,
sin6_port=htons(35224), inet_pton(AF_INET6, "::ffff:", 
&sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, msg_namelen=112->28, msg_iov=
[{iov_base="_\5S\30\273]\[email protected]\34\24\322\243{2\361\312|\325\n\1\314\316`\3
03\250\301X\20", iov_len=1452}], msg_iovlen=1, msg_control=[{cmsg_len=36, 
cmsg_level=SOL_IPV6, cmsg_type=0x32}, {cmsg_len=28, cmsg_level=SOL_IP, 
cmsg_type=IP_PKTINFO, cmsg_data={ipi_ifindex=if_nametoindex("eth0"),
{cmsg_len=17, cmsg_level=SOL_IP, 
cmsg_type=IP_TOS, cmsg_data=[0]}], msg_controllen=96, msg_flags=0}, 0) = 28 <0.000007>
17:39:09.165160 sendmsg(3, {msg_name={sa_family=AF_INET6, 
sin6_port=htons(35224), inet_pton(AF_INET6, "::ffff:", 
&sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, msg_namelen=28, 
msg_iov=[{iov_base="Oe4\37:3\344 &\243W\10~c\\\316\2640\255*\231 
OY\326b\26\300\264&\33\""..., iov_len=1302}], msg_iovlen=1, msg_control=
[{cmsg_len=28, cmsg_level=SOL_TCP, cmsg_type=0x8}], msg_controllen=28, 
msg_flags=0}, 0) = 1302 <0.000054>

Let’s start with recvmsg . We can clearly see that the ipi_addr for the source is being passed correctly: ipi_addr=inet_addr(“”). This part works as expected. Looking at sendmsg  almost instantly tells us where the problem is. The field we want, ip_spec_dst is not being set as we make this system call. So the kernel continues to make wrong guesses as to what the source address may be.

This turned out to be a bug where the library was using IPROTO_TCP instead of IPPROTO_IPV4 as the control message level while making the sendmsg call. Was that it? Seemed a little anticlimactic. I submitted a slightly more typesafe fix and sure enough, straces now showed me what I was expecting to see.

18:22:08.334755 sendmsg(3, {msg_name={sa_family=AF_INET6, 
sin6_port=htons(37783), inet_pton(AF_INET6, "::ffff:", 
&sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, msg_namelen=28, 
70\227\3023_\353n\364"..., iov_len=33}], msg_iovlen=1, msg_control=
[{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, cmsg_data=
], msg_controllen=32, msg_flags=0}, 0) =
33 <0.000049>

cloudflared is now able to connect with UDP (QUIC) to the Cloudflare network from anywhere in the world!

$  cloudflared tunnel --protocol quic run sudarsans-tunnel
2021-09-21T11:37:30Z INF Starting tunnel tunnelID=a72e9cb7-90dc-499b-b9a0-04ee70f4ed78
2021-09-21T11:37:30Z INF Version 2021.9.1
2021-09-21T11:37:30Z INF GOOS: darwin, GOVersion: go1.16.5, GoArch: amd64
2021-09-21T11:37:30Z INF Settings: map[p:quic protocol:quic]
2021-09-21T11:37:30Z INF Initial protocol quic
2021-09-21T11:37:32Z INF Connection 3ade6501-4706-433e-a960-c793bc2eecd4 registered connIndex=0 location=AMS

While the programmatic bug causing this issue was a trivial one, the journey into systematically discovering the issue and understanding how Linux internals worked for UDP along the way turned out to be very rewarding for me. It also reiterated my belief that tcpdump and strace are indeed invaluable tools in anybody’s arsenal when debugging network problems.

What’s next?

You can give this a try with the latest cloudflared release at https://github.com/cloudflare/cloudflared/releases/latest. Just remember to set the protocol flag to quic. We plan to leverage this new mode to roll out some exciting new features for Cloudflare Tunnel. So upgrade away and keep watching this space for more information on how you can take advantage of this.

Zero Trust — Not a Buzzword

Post Syndicated from Fernando Serto original https://blog.cloudflare.com/zero-trust-not-a-buzzword/

Zero Trust — Not a Buzzword

Zero Trust — Not a Buzzword

Over the last few years, Zero Trust, a term coined by Forrester, has picked up a lot of steam. Zero Trust, at its core, is a network architecture and security framework focusing on not having a distinction between external and internal access environments, and never trusting users/roles.

In the Zero Trust model, the network only delivers applications and data to authenticated and authorised users and devices, and gives organisations visibility into what is being accessed and to apply controls based on behavioural analysis. It gained popularity as the media reported on several high profile breaches caused by misuse, abuse or exploitation of VPN systems, breaches into end-users’ devices with access to other systems within the network, or breaches through third parties — either by exploiting access or compromising software repositories in order to deploy malicious code. This would later be used to provide further access into internal systems, or to deploy malware and potentially ransomware into environments well within the network perimeter.

When we first started talking to CISOs about Zero Trust, it felt like it was just a buzzword, and CISOs were bombarded with messaging from different cybersecurity vendors offering them Zero Trust solutions. Recently, another term, SASE (Secure Access Services Edge), a framework released by Gartner, also came up and added even more confusion to the mix.

Then came COVID-19 in 2020, and with it the reality of lockdowns and remote work. And while some organizations took that as an opportunity to accelerate projects around modernising their access infrastructure, others, due to procurement processes, or earlier technology decisions, ended up having to take a more tactical approach, ramping up existing remote access infrastructure by adding more licenses or capacity without having an opportunity to rethink their approach, nor having an opportunity to take into account the impact of their employees’ experience while working remotely full time in the early days of the pandemic.

So we thought it might be a good time to check on organizations in Asia Pacific, and look at the following:

  • The pandemic’s impact on businesses
  • Current IT security approaches and challenges
  • Awareness, adoption and implementation of Zero Trust
  • Key drivers and challenges in adopting Zero Trust

In August 2021, we commissioned a research company called The Leading Edge to conduct a survey that touches on these topics. The survey was conducted across five countries — Australia, India, Japan, Malaysia, and Singapore, and 1,006 IT and cybersecurity decision-makers and influencers from companies with more than 500 employees participated.

For example, 54% of organisations said they saw an increase in security incidents in 2021, when compared to the previous year, with 83% of respondents who experienced security incidents saying they had to make significant changes to their IT security procedures as a result.

Zero Trust — Not a Buzzword
Increase in security incidents when compared to 2020. ▲▼ Significantly higher/lower than total sample

And while the overall APAC stats are already quite interesting, I thought it would be even more fascinating to look at the unique characteristics of each of the five countries, so let’s have a look:


Australian organisations reported the highest impact of COVID-19 when it comes to their IT security approach, with 87% of the 203 respondents surveyed saying the pandemic had a moderate to significant impact on their IT security posture. The two biggest cities in Australia (Sydney and Melbourne) were in lockdown for over 100 days, each in the second half of 2021 alone. With the extensive lockdowns, it’s not a surprise that 48% of respondents reported challenges with maximising remote workers’ productivity without exposing them or their devices to new risks.

With 94% of organisations in Australia having reported they will be implementing a combination of return to office and work from home, building an effective and uniform security approach can be quite challenging. If you combine that with the fact that 62% saw an increase in security incidents over the last year, we can safely assume IT and cybersecurity decision-makers and influencers in Australia have been working on improving their security posture over the last year, even though 40% of respondents indicated they struggled to secure the right level of funding for such projects.

Australia seems to be well advanced on the journey into implementing Zero Trust when compared to other four countries included in the report, with 45% of the organisations that have adopted Zero Trust starting their Zero Trust journey over the last one to four years. Australian organisations have always been known for fast cloud adoption, and even in the early 2010s Australians were already consuming IaaS quite heavily.


When compared to the other countries in the report, India has a very challenging environment when it comes to working from home, with Internet connectivity being inconsistent, even though there’s been significant improvement in internet speeds in the country, and problems like power outages regularly occurring in certain areas outside of city centres. Surprisingly, the biggest challenge reported by Indian organisations was that they could benefit from newer security functionality, which goes to show that legacy security approaches are still widely present in India. Likewise, 37% of the respondents reported that their access technologies are too complex, which supports the previous point that newer security functionality would be beneficial to the same organisations.

When asked about their concerns around the shift in how their users will access applications, one of the biggest concerns raised by 59% of the respondents was around applications being protected by VPN or IP address controls alone. This shows Zero Trust would fit really well with their IT strategy moving forward, as controls can now be applied to users and their devices.

Another interesting point to make, and where Zero Trust can be leveraged, is 65% of respondents saying internal IT and security staff shortage and cuts is a huge challenge. Most security technologies out there would require special skills to build, maintain and operate, and this is where simplifying access with the right Zero Trust approach could really help improve the productivity of those teams.


When we look at the results of the survey across all five countries, it’s fairly obvious that Japan didn’t seem to have quite the same challenges as the other countries when the pandemic started. Businesses continued to operate normally for most of 2020 and 2021, which would explain why the impact wasn’t in line with the other countries. Having said that, 51% of the respondents surveyed in Japan still reported they saw a moderate to significant impact in their IT security approach, which is still significant, even though lower than the other countries.

Japanese organisations also reported an increase in the number of security incidents, which supports the fact that even though the impact of the pandemic wasn’t as severe as in other countries, 45% of the respondents still reported an increase in security incidents, and 63% still had to make changes to their IT security procedures as a direct result of incidents.


Malaysia rated second highest (at 80%) in our report on the impact the pandemic has had on organisations’ IT security approach, and rated highest on both employees using their home networks and using personal devices for work, at 94% and 92% respectively. From a security perspective, that poses a significant impact to an organization’s security posture and increases the attack surface for an organisation substantially.

From a risk perspective, Malaysian organisations rated lack of management over employees’ devices pretty highly, with 65% of them expressing concerns over it. Other areas worth calling out were applications and data being exposed to the public Internet, and lack of visibility into staff activity inside applications.

With 57% of the respondents calling out an increase in security incidents when compared to the previous year, 89% of the respondents said they had to make significant changes to their IT security procedures due to either security incidents or attack attempts against their environments.


In Singapore, 79% of IT and cybersecurity decision-makers and influencers reported that the pandemic has impacted their IT security approach, and two in five organisations said they could benefit from more modern security functionality as a direct result of the impact caused by the pandemic. 52% of the organisations also reported an increase in security incidents compared to 2020, with almost half having seen an increase in phishing attempts.

Singaporean organisations were also not immune to a significant increase in IT security spend as a direct result of the pandemic, with 62% of them having reported more investment in security. Some of the challenges these organisations were facing were related to applications being directly exposed to the public Internet, limited oversight on third party access and applications being protected by username and password only.

While Singapore is known for high speed home Internet, it was quite a surprise for me to see that 40% of organisations surveyed reported issues with latency or slow connectivity into applications via VPN. This goes to show that the problem of concentrating traffic into a single location can impact application performance even across relatively small geographies, and even if bandwidth is not necessarily a problem, like what happens in Singapore.

The work in IT security never stops

While there were distinct differences in each country around IT security posture and Zero Trust adoption, across Asia Pacific, the similarities are what stand out the most:

  • Cyberattacks continue to rise
  • Flexible work is here to stay
  • Skilled in-house IT security workers are a scarce resource
  • Need to educate stakeholders around Zero Trust

These challenges are not easy to tackle, add to these the required focus on improving employee experience, reducing operational complexities, better visibility into 3rd party activity, and tighter controls due to the increase in security incidents, and you’ve got a heck of a huge responsibility for IT.

And this is where Cloudflare comes in. Not only have we been helping our employees work security throughout the pandemic, we have also been helping organisations all over the globe streamline their IT security operations when it comes to users accessing applications through Cloudflare Access, or securing their activity on the Internet through our Secure Web Gateway services, which even includes controls around SaaS applications and browser isolation, all with the best possible user experience.

So come talk to us!

Introducing the Security at the Edge: Core Principles whitepaper

Post Syndicated from Maddie Bacon original https://aws.amazon.com/blogs/security/introducing-the-security-at-the-edge-core-principles-whitepaper/

Amazon Web Services (AWS) recently released the Security at the Edge: Core Principles whitepaper. Today’s business leaders know that it’s critical to ensure that both the security of their environments and the security present in traditional cloud networks are extended to workloads at the edge. The whitepaper provides security executives the foundations for implementing a defense in depth strategy for security at the edge by addressing three areas of edge security:

  • AWS services at AWS edge locations
  • How those services and others can be used to implement the best practices outlined in the design principles of the AWS Well-Architected Framework Security Pillar
  • Additional AWS edge services, which customers can use to help secure their edge environments or expand operations into new, previously unsupported environments

Together, these elements offer core principles for designing a security strategy at the edge, and demonstrate how AWS services can provide a secure environment extending from the core cloud to the edge of the AWS network and out to customer edge devices and endpoints. You can find more information in the Security at the Edge: Core Principles whitepaper.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Maddie Bacon

Maddie (she/her) is a technical writer for AWS Security with a passion for creating meaningful content. She previously worked as a security reporter and editor at TechTarget and has a BA in Mathematics. In her spare time, she enjoys reading, traveling, and all things Harry Potter.


Jana Kay

Since 2018, Jana has been a cloud security strategist with the AWS Security Growth Strategies team. She develops innovative ways to help AWS customers achieve their objectives, such as security table top exercises and other strategic initiatives. Previously, she was a cyber, counter-terrorism, and Middle East expert for 16 years in the Pentagon’s Office of the Secretary of Defense.

Accepting API keys as a query string in Amazon API Gateway

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/accepting-api-keys-as-a-query-string-in-amazon-api-gateway/

This post was written by Ronan Prenty, Sr. Solutions Architect and Zac Burns, Cloud Support Engineer & API Gateway SME

Amazon API Gateway is a fully managed service that makes it easier for developers to create, publish, maintain, monitor, and secure APIs at any scale. APIs act as the front door to applications and allow developers to offload tasks like authorization, throttling, caching, and more.

A common feature requested by customers is the ability to track usage for specific users or services through API keys. API Gateway REST APIs support this feature and, for added security, require that the API key resides in a header or an authorizer.

Developers may also need to pass API keys in the query string parameters. Best practices encourage refactoring the requests at the client level to move API keys to the header. However, this may not be possible during the migration.

This blog explains how to build an API Gateway REST API that temporarily accepts API keys as query string parameters. This post helps customers who have APIs that accept API keys as query string parameters and want to migrate to API Gateway with minimal impact on their clients. The post also discusses increasing security by refactoring the client to send API keys as a header instead of a query string.

There is also an example project for you to test and evaluate. This solution uses a custom authorizer AWS Lambda function to extract the API key from the query string parameter and apply it to a usage plan. The sample application uses the AWS Serverless Application Model (AWS SAM) for deployment.

Key concepts

API keys and usage plans

API keys are alphanumeric strings that are distributed to developers to grant access to an API. API Gateway can generate these on your behalf, or you can import them.

Usage plans let you provide API keys to your customers so that you can track and limit their usage. API keys are not a primary authorization mechanism for your APIs. If multiple APIs are associated with a usage plan, a user with a valid API key can access all APIs in that usage plan. We provide numerous options for securing access to your APIs, including resource policies, Lambda authorizers, and Amazon Cognito user pools.

Usage plans define who can access deployed API stages and methods along with metering their usage. Usage plans use API keys to identify who is making requests and apply throttling and quota limits.

How API Gateway handles API keys

API Gateway supports API keys sent as headers in a request. It does not support API keys sent as a query string parameter. API Gateway only accepts requests over HTTPS, which means that the request is encrypted. When sending API keys as query string parameters, there is still a risk that URLs are logged in plaintext by the client sending requests.

API Gateway has two settings to accept API keys:

  1. Header: The request contains the values as the X-API-Key header. API Gateway then validates the key against a usage plan.
  2. Authorizer: The authorizer includes the API key as part of the authorization response. Once API Gateway receives the API key as part of the response, it validates it against a usage plan.

Solution overview

To accept an API key as a query string parameter temporarily, create a custom authorizer using a Lambda function:

Note: the apiKeySource property of your API must be set to Authorizer instead of Header.

Note: the apiKeySource property of your API must be set to Authorizer instead of Header.

  1. The client sends an HTTP request to the API Gateway endpoint with the API key in the query string.
  2. API Gateway sends the request to a REQUEST type custom authorizer
  3. The custom authorizer function extracts the API Key from the payload. It constructs the response object with the API Key as the value for the `usageIdentifierKey` property
  4. The response gets sent back to API Gateway for validation.
  5. API Gateway validates the API key against a usage plan.
  6. If valid, API Gateway passes the request to the backend.

Deploying the solution


This solution requires no pre-existing AWS resources and deploys everything you need from the template. Deploying the solution requires:

You can find the solution on GitHub using this link.

With the prerequisites completed, deploy the template with the following commands:

git clone https://github.com/aws-samples/amazon-apigateway-accept-apikeys-as-querystring.git
cd amazon-apigateway-accept-apikeys-as-querystring
sam build --use-container
sam deploy --guided

Long term considerations

This temporary solution enables developers to migrate APIs to API Gateway and maintain query string-based API keys. While this solution does work, it does not follow best practices.

In addition to security, there is also a cost factor. Each time the client request contains an API key, the custom authorizer AWS Lambda function will be invoked, increasing the total amount of Lambda invocations you are billed for. To ensure you are billed only for valid requests, you can add an identity source to the custom authorizer meaning that only requests containing this identity source will be sent to the Lambda function. Requests that do not contain this identity source will not be billed by Lambda or API Gateway. Migrating to a header-based API key removes the need for a custom authorizer and the extra Lambda function invocations. You can find out more information on AWS Lambda billing here.

Customer migration process

With this in mind, the structure of the request sent by API clients must change from:

GET /some-endpoint?apiKey=abc123456789


GET /some-endpoint
x-api-key: abc123456789

You can provide clients with a notice period when this temporary solution is operational. After, they must migrate to a new API endpoint using a header to provide the API keys. Once the client migration is complete, they can retire the custom solution.

Developer portal

In addition to migrating API keys to a header-based solution, customers also ask us how to manage customer keys and usage plans. One option is to deploy the API Gateway developer portal.

This portal enables your customers to discover available APIs, browse API documentation, register for API keys, test APIs in the user interface, and monitor their API usage. This portal also allows you to publish non-API Gateway managed APIs by uploading OpenAPI definitions. The serverless developer portal can be customized and branded to suit your organization.


This blog post demonstrates how to use custom authorizers in API Gateway to accept API keys as a query string parameter. It also provides an AWS SAM template to deploy an example application for testing. Finally, it discusses the importance of moving customers to header-based API keys and managing those keys with the developer portal.

For more serverless content, visit Serverless Land.

Privacy-Preserving Compromised Credential Checking

Post Syndicated from Luke Valenta original https://blog.cloudflare.com/privacy-preserving-compromised-credential-checking/

Privacy-Preserving Compromised Credential Checking

Privacy-Preserving Compromised Credential Checking

Today we’re announcing a public demo and an open-sourced Go implementation of a next-generation, privacy-preserving compromised credential checking protocol called MIGP (“Might I Get Pwned”, a nod to Troy Hunt’s “Have I Been Pwned”). Compromised credential checking services are used to alert users when their credentials might have been exposed in data breaches. Critically, the ‘privacy-preserving’ property of the MIGP protocol means that clients can check for leaked credentials without leaking any information to the service about the queried password, and only a small amount of information about the queried username. Thus, not only can the service inform you when one of your usernames and passwords may have become compromised, but it does so without exposing any unnecessary information, keeping credential checking from becoming a vulnerability itself. The ‘next-generation’ property comes from the fact that MIGP advances upon the current state of the art in credential checking services by allowing clients to not only check if their exact password is present in a data breach, but to check if similar passwords have been exposed as well.

For example, suppose your password last year was amazon20\$, and you change your password each year (so your current password is amazon21\$). If last year’s password got leaked, MIGP could tell you that your current password is weak and guessable as it is a simple variant of the leaked password.

The MIGP protocol was designed by researchers at Cornell Tech and the University of Wisconsin-Madison, and we encourage you to read the paper for more details. In this blog post, we provide motivation for why compromised credential checking is important for security hygiene, and how the MIGP protocol improves upon the current generation of credential checking services. We then describe our implementation and the deployment of MIGP within Cloudflare’s infrastructure.

Our MIGP demo and public API are not meant to replace existing credential checking services today, but rather demonstrate what is possible in the space. We aim to push the envelope in terms of privacy and are excited to employ some cutting-edge cryptographic primitives along the way.

The threat of data breaches

Data breaches are rampant. The regularity of news articles detailing how tens or hundreds of millions of customer records have been compromised have made us almost numb to the details. Perhaps we all hope to stay safe just by being a small fish in the middle of a very large school of similar fish that is being predated upon. But we can do better than just hope that our particular authentication credentials are safe. We can actually check those credentials against known databases of the very same compromised user information we learn about from the news.

Many of the security breaches we read about involve leaked databases containing user details. In the worst cases, user data entered during account registration on a particular website is made available (often offered for sale) after a data breach. Think of the addresses, password hints, credit card numbers, and other private details you have submitted via an online form. We rely on the care taken by the online services in question to protect those details. On top of this, consider that the same (or quite similar) usernames and passwords are commonly used on more than one site. Our information across all of those sites may be as vulnerable as the site with the weakest security practices. Attackers take advantage of this fact to actively compromise accounts and exploit users every day.

Credential stuffing is an attack in which malicious parties use leaked credentials from an account on one service to attempt to log in to a variety of other services. These attacks are effective because of the prevalence of reused credentials across services and domains. After all, who hasn’t at some point had a favorite password they used for everything? (Quick plug: please use a password manager like LastPass to generate unique and complex passwords for each service you use.)

Website operators have (or should have) a vested interest in making sure that users of their service are using secure and non-compromised credentials. Given the sophistication of techniques employed by malevolent actors, the standard requirement to “include uppercase, lowercase, digit, and special characters” really is not enough (and can be actively harmful according to NIST’s latest guidance). We need to offer better options to users that keep them safe and preserve the privacy of vulnerable information. Dealing with account compromise and recovery is an expensive process for all parties involved.

Users and organizations need a way to know if their credentials have been compromised, but how can they do it? One approach is to scour dark web forums for data breach torrent links, download and parse gigabytes or terabytes of archives to your laptop, and then search the dataset to see if their credentials have been exposed. This approach is not workable for the majority of Internet users and website operators, but fortunately there’s a better way — have someone with terabytes to spare do it for you!

Making compromise checking fast and easy

This is exactly what compromised credential checking services do: they aggregate breach datasets and make it possible for a client to determine whether a username and password are present in the breached data. Have I Been Pwned (HIBP), launched by Troy Hunt in 2013, was the first major public breach alerting site. It provides a service, Pwned Passwords, where users can efficiently check if their passwords have been compromised. The initial version of Pwned Passwords required users to send the full password hash to the service to check if it appears in a data breach. In a 2018 collaboration with Cloudflare, the service was upgraded to allow users to run range queries over the password dataset, leaking only the salted hash prefix rather than the entire hash. Cloudflare continues to support the HIBP project by providing CDN and security support for organizations to download the raw Pwned Password datasets.

The HIBP approach was replicated by Google Password Checkup (GPC) in 2019, with the primary difference that GPC alerts are based on username-password pairs instead of passwords alone, which limits the rate of false positives. Enzoic and Microsoft Password Monitor are two other similar services. This year, Cloudflare also released Exposed Credential Checks as part of our Web Application Firewall (WAF) to help inform opted-in website owners when login attempts to their sites use compromised credentials. In fact, we use MIGP on the backend for this service to ensure that plaintext credentials never leave the edge server on which they are being processed.

Most standalone credential checking services work by having a user submit a query containing their password’s or username-password pair’s hash prefix. However, this leaks some information to the service, which could be problematic if the service turns out to be malicious or is compromised. In a collaboration with researchers at Cornell Tech published at CCS’19, we showed just how damaging this leaked information can be. Malevolent actors with access to the data shared with most credential checking services can drastically improve the effectiveness of password-guessing attacks. This left open the question: how can you do compromised credential checking without sharing (leaking!) vulnerable credentials to the service provider itself?

What does a privacy-preserving credential checking service look like?

In the aforementioned CCS’19 paper, we proposed an alternative system in which only the hash prefix of the username is exposed to the MIGP server (independent work out of Google and Stanford proposed a similar system). No information about the password leaves the user device, alleviating the risk of password-guessing attacks. These credential checking services help to preserve password secrecy, but still have a limitation: they can only alert users if the exact queried password appears in the breach.

The present evolution of this work, Might I Get Pwned (MIGP), proposes a next-generation similarity-aware compromised credential checking service that supports checking if a password similar to the one queried has been exposed in the data breach. This approach supports the detection of credential tweaking attacks, an advanced version of credential stuffing.

Credential tweaking takes advantage of the fact that many users, when forced to change their password, use simple variants of their original password. Rather than just attempting to log in using an exact leaked password, say ‘password123’, a credential tweaking attacker might also attempt to log in with easily-predictable variants of the password such as ‘password124’ and ‘password123!’.

There are two main mechanisms described in the MIGP paper to add password variant support: client-side generation and server-side precomputation. With client-side generation, the client simply applies a series of transform rules to the password to derive the set of variants (e.g., truncating the last letter or adding a ‘!’ at the end), and runs multiple queries to the MIGP service with each username and password variant pair. The second approach is server-side precomputation, where the server applies the transform rules to generate the password variants when encrypting the dataset, essentially treating the password variants as additional entries in the breach dataset. The MIGP paper describes tradeoffs between the two approaches and techniques for generating variants in more detail. Our demo service includes variant support via server-side precomputation.

Breach extraction attacks and countermeasures

One challenge for credential checking services are breach extraction attacks, in which an adversary attempts to learn username-password pairs that are present in the breach dataset (which might not be publicly available) so that they can attempt to use them in future credential stuffing or tweaking attacks. Similarity-aware credential checking services like MIGP can make these attacks more effective, since adversaries can potentially check for more breached credentials per API query. Fortunately, additional measures can be incorporated into the protocol to help counteract these attacks. For example, if it is problematic to leak the number of ciphertexts in a given bucket, dummy entries and padding can be employed, or an alternative length-hiding bucket format can be used. Slow hashing and API rate limiting are other common countermeasures that credential checking services can deploy to slow down breach extraction attacks. For instance, our demo service applies the memory-hard slow hash algorithm scrypt to credentials as part of the key derivation function to slow down these attacks.

Let’s now get into the nitty-gritty of how the MIGP protocol works. For readers not interested in the cryptographic details, feel free to skip to the demo below!

MIGP protocol

There are two parties involved in the MIGP protocol: the client and the server. The server has access to a dataset of plaintext breach entries (username-password pairs), and a secret key used for both the precomputation and the online portions of the protocol. In brief, the client performs some computation over the username and password and sends the result to the server; the server then returns a response that allows the client to determine if their password (or a similar password) is present in the breach dataset.

Privacy-Preserving Compromised Credential Checking
Full protocol description from the MIGP paper: clients learn if their credentials are in the breach dataset, leaking only the hash prefix of the queried username to the server


At a high level, the MIGP server partitions the breach dataset into buckets based on the hash prefix of the username (the bucket identifier), which is usually 16-20 bits in length.

Privacy-Preserving Compromised Credential Checking
During the precomputation phase of the MIGP protocol, the server derives password variants, encrypts entries, and stores them in buckets based on the hash prefix of the username

We use server-side precomputation as the variant generation mechanism in our implementation. The server derives one ciphertext for each exact username-password pair in the dataset, and an additional ciphertext per password variant. A bucket consists of the set ciphertexts for all breach entries and variants with the same username hash prefix. For instance, suppose there are n breach entries assigned to a particular bucket. If we compute m variants per entry, counting the original entry as one of the variants, there will be n*m ciphertexts stored in the bucket. This introduces a large expansion in the size of the processed dataset, so in practice it is necessary to limit the number of variants computed per entry. Our demo server stores 10 ciphertexts per breach entry in the input: the exact entry, eight variants (see Appendix A of the MIGP paper), and a special variant for allowing username-only checks.

Each ciphertext is the encryption of a username-password (or password variant) pair along with some associated metadata. The metadata describes whether the entry corresponds to an exact password appearing in the breach, or a variant of a breached password. The server derives a per-entry secret key pad using a key derivation function (KDF) with the username-password pair and server secret as inputs, and uses XOR encryption to derive the entry ciphertext. The bucket format also supports storing optional encrypted metadata, such as the date the breach was discovered.

  Secret sk       // Server secret key
  String u        // Username
  String w        // Password (or password variant)
  Byte mdFlag     // Metadata flag
  String mdString // Optional metadata string

  String C        // Ciphertext

function Encrypt(sk, u, w, mdFlag, mdString):
  padHdr=KDF1(u, w, sk)
  padBody=KDF2(u, w, sk)
  zeros=[0] * KEY_CHECK_LEN
  C=XOR(padHdr, zeros || mdFlag) || mdString.length || XOR(padBody, mdString)

The precomputation phase only needs to be done rarely, such as when the MIGP parameters are changed (in which case the entire dataset must be re-processed), or when new breach datasets are added (in which case the new data can be appended to the existing buckets).

Online phase

Privacy-Preserving Compromised Credential Checking
During the online phase of the MIGP protocol, the client requests a bucket of encrypted breach entries corresponding to the queried username, and with the server’s help derives a key that allows it to decrypt an entry corresponding to the queried credentials

The online phase of the MIGP protocol allows a client to check if a username-password pair (or variant) appears in the server’s breach dataset, while only leaking the hash prefix of the username to the server. The client and server engage in an OPRF protocol message exchange to allow the client to derive the per-entry decryption key, without leaking the username and password to the server, or the server’s secret key to the client. The client then computes the bucket identifier from the queried username and downloads the corresponding bucket of entries from the server. Using the decryption key derived in the previous step, the client scans through the entries in the bucket attempting to decrypt each one. If the decryption succeeds, this signals to the client that their queried credentials (or a variant thereof) are in the server’s dataset. The decrypted metadata flag indicates whether the entry corresponds to the exact password or a password variant.

The MIGP protocol solves many of the shortcomings of existing credential checking services with its solution that avoids leaking any information about the client’s queried password to the server, while also providing a mechanism for checking for similar password compromise. Read on to see the protocol in action!

MIGP demo

As the state of the art in attack methodologies evolve with new techniques such as credential tweaking, so must the defenses. To that end, we’ve collaborated with the designers of the MIGP protocol to prototype and deploy the MIGP protocol within Cloudflare’s infrastructure.

Our MIGP demo server is deployed at migp.cloudflare.com, and runs entirely on top of Cloudflare Workers. We use Workers KV for efficient storage and retrieval of buckets of encrypted breach entries, capping out each bucket size at the current KV value limit of 25MB. In our instantiation, we set the username hash prefix length to 20 bits, so that there are a total of 2^20 (or just over 1 million) buckets.

There are currently two ways to interact with the demo MIGP service: via the browser client at migp.cloudflare.com, or via the Go client included in our open-sourced MIGP library. As shown in the screenshots below, the browser client displays the request from your device and the response from the MIGP service. You should take caution to not input any sensitive credentials in a third-party service (feel free to use the test credentials [email protected] and password1 for the demo).

Keep in mind that “absence of evidence is not evidence of absence”, especially in the context of data breaches. We intend to periodically update the breach datasets used by the service as new public breaches become available, but no breach alerting service will be able to provide 100% accuracy in assuring that your credentials are safe.

See the MIGP demo in action in the attached screenshots. Note that in all cases, the username ([email protected]) and corresponding username prefix hash (000f90f4) remain the same, so the client retrieves the exact same bucket contents from the server each time. However, the blindElement parameter in the client request differs per request, allowing the client to decrypt different bucket elements depending on the queried credentials.

Privacy-Preserving Compromised Credential Checking
Example query in which the credentials are exposed in the breach dataset
Privacy-Preserving Compromised Credential Checking
Example query in which similar credentials were exposed in the breach dataset
Privacy-Preserving Compromised Credential Checking
Example query in which the username is present in the breach dataset
Privacy-Preserving Compromised Credential Checking
Example query in which the credentials are not found in the dataset

Open-sourced MIGP library

We are open-sourcing our implementation of the MIGP library under the BSD-3 License. The code is written in Go and is available at https://github.com/cloudflare/migp-go. Under the hood, we use Cloudflare’s CIRCL library for OPRF support and Go’s supplementary cryptography library for scrypt support. Check out the repository for instructions on setting up the MIGP client to connect to Cloudflare’s demo MIGP service. Community contributions and feedback are welcome!

Future directions

In this post, we announced our open-sourced implementation and demo deployment of MIGP, a next-generation breach alerting service. Our deployment is intended to lead the way for other credential compromise checking services to migrate to a more privacy-friendly model, but is not itself currently meant for production use. However, we identify several concrete steps that can be taken to improve our service in the future:

  • Add more breach datasets to the database of precomputed entries
  • Increase the number of variants in server-side precomputation
  • Add library support in more programming languages to reach a broader developer base
  • Hide the number of ciphertexts per bucket by padding with dummy entries
  • Add support for efficient client-side variant checking by batching API calls to the server

For exciting future research directions that we are investigating — including one proposal to remove the transmission of plaintext passwords from client to server entirely — take a look at https://blog.cloudflare.com/research-directions-in-password-security.

We are excited to share and build upon these ideas with the wider Internet community, and hope that our efforts impact positive change in the password security ecosystem. We are particularly interested in collaborating with stakeholders in the space to develop, test, and deploy next-generation protocols to improve user security and privacy. You can reach us with questions, comments, and research ideas at [email protected]. For those interested in joining our team, please visit our Careers Page.

Research Directions in Password Security

Post Syndicated from Ian McQuoid original https://blog.cloudflare.com/research-directions-in-password-security/

Research Directions in Password Security

Research Directions in Password Security

As Internet users, we all deal with passwords every day. With so many different services, each with their own login systems, we have to somehow keep track of the credentials we use with each of these services. This situation leads some users to delegate credential storage to password managers like LastPass or a browser-based password manager, but this is far from universal. Instead, many people still rely on old-fashioned human memory, which has its limitations — leading to reused passwords and to security problems. This blog post discusses how Cloudflare Research is exploring how to minimize password exposure and thwart password attacks.

The Problem of Password Reuse

Because it’s too difficult to remember many distinct passwords, people often reuse them across different online services. When breached password datasets are leaked online, attackers can take advantage of these to conduct “credential stuffing attacks”. In a credential stuffing attack, an attacker tests breached credentials against multiple online login systems in an attempt to hijack user accounts. These attacks are highly effective because users tend to reuse the same credentials across different websites, and they have quickly become one of the most prevalent types of online guessing attacks. Automated attacks can be run at a large scale, testing out exposed passwords across multiple systems, under the assumption that some of these passwords will unlock accounts somewhere else (if they have been reused). When a data breach is detected, users of that service will likely receive a security notification and will reset that account password. However, if this password was reused elsewhere, they may easily forget that it needs to be changed for those accounts as well.

How can we protect against credential stuffing attacks? There are a number of methods that have been deployed — with varying degrees of success. Password managers address the problem of remembering a strong, unique password for every account, but many users have yet to adopt them. Multi-factor authentication is another potential solution — that is, using another form of authentication in addition to the username/password pair. This can work well, but has limits: for example, such solutions may rely on specialized hardware that not all clients have. Consumer systems are often reluctant to mandate multi-factor authentication, given concerns that people may find it too complicated to use; companies do not want to deploy something that risks impeding the growth of their user base.

Since there is no perfect solution, security researchers continue to try to find improvements. Two different approaches we will discuss in this blog post are hardening password systems using cryptographically secure keys, and detecting the reuse of compromised credentials, so they don’t leave an account open to guessing attacks.

Improved Authentication with PAKEs

Investigating how to securely authenticate a user just using what they can remember has been an important area in secure communication. To this end, the subarea of cryptography known as Password Authenticated Key Exchange (PAKE) came about. PAKEs deal with protocols for establishing cryptographically secure keys where the only source of authentication is a human memorizable (low-entropy, attacker-guessable) password — that is, the “what you know” side of authentication.

Before diving into the details, we’ll provide a high-level overview of the basic problem. Although passwords are typically protected in transit by being sent over HTTPS, servers handle them in plaintext to verify them once they arrive. Handling plaintext passwords increases security risk — for instance, they might get inadvertently logged and exposed. Ideally, the user’s password never gets sent to the server in the first place. This is where PAKEs come in — a means of verifying that the user and server share a password, ideally without revealing information about the password that could help attackers to discover or crack it.

A few words on PAKEs

PAKE protocols let two parties turn a password into a shared key. Each party only gets one guess at the password the other holds. If a user tries to log in to the wrong server with a PAKE, that server will not be able to turn around and impersonate the user. As such, PAKEs guarantee that communication with one of the parties is the only way for an attacker to test their (single) password guess. This may seem like an unneeded level of complexity when we could use already available tools like a key distribution mechanism along with password-over-TLS, but this puts a lot of trust in the service. You may trust a service with learning your password on that service, but what about if you accidentally use a password for a different service when trying to log in? Note the particular risks of a reused password: it is no longer just a secret shared between a user and a single service, but is now a secret shared between a user and multiple services. This therefore increases the password’s privacy sensitivity — a service should not know users’ account login information for other services.

Research Directions in Password Security
A comparison of shared secrets between passwords over TLS versus PAKEs.With passwords over TLS, a service might learn passwords used on another service. This problem does not arise with PAKEs.

PAKE protocols are built with the assumption that the server isn’t always working in the best interest of the client and, even more, cannot use any kind of public-key infrastructure during login (although it doesn’t hurt to have both!). This precludes the user from sending their plaintext password (or any information that could be used to derive it —  in a computational sense) to the server during login.

PAKE protocols have expanded into new territory since the seminal EKE paper of Bellovin and Merritt, where the client and server both remembered a plaintext version of the password. As mentioned above, when the server stores the plaintext password, the client risks having the password logged or leaked. To address this, new protocols were developed, referred to as augmented, verifier-based, or asymmetric PAKEs (aPAKEs), where the server stored a modified version (similar to a hash) of the password instead of the plaintext password. This mirrors the way many of us were taught to store passwords in a database, specifically as a hash of the password with accompanying salt and pepper. However, in these cases, attackers can still use traditional methods of attack such as targeted rainbow tables. To avoid these kinds of attacks, a new kind of PAKE was born, the strong asymmetric PAKE (saPAKE).

OPAQUE was the first saPAKE and it guarantees defense against precomputation by hiding the password dictionary itself! It does this by replacing the noninteractive hash function with an interactive protocol referred to as an Oblivious Pseudorandom Function (OPRF) where one party inputs their “salt”, another inputs their “password”, and only the password-providing party learns the output of the function. The fact that the password-providing party learns nothing (computationally) about the salt prevents offline precomputation by disallowing an attacker from evaluating the function in their head.

Another way to think about the three PAKE paradigms has to do with how each of them treats the password dictionary:

PAKE type Password Dictionary Threat Model
PAKE The password dictionary is public and common to every user. Without any guessing, the attacker learns the user’s password upon compromise of the server.
aPAKE Each user gets their own password dictionary; a description of the dictionary (e.g., the “salt”) is leaked to the client when they attempt to log in. The attacker must perform an independent precomputation for each client they want to attack.
saPAKE (e.g., OPAQUE) Each user gets their own password dictionary; the server only provides an online interface (the OPRF) to the dictionary. The adversary must wait until after they compromise the server to run an offline attack on the user’s password1.

OPAQUE also goes one step further and allows the user to perform the password transformation on their own device so that the server doesn’t see the plaintext password during registration either. Cloudflare Research has been involved with OPAQUE for a while now — for instance, you can read about our previous implementation work and demo if you want to learn more.

But OPAQUE is not a panacea: in the event of server compromise, the attacker can learn the salt that the server uses to evaluate the OPRF and can still run the same offline attack that was available in the aPAKE world, although this is now considerably more time-consuming and can be made increasingly difficult through the use of memory-hard hash functions like scrypt. This means that despite our best efforts, when a server is breached, the attacker can eventually come out with a list of plaintext passwords. Indeed, this attack is always inevitable as the attacker can always run the (sa)PAKE protocol in their head acting as both parties to test each password. With this being the case, we still need to take steps to defend against automated password attacks such as credential stuffing attacks and have ways of mitigating them.

Are You Overexposed?

To help detect and respond to credential stuffing, Cloudflare recently rolled out the Exposed Credential Checks feature on the Web Application Firewall (WAF), which can alert the origin if a user’s login credentials have appeared in a recent breach. Historically, compromised credential checking services have allowed users to be proactive against credential stuffing attacks when their username and password appear together in a breach. However, they do not account for recently proposed credential tweaking attacks, in which an attacker tries variants of a breached password, under the assumption that users often use slight modifications of the same password for different accounts, such as “sunshineFB”, “sunshineIG”, and so on. Therefore, compromised credential check services should incorporate methods of checking for credential tweaks.

Under the hood, Cloudflare’s Exposed Credential Checks feature relies on an underlying protocol deemed Might I Get Pwned (MIGP). MIGP uses the bucketization method proposed in Li et al. to avoid sending the plaintext username or password to the server while handling a large breach dataset. After receiving a user’s credentials, MIGP hashes the username and sends a portion of that hash as a “bucket identifier” to the server. The client and server can then perform a private membership test protocol to verify whether the user’s username/password pair appeared in that bucket, without ever having to send plaintext credentials to the server.

Unlike previous compromised credential check services, MIGP also enables credential tweaking checks by augmenting the original breach dataset with a set of password “variants”. For each leaked password, it generates a list of password variants, which are labeled as such to differentiate them from the original leaked password and added to the original dataset. For more information, you can check out the Cloudflare Research blog post detailing our open-source implementation and deployment of the MIGP protocol.  

Research Directions in Password Security

Measuring Credential Compromises

The question remains, just how important are these exposed credential checks for detecting and preventing credential stuffing attacks in practice? To answer this question, the Research Team has initiated a study investigating login requests to our own Cloudflare dashboard. For this study, we are collecting the data logged by Cloudflare’s Exposed Credential Check feature (described above), designed to be privacy-preserving: this check does not reveal a password, but provides a “yes/no” response on whether the submitted credentials appear in our breach dataset. Along with this signal, we are looking at other fields that may be indicative of malicious behavior such as bot score and IP reputation. As this project develops, we plan to cluster the data to find patterns of different types of credential stuffing attacks that we can generalize to form attack fingerprints. We can then feed these fingerprints into the alert logs for the Cloudflare Detection & Response team to see if they provide useful information for the security analysts.

Additionally, we hope to investigate potential post-compromise behavior as it relates to these compromise check fields. After an attacker successfully hijacks an account, they may take a number of actions such as changing the password, revoking all valid access tokens, or setting up a malicious script. By analyzing compromised credential checks along with these signals, we may be able to better differentiate benign from malicious behavior.

Future directions: OPAQUE and MIGP combined

This post has discussed how we’re approaching the problem of preventing credential stuffing attacks from two different angles. Through the deployment and analysis of compromised credential checks, we aim to prevent server compromise by detecting and preventing credential stuffing attacks before they happen. In addition, in the case that a server does get compromised, the wider use of OPAQUE would help address the problem of leaking passwords to an attacker by avoiding the reception and storage of plaintext passwords on the server as well as preventing precomputation attacks.

However, there are still remaining research challenges to address. Notably, the current method for interfacing with MIGP still requires the server to either pass along a plaintext version of the client’s password, or trust the client to honestly communicate with the MIGP service on behalf of the server. If we want to leverage the security guarantees of OPAQUE (or generally an saPAKE) with the analytics and alert system provided by MIGP in a privacy-preserving way, we need additional mechanisms.

At first glance, the privacy-preserving goals of both protocols seem to be perfect matches for each other. Both OPAQUE and MIGP are built upon the idea of replacing the traditional salted password hashes with an OPRF as a way of keeping the client’s plaintext passwords from ever leaving their device. However, both the interfaces for these protocols rely on user-provided inputs which aren’t cryptographically tied to each other. This allows an attacker to provide a false password to MIGP while providing their actual password to the OPAQUE server. Further, the security analysis of both protocols assume that their idealized building blocks are separated in an important way. This isn’t to say that the two protocols are incompatible, and indeed, much of these protocols may be salvaged.

The next stages for password privacy will be an integration of these two protocols such that a server can be made aware of credential stuffing attacks and the patterns of compromised account usage that can protect a server against the compromise of other servers while providing the same privacy guarantees OPAQUE does. Our goal is to allow you to protect yourself from other compromised servers while protecting your clients from compromise of your server. Stay tuned for updates!

We’re always keen to collaborate with others to build more secure systems, and would love to hear from those interested in password research. You can reach us with questions, comments, and research ideas at [email protected]. For those interested in joining our team, please visit our Careers Page.

1There are other ways of constructing saPAKE protocols. The curious reader can see this CRYPTO 2019 paper for details.

Introducing SSL/TLS Recommender

Post Syndicated from Suleman Ahmad original https://blog.cloudflare.com/ssl-tls-recommender/

Introducing SSL/TLS Recommender

Introducing SSL/TLS Recommender

Seven years ago, Cloudflare made HTTPS availability for any Internet property easy and free with Universal SSL. At the time, few websites — other than those that processed sensitive data like passwords and credit card information — were using HTTPS because of how difficult it was to set up.

However, as we all started using the Internet for more and more private purposes (communication with loved ones, financial transactions, shopping, healthcare, etc.) the need for encryption became apparent. Tools like Firesheep demonstrated how easily attackers could snoop on people using public Wi-Fi networks at coffee shops and airports. The Snowden revelations showed the ease with which governments could listen in on unencrypted communications at scale. We have seen attempts by browser vendors to increase HTTPS adoption such as the recent announcement by Chromium for loading websites on HTTPS by default. Encryption has become a vital part of the modern Internet, not just to keep your information safe, but to keep you safe.

When it was launched, Universal SSL doubled the number of sites on the Internet using HTTPS. We are building on that with SSL/TLS Recommender, a tool that guides you to stronger configurations for the backend connection from Cloudflare to origin servers. Recommender has been available in the SSL/TLS tab of the Cloudflare dashboard since August 2020 for self-serve customers. Over 500,000 zones are currently signed up. As of today, it is available for all customers!

How Cloudflare connects to origin servers

Cloudflare operates as a reverse proxy between clients (“visitors”) and customers’ web servers (“origins”), so that Cloudflare can protect origin sites from attacks and improve site performance. This happens, in part, because visitor requests to websites proxied by Cloudflare are processed by an “edge” server located in a data center close to the client. The edge server either responds directly back to the visitor, if the requested content is cached, or creates a new request to the origin server to retrieve the content.

Introducing SSL/TLS Recommender

The backend connection to the origin can be made with an unencrypted HTTP connection or with an HTTPS connection where requests and responses are encrypted using the TLS protocol (historically known as SSL). HTTPS is the secured form of HTTP and should be used whenever possible to avoid leaking information or allowing content tampering by third-party entities. The origin server can further authenticate itself by presenting a valid TLS certificate to prevent active monster-in-the-middle attacks. Such a certificate can be obtained from a certificate authority such as Let’s Encrypt or Cloudflare’s Origin CA. Origins can also set up authenticated origin pull, which ensures that any HTTPS requests outside of Cloudflare will not receive a response from your origin.

Cloudflare Tunnel provides an even more secure option for the connection between Cloudflare and origins. With Tunnel, users run a lightweight daemon on their origin servers that proactively establishes secure and private tunnels to the nearest Cloudflare data centers. With this configuration, users can completely lock down their origin servers to only receive requests routed through Cloudflare. While we encourage customers to set up tunnels if feasible, it’s important to encourage origins with more traditional configurations to adopt the strongest possible security posture.

Detecting HTTPS support

You might wonder, why doesn’t Cloudflare always connect to origin servers with a secure TLS connection? To start, some origin servers have no TLS support at all (for example, certain shared hosting providers and even government sites have been slow adopters) and rely on Cloudflare to ensure that the client request is at least encrypted over the Internet from the browser to Cloudflare’s edge.

Then why don’t we simply probe the origin to determine if TLS is supported? It turns out that many sites only partially support HTTPS, making the problem non-trivial. A single customer site can be served from multiple separate origin servers with differing levels of TLS support. For instance, some sites support HTTPS on their landing page but serve certain resources only over unencrypted HTTP. Further, site content can differ when accessed over HTTP versus HTTPS (for example, http://example.com and https://example.com can return different results).

Such content differences can arise due to misconfiguration on the origin server, accidental mistakes by developers when migrating their servers to HTTPS, or can even be intentional depending on the use case.

A study by researchers at Northeastern University, the Max Planck Institute for Informatics, and the University of Maryland highlights reasons for some of these inconsistencies. They found that 1.5% of surveyed sites had at least one page that was unavailable over HTTPS — despite the protocol being supported on other pages — and 3.7% of sites served different content over HTTP versus HTTPS for at least one page. Thus, always using the most secure TLS setting detected on a particular resource could result in unforeseen side effects and usability issues for the entire site.

We wanted to tackle all such issues and maximize the number of TLS connections to origin servers, but without compromising a website’s functionality and performance.

Introducing SSL/TLS Recommender
Content differences on sites when loaded over HTTPS vs HTTP; images taken from https://www.cs.umd.edu/~dml/papers/https_tma20.pdf with author permission

Configuring the SSL/TLS encryption mode

Cloudflare relies on customers to indicate the level of TLS support at their origins via the zone’s SSL/TLS encryption mode. The following SSL/TLS encryption modes can be configured from the Cloudflare dashboard:

  • Off indicates that client requests reaching Cloudflare as well as Cloudflare’s requests to the origin server should only use unencrypted HTTP. This option is never recommended, but is still in use by a handful of customers for legacy reasons or testing.
  • Flexible allows clients to connect to Cloudflare’s edge via HTTPS, but requests to the origin are over HTTP only. This is the most common option for origins that do not support TLS. However, we encourage customers to upgrade their origins to support TLS whenever possible and only use Flexible as a last resort.
  • Full enables encryption for requests to the origin when clients connect via HTTPS, but Cloudflare does not attempt to validate the certificate. This is useful for origins that have a self-signed or otherwise invalid certificate at the origin, but leaves open the possibility for an active attacker to impersonate the origin server with a fake certificate. Client HTTP requests result in HTTP requests to the origin.
  • Full (strict) indicates that Cloudflare should validate the origin certificate to fully secure the connection. The origin certificate can either be issued by a public CA or by Cloudflare Origin CA. HTTP requests from clients result in HTTP requests to the origin, exactly the same as in Full mode. We strongly recommend Full (strict) over weaker options if supported by the origin.
  • Strict (SSL-Only Origin Pull) causes all traffic to the origin to go over HTTPS, even if the client request was HTTP. This differs from Full (strict) in that HTTP client requests will result in an HTTPS request to the origin, not HTTP. Most customers do not need to use this option, and it is available only to Enterprise customers. The preferred way to ensure that no HTTP requests reach your origin is to enable Always Use HTTPS in conjunction with Full or Full (strict) to redirect visitor HTTP requests to the HTTPS version of the content.
Introducing SSL/TLS Recommender
SSL/TLS encryption modes determine how Cloudflare connects to origins

The SSL/TLS encryption mode is a zone-wide setting, meaning that Cloudflare applies the same policy to all subdomains and resources. If required, you can configure this setting more granularly via Page Rules. Misconfiguring this setting can make site resources unavailable. For instance, suppose your website loads certain assets from an HTTP-only subdomain. If you set your zone to Full or Full (strict), you might make these assets unavailable for visitors that request the content over HTTPS, since the HTTP-only subdomain lacks HTTPS support.

Importance of secure origin connections

When an end-user visits a site proxied by Cloudflare, there are two connections to consider: the front-end connection between the visitor and Cloudflare and the back-end connection between Cloudflare and the customer origin server. The front-end connection typically presents the largest attack surface (for example, think of the classic example of an attacker snooping on a coffee shop’s Wi-Fi network), but securing the back-end connection is equally important. While all SSL/TLS encryption modes (except Off) secure the front-end connection, less secure modes leave open the possibility of malicious activity on the backend.

Consider a zone set to Flexible where the origin is connected to the Internet via an untrustworthy ISP. In this case, spyware deployed by the customer’s ISP in an on-path middlebox could inspect the plaintext traffic from Cloudflare to the origin server, potentially resulting in privacy violations or leaks of confidential information. Upgrading the zone to Full or a stronger mode to encrypt traffic to the ISP would help prevent this basic form of snooping.

Similarly, consider a zone set to Full where the origin server is hosted in a shared hosting provider facility. An attacker colocated in the same facility could generate a fake certificate for the origin (since the certificate isn’t validated for Full) and deploy an attack technique such as ARP spoofing to direct traffic intended for the origin server to an attacker-owned machine instead. The attacker could then leverage this setup to inspect and filter traffic intended for the origin, resulting in site breakage or content unavailability. The attacker could even inject malicious JavaScript into the response served to the visitor to carry out other nefarious goals. Deploying a valid Cloudflare-trusted certificate on the origin and configuring the zone to use Full (strict) would prevent Cloudflare from trusting the attacker’s fake certificate in this scenario, preventing the hijack.

Since a secure backend only improves your website security, we strongly encourage setting your zone to the highest possible SSL/TLS encryption mode whenever possible.

Balancing functionality and security

When Universal SSL was launched, Cloudflare’s goal was to get as many sites away from the status quo of HTTP as possible. To accomplish this, Cloudflare provisioned TLS certificates for all customer domains to secure the connection between the browser and the edge. Customer sites that did not already have TLS support were defaulted to Flexible, to preserve existing site functionality. Although Flexible is not recommended for most zones, we continue to support this option as some Cloudflare customers still rely on it for origins that do not yet support TLS. Disabling this option would make these sites unavailable. Currently, the default option for newly onboarded zones is Full if we detect a TLS certificate on the origin zone, and Flexible otherwise.

Further, the SSL/TLS encryption mode configured at the time of zone sign-up can become suboptimal as a site evolves. For example, a zone might switch to a hosting provider that supports origin certificate installation. An origin server that is able to serve all content over TLS should at least be on Full. An origin server that has a valid TLS certificate installed should use Full (strict) to ensure that communication between Cloudflare and the origin server is not susceptible to monster-in-the-middle attacks.

The Research team combined lessons from academia and our engineering efforts to make encryption easy, while ensuring the highest level of security possible for our customers. Because of that goal, we’re proud to introduce SSL/TLS Recommender.

SSL/TLS Recommender

Cloudflare’s mission is to help build a better Internet, and that includes ensuring that requests from visitors to our customers’ sites are as secure as possible. To that end, we began by asking ourselves the following question: how can we detect when a customer is able to use a more secure SSL/TLS encryption mode without impacting site functionality?

To answer this question, we built the SSL/TLS Recommender. Customers can enable Recommender for a zone via the SSL/TLS tab of the Cloudflare dashboard. Using a zone’s currently configured SSL/TLS option as the baseline for expected site functionality, the Recommender performs a series of checks to determine if an upgrade is possible. If so, we email the zone owner with the recommendation. If a zone is currently misconfigured — for example, an HTTP-only origin configured on Full — Recommender will not recommend a downgrade.

Introducing SSL/TLS Recommender

The checks that Recommender runs are determined by the site’s currently configured SSL/TLS option.

The simplest check is to determine if a customer can upgrade from Full to Full (strict). In this case, all site resources are already served over HTTPS, so the check comprises a few simple tests of the validity of the TLS certificate for the domain and all subdomains (which can be on separate origin servers).

The check to determine if a customer can upgrade from Off or Flexible to Full is more complex. A site can be upgraded if all resources on the site are available over HTTPS and the content matches when served over HTTP versus HTTPS. Recommender carries out this check as follows:

  • Crawl customer sites to collect links. For large sites where it is impractical to scan every link, Recommender tests only a subset of links (up to some threshold), leading to a trade-off between performance and potential false positives. Similarly, for sites where the crawl turns up an insufficient number of links, we augment our results with a sample of links from recent visitors requests to the zone to provide a high-confidence recommendation. The crawler uses the user agent Cloudflare-SSLDetector and has been added to Cloudflare’s list of known good bots. Similar to other Cloudflare crawlers, Recommender ignores robots.txt (except for rules explicitly targeting the crawler’s user agent) to avoid negatively impacting the accuracy of the recommendation.
  • Download the content of each link over both HTTP and HTTPS. Recommender makes only idempotent GET requests when scanning origin servers to avoid modifying server resource state.
  • Run a content similarity algorithm to determine if the content matches. The algorithm is adapted from a research paper called “A Deeper Look at Web Content Availability and Consistency over HTTP/S” (TMA Conference 2020) and is designed to provide an accurate similarity score even for sites with dynamic content.

Recommender is conservative with recommendations, erring on the side of maintaining current site functionality rather than risking breakage and usability issues. If a zone is non-functional, the zone owner blocks all types of bots, or if misconfigured SSL-specific Page Rules are applied to the zone, then Recommender will not be able to complete its scans and provide a recommendation. Therefore, it is not intended to resolve issues with website or domain functionality, but rather maximize your zone’s security when possible.

Please send questions and feedback to [email protected]. We’re excited to continue this line of work to improve the security of customer origins!


While this work is led by the Research team, we have been extremely privileged to get support from all across the company!

Special thanks to the incredible team of interns that contributed to SSL/TLS Recommender. Suleman Ahmad (now full-time), Talha Paracha, and Ananya Ghose built the current iteration of the project and Matthew Bernhard helped to lay the groundwork in a previous iteration of the project.

Dynamic Process Isolation: Research by Cloudflare and TU Graz

Post Syndicated from Kenton Varda original https://blog.cloudflare.com/spectre-research-with-tu-graz/

Dynamic Process Isolation: Research by Cloudflare and TU Graz

Dynamic Process Isolation: Research by Cloudflare and TU Graz

Last year, I wrote about the Cloudflare Workers security model, including how we fight Spectre attacks. In that post, I explained that there is no known complete defense against Spectre — regardless of whether you’re using isolates, processes, containers, or virtual machines to isolate tenants. What we do have, though, is a huge number of tools to increase the cost of a Spectre attack, to the point where it becomes infeasible. Cloudflare Workers has been designed from the very beginning with protection against side channel attacks in mind, and because of this we have been able to incorporate many defenses that other platforms — such as virtual machines and web browsers — cannot. However, the performance and scalability requirements of edge compute make it infeasible to run every Worker in its own private process, so we cannot rely on the usual defenses provided by the operating system kernel and address space separation.

Given our different approach, we cannot simply rely on others to tell us if we are safe. We had to do our own research. To do this we partnered with researchers at Graz Technical University (TU Graz) to study the impact of Spectre on our environment. The team at TU Graz are some of the foremost experts on the topic, having co-discovered Spectre initially as well as discovered several follow-on bugs like NetSpectre, ZombieLoad, Fallout, and others.

Today we are publishing a paper describing our findings, authored by Martin Schwarzl, Pietro Borrello, Andreas Kogler, Thomas Schuster, Daniel Gruss, Michael Schwarz, and myself. This paper covers research done in 2019 and early 2020. The research both tests the possibility of attacking Workers using Spectre, and proposes a new defense mechanism, which we now employ in production.

For this research, the team at TU Graz had full access to the Workers Runtime source code and were able to compile and run it locally for testing.

The research has two basic components.

Part 1: Develop an attack

A side channel attack (of which Spectre is one variety) is kind of like playing poker with a CPU. In poker, players try to understand what their opponents are thinking by looking for subtle unconscious behaviors, such as a nervous look or a hand motion. These behaviors are called “tells”. In a side channel attack, the attacker wants to find out secrets that the CPU knows. The CPU won’t reveal these secrets directly, but they can sometimes subtly affect how long the CPU spends to perform certain operations, kind of like a poker tell. If an attacker can carefully time the CPU’s actions, they can potentially discover the underlying secrets. Spectre attacks in particular focus on side channels that result from the CPU’s use of speculative execution, in which the CPU executes code that it is not yet sure should be executed, and then attempts to roll it back if not. Speculative execution is a particularly potent tool in side channel attacks because it essentially allows the attacker to program custom side channels in speculatively-executed code.

Many Spectre defenses focus on eliminating the “tells” by trying to prevent the variability in the CPU’s timing. This is hard, because CPUs are extremely complex and there are many ways that their timing can be affected. While many specific “tells” have been found and mitigated, there are undoubtedly many more that haven’t been disclosed. This has led to a game of whack-a-mole, where researchers continuously find new “tells” while CPU vendors rush out kernel and microcode patches to solve them — often with large performance losses as a side effect.

In Workers, we have focused on a different approach: preventing the attacker from seeing the “tells”. The Workers Runtime is designed to prevent a Worker from measuring its own execution time, as well as to prevent other forms of non-deterministic behavior like multithreading that could be used in place of a timer. I described these techniques in detail in last year’s post.

However, this approach can’t be perfect as long as Workers are allowed to talk to the rest of the world. A Worker could always communicate with a remote time server to measure time. Such communications will be far less accurate than a local timer, and since the timing differences are extremely small, they will be hard to measure this way. But, by using amplification techniques to improve the strength of the signal, repeating the attack many times and applying statistics, it could still be possible to derive secrets.

We therefore set out to develop an attack based on this approach. Upon applying the best techniques available to us, we were indeed able to produce a working Spectre variant 1 attack that could leak memory at a rate of 120 bits per hour. Compared to attacks demonstrated on many other platforms, 120 bits per hour is pretty slow. However, it’s obviously still fast enough to be a problem.

It’s important to note, though, that this speed was achieved in an ideal scenario:

  • Since the Workers Runtime prevents Workers from measuring their own execution time, any attack would need to rely on a remote time server. But for the purpose of our test, the “remote” server was in fact located on the same machine. In a real-world scenario, such a server would need to be accessed over the Internet, making the timing less accurate.
  • The machine running the test had no other load. A real-world machine would be processing hundreds or thousands of requests concurrently, creating noise.
  • The attack only demonstrated that it could read some bits that it shouldn’t. In order to read interesting bits, an attacker would first need to locate those bits, which likely would require reading hundreds or thousands of other bits first.

In the real world, these factors appear to make an attack too slow to be interesting. If an attack takes days or weeks to carry out, the contents of memory are highly likely to change before it can read them. For example, we update the Workers Runtime code at least once a week, which causes a restart of all processes.

That said, we did not feel comfortable relying on this argument as our defense. Instead, we set out to do better.

Part 2: Enhance our defenses

In the second part of the research, we designed and implemented a novel Spectre defense which we call Dynamic Process Isolation.

Dynamic Process Isolation was described in my blog post last year. At the time, this system was still in testing, but it has since been fully deployed in production.

In short, our defense uses hardware performance counters to detect Workers whose performance characteristics could be indicative of an attack. Before the attack has had enough time to leak any bits, we move the Worker into a separate operating system process, thus taking advantage of the additional defenses implemented by the OS kernel. Crucially, since a benign Worker can still operate normally while in an isolated process, we are able to use a detector that produces false positives, as long as the rate is relatively low. This affordance made it possible for us to develop a working classifier where previous work in the area had struggled.

Specifically, we developed a detector based on measuring branch mispredictions. Spectre variant 1 attacks — the fastest and easiest kind of Spectre attack — work by fooling the CPU’s branch predictor to trigger speculative code execution. Such an attack, when running in our environment, must trigger repeated mispredictions in a loop, in order to get enough data to apply statistics to overcome the noise floor. We can see these mispredictions in the hardware performance counters. While an attack could try to evade the detector by spreading out its trials over a longer time period, doing so would slow down the attack by orders of magnitude, which is exactly our goal. Classifiers for other Spectre variants might be straightforward to build as well, however, we find other variants already produce much lower bandwidth or are otherwise effectively mitigated by our existing defenses.

This defense successfully detects and mitigates the attack we developed. We also tested it against a number of Spectre proofs of concept and found it caught all of them. Meanwhile, the rate of false positives is well within the range we can tolerate: Out of many thousands of Workers running on our platform, we see only about 20 being falsely detected as attacks.

For more details, check out the paper and my blog post from last year.

Read the Paper

Collaborating with TU Graz was a great experience. We are very happy to work with some of the world’s foremost experts on this problem, and to have produced not just an attack but also a constructive defense.

For more details, download the full paper on arXiv.

Handshake Encryption: Endgame (an ECH update)

Post Syndicated from Christopher Wood original https://blog.cloudflare.com/handshake-encryption-endgame-an-ech-update/

Handshake Encryption: Endgame (an ECH update)

Handshake Encryption: Endgame (an ECH update)

Privacy and security are fundamental to Cloudflare, and we believe in and champion the use of cryptography to help provide these fundamentals for customers, end-users, and the Internet at large. In the past, we helped specify, implement, and ship TLS 1.3, the latest version of the transport security protocol underlying the web, to all of our users. TLS 1.3 vastly improved upon prior versions of the protocol with respect to security, privacy, and performance: simpler cryptographic algorithms, more handshake encryption, and fewer round trips are just a few of the many great features of this protocol.

TLS 1.3 was a tremendous improvement over TLS 1.2, but there is still room for improvement. Sensitive metadata relating to application or user intent is still visible in plaintext on the wire. In particular, all client parameters, including the name of the target server the client is connecting to, are visible in plaintext. For obvious reasons, this is problematic from a privacy perspective: Even if your application traffic to crypto.cloudflare.com is encrypted, the fact you’re visiting crypto.cloudflare.com can be quite revealing.

And so, in collaboration with other participants in the standardization community and members of industry, we embarked towards a solution for encrypting all sensitive TLS metadata in transit. The result: TLS Encrypted ClientHello (ECH), an extension to protect this sensitive metadata during connection establishment.

Last year, we described the current status of this standard and its relation to the TLS 1.3 standardization effort, as well as ECH’s predecessor, Encrypted SNI (ESNI). The protocol has come a long way since then, but when will we know when it’s ready? There are many ways by which one can measure a protocol. Is it implementable? Is it easy to enable? Does it seamlessly integrate with existing protocols or applications? In order to assess these questions and see if the Internet is ready for ECH, the community needs deployment experience. Hence, for the past year, we’ve been focused on making the protocol stable, interoperable, and, ultimately, deployable. And today, we’re pleased to announce that we’ve begun our initial deployment of TLS ECH.

What does ECH mean for connection security and privacy on the network? How does it relate to similar technologies and concepts such as domain fronting? In this post, we’ll dig into ECH details and describe what this protocol does to move the needle to help build a better Internet.

Connection privacy

For most Internet users, connections are made to perform some type of task, such as loading a web page, sending a message to a friend, purchasing some items online, or accessing bank account information. Each of these connections reveals some limited information about user behavior. For example, a connection to a messaging platform reveals that one might be trying to send or receive a message. Similarly, a connection to a bank or financial institution reveals when the user typically makes financial transactions. Individually, this metadata might seem harmless. But consider what happens when it accumulates: does the set of websites you visit on a regular basis uniquely identify you as a user? The safe answer is: yes.

This type of metadata is privacy-sensitive, and ultimately something that should only be known by two entities: the user who initiates the connection, and the service which accepts the connection. However, the reality today is that this metadata is known to more than those two entities.

Making this information private is no easy feat. The nature or intent of a connection, i.e., the name of the service such as crypto.cloudflare.com, is revealed in multiple places during the course of connection establishment: during DNS resolution, wherein clients map service names to IP addresses; and during connection establishment, wherein clients indicate the service name to the target server. (Note: there are other small leaks, though DNS and TLS are the primary problems on the Internet today.)

As is common in recent years, the solution to this problem is encryption. DNS-over-HTTPS (DoH) is a protocol for encrypting DNS queries and responses to hide this information from onpath observers. Encrypted Client Hello (ECH) is the complementary protocol for TLS.

The TLS handshake begins when the client sends a ClientHello message to the server over a TCP connection (or, in the context of QUIC, over UDP) with relevant parameters, including those that are sensitive. The server responds with a ServerHello, encrypted parameters, and all that’s needed to finish the handshake.

Handshake Encryption: Endgame (an ECH update)

The goal of ECH is as simple as its name suggests: to encrypt the ClientHello so that privacy-sensitive parameters, such as the service name, are unintelligible to anyone listening on the network. The client encrypts this message using a public key it learns by making a DNS query for a special record known as the HTTPS resource record. This record advertises the server’s various TLS and HTTPS capabilities, including ECH support. The server decrypts the encrypted ClientHello using the corresponding secret key.

Conceptually, DoH and ECH are somewhat similar. With DoH, clients establish an encrypted connection (HTTPS) to a DNS recursive resolver such as and, within that connection, perform DNS transactions.

Handshake Encryption: Endgame (an ECH update)

With ECH, clients establish an encrypted connection to a TLS-terminating server such as crypto.cloudflare.com, and within that connection, request resources for an authorized domain such as cloudflareresearch.com.

Handshake Encryption: Endgame (an ECH update)

There is one very important difference between DoH and ECH that is worth highlighting. Whereas a DoH recursive resolver is specifically designed to allow queries for any domain, a TLS server is configured to allow connections for a select set of authorized domains. Typically, the set of authorized domains for a TLS server are those which appear on its certificate, as these constitute the set of names for which the server is authorized to terminate a connection.

Basically, this means the DNS resolver is open, whereas the ECH client-facing server is closed. And this closed set of authorized domains is informally referred to as the anonymity set. (This will become important later on in this post.) Moreover, the anonymity set is assumed to be public information. Anyone can query DNS to discover what domains map to the same client-facing server.

Why is this distinction important? It means that one cannot use ECH for the purposes of connecting to an authorized domain and then interacting with a different domain, a practice commonly referred to as domain fronting. When a client connects to a server using an authorized domain but then tries to interact with a different domain within that connection, e.g., by sending HTTP requests for an origin that does not match the domain of the connection, the request will fail.

From a high level, encrypting names in DNS and TLS may seem like a simple feat. However, as we’ll show, ECH demands a different look at security and an updated threat model.

A changing threat model and design confidence

The typical threat model for TLS is known as the Dolev-Yao model, in which an active network attacker can read, write, and delete packets from the network. This attacker’s goal is to derive the shared session key. There has been a tremendous amount of research analyzing the security of TLS to gain confidence that the protocol achieves this goal.

The threat model for ECH is somewhat stronger than considered in previous work. Not only should it be hard to derive the session key, it should also be hard for the attacker to determine the identity of the server from a known anonymity set. That is, ideally, it should have no more advantage in identifying the server than if it simply guessed from the set of servers in the anonymity set. And recall that the attacker is free to read, write, and modify any packet as part of the TLS connection. This means, for example, that an attacker can replay a ClientHello and observe the server’s response. It can also extract pieces of the ClientHello — including the ECH extension — and use them in its own modified ClientHello.

Handshake Encryption: Endgame (an ECH update)

The design of ECH ensures that this sort of attack is virtually impossible by ensuring the server certificate can only be decrypted by either the client or client-facing server.

Something else an attacker might try is masquerade as the server and actively interfere with the client to observe its behavior. If the client reacted differently based on whether the server-provided certificate was correct, this would allow the attacker to test whether a given connection using ECH was for a particular name.

Handshake Encryption: Endgame (an ECH update)

ECH also defends against this attack by ensuring that an attacker without access to the private ECH key material cannot actively inject anything into the connection.

The attacker can also be entirely passive and try to infer encrypted information from other visible metadata, such as packet sizes and timing. (Indeed, traffic analysis is an open problem for ECH and in general for TLS and related protocols.) Passive attackers simply sit and listen to TLS connections, and use what they see and, importantly, what they know to make determinations about the connection contents. For example, if a passive attacker knows that the name of the client-facing server is crypto.cloudflare.com, and it sees a ClientHello with ECH to crypto.cloudflare.com, it can conclude, with reasonable certainty, that the connection is to some domain in the anonymity set of crypto.cloudflare.com.

The number of potential attack vectors is astonishing, and something that the TLS working group has tripped over in prior iterations of the ECH design. Before any sort of real world deployment and experiment, we needed confidence in the design of this protocol. To that end, we are working closely with external researchers on a formal analysis of the ECH design which captures the following security goals:

  1. Use of ECH does not weaken the security properties of TLS without ECH.
  2. TLS connection establishment to a host in the client-facing server’s anonymity set is indistinguishable from a connection to any other host in that anonymity set.

We’ll write more about the model and analysis when they’re ready. Stay tuned!

There are plenty of other subtle security properties we desire for ECH, and some of these drill right into the most important question for a privacy-enhancing technology: Is this deployable?

Focusing on deployability

With confidence in the security and privacy properties of the protocol, we then turned our attention towards deployability. In the past, significant protocol changes to fundamental Internet protocols such as TCP or TLS have been complicated by some form of benign interference. Network software, like any software, is prone to bugs, and sometimes these bugs manifest in ways that we only detect when there’s a change elsewhere in the protocol. For example, TLS 1.3 unveiled middlebox ossification bugs that ultimately led to the middlebox compatibility mode for TLS 1.3.

While itself just an extension, the risk of ECH exposing (or introducing!) similar bugs is real. To combat this problem, ECH supports a variant of GREASE whose goal is to ensure that all ECH-capable clients produce syntactically equivalent ClientHello messages. In particular, if a client supports ECH but does not have the corresponding ECH configuration, it uses GREASE. Otherwise, it produces a ClientHello with real ECH support. In both cases, the syntax of the ClientHello messages is equivalent.

This hopefully avoids network bugs that would otherwise trigger upon real or fake ECH. Or, in other words, it helps ensure that all ECH-capable client connections are treated similarly in the presence of benign network bugs or otherwise passive attackers. Interestingly, active attackers can easily distinguish — with some probability — between real or fake ECH. Using GREASE, the ClientHello carries an ECH extension, though its contents are effectively randomized, whereas a real ClientHello using ECH has information that will match what is contained in DNS. This means an active attacker can simply compare the ClientHello against what’s in the DNS. Indeed, anyone can query DNS and use it to determine if a ClientHello is real or fake:

$ dig +short crypto.cloudflare.com TYPE65
\# 134 0001000001000302683200040008A29F874FA29F884F000500480046 FE0D0042D500200020E3541EC94A36DCBF823454BA591D815C240815 77FD00CAC9DC16C884DF80565F0004000100010013636C6F7564666C 6172652D65736E692E636F6D00000006002026064700000700000000 0000A29F874F260647000007000000000000A29F884F

Despite this obvious distinguisher, the end result isn’t that interesting. If a server is capable of ECH and a client is capable of ECH, then the connection most likely used ECH, and whether clients and servers are capable of ECH is assumed public information. Thus, GREASE is primarily intended to ease deployment against benign network bugs and otherwise passive attackers.

Note, importantly, that GREASE (or fake) ECH ClientHello messages are semantically different from real ECH ClientHello messages. This presents a real problem for networks such as enterprise settings or school environments that otherwise use plaintext TLS information for the purposes of implementing various features like filtering or parental controls. (Encrypted DNS protocols like DoH also encountered similar obstacles in their deployment.) Fundamentally, this problem reduces to the following: How can networks securely disable features like DoH and ECH? Fortunately, there are a number of approaches that might work, with the more promising one centered around DNS discovery. In particular, if clients could securely discover encrypted recursive resolvers that can perform filtering in lieu of it being done at the TLS layer, then TLS-layer filtering might be wholly unnecessary. (Other approaches, such as the use of canary domains to give networks an opportunity to signal that certain features are not permitted, may work, though it’s not clear if these could or would be abused to disable ECH.)

We are eager to collaborate with browser vendors, network operators, and other stakeholders to find a feasible deployment model that works well for users without ultimately stifling connection privacy for everyone else.

Next steps

ECH is rolling out for some FREE zones on our network in select geographic regions. We will continue to expand the set of zones and regions that support ECH slowly, monitoring for failures in the process. Ultimately, the goal is to work with the rest of the TLS working group and IETF towards updating the specification based on this experiment in hopes of making it safe, secure, usable, and, ultimately, deployable for the Internet.

ECH is one part of the connection privacy story. Like a leaky boat, it’s important to look for and plug all the gaps before taking on lots of passengers! Cloudflare Research is committed to these narrow technical problems and their long-term solutions. Stay tuned for more updates on this and related protocols.

Privacy Pass v3: the new privacy bits

Post Syndicated from Pop Chunhapanya original https://blog.cloudflare.com/privacy-pass-v3/

Privacy Pass v3: the new privacy bits

Privacy Pass v3: the new privacy bits

In November 2017, we released our implementation of a privacy preserving protocol to let users prove that they are humans without enabling tracking. When you install Privacy Pass’s browser extension, you get tokens when you solve a Cloudflare CAPTCHA which can be used to avoid needing to solve one again… The redeemed token is cryptographically unlinkable to the token originally provided by the server. That is why Privacy Pass is privacy preserving.

In October 2019, Privacy Pass reached another milestone. We released Privacy Pass Extension v2.0 that includes a new service provider (hCaptcha) which provides a way to redeem a token not only with CAPTCHAs in the Cloudflare challenge pages but also hCaptcha CAPTCHAs in any website. When you encounter any hCaptcha CAPTCHA in any website, including the ones not behind Cloudflare, you can redeem a token to pass the CAPTCHA.

We believe Privacy Pass solves an important problem — balancing privacy and security for bot mitigation— but we think there’s more to be done in terms of both the codebase and the protocol. We improved the codebase by redesigning how the service providers interact with the core extension. At the same time, we made progress on the standardization at IETF and improved the protocol by adding metadata which allows us to do more fabulous things with Privacy Pass.

Announcing Privacy Pass Extension v3.0

The current implementation of our extension is functional, but it is difficult to maintain two Privacy Pass service providers: Cloudflare and hCaptcha. So we decided to refactor the browser extension to improve its maintainability. We also used this opportunity to make following improvements:

  • Implement the extension using TypeScript instead of plain JavaScript.
  • Build the project using a module bundler instead of custom build scripts.
  • Refactor the code and define the API for the cryptographic primitive.
  • Treat provider-specific code as an encapsulated software module rather than a list of configuration properties.

As a result of the improvements listed above, the extension will be less error-prone and each service provider will have more flexibility and can be integrated seamlessly with other providers.

In the new extension we use TypeScript instead of plain JavaScript because its syntax is a kind of extension to JavaScript, and we already use TypeScript in Workers. One of the things that makes TypeScript special is that it has features that are only available in modern programming languages, like null safety.

Support for Future Service Providers

Another big improvement in v3.0 is that it is designed for modularity, meaning that it will be very easy to add a new potential service provider in the future. A new provider can use an API provided by us to implement their own request flow to use the Privacy Pass protocol and to handle the HTTP requests. By separating the provider-specific code from the core extension code using the API, the extension will be easier to update when there is a need for more service providers.

On a technical level, we allow each service provider to have its own WebRequest API event listeners instead of having central event listeners for all the providers. This allows providers to extend the browser extension’s functionality and implement any request handling logic they want.

Another major change that enables us to do this is that we moved away from configuration to programmable modularization.

Configuration vs Modularization

As mentioned in 2019, it would be impossible to expect different service providers to all abide by the same exact request flow, so we decided to use a JSON configuration file in v2.0 to define the request flow. The configuration allows the service providers to easily modify the extension characteristics without dealing too much with the core extension code. However, recently we figured out that we can improve it without using a configuration file, and using modules instead.

Using a configuration file limits the flexibility of the provider by the number of possible configurations. In addition, when the logic of each provider evolves and deviates from one another, the size of configuration will grow larger and larger which makes it hard to document and keep track of. So we decided to refactor how we determine the request flow from using a configuration file to using a module file written specifically for each service provider instead.

Privacy Pass v3: the new privacy bits

By using a programmable module, the providers are not limited by the available fields in the configuration. In addition, the providers can use the available implementations of the necessary cryptographic primitives in any point of the request flow because we factored out the crypto bits into a separate module which can be used by any provider. In the future, if the cryptographic primitives ever change, the providers can update the code and use it any time.

Towards Standard Interoperability

The Privacy Pass protocol was first published at the PoPETS symposium in 2018. As explained in this previous post, the core of the Privacy Pass protocol is a secure way to generate tokens between server and client. To that end, the protocol requires evaluating a pseudorandom function that is oblivious and verifiable. The first property prevents the server from learning information about the client’s tokens, while the client learns nothing about the server’s private key. This is useful to protect the privacy of users. The token generation must also be verifiable in the sense that the client can attest to the fact that its token was minted using the server’s private key.

The original implementation of Privacy Pass has seen real-world use in our browser extension, helping to reduce CAPTCHAs for hundreds of thousands of people without compromising privacy. But to guarantee interoperability between services implementing Privacy Pass, what’s required is an accurate specification of the protocol and its operations. With this motivation, the Privacy Pass protocol was proposed as an Internet draft at the Internet Engineering Task Force (IETF) — to know more about our participation at IETF look at the post.

In March 2020, the protocol was presented at IETF-107 for the first time. The session was a Birds-of-a-Feather, a place where the IETF community discusses the creation of new working groups that will write the actual standards. In the session, the working group’s charter is presented and proposes to develop a secure protocol for redeeming unforgeable tokens that attest to the validity of some attribute being held by a client. The charter was later approved, and three documents were integrated covering the protocol, the architecture, and an HTTP API for supporting Privacy Pass. The working group at IETF can be found at https://datatracker.ietf.org/wg/privacypass/.

Additionally, to its core functionality, the Privacy Pass protocol can be extended to improve its usability or to add new capabilities. For instance, adding a mechanism for public verifiability will allow a third party, someone who did not participate in the protocol, to verify the validity of tokens. Public verifiability can be implemented using a blind-signature scheme — this is a special type of digital signatures firstly proposed by David Chaum in which signers can produce signatures on messages without learning the content of the message. A diversity of algorithms to implement blind-signatures exist; however, there is still work to be done to define a good candidate for public verifiability.

Another extension for Privacy Pass is the support for including metadata in the tokens. As this is a feature with high impact on the protocol, we devote a larger section to explain the benefits of supporting metadata in the face of hoarding attacks.

Future work: metadata

What is research without new challenges that arise? What does development look like if there are no other problems to solve? During the design and development of Privacy Pass (both as a service, as an idea, and as a protocol), a potential vector for abuse was noted, which will be referred to as a “hoarding” or “farming” attack. This attack consists of individual users or groups of users that can gather tokens over a long period of time and redeem them all at once with the aim of, for example, overwhelming a website and making the service unavailable for other users. In a more complex scenario, an attacker can build up a stock of tokens that they could then redistribute amongst other clients. This redistribution ability is possible as tokens are not linked to specific clients, which is a property of the Privacy Pass protocol.

There have been several proposed solutions to this attack. One can, for example, make the verification of tokens procedure very efficient, so attackers will need to hoard an even larger amount of tokens in order to overwhelm a service. But the problem is not only about making verification times faster, and, therefore, this does not completely solve the problem. Note that in Privacy Pass, a successful token redemption could be exchanged for a single-origin cookie. These cookies allow clients to avoid future challenges for a particular domain without using more tokens. In the case of a hoarding attack, an attacker could trade in their hoarded number of tokens for a number of cookies. An attacker can, then, mount a layer 7 DDoS attack with the “hoarded” cookies, which would render the service unavailable.

In the next sections, we will explore other different solutions to this attack.

A simple solution and its limitations: key rotation

What does “key rotation” mean in the context of Privacy Pass? In Privacy Pass, each token is attested by keys held by the service. These keys are further used to verify the honesty of a token presented by a client when trying to access a challenge-protected service. “Key rotation” means updating these keys with regard to a chosen epoch (meaning, for example, that every two weeks — the epoch —, the keys will be rotated). Regular key rotation, then, implies that tokens belong to these epochs and cannot be used outside them, which prevents stocks of tokens from being useful for longer than the epoch they belong to.

Keys, however, should not be rotated frequently as:

  • Rotating a key can lead to security implications
  • Establishing trust in a frequently-rotating key service can be a challenging problem
  • The unlinkability of the client when using tokens can be diminished

Let’s explore these problems one by one now:

Rotating a key can lead to security implications, as past keys need to be deleted from secure storage locations and replaced with new ones. This process is prone to failure if done regularly, and can lead to potential key material leakage.

Establishing trust in a frequently-rotating key service can be a challenging problem, as keys will have to be verified by the needed parties each time they are regenerated. Keys need to be verified as it has to be attested that they belong to the entity one is trying to communicate with. If keys rotate too frequently, this verification procedure will have to happen frequently as well, so that an attacker will not be able to impersonate the honest entity with a “fake” public key.

The unlinkability of the client when using tokens can be diminished as a savvy attacker (a malicious server, for example) could link token generation and token future-use. In the case of a malicious server, it can, for example, rotate their keys too often to violate unlinkability or could pick a separate public key for each client issuance. In these cases, this attack can be solved by the usage of public mechanisms to record which server’s public keys are used; but this requires further infrastructure and coordination between actors. Other cases are not easily solvable by this “public verification”: if keys are rotated every minute, for example, and a client was the only one to visit a “privacy pass protected” site in that minute, then, it’s not hard to infer (to “link”) that the token came only from this specific client.

A novel solution: Metadata

A novel solution to this “hoarding” problem that does not require key rotation or further optimization of verification times is the addition of metadata. This approach was introduced in the paper “A Fast and Simple Partially Oblivious PRF, with Applications”, and it is called the “POPRF with metadata” construction. The idea is to add a metadata field to the token generation procedure in such a way that tokens are cryptographically linked to this added metadata. The added metadata can be, for example, a number that signals which epoch this token belongs to. The service, when presented with this token on verification, promptly checks that it corresponds to its internal epoch number (this epoch number can correspond to a period of time, a threshold of number of tokens issued, etc.). If it does not correspond, this token is expired and cannot be further used. Metadata, then, can be used to expire tokens without performing key rotations, thereby avoiding some issues outlined above.

Other kinds of metadata can be added to the Partially Oblivious PRF (PO-PRF) construction as well. Geographic location can be added, which signals that tokens can only be used in a specific region.

The limits of metadata

Note, nevertheless, that the addition of this “metadata” should be carefully considered as adding, in the case of “time-metadata”, an explicit time bound signal will diminish the unlikability set of the tokens. If an explicit time-bound signal is added (for example, the specific time — year, month, day, hour, minute and seconds — in which this token was generated and the amount of time it is valid for), it will allow a malicious server to link generation and usage. The recommendation is to use “opaque metadata”: metadata that is public to both client and service but that only the service knows its precise meaning. A server, for example, can set a counter that gets increased after a period of time (for example, every two weeks). The server will add this counter as metadata rather than the period of time. The client, in this case, publicly knows what this counter is but does not know to which period it refers to.

Geographic location metadata should be coarse as well: it should refer to a large geographical area, such as a continent, or political and economic union rather than an explicit location.

Wrap up

The Privacy Pass protocol provides users with a secure way for redeeming tokens. At Cloudflare, we use the protocol to reduce the number of CAPTCHAs improving the user experience while browsing websites. A natural evolution of the protocol is expected, ranging from its standardization to innovating with new capabilities that help to prevent abuse of the service.

On the service side, we refactored the Privacy Pass browser extension aiming to improve the quality of the code, so bugs can be detected in earlier phases of the development. The code is available at the challenge-bypass-extension repository, and we invite you to try the release candidate version.

An appealing extension for Privacy Pass is the inclusion of metadata as it provides a non-cumbersome way to solve hoarding attacks, while preserving the anonymity (in general, the privacy) of the protocol itself. Our paper provides you more information about the technical details behind this idea.

The application of the Privacy Pass protocol in other use cases or to create other service providers requires a certain degree of compatibility. People wanting to implement Privacy Pass must be able to have a standard specification, so implementations can interoperate. The efforts along these lines are centered on the Privacy Pass working group at IETF, a space open for anyone to participate in delineating the future of the protocol. Feel free to be part of these efforts too.

We are continuously working on new ways of improving our services and helping the Internet be a better and a more secure place. You can join us on this effort and can reach us at research.cloudflare.com. See you next time.

Helping Apache Servers stay safe from zero-day path traversal attacks (CVE-2021-41773)

Post Syndicated from Michael Tremante original https://blog.cloudflare.com/helping-apache-servers-stay-safe-from-zero-day-path-traversal-attacks/

Helping Apache Servers stay safe from zero-day path traversal attacks (CVE-2021-41773)

Helping Apache Servers stay safe from zero-day path traversal attacks (CVE-2021-41773)

On September 29, 2021, the Apache Security team was alerted to a path traversal vulnerability being actively exploited (zero-day) against Apache HTTP Server version 2.4.49. The vulnerability, in some instances, can allow an attacker to fully compromise the web server via remote code execution (RCE) or at the very least access sensitive files. CVE number 2021-41773 has been assigned to this issue. Both Linux and Windows based servers are vulnerable.

An initial patch was made available on October 4 with an update to 2.4.50, however, this was found to be insufficient resulting in an additional patch bumping the version number to 2.4.51 on October 7th (CVE-2021-42013).

Customers using Apache HTTP Server versions 2.4.49 and 2.4.50 should immediately update to version 2.4.51 to mitigate the vulnerability. Details on how to update can be found on the official Apache HTTP Server project site.

Any Cloudflare customer with the setting normalize URLs to origin turned on have always been protected against this vulnerability.

Additionally, customers who have access to the Cloudflare Web Application Firewall (WAF), receive additional protection by turning on the rule with the following IDs:

  • 1c3d3022129c48e9bb52e953fe8ceb2f (for our new WAF)
  • 100045A (for our legacy WAF)

The rule can also be identified by the following description:

Rule message: Anomaly:URL:Query String - Multiple Slashes, Relative Paths, CR, LF or NULL.

Given the nature of the vulnerability, attackers would normally try to access sensitive files (for example /etc/passwd), and as such, many other Cloudflare Managed Rule signatures are also effective at stopping exploit attempts depending on the file being accessed.

How the vulnerability works

The vulnerability leverages missing path normalization logic. If the Apache server is not configured with a require all denied directive for files outside the document root, attackers can craft special URLs to read any file on the file system accessible by the Apache process. Additionally, this flaw could also leak the source of interpreted files like CGI scripts and, in some cases, also allow the attacker to take over the web server by executing shell scripts.

For example, the following path:


would allow the attacker to climb the directory tree (../ indicates parent directory) outside of the web server document root and then subsequently access /etc/passwd.

Well implemented path normalization logic would correctly collapse the path into the shorter $hostname/etc/passwd by normalizing all ../ character sequences nullifying the attempt to climb up the directory tree.

Correct normalization is not easy as it also needs to take into consideration character encoding, such as percent encoded characters used in URLs. For example, the following path is equivalent to the first one provided:


as the characters %2e represent the percent encoded version of dot “.”. Not taking this properly into account was the cause of the vulnerability.

The PoC for this vulnerability is straightforward and simply relies on attempting to access sensitive files on vulnerable Apache web servers.

Exploit Attempts

Cloudflare has seen a sharp increase in attempts to exploit and find vulnerable servers since October 5.

Helping Apache Servers stay safe from zero-day path traversal attacks (CVE-2021-41773)

Most exploit attempts observed have been probing for static file paths — indicating heavy scanning activity before attackers (or researchers) may have attempted more sophisticated techniques that could lead to remote code execution. The most commonly attempted file paths are reported below:



Keeping web environments safe is not an easy task. Attackers will normally gain access and try to exploit vulnerabilities even before PoCs become widely available — we reported such a case not too long ago with Atlassian’s Confluence OGNL vulnerability.

It is vital to employ all security measures available. Cloudflare features such as our URL normalization and the WAF, are easy to implement and can buy time to deploy any relevant patches offered by the affected software vendors.

May I ask who’s calling, please? A recent rise in VoIP DDoS attacks

Post Syndicated from Patrick R. Donahue original https://blog.cloudflare.com/attacks-on-voip-providers/

May I ask who’s calling, please? A recent rise in VoIP DDoS attacks

May I ask who’s calling, please? A recent rise in VoIP DDoS attacks

Over the past month, multiple Voice over Internet Protocol (VoIP) providers have been targeted by Distributed Denial of Service (DDoS) attacks from entities claiming to be REvil. The multi-vector attacks combined both L7 attacks targeting critical HTTP websites and API endpoints, as well as L3/4 attacks targeting VoIP server infrastructure. In some cases, these attacks resulted in significant impact to the targets’ VoIP services and website/API availability.

Cloudflare’s network is able to effectively protect and accelerate voice and video infrastructure because of our global reach, sophisticated traffic filtering suite, and unique perspective on attack patterns and threat intelligence.

If you or your organization have been targeted by DDoS attacks, ransom attacks and/or extortion attempts, seek immediate help to protect your Internet properties. We recommend not paying the ransom, and to report it to your local law enforcement agencies.

Voice (and video, emojis, conferences, cat memes and remote classrooms) over IP

Voice over IP (VoIP) is a term that’s used to describe a group of technologies that allow for communication of multimedia over the Internet. This technology enables your FaceTime call with your friends, your virtual classroom lessons over Zoom and even some “normal” calls you make from your cell phone.

May I ask who’s calling, please? A recent rise in VoIP DDoS attacks

The principles behind VoIP are similar to traditional digital calls over circuit-switched networks. The main difference is that the encoded media, e.g., voice or video, is partitioned into small units of bits that are transferred over the Internet as the payloads of IP packets according to specially defined media protocols.

This “packet switching” of voice data, as compared to traditional “circuit switching”, results in much more efficient use of network resources. As a result, calling over VoIP can be much more cost-effective than calls made over the POTS (“plain old telephone service”). Switching to VoIP can cut down telecom costs for businesses by more than 50%, so it’s no surprise that one in every three businesses has already adopted VoIP technologies. VoIP is flexible, scalable, and has been especially useful in bringing people together remotely during the pandemic.

A key protocol behind most VoIP calls is the heavily adopted Signal Initiation Protocol (SIP). SIP was originally defined in RFC-2543 (1999) and designed to serve as a flexible and modular protocol for initiating calls (“sessions”), whether voice or video, or two-party or multiparty.

Speed is key for VoIP

Real-time communication between people needs to feel natural, immediate and responsive. Therefore, one of the most important features of a good VoIP service is speed. The user experiences this as natural sounding audio and high definition video, without lag or stutter. Users’ perceptions of call quality are typically closely measured and tracked using metrics like Perceptual Evaluation of Speech Quality and Mean Opinion Scores. While SIP and other VoIP protocols can be implemented using TCP or UDP as the underlying protocols, UDP is typically chosen because it’s faster for routers and servers to process them.

May I ask who’s calling, please? A recent rise in VoIP DDoS attacks

UDP is a protocol that is unreliable, stateless and comes with no Quality of Service (QoS) guarantees. What this means is that the routers and servers typically use less memory and computational power to process UDP packets and therefore can process more packets per second. Processing packets faster results in quicker assembly of the packets’ payloads (the encoded media), and therefore a better call quality.

Under the guidelines of faster is better, VoIP servers will attempt to process the packets as fast as possible on a first-come-first-served basis. Because UDP is stateless, it doesn’t know which packets belong to existing calls and which attempt to initiate a new call. Those details are in the SIP headers in the form of requests and responses which are not processed until further up the network stack.

When the rate of packets per second increases beyond the router’s or server’s capacity, the faster is better guideline actually turns into a disadvantage. While a traditional circuit-switched system will refuse new connections when its capacity is reached and attempt to maintain the existing connections without impairment, a VoIP server, in its race to process as many packets as possible, will not be able to handle all packets or all calls when its capacity is exceeded. This results in latency and disruptions for ongoing calls, and failed attempts of making or receiving new calls.

Without proper protection in place, the race for a superb call experience comes at a security cost which attackers learned to take advantage of.

DDoSing VoIP servers

Attackers can take advantage of UDP and the SIP protocol to overwhelm unprotected VoIP servers with floods of specially-crafted UDP packets. One way attackers overwhelm VoIP servers is by pretending to initiate calls. Each time a malicious call initiation request is sent to the victim, their server uses computational power and memory to authenticate the request. If the attacker can generate enough call initiations, they can overwhelm the victim’s server and prevent it from processing legitimate calls. This is a classic DDoS technique applied to SIP.

May I ask who’s calling, please? A recent rise in VoIP DDoS attacks

A variation on this technique is a SIP reflection attack. As with the previous technique, malicious call initiation requests are used. However, in this variation, the attacker doesn’t send the malicious traffic to the victim directly. Instead, the attacker sends them to many thousands of random unwitting SIP servers all across the Internet, and they spoof the source of the malicious traffic to be the source of the intended victim. That causes thousands of SIP servers to start sending unsolicited replies to the victim, who must then use computational resources to discern whether they are legitimate. This too can starve the victim server of resources needed to process legitimate calls, resulting in a widespread denial of service event for users. Without the proper protection in place, VoIP services can be extremely susceptible to DDoS attacks. Once against a classic DDoS attack type being used against SIP.

The graph below shows a recent multi-vector UDP DDoS attack that targeted VoIP infrastructure protected by Cloudflare’s Magic Transit service. The attack peaked just above 70 Gbps and 16M packets per second. While it’s not the largest attack we’ve ever seen, attacks of this size can have large impact on unprotected infrastructure. This specific attack lasted a bit over 10 hours and was automatically detected and mitigated.

May I ask who’s calling, please? A recent rise in VoIP DDoS attacks

[Alt text: Graph of a 70 Gbps DDoS attack against a VoIP provider]

Below are two additional graphs of similar attacks seen last week against SIP infrastructure. In the first chart we see multiple protocols being used to launch the attack, with the bulk of traffic coming from (spoofed) DNS reflection and other common amplification and reflection vectors. These attacks peaked at over 130 Gbps and 17.4M pps.

May I ask who’s calling, please? A recent rise in VoIP DDoS attacks

May I ask who’s calling, please? A recent rise in VoIP DDoS attacks

[Alt text: Graph of a 130 Gbps DDoS attack against a different VoIP provider]

Protecting VoIP services without sacrificing performance

One of the most important factors for delivering a quality VoIP service is speed. The lower the latency, the better. Cloudflare’s Magic Transit service can help protect critical VoIP infrastructure without impacting latency and call quality.

May I ask who’s calling, please? A recent rise in VoIP DDoS attacks

[alt text: Diagram of Cloudflare Magic Transit routing]

Cloudflare’s Anycast architecture, coupled with the size and scale of our network, minimizes and can even improve latency for traffic routed through Cloudflare versus the public Internet. Check out our recent post from Cloudflare’s Speed Week for more details on how this works, including test results demonstrating a performance improvement of 36% on average across the globe for a real customer network using Magic Transit.

May I ask who’s calling, please? A recent rise in VoIP DDoS attacks

[alt text: World map of Cloudflare locations

Furthermore, every packet that is ingested in a Cloudflare data center is analyzed for DDoS attacks using multiple layers of out-of-path detection to avoid latency. Once an attack is detected, the edge generates a real-time fingerprint that matches the characteristics of the attack packets. The fingerprint is then matched in the Linux kernel eXpress Data Path (XDP) to quickly drop attack packets at wirespeed without inflicting collateral damage on legitimate packets. We have also recently deployed additional specific mitigation rules to inspect UDP traffic to determine whether it is valid SIP traffic.

May I ask who’s calling, please? A recent rise in VoIP DDoS attacks

The detection and mitigation is done autonomously within every single Cloudflare edge server — there is no “scrubbing center” with limited capacity and limited deployment scope in the equation. Additionally, threat intelligence is automatically shared across our network in real-time to ‘teach’ other edge servers about the attack.

Edge detections are also completely configurable. Cloudflare Magic Transit customers can use the L3/4 DDoS Managed Ruleset to tune and optimize their DDoS protection settings, and also craft custom packet-level (including deep packet inspection) firewall rules using the Magic Firewall to enforce a positive security model.

Bringing people together, remotely

Cloudflare’s mission is to help build a better Internet. A big part of that mission is making sure that people around the world can communicate with their friends, family and colleagues uninterrupted — especially during these times of COVID. Our network is uniquely positioned to help keep the world connected, whether that is by helping developers build real-time communications systems or by keeping VoIP providers online.

Our network’s speed and our always-on, autonomous DDoS protection technology helps VoIP providers to continue serving their customers without sacrificing performance or having to give in to ransom DDoS extortionists.

Talk to a Cloudflare specialist to learn more.

Under attack? Contact our hotline to speak with someone immediately.

Data at Cloudflare just got a lot faster: Announcing Live-updating Analytics and Instant Logs

Post Syndicated from Jon Levine original https://blog.cloudflare.com/instant-logs/

Data at Cloudflare just got a lot faster: Announcing Live-updating Analytics and Instant Logs

Data at Cloudflare just got a lot faster: Announcing Live-updating Analytics and Instant Logs

Today, we’re excited to introduce Live-updating Analytics and Instant Logs. For Pro, Business, and Enterprise customers, our analytics dashboards now update live to show you data as it arrives. In addition to this, Enterprise customers can now view their HTTP request logs instantly in the Cloudflare dashboard.

Cloudflare’s data products are essential for our customers’ visibility into their network and applications. Having this data in real time makes it even more powerful — could you imagine trying to navigate using a GPS that showed your location a minute ago? That’s the power of real time data!

Real time data unlocks entirely new use cases for our customers. They can respond to threats and resolve errors as soon as possible, keeping their applications secure and minimising disruption to their end users.

Lightning fast, in-depth analytics

Cloudflare products generate petabytes of log data daily and are designed for scale. To make sense of all this data, we summarize it using analytics — the ability to see time series data, tops Ns, and slices and dices of the data generated by Cloudflare products. This allows customers to identify trends and anomalies and drill deep into problems.

We take it a step further from just showing you high-level metrics. With Cloudflare Analytics you have the ability to quickly drill down into the most important data — narrow in on a specific time period and add a chain of filters to slice your data further and see all the reflecting analytics.

Data at Cloudflare just got a lot faster: Announcing Live-updating Analytics and Instant Logs
Video of Cloudflare analytics showing live updating and drill down capabilities

Let’s say you’re a developer who’s made some recent changes to your website, you’ve deleted some old content and created new web pages. You want to know as soon as possible if these changes have led to any broken links, so you can quickly identify them and make fixes. With live-updating analytics, you can monitor your traffic by status code. If you notice an uptick in 404 errors add a filter to get details on all 404s and view the top referrers causing the errors. From there, take steps to resolve the problem whether by creating a redirect page rule or fixing broken links on your own site.

Instant Logs at your fingertips

While Analytics are a great way to see data at an aggregate level, sometimes you need event level information, too. Logs are powerful because they record every single event that flows through a network, so you can figure out what occurred on a granular level.

Our Logpush system is already able to get logs from our global edge network to a customer’s storage destination or analytics provider within seconds. However, setting this up has a lot of overhead and often customers incur long processing times at their destination. We wanted logs to be instant — instant to set up, deliver and take action on.

It’s that easy.

With Instant Logs, customers can actively monitor the traffic that’s flowing through their network and make key decisions that affect their applications now. Real time data unlocks totally new use cases:

  • For Security Engineers: Stop an attack as it’s developing. For example, apply a Firewall rule and see it’s impact — get answers within seconds. If it’s not what you were intending, try another rule and check again.
  • For Developers: Roll out a config change — to Cloudflare, or to your origin — and have piece of mind to watch as your error rates stay flat (we hope!).

(By the way, if you’re a fan of Workers and want to see real time Workers logging, check out the recently released dashboard for Workers logs.)

Logs at the speed of sight

“Real time” or “instant” can mean different things to different people in different contexts. At Cloudflare, we’re striving to make it as close to the speed of sight as possible. For us, this means we wanted the “glass-to-glass” time — from when you hit “enter” in your browser until when the logs appear — to be under one second.

How did we do?

Today, Cloudflare’s Instant Logs have an average delay of two seconds, and we’re continuing to make improvements to drive that down.

“Real-time” is a very fuzzy term. Looking at other services we see Akamai talking about real-time data as “within minutes” or “latency of 10 minutes”, Amazon talks about “near real-time” for CloudWatch, Google Cloud Logging provides log tailing with a configurable buffer “up to 60 seconds” to deal with potential out-of-order log delivery, and we benchmarked Fastly logs at 25 seconds.

Our goal is to drive down the delay as much as possible (within the laws of physics). We’re happy to have shipped Instant Logs that arrive in two seconds, but we’re not satisfied and will continue to bring that number down.

In time sensitive scenarios such as an attack or an outage, a few minutes or even 30 seconds of delay can have a big impact on customers. At Cloudflare, our goal is to get our customer’s data into their hands as fast as possible  — and we’re just getting started.

How to get access?

Live-updating Analytics is available now on all Pro, Business, and Enterprise plans. Select the “Last 30 minutes” view of your traffic in the Analytics tab to start monitoring your analytics live.

We’ll be starting our Beta for Instant Logs in a couple of weeks. Join the waitlist to get notified about when you can get access!

If you’re eager for details on the inner workings of Instant Logs, check out our blog post about how we built Instant Logs.

What’s next

We’re hard at work to make Instant Logs available for all Enterprise customers — stay tuned after joining our waitlist. We’re also planning to bring all of our datasets to Instant Logs, including Firewall Events. In addition, we’re working on the next set of features like the ability to download logs from your session and compute running aggregates from logs.

For a peek into what we have our sights on next, we know how important it is to perform analysis on not only up-to-date data, but also historical data. We want to give customers the ability to analyze logs, draw insights and perform forensics straight from the Cloudflare platform.

If this sounds cool, we’re hiring engineers for our data team in Lisbon, London and San Francisco — would love to have you help us build the future of data at Cloudflare.

The Show Must Go On: Securing Netflix Studios At Scale

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/the-show-must-go-on-securing-netflix-studios-at-scale-19b801c86479

Written by Jose Fernandez, Arthur Gonigberg, Julia Knecht, and Patrick Thomas

Netflix Zuul Open Source Logo

In 2017, Netflix Studios was hitting an inflection point from a period of merely rapid growth to the sort of explosive growth that throws “how do we scale?” into every conversation. The vision was to create a “Studio in the Cloud”, with applications supporting every part of the business from pitch to play. The security team was working diligently to support this effort, faced with two apparently contradictory priorities:

  • 1) streamline any security processes so that we could get applications built and deployed to the public internet faster
  • 2) raise the overall security bar so that the accumulated risk of this giant and growing portfolio of newly internet-facing, high-sensitivity assets didn’t exceed its value

The journey to resolve that contradiction has been a collaboration that we’re proud of, and that we think exemplifies how Netflix approaches infrastructure product development and product security partnerships. You’ll hear from two teams here: first Application Security, and then Cloud Gateway.

Julia & Patrick (Netflix Application Security): In deciding how to address this, we focused on two observations. The first was that there were too many security things that each software team needed to think about — things like TLS certificates, authentication, security headers, request logging, rate limiting, among many others. There were security checklists for developers, but they were lengthy and mostly manual, neither of which contributed to the goal of accelerating development. Adding to the complexity, many of the checklist items themselves had a variety of different options to fulfill them (“new apps do this, but legacy apps do that”; “Java apps should use this approach, but Ruby apps should try one of these four things”… yes, there were flowcharts inside checklists. Ouch.). For development teams, just working through the flowcharts of requirements and options was a monumental task. Supporting developers through those checklists for edge cases, and then validating that each team’s choices resulted in an architecture with all the desired security properties, was similarly not scalable for our security engineers.

Our second observation centered on strong authentication as our highest-leverage control. Missing or incomplete authentication in an application was the most critical type of issue we regularly faced, while at the same time, an application that had a bulletproof authentication story was an application we considered to be lower risk. Concepts like Zero Trust, Beyond Corp, and Identity Aware Proxies all seemed to point the same way: there is powerful assurance in making 100% authentication a property of the architecture of the application rather than an implementation detail within an application.

With both of these observations in hand, we looked at the challenge through a lens that we have found incredibly valuable: how do we productize it? Netflix engineers talk a lot about the concept of a “Paved Road”. One especially attractive part of a Paved Road approach for security teams with a large portfolio is that it helps turn lots of questions into a boolean proposition: Instead of “Tell me how your app does this important security thing?”, it’s just “Are you using this paved road product that handles that?”. So, what would a product look like that could tackle most of the security checklist for a team, and that also could give us that architectural property of guaranteed authentication? With these lofty goals in mind, we turned to our central engineering teams to help get us there.

Partnering to Productize Security

Jose & Arthur (Netflix Cloud Gateway): The Cloud Gateway team develops and operates Netflix’s “Front Door”. Historically we have been responsible for connecting, routing, and steering internet traffic from Netflix subscribers to services in the cloud. Our gateways are powered by our flagship open-source technology Zuul. When Netflix Studios and our security partners approached us, the proposal was conceptually simple and a good fit for our modular, filter-based approach. To try it out, we deployed a custom Zuul build (which we named “API Wall” and eventually, more affectionately, “Wall-E”) with a new filter for Netflix’s Single-Sign-On provider, enabled it for all requests, and boom! — an application deployment strategy that guarantees authentication for services behind it.

Wall-E logical diagram showing a proxy with distinct filters

Killing the Checklist

Once we worked together to integrate our SSO with Wall-E, we had established a pretty exciting pattern of adding security requirements as filters. We thought back to our checklist through the lens of: which of these things are consistent enough across applications to add as a required filter? Our web application firewall (WAF), DDoS prevention, security header validation, and durable logging all fit the bill. One by one, we saw our checklists’ requirements bite the dust, and shift from ‘individual app developer-owned’ to ‘Wall-E owned’ (and consistently implemented!).

By this point, it was clear that we had achieved the vision in the AppSec team’s original request. We eventually were able to add so much security leverage into Wall-E that the bulk of the “going internet-facing” checklist for Studio applications boiled down to one item: Will you use Wall-E?

A small section of our go-external security questionnaire and checklist for studio apps before Wall-E and after Wall-E.

The Early Adopter Challenge

Wall-E’s early adopters were handpicked and nudged along by the Application Security team. Back then, the Cloud Gateway team had to work closely with application developers to provide a seamless migration without disrupting users. These joint efforts took several weeks for both parties. During our initial consultations, it was clear that developers preferred prioritizing product work over security or infrastructure improvements. Our meetings usually ended like this: “Security suggested we talk to you, and we like the idea of improving our security posture, but we have product goals to meet. Let’s talk again next quarter”. These conversations surfaced a couple of problems we knew we had to overcome to address this early adopter challenge:

  1. Setting up Wall-E for an application took too much time and effort, and the hands-on approach would not scale.
  2. Security improvements alone were not enough to drive organic adoption in Netflix’s “context not control” culture.

We were under pressure to improve our adoption numbers and decided to focus first on the setup friction by improving the developer experience and automating the onboarding process.

Scaling With Developer Experience

Developers in the Netflix streaming world compose the customer-facing Netflix experience out of hundreds of microservices, reachable by complex routing rules. On the Netflix Studio side, in Content Engineering, each team develops distinct products with simpler routing needs. To support that much different model, we did another thing that seemed simple at the time but has had an outsized impact over the years: we asked app teams to integrate with us by creating a version-controlled YAML file. Originally this was intended as a simplified and developer-friendly way to help collect domain names and some routing rules into a versionable package, but we quickly realized we had stumbled into a powerful model: we were harvesting developer intent.

An interactive Wall-E configuration wizard, and a concise declarative format for an application’s routing, resource, and authentication decisions

This small change was a kind of magic, and completely flipped our relationship with development teams: since we had a concise, standardized definition of the app they intended to expose, we could proactively automate a lot of the setup. Specify a domain name? Wall-E can ensure that it automagically exists, with DNS and TLS configured correctly. Iterating on this experience eventually led to other intent-based streamlining, like asking about intended user populations and related applications (to select OAuth configs and claims). We could now tell developers that setting up Wall-E would only take a few minutes and that our tooling would automate everything.

Going Faster, Faster

As all of these pieces came together, app teams outside Studio took notice. For a typical paved road application with no unusual security complications, a team could go from “git init” to a production-ready, fully authenticated, internet accessible application in a little less than 10 minutes. The automation of the infrastructure setup, combined with reducing risk enough to streamline security review saves developers days, if not weeks, on each application. Developers didn’t necessarily care that the original motivating factor was about security: what they saw in practice was that apps using Wall-E could get in front of users sooner, and iterate faster.

This created that virtuous cycle that core engineering product teams get incredibly excited about: more users make the amortized platform investment more valuable, but they also bring more ideas and clarity for feature ideas, which in turn attract more users. This set the tone for the next year of development, along two tracks: fixing adoption blockers, and turning more “developer intent” into product features to just handle things for them.

For adoption, both the security team and our team were asking the same question of developers: Is there anything that prevents you from using Wall-E? Each time we got an answer to that question, we tried to figure out how we could address it. Nearly all of the blockers related to systems in which (usually for historical reasons) some application team was solving both authentication and application routing in a custom way. Examples include legacy mTLS and various webhook schemes​. With Wall-E as a clear, durable, paved road choice, we finally had enough of a carrot to move these teams away from supporting unique, potentially risky features. The value proposition wasn’t just “let us help you migrate and you’ll only ever have to deal with incoming traffic that is already properly authenticated”, it was also “you can throw away the services and manual processes that handled your custom mechanisms and offload any responsibility for authentication, WAF integration and monitoring, and DDoS protection to the platform”. Overall, we cannot overstate the value of organizationally committing to a single paved road product to handle these kinds of concerns. It creates an amazing clarity and strategic pressure that helps align actual services that teams operate to the charters and expertise that define them. The difference between 2–4 “right-ish” ways and a single paved road one is powerful.

Also, with fewer exceptions and clearer criteria for apps that should adopt this paved road, our AppSec Engineering and User Focused Security Engineering (UFSE) teams could automate security guidance to give more appropriate automated nudges for adoption. Every leader’s security risk dashboard now includes a Wall-E adoption metric, and roughly ⅔ of recommended apps have chosen to adopt it. Wall-E now fronts over 350 applications, and is adding roughly 3 new production applications (mostly internet-facing) per week.

Automated guidance data, showing the percentage of applications recommended to use Wall-E which have taken it up. The jumpiness in the number of apps recommended for adoption is real: as adoption blockers were discovered then eventually solved, and as we standardized guidance across the company, our automated recommendations reflected these developments.

As adoption continued to increase, we looked at various signals of developer intent for good functionality to move from development-team-owned to platform-owned. One particularly pleasing example turned out to be UI hosting: it popped up over and over again as both an awkward exception to our “full authentication” goal, and also oftentimes the only thing that required Single Page App (SPA) UI teams to run actual cloud instances and have to be on-call for infrastructure. This eventually matured into an opinionated, declarative asset service that abstracts static file hosting for application teams: developers get fast static asset deployments, security gets strong guardrails around UI applications, and Netflix overall has fewer cloud instances to manage (and pay for!). Wall-E became a requirement for the best UI developer experience, and that drove even more adoption.

A productized approach also meant that we could efficiently enable lots of complex but “nice to have” features to enhance the developer experience, like Atlas metrics for free, and integration with our request tracing tool, Edgar.

From Product to Platform

You may have noticed a word sneak into the conversation up there… “platform”. Netflix has a Developer Productivity organization: teams dedicated to helping other developers be more effective. A big part of their work is this idea of harvesting developer intent and automating the necessary touchpoints across our systems. As these teams came to see Wall-E as the clear answer for many of their customers, they started integrating their tools to configure Wall-E from the even higher level developer intents they were harvesting. In effect, this moves authentication and traffic routing (and everything else that Wall-E handles) from being a specific product that developers need to think about and make a choice about, to just a fact that developers can trust and generally ignore. In 2019, essentially 100% of the Wall-E app configuration was done manually by developers. In 2021, that interaction has changed dramatically: now more than 50% of app configuration in WallE is done by automated tools (which are acting on higher-level abstractions on behalf of developers).

This scale and standardization again multiplies value: our internal risk quantification forecasts show compelling annualized savings in risk and incident response costs across the Wall-E portfolio. These applications have fewer, less severe, and less exploitable bugs compared to non-Wall-E apps, and we rarely need an urgent response from app owners (we call this not-getting-paged-at-midnight-as-a-service). Developer time saved on initial application setup and unneeded services additionally adds up on the order of team-months of productivity per year.

Looking back to the core need that started us down this road (“streamline any security processes […]” and “raise the overall security bar […]”), Wall-E’s evolution to being part of the platform cements and extends the initial success. Going forward, more and more apps and developers can benefit from these security assurances while needing to think less and less about them. It’s an outcome we’re quite proud of.

Let’s Do More Of That

To briefly recap, here’s a few of the things that we take away from this journey:

  • If you can do one thing to manage a large product security portfolio, do bulletproof authentication; preferably as a property of the architecture
  • Security teams and central engineering teams can and should have a collaborative, mutually supportive partnership
  • “Productizing” a capability (eg: clearly articulated; defined value proposition; branded; measured), even for internal tools, is useful to drive adoption and find further value
  • A specific product makes the “paved road” clearer; a boolean “uses/doesn’t use” is strongly preferable to various options with subtle caveats
  • Hitch the security wagon to developer productivity
  • Harvesting intent is powerful; it lets many teams add value

What’s Next

We see incredible power in this kind of security/infrastructure partnership work, and we’re excited to leverage these wins into our next goal: to truly become an infrastructure-as-service provider by building a full-fledged Gateway API, thereby handing off ownership of the developer experience to our partner teams in the Developer Productivity organization. This will allow us to focus on the challenges that will come on our way to the next milestone: 1000 applications behind Wall-E.

If this kind of thing is exciting to you, we are hiring for both of these teams: Senior Software Engineer and Engineering Manager on Application Networking; and Senior Security Partner and Appsec Senior Software Engineer.

With special thanks to Cloud Gateway and InfoSec team members past and present, especially Sunil Agrawal, Mikey Cohen, Will Rose, Dilip Kancharla, our partners on Studio & Developer Productivity, and the early Wall-E adopters that provided valuable feedback and ideas. And also to Queen for the song references we slipped in; tell us if you find ’em all.

The Show Must Go On: Securing Netflix Studios At Scale was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

How Cloudflare helped mitigate the Atlassian Confluence OGNL vulnerability before the PoC was released

Post Syndicated from Michael Tremante original https://blog.cloudflare.com/how-cloudflare-helped-mitigate-the-atlassian-confluence-ognl-vulnerability-before-the-poc-was-released/

How Cloudflare helped mitigate the Atlassian Confluence OGNL vulnerability before the PoC was released

How Cloudflare helped mitigate the Atlassian Confluence OGNL vulnerability before the PoC was released

On August 25, 2021, Atlassian released a security advisory for their Confluence Server and Data Center. The advisory highlighted an Object-Graph Navigation Language (OGNL) injection that would result in an unauthenticated attacker being able to execute arbitrary code.

A full proof of concept (PoC) of the attack was made available by a security researcher on August 31, 2021. Cloudflare immediately reviewed the PoC and prepared a mitigation rule via an emergency release. The rule, once tested, was deployed on September 1, 2021, at 15:32 UTC with a default action of BLOCK and the following IDs:

  • 100400 (for our legacy WAF)
  • e8c550810618437c953cf3a969e0b97a (for our new WAF)

All customers using the Cloudflare WAF to protect their self-hosted Confluence applications have automatically been protected since the new rule was deployed last week. Additionally, the Cloudflare WAF started blocking a high number of potentially malicious requests to Confluence applications even before the rule was deployed.

And customers who had deployed Cloudflare Access in front of their Confluence applications were already protected even before the emergency release. Access checks every request made to a protected hostname for a JSON Web Token (JWT) containing a user’s identity. Any unauthenticated users attempting this exploit would have been blocked before they could reach the Confluence server.

Customers must, however, immediately update their self-hosted Confluence installations to the versions listed in the advisory to ensure full protection.

This vulnerability was assigned the CVE number 2021-26084 and has a base score of 9.8 — critical. A detailed technical write-up of the vulnerability along with the PoC can be found on GitHub.

Timeline of Events

A timeline of events is provided below:

2021-08-25 at 17:00 UTC Atlassian security advisory released
2021-08-28 Cloudflare WAF starts seeing and blocking malicious traffic targeting vulnerable endpoints related to the security advisory
2021-08-31 at 22:22 UTC A PoC becomes widely available
2021-09-01 at 15:32 UTC Additional Cloudflare WAF rule to target CVE-2021-26084

How soon were attackers probing vulnerable endpoints?

High profile vulnerabilities tend to be exploited very quickly by attackers once a PoC or a patch becomes available. Cloudflare maintains aggregated and sampled data on WAF blocks1 that can be used to explore how quickly vulnerable endpoints start receiving malicious traffic, highlighting the importance of deploying patches as quickly as possible.

Cloudflare data suggests that scanning for the vulnerability started up to three days before the first publicly available PoC was published, as high WAF activity was observed on vulnerable endpoints beginning August 28, 2021. This activity may indicate that attackers or researchers had successfully reverse engineered the patch within that time frame.

It also shows that, even without the specific WAF rule that we rolled out for this vulnerability, Cloudflare was blocking attempts to exploit it. Other WAF rules picked up the suspect behavior.

For this vulnerability, two endpoints are highlighted that can be used to explore attack traffic:

  • /pages/doenterpagevariables.action
  • /pages/createpage-entervariables.action

The following graph shows traffic matching Cloudflare’s WAF security feature from August 21 to September 5, 2021. Specifically:

  • In blue: HTTP requests blocked by Cloudflare’s WAF matching the two chosen paths.
  • In red: HTTP requests blocked by Cloudflare’s WAF matching the two paths and the specific rule deployed to cover this vulnerability.
How Cloudflare helped mitigate the Atlassian Confluence OGNL vulnerability before the PoC was released

By looking at the data, an increase in activity can be seen starting from August 28, 2021 — far beyond normal Internet background noise levels. Additionally, more than 64% of the traffic increase was detected and blocked by the Cloudflare WAF as malicious on the day the PoC was available.

What were attackers trying before a PoC was widely available?

Just before a PoC became widely available, an increasing number of requests were blocked by customer configured IP based rules, followed by our Managed Rulesets and rate limiting. Most custom WAF rules and IP based rules are created by customers either in response to malicious activity in the WAF logs, or as a positive security model to lock down applications that should simply not have public access from the Internet.

We can zoom into the Managed WAF rule matches to explore what caused the WAF to trigger before the specific rule was deployed:

How Cloudflare helped mitigate the Atlassian Confluence OGNL vulnerability before the PoC was released

Command injection based attacks were the most common vector attempted before a PoC was made widely available, indicating again that some attackers may have been at least partially aware of the nature of the vulnerability. These attacks are aimed at executing remote code on the target application servers and are often platform specific. Other attack types observed, in order of frequency were:

  • Request Port Anomalies: these are HTTP requests to uncommon ports that are normally not exposed for HTTP traffic.
  • Fake Bot Signatures: these requests matched many of our rules aimed at detecting user agents spoofing themselves as popular bots such as Google, Yandex, Bing and others.
  • OWASP Inbound Anomaly Score Exceeded: these are requests that were flagged by our implementation of the OWASP ModSecurity Core Ruleset. The OWASP ruleset is a score based system that scans requests for patterns of characters that normally identify malicious requests;
  • HTTP Request Anomalies: these requests triggered many of our HTTP based validation checks including but not limited to RFC compliance checks.


Patching zero-day attacks as quickly as possible is vital for security. No single approach can be 100% successful at mitigating intrusion attempts. By observing patterns and triggers for this specific CVE, it is clear that a layered approach is most effective for protecting critical infrastructure. Cloudflare data also implies that, at least in part, some attackers or researchers were aware of the nature of the vulnerability at least since August 28, 2021, three days before a PoC was made widely available.

1The WAF block data consists of sampled matches of request fields including path, geography, rule ID, timestamp and other similar metadata.

Ransomware mitigation: Top 5 protections and recovery preparation actions

Post Syndicated from Brad Dispensa original https://aws.amazon.com/blogs/security/ransomware-mitigation-top-5-protections-and-recovery-preparation-actions/

In this post, I’ll cover the top five things that Amazon Web Services (AWS) customers can do to help protect and recover their resources from ransomware. This blog post focuses specifically on preemptive actions that you can take.

#1 – Set up the ability to recover your apps and data

In order for a traditional encrypt-in-place ransomware attempt to be successful, the actor responsible for the attempt must be able to prevent you from accessing your data, and then hold your data for ransom. The first thing that you should do to protect your account is to ensure that you have the ability to recover your data, regardless of how it was made inaccessible. Backup solutions protect and restore data, and disaster recovery (DR) solutions offer fast recovery of data and workloads.

AWS makes this process significantly easier for you with services like AWS Backup, or CloudEndure Disaster Recovery, which offer robust infrastructure DR. I’ll go over how you can use both of these services to help recover your data. When you choose a data backup solution, simply creating a snapshot of an Amazon Elastic Compute Cloud (Amazon EC2) instance isn’t enough. A powerful function of the AWS Backup service is that when you create a backup vault, you can use a different customer master key (CMK) in the AWS Key Management Service (AWS KMS). This is powerful because the CMK can have a key policy that allows AWS operators to use the key to encrypt the backup, but you can limit decryption to a completely different principal.

In Figure 1, I show an account that locally encrypted their EC2 Amazon Elastic Block Store (Amazon EBS) volume by using CMK A, but AWS Backup uses CMK B. If the user in account A with a decrypt grant on CMK A attempts to access the backup, even if the user is authorized by the AWS Identity and Access Management (IAM) principal access policy, the CMK policy won’t allow access to the encrypted data.

Figure 1: An account using AWS Backup that stores data in a separate account with different key material

Figure 1: An account using AWS Backup that stores data in a separate account with different key material

If you place the backup or replication into a separate account that is dedicated just for backup, this also helps to reduce the likelihood that a threat actor would be able to destroy or tamper with the backup. AWS Backup now natively supports this cross-account capability, which makes the backup process even easier. The AWS Backup Developer Guide provides instructions for using this functionality, as well as the policy that you will need to apply.

Make sure that you’re backing up your data in all supported services and that your backup schedule is based on your business recovery time objective (RTO) and recovery point objective (RPO).

Now, let’s take a look at how CloudEndure Disaster Recovery works.

Figure 2: An overview of how CloudEndure Disaster Recovery works

Figure 2: An overview of how CloudEndure Disaster Recovery works

The high-level architecture diagram in Figure 2 illustrates how CloudEndure Disaster Recovery keeps your entire on-premises environment in sync with replicas in AWS and ready to fail over to AWS at any time, with aggressive recovery objectives and significantly reduced total cost of ownership (TCO). On the left is the source environment, which can be composed of different types of applications—in this case, I give Oracle databases and SQL Servers as examples. And although I’m highlighting DR from on-premises to AWS in this example, CloudEndure Disaster Recovery can provide the same functionality and improved recovery performance between AWS Regions for your workloads that are already in AWS.

The CloudEndure Agent is deployed on the source machines without requiring any kind of reboot and without impacting performance. That initiates nearly continuous replication of that data into AWS. CloudEndure Disaster Recovery also provisions a low-cost staging area that helps reduce the cost of cloud infrastructure during replication, and until that machine actually needs to be spun up during failover or disaster recovery tests.

When a customer experiences an outage, CloudEndure Disaster Recovery launches the machines in the appropriate AWS Region VPC and target subnets of your choice. The dormant lightweight state, called the Staging Area, is now launched into the actual servers that have been migrated from the source environment (the Oracle databases and SQL Servers, in this example). One of the features of CloudEndure Disaster Recovery is point-in-time recovery, which is important in the event of a ransomware event, because you can use this feature to recover your environment to a previous consistent point in time of your choosing. In other words, you can go back to the environment you had prior to the event.

The machine conversion technology in CloudEndure Disaster Recovery means that those replicated machines can run natively within AWS, and the process typically takes just minutes for the machines to boot. You can also conduct frequent DR readiness tests without impacting replication or user activities.

Another service that’s useful for data protection is the AWS object storage service, Amazon Simple Storage Service (Amazon S3), where you can use features such as object versioning to help prevent objects from being overwritten with ransomware-encrypted files, or Object Lock, which provides a write once, read many (WORM) solution to help prevent objects from ever being modified or overwritten.

For more information on developing a DR plan and a business continuity plan, see the following pages:

#2 – Encrypt your data

In addition to holding data for ransom, more recent ransomware events increasingly use double extortion schemes. A double extortion is when the actor not only encrypts the data, but exfiltrates the data and threatens to release the data if the ransom isn’t paid.

To help protect your data, you should always enable encryption of the data and segment your workflow so that authorized systems and users have limited access to use the key material to decrypt the data.

As an example, let’s say that you have a web application that uses an API to write data objects into an S3 bucket. Rather than allowing the application to have full read and write permissions, limit the application to just a single operation (for example, PutObject). Smaller, more reusable code is also easier to manage, so segmenting the workflow also helps developers to be able to work more quickly. An example of this type of workflow, in which separate CMK policies are used for read operations and write operations to limit access, is laid out in Figure 3.

Figure 3: A serverless workflow that uses separate CMK policies for read operations and write operations

Figure 3: A serverless workflow that uses separate CMK policies for read operations and write operations

It’s important to note that although AWS managed CMKs can help you to meet regulatory requirements for data at rest encryption, they don’t support customer key policies. Customers who want to control how their key material is used must use a customer managed CMK.

For data that is stored locally on Amazon EBS, remember that while the blocks are encrypted by using AWS KMS, after the server boots, your data is unencrypted locally at the operating system level. If you have sensitive data that is being stored as part of your application locally, consider using tooling like the AWS Encryption SDK or Encryption CLI to store that data in an encrypted format.

As Amazon Chief Technology Officer Werner Vogels says, encrypt everything!

Figure 4: Amazon Chief Technology Officer Werner Vogels wants customers to encrypt everything

Figure 4: Amazon Chief Technology Officer Werner Vogels wants customers to encrypt everything

#3 – Apply critical patches

In order for an actor to get access to a system, they must take advantage of a vulnerability or misconfiguration. Although many organizations patch their infrastructure, some only do so on a weekly or monthly basis, and that can be inadequate for patching critical systems that require 24/7 operation. Increasingly, threat actors have the ability to reverse engineer patches or common vulnerability exposure (CVE) announcements in hours. You should deploy security-related patches, especially those that are high severity, with the least amount of delay possible.

AWS Systems Manager can help you to automate this process in the cloud and on premises. With Systems Manager patch baselines, you can apply patches based on machine tags (for example, development versus production) but also based on patch type. For example, the predefined patch baseline AWS-AmazonLinuxDefaultPatchBaseline approves all operating system patches that are classified as “Security” and that have a severity level of “Critical” or “Important.” Patches are auto-approved seven days after release. The baseline also auto-approves all patches with a classification of “Bugfix” seven days after release.

If you want a more aggressive patching posture, you can instead create a custom baseline. For example, in Figure 5, I’ve created a baseline for all Windows versions with a critical severity.

Figure 5: An example of the creation of a custom patch baseline for Systems Manager

Figure 5: An example of the creation of a custom patch baseline for Systems Manager

I can then set up an hourly scheduled event to scan all or part of my fleet and patch based on this baseline. In Figure 6, I show an example of this type of workflow taken from this AWS blog post, which gives an overview of the patch baseline process and covers how to use it in your cloud environment.

Figure 6: Example workflow showing how to scan, check, patch, and report by using Systems Manager

Figure 6: Example workflow showing how to scan, check, patch, and report by using Systems Manager

In addition, if you’re using AWS Organizations, this blog post will show you how you can apply this method organization-wide.

AWS offers many tools to make patching easier, and making sure that your servers are fully patched will greatly reduce your susceptibility to ransomware.

#4 – Follow a security standard

Don’t guess whether your environment is secure. Most commercial and public-sector customers are subject to some form of regulation or compliance standard. You should be measuring your security and risk posture against recognized standards in an ongoing practice. If you don’t have a framework that you need to follow, consider using the AWS Well-Architected Framework as your baseline.

With AWS Security Hub, you can view data from AWS security services and third-party tools in a single view and also benchmark your account against standards or frameworks like the CIS AWS Foundations Benchmark, the Payment Card Industry Data Security Standard (PCI DSS), and the AWS Foundational Security Best Practices. These are automated scans of your environment that can alert you when drifts in compliance occur. You can also choose to use AWS Config conformance packs to automate a subset of controls for NIST 800-53, Health Insurance Portability and Accountability Act (HIPAA), Korea – Information Security Management System (ISMS), as well as a growing list of over 60 conformance pack templates at the time of this publication.

Another important aspect of following best practices is to implement least privilege at all levels. In AWS, you can use IAM to write policies that enforce least privilege. These policies, when applied through roles, will limit the actor’s capability to advance in your environment. Access Analyzer is a new feature of IAM that allows you to more easily generate least privilege permissions, and it is covered in this blog post.

#5 – Make sure you’re monitoring and automating responses

Make sure you have robust monitoring and alerting in place. Each of the items I described earlier is a powerful tool to help you to protect against a ransomware event, but none will work unless you have strong monitoring in place to validate your assumptions.

Here, I want to provide some specific examples based on the examples earlier in this post.

If you’re backing up your data by using AWS Backup, as described in item #1 (Set up the ability to recover your apps and data), you should have Amazon CloudWatch set up to send alerts when a backup job fails. When an alert is triggered, you also need to act on it. If your response to an AWS alert email would be to re-run the job, you should automate that workflow by using AWS Lambda. If a subsequent failure occurs, open a ticket in your ticketing service automatically or page your operations team.

If you’re encrypting all of your data, as described in item #2 (Encrypt your data), are you watching AWS CloudTrail to see when AWS KMS denies permission to an operation?

Additionally, are you monitoring and acting on patch management baselines as described in item #3 (Apply critical patches) and responding when a patch isn’t able to successfully deploy?

Last, are you watching the compliance status of your Security Hub compliance reports and taking action on findings? You also need to monitor your environment for suspicious activity, investigate, and act quickly to mitigate risks. This is where Amazon GuardDuty, Security Hub, and Amazon Detective can be valuable.

AWS makes it easier to create automated responses to the alerts I mentioned earlier. The multi-account response solution in this blog post provides a good starting point that you can use to customize a response based on the needs of your workload.


In this blog post, I showed you the top five actions that you can take to protect and recover from a ransomware event.

In addition to the advice provided here, NIST has recently published guidance on the prevention of ransomware, which you can view in the NIST SP1800-25 publication.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Brad Dispensa

Brad is a principal security specialist solutions architect for Amazon Web Services in the worldwide public sector group.

Confidential computing: an AWS perspective

Post Syndicated from David Brown original https://aws.amazon.com/blogs/security/confidential-computing-an-aws-perspective/

Customers around the globe—from governments and highly regulated industries to small businesses and start-ups—trust Amazon Web Services (AWS) with their most sensitive data and applications. At AWS, keeping our customers’ workloads secure and confidential, while helping them meet their privacy and data sovereignty requirements, is our highest priority. Our investments in security technologies and rigorous operational practices meet and exceed even our most demanding customers’ confidential computing and data privacy standards. Over the years, we’ve made many long-term investments in purpose-built technologies and systems to keep raising the bar of security and confidentiality for our customers.

In the past year, there has been an increasing interest in the phrase confidential computing in the industry and in our customer conversations. We’ve observed that this phrase is being applied to various technologies that solve very different problems, leading to confusion about what it actually means. With the mission of innovating on behalf of our customers, we want to offer you our perspective on confidential computing.

At AWS, we define confidential computing as the use of specialized hardware and associated firmware to protect customer code and data during processing from outside access. Confidential computing has two distinct security and privacy dimensions. The most important dimension—the one we hear most often from customers as their key concern—is the protection of customer code and data from the operator of the underlying cloud infrastructure. The second dimension is the ability for customers to divide their own workloads into more-trusted and less-trusted components, or to design a system that allows parties that do not, or cannot, fully trust one another to build systems that work in close cooperation while maintaining confidentiality of each party’s code and data.

In this post, I explain how the AWS Nitro System intrinsically meets the requirements of the first dimension by providing those protections to customers who use Nitro-based Amazon Elastic Compute Cloud (Amazon EC2) instances, without requiring any code or workload changes from the customer side. I also explain how AWS Nitro Enclaves provides a way for customers to use familiar toolsets and programming models to meet the requirements of the second dimension. Before we get to the details, let’s take a closer look at the Nitro System.

What is the Nitro System?

The Nitro System, the underlying platform for all modern Amazon EC2 instances, is a great example of how we have invented and innovated on behalf of our customers to provide additional confidentiality and privacy for their applications. For ten years, we have been reinventing the EC2 virtualization stack by moving more and more virtualization functions to dedicated hardware and firmware, and the Nitro System is a result of this continuous and sustained innovation. The Nitro System is comprised of three main parts: the Nitro Cards, the Nitro Security Chip, and the Nitro Hypervisor. The Nitro Cards are dedicated hardware components with compute capabilities that perform I/O functions, such as the Nitro Card for Amazon Virtual Private Cloud (Amazon VPC), the Nitro Card for Amazon Elastic Block Store (Amazon EBS), and the Nitro Card for Amazon EC2 instance storage.

Nitro Cards—which are designed, built, and tested by Annapurna Labs, our in-house silicon development subsidiary—enable us to move key virtualization functionality off the EC2 servers—the underlying host infrastructure—that’s running EC2 instances. We engineered the Nitro System with a hardware-based root of trust using the Nitro Security Chip, allowing us to cryptographically measure and validate the system. This provides a significantly higher level of trust than can be achieved with traditional hardware or virtualization systems. The Nitro Hypervisor is a lightweight hypervisor that manages memory and CPU allocation, and delivers performances that is indistinguishable from bare metal (we recently compared it against our bare metal instances in the Bare metal performance with the AWS Nitro System post).

The Nitro approach to confidential computing

There are three main types of protection provided by the Nitro System. The first two protections underpin the key dimension of confidential computing—customer protection from the cloud operator and from cloud system software—and the third reinforces the second dimension—division of customer workloads into more-trusted and less-trusted elements.

  1. Protection from cloud operators: At AWS, we design our systems to ensure workload confidentiality between customers, and also between customers and AWS. We’ve designed the Nitro System to have no operator access. With the Nitro System, there’s no mechanism for any system or person to log in to EC2 servers (the underlying host infrastructure), read the memory of EC2 instances, or access any data stored on instance storage and encrypted EBS volumes. If any AWS operator, including those with the highest privileges, needs to do maintenance work on the EC2 server, they can do so only by using a strictly limited set of authenticated, authorized, and audited administrative APIs. None of these APIs have the ability to access customer data on the EC2 server. Because these technological restrictions are built into the Nitro System itself, no AWS operator can bypass these controls and protections. For additional defense-in-depth against physical attacks at the memory interface level, we offer memory encryption on various EC2 instances. Today, memory encryption is enabled by default on all Graviton2-based instances (T4g, M6g, C6g, C6gn, R6g, X2g), and Intel-based M6i instances, which have Total Memory Encryption (TME). Upcoming EC2 platforms based on the AMD Milan processor will feature Secure Memory Encryption (SME).
  2. Protection from AWS system software: The unique design of the Nitro System utilizes low-level, hardware-based memory isolation to eliminate direct access to customer memory, as well as to eliminate the need for a hypervisor on bare metal instances.

    • For virtualized EC2 instances (as shown in Figure 1), the Nitro Hypervisor coordinates with the underlying hardware-virtualization systems to create virtual machines that are isolated from each other as well as from the hypervisor itself. Network, storage, GPU, and accelerator access use SR-IOV, a technology that allows instances to interact directly with hardware devices using a pass-through connection securely created by the hypervisor. Other EC2 features such as instance snapshots and hibernation are all facilitated by dedicated agents that employ end-to-end memory encryption that is inaccessible to AWS operators.

      Figure 1: Virtualized EC2 instances

      Figure 1: Virtualized EC2 instances

    • For bare metal EC2 instances (as shown in Figure 2), there’s no hypervisor running on the EC2 server, and customers get dedicated and exclusive access to all of the underlying main system board. Bare metal instances are designed for customers who want access to the physical resources for applications that take advantage of low-level hardware features—such as performance counters and Intel® VT—that aren’t always available or fully supported in virtualized environments, and also for applications intended to run directly on the hardware or licensed and supported for use in non-virtualized environments. Bare metal instances feature the same storage, networking, and other EC2 capabilities as virtualized instances because the Nitro System implements all of the system functions normally provided by the virtualization layer in an isolated and independent manner using dedicated hardware and purpose-built system firmware. We used the very same technology to create Amazon EC2 Mac instances. Because the Nitro System operates over an independent bus, we can attach Nitro cards directly to Apple’s Mac mini hardware without any other physical modifications.

      Figure 2: Bare metal EC2 instance

      Figure 2: Bare metal EC2 instance

  3. Protection of sensitive computing and data elements from customers’ own operators and software: Nitro Enclaves provides the second dimension of confidential computing. Nitro Enclaves is a hardened and highly-isolated compute environment that’s launched from, and attached to, a customer’s EC2 instance. By default, there’s no ability for any user (even a root or admin user) or software running on the customer’s EC2 instance to have interactive access to the enclave. Nitro Enclaves has cryptographic attestation capabilities that allow customers to verify that all of the software deployed to their enclave has been validated and hasn’t been tampered with. A Nitro enclave has the same level of protection from the cloud operator as a normal Nitro-based EC2 instance, but adds the capability for customers to divide their own systems into components with different levels of trust. A Nitro enclave provides a means of protecting particularly sensitive elements of customer code and data not just from AWS operators but also from the customer’s own operators and other software.As the main goal of Nitro Enclaves is to protect against the customers’ own users and software on their EC2 instances, a Nitro enclave considers the EC2 instance to reside outside of its trust boundary. Therefore, a Nitro enclave shares no memory or CPU cores with the customer instance. To significantly reduce the attack surface area, a Nitro enclave also has no IP networking and offers no persistent storage. We designed Nitro Enclaves to be a platform that is highly accessible to all developers without the need to have advanced cryptography knowledge or CPU micro-architectural expertise, so that these developers can quickly and easily build applications to process sensitive data. At the same time, we focused on creating a familiar developer experience so that developing the trusted code that runs in a Nitro enclave is as easy as writing code for any Linux environment.


To summarize, the Nitro System’s unique approach to virtualization and isolation enables our customers to secure and isolate sensitive data processing from AWS operators and software at all times. It provides the most important dimension of confidential computing as an intrinsic, on-by-default, set of protections from the system software and cloud operators, and optionally via Nitro Enclaves even from customers’ own software and operators.

What’s next?

As mentioned earlier, the Nitro System represents our almost decade-long commitment to raising the bar for security and confidentiality for compute workloads in the cloud. It has allowed us to do more for our customers than is possible with off-the-shelf technology and hardware. But we’re not stopping here, and will continue to add more confidential computing capabilities in the coming months.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


David Brown

David is the Vice President of Amazon EC2, a web service that provides secure, resizable compute capacity in the cloud. He joined AWS in 2007, as a software developer based in Cape Town, working on the early development of Amazon EC2. Over the last 12 years, he has had several roles within Amazon EC2, working on shaping the service into what it is today. Prior to joining Amazon, David worked as a software developer within a financial industry startup.

Making Magic Transit health checks faster and more responsive

Post Syndicated from Meyer Zinn original https://blog.cloudflare.com/making-magic-transit-health-checks-faster-and-more-responsive/

Making Magic Transit health checks faster and more responsive

Making Magic Transit health checks faster and more responsive

Magic Transit advertises our customer’s IP prefixes directly from our edge network, applying DDoS mitigation and firewall policies to all traffic destined for the customer’s network. After the traffic is scrubbed, we deliver clean traffic to the customer over GRE tunnels (over the public Internet or Cloudflare Network Interconnect). But sometimes, we experience inclement weather on the Internet: network paths between Cloudflare and the customer can become unreliable or go down. Customers often configure multiple tunnels through different network paths and rely on Cloudflare to pick the best tunnel to use if, for example, some router on the Internet is having a stormy day and starts dropping traffic.

Making Magic Transit health checks faster and more responsive

Because we use Anycast GRE, every server across Cloudflare’s 200+ locations globally can send GRE traffic to customers. Every server needs to know the status of every tunnel, and every location has completely different network routes to customers. Where to start?

In this post, I’ll break down my work to improve the Magic Transit GRE tunnel health check system, creating a more stable experience for customers and dramatically reducing CPU and memory usage at Cloudflare’s edge.

Everybody has their own weather station

To decide where to send traffic, Cloudflare edge servers need to periodically send health checks to each customer tunnel endpoint.

When Magic Transit was first launched, every server sent a health check to every tunnel once per minute. This naive, “shared-nothing” approach was simple to implement and served customers well, but would occasionally deliver less than optimal health check behavior in two specific ways.

Way #1: Inconsistent weather reports

Sometimes a server just runs into bad luck, and a check randomly fails. From there, the server would mark the tunnel as degraded and immediately start shifting traffic towards a fallback tunnel. Imagine you and I were standing right next to each other under a clear sky, and I felt a single drop of water and declared, “It’s raining!” whereas you felt no raindrops and declared, “It’s all clear!”

With relatively minimal data per server, it means that health determinations can be imprecise. It also means that individual servers could overreact to individual failures. From a customer’s point of view, it’s like Cloudflare detected a problem with the primary tunnel. But, in reality, the server just got a bad weather forecast and made a different judgement call.

Way #2: Slow to respond to storms

Even when tunnel states are consistent across servers, they can be slow to respond. In this case, if a server runs a health check which succeeds, but a second later the tunnel goes down, the next health check won’t happen for another 59 seconds. Until that next health check fails, the server has no idea anything is wrong, so it keeps sending traffic over unhealthy tunnels, leading to packet loss and latency for the customer.

Much like how a live, up-to-the-minute rain forecast helps you decide when to leave to avoid the rain, servers that send tunnel checks more frequently get a finer view of the Internet weather and can respond faster to localized storms. But if every server across Cloudflare’s edge sent health checks too frequently, we would very quickly start to overwhelm our customers’ networks.

All of the weather stations nearby start sharing observations

Clearly, we needed to hammer out some kinks. We wanted servers in the same location to come to the same conclusions about where to send traffic, and we wanted faster detection of issues without increasing the frequency of tunnel checks.

Health checks sent from servers in the same data center take the same route across the Internet. Why not share the results among them?

Instead of a single raindrop causing me to declare that it’s raining, I’d tell you about the raindrop I felt, and you’d tell me about the clear sky you’re looking at. Together, we come to the same conclusion: there isn’t enough rain to open an umbrella.

There is even a special networking protocol that allows us to easily share information between servers in the same private network. From the makers of Unicast and Anycast, now presenting: Multicast!

A single IP address does not necessarily represent a single machine in a network. The Internet Protocol specifies a way to send one message that gets delivered to a group of machines, like writing to an email list. Every machine has to opt into the group—we can’t just enroll people at random for our email list—but once a machine joins, it receives a copy of any message sent to the group’s address.

Making Magic Transit health checks faster and more responsive

The servers in a Cloudflare edge data center are part of the same private network, so for “version 2” of our health check system, we had each server in a data center join a multicast group and share their health check results with one another. Each server still made an independent assessment for each tunnel, but that assessment was based on data collected by all servers in the same location.

This second version of tunnel health checks resulted in more consistent  tunnel health determinations by servers in the same data center. It also resulted in faster response times—especially in large data centers where servers receive updates from their peers very rapidly.

However, we started seeing scaling problems. As we added more customers, we added more tunnel endpoints where we need to check the weather. In some of our larger data centers, each server was receiving close to half a billion messages per minute.

Imagine it’s not just you and me telling each other about the weather above us. You’re in a crowd of hundreds of people, and now everyone is shouting the weather updates for thousands of cities around the world!

One weather station to rule them all

As an engineering intern on the Magic Transit team, my project this summer has been developing a third approach. Rather than having every server infrequently check the weather for every tunnel and shouting the observation to everyone else, now every server tunnel can frequently check the weather for a few tunnels. With this new approach, servers would then only tell the others about the overall weather report—not every individual measurement.

That scenario sounds more efficient, but we need to distribute the task of sending tunnel health checks across all the servers in a location so one server doesn’t get an overwhelming amount of work. So how can we assign tunnels to servers in a way that doesn’t require a centralized orchestrator or shared database? Enter consistent hashing, the single coolest distributed computing concept I got to apply this summer.

Every server sends a multicast “heartbeat” every few seconds. Then, by listening for multicast heartbeats, each server can construct a list of the IP addresses of peers known to be alive, including its own address, sorted by taking the hash of each address. Every server in a data center has the same list of peers in the same order.

When a server needs to decide which tunnels it is responsible for sending health checks to, the server simply hashes each tunnel to an integer and searches through the list of peer addresses to find the peer with the smallest hash greater than the tunnel’s hash, wrapping around to the first peer if no peer is found. The server is responsible for sending health checks to the tunnel when the assigned peer’s address equals the server’s address.

If a server stops sending messages for a long enough period of time, the server gets removed from the known peers list. As a consequence, the next time another server tries to hash a tunnel the removed peer was previously assigned, the tunnel simply gets reassigned to the next peer in the list.

And like magic, we have devised a scheme to consistently assign tunnels to servers in a way that is resilient to server failures and does not require any extra coordination between servers beyond heartbeats. Now, the assigned server can send health checks way more frequently, compose more precise weather forecasts, and share those forecasts without being drowned out by the crowd.


Releasing the new health check system globally reduced Magic Transit’s CPU usage by over 70% and memory usage by nearly 85%.

Memory usage (measured in terabytes):

Making Magic Transit health checks faster and more responsive

CPU usage (measured in CPU-seconds per two minute interval, averaged over two days):

Making Magic Transit health checks faster and more responsive

Reducing the number of multicast messages means that servers can now keep up with the Internet weather, even in the largest data centers. We’re now poised for the next stage of Magic Transit’s growth, just in time for our two-year anniversary.

If you want to help build the future of networking, join our team.