Tag Archives: ipv6

connect() – why are you so slow?

Post Syndicated from Frederick Lawler http://blog.cloudflare.com/author/frederick/ original https://blog.cloudflare.com/linux-transport-protocol-port-selection-performance


It is no secret that Cloudflare is encouraging companies to deprecate their use of IPv4 addresses and move to IPv6 addresses. We have a couple articles on the subject from this year:

And many more in our catalog. To help with this, we spent time this last year investigating and implementing infrastructure to reduce our internal and egress use of IPv4 addresses. We prefer to re-allocate our addresses than to purchase more due to increasing costs. And in this effort we discovered that our cache service is one of our bigger consumers of IPv4 addresses. Before we remove IPv4 addresses for our cache services, we first need to understand how cache works at Cloudflare.

How does cache work at Cloudflare?

Describing the full scope of the architecture is out of scope of this article, however, we can provide a basic outline:

  1. Internet User makes a request to pull an asset
  2. Cloudflare infrastructure routes that request to a handler
  3. Handler machine returns cached asset, or if miss
  4. Handler machine reaches to origin server (owned by a customer) to pull the requested asset

The particularly interesting part is the cache miss case. When a very popular origin has an uncached asset that many Internet Users are trying to access at once, we may make upwards of:
50k TCP unicast connections to a single destination.

That is a lot of connections! We have strategies in place to limit the impact of this or avoid this problem altogether. But in these rare cases when it occurs, we will then balance these connections over two source IPv4 addresses.

Our goal is to remove the load balancing and prefer one IPv4 address. To do that, we need to understand the performance impact of two IPv4 addresses vs one.

TCP connect() performance of two source IPv4 addresses vs one IPv4 address

We leveraged a tool called wrk, and modified it to distribute connections over multiple source IP addresses. Then we ran a workload of 70k connections over 48 threads for a period of time.

During the test we measured the function tcp_v4_connect() with the BPF BCC libbpf-tool funclatency tool to gather latency metrics as time progresses.

Note that throughout the rest of this article, all the numbers are specific to a single machine with no production traffic. We are making the assumption that if we can improve a worse case scenario in an algorithm with a best case machine, that the results could be extrapolated to production. Lock contention was specifically taken out of the equation, but will have production implications.

Two IPv4 addresses

The y-axis are buckets of nanoseconds in powers of ten. The x-axis represents the number of connections made per bucket. Therefore, more connections in a lower power of ten buckets is better.

We can see that the majority of the connections occur in the fast case with roughly ~20k in the slow case. We should expect this bimodal to increase over time due to wrk continuously closing and establishing connections.

Now let us look at the performance of one IPv4 address under the same conditions.

One IPv4 address

In this case, the bimodal distribution is even more pronounced. Over half of the total connections are in the slow case than in the fast! We may conclude that simply switching to one IPv4 address for cache egress is going to introduce significant latency on our connect() syscalls.

The next logical step is to figure out where this bottleneck is happening.

Port selection is not what you think it is

To investigate this, we first took a flame graph of a production machine:

Flame graphs depict a run-time function call stack of a system. Y-axis depicts call-stack depth, and x-axis depicts a function name in a horizontal bar that represents the amount of times the function was sampled. Checkout this in-depth guide about flame graphs for more details.

Most of the samples are taken in the function __inet_hash_connect(). We can see that there are also many samples for __inet_check_established() with some lock contention sampled between. We have a better picture of a potential bottleneck, but we do not have a consistent test to compare against.

Wrk introduces a bit more variability than we would like to see. Still focusing on the function tcp_v4_connect(), we performed another synthetic test with a homegrown benchmark tool to test one IPv4 address. A tool such as stress-ng may also be used, but some modification is necessary to implement the socket option IP_LOCAL_PORT_RANGE. There is more about that socket option later.

We are now going to ensure a deterministic amount of connections, and remove lock contention from the problem. The result is something like this:

On the y-axis we measured the latency between the start and end of a connect() syscall. The x-axis denotes when a connect() was called. Green dots are even numbered ports, and red dots are odd numbered ports. The orange line is a linear-regression on the data.

The disparity between the average time for port allocation between even and odd ports provides us with a major clue. Connections with odd ports are found significantly slower than the even. Further, odd ports are not interleaved with earlier connections. This implies we exhaust our even ports before attempting the odd. The chart also confirms our bimodal distribution.

__inet_hash_connect()

At this point we wanted to understand this split a bit better. We know from the flame graph and the function __inet_hash_connect() that this holds the algorithm for port selection. For context, this function is responsible for associating the socket to a source port in a late bind. If a port was previously provided with bind(), the algorithm just tests for a unique TCP 4-tuple (src ip, src port, dest ip, dest port) and ignores port selection.

Before we dive in, there is a little bit of setup work that happens first. Linux first generates a time-based hash that is used as the basis for the starting port, then adds randomization, and then puts that information into an offset variable. This is always set to an even integer.

net/ipv4/inet_hashtables.c

   offset &= ~1U;
    
other_parity_scan:
    port = low + offset;
    for (i = 0; i < remaining; i += 2, port += 2) {
        if (unlikely(port >= high))
            port -= remaining;

        inet_bind_bucket_for_each(tb, &head->chain) {
            if (inet_bind_bucket_match(tb, net, port, l3mdev)) {
                if (!check_established(death_row, sk, port, &tw))
                    goto ok;
                goto next_port;
            }
        }
    }

    offset++;
    if ((offset & 1) && remaining > 1)
        goto other_parity_scan;

Then in a nutshell: loop through one half of ports in our range (all even or all odd ports) before looping through the other half of ports (all odd or all even ports respectively) for each connection. Specifically, this is a variation of the Double-Hash Port Selection Algorithm. We will ignore the bind bucket functionality since that is not our main concern.

Depending on your port range, you either start with an even port or an odd port. In our case, our low port, 9024, is even. Then the port is picked by adding the offset to the low port:

net/ipv4/inet_hashtables.c

port = low + offset;

If low was odd, we will have an odd starting port because odd + even = odd.

There is a bit too much going on in the loop to explain in text. I have an example instead:

This example is bound by 8 ports and 8 possible connections. All ports start unused. As a port is used up, the port is grayed out. Green boxes represent the next chosen port. All other colors represent open ports. Blue arrows are even port iterations of offset, and red are the odd port iterations of offset. Note that the offset is randomly picked, and once we cross over to the odd range, the offset is incremented by one.

For each selection of a port, the algorithm then makes a call to the function check_established() which dereferences __inet_check_established(). This function loops over sockets to verify that the TCP 4-tuple is unique. The takeaway is that the socket list in the function is usually smaller than not. This grows as more unique TCP 4-tuples are introduced to the system. Longer socket lists may slow down port selection eventually. We have a blog post that dives into the socket list and port uniqueness criteria.

At this point, we can summarize that the odd/even port split is what is causing our performance bottleneck. And during the investigation, it was not obvious to me (or even maybe you) why the offset was initially calculated the way it was, and why the odd/even port split was introduced. After some git-archaeology the decisions become more clear.

Security considerations

Port selection has been shown to be used in device fingerprinting in the past. This led the authors to introduce more randomization into the initial port selection. Prior, ports were predictably picked solely based on their initial hash and a salt value which does not change often. This helps with explaining the offset, but does not explain the split.

Why the even/odd split?

Prior to this patch and that patch, services may have conflicts between the connect() and bind() heavy workloads. Thus, to avoid those conflicts, the split was added. An even offset was chosen for the connect() workloads, and an odd offset for the bind() workloads. However, we can see that the split works great for connect() workloads that do not exceed one half of the allotted port range.

Now we have an explanation for the flame graph and charts. So what can we do about this?

User space solution (kernel < 6.8)

We have a couple of strategies that would work best for us. Infrastructure or architectural strategies are not considered due to significant development effort. Instead, we prefer to tackle the problem where it occurs.

Select, test, repeat

For the “select, test, repeat” approach, you may have code that ends up looking like this:

sys = get_ip_local_port_range()
estab = 0
i = sys.hi
while i >= 0:
    if estab >= sys.hi:
        break

    random_port = random.randint(sys.lo, sys.hi)
    connection = attempt_connect(random_port)
    if connection is None:
        i += 1
        continue

    i -= 1
    estab += 1

The algorithm simply loops through the system port range, and randomly picks a port each iteration. Then test that the connect() worked. If not, rinse and repeat until range exhaustion.

This approach is good for up to ~70-80% port range utilization. And this may take roughly eight to twelve attempts per connection as we approach exhaustion. The major downside to this approach is the extra syscall overhead on conflict. In order to reduce this overhead, we can consider another approach that allows the kernel to still select the port for us.

Select port by random shifting range

This approach leverages the IP_LOCAL_PORT_RANGE socket option. And we were able to achieve performance like this:

That is much better! The chart also introduces black dots that represent errored connections. However, they have a tendency to clump at the very end of our port range as we approach exhaustion. This is not dissimilar to what we may see in “select, test, repeat”.

The way this solution works is something like:

IP_BIND_ADDRESS_NO_PORT = 24
IP_LOCAL_PORT_RANGE = 51
sys = get_local_port_range()
window.lo = 0
window.hi = 1000
range = window.hi - window.lo
offset = randint(sys.lo, sys.hi - range)
window.lo = offset
window.hi = offset + range

sk = socket(AF_INET, SOCK_STREAM)
sk.setsockopt(IPPROTO_IP, IP_BIND_ADDRESS_NO_PORT, 1)
range = pack("@I", window.lo | (window.hi << 16))
sk.setsockopt(IPPROTO_IP, IP_LOCAL_PORT_RANGE, range)
sk.bind((src_ip, 0))
sk.connect((dest_ip, dest_port))

We first fetch the system’s local port range, define a custom port range, and then randomly shift the custom range within the system range. Introducing this randomization helps the kernel to start port selection randomly at an odd or even port. Then reduces the loop search space down to the range of the custom window.

We tested with a few different window sizes, and determined that a five hundred or one thousand size works fairly well for our port range:

Window size Errors Total test time Connections/second
500 868 ~1.8 seconds ~30,139
1,000 1,129 ~2 seconds ~27,260
5,000 4,037 ~6.7 seconds ~8,405
10,000 6,695 ~17.7 seconds ~3,183

As the window size increases, the error rate increases. That is because a larger window provides less random offset opportunity. A max window size of 56,512 is no different from using the kernels default behavior. Therefore, a smaller window size works better. But you do not want it to be too small either. A window size of one is no different from “select, test, repeat”.

In kernels >= 6.8, we can do even better.

Kernel solution (kernel >= 6.8)

A new patch was introduced that eliminates the need for the window shifting. This solution is going to be available in the 6.8 kernel.

Instead of picking a random window offset for setsockopt(IPPROTO_IP, IP_LOCAL_PORT_RANGE, …), like in the previous solution, we instead just pass the full system port range to activate the solution. The code may look something like this:

IP_BIND_ADDRESS_NO_PORT = 24
IP_LOCAL_PORT_RANGE = 51
sys = get_local_port_range()
sk = socket(AF_INET, SOCK_STREAM)
sk.setsockopt(IPPROTO_IP, IP_BIND_ADDRESS_NO_PORT, 1)
range = pack("@I", sys.lo | (sys.hi << 16))
sk.setsockopt(IPPROTO_IP, IP_LOCAL_PORT_RANGE, range)
sk.bind((src_ip, 0))
sk.connect((dest_ip, dest_port))

Setting IP_LOCAL_PORT_RANGE option is what tells the kernel to use a similar approach to “select port by random shifting range” such that the start offset is randomized to be even or odd, but then loops incrementally rather than skipping every other port. We end up with results like this:

The performance of this approach is quite comparable to our user space implementation. Albeit, a little faster. Due in part to general improvements, and that the algorithm can always find a port given the full search space of the range. Then there are no cycles wasted on a potentially filled sub-range.

These results are great for TCP, but what about other protocols?

Other protocols & connect()

It is worth mentioning at this point that the algorithms used for the protocols are mostly the same for IPv4 & IPv6. Typically, the key difference is how the sockets are compared to determine uniqueness and where the port search happens. We did not compare performance for all protocols. But it is worth mentioning some similarities and differences with TCP and a couple of others.

DCCP

The DCCP protocol leverages the same port selection algorithm as TCP. Therefore, this protocol benefits from the recent kernel changes. It is also possible the protocol could benefit from our user space solution, but that is untested. We will let the reader exercise DCCP use-cases.

UDP & UDP-Lite

UDP leverages a different algorithm found in the function udp_lib_get_port(). Similar to TCP, the algorithm will loop over the whole port range space incrementally. This is only the case if the port is not already supplied in the bind() call. The key difference between UDP and TCP is that a random number is generated as a step variable. Then, once a first port is identified, the algorithm loops on that port with the random number. This relies on an uint16_t overflow to eventually loop back to the chosen port. If all ports are used, increment the port by one and repeat. There is no port splitting between even and odd ports.

The best comparison to the TCP measurements is a UDP setup similar to:

sk = socket(AF_INET, SOCK_DGRAM)
sk.bind((src_ip, 0))
sk.connect((dest_ip, dest_port))

And the results should be unsurprising with one IPv4 source address:

UDP fundamentally behaves differently from TCP. And there is less work overall for port lookups. The outliers in the chart represent a worst-case scenario when we reach a fairly bad random number collision. In that case, we need to more-completely loop over the ephemeral range to find a port.

UDP has another problem. Given the socket option SO_REUSEADDR, the port you get back may conflict with another UDP socket. This is in part due to the function udp_lib_lport_inuse() ignoring the UDP 2-tuple (src ip, src port) check given the socket option. When this happens you may have a new socket that overwrites a previous. Extra care is needed in that case. We wrote more in depth about these cases in a previous blog post.

In summary

Cloudflare can make a lot of unicast egress connections to origin servers with popular uncached assets. To avoid port-resource exhaustion, we balance the load over a couple of IPv4 source addresses during those peak times. Then we asked: “what is the performance impact of one IPv4 source address for our connect()-heavy workloads?”. Port selection is not only difficult to get right, but is also a performance bottleneck. This is evidenced by measuring connect() latency with a flame graph and synthetic workloads. That then led us to discovering TCP’s quirky port selection process that loops over half your ephemeral ports before the other for each connect().

We then proposed three solutions to solve the problem outside of adding more IP addresses or other architectural changes: “select, test, repeat”, “select port by random shifting range”, and an IP_LOCAL_PORT_RANGE socket option solution in newer kernels. And finally closed out with other protocol honorable mentions and their quirks.

Do not take our numbers! Please explore and measure your own systems. With a better understanding of your workloads, you can make a good decision on which strategy works best for your needs. Even better if you come up with your own strategy!

Using DNS to estimate the worldwide state of IPv6 adoption

Post Syndicated from Carlos Rodrigues http://blog.cloudflare.com/author/carlos-rodrigues/ original https://blog.cloudflare.com/ipv6-from-dns-pov


In order for one device to talk to other devices on the Internet using the aptly named Internet Protocol (IP), it must first be assigned a unique numerical address. What this address looks like depends on the version of IP being used: IPv4 or IPv6.

IPv4 was first deployed in 1983. It’s the IP version that gave birth to the modern Internet and still remains dominant today. IPv6 can be traced back to as early as 1998, but only in the last decade did it start to gain significant traction — rising from less than 1% to somewhere between 30 and 40%, depending on who’s reporting and what and how they’re measuring.

With the growth in connected devices far exceeding the number of IPv4 addresses available, and its costs rising, the much larger address space provided by IPv6 should have made it the dominant protocol by now. However, as we’ll see, this is not the case.

Cloudflare has been a strong advocate of IPv6 for many years and, through Cloudflare Radar, we’ve been closely following IPv6 adoption across the Internet. At three years old, Radar is still a relatively recent platform. To go further back in time, we can briefly turn to our friends at APNIC1 — one of the five Regional Internet Registries (RIRs). Through their data, going back to 2012, we can see that IPv6 experienced a period of seemingly exponential growth until mid-2017, after which it entered a period of linear growth that’s still ongoing:

IPv6 adoption is slowed down by the lack of compatibility between both protocols — devices must be assigned both an IPv4 and an IPv6 address — along with the fact that virtually all devices on the Internet still support IPv4. Nevertheless, IPv6 is critical for the future of the Internet, and continued effort is required to increase its deployment.

Cloudflare Radar, like APNIC and most other sources today, publishes numbers that primarily reflect the extent to which Internet Service Providers (ISPs) have deployed IPv6: the client side. It’s a very important angle, and one that directly impacts end users, but there’s also the other end of the equation: the server side.

With this in mind, we invite you to follow us on a quick experiment where we aim for a glimpse of server side IPv6 adoption, and how often clients are actually (or likely) able to talk to servers over IPv6. We’ll rely on DNS for this exploration and, as they say, the results may surprise you.

IPv6 Adoption on the Client Side (from HTTP)

By the end of October 2023, from Cloudflare’s perspective, IPv6 adoption across the Internet was at roughly 36% of all traffic, with slight variations depending on the time of day and day of week. When excluding bots the estimate goes up to just over 46%, while excluding humans pushes it down close to 24%. These numbers refer to the share of HTTP requests served over IPv6 across all IPv6-enabled content (the default setting).

For this exercise, what matters most is the number for both humans and bots. There are many reasons for the adoption gap between both kinds of traffic — from varying levels of IPv6 support in the plethora of client software used, to varying levels of IPv6 deployment inside the many networks where traffic comes from, to the varying size of such networks, etc. — but that’s a rabbit hole for another day. If you’re curious about the numbers for a particular country or network, you can find them on Cloudflare Radar and in our Year in Review report for 2023.

It Takes Two to Dance

You, the reader, might point out that measuring the client side of the client-server equation only tells half the story: for an IPv6-capable client to establish a connection with a server via IPv6, the server must also be IPv6-capable.

This raises two questions:

  1. What’s the extent of IPv6 adoption on the server side?
  2. How well does IPv6 adoption on the client side align with adoption on the server side?

There are several possible answers, depending on whether we’re talking about users, devices, bytes transferred, and so on. We’ll focus on connections (it will become clear why in a moment), and the combined question we’re asking is:

How often can an IPv6-capable client use IPv6 when connecting to servers on the Internet, under typical usage patterns?

Typical usage patterns include people going about their day visiting some websites more often than others or automated clients calling APIs. We’ll turn to DNS to get this perspective.

Enter DNS

Before a client can attempt to establish a connection with a server by name, using either the classic IPv4 protocol or the more modern IPv6, it must look up the server’s IP address in the phonebook of the Internet, the Domain Name System (DNS).

Looking up a hostname in DNS is a recursive process. To find the IP address of a server, the domain hierarchy (the dot-separated components of a server’s name) must be followed across several DNS authoritative servers until one of them returns the desired response2. Most clients, however, don’t do this directly and instead ask an intermediary server called a recursive resolver to do it for them. Cloudflare operates one such recursive resolver that anyone can use: 1.1.1.1.

As a simplified example, when a client asks 1.1.1.1 for the IP address where “www.example.com” lives, 1.1.1.1 will go out and ask the DNS root servers3 about “.com”, then ask the .com DNS servers about “example.com”, and finally ask the example.com DNS servers about “www.example.com”, which has direct knowledge of it and answers with an IP address. To make things faster for the next client asking a similar question, 1.1.1.1 caches (remembers for a while) both the final answer and the steps in between.

This means 1.1.1.1 is in a very good position to count how often clients try to look up IPv4 addresses (A-type queries) vs. how often they try to look up IPv6 addresses (AAAA-type queries), covering most of the observable Internet.

But how does a client know when to ask for a server’s IPv4 address or its IPv6 address?

The short answer is that clients with IPv6 available to them just ask for both — doing separate A and AAAA lookups for every server they wish to connect to. These IPv6-capable clients will prioritize connecting over IPv6 when they get a non-empty AAAA answer, whether they also get a non-empty A answer (which they almost always get, as we’ll see). The algorithm driving this preference for modernity is called Happy Eyeballs, if you’re interested in the details.

We’re now ready to start looking at some actual data…

IPv6 Adoption on the Client Side (from DNS)

The first step is establishing a baseline by measuring client IPv6 deployment from 1.1.1.1’s perspective and comparing it with the numbers from HTTP requests that we started with.

It’s tempting to count how often clients connect to 1.1.1.1 using IPv6, but the results are misleading for a couple of reasons, the strongest one being hidden in plain sight: 1.1.1.1 is the most memorable address of the set of IPv4 and IPv6 addresses that clients can use to perform DNS lookups through the 1.1.1.1 service. Ideally, IPv6-capable clients using 1.1.1.1 as their recursive resolver should have all four of the following IP addresses configured, not just the first two:

  • 1.1.1.1 (IPv4)
  • 1.0.0.1 (IPv4)
  • 2606:4700:4700::1111 (IPv6)
  • 2606:4700:4700::1001 (IPv6)

But, when manual configuration is involved4, humans find IPv6 addresses less memorable than IPv4 addresses and are less likely to configure them, viewing the IPv4 addresses as enough.

A related, but less obvious, confounding factor is that many IPv6-capable clients will still perform DNS lookups over IPv4 even when they have 1.1.1.1’s IPv6 addresses configured, as spreading lookups over the available addresses is a popular default option.

A more sensible approach to assess IPv6 adoption from DNS client activity is calculating the percentage of AAAA-type queries over the total amount of A-type queries, assuming IPv6-clients always perform both5, as mentioned earlier.

Then, from 1.1.1.1’s perspective, IPv6 adoption on the client side is estimated at 30.5% by query volume. This is a bit under what we observed from HTTP traffic over the same time period (35.9%) but such a difference between two different perspectives is not unexpected.

A Note on TTLs

It’s not only recursive resolvers that cache DNS responses, most DNS clients have their own local caches as well. Your web browser, operating system, and even your home router, keep answers around hoping to speed up subsequent queries.

How long each answer remains in cache depends on the time-to-live (TTL) field sent back with DNS records. If you’re familiar with DNS, you might be wondering if A and AAAA records have similar TTLs. If they don’t, we may be getting fewer queries for just one of these two types (because it gets cached for longer at the client level), biasing the resulting adoption figures.

The pie charts here break down the minimum TTLs sent back by 1.1.1.1 in response to A and AAAA queries6. There is some difference between both types, but the difference is very small.

IPv6 Adoption on the Server Side

The following graph shows how often A and AAAA-type queries get non-empty responses, shedding light on server side IPv6 adoption and getting us closer to the answer we’re after:

IPv6 adoption by servers is estimated at 43.3% by query volume, noticeably higher than what was observed for clients.

How Often Both Sides Swipe Right

If 30.5% of IP address lookups handled by 1.1.1.1 could make use of an IPv6 address to connect to their intended destinations, but only 43.3% of those lookups get a non-empty response, that can give us a pretty good basis for how often IPv6 connections are made between client and server — roughly 13.2% of the time.

The Potential Impact of Popular Domains

IPv6 server side adoption measured by query volume for the domains in Radar’s Top 100 list is 60.8%. If we exclude these domains from our overall calculations, the previous 13.2% figure drops to 8%. This is a significant difference, but not unexpected, as the Top 100 domains make up over 55% of all A and AAAA queries to 1.1.1.1.

If just a few more of these highly popular domains were to deploy IPv6 today, observed adoption would noticeably increase and, with it, the chance of IPv6-capable clients establishing connections using IPv6.

Closing Thoughts

Observing the extent of IPv6 adoption across the Internet can mean different things:

  • Counting users with IPv6-capable Internet access;
  • Counting IPv6-capable devices or software on those devices (clients and/or servers);
  • Calculating the amount of traffic flowing through IPv6 connections, measured in bytes;
  • Counting the fraction of connections (or individual requests) over IPv6.

In this exercise we chose to look at connections and requests. Keeping in mind that the underlying reality can only be truly understood by considering several different perspectives, we saw three different IPv6 adoption figures:

  • 35.9% (client side) when counting HTTP requests served from Cloudflare’s CDN;
  • 30.5% (client side) when counting A and AAAA-type DNS queries handled by 1.1.1.1;
  • 43.3% (server side) of positive responses to AAAA-type DNS queries, also from 1.1.1.1.

We combined the client side and server side figures from the DNS perspective to estimate how often connections to third-party servers are likely to be established over IPv6 rather than IPv4: just 13.2% of the time.

To improve on these numbers, ISPs, cloud and hosting providers, and corporations alike must increase the rate at which they make IPv6 available for devices in their networks. But large websites and content sources also have a critical role to play in enabling IPv6-capable clients to use IPv6 more often, as 39.2% of queries for domains in the Radar Top 100 (representing over half of all A and AAAA queries to 1.1.1.1) are still limited to IPv4-only responses.

On the road to full IPv6 adoption, the Internet isn’t quite there yet. But continued effort from all those involved can help it to continue to move forward, and perhaps even accelerate progress.

On the server side, Cloudflare has been helping with this worldwide effort for many years by providing free IPv6 support for all domains. On the client side, the 1.1.1.1 app automatically enables your device for IPv6 even if your ISP doesn’t provide any IPv6 support. And, if you happen to manage an IPv6-only network, 1.1.1.1’s DNS64 support also has you covered.

***
1Cloudflare’s public DNS resolver (1.1.1.1) is operated in partnership with APNIC. You can read more about it in the original announcement blog post and in 1.1.1.1’s privacy policy.
2There’s more information on how DNS works in the “What is DNS?” section of our website. If you’re a hands-on learner, we suggest you take a look at Julia Evans’ “mess with dns”.
3Any recursive resolver will already know the IP addresses of the 13 root servers beforehand. Cloudflare also participates at the topmost level of DNS by providing anycast service to the E and F-Root instances, which means 1.1.1.1 doesn’t need to go far for that first lookup step.
4When using the 1.1.1.1 app, all four IP addresses are configured automatically.
5For simplification, we assume the amount of IPv6-only clients is still negligibly small. It’s a reasonable assumption in general, and other datasets available to us confirm it.
61.1.1.1, like other recursive resolvers, returns adjusted TTLs: the record’s original TTL minus the number of seconds since the record was last cached. Empty A and AAAA answers get cached for the amount of time defined in the domain’s Start of Authority (SOA) record, as specified by RFC 2308.

Amazon’s $2bn IPv4 tax — and how you can avoid paying it

Post Syndicated from Anie Jacob original http://blog.cloudflare.com/amazon-2bn-ipv4-tax-how-avoid-paying/

Amazon’s $2bn IPv4 tax — and how you can avoid paying it

Amazon’s $2bn IPv4 tax — and how you can avoid paying it

One of the wonderful things about the Internet is that, whether as a consumer or producer, the cost has continued to come down. Back in the day, it used to be that you needed a server room, a whole host of hardware, and an army of folks to help keep everything up and running. The cloud changed that, but even with that shift, services like SSL or unmetered DDoS protection were out of reach for many. We think that the march towards a more accessible Internet — both through ease of use, and reduced cost — is a wonderful thing, and we’re proud to have played a part in making it happen.

Every now and then, however, the march of progress gets interrupted.

On July 28, 2023, Amazon Web Services (AWS) announced that they would begin to charge “per IP per hour for all public IPv4 addresses, whether attached to a service or not”, starting February 1, 2024. This change will add at least \$43 extra per year for every IPv4 address Amazon customers use; this may not sound like much, but we’ve seen back of the napkin analysis that suggests this will result in an approximately \$2bn tax on the Internet.

In this blog, we’ll explain a little bit more about the technology involved, but most importantly, give you a step-by-step walkthrough of how Cloudflare can help you not only eliminate the need to pay Amazon for something that they shouldn’t be charging you for in the first place, but also if you’re a Pro or Business subscriber, we want to put \$43 in your pocket instead of taking it out. Don’t give Amazon \$43 for IPv4, let us give you \$43 and throw in IPv4 as well.

How can Cloudflare help?

The only way to avoid Amazon’s IPv4 tax is to transition to IPv6 with AWS. But we recognize that not everyone is ready to make that shift — it can be an expensive and challenging process, and may present problems with hardware compatibility and network performance. We cover the finer details of these challenges below, so keep reading! Cloudflare can help ease this transition: let us deal with communicating to AWS using IPv6. Not only that, you’ll get all the rest of the benefits of using Cloudflare and our global network — including all the performance and security that Cloudflare is known for — and a \$43 dollar credit for using us!

IPv6 services like these are something we’ve been offering at Cloudflare for years – in fact this was first announced during Cloudflare's first birthday week in 2011! We’ve made this process simple to enable as well, so you can set this up as soon as today.

Amazon’s $2bn IPv4 tax — and how you can avoid paying it

To set this feature up you will need to both enable IPv6 Compatibility and set up your origin for AWS to be an IPv6 origin.

To configure this feature simply follow these steps:

1. Login to your Cloudflare account.

2. Select the appropriate domain

3. Click the Network app.

Amazon’s $2bn IPv4 tax — and how you can avoid paying it

4. Make sure IPv6 Compatibility is toggled on.

Amazon’s $2bn IPv4 tax — and how you can avoid paying it

To get an IPv6 origin from Amazon you will likely have to follow these steps:

  1. Associate an IPv6 CIDR block with your VPC and subnets
  2. Update your route tables
  3. Update your security group rules
  4. Change your instance type
  5. Assign IPv6 addresses to your instances
  6. (Optional) Configure IPv6 on your instances

(For more information about this migration, check out this link.)

Once you have your IPv6 origins, you’ll want to update your origins on Cloudflare to use the IPv6 addresses. In the simple example of a single origin at root, this is done by creating a proxied (orange-cloud) AAAA record in your Cloudflare DNS editor:

Amazon’s $2bn IPv4 tax — and how you can avoid paying it

If you are using Load Balancers, you will want to update the origin(s) there.

Once that’s done, you can remove the A/IPv4 record(s) and traffic will move over to the v6 address. While this process is easy now, we’re working on how we can make moving to IPv6 on Cloudflare even easier.

Once you have these features configured and have traffic running through Cloudflare to your origin for at least 6 months, you will be eligible to have a $43 credit deposited right into your Cloudflare account! You can use this credit for your Pro or Biz subscription or even for Workers and R2 usage. See here for more information on how to opt in to this offer.

Through this feature Cloudflare provides the flexibility to manage your IPv6 settings as per your requirements. By leveraging Cloudflare's robust IPv6 support, you can ensure seamless connectivity for your users, while avoiding additional costs associated with public IPv4 addresses.

What’s wrong with IPv4?

So if Cloudflare has this solution, why should you even move to IPv6? Well to clearly explain this let's start with the problem with IPv4.

IP addresses are used to identify and reach resources on a network, which could be a private network, like your office's private network, or a complex public network like the Internet. An example of an IPv4 address would be 198.51.100.1 or 198.51.100.50. And there are approximately 4.3 billion unique IPv4 addresses like these for websites, servers, and other destinations on the Internet to use for routing.

4.3 billion IPv4 addresses may sound like a lot, but it’s not as IPv4 space is running out. In September 2015 ARIN, one of the regional Internet registries that allows people to acquire IP addresses, announced that they had no available space: if you want to buy an IPv4 address you have to go and talk to private companies who are selling them. These companies charge a pretty penny for their IPv4 addresses. It costs about $40 per IPv4 address today. To buy a grouping of IPv4 addresses, also known as a prefix of which the minimum required size is 256 IP addresses, costs about \$10,000.

IP addresses are necessary for having a domain or device on the Internet, but today IPv4 addresses are an increasingly more complicated resource to acquire. Therefore, to facilitate the growth of the Internet there needed to be more unique addresses made available without breaking the bank. That’s where IPv6 comes in.

IPv4 vs. IPv6

In 1995 the IETF (Internet Engineering Task Force) published the RFC for IPv6, which proposed to solve this problem of the limited IPv4 space. Instead of 32 bits of addressable space, IPv6 expanded to 128 bits of addressable space. This means that instead of 4.3 billion addresses available, there are approximately 340 undecillion IPv6 addresses available. This is roughly equivalent to the number of grains of sand on Earth.

So this problem is solved, why should you care?  The answer is because many networks on the Internet still prefer IPv4, and companies like AWS are starting to charge money for IPv4 usage.

Let's speak on AWS first: AWS today owns one of the largest chunks of the IPv4 space. During a period of time when IPv4 addresses were on the private market to purchase for dollars per IP address, AWS chose to use its large capital to its advantage and buy up a large amount of the space. Today AWS owns 1.7% of the IPv4 address space which equates to ~100 million IPv4 addresses.

Amazon’s $2bn IPv4 tax — and how you can avoid paying it

So you would think that moving to IPv6 is the right move, however, for the Internet community it’s proven to be quite a challenge.

When IPv6 was published in the 90s very few networks had devices that supported IPv6. However, today in 2023, that is not the case: global networks supporting IPv6 has increased to 46 percent, so the hardware limitations around supporting it are decreasing. Additionally, anti-abuse and security tools initially had no idea how to deal with attacks or traffic that used IPv6 address space, and this still remains an issue for some of these tools. In 2014, we made it even easier for origin tools to convert by creating pseudo IPv4 to help bridge the gap to those tools.

Despite all of this, many networks don’t have good support infrastructure for IPv6 networking since most networks were built on IPv4. At Cloudflare, we have built our network to support both protocols, known as “dual-stack”.

For a while there were also many networks which had markedly worse performance for IPv6 than IPv4. This is not true anymore, as of today we see only a slight degradation in IPv6 performance across the whole Internet compared to IPv4. The reasons for this include things like legacy hardware, sub-optimal IPv6 connectivity outside our network and high cost for deploying IPv6. You can see in the chart below the additional latency of IPv6 traffic on Cloudflare’s network as compared to IPv4 traffic:

Amazon’s $2bn IPv4 tax — and how you can avoid paying it

There were many challenges to adopting IPv6, and for some these issues with hardware compatibility and network performance are still worries. This is why still using IPv4 can be useful to folks while transitioning to IPv6, which is what makes AWS’ decision to charge for IPv4 impactful on many websites.

So, don’t pay the AWS tax

At the end of the day the choice is clear: you could pay Amazon more to rent their IPs than to buy them, or move to Cloudflare and use our free service to help with the transition to IPv6 with little overhead.

Cloudflare’s handling of a bug in interpreting IPv4-mapped IPv6 addresses

Post Syndicated from Lucas Ferreira original https://blog.cloudflare.com/cloudflare-handling-bug-interpreting-ipv4-mapped-ipv6-addresses/

Cloudflare's handling of a bug in interpreting IPv4-mapped IPv6 addresses

Cloudflare's handling of a bug in interpreting IPv4-mapped IPv6 addresses

In November 2022, our bug bounty program received a critical and very interesting report. The report stated that certain types of DNS records could be used to bypass some of our network policies and connect to ports on the loopback address (e.g. 127.0.0.1) of our servers. This post will explain how we dealt with the report, how we fixed the bug, and the outcome of our internal investigation to see if the vulnerability had been previously exploited.

RFC 4291 defines ways to embed an IPv4 address into IPv6 addresses. One of the methods defined in the RFC is to use IPv4-mapped IPv6 addresses, that have the following format:

   |                80 bits               | 16 |      32 bits        |
   +--------------------------------------+--------------------------+
   |0000..............................0000|FFFF|    IPv4 address     |
   +--------------------------------------+----+---------------------+

In IPv6 notation, the corresponding mapping for 127.0.0.1 is ::ffff:127.0.0.1 (RFC 4038)

The researcher was able to use DNS entries based on mapped addresses to bypass some of our controls and access ports on the loopback address or non-routable IPs.

This vulnerability was reported on November 27 to our bug bounty program. Our Security Incident Response Team (SIRT) was contacted, and incident response activities began shortly after the report was filed. A hotpatch was deployed three hours later to prevent exploitation of the bug.

Date Time (UTC) Activity
27 November 2022 20:42 Initial report to Cloudflare’s bug bounty program
21:04 SIRT oncall is paged
21:15 SIRT manager on call starts working on the report
21:22 Incident declared and team is assembled and debugging starts
23:20 A hotfix is ready and deployment starts
23:47 Team confirms that the hotfix is deployed and working
23:58 Team investigates if other products are affected. Load Balancers and Spectrum are potential targets. Both products are found to be unaffected by the vulnerability.
28 November 2022 21:14 A permanent fix is ready
29 November 2022 21:34 Permanent fix is merged

Blocking exploitation

Immediately after the vulnerability was reported to our Bug Bounty program, the team began working to understand the issue and find ways to quickly block potential exploitation. It was determined that the fastest way to prevent exploitation would be to block the creation of the DNS records required to execute the attack.

The team then began to implement a patch to prevent the creation of DNS records that include IPv6 addresses that map loopback or RFC 1918 (internal) IPv4 addresses. The fix was fully deployed and confirmed three hours after the report was filed. We later realized that this change was insufficient because records hosted on external DNS servers could also be used in this attack.

The exploit

The exploit provided consisted of the following: a DNS entry, and a Cloudflare Worker. The DNS entry was an AAAA record pointing to ::ffff:127.0.0.1:

exploit.example.com AAAA ::ffff:127.0.0.1

The worker included the following code:

export default {
    async fetch(request) {
        const requestJson = await request.json()
        return fetch(requestJson.url, requestJson)
    }
}

The Worker was given a custom URL such as proxy.example.com.

With that setup, it was possible to make the worker attempt connections on the loopback interface of the server where it was running. The call would look like this:

curl https://proxy.example.com/json -d '{"url":"http://exploit.example.com:80/url_path"}'

The attack could then be scripted to attempt to connect to multiple ports on the server.

It was also found that a similar setup could be used with other IPv4 addresses to attempt connections into internal services. In this case, the DNS entry would look like:

exploit.example.com AAAA ::ffff:10.0.0.1

This exploit would allow an attacker to connect to services running on the loopback interface of the server. If the attacker was able to bypass the security and authentication mechanisms of a service, it could impact the confidentiality and integrity of data. For services running on other servers, the attacker could also use the worker to attempt connections and map services available over the network. As in most networks, Cloudflare’s network policies and ACLs must allow a few ports to be accessible. These ports would be accessible by an attacker using this exploit.

Investigation

We started an investigation to understand the root cause of the problem and created a proof-of-concept that allowed the team to debug the issue. At the same time, we started a parallel investigation to determine if the issue had been previously exploited.

It all happened when two bugs collided.

The first bug happened in our internal DNS system which is responsible for mapping hostnames to IP addresses of our customers’ origin servers (the DNS system). When the DNS system tried to answer a query for the DNS record from exploit.example.com, it serialized the IP as a string. The Golang net library used for DNS automatically converted the IP ::ffff:10.0.0.1 to string “10.0.0.1”. However, the DNS system still treated it as an IPv6 address. So a query response {ipv6: “10.0.0.1”} was returned.

The second bug was in our internal HTTP system (the proxy) which is responsible for forwarding HTTP traffic to customer’s origin servers. The bug happened in how the proxy validates this DNS response, {ipv6: “10.0.0.1”}. The proxy has two deny lists of IPs that are not allowed to be used, one for IPv4 and one for IPv6. These lists contain localhost IPs and private IPs. The bug was that the proxy system compared the address 10.0.0.1 against the IPv6 deny list because the address was in the “ipv6” section. Naturally the address didn’t match any entry in the deny list. So the address was allowed to be used as an origin IP address.

The second investigation team searched through the logs and found no evidence of previous exploitation of this vulnerability. The team also checked Cloudflare DNS for entries using IPv4-mapped IPv6 addresses and determined that all the existing entries had been used for testing purposes. As of now, there are no signs that this vulnerability could have been previously used against Cloudflare systems.

Remediating the vulnerability

To address this issue we implemented a fix in the proxy service to correctly use the deny list of the parsed address, not the deny list of the IP family the DNS API response claimed to be, to validate the IP address. We confirmed both in our test and production environments that the fix did prevent the issue from happening again.

Beyond maintaining a bug bounty program, we regularly perform internal security reviews and hire third-party firms to audit the software we develop. But it is through our bug bounty program that we receive some of the most interesting and creative reports. Each report has helped us improve the security of our services. We invite those that find a security issue in any of Cloudflare’s services to report it to us through HackerOne.