Tag Archives: Network

Cloudflare partners with Microsoft to protect joint customers with a Global Zero Trust Network

Post Syndicated from Abhi Das original https://blog.cloudflare.com/cloudflare-partners-with-microsoft-to-protect-joint-customers-with-global-zero-trust-network/

Cloudflare partners with Microsoft to protect joint customers with a Global Zero Trust Network

Cloudflare partners with Microsoft to protect joint customers with a Global Zero Trust Network

As a company, we are constantly asking ourselves what we can do to provide more value to our customers, including integrated solutions with our partners. Joint customers benefit from our integrations below with Azure Active Directory by:

First, centralized identity and access management via Azure Active Directory which provides single sign-on, multifactor authentication, and access via conditional authentication.

Second, policy oriented access to specific applications using Cloudflare Access—a VPN replacement service.

Third, an additional layer of security for internal applications by connecting them to Cloudflare global network and not having to open them up to the whole Internet.

Cloudflare partners with Microsoft to protect joint customers with a Global Zero Trust Network

Let’s step back a bit.

Why Zero Trust?

Companies of all sizes are faced with an accelerating digital transformation of their IT stack and an increasingly distributed workforce, changing the definition of the security perimeter. We are moving away from the castle and moat model to the whole Internet, requiring security checks for every user accessing every resource. As a result, all companies, especially those whose use of Azure’s broad cloud portfolio is increasing, are adopting Zero Trust architectures as an essential part of their cloud and SaaS journey.

Cloudflare Access provides secure access to Azure hosted applications and on-premise applications. Also, it acts as an on-ramp to the world’s fastest network to Azure and the rest of the Internet. Users connect from their devices or offices via Cloudflare’s network in over 250 cities around the world. You can use Cloudflare Zero Trust on that global network to ensure that every request to every resource is evaluated for security, including user identity. We are excited to bring this secure global network on-ramp to Azure hosted applications and on-premise applications.

Also, performance is one of our key advantages in addition to security. Cloudflare serves over 32 million HTTP requests per second, on average, for the millions of customers who secure their applications on our network. When those applications do not run on our network, we can rely on our own global private backbone and our connectivity with over 10,000 networks globally to connect the user.

We are excited to bring this global security and performance perimeter network as our Cloudflare Zero Trust product for your Azure hosted applications and on-premises applications.

Cloudflare Access: a modern Zero Trust approach

Cloudflare’s Zero Trust solution Cloudflare Access provides a modern approach to authentication for internally managed applications. When corporate applications on Azure or on-premise are protected with Cloudflare Access, they feel like SaaS applications, and employees can log in to them with a simple and consistent flow. Cloudflare Access acts as a unified reverse proxy to enforce access control by making sure every request is authenticated, authorized, and encrypted.

Identity: Cloudflare Access integrates out of the box with all the major identity providers, including Azure Active Directory, allowing use of the policies and users you already created to provide conditional access to your web applications. For example, you can use Cloudflare Access to ensure that only company employees and no contractors can get to your internal kanban board, or you can lock down the SAP finance application hosted on Azure or on-premise.

Devices: You can use TLS with Client Authentication and limit connections only to devices with a unique client certificate. Cloudflare will ensure the connecting device has a valid client certificate signed by the corporate CA, and authenticate user credentials to grant access to an internal application.

Additional security: Want to use Cloudflare Access in front of an internal application but don’t want to open up that application to the whole Internet? For additional security, you can combine Access with Cloudflare Tunnel. Cloudflare Tunnel will connect from your Azure environment directly to Cloudflare’s network, so there is no publicly accessible IP.

Cloudflare partners with Microsoft to protect joint customers with a Global Zero Trust Network

Secure both your legacy applications and Azure hosted applications via Azure AD and Cloudflare Access jointly

For on-premise legacy applications, we are excited to announce that Cloudflare is an Azure Active Directory secure hybrid access partner. Azure AD secure hybrid access enables customers to centrally manage access to their legacy on-premise applications using SSO authentication without incremental development. Starting today, joint customers can easily use Cloudflare Access solution as an additional layer of security with built-in performance, in front of their legacy applications.

Cloudflare partners with Microsoft to protect joint customers with a Global Zero Trust Network

Traditionally for on-premise applications, customers have to change their existing code or add additional layers of code to integrate Azure AD or Cloudflare Access–like capabilities. With the help of Azure Active Directory secure hybrid access, customers can integrate these capabilities seamlessly without much code changes. Once integrated, customers can take advantage of the below Azure AD features and more:

  1. Multi-factor authentication (MFA)
  2. Single sign-on (SSO)
  3. Passwordless authentication
  4. Unified user access management
  5. Azure AD Conditional Access and device trust

Very similarly, the Azure AD and Cloudflare Access combo can also be used to secure your Azure hosted applications. Cloudflare Access enables secure on-ramp to Azure hosted applications or on-premise applications via the below two integrations:

Cloudflare partners with Microsoft to protect joint customers with a Global Zero Trust Network

1. Cloudflare Access integration with Azure AD:

Cloudflare Access is a Zero Trust Network Access (ZTNA) solution that allows you to configure precise access policies across their applications. You can integrate Microsoft Azure Active Directory with Cloudflare Zero Trust and build rules based on user identity, group membership and Azure AD Conditional Access policies. Users will authenticate with their Azure AD credentials and connect to Cloudflare Access. Additional policy controls include Device Posture, Network, Location and more. Setup typically takes less than a few hours!

2. Cloudflare Tunnel integration with Azure:

Cloudflare Tunnel can expose applications running on the Microsoft Azure platform. See guide to install and configure Cloudflare Tunnel. Also, a prebuilt Cloudflare Linux image exists on the Azure Marketplace. To simplify the process of connecting Azure applications to Cloudflare’s network, deploy the prebuilt image to an Azure resource group. Cloudflare Tunnel is now available on Microsoft’s Azure marketplace.

“The hybrid work environment has accelerated the cloud transition and increased the need for CIOs everywhere to provide secure and performant access to applications for their employees. This is especially true for self-hosted applications. Cloudflare’s global network security perimeter via Cloudflare Access provides this additional layer of security together with Azure Active Directory to enable employees to get work done from anywhere securely and performantly.” said David Gregory, Principal Program Manager
– Microsoft Azure Active Directory

Conclusion

Over the last ten years, Cloudflare has built one of the fastest, most reliable, most secure networks in the world. You now have the ability to use that network as a global security and performance perimeter to your Azure hosted applications via the above integrations, and it is easy.

What’s next?

In the coming months, we will be further strengthening our integrations with Microsoft Azure allowing customers to better implement our SASE perimeter.

If you’re using Cloudflare Zero Trust products today and are interested in using this integration with Azure, please visit our above developer documentation to learn about how you can enable it. If you want to learn more or have additional questions, please fill out the form or get in touch with your Cloudflare CSM or AE, and we’ll be happy to help you.

Interconnect Anywhere — Reach Cloudflare’s network from 1,600+ locations

Post Syndicated from Matt Lewis original https://blog.cloudflare.com/interconnect-anywhere/

Interconnect Anywhere — Reach Cloudflare’s network from 1,600+ locations

Interconnect Anywhere — Reach Cloudflare’s network from 1,600+ locations

Customers choose Cloudflare for our network performance, privacy and security.  Cloudflare Network Interconnect is the best on-ramp for our customers to utilize our diverse product suite. In the past, we’ve talked about Cloudflare’s physical footprint in over 200+ data centers, and how Cloudflare Network Interconnect enabled companies in those data centers to connect securely to Cloudflare’s network. Today, Cloudflare is excited to announce expanded partnerships that allows customers to connect to Cloudflare from their own Layer 2 service fabric. There are now over 1,600 locations where enterprise security and network professionals have the option to connect to Cloudflare securely and privately from their existing fabric.

Interconnect Anywhere is a journey

Since we launched Cloudflare Network Interconnect (CNI) in August 2020, we’ve been focused on extending the availability of Cloudflare’s network to as many places as possible. The initial launch opened up 150 physical locations alongside 25 global partner locations. During Security Week this year, we grew that availability by adding data center partners to our CNI Partner Program. Today, we are adding even more connectivity options by expanding Cloudflare availability to all of our partners’ locations, as well as welcoming CoreSite Open Cloud Exchange (OCX) and Infiny by Epsilon into our CNI Partner Program. This totals 1,638 locations where our customers can now connect securely to the Cloudflare network. As we continue to expand, customers are able to connect the fabric of their choice to Cloudflare from a growing list of data centers.

Fabric Partner Enabled Locations
PacketFabric 180+
Megaport 700+
Equinix Fabric 46+
Console Connect 440+
CoreSite Open Cloud Exchange 22+
Infiny by Epsilon 250+

“We are excited to expand our partnership with Cloudflare to ensure that our mutual customers can benefit from our carrier-class Software Defined Network (SDN) and Cloudflare’s network security in all Packetfabric locations. Now customers can easily connect from wherever they are located to access best of breed security services alongside Packetfabric’s Cloud Connectivity options.”
Alex Henthorn-Iwane, PacketFabric’s Chief Marketing Officer

“With the significant rise in DDoS attacks over the past year, it’s becoming ever more crucial for IT and Operations teams to prevent and mitigate network security threats. We’re thrilled to enable Cloudflare Interconnect everywhere on Megaport’s global Software Defined Network, which is available in over 700 enabled locations in 24 countries worldwide. Our partnership will give organizations the ability to reduce their exposure to network attacks, improve customer experiences, and simplify their connectivity across our private on-demand network in a matter of a few mouse clicks.”
Misha Cetrone, Megaport VP of Strategic Partnerships

“Expanding the connectivity options to Cloudflare ensures that customers can provision hybrid architectures faster and more easily, leveraging enterprise-class network services and automation on the Open Cloud Exchange. Simplifying the process of establishing modern networks helps achieve critical business objectives, including reducing total cost of ownership, and improving business agility as well as interoperability.”
Brian Eichman, CoreSite Senior Director of Product Development

“Partner accessibility is key in our cloud enablement and interconnection strategy. We are continuously evolving to offer our customers and partners simple and secure connectivity no matter where their network resides in the world. Infiny enables access to Cloudflare from across a global footprint while delivering high-quality cloud connectivity solutions at scale. Customers and partners gain an innovative Network-as-a-Service SDN platform that supports them with programmable and automated connectivity for their cloud and networking needs.”
Mark Daley, Epsilon Director of Digital Strategy

Uncompromising security and increased reliability from your choice of network fabric

Now, companies can connect to Cloudflare’s suite of network and security products without traversing shared public networks by taking advantage of software-defined networking providers. No matter where a customer is connected to one of our fabric partners, Cloudflare’s 200+ data centers ensure that world-class network security is close by and readily available via a secure, low latency, and cost-effective connection. An increased number of locations further allows customers to have multiple secure connections to the Cloudflare network, increasing redundancy and reliability. As we further expand our global network and increase the number of data centers where Cloudflare and our partners are connected, latency becomes shorter and customers will reap the benefits.

Let’s talk about how a customer can use Cloudflare Network Interconnect to improve their security posture through a fabric provider.

Plug and Play Fabric connectivity

Acme Corp is an example company that wants to deliver highly performant digital services to their customers and ensure employees can securely connect to their business apps from anywhere. They’ve purchased Magic Transit and Cloudflare Access and are evaluating Magic WAN to secure their network while getting the performance Cloudflare provides. They want to avoid potential network traffic congestion and latency delays, so they have designed a network architecture with their software-defined network fabric and Cloudflare using Cloudflare Network Interconnect.

With Cloudflare Network Interconnect, provisioning this connection is simple. Acme goes to their partner portal, and requests a virtual layer 2 connection to Cloudflare with the bandwidth of their choice. Cloudflare accepts the connection, provides the BGP session establishment information and organizes a turn up call if required. Easy!

Let’s talk about how Cloudflare and our partners have worked together to simplify the interconnectivity experience for the customer.

With Cloudflare Network interconnect, availability only gets better

Interconnect Anywhere — Reach Cloudflare’s network from 1,600+ locations
Connection Established with Cloudflare Network Interconnect

When a customer uses CNI to establish a connection between a fabric partner and Cloudflare, that connection runs over layer 2, configured via the partner user interface. Our new partnership model allows customers to connect privately and securely to Cloudflare’s network even when the customer is not located in the same data center as Cloudflare.

Interconnect Anywhere — Reach Cloudflare’s network from 1,600+ locations
Shorter Customer Path with New Partner Locations‌‌

The diagram above shows a shorter customer path thanks to an incremental partner-enabled location. Every time Cloudflare brings a new data center online and connects with a partner fabric, all customers in that region immediately benefit from the closer proximity and reduced latency.

Network fabrics in action

For those who want to self-serve, we’ve published documentation that details the steps in provisioning these connections. You can find the steps for each partner below:

As we expand our network, it’s critical we provide more ways to allow our customers to connect easily. We will continue to shorten the time it takes to set up a new interconnection, drive down costs, strengthen security and improve customer experience with all of our Network On-Ramp partners.

If you are using one of our software-defined networking partners and would like to connect to Cloudflare via their fabric, contact your fabric partner account team or reach out to us using the Cloudflare Network Interconnect page. If you are not using a fabric today, but would like to take advantage of software-defined networking to connect to Cloudflare, reach out to your account team.

Conntrack turns a blind eye to dropped SYNs

Post Syndicated from Jakub Sitnicki original https://blog.cloudflare.com/conntrack-turns-a-blind-eye-to-dropped-syns/

Intro

Conntrack turns a blind eye to dropped SYNs

We have been working with conntrack, the connection tracking layer in the Linux kernel, for years. And yet, despite the collected know-how, questions about its inner workings occasionally come up. When they do, it is hard to resist the temptation to go digging for answers.

One such question popped up while writing the previous blog post on conntrack:

“Why are there no entries in the conntrack table for SYN packets dropped by the firewall?”

Ready for a deep dive into the network stack? Let’s find out.

Conntrack turns a blind eye to dropped SYNs
Image by chulmin park from Pixabay

We already know from last time that conntrack is in charge of tracking incoming and outgoing network traffic. By running conntrack -L we can inspect existing network flows, or as conntrack calls them, connections.

So if we spin up a toy VM, connect to it over SSH, and inspect the contents of the conntrack table, we will see…

$ vagrant init fedora/33-cloud-base
$ vagrant up
…
$ vagrant ssh
Last login: Sun Jan 31 15:08:02 2021 from 192.168.122.1
[vagrant@ct-vm ~]$ sudo conntrack -L
conntrack v1.4.5 (conntrack-tools): 0 flow entries have been shown.

… nothing!

Even though the conntrack kernel module is loaded:

[vagrant@ct-vm ~]$ lsmod | grep '^nf_conntrack\b'
nf_conntrack          163840  1 nf_conntrack_netlink

Hold on a minute. Why is the SSH connection to the VM not listed in conntrack entries? SSH is working. With each keystroke we are sending packets to the VM. But conntrack doesn’t register it.

Isn’t conntrack an integral part of the network stack that sees every packet passing through it? 🤔

Conntrack turns a blind eye to dropped SYNs
Based on an image by Jan Engelhardt CC BY-SA 3.0

Clearly everything we learned about conntrack last time is not the whole story.

Calling into conntrack

Our little experiment with SSH’ing into a VM begs the question — how does conntrack actually get notified about network packets passing through the stack?

We can walk the receive path step by step and we won’t find any direct calls into the conntrack code in either the IPv4 or IPv6 stack. Conntrack does not interface with the network stack directly.

Instead, it relies on the Netfilter framework, and its set of hooks baked into in the stack:

int ip_rcv(struct sk_buff *skb, struct net_device *dev, …)
{
    …
    return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
               net, NULL, skb, dev, NULL,
               ip_rcv_finish);
}

Netfilter users, like conntrack, can register callbacks with it. Netfilter will then run all registered callbacks when its hook processes a network packet.

For the INET family, that is IPv4 and IPv6, there are five Netfilter hooks  to choose from:

Conntrack turns a blind eye to dropped SYNs
Based on Nftables – Packet flow and Netfilter hooks in detail, thermalcircle.de, CC BY-SA 4.0

Which ones does conntrack use? We will get to that in a moment.

First, let’s focus on the trigger. What makes conntrack register its callbacks with Netfilter?

The SSH connection doesn’t show up in the conntrack table just because the module is loaded. We already saw that. This means that conntrack doesn’t register its callbacks with Netfilter at module load time.

Or at least, it doesn’t do it by default. Since Linux v5.1 (May 2019) the conntrack module has the enable_hooks parameter, which causes conntrack to register its callbacks on load:

[vagrant@ct-vm ~]$ modinfo nf_conntrack
…
parm:           enable_hooks:Always enable conntrack hooks (bool)

Going back to our toy VM, let’s try to reload the conntrack module with enable_hooks set:

[vagrant@ct-vm ~]$ sudo rmmod nf_conntrack_netlink nf_conntrack
[vagrant@ct-vm ~]$ sudo modprobe nf_conntrack enable_hooks=1
[vagrant@ct-vm ~]$ sudo conntrack -L
tcp      6 431999 ESTABLISHED src=192.168.122.204 dst=192.168.122.1 sport=22 dport=34858 src=192.168.122.1 dst=192.168.122.204 sport=34858 dport=22 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
conntrack v1.4.5 (conntrack-tools): 1 flow entries have been shown.
[vagrant@ct-vm ~]$

Nice! The conntrack table now contains an entry for our SSH session.

The Netfilter hook notified conntrack about SSH session packets passing through the stack.

Now that we know how conntrack gets called, we can go back to our question — can we observe a TCP SYN packet dropped by the firewall with conntrack?

Listing Netfilter hooks

That is easy to check:

  1. Add a rule to drop anything coming to port tcp/25702

[vagrant@ct-vm ~]$ sudo iptables -t filter -A INPUT -p tcp --dport 2570 -j DROP

2) Connect to the VM on port tcp/2570 from the outside

host $ nc -w 1 -z 192.168.122.204 2570

3) List conntrack table entries

[vagrant@ct-vm ~]$ sudo conntrack -L
tcp      6 431999 ESTABLISHED src=192.168.122.204 dst=192.168.122.1 sport=22 dport=34858 src=192.168.122.1 dst=192.168.122.204 sport=34858 dport=22 [ASSURED] mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
conntrack v1.4.5 (conntrack-tools): 1 flow entries have been shown.

No new entries. Conntrack didn’t record a new flow for the dropped SYN.

But did it process the SYN packet? To answer that we have to find out which callbacks conntrack registered with Netfilter.

Netfilter keeps track of callbacks registered for each hook in instances of struct nf_hook_entries. We can reach these objects through the Netfilter state (struct netns_nf), which lives inside network namespace (struct net).

struct netns_nf {
    …
    struct nf_hook_entries __rcu *hooks_ipv4[NF_INET_NUMHOOKS];
    struct nf_hook_entries __rcu *hooks_ipv6[NF_INET_NUMHOOKS];
    …
}

struct nf_hook_entries, if you look at its definition, is a bit of an exotic construct. A glance at how the object size is calculated during its allocation gives a hint about its memory layout:

    struct nf_hook_entries *e;
    size_t alloc = sizeof(*e) +
               sizeof(struct nf_hook_entry) * num +
               sizeof(struct nf_hook_ops *) * num +
               sizeof(struct nf_hook_entries_rcu_head);

It’s an element count, followed by two arrays glued together, and some RCU-related state which we’re going to ignore. The two arrays have the same size, but hold different kinds of values.

We can walk the second array, holding pointers to struct nf_hook_ops, to discover the registered callbacks and their priority. Priority determines the invocation order.

Conntrack turns a blind eye to dropped SYNs

With drgn, a programmable C debugger tailored for the Linux kernel, we can locate the Netfilter state in kernel memory, and walk its contents relatively easily. Given we know what we are looking for.

[vagrant@ct-vm ~]$ sudo drgn
drgn 0.0.8 (using Python 3.9.1, without libkdumpfile)
…
>>> pre_routing_hook = prog['init_net'].nf.hooks_ipv4[0]
>>> for i in range(0, pre_routing_hook.num_hook_entries):
...     pre_routing_hook.hooks[i].hook
...
(nf_hookfn *)ipv4_conntrack_defrag+0x0 = 0xffffffffc092c000
(nf_hookfn *)ipv4_conntrack_in+0x0 = 0xffffffffc093f290
>>>

Neat! We have a way to access Netfilter state.

Let’s take it to the next level and list all registered callbacks for each Netfilter hook (using less than 100 lines of Python):

[vagrant@ct-vm ~]$ sudo /vagrant/tools/list-nf-hooks
🪝 ipv4 PRE_ROUTING
       -400 → ipv4_conntrack_defrag     ☜ conntrack callback
       -300 → iptable_raw_hook
       -200 → ipv4_conntrack_in         ☜ conntrack callback
       -150 → iptable_mangle_hook
       -100 → nf_nat_ipv4_in

🪝 ipv4 LOCAL_IN
       -150 → iptable_mangle_hook
          0 → iptable_filter_hook
         50 → iptable_security_hook
        100 → nf_nat_ipv4_fn
 2147483647 → ipv4_confirm
…

The output from our script shows that conntrack has two callbacks registered with the PRE_ROUTING hook – ipv4_conntrack_defrag and ipv4_conntrack_in. But are they being called?

Conntrack turns a blind eye to dropped SYNs
Based on Netfilter PRE_ROUTING hook, thermalcircle.de, CC BY-SA 4.0

Tracing conntrack callbacks

We expect that when the Netfilter PRE_ROUTING hook processes a TCP SYN packet, it will invoke ipv4_conntrack_defrag and then ipv4_conntrack_in callbacks.

To confirm it we will put to use the tracing powers of BPF 🐝. BPF programs can run on entry to functions. These kinds of programs are known as BPF kprobes. In our case we will attach BPF kprobes to conntrack callbacks.

Usually, when working with BPF, we would write the BPF program in C and use clang -target bpf to compile it. However, for tracing it will be much easier to use bpftrace. With bpftrace we can write our BPF kprobe program in a high-level language inspired by AWK:

kprobe:ipv4_conntrack_defrag,
kprobe:ipv4_conntrack_in
{
    $skb = (struct sk_buff *)arg1;
    $iph = (struct iphdr *)($skb->head + $skb->network_header);
    $th = (struct tcphdr *)($skb->head + $skb->transport_header);

    if ($iph->protocol == 6 /* IPPROTO_TCP */ &&
        $th->dest == 2570 /* htons(2570) */ &&
        $th->syn == 1) {
        time("%H:%M:%S ");
        printf("%s:%u > %s:%u tcp syn %s\n",
               ntop($iph->saddr),
               (uint16)($th->source << 8) | ($th->source >> 8),
               ntop($iph->daddr),
               (uint16)($th->dest << 8) | ($th->dest >> 8),
               func);
    }
}

What does this program do? It is roughly an equivalent of a tcpdump filter:

dst port 2570 and tcp[tcpflags] & tcp-syn != 0

But only for packets passing through conntrack PRE_ROUTING callbacks.

(If you haven’t used bpftrace, it comes with an excellent reference guide and gives you the ability to explore kernel data types on the fly with bpftrace -lv 'struct iphdr'.)

Let’s run the tracing program while we connect to the VM from the outside (nc -z 192.168.122.204 2570):

[vagrant@ct-vm ~]$ sudo bpftrace /vagrant/tools/trace-conntrack-prerouting.bt
Attaching 3 probes...
Tracing conntrack prerouting callbacks... Hit Ctrl-C to quit
13:22:56 192.168.122.1:33254 > 192.168.122.204:2570 tcp syn ipv4_conntrack_defrag
13:22:56 192.168.122.1:33254 > 192.168.122.204:2570 tcp syn ipv4_conntrack_in
^C

[vagrant@ct-vm ~]$

Conntrack callbacks have processed the TCP SYN packet destined to tcp/2570.

But if conntrack saw the packet, why is there no corresponding flow entry in the conntrack table?

Going down the rabbit hole

What actually happens inside the conntrack PRE_ROUTING callbacks?

To find out, we can trace the call chain that starts on entry to the conntrack callback. The function_graph tracer built into the Ftrace framework is perfect for this task.

But because all incoming traffic goes through the PRE_ROUTING hook, including our SSH connection, our trace will be polluted with events from SSH traffic. To avoid that, let’s switch from SSH access to a serial console.

When using libvirt as the Vagrant provider, you can connect to the serial console with virsh:

host $ virsh -c qemu:///session list
 Id   Name                State
-----------------------------------
 1    conntrack_default   running

host $ virsh -c qemu:///session console conntrack_default
Once connected to the console and logged into the VM, we can record the call chain using the trace-cmd wrapper for Ftrace:
[vagrant@ct-vm ~]$ sudo trace-cmd start -p function_graph -g ipv4_conntrack_defrag -g ipv4_conntrack_in
  plugin 'function_graph'
[vagrant@ct-vm ~]$ # … connect from the host with `nc -z 192.168.122.204 2570` …
[vagrant@ct-vm ~]$ sudo trace-cmd stop
[vagrant@ct-vm ~]$ sudo cat /sys/kernel/debug/tracing/trace
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 1)   1.219 us    |  finish_task_switch();
 1)   3.532 us    |  ipv4_conntrack_defrag [nf_defrag_ipv4]();
 1)               |  ipv4_conntrack_in [nf_conntrack]() {
 1)               |    nf_conntrack_in [nf_conntrack]() {
 1)   0.573 us    |      get_l4proto [nf_conntrack]();
 1)               |      nf_ct_get_tuple [nf_conntrack]() {
 1)   0.487 us    |        nf_ct_get_tuple_ports [nf_conntrack]();
 1)   1.564 us    |      }
 1)   0.820 us    |      hash_conntrack_raw [nf_conntrack]();
 1)   1.255 us    |      __nf_conntrack_find_get [nf_conntrack]();
 1)               |      init_conntrack.constprop.0 [nf_conntrack]() {  ❷
 1)   0.427 us    |        nf_ct_invert_tuple [nf_conntrack]();
 1)               |        __nf_conntrack_alloc [nf_conntrack]() {      ❶
                             … 
 1)   3.680 us    |        }
                           … 
 1) + 15.847 us   |      }
                         … 
 1) + 34.595 us   |    }
 1) + 35.742 us   |  }
 …
[vagrant@ct-vm ~]$

What catches our attention here is the allocation, __nf_conntrack_alloc() (❶), inside init_conntrack() (❷). __nf_conntrack_alloc() creates a struct nf_conn object which represents a tracked connection.

This object is not created in vain. A glance at init_conntrack() source shows that it is pushed onto a list of unconfirmed connections3.

Conntrack turns a blind eye to dropped SYNs

What does it mean that a connection is unconfirmed? As conntrack(8) man page explains:

unconfirmed:
       This table shows new entries, that are not yet inserted into the
       conntrack table. These entries are attached to packets that  are
       traversing  the  stack, but did not reach the confirmation point
       at the postrouting hook.

Perhaps we have been looking for our flow in the wrong table? Does the unconfirmed table have a record for our dropped TCP SYN?

Pulling the rabbit out of the hat

I have bad news…

[vagrant@ct-vm ~]$ sudo conntrack -L unconfirmed
conntrack v1.4.5 (conntrack-tools): 0 flow entries have been shown.
[vagrant@ct-vm ~]$

The flow is not present in the unconfirmed table. We have to dig deeper.

Let’s for a moment assume that a struct nf_conn object was added to the unconfirmed list. If the list is now empty, then the object must have been removed from the list before we inspected its contents.

Has an entry been removed from the unconfirmed table? What function removes entries from the unconfirmed table?

It turns out that nf_ct_add_to_unconfirmed_list() which init_conntrack() invokes, has its opposite defined just right beneath it – nf_ct_del_from_dying_or_unconfirmed_list().

It is worth a shot to check if this function is being called, and if so, from where. For that we can again use a BPF tracing program, attached to function entry. However, this time our program will record a kernel stack trace:

kprobe:nf_ct_del_from_dying_or_unconfirmed_list { @[kstack()] = count(); exit(); }

With bpftrace running our one-liner, we connect to the VM from the host with nc as before:

[vagrant@ct-vm ~]$ sudo bpftrace -e 'kprobe:nf_ct_del_from_dying_or_unconfirmed_list { @[kstack()] = count(); exit(); }'
Attaching 1 probe...

@[
    nf_ct_del_from_dying_or_unconfirmed_list+1 ❹
    destroy_conntrack+78
    nf_conntrack_destroy+26
    skb_release_head_state+78
    kfree_skb+50 ❸
    nf_hook_slow+143 ❷
    ip_local_deliver+152 ❶
    ip_sublist_rcv_finish+87
    ip_sublist_rcv+387
    ip_list_rcv+293
    __netif_receive_skb_list_core+658
    netif_receive_skb_list_internal+444
    napi_complete_done+111
    …
]: 1

[vagrant@ct-vm ~]$

Bingo. The conntrack delete function was called, and the captured stack trace shows that on local delivery path (❶), where LOCAL_IN Netfilter hook runs (❷), the packet is destroyed (❸). Conntrack must be getting called when sk_buff (the packet and its metadata) is destroyed. This causes conntrack to remove the unconfirmed flow entry (❹).

It makes sense. After all we have a DROP rule in the filter/INPUT chain. And that iptables -j DROP rule has a significant side effect. It cleans up an entry in the conntrack unconfirmed table!

This explains why we can’t observe the flow in the unconfirmed table. It lives for only a very short period of time.

Not convinced? You don’t have to take my word for it. I will prove it with a dirty trick!

Making the rabbit disappear, or actually appear

If you recall the output from list-nf-hooks that we’ve seen earlier, there is another conntrack callback there – ipv4_confirm, which I have ignored:

[vagrant@ct-vm ~]$ sudo /vagrant/tools/list-nf-hooks
…
🪝 ipv4 LOCAL_IN
       -150 → iptable_mangle_hook
          0 → iptable_filter_hook
         50 → iptable_security_hook
        100 → nf_nat_ipv4_fn
 2147483647 → ipv4_confirm              ☜ another conntrack callback
… 

ipv4_confirm is “the confirmation point” mentioned in the conntrack(8) man page. When a flow gets confirmed, it is moved from the unconfirmed table to the main conntrack table.

The callback is registered with a “weird” priority – 2,147,483,647. It’s the maximum positive value of a 32-bit signed integer can hold, and at the same time, the lowest possible priority a callback can have.

This ensures that the ipv4_confirm callback runs last. We want the flows to graduate from the unconfirmed table to the main conntrack table only once we know the corresponding packet has made it through the firewall.

Luckily for us, it is possible to have more than one callback registered with the same priority. In such cases, the order of registration matters. We can put that to use. Just for educational purposes.

Good old iptables won’t be of much help here. Its Netfilter callbacks have hard-coded priorities which we can’t change. But nftables, the iptables successor, is much more flexible in this regard. With nftables we can create a rule chain with arbitrary priority.

So this time, let’s use nftables to install a filter rule to drop traffic to port tcp/2570. The trick, though, is to register our chain before conntrack registers itself. This way our filter will run last.

First, delete the tcp/2570 drop rule in iptables and unregister conntrack.

vm # iptables -t filter -F
vm # rmmod nf_conntrack_netlink nf_conntrack

Then add tcp/2570 drop rule in nftables, with lowest possible priority.

vm # nft add table ip my_table
vm # nft add chain ip my_table my_input { type filter hook input priority 2147483647 \; }
vm # nft add rule ip my_table my_input tcp dport 2570 counter drop
vm # nft -a list ruleset
table ip my_table { # handle 1
        chain my_input { # handle 1
                type filter hook input priority 2147483647; policy accept;
                tcp dport 2570 counter packets 0 bytes 0 drop # handle 4
        }
}

Finally, re-register conntrack hooks.

vm # modprobe nf_conntrack enable_hooks=1

The registered callbacks for the LOCAL_IN hook now look like this:

vm # /vagrant/tools/list-nf-hooks
…
🪝 ipv4 LOCAL_IN
       -150 → iptable_mangle_hook
          0 → iptable_filter_hook
         50 → iptable_security_hook
        100 → nf_nat_ipv4_fn
 2147483647 → ipv4_confirm, nft_do_chain_ipv4
…

What happens if we connect to port tcp/2570 now?

vm # conntrack -L
tcp      6 115 SYN_SENT src=192.168.122.1 dst=192.168.122.204 sport=54868 dport=2570 [UNREPLIED] src=192.168.122.204 dst=192.168.122.1 sport=2570 dport=54868 mark=0 secctx=system_u:object_r:unlabeled_t:s0 use=1
conntrack v1.4.5 (conntrack-tools): 1 flow entries have been shown.

We have fooled conntrack 💥

Conntrack promoted the flow from the unconfirmed to the main conntrack table despite the fact that the firewall dropped the packet. We can observe it.

Outro

Conntrack processes every received packet4 and creates a flow for it. A flow entry is always created even if the packet is dropped shortly after. The flow might never be promoted to the main conntrack table and can be short lived.

However, this blog post is not really about conntrack. Its internals have been covered by magazines, papers, books, and on other blogs long before. We probably could have learned elsewhere all that has been shown here.

For us, conntrack was really just an excuse to demonstrate various ways to discover the inner workings of the Linux network stack. As good as any other.

Today we have powerful introspection tools like drgn, bpftrace, or Ftrace, and a cross referencer to plow through the source code, at our fingertips. They help us look under the hood of a live operating system and gradually deepen our understanding of its workings.

I have to warn you, though. Once you start digging into the kernel, it is hard to stop…

………..
1Actually since Linux v5.10 (Dec 2020) there is an additional Netfilter hook for the INET family named NF_INET_INGRESS. The new hook type allows users to attach nftables chains to the Traffic Control ingress hook.
2Why did I pick this port number? Because 2570 = 0x0a0a. As we will see later, this saves us the trouble of converting between the network byte order and the host byte order.
3To be precise, there are multiple lists of unconfirmed connections. One per each CPU. This is a common pattern in the kernel. Whenever we want to prevent CPUs from contending for access to a shared state, we give each CPU a private instance of the state.
4Unless we explicitly exclude it from being tracked with iptables -j NOTRACK.

Cloudflare recognized as a ‘Leader’ in The Forrester Wave for DDoS Mitigation Solutions

Post Syndicated from Vivek Ganti original https://blog.cloudflare.com/cloudflare-is-named-a-leader-in-the-forrester-wave-for-ddos-mitigation-solutions/

Cloudflare recognized as a 'Leader' in The Forrester Wave for DDoS Mitigation Solutions

Cloudflare recognized as a 'Leader' in The Forrester Wave for DDoS Mitigation Solutions

We’re thrilled to announce that Cloudflare has been named a leader in The Forrester WaveTM: DDoS Mitigation Solutions, Q1 2021. You can download a complimentary copy of the report here.

According to the report, written by, Forrester Senior Analyst for Security and Risk, David Holmes, “Cloudflare protects against DDoS from the edge, and fast… customer references view Cloudflare’s edge network as a compelling way to protect and deliver applications.”

Unmetered and unlimited DDoS protection for all

Cloudflare was founded with the mission to help build a better Internet — one where the impact of DDoS attacks is a thing of the past. Over the last 10 years, we have been unwavering in our efforts to protect our customers’ Internet properties from DDoS attacks of any size or kind. In 2017, we announced unmetered DDoS protection for free — as part of every Cloudflare service and plan including the Free plan — to make sure every organization can stay protected and available.

Thanks to our home-grown automated DDoS protection systems, we’re able to provide unmetered and unlimited DDoS protection for free. Our automated systems constantly analyze traffic samples asynchronously as to avoid impact to performance. They scan for DDoS attacks across layers 3-7 of the OSI model. They look for patterns in IP packets, HTTP requests and HTTP responses. When an attack is identified, a real-time signature is generated in the form of an ephemeral mitigation rule. The rule is propagated to the most optimal location in our edge for the most cost-efficient mitigation: either in the Linux kernel’s eXpress Data Path (XDP), Linux userspace iptables or in the HTTP reverse-proxy. A cost-efficient mitigation strategy means that we can mitigate the most volumetric, distributed attacks without impacting performance.

Cloudflare recognized as a 'Leader' in The Forrester Wave for DDoS Mitigation Solutions

Read more about how Cloudflare’s DDoS protection systems work here.

DDoS attacks increasing

We’d like to say DDoS attacks are a thing of the past. But unfortunately, they are not.

On the contrary, we continue to see the frequency, sophistication, and geographical distribution of DDoS attacks rise every quarter – in quantity or size. See our reports from last year (Q1 ‘20, Q2 ‘20, Q3 ‘20, and Q4 ‘20) and view overall Internet traffic trends here on Cloudflare Radar.

Over the past year, Cloudflare has seen and automatically mitigated some of the largest and arguably the most creative cyber attacks. As attackers are getting bolder and smarter in their ways, organizations are looking for ways to battle these kinds of attacks with no disruption to the services they provide.

Cloudflare recognized as a 'Leader' in The Forrester Wave for DDoS Mitigation Solutions
DDoS attacks in 2020

Organizations are being extorted under threat of DDoS

In January this year, we shared the story of how we helped a Fortune Global 500 company stay online and protected whilst they were targeted by a ransom DDoS attack. They weren’t the only one. In fact, in the fourth quarter of 2020, 17% of surveyed Cloudflare customers reported receiving a ransom or a threat of DDoS attack. In Q1 2021, this increased to 26% — roughly 1 out of every 4 respondents reported a ransom threat and a subsequent DDoS attack on their network infrastructure.

Cloudflare recognized as a 'Leader' in The Forrester Wave for DDoS Mitigation Solutions

Whether organizations are targeted with ransom attacks or amateur ‘cyber vandalism’, it’s important for organizations to utilize an always-on, automated DDoS protection service that doesn’t require manual human intervention in the hour of need. We take great pride in being able to provide this level of protection to our customers.

Continuous improvement

As attacks have continued to evolve, and the number of customers using our services has increased, Cloudflare has continually invested in our technology to stay several steps ahead of attackers. We’ve made significant investments in bolstering our mitigation capacity, honing our detection algorithms, and providing better analytics capabilities to our customers. Our aim is to make impact from DDoS attacks a thing of the past, for all customers, just like spam in the 90s.

In 2019, we rolled out our autonomous DDoS detection and mitigation system, dosd. This component of our mitigation stack is fully software-defined, leverages Linux’s eXpress Data Path (XDP), and allows us to quickly and automatically deploy eBPF rules that run on each packet received for inspection — mitigating the most sophisticated attacks within less than 3 seconds on average at the edge and other common attacks instantly. It works by detecting patterns in the attack traffic and then quickly deploying rules autonomously to drop the offenders at wire speed. Additionally, because dosd operates independently within each data center, with no reliance on a centralized data center, it greatly increases the resilience of our network.

While dosd is great at mitigating attacks by detecting patterns in the traffic, what about patternless attacks? That’s where flowtrackd comes in, our novel TCP state classification engine, built in 2020, to defend against disruptive L3/L4 attacks targeting our Magic Transit customers. It’s able to detect and mitigate the most randomized, sophisticated attacks. Additionally, at L7, we also learn our customer’s traffic baselines and identify when their origin is in distress. When an origin server shows signs of deterioration, our systems begin soft mitigation in order to reduce the impact on the server and allow it to recuperate.

Building advanced DDoS protection systems is not only about the detection, but also about cost efficient mitigation. We aim to mitigate attacks without impacting performance that can be caused due to excessive computational consumption. This requirement is why we introduced IP Jails to the world: IP Jails is a gatebot capability that mitigates the most volumetric and distributed attacks without impacting performance. Gatebot activates IP Jails when attacks become significantly volumetric, and then instead of blocking at L7, IP Jails temporarily drops the connection of the offending IP address that generated the request which matched the attack signature that gatebot created. IP Jails leverages the Linux iptables mechanism to drop packets at wirespeed. Dropping L7 attacks at L4, is significantly more cost-efficient, and benefits both our customers and our Site Reliability Engineering team.

Lastly, to provide our customers better visibility and insight into the increasingly sophisticated attacks we’re seeing and mitigating, we released the Firewall Analytics dashboard in 2019. This dashboard provides insights into both HTTP application security and DDoS activity at L7, allowing customers to configure rules directly from within analytics dashboards thus tightening the feedback loop for responding to events. Later in 2020, we released an equivalent dashboard for L3/4 activity for our enterprise Magic Transit and Spectrum customers, in the form of the Network Analytics dashboard. Network Analytics provides insight into packet-level traffic and DDoS attack activity, along with periodical Insights and Trends. To complement the dashboards and provide our users the right information as they need it, we rolled out real-time DDoS alerts and also periodical DDoS reports — right into your inboxes. Or if you prefer, directly into your SIEM dashboards.

Cloudflare received the top score in the strategy category

This year, due to our advanced DDoS protection capabilities, Cloudflare received the top score in the strategy category and among the top three in the current offering category. Additionally, we were given the highest possible scores in 15 criteria in the report, including:

  • Threat detection
  • Burst attacks
  • Response automation
  • Speed of implementation
  • Product vision
  • Performance
  • Security operation center (SOC) service

We believe that our standing stems from the sustained investments we’ve made over the last few years in our global Anycast network — which serves as the foundation of all services we provide to our customers.

Our network is architected for scale — every service runs on every server in every Cloudflare data center that spans over 200 cities globally. And as opposed to some of the other vendors in the report, every Cloudflare service is delivered from every one of our edge data centers.

Integrated security and performance

A leading application performance monitoring company that uses Cloudflare’s services for serverless compute and content delivery recently told us that they wanted to consolidate their performance and security services under one provider. They got rid of their incumbent L3 services provider and onboarded Cloudflare for their application and network services (with Magic Transit) for easier management and better support.

We see this more and more. The benefits of using a single cloud provider for bundled security and performance services are plentiful:

  • Easier management — users can manage all of Cloudflare’s services such as DDoS protection, WAF, CDN, bot management and serverless compute from a single dashboard and a single API endpoint.
  • Deep service integration – all of our services are deeply integrated which allows our users to truly leverage the power of Cloudflare. As an example, Bot Management rules are implemented with our Application Firewall.
  • Easier troubleshooting — instead of having to reach out to multiple providers, our customers have a single point of contact when troubleshooting. Additionally, we provide immediate human response in our under attack hotline.
  • Lower latency — because every one of our services are delivered from all of our data centers, there are no performance penalties. As an example, there are no additional routing hops between the DDoS service to Bot Management service to CDN service.

However, not all cloud services are built the same, i.e. most vendors today do not have a comprehensive and robust solution to offer. Cloudflare’s unique architecture enables it to offer an integrated solution that comprises an all-star cast featuring the following to name a few:

  • CDN: Customer’s Choice LEADER in 2020 Gartner Peer Insights ‘Voice of the Customer’: Global CDN1
  • DDoS: Received the highest number of high scores in the 2020 Gartner report for Solution Comparison for DDoS Cloud Scrubbing Centers2
  • WAF: Cloudflare is a CHALLENGER in the 2020 Gartner Magic Quadrant for Web Application Firewall (receiving the highest placement in the ‘Ability to Execute’)3
  • Zero Trust: Cloudflare is a LEADER in the Omdia Market Radar: Zero-Trust Access Report, 20204
  • Bot Management: Leader in the 2020 SPARK Matrix of Bot Management Market5
  • Integrated solution: Innovation leader in the Global Holistic Web Protection Market for 2020 by Frost & Sullivan6

We are pleased to be named a LEADER in The Forrester Wave™: for DDoS Mitigation Solutions, Q1 2021 report, and will continue to work tirelessly to remain, as the report puts it, a “compelling way to protect and deliver applications” for our customers.

For more information about Cloudflare’s DDoS protection, reach out to us here or hands-on evaluation of Cloudflare, sign up here.

1https://www.gartner.com/reviews/market/global-cdn/vendor/cloudflare/product/cloudflare-cdn
2https://www.gartner.com/en/documents/3983636/solution-comparison-for-ddos-cloud-scrubbing-centers
3Gartner, “Magic Quadrant for Web Application Firewalls”, Analyst(s): Jeremy D’Hoinne, Adam Hils, John Watts, Rajpreet Kaur, October 19, 2020. https://www.gartner.com/doc/reprints?id=1-249JQ6L1&ct=200929&st=sb
4https://www.cloudflare.com/lp/omdia-zero-trust
5https://www.cloudflare.com/lp/qks-bot-management-leader/
6https://www.cloudflare.com/lp/frost-radar-holistic-web/

An introduction to three-phase power and PDUs

Post Syndicated from Rob Dinh original https://blog.cloudflare.com/an-introduction-to-three-phase-power-and-pdus/

An introduction to three-phase power and PDUs

Our fleet of over 200 locations comprises various generations of servers and routers. And with the ever changing landscape of services and computing demands, it’s imperative that we manage power in our data centers right. This blog is a brief Electrical Engineering 101 session going over specifically how power distribution units (PDU) work, along with some good practices on how we use them. It appears to me that we could all use a bit more knowledge on this topic, and more love and appreciation of something that’s critical but usually taken for granted, like hot showers and opposable thumbs.

A PDU is a device used in data centers to distribute power to multiple rack-mounted machines. It’s an industrial grade power strip typically designed to power an average consumption of about seven US households. Advanced models have monitoring features and can be accessed via SSH or webGUI to turn on and off power outlets. How we choose a PDU depends on what country the data center is and what it provides in terms of voltage, phase, and plug type.

An introduction to three-phase power and PDUs

For each of our racks, all of our dual power-supply (PSU) servers are cabled to one of the two vertically mounted PDUs. As shown in the picture above, one PDU feeds a server’s PSU via a red cable, and the other PDU feeds that server’s other PSU via a blue cable. This is to ensure we have power redundancy maximizing our service uptime; in case one of the PDUs or server PSUs fail, the redundant power feed will be available keeping the server alive.

Faraday’s Law and Ohm’s Law

Like most high-voltage applications, PDUs and servers are designed to use AC power. Meaning voltage and current aren’t constant — they’re sine waves with magnitudes that can alternate between positive and negative at a certain frequency. For example, a voltage feed of 100V is not constantly at 100V, but it bounces between 100V and -100V like a sine wave. One complete sine wave cycle is one phase of 360 degrees, and running at 50Hz means there are 50 cycles per second.

The sine wave can be explained by Faraday’s Law and by looking at how an AC power generator works. Faraday’s Law tells us that a current is induced to flow due to a changing magnetic field. Below illustrates a simple generator with a permanent magnet rotating at constant speed and a coil coming in contact with the magnet’s magnetic field. Magnetic force is strongest at the North and South ends of the magnet. So as it rotates on itself near the coil, current flow fluctuates in the coil. One complete rotation of the magnet represents one phase. As the North end approaches the coil, current increases from zero. Once the North end leaves, current decreases to zero. The South end in turn approaches, now the current “increases” in the opposite direction. And finishing the phase, the South end leaves returning the current back to zero. Current alternates its direction at every half cycle, hence the naming of Alternating Current.

An introduction to three-phase power and PDUs

Current and voltage in AC power fluctuate in-phase, or “in tandem”, with each other. So by Ohm’s Law of Power = Voltage x Current, power will always be positive. Notice on the graph below that AC power (Watts) has two peaks per cycle. But for practical purposes, we’d like to use a constant power value. We do that by interpreting AC power into “DC” power using root-mean-square (RMS) averaging, which takes the max value and divides it by √2. For example, in the US, our conditions are 208V 24A at 60Hz. When we look at spec sheets, all of these values can be assumed as RMS’d into their constant DC equivalent values. When we say we’re fully utilizing a PDU’s max capacity of 5kW, it actually means that the power consumption of our machines bounces between 0 and 7.1kW (5kW x √2).

An introduction to three-phase power and PDUs
An introduction to three-phase power and PDUs

It’s also critical to figure out the sum of power our servers will need in a rack so that it falls under the PDU’s design max power capacity. For our US example, a PDU is typically 5kW (208 volts x 24 amps); therefore, we’re budgeting 5kW and fit as many machines as we can under that. If we need more machines and the total sum power goes above 5kW, we’d need to provision another power source. That would lead to possibly another set of PDUs and racks that we may not fully use depending on demand; e.g. more underutilized costs. All we can do is abide by P = V x I.

However there is a way we can increase the max power capacity economically — 3-phase PDU. Compared to single phase, its max capacity is √3 or 1.7 times higher. A 3-phase PDU of the same US specs above has a capacity of 8.6kW (5kW x √3), allowing us to power more machines under the same source. Using a 3-phase setup might mean it has thicker cables and bigger plug; but despite being more expensive than a 1-phase, its value is higher compared to two 1-phase rack setups for these reasons:

  • It’s more cost-effective, because there are fewer hardware resources to buy
  • Say the computing demand adds up to 215kW of hardware, we would need 25 3-phase racks compared to 43 1-phase racks.
  • Each rack needs two PDUs for power redundancy. Using the example above, we would need 50 3-phase PDUs compared to 86 1-phase PDUs to power 215kW worth of hardware.
  • That also means a smaller rack footprint and fewer power sources provided and charged by the data center, saving us up to √3 or 1.7 times in opex.
  • It’s more resilient, because there are more circuit breakers in a 3-phase PDU — one more than in a 1-phase. For example, a 48-outlet PDU that is 1-phase would be split into two circuits of 24 outlets. While a 3-phase one would be split into 3 circuits of 16 outlets. If a breaker tripped, we’d lose 16 outlets using a 3-phase PDU instead of 24.
An introduction to three-phase power and PDUs

The PDU shown above is a 3-phase model of 48 outlets. We can see three pairs of circuit breakers for the three branches that are intertwined with each other white, grey, and black. Industry demands today pressure engineers to maximize compute performance and minimize physical footprint, making the 3-phase PDU a widely-used part of operating a data center.

What is 3-phase?

A 3-phase AC generator has three coils instead of one where the coils are 120 degrees apart inside the cylindrical core, as shown in the figure below. Just like the 1-phase generator, current flow is induced by the rotation of the magnet thus creating power from each coil sequentially at every one-third of the magnet’s rotation cycle. In other words, we’re generating three 1-phase power offset by 120 degrees.

An introduction to three-phase power and PDUs

A 3-phase feed is set up by joining any of its three coils into line pairs. L1, L2, and L3 coils are live wires with each on their own phase carrying their own phase voltage and phase current. Two phases joining together form one line carrying a common line voltage and line current. L1 and L2 phase voltages create the L1/L2 line voltage. L2 and L3 phase voltages create the L2/L3 line voltage. L1 and L3 phase voltages create the L1/L3 line voltage.

An introduction to three-phase power and PDUs

Let’s take a moment to clarify the terminology. Some other sources may refer to line voltage (or current) as line-to-line or phase-to-phase voltage (or current). It can get confusing, because line voltage is the same as phase voltage in 1-phase circuits, as there’s only one phase. Also, the magnitude of the line voltage is equal to the magnitude of the phase voltage in 3-phase Delta circuits, while the magnitude of the line current is equal to the magnitude of the phase current in 3-phase Wye circuits.

Conversely, the line current equals to phase current times √3 in Delta circuits. In Wye circuits, the line voltage equals to phase voltage times √3.

In Delta circuits:
Vline = Vphase
Iline = √3 x Iphase

In Wye circuits:
Vline = √3 x Vphase
Iline = Iphase

Delta and Wye circuits are the two methods that three wires can join together. This happens both at the power source with three coils and at the PDU end with three branches of outlets. Note that the generator and the PDU don’t need to match each other’s circuit types.

An introduction to three-phase power and PDUs

On PDUs, these phases join when we plug servers into the outlets. So we conceptually use the wirings of coils above and replace them with resistors to represent servers. Below is a simplified wiring diagram of a 3-phase Delta PDU showing the three line pairs as three modular branches. Each branch carries two phase currents and its own one common voltage drop.

An introduction to three-phase power and PDUs

And this one below is of a 3-phase Wye PDU. Note that Wye circuits have an additional line known as the neutral line where all three phases meet at one point. Here each branch carries one phase and a neutral line, therefore one common current. The neutral line isn’t considered as one of the phases.

An introduction to three-phase power and PDUs

Thanks to a neutral line, a Wye PDU can offer a second voltage source that is √3 times lower for smaller devices, like laptops or monitors. Common voltages for Wye PDUs are 230V/400V or 120V/208V, particularly in North America.

Where does the √3 come from?

Why are we multiplying by √3? As the name implies, we are adding phasors. Phasors are complex numbers representing sine wave functions. Adding phasors is like adding vectors. Say your GPS tells you to walk 1 mile East (vector a), then walk a 1 mile North (vector b). You walked 2 miles, but you only moved by 1.4 miles NE from the original location (vector a+b). That 1.4 miles of “work” is what we want.

Let’s take in our application L1 and L2 in a Delta circuit. we add phases L1 and L2, we get a L1/L2 line. We assume the 2 coils are identical. Let’s say α represents the voltage magnitude for each phase. The 2 phases are 120 degrees offset as designed in the 3-phase power generator:

|L1| = |L2| = α
L1 = |L1|∠0° = α∠0°
L2 = |L2|∠-120° = α∠-120°

Using vector addition to solve for L1/L2:

L1/L2 = L1 + L2

An introduction to three-phase power and PDUs
An introduction to three-phase power and PDUs

Convert L1/L2 into polar form:

An introduction to three-phase power and PDUs

Since voltage is a scalar, we’re only interested in the “work”:

|L1/L2| = √3α

Given that α also applies for L3. This means for any of the three line pairs, we multiply the phase voltage by √3 to calculate the line voltage.

Vline = √3 x Vphase

Now with the three line powers being equal, we can add them all to get the overall effective power. The derivation below works for both Delta and Wye circuits.

Poverall = 3 x Pline
Poverall = 3 x (Vline x Iline)
Poverall = (3/√3) x (Vphase x Iphase)
Poverall = √3 x Vphase x Iphase

Using the US example, Vphase is 208V and Iphase is 24A. This leads to the overall 3-phase power to be 8646W (√3 x 208V x 24A) or 8.6kW. There lies the biggest advantage for using 3-phase systems. Adding 2 sets of coils and wires (ignoring the neutral wire), we’ve turned a generator that can produce √3 or 1.7 times more power!

Dealing with 3-phase

The derivation in the section above assumes that the magnitude at all three phases is equal, but we know in practice that’s not always the case. In fact, it’s barely ever. We rarely have servers and switches evenly distributed across all three branches on a PDU. Each machine may have different loads and different specs, so power could be wildly different, potentially causing a dramatic phase imbalance. Having a heavily imbalanced setup could potentially hinder the PDU’s available capacity.

A perfectly balanced and fully utilized PDU at 8.6kW means that each of its three branches has 2.88kW of power consumed by machines. Laid out simply, it’s spread 2.88 + 2.88 + 2.88. This is the best case scenario. If we were to take 1kW worth of machines out of one branch, spreading power to 2.88 + 1.88 + 2.88. Imbalance is introduced, the PDU is underutilized, but we’re fine. However, if we were to put back that 1kW into another branch — like 3.88 + 1.88 + 2.88 — the PDU is over capacity, even though the sum is still 8.6kW. In fact, it would be over capacity even if you just added 500W instead of 1kW on the wrong branch, thus reaching 3.18 + 1.88 + 2.88 (8.1kW).

That’s because a 8.6kW PDU is spec’d to have a maximum of 24A for each phase current. Overloading one of the branches can force phase currents to go over 24A. Theoretically, we can reach the PDU’s capacity by loading one branch until its current reaches 24A and leave the other two branches unused. That’ll render it into a 1-phase PDU, losing the benefit of the √3 multiplier. In reality, the branch would have fuses rated less than 24A (usually 20A) to ensure we won’t reach that high and cause overcurrent issues. Therefore the same 8.6kW PDU would have one of its branches tripped at 4.2kW (208V x 20A).

Loading up one branch is the easiest way to overload the PDU. Being heavily imbalanced significantly lowers PDU capacity and increases risk of failure. To help minimize that, we must:

  • Ensure that total power consumption of all machines is under the PDU’s max power capacity
  • Try to be as phase-balanced as possible by spreading cabling evenly across the three branches
  • Ensure that the sum of phase currents from powered machines at each branch is under the fuse rating at the circuit breaker.

This spreadsheet from Raritan is very useful when designing racks.

For the sake of simplicity, let’s ignore other machines like switches. Our latest 2U4N servers are rated at 1800W. That means we can only fit a maximum of four of these 2U4N chassis (8600W / 1800W = 4.7 chassis). Rounding them up to 5 would reach a total rack level power consumption of 9kW, so that’s a no-no.

Splitting 4 chassis into 3 branches evenly is impossible, and will force us to have one of the branches to have 2 chassis. That would lead to a non-ideal phase balancing:

An introduction to three-phase power and PDUs

Keeping phase currents under 24A, there’s only 1.1A (24A – 22.9A) to add on L1 or L2 before the PDU gets overloaded. Say we want to add as many machines as we can under the PDU’s power capacity. One solution is we can add up to 242W on the L1/L2 branch until both L1 and L2 currents reach their 24A limit.

An introduction to three-phase power and PDUs

Alternatively, we can add up to 298W on the L2/L3 branch until L2 current reaches 24A. Note we can also add another 298W on the L3/L1 branch until L1 current reaches 24A.

An introduction to three-phase power and PDUs

In the examples above, we can see that various solutions are possible. Adding two 298W machines each at L2/L3 and L3/L1 is the most phase balanced solution, given the parameters. Nonetheless, PDU capacity isn’t optimized at 7.8kW.

Dealing with a 1800W server is not ideal, because whichever branch we choose to power one would significantly swing the phase balance unfavorably. Thankfully, our Gen X servers take up less space and are more power efficient. Smaller footprint allows us to have more flexibility and fine-grained control over our racks in many of our diverse data centers. Assuming each 1U server is 450W, as if we physically split the 1800W 2U4N into fours each with their own power supplies, we’re now able to fit 18 nodes. That’s 2 more nodes than the four 2U4N setup:

An introduction to three-phase power and PDUs

Adding two more servers here means we’ve increased our value by 12.5%. While there are more factors not considered here to calculate the Total Cost of Ownership, this is still a great way to show we can be smarter with asset costs.

Cloudflare provides the back-end services so that our customers can enjoy the performance, reliability, security, and global scale of our edge network. Meanwhile, we manage all of our hardware in over 100 countries with various power standards and compliances, and ensure that our physical infrastructure is as reliable as it can be.

There’s no Cloudflare without hardware, and there’s no Cloudflare without power. Want to know more? Watch this Cloudflare TV segment about power: https://cloudflare.tv/event/7E359EDpCZ6mHahMYjEgQl.

ASICs at the Edge

Post Syndicated from Tom Strickx original https://blog.cloudflare.com/asics-at-the-edge/

ASICs at the Edge

At Cloudflare we pride ourselves in our global network that spans more than 200 cities in over 100 countries. To handle all the traffic passing through our network, there are multiple technologies at play. So let’s have a look at one of the cornerstones that makes all of this work… ASICs. No, not the running shoes.

What’s an ASIC?

ASIC stands for Application Specific Integrated Circuit. The name already says it, it’s a chip with a very narrow use case, geared towards a single application. This is in stark contrast to a CPU (Central Processing Unit), or even a GPU (Graphics Processing Unit). A CPU is designed and built for general purpose computation, and does a lot of things reasonably well. A GPU is more geared towards graphics (it’s in the name), but in the last 15 years, there’s been a drastic shift towards GPGPU (General Purpose GPU), in which technologies such as CUDA or OpenCL allow you to use the highly parallel nature of the GPU to do general purpose computing. A good example of GPU use is video encoding, or more recently, computer vision, used in applications such as self-driving cars.

Unlike CPUs or GPUs, ASICs are built with a single function in mind. Great examples are the Google Tensor Processing Units (TPU), used to accelerate machine learning functions[1], or for orbital maneuvering[2], in which specific orbital maneuvers are encoded, like the Hohmann Transfer, used to move rockets (and their payloads) to a new orbit at a different altitude. And they are also heavily used in the networking industry. Technically, the use case in the network industry should be called an ASSP (Application Specific Standard Product), but network engineers are simple people, so we prefer to call it an ASIC.

Why an ASIC

ASICs have the major benefit of being hyper-efficient. The more complex hardware is, the more it will need cooling and power. As ASICs only contain the hardware components needed for their function, their overall size can be reduced, and so are their power requirements. This has a positive impact on the overall physical size of the network (devices don’t need to be as bulky to provide sufficient cooling), and helps reduce the power consumption of a data center.

Reducing hardware complexity also reduces the failure rate of the manufacturing process, and allows for easier production.

The downside is that you need to embed a lot of your features in hardware, and once a new technology or specification comes around, any chips made without that technology baked in, won’t be able to support it (VXLAN for example).

For network equipment, this works perfectly. Overall, the networking industry is slow-moving, and considerable time is taken before new technologies make it to the market (as can be seen with IPv6, MPLS implementations, xDSL availability, …). This means the chips don’t need to evolve on a yearly basis, and can instead be created on a much slower cycle, with bigger leaps in technology. For example, it took Broadcom two years to go from Tomahawk 3 to Tomahawk 4, but in that process they doubled the throughput. The benefits listed earlier are super helpful for network equipment, as they allow for considerable throughput in a small form factor.

Building an ASIC

As with chips of any kind, building an ASIC is a long-term process. Just like with CPUs, if there’s a defect in the hardware design, you have to start from scratch, and scrap the entire build line. As such, the development lifecycle is incredibly long. It starts with prototyping in an FPGA (Field Programmable Gate Array), in which chip designers can program their required functionality and confirm compatibility. All of this is done in a HDL (Hardware Description Language), such as Verilog.

Once the prototyping stage is over, they move to baking the new packet processing pipeline into the chip at a foundry. After that, no more changes can be made to the chip, as it’s literally baked into the hardware (unlike an FPGA, which can be reprogrammed). Further difficulty is added by the fact that there are a very small number of hardware companies that will buy ASICs in bulk to build equipment with; as such the unit cost can increase drastically.

All of this means that the iteration cycle of an ASIC tends to be on the slower side of things (compared to the yearly refreshes in the Intel Process-Architecture-Optimization model for example), and will usually be smaller incremental updates: For example, increases in port-speeds are incremental (1G → 10G → 25G → 40G → 100G → 400G → 800G → …), and are tied into upgrades to the SerDes (Serialiser/Deserialiser) part of the chip.

New protocol support is a lot harder, and might require multiple development cycles before it shows up in a chip.

What ASICs do

The ASICs in our network equipment are responsible for the switching and routing of packets, as well as being the first layer of defense (in the form of a stateless firewall). Due to the sheer nature of how fast packets get switched, fast memory access is a primary concern. Most ASICs will use a special sort of memory, called TCAM (Ternary Content-Addressable Memory). This memory will be used to store all sorts of lookup tables. These may be forwarding tables (where does this packet go), ACL (Access Control List) tables (is this packet allowed), or CoS (Class of Service) tables (which priority should be given to this packet)

CAM, and its more advanced sibling, TCAM, are fascinating kinds of memory, as they operate fundamentally different than traditional Random Access Memory (RAM). While you have to use a memory address to access data in RAM, with CAM and TCAM you can directly refer to the content you are looking for. It is a physical implementation of a key-value store.

In CAM you use the exact binary representation of a word, in a network application, that word is likely going to be an IP address, so 11001011.00000000.01110001.00000000 for example (203.0.113.0). While this is definitely useful, networks operate a big collection of IP addresses, and storing each individually would require significant memory. To remedy this memory requirement, TCAM can store three states, instead of the binary two. This third state, sometimes called ‘ignore’ state, allows for the storage of multiple sequential data words as a single entry.

In networking, these sequential data words are IP prefixes. So for the previous example, if we wanted to store the collection of that IP address, and the 254 IPs following it, in TCAM it would as follows: 11001011.00000000.01110001.XXXXXXXX (203.0.113.0/24). This storage method means we can ask questions of the ASIC such as “where should I send packets with the destination IP address of 203.0.113.19”, to which the ASIC can have a reply ready in a single clock cycle, as it does not need to run through all memory, but instead can directly reference the key. This reply will usually be a reference to a memory address in traditional RAM, where more data can be stored, such as output port, or firewall requirements for the packet.
ASICs at the Edge

To dig a bit deeper into what ASICs do in network equipment, let’s briefly go over some fundamentals.

Networking can be split into two primary components: routing and switching. Switching allows you to directly interconnect multiple devices, so they can talk with each other across the network. It’s what allows your phone to connect to your TV to play a new family video. Routing is the next level up. It’s the mechanism that interconnects all these switched networks into a network of networks, and eventually, the Internet.

So routers are the devices responsible for steering traffic through this complex maze of networks, so it gets to its destination safely, and hopefully, as fast as possible. On the Internet, routers will usually use a routing protocol called BGP (Border Gateway Protocol) to exchange reachability information for a prefix (a collection of IP addresses), also called NLRI (Network Layer Reachability Information).

As with navigating the roads, there are multiple ways to get from point A to point B on the Internet. To make sure the router makes the right decision, it will store all of the reachability information in the RIB (Routing Information Base). That way, if anything changes with one route, the router still has other options immediately available.

With this information, a BGP daemon can calculate the ideal path to take for any given destination from its own point-of-view. This Cisco documentation explains the decision process the daemon goes through to calculate that ideal path.

Once we have this ideal path for a given destination, we should store this information, as it would be very inefficient to calculate this every single time we need to go there. The storage database is called the FIB (Forwarding Information Base). The FIB will be a subset of the RIB, as it will only ever contain the best path for a destination at any given time, while the RIB keeps all the available paths, even the non-ideal ones.

With these individual components, routers can make packets go from point A to point B in a blink of an eye.

Here’ are some of the more specific functions our ASICs need to perform:

  1. FIB install: Once the router has calculated its FIB, it’s important the router can access this as quickly as possible. To do so, the ASIC will install (write) this calculated FIB into the TCAM, so any lookups can happen as quickly as possible.
    ASICs at the Edge

  2. Packet forwarding lookups: as we need to know where to send a received packet, we look up this information in TCAM, which is, as we mentioned, incredibly fast.

  3. Stateless Firewall: while a router routes packets between destinations, you also want to ensure that certain packets don’t reach a destination at all. This can be done using either a stateless or stateful firewall. “State” in this case refers to TCP state, so the router would need to understand if a connection is new, or already established. As maintaining state is a complex issue, which requires storing tables, and can quickly consume a lot of memory, most routers will only operate a stateless firewall.
    Instead, stateful firewalls often have their own appliances. At Cloudflare, we’ve opted to move maintaining state to our compute nodes, as that severely reduces the state-table (one router for all state vs X metals for all state combined). A stateless firewall makes use of the TCAM again to store rules on what to do with specific types of packets. For example, one of the rules we employ at our edge is DENY-BOGON-RANGES , in which we discard traffic sourced from RFC1918 space (and other unroutable space). As this makes use of TCAM, it can all be done at line rate (the maximum speed of the interface).

  4. Advanced features, such as GRE encapsulation: modern networking isn’t just packet switching and packet routing anymore, and more advanced features are needed. One of these is encapsulation. With packet encapsulation, a system will put a data packet into another data packet. Using this technique, it’s possible to build a network on top of an existing network (an overlay). Overlays can be used to build a virtual backbone for example, in which multiple locations can be virtually connected through the Internet.
    While you can encapsulate packets on a CPU (we do this for Magic Transit), there are considerable challenges in doing so in software. As such, the ASIC can have built-in functionality to encapsulate a packet in a multitude of protocols, such as GRE. You may not want encapsulated packets to have to take a second trip through your entire pipeline, as this adds latency, so these shortcuts can also be built into the chip.

  5. MPLS, EVPN, VXLAN, SDWAN, SDN, …: I ran out of buzzwords to enumerate here, but while MPLS isn’t new (the first RFC was created in 2001), it’s a rather advanced requirement, just as the others listed, which means not all ASIC vendors will implement this for all their chips due to the increased complexity.

Vendor Landscape

At Cloudflare, we interact with both hardware and software vendors on a daily basis while operating our global network. As we’re talking about ASICs today, we’ll explore the hardware landscape, but some hardware vendors also have their own NOS (Network Operating System).
There’s a vast selection of hardware out there, all with different features and pricing. It can become incredibly hard to see the wood for the trees, so we’ll focus on 4 important distinguishing factors: Throughput (how many bits can the ASIC push through), buffer size (how many bits can the ASIC store in memory in case of resource contention), programmability (how easy is it for a third party programmer like Cloudflare to interact directly with the ASIC), feature set (how many advanced things outside of routing/switching can the ASIC do).

The landscape is so varied because different companies have different requirements. A company like Cloudflare has different expectations for its network hardware than your typical corner shop. Even within our own network we’ll have different requirements for the different layers that make up our network.

Broadcom

The elephant in the networking room (or is it the jumbo frame in the switch?) is Broadcom. Broadcom is a semiconductor company, with their primary revenue in the wired infrastructure segment (over 50% of revenue[3]). While they’ve been around since 1991, they’ve become an unstoppable force in the last 10 years, in part due to their reliance on Apple (25% of revenue). As a semiconductor manufacturer, their market dominance is primarily achieved by acquiring other companies. A great example is the acquisition of Dune Networks, which has become an excellent revenue generator as the StrataDNX series of ASIC (Arad, QumranMX, Jericho). As such, they have become the biggest ASIC vendor by far, and own 59% of the entire Ethernet Integrated Circuits market[4].

As such, they supply a lot of merchant silicon to Cisco, Juniper, Arista and others. Up until recently, if you wanted to use the Broadcom SDK to accelerate your packet forwarding, you have to sign so many NDAs you might get a hand cramp, which makes programming them a lot trickier. This changed recently when Broadcom open-sourced their SDK. Let’s have a quick look at some of their products.

Tomahawk

The Tomahawk line of ASICs are the bread-and-butter for the enterprise market. They’re cheap and incredibly fast. The first generation of Tomahawk chips did 3.2Tbps linerate, with low-latency switching. The latest generation of this chip (Tomahawk 4) does 25.6Tbps in a 7nm transistor footprint[5]). As you can’t have a cheap, fast, and full feature set for a single package, this means you lose out on features. In this case, you’re missing most of the more advanced networking technologies such as VXLAN, and you have no buffer to speak of.
As an example of a different vendor using this silicon, you can have a look at the Juniper QFX5200 switching platform.

StrataDNX (Arad, QumranMX, Jericho)

These chipsets came through the acquisition of Dune Networks, and are a collection of high-bandwidth, deep buffer (large amount of memory available to store (buffer) packets) chips, allowing them to be deployed in versatile environments, including the Cloudflare edge. The Arista DCS-7280SR that we run in some of our edge locations as edge routers run on the Jericho chipset. Since then, the chips have evolved, and with Jericho2, Broadcom now have a 10Tbps deep buffer chip[6]. With their fabric chip (this links multiple ASICs together), you can build switches with 48x400G ports[7] without much effort.
Cisco built their NCS5500 line of routers using the QumranMX[8].

Trident

This ASIC is an upgrade from the Tomahawk chipset, with a complex and extensive feature set, while maintaining high throughput rates. The latest Trident4 does 12.8Tbps at incredibly low latencies[9], making it an incredibly flexible platform. It unfortunately has no buffer space to speak of, which limits its scope for Cloudflare, as we need the buffer space to be able to switch between the different port speeds we have on our edge routers. The Arista 7050X and 7300X are built on top of this.

Intel

Intel is well known in the network industry for building stable and high-performance 10G NICs (Network Interface Controller). They’re not known for ASICs. They made an initial attempt with their acquisition of Fulcrum[10], which built the FM6000[11] series of ASIC, but nothing of note was really built with them. Intel decided to try again in 2019 with their acquisition of Barefoot. This small manufacturer is responsible for the Barefoot Tofino ASIC, which may well be a fundamental paradigm shift in the network industry.

Barefoot Tofino

The Tofino[12] is built using a PISA (Protocol Independent Switch Architecture), and using P4 (Programming Protocol-Independent Packet Processors)[13], you can program the data-plane (packet forwarding) as you see fit. It’s a drastic move away from the traditional method of networking, in which direct programming of the ASIC isn’t easily possible, and definitely not through a standard programming language. As an added benefit, P4 also allows you to perform a formal verification of your forwarding program, and be sure that it will do what you expect it to. Caveat: OpenFlow tried this, but unfortunately never really got much traction.
ASICs at the Edge[14]

There are multiple variations of the Tofino 1 available, but the top-end ASIC has a 6.5Tbps linerate capacity. As the ASIC is programmable, its featureset is as rich as you’d want it to be. Unfortunately, the chip does not come with a lot of buffer memory, so we can’t deploy these as edge devices (yet). Both Arista (7170 Series[15]) and Cisco (Nexus 34180YC and 3464C series[16]) have built equipment with the Tofino chip inside.

Mellanox

As some of you may know, Mellanox is the vendor that recently got acquired by Nvidia, which also provides our 25G NICs in our compute nodes. Besides NICs, Mellanox has a well-established line of ASICs, mostly for switching.

Spectrum

The latest iteration of this ASIC, Spectrum 3 offers 12.8Tbps switching capacity, with an extensive featureset, including Deep Packet Inspection and NAT. This chip allows for building dense high-speed port devices, going up to 25.6Tbps[17]. Buffering wise, there’s none to really speak of (64MB). Mellanox also builds their own hardware platforms. Unlike the other vendors below, they aren’t shipped with the Mellanox Operating System, instead, they offer you a variety of choices to run on top, including Cumulus Linux (which was also acquired by Nvidia 🤔).

As mentioned, while we use their NIC technology extensively, we currently don’t have any Mellanox ASIC silicon in our network.

Juniper

Juniper is a network hardware supplier, and currently the biggest supplier of network equipment for Cloudflare. As previously mentioned in the Broadcom section, Juniper buys some of their silicon from Broadcom, but they also have a significant lineup of home-grown silicon, which can be split into 2 families: Trio and Express.

Express

The Express family is the switching-skewed family, where bandwidth is a priority, while still maintaining a broad range of feature capabilities. These chips live in the same application landscape as the Broadcom StrataDNX chips.

Paradise (Q5)

The Q5 is the new generation of the Juniper switching ASIC[18]. While by itself it doesn’t boast high linerates (500Gbps), when combined into a chassis with a fabric chip (Clos network in this case), they can produce switches (or line cards) with up to 12Tbps of throughput capacity[19]. In addition to allowing for high-throughput, dense network appliances, the chip also comes with a staggering amount of buffer space (4GB per ASIC), provided by external HMC (Hybrid Memory Cube). In this HMC, they’ve also decided to put the FIB, MAC and other tables (so no TCAM).
The Q5 chip is used in their QFX1000 lineup of switches, which include the QFX10002-36Q, QFX10002-60C, QFX10002-72Q and QFX10008, all of which are deployed in our datacenters, as either edge routers or core aggregation switches.

ExpressPlus (ZX)

The ExpressPlus is the more feature-rich and faster evolution of the Paradise chip. It offers double the bandwidth per chip (1Tbps) and is built into a combined Clos-fabric reaching 6Tbps in a 2U form-factor (PTX10002). It also has an increased logical scale, which comes with bigger buffers, larger FIB storage, and more ACL space.

The ExpressPlus drives some of the PTX line of IP routers, together with its newest sibling, Triton.

Triton (BT)

Triton is the latest generation of ASIC in the Express family, with 3.6Tbps of capacity per chip, making way for some truly bandwidth-dense hardware. Both Triton and ExpressPlus are 400GE capable.

Trio

The Trio family of chips are primarily used in the feature-heavy MX routing platform, and is currently at its 5th generation.

ASICs at the EdgeA Juniper MPC4E-3D-32XGE line card

Trio Eagle (Trio 4.0) (EA)

The Trio Eagle is the previous generation of the Trio Penta, and can be found on the MPC7E line cards for example. It’s a feature-rich ASIC, with a 400Gbps forwarding capacity, and significant buffer and TCAM capacity (as is to be expected from a routing platform ASIC)

Trio Penta (Trio 5.0) (ZT)

Penta is the new generation routing chip, which is built for the MX platform routers. On top of being a very beefy chip, capable of 500Gbps per ASIC, allowing Juniper to build line cards of up to 4Tbps of capacity, the chip also has a lot of baked in features, offering advanced hardware offloading for for example MACSec, or Layer 3 IPsec.

The Penta chip is packaged on the MPC10E and MPC11E line card, which can be installed in multiple variations of the MX chassis routers (MX480 included).

Cisco

Last but not least, there’s Cisco. As the saying goes “nobody ever got fired for buying Cisco”, they’re the biggest vendor of network solutions around. Just like Juniper, they have a mixed product fleet of merchant silicon, as well as home-grown. While we used to operate Cisco routers as edge routers (Cisco ASR 9000), this is no longer the case. We do still use them heavily for our ToR (Top-of-Rack) switching needs, utilizing both their Nexus 5000 series and Nexus 9000 series switches.

Bigsur

Bigsur is custom silicon developed for the Nexus 6000 line of switches (confusingly, the switches themselves are called Cisco Nexus 5672UP and Cisco Nexus 6001). In our specific model, the Cisco Nexus 5672UP, there’s 7 of them interconnected, providing 10G and 40G connectivity. Unfortunately Cisco is a lot more tight-lipped about their ASIC capabilities, so I can’t go as deep as I did with the Juniper chips. Feature-wise, there’s not a lot we require from them in our edge network. They’re simple Layer 2 forwarding switches, with no added requirements. Buffer wise, they use a system called Virtual Output Queueing, just like the Juniper Express chip. Unlike the Juniper silicon, the Bigsur ASIC doesn’t come with a lot of TCAM or buffer space.

Tahoe

The Tahoe is the Cisco ASIC found in the Cisco 9300-EX switches, also known as the LSE (Leaf Spine Engine). It offers higher-density port configurations compared to the Bigsur (1.6Tbps)[20]. Overall, this ASIC is a maturation of the Bigsur silicon, offering more advanced features such as advanced VXLAN+EVPN fabrics, greater port flexibility (10G, 25G, 40G and 100G), and increased buffer sizes (40MB). We use this ASIC extensively in both our edge data centers as well as in our core data centers.

Conclusion

A lot of different factors come into play when making the decision to purchase the next generation of Cloudflare network equipment. This post only scratches the surface of technical considerations to be made, and doesn’t come near any other factors, such as ecosystem contributions, openness, interoperability, or pricing. None of this would’ve been possible without the contributions from other network engineers—this post was written on the shoulders of giants. In particular, thanks to the excellent work by Jim Warner at UCSC, the engrossing book on the new MX platforms, written by David Roy (Day One: Inside the MX 5G), as well as the best book on the Juniper QFX lineup: Juniper QFX10000 Series by Douglas Richard Hanks Jr, and to finish it off, the Summary of Network ASICs post by Justin Pietsch.


  1. https://cloud.google.com/tpu/ ↩︎

  2. https://angel.co/company/spacex/jobs/744408-sr-fpga-asic-design-engineer ↩︎

  3. https://marketrealist.com/2017/02/wired-infrastructure-segment-protects-broadcom/ ↩︎

  4. https://www.wsj.com/articles/broadcom-lands-deals-to-place-components-in-apple-smartphones-11579821914 ↩︎

  5. https://www.globenewswire.com/news-release/2019/12/09/1958047/0/en/Broadcom-Ships-Tomahawk-4-Industry-s-Highest-Bandwidth-Ethernet-Switch-Chip-at-25-6-Terabits-per-Second.html ↩︎

  6. https://www.broadcom.com/products/ethernet-connectivity/switching/stratadnx/BCM88690 ↩︎

  7. https://www.ufispace.com/products/telco/core-edge/s9705-48d ↩︎

  8. https://www.ciscolive.com/c/dam/r/ciscolive/emea/docs/2019/pdf/BRKSPG-2900.pdf ↩︎

  9. https://www.broadcom.com/products/ethernet-connectivity/switching/strataxgs/bcm56880-series ↩︎

  10. https://newsroom.intel.com/news-releases/intel-to-acquire-fulcrum-microsystems/ ↩︎

  11. https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/ethernet-switch-fm5000-fm6000-datasheet.pdf ↩︎

  12. https://barefootnetworks.com/products/brief-tofino/ ↩︎

  13. http://www.sigcomm.org/node/3503 ↩︎

  14. https://github.com/p4lang/p4lang.github.io/blob/master/assets/p4_switch_model-600px.png ↩︎

  15. https://www.arista.com/assets/data/pdf/Whitepapers/7170_White_Paper.pdf ↩︎

  16. https://www.barefootnetworks.com/press-releases/barefoot-networks-to-showcase-technologies-to-build-fast-and-resilient-networks-using-deep-insight-and-tofino-powered-cisco-nexus-switches-at-cisco-live-us-2019/ ↩︎

  17. https://www.mellanox.com/products/ethernet-switches/sn4000 ↩︎

  18. https://www.juniper.net/assets/us/en/local/pdf/whitepapers/2000599-en.pdf ↩︎

  19. https://www.juniper.net/assets/us/en/local/pdf/datasheets/1000531-en.pdf#page=7 ↩︎

  20. https://www.cisco.com/c/dam/global/fr_ch/solutions/data-center-virtualization/pdf/Cisco_Nexus_9300_EX_Platform.pdf#page=8 ↩︎

Investigate VPC flow with Amazon Detective

Post Syndicated from Ross Warren original https://aws.amazon.com/blogs/security/investigate-vpc-flow-with-amazon-detective/

Many Amazon Web Services (AWS) customers need enhanced insight into IP network flow. Traditionally, cost, the complexity of collection, and the time required for analysis has led to incomplete investigations of network flows. Having good telemetry is paramount, and VPC Flow Logs are a very important part of a robust centralized logging architecture. The information that VPC Flow Logs provide is frequently used by security analysts to determine the scope of security issues, to validate that network access rules are working as expected, and to help analysts investigate issues and diagnose network behaviors. Flow logs capture information about the IP traffic going to and from EC2 interfaces in a VPC. Each record describes aspects of the traffic flow, such as where it originated and where it was sent to, what network ports were used, and how many bytes were sent.

Amazon Detective now enables you to interactively examine the details of the virtual private cloud (VPC) network flows of your Amazon Elastic Compute Cloud (Amazon EC2) instances. Amazon Detective makes it easy to analyze, investigate, and quickly identify the root cause of potential security issues or suspicious activities. Detective automatically collects VPC flow logs from your monitored accounts, aggregates them by EC2 instance, and presents visual summaries and analytics about these network flows. Detective doesn’t require VPC Flow Logs to be configured and doesn’t impact existing flow log collection.

In this blog post, I describe how to use the new VPC flow feature in Detective to investigate an UnauthorizedAccess:EC2/TorClient finding from Amazon GuardDuty. Amazon GuardDuty is a threat detection service that continuously monitors for malicious activity and unauthorized behavior to protect your AWS accounts, workloads, and data stored in Amazon S3. GuardDuty documentation states that this alert can indicate unauthorized access to your AWS resources with the intent of hiding the unauthorized user’s true identity. I’ll demonstrate how to use Amazon Detective to investigate an instance that was flagged by Amazon GuardDuty to determine whether it is compromised or not.

Starting the investigation in GuardDuty

In my GuardDuty console, I’m going to select the UnauthorizedAccess:EC2/TorClient finding shown in Figure 1, choose the Actions menu, and select Investigate.
 

Figure 1: Investigating from the GuardDuty console

Figure 1: Investigating from the GuardDuty console

This opens a new browser tab and launches the Amazon Detective console, where I’m presented with the profile page for this finding, shown in Figure 2. You must have Detective enabled to pivot between a GuardDuty finding and Detective. Detective provides profile pages for supported GuardDuty findings and AWS resources (for example, IP address, EC2 instance, user, and role) that include information and data visualizations that summarize observed behaviors and give guidance for interpreting them. Profiles help analysts to determine whether the finding is of genuine concern or a false positive. For resources, profiles provide supporting details for an investigation into a finding or for a general hunt for suspicious activity.
 

Figure 2: Finding profile panel

Figure 2: Finding profile panel

In this case, the profile page for this GuardDuty UnauthorizedAccess:EC2/TorClient finding provides contextual and behavioral data about the EC2 instance on which GuardDuty has noted the issue. As I dive into this finding, I’m going to be asking questions that help assess whether the instance was in fact accessed unintentionally, such as, “What IP port or network service was in use at that time?,” “Were any large data transfers involved?,” “Was the traffic allowed by my security groups?” Profile pages in Detective organize content that helps security analysts investigate GuardDuty findings, examine unexpected network behavior, and identify other AWS resources that might be affected by a potential security issue.

I begin scrolling down the page and notice the Findings associated with EC2 instance i-9999999999999999 panel. Detective displays related findings to provide analysts with additional evidence and context about potentially related issues. The finding I’m investigating is listed there, as well as an Unusual Behaviors/VM/Behavior:EC2-NetworkPortUnusual finding. GuardDuty builds a baseline on your network traffic and will generate findings where there is traffic outside the calculated normal. While we might not investigate every instance of anomalous traffic, having these alerts correlated by Detective provides context for validating the issue. Keeping this in mind as I scroll down, at the bottom of this profile page, I find the Overall VPC flow volume panel. If you choose the Info link next to the panel title, you can see helpful tips that describe how to use the visualizations and provide ideas for questions to ask within your investigation. These info links are available throughout Detective. Check them out!

Investigating VPC flow in Detective

In this investigation, I’m very curious about the two large spikes in inbound traffic that I see in the Overall VPC flow volume panel, which seem to be visually associated with some unusual outbound traffic spikes. It’s most likely that these outbound spikes are related to the Unusual Behaviors/VM/Behavior:EC2-NetworkPortUnusual finding I mentioned earlier. To start the investigation, I choose the display details for scope time button, shown circled at the bottom of Figure 2. This expands the VPC Flow Details, shown in Figure 3.
 

Figure 3: Our first look at VPC Flow Details

Figure 3: Our first look at VPC Flow Details

We now can see that each entry displays the volume of inbound traffic, the volume of outbound traffic, and whether the access request was accepted or rejected. Detective provides annotations on the VPC flows to help guide your investigation. These From finding annotations make it clear which flows and resources were involved in the finding. In this case, we can easily see (in Figure 3) the three IP addresses at the top of the list that triggered this GuardDuty finding.

I’m first going to focus on the spikes in traffic that are above the baseline. When I click on one of the spikes in the graph, the time window for the VPC flow activity now matches the dates of these spikes I’m investigating.

If I choose the Inbound Traffic column header, shown in Figure 4, I can find the flows that contributed to the spike during this time window.
 

Figure 4: Inbound traffic spikes

Figure 4: Inbound traffic spikes

Note that the two large inbound spikes aren’t associated with the IP address from the UnauthorizedAccess:EC2/TorClient finding, based on the Detective annotation From finding. Let’s check the outbound traffic. If I do a quick sort of the table based on the outbound traffic column, as shown in Figure 5, we can also see the outbound spikes, and it isn’t immediately evident whether the spikes are associated with this finding. I could continue to investigate the spikes (because they are a visual anomaly), or focus just on the VPC flow traffic that GuardDuty and Detective have labeled as associated with this TOR finding.
 

Figure 5: Outbound traffic spikes

Figure 5: Outbound traffic spikes

Let’s focus on the outbound and inbound spikes and see if we can determine what’s happening. The inbound spikes are on port 443, typically an HTTPS port, or a secure web connection. The outbound spikes are on port 22 (ssh), but go to IP addresses that look to be internal based on their addresses of 172.16.x.x. The port 443 traffic might indicate a web server that’s open to the internet and receiving traffic. With further investigation, we can determine if this idea is valid, and continue hunting for potentially malicious traffic.

A good next step would be to investigate the two specific IP addresses to rule out their involvement in the finding. I can do this by right-clicking on either of the external IP addresses and opening a new tab, where I can focus on investigating these two specific IP addresses. I would take this line of investigation to possibly rule out the involvement of these IP addresses in this finding, determine if they regularly communicate with my resources, find out what instance(s) they’re related to, and see if there are other findings associated with these instances or IP addresses. This deeper investigation is outside the scope of this blog post, but it’s something you should be doing in your own environment.

IP addresses in AWS are ephemeral in nature. The unique identifier in VPC flow logs is the Instance ID. At the time of this investigation, 172.16.0.7 is assigned to the instance related to this finding, so let’s continue to take a look at the internal 172.16.0.7 IP address with 218 MB outbound traffic on port 22. I choose 172.16.0.7, and Detective opens up the profile page for this specific IP address, as shown in Figure 6. Here we see some interesting correlations: two other GuardDuty findings related to SSH brute-force attacks. These could be related to our outbound port 22 spikes, because they’re certainly in the window of time we’re investigating.
 

Figure 6: IP address profile panel

Figure 6: IP address profile panel

As part of a deeper investigation, you would investigate the SSH brute-force findings for 198.51.100.254 and 203.0.113.83 but for now I’m interested what this IP is involved in. Detective easily associates this 172.16.0.7 IP address with the instance that was assigned the IP during the scope time. I scroll down to the bottom of the profile page for 172.16.0.7 and investigate the i-9999999999999999 instance by choosing the instance name.

Filtering VPC flow activity

In Detective, as the investigator we are looking at an instance profile panel, similar to the one in Figure 2, and since we’re interested in VPC flow details, I’m going to scroll down and select display details for scope time.

To focus on specific activity, I can filter the activity details by the following values:

  • IP address
  • Local or remote port
  • Direction
  • Protocol
  • Whether the request was accepted or rejected

I’m going to filter these VPC flow details and just look at port 22 (sshd) inbound traffic. I select the Filter check box and select Local Port and 22, as shown in Figure 7. Detective fills in all the available ports for you, making it easy to complete this filter.
 

Figure 7: Port 22 traffic

Figure 7: Port 22 traffic

The activity details show a few IP addresses related to port 22, and we’re still following the large outbound spikes of traffic. It’s outside the scope of this blog post, but now it would be time to start looking at your security groups and network access control lists (ACLs) and determine why port 22 is open to the internet and sending all this traffic.

Understanding traffic behavior

As an investigator, I now have a good picture of the traffic related to the initial finding, and by diving deeper we’re able to discover other interesting traffic during the same timeframe. While we may not always determine “who has done it,” the goal should be to improve our understanding of the behavior of our environment and gather important technical evidence. Detective helps you identify and investigate anomalies to give you insight into your environment. If we were to continue our investigation into the finding, here are some actions we can take within Detective.

Investigate VPC findings with Detective:

  • Perform ports and utilization analysis
    • Identify service and ephemeral ports
    • Determine whether traffic was accepted or rejected based on security groups and NACL configurations
    • Investigate possible reconnaissance traffic by exploring the significant amount of rejected traffic
  • Correlate EC2 instances to TCP/IP ports and IPs
  • Analyze traffic spikes and anomalies
  • Discover traffic patterns and make behavioral correlations

Explore EC2 instance behavior with Detective:

  • Directional Traffic Analysis
  • Investigate possible data exfiltration events by digging into large transfers
  • Enumerate distinct IP connections and sort and filter by protocol, amount of traffic, and traffic direction
  • Gather data related to a spike in port count from a single IP address (potential brute force) or multiple IP addresses (distributed denial of service (DDoS))

Additional forensics steps to consider

  • Snapshot EC2 Volumes
  • Memory dump of EC2 instance
  • Isolate EC2 instance
  • Review your authentication strategy and assess whether the chosen authentication method is sufficient to protect your asset

Summary

Without requiring you to set up infrastructure or spend time configuring log ingestion, Detective collects, organizes, and presents relevant data for your threat analysis and investigations. Security and operations teams will find this new capability helpful for simplifying EC2 traffic analysis, validating security group permissions, and diagnosing EC2 instance behavior. Detective does the heavy lifting of storing, and analyzing VPC flow data so you can focus on quickly answering your investigative questions. VPC network flow details are available now in all Detective supported Regions and are included as part of your service subscription.

To get started, you can enable a 30-day free trial of Amazon Detective. See the AWS Regional Services page for all the regions where Detective is available. To learn more, visit the Amazon Detective product page.

Are you a visual learner? Check out Amazon Detective Overview and Demonstration. This video helps you learn how and when to use Amazon Detective to improve the security of your AWS resources.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Ross Warren

Ross Warren is a Solution Architect at AWS based in Northern Virginia. Prior to his work at AWS, Ross’ areas of focus included cyber threat hunting and security operations. Ross has worked at a handful of startups and has enjoyed the transition to AWS because he can continue to build solutions for customers on today’s most innovative platform.

Author

Jim Miller

Jim is a Solution Architect at AWS based in Connecticut. Jim has worked within cyber security his entire career with areas of focus including cyber security architecture and incident response. At AWS he loves building secure solutions for customers to enable teams to build and innovate with confidence.

Delivering HTTP/2 upload speed improvements

Post Syndicated from Junho Choi original https://blog.cloudflare.com/delivering-http-2-upload-speed-improvements/

Delivering HTTP/2 upload speed improvements

Delivering HTTP/2 upload speed improvements

Cloudflare recently shipped improved upload speeds across our network for clients using HTTP/2. This post describes our journey from troubleshooting an issue to fixing it and delivering faster upload speeds to the global Internet.

We launched speed.cloudflare.com in May 2020 to give our users insight into how well their networks perform. The test provides download, upload and latency tests. Soon after release, we received reports from a small number of users that sometimes upload speeds were underreported. Our investigation determined that it seemed to happen with end users that had high upload bandwidth available (several hundreds Mbps class cable modem or fiber service). Our speed tests are performed via browser JavaScript, and most browsers use HTTP/2 by default. We found that HTTP/2 upload speeds were sometimes much slower than HTTP/1.1 (assuming all TLS) when the user had high available upload bandwidth.

Upload speed is more important than ever, especially for people using home broadband connections. As many people have been forced to work from home they’re using their broadband connections differently than before. Prior to the pandemic broadband traffic was very asymmetric (you downloaded way more than you uploaded… think listening to music, or streaming a movie), but now we’re seeing an increase in uploading as people video conference from home or create content from their home office.

Initial Tests

User reports were focused on particularly fast home networks. I set up a dummynet network simulator to test upload speed in a controlled environment. I launched a linux VM running our code inside my Macbook Pro and set up a dummynet between the VM and Mac host.  Measuring upload speed is simple – create a file and upload using curl to an endpoint which accepts a request body. I ran the same test 20 times and took a median upload speed (Mbps).

% dd if=/dev/urandom of=test.dat bs=1M count=10
% curl --http1.1 -w '%{speed_upload}\n' -sf -o/dev/null --data-binary @test.dat https://edge/upload-endpoint
% curl --http2 -w '%{speed_upload}\n' -sf -o/dev/null --data-binary @test.dat https://edge/upload-endpoint

Stepping up to uploading a 10MB object over a network which has 200Mbps available bandwidth and 40ms RTT, the result was surprising. Using our production configuration, HTTP/2 upload speed tested at almost half of the same test conditions using HTTP/1.1 (higher is better).

Delivering HTTP/2 upload speed improvements

The result may differ depending on your network, but the gap is bigger when the network is fast. On a slow network, like my home cable connection (5Mbps upload and 20ms RTT), HTTP/2 upload speed was almost identical to the performance observed with HTTP/1.1.

Receiver Flow Control

Before we get into more detail on this topic, my intuition suggested the issue was related to receiver flow control. Usually the client (browser or any HTTP client) is the receiver of data, but in the case the client is uploading content to the server, the server is the receiver of data. And the receiver needs some type of flow control of the receive buffer.

How we handle receiver flow control differs between HTTP/1.1 and HTTP/2. For example, HTTP/1.1 doesn’t define protocol-level receiver flow control since there is no multiplexing of requests in the connection and it’s up to the TCP layer which handles receiving data. Note that most of the modern OS TCP stacks have auto tuning of the receive buffer (we will revisit that later) and they tune based on the current BDP (bandwidth-delay product).

In the case of HTTP/2, there is a stream-level flow control mechanism because the protocol supports multiplexing of streams. Each HTTP/2 stream has its own flow control window and there is connection level flow control for all streams in the connection. If it’s too tight, the sender will be blocked by the flow control. If it’s too loose we may end up wasting memory for buffering. So keeping it optimal is important when implementing flow control and the most optimal strategy is to keep the receive buffer matching the current BDP. BDP represents the maximum bytes in flight in the given network and can be used as an optimal buffer size.

How NGINX handles the request body buffer

Initially I tried to find a parameter which controls NGINX upload buffering and tried to see if tuning the values improved the result. There are a couple of parameters which are related to uploading a request body.

And this is HTTP/2 specific:

Cloudflare does not use the proxy_buffering directive, so it can be immediately discounted.  client_body_buffer_size is the size of the request body buffer which is used regardless of the protocol, so this one applies to HTTP/1.1 and HTTP/2 as well.

When looking into the code, here is how it works:

Delivering HTTP/2 upload speed improvements

  • HTTP/1.1: use client_body_buffer_size buffer as a buffer between upstream and the client, simply repeating reading and writing using this buffer.
  • HTTP/2: since we need a flow control window update for the HTTP/2 DATA frame, there are two parameters:
    • http2_body_preread_size: it specifies the size of the initial request body read before it starts to send to the upstream.
    • client_body_buffer_size: it specifies the size of the request body buffer.
    • Those two parameters are used for allocating a request body buffer during uploading. Here is a brief summary of how unbuffered upload works:
      • Allocate a single request body buffer which size is a maximum of http2_body_preread_size and client_body_buffer_size. This means if http2_body_preread_size is 64KB and client_body_buffer_size is 128KB, then a 128KB buffer is allocated. We use 128KB for client_body_buffer_size.
      • HTTP/2 Settings INITIAL_WINDOW_SIZE of the stream is set to http2_body_preread_size and we use 64KB as a default (the RFC7540 default value).
      • HTTP/2 module reads up to http2_body_preread_size before sending it to upstream.
      • After flushing the preread buffer, keep reading from the client and write to upstream and send WINDOW_UPDATE frame back to the client when necessary until the request body is fully received.

To summarise what this means: HTTP/1.1 simply uses a single buffer, so TCP socket buffers do the flow control. However with HTTP/2, the application layer also has receiver flow control and NGINX uses a fixed size buffer for the receiver. This limits upload speed when the current link has a BDP larger than the current request body buffer size. So the bottleneck is HTTP/2 flow control when the buffer size is too tight.

We’re going to need a bigger buffer?

In theory, bigger buffer sizes should avoid upload bottlenecks, so I tried a few out by running my tests again. The previous chart result is now labelled “prod” and plotted alongside HTTP/2 tests with client_body_buffer_size set to 256KB, 512KB and 1024KB:

Delivering HTTP/2 upload speed improvements

It appears 512KB is an optimal value for client_body_buffer_size.

What if I test with some other network parameter? Here is when RTT is 10ms, in this case, 256KB looks optimal.

Delivering HTTP/2 upload speed improvements

Both cases look much better than current 128KB and get a similar performance to HTTP/1.1 or even better. However, it seems like the optimal buffer size is a moving target and furthermore having too large a buffer size can hurt the performance: we need a smart way to find the optimal buffer size.

Autotuning request body buffer size

One of the ideas that can help this kind of situation is autotuning. For example, modern TCP stacks autotune their receive buffers automatically. In production, our edge also has TCP receiver buffer autotuning enabled by default.

net.ipv4.tcp_moderate_rcvbuf = 1

But in case of HTTP/2, TCP buffer autotuning is not very effective because the HTTP/2 layer is doing its own flow control and the existing 128KB was too small for a high BDP link. At this point, I decided to pursue autotuning HTTP/2 receive buffer sizing as well, similar to what TCP does.

The basic idea is that NGINX doubles the size of HTTP/2 request body buffer size based on its BDP. Here is an algorithm currently implemented in our version of NGINX:

  • Allocate a request body buffer as explained above.
  • For every RTT (using linux tcp_info), update the current BDP.
  • Double the request body buffer size when the current BDP > (receiver window / 4).

Test Result

Lab Test

Here is a test result when HTTP/2 autotune upload is enabled (still using client_body_buffer_size 128KB). You can see “h2 autotune” is doing pretty well – similar or even slightly faster than HTTP/1.1 speed (that’s the initial goal). It might be slightly worse than a hand-picked optimal buffer size for given conditions, but you can see now NGINX picks up optimal buffer size automatically based on network conditions.

Delivering HTTP/2 upload speed improvements
Delivering HTTP/2 upload speed improvements

Production test

After we deployed this feature, I ran similar tests against our production edge, uploading a 10MB file from well connected client nodes to our edge. I created a Linux VM instance in Google Cloud and ran the upload test where the network is very fast (a few Gbps) and low latency (<10ms).

Here is when I run the test in the Google Cloud Belgium region to our CDG (Paris) PoP which has 7ms RTT. This looks very good with almost 3x improvement.

Delivering HTTP/2 upload speed improvements

I also tested between the Google Cloud Tokyo region and our NRT (Tokyo) PoP, which had a 2.3ms RTT. Although this is not realistic for home users, the results are interesting. A 128KB fixed buffer performs well, but HTTP/2 with buffer autotune outperforms even HTTP/1.1.

Delivering HTTP/2 upload speed improvements

Summary

HTTP/2 upload buffer autotuning is now fully deployed in the Cloudflare edge. Customers should now benefit from improved upload performance for all HTTP/2 connections, including speed tests on speed.cloudflare.com. Autotuning upload buffer size logic works well for most cases, and now HTTP/2 upload is much faster than before! When we think about the performance we usually tend to think about download speed or latency reduction, but faster uploading can also help users working from home when they need a large amount of upload, such as photo/video sharing apps, content creation, video conferencing or self broadcasting.

Many thanks to Lucas Pardue and Rustam Lalkaka for great feedback and suggestions on the article.