Tag Archives: Spectrum

Bringing Your Own IPs to Cloudflare (BYOIP)

Post Syndicated from Tom Brightbill original https://blog.cloudflare.com/bringing-your-own-ips-to-cloudflare-byoip/

Bringing Your Own IPs to Cloudflare (BYOIP)

Today we’re thrilled to announce general availability of Bring Your Own IP (BYOIP) across our Layer 7 products as well as Spectrum and Magic Transit services. When BYOIP is configured, the Cloudflare edge will announce a customer’s own IP prefixes and the prefixes can be used with our Layer 7 services, Spectrum, or Magic Transit. If you’re not familiar with the term, an IP prefix is a range of IP addresses. Routers create a table of reachable prefixes, known as a routing table, to ensure that packets are delivered correctly across the Internet.

As part of this announcement, we are listing BYOIP on the relevant product pages, developer documentation, and UI support for controlling your prefixes. Previous support was API only.

Customers choose BYOIP with Cloudflare for a number of reasons. It may be the case that your IP prefix is already allow-listed in many important places, and updating firewall rules to also allow Cloudflare address space may represent a large administrative hurdle. Additionally, you may have hundreds of thousands, or even millions, of end users pointed directly to your IPs via DNS, and it would be hugely time consuming to get them all to update their records to point to Cloudflare IPs.

Over the last several quarters we have been building tooling and processes to support customers bringing their own IPs at scale. At the time of writing this post we’ve successfully onboarded hundreds of customer IP prefixes. Of these, 84% have been for Magic Transit deployments, 14% for Layer 7 deployments, and 2% for Spectrum deployments.

When you BYOIP with Cloudflare, this means we announce your IP space in over 200 cities around the world and tie your IP prefix to the service (or services!) of your choosing. Your IP space will be protected and accelerated as if they were Cloudflare’s own IPs. We can support regional deployments for BYOIP prefixes as well if you have technical and/or legal requirements limiting where your prefixes can be announced, such as data sovereignty.

Bringing Your Own IPs to Cloudflare (BYOIP)

You can turn on advertisement of your IPs from the Cloudflare edge with a click of a button and be live across the world in a matter of minutes.

All BYOIP customers receive network analytics on their prefixes. Additionally all IPs in BYOIP prefixes can be considered static IPs. There are also benefits specific to the service you use with your IP prefix on Cloudflare.

Layer 7 + BYOIP:

Cloudflare has a robust Layer 7 product portfolio, including products like Bot Management, Rate Limiting, Web Application Firewall, and Content Delivery, to name just a few. You can choose to BYOIP with our Layer 7 products and receive all of their benefits on your IP addresses.

For Layer 7 services, we can support a variety of IP to domain mapping requests including sharing IPs between domains or putting domains on dedicated IPs, which can help meet requirements for things such as non-SNI support.

If you are also an SSL for SaaS customer, using BYOIP, you have increased flexibility to change IP address responses for custom_hostnames in the event an IP is unserviceable for some reason.

Spectrum + BYOIP:

Spectrum is Cloudflare’s solution to protect and accelerate applications that run any UDP or TCP protocol. The Spectrum API supports BYOIP today. Spectrum customers who use BYOIP can specify, through Spectrum’s API, which IPs they would like associated with a Spectrum application.

Magic Transit + BYOIP:

Magic Transit is a Layer 3 security service which processes all your network traffic by announcing your IP addresses and attracting that traffic to the Cloudflare edge for processing.  Magic Transit supports sophisticated packet filtering and firewall configurations. BYOIP is a requirement for using the Magic Transit service. As Magic Transit is an IP level service, Cloudflare must be able to announce your IPs in order to provide this service

Bringing Your IPs to Cloudflare: What is Required?

Before Cloudflare can announce your prefix we require some documentation to get started. The first is something called a ‘Letter of Authorization’ (LOA), which details information about your prefix and how you want Cloudflare to announce it. We then share this document with our Tier 1 transit providers in advance of provisioning your prefix. This step is done to ensure that Tier 1s are aware we have authorization to announce your prefixes.

Secondly, we require that your Internet Routing Registry (IRR) records are up to date and reflect the data in the LOA. This typically means ensuring the entry in your regional registry is updated (i.e. ARIN, RIPE, APNIC).

Once the administrivia is out of the way, work with your account team to learn when your prefixes will be ready to announce.

We also encourage customers to use RPKI and can support this for customer prefixes. We have blogged and built extensive tooling to make adoption of this protocol easier. If you’re interested in BYOIP with RPKI support just let your account team know!

Configuration

Each customer prefix can be announced via the ‘dynamic advertisement’ toggle in either the UI or API, which will cause the Cloudflare edge to either announce or withdraw a prefix on your behalf. This can be done as soon as your account team lets you know your prefixes are ready to go.

Once the IPs are ready to be announced, you may want to set up ‘delegations’ for your prefixes. Delegations manage how the prefix can be used across multiple Cloudflare accounts and have slightly different implications depending on which service your prefix is bound to. A prefix is owned by a single account, but a delegation can extend some of the prefix functionality to other accounts. This is also captured on our developer docs. Today, delegations can affect Layer 7 and Spectrum BYOIP prefixes.

Bringing Your Own IPs to Cloudflare (BYOIP)

Layer 7: If you use BYOIP + Layer 7 and also use the SSL for SaaS service, a delegation to another account will allow that account to also use that prefix to validate custom hostnames in addition to the original account which owns the prefix. This means that multiple accounts can use the same IP prefix to serve up custom hostname traffic. Additionally, all of your IPs can serve traffic for custom hostnames, which means you can easily change IP addresses for these hostnames if an IP is blocked for any reason.

Spectrum: If you used BYOIP + Spectrum, via the Spectrum API, you can specify which IP in your prefix you want to create a Spectrum app with. If you create a delegation for prefix to another account, that second account will also be able to specify an IP from that prefix to create an app.

If you are interested in learning more about BYOIP across either Magic Transit, CDN, or Spectrum, please reach out to your account team if you’re an existing customer or contact [email protected] if you’re a new prospect.

Cloudflare for SSH, RDP and Minecraft

Post Syndicated from Achiel van der Mandele original https://blog.cloudflare.com/cloudflare-for-ssh-rdp-and-minecraft/

Cloudflare for SSH, RDP and Minecraft

Cloudflare for SSH, RDP and Minecraft

Almost exactly two years ago, we launched Cloudflare Spectrum for our Enterprise customers. Today, we’re thrilled to extend DDoS protection and traffic acceleration with Spectrum for SSH, RDP, and Minecraft to our Pro and Business plan customers.

When we think of Cloudflare, a lot of the time we think about protecting and improving the performance of websites. But the Internet is so much more, ranging from gaming, to managing servers, to cryptocurrencies. How do we make sure these applications are secure and performant?

With Spectrum, you can put Cloudflare in front of your SSH, RDP and Minecraft services, protecting them from DDoS attacks and improving network performance. This allows you to protect the management of your servers, not just your website. Better yet, by leveraging the Cloudflare network you also get increased reliability and increased performance: lower latency!

Remote access to servers

While access to websites from home is incredibly important, being able to remotely manage your servers can be equally critical. Losing access to your infrastructure can be disastrous: people need to know their infrastructure is safe and connectivity is good and performant. Usually, server management is done through SSH (Linux or Unix based servers) and RDP (Windows based servers). With these protocols, performance and reliability are key: you need to know you can always reliably manage your servers and that the bad guys are kept out. What’s more, low latency is really important. Every time you type a key in an SSH terminal or click a button in a remote desktop session, that key press or button click has to traverse the Internet to your origin before the server can process the input and send feedback. While increasing bandwidth can help, lowering latency can help even more in getting your sessions to feel like you’re working on a local machine and not one half-way across the globe.

All work and no play makes Jack Steve a dull boy

While we stay at home, many of us are also looking to play and not only work. Video games in particular have seen a huge increase in popularity. As personal interaction becomes more difficult to come by, Minecraft has become a popular social outlet. Many of us at Cloudflare are using it to stay in touch and have fun with friends and family in the current age of quarantine. And it’s not just employees at Cloudflare that feel this way, we’ve seen a big increase in Minecraft traffic flowing through our network. Traffic per week had remained steady for a while but has more than tripled since many countries have put their citizens in lockdown:

Cloudflare for SSH, RDP and Minecraft

Minecraft is a particularly popular target for DDoS attacks: it’s not uncommon for people to develop feuds whilst playing the game. When they do, some of the more tech-savvy players of this game opt to take matters into their own hands and launch a (D)DoS attack, rendering it unusable for the duration of the attacks. Our friends at Hypixel and Nodecraft have known this for many years, which is why they’ve chosen to protect their servers using Spectrum.

While we love recommending their services, we realize some of you prefer to run your own Minecraft server on a VPS (virtual private server like a DigitalOcean droplet) that you maintain. To help you protect your Minecraft server, we’re providing Spectrum for Minecraft as well, available on Pro and Business plans. You’ll be able to use the entire Cloudflare network to protect your server and increase network performance.

How does it work?

Configuring Spectrum is easy, just log into your dashboard and head on over to the Spectrum tab. From there you can choose a protocol and configure the IP of your server:

Cloudflare for SSH, RDP and Minecraft

After that all you have to do is use the subdomain you configured to connect instead of your IP. Traffic will be proxied using Spectrum on the Cloudflare network, keeping the bad guys out and your services safe.

Cloudflare for SSH, RDP and Minecraft

So how much does this cost? We’re happy to announce that all paid plans will get access to Spectrum for free, with a generous free data allowance. Pro plans will be able to use SSH and Minecraft, up to 5 gigabytes for free each month. Biz plans can go up to 10 gigabytes for free and also get access to RDP. After the free cap you will be billed on a per gigabyte basis.

Spectrum is complementary to Access: it offers DDoS protection and improved network performance as a ‘drop-in’ product, no configuration necessary on your origins. If you want more control over who has access to which services, we highly recommend taking a look at Cloudflare for Teams.

We’re very excited to extend Cloudflare’s services to not just HTTP traffic, allowing you to protect your core management services and Minecraft gaming servers. In the future, we’ll add support for more protocols. If you have a suggestion, let us know! In the meantime, if you have a Pro or Business account, head on over to the dashboard and enable Spectrum today!

Project Crossbow: Lessons from Refactoring a Large-Scale Internal Tool

Post Syndicated from Junade Ali original https://blog.cloudflare.com/project-crossbow-lessons-from-refactoring-a-large-scale-internal-tool/

Project Crossbow: Lessons from Refactoring a Large-Scale Internal Tool

Project Crossbow: Lessons from Refactoring a Large-Scale Internal Tool

Cloudflare’s global network currently spans 200 cities in more than 90 countries. Engineers working in product, technical support and operations often need to be able to debug network issues from particular locations or individual servers.

Crossbow is the internal tool for doing just this; allowing Cloudflare’s Technical Support Engineers to perform diagnostic activities from running commands (like traceroutes, cURL requests and DNS queries) to debugging product features and performance using bespoke tools.

In September last year, an Engineering Manager at Cloudflare asked to transition Crossbow from a Product Engineering team to the Support Operations team. The tool had been a secondary focus and had been transitioned through multiple engineering teams without developing subject matter knowledge.

The Support Operations team at Cloudflare is closely aligned with Cloudflare’s Technical Support Engineers; developing diagnostic tooling and Natural Language Processing technology to drive efficiency. Based on this alignment, it was decided that Support Operations was the best team to own this tool.

Learning from Sisyphus

Whilst seeking advice on the transition process, an SRE Engineering Manager in Cloudflare suggested reading: “A Case Study in Community-Driven Software Adoption”. This book proved a truly invaluable read for anyone thinking of doing internal tool development or contributing to such tooling. The book describes why multiple tools are often created for the same purpose by different autonomous teams and how this issue can be overcome. The book also describes challenges and approaches to gaining adoption of tooling, especially where this requires some behaviour change for engineers who use such tools.

That said, there are some things we learnt along the way of taking over Crossbow and performing a refactor and revamp of a large-scale internal tool. This blog post seeks to be an addendum to such guidance and provide some further practical advice.

In this blog post we won’t dwell too much on the work of the Cloudflare Support Operations team, but this can be found in the SRECon talk: “Support Operations Engineering: Scaling Developer Products to the Millions”. The software development methodology used in Cloudflare’s Support Operations Group closely resembles Extreme Programming.

Cutting The Fat

There were two ways of using Crossbow, a CLI (command line interface) and UI in Cloudflare’s internal tool for Cloudflare’s Technical Support Engineers. Maintaining both interfaces clearly had significant overhead for improvement efforts, and we took the decision to deprecate one of the interfaces. This allowed us to focus our efforts on one platform to achieve large-scale improvements across technology, usability and functionality.

We set-up a poll to allow engineering, operations, solutions engineering and technical support teams to provide their feedback on how they used the tooling. Polling was not only critical for gaining vital information to how different teams used the tool, but also ensured that prior to deprecation that people knew their views were taken onboard. We polled not only on the option people preferred, but which options they felt were necessary to them and the reasons as to why.

We found that the reasons for favouring the web UI primarily revolved around the absence of documentation and training. Instead, we discovered those who used the CLI found it far more critical for their workflow. Product Engineering teams do not routinely have access to the support UI but some found it necessary to use Crossbow for their jobs and users wanted to be able to automate commands with shell scripts.

Technically, the UI was in JavaScript with an API Gateway service that converted HTTP requests to gRPC alongside some configuration to allow it to work in the support UI. The CLI directly interfaced with the gRPC API so it was a simpler system. Given the Cloudflare Support Operations team primarily works on Systems Engineering projects and had limited UI resources, the decision to deprecate the UI was also in our own interest.

We rolled out a new internal Crossbow user group, trained up teams and created new documentation, provided advance notification of deprecation and abrogated the source code of these services. We also dramatically improved the user experience when using the CLI for users through simple improvements to the help information and easier CLI usage.

Rearchitecting Pub/Sub with Cloudflare Access

One of the primary challenges we encountered was how the system architecture for Crossbow was designed many years ago. A gRPC API ran commands at Cloudflare’s edge network using a configuration management tool which the SRE team expressed a desire to deprecate (with Crossbow being the last user of it).

During a visit to the Singapore Office, the Edge SRE Engineering Manager locally wanted his team to understand Crossbow and how to contribute. During this meeting, we provided an overview of the current architecture and the team there were forthcoming in providing potential refactoring ideas to handle global network stability and move away from the old pipeline. This provided invaluable insight into the common issues experienced between technical approaches and instances of where the tool would fail requiring Technical Support Engineers to consult the SRE team.

We decided to adopt a more simple pub/sub pipeline, instead the edge network would expose a gRPC daemon that would listen for new jobs and execute them and then make a callback to the API service with the results (which would be relayed onto the client).

For authentication between the API service and the client or the API service and the network edge, we implemented a JWT authentication scheme. For a CLI user, the authentication was done by querying an HTTP endpoint behind Cloudflare Access using cloudflared, which provided a JWT the client could use for authentication with gRPC. In practice, this looks something like this:

  1. CLI makes request to authentication server using cloudflared
  2. Authentication server responds with signed JWT token
  3. CLI makes gRPC request with JWT authentication token to API service
  4. API service validates token using a public key

The gRPC API endpoint was placed on Cloudflare Spectrum; as users were authenticated using Cloudflare Access, we could remove the requirement for users to be on the company VPN to use the tool. The new authentication pipeline, combined with a single user interface, also allowed us to improve the collection of metrics and usage logs of the tool.

Risk Management

Risk is inherent in the activities undertaken by engineering professionals, meaning that members of the profession have a significant role to play in managing and limiting it.

As with all engineering projects, it was critical to manage risk. However, the risk to manage is different for different engineering projects. Availability wasn’t the largest factor, given that Technical Support Engineers could escalate issues to the SRE team if the tool wasn’t available. The main risk was security of the Cloudflare network and ensuring Crossbow did not affect the availability of any other services. To this end we took methodical steps to improve isolation and engaged the InfoSec team early to assist with specification and code reviews of the new pipeline. Where a risk to availability existed, we ensured this was properly communicated to the support team and the internal Crossbow user group to communicate the risk/reward that existed.

Feedback, Build, Refactor, Measure

The Support Operations team at Cloudflare works using a methodology based on Extreme Programming. A key tenant of Extreme Programming is that of Test Driven Development, this is often described as a “red-green-green” pattern or “red-green-refactor”. First the engineer enshrines the requirements in tests, then they make those tests pass and then refactor to improve code quality before pushing the software.

As we took on this project, the Cloudflare Support and SRE teams were working on Project Baton – an effort to allow Technical Support Engineers to handle more customer escalations without handover to the SRE teams.

As part of this effort, they had already created an invaluable resource in the form of a feature wish list for Crossbow. We associated JIRAs with all these items and prioritised this work to deliver such feature requests using a Test Driven Development workflow and the introduction of Continuous Integration. Critically we measured such improvements once deployed. Adding simple functionality like support for MTR (a Linux network diagnostic tool) and exposing support for different cURL flags provided improvements in usage.

We were also able to embed Crossbow support for other tools available at the network edge created by other teams, allowing them to maintain such tools and expose features to Crossbow users. Through the creation of an improved development environment and documentation, we were able to drive Product Engineering teams to contribute functionality that was in the mutual interest of them and the customer support team.

Finally, we owned a number of tools which were used by Technical Support Engineers to discover what Cloudflare configuration was applied to a given URL and performing distributed performance testing, we deprecated these tools and rolled them into Crossbow. Another tool owned by the Cloudflare Workers team, called Edge Worker Debug was rolled into Crossbow and the team deprecated their tool.

Results

From implementing user analytics on the tool on the 16 December 2019 to the week ending the 22 January 2020, we found a found usage increase of 4.5x. This growth primarily happened within a 4 week period; by adding the most wanted functionality, we were able to achieve a critical saturation of usage amongst Technical Support Engineers.

Project Crossbow: Lessons from Refactoring a Large-Scale Internal Tool

Beyond this point, it became critical to use the number of checks being run as a metric to evaluate how useful the tool was. For example, only the week starting January 27 saw no meaningful increase in unique users (a 14% usage increase over the previous week – within the normal fluctuation of stable usage). However, over the same timeframe, we saw a 2.6x increase in the number of tests being run – coinciding with introduction of a number of new high-usage functionalities.

Project Crossbow: Lessons from Refactoring a Large-Scale Internal Tool

Conclusion

Through removing low-value/high-maintenance functionality and merciless refactoring, we were dramatically able to improve the quality of Crossbow and therefore improve the velocity of delivery. We were able to dramatically improve usage through enabling functionality to measure usage, receive feature requests in feedback loops with users and test-driven development. Consolidation of tooling reduced overhead of developing support tooling across the business, providing a common framework for developing and exposing functionality for Technical Support Engineers.

There are two key counterintuitive learnings from this project. The first is that cutting functionality can drive usage, providing this is done intelligently. In our case, the web UI contained no additional functionality that wasn’t in the CLI, yet caused substantial engineering overhead for maintenance. By deprecating this functionality, we were able to reduce technical debt and thereby improve the velocity of delivering more important functionality. This effort requires effective communication of the decision making process and involvement from those who are impacted by such a decision.

Secondly, tool development efforts are often focussed by user feedback but lack a means of objectively measuring such improvements. When logging is added, it is often done purely for security and audit logging purposes. Whilst feedback loops with users are invaluable, it is critical to have an objective measure of how successful such a feature is and how it is used. Effective measurement drives the decision making process of future tooling and therefore, in the long run, the usage data can be more important than the original feature itself.

If you’re interested in debugging interesting technical problems on a network with these tools, we’re hiring for Support Engineers (including Security Operations, Technical Support and Support Operations Engineering) in San Francisco, Austin, Champaign, London, Lisbon, Munich and Singapore.

When TCP sockets refuse to die

Post Syndicated from Marek Majkowski original https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/

When TCP sockets refuse to die

While working on our Spectrum server, we noticed something weird: the TCP sockets which we thought should have been closed were lingering around. We realized we don’t really understand when TCP sockets are supposed to time out!

When TCP sockets refuse to die

Image by Sergiodc2 CC BY SA 3.0

In our code, we wanted to make sure we don’t hold connections to dead hosts. In our early code we naively thought enabling TCP keepalives would be enough… but it isn’t. It turns out a fairly modern TCP_USER_TIMEOUT socket option is equally as important. Furthermore it interacts with TCP keepalives in subtle ways. Many people are confused by this.

In this blog post, we’ll try to show how these options work. We’ll show how a TCP socket can timeout during various stages of its lifetime, and how TCP keepalives and user timeout influence that. To better illustrate the internals of TCP connections, we’ll mix the outputs of the tcpdump and the ss -o commands. This nicely shows the transmitted packets and the changing parameters of the TCP connections.

SYN-SENT

Let’s start from the simplest case – what happens when one attempts to establish a connection to a server which discards inbound SYN packets?

The scripts used here are available on our Github.

$ sudo ./test-syn-sent.py
# all packets dropped
00:00.000 IP host.2 > host.1: Flags [S] # initial SYN

State    Recv-Q Send-Q Local:Port Peer:Port
SYN-SENT 0      1      host:2     host:1    timer:(on,940ms,0)

00:01.028 IP host.2 > host.1: Flags [S] # first retry
00:03.044 IP host.2 > host.1: Flags [S] # second retry
00:07.236 IP host.2 > host.1: Flags [S] # third retry
00:15.427 IP host.2 > host.1: Flags [S] # fourth retry
00:31.560 IP host.2 > host.1: Flags [S] # fifth retry
01:04.324 IP host.2 > host.1: Flags [S] # sixth retry
02:10.000 connect ETIMEDOUT

Ok, this was easy. After the connect() syscall, the operating system sends a SYN packet. Since it didn’t get any response the OS will by default retry sending it 6 times. This can be tweaked by the sysctl:

$ sysctl net.ipv4.tcp_syn_retries
net.ipv4.tcp_syn_retries = 6

It’s possible to overwrite this setting per-socket with the TCP_SYNCNT setsockopt:

setsockopt(sd, IPPROTO_TCP, TCP_SYNCNT, 6);

The retries are staggered at 1s, 3s, 7s, 15s, 31s, 63s marks (the inter-retry time starts at 2s and then doubles each time). By default the whole process takes 130 seconds, until the kernel gives up with the ETIMEDOUT errno. At this moment in the lifetime of a connection, SO_KEEPALIVE settings are ignored, but TCP_USER_TIMEOUT is not. For example, setting it to 5000ms, will cause the following interaction:

$ sudo ./test-syn-sent.py 5000
# all packets dropped
00:00.000 IP host.2 > host.1: Flags [S] # initial SYN

State    Recv-Q Send-Q Local:Port Peer:Port
SYN-SENT 0      1      host:2     host:1    timer:(on,996ms,0)

00:01.016 IP host.2 > host.1: Flags [S] # first retry
00:03.032 IP host.2 > host.1: Flags [S] # second retry
00:05.016 IP host.2 > host.1: Flags [S] # what is this?
00:05.024 IP host.2 > host.1: Flags [S] # what is this?
00:05.036 IP host.2 > host.1: Flags [S] # what is this?
00:05.044 IP host.2 > host.1: Flags [S] # what is this?
00:05.050 connect ETIMEDOUT

Even though we set user-timeout to 5s, we still saw the six SYN retries on the wire. This behaviour is probably a bug (as tested on 5.2 kernel): we would expect only two retries to be sent – at 1s and 3s marks and the socket to expire at 5s mark. Instead we saw this, but also we saw further 4 retransmitted SYN packets aligned to 5s mark – which makes no sense. Anyhow, we learned a thing – the TCP_USER_TIMEOUT does affect the behaviour of connect().

SYN-RECV

SYN-RECV sockets are usually hidden from the application. They live as mini-sockets on the SYN queue. We wrote about the SYN and Accept queues in the past. Sometimes, when SYN cookies are enabled, the sockets may skip the SYN-RECV state altogether.

In SYN-RECV state, the socket will retry sending SYN+ACK 5 times as controlled by:

$ sysctl net.ipv4.tcp_synack_retries
net.ipv4.tcp_synack_retries = 5

Here is how it looks on the wire:

$ sudo ./test-syn-recv.py
00:00.000 IP host.2 > host.1: Flags [S]
# all subsequent packets dropped
00:00.000 IP host.1 > host.2: Flags [S.] # initial SYN+ACK

State    Recv-Q Send-Q Local:Port Peer:Port
SYN-RECV 0      0      host:1     host:2    timer:(on,996ms,0)

00:01.033 IP host.1 > host.2: Flags [S.] # first retry
00:03.045 IP host.1 > host.2: Flags [S.] # second retry
00:07.301 IP host.1 > host.2: Flags [S.] # third retry
00:15.493 IP host.1 > host.2: Flags [S.] # fourth retry
00:31.621 IP host.1 > host.2: Flags [S.] # fifth retry
01:04:610 SYN-RECV disappears

With default settings, the SYN+ACK is re-transmitted at 1s, 3s, 7s, 15s, 31s marks, and the SYN-RECV socket disappears at the 64s mark.

Neither SO_KEEPALIVE nor TCP_USER_TIMEOUT affect the lifetime of SYN-RECV sockets.

Final handshake ACK

After receiving the second packet in the TCP handshake – the SYN+ACK – the client socket moves to an ESTABLISHED state. The server socket remains in SYN-RECV until it receives the final ACK packet.

Losing this ACK doesn’t change anything – the server socket will just take a bit longer to move from SYN-RECV to ESTAB. Here is how it looks:

00:00.000 IP host.2 > host.1: Flags [S]
00:00.000 IP host.1 > host.2: Flags [S.]
00:00.000 IP host.2 > host.1: Flags [.] # initial ACK, dropped

State    Recv-Q Send-Q Local:Port  Peer:Port
SYN-RECV 0      0      host:1      host:2 timer:(on,1sec,0)
ESTAB    0      0      host:2      host:1

00:01.014 IP host.1 > host.2: Flags [S.]
00:01.014 IP host.2 > host.1: Flags [.]  # retried ACK, dropped

State    Recv-Q Send-Q Local:Port Peer:Port
SYN-RECV 0      0      host:1     host:2    timer:(on,1.012ms,1)
ESTAB    0      0      host:2     host:1

As you can see SYN-RECV, has the "on" timer, the same as in example before. We might argue this final ACK doesn’t really carry much weight. This thinking lead to the development of TCP_DEFER_ACCEPT feature – it basically causes the third ACK to be silently dropped. With this flag set the socket remains in SYN-RECV state until it receives the first packet with actual data:

$ sudo ./test-syn-ack.py
00:00.000 IP host.2 > host.1: Flags [S]
00:00.000 IP host.1 > host.2: Flags [S.]
00:00.000 IP host.2 > host.1: Flags [.] # delivered, but the socket stays as SYN-RECV

State    Recv-Q Send-Q Local:Port Peer:Port
SYN-RECV 0      0      host:1     host:2    timer:(on,7.192ms,0)
ESTAB    0      0      host:2     host:1

00:08.020 IP host.2 > host.1: Flags [P.], length 11  # payload moves the socket to ESTAB

State Recv-Q Send-Q Local:Port Peer:Port
ESTAB 11     0      host:1     host:2
ESTAB 0      0      host:2     host:1

The server socket remained in the SYN-RECV state even after receiving the final TCP-handshake ACK. It has a funny "on" timer, with the counter stuck at 0 retries. It is converted to ESTAB – and moved from the SYN to the accept queue – after the client sends a data packet or after the TCP_DEFER_ACCEPT timer expires. Basically, with DEFER ACCEPT the SYN-RECV mini-socket discards the data-less inbound ACK.

Idle ESTAB is forever

Let’s move on and discuss a fully-established socket connected to an unhealthy (dead) peer. After completion of the handshake, the sockets on both sides move to the ESTABLISHED state, like:

State Recv-Q Send-Q Local:Port Peer:Port
ESTAB 0      0      host:2     host:1
ESTAB 0      0      host:1     host:2

These sockets have no running timer by default – they will remain in that state forever, even if the communication is broken. The TCP stack will notice problems only when one side attempts to send something. This raises a question – what to do if you don’t plan on sending any data over a connection? How do you make sure an idle connection is healthy, without sending any data over it?

This is where TCP keepalives come in. Let’s see it in action – in this example we used the following toggles:

  • SO_KEEPALIVE = 1 – Let’s enable keepalives.
  • TCP_KEEPIDLE = 5 – Send first keepalive probe after 5 seconds of idleness.
  • TCP_KEEPINTVL = 3 – Send subsequent keepalive probes after 3 seconds.
  • TCP_KEEPCNT = 3 – Time out after three failed probes.
$ sudo ./test-idle.py
00:00.000 IP host.2 > host.1: Flags [S]
00:00.000 IP host.1 > host.2: Flags [S.]
00:00.000 IP host.2 > host.1: Flags [.]

State Recv-Q Send-Q Local:Port Peer:Port
ESTAB 0      0      host:1     host:2
ESTAB 0      0      host:2     host:1  timer:(keepalive,2.992ms,0)

# all subsequent packets dropped
00:05.083 IP host.2 > host.1: Flags [.], ack 1 # first keepalive probe
00:08.155 IP host.2 > host.1: Flags [.], ack 1 # second keepalive probe
00:11.231 IP host.2 > host.1: Flags [.], ack 1 # third keepalive probe
00:14.299 IP host.2 > host.1: Flags [R.], seq 1, ack 1

Indeed! We can clearly see the first probe sent at the 5s mark, two remaining probes 3s apart – exactly as we specified. After a total of three sent probes, and a further three seconds of delay, the connection dies with ETIMEDOUT, and final the RST is transmitted.

For keepalives to work, the send buffer must be empty. You can notice the keepalive timer active in the "timer:(keepalive)" line.

Keepalives with TCP_USER_TIMEOUT are confusing

We mentioned the TCP_USER_TIMEOUT option before. It sets the maximum amount of time that transmitted data may remain unacknowledged before the kernel forcefully closes the connection. On its own, it doesn’t do much in the case of idle connections. The sockets will remain ESTABLISHED even if the connectivity is dropped. However, this socket option does change the semantics of TCP keepalives. The tcp(7) manpage is somewhat confusing:

Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option, TCP_USER_TIMEOUT will override keepalive to determine when to close a connection due to keepalive failure.

The original commit message has slightly more detail:

To understand the semantics, we need to look at the kernel code in linux/net/ipv4/tcp_timer.c:693:

if ((icsk->icsk_user_timeout != 0 &&
    elapsed >= msecs_to_jiffies(icsk->icsk_user_timeout) &&
    icsk->icsk_probes_out > 0) ||

For the user timeout to have any effect, the icsk_probes_out must not be zero. The check for user timeout is done only after the first probe went out. Let’s check it out. Our connection settings:

  • TCP_USER_TIMEOUT = 5*1000 – 5 seconds
  • SO_KEEPALIVE = 1 – enable keepalives
  • TCP_KEEPIDLE = 1 – send first probe quickly – 1 second idle
  • TCP_KEEPINTVL = 11 – subsequent probes every 11 seconds
  • TCP_KEEPCNT = 3 – send three probes before timing out
00:00.000 IP host.2 > host.1: Flags [S]
00:00.000 IP host.1 > host.2: Flags [S.]
00:00.000 IP host.2 > host.1: Flags [.]

# all subsequent packets dropped
00:01.001 IP host.2 > host.1: Flags [.], ack 1 # first probe
00:12.233 IP host.2 > host.1: Flags [R.] # timer for second probe fired, socket aborted due to TCP_USER_TIMEOUT

So what happened? The connection sent the first keepalive probe at the 1s mark. Seeing no response the TCP stack then woke up 11 seconds later to send a second probe. This time though, it executed the USER_TIMEOUT code path, which decided to terminate the connection immediately.

What if we bump TCP_USER_TIMEOUT to larger values, say between the second and third probe? Then, the connection will be closed on the third probe timer. With TCP_USER_TIMEOUT set to 12.5s:

00:01.022 IP host.2 > host.1: Flags [.] # first probe
00:12.094 IP host.2 > host.1: Flags [.] # second probe
00:23.102 IP host.2 > host.1: Flags [R.] # timer for third probe fired, socket aborted due to TCP_USER_TIMEOUT

We’ve shown how TCP_USER_TIMEOUT interacts with keepalives for small and medium values. The last case is when TCP_USER_TIMEOUT is extraordinarily large. Say we set it to 30s:

00:01.027 IP host.2 > host.1: Flags [.], ack 1 # first probe
00:12.195 IP host.2 > host.1: Flags [.], ack 1 # second probe
00:23.207 IP host.2 > host.1: Flags [.], ack 1 # third probe
00:34.211 IP host.2 > host.1: Flags [.], ack 1 # fourth probe! But TCP_KEEPCNT was only 3!
00:45.219 IP host.2 > host.1: Flags [.], ack 1 # fifth probe!
00:56.227 IP host.2 > host.1: Flags [.], ack 1 # sixth probe!
01:07.235 IP host.2 > host.1: Flags [R.], seq 1 # TCP_USER_TIMEOUT aborts conn on 7th probe timer

We saw six keepalive probes on the wire! With TCP_USER_TIMEOUT set, the TCP_KEEPCNT is totally ignored. If you want TCP_KEEPCNT to make sense, the only sensible USER_TIMEOUT value is slightly smaller than:

TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT

Busy ESTAB socket is not forever

Thus far we have discussed the case where the connection is idle. Different rules apply when the connection has unacknowledged data in a send buffer.

Let’s prepare another experiment – after the three-way handshake, let’s set up a firewall to drop all packets. Then, let’s do a send on one end to have some dropped packets in-flight. An experiment shows the sending socket dies after ~16 minutes:

00:00.000 IP host.2 > host.1: Flags [S]
00:00.000 IP host.1 > host.2: Flags [S.]
00:00.000 IP host.2 > host.1: Flags [.]

# All subsequent packets dropped
00:00.206 IP host.2 > host.1: Flags [P.], length 11 # first data packet
00:00.412 IP host.2 > host.1: Flags [P.], length 11 # early retransmit, doesn't count
00:00.620 IP host.2 > host.1: Flags [P.], length 11 # 1nd retry
00:01.048 IP host.2 > host.1: Flags [P.], length 11 # 2rd retry
00:01.880 IP host.2 > host.1: Flags [P.], length 11 # 3th retry

State Recv-Q Send-Q Local:Port Peer:Port
ESTAB 0      0      host:1     host:2
ESTAB 0      11     host:2     host:1    timer:(on,1.304ms,3)

00:03.543 IP host.2 > host.1: Flags [P.], length 11 # 4th
00:07.000 IP host.2 > host.1: Flags [P.], length 11 # 5th
00:13.656 IP host.2 > host.1: Flags [P.], length 11 # 6th
00:26.968 IP host.2 > host.1: Flags [P.], length 11 # 7th
00:54.616 IP host.2 > host.1: Flags [P.], length 11 # 8th
01:47.868 IP host.2 > host.1: Flags [P.], length 11 # 9th
03:34.360 IP host.2 > host.1: Flags [P.], length 11 # 10th
05:35.192 IP host.2 > host.1: Flags [P.], length 11 # 11th
07:36.024 IP host.2 > host.1: Flags [P.], length 11 # 12th
09:36.855 IP host.2 > host.1: Flags [P.], length 11 # 13th
11:37.692 IP host.2 > host.1: Flags [P.], length 11 # 14th
13:38.524 IP host.2 > host.1: Flags [P.], length 11 # 15th
15:39.500 connection ETIMEDOUT

The data packet is retransmitted 15 times, as controlled by:

$ sysctl net.ipv4.tcp_retries2
net.ipv4.tcp_retries2 = 15

From the ip-sysctl.txt documentation:


The default value of 15 yields a hypothetical timeout of 924.6 seconds and is a lower bound for the effective timeout. TCP will effectively time out at the first RTO which exceeds the hypothetical timeout.

The connection indeed died at ~940 seconds. Notice the socket has the "on" timer running. It doesn’t matter at all if we set SO_KEEPALIVE – when the "on" timer is running, keepalives are not engaged.

TCP_USER_TIMEOUT keeps on working though. The connection will be aborted exactly after user-timeout specified time since the last received packet. With the user timeout set the tcp_retries2 value is ignored.

Zero window ESTAB is… forever?

There is one final case worth mentioning. If the sender has plenty of data, and the receiver is slow, then TCP flow control kicks in. At some point the receiver will ask the sender to stop transmitting new data. This is a slightly different condition than the one described above.

In this case, with flow control engaged, there is no in-flight or unacknowledged data. Instead the receiver throttles the sender with a "zero window" notification. Then the sender periodically checks if the condition is still valid with "window probes". In this experiment we reduced the receive buffer size for simplicity. Here’s how it looks on the wire:

00:00.000 IP host.2 > host.1: Flags [S]
00:00.000 IP host.1 > host.2: Flags [S.], win 1152
00:00.000 IP host.2 > host.1: Flags [.]

00:00.202 IP host.2 > host.1: Flags [.], length 576 # first data packet
00:00.202 IP host.1 > host.2: Flags [.], ack 577, win 576
00:00.202 IP host.2 > host.1: Flags [P.], length 576 # second data packet
00:00.244 IP host.1 > host.2: Flags [.], ack 1153, win 0 # throttle it! zero-window

00:00.456 IP host.2 > host.1: Flags [.], ack 1 # zero-window probe
00:00.456 IP host.1 > host.2: Flags [.], ack 1153, win 0 # nope, still zero-window

State Recv-Q Send-Q Local:Port Peer:Port
ESTAB 1152   0      host:1     host:2
ESTAB 0      129920 host:2     host:1  timer:(persist,048ms,0)

The packet capture shows a couple of things. First, we can see two packets with data, each 576 bytes long. They both were immediately acknowledged. The second ACK had "win 0" notification: the sender was told to stop sending data.

But the sender is eager to send more! The last two packets show a first "window probe": the sender will periodically send payload-less "ack" packets to check if the window size had changed. As long as the receiver keeps on answering, the sender will keep on sending such probes forever.

The socket information shows three important things:

  • The read buffer of the reader is filled – thus the "zero window" throttling is expected.
  • The write buffer of the sender is filled – we have more data to send.
  • The sender has a "persist" timer running, counting the time until the next "window probe".

In this blog post we are interested in timeouts – what will happen if the window probes are lost? Will the sender notice?

By default the window probe is retried 15 times – adhering to the usual tcp_retries2 setting.

The tcp timer is in persist state, so the TCP keepalives will not be running. The SO_KEEPALIVE settings don’t make any difference when window probing is engaged.

As expected, the TCP_USER_TIMEOUT toggle keeps on working. A slight difference is that similarly to user-timeout on keepalives, it’s engaged only when the retransmission timer fires. During such an event, if more than user-timeout seconds since the last good packet passed, the connection will be aborted.

Note about using application timeouts

In the past we have shared an interesting war story:

Our HTTP server gave up on the connection after an application-managed timeout fired. This was a bug – a slow connection might have correctly slowly drained the send buffer, but the application server didn’t notice that.

We abruptly dropped slow downloads, even though this wasn’t our intention. We just wanted to make sure the client connection was still healthy. It would be better to use TCP_USER_TIMEOUT than rely on application-managed timeouts.

But this is not sufficient. We also wanted to guard against a situation where a client stream is valid, but is stuck and doesn’t drain the connection. The only way to achieve this is to periodically check the amount of unsent data in the send buffer, and see if it shrinks at a desired pace.

For typical applications sending data to the Internet, I would recommend:

  1. Enable TCP keepalives. This is needed to keep some data flowing in the idle-connection case.

  2. Set TCP_USER_TIMEOUT to TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT.

  3. Be careful when using application-managed timeouts. To detect TCP failures use TCP keepalives and user-timeout. If you want to spare resources and make sure sockets don’t stay alive for too long, consider periodically checking if the socket is draining at the desired pace. You can use ioctl(TIOCOUTQ) for that, but it counts both data buffered (notsent) on the socket and in-flight (unacknowledged) bytes. A better way is to use TCP_INFO tcpi_notsent_bytes parameter, which reports only the former counter.

An example of checking the draining pace:

while True:
    notsent1 = get_tcp_info(c).tcpi_notsent_bytes
    notsent1_ts = time.time()
    ...
    poll.poll(POLL_PERIOD)
    ...
    notsent2 = get_tcp_info(c).tcpi_notsent_bytes
    notsent2_ts = time.time()
    pace_in_bytes_per_second = (notsent1 - notsent2) / (notsent2_ts - notsent1_ts)
    if pace_in_bytes_per_second > 12000:
        # pace is above effective rate of 96Kbps, ok!
    else:
        # socket is too slow...

There are ways to further improve this logic. We could use TCP_NOTSENT_LOWAT, although it’s generally only useful for situations where the send buffer is relatively empty. Then we could use the SO_TIMESTAMPING interface for notifications about when data gets delivered. Finally, if we are done sending the data to the socket, it’s possible to just call close() and defer handling of the socket to the operating system. Such a socket will be stuck in FIN-WAIT-1 or LAST-ACK state until it correctly drains.

Summary

In this post we discussed five cases where the TCP connection may notice the other party going away:

  • SYN-SENT: The duration of this state can be controlled by TCP_SYNCNT or tcp_syn_retries.
  • SYN-RECV: It’s usually hidden from application. It is tuned by tcp_synack_retries.
  • Idling ESTABLISHED connection, will never notice any issues. A solution is to use TCP keepalives.
  • Busy ESTABLISHED connection, adheres to tcp_retries2 setting, and ignores TCP keepalives.
  • Zero-window ESTABLISHED connection, adheres to tcp_retries2 setting, and ignores TCP keepalives.

Especially the last two ESTABLISHED cases can be customized with TCP_USER_TIMEOUT, but this setting also affects other situations. Generally speaking, it can be thought of as a hint to the kernel to abort the connection after so-many seconds since the last good packet. This is a dangerous setting though, and if used in conjunction with TCP keepalives should be set to a value slightly lower than TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT. Otherwise it will affect, and potentially cancel out, the TCP_KEEPCNT value.

In this post we presented scripts showing the effects of timeout-related socket options under various network conditions. Interleaving the tcpdump packet capture with the output of ss -o is a great way of understanding the networking stack. We were able to create reproducible test cases showing the "on", "keepalive" and "persist" timers in action. This is a very useful framework for further experimentation.

Finally, it’s surprisingly hard to tune a TCP connection to be confident that the remote host is actually up. During our debugging we found that looking at the send buffer size and currently active TCP timer can be very helpful in understanding whether the socket is actually healthy. The bug in our Spectrum application turned out to be a wrong TCP_USER_TIMEOUT setting – without it sockets with large send buffers were lingering around for way longer than we intended.

The scripts used in this article can be found on our Github.

Figuring this out has been a collaboration across three Cloudflare offices. Thanks to Hiren Panchasara from San Jose, Warren Nelson from Austin and Jakub Sitnicki from Warsaw. Fancy joining the team? Apply here!