Tag Archives: research

Metasploit Wrap-Up

2021-11-05 Spencer McIntyre

Post Syndicated from Spencer McIntyre original https://blog.rapid7.com/2021/11/05/metasploit-wrap-up-137/

GitLab RCE

Metasploit Wrap-Up

New Rapid7 team member jbaines-r7 wrote an exploit targeting GitLab via the ExifTool command. Exploiting this vulnerability results in unauthenticated remote code execution as the git user. What makes this module extra neat is the fact that it chains two vulnerabilities together to achieve this desired effect. The first vulnerability is in GitLab itself that can be leveraged to pass invalid image files to the ExifTool parser which contained the second vulnerability whereby a specially-constructed image could be used to execute code. For even more information on these vulnerabilities, check out Rapid7’s post.

Less Than BulletProof

This week community member h00die submitted another WordPress module. This one leverages an information disclosure vulnerability in the WordPress BulletProof Security plugin that can disclose user credentials from a backup file. These credentials could then be used by a malicious attacker to login to WordPress if the hashed password is able to be cracked in an offline attack.

Metasploit Masterfully Manages Meterpreter Metadata

Each Meterpreter implementation is a unique snowflake that often incorporates API commands that others may not. A great example of this are all the missing Kiwi commands in the Linux Meterpreter. Metasploit now has much better support for modules to identify the functionality they require a Meterpreter session to have in order to run. This will help alleviate frustration encountered by users when they try to run a post module with a Meterpreter type that doesn’t offer functionality that is needed. This furthers the Metasploit project goal of providing more meaningful error information regarding post module incompatibilities which has been an ongoing effort this year.

New module content (3)

WordPress BulletProof Security Backup Disclosure by Ron Jost (Hacker5preme) and h00die, which exploits CVE-2021-39327 – This adds an auxiliary module that leverages an information disclosure vulnerability in the BulletproofSecurity plugin for WordPress. This vulnerability is identified as CVE-2021-39327. The module retrieves a backup file, which is publicly accessible, and extracts user credentials from the database backup.
GitLab Unauthenticated Remote ExifTool Command Injection by William Bowling and jbaines-r7, which exploits CVE-2021-22204 and CVE-2021-22205 – This adds an exploit for an unauthenticated remote command injection in GitLab via a separate vulnerability within ExifTool. The vulnerabilities are identified as CVE-2021-22204 and CVE-2021-22205.
WordPress Plugin Pie Register Auth Bypass to RCE by Lotfi13-DZ and h00die – This exploits an authentication bypass which leads to arbitrary code execution in versions 3.7.1.4 and below of the WordPress plugin, pie-register. Supplying a valid admin id to the user_id_social_site parameter in a POST request now returns a valid session cookie. With that session cookie, a PHP payload as a plugin is uploaded and requested, resulting in code execution.

Enhancements and features

#15665 from adfoster-r7 – This adds additional metadata to exploit modules to specify Meterpreter command requirements. Metadata information is used to add a descriptive warning when running modules with a Meterpreter implementation that doesn’t support the required command functionality.
#15782 from k0pak4 – This updates the iis_internal_ip module to include coverage for the PROPFIND internal IP address disclosure as described by CVE-2002-0422.

Bugs fixed

#15805 from timwr – This bumps the metasploit-payloads version to include two bug fixes for the Python Meterpreter.

Get it

As always, you can update to the latest Metasploit Framework with msfupdate
and you can get more details on the changes since the last blog post from
GitHub:

If you are a git user, you can clone the Metasploit Framework repo (master branch) for the latest.
To install fresh without using git, you can use the open-source-only Nightly Installers or the
binary installers (which also include the commercial edition).

Hands-On IoT Hacking: Rapid7 at DefCon 29 IoT Village, Part 3

2021-11-04 Deral Heiland

Post Syndicated from Deral Heiland original https://blog.rapid7.com/2021/11/04/hands-on-iot-hacking-rapid7-at-defcon-29-iot-village-part-3/

Hands-On IoT Hacking: Rapid7 at DefCon 29 IoT Village, Part 3

In our first post in this series, we covered the setup of Rapid7’s hands-on exercise at Defcon 29’s IoT Village. Last week, we discussed how to determine the UART status of the header we created and how to actually start hacking on the IoT device. The goal in this next phase of the IoT hacking exercise is to turn the console back on.

To accomplish this, we need to reenter the bootargs variable without the console setting. To change the bootargs variable, the “setenv” command should be used. In the case of this exercise, enter the following command as shown in Figure 16. You can see that the “console=off” has been removed. This will overwrite the current bootargs environment variable setting.

Hands-On IoT Hacking: Rapid7 at DefCon 29 IoT Village, Part 3 — *Figure 16: setenv command*

Once you’ve run this command, we recommend verifying that you’ve correctly made the changes to the bootargs variable by running the “printenv” command again and observing that the output shows that “console=off” has been removed. It is very common to accidentally mistype an environment variable, which will cause errors on reboot or just create an entirely new variable that has no usable value. The correct bootargs variable line should read as shown in Figure 17:

Once you’re sure the changes made to bootargs are correct, you’ll need to save the environment variable settings. To do this, you’ll use the “saveenv” command. Enter this command in the UART console, and hit enter. If you miss this step, then none of the changes made to the environment variables of U-Boot will be saved and all will be lost on reboot.

The saveenv should cause the U-Boot environment variables to be written to flash and return a response indicating it is being saved. An example of this is shown in Figure 18:

Reboot and capture logs for review

Once you’ve made all the needed changes to the U-Boot environment variables and saved them, you can reboot the device, observe console logs from the boot process, and save the console log data to a file for further review. The boot log data from the console will play a critical role in the next steps as you work toward gaining full root access to the device.

Next, reboot the systems. You can do this in a couple of different ways. You can either type the “reset” command within the U-Boot console and hit enter, which tells the MCU to reset and causes the system to restart, or just cycle the power on the device. After entering the reset command or power cycling the device, the device should reboot. The console should now be unlocked, and you should see the kernel boot up. If you still do not have a functioning console, you either entered the wrong data for bootargs or failed to save the settings with the “saveenv” command. I must admit I am personally guilty of both many times.

During the Defcon IoT Village exercise, we had the attendees capture console logs to a file for review using the following process in GtkTerm. If you are using a different serial console application, this process will be different for capture and saving logs.

In GtkTerm, to capture logs for review, select “Log” on the task bar pulldown menu on GtkTerm as shown below in Figure 19:

Once “Log” is selected, a window will pop up. From here, you need to select the file to write out the logs to. In this case, we had the attendees select the defcon_log.txt file on the laptops desktop as shown below in Figure 20:

Once you’ve selected a log file, you should now start capturing logs to that file. From here, the device can be powered back on or restarted to start capturing logs for review. Let the system boot up completely. Once it appears to be up and running, you can turn off logging by selecting “Log” and then selecting “Stop” in the dropdown, as shown in Figure 21:

Once logging is stopped, you can open the captured log file and review the contents. During the Defcon IoT Village exercise, we had the participants search for the keyword “failsafe” in the captured logs. Searching for failsafe should take you to the log entry containing the line:

“Press the [f] key and hit [enter] to enter failsafe mode”

This is a prompt that allows you to hit the “f” key followed by return to boot the system into single-user mode. You won’t find this mode on all IoT devices, but you will find it on some, like in this case with the LUMA device. Single-user mode will start the system up with limited functionality and is often used for conducting maintenance on an operating system — and, yes, this is root-level access to the device, but with none of the critical system function running that would allow network service, USB access, and applications that are run as part of the device’s normal operation features. Our goal later is to use this access and the following data to eventually gain full running system root access.

There is also another critical piece of data in the log file just shortly after the failsafe mode prompt, which we need to note. Approximately 8 lines below failsafe prompt, there is a reference to “rootfs_data” as shown in Figure 22:

The piece of data we need from this line is the Unsorted Block Image File System (UBIFS) device number and the volume number. This will let us properly mount the rootfs_data partition later. With the LUMA, we found this to be one of the two following values.

Device 0, volume 2
Device 0, volume 3

Boot into single-user mode

Now that the captured logs have been reviewed, allowing us to identify the failsafe mode and the UBIFS mount data. The next step is to reboot the system into single-user mode, so we can work on getting full root access to the devices. To do this, you’ll need to monitor the system booting up in the UART console, watching for the failsafe mode prompt as shown below in Figure 23:

When this prompt shows up, you will only have a couple of seconds to press the “f” key followed by the return key to get the system to launch into single-user root access mode. If you miss this, you’ll need to reboot and start over. If you’re successful, the UART console should show the following prompt (Figure 24):

In single-user mode, you’ll have root access, although most of the partitions, applications, networks, and associated functions will not be loaded or running. Our goal will be to make changes so you can boot the device up into full operation system mode and have root access.

In our fourth and final installment of this series, we’ll go over how to configure user accounts, and finally, how to reboot the device and login. Check back with us next week!

NEVER MISS A BLOG

Get the latest stories, expertise, and news about security today.

Sneaking Through Windows: Infostealer Malware Masquerades as Windows Application

2021-10-28 Andrew Iwamaye

Post Syndicated from Andrew Iwamaye original https://blog.rapid7.com/2021/10/28/sneaking-through-windows-infostealer-malware-masquerades-as-windows-application/

Sneaking Through Windows: Infostealer Malware Masquerades as Windows Application

This post also includes contributions from Reese Lewis, Andrew Christian, and Seth Lazarus.

Rapid7’s Managed Detection and Response (MDR) team leverages specialized toolsets, malware analysis, tradecraft, and collaboration with our colleagues on the Threat Intelligence and Detection Engineering (TIDE) team to detect and remediate threats.

Recently, we identified a malware campaign whose payload installs itself as a Windows application after delivery via a browser ad service and bypasses User Account Control (UAC) by abusing a Windows environment variable and a native scheduled task to ensure it persistently executes with elevated privileges. The malware is classified as a stealer, which intends to steal sensitive data from an infected asset (such as browser credentials and cryptocurrency), prevent browser updates, and allow for arbitrary command execution.

Detection

The MDR SOC first became aware of this malware campaign upon analysis of “UAC Bypass – Disk Cleanup Utility” and “Suspicious Process – TaskKill Multiple Times” alerts (authored by Rapid7’s TIDE team) within Rapid7’s InsightIDR platform.

As the “UAC Bypass – Disk Cleanup Utility” name implies, the alert identified a possible UAC bypass using the Disk Cleanup utility due to a vulnerability in some versions of Windows 10 that allows a native scheduled task to execute arbitrary code by modifying the content of an environment variable. Specifically, the alert detected a PowerShell command spawned by a suspicious executable named HoxLuSfo.exe. We determined that HoxLuSfo.exe was spawned by sihost.exe, a background process that launches and maintains the Windows action and notification centers.

Sneaking Through Windows: Infostealer Malware Masquerades as Windows Application — *Figure 1: PowerShell command identified by Rapid7’s MDR on infected assets*

We determined the purpose of the PowerShell command was, after sleeping, to attempt to perform a Disk Cleanup Utility UAC bypass. The command works because, on some Windows systems, it is possible for the Disk Cleanup Utility to run via the native scheduled task “SilentCleanup” that, when triggered, executes the following command with elevated privileges:

%windir%\system32\cleanmgr.exe /autoclean /d %systemdrive%

The PowerShell command exploited the use of the environment variable %windir% in the path specified in the “SilentCleanup” scheduled task by altering the value set for the environment variable %windir%. Specifically, the PowerShell command deleted the existing %windir% environment variable and replaced it with a new %windir% environment variable set to:

%LOCALAPPDATA%\Microsoft\OneDrive\setup\st.exe REM

The environment variable replacement therefore configured the scheduled task “SilentCleanup” to execute the following command whenever the task “SilentCleanup” was triggered:

%LOCALAPPDATA%\Microsoft\OneDrive\setup\st.exe REM\system32\cleanmgr.exe /autoclean /d %systemdrive%

The binary st.exe was a copied version of HoxLuSfo.exe from the file path C:\Program Files\WindowsApps\3b76099d-e6e0-4e86-bed1-100cc5fa699f_113.0.2.0_neutral__7afzw0tp1da5e\HoxLuSfo\.

The trailing “REM” at the end of the Registry entry commented out the rest of the native command for the “SilentCleanup” scheduled task, effectively configuring the task to execute:

%LOCALAPPDATA%\Microsoft\OneDrive\setup\st.exe

After making the changes to the %windir% environment variable, the PowerShell command ran the “SilentCleanup” scheduled task, thereby hijacking the “SilentCleanup” scheduled task to run st.exe with elevated privileges.

The alert for “Suspicious Process – TaskKill Multiple Times” later detected st.exe spawning multiple commands attempting to kill any process named Google*, MicrosoftEdge*, or setu*.

Analysis of HoxLuSfo.exe

Rapid7’s MDR could not remotely acquire the files HoxLuSfo.exe and st.exe from the infected assets because they were no longer present at the time of the investigation. However, we obtained a copy of the executable from VirusTotal based on its MD5 hash, 1cc0536ae396eba7fbde9f35dc2fc8e3.

Rapid7’s MDR concluded that HoxLuSfo.exe had the following characteristics and behaviors:

32-bit Microsoft Visual Studio .NET executable containing obfuscated code
Originally named TorE.exe
At the time of writing, only 10 antivirus solutions detected HoxLuSfo.exe as malicious

Fingerprints the infected asset
Drops and leverages a 32-bit Microsoft Visual Studio .NET DLL, JiLuT64.dll (MD5: 14ff402962ad21b78ae0b4c43cd1f194), which is an Agile .NET obfuscator signed by SecureTeam Software Ltd, likely to (de)obfuscate contents
Modifies the hosts file on the infected asset to prevent correct resolution of common browser update URLs to prevent browser updates

Enumerates installed browsers and steals credentials from installed browsers
Kills processes named Google*, MicrosoftEdge*, setu*
Contains functionality to steal cryptocurrency
Contains functionality for the execution of arbitrary commands on the infected asset

Communicates with s1.cleancrack[.]tech and s4.cleancrack[.]tech (both of which resolve to 172.67.187[.]162 and 104.21.92[.]68 at the time of analysis) via AES-encrypted messages with a key of e84ad660c4721ae0e84ad660c4721ae0. The encryption scheme employed appears to be reused code from here.
Has a PDB path of E:\msix\ChromeRceADMIN4CB\TorE\obj\Release\TorE.pdb.

Rapid7’s MDR interacted with s4.cleancrack[.]tech and discovered what appears to be a login portal for the attacker to access stolen data.

Source of infection

Rapid7’s MDR observed the execution of chrome.exe just prior to HoxLuSfo.exe spawning the PowerShell command we detected with our alert.

In one of our investigations, our analysis of the user’s Chrome browser history file showed redirects to suspicious domains before initial infection:
hXXps://getredd[.]biz/ →
hXXps://eu.postsupport[.]net →
hXXp://updateslives[.]com/

In another investigation, DNS logs showed a redirect chain that followed a similar pattern:
hXXps://getblackk[.]biz/ →
hXXps://eu.postsupport[.]net →
hXXp://updateslives[.]com/ →
hXXps://chromesupdate[.]com

In the first investigation, the user’s Chrome profile revealed that the site permission settings for a suspicious domain, birchlerarroyo[.]com, were altered just prior to the redirects. Specifically, the user granted permission to the site hosted at birchlerarroyo[.]com to send notifications to the user.

Rapid7’s MDR visited the website hosted at birchlerarroyo[.]com and found that the website presented a browser notification requesting permission to show notifications to the user.

We suspect that the website hosted at birchlerarroyo[.]com was compromised, as its source code contained a reference to a suspicious JavaScript file hosted at fastred[.]biz:

We determined that the JavaScript file hosted at fastred[.]biz was responsible for the notification observed at birchlerarroyo[.]com via the code in Figure 10.

Pivoting off of the string “Код RedPush” within the source code of birchlerarroyo[.]com (see highlighted lines in Figure 9), as well as the workerName and applicationServerKey settings within the JavaScript file in Figure 10, Rapid7’s MDR discovered additional websites containing similar source code: ostoday[.]com and magnetline[.]ru.

Rapid7’s MDR analyzed the websites hosted at each of birchlerarroyo[.]com, ostoday[.]com, and magnetline[.]ru and found that each:

Displayed the same type of browser notification shown in Figure 8
Was built using WordPress and employed the same WordPress plugin, “WP Rocket”
Had source code that referred to similar Javascript files hosted at either fastred[.]biz or clickmatters[.]biz and the JavaScript files had the same applicationServerKey: BIbjCoVklTIiXYjv3Z5WS9oemREJPCOFVHwpAxQphYoA5FOTzG-xOq6GiK31R-NF--qzgT3_C2jurmRX_N6nY4g

Had source code that contained a similar rbConfig parameter referencing takiparkrb[.]site and a varying rotator value

Had source code that contained references to either “Код RedPush” (translates to “Redpush code”), “Код РБ” (translates to “CodeRB”), or “Код нативного ПУШа RB” (translates to “Native PUSH code RB”)

Pivoting off of the similar strings of “CodeRB” and “Redpush” within source code led to other findings.

First, Rapid7’s MDR discovered an advertising business, RedPush (see redpush[.]biz). RedPush provides its customers with advertisement code to host on customers’ websites. The code produces pop-up notifications to allow for advertisements to be pushed to users browsing the customers’ websites. RedPush’s customers make a profit based on the number of advertisement clicks generated from their websites that contain RedPush’s code.

Second, Rapid7’s MDR discovered a publication by Malwaretips describing a browser pop-up malware family known as Redpush. Upon visiting a website compromised with Redpush code, the code presents a browser notification requesting permission to send notifications to the user. After the user grants permission, the compromised site appears to gain the ability to push toast notifications, which could range from spam advertisements to notifications for malicious fake software updates. Similar publications by McAfee here and here describe that threat actors have recently been employing toast notifications that advertise fake software updates to trick users into installing malicious Windows applications.

Rapid7’s MDR could not reproduce a push of a malicious software after visiting the compromised website at birchlerarroyo[.]com, possibly for several reasons:

Notification-enabled sites may send notifications at varying frequencies, as explained here, and varying times of day.
Malicious packages are known to be selectively pushed to users based on geolocation, as explained here. (Note: Rapid7’s MDR interacted with the website using IP addresses having varying geolocations in North America and Europe.)
The malware was no longer being served at the time of investigation.

However, the malware delivery techniques described by Malwaretips and McAfee were likely employed to trick the users in our investigations into installing the malware while they were browsing the Internet. As explained in the “Forensic analysis” section, in one of our investigations, there was evidence of an initial toast notification, a fake update masquerade, and installation of a malicious Windows application. Additionally, the grandparent process of the PowerShell command we detected, sihost.exe, indicated to us that the malware may have leveraged the Windows Notification Center during the infection chain.

Forensic analysis

Analysis of the User’s Chrome profile and Microsoft-Windows-PushNotifications-Platform Windows Event Logs suggests that upon the user enabling notifications to be sent from the compromised site at birchlerarroyo[.]com, the user was presented with and cleared a toast notification. We could not determine what the contents of the toast notification were based on available evidence.

Based on our analysis of timestamp evidence, the user was likely directed to each of getredd[.]biz, postsupport[.]net, and updateslives[.]com after clicking the toast notification, and presented a fake update webpage.

Similar to the infection mechanism described by McAfee, the installation path of the malware on disk within C:\Program Files\WindowsApps\ suggests that the users were tricked into installing a malicious Windows application. The Microsoft-Windows-AppXDeploymentServerOperational and Microsoft-Windows-AppxPackagingOperational Windows Event logs contained suspicious entries confirming installation of the malware as a Windows application, as shown in Figures 15-19.

The events in Figures 15-19 illustrate that the malicious Windows application was distributed through the web with App Installer as a MSIX file, oelgfertgokejrgre.msix.

Analysis of `oelgfertgokejrgre.msix`

Rapid7’s MDR visited chromesupdate[.]com in a controlled environment and discovered that it was hosting a convincing Chrome-update-themed webpage.

The website title, “Google Chrome – Download the Fast, Secure Browser from Google,” was consistent with those we observed of the redirect URLs getredd[.]biz, postsupport[.]net, and updateslives[.]com. The users in our investigations likely arrived at the website in Figure 20 after clicking a malicious toast notification, and proceeded to click the “Install” link presented on the website to initiate the Windows application installation.

The “Install” link presented at the website led to a Windows application installer URL (similar to that seen in Figure 17), which is consistent with MSIX distribution via the web.

Rapid7’s MDR obtained the MSIX file, oelgfertgokejrgre.msix, hosted at chromesupdate[.]com, and confirmed that it was a Windows application package.

Analysis of the contents extracted from oelgfertgokejrgre.msix revealed the following notable characteristics and features:

Two files, HoxLuSfo.exe and JiLutime.dll, were contained within the HoxLuSfo subdirectory. JiLutime.dll (MD5: 60bb67ebcffed2f406ac741b1083dc80) was a 32-bit Agile .NET obfuscator DLL signed by SecureTeam Software Ltd, likely to (de)obfuscate contents.
The AppxManifest.xml file contained more references to the Windows application’s masquerade as a Google Chrome update, as well as details related to its package identity and signature.

The DeroKuilSza.build.appxrecipe file contained strings that referenced a project “DeroKuilSza,” which is likely associated with the malware author.

Our dynamic analysis of oelgfertgokejrgre.msix provided clarity around the malware’s installation process. Detonation of oelgfertgokejrgre.msix caused a Windows App Installer window to appear, which displayed information about a fake Google Chrome update.

The information displayed to the user in Figure 26 is spoofed to masquerade as a legitimate Google Chrome update. The information correlates to the AppxManifest.xml configuration shown in Figure 24.

Once we proceeded with the installation, the MSIX package registered a notification sender via App Installer and immediately presented a notification to launch the fake Google Chrome update.

Since the malicious Windows application package installed by the MSIX file was not hosted on the Microsoft Store, a prompt is presented to enable installation of sideload applications, if not already enabled, to allow for installation of applications from unofficial sources.

The malware needs the enablement of “Sideload apps” to complete its installation.

Pulling off the mask

The malware we summarized in this blog post has several tricks up its sleeve. Its delivery mechanism via an ad service as a Windows application (which does not leave typical web-based download forensic artifacts behind), Windows application installation path, and UAC bypass technique by manipulation of an environment variable and native scheduled task can go undetected by various security solutions or even by a seasoned SOC analyst. Rapid7’s MDR customers can rest assured that, by leveraging our attacker behavior analytics detection methodology, our analysts will detect and respond to this infection chain before the malware can steal valuable data.

IOCs

Type	Indicator
Domain Name	updateslives[.]com
Domain Name	getredd[.]biz
Domain Name	postsupport[.]net
Domain Name	eu.postsupport[.]net
Domain Name	cleancrack[.]tech
Domain Name	s1.cleancrack[.]tech
Domain Name	s4.cleancrack[.]tech
Domain Name	getblackk[.]biz
Domain Name	chromesupdate[.]com
Domain Name	fastred[.]biz
Domain Name	clickmatters[.]biz
Domain Name	takiparkrb[.]site
IP Address	172.67.187[.]162
IP Address	104.21.92[.]68
IP Address	104.21.4[.]200
IP Address	172.67.132[.]99
Directory	C:\Program Files\WindowsApps\3b76099d-e6e0-4e86-bed1-100cc5fa699f_113.0.2.0_neutral__7afzw0tp1da5e\HoxLuSfo
Filepath	C:\Program Files\WindowsApps\3b76099d-e6e0-4e86-bed1-100cc5fa699f_113.0.2.0_neutral__7afzw0tp1da5e\HoxLuSfo\HoxLuSfo.exe
Filename	HoxLuSfo.exe
MD5	1cc0536ae396eba7fbde9f35dc2fc8e3
SHA1	b7ac2fd5108f69e90ad02a1c31f8b50ab4612aa6
SHA256	5dc8aa3c906a469e734540d1fea1549220c63505b5508e539e4a16b841902ed1
Filepath	%USERPROFILE%\AppData\Local\Microsoft\OneDrive\setup\st.exe
Filename	st.exe
Registry Value + Registry Data	HKCU\Environment.%windir% –> %LOCALAPPDATA%\Microsoft\OneDrive\setup\st.exe
Filename	oelgfertgokejrgre.msix
MD5	6860c43374ad280c3927b16af66e3593
SHA1	94658e04988b02c395402992f46f1e975f9440e1
SHA256	0a127dfa75ecdc85e88810809c94231949606d93d232f40dad9823d3ac09b767

NEVER MISS A BLOG

Get the latest stories, expertise, and news about security today.

Hands-On IoT Hacking: Rapid7 at DefCon IoT Village, Part 2

2021-10-28 Deral Heiland

Post Syndicated from Deral Heiland original https://blog.rapid7.com/2021/10/28/hands-on-iot-hacking-rapid7-at-defcon-29-iot-village-pt-2/

Hands-On IoT Hacking: Rapid7 at DefCon IoT Village, Part 2

In our last post, we discussed how we set up Rapid7’s hands-on exercise at the Defcon 29 IoT Village. Now, with that foundation laid, we’ll get into how to determine whether the header we created is UART.

When trying to determine baud rate for IoT devices, I often just guess. Generally, for typical IoT hardware, the baud rate is going to be one of the following:

9600
19200
38400
57600
115200

Typically, 115200 and 57600 are the most commonly encountered baud rates on consumer-grade IoT devices. Other settings that need to be made are data bits, stop bits, and parity bits. Typically, these will be set to the following standard defaults, as shown in Figure 5:

Hands-On IoT Hacking: Rapid7 at DefCon IoT Village, Part 2 — *Figure 5: Logic 2 Async Serial Decoder Settings*

Once all the correct settings have been determined, and if the test point is UART, then the decoder in the Logic 2 application should decode the bit stream and reveal console text data for the device booting up. An example of this is shown in Figure 6:

FTDI UART Setup

Once you’ve properly determined the header is UART and identified transmit, receive, and ground pins, you can next hook up a USB to UART FTDI and start analyzing and hacking on the IoT device. During the IoT Village exercise, we used a Shikra for UART connection. Unfortunately, the Shikra appears to no longer be available, but any USB to UART FTDI device supporting 3.3vdc can be used for this exercise. However, I do recommend purchasing a multi-voltage FTDI device if possible. It’s common to encounter IoT devices that require either 1.8, 3.3, or 5 vdc, so having a product that can support these voltage levels is the best solution.

The software we used to connect to the FTDI device for the exercise at Defcon IoT Village was GtkTerm running on an Ubuntu Linux — but again, any terminal software that supports tty terminal connection will work for this. For example, I have also used CoolTerm or Putty, which both work fine. So just find a terminal software that works best for you and substitute it for what is referenced here.

The next step is to attach the Shikra (pin out) or whatever brand of FTDI USB device you’re using to the UART header on Luma (Figure 7) using this table:

Shikra Pins	Luma Header J19 Pins
Pin 1 TXT	Pin 2 RCV
Pin 2 RCV	Pin 3 TXT
Pin 18 GND	Pin 1 GND

Once the UART to USB device is connected to the LUMA, double-click on the GtkTerm icon located on the Linux Desktop and configure the application by selecting Configuration on the menu bar followed by Port in the drop-down menu. From there, set the Port (/dev/ttyUSB0) and Baud Rate (115200) to match the figure below (Figure 8), and click OK.

Once configured, power on the LUMA device. At this point, you should start to see the device’s boot process logged to the UART console. For the Defcon IoT hacking exercise, we had preconfigured the devices to disable the console, so once we loaded U-boot and started the system kernel image, the console became disabled, as shown in Figure 9:

We made these changes so the attendees working on the exercise would experience a common setting often encountered, where the UART console is disabled during the booting process, and they’d have the chance to conduct another common attack that would allow them to break out of this lockdown.

For example, during the boot sequence, it’s often possible to force the device to break out of the boot process and to drop into a U-Boot console. For standard U-Boot, this will often happen when the Kernel image is inaccessible, causing the boot process to error out and drop into a U-Boot console prompt. This condition can sometimes be forced by shorting the data line (serial out) from the flash memory chip containing the kernel image to ground during the boot process. This prevents the boot process from loading the kernel into memory. Figure 5 shows a pin-out image of the flash memory chip currently in use on this device. The data out from the flash memory chip is Serial Out (SO) on pin 2.

Also, I would like to note that during the Defcon IoT Village exercises, I had a conversation with several like-minded IoT hackers who said that they typically do this same attack but use the clock pin (SCLK). So, that is another viable option when conducting this type of attack on an IoT device to gain access to the U-Boot console.

During our live exercises at Defcon IoT Village, to help facilitate the process of grounding the data line Pin 2 Serial Out (SO) — and to avoid ending up with a bunch of dead devices because of accidentally grounding the wrong pins — we attached a lead from pin 2 of the flash memory chip, as shown in Figure 11:

To conduct this “pin glitch” attack to gain access to the U-Boot console, you will need to first power down the device. Then, restart the device by powering it back on while also monitoring the UART Console for the U-boot to start loading. Once you see the U-Boot loading, hold the shorting lead against the metal shielding or some other point of ground within the device, as shown in Figure 12:

Shorting this or the clock pin to ground will prevent U-Boot from being able to load the kernel. If your timing is accurate, you should be successful and now see U-Boot console prompt IPQ40xx, as shown below in Figure 13. Once you see this prompt, you can lay the shorting lead to the side. If this prompt does not show up, then you will need to repeat this process.

With the LUMA device used in this example, this attack is more forgiving and easier to carry out successfully. The main reason, in my opinion, is because the U-Boot image and the kernel image are on separate flash memory chips. In my experience, this seems to cause more of a delay between U-Boot load and kernel loading, allowing for a longer window of time for the pin glitch to succeed.

Alter U-boot environment variables

U-boot environment variables are used to control the boot process of the devices. During this phase of the exercise, we used the following three U-Boot console commands to view, alter, and save changes made to the U-Boot environment variables to re-enable the console, which we had disabled before the exercise.

“Printenv” is used to list the current environment variable settings.
“Setenv” is used to create or modify environment variables.
“Saveenv” is used to write the environment variables back to memory so they are permanent.

When connected into the U-Boot console to view the device’s configured environment variables, the “printenv” command is used. This command will return something that looks like the following Figure 14 below. Scrolling down and viewing the environment settings will reveal a lot about how the device boot process is configured. In the case of the Defcon IoT Village, we had attendees pay close attention to the bootargs variable, because this is where the console was disabled from.

With a closer look at the bootargs variable as shown below in Figure 15, we can see that the console had been set to off. This is the reason the UART console halted during the boot process once the kernel was loaded.

In our third post, we’ll cover the next phase of our IoT Village exercise: turning the console back on and achieving single-user mode. Check back with us next week!

NEVER MISS A BLOG

Get the latest stories, expertise, and news about security today.

Recog: Data Rules Everything Around Me

2021-10-25 Matthew Kienow

Post Syndicated from Matthew Kienow original https://blog.rapid7.com/2021/10/25/recog-data-rules-everything-around-me/

Recog: Data Rules Everything Around Me

The recog project — a recognition framework used to identify products, operating systems, and hardware through matching network probe data against its extensive fingerprint collection — has been around for many years. In the beginning, Rapid7 used it internally as part of the Nexpose vulnerability scanner. Then, in 2014, the fingerprints and Ruby implementation of the framework were released as open-source software, in keeping with Rapid7’s continued commitment to open-source initiatives. Later, in 2018, we released a Java implementation of the framework, recog-java, as open-source, and later that year, Rumble released a Go implementation of the framework, recog-go.

Still, there remained one problem to solve with the framework: balancing the roles of content and code. In recog, three different language implementations, with varying levels of feature parity, all support the most basic requirements of processing the XML fingerprint data, matching input data against the fingerprint collection and returning a collection of enrichment parameters, both static and dynamic. The value of these implementations (the code) isn’t fully realized without being combined with the fingerprint data (the content).

However, the Ruby implementation is clearly an outlier, since it stores the framework code alongside the fingerprint data. The problem of content versus code would not be as great of a concern if there were only one language implementation — but instead, we have three, and there have been recent conversations about the possibility of a fourth!

Solving the content vs. code conundrum

Carving off the Ruby implementation from the existing repository would leave the content while creating a consistent structure between all language implementations. Since this act would also remove the fingerprint testing performed by the Ruby implementation, it provides an opportunity to assess fingerprint verification across all recog implementations.

In the past, there were delayed reports of issues discovered between the different regular expression engines used in other language implementations after fingerprint pull requests were merged. Prevention required either the contributor or maintainer to verify fingerprint changes against the Java and Go implementations, and while the Go implementation has a verify tool, this was missing from Java.

In order to facilitate future content separation, the Java implementation would need a fingerprint verification tool. This was not as straightforward, since the Java library neither retained the data parsed from the fingerprint examples nor interpolated all parameters. But after some modifications to the `parse` and `match` methods, I was able to remove these impediments. I created an implementation of the recog fingerprint verification tool that matches both the features and behaviors of the Ruby tool as a new module within the Java implementation.

The final step is automation, which will allow contributors and maintainers to efficiently process fingerprint content changes and focus on the correctness of the regular expressions and enrichment parameters. This helps alleviate concerns around any issues with one or more of the language implementations.

I created a new GitHub Actions verify workflow for this purpose. The initial workflow simply runs the `recog_standardize` tool to ensure each fingerprint asserts known identifiers. The latest update to the workflow adds jobs, in which each language implementation’s fingerprint verification tool runs against any updated fingerprint XML files. The verify workflow provides necessary feedback to contributors and maintainers, improving the content modification process.

Recog: Data Rules Everything Around Me — *View of successful verify workflow*

These steps are the first of more to come that will aid users, contributors, and maintainers of the recog recognition framework project. Recog content and language implementations form a component within other projects in the information security domain.

Recog is often used as a component in large projects, and we have plans for additional tooling to make the framework more directly usable for end users. As recog develops and grows, the Rapid7 team looks forward to watching projects built on top of it develop and grow.

NEVER MISS A BLOG

Get the latest stories, expertise, and news about security today.

2022 Planning: Designing Effective Strategies to Manage Supply Chain Risk

2021-10-22 Jesse Mack

Post Syndicated from Jesse Mack original https://blog.rapid7.com/2021/10/22/2022-planning-designing-effective-strategies-to-manage-supply-chain-risk/

2022 Planning: Designing Effective Strategies to Manage Supply Chain Risk

Supply chains are on everyone’s mind right now — from consumer-tech bottlenecks to talks of holiday-season toy shortages. Meanwhile, cyberattacks targeting elements of the supply chain have become increasingly common and impactful — making this area of security a top priority as organizations ensure their digital defense plans are ready for 2022.

Here’s the thing, though: Supply chains are enormously complex, and securing all endpoints in your partner ecosystem can be a herculean challenge.

On Thursday, October 21, 2 members of Rapid7’s Research team — Erick Galinkin, Principal Artificial Intelligence Researcher, and Bob Rudis, Chief Security Data Scientist — sat down to get the perspectives of 2 industry panelists: Loren Morgan, VP of Global IT Operations, Infrastructure and Delivery at Owens & Minor; and Dan Walsh, CISO at VillageMD. They discussed the dynamics of supply chain security, how they think about vendor risk, and what they’re doing to tackle these challenges at their organizations.

2022 Planning: Designing Effective Strategies to Manage Supply Chain Risk

Head to our 2022 Planning series page for more – full replay available soon!

What is supply chain risk, anyway?

The conversation kicked off with a foundational question: What do we mean when we talk about supply chain risk? The answer here is particularly important, given how sprawling and multivariate modern-day supply chains have become.

Dan defined the concept as “the risk inherent in the way we deliver business results.” For example, you might be working with a solutions provider whose software relies on open-source libraries, which could introduce vulnerabilities. The impact can be particularly high when a vendor your organization relies on in a strategic, business-critical capacity experiences a security issue.

Bob noted that the nature of supply chain risk hasn’t fundamentally changed in the past decade-plus — what’s different today is the scale of the problem. That includes not only the size of supply chains themselves but also the magnitude of the risks, as attacks increase in frequency and scope.

For Loren, acknowledging and acting on these growing risks means asking a central question: How are our partners investing in their own defenses? And further, how can we get visibility into the actions our vendors are taking to counteract their vulnerabilities?

Dropping the SBOM

Erick pointed out that one of the more practical ways of achieving visibility with technology vendors is the software bill of materials (SBOM). An SBOM is a list of all the libraries, dependencies, third-party modules, and other components that a provider brings into their software product.

“It’s like an ingredient list on a package of food,” Dan said. Because of the level of detail it provides, an SBOM can offer much greater insight into vulnerabilities than a compliance certification like SOC2 would.

“Ultimately, from our vendors, what we’re looking for is trust,” Dan noted. The visibility an SBOM provides can go a long way toward achieving that trust.

But not all vendors might jump at the request to produce an SBOM. And how do you know the SBOM is fully accurate and complete? The cloud complicates the picture considerably, too.

“A SaaSBOM is a lot trickier,” Erick noted. With fully cloud-based applications, verifying what’s in an SBOM becomes a much tougher task. And cloud misconfigurations have become an increasingly prominent source of vulnerabilities — especially as today’s end users are leveraging an array of easy-to-use SaaS tools and browser extensions, multiplying the potential points of risk.

Dan suggested that in the future, the industry might move to an ABOM — a highly memorable shorthand for “application bill of materials” — which would include all source code, infrastructure, and other key components that make an application tick. This would help provide a deeper level of visibility and trust when evaluating the risks inherent in the ever-growing lists of applications that enterprises rely on in today’s cloud-first technology ecosystem.

Taking action

So, what key concepts and practices should you implement as you put together a 2022 cybersecurity plan that factors in supply chain risk? Here are a few suggestions our panel discussed.

Invest in talent: “Find somebody who’s been there, done that,” Loren urged. Having experienced people on board who can stand up a third-party risk assessment program and handle everything it entails — from interviewing vendors to reviewing SBOMs and other artifacts — can help make this complex task more manageable.
Tailor scrutiny by vendor: Not all third parties carry the same level of risk, primarily because of the type of data they access. Accordingly, your vetting process should reflect the vendor you’re evaluating and the specific level of risk associated with them. This will save time and energy when evaluating partners who don’t introduce as much risk and ensure the higher-risk vendors get the appropriate level of scrutiny. In Dan’s work at VillageMD, for example, private health information (PHI) is the most critical type of data that needs the highest security, so vendors handling PHI need to be more rigorously vetted.
Think about your internal supply chain: As Bob pointed out, virtually all organizations today are doing some amount of development — whether they’re a full-on software provider or simply building their own website. That means we’re all susceptible to introducing the same kinds of vulnerabilities that our vendors might, impacting not just our own security but our customers’ as well. For example, what happens if a developer introduces a vulnerable component into your product’s source code? Or what if your DevOps team introduced a misconfiguration? Does your security operations team have a clear way to know that? Be sure to put guardrails in place by establishing a foundational software development life cycle (SDLC) process for all areas where you’re doing development.
Identify your no-go’s: Each of our panelists also had a few things they considered make-or-break when it comes to vendor assessments — requests that, if not met, would sink any conversation with a potential partner. For Bob, it was a vendor’s ability to supply a penetration test with complete findings. Loren echoed this, and also said he insists that partners share their data handling processes. For Dan, it was the right to audit the vendor and their software annually. Identify what these no-go’s are for your organization, and build them into vendor conversations and contracts.

Ultimately, holding your vendors accountable is the most important step you can take in the effort to build a secure supply chain.

“It’s incumbent on consumers to hold their vendors’ feet to the fire and say, ‘How are you doing this?'” Erick commented. Demand real data and clear documentation rather than vague responses. When we do this for our own organizations, we make each other safer by demanding more of vendors and raising the bar for security across the supply chain.

Stay tuned for the next 2 installments in our 2022 Planning webcast series! Next up, we’ll be discussing the path to effective cybersecurity maturity and how to factor that journey into your 2022 cybersecurity program. Sign up today!

Hands-On IoT Hacking: Rapid7 at DefCon IoT Village, Part 1

2021-10-21 Deral Heiland

Post Syndicated from Deral Heiland original https://blog.rapid7.com/2021/10/21/hands-on-iot-hacking-rapid7-at-defcon-iot-village-pt-1/

Hands-On IoT Hacking: Rapid7 at DefCon IoT Village, Part 1

This year, Rapid7 participated at the IoT Village during DefCon29 by running a hands-on hardware hacking exercise, with the goal of exposing attendees to concepts and methods for IoT hacking. Over the years, these exercises have covered several different embedded device topics, including how to use a Logic Analyzer, extracting firmware, and gaining root access to an embedded IoT device.

This year’s exercise focused on the latter and covered the following aspects:

Interaction with Universal Asynchronous Receiver Transmitter (UART)
Escaping the boot process to gain access to a U-Boot console
Modification of U-Boot environment variables
Monitoring system console during boot process for information
Accessing failsafe (single-user mode)
Mounting UBIFS partitions
Modifying file system for root access

While at DefCon, we had many IoT Village attendees request a copy of our exercise manual, so I decided to create a series of in-depth write-ups about the exercise we ran there, with better explanation of several of these key topic areas. Over the course of four posts, we’ll detail the exercise and add some expanded context to answer several questions and expand on the discussion we had with attendees at this year’s DefCon IoT Village.

The device we used in our exercise was a Luma Mesh WiFi device. The only change I made to the Luma devices for the exercise was to modify the U-Boot environment variables and add console=off to the bootargs variable to disable the console. I did this to add more complexity to the exercise and show a state that is often encountered.

Identify UART

One of the first steps in gaining root access to an IoT device is to identify possible entry points, such as a UART connection. In the case of our exercise, we performed this ahead of time by locating the UART connection and soldering a 2.54 mm header onto the board. This helped streamline the exercise, so attendees could complete it in a reasonable timeframe. However, the typical method to do this is to examine the device’s circuit board looking for an empty header, as in the example shown in Figure 1:

Hands-On IoT Hacking: Rapid7 at DefCon IoT Village, Part 1 — *Figure 1: Common 4 port 2.54mm header*

This example shows 4 port headers. Although 4 port headers are common for UART, it is not always the rule. UART connections can be included in larger port headers or may not even have an exposed header. So, when you find a header that you believe to be UART, you’ll need to validate it.

To do this, we first recommend soldering male pins into the exposed socket. This will allow easier connectivity of test equipment. An example of this is shown in Figure 2:

Once you’ve installed a header, I recommend using a logic analyzer to examine the connection for UART data. There are many different logic analyzers available on the market, which range in value from $12 or $15 to hundreds of dollars. In my case, I prefer using a Saleae logic analyzer.

The next step is to identify if any of the header pins are ground. To do this, first make sure the device is powered off. Then, you can use a multimeter set on continuity check and attach the ground lead “Black” to one of the metal shields covering various components on the circuit board, or one of the screws used to hold the circuit board in the cases — both often are found to be electrical ground.

Next, touch each pin in the header with the positive lead “Red” until the multimeter makes a ringing noise. This will indicate which pin is electrically ground. Once you’ve identified ground, you can attach the Logic Analyzer ground to that header pin and then connect the logic channel leads to the remaining pins, as shown in Figure 3:

Once hooked up, make sure the appropriate analyzer software is installed and running. In my case, I used Saleae’s Logic2. You can then power on the device and capture data on this header to analyze and identify:

Whether or not this header is UART
What the baud rate is
Which pin is transmit
Which pin is receive

As shown in the capture example in Figure 4, I captured 30 seconds of data during power-up of the device for channel 0 and 1. Here, we can see that data is shown on pin 1, which in this case indicates that channel 1, if determined to be UART, is most likely connected to the transmit pin. Since we are not sending any data to the device, channel 0 should show nothing, indicating it is most likely the receive pin.

The next step is to make a final determination as to whether this is a UART header? If so, what is the baud rate?

We’ll cover this and the subsequent steps in our next post. Check back next week for more!

NEVER MISS A BLOG

Get the latest stories, expertise, and news about security today.

Introducing Cloudflare’s Technology Partner Program

2021-10-15 Matt Lewis

Post Syndicated from Matt Lewis original https://blog.cloudflare.com/technology-partner-program/

Introducing Cloudflare’s Technology Partner Program

The Internet is built on a series of shared protocols, all working in harmony to deliver the collective experience that has changed the way we live and work. These open standards have created a platform such that a myriad of companies can build unique services and products that work together seamlessly. As a steward and supporter of an open Internet, we aspire to provide an interoperable platform that works with all the complementary technologies that our customers use across their technology stack. This has been the guiding principle for the multiple partnerships we have launched over the last few years.

One example is our Bandwidth Alliance — launched in 2018, this alliance with 18 cloud and storage providers aims to reduce egress fees, also known as data transfer fees, for our customers. The Bandwidth Alliance has broken the norms of the cloud industry so that customers can move data more freely. Since then, we have launched several technology partner programs with over 40+ partners, including:

Analytics — Visualize Cloudflare logs and metrics easily, and help customers better understand events and trends from websites and applications on the Cloudflare network.
Network Interconnect — Partnerships with best-in-class Interconnection platforms offer private, secure, software-defined links with near instant-turn-up of ports.
Endpoint Protection Partnerships — With these integrations, every connection to our customer’s corporate application gets an additional layer of identity assurance without the need to connect to VPN.
Identity Providers — Easily integrate your organization’s single sign-on provider and benefit from the ease-of-use and functionality of Cloudflare Access.

Introducing Cloudflare’s Technology Partner Program

These partner programs have helped us serve our customers better alongside our partners with our complementary solutions. The integrations we have driven have made it easy for thousands of customers to use Cloudflare with other parts of their stack.

We aim to continue expanding the Cloudflare Partner Network to make it seamless for our customers to use Cloudflare. To support our growing ecosystem of partners, we are excited to launch our Technology Partner Program.

Announcing Cloudflare’s Technology Partner Program

Cloudflare’s Technology Partner Program facilitates innovative integrations that create value for our customers, our technology partners, and Cloudflare. Our partners not only benefit from technical integrations with us, but also have the opportunity to drive sales and marketing efforts to better serve mutual customers and prospects.

This program offers a guiding structure so that our partners can benefit across three key areas:

Build with Cloudflare: Sandbox access to Cloudflare enterprise features and APIs to build and test integrations. Opportunity to collaborate with Cloudflare’s product teams to build innovative solutions.
Market with Cloudflare: Develop joint solution brief and host joint events to drive awareness and adoption of integrations. Leverage a range of our partners tools and resources to bring our joint solutions to market.
Sell with Cloudflare: Align with our sales teams to jointly target relevant customer segments across geographies.

Technology Partner Tiers

Depending on the maturity of the integration and fit with Cloudflare’s product portfolio, we have two types of partners:

Strategic partners: Strategic partners have mature integrations across the Cloudflare product suite. They are leaders in their industries and have a significant overlap with our customer base. These partners are strategically aligned with our sales and marketing efforts, and they collaborate with our product teams to bring innovative solutions to market.
Integration partners: Integration partners are early participants in Cloudflare’s partnership ecosystem. They already have or are on a path to build validated, functional integrations with Cloudflare. These partners have programmatic access to resources that will help them experiment with and build integrations with Cloudflare.

Work with Us

If you are interested in working with our Technology Partnerships team to develop and bring to market a joint solution, we’d love to hear from you! Partners can complete the application on our Technology Partner Program website and we will reach out quickly to discuss how we can help build solutions for our customers together.

“Look, Ma, no probes!” — Characterizing CDNs’ latencies with passive measurement

2021-10-15 Vasilis Giotsas

Post Syndicated from Vasilis Giotsas original https://blog.cloudflare.com/cdn-latency-passive-measurement/

“Look, Ma, no probes!” — Characterizing CDNs’ latencies with passive measurement

Something that comes up a lot at Cloudflare is how well our network and systems are performing. Like many service providers, we need to be engaged in a constant process of introspection to evaluate aspects of Cloudflare’s service with respect to customers, within our own network and systems and, as was the case in a recent blog post, the clients (such as web browsers). Many of these questions are obvious, but answering them is decisive in opening paths to new and improved services. The important point here is that it’s relatively straightforward to monitor and assess aspects of our service we can see or measure directly.

However, for certain aspects of our performance we may not have access to the necessary data, for a number of reasons. For instance, the data sources may be outside our network perimeter, or we may avoid collecting certain measurements that would violate the privacy of end users. In particular, the questions below are important to gain a better understanding of our performance, but harder to answer due to limitations in data availability:

How much better (or worse!) are we doing compared to other service providers (CDNs) by being in certain locations?
Can we know “a priori” and rank where data centers will have the greatest improvement and know which locations might deteriorate service?

The last question is particularly important because it requires the predictive power of synthesising available network measurements to model and infer network features that cannot be directly observed. For such predictions to be informative and meaningful, it’s critical to distill our measurements in a way that illuminates the interdependence of network structure, content distribution practices and routing policies, and their impact on network performance.

Active measurements are inadequate or unavailable

Measuring and comparing the performance of Content Distribution Networks (CDN) is critical in terms of understanding the level of service offered to end users, detecting and debugging network issues, and planning the deployment of new network locations. Measuring our own existing infrastructure is relatively straightforward, for example, by collecting DNS and HTTP request statistics received at each one of our data centers.

But what if we want to understand and evaluate the performance of other networks? Understandably, such data is not shared among networks due to privacy and business concerns. An alternative to data sharing is direct observation with what are called “active measurements.” An example of active measurement is when a measuring tape is used to determine the size of a room — one must take an action to perform the measurement.

Active measurements from Cloudflare data centers to other CDNs, however, don’t say much about the client experience. The only way to actively measure CDNs is by probing from third-party points of view, namely some type of end-client or globally distributed measurement platform. For example, ping probes might be launched from RIPE Atlas clients to many CDNs; alternatively, we might rely on data obtained from Real User Measurements (RUM) services that embed JavaScript requests into various services and pages around the world.

Active measurements are extremely valuable, and we heavily rely on them to collect a wide range of performance metrics. However, active measurements are not always reliable. Consider ping probes from RIPE Atlas. A collection of direct pings is most assuredly accurate. The weakness is that the distribution of its probes is heavily concentrated in Europe and North America, and it offers very sparse coverage of Autonomous Systems (ASes) in other regions (Asia, Africa, South America). Additionally the distribution of RIPE Atlas probes to ASes does not reflect the distribution of users to ASes, instead university networks and hosting providers or enterprises are overrepresented in the probes population.

Similarly, data from third party Real User Measurements (RUM) services have weaknesses too. RUM platforms compare CDNs by embedding JavaScript request probes in websites visited by users all over the world. This sounds great, except the data cannot be validated by outside parties, which is an important aspect of measurement. For example, consider the following chart that shows Cloudfront’s median Round-Trip Time (RTT) in Greece as measured by the two leading RUM platforms, Cedexis and Perfops. While both platforms follow the same measurement method, their results for the same time period and the same networks differ considerably. If the two sets of measurements for the same thing differ, then neither can be relied upon.

Ultimately, active measurements are always limited to and by the things that they directly see. Simply relying on existing measurements does not in and of itself translate to predictive models that help assess the potential impact of infrastructure and policy changes on performance. However, when the biases of active measurements are well understood, they can do two things really well: inform our understanding, and help validate models of our understanding of the world — and we’re going to showcase both as we develop a mechanism for evaluating CDN latencies passively.

Predicting CDNs’ RTTs with Passive Network Measurements

So, how might we measure without active probes? We’ve devised a method to understand latency across CDNs by using our own RTT measurements. In particular, we can use these measurements as a proxy for estimating the latency between clients and other CDNs. With this technique, we can understand latency to locations where CDNs have deployed their infrastructure, as well as show performance improvements in locations where one CDN exists but others do not. Importantly, we have validated the assumptions shown below through a large-scale traceroute and ping measurement campaign, and we’ve designed this technique so that it can be reproduced by others. After all, independent validation is important across measurement communities.

Step 1. Predicting Anycast Catchments

The first step in RTT inference is to predict the anycast catchments, namely predict the set of data centers that will be used by an IP. To this end, we compile the network footprint of each CDN provider whose performance we want to predict, which allows us to predict the CDN location where a request from a particular client AS will arrive. In particular, we collect the following data:

List of ISPs that host off-net server caches of CDNs using the methodology and code developed in Gigis et al. paper.
List of on-net city-level data centers according to PeeringDB, the network maps in the websites of each individual CDN, and IP geolocation measurements.
List of Internet eXchange Points (IXPs) where each CDN is connected, in conjunction with the other ASes that are also members of the same IXPs, from IXP databases such as PeeringDB, the Euro-IX IXP-DB, and Packet Clearing House.
List of CDN interconnections to other ASes extracted from BGP data collected from RouteViews and RIPE RIS.

The figure below shows the IXP connections for nine CDNs, according to the above-mentioned datasets. Cloudflare is present in 258 IXPs, which is 56 IXPs more than Google, the second CDN in the list.

With the above data, we can compute the possible paths between a client AS and the CDN’s data centers and infer the Anycast Catchments using techniques similar to the recent papers by Zhang et al. and Sermpezis and Kotronis, which predict paths by reproducing the Internet inter-domain routing policies. For CDNs that use BGP-based Anycast, we can predict which data center will receive a request based on the possible routing paths between the client and the CDN. For CDNs that rely on DNS-based redirection, we don’t make an inference yet, but we first predict the latency to each data center, and we select the path with the lowest latency assuming that CDN operators manage to offer the path with the smallest latency.

The challenge in predicting paths emanates from the incomplete knowledge of the varying routing policies implemented by individual ASes, which are either hosting web clients (for instance an ISP or an enterprise network), or are along the path between the CDN and the client’s network. However, in our prediction problem, we can already partition the IP address space to Anycast Catchment regions (as proposed by Schomp and Al-Dalky) based on our extensive data center footprint, which allows us to reverse engineer the routing decisions of client ASes that are visible to Cloudflare. That’s a lot to unpack, so let’s go through an example.

Example

First, assume that an ISP has two potential paths to a CDN: one over a transit provider and one through a direct peering connection over an IXP, and each path terminates at a different data center, as shown in the figure below. In the example below, routing through a transit AS incurs a cost, while IXP peering links do not incur transit exchange costs. Therefore, we would predict that the client ISP would use the path to data center 2 through the IXP.

Step 2. Predicting CDN Path Latencies

The next step is to estimate the RTT between the client AS and the corresponding CDN location. To this end, we utilize passive RTT measurements from Cloudflare’s own infrastructure. For each of our data centers, we calculate the median TCP RTT for each IP /24 subnet that sends us HTTP requests. We then assume that a request from a given IP subnet to a data center that is common between Cloudflare and another CDN will have a comparable RTT (our approach focuses on the performance of the anycast network and omits host software differences). This assumption is generally true, because the distance between two endpoints is the dominant factor in determining latency. Note that the median RTT is selected to represent client performance. In contrast, the minimum RTT is an indication of closeness to clients (not expected performance). Our approach on estimating latencies is similar to the work of Madhyastha et al. who combined the median RTT of existing measurements with a path prediction technique informed by network topologies to infer end-to-end latencies that cannot be measured directly. While this work reported an accuracy of 65% for arbitrary ASes, we focus on CDNs which, on average, have much smaller paths (most clients are within 1 AS hop) making the path prediction problem significantly easier (as noted by Chiu et al. and Singh and Gill). Also note that for the purposes of RTT estimation, it’s important to predict which CDN data center the request from a client IP will use, not the actual hops along the path.

Example

Assume that for a certain IP subnet used by AS3379 (a Greek ISP), the following table shows the median RTT for each Cloudflare data center that receives HTTP requests from that subnet. Note that while requests from an IP typically land at the nearest data center (Athens in that case), some requests may arrive at different data centers due to traffic load management and different service tiers.

Data Center	Athens	Sofia	Milan	Frankfurt	Amsterdam
Median RTT	22 ms	42 ms	43 ms	70 ms	75 ms

Assume that another CDN B does not have data centers or cache servers in Athens and Sofia, but only in Milan, Frankfurt, and Amsterdam. Based on the topology and colocation data of CDN B, we will predict the anycast catchments, and we find that for AS3379 the data center in Frankfurt will be used. In that case, we will use the corresponding latency as an estimate of the median latency between CDN B and the given prefix.

The above methodology works well because Cloudflare’s global network allows us to collect network measurements between 63,832 ASes (virtually every AS which hosts clients), and 300 cities in 115 different countries where Cloudflare infrastructure is deployed, allowing us to cover the vast majority of regions where other CDNs have deployed infrastructure.

Step 3. Validation

To validate the above measurement, we run a global campaign of traceroute and ping measurements from 9,990 Atlas probes in 161 different countries (see the interactive map for real-time data on the geographical distribution of probes).

For each CDN as a measurement target, we selected a destination hostname that is anycasted from all locations, and we selected the DNS resolution to run on each measurement probe so that the returned IP corresponds to the probe’s nearest location.

After the measurements were completed, we first evaluated the Anycast Catchment prediction, namely the prediction of which CDN data center will be used by each RIPE Atlas probe. To this end, we geolocated the destination IPs of each completed traceroute measurement against the predicted data center. Nearly 90% of our predicted data centers agreed with the measured data centers.

We also validated our RTT predictions. The figure below shows the absolute difference between the measured RTT and the predicted RTT in milliseconds, across all data centers. More than 50% of the predictions have an RTT difference of 3 ms or less, while almost 95% of the predictions have an RTT difference of at most 10 ms.

Results

We applied our methodology on nine major CDNs, including Cloudflare, in September 2021. As shown in the boxplot below, Cloudflare exhibits the lowest median RTT across all observed clients, with a median RTT close to 10 ms.

Limitations of measurement methodology

Because our approach relies on estimating latency, it is not possible to obtain millisecond-accurate measurements. However, such measurements are essentially infeasible even when using real user measurements because the network conditions are highly dynamic, meaning that measured RTT may differ significantly between different measurements.

Secondly, our approach obviously cannot be used to monitor network hygiene in real time and detect performance issues that may often lie outside Cloudflare’s network. Instead, our approach is useful for understanding the expected performance of our network topology and connectivity, and we can test what-if scenarios to predict the impact on performance that different events may have (e.g. deployment of a new data center, interruption of connectivity to an ISP or IXP).

Finally, while Cloudflare has the most extensive coverage of data centers and IXPs compared to other CDNs, there are certain countries where Cloudflare does not have a data center in contrast to other CDNs. In some other countries, Cloudflare is present to a partner data center but not in a carrier-neutral data center which may restrict the number of direct peering links between Cloudflare’s and other regional ISPs. In such countries, client IPs may be routed to a data center outside the country because the BGP decision process typically prioritizes cost over proximity. Therefore, for about 7% of the client /24 IP prefixes, we do not have a measured RTT between a data center in the same country as the IP. We are working to alleviate this with traceroute measurements and will report back later.

Looking Ahead

The ability to predict and compare the performance of different CDN networks allows us to evaluate the impact of different peering and data center strategies, as well as identify shortcomings in our Anycast Catchments and traffic engineering policies. Our ongoing work focuses on measuring and quantifying the impact of peering on IXPs on end-to-end latencies, as well as identifying cases of local Internet ecosystems where an open peering policy may lead to latency increases. This work will eventually enable us to optimize our infrastructure placement and control-plane policies to the specific topological properties of different regions and minimize latency for end users.

Multi-User IP Address Detection

2021-10-15 Alex Chen

Post Syndicated from Alex Chen original https://blog.cloudflare.com/multi-user-ip-address-detection/

Multi-User IP Address Detection

Cloudflare provides our customers with security tools that help them protect their Internet applications against malicious or undesired traffic. Malicious traffic can include scraping content from a website, spamming form submissions, and a variety of other cyberattacks. To protect themselves from these types of threats while minimizing the blocking of legitimate site visitors, Cloudflare’s customers need to be able to identify traffic that might be malicious.

We know some of our customers rely on IP addresses to distinguish between traffic from legitimate users and potentially malicious users. However, in many cases the IP address of a request does not correspond to a particular user or even device. Furthermore, Cloudflare believes that in the long term, the IP address will be an even more unreliable signal for identifying the origin of a request. We envision a day where IP will be completely unassociated with identity. With that vision in mind, multi-user IP address detection represents our first step: pointing out situations where the IP address of a request cannot be assumed to be a single user. This gives our customers the ability to make more judicious decisions when responding to traffic from an IP address, instead of indiscriminately treating that traffic as though it was coming from a single user.

Historically, companies commonly treated IP addresses like mobile phone numbers: each phone number in theory corresponds to a single person. If you get several spam calls within an hour from the same phone number, you might safely assume that phone number represents a single person and ignore future calls or even block that number. Similarly, many Internet security detection engines rely on IP addresses to discern which requests are legitimate and which are malicious.

However, this analogy is flawed and can present a problem for security. In practice, IP addresses are more like postal addresses because they can be shared by more than one person at a time (and because of NAT and CG-NAT the number of people sharing an IP can be very large!). Many existing Internet security tools accept IP addresses as a reliable way to distinguish between site visitors. However, if multiple visitors share the same IP address, security products cannot rely on the IP address as a unique identifying signal. Thousands of requests from thousands of different users need to be treated differently from thousands of requests from the same user. The former is likely normal traffic, while the latter is almost certainly automated, malicious traffic.

For example, if several people in the same apartment building accessed the same site, it’s possible all of their requests would be routed through a middlebox operated by their Internet service provider that has only one IP address. But this sudden series of requests from the same IP address could closely resemble the behavior of a bot. In this case, IP addresses can’t be used by our customers to distinguish this activity from a real threat, leading them to mistakenly block or challenge their legitimate site visitors.

By adding multi-user IP address detection to Cloudflare products, we’re improving the quality of our detection techniques and reducing false positives for our customers.

Examples of Multi-User IP Addresses

Multi-user IP addresses take on many forms. When your company uses an enterprise VPN, for example, employees may share the same IP address when accessing external websites. Other types of VPNs and proxies also place multiple users behind a single IP address.

Another type of multi-user IP address originated from the core communications protocol of the Internet. IPv4 was developed in the 1980s. The protocol uses a 32-bit address space, allowing for over four billion unique addresses. Today, however, there are many times more devices than IPv4 addresses, meaning that not every device can have a unique IP address. Though IPv6 (IPv4’s successor protocol) solves the problem with 128-bit addresses (supporting 2128 unique addresses), IPv4 still routes the majority of Internet traffic (76% of human-only traffic is IPv4, as shown on Cloudflare Radar).

To solve this issue, many devices in the same Local Area Network (LAN) can share a single Internet-addressable IP address to communicate with the public Internet, while using private Internet addresses to communicate within the LAN. Since private addresses are to be used only within a LAN, different LANs can number their hosts using the same private IP address space. The Internet gateway of the LAN does the Network Address Translation (NAT), namely takes messages which arrive on that single public IP and forwards them to the private IP of the appropriate device on their local network. In effect it’s similar to how everyone in an office building shares the same street address, and the front desk worker is responsible for sorting out what mail was meant for which person.

While NAT allows multiple devices behind the same Internet gateway to share the same public IP address, the explosive growth of the Internet population necessitated further reuse of the limited IPv4 address space. Internet Service Providers (ISPs) required users in different LANs to share the same IP address for their service to scale. Carrier-Grade Network Address Translation (CG-NAT) emerged as another solution for address space reuse. Network operators can use CG-NAT middleboxes to translate hundreds or thousands of private IPv4 addresses into a single (or pool of) public IPv4 address. However, this sharing is not without side-effects. CG-NAT results in IP addresses that cannot be tied to single devices, users, or broadband subscriptions, creating issues for security products that rely on the IP address as a way to distinguish between requests from different users.

What We Built

We built a tool to help our customers detect when a /24 IP prefix (set of IP addresses that have the same first 24 bits) is likely to contain multi-user IP addresses, so they can more finely tune the security rules that protect their websites. In order to identify multi-user IP prefixes, we leverage both internal data and public data sources. Within this data, we look at a few key parameters.

When an Internet user visits a website, the underlying TCP stack opens a number of connections in order to send and receive data from remote servers. Each connection is identified by a 4-tuple (source IP, source port, destination IP, destination port). Repeating requests from the same web client will likely be mapped to the same source port, so the number of distinct source ports can serve as a good indication of the number of distinct client applications. By counting the number of open source ports for a given IP address, you can estimate whether this address is shared by multiple users.

User agents provide device-reported information about themselves such as browser and operating system versions. For multi-user IP detection, you can count the number of distinct user agents in requests from a given IP. To avoid overcounting web clients per device, you can exclude requests that are identified as triggered by bots and we only count requests from user agents that are used by web browsers. There are some tradeoffs to this approach: some users may use multiple web browsers and some other users may have exactly the same user agent. Nevertheless, past research has shown that the number of unique web browser user agents is the best tradeoff to most accurately determine CG-NAT usage.

Mozilla/5.0 (X11; Linux x86_64; rv:92.0) Gecko/20100101 Firefox/92.0

For our inferences, we group IP addresses to their corresponding /24 IP prefix. The figure below shows the distribution of browser User Agents per /24 IP prefix, based on data accumulated over the period of a day. About 35% of the prefixes have more than 100 different browser clients behind them.

Our service also uses other publicly available data sources to further refine the accuracy of our identification and to classify the type of multi-user IP address. For example, we collect data from PeeringDB, which is a database where network operators self-identify their network type, traffic levels, interconnection points, and peering policy. This data only covers a fraction of the Internet’s autonomous systems (ASes). To overcome this limitation, we use this data and our own data (number of requests per AS, number of websites in each AS) to infer AS type. We also use external data sources such as IRR to identify requests from VPNs and proxy servers.

These details (especially AS type) can provide more information on the type of multi-user IP address. For instance, CG-NAT systems are almost exclusively deployed by broadband providers, so by inferring the AS type (ISP, CDN, Enterprise, etc.), we can more confidently infer the type of each multi-user IP address. A scheduled job periodically executes code to pull data from these sources, process it, and write the list of multi-user IP addresses to a database. That IP info data is then ingested by another system that deploys it to Cloudflare’s edge, enabling our security products to detect potential threats with minimal latency.

To validate our inferences for which IP addresses are multi-user, we created a dataset relying on separate data and measurements which we believe are more reliable indicators. One method we used was running traceroute queries through RIPE Atlas, from each RIPE Atlas probe to the probe’s public IP address. By examining the traceroute hops, we can determine if an IP is behind a CG-NAT or another middlebox. For example, if an IP is not behind a CG-NAT, the traceroute should terminate immediately or just have one hop (likely a home NAT). On the other hand, if a traceroute path includes addresses within the RFC 6598 CGNAT prefix or other hops in the private or shared address space, it is likely the corresponding probe is behind CG-NAT.

To further improve our validation datasets, we’re also reaching out to our ISP partners to confirm the known IP addresses of CG-NATs. As we refine our validation data, we can more accurately tune our multi-user IP address inference parameters and provide a better experience to ISP customers on sites protected by Cloudflare security products.

The multi-user IP detection service currently recognizes approximately 500,000 unique multi-user IP addresses and is being tuned to further improve detection accuracy. Be on the lookout for an upcoming technical blog post, where we will take a deeper look at the system we built and the metrics collected after running this service for a longer period of time.

How Will This Impact Bot Management and Rate Limiting Customers?

Our initial launch will integrate multi-user IP address detection into our Bot Management and Rate Limiting products.

The Cloudflare Bot Management product has five detection mechanisms. The integration will improve three of the five: the machine learning (ML) detection mechanism, the heuristics engine, and the behavioral analysis models. Multi-user IP addresses and their types will serve as additional features to train our ML model. Furthermore, logic will be added to ensure multi-user IP addresses are treated differently in our other detection mechanisms. For instance, our behavioral analysis detection mechanism shouldn’t treat a series of requests from a multi-user IP the same as a series of requests from a single-user IP. There won’t be any new ways to see or interact with this feature, but you should expect to see a decrease in false positive bot detections involving multi-user IP addresses.

The integration with Rate Limiting will allow us to increase the set rate limiting threshold when receiving requests coming from multi-user IP addresses. The factor by which we increase the threshold will be conservative so as not to completely bypass the rate limit. However, the increased threshold should greatly reduce cases where legitimate users behind multi-user IP addresses are blocked or challenged.

Looking Forward

We plan to further integrate across all of Cloudflare’s products that rely upon IP addresses as a measure of uniqueness, including but not limited to DDoS Protection, Cloudflare One Intel, and Web Application Firewall.

We will also continue to make improvements to our multi-user IP address detection system to incorporate additional data sources and improve accuracy. One data source would allow us to get a fraction for the estimated number of subscribers over the total number of IPs advertised (owned) by an AS. ASes that have more estimated subscribers than available IPs would have to rely on CG-NAT to provide service to all subscribers.

As mentioned above, with the help of our ISP partners we hope to improve the validation datasets we use to test and refine the accuracy of our inferences. Additionally, our integration with Bot Management will also unlock an opportunity to create a feedback loop that further validates our datasets. The challenge solve rate (CSR) is a metric generated by Bot Management that indicates the proportion of requests that were challenged and solved (and thus assumed to be human). Examining requests with both high and low CSRs will allow us to check if the multi-user IP addresses we have initially identified indeed represent mostly legitimate human traffic that our customers should not block.

The continued adoption of IPv6 might someday make CG-NATs and other IPv4 sharing technologies irrelevant, as the address space will no longer be limited. This could reduce the prevalence of multi-user IP addresses. However, with the development of new networking technologies that obfuscate IP addresses for user privacy (for example, IPv6 randomized address assignment), it seems unlikely it will become any easier to tie an IP address to a single user. Cloudflare firmly believes that eventually, IP will be completely unassociated with identity.

Yet in the short term, we recognize that IP addresses still play a pivotal role for the security of our customers. By integrating this multi-user IP address detection capability into our products, we aim to deliver a more free and fluid experience for everyone using the Internet.

Geo Key Manager: Setting up a service for scale

2021-10-15 Tanya Verma

Post Syndicated from Tanya Verma original https://blog.cloudflare.com/scaling-geo-key-manager/

Geo Key Manager: Setting up a service for scale

In 2017, we launched Geo Key Manager, a service that allows Cloudflare customers to choose where they store their TLS certificate private keys. For example, if a US customer only wants its private keys stored in US data centers, we can make that happen. When a user from Tokyo makes a request to this website or API, it first hits the Tokyo data center. As the Tokyo data center lacks access to the private key, it contacts a datacenter in the US to terminate the TLS request. Once the TLS session is established, the Tokyo datacenter can serve future requests. For a detailed description of how this works, refer to this post on Geo Key Manager.

This is a story about the evolution of systems in response to increase in scale and scope. Geo Key Manager started off as a small research project and, as it got used more and more, wasn’t scaling as well as we wanted it to. This post describes the challenges Geo Key Manager is facing today, particularly from a networking standpoint, and some of the steps along its way to a truly scalable service.

Geo Key Manager started out as a research project that leveraged two key innovations: Keyless SSL, an early Cloudflare innovation; and identity-based encryption and broadcast encryption, a relatively new cryptographic technique which can be used to construct identity-based access management schemes — in this case, identities based on geography. Keyless SSL was originally designed as a keyserver that customers would host on their own infrastructure, allowing them to retain complete ownership of their own private keys while still reaping the benefits of Cloudflare.

Eventually we started using Keyless SSL for Geo Key Manager requests, and later, all TLS terminations at Cloudflare were switched to an internal version of Keyless SSL. We made several tweaks to make the transition go more smoothly, but this meant that we were using Keyless SSL in ways we hadn’t originally intended.

With the increasing risks of balkanization of the Internet in response to geography-specific regulations like GDPR, demand for products like Geo Key Manager which enable users to retain geographical control of their information has surged. A lot of the work we do on the Research team is exciting because we get to apply cutting-edge advancements in the field to Cloudflare scale systems. It’s also fascinating to see projects be used in new and unexpected ways. But inevitably, many of our systems use technology that has never before been used at this scale which can trigger failures as we shall see.

A trans-pacific voyage

In late March, we started seeing an increase in TLS handshake failures in Melbourne, Australia. When we start observing failures in a specific data center, depending on the affected service, we have several options to mitigate impact. For critical systems like TLS termination, one approach is to reroute all traffic using anycast to neighboring data centers. But when we rerouted traffic away from the Melbourne data center, the same failures moved to the nearby data centers of Adelaide and Perth. As we were investigating the issue, we noticed a large spike in timeouts.

Geo Key Manager: Setting up a service for scale

The service that is first in line in the system that makes up TLS termination at Cloudflare performs all the parts of TLS termination that do not require access to the customers’ private keys. Since it is an Internet-facing process, we try to keep sensitive information out of this process as much as possible, in case of memory disclosure bugs. So it forwards the key signing request to keynotto, a service written in Rust that performs RSA and ECDSA key signatures.

We continued our investigation. This service was timing out on the requests it sent to keynotto. Next, we looked into the requests itself. We observed that there was an increase in total requests, but not by a very large factor. You can see a large drop in traffic in the Melbourne data center below. That indicates the time where we dropped traffic from Melbourne and rerouted it to other nearby data centers.

We decided to track one of these timed-out requests using our distributed tracing infrastructure. The Jaeger UI allows you to chart traces that take the longest. Thanks to this, we quickly figured out that most of these failures were being caused by a single, new zone that was getting a large amount of API traffic. And interestingly, this zone also had Geo Key Manager enabled, with a policy set to US-only data centers.

Geo Key Manager routes requests to the closest data center that satisfies the policy. This happened to be a data center in San Francisco. That meant adding ~175ms to the actual time spent performing the key signing (median is 3 ms), to account for the trans-pacific voyage 🚢. So, it made sense that the remote key signatures were relatively slow, but what was causing TLS timeouts and degradation for unrelated zones with nothing to do with Geo Key Manager?

Below are graphs depicting the increase by quantile in RSA key signatures latencies in Melbourne. Plotting them all on the same graph didn’t work right with the scale, so the p50 is shown separately from p70 and p90.

To answer why unrelated zones had timeouts and performance degradation, we have to understand the architecture of the three services involved in terminating TLS.

Life of a TLS request

TLS requests arrive at a data center routed by anycast. Each server runs the same stack of services, so each has its own instance of the initial service, keynotto, and gokeyless (discussed shortly). The first service has a worker pool of half the number of CPU cores. There’s 96 cores on each server, so 48 workers. Each of these workers creates their own connection to keynotto, which they use to send and receive the responses of key-signing requests. keynotto, at that point in time, could multiplex between all 48 of these connections because it spawned a new thread to handle each connection. However, it processed all requests on the same connection sequentially. Given that there could be dozens of requests per second on the same connection, if even a single one was slow, it would cause head of line blocking of all other requests enqueued after it. So long as most requests were short lived, this bug went unnoticed. But when a lot of traffic needed to be processed via Geo Key Manager, the head of line blocking created problems. This type of flaw is usually only exposed under heavy load or when load testing, and will make more sense after I introduce gokeyless and explain the history of keynotto.

gokeyless-internal, very imaginatively named, is the internal instance of our Keyless SSL keyserver written in Go. I’ll abbreviate it to gokeyless for the sake of simplicity. Before we introduced keynotto as the designated key signing process, the first service sent all key signing requests directly to gokeyless. gokeyless created worker pools based on the type of operation consisting of goroutines, a very lightweight thread managed by the Go runtime. The ones of interest are the RSA, ECDSA, and the remote worker pool. RSA and ECDSA are fairly obvious, these goroutines performed RSA key signatures and ECDSA key signatures. Any requests involving Geo Key Manager were placed in the remote pool. This prevented the network-dependent remote requests from affecting local key signatures. Worker pools were an artifact of a previous generation of Keyless which didn’t need to account for remote requests. Using benchmarks we had noticed that spinning up worker pools provided some marginal latency benefits. For local operations only, performance was optimal as the computation was very fast and CPU bound. However, when we started adding remote operations that could block the workers needed for performing local operations, we decided to create a new worker pool only for remote ops.

When gokeyless was designed in 2014, Rust was not a mature language. But that changed recently, and we decided to experiment with a minimal Rust proxy placed between the first service and gokeyless. This would handle RSA and ECDSA signatures, which were about 99% of all key signing operations, while handing off the more esoteric operations like Geo Key Manager over to gokeyless running locally on the same server. The hope was that we could eke out some performance gains from the tight runtime control afforded by Rust and the use of Rust’s cryptographic libraries. Performance is incredibly important to Cloudflare since CPU is one of the main limiting factors for edge servers. Go’s RSA is notorious for being slow and using a lot of CPU. Given that one in three handshakes use RSA, it is important to optimize it. CGo seemed to create unnecessary overhead, and there was no assembly-only RSA that we could use. We tried to speed up RSA without CGo and only using assembly and made some strides, but it was still a little non-optimal. So keynotto was built to take advantage of the fast RSA implementation in BoringSSL.

The next section is a quick diversion into keynotto that isn’t strictly related to the story, but who doesn’t enjoy a hot take on Go vs Rust?

Go vs Rust: The battle continues

While keynotto was initially deployed because it objectively lowered CPU, we weren’t sure what fraction of its benefits over gokeyless were related to different cryptography libraries used vs the choice of language. keynotto used BoringCrypto, the TLS library that is part of BoringSSL, whereas gokeyless used Go’s standard crypto library crypto/tls. And keynotto was implemented in Rust, while gokeyless in Go. However, recently during the process of FedRAMP certification, we had to switch the TLS library in gokeyless to use the same cryptography library as keynotto. This was because while the Go standard library doesn’t use a FIPS validated cryptography module, this Google-maintained branch of Go that uses BoringCrypto does. We then turned off keynotto in a few very large data centers for a couple days and compared this to the control group of having keynotto turned on. We found that moving gokeyless to BoringCrypto provided a very marginal benefit, forcing us to revisit our stance on CGo. This result meant we could attribute the difference to using Rust over Go and implementation differences between keynotto and gokeyless.

Turning off keynotto resulted in an average 26% increase in maximum memory consumed and 71% increase in maximum CPU consumed. In the case of all quantiles, we observed a significant increase in latency for both RSA and ECDSA. ECDSA as measured across four different large data centers (each is a different color) is shown:

Remote ops don’t play fair

The first service, keynotto and gokeyless communicate over TCP via the custom Keyless protocol. This protocol was developed in 2014, and originally intended for communicating with external key servers aka gokeyless. After this protocol started being used internally, by increasing numbers of clients in several different languages, challenges started appearing. In particular, each client had to implement by hand the serialization code and make sense of all the features the protocol provided. The custom protocol also makes tracing more challenging to do right. So while older clients like the first service have tuned their implementations, for newer clients like keynotto, it can be difficult to consider every property such as the right way to propagate traces and such.

One such property that the Keyless protocol offers is multiplexing of requests within a connection. It does so by provisioning unique IDs for each request, which allows for out-of-order delivery of the responses similar to protocols like HTTP/2. While the first service and gokeyless leveraged this property to handle situations where the order of responses is different from the order of requests, keynotto, a much newer service, didn’t. This is why it had to process requests on the same connection sequentially. This led to local key signing requests — which took roughly 3ms — being blocked on the remote requests that took 60x that duration!

We now know why most TLS requests were degraded/dropped. But it’s also worth examining the remote requests themselves. The gokeyless remote worker pool had 200 workers in each server. One of the mitigations we applied was bumping that number to 2,000. Increasing the concurrency in a system when faced with problems that resemble resource constraints is an understandable thing to do. The reason not to do so is high resource utilization. Let’s get some context on why bumping up remote workers 10x wasn’t necessarily a problem. When gokeyless was created, it had separate pools with a configurable number of workers for RSA, ECDSA, and remote ops. RSA and ECDSA had a large fraction of workers, and remote was relatively smaller. Post keynotto, the need for the RSA and ECDSA pool was obviated since keynotto handled all local key signing operations.

Let’s do some napkin math now to prove that bumping it by 10x could not have possibly helped. We were receiving five thousand requests per second in addition to the usual traffic in Melbourne. 200 were remote requests. Melbourne has 60 servers. Looking at a breakdown of remote requests by server, the maximum remote requests one server had was ten requests per second (rps). The graph below shows the percentage of workers in gokeyless actually performing operations, “other” is the remote workers. We can see that while Melbourne worker utilization peaked at 47%, that is still not 100%. And we see that the utilization in San Francisco was much lower, so it couldn’t be SF that was slowing things down.

Given that we had 200 workers per server and at most ten requests per second, even if they took 200 ms each, gokeyless was not the bottleneck. We had no reason to increase the number of remote workers. The culprit here was keynotto’s serial queue.

The solution here was to prevent remote signing ops from blocking local ops in keynotto. We did this by changing the internal task handling model of keynotto to not just handle each connection concurrently, but also to process each request on a connection concurrently. This was done by using a multi-producer, single-consumer queue. When a request from a connection was read, it was handed off to a new thread for processing while the connection thread went back to reading requests. When the request was done processing, the response was written to a channel. A write thread for each connection was also created, which polled the read side of the channel and wrote the response back to the connection.

The next thing we added were carefully chosen timeout values to keynotto. Timeouts allow for faster failure and faster cleanup of resources associated with the timed out request. Choosing appropriate timeouts can be tricky, especially when composed across multiple services. Since keynotto is downstream of the first service but upstream of gokeyless, its timeout should be smaller than the former but larger than the latter to allow upstream services to diagnose which process triggered the timeout. Timeouts can further be complicated when network latency is involved, because trans-pacific p99 latency is very different from p99 between two US states. Using the worst case network latency can be a fair compromise. We chose keynotto timeouts to be an order of magnitude larger than the p99 request latency, since the purpose was to prevent continued stall and resource consumption.

A more general solution to avoid the two issues outlined above would be to use gRPC. gRPC was released in 2015, which was one year too late to be used in Keyless at that time. It has several advantages over custom protocol implementations such as multi-language client libraries, improved tracing, easy timeouts, load balancing, protobuf for serialization and so on, but directly relevant here is multiplexing. gRPC supports multiplexing out of the box which makes it unnecessary to handle request/response multiplexing manually.

Tracking requests across the Atlantic

Then in early June, we started seeing TLS termination timeouts in Chicago. This time, there were timeouts across the first service, keynotto, and gokeyless. We had on our hands a zone with Geo Key Manager policy set to the EU that had suddenly started receiving around 80 thousand remote rps, which was significantly more than the 200 remote rps in Australia. This time, keynotto was not responsible since its head of line blocking issues had been fixed. It was timing out waiting for gokeyless to perform the remote key signatures. A quick glance at the maximum worker utilization of gokeyless for the entire datacenter revealed that it was maxed out. To counteract the issue, we increased the remote workers from 200 to 20,000, and then to 200,000. It wasn’t enough. Requests were still timing out and worker utilization was squarely at 100% until we rate limited the traffic to that zone.

Why didn’t increasing the number of workers by a factor of 1,000 times help? We didn’t even have 250,000 rps for the entire datacenter, let alone every server.

File descriptors don’t scale

gokeyless maintains a map of ping latencies to each data center as measured from the data center that the instance is running in. It updates this map frequently and uses it to make decisions about which data center to route remote requests to. A long time ago, every server in every data center maintained its own connections to other data centers. This quickly grew out of hand and we started running out of file descriptors as the number of data centers grew. So, we switched to a new system of creating pods of 32 servers and shared the connections that each server needed to maintain with other data centers by the size of the pod. This was great for reducing the number of file descriptors being used by each server.

An example will illustrate this better. In a data center with 100 servers, there would be four pods: three with 32 servers each and one with four servers. If there were a total of 65 data centers in the world, then each server in a pod of 32 will be responsible for maintaining connections to two data centers, and each server in the pod of four will handle 16 servers. Remote requests are routed to the data center with the shortest latency. So in a pod of 32 servers, the server that maintains the connection with the closest data center will be responsible for sending remote requests from every member of the pod to the target data center. In case of failure or timeout, the data center which is second closest (and then the third closest) will be responsible. If all three closest data centers return failures, we give up and return a failure. So this means if a data center with 130 servers is receiving 80k remote rps, which is 615 rps/server, then there will be approximately four servers that are actually responsible for routing all remote traffic at any given time, and will each be handling ~20k rps. This was the case in Chicago. These requests were being routed to a data center in Hamburg, Germany and there could be a maximum of four connections at the same time between Chicago and Hamburg.

Little’s Law

At that time, gokeyless used a worker pool architecture with a queue for buffering additional tasks. This queue blocked after 1,024 requests were queued, to create backpressure on the client. This is a common technique to prevent overloading — not accepting more tasks than the server knows it can handle. p50 RSA latency was 4 ms. The latency from Chicago to Hamburg was ~100 ms. After we bumped up the remote workers for each server to 200,000 in Chicago, we were surely not constrained on the sending end since we only had 80k rps for the whole data center and each request shouldn’t take more than 100 ms in the vast majority of cases. We know this because of Little’s Law. Little’s Law states that the capacity of a system is the arrival rate multiplied by the service time for each request. Let’s see how it applies here, and how it allowed us to prove why increasing concurrency or queue size did not help.

Consider the queue in the German data center, Hamburg in this case. We hadn’t bumped the number of remote workers there, so only 200 remote workers were available. Assuming 20k rps arrived to one server in Hamburg and each took ~5 ms to process, the necessary concurrency in the system should be 100 workers. We had 200 workers. Even without a queue, Hamburg should’ve been easily able to handle the 20k rps thrown at it. We investigated how many remote rps were actually processed per server in Hamburg:

These numbers didn’t make any sense! The maximum number at any time was a little over 700. We should’ve been able to process orders of magnitude more requests. By the time we investigated this, the incident had ended. We had no visibility into the size of the service level queue — or the size of the TCP queue — to understand what could have caused the timeouts. We speculated that while p50 for the population of all RSA key signatures might be 4 ms, perhaps for the specific population of Geo Key Manager RSA signatures, things took a whole lot longer. With 200 workers, we processed ~500 rps, which meant each operation would have to take 400 ms. p99.9 for signatures can be around this much, so it is possible that was how long things took.

We recently ran a load test to establish an upper bound on remote rps, and discovered that one server in the US could process 7,000 remote rps to the EU — much lower than Little’s Law suggests. We identified several slow remote requests through tracing and noticed these usually corresponded with multiple tries to different remote hosts. The explanation for why these retries are necessary to get through to an otherwise very functional data center is that gokeyless uses a hardcoded list of data centers and corresponding server hostnames to establish connections to. Hostnames can change when new servers are added or removed. If we are attempting to connect to an invalid host and waiting on a one second timeout, we will certainly need to retry to another host, which can cause large increases in average processing time.

After the Hamburg outage, we decided to remove gokeyless worker pools and switch to a model of one goroutine per request. Goroutines are extremely lightweight, with benchmarks that suggest that a 4 GB machine can handle around one million goroutines. This was something that made sense to do after we had extended gokeyless to support remote requests in 2017, but because of the complexity involved in changing the entire architecture of processing requests, we held off on the change. Furthermore, when we launched Geo Key Manager, we didn’t have enough traffic that might have prompted us to rethink this architecture urgently. This removes complexity from the code, because tuning worker pool sizes and queue sizes can be complicated. We have also observed, on average, a 7x drop in the memory consumed on switching to this one goroutine per request model, because we no longer have idle goroutines active as was in the case of worker pools. It also makes it easier to trace the life of an individual request because you get to follow along all the steps in its execution, which can help when trying to chase down per request slowdowns.

Conclusion

The complexity of distributed systems can quickly scale up to unmanageable levels, making it important to have deep visibility into it. An extremely important tool to have is distributed tracing, which tracks a request through multiple services and provides information about time spent in each part of the journey. We didn’t propagate gokeyless trace spans through parts of the remote operation, which prevented us from identifying why Hamburg only processed 700 rps. Having examples on hand for various error cases can also make diagnosis easier, especially in the midst of an outage. While we load tested our TLS termination system in the common case, where key signatures happened locally, we hadn’t load tested the less common and lower volume use case of remote operations. Using sources of truth that update dynamically to reflect the current state of the world instead of static sources is also important. In our case, the list of data centers that gokeyless connected to for performing remote operation was hardcoded a long time ago and never updated. So while we added more data centers, gokeyless was unable to make use of them, and in some cases, may have been connecting to servers with invalid.

We’re now working on overhauling several pieces of Geo Key Manager to make it significantly more flexible and scalable, so think of this as setting up the stage for a future blog post where we finally solve some of the issues outlined here, and stay tuned for updates!

Privacy-Preserving Compromised Credential Checking

2021-10-14 Luke Valenta

Post Syndicated from Luke Valenta original https://blog.cloudflare.com/privacy-preserving-compromised-credential-checking/

Privacy-Preserving Compromised Credential Checking

Today we’re announcing a public demo and an open-sourced Go implementation of a next-generation, privacy-preserving compromised credential checking protocol called MIGP (“Might I Get Pwned”, a nod to Troy Hunt’s “Have I Been Pwned”). Compromised credential checking services are used to alert users when their credentials might have been exposed in data breaches. Critically, the ‘privacy-preserving’ property of the MIGP protocol means that clients can check for leaked credentials without leaking any information to the service about the queried password, and only a small amount of information about the queried username. Thus, not only can the service inform you when one of your usernames and passwords may have become compromised, but it does so without exposing any unnecessary information, keeping credential checking from becoming a vulnerability itself. The ‘next-generation’ property comes from the fact that MIGP advances upon the current state of the art in credential checking services by allowing clients to not only check if their exact password is present in a data breach, but to check if similar passwords have been exposed as well.

For example, suppose your password last year was amazon20\$, and you change your password each year (so your current password is amazon21\$). If last year’s password got leaked, MIGP could tell you that your current password is weak and guessable as it is a simple variant of the leaked password.

The MIGP protocol was designed by researchers at Cornell Tech and the University of Wisconsin-Madison, and we encourage you to read the paper for more details. In this blog post, we provide motivation for why compromised credential checking is important for security hygiene, and how the MIGP protocol improves upon the current generation of credential checking services. We then describe our implementation and the deployment of MIGP within Cloudflare’s infrastructure.

Our MIGP demo and public API are not meant to replace existing credential checking services today, but rather demonstrate what is possible in the space. We aim to push the envelope in terms of privacy and are excited to employ some cutting-edge cryptographic primitives along the way.

The threat of data breaches

Data breaches are rampant. The regularity of news articles detailing how tens or hundreds of millions of customer records have been compromised have made us almost numb to the details. Perhaps we all hope to stay safe just by being a small fish in the middle of a very large school of similar fish that is being predated upon. But we can do better than just hope that our particular authentication credentials are safe. We can actually check those credentials against known databases of the very same compromised user information we learn about from the news.

Many of the security breaches we read about involve leaked databases containing user details. In the worst cases, user data entered during account registration on a particular website is made available (often offered for sale) after a data breach. Think of the addresses, password hints, credit card numbers, and other private details you have submitted via an online form. We rely on the care taken by the online services in question to protect those details. On top of this, consider that the same (or quite similar) usernames and passwords are commonly used on more than one site. Our information across all of those sites may be as vulnerable as the site with the weakest security practices. Attackers take advantage of this fact to actively compromise accounts and exploit users every day.

Credential stuffing is an attack in which malicious parties use leaked credentials from an account on one service to attempt to log in to a variety of other services. These attacks are effective because of the prevalence of reused credentials across services and domains. After all, who hasn’t at some point had a favorite password they used for everything? (Quick plug: please use a password manager like LastPass to generate unique and complex passwords for each service you use.)

Website operators have (or should have) a vested interest in making sure that users of their service are using secure and non-compromised credentials. Given the sophistication of techniques employed by malevolent actors, the standard requirement to “include uppercase, lowercase, digit, and special characters” really is not enough (and can be actively harmful according to NIST’s latest guidance). We need to offer better options to users that keep them safe and preserve the privacy of vulnerable information. Dealing with account compromise and recovery is an expensive process for all parties involved.

Users and organizations need a way to know if their credentials have been compromised, but how can they do it? One approach is to scour dark web forums for data breach torrent links, download and parse gigabytes or terabytes of archives to your laptop, and then search the dataset to see if their credentials have been exposed. This approach is not workable for the majority of Internet users and website operators, but fortunately there’s a better way — have someone with terabytes to spare do it for you!

Making compromise checking fast and easy

This is exactly what compromised credential checking services do: they aggregate breach datasets and make it possible for a client to determine whether a username and password are present in the breached data. Have I Been Pwned (HIBP), launched by Troy Hunt in 2013, was the first major public breach alerting site. It provides a service, Pwned Passwords, where users can efficiently check if their passwords have been compromised. The initial version of Pwned Passwords required users to send the full password hash to the service to check if it appears in a data breach. In a 2018 collaboration with Cloudflare, the service was upgraded to allow users to run range queries over the password dataset, leaking only the salted hash prefix rather than the entire hash. Cloudflare continues to support the HIBP project by providing CDN and security support for organizations to download the raw Pwned Password datasets.

The HIBP approach was replicated by Google Password Checkup (GPC) in 2019, with the primary difference that GPC alerts are based on username-password pairs instead of passwords alone, which limits the rate of false positives. Enzoic and Microsoft Password Monitor are two other similar services. This year, Cloudflare also released Exposed Credential Checks as part of our Web Application Firewall (WAF) to help inform opted-in website owners when login attempts to their sites use compromised credentials. In fact, we use MIGP on the backend for this service to ensure that plaintext credentials never leave the edge server on which they are being processed.

Most standalone credential checking services work by having a user submit a query containing their password’s or username-password pair’s hash prefix. However, this leaks some information to the service, which could be problematic if the service turns out to be malicious or is compromised. In a collaboration with researchers at Cornell Tech published at CCS’19, we showed just how damaging this leaked information can be. Malevolent actors with access to the data shared with most credential checking services can drastically improve the effectiveness of password-guessing attacks. This left open the question: how can you do compromised credential checking without sharing (leaking!) vulnerable credentials to the service provider itself?

What does a privacy-preserving credential checking service look like?

In the aforementioned CCS’19 paper, we proposed an alternative system in which only the hash prefix of the username is exposed to the MIGP server (independent work out of Google and Stanford proposed a similar system). No information about the password leaves the user device, alleviating the risk of password-guessing attacks. These credential checking services help to preserve password secrecy, but still have a limitation: they can only alert users if the exact queried password appears in the breach.

The present evolution of this work, Might I Get Pwned (MIGP), proposes a next-generation similarity-aware compromised credential checking service that supports checking if a password similar to the one queried has been exposed in the data breach. This approach supports the detection of credential tweaking attacks, an advanced version of credential stuffing.

Credential tweaking takes advantage of the fact that many users, when forced to change their password, use simple variants of their original password. Rather than just attempting to log in using an exact leaked password, say ‘password123’, a credential tweaking attacker might also attempt to log in with easily-predictable variants of the password such as ‘password124’ and ‘password123!’.

There are two main mechanisms described in the MIGP paper to add password variant support: client-side generation and server-side precomputation. With client-side generation, the client simply applies a series of transform rules to the password to derive the set of variants (e.g., truncating the last letter or adding a ‘!’ at the end), and runs multiple queries to the MIGP service with each username and password variant pair. The second approach is server-side precomputation, where the server applies the transform rules to generate the password variants when encrypting the dataset, essentially treating the password variants as additional entries in the breach dataset. The MIGP paper describes tradeoffs between the two approaches and techniques for generating variants in more detail. Our demo service includes variant support via server-side precomputation.

Breach extraction attacks and countermeasures

One challenge for credential checking services are breach extraction attacks, in which an adversary attempts to learn username-password pairs that are present in the breach dataset (which might not be publicly available) so that they can attempt to use them in future credential stuffing or tweaking attacks. Similarity-aware credential checking services like MIGP can make these attacks more effective, since adversaries can potentially check for more breached credentials per API query. Fortunately, additional measures can be incorporated into the protocol to help counteract these attacks. For example, if it is problematic to leak the number of ciphertexts in a given bucket, dummy entries and padding can be employed, or an alternative length-hiding bucket format can be used. Slow hashing and API rate limiting are other common countermeasures that credential checking services can deploy to slow down breach extraction attacks. For instance, our demo service applies the memory-hard slow hash algorithm scrypt to credentials as part of the key derivation function to slow down these attacks.

Let’s now get into the nitty-gritty of how the MIGP protocol works. For readers not interested in the cryptographic details, feel free to skip to the demo below!

MIGP protocol

There are two parties involved in the MIGP protocol: the client and the server. The server has access to a dataset of plaintext breach entries (username-password pairs), and a secret key used for both the precomputation and the online portions of the protocol. In brief, the client performs some computation over the username and password and sends the result to the server; the server then returns a response that allows the client to determine if their password (or a similar password) is present in the breach dataset.

Precomputation

At a high level, the MIGP server partitions the breach dataset into buckets based on the hash prefix of the username (the bucket identifier), which is usually 16-20 bits in length.

We use server-side precomputation as the variant generation mechanism in our implementation. The server derives one ciphertext for each exact username-password pair in the dataset, and an additional ciphertext per password variant. A bucket consists of the set ciphertexts for all breach entries and variants with the same username hash prefix. For instance, suppose there are n breach entries assigned to a particular bucket. If we compute m variants per entry, counting the original entry as one of the variants, there will be n*m ciphertexts stored in the bucket. This introduces a large expansion in the size of the processed dataset, so in practice it is necessary to limit the number of variants computed per entry. Our demo server stores 10 ciphertexts per breach entry in the input: the exact entry, eight variants (see Appendix A of the MIGP paper), and a special variant for allowing username-only checks.

Each ciphertext is the encryption of a username-password (or password variant) pair along with some associated metadata. The metadata describes whether the entry corresponds to an exact password appearing in the breach, or a variant of a breached password. The server derives a per-entry secret key pad using a key derivation function (KDF) with the username-password pair and server secret as inputs, and uses XOR encryption to derive the entry ciphertext. The bucket format also supports storing optional encrypted metadata, such as the date the breach was discovered.

Input:
  Secret sk       // Server secret key
  String u        // Username
  String w        // Password (or password variant)
  Byte mdFlag     // Metadata flag
  String mdString // Optional metadata string

Output:
  String C        // Ciphertext

function Encrypt(sk, u, w, mdFlag, mdString):
  padHdr=KDF1(u, w, sk)
  padBody=KDF2(u, w, sk)
  zeros=[0] * KEY_CHECK_LEN
  C=XOR(padHdr, zeros || mdFlag) || mdString.length || XOR(padBody, mdString)

The precomputation phase only needs to be done rarely, such as when the MIGP parameters are changed (in which case the entire dataset must be re-processed), or when new breach datasets are added (in which case the new data can be appended to the existing buckets).

Online phase

The online phase of the MIGP protocol allows a client to check if a username-password pair (or variant) appears in the server’s breach dataset, while only leaking the hash prefix of the username to the server. The client and server engage in an OPRF protocol message exchange to allow the client to derive the per-entry decryption key, without leaking the username and password to the server, or the server’s secret key to the client. The client then computes the bucket identifier from the queried username and downloads the corresponding bucket of entries from the server. Using the decryption key derived in the previous step, the client scans through the entries in the bucket attempting to decrypt each one. If the decryption succeeds, this signals to the client that their queried credentials (or a variant thereof) are in the server’s dataset. The decrypted metadata flag indicates whether the entry corresponds to the exact password or a password variant.

The MIGP protocol solves many of the shortcomings of existing credential checking services with its solution that avoids leaking any information about the client’s queried password to the server, while also providing a mechanism for checking for similar password compromise. Read on to see the protocol in action!

MIGP demo

As the state of the art in attack methodologies evolve with new techniques such as credential tweaking, so must the defenses. To that end, we’ve collaborated with the designers of the MIGP protocol to prototype and deploy the MIGP protocol within Cloudflare’s infrastructure.

Our MIGP demo server is deployed at migp.cloudflare.com, and runs entirely on top of Cloudflare Workers. We use Workers KV for efficient storage and retrieval of buckets of encrypted breach entries, capping out each bucket size at the current KV value limit of 25MB. In our instantiation, we set the username hash prefix length to 20 bits, so that there are a total of 2^20 (or just over 1 million) buckets.

There are currently two ways to interact with the demo MIGP service: via the browser client at migp.cloudflare.com, or via the Go client included in our open-sourced MIGP library. As shown in the screenshots below, the browser client displays the request from your device and the response from the MIGP service. You should take caution to not input any sensitive credentials in a third-party service (feel free to use the test credentials [email protected] and password1 for the demo).

Keep in mind that “absence of evidence is not evidence of absence”, especially in the context of data breaches. We intend to periodically update the breach datasets used by the service as new public breaches become available, but no breach alerting service will be able to provide 100% accuracy in assuring that your credentials are safe.

See the MIGP demo in action in the attached screenshots. Note that in all cases, the username ([email protected]) and corresponding username prefix hash (000f90f4) remain the same, so the client retrieves the exact same bucket contents from the server each time. However, the blindElement parameter in the client request differs per request, allowing the client to decrypt different bucket elements depending on the queried credentials.

Open-sourced MIGP library

We are open-sourcing our implementation of the MIGP library under the BSD-3 License. The code is written in Go and is available at https://github.com/cloudflare/migp-go. Under the hood, we use Cloudflare’s CIRCL library for OPRF support and Go’s supplementary cryptography library for scrypt support. Check out the repository for instructions on setting up the MIGP client to connect to Cloudflare’s demo MIGP service. Community contributions and feedback are welcome!

Future directions

In this post, we announced our open-sourced implementation and demo deployment of MIGP, a next-generation breach alerting service. Our deployment is intended to lead the way for other credential compromise checking services to migrate to a more privacy-friendly model, but is not itself currently meant for production use. However, we identify several concrete steps that can be taken to improve our service in the future:

Add more breach datasets to the database of precomputed entries
Increase the number of variants in server-side precomputation
Add library support in more programming languages to reach a broader developer base
Hide the number of ciphertexts per bucket by padding with dummy entries
Add support for efficient client-side variant checking by batching API calls to the server

For exciting future research directions that we are investigating — including one proposal to remove the transmission of plaintext passwords from client to server entirely — take a look at https://blog.cloudflare.com/research-directions-in-password-security.

We are excited to share and build upon these ideas with the wider Internet community, and hope that our efforts impact positive change in the password security ecosystem. We are particularly interested in collaborating with stakeholders in the space to develop, test, and deploy next-generation protocols to improve user security and privacy. You can reach us with questions, comments, and research ideas at [email protected]. For those interested in joining our team, please visit our Careers Page.

Unbuckling the narrow waist of IP: Addressing Agility for Names and Web Services

2021-10-14 Marwan Fayed

Post Syndicated from Marwan Fayed original https://blog.cloudflare.com/addressing-agility/

Unbuckling the narrow waist of IP: Addressing Agility for Names and Web Services

At large operational scales, IP addressing stifles innovation in network- and web-oriented services. For every architectural change, and certainly when starting to design new systems, the first set of questions we are forced to ask are:

Which block of IP addresses do or can we use?
Do we have enough in IPv4? If not, where or how can we get them?
How do we use IPv6 addresses, and does this affect other uses of IPv6?
Oh, and what careful plan, checks, time, and people do we need for migration?

Having to stop and worry about IP addresses costs time, money, resources. This may sound surprising, given the visionary and resilient advent of IP, 40+ years ago. By their very design, IP addresses should be the last thing that any network has to think about. However, if the Internet has laid anything bare, it’s that small or seemingly unimportant weaknesses — often invisible or impossible to see at design time — always show up at sufficient scale.

One thing we do know: “more addresses” should never be the answer. In IPv4 that type of thinking only contributes to their scarcity, driving up further their market prices. IPv6 is absolutely necessary, but only one part of the solution. For example, in IPv6, the best practice says that the smallest allocation, just for personal use, is /56 — that’s 272 or about 4,722,000,000,000,000,000,000 addresses. I certainly can’t reason about numbers that large. Can you?

In this blog post, we’ll explain why IP addressing is a problem for web services, the underlying causes, and then describe an innovative solution that we’re calling Addressing Agility, alongside the lessons we’ve learned. The best part of all may be the kinds of new systems and architectures enabled by Addressing Agility. The full details are available in our recent paper from ACM SIGCOMM 2021. As a preview, here is a summary of some of the things we learned:

It’s true! There is no limit to the number of names that can appear on any single address; the address of any name can change with every new query, anywhere; and address changes can be made for any reason, be it service provisioning or policy or performance evaluation, or others we’ve yet to encounter…

Explained below are the reasons this is all true, the way we get there, and the reasons these lessons matter for HTTP and TLS services of any size. The key insight on which we build: On the Internet Protocol (IP) design, much like the global postal system, addresses have never been, should never be, and in no way are ever, needed to represent names. We just sometimes treat addresses as if they do. Instead, this work shows that all names should share all of their addresses, any set of their addresses, or even just one address.

The narrow waist is a funnel, but also a choke point

Decades-old conventions artificially tie IP addresses to names and resources. This is understandable since the architecture and software that drive the Internet evolved from a setting in which one computer had one name and (most often) one network interface card. It would be natural, then, for the Internet to evolve such that one IP address would be associated with names and software processes.

Among end clients and network carriers, where there is little need for names and less need for listening processes, these IP bindings have little impact. However, the name and process conventions create strong limitations on all content hosting, distribution, and content-service providers (CSPs). Once assigned to names, interfaces, and sockets, addresses become largely static and require effort, planning, and care to change if change is possible at all.

The “narrow waist” of IP has enabled the Internet, but much like TCP has been to transport protocols and HTTP to application protocols, IP has become a stifling bottleneck to innovation. The idea is depicted by the figure below, in which we see that otherwise separate communication bindings (with names) and connection bindings (with interfaces and sockets) create transitive relationships between them.

The transitive lock is hard to break, because changing either can have an impact on the other. Moreover, service providers often use IP addresses to represent policies and service levels that themselves exist independently of names. Ultimately the IP bindings are one more thing to think about — and for no good reason.

Let’s put this another way. When thinking of new designs, new architectures, or just better resource allocations, the first set of questions should never be “which IP addresses do we use?” or “do we have IP addresses for this?” Questions like these and their answers slow development and innovation.

We realised that IP bindings are not only artificial but, according to the original visionary RFCs and standards, also incorrect. In fact, the notion of IP addresses as being representative of anything other than reachability runs counter to their original design. In the original RFC and related drafts, the architects are explicit, “A distinction is made between names, addresses, and routes. A name indicates what we seek. An address indicates where it is. A route indicates how to get there.” Any association to IP of information like SNI or HTTP host in higher-layer protocols is a clear violation of the layering principle.

Of course none of our work exists in isolation. It does, however, complete a long-standing evolution to decouple IP addresses from their conventional use, an evolution that consists of standing on the shoulders of giants.

The Evolving Past…

Looking backwards over the last 20 years, it’s easy to see that a quest for addressing agility has been ongoing for some time, and one in which Cloudflare has been deeply invested.

The decades-old one-to-one binding between IP and network card interfaces was first broken a few years ago when Google’s Maglev combined Equal Cost MultiPath (ECMP) and consistent hashing to disseminate traffic from one ‘virtual’ IP address among many servers. As an aside, according to the original Internet Protocol RFCs, this use of IP is proscribed and there is nothing virtual about it.

Many similar systems have since emerged at GitHub, Facebook, and elsewhere, including our very own Unimog. More recently, Cloudflare designed a new programmable sockets architecture called bpf_sk_lookup to decouple IP addresses from sockets and processes.

But what about those names? The value of ‘virtual hosting’ was cemented in 1997 when HTTP 1.1 defined the host field as mandatory. This was the first official acknowledgement that multiple names can coexist on a single IP address, and was necessarily reproduced by TLS in the Server Name Indication field. These are absolute requirements since the number of possible names is greater than the number of IP addresses.

…Indicates an Agile Future

Looking ahead, Shakespeare was wise to ask, “What’s in a Name?” If the Internet could speak then it might say, “That name which we label by any other address would be just as reachable.”

If Shakespeare instead asked, “What is in an address?” then the Internet would similarly answer, “That address which we label by any other name would be just as reachable, too.”

A strong implication emerges from the truth of those answers: The mapping between names and addresses is any-to-any. If this is true then any address can be used to reach a name as long as a name is reachable at an address.

In fact, a version of many addresses for a name has been available since 1995 with the introduction of DNS-based load-balancing. Then why not all addresses for all names, or any addresses at any given time for all names? Or — as we’ll soon discover — one address for all names! But first let’s talk about the manner in which addressing agility is achieved.

Achieving Addressing Agility: Ignore names, map policies

The key to addressing agility is authoritative DNS — but not in the static name-to-IP mappings stored in some form of a record or lookup table. Consider that from any client’s perspective, the binding only appears `on-query’. For all practical uses of the mapping, the query’s response is the last possible moment in the lifetime of a request where a name can be bound to an address.

This leads to the observation that name mappings are actually made, not in some record or zone file, but at the moment the response is returned. It’s a subtle, but important distinction. Today’s DNS systems use a name to look up a set of addresses, and then sometimes use some policy to decide which specific address to return. The idea is shown in the figure below. When a query arrives, a lookup reveals the addresses associated with that name, and then returns one or more of those addresses. Often, additional policy or logic filters are used to narrow the address selection, such as service level or geo-regional coverage. The important detail is that addresses are identified with a name first, and policies are only applied afterwards.

(a) Conventional Authoritative DNS

(b) Addressing Agility

Addressing agility is achieved by inverting this relationship. Instead of IP addresses pre-assigned to a name, our architecture begins with a policy that may (or in our case, not) include a name. For example, a policy may be represented by attributes such as location and account type and ignore the name (which we did in our deployment). The attributes identify a pool of addresses that are associated with that policy. The pool itself may be isolated to that policy or have elements shared with other pools and policies. Moreover, all the addresses in the pool are equivalent. This means that any of the addresses may be returned — or even selected at random — without inspecting the DNS query name.

Now pause for a moment because there are two really noteworthy implications that fall out to per-query responses:

i. IP addresses can be, and are, computed and assigned at runtime or query-time.

ii. The lifetime of the IP-to-name mapping is the larger of the ensuing connection lifetime and the TTL in downstream caches.

The outcome is powerful and means that the binding itself is otherwise ephemeral and can be changed without regard to previous bindings, resolvers, clients, or purpose. Also, scale is no issue, and we know because we deployed it at the edge.

IPv6 — new clothes, same emperor

Before talking about our deployment, let’s first address the proverbial elephant in the room: IPv6. The first thing to make clear is that everything — everything — discussed here in the context of IPv4 applies equally in IPv6. As is true of the global postal system, addresses are addresses, whether in Canada, Cambodia, Cameroon, Chile, or China — and that includes their relatively static, inflexible nature.

Despite equivalence, the obvious question remains: Surely all the reasons to pursue Addressing Agility are satisfied simply by changing to IPv6? Counter-intuitive as the answer may be, the answer is a definite, absolute no! IPv6 may mitigate against address exhaustion, at least for the lifetimes of everyone alive today, but the abundance of IPv6 prefixes and addresses makes reasoning difficult about its bindings to names and resources.

The abundance of IPv6 addresses also risks inefficiencies because operators can take advantage of the bit length and large prefix sizes to embed meaning into the IP address. This is a powerful feature of IPv6, but also means many, many, addresses in any prefix will go unused.

To be clear, Cloudflare is demonstrably one of the biggest advocates of IPv6, and for good reasons, not least that the abundance of addresses ensures longevity. Even so, IPv6 changes little about the way addresses are tied to names and resources, whereas an address’ agility ensures flexibility and responsiveness for their lifetimes.

A Side-note: Agility is for Everyone

One last comment on the architecture and its transferability — Addressing Agility is usable, even desirable, for any service that operates authoritative DNS. Other content-oriented service providers are obvious contenders, but so too are smaller operators. Universities, enterprises, and governments are just a few examples of organizations that can operate their own authoritative services. So long as the operators are able to accept connections on the IP addresses that are returned, all are potential beneficiaries of addressing agility as a result.

Policy-based randomized addresses — at scale

We’ve been working with Addressing Agility live at the edge, with production traffic, since June 2020, as follows:

More than 20 million hostnames and services
All data centers in Canada (giving a reasonable population and multiple time zones)
/20 (4096 addresses) in IPv4 and /44 in IPv6
/24 (256 addresses) in IPv4 from January 2021 to June 2021
For every query, generate a random host-portion within the prefix.

After all, the true test of agility is most extreme when a random address is generated for every query that hits our servers. Then we decided to truly put the idea to the test. In June 2021, in our Montreal data center and soon after in Toronto, all 20+ million zones were mapped to one-single address.

Over the course of one year, every query for a domain captured by the policy received an address selected at random — from a set of as few as 4096 addresses, then 256, and then one. Internally, we refer to the address set of one as Ao1, and we’ll return to this point later.

The measure of success: “Nothing to see here”

There may be a number of questions our readers are quietly asking themselves:

What did this break on the Internet?
What effect did this have on Cloudflare systems?
What would I see happening if I could?

The short answer to each question above is nothing. But — and this is important — address randomization does expose weaknesses in the designs of systems that rely on the Internet. The weaknesses always, every one, occurs because the designers ascribe meaning to IP addresses beyond reachability. (And, if only incidentally, every one of those weaknesses are circumvented by the use of one address, or ‘Ao1.’)

To better understand the nature of “nothing”, let’s answer the above questions starting from the bottom of the list.

What would I see if I could?

The answer is shown by the example in the figure below. From all data centers in the “Rest of World” outside our deployment, a query for a zone returns the same addresses (such is Cloudflare’s global anycast system). In contrast, every query that lands in a deployment data center receives a random address. These can be seen below in successive dig commands to two different data centers.

For those who may be wondering about subsequent request traffic, yes, this means that servers are configured to accept connection requests for any of the 20+ million domains on all addresses in the address pool.

Ok, but surely Cloudflare’s surrounding systems needed modification?

Nope. This is a drop-in transparent change to the data pipeline for authoritative DNS. Each of routing prefix advertisements in BGP, DDoS, load balancers, distributed cache, … no changes were required.

There is, however, one fascinating side effect: randomization is to IP addresses as a good hash function is to a hash table — it evenly maps an arbitrary size input to a fixed number of outputs. The effect can be seen by looking at measures of load-per-IP before and after randomization as in the graphs below, with data taken from 1% samples of requests at one data center over seven days.

Before randomization, for only a small portion of Cloudflare’s IP space, (a) the difference between greatest and least requests per IP (y1-axis on the left) is three orders of magnitude; similarly, bytes per IP (y2-axis on the right) is almost six orders of magnitude. After randomization, (b) for all domains on a single /20 that previously occupied multiple /20s, these reduce to 2 and 3 orders of magnitude, respectively. Taking this one step further down to /24 in (c), per-query randomization of 20+ million zones onto 256 addresses reduces differences in load to small constant factors.

This might matter to any content service provider that might think about provisioning resources by IP address. A priori predictions of load generated by a customer can be hard. The above graphs are evidence that the best path forward is to give all the addresses to all the names.

Surely this breaks something on the wider Internet?

Here, too, the answer is no! Well, perhaps more precisely stated as, “no, randomization breaks nothing… but it can expose weaknesses in systems and their designs.”

Any systems that might be affected by address randomization appears to have a prerequisite: some meaning is ascribed to the IP address beyond just reachability. Addressing Agility keeps and even restores the semantics of IP addresses and the core Internet architecture, but it will break software systems that make assumptions about their meaning.

Let’s first cover a few examples, why they don’t matter, and then follow with a small change to addressing agility that bypasses weaknesses (by using one single IP address):

HTTP Connection Coalescing enables a client to re-use existing connections to request resources from different origins. Clients such as Firefox that permit coalescing when the URI authority matches the connection are unaffected. However, clients that require a URI host to resolve to the same IP address as the given connection will fail.
Non-TLS or HTTP-based services may be affected. One example is ssh, which maintains a hostname-to-IP mapping in its known_hosts. This association, while understandable, is outdated and already broken given that many DNS records presently return more than one IP address.
Non-SNI TLS certificates require a dedicated IP address. Providers are forced to charge a premium because each address can only support a single certificate without SNI. The bigger issue, independent of IP, is the use of TLS without SNI. We have launched efforts to understand non-SNI to hopefully end this unfortunate legacy.
DDoS protections that rely on destination IPs may be hindered, initially. We would argue that addressing agility is beneficial for two reasons. First, IP randomization distributes the attack load across all addresses in use, effectively serving as a layer-3 load-balancer. Second, DoS mitigations often work by changing IP addresses, an ability that is inherent in Addressing Agility.

All for on One, and One for All

We started with 20+ million zones bound to addresses across tens of thousands of addresses, and successfully served them from 4096 addresses in a /20 and then 256 addresses in a /24. Surely this trend begs the following question:

If randomization works over n addresses, then why not randomization over 1 address?

Indeed, why not? Recall from above the comment about randomization over IPs as being equivalent to a perfect hash function in a hash table. The thing about well-designed hash-based structures is that they preserve their properties for any size of the structure, even a size of 1. Such a reduction would be a true test of the foundations on which Addressing Agility is constructed.

So, test we did. From a /20 address set, to a /24 and then, from June 2021, to an address set of one /32, and equivalently a /128 (Ao1). It doesn’t just work. It really works. Concerns that might be exposed by randomization are resolved by Ao1. For example, non-TLS or non-HTTP services have a reliable IP address (or at least non-random and until there is a policy change on the name). Also, HTTP connection coalescing falls out as if for free and, yes, we see increased levels of coalescing where Ao1 is being used.

But why in IPv6 where there are so many addresses?

One argument against binding to a single IPv6 address is that there is no need, because address exhaustion is unlikely. This is a pre-CIDR position that, we claim, is benign at best and irresponsible at worst. As mentioned above, the number of IPv6 addresses makes reasoning about them difficult. In lieu of asking why use a single IPv6 address, we should be asking, “why not?”

Are there upstream implications? Yes, and opportunities!

Ao1 reveals an entirely different set of implications from IP randomization that, arguably, gives us a window into the future of Internet routing and reachability by amplifying the effects that seemingly small actions might have.

Why? The number of possible variable-length names in the universe will always exceed the number of fixed-length addresses. This means that, by the pigeonhole principle, single IP addresses must be shared by multiple names, and different content from unrelated parties.

The possible upstream effects amplified by Ao1 are worth raising and are described below. So far, though, we’ve seen none of these in our evaluations, nor have they come up in communications with upstream networks.

Upstream Routing Errors are Immediate and Total. If all traffic arrives on a single address (or prefix), then upstream routing errors affect all content equally. (This is the reason Cloudflare returns two addresses in non-contiguous address ranges.) Note, however, the same is true of threat blocking.
Upstream DoS Protections could be triggered. It is conceivable that the concentration of requests and traffic on a single address could be perceived upstream as a DoS attack and trigger upstream protections that may exist.

In both cases, the actions are mitigated by Addressing Agility’s ability to change addresses en masse so quickly. Prevention is also possible, but requires open communication and discourse.

One last upstream effect remains:

Port exhaustion in IPv4 NAT might be accelerated, and is solved by IPv6! From the client-side, the number of permissible concurrent connections to one-address is upper-bounded by the size of a transport protocol’s port field, for example about 65K in TCP.

For example, in TCP on Linux this was an issue until recently. (See this commit and SO_BIND_ADDRESS_NO_PORT in ip(7) man page.) In UDP the issue remains. In QUIC, connection identifiers can prevent port exhaustion, but they have to be used. So far, though, we have yet to see any evidence that this is an issue.

Even so — and here is the best part — to the best of our knowledge this is the only risk to one-address uses, and is also immediately resolved by migrating to IPv6. (So, ISPs and network administrators, go forth and implement IPv6!)

We’re just getting started!

And so we end as we began. With no limit to the number of names on any single IP address, the ability to change the address per-query, for any reason, what could you build?

We are, indeed, just getting started! The flexibility and future-proofing enabled by Addressing Agility is enabling us to imagine, design, and build new systems and architectures. We’re planning BGP route leak detection and mitigation for anycast systems, measurement platforms, and more.

Further technical details on all the above, as well as acknowledgements to so many who helped make this possible, can be found in this paper and short talk. Even with these new possibilities, challenges remain. There are many open questions that include, but are in no way limited to the following:

What policies can be reasonably expressed or implemented?
Is there an abstract syntax or grammar with which to express them?
Could we use formal methods and verification to prevent erroneous or conflicting policies?

Addressing Agility is for everyone, even necessary for these ideas to succeed more widely. Input and ideas are welcomed at [email protected].

If you are a student enrolled in a PhD or equivalent research program and looking for an internship for 2022 in the USA or Canada and the EU or UK.

If you’re interested in contributing to projects like this or helping Cloudflare develop its traffic and address management systems, our Addressing Engineering team is hiring!

Research Directions in Password Security

2021-10-14 Ian McQuoid

Post Syndicated from Ian McQuoid original https://blog.cloudflare.com/research-directions-in-password-security/

Research Directions in Password Security

As Internet users, we all deal with passwords every day. With so many different services, each with their own login systems, we have to somehow keep track of the credentials we use with each of these services. This situation leads some users to delegate credential storage to password managers like LastPass or a browser-based password manager, but this is far from universal. Instead, many people still rely on old-fashioned human memory, which has its limitations — leading to reused passwords and to security problems. This blog post discusses how Cloudflare Research is exploring how to minimize password exposure and thwart password attacks.

The Problem of Password Reuse

Because it’s too difficult to remember many distinct passwords, people often reuse them across different online services. When breached password datasets are leaked online, attackers can take advantage of these to conduct “credential stuffing attacks”. In a credential stuffing attack, an attacker tests breached credentials against multiple online login systems in an attempt to hijack user accounts. These attacks are highly effective because users tend to reuse the same credentials across different websites, and they have quickly become one of the most prevalent types of online guessing attacks. Automated attacks can be run at a large scale, testing out exposed passwords across multiple systems, under the assumption that some of these passwords will unlock accounts somewhere else (if they have been reused). When a data breach is detected, users of that service will likely receive a security notification and will reset that account password. However, if this password was reused elsewhere, they may easily forget that it needs to be changed for those accounts as well.

How can we protect against credential stuffing attacks? There are a number of methods that have been deployed — with varying degrees of success. Password managers address the problem of remembering a strong, unique password for every account, but many users have yet to adopt them. Multi-factor authentication is another potential solution — that is, using another form of authentication in addition to the username/password pair. This can work well, but has limits: for example, such solutions may rely on specialized hardware that not all clients have. Consumer systems are often reluctant to mandate multi-factor authentication, given concerns that people may find it too complicated to use; companies do not want to deploy something that risks impeding the growth of their user base.

Since there is no perfect solution, security researchers continue to try to find improvements. Two different approaches we will discuss in this blog post are hardening password systems using cryptographically secure keys, and detecting the reuse of compromised credentials, so they don’t leave an account open to guessing attacks.

Improved Authentication with PAKEs

Investigating how to securely authenticate a user just using what they can remember has been an important area in secure communication. To this end, the subarea of cryptography known as Password Authenticated Key Exchange (PAKE) came about. PAKEs deal with protocols for establishing cryptographically secure keys where the only source of authentication is a human memorizable (low-entropy, attacker-guessable) password — that is, the “what you know” side of authentication.

Before diving into the details, we’ll provide a high-level overview of the basic problem. Although passwords are typically protected in transit by being sent over HTTPS, servers handle them in plaintext to verify them once they arrive. Handling plaintext passwords increases security risk — for instance, they might get inadvertently logged and exposed. Ideally, the user’s password never gets sent to the server in the first place. This is where PAKEs come in — a means of verifying that the user and server share a password, ideally without revealing information about the password that could help attackers to discover or crack it.

A few words on PAKEs

PAKE protocols let two parties turn a password into a shared key. Each party only gets one guess at the password the other holds. If a user tries to log in to the wrong server with a PAKE, that server will not be able to turn around and impersonate the user. As such, PAKEs guarantee that communication with one of the parties is the only way for an attacker to test their (single) password guess. This may seem like an unneeded level of complexity when we could use already available tools like a key distribution mechanism along with password-over-TLS, but this puts a lot of trust in the service. You may trust a service with learning your password on that service, but what about if you accidentally use a password for a different service when trying to log in? Note the particular risks of a reused password: it is no longer just a secret shared between a user and a single service, but is now a secret shared between a user and multiple services. This therefore increases the password’s privacy sensitivity — a service should not know users’ account login information for other services.

PAKE protocols are built with the assumption that the server isn’t always working in the best interest of the client and, even more, cannot use any kind of public-key infrastructure during login (although it doesn’t hurt to have both!). This precludes the user from sending their plaintext password (or any information that could be used to derive it — in a computational sense) to the server during login.

PAKE protocols have expanded into new territory since the seminal EKE paper of Bellovin and Merritt, where the client and server both remembered a plaintext version of the password. As mentioned above, when the server stores the plaintext password, the client risks having the password logged or leaked. To address this, new protocols were developed, referred to as augmented, verifier-based, or asymmetric PAKEs (aPAKEs), where the server stored a modified version (similar to a hash) of the password instead of the plaintext password. This mirrors the way many of us were taught to store passwords in a database, specifically as a hash of the password with accompanying salt and pepper. However, in these cases, attackers can still use traditional methods of attack such as targeted rainbow tables. To avoid these kinds of attacks, a new kind of PAKE was born, the strong asymmetric PAKE (saPAKE).

OPAQUE was the first saPAKE and it guarantees defense against precomputation by hiding the password dictionary itself! It does this by replacing the noninteractive hash function with an interactive protocol referred to as an Oblivious Pseudorandom Function (OPRF) where one party inputs their “salt”, another inputs their “password”, and only the password-providing party learns the output of the function. The fact that the password-providing party learns nothing (computationally) about the salt prevents offline precomputation by disallowing an attacker from evaluating the function in their head.

Another way to think about the three PAKE paradigms has to do with how each of them treats the password dictionary:

PAKE type	Password Dictionary	Threat Model
PAKE	The password dictionary is public and common to every user.	Without any guessing, the attacker learns the user’s password upon compromise of the server.
aPAKE	Each user gets their own password dictionary; a description of the dictionary (e.g., the “salt”) is leaked to the client when they attempt to log in.	The attacker must perform an independent precomputation for each client they want to attack.
saPAKE (e.g., OPAQUE)	Each user gets their own password dictionary; the server only provides an online interface (the OPRF) to the dictionary.	The adversary must wait until after they compromise the server to run an offline attack on the user’s password¹.

OPAQUE also goes one step further and allows the user to perform the password transformation on their own device so that the server doesn’t see the plaintext password during registration either. Cloudflare Research has been involved with OPAQUE for a while now — for instance, you can read about our previous implementation work and demo if you want to learn more.

But OPAQUE is not a panacea: in the event of server compromise, the attacker can learn the salt that the server uses to evaluate the OPRF and can still run the same offline attack that was available in the aPAKE world, although this is now considerably more time-consuming and can be made increasingly difficult through the use of memory-hard hash functions like scrypt. This means that despite our best efforts, when a server is breached, the attacker can eventually come out with a list of plaintext passwords. Indeed, this attack is always inevitable as the attacker can always run the (sa)PAKE protocol in their head acting as both parties to test each password. With this being the case, we still need to take steps to defend against automated password attacks such as credential stuffing attacks and have ways of mitigating them.

Are You Overexposed?

To help detect and respond to credential stuffing, Cloudflare recently rolled out the Exposed Credential Checks feature on the Web Application Firewall (WAF), which can alert the origin if a user’s login credentials have appeared in a recent breach. Historically, compromised credential checking services have allowed users to be proactive against credential stuffing attacks when their username and password appear together in a breach. However, they do not account for recently proposed credential tweaking attacks, in which an attacker tries variants of a breached password, under the assumption that users often use slight modifications of the same password for different accounts, such as “sunshineFB”, “sunshineIG”, and so on. Therefore, compromised credential check services should incorporate methods of checking for credential tweaks.

Under the hood, Cloudflare’s Exposed Credential Checks feature relies on an underlying protocol deemed Might I Get Pwned (MIGP). MIGP uses the bucketization method proposed in Li et al. to avoid sending the plaintext username or password to the server while handling a large breach dataset. After receiving a user’s credentials, MIGP hashes the username and sends a portion of that hash as a “bucket identifier” to the server. The client and server can then perform a private membership test protocol to verify whether the user’s username/password pair appeared in that bucket, without ever having to send plaintext credentials to the server.

Unlike previous compromised credential check services, MIGP also enables credential tweaking checks by augmenting the original breach dataset with a set of password “variants”. For each leaked password, it generates a list of password variants, which are labeled as such to differentiate them from the original leaked password and added to the original dataset. For more information, you can check out the Cloudflare Research blog post detailing our open-source implementation and deployment of the MIGP protocol.

Measuring Credential Compromises

The question remains, just how important are these exposed credential checks for detecting and preventing credential stuffing attacks in practice? To answer this question, the Research Team has initiated a study investigating login requests to our own Cloudflare dashboard. For this study, we are collecting the data logged by Cloudflare’s Exposed Credential Check feature (described above), designed to be privacy-preserving: this check does not reveal a password, but provides a “yes/no” response on whether the submitted credentials appear in our breach dataset. Along with this signal, we are looking at other fields that may be indicative of malicious behavior such as bot score and IP reputation. As this project develops, we plan to cluster the data to find patterns of different types of credential stuffing attacks that we can generalize to form attack fingerprints. We can then feed these fingerprints into the alert logs for the Cloudflare Detection & Response team to see if they provide useful information for the security analysts.

Additionally, we hope to investigate potential post-compromise behavior as it relates to these compromise check fields. After an attacker successfully hijacks an account, they may take a number of actions such as changing the password, revoking all valid access tokens, or setting up a malicious script. By analyzing compromised credential checks along with these signals, we may be able to better differentiate benign from malicious behavior.

Future directions: OPAQUE and MIGP combined

This post has discussed how we’re approaching the problem of preventing credential stuffing attacks from two different angles. Through the deployment and analysis of compromised credential checks, we aim to prevent server compromise by detecting and preventing credential stuffing attacks before they happen. In addition, in the case that a server does get compromised, the wider use of OPAQUE would help address the problem of leaking passwords to an attacker by avoiding the reception and storage of plaintext passwords on the server as well as preventing precomputation attacks.

However, there are still remaining research challenges to address. Notably, the current method for interfacing with MIGP still requires the server to either pass along a plaintext version of the client’s password, or trust the client to honestly communicate with the MIGP service on behalf of the server. If we want to leverage the security guarantees of OPAQUE (or generally an saPAKE) with the analytics and alert system provided by MIGP in a privacy-preserving way, we need additional mechanisms.

At first glance, the privacy-preserving goals of both protocols seem to be perfect matches for each other. Both OPAQUE and MIGP are built upon the idea of replacing the traditional salted password hashes with an OPRF as a way of keeping the client’s plaintext passwords from ever leaving their device. However, both the interfaces for these protocols rely on user-provided inputs which aren’t cryptographically tied to each other. This allows an attacker to provide a false password to MIGP while providing their actual password to the OPAQUE server. Further, the security analysis of both protocols assume that their idealized building blocks are separated in an important way. This isn’t to say that the two protocols are incompatible, and indeed, much of these protocols may be salvaged.

The next stages for password privacy will be an integration of these two protocols such that a server can be made aware of credential stuffing attacks and the patterns of compromised account usage that can protect a server against the compromise of other servers while providing the same privacy guarantees OPAQUE does. Our goal is to allow you to protect yourself from other compromised servers while protecting your clients from compromise of your server. Stay tuned for updates!

We’re always keen to collaborate with others to build more secure systems, and would love to hear from those interested in password research. You can reach us with questions, comments, and research ideas at [email protected]. For those interested in joining our team, please visit our Careers Page.

…
¹There are other ways of constructing saPAKE protocols. The curious reader can see this CRYPTO 2019 paper for details.

Should we teach AI and ML differently to other areas of computer science? A challenge

2021-10-14 Sue Sentance

Post Syndicated from Sue Sentance original https://www.raspberrypi.org/blog/research-seminar-data-centric-ai-ml-teaching-in-school/

Between September 2021 and March 2022, we’re partnering with The Alan Turing Institute to host a series of free research seminars about how to teach AI and data science to young people.

In the second seminar of the series, we were excited to hear from Professor Carsten Schulte, Yannik Fleischer, and Lukas Höper from the University of Paderborn, Germany, who presented on the topic of teaching AI and machine learning (ML) from a data-centric perspective. Their talk raised the question of whether and how AI and ML should be taught differently from other themes in the computer science curriculum at school.

Machine behaviour — a new field of study?

The rationale behind the speakers’ work is a concept they call hybrid interaction system, referring to the way that humans and machines interact. To explain this concept, Carsten referred to an 2019 article published in Nature by Iyad Rahwan and colleagues: Machine hehaviour. The article’s authors propose that the study of AI agents (complex and simple algorithms that make decisions) should be a separate, cross-disciplinary field of study, because of the ubiquity and complexity of AI systems, and because these systems can have both beneficial and detrimental impacts on humanity, which can be difficult to evaluate. (Our previous seminar by Mhairi Aitken highlighted some of these impacts.) The authors state that to study this field, we need to draw on scientific practices from across different fields, as shown below:

Machine behaviour as a field sits at the intersection of AI engineering and behavioural science. Quantitative evidence from machine behaviour studies feeds into the study of the impact of technology, which in turn feeds questions and practices into engineering and behavioural science. — The interdisciplinarity of machine behaviour. (Image taken from Rahwan et al [1])

In establishing their argument, the authors compare the study of animal behaviour and machine behaviour, citing that both fields consider aspects such as mechanism, development, evolution and function. They describe how part of this proposed machine behaviour field may focus on studying individual machines’ behaviour, while collective machines and what they call ‘hybrid human-machine behaviour’ can also be studied. By focusing on the complexities of the interactions between machines and humans, we can think both about machines shaping human behaviour and humans shaping machine behaviour, and a sort of ‘co-behaviour’ as they work together. Thus, the authors conclude that machine behaviour is an interdisciplinary area that we should study in a different way to computer science.

Carsten and his team said that, as educators, we will need to draw on the parameters and frameworks of this machine behaviour field to be able to effectively teach AI and machine learning in school. They argue that our approach should be centred on data, rather than on code. I believe this is a challenge to those of us developing tools and resources to support young people, and that we should be open to these ideas as we forge ahead in our work in this area.

Ideas or artefacts?

In the interpretation of computational thinking popularised in 2006 by Jeanette Wing, she introduces computational thinking as being about ‘ideas, not artefacts’. When we, the computing education community, started to think about computational thinking, we moved from focusing on specific technology — and how to understand and use it — to the ideas or principles underlying the domain. The challenge now is: have we gone too far in that direction?

Carsten argued that, if we are to understand machine behaviour, and in particular, human-machine co-behaviour, which he refers to as the hybrid interaction system, then we need to be studying artefacts as well as ideas.

Throughout the seminar, the speakers reminded us to keep in mind artefacts, issues of bias, the role of data, and potential implications for the way we teach.

Studying machine learning: a different focus

In addition, Carsten highlighted a number of differences between learning ML and learning other areas of computer science, including traditional programming:

The process of problem-solving is different. Traditionally, we might try to understand the problem, derive a solution in terms of an algorithm, then understand the solution. In ML, the data shapes the model, and we do not need a deep understanding of either the problem or the solution.
Our tolerance of inaccuracy is different. Traditionally, we teach young people to design programs that lead to an accurate solution. However, the nature of ML means that there will be an error rate, which we strive to minimise.
The role of code is different. Rather than the code doing the work as in traditional programming, the code is only a small part of a real-world ML system.

These differences imply that our teaching should adapt too.

A graphic demonstrating that in machine learning as compared to other areas of computer science, the process of problem-solving, tolerance of inaccuracy, and role of code is different. — Click to enlarge.

ProDaBi: a programme for teaching AI, data science, and ML in secondary school

In Germany, education is devolved to state governments. Although computer science (known as informatics) was only last year introduced as a mandatory subject in lower secondary schools in North Rhine-Westphalia, where Paderborn is located, it has been taught at the upper secondary levels for many years. ProDaBi is a project that researchers have been running at Paderborn University since 2017, with the aim of developing a secondary school curriculum around data science, AI, and ML.

The ProDaBi curriculum includes:

Two modules for 11- to 12-year-olds covering decision trees and data awareness (ethical aspects), introduced this year
A short course for 13-year-olds covering aspects of artificial intelligence, through the game Hexapawn
A set of modules for 14- to 15-year-olds, covering data science, data exploration, decision trees, neural networks, and data awareness (ethical aspects), using Jupyter notebooks
A project-based course for 18-year-olds, including the above topics at a more advanced level, using Codap and Jupyter notebooks to develop practical skills through projects; this course has been running the longest and is currently in its fourth iteration

Although the ProDaBi project site is in German, an English translation is available.

Learning modules developed as part of the ProDaBi project. — Modules developed as part of the ProDaBi project

Our speakers described example activities from three of the modules:

Hexapawn, a two-player game inspired by the work of Donald Michie in 1961. The purpose of this activity is to support learners in reflecting on the way the machine learns. Children can then relate the activity to the behavior of AI agents such as autonomous cars. An English version of the activity is available.
Data cards, a series of activities to teach about decision trees. The cards are designed in a ‘Top Trumps’ style, and based on food items, with unplugged and digital elements.
Data awareness, a module focusing on the amount of data an individual can generate as they move through a city, in this case through the mobile phone network. Children are encouraged to reflect on personal data in the context of the interaction between the human and data-driven artefact, and how their view of the world influences their interpretation of the data that they are given.

Questioning how we should teach AI and ML at school

There was a lot to digest in this seminar: challenging ideas and some new concepts, for me anyway. An important takeaway for me was how much we do not yet know about the concepts and skills we should be teaching in school around AI and ML, and about the approaches that we should be using to teach them effectively. Research such as that being carried out in Paderborn, demonstrating a data-centric approach, can really augment our understanding, and I’m looking forward to following the work of Carsten and his team.

Carsten and colleagues ended with this summary and discussion point for the audience:

“‘AI education’ requires developing an adequate picture of the hybrid interaction system — a kind of data-driven, emergent ecosystem which needs to be made explicitly to understand the transformative role as well as the technological basics of these artificial intelligence tools and how they are related to data science.”

You can catch up on the seminar, including the Q&A with Carsten and his colleagues, here:

Join our next seminar

This seminar really extended our thinking about AI education, and we look forward to introducing new perspectives from different researchers each month. At our next seminar on Tuesday 2 November at 17:00–18:30 BST / 12:00–13:30 EDT / 9:00–10:30 PDT / 18:00–19:30 CEST, we will welcome Professor Matti Tedre and Henriikka Vartiainen (University of Eastern Finland). The two Finnish researchers will talk about emerging trajectories in ML education for K-12. We look forward to meeting you there.

I want to sign up for the next seminar

Carsten and their colleagues are also running a series of seminars on AI and data science: you can find out about these on their registration page.

You can increase your own understanding of machine learning by joining our latest free online course!

[1] Rahwan, I., Cebrian, M., Obradovich, N., Bongard, J., Bonnefon, J. F., Breazeal, C., … & Wellman, M. (2019). Machine behaviour. Nature, 568(7753), 477-486.

The post Should we teach AI and ML differently to other areas of computer science? A challenge appeared first on Raspberry Pi.

Cloudflare and the IETF

2021-10-13 Jonathan Hoyland

Post Syndicated from Jonathan Hoyland original https://blog.cloudflare.com/cloudflare-and-the-ietf/

Cloudflare and the IETF

The Internet, far from being just a series of tubes, is a huge, incredibly complex, decentralized system. Every action and interaction in the system is enabled by a complicated mass of protocols woven together to accomplish their task, each handing off to the next like trapeze artists high above a virtual circus ring. Stop to think about details, and it is a marvel.

Consider one of the simplest tasks enabled by the Internet: Sending a message from sender to receiver.

The location (address) of a receiver is discovered using DNS, a connection between sender and receiver is established using a transport protocol like TCP, and (hopefully!) secured with a protocol like TLS. The sender’s message is encoded in a format that the receiver can recognize and parse, like HTTP, because the two disparate parties need a common language to communicate. Then, ultimately, the message is sent and carried in an IP datagram that is forwarded from sender to receiver based on routes established with BGP.

Even an explanation this dense is laughably oversimplified. For example, the four protocols listed are just the start, and ignore many others with acronyms of their own. The truth is that things are complicated. And because things are complicated, how these protocols and systems interact and influence the user experience on the Internet is complicated. Extra round trips to establish a secure connection increase the amount of time before useful work is done, harming user performance. The use of unauthenticated or unencrypted protocols reveals potentially sensitive information to the network or, worse, to malicious entities, which harms user security and privacy. And finally, consolidation and centralization — seemingly a prerequisite for reducing costs and protecting against attacks — makes it challenging to provide high availability even for essential services. (What happens when that one system goes down or is otherwise unavailable, or to extend our earlier metaphor, when a trapeze isn’t there to catch?)

These four properties — performance, security, privacy, and availability — are crucial to the Internet. At Cloudflare, and especially in the Cloudflare Research team, where we use all these various protocols, we’re committed to improving them at every layer in the stack. We work on problems as diverse as helping network security and privacy with TLS 1.3 and QUIC, improving DNS privacy via Oblivious DNS-over-HTTPS, reducing end-user CAPTCHA annoyances with Privacy Pass and Cryptographic Attestation of Personhood (CAP), performing Internet-wide measurements to understand how things work in the real world, and much, much more.

Above all else, these projects are meant to do one thing: focus beyond the horizon to help build a better Internet. We do that by developing, advocating, and advancing open standards for the many protocols in use on the Internet, all backed by implementation, experimentation, and analysis.

Standards

The Internet is a network of interconnected autonomous networks. Computers attached to these networks have to be able to route messages to each other. However, even if we can send messages back and forth across the Internet, much like the storied Tower of Babel, to achieve anything those computers have to use a common language, a lingua franca, so to speak. And for the Internet, standards are that common language.

Many of the parts of the Internet that Cloudflare is interested in are standardized by the IETF, which is a standards development organization responsible for producing technical specifications for the Internet’s most important protocols, including IP, BGP, DNS, TCP, TLS, QUIC, HTTP, and so on. The IETF’s mission is:

> to make the Internet work better by producing high-quality, relevant technical documents that influence the way people design, use, and manage the Internet.

Our individual contributions to the IETF help further this mission, especially given our role on the Internet. We can only do so much on our own to improve the end-user experience. So, through standards, we engage with those who use, manage, and operate the Internet to achieve three simple goals that lead to a better Internet:

1. Incrementally improve existing and deployed protocols with innovative solutions;

2. Provide holistic solutions to long-standing architectural problems and enable new use cases; and

3. Identify key problems and help specify reusable, extensible, easy-to-implement abstractions for solving them.

Below, we’ll give an example of how we helped achieve each goal, touching on a number of important technical specifications produced in recent years, including DNS-over-HTTPS, QUIC, and (the still work-in-progress) TLS Encrypted Client Hello.

Incremental innovation: metadata privacy with DoH and ECH

The Internet is not only complicated — it is leaky. Metadata seeps like toxic waste from nearly every protocol in use, from DNS to TLS, and even to HTTP at the application layer.

One critically important piece of metadata that still leaks today is the name of the server that clients connect to. When a client opens a connection to a server, it reveals the name and identity of that server in many places, including DNS, TLS, and even sometimes at the IP layer (if the destination IP address is unique to that server). Linking client identity (IP address) to target server names enables third parties to build a profile of per-user behavior without end-user consent. The result is a set of protocols that does not respect end-user privacy.

Fortunately, it’s possible to incrementally address this problem without regressing security. For years, Cloudflare has been working with the standards community to plug all of these individual leaks through separate specialized protocols:

DNS-over-HTTPS encrypts DNS queries between clients and recursive resolvers, ensuring only clients and trusted recursive resolvers see plaintext DNS traffic.
TLS Encrypted Client Hello encrypts metadata in the TLS handshake, ensuring only the client and authoritative TLS server see sensitive TLS information.

These protocols impose a barrier between the client and server and everyone else. However, neither of them prevent the server from building per-user profiles. Servers can track users via one critically important piece of information: the client IP address. Fortunately, for the overwhelming majority of cases, the IP address is not essential for providing a service. For example, DNS recursive resolvers do not need the full client IP address to provide accurate answers, as is evidenced by the EDNS(0) Client Subnet extension. To further reduce information exposure on the web, we helped push further with two more incremental improvements:

Oblivious DNS-over-HTTPS (ODoH) uses cryptography and network proxies to break linkability between client identity (IP address) and DNS traffic, ensuring that recursive resolvers have only the minimal amount of information to provide DNS answers — the queries themselves, without any per-client information.
MASQUE is standardizing techniques for proxying UDP and IP protocols over QUIC connections, similar to the existing HTTP CONNECT method for TCP-based protocols. Generally, the CONNECT method allows clients to use services without revealing any client identity (IP address).

While each of these protocols may seem only an incremental improvement over what we have today, together, they raise many possibilities for the future of the Internet. Are DoH and ECH sufficient for end-user privacy, or are technologies like ODoH and MASQUE necessary? How do proxy technologies like MASQUE complement or even subsume protocols like ODoH and ECH? These are questions the Cloudflare Research team strives to answer through experimentation, analysis, and deployment together with other stakeholders on the Internet through the IETF. And we could not ask the questions without first laying the groundwork.

Architectural advancement: QUIC and HTTP/3

QUIC and HTTP/3 are transformative technologies. Whilst the TLS handshake forms the heart of QUIC’s security model, QUIC is an improvement beyond TLS over TCP, in many respects, including more encryption (privacy), better protection against active attacks and ossification at the network layer, fewer round trips to establish a secure connection, and generally better security properties. QUIC and HTTP/3 give us a clean slate for future innovation.

Perhaps one of QUIC’s most important contributions is that it challenges and even breaks many established conventions and norms used on the Internet. For example, the antiquated socket API for networking, which treats the network connection as an in-order bit pipe is no longer appropriate for modern applications and developers. Modern networking APIs such as Apple’s Network.framework provide high-level interfaces that take advantage of the new transport features provided by QUIC. Applications using this or even higher-level HTTP abstractions can take advantage of the many security, privacy, and performance improvements of QUIC and HTTP/3 today with minimal code changes, and without being constrained by sockets and their inherent limitations.

Another salient feature of QUIC is its wire format. Nearly every bit of every QUIC packet is encrypted and authenticated between sender and receiver. And within a QUIC packet, individual frames can be rearranged, repackaged, and otherwise transformed by the sender.

Together, these are powerful tools to help mitigate future network ossification and enable continued extensibility. (TLS’s wire format ultimately led to the middlebox compatibility mode for TLS 1.3 due to the many middlebox ossification problems that were encountered during early deployment tests.)

Exercising these features of QUIC is important for the long-term health of the protocol and applications built on top of it. Indeed, this sort of extensibility is what enables innovation.

In fact, we’ve already seen a flurry of new work based on QUIC: extensions to enable multipath QUIC, different congestion control approaches, and ways to carry data unreliably in the DATAGRAM frame.

Beyond functional extensions, we’ve also seen a number of new use cases emerge as a result of QUIC. DNS-over-QUIC is an upcoming proposal that complements DNS-over-TLS for recursive to authoritative DNS query protection. As mentioned above, MASQUE is a working group focused on standardizing methods for proxying arbitrary UDP and IP protocols over QUIC connections, enabling a number of fascinating solutions and unlocking the future of proxy and VPN technologies. In the context of the web, the WebTransport working group is standardizing methods to use QUIC as a “supercharged WebSocket” for transporting data efficiently between client and server while also depending on the WebPKI for security.

By definition, these extensions are nowhere near complete. The future of the Internet with QUIC is sure to be a fascinating adventure.

Specifying abstractions: Cryptographic algorithms and protocol design

Standards allow us to build abstractions. An ideal standard is one that is usable in many contexts and contains all the information a sufficiently skilled engineer needs to build a compliant implementation that successfully interoperates with other independent implementations. Writing a new standard is sort of like creating a new Lego brick. Creating a new Lego brick allows us to build things that we couldn’t have built before. For example, one new “brick” that’s nearly finished (as of this writing) is Hybrid Public Key Encryption (HPKE). HPKE allows us to efficiently encrypt arbitrary plaintexts under the recipient’s public key.

Mixing asymmetric and symmetric cryptography for efficiency is a common technique that has been used for many years in all sorts of protocols, from TLS to PGP. However, each of these applications has come up with their own design, each with its own security properties. HPKE is intended to be a single, standard, interoperable version of this technique that turns this complex and technical corner of protocol design into an easy-to-use black box. The standard has undergone extensive analysis by cryptographers throughout its development and has numerous implementations available. The end result is a simple abstraction that protocol designers can include without having to consider how it works under-the-hood. In fact, HPKE is already a dependency for a number of other draft protocols in the IETF, such as TLS Encrypted Client Hello, Oblivious DNS-over-HTTPS, and Message Layer Security.

Modes of Interaction

We engage with the IETF in the specification, implementation, experimentation, and analysis phases of a standard to help achieve our three goals of incremental innovation, architectural advancement, and production of simple abstractions.

Our participation in the standards process hits all four phases. Individuals in Cloudflare bring a diversity of knowledge and domain expertise to each phase, especially in the production of technical specifications. This week, we’ll have a blog about an upcoming standard that we’ve been working on for a number of years and will be sharing details about how we used formal analysis to make sure that we ruled out as many security issues in the design as possible. We work in close collaboration with people from all around the world as an investment in the future of the Internet. Open standards mean that everyone can take advantage of the latest and greatest in protocol design, whether they use Cloudflare or not.

Cloudflare’s scale and perspective on the Internet are essential to the standards process. We have experience rapidly implementing, deploying, and experimenting with emerging technologies to gain confidence in their maturity. We also have a proven track record of publishing the results of these experiments to help inform the standards process. Moreover, we open source as much of the code we use for these experiments as possible to enable reproducibility and transparency. Our unique collection of engineering expertise and wide perspective allows us to help build standards that work in a wide variety of use cases. By investing time in developing standards that everyone can benefit from, we can make a clear contribution to building a better Internet.

One final contribution we make to the IETF is more procedural and based around building consensus in the community. A challenge to any open process is gathering consensus to make forward progress and avoiding deadlock. We help build consensus through the production of running code, leadership on technical documents such as QUIC and ECH, and even logistically by chairing working groups. (Working groups at the IETF are chaired by volunteers, and Cloudflare numbers a few working group chairs amongst its employees, covering a broad spectrum of the IETF (and its related research-oriented group, the IRTF) from security and privacy to transport and applications.) Collaboration is a cornerstone of the standards process and a hallmark of Cloudflare Research, and we apply it most prominently in the standards process.

If you too want to help build a better Internet, check out some IETF Working Groups and mailing lists. All you need to start contributing is an Internet connection and an email address, so why not give it a go? And if you want to join us on our mission to help build a better Internet through open and interoperable standards, check out our open positions, visiting researcher program, and many internship opportunities!

Pairings in CIRCL

2021-10-13 Armando Faz-Hernández

Post Syndicated from Armando Faz-Hernández original https://blog.cloudflare.com/circl-pairings-update/

Pairings in CIRCL

In 2019, we announced the release of CIRCL, an open-source cryptographic library written in Go that provides optimized implementations of several primitives for key exchange and digital signatures. We are pleased to announce a major update of our library: we have included more packages for elliptic curve-based cryptography (ECC), pairing-based cryptography, and quantum-resistant algorithms.

All of these packages are the foundation of work we’re doing on bringing the benefits of cutting edge research to Cloudflare. In the past we’ve experimented with post-quantum algorithms, used pairings to keep keys safe around the world, and implemented advanced elliptic curves. Now we’re continuing that work, and sharing the foundation with everyone.

In this blog post we’re going to focus on pairing-based cryptography and give you a brief overview of some properties that make this topic so pleasant. If you are not so familiar with elliptic curves, we recommend this primer on ECC.

Otherwise, let’s get ready, pairings have arrived!

What are pairings?

Elliptic curve cryptography enables an efficient instantiation of several cryptographic applications: public-key encryption, signatures, zero-knowledge proofs, and many other more exotic applications like oblivious transfer and OPRFs. With all of those applications you might wonder what is the additional value that pairings offer? To see that, we need first to understand the basic properties of an elliptic curve system, and from that we can highlight the big gap that pairings have.

Conventional elliptic curve systems work with a single group $\mathbb{G}$: the points of an elliptic curve $E$. In this group, usually denoted additively, we can add the points $P$ and $Q$ and get another point on the curve $R=P+Q$; also, we can multiply a point $P$ by an integer scalar $k$ and by repeatedly doing

$$ kP = \underbrace{P+P+\dots+P}_{k \text{ terms}} $$
This operation is known as scalar multiplication, which resembles exponentiation, and there are efficient algorithms for this operation. But given the point $Q=kP$, and $P$, it is very hard for an adversary that doesn’t know $k$ to find it. This is the Elliptic Curve Discrete Logarithm problem (ECDLP).

Now we show a property of scalar multiplication that can help us to understand the properties of pairings.

Scalar Multiplication is a Linear Map

Note the following equivalences:

$ (a+b)P = aP + bP $

$ b (aP) = a (bP) $.

These are very useful properties for many protocols: for example, the last identity allows Alice and Bob to arrive at the same value when following the Diffie-Hellman key-agreement protocol.

But while point addition and scalar multiplication are nice, it’s also useful to be able to multiply points: if we had a point $P$ and $aP$ and $bP$, getting $abP$ out would be very cool and let us do all sorts of things. Unfortunately Diffie-Hellman would immediately be insecure, so we can’t get what we want.

Guess what? Pairings provide an efficient, useful sort of intermediary point multiplication.

It’s intermediate multiplication because although the operation takes two points as operands, the result of a pairing is not a point, but an element of a different group; thus, in a pairing there are more groups involved and all of them must contain the same number of elements.

Pairing is denoted as $$ e \colon\; \mathbb{G}_1 \times \mathbb{G}_2 \rightarrow \mathbb{G}_T $$
Groups $\mathbb{G}_1$ and $\mathbb{G}_2$ contain points of an elliptic curve $E$. More specifically, they are the $r$-torsion points, for a fixed prime $r$. Some pairing instances fix $\mathbb{G}_1=\mathbb{G}_2$, but it is common to use disjoint sets for efficiency reasons. The third group $\mathbb{G}_T$ has notable differences. First, it is written multiplicatively, unlike the other two groups. $\mathbb{G}_T$ is not the set of points on an elliptic curve. It’s instead a subgroup of the multiplicative group over some larger finite field. It contains the elements that satisfy $x^r=1$, better known as the $r$-roots of unity.

While every elliptic curve has a pairing, very few have ones that are efficiently computable. Those that do, we call them pairing-friendly curves.

The Pairing Operation is a Bilinear Map

What makes pairings special is that $e$ is a bilinear map. Yes, the linear property of the scalar multiplication is present twice, one per group. Let’s see the first linear map.

For points $P, Q, R$ and scalars $a$ and $b$ we have:

$ e(P+Q, R) = e(P, R) * e(Q, R) $

$ e(aP, Q) = e(P, Q)^a $

So, a scalar $a$ acting in the first operand as $aP$, finds its way out and escapes from the input of the pairing and appears in the output of the pairing as an exponent in $\mathbb{G}_T$. The same linear map is observed for the second group:

$ e(P, Q+R) = e(P, Q) * e(P, R) $

$ e(P, bQ) = e(P, Q)^b $

Hence, the pairing is bilinear. We will see below how this property becomes useful.

Can bilinear pairings help solving ECDLP?

The MOV (by Menezes, Okamoto, and Vanstone) attack reduces the discrete logarithm problem on elliptic curves to finite fields. An attacker with knowledge of $kP$ and public points $P$ and $Q$ can recover $k$ by computing:

$ g_k = e(kP, Q) = e(P, Q)^k $,

$ g = e(P, Q) $

$ k = \log_g(g_k) $

Note that the discrete logarithm to be solved was moved from $\mathbb{G}_1$ to $\mathbb{G}_T$. So an attacker must ensure that the discrete logarithm is easier to solve in $\mathbb{G}_T$, and surprisingly, for some curves this is the case.

Fortunately, pairings do not present a threat for standard curves (such as the NIST curves or Curve25519) because these curves are constructed in such a way that $\mathbb{G}_T$ gets very large, which makes the pairing operation not efficient anymore.

This attacking strategy was one of the first applications of pairings in cryptanalysis as a tool to solve the discrete logarithm. Later, more people noticed that the properties of pairings are so useful, and can be used constructively to do cryptography. One of the fascinating truisms of cryptography is that one person’s sledgehammer is another person’s brick: while pairings yield a generic attack strategy for the ECDLP problem, it can also be used as a building block in a ton of useful applications.

Applications of Pairings

In the 2000s decade, a large wave of research works were developed aimed at applying pairings to many practical problems. An iconic pairing-based system was created by Antoine Joux, who constructed a one-round Diffie-Hellman key exchange for three parties.

Let’s see first how a three-party Diffie-Hellman is done without pairings. Alice, Bob and Charlie want to agree on a shared key, so they compute, respectively, $aP$, $bP$ and $cP$ for a public point P. Then, they send to each other the points they computed. So Alice receives $cP$ from Charlie and sends $aP$ to Bob, who can then send $baP$ to Charlie and get $acP$ from Alice and so on. After all this is done, they can all compute $k=abcP$. Can this be performed in a single round trip?

The 3-party Diffie-Hellman protocol needs two communication rounds (on the top), but with the use of pairings a one-round trip protocol is possible.

Affirmative! Antoine Joux showed how to agree on a shared secret in a single round of communication. Alice announces $aP$, gets $bP$ and $cP$ from Bob and Charlie respectively, and then computes $k= (bP, cP)^a$. Likewise Bob computes $e(aP,cP)^b$ and Charlie does $e(aP,bP)^c$. It’s not difficult to convince yourself that all these values are equivalent, just by looking at the bilinear property.

$e(bP,cP)^a = e(aP,cP)^b = e(aP,bP)^c = e(P,P)^{abc}$

With pairings we’ve done in one round what would otherwise take two.

Another application in cryptography addresses a problem posed by Shamir in 1984: does there exist an encryption scheme in which the public key is an arbitrary string? Imagine if your public key was your email address. It would be easy to remember and certificate authorities and certificate management would be unnecessary.

A solution to this problem came some years later, in 2001, and is the Identity-based Encryption (IBE) scheme proposed by Boneh and Franklin, which uses bilinear pairings as the main tool.

Nowadays, pairings are used for the zk-SNARKS that make Zcash an anonymous currency, and are also used in drand to generate public-verifiable randomness. Pairings and the compact, aggregatable BLS signatures are used in Ethereum. We have used pairings to build Geo Key Manager: pairings let us implement a compact broadcast and negative broadcast scheme that together make Geo Key Manager work.

In order to make these schemes, we have to implement pairings, and to do that we need to understand the mathematics behind them.

Where do pairings come from?

In order to deeply understand pairings we must understand the interplay of geometry and arithmetic, and the origins of the group’s law. The starting point is the concept of a divisor, a formal combination of points on the curve.

$D = \sum n_i P_i$

The sum of all the coefficients $n_i$ is the degree of the divisor. If we have a function on the curve that has poles and zeros, we can count them with multiplicity to get a principal divisor. Poles are counted as negative, while zeros as positive. For example if we take the projective line, the function $x$ has the divisor $(0)-(\infty)$.

The degree of a divisor is the sum of its coefficients. All principal divisors have degree $0$. The group of degree zero divisors modulo the principal divisors is the Jacobian. This means that we take all the degree zero divisors, and freely add or subtract principle divisors, constructing an abelian variety called the Jacobian.

Until now our constructions have worked for any curve. Elliptic curves have a special property: since a line intersects the curve in three points, it’s always possible to turn an element of the Jacobian into one of the form $(P)-(O)$ for a point $P$. This is where the addition law of elliptic curves comes from.

Given a function $f$ we can evaluate it on a divisor $D=\sum n_i P_i$ by taking the product $\prod f(P_i)^{n_i}$. And if two functions $f$ and $g$ have disjoint divisors, we have the Weil duality:

$ f(\text{div}(g)) = g(\text{div}(f)) $,

The existence of Weil duality is what gives us the bilinear map we seek. Given an $r$-torsion point $T$ we have a function $f$ whose divisor is $r(T)-r(O)$. We can write down an auxiliary function $g$ such that $f(rP)=g^r(P)$ for any $P$. We then get a pairing by taking.

$e_r(S,T)=\frac{g(X+S)}{g(X)}$.

The auxiliary point $X$ is any point that makes the numerator and denominator defined.

In practice, the pairing we have defined above, the Weil pairing, is little used. It was historically the first pairing and is extremely important in the mathematics of elliptic curves, but faster alternatives based on more complicated underlying mathematics are used today. These faster pairings have different $\mathbb{G}_1$ and $\mathbb{G}_2$, while the Weil pairing made them the same.

Shift in parameters

As we saw earlier, the discrete logarithm problem can be attacked either on the group of elliptic curve points or on the third group (an extension of a prime field) whichever is weaker. This is why the parameters that define a pairing must balance the security of the three groups.

Before 2015 a good balance between the extension degree, the size of the prime field, and the security of the scheme was achieved by the family of Barreto-Naehrig (BN) curves. For 128 bits of security, BN curves use an extension of degree 12, and have a prime of size 256 bits; as a result they are an efficient choice for implementation.

A breakthrough for pairings occurred in 2015 when Kim and Barbescu published a result that accelerated the attacks in finite fields. This resulted in increasing the size of fields to comply with standard security levels. Just as short hashes like MD5 got depreciated as they became insecure and $2^{64}$ was no longer enough, and RSA-1024 was replaced with RSA-2048, we regularly change parameters to deal with improved attacks.

For pairings this implied the use of larger primes for all the groups. Roughly speaking, the previous 192-bit security level becomes the new 128-bit level after this attack. Also, this shift in parameters brings the family of Barreto-Lynn-Scott (BLS) curves to the stage because pairings on BLS curves are faster than BN in this new setting. Hence, currently BLS curves using an extension of degree 12, and primes of around 384 bits provide the equivalent to 128 bit security.

The IETF draft (currently in preparation) draft-irtf-cfrg-pairing-friendly-curves specifies secure pairing-friendly elliptic curves. It includes parameters for BN and BLS families of curves. It also targets different security levels to provide crypto agility for some applications relying on pairing-based cryptography.

Implementing Pairings in Go

Historically, notable examples of software libraries implementing pairings include PBC by Ben Lynn, Miracl by Michael Scott, and Relic by Diego Aranha. All of them are written in C/C++ and some ports and wrappers to other languages exist.

In the Go standard library we can find the golang.org/x/crypto/bn256 package by Adam Langley, an implementation of a pairing using a BN curve with 256-bit prime. Our colleague Brendan McMillion built github.com/cloudflare/bn256 that dramatically improves the speed of pairing operations for that curve. See the RWC-2018 talk to see our use case of pairing-based cryptography. This time we want to go one step further, and we started looking for alternative pairing implementations.

Although one can find many libraries that implement pairings, our goal is to rely on one that is efficient, includes protection against side channels, and exposes a flexible API oriented towards applications that permit generality, while avoiding common security pitfalls. This motivated us to include pairings in CIRCL. We followed best practices on secure code development and we want to share with you some details about the implementation.

We started by choosing a pairing-friendly curve. Due to the attack previously mentioned, the BN256 curve does not meet the 128-bit security level. Thus there is a need for using stronger curves. Such a stronger curve is the BLS12-381 curve that is widely used in the zk-SNARK protocols and short signature schemes. Using this curve allows us to make our Go pairing implementation interoperable with other implementations available in other programming languages, so other projects can benefit from using CIRCL too.

This code snippet tests the linearity property of pairings and shows how easy it is to use our library.

import (
    "crypto/rand"
    "fmt"
    e "github.com/cloudflare/circl/ecc/bls12381"
)

func ExamplePairing() {
    P,  Q := e.G1Generator(), e.G2Generator()
    a,  b := new(e.Scalar), new(e.Scalar)
    aP, bQ := new(e.G1), new(e.G2)
    ea, eb := new(e.Gt), new(e.Gt)

    a.Random(rand.Reader)
    b.Random(rand.Reader)

    aP.ScalarMult(a, P)
    bQ.ScalarMult(b, Q)

    g  := e.Pair( P, Q)
    ga := e.Pair(aP, Q)
    gb := e.Pair( P,bQ)

    ea.Exp(g, a)
    eb.Exp(g, b)
    linearLeft := ea.IsEqual(ga) // e(P,Q)^a == e(aP,Q)
    linearRight:= eb.IsEqual(gb) // e(P,Q)^b == e(P,bQ)

    fmt.Print(linearLeft && linearRight)
    // Output: true
}

We applied several optimizations that allowed us to improve on performance and security of the implementation. In fact, as the parameters of the curve are fixed, some other optimizations become easier to apply; for example, the code for prime field arithmetic and the towering construction for extension fields as we detail next.

Formally-verified arithmetic using fiat-crypto

One of the more difficult parts of a cryptography library to implement correctly is the prime field arithmetic. Typically people specialize it for speed, but there are many tedious constraints on what the inputs to operations can be to ensure correctness. Vulnerabilities have happened when people get it wrong, across many libraries. However, this code is perfect for machines to write and check. One such tool is fiat-crypto.

Using fiat-crypto to generate the prime field arithmetic means that we have a formal verification that the code does what we need. The fiat-crypto tool is invoked in this script, and produces Go code for addition, subtraction, multiplication, and squaring over the 381-bit prime field used in the BLS12-381 curve. Other operations are not covered, but those are much easier to check and analyze by hand.

Another advantage is that it avoids relying on the generic big.Int package, which is slow, frequently spills variables to the heap causing dynamic memory allocations, and most importantly, does not run in constant-time. Instead, the code produced is straight-line code, no branches at all, and relies on the math/bits package for accessing machine-level instructions. Automated code generation also means that it’s easier to apply new techniques to all primitives.

Tower Field Arithmetic

In addition to prime field arithmetic, say integers modulo a prime number p, a pairing also requires high-level arithmetic operations over extension fields.

To better understand what an extension field is, think of the analogous case of going from the reals to the complex numbers: the operations are referred as usual, there exist addition, multiplication and division $(+, -, \times, /)$, however they are computed quite differently.

The complex numbers are a quadratic extension over the reals, so imagine a two-level house. The first floor is where the real numbers live, however, they cannot access the second floor by themselves. On the other hand, the complex numbers can access the entire house through the use of a staircase. The equation $f(x)=x^2+1$ was not solvable over the reals, but is solvable over the complex numbers, since they have the number $i^2=1$. And because they have the number $i$, they also have to have numbers like $3i$ and $5+i$ that solve other equations that weren’t solvable over the reals either. This second story has given the roots of the polynomials a place to live.

Algebraically we can view the complex numbers as $\mathbb{R}[x]/(x^2+1)$, the space of polynomials where we consider $x^2=-1$. Given a polynomial like $x^3+5x+1$, we can turn it into $4x+1$, which is another way of writing $1+4i$. In this new field $x^2+1=0$ holds automatically, and we have added both $x$ as a root of the polynomial we picked. This process of writing down a field extension by adding a root of a polynomial works over any field, including finite fields.

Following this analogy, one can construct not a house but a tower, say a $k=12$ floor building where the ground floor is the prime field $\mathbb{F}_p$. The reason to build such a large tower is because we want to host VIP guests: namely a group called $\mu_r$, the $r$-roots of unity. There are exactly $r$ members and they behave as an (algebraic) group, i.e. for all $x,y \in \mu_r$, it follows $x*y \in \mu_r$ and $x^r = 1$.

One particularity is that making operations on the top-floor can be costly. Assume that whenever an operation is needed in the top-floor, members who live on the main floor are required to provide their assistance. Hence an operation on the top-floor needs to go all the way down to the ground floor. For this reason, our tower needs more than a staircase, it needs an efficient way to move between the levels, something even better than an elevator.

What if we use portals? Imagine anyone on the twelfth floor using a portal to go down immediately to the sixth floor, and then use another one connecting the sixth to the second floor, and finally another portal connecting to the ground floor. Thus, one can get faster to the first floor rather than descending through a long staircase.

The building analogy can be used to understand and construct a tower of finite fields. We use only some extensions to build a twelfth extension from the (ground) prime field $\mathbb{F}_p$.

$\mathbb{F}_{p}$ ⇒ $\mathbb{F}_{p^2}$ ⇒ $\mathbb{F}_{p^4}$ ⇒ $\mathbb{F}_{p^{12}}$

In fact, the extension of finite fields is as follows:

$\mathbb{F}_{p^2}$ is built as polynomials in $\mathbb{F}_p[u]$ reduced modulo $u^2+1=0$.
$\mathbb{F}_{p^6}$ is built as polynomials in $\mathbb{F}_{p^2}[v]$ reduced modulo $v^2+u+1=0$.
$\mathbb{F}_{p^{12}}$ is built as polynomials in $\mathbb{F}_{p^6}[w]$ reduced modulo $w^2+v=0$, or as polynomials in $\mathbb{F}_{p^4}[w]$ reduced modulo $w^3+v=0$.

The portals here are the polynomials used as modulus, as they allow us to move from one extension to the other.

Different constructions for higher extensions have an impact on the number of operations performed. Thus, we implemented the latter tower field for $\mathbb{F}_{p^{12}}$ as it results in a lower number of operations. The arithmetic operations are quite easy to implement and manually verify, so at this level formal verification is not as effective as in the case of prime field arithmetic. However, having an automated tool that generates code for this arithmetic would be useful for developers not familiar with the internals of field towering. The fiat-crypto tool keeps track of this idea [Issue 904, Issue 851].

Now, we describe more details about the main core operations of a bilinear pairing.

The Miller loop and the final exponentiation

The pairing function we implemented is the optimal r-ate pairing, which is defined as:

$ e(P,Q) = f_Q(P)^{\text{exp}} $

That is the construction of a function $f$ based on $Q$, then evaluated on a point $P$, and the result of that is raised to a specific power. The efficient function evaluation is performed using “the Miller loop”, which is an algorithm devised by Victor Miller and has a similar structure to a double-and-add algorithm for scalar multiplication.

After having computed $f_Q(P)$ this value is an element of $\mathbb{F}_{p^{12}}$, however it is not yet an $r$-root of unity; in order to do so, the final exponentiation accomplishes this task. Since the exponent is constant for each curve, special algorithms can be tuned for it.

One interesting acceleration opportunity presents itself: in the Miller loop the elements of $\mathbb{F}_{p^{12}}$ that we have to multiply by are special — as polynomials, they have no linear term and their constant term lives in $\mathbb{F}_{p^{2}}$. We created a specialized multiplication that avoids multiplications where the input has to be zero. This specialization accelerated the pairing computation by 12%.

So far, we have described how the internal operations of a pairing are performed. There are still some other functions floating around regarding the use of pairings in cryptographic protocols. It is also important to optimize these functions and now we will discuss some of them.

Product of Pairings

Often protocols will want to evaluate a product of pairings, rather than a single pairing. This is the case if we’re evaluating multiple signatures, or if the protocol uses cancellation between different equations to ensure security, as in the dual system encoding approach to designing protocols. If each pairing was evaluated individually, this would require multiple evaluations of the final exponentiation. However, we can evaluate the product first, and then evaluate the final exponentiation once. This requires a different interface that can take vectors of points.

Occasionally, there is a sign or an exponent in the factors of the product. It’s very easy to deal with a sign explicitly by negating one of the input points, almost for free. General exponents are more complicated, especially when considering the need for side channel protection. But since we expose the interface, later work on the library will accelerate it without changing applications.

Regarding API exposure, one of the trickiest and most error prone aspects of software engineering is input validation. So we must check that raw binary inputs decode correctly as the points used for a pairing. Part of this verification includes subgroup membership testing which is the topic we discuss next.

Subgroup Membership Testing

Checking that a point is on the curve is easy, but checking that it has the right order is not: the classical way to do this is an entire expensive scalar multiplication. But implementing pairings involves the use of many clever tricks that help to make things run faster.

One example is twisting: the $\mathbb{G}_2$ group are points with coordinates in $\mathbb{F}_{p^{12}}$, however, one can use a smaller field to reduce the number of operations. The trick here is using an associated curve $E’$, which is a twist of the original curve $E$. This allows us to work on the subfield $\mathbb{F}_{p^{2}}$ that has cheaper operations.

Additionally, twisting the curve over $\mathbb{G}_2$ carries some efficiently computable endomorphisms coming from the field structure. For the cost of two field multiplications, we can compute an additional endomorphism, dramatically decreasing the cost of scalar multiplication.

By searching for the smallest combination of scalars that could zero out the $r$-torsion points, Sean Bowe came up with a much more efficient way to do subgroup checks. We implement his trick, with a big reduction in the complexity of some applications.

As can be seen, implementing a pairing is full of subtleties. We just saw that point validation in the pairing setting is a bit more challenging than in the conventional case of elliptic curve cryptography. This kind of reformulation also applies to other operations that require special care on their implementation. One another example is how to encode binary strings as elements of the group $\mathbb{G}_1$ (or $\mathbb{G}_2$). Although this operation might sound simple, implementing it securely needs to take into consideration several aspects; thus we expand more on this topic.

Hash to Curve

An important piece on the Boneh-Franklin Identity-based Encryption scheme is a special hash function that maps an arbitrary string — the identity, e.g., an email address — to a point on an elliptic curve, and that still behaves as a conventional cryptographic hash function (such as SHA-256) that is hard to invert and collision-resistant. This operation is commonly known as hashing to curve.

Boneh and Franklin found a particular way to perform hashing to curve: apply a conventional hash function to the input bitstring, and interpret the result as the $y$-coordinate of a point, then from the curve equation $y^2=x^3+b$, find the $x$-coordinate as $x=\sqrt[3]{y^2-b}$. The cubic root always exists on fields of characteristic $p\equiv 2 \bmod{3}$. But this algorithm does not apply to other fields in general restricting the parameters to be used.

Another popular algorithm, but since now we need to remark it is an insecure way for performing hash to curve is the following. Let the hash of the input be the $x$-coordinate, and from it find the $y$-coordinate by computing a square root $y= \sqrt{x^3+b}$. Note that not all $x$-coordinates lead that the square root exists, which means the algorithm may fail; thus, it’s a probabilistic algorithm. To make sure it works always, a counter can be added to $x$ and increased everytime the square root is not found. Although this algorithm always finds a point on the curve, this also makes the algorithm run in variable time i.e., it’s a non-constant time algorithm. The lack of this property on implementations of cryptographic algorithms makes them susceptible to timing attacks. The DragonBlood attack is an example of how a non-constant time hashing algorithm resulted in a full key recovery of WPA3 Wi-Fi passwords.

Secure hash to curve algorithms must guarantee several properties. It must be ensured that any input passed to the hash produces a point on the targeted group. That is no special inputs must trigger exceptional cases, and the output point must belong to the correct group. We make emphasis on the correct group since in certain applications the target group is the entire set of points of an elliptic curve, but in other cases, such as in the pairing setting, the target group is a subgroup of the entire curve, recall that $\mathbb{G}_1$ and $\mathbb{G}_2$ are $r$-torsion points. Finally, some cryptographic protocols are proven secure provided that the hash to curve function behaves as a random oracle of points. This requirement adds another level of complexity to the hash to curve function.

Fortunately, several researchers have addressed most of these problems and some other researchers have been involved in efforts to define a concrete specification for secure algorithms for hashing to curves, by extending the sort of geometric trick that worked for the Boneh-Franklin curve. We have participated in the Crypto Forum Research Group (CFRG) at IETF on the work-in-progress Internet draft-irtf-cfrg-hash-to-curve. This document specifies secure algorithms for hashing targeting several elliptic curves including the BLS12-381 curve. At Cloudflare, we are actively collaborating in several working groups of IETF, see Jonathan Hoyland’s post to know more about it.

Our implementation complies with the recommendations given in the hash to curve draft and also includes many implementation techniques and tricks derived from a vast number of academic research articles in pairing-based cryptography. A good compilation of most of these tricks is delightfully explained by Craig Costello.

We hope this post helps you to shed some light and guidance on the development of pairing-based cryptography, as it has become much more relevant these days. We will share with you soon an interesting use case in which the application of pairing-based cryptography helps us to harden the security of our infrastructure.

What’s next?

We invite you to use our CIRCL library, now equipped with bilinear pairings. But there is more: look at other primitives already available such as HPKE, VOPRF, and Post-Quantum algorithms. On our side, we will continue improving the performance and security of our library, and let us know if any of your projects uses CIRCL, we would like to know your use case. Reach us at research.cloudflare.com.

You’ll soon hear more about how we’re using CIRCL across Cloudflare.

Exported Authenticators: The long road to RFC

2021-10-13 Jonathan Hoyland

Post Syndicated from Jonathan Hoyland original https://blog.cloudflare.com/exported-authenticators-the-long-road-to-rfc/

Exported Authenticators: The long road to RFC

Our earlier blog post talked in general terms about how we work with the IETF. In this post we’re going to talk about a particular IETF project we’ve been working on, Exported Authenticators (EAs). Exported Authenticators is a new extension to TLS that we think will prove really exciting. It unlocks all sorts of fancy new authentication possibilities, from TLS connections with multiple certificates attached, to logging in to a website without ever revealing your password.

Now, you might have thought that given the innumerable hours that went into the design of TLS 1.3 that it couldn’t possibly be improved, but it turns out that there are a number of places where the design falls a little short. TLS allows us to establish a secure connection between a client and a server. The TLS connection presents a certificate to the browser, which proves the server is authorised to use the name written on the certificate, for example blog.cloudflare.com. One of the most common things we use that ability for is delivering webpages. In fact, if you’re reading this, your browser has already done this for you. The Cloudflare Blog is delivered over TLS, and by presenting a certificate for blog.cloudflare.com the server proves that it’s allowed to deliver Cloudflare’s blog.

When your browser requests blog.cloudflare.com you receive a big blob of HTML that your browser then starts to render. In the dim and distant past, this might have been the end of the story. Your browser would render the HTML, and display it. Nowadays, the web has become more complex, and the HTML your browser receives often tells it to go and load lots of other resources. For example, when I loaded the Cloudflare blog just now, my browser made 73 subrequests.

As we mentioned in our connection coalescing blog post, sometimes those resources are also served by Cloudflare, but on a different domain. In our connection coalescing experiment, we acquired certificates with a special extension, called a Subject Alternative Name (SAN), that tells the browser that the owner of the certificate can act as two different websites. Along with some further shenanigans that you can read about in our blog post, this lets us serve the resources for both the domains over a single TLS connection.

Cloudflare, however, services millions of domains, and we have millions of certificates. It’s possible to generate certificates that cover lots of domains, and in fact this is what Cloudflare used to do. We used to use so-called “cruise-liner” certificates, with dozens of names on them. But for connection coalescing this quickly becomes impractical, as we would need to know what sub-resources each webpage might request, and acquire certificates to match. We switched away from this model because issues with individual domains could affect other customers.

What we’d like to be able to do is serve as much content as possible down a single connection. When a user requests a resource from a different domain they need to perform a new TLS handshake, costing valuable time and resources. Our connection coalescing experiment showed the benefits when we know in advance what resources are likely to be requested, but most of the time we don’t know what subresources are going to be requested until the requests actually arrive. What we’d rather do is attach extra identities to a connection after it’s been established, and we know what extra domains the client actually wants. Because the TLS connection is just a transport mechanism and doesn’t understand the information being sent across it, it doesn’t actually know what domains might subsequently be requested. This is only available to higher-layer protocols such as HTTP. However, we don’t want any website to be able to impersonate another, so we still need to have strong authentication.

Exported Authenticators

Enter Exported Authenticators. They give us even more than we asked for. They allow us to do application layer authentication that’s just as strong as the authentication you get from TLS, and then tie it to the TLS channel. Now that’s a pretty complicated idea, so let’s break it down.

To understand application layer authentication we first need to explain what the application layer is. The application layer is a reference to the OSI model. The OSI model describes the various layers of abstraction we use, to make things work across the Internet. When you’re developing your latest web application you don’t want to have to worry about how light is flickered down a fibre optic cable, or even how the TLS handshake is encoded (although that’s a fascinating topic in its own right, let’s leave that for another time.)

All you want to care about is having your content delivered to your end-user, and using TLS gives you a guaranteed in-order, reliable, authenticated channel over which you can communicate. You just shove bits in one end of the pipe, and after lots of blinky lights, fancy routing, maybe a touch of congestion control, and a little decoding, *poof*, your data arrives at the end-user.

The application layer is the top of the OSI stack, and contains things like HTTP. Because the TLS handshake is lower in the stack, the application is oblivious to this process. So, what Exported Authenticators give us is the ability for the very top of the stack to reliably authenticate their partner.

Now let’s jump back a bit, and discuss what we mean when we say that EAs give us authentication that’s as strong as TLS authentication. TLS, as we know, is used to create a secure connection between two endpoints, but lots of us are hazy when we try and pin down exactly what we mean by “secure”. The TLS standard makes eight specific promises, but rather than get buried in that particular ocean of weeds, let’s just pick out the one guarantee that we care about most: Peer Authentication.

Peer authentication: The client's view of the peer identity should reflect the server's identity. [...]

In other words, if the client thinks that it’s talking to example.com then it should, in fact, be talking to example.com.

What we want from EAs is that if I receive an EA then I have cryptographic proof that the person I’m talking to is the person I think I’m talking to. Now at this point you might be wondering what an EA actually looks like, and what it has to do with certificates. Well, an EA is actually a trio of messages, the first of which is a Certificate. The second is a CertificateVerify, a cryptographic proof that the sender knows the private key for the certificate. Finally there is a Finished message, which acts as a MAC, and proves the first two parts of the message haven’t been tampered with. If this structure sounds familiar to you, it’s because it’s the same structure as used by the server in the TLS handshake to prove it is the owner of the certificate.

The final piece of unpacking we need to do is explaining what we mean by tying the authentication to the TLS channel. Because EAs are an application layer construct they don’t provide any transport mechanism. So, whilst I know that the EA was created by the server I want to talk to, without binding the EA to a TLS connection I can’t be sure that I’m talking directly to the server I want.

For all I know, the TLS server I’m talking to is creating a new TLS connection to the EA Server, and relaying my request, and then returning the response. This would be very bad, because it would allow a malicious server to impersonate any server that supports EAs.

EAs therefore have an extra security feature. They use the fact that every TLS connection is guaranteed to produce a unique set of keys. EAs take one of these keys and use it to construct the EA. This means that if some malicious third-party copies an EA from one TLS session to another, the recipient wouldn’t be able to validate it. This technique is called channel binding, and is another fascinating topic, but this post is already getting a bit long, so we’ll have to revisit channel binding in a future blog post.

How the sausage is made

OK, now we know what EAs do, let’s talk about how they were designed and built. EAs are going through the IETF standardisation process. Draft standards move through the IETF process starting as Internet Drafts (I-Ds), and ending up as published Requests For Comment (RFCs). RFCs are voluntary standards that underpin much of the global Internet plumbing, and not just for security protocols like TLS. RFCs define DNS, UDP, TCP, and many, many more.

The first step in producing a new IETF standard is coming up with a proposal. Designing security protocols is a very conservative business, firstly because it’s very easy to introduce really subtle bugs, and secondly, because if you do introduce a security issue, things can go very wrong, very quickly. A flaw in the design of a protocol can be especially problematic as it can be replicated across multiple independent implementations — for example the TLS renegotiation vulnerabilities reported in 2009 and the custom EC(DH) parameters vulnerability from 2012. To minimise the risks of design issues, EAs hew closely to the design of the TLS 1.3 handshake.

Security and Assurance

Before making a big change to how authentication works on the Internet, we want as much assurance as possible that we’re not going to break anything. To give us more confidence that EAs are secure, they reuse parts of the design of TLS 1.3. The TLS 1.3 design was carefully examined by dozens of experts, and underwent multiple rounds of formal analysis — more on that in a moment. Using well understood design patterns is a super important part of security protocols. Making something secure is incredibly difficult, because security issues can be introduced in thousands of ways, and an attacker only needs to find one. By starting from a well understood design we can leverage the years of expertise that went into it.

Another vital step in catching design errors early is baked into the IETF process: achieving rough consensus. Although the ins and outs of the IETF process are worthy of their own blog post, suffice it to say the IETF works to ensure that all technical objections get addressed, and even if they aren’t solved they are given due care and attention. Exported Authenticators were proposed way back in 2016, and after many rounds of comments, feedback, and analysis the TLS Working Group (WG) at the IETF has finally reached consensus on the protocol. All that’s left before the EA I-D becomes an RFC is for a final revision of the text to be submitted and sent to the RFC Editors, leading hopefully to a published standard very soon.

As we just mentioned, the WG has to come to a consensus on the design of the protocol. One thing that can hold up achieving consensus are worries about security. After the Snowden revelations there was a barrage of attacks on TLS 1.2, not to mention some even earlier attacks from academia. Changing how trust works on the Internet can be pretty scary, and the TLS WG didn’t want to be caught flat-footed. Luckily this coincided with the maturation of some tools and techniques we can use to get mathematical guarantees that a protocol is secure. This class of techniques is known as formal methods. To help ensure that people are confident in the security of EAs I performed a formal analysis.

Formal Analysis

Formal analysis is a special technique that can be used to examine security protocols. It creates a mathematical description of the protocol, the security properties we want it to have, and a model attacker. Then, aided by some sophisticated software, we create a proof that the protocol has the properties we want even in the presence of our model attacker. This approach is able to catch incredibly subtle edge cases, which, if not addressed, could lead to attacks, as has happened before. Trotting out a formal analysis gives us strong assurances that we haven’t missed any horrible issues. By sticking as closely as possible to the design of TLS 1.3 we were able to repurpose much of the original analysis for EAs, giving us a big leg up in our ability to prove their security. Our EA model is available in Bitbucket, along with the proofs. You can check it out using Tamarin, a theorem prover for security protocols.

Formal analysis, and formal methods in general, give very strong guarantees that rule out entire classes of attack. However, they are not a panacea. TLS 1.3 was subject to a number of rounds of formal analysis, and yet an attack was still found. However, this attack in many ways confirms our faith in formal methods. The attack was found in a blind spot of the proof, showing that attackers have been pushed to the very edges of the protocol. As our formal analyses get more and more rigorous, attackers will have fewer and fewer places to search for attacks. As formal analysis has become more and more practical, more and more groups at the IETF have been asking to see proofs of security before standardising new protocols. This hopefully will mean that future attacks on protocol design will become rarer and rarer.

Once the EA I-D becomes an RFC, then all sorts of cool stuff gets unlocked — for example OPAQUE-EAs, which will allow us to do password-based login on the web without the server ever seeing the password! Watch this space.

Coalescing Connections to Improve Network Privacy and Performance

2021-10-13 Talha Paracha

Post Syndicated from Talha Paracha original https://blog.cloudflare.com/connection-coalescing-experiments/

Coalescing Connections to Improve Network Privacy and Performance

Web pages typically have a large number of embedded subresources (e.g., JavaScript, CSS, image files, ads, beacons) that are fetched by a browser on page loads. Requests for these subresources can prompt browsers to perform further DNS lookups, TCP connections, and TLS handshakes, which can have a significant impact on how long it takes for the user to see the content and interact with the page. Further, each additional request exposes metadata (such as plaintext DNS queries, or unencrypted SNI in TLS handshake) which can have potential privacy implications for the user. With these factors in mind, we carried out a measurement study to understand how we can leverage Connection Coalescing (aka Connection Reuse) to address such concerns, and study its feasibility.

Background

The web has come a long way and initially consisted of very simple protocols. One of them was HTTP/1.0, which required browsers to make a separate connection for every subresource on the page. This design was quickly recognized as having significant performance bottlenecks and was extended with HTTP pipelining and persistent connections in HTTP/1.1 revision, which allowed HTTP requests to reuse the same TCP connection. But, yet again, this was no silver bullet: while multiple requests could share the same connection, they still had to be serialized one after the other, so a client and server could only execute a single request/response exchange at any given time for each connection. As time passed, websites became more complex in structure and dynamic in nature, and HTTP/1.1 was identified as a major bottleneck. The only way to gain concurrency at the network layer was to use multiple TCP connections to the same origin in parallel, but this meant losing most benefits of persistent connections and ended up overloading the origin servers which were unable to meet the concurrency demand.

To address these performance limitations, the SPDY protocol was introduced over a decade later. SPDY supported stream multiplexing, where requests to and responses from the server used a single interleaved TCP connection, and allowed browsers to prioritize requests for critical subresources first — that were blocking page rendering. A modified variant of SPDY was standardized by the IETF as HTTP/2 in 2012 and published as RFC 7540 in 2015.

HTTP/2 and onwards retained this new standard for connection reuse. More specifically, all subresources on the same domain were able to reuse the same TCP/TLS (or UDP/QUIC) connection without any head-of-line blocking (at least on the application layer). This resulted in a single connection for all the subresources — reducing extraneous requests on page loads — potentially speeding up some websites and applications.

Interestingly, the protocol has a lesser-known feature to also enable subresources at different hostnames to be fetched over the same connection. We studied the real-world feasibility and benefits of this technique as an effort to improve users’ experience for websites across our network.

Connection Coalescing

The technique is often referred to as Connection Coalescing and, to put it simply, is a way to access resources from different hostnames that are accessible from the same web server.

There are several reasons for why a single server could handle requests for different hosts, ranging from low-cost virtual hosting to the usage of CDNs and cloud providers (including Cloudflare, that acts as a reverse proxy for approximately 25 million Internet properties). Before going into the technical conditions required to enable connection coalescing, we should take a look at some benefits such a strategy can provide.

Privacy. When resources at different hostnames are loaded via separate TLS connections, those connections expose metadata to ISPs and other observers via the Server Name Indicator (SNI) field about the destinations that are being contacted (i.e., in the absence of encrypted SNI). This set of exposed SNI’s can allow an on-path adversary to fingerprint traffic and possibly determine user interactions on the webpage. On the other hand, coalesced requests for more than one hostname on a single connection exposes only one destination, and helps avoid such threats.
Performance. Additional TLS handshakes and TCP connections can incur significant costs in terms of cpu, memory and other resources. Thus, coalescing requests to use the same connection can optimize resource utilization.
Resource Prioritization. Multiplexing requests on a single connection means that applications have better visibility and more direct control over how related resources are prioritized and scheduled. In the absence of coalescing, the network properties (for example, route congestion) can interfere with the intended order of delivery for resources. This reliability gained through connection coalescing opens up new optimization opportunities to improve web page load times, among other things.

However, along with all these potential benefits, connection coalescing also has some associated risk factors that need to be considered in practice. First, TCP incorporates “fair” congestion control mechanisms — if there are ten connections on the same route, each gets approximately 1/10th of the total bandwidth. So with a route congested and bandwidth restricted, a client relying on multiple connections might be better off (for example, if they have five of the ten connections, their total share of bandwidth would be half). Second, browsers will use different parallelization routines for scheduling requests on multiple connections versus the same connection — it is not immediately clear whether the former or latter would perform better. Third, multiple connections exhibit an inherent form of load balancing for TLS-termination processes. That’s because multiple requests on the same connection must be answered by the same TLS-termination process that holds the session keys (often on the same physical server). So, it is important to study connection coalescing carefully before rolling it out widely.

With this context in mind, we studied the feasibility of connection coalescing on real-world traffic. More specifically, the two questions we wanted to answer were
(a) can we empirically demonstrate and quantify the theoretical benefits of connection coalescing?, and (b) could coalescing cause unintended side effects, such as performance degradation, due to the risks highlighted above?

In order to answer these questions, we first made the observation that a large number of Cloudflare customers request subresources from cdnjs — which is also powered by Cloudflare. For context, cdnjs has public JavaScript and CSS libraries (like jQuery), and is used by more than 12% of all websites on the Internet. One popular way these websites include resources from cdnjs is by using <script src="https://cdnjs.cloudflare.com/..." ></script> HTML tags. But there are other ways as well, such as the usage of XMLHttpRequest or Fetch APIs. Regardless of the way these resources are included, browsers will need to fetch them for completely loading a website.

We then identified a list of approximately four thousand websites using Cloudflare (on the Free plan) that likely used cdnjs. We divided this list of sites into evenly-sized and randomly-picked control and experiment groups. Our plan was to enable coalescing only for the experiment group, so that subresource requests generated from their web pages for cdnjs could reuse existing connections. In this way, we were able to compare results obtained on the experiment group, with the ones for the control group, and attribute any differences observed to connection coalescing.

In order to signal browsers that the requests can be coalesced, we served cdnjs and the sites from the same IP address in a few regions around the world. This meant the same DNS responses for all the zones that were part of the study — eventually load balanced by our Anycast network. These sites also had TLS certificates that included cdnjs.

The above two conditions (same IP and compatible certificate) are required to achieve coalescing as per the HTTP/2 spec. However, the QUIC spec allows coalescing even if only the second condition is met. Major web browsers are yet to adopt the QUIC coalescing mechanism, and currently use only the HTTP/2 coalescing logic for both protocols.

Results

We started noticing evidence of real-world coalescing from the day our experiment was launched. The following graph shows that approximately 50% of requests to cdnjs from our experiment group sites are coalesced (i.e., their TLS SNI does not equal cdnjs) as compared to 0% of requests from the control group sites.

In addition, we conducted active measurements using our private WebPageTest instances at the landing pages of experiment and control sites — using the two well-supported browsers: Google Chrome and Firefox. From our results, Chrome created about 78% fewer TLS connections to cdnjs for our experiment group sites, as compared to the control group. But surprisingly, Firefox created just roughly 22% fewer connections. As TLS handshakes are computationally expensive because they involve cryptographic signatures and key exchange algorithms, fewer handshakes meant less CPU cycles spent by both the client and the server.

Upon further analysis, we were able to make two observations from the data:

A fraction of sites that never coalesced connections with either browser appeared to load subresources with CORS enabled (i.e., <script src="https://cdnjs.cloudflare.com/..." integrity="sha512-894Y..." crossorigin="anonymous">). This is the default way cdnjs recommends inclusion of subresources, as CORS is needed for integrity checks that provide substantial mitigations against script-manipulation attacks. We do not recommend removing this attribute. Our testing also revealed that using XMLHttpRequest or Fetch APIs to load subresources disabled coalescing as well. It is unclear why browsers choose to not coalesce such connections, and we are in contact with the vendors to find out.
Although both Firefox and Chrome coalesced requests for cdnjs on existing connections, the reason for the discrepancy in the number of TLS connections to cdnjs (approximately 78% vs roughly 22%) is because Firefox appears to open new connections even if it does not end up using them.

After evaluating the potential benefits of coalescing, we wanted to understand if coalescing caused any unintended side effects. Hence, the final measurement we conducted was to check whether our experiments were detrimental to a website’s performance. We tracked Page Load Times (PLT) and Largest Contentful Paint (LCP) across a variety of stimulated network conditions using both Chrome and Firefox and found the results for experiment vs control group to not be statistically significant.

Conclusion

We consider our experimentation successful in determining the feasibility of connection coalescing and highlighting its potential benefits in terms of privacy and performance. More specifically, we observed the privacy benefits of coalescing in more than 50% of requests to cdnjs from real-world traffic. In addition, our active testing demonstrated that browsers create fewer TLS connections with coalescing enabled. Interestingly, our results also revealed that the benefits might not always occur (i.e., CORS-enabled requests, Firefox creating additional TLS connections despite coalescing). Finally, we did not find any evidence that coalescing can cause harm to real-world users’ experience on the Internet.

Some future directions we would like to explore include:

More aggressive connection reuse with multiple hostnames, while identifying conditions most suitable for coalescing.
Understanding how different connection reuse methods compare, e.g., IP-based coalescing vs. use of Origin Frames, and what effects do they have on user experience over the Internet.
Evaluating coalescing support among different browser vendors, and encouraging adoption of HTTP/3 QUIC based coalescing.
Reaping the full benefits of connection coalescing by experimenting with custom priority schemes for requests within the same connection.

Please send questions and feedback to [email protected]. We’re excited to continue this line of work in our effort to help build a better Internet! For those interested in joining our team please visit our Careers Page.

Introducing SSL/TLS Recommender

2021-10-12 Suleman Ahmad

Post Syndicated from Suleman Ahmad original https://blog.cloudflare.com/ssl-tls-recommender/

Introducing SSL/TLS Recommender

Seven years ago, Cloudflare made HTTPS availability for any Internet property easy and free with Universal SSL. At the time, few websites — other than those that processed sensitive data like passwords and credit card information — were using HTTPS because of how difficult it was to set up.

However, as we all started using the Internet for more and more private purposes (communication with loved ones, financial transactions, shopping, healthcare, etc.) the need for encryption became apparent. Tools like Firesheep demonstrated how easily attackers could snoop on people using public Wi-Fi networks at coffee shops and airports. The Snowden revelations showed the ease with which governments could listen in on unencrypted communications at scale. We have seen attempts by browser vendors to increase HTTPS adoption such as the recent announcement by Chromium for loading websites on HTTPS by default. Encryption has become a vital part of the modern Internet, not just to keep your information safe, but to keep you safe.

When it was launched, Universal SSL doubled the number of sites on the Internet using HTTPS. We are building on that with SSL/TLS Recommender, a tool that guides you to stronger configurations for the backend connection from Cloudflare to origin servers. Recommender has been available in the SSL/TLS tab of the Cloudflare dashboard since August 2020 for self-serve customers. Over 500,000 zones are currently signed up. As of today, it is available for all customers!

How Cloudflare connects to origin servers

Cloudflare operates as a reverse proxy between clients (“visitors”) and customers’ web servers (“origins”), so that Cloudflare can protect origin sites from attacks and improve site performance. This happens, in part, because visitor requests to websites proxied by Cloudflare are processed by an “edge” server located in a data center close to the client. The edge server either responds directly back to the visitor, if the requested content is cached, or creates a new request to the origin server to retrieve the content.

The backend connection to the origin can be made with an unencrypted HTTP connection or with an HTTPS connection where requests and responses are encrypted using the TLS protocol (historically known as SSL). HTTPS is the secured form of HTTP and should be used whenever possible to avoid leaking information or allowing content tampering by third-party entities. The origin server can further authenticate itself by presenting a valid TLS certificate to prevent active monster-in-the-middle attacks. Such a certificate can be obtained from a certificate authority such as Let’s Encrypt or Cloudflare’s Origin CA. Origins can also set up authenticated origin pull, which ensures that any HTTPS requests outside of Cloudflare will not receive a response from your origin.

Cloudflare Tunnel provides an even more secure option for the connection between Cloudflare and origins. With Tunnel, users run a lightweight daemon on their origin servers that proactively establishes secure and private tunnels to the nearest Cloudflare data centers. With this configuration, users can completely lock down their origin servers to only receive requests routed through Cloudflare. While we encourage customers to set up tunnels if feasible, it’s important to encourage origins with more traditional configurations to adopt the strongest possible security posture.

Detecting HTTPS support

You might wonder, why doesn’t Cloudflare always connect to origin servers with a secure TLS connection? To start, some origin servers have no TLS support at all (for example, certain shared hosting providers and even government sites have been slow adopters) and rely on Cloudflare to ensure that the client request is at least encrypted over the Internet from the browser to Cloudflare’s edge.

Then why don’t we simply probe the origin to determine if TLS is supported? It turns out that many sites only partially support HTTPS, making the problem non-trivial. A single customer site can be served from multiple separate origin servers with differing levels of TLS support. For instance, some sites support HTTPS on their landing page but serve certain resources only over unencrypted HTTP. Further, site content can differ when accessed over HTTP versus HTTPS (for example, http://example.com and https://example.com can return different results).

Such content differences can arise due to misconfiguration on the origin server, accidental mistakes by developers when migrating their servers to HTTPS, or can even be intentional depending on the use case.

A study by researchers at Northeastern University, the Max Planck Institute for Informatics, and the University of Maryland highlights reasons for some of these inconsistencies. They found that 1.5% of surveyed sites had at least one page that was unavailable over HTTPS — despite the protocol being supported on other pages — and 3.7% of sites served different content over HTTP versus HTTPS for at least one page. Thus, always using the most secure TLS setting detected on a particular resource could result in unforeseen side effects and usability issues for the entire site.

We wanted to tackle all such issues and maximize the number of TLS connections to origin servers, but without compromising a website’s functionality and performance.

Configuring the SSL/TLS encryption mode

Cloudflare relies on customers to indicate the level of TLS support at their origins via the zone’s SSL/TLS encryption mode. The following SSL/TLS encryption modes can be configured from the Cloudflare dashboard:

Off indicates that client requests reaching Cloudflare as well as Cloudflare’s requests to the origin server should only use unencrypted HTTP. This option is never recommended, but is still in use by a handful of customers for legacy reasons or testing.
Flexible allows clients to connect to Cloudflare’s edge via HTTPS, but requests to the origin are over HTTP only. This is the most common option for origins that do not support TLS. However, we encourage customers to upgrade their origins to support TLS whenever possible and only use Flexible as a last resort.
Full enables encryption for requests to the origin when clients connect via HTTPS, but Cloudflare does not attempt to validate the certificate. This is useful for origins that have a self-signed or otherwise invalid certificate at the origin, but leaves open the possibility for an active attacker to impersonate the origin server with a fake certificate. Client HTTP requests result in HTTP requests to the origin.
Full (strict) indicates that Cloudflare should validate the origin certificate to fully secure the connection. The origin certificate can either be issued by a public CA or by Cloudflare Origin CA. HTTP requests from clients result in HTTP requests to the origin, exactly the same as in Full mode. We strongly recommend Full (strict) over weaker options if supported by the origin.
Strict (SSL-Only Origin Pull) causes all traffic to the origin to go over HTTPS, even if the client request was HTTP. This differs from Full (strict) in that HTTP client requests will result in an HTTPS request to the origin, not HTTP. Most customers do not need to use this option, and it is available only to Enterprise customers. The preferred way to ensure that no HTTP requests reach your origin is to enable Always Use HTTPS in conjunction with Full or Full (strict) to redirect visitor HTTP requests to the HTTPS version of the content.

The SSL/TLS encryption mode is a zone-wide setting, meaning that Cloudflare applies the same policy to all subdomains and resources. If required, you can configure this setting more granularly via Page Rules. Misconfiguring this setting can make site resources unavailable. For instance, suppose your website loads certain assets from an HTTP-only subdomain. If you set your zone to Full or Full (strict), you might make these assets unavailable for visitors that request the content over HTTPS, since the HTTP-only subdomain lacks HTTPS support.

Importance of secure origin connections

When an end-user visits a site proxied by Cloudflare, there are two connections to consider: the front-end connection between the visitor and Cloudflare and the back-end connection between Cloudflare and the customer origin server. The front-end connection typically presents the largest attack surface (for example, think of the classic example of an attacker snooping on a coffee shop’s Wi-Fi network), but securing the back-end connection is equally important. While all SSL/TLS encryption modes (except Off) secure the front-end connection, less secure modes leave open the possibility of malicious activity on the backend.

Consider a zone set to Flexible where the origin is connected to the Internet via an untrustworthy ISP. In this case, spyware deployed by the customer’s ISP in an on-path middlebox could inspect the plaintext traffic from Cloudflare to the origin server, potentially resulting in privacy violations or leaks of confidential information. Upgrading the zone to Full or a stronger mode to encrypt traffic to the ISP would help prevent this basic form of snooping.

Similarly, consider a zone set to Full where the origin server is hosted in a shared hosting provider facility. An attacker colocated in the same facility could generate a fake certificate for the origin (since the certificate isn’t validated for Full) and deploy an attack technique such as ARP spoofing to direct traffic intended for the origin server to an attacker-owned machine instead. The attacker could then leverage this setup to inspect and filter traffic intended for the origin, resulting in site breakage or content unavailability. The attacker could even inject malicious JavaScript into the response served to the visitor to carry out other nefarious goals. Deploying a valid Cloudflare-trusted certificate on the origin and configuring the zone to use Full (strict) would prevent Cloudflare from trusting the attacker’s fake certificate in this scenario, preventing the hijack.

Since a secure backend only improves your website security, we strongly encourage setting your zone to the highest possible SSL/TLS encryption mode whenever possible.

Balancing functionality and security

When Universal SSL was launched, Cloudflare’s goal was to get as many sites away from the status quo of HTTP as possible. To accomplish this, Cloudflare provisioned TLS certificates for all customer domains to secure the connection between the browser and the edge. Customer sites that did not already have TLS support were defaulted to Flexible, to preserve existing site functionality. Although Flexible is not recommended for most zones, we continue to support this option as some Cloudflare customers still rely on it for origins that do not yet support TLS. Disabling this option would make these sites unavailable. Currently, the default option for newly onboarded zones is Full if we detect a TLS certificate on the origin zone, and Flexible otherwise.

Further, the SSL/TLS encryption mode configured at the time of zone sign-up can become suboptimal as a site evolves. For example, a zone might switch to a hosting provider that supports origin certificate installation. An origin server that is able to serve all content over TLS should at least be on Full. An origin server that has a valid TLS certificate installed should use Full (strict) to ensure that communication between Cloudflare and the origin server is not susceptible to monster-in-the-middle attacks.

The Research team combined lessons from academia and our engineering efforts to make encryption easy, while ensuring the highest level of security possible for our customers. Because of that goal, we’re proud to introduce SSL/TLS Recommender.

SSL/TLS Recommender

Cloudflare’s mission is to help build a better Internet, and that includes ensuring that requests from visitors to our customers’ sites are as secure as possible. To that end, we began by asking ourselves the following question: how can we detect when a customer is able to use a more secure SSL/TLS encryption mode without impacting site functionality?

To answer this question, we built the SSL/TLS Recommender. Customers can enable Recommender for a zone via the SSL/TLS tab of the Cloudflare dashboard. Using a zone’s currently configured SSL/TLS option as the baseline for expected site functionality, the Recommender performs a series of checks to determine if an upgrade is possible. If so, we email the zone owner with the recommendation. If a zone is currently misconfigured — for example, an HTTP-only origin configured on Full — Recommender will not recommend a downgrade.

The checks that Recommender runs are determined by the site’s currently configured SSL/TLS option.

The simplest check is to determine if a customer can upgrade from Full to Full (strict). In this case, all site resources are already served over HTTPS, so the check comprises a few simple tests of the validity of the TLS certificate for the domain and all subdomains (which can be on separate origin servers).

The check to determine if a customer can upgrade from Off or Flexible to Full is more complex. A site can be upgraded if all resources on the site are available over HTTPS and the content matches when served over HTTP versus HTTPS. Recommender carries out this check as follows:

Crawl customer sites to collect links. For large sites where it is impractical to scan every link, Recommender tests only a subset of links (up to some threshold), leading to a trade-off between performance and potential false positives. Similarly, for sites where the crawl turns up an insufficient number of links, we augment our results with a sample of links from recent visitors requests to the zone to provide a high-confidence recommendation. The crawler uses the user agent Cloudflare-SSLDetector and has been added to Cloudflare’s list of known good bots. Similar to other Cloudflare crawlers, Recommender ignores robots.txt (except for rules explicitly targeting the crawler’s user agent) to avoid negatively impacting the accuracy of the recommendation.
Download the content of each link over both HTTP and HTTPS. Recommender makes only idempotent GET requests when scanning origin servers to avoid modifying server resource state.
Run a content similarity algorithm to determine if the content matches. The algorithm is adapted from a research paper called “A Deeper Look at Web Content Availability and Consistency over HTTP/S” (TMA Conference 2020) and is designed to provide an accurate similarity score even for sites with dynamic content.

Recommender is conservative with recommendations, erring on the side of maintaining current site functionality rather than risking breakage and usability issues. If a zone is non-functional, the zone owner blocks all types of bots, or if misconfigured SSL-specific Page Rules are applied to the zone, then Recommender will not be able to complete its scans and provide a recommendation. Therefore, it is not intended to resolve issues with website or domain functionality, but rather maximize your zone’s security when possible.

Please send questions and feedback to [email protected]. We’re excited to continue this line of work to improve the security of customer origins!

Mentions

While this work is led by the Research team, we have been extremely privileged to get support from all across the company!

Special thanks to the incredible team of interns that contributed to SSL/TLS Recommender. Suleman Ahmad (now full-time), Talha Paracha, and Ananya Ghose built the current iteration of the project and Matthew Bernhard helped to lay the groundwork in a previous iteration of the project.

GitLab RCE

Less Than BulletProof

Metasploit Masterfully Manages Meterpreter Metadata

New module content (3)

Enhancements and features

Bugs fixed

Get it

Reboot and capture logs for review

Boot into single-user mode

NEVER MISS A BLOG

Detection

Analysis of HoxLuSfo.exe

Source of infection

Forensic analysis

Analysis of oelgfertgokejrgre.msix

Pulling off the mask

IOCs

NEVER MISS A BLOG

FTDI UART Setup

Alter U-boot environment variables

NEVER MISS A BLOG

Solving the content vs. code conundrum

NEVER MISS A BLOG

What is supply chain risk, anyway?

Dropping the SBOM

Taking action

Identify UART

NEVER MISS A BLOG

Announcing Cloudflare’s Technology Partner Program

Technology Partner Tiers

Work with Us

Active measurements are inadequate or unavailable

Predicting CDNs’ RTTs with Passive Network Measurements

Step 1. Predicting Anycast Catchments

Step 2. Predicting CDN Path Latencies

Example

Step 3. Validation

Results

Limitations of measurement methodology

Looking Ahead

Examples of Multi-User IP Addresses

What We Built

How Will This Impact Bot Management and Rate Limiting Customers?

Looking Forward

A trans-pacific voyage

Life of a TLS request

Go vs Rust: The battle continues

Remote ops don’t play fair

Tracking requests across the Atlantic

File descriptors don’t scale

Little’s Law

Conclusion

The threat of data breaches

Making compromise checking fast and easy

What does a privacy-preserving credential checking service look like?

Breach extraction attacks and countermeasures

MIGP protocol

Precomputation

Online phase

MIGP demo

Open-sourced MIGP library

Future directions

The narrow waist is a funnel, but also a choke point

The Evolving Past…

…Indicates an Agile Future

Achieving Addressing Agility: Ignore names, map policies

IPv6 — new clothes, same emperor

A Side-note: Agility is for Everyone

Policy-based randomized addresses — at scale

The measure of success: “Nothing to see here”

What would I see if I could?

Ok, but surely Cloudflare’s surrounding systems needed modification?

Surely this breaks something on the wider Internet?

All for on One, and One for All

But why in IPv6 where there are so many addresses?

Are there upstream implications? Yes, and opportunities!

We’re just getting started!

The Problem of Password Reuse

Improved Authentication with PAKEs

A few words on PAKEs

Analysis of `oelgfertgokejrgre.msix`