Tag Archives: security

Is It Really Two-Factor Authentication?

Post Syndicated from Bozho original https://techblog.bozho.net/is-it-really-two-factor-authentication/

Terminology-wise, there is a clear distinction between two-factor authentication (multi-factor authentication) and two-step verification (authentication), as this article explains. 2FA/MFA is authentication using more than one factors, i.e. “something you know” (password), “something you have” (token, card) and “something you are” (biometrics). Two-step verification is basically using two passwords – one permanent and another one that is short-lived and one-time.

At least that’s the theory. In practice it’s more complicated to say which authentication methods belongs to which category (“something you X”). Let me illustrate that with a few emamples:

  • An OTP hardware token is considered “something you have”. But it uses a shared symmetric secret with the server so that both can generate the same code at the same time (if using TOTP), or the same sequence. This means the secret is effectively “something you know”, because someone may steal it from the server, even though the hardware token is protected. Unless, of course, the server stores the shared secret in an HSM and does the OTP comparison on the HSM itself (some support that). And there’s still a theoretical possibility for the keys to leak prior to being stored on hardware. So is a hardware token “something you have” or “something you know”? For practical purposes it can be considered “something you have”
  • Smartphone OTP is often not considered as secure as a hardware token, but it should be, due to the secure storage of modern phones. The secret is shared once during enrollment (usually with on-screen scanning), so it should be “something you have” as much as a hardware token
  • SMS is not considered secure and often given as an example for 2-step verification, because it’s just another password. While that’s true, this is because of a particular SS7 vulnerability (allowing the interception of mobile communication). If mobile communication standards were secure, the SIM card would be tied to the number and only the SIM card holder would be able to receive the message, making it “something you have”. But with the known vulnerabilities, it is “something you know”, and that something is actually the phone number.
  • Fingerprint scanners represent “something you are”. And in most devices they are built in a way that the scanner authenticates to the phone (being cryptographically bound to the CPU) while transmitting the fingerprint data, so you can’t just intercept the bytes transferred and then replay them. That’s the theory; it’s not publicly documented how it’s implemented. But if it were not so, then “something you are” is “something you have” – a sequence of bytes representing your fingerprint scan, and that can leak. This is precisely why biometric identification should only be done locally, on the phone, without any server interaction – you can’t make sure the server is receiving sensor-scanned data or captured and replayed data. That said, biometric factors are tied to the proper implementation of the authenticating smartphone application – if your, say, banking application needs a fingerprint scan to run, a malicious actor should not be able to bypass that by stealing shared credentials (userIDs, secrets) and do API calls to your service. So to the server there’s no “something you are”. It’s always “something that the client-side application has verified that you are, if implemented properly”
  • A digital signature (via a smartcard or yubikey or even a smartphone with secure hardware storage for private keys) is “something you have” – it works by signing one-time challenges, sent by the server and verifying that the signature has been created by the private key associated with the previously enrolled public key. Knowing the public key gives you nothing, because of how public-key cryptography works. There’s no shared secret and no intermediary whose data flow can be intercepted. A private key is still “something you know”, but by putting it in hardware it becomes “something you have”, i.e. a true second factor. Of course, until someone finds out that the random generation of primes used for generating the private key has been broken and you can derive the private key form the public key (as happened recently with one vendor).

There isn’t an obvious boundary between theoretical and practical. “Something you are” and “something you have” can eventually be turned into “something you know” (or “something someone stores”). Some theoretical attacks can become very practical overnight.

I’d suggest we stick to calling everything “two-factor authentication”, because it’s more important to have mass understanding of the usefulness of the technique than to nitpick on the terminology. 2FA does not solve phishing, unfortunately, but it solves leaked credentials, which is good enough and everyone should have some form of it. Even SMS is better than nothing (obviously, for high-profile systems, digital signatures is the way to go).

The post Is It Really Two-Factor Authentication? appeared first on Bozho's tech blog.

Making Sense of the Information Security Landscape

Post Syndicated from Bozho original https://techblog.bozho.net/making-sense-of-the-information-security-landscape/

There are hundreds of different information security solutions out there and choosing which one to pick can be hard. Usually decisions are driven by recommendations, vendor familiarity, successful upsells, compliance needs, etc. I’d like to share my understanding of the security landscape by providing one-line descriptions of each of the different categories of products.

Note that these categories are not strictly defined sometimes and they may overlap. They may have evolved over time and a certain category can include several products from legacy categories. The explanations will be slightly simplified. For a generalization and summary, skip the list and go to the next paragraph. This post aims to summarize a lot of Gertner and Forester reports, as well as product data sheets, combined with some real world observations and to bring this to a technical level, rather than broad business-focused capabilities. I’ll split them in several groups, though they may be overlapping.

Monitoring and auditing

  • SIEM (Security Information and Event Management) – collects logs from all possible sources (applications, OSs, network appliances) and raises alarms if there are anomalies
  • IDS (Intrusion Detection System) – listening to network packets and finding malicious signatures or statistical anomalies. There are multiple ways to listen to the traffic: proxy, port mirroring, network tap, host-based interface listener. Deep packet inspection is sometimes involved, which requires sniffing TLS at the host or terminating it at a proxy in order to be able to inspect encrypted communication (especially for TLS 1.3), effectively doing an MITM “attack” on the organization users.
  • IPS (Intrusion Prevention System) – basically a marketing upgrade of IDS with the added option to “block” traffic rather than just “report” the intrusion.
  • UEBA (User and Entity Behavior Analytics) – a system that listens to system activity (via logs and/or directly monitoring endpoints for user and system activity, including via screen capture) that tries to identify user behavior patterns (as well as system component behavior patterns) and report on any anomalies and changes in the pattern, also classifying users as less or more “risky”. Recently UEBA has been part of next-gen SIEMs
  • SUBA (Security user Behavior Analytics) – same as UEBA, but named so after the purpose (security) rather than the entities monitored. Used by Forester (whereas UEBA is used by Gartner)
  • DAM (Database Activity Monitoring) – tools that monitor and log database queries and configuration changes, looking for suspicious patterns and potentially blocking them based on policies. Implemented via proxy or agents installed at the host
  • DAP (Database Audit and Protection) – based on DAM, but with added features for content classification (similar to DLPs), vulnerability detection and more clever behavior analysis (e.g. through UEBA)
  • FIM (File Integrity Monitoring) – usually a feature of other tools, FIM is constantly monitoring files for potentially suspicious changes
  • SOC (Security Operations Center) – this is more of an organizational unit that employs multiple tools (usually a SIEM, DLP, CASB) to fully handle the security of an organization.

Access proxies

  • CASB (Cloud Access Security Broker) – a proxy (usually) that organizations go through when connecting to cloud services that allow them to enforce security policies and detect anomalies, e.g. regarding authentication and authorization, input and retrieval of sensitive data. CASBs may involve additional encryption options for the data being used.
  • CSG (Cloud Security Gateway) – effectively the same as CASB
  • SWG (Secure Web Gateway) – a proxy for accessing the web, includes filtering malicious websites, filtering potentially malicious downloads, limiting uploads
  • SASE (Secure Access Service Edge) – like CASB/CSG, but also providing additional bundled functionalities like a Firewall, SWG, VPN, DNS management, etc.

Firewalls

  • WAF (Web Application Firewall) – a firewall (working as a reverse proxy) that you put in front of web applications to protect them from typical web vulnerabilities that may not be addressed by the application developer – SQL injections, XSS, CSRF, etc.
  • NF (Network Firewall) – the typical firewall that allows you to allow or block traffic based on protocol, port, source/destination
  • NGFW (Next Generation Firewall) – a firewall that combines both network firewall, (web) application firewall and providing analysis of the traffic thus detecting potential anomalies/intrusions/data exfiltration

Data protection

  • DLP (Data Leak Prevention / Data Loss Prevention) – that’s a broad category of tools that aim at preventing data loss – mostly accidental, but sometimes malicious as well. Sometimes involves installing an agent in each machine, in other case it’s proxy-based. Many other solutions provide DLP functionality, like IPS/IDS, WAFs, CASBs, but DLPs are focused on inspecting user activities (including via UEBA/SUBA), network traffic (including via SWGs), communication (most often email) and publicly facing storage (e.g. FTP, S3), that may lead to leaking data. DLPs include discovering sensitive data in structured (databases) and unstructured (office documents) data. Other DLP features are encryption of data at rest and tokenization of sensitive data.
  • ILDP (Information Leak Detection and Prevention) – same as DLP
  • IPC (Information Protection and Control) – same as DLP
  • EPS (Extrusion Prevention System) – same as DLP, focused on monitoring outbound traffic for exfiltration attempts
  • CMF (Content Monitoring and Filtering) – part of DLP. May overlap with SWG functionalities.
  • CIP (Critical Information Protection) – part of DLP, focused on critical information, e.g. through encryption and tokenization
  • CDP (Continuous Data Protection) – basically incremental/real-time backup management, with retention settings and possibly encryption

Vulnerability testing

  • RASP (Runtime Application Self-protection) – tools (usually in the form of libraries that are included in the application runtime) that monitor in real-time the application usage and can block certain actions (at binary level) or even shut down the application if a cyber attack is detected.
  • IASTInteractive Application Security Testing – Similar to RASP, the subtle difference being that IASP is usually used in pre-production environments while RASP is used in production
  • SAST (Static Application Security Testing) – tools that scan application source code for vulnerabilities
  • DAST (Dynamic Application Security Testing) – tools that scans web applications for vulnerabilities through their exposed HTTP endpoints
  • VA (Vulnerability assessment) – a process helped by many tools (including those above, and more) for finding, assessing and eliminating vulnerabilities

Identity and access

  • IAM (Identity and Access Management) – products that allow organizations to centralize authentication and enrollment of their users, providing single-sign-on capabilities, centralized monitoring authentication activity, applying access policies (e.g. working hours), enforcing 2FA, etc.
  • SSO – the ability to use the same credentials for logging into multiple (preferably all) applications in an organization.
  • WAM (Web Access Management) – the “older” version of IAM, lacking flexibility and some features like centralized user enrollment/provisioning
  • PAM (Privileged access management) – managing credentials of privileged users (e.g. system administrators). Instead of having admin credentials stored in local password managers (or worse – sticky notes or files on the desktop), credentials are stored in a centralized, protected vault and “released” for use only after a certain approval process for executing a given admin task, in some cases monitoring and logging the executed activities. The PAM handles regular password changes. It basically acts as a proxy (though not necessarily in the network sense) between a privileged user and a system that requires elevated privileges.

Endpoint protection

  • AV (Anti-Virus) – the good old antivirus software that gets malicious software signatures form a centrally managed blacklist and blocks programs that match those signatures
  • NGAV (Next Generation Anti-Virus) – going beyond signature matching, NGAV looks for suspicious activities (e.g. filesystem, memory, registry access/modification) and uses policies and rules to block such activity even from previously unknown and not yet blacklisted programs. Machine learning is usually said to be employed, but in many cases that’s mostly marketing.
  • EPP (Endpoint Protection Platform) – includes NGAV as well as a management layer that allows centrally provisioning and managing policies, reporting and workflows for remediation
  • EDR (Endpoint Detection and Response) – using an agent to collect endpoint (device) data, centralize it, combine it with network logs and analyze that in order to detect malicious activity. After suspected malicious activity is detected, allow centralized response, including blocking/shutting down/etc. Compared to NGAV, EDR makes use of the data across the organization, while NGAV usually focuses on individual machines, but that’s not universally true
  • ATP (Advanced threat protection) – same as EDR
  • ATD (Advanced threat detection) – same as above, with just monitoring and analytics capabilities

Coordination and automation

  • UTM (Unified Threat Management) – combining multiple monitoring and prevention tools in one suite (antivirus/NGAV/EDR), DLP, Firewalls, VPNs, etc. The benefit being that you purchase one thing rather than finding your way through the jungle described above. At least that’s on paper; in reality you still get different modules, sometimes not even properly integrated with each other.
  • SOAR (Security Orchestration, Automation and Response) – tools for centralizing security alerts and configuring automated actions in response. Alert fatigue is a real thing with many false positives generated by tools like SIEMs/DLPs/EDRs. Reducing those false alarms is often harder than just scripting the way they are handled. SOAR provides that – it ingests alerts and allows you to use pre-built or custom response “cookbooks” that include checking data (e.g. whether an IP is in some blacklist, are there attachments of certain content type in a flagged email, whether an employee is on holiday, etc.), creating tickets and alerting via multiple channels (email/sms/other type of push)
  • TIP (Threat Intelligence Platform) – threat intelligence is often part of other solutions like SIEMs, EDRs and DLPs and involves collecting information (intelligence) about certain resources like IP addresses, domain names, certificates. When these items are discovered in the collected logs, the TIP can enrich the event with what it knows about the given item and even act in order to block a request, if a threat threshold is reached. In short – scanning public and private databases to find information about malicious actors and their assets.

Email

  • SEG (Secure email gateway) – a proxy for all incoming and outgoing email that scans them for malicious attachments, potential phishing and in some cases data exfiltration attempts.
  • MFT (Managed File Transfer) – a tool that allows sharing files securely with someone by replacing attachments. Shared files can be tracked, monitored, audited and scanned for vulnerabilities, and access can be cut once the files was downloaded by the recipient, reducing the risk of data leaks.

DDoS

  • DDoS mitigation/protection – services that hide your actual IP in an attempt to block malicious DDoS traffic before it reaches your network (when it’s too late). They usually rely on large global networks an data centers (called “scrubbing centers”) to send clean traffic to your servers.

Compliance

  • GRC (Governance, Risk and Compliance) – a management tool for handling all the policies, audits, risk assessments, workflows and reports regarding different aspects of compliance, including security compliance
  • IRM – allegedly, philosophically different and more modern and advanced, in reality – the same as GRC with some additional monitoring features

So let’s summarize the ways that all of these solutions work:

  • Monitoring logs and other events
  • Inspecting incoming traffic and finding malicious activities
  • Inspecting outgoing traffic and applying policies
  • Application vulnerability detection
  • Automating certain aspects of the alerting, investigation and response handling

Monitoring (which is central to most tools) is usually done via proxies, port mirroring, network taps or host-based interface listeners, each having its pros and cons. Enforcement is almost always done via proxies. Bypassing these proxies should not be possible, but for cloud services you can’t really block access if the service is accessed outside your corporate environment (unless the SaaS provider has an IP whitelist feature).

In most cases, even though machine learning/AI is advertised as “the new thing”, tools make decisions based on configured policies (rules). Organizations are drowned in complex policies that they should keep up to date and syncrhonize across tools. Policy management, especially given there’s no indsutry standard for how policies should be defined, is a huge burden. In theory, it gives flexibility and should be there, in practice it may lead to a messy and hard to manage environment.

Monitoring is universally seen as the way to receive actionable intelligence from systems. This is much messier in reality than in demos and often leads to systems being left unmonitored and alerts being ignored. Alert fatigue, which follows from the complexity of policy management, is a bug problem in information security. SOAR is a way to remedy that but it sounds like a band-aid on a broken process rather than a true solution – false alarms should be reduced rather than being closed quasi-automatically. If handling an alert is automatable, then tha tool that generates it should be able to know it’s not a real problem.

The complexity of the security landscape is obviously huge – product categories are defined based on multiple criteria – what problem they solve, how they solve it, or to what extent they solve it. Is a SIEM also a DLP if it uses UEBA to block certain traffic (next-gen SIEMs may be able to invoke blocking actions even if requiring another system to carry it out). Is a DLP a CASB if it does encryption of data that’s stored in cloud services? Should you have an EPP and a SIEM, if the EPP gives you good enough overview of the events being logged in your infrastructure? Is a CASB a WAF for SaaS? Is a SIEM a DAM if it supports native database audit logs? You can’t answer these questions at a category level, you have to look at particular products and how well they implement a certain feature.

Can you have a unified proxy (THE proxy) that monitors everything incoming and outgoing and collects that data, acting as WAF, DLP, SIEM, CASB, SEG? Can you have just one agent that is both a EDR, and a DLP? Well, certainly categories like SASE and UTM go in that direction, trying to ease the decision making process.

I think it’s most important to start from the attack targets, rather than from the means to get there or from the means to prevent getting there. Unfortunately, enterprise security is often driven by “I need to have this type of product”. This leads to semi-abandoned and partially configured tools for which organizations pay millions. Because there is never enough people to be able to go into the intricate details of yet another security soluion, and organizations rely on consultants to set things up.

I don’t have solutions to the problems stated above, but I hope I’ve given a good overview of the landscape. And I think we should focus less on “security products” and more on “security techniques” and on people that can implement them. You don’t have a billion dollar corporation to sell you a silver bullet (which you can’t fire). You need traind experts. That’s hard. There aren’t enough of them. And the security team is often undervalued in the enterprise. Yes, cybersecurity is very important, but I’m not sure whether this will ever get enough visibility and be prioritized over purely business goals. And maybe it shouldn’t, if risk is properly calculated.

All the products above are ways to buy some feeling of security. If used properly and in the right combination, it can be more than a feeling. But too often a feeling is just good enough.

The post Making Sense of the Information Security Landscape appeared first on Bozho's tech blog.

Reduce Cost and Increase Security with Amazon VPC Endpoints

Post Syndicated from Nigel Harris original https://aws.amazon.com/blogs/architecture/reduce-cost-and-increase-security-with-amazon-vpc-endpoints/

Introduction

This blog explains the benefits of using Amazon VPC endpoints and highlights a self-paced workshop that will help you to learn more about them. Amazon Virtual Private Cloud (Amazon VPC) enables you to launch AWS resources into a virtual network that you’ve defined. This virtual network resembles a traditional network that you’d operate in your own data center, with the benefits of using the scalable infrastructure of AWS.

A VPC endpoint allows you to privately connect your VPC to supported AWS services without requiring an Internet gateway, NAT device, VPN connection, or AWS Direct Connect connection. Endpoints are virtual devices that are horizontally scaled, redundant, and highly available VPC components. They allow communication between instances in your VPC and services without imposing availability risks or bandwidth constraints on your network traffic.

VPC endpoints enable you to reduce data transfer charges resulting from network communication between private VPC resources (such as Amazon Elastic Cloud Compute—or EC2—instances) and AWS Services (such as Amazon Quantum Ledger Database, or QLDB). Without VPC endpoints configured, communications that originate from within a VPC destined for public AWS services must egress AWS to the public Internet in order to access AWS services. This network path incurs outbound data transfer charges. Data transfer charges for traffic egressing from Amazon EC2 to the Internet vary based on volume. However, at the time of writing, after the first 1GB / Month ($0.00 per GB), transfers are charged at a rate of $ 0.09/GB (for AWS US-East 1 Virginia). With VPC endpoints configured, communication between your VPC and the associated AWS service does not leave the Amazon network. If your workload requires you to transfer significant volumes of data between your VPC and AWS, you can reduce costs by leveraging VPC endpoints.

There are two types of VPC endpoints: interface endpoints and gateway endpoints. Amazon Simple Storage Service (S3) and Amazon DynamoDB are accessed using gateway endpoints. You can configure resource policies on both the gateway endpoint and the AWS resource that the endpoint provides access to. A VPC endpoint policy is an AWS Identity and Access Management (AWS IAM) resource policy that you can attach to an endpoint. It is a separate policy for controlling access from the endpoint to the specified service. This enables granular access control and private network connectivity from within a VPC. For example, you could create a policy that restricts access to a specific DynamoDB table through a VPC endpoint.

Figure 1: Accessing S3 via a Gateway VPC Endpoint

Figure 1: Accessing S3 via a Gateway VPC Endpoint

Interface endpoints enable you to connect to services powered by AWS PrivateLink. This includes a large number of AWS services, services hosted by other AWS customers and partners in their own VPCs, and supported AWS Marketplace partner services. Like gateway endpoints, interface endpoints can be secured using resource policies on the endpoint itself and the resource that the endpoint provides access to. Interface endpoints enable the use of security groups to restrict access to the endpoint.

Figure 2: Accessing QLDB via an Interface VPC Endpoint

Figure 2: Accessing QLDB via an Interface VPC Endpoint

In larger multi-account AWS environments, network design can vary considerably. Consider an organization that has built a hub-and-spoke network with AWS Transit Gateway. VPCs have been provisioned into multiple AWS accounts, perhaps to facilitate network isolation or to enable delegated network administration. When deploying distributed architectures such as this, a popular approach is to build a “shared services VPC, which provides access to services required by workloads in each of the VPCs. This might include directory services or VPC endpoints. Sharing resources from a central location instead of building them in each VPC may reduce administrative overhead and cost. This approach was outlined by my colleague Bhavin Desai in his blog post Centralized DNS management of hybrid cloud with Amazon Route 53 and AWS Transit Gateway.

Figure 3: Centralized VPC Endpoints (multiple VPCs)

Figure 3: Centralized VPC Endpoints (multiple VPCs)

Alternatively, an organization may have centralized its network and chosen to leverage VPC sharing to enable multiple AWS accounts to create application resources (such as Amazon EC2 instances, Amazon Relational Database Service (RDS) databases, and AWS Lambda functions) into a shared, centrally managed network. With either pattern, establishing granular set of controls to limit access to resources can be critical to support organizational security and compliance objectives while maintaining operational efficiency.

Figure 4: Centralized VPC Endpoints (shared VPC)

Figure 4: Centralized VPC Endpoints (shared VPC)

Learn how with the VPC Endpoint Workshop

Understanding how to appropriately restrict access to endpoints and the services they provide connectivity to is an often-misunderstood topic. I recently authored a hands-on workshop to help customers learn how to provision appropriate levels of access. Continue to learn about Amazon VPC Endpoints by taking the VPC Endpoint Workshop and then improve the security posture of your cloud workloads by leveraging network controls and VPC endpoint policies to manage access to your AWS resources.

Protecting Remote Desktops at Scale with Cloudflare Access

Post Syndicated from Mike Borkenstein original https://blog.cloudflare.com/protecting-remote-desktops-at-scale-with-cloudflare-access/

Protecting Remote Desktops at Scale with Cloudflare Access

Early last year, before any of us knew that so many people would be working remotely in 2020, we announced that Cloudflare Access, Cloudflare’s Zero Trust authentication solution, would begin protecting the Remote Desktop Protocol (RDP). To protect RDP, customers would deploy Argo Tunnel to create an encrypted connection between their RDP server and our edge – effectively locking down RDP resources from the public Internet. Once locked down with Tunnel, customers could use Cloudflare Access to create identity-driven rules enforcing who could login to their resources.

Setting Tunnel up initially required installing the Cloudflare daemon, cloudflared, on each RDP server. However, as the adoption of remote work increased we learned that installing and provisioning a new daemon on every server in a network was a tall order for customers managing large fleets of servers.

What should have been a simple, elegant VPN replacement became a deployment headache. As organizations helped tens of thousands of users switch to remote work, no one had the bandwidth to deploy tens of thousands of daemons.

Message received: today we are announcing Argo Tunnel RDP Bastion mode, a simpler way to protect RDP connections at scale. 🎉 By functioning as a jump-host, cloudflared can reside on a single node in your network and proxy requests to any internal server, eliminating deployment headaches.

Previously, if a user wanted to RDP to a resource not yet protected with a dedicated cloudflared tunnel, they would have to reach out to a member of their infrastructure team and request that it be provisioned manually. For larger enterprises managing thousands of network assets, this could pose a significant burden, involving new configuration management manifests and implementing tunnel health monitoring.

Argo Tunnel RDP Bastion mode enables teams to reach any machine through a single cloudflared instance – a single tunnel, gated by Cloudflare Access, to reach hundreds of remote desktops.

Why does RDP matter?

RDP is one of the most popular protocols used by employees to access their office computers from remote devices. It is installed by default on Windows, and is supported on *nix and MacOS operating systems. Many companies rely on RDP to allow their employees to work from home.

Utilization of the remote desktop protocol has increased significantly in correlation with increased work from home due to the Coronavirus pandemic. Unfortunately, in a rush to make machines available to remote users, many organizations have misconfigured RDP, which has given attackers a new opportunity to target remote desktops.

This increase is due primarily to two factors. The first factor is exposure. Many RDP servers are inadvertently exposed directly to the open Internet due to incomplete enforcement of firewall rules or unpatched vulnerabilities. Quickly exposing desktop fleets in a rush to help employees work from home might result in more security oversights.

Second, most RDP servers are not protected with corporate SSO tools. When users connect over RDP, they often enter a local password to login to the target machine. However, organizations don’t always manage these credentials properly. Instead, users set and save passwords on an ad-hoc basis outside of the single sign-on credentials used for other services. That oversight leads to outdated, reused, and ultimately weak passwords that are potentially  securing Internet-exposed resources.

Where does Cloudflare Access fit?

Cloudflare Access adds stronger authentication to RDP sessions by first locking down access to the remote machine via Argo Tunnel, then enforcing identity-based policies to determine who can gain access. Whether your organization uses Okta, Azure AD, or another provider, your users will be prompted to authenticate with those credentials before starting any RDP sessions.

With RDP connections protected by Access, organizations can enforce the same password strength and rotation requirements for RDP connections as they do for other critical tools.

How does it work?

Protecting Remote Desktops at Scale with Cloudflare Access

On the origin side, an admin will configure a single cloudflared instance to run in bastion mode. That bastion will reach out to the two closest Cloudflare edge data centers and create a long-lived HTTP2 session. Once set up, cloudflared will wait for incoming connections from clients to specify which final origin to connect to. This is unlike conventional cloudflared tunnel behavior, which immediately creates a single outgoing connection to a pre-configured origin.

On the RDP user side, a cloudflared instance running as a client will be configured with the final destination of the RDP session.  This isn’t the address of the cloudflared bastion but rather the internal hostname the user wants to connect to.

Next, the user’s primary RDP client (i.e. “Remote Desktop Connection” on Windows) will initiate a connection to the local cloudflared client. cloudflared will launch a browser window and navigate to the Access app’s login page, prompting the user to authenticate with an IdP.

Once authenticated, the cloudflared client will tunnel the RDP traffic over HTTPS requests to the Cloudflare edge, including the final RDP destination and Access JWT in the request headers. The edge will verify the Access JWT to ensure that the client is authorized to reach the origin and, if it is, will use a special PoP to PoP route called Argo Smart Routing to forward the connection to the bastion over the shortest path possible.

For each incoming connection, the bastion will initiate an outgoing RDP session to the final internal destination and proxy traffic back and forth to the client.

Protecting Remote Desktops at Scale with Cloudflare Access

What’s next?

While today we are proxying just RDP traffic in bastion mode, we will eventually be expanding this functionality to protocols like FTP, SSH, and generic TCP.

In the effort to make protecting internal resources easier than ever before, cloudflared can now also be conveniently found in the Cloudflare package repo, in tagged releases on the cloudflared Github repo, and in the cloudflared Docker hub repo.

Network-layer DDoS attack trends for Q2 2020

Post Syndicated from Vivek Ganti original https://blog.cloudflare.com/network-layer-ddos-attack-trends-for-q2-2020/

Network-layer DDoS attack trends for Q2 2020

Network-layer DDoS attack trends for Q2 2020

In the first quarter of 2020, within a matter of weeks, our way of life shifted. We’ve become reliant on online services more than ever. Employees that can are working from home, students of all ages and grades are taking classes online, and we’ve redefined what it means to stay connected. The more the public is dependent on staying connected, the larger the potential reward for attackers to cause chaos and disrupt our way of life. It is therefore no surprise that in Q1 2020 (January 1, 2020 to March 31, 2020) we reported an increase in the number of attacks—especially after various government authority mandates to stay indoors—shelter-in-place went into effect in the second half of March.

In Q2 2020 (April 1, 2020 to June 30, 2020), this trend of increasing DDoS attacks continued and even accelerated:

  • The number of L3/4 DDoS attacks observed over our network doubled compared to that in the first three months of the year.
  • The scale of the largest L3/4 DDoS attacks increased significantly. In fact, we observed some of the largest attacks ever recorded over our network.
  • We observed more attack vectors being deployed and attacks were more geographically distributed.

The number of global L3/4 DDoS attacks in Q2 doubled

Gatebot is Cloudflare’s primary DDoS protection system. It automatically detects and mitigates globally distributed DDoS attacks. A global DDoS attack is an attack that we observe in more than one of our edge data centers. These attacks are usually generated by sophisticated attackers employing botnets in the range of tens of thousand to millions of bots.

Network-layer DDoS attack trends for Q2 2020

Sophisticated attackers kept Gatebot busy in Q2. The total number of global L3/4 DDoS attacks that Gatebot detected and mitigated in Q2 doubled quarter over quarter. In our Q1 DDoS report, we reported a spike in the number and size of attacks. We continue to see this trend accelerate through Q2; over 66% of all global DDoS attacks in 2020 occurred in the second quarter (nearly 100% increase). May was the busiest month in the first half of 2020, followed by June and April. Almost a third of all L3/4 DDoS attacks occurred in May.

In fact, 63% of all L3/4 DDoS attacks that peaked over 100 Gbps occurred in May. As the global pandemic continued to heighten around the world in May, attackers were especially eager to take down websites and other Internet properties.

Network-layer DDoS attack trends for Q2 2020

Small attacks continue to dominate in numbers as big attacks get bigger in size

A DDoS attack’s strength is equivalent to its size—the actual number of packets or bits flooding the link to overwhelm the target. A ‘large’ DDoS attack refers to an attack that peaks at a high rate of Internet traffic. The rate can be measured in terms of packets or bits. Attacks with high bit rates attempt to saturate the Internet link, and attacks with high packet rates attempt to overwhelm the routers or other in-line hardware devices.

Similar to Q1, the majority of L3/4 DDoS attacks that we observed in Q2 were also relatively ‘small’ with regards to the scale of Cloudflare’s network. In Q2, nearly 90% of all L3/4 DDoS attacks that we saw peaked below 10 Gbps. Small attacks that peak below 10 Gbps can still easily cause an outage to most of the websites and Internet properties around the world if they are not protected by a cloud-based DDoS mitigation service.

Network-layer DDoS attack trends for Q2 2020

Similarly, from a packet rate perspective, 76% of all L3/4 DDoS attacks in Q2 peaked up to 1 million packets per second (pps). Typically, a 1 Gbps Ethernet interface can deliver anywhere between 80k to 1.5M pps. Assuming the interface also serves legitimate traffic, and that most organizations have much less than a 1 Gbps interface, you can see how even these ‘small’ packet rate DDoS attacks can easily take down Internet properties.

Network-layer DDoS attack trends for Q2 2020

In terms of duration, 83% of all attacks lasted between 30 to 60 minutes. We saw a similar trend in Q1 with 79% of attacks falling in the same duration range. This may seem like a short duration, but imagine this as a 30 to 60 minute cyber battle between your security team and the attackers. Now it doesn’t seem so short. Additionally, if a DDoS attack creates an outage or service degradation, the recovery time to reboot your appliances and relaunch your services can be much longer; costing you lost revenue and reputation for every minute.

Network-layer DDoS attack trends for Q2 2020

In Q2, we saw the largest DDoS attacks on our network, ever

This quarter, we saw an increasing number of large scale attacks; both in terms of packet rate and bit rate. In fact, 88% of all DDoS attacks in 2020 that peaked above 100 Gbps were launched after shelter-in-place went into effect in March. Once again, May was not just the busiest month with the most number of attacks, but also the greatest number of large attacks above 100 Gbps.

Network-layer DDoS attack trends for Q2 2020

From the packet perspective, June took the lead with a whopping 754 million pps attack. Besides that attack, the maximum packet rates stayed mostly consistent throughout the quarter with around 200 million pps.

Network-layer DDoS attack trends for Q2 2020

The 754 million pps attack was automatically detected and mitigated by Cloudflare. The attack was part of an organized four-day campaign that lasted from June 18 to the 21. As part of the campaign, attack traffic from over 316,000 IP addresses targeted a single Cloudflare IP address.

Cloudflare’s DDoS protection systems automatically detected and mitigated the attack, and due to the size and global coverage of our network, there was no impact to performance. A global interconnected network is crucial when mitigating large attacks in order to be able to absorb the attack traffic and mitigate it close to the source, whilst also continuing serving legitimate customer traffic without inducing latency or service interruptions.

The United States is targeted with the most attacks

When we look at the L3/4 DDoS attack distribution by country, our data centers in the United States received the most number of attacks (22.6%), followed by Germany (4.4%), Canada (2.7%) and Great Britain (2.6%).

Network-layer DDoS attack trends for Q2 2020

However when we look at the total attack bytes mitigated by each Cloudflare data center, the United States still leads (34.9%), but followed by Hong Kong (6.6%), Russia (6.5%), Germany (4.5%) and Colombia (3.7%). The reason for this change is due to the total amount of bandwidth that was generated in each attack. For instance, while Hong Kong did not make it to the top 10 list due to the relatively small number of attacks that was observed in Hong Kong (1.8%), the attacks were highly volumetric and generated so much attack traffic that pushed Hong Kong to the 2nd place.

When analyzing L3/4 DDoS attacks, we bucket the traffic by the Cloudflare edge data center locations and not by the location of the source IP. The reason is when attackers launch L3/4 attacks they can ‘spoof’ (alter) the source IP address in order to obfuscate the attack source. If we were to derive the country based on a spoofed source IP, we would get a spoofed country. Cloudflare is able to overcome the challenges of spoofed IPs by displaying the attack data by the location of Cloudflare’s data center in which the attack was observed. We’re able to achieve geographical accuracy in our report because we have data centers in over 200 cities around the world.

57% of all L3/4 DDoS attacks in Q2 were SYN floods

An attack vector is a term used to describe the attack method. In Q2, we observed an increase in the number of vectors used by attackers in L3/4 DDoS attacks. A total of 39 different types of attack vectors were used in Q2, compared to 34 in Q1. SYN floods formed the majority with over 57% in share, followed by RST (13%), UDP (7%), CLDAP (6%) and SSDP (3%) attacks.

Network-layer DDoS attack trends for Q2 2020

SYN flood attacks aim to exploit the handshake process of a TCP connection. By repeatedly sending initial connection request packets with a synchronize flag (SYN), the attacker attempts to overwhelm the router’s connection table that tracks the state of TCP connections. The router replies with a packet that contains a synchronized acknowledgment flag (SYN-ACK), allocates a certain amount of memory for each given connection and falsely waits for the client to respond with a final acknowledgment (ACK). Given a sufficient number of SYNs that occupy the router’s memory, the router is unable to allocate further memory for legitimate clients causing a denial of service.

No matter the attack vector, Cloudflare automatically detects and mitigates stateful or stateless DDoS attacks using our 3 pronged protection approach comprising of our home-built DDoS protection systems:

  1. Gatebot – Cloudflare’s centralized DDoS protection systems for detecting and mitigating globally distributed volumetric DDoS attacks. Gatebot runs in our network’s core data center. It receives samples from every one of our edge data centers, analyzes them and automatically sends mitigation instructions when attacks are detected. Gatebot is also synchronized to each of our customers’ web servers to identify its health and triggers accordingly, tailored protection.
  2. dosd (denial of service daemon) – Cloudflare’s decentralized DDoS protection systems. dosd runs autonomously in each server in every Cloudflare data center around the world, analyzes traffic, and applies local mitigation rules when needed. Besides being able to detect and mitigate attacks at super fast speeds, dosd significantly improves our network resilience by delegating the detection and mitigation capabilities to the edge.
  3. flowtrackd (flow tracking daemon) – Cloudflare’s TCP state tracking machine for detecting and mitigating the most randomized and sophisticated TCP-based DDoS attacks in unidirectional routing topologies. flowtrackd is able to identify the state of a TCP connection and then drops, challenges or rate-limits packets that don’t belong to a legitimate connection.

In addition to our automated DDoS protection systems, Cloudflare also generates real-time threat intelligence that automatically mitigates attacks. Furthermore, Cloudflare provides its customers firewall, rate-limiting and additional tools to further customize and optimize their protection.

Cloudflare DDoS mitigation

As Internet usage continues to evolve for businesses and individuals, expect DDoS tactics to adapt as well. Cloudflare protects websites, applications, and entire networks from DDoS attacks of any size, kind, or level of sophistication.

Our customers and industry analysts recommend our comprehensive solution for three main reasons:

  • Network scale: Cloudflare’s 37 Tbps network can easily block attacks of any size, type, or level of sophistication. The Cloudflare network has a DDoS mitigation capacity that is higher than the next four competitors—combined.
  • Time-to-mitigation: Cloudflare mitigates most network layer attacks in under 10 seconds globally, and immediate mitigation (0 seconds) when static rules are preconfigured. With our global presence, Cloudflare mitigates attacks close to the source with minimal latency. In some cases, traffic is even faster than over the public Internet.
  • Threat intelligence: Cloudflare’s DDoS mitigation is powered by threat intelligence harnessed from over 27 million Internet properties on it. Additionally, the threat intelligence is incorporated into customer facing firewalls and tools in order to empower our customers.

Cloudflare is uniquely positioned to deliver DDoS mitigation with unparalleled scale, speed, and smarts because of the architecture of our network. Cloudflare’s network is like a fractal—every service runs on every server in every Cloudflare data center that spans over 200 cities globally. This enables Cloudflare to detect and mitigate attacks close to the source of origin, no matter the size, source, or type of attack.

Network-layer DDoS attack trends for Q2 2020

To learn more about Cloudflare’s DDoS solution contact us or get started.

You can also join an upcoming live webinar where we will be discussing these trends, and strategies enterprises can implement to combat DDoS attacks and keep their networks online and fast. You can register here.

Architecting Secure Serverless Applications

Post Syndicated from Brian McNamara original https://aws.amazon.com/blogs/architecture/architecting-secure-serverless-applications/

Introduction

Cloud security at AWS is our top priority, and we have a deep set of cloud security tools consisting of more than 200 security, compliance, and governance services and key features. It’s why a broad set of customers — from enterprises, to the public sector, to startups — continue to rely on the capabilities we provide to ensure their workloads are secure.

In this series of blog posts, we will outline the controls that AWS Serverless services expose, while illustrating how their native capabilities can be used to meet security and compliance needs.

In this introductory post, I’ll talk about the value proposition of serverless architectures, drawing specific attention to changes shared security model for serverless applications. I will also call out specific personas – Developers, DevOps engineers, and Compliance teams — who have an interest in ensuring that serverless applications are deployed and managed securely.

What is serverless?

Serverless is the native architecture of the cloud that enables you to shift more of your operational responsibilities to your cloud provider, such as AWS, increasing your agility and innovation. Serverless allows you to build and run applications and services without thinking about servers. It eliminates infrastructure management tasks, such as server or cluster provisioning, patching, operating system maintenance, and capacity provisioning. You can build serverless architectures for nearly any type of application or backend service, and they handle everything you require to run and scale your application with high availability.

There are four benefits of serverless:

  1. No server management
  2. Flexible scaling
  3. Pay for value
  4. Automated high availability

Shared security model

Security and compliance are shared responsibilities between AWS and you, the customer. You benefit from a datacenter and network architecture that is built to meet the requirements of the most security-sensitive organizations, while AWS is responsible for protecting the infrastructure that runs all of the AWS cloud. AWS also provides you with services that you can use securely. Third-party auditors regularly test and verify the effectiveness of our security as part of the AWS compliance programs. Your responsibilities are determined by the AWS services you use as well as other factors, including the sensitivity of data, company requirements, and applicable laws and regulations. This is known as “security in the cloud.”

In the serverless model, customers are free to focus on the security of application code, the storage and accessibility of sensitive data, observing the behavior of their applications through monitoring and logging, and identity and access management (IAM) to the respective service.

In the serverless model, customers are free to focus on the security of application code, the storage and accessibility of sensitive data, observing the behavior of their applications through monitoring and logging, and identity and access management (IAM) to the respective service.

Pay particular attention to the dotted box around Platform management, Code encryption, Network traffic, Firewall config, and Operating system and network configuration. While AWS assumes these responsibilities for serverless architectures, you still need to address them for non-serverless architectures.

Security personas

If you have spent time designing and operating server-based applications, consider the following to better understand how serverless changes your security practices:

  • Compliance teams need to understand how AWS assumes more of the security responsibilities in serverless applications, whether a service is covered by a compliance standard, and whether any additional configuration must be implemented to ensure compliance.
  • DevOps teams need to employ available protective and detective controls to securely deploy and manage serverless applications.
  • Developers and their teams need to understand how best to utilize least privilege and use sensitive data in their applications.

The series ahead

In upcoming blog posts, we’ll address topics, such as:

  • How users are authenticated and authorized
  • How to address the risk of data loss
  • How to deal with code injection
  • How to address data exfiltration
  • How to escalation of privileges, and denial of service

As well, each post will address the concerns of each persona for the relevant services.

New – Using Amazon GuardDuty to Protect Your S3 Buckets

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-using-amazon-guardduty-to-protect-your-s3-buckets/

As we anticipated in this post, the anomaly and threat detection for Amazon Simple Storage Service (S3) activities that was previously available in Amazon Macie has now been enhanced and reduced in cost by over 80% as part of Amazon GuardDuty. This expands GuardDuty threat detection coverage beyond workloads and AWS accounts to also help you protect your data stored in S3.

This new capability enables GuardDuty to continuously monitor and profile S3 data access events (usually referred to data plane operations) and S3 configurations (control plane APIs) to detect suspicious activities such as requests coming from an unusual geo-location, disabling of preventative controls such as S3 block public access, or API call patterns consistent with an attempt to discover misconfigured bucket permissions. To detect possibly malicious behavior, GuardDuty uses a combination of anomaly detection, machine learning, and continuously updated threat intelligence. For your reference, here’s the full list of GuardDuty S3 threat detections.

When threats are detected, GuardDuty produces detailed security findings to the console and to Amazon EventBridge, making alerts actionable and easy to integrate into existing event management and workflow systems, or trigger automated remediation actions using AWS Lambda. You can optionally deliver findings to an S3 bucket to aggregate findings from multiple regions, and to integrate with third party security analysis tools.

If you are not using GuardDuty yet, S3 protection will be on by default when you enable the service. If you are using GuardDuty, you can simply enable this new capability with one-click in the GuardDuty console or through the API. For simplicity, and to optimize your costs, GuardDuty has now been integrated directly with S3. In this way, you don’t need to manually enable or configure S3 data event logging in AWS CloudTrail to take advantage of this new capability. GuardDuty also intelligently processes only the data events that can be used to generate threat detections, significantly reducing the number of events processed and lowering your costs.

If you are part of a centralized security team that manages GuardDuty across your entire organization, you can manage all accounts from a single account using the integration with AWS Organizations.

Enabling S3 Protection for an AWS Account
I already have GuardDuty enabled for my AWS account in this region. Now, I want to add threat detection for my S3 buckets. In the GuardDuty console, I select S3 Protection and then Enable. That’s it. To be more protected, I repeat this process for all regions enabled in my account.

After a few minutes, I start seeing new findings related to my S3 buckets. I can select each finding to get more information on the possible threat, including details on the source actor and the target action.

After a few days, I select the Usage section of the console to monitor the estimated monthly costs of GuardDuty in my account, including the new S3 protection. I can also find which are the S3 buckets contributing more to the costs. Well, it turns out I didn’t have lots of traffic on my buckets recently.

Enabling S3 Protection for an AWS Organization
To simplify management of multiple accounts, GuardDuty uses its integration with AWS Organizations to allow you to delegate an account to be the administrator for GuardDuty for the whole organization.

Now, the delegated administrator can enable GuardDuty for all accounts in the organization in a region with one click. You can also set Auto-enable to ON to automatically include new accounts in the organization. If you prefer, you can add accounts by invitation. You can then go to the S3 Protection page under Settings to enable S3 protection for their entire organization.

When selecting Auto-enable, the delegated administrator can also choose to enable S3 protection automatically for new member accounts.

Available Now
As always, with Amazon GuardDuty, you only pay for the quantity of logs and events processed to detect threats. This includes API control plane events captured in CloudTrail, network flow captured in VPC Flow Logs, DNS request and response logs, and with S3 protection enabled, S3 data plane events. These sources are ingested by GuardDuty through internal integrations when you enable the service, so you don’t need to configure any of these sources directly. The service continually optimizes logs and events processed to reduce your cost, and displays your usage split by source in the console. If configured in multi-account, usage is also split by account.

There is a 30-day free trial for the new S3 threat detection capabilities. This applies as well to accounts that already have GuardDuty enabled, and add the new S3 protection capability. During the trial, the estimated cost based on your S3 data event volume is calculated in the GuardDuty console Usage tab. In this way, while you evaluate these new capabilities at no cost, you can understand what would be your monthly spend.

GuardDuty for S3 protection is available in all regions where GuardDuty is offered. For regional availability, please see the AWS Region Table. To learn more, please see the documentation.

Danilo

Building a serverless tokenization solution to mask sensitive data

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/building-a-serverless-tokenization-solution-to-mask-sensitive-data/

This post is courtesy of Anuj Gupta, Senior Solutions Architect, and Steven David, Senior Solutions Architect.

Customers tell us that security and compliance are top priorities regardless of industry or location. Government and industry regulations are regularly updated and companies must move quickly to remain compliant. Organizations must balance the need to generate value from data and to ensure data privacy. There are many situations where it is prudent to obfuscate data to reduce the risk of exposure, while also improving the ability to innovate.

This blog discusses data obfuscation and how it can be used to reduce the risk of unauthorized access. It can also simplify PCI DSS compliance by reducing the number of components for which this compliance may apply.

Comparing tokenization and encryption

There is a difference between encryption and tokenization. Encryption is the process of using an algorithm to transform plaintext into ciphertext. An algorithm and an encryption key are required to decrypt the original plaintext.

Tokenization is the process of transforming a piece of data into a random string of characters called a token. It does not have direct meaningful value in relation to the original data. Tokens serve as a reference to the original data, but cannot be used to derive that data.

Unlike encryption, tokenization does not use a mathematical process to transform the sensitive information into the token. Instead, tokenization uses a database, often called a token vault, which stores the relationship between the sensitive value and the token. The real data in the vault is then secured, often via encryption. The token value can be used in various applications as a substitute for the original data.

For example, for processing a recurring credit card payment, the token is submitted to the vault. The index is used to fetch the original data for use in the authorization process. Recently, tokens are also being used to secure other types of sensitive or personally identifiable information. This includes data like social security numbers (SSNs), telephone numbers, and email addresses.

Overview

In this blog, we show how to design a secure, reliable, scalable, and cost-optimized tokenization solution. It can be integrated with applications to generate tokens, store ciphertext in an encrypted token vault, and exchange tokens for the original text.

In an example use-case, a data analyst needs access to a customer database. The database includes the customer’s name, SSN, credit card, order history, and preferences. Some of the customer information qualifies as sensitive data. To enforce the required information security policy, you must enforce methods such as column level access, role-based control, column level encryption, and protection from unauthorized access.

Providing access to the customer database increases the complexity of managing fine-grained access policies. Tokenization replaces the sensitive data with random unique tokens, which are stored in an application database. This lowers the complexity and the cost of managing access, while helping with data protection.

Walkthrough

This serverless application uses Amazon API Gateway, AWS Lambda, Amazon Cognito, Amazon DynamoDB, and the AWS KMS.

Serverless architecture diagram

The client authenticates with Amazon Cognito and receives an authorization token. This token is used to validate calls to the Customer Order Lambda function. The function calls the tokenization layer, providing sensitive information in the request. This layer includes the logic to generate unique random tokens and store encrypted text in a cipher database.

Lambda calls KMS to obtain an encryption key. It then uses the DynamoDB client-side encryption library to encrypt the original text and store the ciphertext in the cipher database. The Lambda function retrieves the generated token in the response from the tokenization layer. This token is then stored in the application database for future reference.

The KMS makes it easy to create and manage cryptographic keys. It provides logs of all key usage to help you meet regulatory and compliance needs.

One of the most important decisions when using the DynamoDB Encryption Client is selecting a cryptographic materials provider (CMP). The CMP determines how encryption and signing keys are generated, whether new key materials are generated for each item or are reused. It also sets the encryption and signing algorithms that are used. To identify a CMP for your workload, refer to this documentation.

The current solution selects the Direct KMS Provider as the CMP. This cryptographic materials provider returns a unique encryption key and signing key for every table item. To do this, it calls KMS every time you encrypt or decrypt an item.

The KMS process

  • To generate encryption materials, the Direct KMS Provider asks AWS KMS to generate a unique data key for each item using a customer master key (CMK) that you specify. It derives encryption and signing keys for the item from the plaintext copy of the data key, and then returns the encryption and signing keys, along with the encrypted data key, which is stored in the material description attribute of the item.
  • The item encryptor uses the encryption and signing keys and removes them from memory as soon as possible. Only the encrypted copy of the data key from which they were derived is saved in the encrypted item.
  • To generate decryption materials, the Direct KMS Provider asks AWS KMS to decrypt the encrypted data key. Then, it derives verification and signing keys from the plaintext data key, and returns them to the item encryptor.

The item encryptor verifies the item and, if verification succeeds, decrypts the encrypted values. Finally, it removes the keys from memory as soon as possible.

For enhanced security, the example creates the Lambda function inside a VPC with a security group attached to allow incoming HTTPS traffic from only private IPs. The Lambda function connects to DynamoDB and KMS via VPC endpoints instead of going through the public internet. It connects to DynamoDB using a service gateway endpoint and to KMS using an interface endpoint providing a highly available and secure connection.

Additionally, VPC endpoints can use endpoint policies to enforce allowing only permitted operations for KMS and DynamoDB over this connection. To further control the management of encryption keys, the KMS master key has a resource-based policy. It allows the Lambda layer to generate data keys for encryption and decryption, and restrict any administrative activity on master key.

To deploy this solution, follow the instructions in the aws-serverless-tokenization GitHub repo. The AWS Serverless Application Model (AWS SAM) template allows you to quickly deploy this solution into your AWS account.

Understanding the code

The solution uses the tokenizer package, deployed as a Lambda layer. It uses Python UUID4 to generate random values. You can optionally update the logic in hash_gen.py to use your own tokenization technique. For example, you could generate tokens with same length as the original text, preserving the format in the generated token.

The ddb_encrypt_item.py file contains the logic for encrypting DynamoDB items and uses a DynamoDB client-side encryption library. To learn more about how this library works, refer to this documentation.

There are three methods used in the application logic:

  • Encrypt_item encrypts the plaintext using the KMS customer managed key. In AttributeActions actions, you can specify if you don’t want to encrypt a portion of the plaintext. For example, you might exclude keys in the JSON input from being encrypted. It also requires a partition key to index the encrypted text in the DynamoDB table. The hash key is used as the name of the partition key in the DynamoDB table. The value of this partition key is the UUID token generated in the previous step.
def encrypt_item (plaintext_item,table_name):
    table = boto3.resource('dynamodb').Table(table_name)

    aws_kms_cmp = AwsKmsCryptographicMaterialsProvider(key_id=aws_cmk_id)

    actions = AttributeActions(
        default_action=CryptoAction.ENCRYPT_AND_SIGN,
        attribute_actions={'Account_Id': CryptoAction.DO_NOTHING}
    )

    encrypted_table = EncryptedTable(
        table=table,
        materials_provider=aws_kms_cmp,
        attribute_actions=actions
    )
    response = encrypted_table.put_item(Item=plaintext_item)
  • Get_decrypted_item gets the plaintext for a given partition key. For example, the UUID token using the KMS customer managed key.
  • Get_Item gets the obfuscated text, for example the ciphertext stored in the DynamoDB table for the provided partition key.

The dynamodb-encryption-sdk requires cryptography libraries as a dependency. Both of these libraries are platform-dependent and must be installed for a specific operating system. Since Lambda functions use Amazon Linux, you must install these libraries for Amazon Linux even if you are developing application code on different operating system. To do this, use the get_AMI_packages_cryptography.sh script to download the Docker image, install dependencies within the image, and export files to be used by our Lambda layer.

If you are processing DynamoDB items at a high frequency and large scale, you might exceed the AWS KMS requests-per-second limit, causing processing delays. You can use tools such as JMeter to test the required throughput based on the expected traffic for this serverless application. If you need to exceed a quota, you can request a quota increase in Service Quotas. Use the Service Quotas console or the RequestServiceQuotaIncrease operation. For details, see Requesting a quota increase in the Service Quotas User Guide. If Service Quotas for AWS KMS are not available in the AWS Region, create a case in the AWS Support Center.

After following this walkthrough, to avoid incurring future charges, delete the resources following step 7 of the README file.

Conclusion

This post shows how to use AWS Serverless services to design a secure, reliable, and cost-optimized tokenization solution. It can be integrated with applications to protect sensitive information and manage access using strict controls with less operational overhead.

Securing Amazon EKS workloads with Atlassian Bitbucket and Snyk

Post Syndicated from James Bland original https://aws.amazon.com/blogs/devops/securing-amazon-eks-workloads-with-atlassian-bitbucket-and-snyk/

This post was contributed by James Bland, Sr. Partner Solutions Architect, AWS, Jay Yeras, Head of Cloud and Cloud Native Solution Architecture, Snyk, and Venkat Subramanian, Group Product Manager, Bitbucket

 

One of our goals at Atlassian is to make the software delivery and development process easier. This post explains how you can set up a software delivery pipeline using Bitbucket Pipelines and Snyk, a tool that finds and fixes vulnerabilities in open-source dependencies and container images, to deploy secured applications on Amazon Elastic Kubernetes Service (Amazon EKS). By presenting important development information directly on pull requests inside the product, you can proactively diagnose potential issues, shorten test cycles, and improve code quality.

Atlassian Bitbucket Cloud is a Git-based code hosting and collaboration tool, built for professional teams. Bitbucket Pipelines is an integrated CI/CD service that allows you to automatically build, test, and deploy your code. With its best-in-class integrations with Jira, Bitbucket Pipelines allows different personas in an organization to collaborate and get visibility into the deployments. Bitbucket Pipes are small chunks of code that you can drop into your pipeline to make it easier to build powerful, automated CI/CD workflows.

In this post, we go over the following topics:

  • The importance of security as practices shift-left in DevOps
  • How embedding security into pull requests helps developer workflows
  • Deploying an application on Amazon EKS using Bitbucket Pipelines and Snyk

Shift-left on security

Security is usually an afterthought. Developers tend to focus on delivering software first and addressing security issues later when IT Security, Ops, or InfoSec teams discover them. However, research from the 2016 State of DevOps Report shows that you can achieve better outcomes by testing for security earlier in the process within a developer’s workflow. This concept is referred to as shift-left, where left indicates earlier in the process, as illustrated in the following diagram.

There are two main challenges in shifting security left to developers:

  • Developers aren’t security experts – They develop software in the most efficient way they know how, which can mean importing libraries to take care of lower-level details. And sometimes these libraries import other libraries within them, and so on. This makes it almost impossible for a developer, who is not a security expert, to keep track of security.
  • It’s time-consuming – There is no automation. Developers have to run tests to understand what’s happening and then figure out how to fix it. This slows them down and takes them away from their core job: building software.

Time spent on SDLC testing

Enabling security into a developer’s workflow

Code Insights is a new feature in Bitbucket that provides contextual information as part of the pull request interface. It surfaces information relevant to a pull request so issues related to code quality or security vulnerabilities can be viewed and acted upon during the code review process. The following screenshot shows Code Insights on the pull request sidebar.

 

Code insights

In the security space, we’ve partnered with Snyk, McAfee, Synopsys, and Anchore. When you use any of these integrations in your Bitbucket Pipeline, security vulnerabilities are automatically surfaced within your pull request, prompting developers to address them. By bringing the vulnerability information into the pull request interface before the actual deployment, it’s much easier for code reviewers to assess the impact of the vulnerability and provide actionable feedback.

When security issues are fixed as part of a developer’s workflow instead of post-deployment, it means fewer sev1 incidents, which saves developer time and IT resources down the line, and leads to a better user experience for your customers.

 

Securing your Atlassian Workflow with Snyk

To demonstrate how you can easily introduce a few steps to your workflow that improve your security posture, we take advantage of the new Snyk integration to Atlassian’s Code Insights and other Snyk integrations to Bitbucket Cloud, Amazon Elastic Container Registry (Amazon ECR, for more information see Container security with Amazon Elastic Container Registry (ECR): integrate and test), and Amazon EKS (for more information see Kubernetes workload and image scanning. We reference sample code in a publicly available Bitbucket repository. In this repository, you can find resources such as a multi-stage build Dockerfile for a sample Java web application, a sample bitbucket-pipelines.yml configured to perform Snyk scans and push container images to Amazon ECR, and a reference Kubernetes manifest to deploy your application.

Prerequisites

You first need to have a few resources provisioned, such as an Amazon ECR repository and an Amazon EKS cluster. You can quickly create these using the AWS Command Line Interface (AWS CLI) by invoking the create-repository command and following the Getting started with eksctl guide. Next, make sure that you have enabled the new code review experience in your Bitbucket account.

To take a closer look at the bitbucket-pipelines.yml file, see the following code:

script:
 - IMAGE_NAME="petstore"
 - docker build -t $IMAGE_NAME .
 - pipe: snyk/snyk-scan:0.4.3
   variables:
     SNYK_TOKEN: $SNYK_TOKEN
     LANGUAGE: "docker"
     IMAGE_NAME: $IMAGE_NAME
     TARGET_FILE: "Dockerfile"
     CODE_INSIGHTS_RESULTS: "true"
     SEVERITY_THRESHOLD: "high"
     DONT_BREAK_BUILD: "true"
 - pipe: atlassian/aws-ecr-push-image:1.1.2
   variables:
     AWS_ACCESS_KEY_ID: $AWS_ACCESS_KEY_ID
     AWS_SECRET_ACCESS_KEY: $AWS_SECRET_ACCESS_KEY
     AWS_DEFAULT_REGION: "us-west-2"
     IMAGE_NAME: $IMAGE_NAME

In the preceding code, we invoke two Bitbucket Pipes to easily configure our pipeline and complete two critical tasks in just a few lines: scan our container image and push to our private registry. This saves time and allows for reusability across repositories while discovering innovative ways to automate our pipelines thanks to an extensive catalog of integrations.

Snyk pipe for Bitbucket Pipelines

In the following use case, we build a container image from the Dockerfile included in the Bitbucket repository and scan the image using the Snyk pipe. We also invoke the aws-ecr-push-image pipe to securely store our image in a private registry on Amazon ECR. When the pipeline runs, we see results as shown in the following screenshot.

Bitbucket pipeline

If we choose the available report, we can view the detailed results of our Snyk scan. In the following screenshot, we see detailed insights into the content of that report: three high, one medium, and five low-severity vulnerabilities were found in our container image.

container image report

 

Snyk scans of Bitbucket and Amazon ECR repositories

Because we use Snyk’s integration to Amazon ECR and Snyk’s Bitbucket Cloud integration to scan and monitor repositories, we can dive deeper into these results by linking our Dockerfile stored in our Bitbucket repository to the results of our last container image scan. By doing so, we can view recommendations for upgrading our base image, as in the following screenshot.

 

ECR scan recommendations

As a result, we can move past informational insights and onto actionable recommendations. In the preceding screenshot, our current image of jboss/wilfdly:11.0.0.Final contains 76 vulnerabilities. We also see two recommendations: a major upgrade to jboss/wildfly:18.0.1.FINAL, which brings our total vulnerabilities down to 65, and an alternative upgrade, which is less desirable.

 

We can investigate further by drilling down into the report to view additional context on how a potential vulnerability was introduced, and also create a Jira issue to Atlassian Jira Software Cloud. The following screenshot shows a detailed report on the Issues tab.

 

Jira issue

We can also explore the Dependencies tab for a list of all the direct dependencies, transitive dependencies, and the vulnerabilities those may contain. See the following screenshot.

 

dependency vulnerabilities

Snyk scan Amazon EKS configuration

The final step in securing our workflow involves integrating Snyk with Kubernetes and deploying to Amazon EKS and Bitbucket Pipelines. Sample Kubernetes manifest files and a bitbucket-pipeline.yml are available for you to use in the accompanying Bitbucket repository for this post. Our bitbucket-pipeline.yml contains the following step:

script:
 - pipe: atlassian/aws-eks-kubectl-run:1.2.3
   variables:
     AWS_ACCESS_KEY_ID: $AWS_ACCESS_KEY_ID
     AWS_SECRET_ACCESS_KEY: $AWS_SECRET_ACCESS_KEY
     AWS_DEFAULT_REGION: $AWS_DEFAULT_REGION
     CLUSTER_NAME: "my-kube-cluster"
     KUBECTL_COMMAND: "apply"
     RESOURCE_PATH: "java-app.yaml"

In the preceding code, we call the aws-eks-kubectl-run pipe and pass in a few repository variables we previously defined (see the following screenshot).

 

repository variables

For more information about generating the necessary access keys in AWS Identity and Access Management (IAM) to make programmatic requests to the AWS API, see Creating an IAM User in Your AWS Account.

Now that we have provisioned the supporting infrastructure and invoked kubectl apply -f java-app.yaml to deploy our pods using our container images in Amazon ECR, we can monitor our project details and view some initial results. The following screenshot shows that our initial configuration isn’t secure.

secure config scan results

The reason for this is that we didn’t explicitly define a few parameters in our Kubernetes manifest under securityContext. For example, parameters such as readOnlyRootFilesystem, runAsNonRoot, allowPrivilegeEscalation, and capabilities either aren’t defined or are set incorrectly in our template. As a result, we see this in our findings with the FAIL flag. Hovering over these on the Snyk console provides specific insights on how to fix these, for example:

  • Run as non-root – Whether any containers in the workload have securityContext.runAsNonRoot set to false or unset
  • Read-only root file system – Whether any containers in the workload have securityContext.readOnlyFilesystem set to false or unset
  • Drop capabilities – Whether all capabilities are dropped and CAP_SYS_ADMIN isn’t added

 

To save you the trouble of researching this, we provide another sample template, java-app-snyk.yaml, which you can apply against your running pods. The difference in this template is that we have included the following lines to the manifest, which address the three failed findings in our report:

securityContext:
 allowPrivilegeEscalation: false
 readOnlyRootFilesystem: true
 runAsNonRoot: true
 capabilities:
   drop:
     - all

After a subsequent scan, we can validate our changes propagated successfully and our Kubernetes configuration is secure (see the following screenshot).

secure config scan passing results

Conclusion

This post demonstrated how to secure your entire flow proactively with Atlassian Bitbucket Cloud and Snyk. Seamless integrations to Bitbucket Cloud provide you with actionable insights at each step of your development process.

Get started for free with Bitbucket and Snyk and learn more about the Bitbucket-Snyk integration.

 

“The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.”

 

flowtrackd: DDoS Protection with Unidirectional TCP Flow Tracking

Post Syndicated from Omer Yoachimik original https://blog.cloudflare.com/announcing-flowtrackd/

flowtrackd: DDoS Protection with Unidirectional TCP Flow Tracking

flowtrackd: DDoS Protection with Unidirectional TCP Flow Tracking

Magic Transit is Cloudflare’s L3 DDoS Scrubbing service for protecting network infrastructure. As part of our ongoing investment in Magic Transit and our DDoS protection capabilities, we’re excited to talk about a new piece of software helping to protect Magic Transit customers: flowtrackd. flowrackd is a software-defined DDoS protection system that significantly improves our ability to automatically detect and mitigate even the most complex TCP-based DDoS attacks. If you are a Magic Transit customer, this feature will be enabled by default at no additional cost on July 29, 2020.

flowtrackd: DDoS Protection with Unidirectional TCP Flow Tracking

TCP-Based DDoS Attacks

In the first quarter of 2020, one out of every two L3/4 DDoS attacks Cloudflare mitigated was an ACK Flood, and over 66% of all L3/4 attacks were TCP based. Most types of DDoS attacks can be mitigated by finding unique characteristics that are present in all attack packets and using that to distinguish ‘good’ packets from the ‘bad’ ones. This is called “stateless” mitigation, because any packet that has these unique characteristics can simply be dropped without remembering any information (or “state”) about the other packets that came before it. However, when attack packets have no unique characteristics, then “stateful” mitigation is required, because whether a certain packet is good or bad depends on the other packets that have come before it.

The most sophisticated types of TCP flood require stateful mitigation, where every TCP connection must be tracked in order to know whether any particular TCP packet is part of an active connection. That kind of mitigation is called “flow tracking”, and it is typically implemented in Linux by the iptables conntrack module. However, DDoS protection with conntrack is not as simple as flipping the iptable switch, especially at the scale and complexity that Cloudflare operates in. If you’re interested to learn more, in this blog we talk about the technical challenges of implementing iptables conntrack.

Complex TCP DDoS attacks pose a threat as they can be harder to detect and mitigate. They therefore have the potential to cause service degradation, outages and increased false positives with inaccurate mitigation rules. So how does Cloudflare block patternless DDoS attacks without affecting legitimate traffic?

Bidirectional TCP Flow Tracking

Using Cloudflare’s traditional products, HTTP applications can be protected by the WAF service, and TCP/UDP applications can be protected by Spectrum. These services are “reverse proxies“, meaning that traffic passes through Cloudflare in both directions. In this bidirectional topology, we see the entire TCP flow (i.e., segments sent by both the client and the server) and can therefore track the state of the underlying TCP connection. This way, we know if a TCP packet belongs to an existing flow or if it is an “out of state” TCP packet. Out of state TCP packets look just like regular TCP packets, but they don’t belong to any real connection between a client and a server. These packets are most likely part of an attack and are therefore dropped.

flowtrackd: DDoS Protection with Unidirectional TCP Flow Tracking
Reverse Proxy: What Cloudflare Sees

While not trivial, tracking TCP flows can be done when we serve as a proxy between the client and server, allowing us to absorb and mitigate out of state TCP floods. However it becomes much more challenging when we only see half of the connection: the ingress flow. This visibility into ingress but not egress flows is the default deployment method for Cloudflare’s Magic Transit service, so we had our work cut out for us in identifying out of state packets.

The Challenge With Unidirectional TCP Flows

With Magic Transit, Cloudflare receives inbound internet traffic on behalf of the customer, scrubs DDoS attacks, and routes the clean traffic to the customer’s data center over a tunnel. The data center then responds directly to the eyeball client using a technique known as Direct Server Return (DSR).

flowtrackd: DDoS Protection with Unidirectional TCP Flow Tracking
Magic Transit: Asymmetric L3 Routing

Using DSR, when a TCP handshake is initiated by an eyeball client, it sends a SYN packet that gets routed via Cloudflare to the origin data center. The origin then responds with a SYN-ACK directly to the client, bypassing Cloudflare. Finally, the client responds with an ACK that once again routes to the origin via Cloudflare and the connection is then considered established.

flowtrackd: DDoS Protection with Unidirectional TCP Flow Tracking
L3 Routing: What Cloudflare Sees

In a unidirectional flow we don’t see the SYN+ACK sent from the origin to the eyeball client, and therefore can’t utilize our existing flow tracking capabilities to identify out of state packets.

Unidirectional TCP Flow Tracking

To overcome the challenges of unidirectional flows, we recently completed the development and rollout of a new system, codenamed flowtrackd (“flow tracking daemon”). flowtrackd is a state machine that hooks into the network interface. It tracks unidirectional TCP flows using only the ingress traffic that routes through Cloudflare to determine the state of the TCP connection. flowtrackd is then able to determine if a packet is part of a new connection, an open one, a connection that is closing, one that is closed, or if it’s an out of state packet. Once a high volume of out-of-state packets is detected, flowtrackd will either challenge (force RST) or drop the packets.

flowtrackd: DDoS Protection with Unidirectional TCP Flow Tracking
Snapshot from what flowtrackd sees

The state machine that determines the state of the flows was developed in-house and complements Gatebot and dosd. Together Gatebot, dosd, and flowtrackd provide a comprehensive multi layer DDoS protection.

Releasing flowtrackd to the Wild

And it works! Less than a day after releasing flowtrackd to an early access customer, flowtrackd automatically detected and mitigated an ACK flood that peaked at 6 million packets per second. No downtime, service disruption, or false positives were reported.

flowtrackd: DDoS Protection with Unidirectional TCP Flow Tracking
flowtrackd Mitigates 6M pps Flood

Cloudflare’s DDoS Protection – Delivered From Every Data Center

As opposed to legacy scrubbing center providers with limited network infrastructures, Cloudflare provides DDoS Protection from every one of our data centers in over 200 locations around the world. We write our own software-defined DDoS protection systems. Notice I say systems, because as opposed to vendors that use a dedicated third party appliance, we’re able to write and spin up whatever software we need, deploy it in the optimal location in our tech stack and are therefore not dependent on other vendors or be limited to the capabilities of one appliance.

flowtrackd joins the Cloudflare DDoS protection family which includes our veteran Gatebot and the younger and energetic dosd. flowtrackd will be available from every one of our data centers, with a total mitigation capacity of over 37 Tbps, protecting our Magic Transit customers against the most complex TCP DDoS attacks.

New to Magic Transit? Replace your legacy provider with Magic Transit and pay nothing until your current contract expires. Offer expires September 1, 2020. Click here for details.

No Humans Involved: Mitigating a 754 Million PPS DDoS Attack Automatically

Post Syndicated from Omer Yoachimik original https://blog.cloudflare.com/no-humans-involved-mitigating-a-754-million-pps-ddos-attack-automatically/

No Humans Involved: Mitigating a 754 Million PPS DDoS Attack Automatically

No Humans Involved: Mitigating a 754 Million PPS DDoS Attack Automatically

On June 21, Cloudflare automatically mitigated a highly volumetric DDoS attack that peaked at 754 million packets per second. The attack was part of an organized four day campaign starting on June 18 and ending on June 21: attack traffic was sent from over 316,000 IP addresses towards a single Cloudflare IP address that was mostly used for websites on our Free plan. No downtime or service degradation was reported during the attack, and no charges accrued to customers due to our unmetered mitigation guarantee.

The attack was detected and handled automatically by Gatebot, our global DDoS detection and mitigation system without any manual intervention by our teams. Notably, because our automated systems were able to mitigate the attack without issue, no alerts or pages were sent to our on-call teams and no humans were involved at all.

No Humans Involved: Mitigating a 754 Million PPS DDoS Attack Automatically
Attack Snapshot – Peaking at 754 Mpps. The two different colors in the graph represent two separate systems dropping packets. 

During those four days, the attack utilized a combination of three attack vectors over the TCP protocol: SYN floods, ACK floods and SYN-ACK floods. The attack campaign sustained for multiple hours at rates exceeding 400-600 million packets per second and peaked multiple times above 700 million packets per second, with a top peak of 754 million packets per second. Despite the high and sustained packet rates, our edge continued serving our customers during the attack without impacting performance at all

The Three Types of DDoS: Bits, Packets & Requests

Attacks with high bits per second rates aim to saturate the Internet link by sending more bandwidth per second than the link can handle. Mitigating a bit-intensive flood is similar to a dam blocking gushing water in a canal with limited capacity, allowing just a portion through.

No Humans Involved: Mitigating a 754 Million PPS DDoS Attack Automatically
Bit Intensive DDoS Attacks as a Gushing River Blocked By Gatebot

In such cases, the Internet service provider may block or throttle the traffic above the allowance resulting in denial of service for legitimate users that are trying to connect to the website but are blocked by the service provider. In other cases, the link is simply saturated and everything behind that connection is offline.

No Humans Involved: Mitigating a 754 Million PPS DDoS Attack Automatically
Swarm of Mosquitoes as a Packet Intensive DDoS Attack

However in this DDoS campaign, the attack peaked at a mere 250 Gbps (I say, mere, but ¼ Tbps is enough to knock pretty much anything offline if it isn’t behind some DDoS mitigation service) so it does not seem as the attacker intended to saturate our Internet links, perhaps because they know that our global capacity exceeds 37 Tbps. Instead, it appears the attacker attempted (and failed) to overwhelm our routers and data center appliances with high packet rates reaching 754 million packets per second. As opposed to water rushing towards a dam, flood of packets can be thought of as a swarm of millions of mosquitoes that you need to zap one by one.

No Humans Involved: Mitigating a 754 Million PPS DDoS Attack Automatically
Zapping Mosquitoes with Gatebot

Depending on the ‘weakest link’ in a data center, a packet intensive DDoS attack may impact the routers, switches, web servers, firewalls, DDoS mitigation devices or any other appliance that is used in-line. Typically, a high packet rate may cause the memory buffer to overflow and thus voiding the router’s ability to process additional packets. This is because there’s a small fixed CPU cost of handing each packet and so if you can send a lot of small packets you can block an Internet connection not by filling it but by causing the hardware that handles the connection to be overwhelmed with processing.

Another form of DDoS attack is one with a high HTTP request per second rate. An HTTP request intensive DDoS attack aims to overwhelm a web server’s resources with more HTTP requests per second than the server can handle. The goal of a DDoS attack with a high request per second rate is to max out the CPU and memory utilization of the server in order to crash it or prevent it from being able to respond to legitimate requests. Request intensive DDoS attacks allow the attacker to generate much less bandwidth, as opposed to bit intensive attacks, and still cause a denial of service.

Automated DDoS Detection & Mitigation

So how did we handle 754 million packets per second? First, Cloudflare’s network utilizes BGP Anycast to spread attack traffic globally across our fleet of data centers. Second, we built our own DDoS protection systems, Gatebot and dosd, which drop packets inside the Linux kernel for maximum efficiency in order to handle massive floods of packets. And third, we built our own L4 load-balancer, Unimog, which uses our appliances’ health and other various metrics to load-balance traffic intelligently within a data center.

In 2017, we published a blog introducing Gatebot, one of our two DDoS protection systems. The blog was titled Meet Gatebot – a bot that allows us to sleep, and that’s exactly what happened during this attack. The attack surface was spread out globally by our Anycast, then Gatebot detected and mitigated the attack automatically without human intervention. And traffic inside each datacenter was load-balanced intelligently to avoid overwhelming any one machine. And as promised in the blog title, the attack peak did in fact occur while our London team was asleep.

So how does Gatebot work? Gatebot asynchronously samples traffic from every one of our data centers in over 200 locations around the world. It also monitors our customers’ origin server health. It then analyzes the samples to identify patterns and traffic anomalies that can indicate attacks. Once an attack is detected, Gatebot sends mitigation instructions to the edge data centers.

To complement Gatebot, last year we released a new system codenamed dosd (denial of service daemon) which runs in every one of our data centers around the world in over 200 cities. Similarly to Gatebot, dosd detects and mitigates attacks autonomously but in the scope of a single server or data center. You can read more about dosd in our recent blog.

The DDoS Landscape

While in recent months we’ve observed a decrease in the size and duration of DDoS attacks, highly volumetric and globally distributed DDoS attacks such as this one still persist. Regardless of the size, type or sophistication of the attack, Cloudflare offers unmetered DDoS protection to all customers and plan levels—including the Free plans.

Sandboxing in Linux with zero lines of code

Post Syndicated from Ignat Korchagin original https://blog.cloudflare.com/sandboxing-in-linux-with-zero-lines-of-code/

Sandboxing in Linux with zero lines of code

Modern Linux operating systems provide many tools to run code more securely. There are namespaces (the basic building blocks for containers), Linux Security Modules, Integrity Measurement Architecture etc.

In this post we will review Linux seccomp and learn how to sandbox any (even a proprietary) application without writing a single line of code.

Sandboxing in Linux with zero lines of code

Tux by Iwan Gabovitch, GPL
Sandbox, Simplified Pixabay License

Linux system calls

System calls (syscalls) is a well-defined interface between userspace applications and the operating system (OS) kernel. On modern operating systems most applications provide only application-specific logic as code. Applications do not, and most of the time cannot, directly access low-level hardware or networking, when they need to store data or send something over the wire. Instead they use system calls to ask the OS kernel to do specific hardware and networking tasks on their behalf:

Sandboxing in Linux with zero lines of code

Apart from providing a generic high level way for applications to interact with the low level hardware, the system call architecture allows the OS kernel to manage available resources between applications as well as enforce policies, like application permissions, networking access control lists etc.

Linux seccomp

Linux seccomp is yet another syscall on Linux, but it is a bit special, because it influences how the OS kernel will behave when the application uses other system calls. By default, the OS kernel has almost no insight into userspace application logic, so it provides all the possible services it can. But not all applications require all services. Consider an application which converts image formats: it needs the ability to read and write data from disk, but in its simplest form probably does not need any network access. Using seccomp an application can declare its intentions in advance to the Linux kernel. For this particular case it can notify the kernel that it will be using the read and write system calls, but never the send and recv system calls (because its intent is to work with local files and never with the network). It’s like establishing a contract between the application and the OS kernel:

Sandboxing in Linux with zero lines of code

But what happens if the application later breaks the contract and tries to use one of the system calls it promised not to use? The kernel will “penalise” the application, usually by immediately terminating it. Linux seccomp also allows less restrictive actions for the kernel to take:

  • instead of terminating the whole application, the kernel can be requested to terminate only the thread, which issued the prohibited system call
  • the kernel may just send a SIGSYS signal to the calling thread
  • the seccomp policy can specify an error code, which the kernel will then return to the calling application instead of executing the prohibited system call
  • if the violating process is under ptrace (for example executing under a debugger), the kernel can notify the tracer (the debugger) that a prohibited system call is about to happen and let the debugger decide what to do
  • the kernel may be instructed to allow and execute the system call, but log the attempt: this is useful, when we want to verify that our seccomp policy is not too tight without the risk of terminating the application and potentially creating an outage

Although there is a lot of flexibility in defining the potential penalty for the application, from a security perspective it is usually best to stick with the complete application termination upon seccomp policy violation. The reason for that will be described later in the examples in the post.

So why would the application take the risk of being abruptly terminated and declare its intentions beforehand, if it can just be “silent” and the OS kernel will allow it to use any system call by default? Of course, for a normal behaving application it makes no sense, but it turns out this feature is quite effective to protect from rogue applications and arbitrary code execution exploits.

Imagine our image format converter is written in some unsafe language and an attacker was able to take control of the application by making it process some malformed image. What the attacker might do is to try to steal some sensitive information from the machine running our converter and send it to themselves via the network. By default, the OS kernel will most likely allow it and a data leak will happen. But if our image converter “confined” (or sandboxed) itself beforehand to only read and write local data the kernel will terminate the application when the latter tries to leak the data over the network thus preventing the leak and locking out the attacker from our system!

Integrating seccomp into the application

To see how seccomp can be used in practice, let’s consider a toy example program

myos.c:

#include <stdio.h>
#include <sys/utsname.h>

int main(void)
{
    struct utsname name;

    if (uname(&name)) {
        perror("uname failed: ");
        return 1;
    }

    printf("My OS is %s!\n", name.sysname);
    return 0;
}

This is a simplified version of the uname command line tool, which just prints your operating system name. Like its full-featured counterpart, it uses the uname system call to actually get the name of the current operating system from the kernel. Let’s see it action:

$ gcc -o myos myos.c
$ ./myos
My OS is Linux!

Great! We’re on Linux, so can further experiment with seccomp (it is a Linux-only feature). Notice that we’re properly handling the error code after invoking the uname system call. However, according to the man page it can only fail, when the passed in buffer pointer is invalid. And in this case the set error number will be “EINVAL”, which translates to invalid parameter. In our case, the “struct utsname” structure is being allocated on the stack, so our pointer will always be valid. In other words, in normal circumstances the uname system call should never fail in this particular program.

To illustrate seccomp capabilities we will add a “sandbox” function to our program before the main logic

myos_raw_seccomp.c:

#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/audit.h>
#include <sys/ptrace.h>
#include <sys/prctl.h>

#include <stdlib.h>
#include <stdio.h>
#include <stddef.h>
#include <sys/utsname.h>
#include <errno.h>
#include <unistd.h>
#include <sys/syscall.h>

static void sandbox(void)
{
    struct sock_filter filter[] = {
        /* seccomp(2) says we should always check the arch */
        /* as syscalls may have different numbers on different architectures */
        /* see https://fedora.juszkiewicz.com.pl/syscalls.html */
        /* for simplicity we only allow x86_64 */
        BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, arch))),
        /* if not x86_64, tell the kernel to kill the process */
        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 0, 4),
        /* get the actual syscall number */
        BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, nr))),
        /* if "uname", tell the kernel to return EPERM, otherwise just allow */
        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, SYS_uname, 0, 1),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ERRNO | (EPERM & SECCOMP_RET_DATA)),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
    };

    struct sock_fprog prog = {
        .len = (unsigned short) (sizeof(filter) / sizeof(filter[0])),
        .filter = filter,
    };

    /* see seccomp(2) on why this is needed */
    if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
        perror("PR_SET_NO_NEW_PRIVS failed");
        exit(1);
    };

    /* glibc does not have a wrapper for seccomp(2) */
    /* invoke it via the generic syscall wrapper */
    if (syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &prog)) {
        perror("seccomp failed");
        exit(1);
    };
}

int main(void)
{
    struct utsname name;

    sandbox();

    if (uname(&name)) {
        perror("uname failed");
        return 1;
    }

    printf("My OS is %s!\n", name.sysname);
    return 0;
}

To sandbox itself the application defines a BPF program, which implements the desired sandboxing policy. Then the application passes this program to the kernel via the seccomp system call. The kernel does some validation checks to ensure the BPF program is OK and then runs this program on every system call the application makes. The results of the execution of the program is used by the kernel to determine if the current call complies with the desired policy. In other words the BPF program is the “contract” between the application and the kernel.

In our toy example above, the BPF program simply checks which system call is about to be invoked. If the application is trying to use the uname system call we tell the kernel to just return a EPERM (which stands for “operation not permitted”) error code. We also tell the kernel to allow any other system call. Let’s see if it works now:

$ gcc -o myos myos_raw_seccomp.c
$ ./myos
uname failed: Operation not permitted

uname failed now with the EPERM error code and EPERM is not even described as a potential failure code in the uname manpage! So we know now that this happened because we “told” the kernel to prohibit us using the uname syscall and to return EPERM instead. We can double check this by replacing EPERM with some other error code, which is totally inappropriate for this context, for example ENETDOWN (“network is down”). Why would we need the network to be up to just get the currently executing OS? Yet, recompiling and rerunning the program we get:

$ gcc -o myos myos_raw_seccomp.c
$ ./myos
uname failed: Network is down

We can also verify the other part of our “contract” works as expected. We told the kernel to allow any other system call, remember? In our program, when uname fails, we convert the error code to a human readable message and print it on the screen with the perror function. To print on the screen perror uses the write system call under the hood and since we can actually see the printed error message, we know that the kernel allowed our program to make the write system call in the first place.

seccomp with libseccomp

While it is possible to use seccomp directly, as in the examples above, BPF programs are cumbersome to write by hand and hard to debug, review and update later. That’s why it is usually a good idea to use a more high-level library, which abstracts away most of the low-level details. Luckily such a library exists: it is called libseccomp and is even recommended by the seccomp man page.

Let’s rewrite our program’s sandbox() function to use this library instead:

myos_libseccomp.c:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/utsname.h>
#include <seccomp.h>
#include <err.h>

static void sandbox(void)
{
    /* allow all syscalls by default */
    scmp_filter_ctx seccomp_ctx = seccomp_init(SCMP_ACT_ALLOW);
    if (!seccomp_ctx)
        err(1, "seccomp_init failed");

    /* kill the process, if it tries to use "uname" syscall */
    if (seccomp_rule_add_exact(seccomp_ctx, SCMP_ACT_KILL, seccomp_syscall_resolve_name("uname"), 0)) {
        perror("seccomp_rule_add_exact failed");
        exit(1);
    }

    /* apply the composed filter */
    if (seccomp_load(seccomp_ctx)) {
        perror("seccomp_load failed");
        exit(1);
    }

    /* release allocated context */
    seccomp_release(seccomp_ctx);
}

int main(void)
{
    struct utsname name;

    sandbox();

    if (uname(&name)) {
        perror("uname failed: ");
        return 1;
    }

    printf("My OS is %s!\n", name.sysname);
    return 0;
}

Our sandbox() function not only became shorter and much more readable, but also provided the ability to reference syscalls in our rules by names and not internal numbers as well as not having to deal with other quirks, like setting PR_SET_NO_NEW_PRIVS bit and dealing with system architectures.

It is worth noting we have modified our seccomp policy a bit. In the raw seccomp example above we instructed the kernel to return an error code when the application tries to execute a prohibited syscall. This is good for demonstration purposes, but in most cases a stricter action is required. Just returning an error code and allowing the application to continue gives the potentially malicious code a chance to bypass the policy. There are many syscalls in Linux and some of them do the same or similar things. For example, we might want to prohibit the application to read data from disk, so we deny the read syscall in our policy and tell the kernel to return an error code instead. However, if the application does get exploited, the exploit code/logic might look like below:

…
if (-1 == read(fd, buf, count)) {
    /* hm… read failed, but what about pread? */
    if (-1 == pread(fd, buf, count, offset) {
        /* what about readv? */ ...
    }
    /* bypassed the prohibited read(2) syscall */
}
…

Wait what?! There is more than one read system call? Yes, there are read, pread, readv as well as more obscure ones, like io_submit and io_uring_enter. Of course, it is our fault for providing incomplete seccomp policy, which does not block all possible read syscalls. But if at least we had instructed the kernel to terminate the process immediately upon violation of the first plain read, the malicious code above would not have the chance to be clever and try other options.

Given the above in the libseccomp example we have a stricter policy now, which tells the kernel to terminate the process upon the policy violation. Let’s see if it works:

$ gcc -o myos myos_libseccomp.c -lseccomp
$ ./myos
Bad system call

Notice that we need to link against libseccomp when compiling the application. Also, when we run the application, we don’t see the uname failed: Operation not permitted error output anymore, because we don’t give the application the ability to even print a failure message. Instead, we see a Bad system call message from the shell, which tells us that the application was terminated with a SIGSYS signal. Great!

zero code seccomp

The previous examples worked fine, but both of them have one disadvantage: we actually needed to modify the source code to embed our desired seccomp policy into the application. This is because seccomp syscall affects the calling process and its children, but there is no interface to inject the policy from “outside”. It is expected that developers will sandbox their code themselves as part of the application logic, but in practice this rarely happens. When developers are starting a new project, most of the time the focus is on primary functionality and security features are usually either postponed or omitted altogether. Also, most real-world software is usually written using some high-level programming language and/or a framework, where the developers do not deal with the system calls directly and probably are even unaware which system calls are being used by their code.

On the other hand we have system operators, sysadmins, SRE and other folks, who run the above code in production. They are more incentivized to keep production systems secure, thus would probably want to sandbox the services as much as possible. But most of the time they don’t have access to the source code. So there are mismatched expectations: developers have the ability to sandbox their code, but are usually not incentivized to do so and operators have the incentive to sandbox the code, but don’t have the ability.

This is where “zero code seccomp” might help, where an external operator can inject the desired sandbox policy into any process without needing to modify any source code. Systemd is one of the popular implementations of a “zero code seccomp” approach. Systemd-managed services can have a SystemCallFilter= directive defined in their unit files listing all the system calls the managed service is allowed to make. As an example, let’s go back to our toy application without any sandboxing code embedded:

$ gcc -o myos myos.c
$ ./myos
My OS is Linux!

Now we can run the same code with systemd, but prohibit the application for using uname without changing or recompiling any code (we’re using systemd-run to create an ephemeral systemd service unit for us):

$ systemd-run --user --pty --same-dir --wait --collect --service-type=exec --property="SystemCallFilter=~uname" ./myos
Running as unit: run-u0.service
Press ^] three times within 1s to disconnect TTY.
Finished with result: signal
Main processes terminated with: code=killed/status=SYS
Service runtime: 6ms

We don’t see the normal My OS is Linux! output anymore and systemd conveniently tells us that the managed process was terminated with a SIGSYS signal. We can even go further and use another directive SystemCallErrorNumber= to configure our seccomp policy not to terminate the application, but return an error code instead as in our first seccomp raw example:

$ systemd-run --user --pty --same-dir --wait --collect --service-type=exec --property="SystemCallFilter=~uname" --property="SystemCallErrorNumber=ENETDOWN" ./myos
Running as unit: run-u2.service
Press ^] three times within 1s to disconnect TTY.
uname failed: Network is down
Finished with result: exit-code
Main processes terminated with: code=exited/status=1
Service runtime: 6ms

systemd small print

Great! We can now inject almost any seccomp policy into any process without the need to write any code or recompile the application. However, there is an interesting statement in the systemd documentation:

…Note that the execve, exit, exit_group, getrlimit, rt_sigreturn, sigreturn system calls and the system calls for querying time and sleeping are implicitly whitelisted and do not need to be listed explicitly…

Some system calls are implicitly allowed and we don’t have to list them. This is mostly related to the way how systemd manages processes and injects the seccomp policy. We established earlier that seccomp policy applies to the current process and its children. So, to inject the policy, systemd forks itself, calls seccomp in the forked process and then execs the forked process into the target application. That’s why always allowing the execve system call is necessary in the first place, because otherwise systemd cannot do its job as a service manager.

But what if we want to explicitly prohibit some of these system calls? If we continue with the execve as an example, that can actually be a dangerous system call most applications would want to prohibit. Seccomp is an effective tool to protect the code from arbitrary code execution exploits, remember? If a malicious actor takes over our code, most likely the first thing they will try is to get a shell (or replace our code with any other application which is easier to control) by directing our code to call execve with the desired binary. So, if our code does not need execve for its main functionality, it would be a good idea to prohibit it. Unfortunately, it is not possible with the systemd SystemCallFilter= approach…

Introducing Cloudflare sandbox

We really liked the “zero code seccomp” approach with systemd SystemCallFilter= directive, but were not satisfied with its limitations. We decided to take it one step further and make it possible to prohibit any system call in any process externally without touching its source code, so came up with the Cloudflare sandbox. It’s a simple standalone toolkit consisting of a shared library and an executable. The shared library is supposed to be used with dynamically linked applications and the executable is for statically linked applications.

sandboxing dynamically linked executables

For dynamically linked executables it is possible to inject custom code into the process by utilizing the LD_PRELOAD environment variable. The libsandbox.so shared library from our toolkit also contains a so-called initialization routine, which should be executed before the main logic. This is how we make the target application sandbox itself:

  • LD_PRELOAD tells the dynamic loader to load our libsandbox.so as part of the application, when it starts
  • the runtime executes the initialization routine from the libsandbox.so before most of the main logic
  • our initialization routine configures the sandbox policy described in special environment variables
  • by the time the main application logic begin executing, the target process has the configured seccomp policy enforced

Let’s see how it works with our myos toy tool. First, we need to make sure it is actually a dynamically linked application:

$ ldd ./myos
	linux-vdso.so.1 (0x00007ffd8e1e3000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f339ddfb000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f339dfcf000)

Yes, it is . Now, let’s prohibit it from using the uname system call with our toolkit:

$ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libsandbox.so SECCOMP_SYSCALL_DENY=uname ./myos
adding uname to the process seccomp filter
Bad system call

Yet again, we’ve managed to inject our desired seccomp policy into the myos application without modifying or recompiling it. The advantage of this approach is that it doesn’t have the shortcomings of the systemd’s SystemCallFilter= and we can block any system call (luckily Bash is a dynamically linked application as well):

$ /bin/bash -c 'echo I will try to execve something...; exec /usr/bin/echo Doing arbitrary code execution!!!'
I will try to execve something...
Doing arbitrary code execution!!!
$ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libsandbox.so SECCOMP_SYSCALL_DENY=execve /bin/bash -c 'echo I will try to execve something...; exec /usr/bin/echo Doing arbitrary code execution!!!'
adding execve to the process seccomp filter
I will try to execve something...
Bad system call

The only problem here is that we may accidentally forget to LD_PRELOAD our libsandbox.so library and potentially run unprotected. Also, as described in the man page, LD_PRELOAD has some limitations. We can overcome all these problems by making libsandbox.so a permanent part of our target application:

$ patchelf --add-needed /usr/lib/x86_64-linux-gnu/libsandbox.so ./myos
$ ldd ./myos
	linux-vdso.so.1 (0x00007fff835ae000)
	/usr/lib/x86_64-linux-gnu/libsandbox.so (0x00007fc4f55f2000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fc4f5425000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fc4f5647000)

Again, we didn’t need access to the source code here, but patched the compiled binary instead. Now we can just configure our seccomp policy as before without the need of LD_PRELOAD:

$ ./myos
My OS is Linux!
$ SECCOMP_SYSCALL_DENY=uname ./myos
adding uname to the process seccomp filter
Bad system call

sandboxing statically linked executables

The above method is quite convenient and easy, but it doesn’t work for statically linked executables:

$ gcc -static -o myos myos.c
$ ldd ./myos
	not a dynamic executable
$ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libsandbox.so SECCOMP_SYSCALL_DENY=uname ./myos
My OS is Linux!

This is because there is no dynamic loader involved in starting a statically linked executable, so LD_PRELOAD has no effect. For this case our toolkit contains a special application launcher, which will inject the seccomp rules similarly to the way systemd does it:

$ sandboxify ./myos
My OS is Linux!
$ SECCOMP_SYSCALL_DENY=uname sandboxify ./myos
adding uname to the process seccomp filter

Note that we don’t see the Bad system call shell message anymore, because our target executable is being started by the launcher instead of the shell directly. Unlike systemd however, we can use this launcher to block dangerous system calls, like execve, as well:

$ sandboxify /bin/bash -c 'echo I will try to execve something...; exec /usr/bin/echo Doing arbitrary code execution!!!'
I will try to execve something...
Doing arbitrary code execution!!!
SECCOMP_SYSCALL_DENY=execve sandboxify /bin/bash -c 'echo I will try to execve something...; exec /usr/bin/echo Doing arbitrary code execution!!!'
adding execve to the process seccomp filter
I will try to execve something...

sandboxify vs libsandbox.so

From the examples above you may notice that it is possible to use sandboxify with dynamically linked executables as well, so why even bother with libsandbox.so? The difference becomes visible, when we start using not the “denylist” policy as in most examples in this post, but rather the preferred “allowlist” policy, where we explicitly allow only the system calls we need, but prohibit everything else.

Let’s convert our toy application back into the dynamically-linked one and try to come up with the minimal list of allowed system calls it needs to function properly:

$ gcc -o myos myos.c
$ ldd ./myos
	linux-vdso.so.1 (0x00007ffe027f6000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f4f1410a000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f4f142de000)
$ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libsandbox.so SECCOMP_SYSCALL_ALLOW=exit_group:fstat:uname:write ./myos
adding exit_group to the process seccomp filter
adding fstat to the process seccomp filter
adding uname to the process seccomp filter
adding write to the process seccomp filter
My OS is Linux

So we need to allow 4 system calls: exit_group:fstat:uname:write. This is the tightest “sandbox”, which still doesn’t break the application. If we remove any system call from this list, the application will terminate with the Bad system call message (try it yourself!).

If we use the same allowlist, but with the sandboxify launcher, things do not work anymore:

$ SECCOMP_SYSCALL_ALLOW=exit_group:fstat:uname:write sandboxify ./myos
adding exit_group to the process seccomp filter
adding fstat to the process seccomp filter
adding uname to the process seccomp filter
adding write to the process seccomp filter

The reason is sandboxify and libsandbox.so inject seccomp rules at different stages of the process lifecycle. Consider the following very high level diagram of a process startup:

Sandboxing in Linux with zero lines of code

In a nutshell, every process has two runtime stages: “runtime init” and the “main logic”. The main logic is basically the code, which is located in the program main() function and other code put there by the application developers. But the process usually needs to do some work before the code from the main() function is able to execute – we call this work the “runtime init” on the diagram above. Developers do not write this code directly, but most of the time this code is automatically generated by the compiler toolchain, which is used to compile the source code.

To do its job, the “runtime init” stage uses a lot of different system calls, but most of them are not needed later at the “main logic” stage. If we’re using the “allowlist” approach for our sandboxing, it does not make sense to allow these system calls for the whole duration of the program, if they are only used once on program init. This is where the difference between libsandbox.so and sandboxify comes from: libsandbox.so enforces the seccomp rules usually after the “runtime init” stage has already executed, so we don’t have to allow most system calls from that stage. sandboxify on the other hand enforces the policy before the “runtime init” stage, so we have to allow all the system calls from both stages, which usually results in a bigger allowlist, thus wider attack surface.

Going back to our toy myos example, here is the minimal list of all the system calls we need to allow to make the application work under our sandbox:

$ SECCOMP_SYSCALL_ALLOW=access:arch_prctl:brk:close:exit_group:fstat:mmap:mprotect:munmap:openat:read:uname:write sandboxify ./myos
adding access to the process seccomp filter
adding arch_prctl to the process seccomp filter
adding brk to the process seccomp filter
adding close to the process seccomp filter
adding exit_group to the process seccomp filter
adding fstat to the process seccomp filter
adding mmap to the process seccomp filter
adding mprotect to the process seccomp filter
adding munmap to the process seccomp filter
adding openat to the process seccomp filter
adding read to the process seccomp filter
adding uname to the process seccomp filter
adding write to the process seccomp filter
My OS is Linux!

It is 13 syscalls vs 4 syscalls, if we’re using the libsandbox.so approach!

Conclusions

In this post we discussed how to easily sandbox applications on Linux without the need to write any additional code. We introduced the Cloudflare sandbox toolkit and discussed the different approaches we take at sandboxing dynamically linked applications vs statically linked applications.

Having safer code online helps to build a Better Internet and we would be happy if you find our sandbox toolkit useful. Looking forward to the feedback, improvements and other contributions!

CVE-2020-5902: Helping to protect against the F5 TMUI RCE vulnerability

Post Syndicated from Michael Tremante original https://blog.cloudflare.com/cve-2020-5902-helping-to-protect-against-the-f5-tmui-rce-vulnerability/

CVE-2020-5902: Helping to protect against the F5 TMUI RCE vulnerability

Cloudflare has deployed a new managed rule protecting customers against a remote code execution vulnerability that has been found in F5 BIG-IP’s web-based Traffic Management User Interface (TMUI). Any customer who has access to the Cloudflare Web Application Firewall (WAF) is automatically protected by the new rule (100315) that has a default action of BLOCK.

Initial testing on our network has shown that attackers started probing and trying to exploit this vulnerability starting on July 3.

F5 has published detailed instructions on how to patch affected devices, how to detect if attempts have been made to exploit the vulnerability on a device and instructions on how to add a custom mitigation. If you have an F5 device, read their detailed mitigations before reading the rest of this blog post.

The most popular probe URL appears to be /tmui/login.jsp/..;/tmui/locallb/workspace/fileRead.jsp followed by /tmui/login.jsp/..;/tmui/util/getTabSet.jsp, /tmui/login.jsp/..;/tmui/system/user/authproperties.jsp and /tmui/login.jsp/..;/tmui/locallb/workspace/tmshCmd.jsp. All contain the critical pattern ..; which is at the heart of the vulnerability.

On July 3 we saw O(1k) probes ramping to O(1m) yesterday. This is because simple test patterns have been added to scanning tools and small test programs made available by security researchers.

CVE-2020-5902: Helping to protect against the F5 TMUI RCE vulnerability

The Vulnerability

The vulnerability was disclosed by the vendor on July 1 and allows both authenticated and unauthenticated users to perform remote code execution (RCE).

Remote Code Execution is a type of code injection which provides the attacker the ability to run any arbitrary code on the target application, allowing them, in most scenarios such as this one, to gain privileged access and perform a full system take over.

The vulnerability affects the administration interface only (the management dashboard), not the underlying data plane provided by the application.

How to Mitigate

If updating the application is not possible, the attack can be mitigated by blocking all requests that match the following regular expression in the URL:

.*\.\.;.*

The above regular expression matches two dot characters (.) followed by a semicolon within any sequence of characters.

Customers who are using the Cloudflare WAF, that have their F5 BIG-IP TMUI interface proxied behind Cloudflare, are already automatically protected from this vulnerability with rule 100315. If you wish to turn off the rule or change the default action:

  1. Head over to the Cloudflare Firewall, then click on Managed Rules and head over to the advanced link under the Cloudflare Managed Rule set,
  2. Search for rule ID: 100315,
  3. Select any appropriate action or disable the rule.

How to test HTTP/3 and QUIC with Firefox Nightly

Post Syndicated from Lucas Pardue original https://blog.cloudflare.com/how-to-test-http-3-and-quic-with-firefox-nightly/

How to test HTTP/3 and QUIC with Firefox Nightly

How to test HTTP/3 and QUIC with Firefox Nightly

HTTP/3 is the third major version of the Hypertext Transfer Protocol, which takes the bold step of moving away from TCP to the new transport protocol QUIC in order to provide performance and security improvements.

During Cloudflare’s Birthday Week 2019, we were delighted to announce that we had enabled QUIC and HTTP/3 support on the Cloudflare edge network. This was joined by support from Google Chrome and Mozilla Firefox, two of the leading browser vendors and partners in our effort to make the web faster and more reliable for all. A big part of developing new standards is interoperability, which typically means different people analysing, implementing and testing a written specification in order to prove that it is precise, unambiguous, and actually implementable.

At the time of our announcement, Chrome Canary had experimental HTTP/3 support and we were eagerly awaiting a release of Firefox Nightly. Now that Firefox supports HTTP/3 we thought we’d share some instructions to help you enable and test it yourselves.

How do I enable HTTP/3 for my domain?

Simply go to the Cloudflare dashboard and flip the switch from the “Network” tab manually:

How to test HTTP/3 and QUIC with Firefox Nightly

Using Firefox Nightly as an HTTP/3 client

Firefox Nightly has experimental support for HTTP/3. In our experience things are pretty good but be aware that you might experience some teething issues, so bear that in mind if you decide to enable and experiment with HTTP/3. If you’re happy with that responsibility, you’ll first need to download and install the latest Firefox Nightly build. Then open Firefox and enable HTTP/3 by visiting “about:config” and setting “network.http.http3.enabled” to true. There are some other parameters that can be tweaked but the defaults should suffice.

How to test HTTP/3 and QUIC with Firefox Nightly
about:config can be filtered by using a search term like “http3”.

Once HTTP/3 is enabled, you can visit your site to test it out. A straightforward way to check if HTTP/3 was negotiated is to check the Developer Tools “Protocol” column in the “Network” tab (on Windows and Linux the Developer Tools keyboard shortcut is Ctrl+Shift+I, on macOS it’s Command+Option+I). This “Protocol” column might not be visible at first, so to enable it right-click one of the column headers and check “Protocol” as shown below.

How to test HTTP/3 and QUIC with Firefox Nightly

Then reload the page and you should see that “HTTP/3” is reported.

How to test HTTP/3 and QUIC with Firefox Nightly

The aforementioned teething issues might cause HTTP/3 not to show up initially. When you enable HTTP/3 on a zone, we add a header field such as alt-svc: h3-27=":443"; ma=86400, h3-28=":443"; ma=86400, h3-29=":443"; ma=86400 to all responses for that zone. Clients see this as an advertisement to try HTTP/3 out and will take up the offer on the next request. So to make this happen you can reload the page but make sure that you bypass the local browser cache (via the “Disable Cache” checkbox, or use the Shift-F5 key combo) or else you’ll just see the protocol used to fetch the resource the first time around. Finally, Firefox provides the “about:networking” page which provides a list of visited zones and the HTTP version that was used to load them; for example, this very blog.

How to test HTTP/3 and QUIC with Firefox Nightly
about:networking contains a table of all visited zones and the connection properties.

Sometimes browsers can get sticky to an existing HTTP connection and will refuse to start an HTTP/3 connection, this is hard to detect by humans, so sometimes the best option is to close the app completely and reopen it. Finally, we’ve also seen some interactions with Service Workers that make it appear that a resource was fetched from the network using HTTP/1.1, when in fact it was fetched from the local Service Worker cache. In such cases if you’re keen to see HTTP/3 in action then you’ll need to deregister the Service Worker. If you’re in doubt about what is happening on the network it is often useful to verify things independently, for example capturing a packet trace and dissecting it with Wireshark.

What’s next?

The QUIC Working Group recently announced a “Working Group Last Call”, which marks an important milestone in the continued maturity of the standards. From the announcement:

After more than three and a half years and substantial discussion, all 845 of the design issues raised against the QUIC protocol drafts have gained consensus or have a proposed resolution. In that time the protocol has been considerably transformed; it has become more secure, much more widely implemented, and has been shown to be interoperable. Both the Chairs and the Editors feel that it is ready to proceed in standardisation.

The coming months will see the specifications settle and we anticipate that implementations will continue to improve their QUIC and HTTP/3 support, eventually enabling it in their stable channels. We’re pleased to continue working with industry partners such as Mozilla to help build a better Internet together.

In the meantime, you might want to check out our guides to testing with other implementations such as Chrome Canary or curl. As compatibility becomes proven, implementations will shift towards optimizing their performance; you can read about Cloudflare’s efforts on comparing HTTP/3 to HTTP/2 and the work we’ve done to improve performance by adding support for CUBIC and HyStart++ to our congestion control module.

Setting up two-factor authentication on your Raspberry Pi

Post Syndicated from Alasdair Allan original https://www.raspberrypi.org/blog/setting-up-two-factor-authentication-on-your-raspberry-pi/

Enabling two-factor authentication (2FA) to boost security for your important accounts is becoming a lot more common these days. However you might be surprised to learn that you can do the same with your Raspberry Pi. You can enable 2FA on Raspberry Pi, and afterwards you’ll be challenged for a verification code when you access it remotely via Secure Shell (SSH).

Accessing your Raspberry Pi via SSH

A lot of people use a Raspberry Pi at home as a file, or media, server. This is has become rather common with the launch of Raspberry Pi 4, which has both USB 3 and Gigabit Ethernet. However, when you’re setting up this sort of server you often want to run it “headless”; without a monitor, keyboard, or mouse. This is especially true if you intend tuck your Raspberry Pi away behind your television, or somewhere else out of the way. In any case, it means that you are going to need to enable Secure Shell (SSH) for remote access.

However, it’s also pretty common to set up your server so that you can access your files when you’re away from home, making your Raspberry Pi accessible from the Internet.

Most of us aren’t going to be out of the house much for a while yet, but if you’re taking the time right now to build a file server, you might want to think about adding some extra security. Especially if you intend to make the server accessible from the Internet, you probably want to enable two-factor authentication (2FA) using Time-based One-Time Password (TOTP).

What is two-factor authentication?

Two-factor authentication is an extra layer of protection. As well as a password, “something you know,” you’ll need another piece of information to log in. This second factor will be based either on “something you have,” like a smart phone, or on “something you are,” like biometric information.

We’re going to go ahead and set up “something you have,” and use your smart phone as the second factor to protect your Raspberry Pi.

Updating the operating system

The first thing you should do is make sure your Raspberry Pi is up to date with the latest version of Raspbian. If you’re running a relatively recent version of the operating system you can do that from the command line:

$ sudo apt-get update
$ sudo apt-get full-upgrade

If you’re pulling your Raspberry Pi out of a drawer for the first time in a while, though, you might want to go as far as to install a new copy of Raspbian using the new Raspberry Pi Imager, so you know you’re working from a good image.

Enabling Secure Shell

The Raspbian operating system has the SSH server disabled on boot. However, since we’re intending to run the board without a monitor or keyboard, we need to enable it if we want to be able to SSH into our Raspberry Pi.

The easiest way to enable SSH is from the desktop. Go to the Raspbian menu and select “Preferences > Raspberry Pi Configuration”. Next, select the “Interfaces” tab and click on the radio button to enable SSH, then hit “OK.”

You can also enable it from the command line using systemctl:

$ sudo systemctl enable ssh
$ sudo systemctl start ssh

Alternatively, you can enable SSH using raspi-config, or, if you’re installing the operating system for the first time, you can enable SSH as you burn your SD Card.

Enabling challenge-response

Next, we need to tell the SSH daemon to enable “challenge-response” passwords. Go ahead and open the SSH config file:

$ sudo nano /etc/ssh/sshd_config

Enable challenge response by changing ChallengeResponseAuthentication from the default no to yes.

Editing /etc/ssh/ssd_config.

Then restart the SSH daemon:

$ sudo systemctl restart ssh

It’s good idea to open up a terminal on your laptop and make sure you can still SSH into your Raspberry Pi at this point — although you won’t be prompted for a 2FA code quite yet. It’s sensible to check that everything still works at this stage.

Installing two-factor authentication

The first thing you need to do is download an app to your phone that will generate the TOTP. One of the most commonly used is Google Authenticator. It’s available for Android, iOS, and Blackberry, and there is even an open source version of the app available on GitHub.

Google Authenticator in the App Store.

So go ahead and install Google Authenticator, or another 2FA app like Authy, on your phone. Afterwards, install the Google Authenticator PAM module on your Raspberry Pi:

$ sudo apt install libpam-google-authenticator

Now we have 2FA installed on both our phone, and our Raspberry Pi, we’re ready to get things configured.

Configuring two-factor authentication

You should now run Google Authenticator from the command line — without using sudo — on your Raspberry Pi in order to generate a QR code:

$ google-authenticator

Afterwards you’re probably going to have to resize the Terminal window so that the QR code is rendered correctly. Unfortunately, it’s just slightly wider than the standard 80 characters across.

The QR code generated by google-authenticator. Don’t worry, this isn’t the QR code for my key; I generated one just for this post that I didn’t use.

Don’t move forward quite yet! Before you do anything else you should copy the emergency codes and put them somewhere safe.

These codes will let you access your Raspberry Pi — and turn off 2FA — if you lose your phone. Without them, you won’t be able to SSH into your Raspberry Pi if you lose or break the device you’re using to authenticate.

Next, before we continue with Google Authenticator on the Raspberry Pi, open the Google Authenticator app on your phone and tap the plus sign (+) at the top right, then tap on “Scan barcode.”

Your phone will ask you whether you want to allow the app access to your camera; you should say “Yes.” The camera view will open. Position the barcode squarely in the green box on the screen.

Scanning the QR code with the Google Authenticator app.

As soon as your phone app recognises the QR code it will add your new account, and it will start generating TOTP codes automatically.

The TOTP in Google Authenticator app.

Your phone will generate a new one-time password every thirty seconds. However, this code isn’t going to be all that useful until we finish what we were doing on your Raspberry Pi. Switch back to your terminal window and answer “Y” when asked whether Google Authenticator should update your .google_authenticator file.

Then answer “Y” to disallow multiple uses of the same authentication token, “N” to increasing the time skew window, and “Y” to rate limiting in order to protect against brute-force attacks.

You’re done here. Now all we have to do is enable 2FA.

Enabling two-factor authentication

We’re going to use Linux Pluggable Authentication Modules (PAM), which provides dynamic authentication support for applications and services, to add 2FA to SSH on Raspberry Pi.

Now we need to configure PAM to add 2FA:

$ sudo nano /etc/pam.d/sshd

Add auth required pam_google_authenticator.so to the top of the file. You can do this either above or below the line that says @include common-auth.

Editing /etc/pam.d/sshd.

As I prefer to be prompted for my verification code after entering my password, I’ve added this line after the @include line. If you want to be prompted for the code before entering your password you should add it before the @include line.

Now restart the SSH daemon:

$ sudo systemctl restart ssh

Next, open up a terminal window on your laptop and try and SSH into your Raspberry Pi.

Wrapping things up

If everything has gone to plan, when you SSH into the Raspberry Pi, you should be prompted for a TOTP after being prompted for your password.

SSH’ing into my Raspberry Pi.

You should go ahead and open Google Authenticator on your phone, and enter the six-digit code when prompted. Then you should be logged into your Raspberry Pi as normal.

You’ll now need your phone, and a TOTP, every time you ssh into, or scp to and from, your Raspberry Pi. But because of that, you’ve just given a huge boost to the security of your device.

Now you have the Google Authenticator app on your phone, you should probably start enabling 2FA for your important services and sites — like Google, Twitter, Amazon, and others — since most bigger sites, and many smaller ones, now support two-factor authentication.

The post Setting up two-factor authentication on your Raspberry Pi appeared first on Raspberry Pi.

Enhancing site security with new Lightsail firewall features

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/enhancing-site-security-with-new-lightsail-firewall-features/

This post is contributed by Mike Coleman, AWS Senior Developer Advocate – Lightsail

Amazon Lightsail provides an easy way to get started with AWS for many customers. The service balances ease of use, security, and flexibility. The Lightsail firewall now offers additional features to help customers secure their Lightsail instances. This update offers three new capabilities:

  • The ability to specify source IP addresses for firewall rules
  • Explicitly allowing or disallowing remote access to instances via Lightsail’s web-based console
  • Support for PING

This blog explores each of these new features in detail, starting with source IP addresses.

Before this update, any open ports in the Lightsail firewall were open to the internet. In many cases, this is a reasonable approach. For example, for new WordPress servers, you likely need broad public access.

However, in some cases you want to restrict access to an instance. If you are staging a new website and it’s not ready for publication, you may want to limit access. One way to ensure that only certain people can visit the site is to only allow certain IP addresses to connect.

Another common use case is limiting remote access to an instance. With the new changes to the Lightsail firewall, you would be able to limit SSH or RDP access by source IP address. Additionally, you can now enable or disable remote access via Lightsail’s built-in web client.

Access can be restricted from one or more IP addresses (for example, the IP address for your home computer) or a continuous range of IP addresses (such as the address range for your corporate network).

Next, I review how you configure these options to restrict remote access via SSH to a single source IP address.

Finding your IP address

Most computers do not have an internet routable IP address assigned. Internet routable IP addresses are scarce and usually assigned to your internet gateway device. The devices on the network are assigned private IP addresses. To communicate between the private IP network and the internet, the network router typically uses network address translation (NAT).

This tutorial assumes you are using NAT. This means the IP address used to restrict SSH access is the IP routable address of your network gateway device (usually your wireless router). Consequently, this limits access to all devices on the network behind this IP address.

There are many ways to find your internet routable IP address. You can log into your network gateway device and find it there (consult your device’s user manual for more details). Alternatively, use one of several public services to determine your IP address – search online for “what is my IP” to list several options.

Restricting SSH access to a single IP address

  1. Start by creating a new Lightsail instance – you can select any blueprint.
  2. Once the instance state shows Running, choose the name of the instance to open the Instance details page.
    firewall test instance
  3. Choose Networking from the menu.
    networking tab
  4. Scroll down to find the current firewall settings. Under Allow connections from, it lists Any IP address for all of the applications. To change this, choose the edit icon for the SSH rule.
    IP address
  5. Check the box next to Restrict to IP address and enter your internet routable IP address under Source IP or range.restrict IP address
    Note: The next section shows how to restrict access from Lightsail’s browser-based SSH client. Currently, Allow Lightsail browser SSH box is checked.
  6. Choose Save.

Now, SSH into your Lightsail instance from your local machine. You can learn more about how to connect to your Lightsail instance using SSH from our documentation.

You should be able to connect to your instance successfully. Next, you test the connection from a different IP address. You do this by restricting a different IP address, and attempting to connect again:

  1. Edit the SSH firewall rule again follows the instructions above. This time, under IP or IP Range enter 192.168.2.150.
  2. Choose Save.

Attempt to connect to your instance once more. The connection fails because your IP address does not match an IP address in the range.

Restricting access from the Lightsail browser-based SSH client

The browser-based SSH client makes it easy to access instances without needing to manage SSH keys on locally. However, there may be cases where you must disable browser-based access.

To do this:

  1. Navigate to the firewall rules for the instance you created earlier.
  2. Choose the edit icon for the SSH rule.edit IP address
  3. Uncheck the box next to Allow Lightsail browser SSH. Choose Save.
  4. From the menu, choose Connect, then choose Connect using SSH. The browser window opens, but you are not connected to the instance.

PING Support

There is now support for PING, a command line utility used to check if a computer is reachable over the network. PING sends a packet to a remote computer, which sends a simple response back. Before this release, you could not PING Lightsail instances.

To activate this feature, add a firewall rule:

  1. Navigate to the networking page for your instance.      add rule to firewall<
  2. Under the firewall section, choose +Add rule.
  3. From the application list, choose PING (ICMP). Choose Save.
  4. From a terminal window on your local machine, send a ping command to your Lightsail instance’s IP address. You can find the IP address from the Connect tab of the instance details page or from instance card on the Lightsail home page.
ping -c 5 192.168.2.143

You see a response similar to:

<ping -c 5 192.168.2.143
PING 192.168.2.143 (192.168.2.143): 56 data bytes
64 bytes from 192.168.2.143: icmp_seq=0 ttl=54 time=19.383 ms
64 bytes from 192.168.2.143: icmp_seq=1 ttl=54 time=16.821 ms
64 bytes from 192.168.2.143: icmp_seq=2 ttl=54 time=16.363 ms
64 bytes from 192.168.2.143: icmp_seq=3 ttl=54 time=27.335 ms
64 bytes from 192.168.2.143: icmp_seq=4 ttl=54 time=19.429 ms

--- 192.168.2.143 ping statistics ---
5 packets transmitted, 5 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 16.363/19.866/27.335/3.943 ms

Conclusion

In this blog I covered how you can increase the security of your Lightsail instances by taking advantage of three new features: source IP restrictions, limiting access to the Lightsail browser SSH and RDP clients, and the addition of PING (ICMP) as an application type. These new features provide you an extra level of flexibility and security when deploying applications on Lightsail.

To learn more about the Lightsail firewall, see the documentation. Additionally, there are Getting Started tutorials for Lightsail, including launching a LAMP stack application or .NET application.

Cloudflare Bot Management: machine learning and more

Post Syndicated from Alex Bocharov original https://blog.cloudflare.com/cloudflare-bot-management-machine-learning-and-more/

Cloudflare Bot Management: machine learning and more

Introduction

Cloudflare Bot Management: machine learning and more

Building Cloudflare Bot Management platform is an exhilarating experience. It blends Distributed Systems, Web Development, Machine Learning, Security and Research (and every discipline in between) while fighting ever-adaptive and motivated adversaries at the same time.

This is the ongoing story of Bot Management at Cloudflare and also an introduction to a series of blog posts about the detection mechanisms powering it. I’ll start with several definitions from the Bot Management world, then introduce the product and technical requirements, leading to an overview of the platform we’ve built. Finally, I’ll share details about the detection mechanisms powering our platform.

Let’s start with Bot Management’s nomenclature.

Some Definitions

Bot – an autonomous program on a network that can interact with computer systems or users, imitating or replacing a human user’s behavior, performing repetitive tasks much faster than human users could.

Good bots – bots which are useful to businesses they interact with, e.g. search engine bots like Googlebot, Bingbot or bots that operate on social media platforms like Facebook Bot.

Bad bots – bots which are designed to perform malicious actions, ultimately hurting businesses, e.g. credential stuffing bots, third-party scraping bots, spam bots and sneakerbots.

Cloudflare Bot Management: machine learning and more

Bot Management – blocking undesired or malicious Internet bot traffic while still allowing useful bots to access web properties by detecting bot activity, discerning between desirable and undesirable bot behavior, and identifying the sources of the undesirable activity.

WAF – a security system that monitors and controls network traffic based on a set of security rules.

Gathering requirements

Cloudflare has been stopping malicious bots from accessing websites or misusing APIs from the very beginning, at the same time helping the climate by offsetting the carbon costs from the bots. Over time it became clear that we needed a dedicated platform which would unite different bot fighting techniques and streamline the customer experience. In designing this new platform, we tried to fulfill the following key requirements.

  • Complete, not complex – customers can turn on/off Bot Management with a single click of a button, to protect their websites, mobile applications, or APIs.
  • Trustworthy – customers want to know whether they can trust the website visitor is who they say they are and provide a certainty indicator for that trust level.
  • Flexible – customers should be able to define what subset of the traffic Bot Management mitigations should be applied to, e.g. only login URLs, pricing pages or sitewide.
  • Accurate – Bot Management detections should have a very small error, e.g. none or very few human visitors ever should be mistakenly identified as bots.
  • Recoverable – in case a wrong prediction was made, human visitors still should be able to access websites as well as good bots being let through.

Moreover, the goal for new Bot Management product was to make it work well on the following use cases:

Cloudflare Bot Management: machine learning and more

Technical requirements

Additionally to the product requirements above, we engineers had a list of must-haves for the new Bot Management platform. The most critical were:

  • Scalability – the platform should be able to calculate a score on every request, even at over 10 million requests per second.
  • Low latency – detections must be performed extremely quickly, not slowing down request processing by more than 100 microseconds, and not requiring additional hardware.
  • Configurability – it should be possible to configure what detections are applied on what traffic, including on per domain/data center/server level.
  • Modifiability – the platform should be easily extensible with more detection mechanisms, different mitigation actions, richer analytics and logs.
  • Security – no sensitive information from one customer should be used to build models that protect another customer.
  • Explainability & debuggability – we should be able to explain and tune predictions in an intuitive way.

Equipped with these requirements, back in 2018, our small team of engineers got to work to design and build the next generation of Cloudflare Bot Management.

Meet the Score

“Simplicity is the ultimate sophistication.”
– Leonardo Da Vinci

Cloudflare operates on a vast scale. At the time of this writing, this means covering 26M+ Internet properties, processing on average 11M requests per second (with peaks over 14M), and examining more than 250 request attributes from different protocol levels. The key question is how to harness the power of such “gargantuan” data to protect all of our customers from modern day cyberthreats in a simple, reliable and explainable way?

Bot management is hard. Some bots are much harder to detect and require looking at multiple dimensions of request attributes over a long time, and sometimes a single request attribute could give them away. More signals may help, but are they generalizable?

When we classify traffic, should customers decide what to do with it or are there decisions we can make on behalf of the customer? What concept could possibly address all these uncertainty problems and also help us to deliver on the requirements from above?

As you might’ve guessed from the section title, we came up with the concept of Trusted Score or simply The Scoreone thing to rule them all – indicating the likelihood between 0 and 100 whether a request originated from a human (high score) vs. an automated program (low score).

Cloudflare Bot Management: machine learning and more
“One Ring to rule them all” by idreamlikecrazy, used under CC BY / Desaturated from original

Okay, let’s imagine that we are able to assign such a score on every incoming HTTP/HTTPS request, what are we or the customer supposed to do with it? Maybe it’s enough to provide such a score in the logs. Customers could then analyze them on their end, find the most frequent IPs with the lowest scores, and then use the Cloudflare Firewall to block those IPs. Although useful, such a process would be manual, prone to error and most importantly cannot be done in real time to protect the customer’s Internet property.

Fortunately, around the same time we started worked on this system , our colleagues from the Firewall team had just announced Firewall Rules. This new capability provided customers the ability to control requests in a flexible and intuitive way, inspired by the widely known Wireshark®  language. Firewall rules supported a variety of request fields, and we thought – why not have the score be one of these fields? Customers could then write granular rules to block very specific attack types. That’s how the cf.bot_management.score field was born.

Having a score in the heart of Cloudflare Bot Management addressed multiple product and technical requirements with one strike – it’s simple, flexible, configurable, and it provides customers with telemetry about bots on a per request basis. Customers can adjust the score threshold in firewall rules, depending on their sensitivity to false positives/negatives. Additionally, this intuitive score allows us to extend our detection capabilities under the hood without customers needing to adjust any configuration.

So how can we produce this score and how hard is it? Let’s explore it in the following section.

Architecture overview

What is powering the Bot Management score? The short answer is a set of microservices. Building this platform we tried to re-use as many pipelines, databases and components as we could, however many services had to be built from scratch. Let’s have a look at overall architecture (this overly simplified version contains Bot Management related services):

Cloudflare Bot Management: machine learning and more

Core Bot Management services

In a nutshell our systems process data received from the edge data centers, produce and store data required for bot detection mechanisms using the following technologies:

  • Databases & data storesKafka, ClickHouse, Postgres, Redis, Ceph.
  • Programming languages – Go, Rust, Python, Java, Bash.
  • Configuration & schema management – Salt, Quicksilver, Cap’n Proto.
  • Containerization – Docker, Kubernetes, Helm, Mesos/Marathon.

Each of these services is built with resilience, performance, observability and security in mind.

Edge Bot Management module

All bot detection mechanisms are applied on every request in real-time during the request processing stage in the Bot Management module running on every machine at Cloudflare’s edge locations. When a request comes in we extract and transform the required request attributes and feed them to our detection mechanisms. The Bot Management module produces the following output:

Firewall fieldsBot Management fields
cf.bot_management.score – an integer indicating the likelihood between 0 and 100 whether a request originated from an automated program (low score) to a human (high score).
cf.bot_management.verified_bot – a boolean indicating whether such request comes from a Cloudflare whitelisted bot.
cf.bot_management.static_resource – a boolean indicating whether request matches file extensions for many types of static resources.

Cookies – most notably it produces cf_bm, which helps manage incoming traffic that matches criteria associated with bots.

JS challenges – for some of our detections and customers we inject into invisible JavaScript challenges, providing us with more signals for bot detection.

Detection logs – we log through our data pipelines to ClickHouse details about each applied detection, used features and flags, some of which are used for analytics and customer logs, while others are used to debug and improve our models.

Once the Bot Management module has produced the required fields, the Firewall takes over the actual bot mitigation.

Firewall integration

The Cloudflare Firewall’s intuitive dashboard enables users to build powerful rules through easy clicks and also provides Terraform integration. Every request to the firewall is inspected against the rule engine. Suspicious requests can be blocked, challenged or logged as per the needs of the user while legitimate requests are routed to the destination, based on the score produced by the Bot Management module and the configured threshold.

Cloudflare Bot Management: machine learning and more

Firewall rules provide the following bot mitigation actions:

  • Log – records matching requests in the Cloudflare Logs provided to customers.
  • Bypass – allows customers to dynamically disable Cloudflare security features for a request.
  • Allow – matching requests are exempt from challenge and block actions triggered by other Firewall Rules content.
  • Challenge (Captcha) – useful for ensuring that the visitor accessing the site is human, and not automated.
  • JS Challenge – useful for ensuring that bots and spam cannot access the requested resource; browsers, however, are free to satisfy the challenge automatically.
  • Block – matching requests are denied access to the site.

Our Firewall Analytics tool, powered by ClickHouse and GraphQL API, enables customers to quickly identify and investigate security threats using an intuitive interface. In addition to analytics, we provide detailed logs on all bots-related activity using either the Logpull API and/or LogPush, which provides the easy way to get your logs to your cloud storage.

Cloudflare Workers integration

In case a customer wants more flexibility on what to do with the requests based on the score, e.g. they might want to inject new, or change existing, HTML page content, or serve incorrect data to the bots, or stall certain requests, Cloudflare Workers provide an option to do that. For example, using this small code-snippet, we can pass the score back to the origin server for more advanced real-time analysis or mitigation:

addEventListener('fetch', event => {
  event.respondWith(handleRequest(event.request))
})
 
async function handleRequest(request) {
  request = new Request(request);
 
  request.headers.set("Cf-Bot-Score", request.cf.bot_management.score)
 
  return fetch(request);
}

Now let’s have a look into how a single score is produced using multiple detection mechanisms.

Detection mechanisms

Cloudflare Bot Management: machine learning and more

The Cloudflare Bot Management platform currently uses five complementary detection mechanisms, producing their own scores, which we combine to form the single score going to the Firewall. Most of the detection mechanisms are applied on every request, while some are enabled on a per customer basis to better fit their needs.

Cloudflare Bot Management: machine learning and more
Cloudflare Bot Management: machine learning and more

Having a score on every request for every customer has the following benefits:

  • Ease of onboarding – even before we enable Bot Management in active mode, we’re able to tell how well it’s going to work for the specific customer, including providing historical trends about bot activity.
  • Feedback loop – availability of the score on every request along with all features has tremendous value for continuous improvement of our detection mechanisms.
  • Ensures scaling – if we can compute for score every request and customer, it means that every Internet property behind Cloudflare is a potential Bot Management customer.
  • Global bot insights – Cloudflare is sitting in front of more than 26M+ Internet properties, which allows us to understand and react to the tectonic shifts happening in security and threat intelligence over time.

Overall globally, more than third of the Internet traffic visible to Cloudflare is coming from bad bots, while Bot Management customers have the ratio of bad bots even higher at ~43%!

Cloudflare Bot Management: machine learning and more
Cloudflare Bot Management: machine learning and more

Let’s dive into specific detection mechanisms in chronological order of their integration with Cloudflare Bot Management.

Machine learning

The majority of decisions about the score are made using our machine learning models. These were also the first detection mechanisms to produce a score and to on-board customers back in 2018. The successful application of machine learning requires data high in Quantity, Diversity, and Quality, and thanks to both free and paid customers, Cloudflare has all three, enabling continuous learning and improvement of our models for all of our customers.

At the core of the machine learning detection mechanism is CatBoost  – a high-performance open source library for gradient boosting on decision trees. The choice of CatBoost was driven by the library’s outstanding capabilities:

  • Categorical features support – allowing us to train on even very high cardinality features.
  • Superior accuracy – allowing us to reduce overfitting by using a novel gradient-boosting scheme.
  • Inference speed – in our case it takes less than 50 microseconds to apply any of our models, making sure request processing stays extremely fast.
  • C and Rust API – most of our business logic on the edge is written using Lua, more specifically LuaJIT, so having a compatible FFI interface to be able to apply models is fantastic.

There are multiple CatBoost models run on Cloudflare’s Edge in the shadow mode on every request on every machine. One of the models is run in active mode, which influences the final score going to Firewall. All ML detection results and features are logged and recorded in ClickHouse for further analysis, model improvement, analytics and customer facing logs. We feed both categorical and numerical features into our models, extracted from request attributes and inter-request features built using those attributes, calculated and delivered by the Gagarin inter-requests features platform.

We’re able to deploy new ML models in a matter of seconds using an extremely reliable and performant Quicksilver configuration database. The same mechanism can be used to configure which version of an ML model should be run in active mode for a specific customer.

A deep dive into our machine learning detection mechanism deserves a blog post of its own and it will cover how do we train and validate our models on trillions of requests using GPUs, how model feature delivery and extraction works, and how we explain and debug model predictions both internally and externally.

Heuristics engine

Not all problems in the world are the best solved with machine learning. We can tweak the ML models in various ways, but in certain cases they will likely underperform basic heuristics. Often the problems machine learning is trying to solve are not entirely new. When building the Bot Management solution it became apparent that sometimes a single attribute of the request could give a bot away. This means that we can create a bunch of simple rules capturing bots in a straightforward way, while also ensuring lowest false positives.

The heuristics engine was the second detection mechanism integrated into the Cloudflare Bot Management platform in 2019 and it’s also applied on every request. We have multiple heuristic types and hundreds of specific rules based on certain attributes of the request, some of which are very hard to spoof. When any of the requests matches any of the heuristics – we assign the lowest possible score of 1.

The engine has the following properties:

  • Speed – if ML model inference takes less than 50 microseconds per model, hundreds of heuristics can be applied just under 20 microseconds!
  • Deployability – the heuristics engine allows us to add new heuristic in a matter of seconds using Quicksilver, and it will be applied on every request.
  • Vast coverage – using a set of simple heuristics allows us to classify ~15% of global traffic and ~30% of Bot Management customers’ traffic as bots. Not too bad for a few if conditions, right?
  • Lowest false positives – because we’re very sure and conservative on the heuristics we add, this detection mechanism has the lowest FP rate among all detection mechanisms.
  • Labels for ML – because of the high certainty we use requests classified with heuristics to train our ML models, which then can generalize behavior learnt from from heuristics and improve detections accuracy.

So heuristics gave us a lift when tweaked with machine learning and they contained a lot of the intuition about the bots, which helped to advance the Cloudflare Bot Management platform and allowed us to onboard more customers.

Behavioral analysis

Machine learning and heuristics detections provide tremendous value, but both of them require human input on the labels, or basically a teacher to distinguish between right and wrong. While our supervised ML models can generalize well enough even on novel threats similar to what we taught them on, we decided to go further. What if there was an approach which doesn’t require a teacher, but rather can learn to distinguish bad behavior from the normal behavior?

Enter the behavioral analysis detection mechanism, initially developed in 2018 and integrated with the Bot Management platform in 2019. This is an unsupervised machine learning approach, which has the following properties:

  • Fitting specific customer needs – it’s automatically enabled for all Bot Management customers, calculating and analyzing normal visitor behavior over an extended period of time.
  • Detects bots never seen before – as it doesn’t use known bot labels, it can detect bots and anomalies from the normal behavior on specific customer’s website.
  • Harder to evade – anomalous behavior is often a direct result of the bot’s specific goal.

Please stay tuned for a more detailed blog about behavioral analysis models and the platform powering this incredible detection mechanism, protecting many of our customers from unseen attacks.

Verified bots

So far we’ve discussed how to detect bad bots and humans. What about good bots, some of which are extremely useful for the customer website? Is there a need for a dedicated detection mechanism or is there something we could use from previously described detection mechanisms? While the majority of good bot requests (e.g. Googlebot, Bingbot, LinkedInbot) already have low score produced by other detection mechanisms, we also need a way to avoid accidental blocks of useful bots. That’s how the Firewall field cf.bot_management.verified_bot came into existence in 2019, allowing customers to decide for themselves whether they want to let all of the good bots through or restrict access to certain parts of the website.

The actual platform calculating Verified Bot flag deserves a detailed blog on its own, but in the nutshell it has the following properties:

  • Validator based approach – we support multiple validation mechanisms, each of them allowing us to reliably confirm good bot identity by clustering a set of IPs.
  • Reverse DNS validator – performs a reverse DNS check to determine whether or not a bots IP address matches its alleged hostname.
  • ASN Block validator – similar to rDNS check, but performed on ASN block.
  • Downloader validator – collects good bot IPs from either text files or HTML pages hosted on bot owner sites.
  • Machine learning validator – uses an unsupervised learning algorithm, clustering good bot IPs which are not possible to validate through other means.
  • Bots Directory – a database with UI that stores and manages bots that pass through the Cloudflare network.
Cloudflare Bot Management: machine learning and more
Bots directory UI sample‌‌

Using multiple validation methods listed above, the Verified Bots detection mechanism identifies hundreds of unique good bot identities, belonging to different companies and categories.

JS fingerprinting

When it comes to Bot Management detection quality it’s all about the signal quality and quantity. All previously described detections use request attributes sent over the network and analyzed on the server side using different techniques. Are there more signals available, which can be extracted from the client to improve our detections?

As a matter of fact there are plenty, as every browser has unique implementation quirks. Every web browser graphics output such as canvas depends on multiple layers such as hardware (GPU) and software (drivers, operating system rendering). This highly unique output allows precise differentiation between different browser/device types. Moreover, this is achievable without sacrificing website visitor privacy as it’s not a supercookie, and it cannot be used to track and identify individual users, but only to confirm that request’s user agent matches other telemetry gathered through browser canvas API.

This detection mechanism is implemented as a challenge-response system with challenge injected into the webpage on Cloudflare’s edge. The challenge is then rendered in the background using provided graphic instructions and the result sent back to Cloudflare for validation and further action such as  producing the score. There is a lot going on behind the scenes to make sure we get reliable results without sacrificing users’ privacy while being tamper resistant to replay attacks. The system is currently in private beta and being evaluated for its effectiveness and we already see very promising results. Stay tuned for this new detection mechanism becoming widely available and the blog on how we’ve built it.

This concludes an overview of the five detection mechanisms we’ve built so far. It’s time to sum it all up!

Summary

Cloudflare has the unique ability to collect data from trillions of requests flowing through its network every week. With this data, Cloudflare is able to identify likely bot activity with Machine Learning, Heuristics, Behavioral Analysis, and other detection mechanisms. Cloudflare Bot Management integrates seamlessly with other Cloudflare products, such as WAF  and Workers.

Cloudflare Bot Management: machine learning and more

All this could not be possible without hard work across multiple teams! First of all thanks to everybody on the Bots Team for their tremendous efforts to make this platform come to life. Other Cloudflare teams, most notably: Firewall, Data, Solutions Engineering, Performance, SRE, helped us a lot to design, build and support this incredible platform.

Cloudflare Bot Management: machine learning and more
Bots team during Austin team summit 2019 hunting bots with axes 🙂

Lastly, there are more blogs from the Bots series coming soon, diving into internals of our detection mechanisms, so stay tuned for more exciting stories about Cloudflare Bot Management!

How Netflix brings safer and faster streaming experience to the living room on crowded networks…

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/how-netflix-brings-safer-and-faster-streaming-experience-to-the-living-room-on-crowded-networks-78b8de7f758c

How Netflix brings safer and faster streaming experience to the living room on crowded networks using TLS 1.3

By Sekwon Choi

At Netflix, we are obsessed with the best streaming experiences. We want playback to start instantly and to never stop unexpectedly in any network environment. We are also committed to protecting users’ privacy and service security without sacrificing any part of the playback experience.

To achieve that, we are efficiently using ABR (adaptive bitrate streaming) for a better playback experience, DRM (Digital Right Management) to protect our service and TLS (Transport Layer Security) to protect customer privacy and to create a safer streaming experience.

Netflix on consumer electronics devices such as TVs, set-top boxes and streaming sticks was until recently using TLS 1.2 for streaming traffic. Now we support TLS 1.3 for safer and faster experiences.

What is TLS?

For two parties to communicate securely, a secure channel is necessary. This needs to have the following three properties.

  • Authentication: Identity of the communicating party is verified.
  • Confidentiality: Data sent over the channel is only visible to the endpoints.
  • Integrity: Data sent over the channel cannot be modified by attackers without detection.

The TLS protocol is designed to provide a secure channel between two peers by providing tools and methods to achieve the above properties.

TLS 1.3

TLS 1.3 is the latest version of the Transport Layer Security protocol. It is simpler, more secure and more efficient than its predecessor.

Perfect Forward Secrecy

One thing we believe is very important at Netflix is providing PFS (Perfect Forward Secrecy).

PFS is a feature of the key exchange algorithm that assures that session keys will not be compromised, even if the server’s private key is compromised. By generating new keys for each session, PFS protects past sessions against the future compromise of secret keys.

TLS 1.2 supports key exchange algorithms with PFS, but it also allows key exchange algorithms that do not support PFS. Even with the previous version of TLS 1.2, Netflix has always selected a key exchange algorithm that provides PFS such as ECDHE (Elliptic Curve Diffie Hellman Ephemeral). TLS 1.3, however, enforces this concept even more by removing all the key exchange algorithms that do not provide PFS, such as static RSA.

Authenticated Encryption

For encryption, TLS 1.3 removes all weak ciphers and uses only Authenticated Encryption with Associated Data (AEAD). This assures the confidentiality, integrity, and authenticity of the data. We use AES Galois/Counter Mode, as it also provides good performance and high throughput.

Secure Handshake

While the above changes are important, the most important change in TLS 1.3 is perhaps its redesign of the handshake protocol.

The TLS 1.2 handshake was not designed to protect the integrity of the entire handshake. It protected only the part of the handshake after the cipher suite negotiation and this opened up the possibility of downgrade attacks which may allow the attackers to force the use of insecure cipher suites.

With TLS 1.3, the server signs the entire handshake including the cipher suite negotiation and thus prevents the attacker from downgrading the cipher suite.

Also in TLS 1.2, extensions were sent in the clear in the ServerHello. Now with TLS 1.3, even extensions are encrypted and all handshake messages after ServerHello are now encrypted.

Reduced Handshake

TLS 1.2 supports numerous key exchange algorithms, cipher suites and digital signatures, including weak and vulnerable ones. Therefore, it requires more messages to perform a handshake and two network round trips.

In contrast, the handshake in TLS 1.3 now requires only one round trip, with a simplified design and with all weak and vulnerable algorithms removed.

In addition, it has a new feature called 0-RTT, or TLS early data, for the resumed handshake. This allows an application to include application data with its initial handshake message, instead of having to wait until the handshake completes.

At Netflix, by the efficient resumption of the TLS session and careful use of 0-RTT for the streaming data, we can reduce the play delay.

A/B Testing Result

We were pretty confident that TLS 1.3 would bring us better security from the analysis of its protocol composition, but we did not know how it would perform in the context of streaming.

Since TLS 1.3’s performance-related feature is the 0-RTT mode with the resumed handshake, our hypothesis is that TLS 1.3 would reduce play delay, as we are no longer required to wait for the handshake to finish and we can instead issue the HTTP request for media data and receive the HTTP response for media data earlier.

To see the actual performance of TLS 1.3 in the field, we performed an experiment with

  • User accounts: half-million user accounts per cell.
  • Device type: mid-performance device with Quad ARM core @ 1.7GHz.
  • Control cell: TLS 1.2
  • Treatment cell: TLS 1.3

Play Delay

Play Delay is defined by how long it takes for playback to start. Below are the results of the play delay measured in the experiment. The results imply that on slower or congested networks, which can be represented by the quantiles of at least 0.75, TLS 1.3 achieves the largest gains, with improvements across all network conditions.

Below is the time series median play delay graph for this mid-performance device in the field. It also shows that playback starts earlier with TLS 1.3.

Media Rebuffer

At Netflix, we define a media rebuffer as a non-network originated rebuffer. It typically occurs when media data is not processed quickly enough by the device due to the high load on the CPU. Comparing the control cell with TLS 1.2, the experiment cell with TLS 1.3 showed about a 7.4% improvement in media rebuffers. This result implies that using TLS 1.3 with 0-RTT is more efficient and can reduce the CPU load.

Conclusion

From the security analysis, we are confident that TLS 1.3 improves communication security over TLS 1.2. From the field test, we are confident that TLS 1.3 provides us a better streaming experience.

At the time of writing this article, the Internet is experiencing higher than usual traffic and congestion. We believe saving even small amounts of data and round trips can be meaningful and even better if it also provides a more secure and efficient streaming experience.

Therefore, we have started deploying TLS 1.3 on newer consumer electronics devices and we are expecting even more devices to be deployed with TLS 1.3 capability in the near future.


How Netflix brings safer and faster streaming experience to the living room on crowded networks… was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Is BGP Safe Yet? No. But we are tracking it carefully

Post Syndicated from Louis Poinsignon original https://blog.cloudflare.com/is-bgp-safe-yet-rpki-routing-security-initiative/

Is BGP Safe Yet? No. But we are tracking it carefully

BGP leaks and hijacks have been accepted as an unavoidable part of the Internet for far too long. We relied on protection at the upper layers like TLS and DNSSEC to ensure an untampered delivery of packets, but a hijacked route often results in an unreachable IP address. Which results in an Internet outage.

The Internet is too vital to allow this known problem to continue any longer. It’s time networks prevented leaks and hijacks from having any impact. It’s time to make BGP safe. No more excuses.

Border Gateway Protocol (BGP), a protocol to exchange routes has existed and evolved since the 1980s. Over the years it has had security features. The most notable security addition is Resource Public Key Infrastructure (RPKI), a security framework for routing. It has been the subject of a few blog posts following our deployment in mid-2018.

Today, the industry considers RPKI mature enough for widespread use, with a sufficient ecosystem of software and tools, including tools we’ve written and open sourced. We have fully deployed Origin Validation on all our BGP sessions with our peers and signed our prefixes.

However, the Internet can only be safe if the major network operators deploy RPKI. Those networks have the ability to spread a leak or hijack far and wide and it’s vital that they take a part in stamping out the scourge of BGP problems whether inadvertent or deliberate.

Many like AT&T and Telia pioneered global deployments of RPKI in 2019. They were successfully followed by Cogent and NTT in 2020. Hundreds networks of all sizes have done a tremendous job over the last few years but there is still work to be done.

If we observe the customer-cones of the networks that have deployed RPKI, we see around 50% of the Internet is more protected against route leaks. That’s great, but it’s nothing like enough.

Is BGP Safe Yet? No. But we are tracking it carefully

Today, we are releasing isBGPSafeYet.com, a website to track deployments and filtering of invalid routes by the major networks.

We are hoping this will help the community and we will crowdsource the information on the website. The source code is available on GitHub, we welcome suggestions and contributions.

We expect this initiative will make RPKI more accessible to everyone and ultimately will reduce the impact of route leaks. Share the message with your Internet Service Providers (ISP), hosting providers, transit networks to build a safer Internet.

Additionally, to monitor and test deployments, we decided to announce two bad prefixes from our 200+ data centers and via the 233+ Internet Exchange Points (IXPs) we are connected to:

  • 103.21.244.0/24
  • 2606:4700:7000::/48

Both these prefixes should be considered invalid and should not be routed by your provider if RPKI is implemented within their network. This makes it easy to demonstrate how far a bad route can go, and test whether RPKI is working in the real world.

Is BGP Safe Yet? No. But we are tracking it carefully
A Route Origin Authorization for 103.21.244.0/24 on rpki.cloudflare.com

In the test you can run on isBGPSafeYet.com, your browser will attempt to fetch two pages: the first one valid.rpki.cloudflare.com, is behind an RPKI-valid prefix and the second one, invalid.rpki.cloudflare.com, is behind the RPKI-invalid prefix.

The test has two outcomes:

  • If both pages were correctly fetched, your ISP accepted the invalid route. It does not implement RPKI.
  • If only valid.rpki.cloudflare.com was fetched, your ISP implements RPKI. You will be less sensitive to route-leaks.
Is BGP Safe Yet? No. But we are tracking it carefully
a simple test of RPKI invalid reachability

We will be performing tests using those prefixes to check for propagation. Traceroutes and probing helped us in the past by creating visualizations of deployment.

A simple indicator is the number of networks sending the accepted route to their peers and collectors:

Is BGP Safe Yet? No. But we are tracking it carefully
Routing status from online route collection tool RIPE Stat

In December 2019, we released a Hilbert curve map of the IPv4 address space. Every pixel represents a /20 prefix. If a dot is yellow, the prefix responded only to the probe from a RPKI-valid IP space. If it is blue, the prefix responded to probes from both RPKI valid and invalid IP space.

To summarize, the yellow areas are IP space behind networks that drop RPKI invalid prefixes. The Internet isn’t safe until the blue becomes yellow.

Is BGP Safe Yet? No. But we are tracking it carefully
Hilbert Curve Map of IP address space behind networks filtering RPKI invalid prefixes

Last but not least, we would like to thank every network that has already deployed RPKI and every developer that contributed to validator-software code bases. The last two years have shown that the Internet can become safer and we are looking forward to the day where we can call route leaks and hijacks an incident of the past.

Time-Based One-Time Passwords for Phone Support

Post Syndicated from Junade Ali original https://blog.cloudflare.com/time-based-one-time-passwords-for-phone-support/

Time-Based One-Time Passwords for Phone Support

Time-Based One-Time Passwords for Phone Support

As part of Cloudflare’s support offering, we provide phone support to Enterprise customers who are experiencing critical business issues.

For account security, specific account settings and sensitive details are not discussed via phone. From today, we are providing Enterprise customers with the ability to configure phone authentication to allow for greater support to be offered over the phone without need to perform validation through support tickets.

After providing your email address to a Cloudflare Support representative, you can now provide a token generated from the Cloudflare dashboard or via a 2FA app like Google Authenticator. So, a customer is able to prove over the phone that they are who they say they are.

Configuring Phone Authentication

If you are an existing Enterprise customer interested in phone support, please contact your Customer Success Manager for eligibility information and set-up. If you are interested in our Enterprise offering, please get in contact via our Enterprise plan page.

If you already have phone support eligibility, you can generate single-use tokens from the Cloudflare dashboard or configure an authenticator app to do the same remotely.

On the support page, you will see a card called “Emergency Phone Support Hotline – Authentication”. From here you can generate a Single-Use Token for authenticating a single call or configure an Authenticator App to generate tokens from a 2FA app.

Time-Based One-Time Passwords for Phone Support

For more detailed instructions, please see the “Emergency Phone” section of the Contacting Cloudflare Support article on the Cloudflare Knowledge Base.

How it Works

A standardised approach for generating TOTPs (Time-Based One-Time Passwords) is described in RFC 6238 – this is the approach that is often used for setting up Two Factor Authentication on websites.

When configuring a TOTP authenticator app, you are usually asked to scan a QR code or input a long alphanumeric string. This is a randomly generated secret that is shared between your local authenticator app and the web service where you are configuring TOTP. After TOTP is configured, this is stored between both the web server and your local device.

TOTP password generation relies on two key inputs; the shared secret and the number of seconds since the Unix epoch (Unix time). The timestamp is integer divided by a validity period (often 30 seconds) and this value is put into a cryptographic hash function alongside the secret to generate an output. The hexadecimal output is then truncated to provide the decimal digits which are shown to the user. The Avalanche Effect means that whenever the inputs that go into the hash function change slightly (e.g. the timestamp increments), a completely different hash output is generated.

This approach is fairly widely used and is available in a number of libraries depending on your preferred programming language. However, as our phone validation functionality offers both authenticator app support and generation of a single-use token from the dashboard (where no shared secret exists) – some deviation was required.

We generate a single use token by creating a hash of an internal user ID combined with a Cloudflare-internal secret, which in turn is used to generate RFC 6238 compliant time-based one-time passwords. Similarly, this service can generate random passwords for any user without needing to store additional secrets. This is then surfaced to the user every 30 seconds via a JavaScript request without exposing the secret used to generate the token.

Time-Based One-Time Passwords for Phone Support

One question you may be asking yourself after all of this is why don’t we simply use the 2FA mechanism which users use to login for phone validation too? Firstly, we don’t want to accustom users to providing their 2FA tokens to anyone else (they should purely be used for logging in). Secondly, as you may have noticed – we recently began supporting WebAuthn keys for logging in, as these are physical tokens used for website authentication they aren’t suited to usage on a mobile device.

To improve user experience during a phone call, we also validate tokens in the previous time step in the event it has expired by the time the user has read it out (indeed, RFC 6238 provides that “at most one time step is allowed as the network delay”). This means a token can be valid for up to one minute.

The APIs powering this service are then wrapped with API gateways that offer audit logging both for customer actions and actions completed by staff members. This provides a clear audit trail for customer authentication.

Future Work

Authentication is a critical component to securing customer support interactions. Authentication tooling must develop alongside support contact channels; from web forms behind logins to using JWT tokens for validating live chat sessions and now TOTP phone authentication. This is complimented by technical support engineers who will manage risk by routing certain issues into traditional support tickets and being able to refer some cases to named customer success managers for approval.

We are constantly advancing our support experience; for example, we plan to further improve our Enterprise Phone Support by giving users the ability to request a callback from a support agent within our dashboard. As always, right here on our blog we’ll keep you up-to-date with improvements in our service.