Tag Archives: security

New – VPC Ingress Routing – Simplifying Integration of Third-Party Appliances

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/new-vpc-ingress-routing-simplifying-integration-of-third-party-appliances/

When I was delivering the Architecting on AWS class, customers often asked me how to configure an Amazon Virtual Private Cloud to enforce the same network security policies in the cloud as they have on-premises. For example, to scan all ingress traffic with an Intrusion Detection System (IDS) appliance or to use the same firewall in the cloud as on-premises. Until today, the only answer I could provide was to route all traffic back from their VPC to an on-premises appliance or firewall in order to inspect the traffic with their usual networking gear before routing it back to the cloud. This is obviously not an ideal configuration, it adds latency and complexity.

Today, we announce new VPC networking routing primitives to allow to route all incoming and outgoing traffic to/from an Internet Gateway (IGW) or Virtual Private Gateway (VGW) to a specific Amazon Elastic Compute Cloud (EC2) instance’s Elastic Network Interface. It means you can now configure your Virtual Private Cloud to send all traffic to an EC2 instance before the traffic reaches your business workloads. The instance typically runs network security tools to inspect or to block suspicious network traffic (such as IDS/IPS or Firewall) or to perform any other network traffic inspection before relaying the traffic to other EC2 instances.

How Does it Work?
To learn how it works, I wrote this CDK script to create a VPC with two public subnets: one subnet for the appliance and one subnet for a business application. The script launches two EC2 instances with public IP address, one in each subnet. The script creates the below architecture:

This is a regular VPC, the subnets have routing tables to the Internet Gateway and the traffic flows in and out as expected. The application instance hosts a static web site, it is accessible from any browser. You can retrieve the application public DNS name from the EC2 Console (for your convenience, I also included the CLI version in the comments of the CDK script).

AWS_REGION=us-west-2
APPLICATION_IP=$(aws ec2 describe-instances                           \
                     --region $AWS_REGION                             \
                     --query "Reservations[].Instances[] | [?Tags[?Key=='Name' && Value=='application']].NetworkInterfaces[].Association.PublicDnsName"  \
                     --output text)
				   
curl -I $APPLICATION_IP

Configure Routing
To configure routing, you need to know the VPC ID, the ENI ID of the ENI attached to the appliance instance, and the Internet Gateway ID. Assuming you created the infrastructure using the CDK script I provided, here are the commands I use to find these three IDs (be sure to adjust to the AWS Region you use):

AWS_REGION=us-west-2
VPC_ID=$(aws cloudformation describe-stacks                              \
             --region $AWS_REGION                                        \
             --stack-name VpcIngressRoutingStack                         \
             --query "Stacks[].Outputs[?OutputKey=='VPCID'].OutputValue" \
             --output text)

ENI_ID=$(aws ec2 describe-instances                                       \
             --region $AWS_REGION                                         \
             --query "Reservations[].Instances[] | [?Tags[?Key=='Name' &&  Value=='appliance']].NetworkInterfaces[].NetworkInterfaceId" \
             --output text)

IGW_ID=$(aws ec2 describe-internet-gateways                               \
             --region $AWS_REGION                                         \
             --query "InternetGateways[] | [?Attachments[?VpcId=='${VPC_ID}']].InternetGatewayId" \
             --output text)

To route all incoming traffic through my appliance, I create a routing table for the Internet Gateway and I attach a rule to direct all traffic to the EC2 instance Elastic Network Interface (ENI):

# create a new routing table for the Internet Gateway
ROUTE_TABLE_ID=$(aws ec2 create-route-table                      \
                     --region $AWS_REGION                        \
                     --vpc-id $VPC_ID                            \
                     --query "RouteTable.RouteTableId"           \
                     --output text)

# create a route for 10.0.1.0/24 pointing to the appliance ENI
aws ec2 create-route                             \
    --region $AWS_REGION                         \
    --route-table-id $ROUTE_TABLE_ID             \
    --destination-cidr-block 10.0.1.0/24         \
    --network-interface-id $ENI_ID

# associate the routing table to the Internet Gateway
aws ec2 associate-route-table                      \
    --region $AWS_REGION                           \
    --route-table-id $ROUTE_TABLE_ID               \
    --gateway-id $IGW_ID

Alternatively, I can use the VPC Console under the new Edge Associations tab.

To route all application outgoing traffic through the appliance, I replace the default route for the application subnet to point to the appliance’s ENI:

SUBNET_ID=$(aws ec2 describe-instances                                  \
                --region $AWS_REGION                                    \
                --query "Reservations[].Instances[] | [?Tags[?Key=='Name' && Value=='application']].NetworkInterfaces[].SubnetId"    \
                --output text)
ROUTING_TABLE=$(aws ec2 describe-route-tables                           \
                    --region $AWS_REGION                                \
                    --query "RouteTables[?VpcId=='${VPC_ID}'] | [?Associations[?SubnetId=='${SUBNET_ID}']].RouteTableId" \
                    --output text)

# delete the existing default route (the one pointing to the internet gateway)
aws ec2 delete-route                       \
    --region $AWS_REGION                   \
    --route-table-id $ROUTING_TABLE        \
    --destination-cidr-block 0.0.0.0/0
	
# create a default route pointing to the appliance's ENI
aws ec2 create-route                          \
    --region $AWS_REGION                      \
    --route-table-id $ROUTING_TABLE           \
    --destination-cidr-block 0.0.0.0/0        \
    --network-interface-id $ENI_ID
	
aws ec2 associate-route-table       \
    --region $AWS_REGION            \
    --route-table-id $ROUTING_TABLE \
    --subnet-id $SUBNET_ID

Alternatively, I can use the VPC Console. Within the correct routing table, I select the Routes tab and click Edit routes to replace the default route (the one pointing to 0.0.0.0/0) to target the appliance’s ENI.

Now I have the routing configuration in place. The new routing looks like:

Configure the Appliance Instance
Finally, I configure the appliance instance to forward all traffic it receives. Your software appliance usually does that for you, no extra step is required when you use AWS Marketplace appliances. When using a plain Linux instance, two extra steps are required:

1. Connect to the EC2 appliance instance and configure IP traffic forwarding in the kernel:

APPLIANCE_ID=$(aws ec2 describe-instances  \
                   --region $AWS_REGION    \
                   --query "Reservations[].Instances[] | [?Tags[?Key=='Name' && Value=='appliance']].InstanceId" \
                   --output text)
aws ssm start-session --region $AWS_REGION --target $APPLIANCE_ID	

##
## once connected (you see the 'sh-4.2$' prompt), type:
##

sudo sysctl -w net.ipv4.ip_forward=1
sudo sysctl -w net.ipv6.conf.all.forwarding=1
exit

2. Configure the EC2 instance to accept traffic for different destinations than itself (known as Dest/Source check) :

aws ec2 modify-instance-attribute --region $AWS_REGION \
                         --no-source-dest-check        \
                         --instance-id $APPLIANCE_ID

Now, the appliance is ready to forward traffic to the other EC2 instances. You can test this by pointing your browser (or using `cURL`) to the application instance.

APPLICATION_IP=$(aws ec2 describe-instances --region $AWS_REGION                          \
                     --query "Reservations[].Instances[] | [?Tags[?Key=='Name' && Value=='application']].NetworkInterfaces[].Association.PublicDnsName"  \
                     --output text)
				   
curl -I $APPLICATION_IP

To verify the traffic is really flowing through the appliance, you can enable source/destination check on the instance again (use --source-dest-check parameter with the modify-instance-attributeCLI command above). The traffic is blocked when Source/Destination check is enabled.

Cleanup
Should you use the CDK script I provided for this article, be sure to run cdk destroy when finished. This ensures you are not billed for the two EC2 instances I use for this demo. As I modified routing tables behind the back of AWS CloudFormation, I need to manually delete the routing tables, the subnet and the VPC. The easiest is to navigate to the VPC Console, select the VPC and click Actions => Delete VPC. The console deletes all components in the correct order. You might need to wait 5-10 minutes after the end of cdk destroy before the console is able to delete the VPC.

From our Partners
During the beta test of these new routing capabilities, we granted early access to a collection of AWS partners. They provided us with tons of helpful feedback. Here are some of the blog posts that they wrote in order to share their experiences (I am updating this article with links as they are published):

  • 128 Technology
  • Aviatrix
  • Checkpoint
  • Cisco
  • Citrix
  • FireEye
  • Fortinet
  • HashiCorp
  • IBM Security
  • Lastline
  • Netscout
  • Palo Alto Networks
  • ShieldX Networks
  • Sophos
  • Trend Micro
  • Valtix
  • Vectra AI
  • Versa Networks

Availability
There is no additional costs to use Virtual Private Cloud ingress routing. It is available in all regions (including AWS GovCloud (US-West)) and you can start to use it today.

You can learn more about gateways routing tables in the updated VPC documentation.

What are the appliances you are going to use with this new VPC routing capability?

— seb

Identify Unintended Resource Access with AWS Identity and Access Management (IAM) Access Analyzer

Post Syndicated from Brandon West original https://aws.amazon.com/blogs/aws/identify-unintended-resource-access-with-aws-identity-and-access-management-iam-access-analyzer/

Today I get to share my favorite kind of announcement. It’s the sort of thing that will improve security for just about everyone that builds on AWS, it can be turned on with almost no configuration, and it costs nothing to use. We’re launching a new, first-of-its-kind capability called AWS Identity and Access Management (IAM) Access Analyzer. IAM Access Analyzer mathematically analyzes access control policies attached to resources and determines which resources can be accessed publicly or from other accounts. It continuously monitors all policies for Amazon Simple Storage Service (S3) buckets, IAM roles, AWS Key Management Service (KMS) keys, AWS Lambda functions, and Amazon Simple Queue Service (SQS) queues. With IAM Access Analyzer, you have visibility into the aggregate impact of your access controls, so you can be confident your resources are protected from unintended access from outside of your account.

Let’s look at a couple examples. An IAM Access Analyzer finding might indicate an S3 bucket named my-bucket-1 is accessible to an AWS account with the id 123456789012 when originating from the source IP 11.0.0.0/15. Or IAM Access Analyzer may detect a KMS key policy that allow users from another account to delete the key, identifying a data loss risk you can fix by adjusting the policy. If the findings show intentional access paths, they can be archived.

So how does it work? Using the kind of math that shows up on unexpected final exams in my nightmares, IAM Access Analyzer evaluates your policies to determine how a given resource can be accessed. Critically, this analysis is not based on historical events or pattern matching or brute force tests. Instead, IAM Access Analyzer understands your policies semantically. All possible access paths are verified by mathematical proofs, and thousands of policies can be analyzed in a few seconds. This is done using a type of cognitive science called automated reasoning. IAM Access Analyzer is the first service powered by automated reasoning available to builders everywhere, offering functionality unique to AWS. To start learning about automated reasoning, I highly recommend this short video explainer. If you are interested in diving a bit deeper, check out this re:Invent talk on automated reasoning from Byron Cook, Director of the AWS Automated Reasoning Group. And if you’re really interested in understanding the methodology, make yourself a nice cup of chamomile tea, grab a blanket, and get cozy with a copy of Semantic-based Automated Reasoning for AWS Access Policies using SMT.

Turning on IAM Access Analyzer is way less stressful than an unexpected nightmare final exam. There’s just one step. From the IAM Console, select Access analyzer from the menu on the left, then click Create analyzer.

Creating an Access Analyzer

Analyzers generate findings in the account from which they are created. Analyzers also work within the region defined when they are created, so create one in each region for which you’d like to see findings.

Once our analyzer is created, findings that show accessible resources appear in the Console. My account has a few findings that are worth looking into, such as KMS keys and IAM roles that are accessible by other accounts and federated users.Viewing Access Analyzer Findings

I’m going to click on the first finding and take a look at the access policy for this KMS key.

An Access Analyzer Finding

From here we can see the open access paths and details about the resources and principals involved. I went over to the KMS console and confirmed that this is intended access, so I archived this particular finding.

All IAM Access Analyzer findings are visible in the IAM Console, and can also be accessed using the IAM Access Analyzer API. Findings related to S3 buckets can be viewed directly in the S3 Console. Bucket policies can then be updated right in the S3 Console, closing the open access pathway.

An Access Analyzer finding in S3

You can also see high-priority findings generated by IAM Access Analyzer in AWS Security Hub, ensuring a comprehensive, single source of truth for your compliance and security-focused team members. IAM Access Analyzer also integrates with CloudWatch Events, making it easy to automatically respond to or send alerts regarding findings through the use of custom rules.

Now that you’ve seen how IAM Access Analyzer provides a comprehensive overview of cloud resource access, you should probably head over to IAM and turn it on. One of the great advantages of building in the cloud is that the infrastructure and tools continue to get stronger over time and IAM Access Analyzer is a great example. Did I mention that it’s free? Fire it up, then send me a tweet sharing some of the interesting things you find. As always, happy building!

— Brandon

Harnessing the Power of the People: Cloudflare’s First Security Awareness Month Design Challenge Winners

Post Syndicated from Jacqueline Keith original https://blog.cloudflare.com/cloudflare-security-awareness-month-design-challenge/

Harnessing the Power of the People: Cloudflare’s First Security Awareness Month Design Challenge Winners

Grabbing the attention of employees at a security and privacy-focused company on security awareness presents a unique challenge; how do you get people who are already thinking about security all day to think about it some more? October marked Cloudflare’s first Security Awareness Month as a public company and to celebrate, the security team challenged our entire company population to create graphics, slogans, and memes to encourage us all to think and act more securely every day.

Employees approached this challenge with gusto; global participation meant plenty of high quality submissions to vote on. In addition to being featured here, the winning designs will be displayed in Cloudflare offices throughout 2020 and the creators will be on the decision panel for next year’s winners. Three rose to the top, highlighting creativity and style that is uniquely Cloudflarian. I sat down with the winners to talk through their thoughts on security and what all companies can do to drive awareness.

Eugene Wang, Design Team, First Place

Harnessing the Power of the People: Cloudflare’s First Security Awareness Month Design Challenge Winners


Sílvia Flores, Executive Assistant, Second Place

Harnessing the Power of the People: Cloudflare’s First Security Awareness Month Design Challenge Winners


Scott Jones, e-Learning Developer, Third Place

Security Haiku

Wipe that whiteboard clean‌‌
Visitors may come and see
Secrets not for them

No tailgating please
You may be a nice person
But I don’t know that‌‌‌‌‌‌

1. What inspired your design?

Eugene: The friendly “Welcome” cloud seen in our all company slides was a jumping off point. It seemed like a great character that embodied being a Cloudflarian and had tons of potential to get into adventures. I also wanted security to be a bit fun, where appropriate. Instead of a serious breach (though it could be), here it was more a minor annoyance personified by a wannabe-sneaky alligator. Add a pun, and there you go—poster design!

Sílvia: What inspired my design was the cute Cloudflare mascot the otter since there are so many otters in SF. Also, security can be fun and I added a pun for all the employees to remember the security system in an entertaining and respectful way. This design is very much my style and I believe making things cute and bright can really grab attention from people who are so busy in their work. A bright, orange, leopard print poster cannot be missed!

Scott: I have always loved the haiku form and poems were allowed!

2. What’s the number one thing security teams can do to get non-security people excited about security?

Eugene: Make them realize and identify the threats that can happen everyday, and their role in keeping things secure. Cute characters and puns help.

Sílvia: Make it more accessible for people to engage and understand it, possibly making more activities, content, and creating a fun environment for people to be aware but also be mindful.

Scott: Use whatever means available to keep the idea of being security conscious in everyone’s active awareness. This can and should be done in a variety of different ways so as to engage everyone in one way or another, visually with posters and signs, mentally by having contests, multi-sensory through B.E.E.R. meeting presentations and yes, even through a careful use of fear by periodically giving examples of what can happen if security is not followed…I believe that people like working here and believe in what we are doing and how we are doing it, so awareness mixed in with a little fear can reach people on a more visceral and personal level.

3. What’s your favorite security tip?

Eugene: Look at the destination of the return email.

Sílvia: LastPass. Oh my lord. I cannot remember one single password since we need to make them so difficult! With numbers, caps, symbols, emojis (ahaha). LastPass makes it easier for me to be secure and still be myself and not remembering any password without freaking out.

Scott: “See something, say something” because it both reflects our basic responsibility to each other and exhibits a pride that we have as being part of a company we believe in and want to protect.‌‌

‌‌For security practitioners and engagement professionals, it’s easy to try to boil the ocean when Security Awareness Month comes around. The list of potential topics and guidance is endless. Focusing on two or three key messages, gauging the maturity of your organization, and encouraging company-wide participation makes it a company-wide effort. Extra recognition and glory for those that go over and above never hurts either.

Want to run a security awareness design contest at your company? Reach out to us at [email protected] for tips and best practices for getting started, garnering support, and encouraging participation.

‌‌

How to get started with security response automation on AWS

Post Syndicated from Cameron Worrell original https://aws.amazon.com/blogs/security/how-get-started-security-response-automation-aws/

At AWS, we encourage you to use automation to help quickly detect and respond to security events within your AWS environments. In addition to increasing the speed of detection and response, automation also helps you scale your security operations as you expand your workloads running on AWS. For these reasons, security automation is a key ­principle outlined in both the Well-Architected and Cloud Adoption frameworks as well as in the AWS Security Incident Response Guide.

In this blog post, you’ll learn to implement automated security response mechanisms within your AWS environments. This post will include common patterns, implementation considerations, and an example solution. Security response automation is a broad topic that spans many areas. The goal of this blog post is to introduce you to core concepts and help you get started.

A word from our lawyers: Please note that you are responsible for making your own independent assessment of the information in this post. This post: (a) is for informational purposes only, (b) represents current AWS product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers, or licensors.

What is security response automation?

Security response automation is a planned and programmed action taken to achieve a desired state for an application or resource based on a condition or event. When you implement security response automation, you should adopt an approach that draws from existing security frameworks. Frameworks are published materials which consist of standards, guidelines, and best practices in order help organizations manage cybersecurity-related risk. Using frameworks helps you achieve consistency and scalability and enables you to focus more on the strategic aspects of your security program. You should work with compliance professionals within your organization to understand any specific security frameworks that may also be relevant for your AWS environment.

Our example solution is based on the NIST Cybersecurity Framework (CSF), which is designed to help organizations assess and improve their ability to prevent, detect, and respond to security events. According to the CSF, “cybersecurity incident response” supports your ability to contain the impact of potential cybersecurity incidents. Although automation is not a CSF requirement, automating responses to events enables you to create repeatable, predictable approaches to monitoring and responding to threats.

The five main steps in the CSF are identify, protect, detect, respond and recover. We’ve expanded the detect and respond steps to include automation and investigation activities.
 

Figure 1: The five steps in the CSF

Figure 1: The five steps in the CSF

The following definitions for each step in the diagram above are based on the CSF but have been adapted for our example in this blog post. Although we will focus on the detect, automate and respond steps, it’s important to understand the entire process flow.

  • Identify: Identify and understand the resources, applications, and data within your AWS environment.
  • Protect: Develop and implement appropriate controls and safeguards to ensure delivery of services.
  • Detect: Develop and implement appropriate activities to identify the occurrence of a cybersecurity event. This step includes the implementation of monitoring capabilities which will be discussed further in the next section.
  • Automate: Develop and implement planned, programmed actions that will achieve a desired state for an application or resource based on a condition or event.
  • Investigate: Perform a systematic examination of the security event to establish the root cause.
  • Respond: Develop and implement appropriate activities to take automated or manual actions regarding a detected security event.
  • Recover: Develop and implement appropriate activities to maintain plans for resilience and to restore any capabilities or services that were impaired due to a security event.

Security response automation on AWS

AWS CloudTrail, AWS Config, and Amazon EventBridge continuously record details about the resources and configuration changes in your AWS account. You can use this information to automatically detect resource changes and to react to deviations from your desired state.
 

Figure 2: Automated remediation flow

Figure 2: Automated remediation flow

As shown in the diagram above, an automated remediation flow on AWS has three stages:

  • Monitor: Your automated monitoring tools collect information about resources and applications running in your AWS environment. For example, they might collect AWS CloudTrail information about activities performed in your AWS account, usage metrics from your Amazon EC2 instances, or flow log information about the traffic going to and from network interfaces in your Amazon Virtual Private Cloud (VPC).
  • Detect: When a monitoring tool detects a predefined condition—such as a breached threshold, anomalous activity, or configuration deviation—it raises a flag within the system. A triggering condition might be an anomalous activity detected by Amazon GuardDuty, a resource becoming out of compliance with an AWS Config Rule, or a high rate of blocked requests on an Amazon VPC security group or AWS WAF web access control list.
  • Respond: When a condition is flagged, an automated response is triggered that performs an action you’ve predefined—something intended to remediate or mitigate the flagged condition. Examples of automated response actions might include modifying a VPC security group, patching an Amazon EC2 instance, or rotating credentials.

You can use the event-driven flow described above to achieve many automated response patterns with varying degrees of complexity. Your response pattern could be as simple as invoking a single AWS Lambda function, or it could be a complex series of AWS Step Function tasks with advanced logic. In this blog post, we’ll use two simple Lambda functions in our example solution.

How to define your response automation

Now that we’ve introduced the concept of security response automation, start thinking about security requirements within your environment that you’d like to enforce through automation. These design requirements might come from general best practices you’d like to follow, or they might be specific controls from compliance frameworks relevant for your business. Regardless, your objectives should be quantitative, not qualitative. Here are some examples of quantitative objectives:

  • Remote administrative network access to servers should be limited.
  • Server storage volumes should be encrypted.
  • AWS console logins should be protected by multi-factor authentication.

As an optional step, you can expand these objectives into user stories that define the conditions and remediation actions when there is an event. User stories are informal descriptions that briefly document a feature within a software system. User stories may be global and span across multiple applications or they may be specific to a single application. For example:

“Remote administrative network access to servers should be limited. Remote access ports include SSH TCP port 22 and RDP TCP port 3389. If open remote access ports are detected within the environment, they should be automatically closed and the owner will be notified.”

Once you’ve completed your user story, you can determine how to use automated remediation to help achieve these objectives in your AWS environment. User stories should be stored in a location that provides versioning support and can reference the associated automation code.

You should carefully consider the effect of your remediation mechanisms in order to prevent unintended impact on your resources and applications. Remediation actions such as instance termination, credential revocation, and security group modification can adversely affect application availability. Depending on the level of risk that’s acceptable to your organization, your automated mechanism might only provide a notification which can then be manually investigated prior to remediation. Once you’ve identified an automated remediation mechanism, you can build out the required components and test them in a non-production environment.

Sample response automation walkthrough

In the following section, we’ll walk you through an automated remediation for a simulated event that indicates potential unauthorized activity—the unintended disabling of CloudTrail logging. Outside parties might want to disable logging to prevent detection and recording of their unauthorized activity. Our response is to re-enable the CloudTrail logging and immediately notify the security contact. Here’s the user story for this scenario:

“CloudTrail logging should be enabled for all AWS accounts and regions. If CloudTrail logging is disabled, it will automatically be enabled and the security operations team will be notified.”

Note: The sample response automation below references Amazon EventBridge which extends and builds upon CloudWatch Events. Amazon EventBridge uses the same Amazon CloudWatch Events API, so the event structure and rules configuration are the same. This blog post uses base functionality that is identical in both EventBridge and CloudWatch Events.

Prerequisites

In order to use our sample remediation, you will need to enable Amazon GuardDuty and AWS Security Hub in the AWS Region you have selected. Both of these services include a 30-day free trial. See the AWS Security Hub pricing page and the Amazon GuardDuty pricing page for additional details.

Important: You’ll use AWS CloudTrail to test the sample remediation. Running more than one CloudTrail trail in your AWS account will result in charges based on the number of events processed while the trail is running. Charges for additional copies of management events recorded in a Region are applied based on the published pricing plan. To minimize the charges, follow the clean-up steps that we provide later in this post to remove the sample automation and delete the trail.

Deploy the sample response automation

In this section, we’ll show you how to deploy and test the CloudTrail logging remediation sample. Amazon GuardDuty generates the finding Stealth:IAMUser/CloudTrailLoggingDisabled when CloudTrail logging is disabled, and AWS Security Hub collects findings from GuardDuty using the standardized finding format mentioned earlier. We recommend that you deploy this sample into a non-production AWS account.

Select the Launch Stack button below to deploy a CloudFormation template with an automation sample in the us-east-1 Region. You can also download the template and implement it in another Region. The template consists of an Amazon EventBridge rule, an AWS Lambda function and the IAM permissions necessary for both components to execute. It takes several minutes for the CloudFormation stack build to complete.

Select this image to open a link that starts building the CloudFormation stack

  1. In the CloudFormation console, choose the Select Template form, and then select Next.
  2. On the Specify Details page, provide the email address for a security contact. (For the purpose of this walkthrough, it should be an email address you have access to.) Then select Next.
  3. On the Options page, accept the defaults, then select Next.
  4. On the Review page, confirm the details, then select Create.
  5. While the stack is being created, check the inbox of the email address you provided in step 2. Look for an email message with the subject AWS Notification – Subscription Confirmation. Select the link in the body of the email to confirm your subscription to the Amazon Simple Notification Service (Amazon SNS) topic. You should see a success message similar to the screenshot below:
     
    Figure 3: SNS subscription confirmation

    Figure 3: SNS subscription confirmation

  6. Return to the CloudFormation console. Once the Status field for the CloudFormation stack changes to CREATE COMPLETE (as shown in figure 4), the solution is implemented and is ready for testing.
     
    Figure 4: CREATE_COMPLETE status

    Figure 4: CREATE_COMPLETE status

Test the sample automation

You’re now ready to test the automated response by creating a test trail in CloudTrail, then trying to stop it.

  1. From the AWS Management Console, choose Services > CloudTrail.
  2. Select Trails, then select Create Trail.
  3. On the Create Trail form:
    1. Enter a value for Trail name. We use test-trail in our example below.
    2. Under Management events, select Write-only (to minimize event volume).
       
      Figure 5: Create a CloudTrail trail

      Figure 5: Create a CloudTrail trail

    3. Under Storage location, choose an existing S3 bucket or create a new one. Note that since S3 bucket names are globally unique, you must add characters (such as a random string) to the name. For example: my-test-trail-bucket-<random-string>.
  4. On the Trails page of the CloudTrail console, verify that the new trail has started. You should see a green checkmark in the Status column, as shown in figure 6.
     
    Figure 6: Verify new trail has started

    Figure 6: Verify new trail has started

  5. You’re now ready to act like an unauthorized user trying to cover their tracks! Stop the logging for the trail you just created:
    1. Select the new trail name to display its configuration page.
    2. Toggle the Logging switch in the top-right corner to OFF.
    3. When prompted with a warning dialog box, select Continue.
    4. Verify that the Logging switch is now off, as shown below.
       
      Figure 7: Verify logging switch is off

      Figure 7: Verify logging switch is off

      You have now simulated a security event by disabling logging for one of the trails in the CloudTrail service. Within the next few seconds, the near real-time automated response will detect the stopped trail, restart it, and send an email notification. You can refresh the Trails page of the CloudTrail console to verify that the trail’s status is ON again.

      Within the next several minutes, the investigatory automated response will also begin. GuardDuty will detect the action that stopped the trail and enrich the data about the source of unexpected behavior. Security Hub will then ingest that information and optionally correlate with other security events.

      Following the steps below, you can monitor findings within Security Hub for the finding type TTPs/Defense Evasion/Stealth:IAMUser-CloudTrailLoggingDisabled to be generated:

  6. In the AWS Management Console, choose Services > Security Hub
    1. Select Findings in the left pane.
    2. Select the Add filters field, then select Type.
    3. Select EQUALS, paste TTPs/Defense Evasion/Stealth:IAMUser-CloudTrailLoggingDisabled into the field, then select Apply.
    4. Refresh your browser periodically until the finding is generated.
    5. Figure 8: Monitor Security Hub for your finding

      Figure 8: Monitor Security Hub for your finding

While you wait on that detection, let’s dig into the components of automation.

How the sample automation works

This example incorporates two automated responses: a near real-time workflow and an investigatory workflow. The near real-time workflow provides a rapid response to an individual event, in this case the stopping of a trail. The goal is to restore the trail to a functioning state and alert security responders as quickly as possible. The investigatory workflow still includes a response to provide defense in depth and also uses services that support a more in-depth investigation of the incident.

Figure 9: Sample automation workflow

Figure 9: Sample automation workflow

In the near real-time workflow, Amazon EventBridge monitors for the undesired activity. When a trail is stopped, AWS CloudTrail publishes an event on the EventBridge bus. An EventBridge rule detects the trail-stopping event and invokes a Lambda function to respond to the event by restarting the trail and notifying the security contact via an Amazon Simple Notification Service (SNS) topic.

In the investigative workflow, CloudTrail logs are monitored for undesired activities. For example, if a trail is stopped, there will be a corresponding log record. GuardDuty detects this activity and retrieves additional data points about the source IP that executed the API call. Two common examples of those additional data points in GuardDuty findings include whether the API call came from an IP address on a threat list, or whether it came from a network not commonly used in your AWS account. An AWS Lambda function responds by restarting the trail and notifying the security contact. Finally, the finding is imported into AWS Security Hub for additional investigation.

AWS Security Hub imports findings from AWS security services such as GuardDuty, Amazon Macie and Amazon Inspector, plus from any third-party product integrations you’ve enabled. All findings are provided to Security Hub in AWS Security Finding Format, which eliminates the need for data conversion. Security Hub correlates these findings to help you identify related security events and determine a root cause. Security Hub also publishes its findings to Amazon EventBridge to enable further processing by other AWS services such as AWS Lambda.

Respond step deep dive

Amazon EventBridge and AWS Lambda work together to respond to a security finding. Amazon EventBridge is a service that provides real-time access to changes in data in AWS services, your own applications, and Software-as-a-Service (SaaS) applications without writing code. In this example, EventBridge identifies a Security Hub finding that requires action and invokes a Lambda function that performs remediation. As shown in figure 10, the Lambda function both notifies the security operator via SNS and restarts the stopped CloudTrail.

Figure 10: Sample "respond" workflow

Figure 10: Sample “respond” workflow

To set this response up, we looked for an event to indicate that a trail had stopped or was disabled. We knew that the GuardDuty finding Stealth:IAMUser/CloudTrailLoggingDisabled is raised when CloudTrail logging is disabled. Therefore, we configured the default event bus to look for this event. You can learn more about all of the available GuardDuty findings in the user guide.

How the code works

When Security Hub publishes a finding to EventBridge, it includes full details of the incident as discovered by GuardDuty. The finding is published in JSON format. If you review the details of the sample finding, note that it has several fields helping you identify the specific events that you’re looking for. Here are some of the relevant details:


{
   …
   "source":"aws.securityhub",
   …
   "detail":{
      "findings": [{
		…
    	“Types”: [
			"TTPs/Defense Evasion/Stealth:IAMUser-CloudTrailLoggingDisabled"
			],
		…
      }]
}

You can build an event pattern using these fields, which an EventBridge filtering rule can then use to identify events and to invoke the remediation Lambda function. Below is a snippet from the CloudFormation template we provided earlier that defines that event pattern for the EventBridge filtering rule:


# pattern matches the nested JSON format of a specific Security Hub finding
      EventPattern:
        source:
        - aws.securityhub
        detail-type:
          - "Security Hub Findings - Imported"
        detail:
          findings:
            Types:
              - "TTPs/Defense Evasion/Stealth:IAMUser-CloudTrailLoggingDisabled"

Once the rule is in place, EventBridge continuously scans the event bus for this pattern. When EventBridge finds a match, it invokes the remediating Lambda function and passes the full details of the event to the function. The Lambda function then parses the JSON fields in the event so that it can act as shown in this Python code snippet:


# extract trail ARN by parsing the incoming Security Hub finding (in JSON format)
trailARN = event['detail']['findings'][0]['ProductFields']['action/awsApiCallAction/affectedResources/AWS::CloudTrail::Trail']   

# description contains useful details to be sent to security operations
description = event['detail']['findings'][0]['Description']

The code also issues a notification to security operators so they can review the findings and insights in Security Hub and other services to better understand the incident and to decide whether further manual actions are warranted. Here’s the code snippet that uses SNS to send out a note to security operators:


#Sending the notification that the AWS CloudTrail has been disabled.
snspublish = snsclient.publish(
	TargetArn = snsARN,
	Message="Automatically restarting CloudTrail logging.  Event description: \"%s\" " %description
	)

While notifications to human operators are important, the Lambda function will not wait to take action. It immediately remediates the condition by restarting the stopped trail in CloudTrail. Here’s a code snippet that restarts the trail to reenable logging:


#Enabling the AWS CloudTrail logging
try:
	client = boto3.client('cloudtrail')
	enablelogging = client.start_logging(Name=trailARN)
	logger.debug("Response on enable CloudTrail logging- %s" %enablelogging)
except ClientError as e:
	logger.error("An error occured: %s" %e)

After the trail has been restarted, API activity is once again logged and can be audited. This can help provide relevant data for the remaining steps in the incident response process. The data is especially important for the post-incident phase, when your team analyzes lessons learned to prevent future incidents. You can also use this phase to identify additional steps to automate in your incident response.

Clean up

After you’ve completed the sample security response automation, we recommend that you remove the resources created in this walkthrough example from your account in order to minimize any charges associated with the trail in CloudTrail and data stored in S3.

Important: Deleting resources in your account can negatively impact the applications running in your AWS account. Verify that applications and AWS account security do not depend on the resources you’re about to delete.

Here are the clean-up steps:

  1. Delete the CloudFormation stack.
  2. Delete the trail you created in CloudTrail.
  3. If you created an S3 bucket for CloudTrail logs, you can also delete that S3 bucket.
  4. New accounts can try GuardDuty at no cost for 30 days. You can suspend or disable GuardDuty before the free trial period to avoid charges.
  5. Security Hub comes with a 30-day free trial. You can avoid charges by disabling the service before the trial period is over.

Summary

You’ve learned the basic concepts and considerations behind security response automation on AWS and how to use Amazon EventBridge, Amazon GuardDuty and AWS Security Hub to automatically re-enable AWS CloudTrail when it becomes disabled unexpectedly. As a next step, you may want to start building your own response automations and dive deeper into the AWS Security Incident Response Guide, NIST Cybersecurity Framework (CSF) or the AWS Cloud Adoption Framework (CAF) Security Perspective. You can explore additional automatic remediation solutions on the AWS Security Blog. You can find the code used in this example on GitHub.

If you have feedback about this blog post, submit them in the Comments section below. If you have questions about using this solution, start a thread in the EventBridgeGuardDuty or Security Hub forums, or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Cameron Worrell

Cameron Worrell

Cameron is a Solutions Architect with a passion for security and enterprise transformation. He joined AWS in 2015.

Alex Tomic

Alex Tomic

Alex is an AWS Enterprise Solutions Architect focused on security and compliance. He joined AWS in 2014.

Nathan Case

Nathan Case

Nathan is a Senior Security Strategist, and joined AWS in 2016. He is always interested to see where our customers plan to go and how we can help them get there. He is also interested in intel, combined data lake sharing opportunities, and open source collaboration. In the end Nathan loves technology and that we can change the world to make it a better place.

Introducing Flan Scan: Cloudflare’s Lightweight Network Vulnerability Scanner

Post Syndicated from Nadin El-Yabroudi original https://blog.cloudflare.com/introducing-flan-scan/

Introducing Flan Scan: Cloudflare’s Lightweight Network Vulnerability Scanner

Introducing Flan Scan: Cloudflare’s Lightweight Network Vulnerability Scanner

Today, we’re excited to open source Flan Scan, Cloudflare’s in-house lightweight network vulnerability scanner. Flan Scan is a thin wrapper around Nmap that converts this popular open source tool into a vulnerability scanner with the added benefit of easy deployment.

We created Flan Scan after two unsuccessful attempts at using “industry standard” scanners for our compliance scans. A little over a year ago, we were paying a big vendor for their scanner until we realized it was one of our highest security costs and many of its features were not relevant to our setup. It became clear we were not getting our money’s worth. Soon after, we switched to an open source scanner and took on the task of managing its complicated setup. That made it difficult to deploy to our entire fleet of more than 190 data centers.

We had a deadline at the end of Q3 to complete an internal scan for our compliance requirements but no tool that met our needs. Given our history with existing scanners, we decided to set off on our own and build a scanner that worked for our setup. To design Flan Scan, we worked closely with our auditors to understand the requirements of such a tool. We needed a scanner that could accurately detect the services on our network and then lookup those services in a database of CVEs to find vulnerabilities relevant to our services. Additionally, unlike other scanners we had tried, our tool had to be easy to deploy across our entire network.

We chose Nmap as our base scanner because, unlike other network scanners which sacrifice accuracy for speed, it prioritizes detecting services thereby reducing false positives. We also liked Nmap because of the Nmap Scripting Engine (NSE), which allows scripts to be run against the scan results. We found that the “vulners” script, available on NSE, mapped the detected services to relevant CVEs from a database, which is exactly what we needed.

The next step was to make the scanner easy to deploy while ensuring it outputted actionable and valuable results. We added three features to Flan Scan which helped package up Nmap into a user-friendly scanner that can be deployed across a large network.

  • Easy Deployment and ConfigurationTo create a lightweight scanner with easy configuration, we chose to run Flan Scan inside a Docker container. As a result, Flan Scan can be built and pushed to a Docker registry and maintains the flexibility to be configured at runtime. Flan Scan also includes sample Kubernetes configuration and deployment files with a few placeholders so you can get up and scanning quickly.
  • Pushing results to the Cloud Flan Scan adds support for pushing results to a Google Cloud Storage Bucket or an S3 bucket. All you need to do is set a few environment variables and Flan Scan will do the rest. This makes it possible to run many scans across a large network and collect the results in one central location for processing.
  • Actionable Reports – Flan Scan generates actionable reports from Nmap’s output so you can quickly identify vulnerable services on your network, the applicable CVEs, and the IP addresses and ports where these services were found. The reports are useful for engineers following up on the results of the scan as well as auditors looking for evidence of compliance scans.

Introducing Flan Scan: Cloudflare’s Lightweight Network Vulnerability Scanner
Sample run of Flan Scan from start to finish. 

How has Scan Flan improved Cloudflare’s network security?

By the end of Q3, not only had we completed our compliance scans, we also used Flan Scan to tangibly improve the security of our network. At Cloudflare, we pin the software version of some services in production because it allows us to prioritize upgrades by weighing the operational cost of upgrading against the improvements of the latest version. Flan Scan’s results revealed that our FreeIPA nodes, used to manage Linux users and hosts, were running an outdated version of Apache with several medium severity vulnerabilities. As a result, we prioritized their update. Flan Scan also found a vulnerable instance of PostgreSQL leftover from a performance dashboard that no longer exists.

Flan Scan is part of a larger effort to expand our vulnerability management program. We recently deployed osquery to our entire network to perform host-based vulnerability tracking. By complementing osquery’s findings with Flan Scan’s network scans we are working towards comprehensive visibility of the services running at our edge and their vulnerabilities. With two vulnerability trackers in place, we decided to build a tool to manage the increasing number of vulnerability  sources. Our tool sends alerts on new vulnerabilities, filters out false positives, and tracks remediated vulnerabilities. Flan Scan’s valuable security insights were a major impetus for creating this vulnerability tracking tool.

How does Flan Scan work?

Introducing Flan Scan: Cloudflare’s Lightweight Network Vulnerability Scanner

The first step of Flan Scan is running an Nmap scan with service detection. Flan Scan’s default Nmap scan runs the following scans:

  1. ICMP ping scan – Nmap determines which of the IP addresses given are online.
  2. SYN scan – Nmap scans the 1000 most common ports of the IP addresses which responded to the ICMP ping. Nmap marks ports as open, closed, or filtered.
  3. Service detection scan – To detect which services are running on open ports Nmap performs TCP handshake and banner grabbing scans.

Other types of scanning such as UDP scanning and IPv6 addresses are also possible with Nmap. Flan Scan allows users to run these and any other extended features of Nmap by passing in Nmap flags at runtime.

Introducing Flan Scan: Cloudflare’s Lightweight Network Vulnerability Scanner
Sample Nmap output

Flan Scan adds the “vulners” script tag in its default Nmap command to include in the output a list of vulnerabilities applicable to the services detected. The vulners script works by making API calls to a service run by vulners.com which returns any known vulnerabilities for the given service.

Introducing Flan Scan: Cloudflare’s Lightweight Network Vulnerability Scanner
Sample Nmap output with Vulners script

The next step of Flan Scan uses a Python script to convert the structured XML of Nmap’s output to an actionable report. The reports of the previous scanner we used listed each of the IP addresses scanned and present the vulnerabilities applicable to that location. Since we had multiple IP addresses running the same service, the report would repeat the same list of vulnerabilities under each of these IP addresses. This meant scrolling back and forth on documents hundreds of pages long to obtain a list of all IP addresses with the same vulnerabilities.  The results were impossible to digest.

Flan Scans results are structured around services. The report enumerates all vulnerable services with a list beneath each one of relevant vulnerabilities and all IP addresses running this service. This structure makes the report shorter and actionable since the services that need to be remediated can be clearly identified. Flan Scan reports are made using LaTeX because who doesn’t like nicely formatted reports that can be generated with a script? The raw LaTeX file that Flan Scan outputs can be converted to a beautiful PDF by using tools like pdf2latex or TeXShop.

Introducing Flan Scan: Cloudflare’s Lightweight Network Vulnerability Scanner
Sample Flan Scan report

What’s next?

Cloudflare’s mission is to help build a better Internet for everyone, not just Internet giants who can afford to buy expensive tools. We’re open sourcing Flan Scan because we believe it shouldn’t cost tons of money to have strong network security.

You can get started running a vulnerability scan on your network in a few minutes by following the instructions on the README. We welcome contributions and suggestions from the community.

Even faster connection establishment with QUIC 0-RTT resumption

Post Syndicated from Alessandro Ghedini original https://blog.cloudflare.com/even-faster-connection-establishment-with-quic-0-rtt-resumption/

Even faster connection establishment with QUIC 0-RTT resumption

One of the more interesting features introduced by TLS 1.3, the latest revision of the TLS protocol, was the so called “zero roundtrip time connection resumption”, a mode of operation that allows a client to start sending application data, such as HTTP requests, without having to wait for the TLS handshake to complete, thus reducing the latency penalty incurred in establishing a new connection.

The basic idea behind 0-RTT connection resumption is that if the client and server had previously established a TLS connection between each other, they can use information cached from that session to establish a new one without having to negotiate the connection’s parameters from scratch. Notably this allows the client to compute the private encryption keys required to protect application data before even talking to the server.

However, in the case of TLS, “zero roundtrip” only refers to the TLS handshake itself: the client and server are still required to first establish a TCP connection in order to be able to exchange TLS data.

Even faster connection establishment with QUIC 0-RTT resumption

Zero means zero

QUIC goes a step further, and allows clients to send application data in the very first roundtrip of the connection, without requiring any other handshake to be completed beforehand.

Even faster connection establishment with QUIC 0-RTT resumption

After all, QUIC already shaved a full round-trip off of a typical connection’s handshake by merging the transport and cryptographic handshakes into one. By reducing the handshake by an additional roundtrip, QUIC achieves real 0-RTT connection establishment.

It literally can’t get any faster!

Attack of the clones

Unfortunately, 0-RTT connection resumption is not all smooth sailing, and it comes with caveats and risks, which is why Cloudflare does not enable 0-RTT connection resumption by default. Users should consider the risks involved and decide whether to use this feature or not.

For starters, 0-RTT connection resumption does not provide forward secrecy, meaning that a compromise of the secret parameters of a connection will trivially allow compromising the application data sent during the 0-RTT phase of new connections resumed from it. Data sent after the 0-RTT phase, meaning after the handshake has been completed, would still be safe though, as TLS 1.3 (and QUIC) will still perform the normal key exchange algorithm (which is forward secret) for data sent after the handshake completion.

More worryingly, application data sent during 0-RTT can be captured by an on-path attacker and then replayed multiple times to the same server. In many cases this is not a problem, as the attacker wouldn’t be able to decrypt the data, which is why 0-RTT connection resumption is useful, but in some cases this can be dangerous.

For example, imagine a bank that allows an authenticated user (e.g. using HTTP cookies, or other HTTP authentication mechanisms) to send money from their account to another user by making an HTTP request to a specific API endpoint. If an attacker was able to capture that request when 0-RTT connection resumption was used, they wouldn’t be able to see the plaintext and get the user’s credentials, because they wouldn’t know the secret key used to encrypt the data; however they could still potentially drain that user’s bank account by replaying the same request over and over:

Even faster connection establishment with QUIC 0-RTT resumption

Of course this problem is not specific to banking APIs: any non-idempotent request has the potential to cause undesired side effects, ranging from slight malfunctions to serious security breaches.

In order to help mitigate this risk, Cloudflare will always reject 0-RTT requests that are obviously not idempotent (like POST or PUT requests), but in the end it’s up to the application sitting behind Cloudflare to decide which requests can and cannot be allowed with 0-RTT connection resumption, as even innocuous-looking ones can have side effects on the origin server.

To help origins detect and potentially disallow specific requests, Cloudflare also follows the techniques described in RFC8470. Notably, Cloudflare will add the Early-Data: 1 HTTP header to requests received during 0-RTT resumption that are forwarded to origins.

Origins able to understand this header can then decide to answer the request with the 425 (Too Early) HTTP status code, which will instruct the client that originated the request to retry sending the same request but only after the TLS or QUIC handshake have fully completed, at which point there is no longer any risk of replay attacks. This could even be implemented as part of a Cloudflare Worker.

Even faster connection establishment with QUIC 0-RTT resumption

This makes it possible for origins to allow 0-RTT requests for endpoints that are safe, such as a website’s index page which is where 0-RTT is most useful, as that is typically the first request a browser makes after establishing a connection, while still protecting other endpoints such as APIs and form submissions. But if an origin does not provide any of those non-idempotent endpoints, no action is required.

One stop shop for all your 0-RTT needs

Just like we previously did for TLS 1.3, we now support 0-RTT resumption for QUIC as well. In honor of this event, we have dusted off the user-interface controls that allow Cloudflare users to enable this feature for their websites, and introduced a dedicated toggle to control whether 0-RTT connection resumption is enabled or not, which can be found under the “Network” tab on the Cloudflare dashboard:

Even faster connection establishment with QUIC 0-RTT resumption

When TLS 1.3 and/or QUIC (via the HTTP/3 toggle) are enabled, 0-RTT connection resumption will be automatically offered to clients that support it, and the replay mitigation mentioned above will also be applied to the connections making use of this feature.

In addition, if you are a user of our open-source HTTP/3 patch for NGINX, after updating the patch to the latest version, you’ll be able to enable support for 0-RTT connection resumption in your own NGINX-based HTTP/3 deployment by using the built-in “ssl_early_data” option, which will work for both TLS 1.3 and QUIC+HTTP/3.

Log every request to corporate apps, no code changes required

Post Syndicated from Sam Rhea original https://blog.cloudflare.com/log-every-request-to-corporate-apps-no-code-changes-required/

Log every request to corporate apps, no code changes required

When a user connects to a corporate network through an enterprise VPN client, this is what the VPN appliance logs:

Log every request to corporate apps, no code changes required

The administrator of that private network knows the user opened the door at 12:15:05, but, in most cases, has no visibility into what they did next. Once inside that private network, users can reach internal tools, sensitive data, and production environments. Preventing this requires complicated network segmentation, and often server-side application changes. Logging the steps that an individual takes inside that network is even more difficult.

Cloudflare Access does not improve VPN logging; it replaces this model. Cloudflare Access secures internal sites by evaluating every request, not just the initial login, for identity and permission. Instead of a private network, administrators deploy corporate applications behind Cloudflare using our authoritative DNS. Administrators can then integrate their team’s SSO and build user and group-specific rules to control who can reach applications behind the Access Gateway.

When a request is made to a site behind Access, Cloudflare prompts the visitor to login with an identity provider. Access then checks that user’s identity against the configured rules and, if permitted, allows the request to proceed. Access performs these checks on each request a user makes in a way that is transparent and seamless for the end user.

However, since the day we launched Access, our logging has resembled the screenshot above. We captured when a user first authenticated through the gateway, but that’s where it stopped. Starting today, we can give your team the full picture of every request made to every application.

We’re excited to announce that you can now capture logs of every request a user makes to a resource behind Cloudflare Access. In the event of an emergency, like a stolen laptop, you can now audit every URL requested during a session. Logs are standardized in one place, regardless of whether you use multiple SSO providers or secure multiple applications, and the Cloudflare Logpush platform can send them to your SIEM for retention and analysis.

Auditing every login

Cloudflare Access brings the speed and security improvements Cloudflare provides to public-facing sites and applies those lessons to the internal applications your team uses. For most teams, these were applications that traditionally lived behind a corporate VPN. Once a user joined that VPN, they were inside that private network, and administrators had to take additional steps to prevent users from reaching things they should not have access to.

Access flips this model by assuming no user should be able to reach anything by default; applying a zero-trust solution to the internal tools your team uses. With Access, when any user requests the hostname of that application, the request hits Cloudflare first. We check to see if the user is authenticated and, if not, send them to your identity provider like Okta, or Azure ActiveDirectory. The user is prompted to login, and Cloudflare then evaluates if they are allowed to reach the requested application. All of this happens at the edge of our network before a request touches your origin, and for the user, it feels like the seamless SSO flow they’ve become accustomed to for SaaS apps.

Log every request to corporate apps, no code changes required

When a user authenticates with your identity provider, we audit that event as a login and make those available in our API. We capture the user’s email, their IP address, the time they authenticated, the method (in this case, a Google SSO flow), and the application they were able to reach.

Log every request to corporate apps, no code changes required

These logs can help you track every user who connected to an internal application, including contractors and partners who might use different identity providers. However, this logging stopped at the authentication. Access did not capture the next steps of a given user.

Auditing every request

Cloudflare secures both external-facing sites and internal resources by triaging each request in our network before we ever send it to your origin. Products like our WAF enforce rules to protect your site from attacks like SQL injection or cross-site scripting. Likewise, Access identifies the principal behind each request by evaluating each connection that passes through the gateway.

Once a member of your team authenticates to reach a resource behind Access, we generate a token for that user that contains their SSO identity. The token is structured as JSON Web Token (JWT). JWT security is an open standard for signing and encrypting sensitive information. These tokens provide a secure and information-dense mechanism that Access can use to verify individual users. Cloudflare signs the JWT using a public and private key pair that we control. We rely on RSA Signature with SHA-256, or RS256, an asymmetric algorithm, to perform that signature. We make the public key available so that you can validate their authenticity, as well.

When a user requests a given URL, Access appends the user identity from that token as a request header, which we then log as the request passes through our network. Your team can collect these logs in your preferred third-party SIEM or storage destination by using the Cloudflare Logpush platform.

Cloudflare Logpush can be used to gather and send specific request headers from the requests made to sites behind Access. Once enabled, you can then configure the destination where Cloudflare should send these logs. When enabled with the Access user identity field, the logs will export to your systems as JSON similar to the logs below.

{
   "ClientIP": "198.51.100.206",
   "ClientRequestHost": "jira.widgetcorp.tech",
   "ClientRequestMethod": "GET",
   "ClientRequestURI": "/secure/Dashboard/jspa",
   "ClientRequestUserAgent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36",
   "EdgeEndTimestamp": "2019-11-10T09:51:07Z",
   "EdgeResponseBytes": 4600,
   "EdgeResponseStatus": 200,
   "EdgeStartTimestamp": "2019-11-10T09:51:07Z",
   "RayID": "5y1250bcjd621y99"
   "RequestHeaders":{"cf-access-user":"srhea"},
}
 
{
   "ClientIP": "198.51.100.206",
   "ClientRequestHost": "jira.widgetcorp.tech",
   "ClientRequestMethod": "GET",
   "ClientRequestURI": "/browse/EXP-12",
   "ClientRequestUserAgent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36",
   "EdgeEndTimestamp": "2019-11-10T09:51:27Z",
   "EdgeResponseBytes": 4570,
   "EdgeResponseStatus": 200,
   "EdgeStartTimestamp": "2019-11-10T09:51:27Z",
   "RayID": "yzrCqUhRd6DVz72a"
   "RequestHeaders":{"cf-access-user":"srhea"},
}

In the example above, the user initially visited the splash page for a sample Jira instance. The next request was made to a specific Jira ticket, EXP-12, about one minute after the first request. With per-request logging, Access administrators can review each request a user made once authenticated in the event that an account is compromised or a device stolen.

The logs are consistent across all applications and identity providers. The same standard fields are captured when contractors login with their AzureAD instance to your supply chain tool as when your internal users authenticate with Okta to your Jira. You can also augment the data above with other request details like TLS cipher used and WAF results.

How can this data be used?

The native logging capabilities of hosted applications vary wildly. Some tools provide more robust records of user activity, but others would require server-side code changes or workarounds to add this level of logging. Cloudflare Access can give your team the ability to skip that work and introduce logging in a single gateway that applies to all resources protected behind it.

The audit logs can be exported to third-party SIEM tools or S3 buckets for analysis and anomaly detection. The data can also be used for audit purposes in the event that a corporate device is lost or stolen. Security teams can then use this to recreate user sessions from logs as they investigate.

What’s next?

Any enterprise customer with Logpush enabled can now use this feature at no additional cost. Instructions are available here to configure Logpush and additional documentation here to enable Access per-request logs.

How to enable encryption in a browser with the AWS Encryption SDK for JavaScript and Node.js

Post Syndicated from Spencer Janyk original https://aws.amazon.com/blogs/security/how-to-enable-encryption-browser-aws-encryption-sdk-javascript-node-js/

In this post, we’ll show you how to use the AWS Encryption SDK (“ESDK”) for JavaScript to handle an in-browser encryption workload for a hypothetical application. First, we’ll review some of the security and privacy properties of encryption, including the names AWS uses for the different components of a typical application. Then, we’ll discuss some of the reasons you might want to encrypt each of those components, with a focus on in-browser encryption, and we’ll describe how to perform that encryption using the ESDK. Lastly, we’ll talk about some of the security properties to be mindful of when designing an application, and where to find additional resources.

An overview of the security and privacy properties of encryption

Encryption is a technique that can restrict access to sensitive data by making it unreadable without a key. An encryption process takes data that is plainly readable or processable (“plaintext”) and uses principles of mathematics to obscure the contents so that it can’t be read without the use of a secret key. To preserve user privacy and prevent unauthorized disclosure of sensitive business data, developers need ways to protect sensitive data during the entire data lifecycle. Data needs to be protected from risks associated with unintentional disclosure as data flows between collection, storage, processing, and sharing components of an application. In this context, encryption is typically divided into two separate techniques: encryption at rest for storing data; and encryption in transit for moving data between entities or systems.

Many applications use encryption in transit to secure connections between their users and the services they provide, and then encrypt the data before it’s stored. However, as applications become more complex and data must be moved between more nodes and stored in more diverse places, there are more opportunities for data to be accidentally leaked or unintentionally disclosed. When a user enters their data in a browser, Transport Layer Security (TLS) can protect that data in transit between the user’s browser and a service endpoint. But in a distributed system, intermediary services between that endpoint and the service that processes that sensitive data might log or cache the data before transporting it. Encrypting sensitive data at the point of collection in the browser is a form of encryption at rest that minimizes the risk of unauthorized access and protects the data if it’s lost, stolen, or accidentally exposed. Encrypting data in the browser means that even if it’s completely exposed elsewhere, it’s unreadable and worthless to anyone without access to the key.

A typical web application

A typical web application will accept some data as input, process it, and then store it. When the user needs to access stored data, the data often follows the same path used when it was input. In our example there are three primary components to the path:

Figure 1: A hypothetical web application where the application is composed of an end-user interacting with a browser front-end, a third party which processes data received from the browser, processing is performed in Amazon EC2, and storage happens in Amazon S3

Figure 1: A hypothetical web application where the application is composed of an end-user interacting with a browser front-end, a third party which processes data received from the browser, processing is performed in Amazon EC2, and storage happens in Amazon S3

  1. An end-user interacts with the application using an interface in the browser.
  2. As data is sent to Amazon EC2, it passes through the infrastructure of a third party which could be an Internet Service Provider, an appliance in the user’s environment, or an application running in the cloud.
  3. The application on Amazon EC2 processes the data once it has been received.
  4. Once the application is done processing data, it is stored in Amazon S3 until it is needed again.

As data moves between components, TLS is used to prevent inadvertent disclosure. But what if one or more of these components is a third-party service that doesn’t need access to sensitive data? That’s where encryption at rest comes in.

Encryption at rest is available as a server-side, client-side, and client-side in-browser protection. Server-side encryption (SSE) is the most commonly used form of encryption with AWS customers, and for good reason: it’s easy to use because it’s natively supported by many services, such as Amazon S3. When SSE is used, the service that’s storing data will encrypt each piece of data with a key (a “data key”) when it’s received, and then decrypt it transparently when it’s requested by an authorized user. This has the benefit of being seamless for application developers because they only need to check a box in Amazon S3 to enable encryption, and it also adds an additional level of access control by having separate permissions to download an object and perform a decryption operation. However, there is a security/convenience tradeoff to consider, because the service will allow any role with the appropriate permissions to perform a decryption. For additional control, many AWS services—including S3—support the use of customer-managed AWS Key Management Service (AWS KMS) customer master keys (CMKs) that allow you to specify key policies or use grants or AWS Identity and Access Management (IAM) policies to control which roles or users have access to decryption, and when. Configuring permission to decrypt using customer-managed CMKs is often sufficient to satisfy compliance regimes that require “application-level encryption.”

Some threat models or compliance regimes may require client-side encryption (CSE), which can add a powerful additional level of access control at the expense of additional complexity. As noted above, services perform server-side encryption on data after it has left the boundary of your application. TLS is used to secure the data in transit to the service, but some customers might want to only manage encrypt/decrypt operations within their application on EC2 or in the browser. Applications can use the AWS Encryption SDK to encrypt data within the application trust boundary before it’s sent to a storage service.

But what about a use case where customers don’t even want plaintext data to leave the browser? Or what if end-users input data that is passed through or logged by intermediate systems that belong to a third-party? It’s possible to create a separate application that only manages encryption to ensure that your environment is segregated, but using the AWS Encryption SDK for JavaScript allows you to encrypt data in an end-user browser before it’s ever sent to your application, so only your end-user will be able to view their plaintext data. As you can see in Figure 2 below, in-browser encryption can allow data to be safely handled by untrusted intermediate systems while ensuring its confidentiality and integrity.

Figure 2: A hypothetical web application with encryption where the application is composed of an end-user interacting with a browser front-end, a third party which processes data received from the browser, processing is performed in Amazon EC2, and storage happens in Amazon S3

Figure 2: A hypothetical web application with encryption where the application is composed of an end-user interacting with a browser front-end, a third party which processes data received from the browser, processing is performed in Amazon EC2, and storage happens in Amazon S3

  1. The application in the browser requests a data key to encrypt sensitive data entered by the user before it is passed to a third party.
  2. Because the sensitive data has been encrypted, the third party cannot read it. The third party may be an Internet Service Provider, an appliance in the user’s environment, an application running in the cloud, or a variety of other actors.
  3. The application on Amazon EC2 can make a request to KMS to decrypt the data key so the data can be decrypted, processed, and re-encrypted.
  4. The encrypted object is stored in S3 where a second encryption request is made so the object can be encrypted when it is stored server side.

How to encrypt in the browser

The first step of in-browser encryption is including a copy of the AWS Encryption SDK for JavaScript with the scripts you’re already sending to the user when they access your application. Once it’s present in the end-user environment, it’s available for your application to make calls. To perform the encryption, the ESDK will request a data key from the cryptographic materials provider that is used to encrypt, and an encrypted copy of the data key that will be stored with the object being encrypted. After a piece of data is encrypted within the browser, the ciphertext can be uploaded to your application backend for processing or storage. When a user needs to retrieve the plaintext, the ESDK can read the metadata attached to the ciphertext to determine the appropriate method to decrypt the data key, and if they have access to the CMK decrypt the data key and then use it to decrypt the data.

Important considerations

One common issue with browser-based applications is inconsistent feature support across different browser vendors and versions. For example, how will the application respond to browsers that lack native support for the strongest recommended cryptographic algorithm suites? Or, will there be a message or alternative mode if a user accesses the application using a browser that has JavaScript disabled? The ESDK for JavaScript natively supports a fallback mode, but it may not be appropriate for all use cases. Be sure to understand what kind of browser environments you will need to support to determine whether in-browser encryption is appropriate, and include support for graceful degradation if you expect limited browser support. Developers should also consider the ways that unauthorized users might monitor user actions via a browser extension, make unauthorized browser requests without user knowledge, or request a “downgraded” (less mathematically intensive) cryptographic operation.

It’s always a good idea to have your application designs reviewed by security professionals. If you have an AWS Account Manager or Technical Account Manager, you can ask them to connect you with a Solutions Architect to review your design. If you’re an AWS customer but don’t have an account manager, consider visiting an AWS Loft to participate in our “Ask an Expert” plan.

Where to learn more

If you have questions about this post, let us know in the Comments section below, or consult the AWS Encryption SDK Developer Forum. Because the Encryption SDK is open-source, you can always contribute, open an issue, or ask questions in Github.

The AWS Encryption SDK for JavaScript is available at: https://github.com/awslabs/aws-encryption-sdk-javascript
Documentation is available at: https://docs.aws.amazon.com/encryption-sdk/latest/developer-guide/javascript.html

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Janyk author photo

Spencer Janyk

Spencer is a Senior Product Manager at Amazon Web Services working on data encryption and privacy. He has previously worked on vulnerability management and monitoring for enterprises and applying machine learning to challenges in ad tech, social media, diversity in recruiting, and talent management. Spencer holds a Master of Arts in Performance Studies from New York University and a Bachelor of Arts in Gender Studies from Whitman College.

Gray author photo

Amanda Gray

Amanda is a Senior Security Engineer at Amazon Web Services on the Crypto Tools team. Previously, Amanda worked on application security and privacy by design, and she continues to promote these goals every day. Amanda holds Bachelors’ degrees in Physics and Computer Science from the University of Washington and Smith College respectively, and a Master’s degree in Physical Oceanography from the University of Washington.

AWS Security Profiles: Maritza Mills, Senior Product Manager, Perimeter Protection

Post Syndicated from Becca Crockett original https://aws.amazon.com/blogs/security/aws-security-profiles-maritza-mills-senior-product-manager-perimeter-protection/

Maritza Mills, Senior Product Manager
In the weeks leading up to re:Invent 2019, we’ll share conversations we’e had with people at AWS who will be presenting at the event so you can learn more about them and some of the interesting work that they’re doing.


How long have you been at AWS, and what do you do in your current role?

How long have you been at AWS, and what do you do in your current role?
I’ve been at AWS almost two years. I’m a product manager for our Perimeter Protection team, which includes products like AWS Web Application Firewall (WAF), AWS Shield and AWS Firewall Manager. I spend a lot of my time talking with customers—primarily security specialists and network engineers—about how they can protect their web applications and how they can defend against Distributed Denial of Service (DDoS) attacks. My work is about deeply understanding the technical challenges customers are facing. I then use that information to inform what we need to build next, and then I work with our engineering team to figure out how we deliver it.

What’s the most challenging part of your job?

Deciding how to prioritize what we work on next. We have AWS customers with a lot of different needs, but we only have so much time in a day. My team has to balance the most pressing customer challenges along with the challenges we anticipate customers will face in the future, plus how quickly we’ll be able to deliver solutions to those challenges. I wish that we could do everything, all the time, but we have to make difficult choices about which things we’re going to do first.

What’s your favorite part of your job?

Constantly learning something new from our customers. A big part of what I do involves listening to customers to understand their most difficult technical challenges, and every customer is different. A customer in healthcare will have different needs from a customer in finance versus one in gaming. It’s exciting to learn about the different problems each customer faces. Even at the same company, different teams may have different goals and approaches to security. Often, I might educate customers on the tools currently available to fit their needs, but there are also times when the solution a customer needs has not been invented yet, and that’s when things really get interesting.

What does cloud security mean to you on a personal level?

When I think about security in the cloud, it’s about security for individual people. If you store data in the cloud, part of “security” is protecting access to your personal information, like your messages and photos, or credit card numbers, or personal healthcare data.

But it’s not just about preventing unauthorized access. It’s also about making sure that peoples’ data are available for them when they need it. One of the big things that we focus on in Perimeter Protection—particularly in AWS Shield—is protecting applications from denial of service attacks so that the applications are always available. This means that when you need to access the money in your bank account, or say, when a hospital needs to access vital information about a patient, the apps are always up and available. When I think about security and what we’re doing at scale here at AWS, that’s what’s most important to me on a personal level.

What’s the most common misperception you encounter about cloud security?

Sometimes, customers might be tempted to use blanket protections without thinking about why their particular application or business is unique, and what different protections they should put in place as a result.

Cloud security is an ongoing discipline that requires continuously monitoring your applications and updating your controls as your applications change. At AWS, we have this concept of the shared responsibility model, where AWS handles security of the cloud itself and customers are responsible for securing the applications which they run on the cloud. We’ve designed several tools to help customers manage that responsibility and adapt and scale as quickly as their applications do. In Perimeter Protection specifically, services like AWS Firewall Manager are designed to give our customers central visibility of their security controls, such as Amazon VPC security groups, AWS WAF rules, and AWS Shield Advanced protections. Services like Firewall Manager also constantly monitor these configurations so that customers can be alerted when changes have occurred.

I encourage customers to think carefully about how their applications will change over time, and how to best monitor and adjust to those changes as they occur.

What challenges do you currently see in the application security space, and how do you think the field will evolve to meet those challenges?

One challenge that I currently see is the pace of change, and the fact that customers need ways to keep up with these changes.

In the past, many security controls have been static—you set them up, and they don’t change. But as our customers have migrated into AWS, they’re able to operate in a more dynamic way and to scale up or down more quickly than they could before. At the same time, we’ve seen the techniques used to gain unauthorized access or to launch DDoS attacks scale and become more sophisticated. Here at AWS, we’re constantly looking ahead to anticipate how customers will need to actively monitor and secure their applications, and then we build those capabilities into our services.

Today, services like AWS Shield can automatically detect and mitigate DDoS attacks and provide you with alarms and the ability to continuously monitor your network flows. AWS WAF gives you the ability to write custom rules so you can create granular protections for your specific environment. We also provide you with information regarding security best practices so you can proactively architect your applications in a way that allows you to quickly react to new and unique attack vectors. That’s part of what we’ll be addressing in our upcoming re:Invent talk, as well.

You and Paul Oremland are leading a re:Invent session called A defense-in-depth approach to building web applications. What can you tell us about the session that’s not described in the catalog?

In this session, we’ll start by reviewing common security vulnerabilities, and then provide detailed examples of how to mitigate them at each layer of their application. I expect attendees will gain a better sense of how those layers fit together and how to think creatively about their individual security needs based on how they’ve architected their system, or based on their specific business case. Finally, I want all customers, from startups to enterprise, to understand how those challenges change as they scale. We’ll be touching on all of that.

It’s a 400-level session, so it’s a technical deep dive. It’s going to have a lot of good information for security specialists and engineers who want to have hands-on examples that they can go back and use. But I also want to encourage people who are exploring or are newer to this space to join us because even if the hands-on portion is a little too advanced, I think the strategy and philosophy of how to think about application security is going to be very relevant even to those less familiar with the subject matter, and to the work that they might do in the future.

What are you hoping that your audience will do differently as a result of attending?

I want to motivate attendees to perform a review of their current architecture and consider the current controls that they have in place. Then, I’d like them to ask themselves, “Why did I put this control here?” and “Do I know exactly what risk each control is mitigating?” I’d also like them to consider whether there are protections they’ve opted not to use in the past, and whether that decision is still an acceptable risk.

How did you choose your topic?

We developed it based on numerous conversations we’ve had with customers when they’re exploring how to protect their applications at the edge. But, we usually find that the conversation expands into other parts of the stack that need protection as well. One goal of this session is to talk about these needs up front, so that customers can come into conversations with us already knowing how they’d like to protect their entire application.

Any advice for first-time attendees coming to re:Invent?

Make sure you have enough time to get to your next session. There’s a lot of different things going on at re:Invent, and they take place in a lot of different buildings. While I think we do a great job with the schedule and spacing, first-time attendees should be aware that they might have a session in one building and then need to immediately be in another building for their next session. Factor that into your commute plans.

You enjoy discussing song lyrics. Who have you enjoyed the most?

Rush is one of my favorite bands when it comes to lyricism. As a kid, the music was just interesting. But as I’ve gotten older, certain lines hit me differently.

In the song “Dreamline,” there’s a particular verse that says:

When we are young
Wandering the face of the earth
Wondering what our dreams might be worth
Learning that we’re only immortal
For a limited time

When I was younger, I really could relate to that feeling of immortality in a way, as if I was going to be around forever. But as I’ve gotten older, I’ve realized that life is very short and very precious, and I want to make the most of it. So I enjoy going back to that song every single time. It’s changed for me as I’ve grown.

And what song has created the lengthiest discussion for you?

I’ve had some great conversations about Fast Car by Tracy Chapman. The themes in that song are relatable to people in so many different ways, and at different times in their lives. One of the great things about song lyrics is that the way people interpret a song is influenced by their personal experiences in life, and this song in particular has always opened up meaningful conversations for me.

Want more AWS Security news? Follow us on Twitter.

The AWS Security team is hiring! Want to find out more? Check out our career page.

Maritza Mills, Senior Product Manager

Maritza Mills

Maritza is a Senior Product Manager for AWS WAF, Shield and Firewall Manager.

Going Keyless Everywhere

Post Syndicated from Nick Sullivan original https://blog.cloudflare.com/going-keyless-everywhere/

Going Keyless Everywhere

Going Keyless Everywhere

Time flies. The Heartbleed vulnerability was discovered just over five and a half years ago. Heartbleed became a household name not only because it was one of the first bugs with its own web page and logo, but because of what it revealed about the fragility of the Internet as a whole. With Heartbleed, one tiny bug in a cryptography library exposed the personal data of the users of almost every website online.

Heartbleed is an example of an underappreciated class of bugs: remote memory disclosure vulnerabilities. High profile examples other than Heartbleed include Cloudbleed and most recently NetSpectre. These vulnerabilities allow attackers to extract secrets from servers by simply sending them specially-crafted packets. Cloudflare recently completed a multi-year project to make our platform more resilient against this category of bug.

For the last five years, the industry has been dealing with the consequences of the design that led to Heartbleed being so impactful. In this blog post we’ll dig into memory safety, and how we re-designed Cloudflare’s main product to protect private keys from the next Heartbleed.

Memory Disclosure

Perfect security is not possible for businesses with an online component. History has shown us that no matter how robust their security program, an unexpected exploit can leave a company exposed. One of the more famous recent incidents of this sort is Heartbleed, a vulnerability in a commonly used cryptography library called OpenSSL that exposed the inner details of millions of web servers to anyone with a connection to the Internet. Heartbleed made international news, caused millions of dollars of damage, and still hasn’t been fully resolved.

Typical web services only return data via well-defined public-facing interfaces called APIs. Clients don’t typically get to see what’s going on under the hood inside the server, that would be a huge privacy and security risk. Heartbleed broke that paradigm: it enabled anyone on the Internet to get access to take a peek at the operating memory used by web servers, revealing privileged data usually not exposed via the API. Heartbleed could be used to extract the result of previous data sent to the server, including passwords and credit cards. It could also reveal the inner workings and cryptographic secrets used inside the server, including TLS certificate private keys.

Heartbleed let attackers peek behind the curtain, but not too far. Sensitive data could be extracted, but not everything on the server was at risk. For example, Heartbleed did not enable attackers to steal the content of databases held on the server. You may ask: why was some data at risk but not others? The reason has to do with how modern operating systems are built.

A simplified view of process isolation

Most modern operating systems are split into multiple layers. These layers are analogous to security clearance levels. So-called user-space applications (like your browser) typically live in a low-security layer called user space. They only have access to computing resources (memory, CPU, networking) if the lower, more credentialed layers let them.

User-space applications need resources to function. For example, they need memory to store their code and working memory to do computations. However, it would be risky to give an application direct access to the physical RAM of the computer they’re running on. Instead, the raw computing elements are restricted to a lower layer called the operating system kernel. The kernel only runs specially-designed applications designed to safely manage these resources and mediate access to them for user-space applications.

When a new user space application process is launched, the kernel gives it a virtual memory space. This virtual memory space acts like real memory to the application but is actually a safely guarded translation layer the kernel uses to protect the real memory. Each application’s virtual memory space is like a parallel universe dedicated to that application. This makes it impossible for one process to view or modify another’s, the other applications are simply not addressable.

Going Keyless Everywhere

Heartbleed, Cloudbleed and the process boundary

Heartbleed was a vulnerability in the OpenSSL library, which was part of many web server applications. These web servers run in user space, like any common applications. This vulnerability caused the web server to return up to 2 kilobytes of its memory in response to a specially-crafted inbound request.

Cloudbleed was also a memory disclosure bug, albeit one specific to Cloudflare, that got its name because it was so similar to Heartbleed. With Cloudbleed, the vulnerability was not in OpenSSL, but instead in a secondary web server application used for HTML parsing. When this code parsed a certain sequence of HTML, it ended up inserting some process memory into the web page it was serving.

Going Keyless Everywhere

It’s important to note that both of these bugs occurred in applications running in user space, not kernel space. This means that the memory exposed by the bug was necessarily part of the virtual memory of the application. Even if the bug were to expose megabytes of data, it would only expose data specific to that application, not other applications on the system.

In order for a web server to serve traffic over the encrypted HTTPS protocol, it needs access to the certificate’s private key, which is typically kept in the application’s memory. These keys were exposed to the Internet by Heartbleed. The Cloudbleed vulnerability affected a different process, the HTML parser, which doesn’t do HTTPS and therefore doesn’t keep the private key in memory. This meant that HTTPS keys were safe, even if other data in the HTML parser’s memory space wasn’t.

Going Keyless Everywhere

The fact that the HTML parser and the web server were different applications saved us from having to revoke and re-issue our customers’ TLS certificates. However, if another memory disclosure vulnerability is discovered in the web server, these keys are again at risk.

Moving keys out of Internet-facing processes

Not all web servers keep private keys in memory. In some deployments, private keys are held in a separate machine called a Hardware Security Module (HSM). HSMs are built to withstand physical intrusion and tampering and are often built to comply with stringent compliance requirements. They can often be bulky and expensive. Web servers designed to take advantage of keys in an HSM connect to them over a physical cable and communicate with a specialized protocol called PKCS#11. This allows the web server to serve encrypted content while being physically separated from the private key.

Going Keyless Everywhere

At Cloudflare, we built our own way to separate a web server from a private key: Keyless SSL. Rather than keeping the keys in a separate physical machine connected to the server with a cable, the keys are kept in a key server operated by the customer in their own infrastructure (this can also be backed by an HSM).

Going Keyless Everywhere

More recently, we launched Geo Key Manager, a service that allows users to store private keys in only select Cloudflare locations. Connections to locations that do not have access to the private key use Keyless SSL with a key server hosted in a datacenter that does have access.

In both Keyless SSL and Geo Key Manager, private keys are not only not part of the web server’s memory space, they’re often not even in the same country! This extreme degree of separation is not necessary to protect against the next Heartbleed. All that is needed is for the web server and the key server to not be part of the same application. So that’s what we did. We call this Keyless Everywhere.

Going Keyless Everywhere

Keyless SSL is coming from inside the house

Repurposing Keyless SSL for Cloudflare-held private keys was easy to conceptualize, but the path from ideation to live in production wasn’t so straightforward. The core functionality of Keyless SSL comes from the open source gokeyless which customers run on their infrastructure, but internally we use it as a library and have replaced the main package with an implementation suited to our requirements (we’ve creatively dubbed it gokeyless-internal).

As with all major architecture changes, it’s prudent to start with testing out the model with something new and low risk. In our case, the test bed was our experimental TLS 1.3 implementation. In order to quickly iterate through draft versions of the TLS specification and push releases without affecting the majority of Cloudflare customers, we re-wrote our custom nginx web server in Go and deployed it in parallel to our existing infrastructure. This server was designed to never hold private keys from the start and only leverage gokeyless-internal. At this time there was only a small amount of TLS 1.3 traffic and it was all coming from the beta versions of browsers, which allowed us to work through the initial kinks of gokeyless-internal without exposing the majority of visitors to security risks or outages due to gokeyless-internal.

The first step towards making TLS 1.3 fully keyless was identifying and implementing the new functionality we needed to add to gokeyless-internal. Keyless SSL was designed to run on customer infrastructure, with the expectation of supporting only a handful of private keys. But our edge must simultaneously support millions of private keys, so we implemented the same lazy loading logic we use in our web server, nginx. Furthermore, a typical customer deployment would put key servers behind a network load balancer, so they could be taken out of service for upgrades or other maintenance. Contrast this with our edge, where it’s important to maximize our resources by serving traffic during software upgrades. This problem is solved by the excellent tableflip package we use elsewhere at Cloudflare.

The next project to go Keyless was Spectrum, which launched with default support for gokeyless-internal. With these small victories in hand, we had the confidence necessary to attempt the big challenge, which was porting our existing nginx infrastructure to a fully keyless model. After implementing the new functionality, and being satisfied with our integration tests, all that’s left is to turn this on in production and call it a day, right? Anyone with experience with large distributed systems knows how far “working in dev” is from “done,” and this story is no different. Thankfully we were anticipating problems, and built a fallback into nginx to complete the handshake itself if any problems were encountered with the gokeyless-internal path. This allowed us to expose gokeyless-internal to production traffic without risking downtime in the event that our reimplementation of the nginx logic was not 100% bug-free.

When rolling back the code doesn’t roll back the problem

Our deployment plan was to enable Keyless Everywhere, find the most common causes of fallbacks, and then fix them. We could then repeat this process until all sources of fallbacks had been eliminated, after which we could remove access to private keys (and therefore the fallback) from nginx. One of the early causes of fallbacks was gokeyless-internal returning ErrKeyNotFound, indicating that it couldn’t find the requested private key in storage. This should not have been possible, since nginx only makes a request to gokeyless-internal after first finding the certificate and key pair in storage, and we always write the private key and certificate together. It turned out that in addition to returning the error for the intended case of the key truly not found, we were also returning it when transient errors like timeouts were encountered. To resolve this, we updated those transient error conditions to return ErrInternal, and deployed to our canary datacenters. Strangely, we found that a handful of instances in a single datacenter started encountering high rates of fallbacks, and the logs from nginx indicated it was due to a timeout between nginx and gokeyless-internal. The timeouts didn’t occur right away, but once a system started logging some timeouts it never stopped. Even after we rolled back the release, the fallbacks continued with the old version of the software! Furthermore, while nginx was complaining about timeouts, gokeyless-internal seemed perfectly healthy and was reporting reasonable performance metrics (sub-millisecond median request latency).

Going Keyless Everywhere

To debug the issue, we added detailed logging to both nginx and gokeyless, and followed the chain of events backwards once timeouts were encountered.

➜ ~ grep 'timed out' nginx.log | grep Keyless | head -5
2018-07-25T05:30:49.000 29m41 2018/07/25 05:30:49 [error] 4525#0: *1015157 Keyless SSL request/response timed out while reading Keyless SSL response, keyserver: 127.0.0.1
2018-07-25T05:30:49.000 29m41 2018/07/25 05:30:49 [error] 4525#0: *1015231 Keyless SSL request/response timed out while waiting for Keyless SSL response, keyserver: 127.0.0.1
2018-07-25T05:30:49.000 29m41 2018/07/25 05:30:49 [error] 4525#0: *1015271 Keyless SSL request/response timed out while waiting for Keyless SSL response, keyserver: 127.0.0.1
2018-07-25T05:30:49.000 29m41 2018/07/25 05:30:49 [error] 4525#0: *1015280 Keyless SSL request/response timed out while waiting for Keyless SSL response, keyserver: 127.0.0.1
2018-07-25T05:30:50.000 29m41 2018/07/25 05:30:50 [error] 4525#0: *1015289 Keyless SSL request/response timed out while waiting for Keyless SSL response, keyserver: 127.0.0.1

You can see the first request to log a timeout had id 1015157. Also interesting that the first log line was “timed out while reading,” but all the others are “timed out while waiting,” and this latter message is the one that continues forever. Here is the matching request in the gokeyless log:

➜ ~ grep 'id=1015157 ' gokeyless.log | head -1
2018-07-25T05:30:39.000 29m41 2018/07/25 05:30:39 [DEBUG] connection 127.0.0.1:30520: worker=ecdsa-29 opcode=OpECDSASignSHA256 id=1015157 sni=announce.php?info_hash=%a8%9e%9dc%cc%3b1%c8%23%e4%93%21r%0f%92mc%0c%15%89&peer_id=-ut353s-%ce%ad%5e%b1%99%06%24e%d5d%9a%08&port=42596&uploaded=65536&downloaded=0&left=0&corrupt=0&key=04a184b7&event=started&numwant=200&compact=1&no_peer_id=1 ip=104.20.33.147

Aha! That SNI value is clearly invalid (SNIs are like Host headers, i.e. they are domains, not URL paths), and it’s also quite long. Our storage system indexes certificates based on two indices: which SNI they correspond to, and which IP addresses they correspond to (for older clients that don’t support SNI). Our storage interface uses the memcached protocol, and the client library that gokeyless-internal uses rejects requests for keys longer than 250 characters (memcached’s maximum key length), whereas the nginx logic is to simply ignore the invalid SNI and treat the request as if only had an IP. The change in our new release had shifted this condition from ErrKeyNotFound to ErrInternal, which triggered cascading problems in nginx. The “timeouts” it encountered were actually a result of throwing away all in-flight requests multiplexed on a connection which happened to return ErrInternalfor a single request. These requests were retried, but once this condition triggered, nginx became overloaded by the number of retried requests plus the continuous stream of new requests coming in with bad SNI, and was unable to recover. This explains why rolling back gokeyless-internal didn’t fix the problem.

This discovery finally brought our attention to nginx, which thus far had escaped blame since it had been working reliably with customer key servers for years. However, communicating over localhost to a multitenant key server is fundamentally different than reaching out over the public Internet to communicate with a customer’s key server, and we had to make the following changes:

  • Instead of a long connection timeout and a relatively short response timeout for customer key servers, extremely short connection timeouts and longer request timeouts are appropriate for a localhost key server.
  • Similarly, it’s reasonable to retry (with backoff) if we timeout waiting on a customer key server response, since we can’t trust the network. But over localhost, a timeout would only occur if gokeyless-internal were overloaded and the request were still queued for processing. In this case a retry would only lead to more total work being requested of gokeyless-internal, making the situation worse.
  • Most significantly, nginx must not throw away all requests multiplexed on a connection if any single one of them encounters an error, since a single connection no longer represents a single customer.

Implementations matter

CPU at the edge is one of our most precious assets, and it’s closely guarded by our performance team (aka CPU police). Soon after turning on Keyless Everywhere in one of our canary datacenters, they noticed gokeyless using ~50% of a core per instance. We were shifting the sign operations from nginx to gokeyless, so of course it would be using more CPU now. But nginx should have seen a commensurate reduction in CPU usage, right?

Going Keyless Everywhere

Wrong. Elliptic curve operations are very fast in Go, but it’s known that RSA operations are much slower than their BoringSSL counterparts.

Although Go 1.11 includes optimizations for RSA math operations, we needed more speed. Well-tuned assembly code is required to match the performance of BoringSSL, so Armando Faz from our Crypto team helped claw back some of the lost CPU by reimplementing parts of the math/big package with platform-dependent assembly in an internal fork of Go. The recent assembly policy of Go prefers the use of Go portable code instead of assembly, so these optimizations were not upstreamed. There is still room for more optimizations, and for that reason we’re still evaluating moving to cgo + BoringSSL for sign operations, despite cgo’s many downsides.

Changing our tooling

Process isolation is a powerful tool for protecting secrets in memory. Our move to Keyless Everywhere demonstrates that this is not a simple tool to leverage. Re-architecting an existing system such as nginx to use process isolation to protect secrets was time-consuming and difficult. Another approach to memory safety is to use a memory-safe language such as Rust.

Rust was originally developed by Mozilla but is starting to be used much more widely. The main advantage that Rust has over C/C++ is that it has memory safety features without a garbage collector.

Re-writing an existing application in a new language such as Rust is a daunting task. That said, many new Cloudflare features, from the powerful Firewall Rules feature to our 1.1.1.1 with WARP app, have been written in Rust to take advantage of its powerful memory-safety properties. We’re really happy with Rust so far and plan on using it even more in the future.

Conclusion

The harrowing aftermath of Heartbleed taught the industry a lesson that should have been obvious in retrospect: keeping important secrets in applications that can be accessed remotely via the Internet is a risky security practice. In the following years, with a lot of work, we leveraged process separation and Keyless SSL to ensure that the next Heartbleed wouldn’t put customer keys at risk.

However, this is not the end of the road. Recently memory disclosure vulnerabilities such as NetSpectre have been discovered which are able to bypass application process boundaries, so we continue to actively explore new ways to keep keys secure.

Going Keyless Everywhere

Delegated Credentials for TLS

Post Syndicated from Nick Sullivan original https://blog.cloudflare.com/keyless-delegation/

Delegated Credentials for TLS

Delegated Credentials for TLS

Today we’re happy to announce support for a new cryptographic protocol that helps make it possible to deploy encrypted services in a global network while still maintaining fast performance and tight control of private keys: Delegated Credentials for TLS. We have been working with partners from Facebook, Mozilla, and the broader IETF community to define this emerging standard. We’re excited to share the gory details today in this blog post.

Deploying TLS globally

Many of the technical problems we face at Cloudflare are widely shared problems across the Internet industry. As gratifying as it can be to solve a problem for ourselves and our customers, it can be even more gratifying to solve a problem for the entire Internet. For the past three years, we have been working with peers in the industry to solve a specific shared problem in the TLS infrastructure space: How do you terminate TLS connections while storing keys remotely and maintaining performance and availability? Today we’re announcing that Cloudflare now supports Delegated Credentials, the result of this work.

Cloudflare’s TLS/SSL features are among the top reasons customers use our service. Configuring TLS is hard to do without internal expertise. By automating TLS, web site and web service operators gain the latest TLS features and the most secure configurations by default. It also reduces the risk of outages or bad press due to misconfigured or insecure encryption settings. Customers also gain early access to unique features like TLS 1.3, post-quantum cryptography, and OCSP stapling as they become available.

Unfortunately, for web services to authorize a service to terminate TLS for them, they have to trust the service with their private keys, which demands a high level of trust. For services with a global footprint, there is an additional level of nuance. They may operate multiple data centers located in places with varying levels of physical security, and each of these needs to be trusted to terminate TLS.

To tackle these problems of trust, Cloudflare has invested in two technologies: Keyless SSL, which allows customers to use Cloudflare without sharing their private key with Cloudflare; and Geo Key Manager, which allows customers to choose the datacenters in which Cloudflare should keep their keys. Both of these technologies are able to be deployed without any changes to browsers or other clients. They also come with some downsides in the form of availability and performance degradation.

Keyless SSL introduces extra latency at the start of a connection. In order for a server without access to a private key to establish a connection with a client, that servers needs to reach out to a key server, or a remote point of presence, and ask them to do a private key operation. This not only adds additional latency to the connection, causing the content to load slower, but it also introduces some troublesome operational constraints on the customer. Specifically, the server with access to the key needs to be highly available or the connection can fail. Sites often use Cloudflare to improve their site’s availability, so having to run a high-availability key server is an unwelcome requirement.

Turning a pull into a push

The reason services like Keyless SSL that rely on remote keys are so brittle is their architecture: they are pull-based rather than push-based. Every time a client attempts a handshake with a server that doesn’t have the key, it needs to pull the authorization from the key server. An alternative way to build this sort of system is to periodically push a short-lived authorization key to the server and use that for handshakes. Switching from a pull-based model to a push-based model eliminates the additional latency, but it comes with additional requirements, including the need to change the client.

Enter the new TLS feature of Delegated Credentials (DCs). A delegated credential is a short-lasting key that the certificate’s owner has delegated for use in TLS. They work like a power of attorney: your server authorizes our server to terminate TLS for a limited time. When a browser that supports this protocol connects to our edge servers we can show it this “power of attorney”, instead of needing to reach back to a customer’s server to get it to authorize the TLS connection. This reduces latency and improves performance and reliability.

Delegated Credentials for TLS
The pull model

Delegated Credentials for TLS
The push model

A fresh delegated credential can be created and pushed out to TLS servers long before the previous credential expires. Momentary blips in availability will not lead to broken handshakes for clients that support delegated credentials. Furthermore, a Delegated Credentials-enabled TLS connection is just as fast as a standard TLS connection: there’s no need to connect to the key server for every handshake. This removes the main drawback of Keyless SSL for DC-enabled clients.

Delegated credentials are intended to be an Internet Standard RFC that anyone can implement and use, not a replacement for Keyless SSL. Since browsers will need to be updated to support the standard, proprietary mechanisms like Keyless SSL and Geo Key Manager will continue to be useful. Delegated credentials aren’t just useful in our context, which is why we’ve developed it openly and with contributions from across industry and academia. Facebook has integrated them into their own TLS implementation, and you can read more about how they view the security benefits here.  When it comes to improving the security of the Internet, we’re all on the same team.

"We believe delegated credentials provide an effective way to boost security by reducing certificate lifetimes without sacrificing reliability. This will soon become an Internet standard and we hope others in the industry adopt delegated credentials to help make the Internet ecosystem more secure."

Subodh Iyengar, software engineer at Facebook

Extensibility beyond the PKI

At Cloudflare, we’re interested in pushing the state of the art forward by experimenting with new algorithms. In TLS, there are three main areas of experimentation: ciphers, key exchange algorithms, and authentication algorithms. Ciphers and key exchange algorithms are only dependent on two parties: the client and the server. This freedom allows us to deploy exciting new choices like ChaCha20-Poly1305 or post-quantum key agreement in lockstep with browsers. On the other hand, the authentication algorithms used in TLS are dependent on certificates, which introduces certificate authorities and the entire public key infrastructure into the mix.

Unfortunately, the public key infrastructure is very conservative in its choice of algorithms, making it harder to adopt newer cryptography for authentication algorithms in TLS. For instance, EdDSA, a highly-regarded signature scheme, is not supported by certificate authorities, and root programs limit the certificates that will be signed. With the emergence of quantum computing, experimenting with new algorithms is essential to determine which solutions are deployable and functional on the Internet.

Since delegated credentials introduce the ability to use new authentication key types without requiring changes to certificates themselves, this opens up a new area of experimentation. Delegated credentials can be used to provide a level of flexibility in the transition to post-quantum cryptography, by enabling new algorithms and modes of operation to coexist with the existing PKI infrastructure. It also enables tiny victories, like the ability to use smaller, faster Ed25519 signatures in TLS.

Inside DCs

A delegated credential contains a public key and an expiry time. This bundle is then signed by a certificate along with the certificate itself, binding the delegated credential to the certificate for which it is acting as “power of attorney”. A supporting client indicates its support for delegated credentials by including an extension in its Client Hello.

A server that supports delegated credentials composes the TLS Certificate Verify and Certificate messages as usual, but instead of signing with the certificate’s private key, it includes the certificate along with the DC, and signs with the DC’s private key. Therefore, the private key of the certificate only needs to be used for the signing of the DC.

Certificates used for signing delegated credentials require a special X.509 certificate extension. This requirement exists to avoid breaking assumptions people may have about the impact of temporary access to their keys on security, particularly in cases involving HSMs and the still unfixed Bleichbacher oracles in older TLS versions.  Temporary access to a key can enable signing lots of delegated credentials which start far in the future, and as a result support was made opt-in. Early versions of QUIC had similar issues, and ended up adopting TLS to fix them. Protocol evolution on the Internet requires working well with already existing protocols and their flaws.

Delegated Credentials at Cloudflare and Beyond

Currently we use delegated credentials as a performance optimization for Geo Key Manager and Keyless SSL. Customers can update their certificates to include the special extension for delegated credentials, and we will automatically create delegated credentials and distribute them to the edge through the Keyless SSL or Geo Key Manager. For more information, see the documentation. It also enables us to be more conservative about where we keep keys for customers, improving our security posture.

Delegated Credentials would be useless if it wasn’t also supported by browsers and other HTTP clients. Christopher Patton, a former intern at Cloudflare, implemented support in Firefox and its underlying NSS security library. This feature is now in the Nightly versions of Firefox. You can turn it on by activating the configuration option security.tls.enable_delegated_credentials at about:config. Studies are ongoing on how effective this will be in a wider deployment. There also is support for Delegated Credentials in BoringSSL.

"At Mozilla we welcome ideas that help to make the Web PKI more robust. The Delegated Credentials feature can help to provide secure and performant TLS connections for our users, and we’re happy to work with Cloudflare to help validate this feature."

Thyla van der Merwe, Cryptography Engineering Manager at Mozilla

One open issue is the question of client clock accuracy. Until we have a wide-scale study we won’t know how many connections using delegated credentials will break because of the 24 hour time limit that is imposed.  Some clients, in particular mobile clients, may have inaccurately set clocks, the root cause of one third of all certificate errors in Chrome. Part of the way that we’re aiming to solve this problem is through standardizing and improving Roughtime, so web browsers and other services that need to validate certificates can do so independent of the client clock.

Cloudflare’s global scale means that we see connections from every corner of the world, and from many different kinds of connection and device. That reach enables us to find rare problems with the deployability of protocols. For example, our early deployment helped inform the development of the TLS 1.3 standard. As we enable developing protocols like delegated credentials, we learn about obstacles that inform and affect their future development.

Conclusion

As new protocols emerge, we’ll continue to play a role in their development and bring their benefits to our customers. Today’s announcement of a technology that overcomes some limitations of Keyless SSL is just one example of how Cloudflare takes part in improving the Internet not just for our customers, but for everyone. During the standardization process of turning the draft into an RFC, we’ll continue to maintain our implementation and come up with new ways to apply delegated credentials.

Announcing cfnts: Cloudflare’s implementation of NTS in Rust

Post Syndicated from Watson Ladd original https://blog.cloudflare.com/announcing-cfnts/

Announcing cfnts: Cloudflare's implementation of NTS in Rust

Announcing cfnts: Cloudflare's implementation of NTS in Rust

Several months ago we announced that we were providing a new public time service. Part of what we were providing was the first major deployment of the new Network Time Security (NTS) protocol, with a newly written implementation of NTS in Rust. In the process, we received helpful advice from the NTP community, especially from the NTPSec and Chrony projects. We’ve also participated in several interoperability events. Now we are returning something to the community: Our implementation, cfnts, is now open source and we welcome your pull requests and issues.

The journey from a blank source file to a working, deployed service was a lengthy one, and it involved many people across multiple teams.


"Correct time is a necessity for most security protocols in use on the Internet. Despite this, secure time transfer over the Internet has previously required complicated configuration on a case by case basis. With the introduction of NTS, secure time synchronization will finally be available for everyone. It is a small, but important, step towards increasing security in all systems that depend on accurate time. I am happy that Cloudflare are sharing their NTS implementation. A diversity of software with NTS support is important for quick adoption of the new protocol."

Marcus Dansarie, coauthor of the NTS specification


How NTS works

NTS is structured as a suite of two sub-protocols as shown in the figure below. The first is the Network Time Security Key Exchange (NTS-KE), which is always conducted over Transport Layer Security (TLS) and handles the creation of key material and parameter negotiation for the second protocol. The second is NTPv4, the current version of the NTP protocol, which allows the client to synchronize their time from the remote server.

In order to maintain the scalability of NTPv4, it was important that the server not maintain per-client state. A very small server can serve millions of NTP clients. Maintaining this property while providing security is achieved with cookies that the server provides to the client that contain the server state.

In the first stage, the client sends a request to the NTS-KE server and gets a response via TLS. This exchange carries out a number of functions:

  • Negotiates the AEAD algorithm to be used in the second stage.
  • Negotiates the second protocol. Currently, the standard only defines how NTS works with NTPv4.
  • Negotiates the NTP server IP address and port.
  • Creates cookies for use in the second stage.
  • Creates two symmetric keys (C2S and S2C) from the TLS session via exporters.

Announcing cfnts: Cloudflare's implementation of NTS in Rust

In the second stage, the client securely synchronizes the clock with the negotiated NTP server. To synchronize securely, the client sends NTPv4 packets with four special extensions:

  • Unique Identifier Extension contains a random nonce used to prevent replay attacks.
  • NTS Cookie Extension contains one of the cookies that the client stores. Since currently only the client remembers the two AEAD keys (C2S and S2C), the server needs to use the cookie from this extension to extract the keys. Each cookie contains the keys encrypted under a secret key the server has.
  • NTS Cookie Placeholder Extension is a signal from the client to request additional cookies from the server. This extension is needed to make sure that the response is not much longer than the request to prevent amplification attacks.
  • NTS Authenticator and Encrypted Extension Fields Extension contains a ciphertext from the AEAD algorithm with C2S as a key and with the NTP header, timestamps, and all the previously mentioned extensions as associated data. Other possible extensions can be included as encrypted data within this field. Without this extension, the timestamp can be spoofed.

After getting a request, the server sends a response back to the client echoing the Unique Identifier Extension to prevent replay attacks, the NTS Cookie Extension to provide the client with more cookies, and the NTS Authenticator and Encrypted Extension Fields Extension with an AEAD ciphertext with S2C as a key. But in the server response, instead of sending the NTS Cookie Extension in plaintext, it needs to be encrypted with the AEAD to provide unlinkability of the NTP requests.

The second handshake can be repeated many times without going back to the first stage since each request and response gives the client a new cookie. The expensive public key operations in TLS are thus amortized over a large number of requests. Furthermore, specialized timekeeping devices like FPGA implementations only need to implement a few symmetric cryptographic functions and can delegate the complex TLS stack to a different device.

Why Rust?

While many of our services are written in Go, and we have considerable experience on the Crypto team with Go, a garbage collection pause in the middle of responding to an NTP packet would negatively impact accuracy. We picked Rust because of its zero-overhead and useful features.

  • Memory safety After Heartbleed, Cloudbleed, and the steady drip of vulnerabilities caused by C’s lack of memory safety, it’s clear that C is not a good choice for new software dealing with untrusted inputs. The obvious solution for memory safety is to use garbage collection, but garbage collection has a substantial runtime overhead, while Rust has less runtime overhead.
  • Non-nullability Null pointers are an edge case that is frequently not handled properly. Rust explicitly marks optionality, so all references in Rust can be safely dereferenced. The type system ensures that option types are properly handled.
  • Thread safety  Data-race prevention is another key feature of Rust. Rust’s ownership model ensures that all cross-thread accesses are synchronized by default. While not a panacea, this eliminates a major class of bugs.
  • Immutability Separating types into mutable and immutable is very important for reducing bugs. For example, in Java, when you pass an object into a function as a parameter, after the function is finished, you will never know whether the object has been mutated or not. Rust allows you to pass the object reference into the function and still be assured that the object is not mutated.
  • Error handling  Rust result types help with ensuring that operations that can produce errors are identified and a choice made about the error, even if that choice is passing it on.

While Rust provides safety with zero overhead, coding in Rust involves understanding linear types and for us a new language. In this case the importance of security and performance meant we chose Rust over a potentially easier task in Go.

Dependencies we use

Because of our scale and for DDoS protection we needed a highly scalable server. For UDP protocols without the concept of a connection, the server can respond to one packet at a time easily, but for TCP this is more complex. Originally we thought about using Tokio. However, at the time Tokio suffered from scheduler problems that had caused other teams some issues. As a result we decided to use Mio directly, basing our work on the examples in Rustls.

We decided to use Rustls over OpenSSL or BoringSSL because of the crate’s consistent error codes and default support for authentication that is difficult to disable accidentally. While there are some features that are not yet supported, it got the job done for our service.

Other engineering choices

More important than our choice of programming language was our implementation strategy. A working, fully featured NTP implementation is a complicated program involving a phase-locked loop. These have a difficult reputation due to their nonlinear nature, beyond the usual complexities of closed loop control. The response of a phase lock loop to a disturbance can be estimated if the loop is locked and the disturbance small. However, lock acquisition, large disturbances, and the necessary filtering in NTP are all hard to analyze mathematically since they are not captured in the linear models applied for small scale analysis. While NTP works with the total phase, unlike the phase-locked loops of electrical engineering, there are still nonlinear elements. For NTP testing, changes to this loop requires weeks of operation to determine the performance as the loop responds very slowly.

Computer clocks are generally accurate over short periods, while networks are plagued with inconsistent delays. This demands a slow response. Changes we make to our service have taken hours to have an effect, as the clients slowly adapt to the new conditions. While RFC 5905 provides lots of details on an algorithm to adjust the clock, later implementations such as chrony have improved upon the algorithm through much more sophisticated nonlinear filters.

Rather than implement these more sophisticated algorithms, we let chrony adjust the clock of our servers, and copy the state variables in the header from chrony and adjust the dispersion and root delay according to the formulas given in the RFC. This strategy let us focus on the new protocols.

Prague

Part of what the Internet Engineering Task Force (IETF) does is organize events like hackathons where implementers of a new standard can get together and try to make their stuff work with one another. This exposes bugs and infelicities of language in the standard and the implementations. We attended the IETF 104 hackathon to develop our server and make it work with other implementations. The NTP working group members were extremely generous with their time, and during the process we uncovered a few issues relating to the exact way one has to handle ALPN with older OpenSSL versions.

At the IETF 104 in Prague we had a working client and server for NTS-KE by the end of the hackathon. This was a good amount of progress considering we started with nothing. However, without implementing NTP we didn’t actually know that our server and client were computing the right thing. That would have to wait for later rounds of testing.

Announcing cfnts: Cloudflare's implementation of NTS in Rust
Wireshark during some NTS debugging

Crypto Week

As Crypto Week 2019 approached we were busily writing code. All of the NTP protocol had to be implemented, together with the connection between the NTP and NTS-KE parts of the server. We also had to deploy processes to synchronize the ticket encrypting keys around the world and work on reconfiguring our own timing infrastructure to support this new service.

With a few weeks to go we had a working implementation, but we needed servers and clients out there to test with. But because we only support TLS 1.3 on the server, which had only just entered into OpenSSL, there were some compatibility problems.

We ended up compiling a chrony branch with NTS support and NTPsec ourselves and testing against time.cloudflare.com. We also tested our client against test servers set up by the chrony and NTPsec projects, in the hopes that this would expose bugs and have our implementations work nicely together. After a few lengthy days of debugging, we found out that our nonce length wasn’t exactly in accordance with the spec, which was quickly fixed. The NTPsec project was extremely helpful in this effort. Of course, this was the day that our office had a blackout, so the testing happened outside in Yerba Buena Gardens.

Announcing cfnts: Cloudflare's implementation of NTS in Rust
Yerba Buena commons. Taken by Wikipedia user Beyond My Ken. CC-BY-SA

During the deployment of time.cloudflare.com, we had to open up our firewall to incoming NTP packets. Since the start of Cloudflare’s network existence and because of NTP reflection attacks we had previously closed UDP port 123 on the router. Since source port 123 is also used by clients sometimes to send NTP packets, it’s impossible for NTP servers to filter reflection attacks without parsing the contents of NTP packet, which routers have difficulty doing.  In order to protect Cloudflare infrastructure we got an entire subnet just for the time service, so it could be aggressively throttled and rerouted in case of massive DDoS attacks. This is an exceptional case: most edge services at Cloudflare run on every available IP.

Bug fixes

Shortly after the public launch, we discovered that older Windows versions shipped with NTP version 3, and our server only spoke version 4. This was easy to fix since the timestamps have not moved in NTP versions: we echo the version back and most still existing NTP version 3 clients will understand what we meant.

Also tricky was the failure of Network Time Foundation ntpd clients to expand the polling interval. It turns out that one has to echo back the client’s polling interval to have the polling interval expand. Chrony does not use the polling interval from the server, and so was not affected by this incompatibility.

Both of these issues were fixed in ways suggested by other NTP implementers who had run into these problems themselves. We thank Miroslav Lichter tremendously for telling us exactly what the problem was, and the members of the Cloudflare community who posted packet captures demonstrating these issues.

Continued improvement

The original production version of cfnts was not particularly object oriented and several contributors were just learning Rust. As a result there was quite a bit of unwrap and unnecessary mutability flying around. Much of the code was in functions even when it could profitably be attached to structures. All of this had to be restructured. Keep in mind that some of the best code running in the real-world have been written, rewritten, and sometimes rewritten again! This is actually a good thing.

As an internal project we relied on Cloudflare’s internal tooling for building, testing, and deploying code. These were replaced with tools available to everyone like Docker to ensure anyone can contribute. Our repository is integrated with Circle CI, ensuring that all contributions are automatically tested. In addition to unit tests we test the entire end to end functionality of getting a measurement of the time from a server.

The Future

NTPsec has already released support for NTS but we see very little usage. Please try turning on NTS if you use NTPsec and see how it works with time.cloudflare.com.  As the draft advances through the standards process the protocol will undergo an incompatible change when the identifiers are updated and assigned out of the IANA registry instead of being experimental ones, so this is very much an experiment. Note that your daemon will need TLS 1.3 support and so could require manually compiling OpenSSL and then linking against it.

We’ve also added our time service to the public NTP pool. The NTP pool is a widely used volunteer-maintained service that provides NTP servers geographically spread across the world. Unfortunately, NTS doesn’t currently work well with the pool model, so for the best security, we recommend enabling NTS and using time.cloudflare.com and other NTS supporting servers.

In the future, we’re hoping that more clients support NTS, and have licensed our code liberally to enable this. We would love to hear if you incorporate it into a product and welcome contributions to make it more useful.

We’re also encouraged to see that Netnod has a production NTS service at nts.ntp.se. The more time services and clients that adopt NTS, the more secure the Internet will be.

Acknowledgements

Tanya Verma and Gabbi Fisher were major contributors to the code, especially the configuration system and the client code. We’d also like to thank Gary Miller, Miroslav Lichter, and all the people at Cloudflare who set up their laptops and home machines to point to time.cloudflare.com for early feedback.

Announcing cfnts: Cloudflare's implementation of NTS in Rust

The TLS Post-Quantum Experiment

Post Syndicated from Kris Kwiatkowski original https://blog.cloudflare.com/the-tls-post-quantum-experiment/

The TLS Post-Quantum Experiment

The TLS Post-Quantum Experiment

In June, we announced a wide-scale post-quantum experiment with Google. We implemented two post-quantum (i.e., not yet known to be broken by quantum computers) key exchanges, integrated them into our TLS stack and deployed the implementation on our edge servers and in Chrome Canary clients. The goal of the experiment was to evaluate the performance and feasibility of deployment in TLS of two post-quantum key agreement ciphers.

In our previous blog post on post-quantum cryptography, we described differences between those two ciphers in detail. In case you didn’t have a chance to read it, we include a quick recap here. One characteristic of post-quantum key exchange algorithms is that the public keys are much larger than those used by “classical” algorithms. This will have an impact on the duration of the TLS handshake. For our experiment, we chose two algorithms: isogeny-based SIKE and lattice-based HRSS. The former has short key sizes (~330 bytes) but has a high computational cost; the latter has larger key sizes (~1100 bytes), but is a few orders of magnitude faster.

During NIST’s Second PQC Standardization Conference, Nick Sullivan presented our approach to this experiment and some initial results. Quite accurately, he compared NTRU-HRSS to an ostrich and SIKE to a turkey—one is big and fast and the other is small and slow.

The TLS Post-Quantum Experiment

Setup & Execution

We based our experiment on TLS 1.3. Cloudflare operated the server-side TLS connections and Google Chrome (Canary and Dev builds) represented the client side of the experiment. We enabled both CECPQ2 (HRSS + X25519) and CECPQ2b (SIKE/p434 + X25519) key-agreement algorithms on all TLS-terminating edge servers. Since the post-quantum algorithms are considered experimental, the X25519 key exchange serves as a fallback to ensure the classical security of the connection.

Clients participating in the experiment were split into 3 groups—those who initiated TLS handshake with post-quantum CECPQ2, CECPQ2b or non post-quantum X25519 public keys. Each group represented approximately one third of the Chrome Canary population participating in the experiment.

In order to distinguish between clients participating in or excluded from the experiment, we added a custom extension to the TLS handshake. It worked as a simple flag sent by clients and echoed back by Cloudflare edge servers. This allowed us to measure the duration of TLS handshakes only for clients participating in the experiment.

For each connection, we collected telemetry metrics. The most important metric was a TLS server-side handshake duration defined as the time between receiving the Client Hello and Client Finished messages. The diagram below shows details of what was measured and how post-quantum key exchange was integrated with TLS 1.3.

The TLS Post-Quantum Experiment

The experiment ran for 53 days in total, between August and October. During this time we collected millions of data samples, representing 5% of (anonymized) TLS connections that contained the extension signaling that the client was part of the experiment. We carried out the experiment in two phases.

In the first phase of the experiment, each client was assigned to use one of the three key exchange groups, and each client offered the same key exchange group for every connection. We collected over 10 million records over 40 days.

In the second phase of the experiment, client behavior was modified so that each client randomly chose which key exchange group to offer for each new connection, allowing us to directly compare the performance of each algorithm on a per-client basis. Data collection for this phase lasted 13 days and we collected 270 thousand records.

Results

We now describe our server-side measurement results. Client-side results are described at https://www.imperialviolet.org/2019/10/30/pqsivssl.html.

What did we find?

The primary metric we collected for each connection was the server-side handshake duration. The below histograms show handshake duration timings for all client measurements gathered in the first phase of the experiment, as well as breakdowns into the top five operating systems. The operating system breakdowns shown are restricted to only desktop/laptop devices except for Android, which consists of only mobile devices.

The TLS Post-Quantum Experiment

It’s clear from the above plots that for most clients, CECPQ2b performs worse than CECPQ2 and CONTROL. Thus, the small key size of CECPQ2b does not make up for its large computational cost—the ostrich outpaces the turkey.

Digging a little deeper

This means we’re done, right? Not quite. We are interested in determining if there are any populations of TLS clients for which CECPQ2b consistency outperforms CECPQ2. This requires taking a closer look at the long tail of handshake durations. The below plots show cumulative distribution functions (CDFs) of handshake timings zoomed in on the 80th percentile (e.g., showing the top 20% of slowest handshakes).

The TLS Post-Quantum Experiment

Here, we start to see something interesting. For Android, Linux, and Windows devices, there is a crossover point where CECPQ2b actually starts to outperform CECPQ2 (Android: ~94th percentile, Linux: ~92nd percentile, Windows: ~95th percentile). macOS and ChromeOS do not appear to have these crossover points.

These effects are small but statistically significant in some cases. The below table shows approximate 95% confidence intervals for the 50th (median), 95th, and 99th percentiles of handshake durations for each key exchange group and device type, calculated using Maritz-Jarrett estimators. The numbers within square brackets give the lower and upper bounds on our estimates for each percentile of the “true” distribution of handshake durations based on the samples collected in the experiment. For example, with a 95% confidence level we can say that the 99th percentile of handshake durations for CECPQ2 on Android devices lies between 4057ms and 4478ms, while the 99th percentile for CECPQ2b lies between 3276ms and 3646ms. Since the intervals do not overlap, we say that with statistical significance, the experiment indicates that CECPQ2b performs better than CECPQ2 for the slowest 1% of Android connections. Configurations where CECPQ2 or CECPQ2b outperforms the other with statistical significance are marked with green in the table.

The TLS Post-Quantum Experiment

Per-client comparison

A second phase of the experiment directly examined the performance of each key exchange algorithm for individual clients, where a client is defined to be a unique (anonymized) IP address and user agent pair. Instead of choosing a single key exchange algorithm for the duration of the experiment, clients randomly selected one of the experiment configurations for each new connection. Although the duration and sample size were limited for this phase of the experiment, we collected at least three handshake measurements for each group configuration from 3900 unique clients.

The plot below shows for each of these clients the difference in latency between CECPQ2 and CECPQ2b, taking the minimum latency sample for each key exchange group as the representative value. The CDF plot shows that for 80% of clients, CECPQ2 outperformed or matched CECPQ2b, and for 99% of clients, the latency gap remained within 70ms. At a high level, this indicates that very few clients performed significantly worse with CECPQ2 over CECPQ2b.

The TLS Post-Quantum Experiment

Do other factors impact the latency gap?

We looked at a number of other factors—including session resumption, IP version, and network location—to see if they impacted the latency gap between CECPQ2 and CECPQ2b. These factors impacted the overall handshake latency, but we did not find that any made a significant impact on the latency gap between post-quantum ciphers. We share some interesting observations from this analysis below.

Session resumption

Approximately 53% of all connections in the experiment were completed with TLS handshake resumption. However, the percentage of resumed connections varied significantly based on the device configuration. Connections from mobile devices were only resumed ~25% of the time, while between 40% and 70% of connections from laptop/desktop devices were resumed. Additionally, resumption provided between a 30% and 50% speedup for all device types.

IP version

We also examined the impact of IP version on handshake latency. Only 12.5% of the connections in the experiment used IPv6. These connections were 20-40% faster than IPv4 connections for desktop/laptop devices, but ~15% slower for mobile devices. This could be an artifact of IPv6 being generally deployed on newer devices with faster processors. For Android, the experiment was only run on devices with more modern processors, which perhaps eliminated the bias.

Network location

The slow connections making up the long tail of handshake durations were not isolated to a few countries, Autonomous Systems (ASes), or subnets, but originated from a globally diverse set of clients. We did not find a correlation between the relative performance of the two post-quantum key exchange algorithms based on these factors.

Discussion

We found that CECPQ2 (the ostrich) outperformed CECPQ2 (the turkey) for the majority of connections in the experiment, indicating that fast algorithms with large keys may be more suitable for TLS than slow algorithms with small keys. However, we observed the opposite—that CECPQ2b outperformed CECPQ2—for the slowest connections on some devices, including Windows computers and Android mobile devices. One possible explanation for this is packet fragmentation and packet loss. The maximum size of TCP packets that can be sent across a network is limited by the maximum transmission unit (MTU) of the network path, which is often ~1400 bytes. During the TLS handshake the server responds to the client with its public key and ciphertext, the combined size of which exceeds the MTU, so it is likely that handshake messages must be split across multiple TCP packets. This increases the risk of lost packets and delays due to retransmission. A repeat of this experiment that includes collection of fine-grained TCP telemetry could confirm this hypothesis.

A somewhat surprising result of this experiment is just how fast HRSS performs for the majority of connections. Recall that the CECPQ2 cipher performs key exchange operations for both X25519 and HRSS, but the additional overhead of HRSS is barely noticeable. Comparing benchmark results, we can see that HRSS will be faster than X25519 on the server side and slower on the client side.

The TLS Post-Quantum Experiment

In our design, the client side performs two operations—key generation and KEM decapsulation. Looking at those two operations we can see that the key generation is a bottleneck here.

Key generation: 	3553.5 [ops/sec]
KEM decapsulation: 	17186.7 [ops/sec]

In algorithms with quotient-style keys (like NTRU), the key generation algorithm performs an inversion in the quotient ring—an operation that is quite computationally expensive. Alternatively, a TLS implementation could generate ephemeral keys ahead of time in order to speed up key exchange. There are several other lattice-based key exchange candidates that may be worth experimenting with in the context of TLS key exchange, which are based on different underlying principles than the HRSS construction. These candidates have similar key sizes and faster key generation algorithms, but have their own drawbacks. For now, HRSS looks like the more promising algorithm for use in TLS.

In the case of SIKE, we implemented the most recent version of the algorithm, and instantiated it with the most performance-efficient parameter set for our experiment. The algorithm is computationally expensive, so we were required to use assembly to optimize it. In order to ensure best performance on Intel, most performance-critical operations have two different implementations; the library detects CPU capabilities and uses faster instructions if available, but otherwise falls back to a slightly slower generic implementation. We developed our own optimizations for 64-bit ARM CPUs. Nevertheless, our results show that SIKE incurred a significant overhead for every connection, especially on devices with weaker processors. It must be noted that high-performance isogeny-based public key cryptography is arguably much less developed than its lattice-based counterparts. Some ideas to develop this are floating around, and we hope to see performance improvements in the future.

The TLS Post-Quantum Experiment

DNS Encryption Explained

Post Syndicated from Peter Wu original https://blog.cloudflare.com/dns-encryption-explained/

DNS Encryption Explained

DNS Encryption Explained

The Domain Name System (DNS) is the address book of the Internet. When you visit cloudflare.com or any other site, your browser will ask a DNS resolver for the IP address where the website can be found. Unfortunately, these DNS queries and answers are typically unprotected. Encrypting DNS would improve user privacy and security. In this post, we will look at two mechanisms for encrypting DNS, known as DNS over TLS (DoT) and DNS over HTTPS (DoH), and explain how they work.

Applications that want to resolve a domain name to an IP address typically use DNS. This is usually not done explicitly by the programmer who wrote the application. Instead, the programmer writes something such as fetch("https://example.com/news") and expects a software library to handle the translation of “example.com” to an IP address.

Behind the scenes, the software library is responsible for discovering and connecting to the external recursive DNS resolver and speaking the DNS protocol (see the figure below) in order to resolve the name requested by the application. The choice of the external DNS resolver and whether any privacy and security is provided at all is outside the control of the application. It depends on the software library in use, and the policies provided by the operating system of the device that runs the software.

DNS Encryption Explained
Overview of DNS query and response

The external DNS resolver

The operating system usually learns the resolver address from the local network using Dynamic Host Configuration Protocol (DHCP). In home and mobile networks, it typically ends up using the resolver from the Internet Service Provider (ISP). In corporate networks, the selected resolver is typically controlled by the network administrator. If desired, users with control over their devices can override the resolver with a specific address, such as the address of a public resolver like Google’s 8.8.8.8 or Cloudflare’s 1.1.1.1, but most users will likely not bother changing it when connecting to a public Wi-Fi hotspot at a coffee shop or airport.

The choice of external resolver has a direct impact on the end-user experience. Most users do not change their resolver settings and will likely end up using the DNS resolver from their network provider. The most obvious observable property is the speed and accuracy of name resolution. Features that improve privacy or security might not be immediately visible, but will help to prevent others from profiling or interfering with your browsing activity. This is especially important on public Wi-Fi networks where anyone in physical proximity can capture and decrypt wireless network traffic.

Unencrypted DNS

Ever since DNS was created in 1987, it has been largely unencrypted. Everyone between your device and the resolver is able to snoop on or even modify your DNS queries and responses. This includes anyone in your local Wi-Fi network, your Internet Service Provider (ISP), and transit providers. This may affect your privacy by revealing the domain names that are you are visiting.

What can they see? Well, consider this network packet capture taken from a laptop connected to a home network:

DNS Encryption Explained

The following observations can be made:

  • The UDP source port is 53 which is the standard port number for unencrypted DNS. The UDP payload is therefore likely to be a DNS answer.
  • That suggests that the source IP address 192.168.2.254 is a DNS resolver while the destination IP 192.168.2.14 is the DNS client.
  • The UDP payload could indeed be parsed as a DNS answer, and reveals that the user was trying to visit twitter.com.
  • If there are any future connections to 104.244.42.129 or 104.244.42.1, then it is most likely traffic that is directed at “twitter.com”.
  • If there is some further encrypted HTTPS traffic to this IP, succeeded by more DNS queries, it could indicate that a web browser loaded additional resources from that page. That could potentially reveal the pages that a user was looking at while visiting twitter.com.

Since the DNS messages are unprotected, other attacks are possible:

  • Queries could be directed to a resolver that performs DNS hijacking. For example, in the UK, Virgin Media and BT return a fake response for domains that do not exist, redirecting users to a search page. This redirection is possible because the computer/phone blindly trusts the DNS resolver that was advertised using DHCP by the ISP-provided gateway router.
  • Firewalls can easily intercept, block or modify any unencrypted DNS traffic based on the port number alone. It is worth noting that plaintext inspection is not a silver bullet for achieving visibility goals, because the DNS resolver can be bypassed.

Encrypting DNS

Encrypting DNS makes it much harder for snoopers to look into your DNS messages, or to corrupt them in transit. Just as the web moved from unencrypted HTTP to encrypted HTTPS there are now upgrades to the DNS protocol that encrypt DNS itself. Encrypting the web has made it possible for private and secure communications and commerce to flourish. Encrypting DNS will further enhance user privacy.

Two standardized mechanisms exist to secure the DNS transport between you and the resolver, DNS over TLS (2016) and DNS Queries over HTTPS (2018). Both are based on Transport Layer Security (TLS) which is also used to secure communication between you and a website using HTTPS. In TLS, the server (be it a web server or DNS resolver) authenticates itself to the client (your device) using a certificate. This ensures that no other party can impersonate the server (the resolver).

With DNS over TLS (DoT), the original DNS message is directly embedded into the secure TLS channel. From the outside, one can neither learn the name that was being queried nor modify it. The intended client application will be able to decrypt TLS, it looks like this:

DNS Encryption Explained

In the packet trace for unencrypted DNS, it was clear that a DNS request can be sent directly by the client, followed by a DNS answer from the resolver. In the encrypted DoT case however, some TLS handshake messages are exchanged prior to sending encrypted DNS messages:

  • The client sends a Client Hello, advertising its supported TLS capabilities.
  • The server responds with a Server Hello, agreeing on TLS parameters that will be used to secure the connection. The Certificate message contains the identity of the server while the Certificate Verify message will contain a digital signature which can be verified by the client using the server Certificate. The client typically checks this certificate against its local list of trusted Certificate Authorities, but the DoT specification mentions alternative trust mechanisms such as public key pinning.
  • Once the TLS handshake is Finished by both the client and server, they can finally start exchanging encrypted messages.
  • While the above picture contains one DNS query and answer, in practice the secure TLS connection will remain open and will be reused for future DNS queries.

Securing unencrypted protocols by slapping TLS on top of a new port has been done before:

  • Web traffic: HTTP (tcp/80) -> HTTPS (tcp/443)
  • Sending email: SMTP (tcp/25) -> SMTPS (tcp/465)
  • Receiving email: IMAP (tcp/143) -> IMAPS (tcp/993)
  • Now: DNS (tcp/53 or udp/53) -> DoT (tcp/853)

A problem with introducing a new port is that existing firewalls may block it. Either because they employ a whitelist approach where new services have to be explicitly enabled, or a blocklist approach where a network administrator explicitly blocks a service. If the secure option (DoT) is less likely to be available than its insecure option, then users and applications might be tempted to try to fall back to unencrypted DNS. This subsequently could allow attackers to force users to an insecure version.

Such fallback attacks are not theoretical. SSL stripping has previously been used to downgrade HTTPS websites to HTTP, allowing attackers to steal passwords or hijack accounts.

Another approach, DNS Queries over HTTPS (DoH), was designed to support two primary use cases:

  • Prevent the above problem where on-path devices interfere with DNS. This includes the port blocking problem above.
  • Enable web applications to access DNS through existing browser APIs.
    DoH is essentially HTTPS, the same encrypted standard the web uses, and reuses the same port number (tcp/443). Web browsers have already deprecated non-secure HTTP in favor of HTTPS. That makes HTTPS a great choice for securely transporting DNS messages. An example of such a DoH request can be found here.

DNS Encryption Explained
DoH: DNS query and response transported over a secure HTTPS stream

Some users have been concerned that the use of HTTPS could weaken privacy due to the potential use of cookies for tracking purposes. The DoH protocol designers considered various privacy aspects and explicitly discouraged use of HTTP cookies to prevent tracking, a recommendation that is widely respected. TLS session resumption improves TLS 1.2 handshake performance, but can potentially be used to correlate TLS connections. Luckily, use of TLS 1.3 obviates the need for TLS session resumption by reducing the number of round trips by default, effectively addressing its associated privacy concern.

Using HTTPS means that HTTP protocol improvements can also benefit DoH. For example, the in-development HTTP/3 protocol, built on top of QUIC, could offer additional performance improvements in the presence of packet loss due to lack of head-of-line blocking. This means that multiple DNS queries could be sent simultaneously over the secure channel without blocking each other when one packet is lost.

A draft for DNS over QUIC (DNS/QUIC) also exists and is similar to DoT, but without the head-of-line blocking problem due to the use of QUIC. Both HTTP/3 and DNS/QUIC, however, require a UDP port to be accessible. In theory, both could fall back to DoH over HTTP/2 and DoT respectively.

Deployment of DoT and DoH

As both DoT and DoH are relatively new, they are not universally deployed yet. On the server side, major public resolvers including Cloudflare’s 1.1.1.1 and Google DNS support it. Many ISP resolvers however still lack support for it. A small list of public resolvers supporting DoH can be found at DNS server sources, another list of public resolvers supporting DoT and DoH can be found on DNS Privacy Public Resolvers.

There are two methods to enable DoT or DoH on end-user devices:

  • Add support to applications, bypassing the resolver service from the operating system.
  • Add support to the operating system, transparently providing support to applications.

There are generally three configuration modes for DoT or DoH on the client side:

  • Off: DNS will not be encrypted.
  • Opportunistic mode: try to use a secure transport for DNS, but fallback to unencrypted DNS if the former is unavailable. This mode is vulnerable to downgrade attacks where an attacker can force a device to use unencrypted DNS. It aims to offer privacy when there are no on-path active attackers.
  • Strict mode: try to use DNS over a secure transport. If unavailable, fail hard and show an error to the user.

The current state for system-wide configuration of DNS over a secure transport:

  • Android 9: supports DoT through its “Private DNS” feature. Modes:
    • Opportunistic mode (“Automatic”) is used by default. The resolver from network settings (typically DHCP) will be used.
    • Strict mode can be configured by setting an explicit hostname. No IP address is allowed, the hostname is resolved using the default resolver and is also used for validating the certificate. (Relevant source code)
  • iOS and Android users can also install the 1.1.1.1 app to enable either DoH or DoT support in strict mode. Internally it uses the VPN programming interfaces to enable interception of unencrypted DNS traffic before it is forwarded over a secure channel.
  • Linux with systemd-resolved from systemd 239: DoT through the DNSOverTLS option.

    • Off is the default.
    • Opportunistic mode can be configured, but no certificate validation is performed.
    • Strict mode is available since systemd 243. Any certificate signed by a trusted certificate authority is accepted. However, there is no hostname validation with the GnuTLS backend while the OpenSSL backend expects an IP address.
    • In any case, no Server Name Indication (SNI) is sent. The certificate name is not validated, making a man-in-the-middle rather trivial.
  • Linux, macOS, and Windows can use a DoH client in strict mode. The cloudflared proxy-dns command uses the Cloudflare DNS resolver by default, but users can override it through the proxy-dns-upstream option.

Web browsers support DoH instead of DoT:

  • Firefox 62 supports DoH and provides several Trusted Recursive Resolver (TRR) settings. By default DoH is disabled, but Mozilla is running an experiment to enable DoH for some users in the USA. This experiment currently uses Cloudflare’s 1.1.1.1 resolver, since we are the only provider that currently satisfies the strict resolver policy required by Mozilla. Since many DNS resolvers still do not support an encrypted DNS transport, Mozilla’s approach will ensure that more users are protected using DoH.
    • When enabled through the experiment, or through the “Enable DNS over HTTPS” option at Network Settings, Firefox will use opportunistic mode (network.trr.mode=2 at about:config).
    • Strict mode can be enabled with network.trr.mode=3, but requires an explicit resolver IP to be specified (for example, network.trr.bootstrapAddress=1.1.1.1).
    • While Firefox ignores the default resolver from the system, it can be configured with alternative resolvers. Additionally, enterprise deployments who use a resolver that does not support DoH have the option to disable DoH.
  • Chrome 78 enables opportunistic DoH if the system resolver address matches one of the hard-coded DoH providers (source code change). This experiment is enabled for all platforms except Linux and iOS, and excludes enterprise deployments by default.
  • Opera 65 adds an option to enable DoH through Cloudflare’s 1.1.1.1 resolver. This feature is off by default. Once enabled, it appears to use opportunistic mode: if 1.1.1.1:443 (without SNI) is reachable, it will be used. Otherwise it falls back to the default resolver, unencrypted.

The DNS over HTTPS page from the curl project has a comprehensive list of DoH providers and additional implementations.

As an alternative to encrypting the full network path between the device and the external DNS resolver, one can take a middle ground: use unencrypted DNS between devices and the gateway of the local network, but encrypt all DNS traffic between the gateway router and the external DNS resolver. Assuming a secure wired or wireless network, this would protect all devices in the local network against a snooping ISP, or other adversaries on the Internet. As public Wi-Fi hotspots are not considered secure, this approach would not be safe on open Wi-Fi networks. Even if it is password-protected with WPA2-PSK, others will still be able to snoop and modify unencrypted DNS.

Other security considerations

The previous sections described secure DNS transports, DoH and DoT. These will only ensure that your client receives the untampered answer from the DNS resolver. It does not, however, protect the client against the resolver returning the wrong answer (through DNS hijacking or DNS cache poisoning attacks). The “true” answer is determined by the owner of a domain or zone as reported by the authoritative name server. DNSSEC allows clients to verify the integrity of the returned DNS answer and catch any unauthorized tampering along the path between the client and authoritative name server.

However deployment of DNSSEC is hindered by middleboxes that incorrectly forward DNS messages, and even if the information is available, stub resolvers used by applications might not even validate the results. A report from 2016 found that only 26% of users use DNSSEC-validating resolvers.

DoH and DoT protect the transport between the client and the public resolver. The public resolver may have to reach out to additional authoritative name servers in order to resolve a name. Traditionally, the path between any resolver and the authoritative name server uses unencrypted DNS. To protect these DNS messages as well, we did an experiment with Facebook, using DoT between 1.1.1.1 and Facebook’s authoritative name servers. While setting up a secure channel using TLS increases latency, it can be amortized over many queries.

Transport encryption ensures that resolver results and metadata are protected. For example, the EDNS Client Subnet (ECS) information included with DNS queries could reveal the original client address that started the DNS query. Hiding that information along the path improves privacy. It will also prevent broken middle-boxes from breaking DNSSEC due to issues in forwarding DNS.

Operational issues with DNS encryption

DNS encryption may bring challenges to individuals or organizations that rely on monitoring or modifying DNS traffic. Security appliances that rely on passive monitoring watch all incoming and outgoing network traffic on a machine or on the edge of a network. Based on unencrypted DNS queries, they could potentially identify machines which are infected with malware for example. If the DNS query is encrypted, then passive monitoring solutions will not be able to monitor domain names.

Some parties expect DNS resolvers to apply content filtering for purposes such as:

  • Blocking domains used for malware distribution.
  • Blocking advertisements.
  • Perform parental control filtering, blocking domains associated with adult content.
  • Block access to domains serving illegal content according to local regulations.
  • Offer a split-horizon DNS to provide different answers depending on the source network.

An advantage of blocking access to domains via the DNS resolver is that it can be centrally done, without reimplementing it in every single application. Unfortunately, it is also quite coarse. Suppose that a website hosts content for multiple users at example.com/videos/for-kids/ and example.com/videos/for-adults/. The DNS resolver will only be able to see “example.com” and can either choose to block it or not. In this case, application-specific controls such as browser extensions would be more effective since they can actually look into the URLs and selectively prevent content from being accessible.

DNS monitoring is not comprehensive. Malware could skip DNS and hardcode IP addresses, or use alternative methods to query an IP address. However, not all malware is that complicated, so DNS monitoring can still serve as a defence-in-depth tool.

All of these non-passive monitoring or DNS blocking use cases require support from the DNS resolver. Deployments that rely on opportunistic DoH/DoT upgrades of the current resolver will maintain the same feature set as usually provided over unencrypted DNS. Unfortunately this is vulnerable to downgrades, as mentioned before. To solve this, system administrators can point endpoints to a DoH/DoT resolver in strict mode. Ideally this is done through secure device management solutions (MDM, group policy on Windows, etc.).

Conclusion

One of the cornerstones of the Internet is mapping names to an address using DNS. DNS has traditionally used insecure, unencrypted transports. This has been abused by ISPs in the past for injecting advertisements, but also causes a privacy leak. Nosey visitors in the coffee shop can use unencrypted DNS to follow your activity. All of these issues can be solved by using DNS over TLS (DoT) or DNS over HTTPS (DoH). These techniques to protect the user are relatively new and are seeing increasing adoption.

From a technical perspective, DoH is very similar to HTTPS and follows the general industry trend to deprecate non-secure options. DoT is a simpler transport mode than DoH as the HTTP layer is removed, but that also makes it easier to be blocked, either deliberately or by accident.

Secondary to enabling a secure transport is the choice of a DNS resolver. Some vendors will use the locally configured DNS resolver, but try to opportunistically upgrade the unencrypted transport to a more secure transport (either DoT or DoH). Unfortunately, the DNS resolver usually defaults to one provided by the ISP which may not support secure transports.

Mozilla has adopted a different approach. Rather than relying on local resolvers that may not even support DoH, they allow the user to explicitly select a resolver. Resolvers recommended by Mozilla have to satisfy high standards to protect user privacy. To ensure that parental control features based on DNS remain functional, and to support the split-horizon use case, Mozilla has added a mechanism that allows private resolvers to disable DoH.

The DoT and DoH transport protocols are ready for us to move to a more secure Internet. As can be seen in previous packet traces, these protocols are similar to existing mechanisms to secure application traffic. Once this security and privacy hole is closed, there will be many more to tackle.

DNS Encryption Explained

Griffin, an anti-fraud risk rule engine making billions of predictions daily

Post Syndicated from Grab Tech original https://engineering.grab.com/griffin

Introduction

At Grab, the scale and fast-moving nature of our business means we need to be vigilant about potential risks to our customers and to our business. Some of the things we watch for include promotion abuse, or passenger safety on late-night ride allocations. To overcome these issues, the TIS (Trust/Identity/Safety) taskforce was formed with a group of AI developers dedicated to fraud detection and prevention.

The team’s mission is:

  • to keep fraudulent users away from our app or services
  • ensure our customers’ safety, and
  • Manage user identities to securely login to the Grab app.

The TIS team’s scope covers not just transport, but also our food, deliver and other Grab verticals.

How we prevented fraudulent transactions in the earlier days

In our early days when Grab was smaller, we used a rules-based approach to block potentially fraudulent transactions. Rules are like boolean conditions that determines if the result will be true or false. These rules were very effective in mitigating fraud risk, and we used to create them manually in the code.

We started with very simple rules. For example:

Rule 1:

 IF a credit card has been declined today

 THEN this card cannot be used for booking

To quickly incorporate rules in our app or service, we integrated them in our backend service code and deployed our service frequently to use the latest rules.

It worked really well in the beginning. Our logic was relatively simple, and only one developer managed the changes regularly. It was very lightweight to trigger the rule deployment and enforce the rules.

However, as the business rapidly expanded, we had to exponentially increase the rule complexity. For example, consider these two new rules:

Rule 2:

IF a credit card has been declined today but this passenger has good booking history

THEN we would still allow this booking to go through, but precharge X amount

Rule 3:

IF a credit card has been declined(but paid off) more than twice in the last 3-months

THEN we would still not allow this booking

The system scans through the rules, one by one, and if it determines that any rule is tripped it will check the other rules. In the example above, if a credit card has been declined more than twice in the last 3-months, the passenger will not be allowed to book even though he has a good booking history.

Though all rules follow a similar pattern, there are subtle differences in the logic and they enable different decisions. Maintaining these complex rules was getting harder and harder.

Now imagine we added more rules as shown in the example below. We first check if the device used by the passenger is a high-risk one. e.g using an emulator for booking. If not, we then check the payment method to evaluate the risk (e.g. any declined booking from the credit card), and then make a decision on whether this booking should be precharged or not. If passenger is using a low-risk  device but is in some risky location where we traditionally see a lot of fraud bookings, we would then run some further checks about the passenger booking history to decide if a pre-charge is also needed.

Now consider that instead of a single passenger, we have thousands of passengers. Each of these passengers can have a large number of rules for review. While not impossible to do, it can be difficult and time-consuming, and it gets exponentially more difficult the more rules you have to take into consideration. Time has to be spent carefully curating these rules.

Rules flow

The more rules you add to increase accuracy, the more difficult it becomes to take them all into consideration.

Our rules were getting 10X more complicated than the example shown above. Consequently, developers had to spend long hours understanding the logic of our rules, and also be very careful to avoid any interference with new rules.

In the beginning, we implemented rules through a three-step process:

  1. Data Scientists and Analysts dived deep into our transaction data, and discovered patterns.
  2. They abstracted these patterns and wrote rules in English (e.g. promotion based booking should be limited to 5 bookings and total finished bookings should be greater than 6, otherwise unallocate current ride)
  3. Developers implemented these rules and deployed the changes to production

Sometimes, the use of English between steps 2 and 3 caused inaccurate rule implementation (e.g. for “X should be limited to 5”, should the implementation be X < 5 or  X <= 5?)

Once a new rule is deployed, we monitored the performance of the rule. For example,

  • How often does the rule fire (after minutes, hours, or daily)?
  • Is it over-firing?
  • Does it conflict with other rules?

Based on implementation, each rule had dependency with other rules. For example, if Rule 1 is fired, we should not continue with Rule 2 and Rule 3.

As a result, we couldn’t  keep each rule evaluation independent.  We had no way to observe the performance of a rule with other rules interfering. Consider an example where we change Rule 1:

From IF a credit card has been declined today

To   IF a credit card has been declined this week

As Rules 2 and 3 depend on Rule 1, their trigger-rate would drop significantly. It means we would have unstable performance metrics for Rule 2 and Rule 3 even though the logic of Rule 2 and Rule 3 does not change. It is very hard for a rule owner to monitor the performance of Rules 2 and Rule 3.

When it comes to the of A/B testing of a new rule, Data Scientists need to put a lot of effort into cleaning up noise from other rules, but most of the time, it is mission-impossible.

After several misfiring events (wrong implementation of rules) and ever longer rule development time (weekly), we realized “No one can handle this manually.“

Birth of Griffin Rule Engine

We decided to take a step back, sit down and closely review our daily patterns. We realized that our daily patterns fall into two categories:

  1. Fetching new data:  e.g. “what is the credit card risk score”, or “how many food bookings has this user ordered in last 7 days”, and transform this data for easier consumption.
  2. Updating/creating rules: e.g. if a credit card risk score is high, decline a booking.

These two categories are essentially divided into two independent components:

  1. Data orchestration – collecting/transforming the data from different data sources.
  2. Rule-based prediction

Based on these findings, we got started with our Data Orchestrator (open sourced at https://github.com/grab/symphony) and Griffin projects.

The intent of Griffin is to provide data scientists and analysts with a way to add new rules to monitor, prevent, and detect fraud across Grab.

Griffin allows technical novices to apply their fraud expertise to add very complex rules that can automate the review of rules without manual intervention.

Griffin  now predicts billions of events every day with 100K+ Queries per second(QPS) at peak time (on only 6 regular EC2s).

Data scientists and analysts can self-service rule changes on the web portal directly, deploy rules with just a few clicks, experiment and monitor performance in real time.

Why we came up with Griffin instead of using third-party tools in the market

Before we decided to create our in-built tool, we did some research for common business rule engines available in the market such as Drools and checked if we should use them. In that process, we found:

  1. Drools has its own Java-based DSL with a non-trivial learning curve (whereas our major users are from Python background).
  2. Limited [expressive power](https://en.wikipedia.org/wiki/Expressive_power_(computer_science),
  3. Limited support for some common math functions (e.g. factorial/ Greatest Common Divisor).
  4. Our nature of business needed dynamic dataset for predictions (for example, a rule may need only passenger booking history on Day 1, but it may use passenger booking history, passenger credit balance, and passenger favorite places on Day 2). On the other hand, Drools usually works well with a static list of dataset instead of dynamic dataset.

Given the above constraints, we decided to build our own rule engine which can better fit our needs.

Griffin Architecture

The diagram depicts the high-level flow of making a prediction through Griffin.

High-level flow of making a prediction through Griffin

Components

  • Data Orchestration: a service that collects all data needed for predictions
  • Rule Engine: a service that makes prediction based on rules
  • Rule Editor: the portal through which users can create/update rules

Workflow

  1. Users create/update rules in the Rule Editor web portal, and save the rules in the database.
  2. Griffin Rule Engine reloads rules immediately as long as it detects any rule changes.
  3. Data Orchestrator sends all dataset (features) needed for a prediction (e.g. whether to block a ride based on passenger past ride pattern, credit card risk) to the Rule Engine
  4. Griffin Rule Engine makes a prediction.

How you can create rules using Griffin

In an abstract view, a rule inside Griffin is defined as:

Rule:

Input:JSON => Result:Boolean

We allow users (analysts, data scientists) to write Python-based rules on WebUI to accommodate some very complicated rules like:

len(list(filter(lambdax: x \>7, (map(lambdax: math.factorial(x), \[1,2,3,4,5,6\]))))) \>2

This significantly optimizes the expressive power of rules.

To match and evaluate a rule more efficiently, we also have other key components associated:

Scenarios

  • Here are some examples: PreBooking, PostBookingCompletion, PostFoodDelivery

Actions

  • Actions such as NotAllowBooking, AuthCapture, SendNotification
  • If a rule result is True, it returns a list of treatments as selected by users, e.g. AuthCapture and SendNotification (the example below is treatments for one Safety-related rule).The one below is for a checkpoint to detect credit-card risk.
Treatments: AuthCapture
  • Each checkpoint has a default treatment. If no rule inside this checkpoint is hit, the rule engine would return the default one (in most cases, it is just “do nothing”).
  • A treatment can only belong to one checkpoint, but one checkpoint can have multiple treatments.

For example, the graph below demonstrates a checkpoint PaxPreRide associated with three treatments: Pass, Decline, Hold

Treatments: Adding

Segments

  • The scope/dimension of a rule. Based on the sample segments below, a rule can be applied only to countries=\[MY,PH\] and verticals=\[GrabBus, GrabCar\]
  • It can be changed at any time on WebUI as well.
Segments

Values of a rule

 

When a rule is hit, more than just treatments, users also want some dynamic values returned. E.g. a max distance of the ride allowed if we believe this booking is medium risk.

Does Python make Griffin run slow?

We picked Python to enjoy its great expressive power and neatness of syntax, but some people ask: Python is slow, would this cause a latency bottleneck?

Our answer is No.

The below graph shows the Latency P99 of Prediction Request from load balancer side(actually the real latency for each prediction is < 6ms, the metrics are peaked at 30ms because some batch requests contain 50 predictions in a single call)

Prediction Request Latency P99

What we did to achieve this?

  • The key idea is to make all computations in CPU and memory only (in other words, no extra I/O).
  • We do not fetch the rules from database for each prediction. Instead, we keep a record called dirty_key, which keeps the latest rule update timestamp. The rule engine would actively check this timestamp and trigger a rule reload only when the dirty_key timestamp in the DB is newer than the latest rule reload time.
  • Rule engine would not fetch any additional new data, instead, all data should be from Data Orchestrator.
  • So the whole prediction flow is only between CPU & memory (and if the data size is small, it could be on CPU cache only).
  • Python GIL essentially enforces a process to have up to one active thread running at a time, no matter how many cores a CPU has. We have Gunicorn to wrap our service, so on the Production machine, we have (2x$num_cores) + 1 processes (see http://docs.gunicorn.org/en/latest/design.html#how-many-workers). The formula is based on the assumption that for a given core, one worker will be reading or writing from the socket while the other worker is processing a request.

The below screenshot is the process snapshot on C5.large machine with 2 vCPU. Note only green processes are active.

Process snapshot on C5.large machine

A lot of trial and error performance tuning:

  • We used to have python-jsonpath-rw for JSONPath query, but the performance was not strong enough. We switched to jmespath and observed about 10ms latency reduction.
  • We use sqlalchemy for DB Query and ORM. We enabled cache for some use cases, but turned out it was over-optimized with stale data. We ended up turning off some caching points to ensure the data consistency.
  • For new dict/list creation, we prefer native call (e.g. {}/[]) instead of function call (see the comparison below).
Native call and Function call
  • Use built-in functions https://docs.python.org/3/library/functions.html. It is written in C, no one can beat it.
  • Add randomness to rule reload so that not all machines run at the same time causing latency spikes.
  • Caching atomic feature units as they are used so that we don’t have to requery for them each time a checkpoint uses it.

How Griffin makes on-call engineers relax

One of the most popular aspects of Griffin is the WebUI. It opens a door for non-developers to make production changes in real time which significantly boosts organisation productivity. In the past a rule change needed 1 week for code change/test/deployment, now it is just 1 minute.

But this also introduces extra risks. Anyone can turn the whole checkpoint down, whether unintentionally or maliciously.

Hence we implemented Shadow Mode and Percentage-based rollout for each rule. Users can put a rule into Shadow Mode to verify the performance without any production impact, and if needed, rollout of a rule can be from 1% all the way to 100%.

We implemented version control for every rule change, and in case anything unexpected happened, we could rollback to the previous version quickly.

Version control
Rollback button

We also built RBAC-based permission system, along with Change Approval flow to make sure any prod change needs at least two people(and approver role has higher permission)

Closing thoughts

Griffin evolved from a fraud-based rule engine to generic rule engine. It can apply to any rule at Grab. For example, Grab just launched Appeal automation several days ago to reduce 50% of the  human effort it typically takes to review straightforward appeals from our passengers and drivers. It was an unplanned use case, but we are so excited about this.

This could happen because from the very beginning we designed Griffin with minimized business context, so that it can be generic enough.

After the launch of this, we observed an amazing adoption rate for various fraud/safety/identity use cases. More interestingly, people now treat Griffin as an automation point for various integration points.

Supporting the latest version of the Privacy Pass Protocol

Post Syndicated from Alex Davidson original https://blog.cloudflare.com/supporting-the-latest-version-of-the-privacy-pass-protocol/

Supporting the latest version of the Privacy Pass Protocol

Supporting the latest version of the Privacy Pass Protocol

At Cloudflare, we are committed to supporting and developing new privacy-preserving technologies that benefit all Internet users. In November 2017, we announced server-side support for the Privacy Pass protocol, a piece of work developed in collaboration with the academic community. Privacy Pass, in a nutshell, allows clients to provide proof of trust without revealing where and when the trust was provided. The aim of the protocol is then to allow anyone to prove they are trusted by a server, without that server being able to track the user via the trust that was assigned.

On a technical level, Privacy Pass clients receive attestation tokens from a server, that can then be redeemed in the future. These tokens are provided when a server deems the client to be trusted; for example, after they have logged into a service or if they prove certain characteristics. The redeemed tokens are cryptographically unlinkable to the attestation originally provided by the server, and so they do not reveal anything about the client.

Supporting the latest version of the Privacy Pass Protocol

Supporting the latest version of the Privacy Pass Protocol

To use Privacy Pass, clients can install an open-source browser extension available in Chrome & Firefox. There have been over 150,000 individual downloads of Privacy Pass worldwide; approximately 130,000 in Chrome and more than 20,000 in Firefox. The extension is supported by Cloudflare to make websites more accessible for users. This complements previous work, including the launch of Cloudflare onion services to help improve accessibility for users of the Tor Browser.

The initial release was almost two years ago, and it was followed up with a research publication that was presented at the Privacy Enhancing Technologies Symposium 2018 (winning a Best Student Paper award). Since then, Cloudflare has been working with the wider community to build on the initial design and improve Privacy Pass. We’ll be talking about the work that we have done to develop the existing implementations, alongside the protocol itself.

What’s new?

Support for Privacy Pass v2.0 browser extension:

  • Easier configuration of workflow.
  • Integration with new service provider (hCaptcha).
  • Compliance with hash-to-curve draft.
  • Possible to rotate keys outside of extension release.
  • Available in Chrome and Firefox (works best with up-to-date browser versions).

Rolling out a new server backend using Cloudflare Workers platform:

  • Cryptographic operations performed using internal V8 engine.
  • Provides public redemption API for Cloudflare Privacy Pass v2.0 tokens.
  • Available by making POST requests to https://privacypass.cloudflare.com/api/redeem. See the documentation for example usage.
  • Only compatible with extension v2.0 (check that you have updated!).

Standardization:

  • Continued development of oblivious pseudorandom functions (OPRFs) draft in prime-order groups with [email protected]
  • New draft specifying Privacy Pass protocol.

Extension v2.0

In the time since the release, we’ve been working on a number of new features. Today we’re excited to announce support for version 2.0 of the extension, the first update since the original release. The extension continues to be available for Chrome and Firefox. You may need to download v2.0 manually from the store if you have auto-updates disabled in your browser.

The extension remains under active development and we still regard our support as in the beta phase. This will continue to be the case as the draft specification of the protocol continues to be written in collaboration with the wider community.

Supporting the latest version of the Privacy Pass Protocol

New Integrations

The client implementation uses the WebRequest API to look for certain types of HTTP requests. When these requests are spotted, they are rewritten to include some cryptographic data required for the Privacy Pass protocol. This allows Privacy Pass providers receiving this data to authorize access for the user.

For example, a user may receive Privacy Pass tokens for completing some server security checks. These tokens are stored by the browser extension, and any future request that needs similar security clearance can be modified to add a stored token as an extra HTTP header. The server can then check the client token and verify that the client has the correct authorization to proceed.

Supporting the latest version of the Privacy Pass Protocol

While Cloudflare supports a particular type of request flow, it would be impossible to expect different service providers to all abide by the same exact interaction characteristics. One of the major changes in the v2.0 extension has been a technical rewrite to instead use a central configuration file. The config is specified in the source code of the extension and allows easier modification of the browsing characteristics that initiate Privacy Pass actions. This makes adding new, completely different request flows possible by simply cloning and adapting the configuration for new providers.

To demonstrate that such integrations are now possible with other services beyond Cloudflare, a new version of the extension will soon be rolling out that is supported by the CAPTCHA provider hCaptcha. Users that solve ephemeral challenges provided by hCaptcha will receive privacy-preserving tokens that will be redeemable at other hCaptcha customer sites.

Supporting the latest version of the Privacy Pass Protocol

“hCaptcha is focused on user privacy, and supporting Privacy Pass is a natural extension of our work in this area. We look forward to working with Cloudflare and others to make this a common and widely adopted standard, and are currently exploring other applications. Implementing Privacy Pass into our globally distributed service was relatively straightforward, and we have enjoyed working with the Cloudflare team to improve the open source Chrome browser extension in order to deliver the best experience for our users.” – Eli-Shaoul Khedouri, founder of hCaptcha

This hCaptcha integration with the Privacy Pass browser extension acts as a proof-of-concept in establishing support for new services. Any new providers that would like to integrate with the Privacy Pass browser extension can do so simply by making a PR to the open-source repository.

Improved cryptographic functionality

After the release of v1.0 of the extension, there were features that were still unimplemented. These included proper zero-knowledge proof validation for checking that the server was always using the same committed key. In v2.0 this functionality has been completed, verifiably preventing a malicious server from attempting to deanonymize users by using a different key for each request.

The cryptographic operations required for Privacy Pass are performed using elliptic curve cryptography (ECC). The extension currently uses the NIST P-256 curve, for which we have included some optimisations. Firstly, this makes it possible to store elliptic curve points in compressed and uncompressed data formats. This means that browser storage can be reduced by 50%, and that server responses can be made smaller too.

Supporting the latest version of the Privacy Pass Protocol

Secondly, support has been added for hashing to the P-256 curve using the “Simplified Shallue-van de Woestijne-Ulas” (SSWU) method specified in an ongoing draft (https://tools.ietf.org/html/draft-irtf-cfrg-hash-to-curve-03) for standardizing encodings for hashing to elliptic curves. The implementation is compliant with the specification of the “P256-SHA256-SSWU-” ciphersuite in this draft.

These changes have a dual advantage, firstly ensuring that the P-256 hash-to-curve specification is compliant with the draft specification. Secondly this ciphersuite removes the necessity for using probabilistic methods, such as hash-and-increment. The hash-and-increment method has a non-negligible chance of failure, and the running time is highly dependent on the hidden client input. While it is not clear how to abuse timing attack vectors currently, using the SSWU method should reduce the potential for attacking the implementation, and learning the client input.

Key rotation

As we mentioned above, verifying that the server is always using the same key is an important part of ensuring the client’s privacy. This ensures that the server cannot segregate the user base and reduce client privacy by using different secret keys for each client that it interacts with. The server guarantees that it’s always using the same key by publishing a commitment to its public key somewhere that the client can access.

Every time the server issues Privacy Pass tokens to the client, it also produces a zero-knowledge proof that it has produced these tokens using the correct key.

Supporting the latest version of the Privacy Pass Protocol

Before the extension stores any tokens, it first verifies the proof against the commitments it knows. Previously, these commitments were stored directly in the source code of the extension. This meant that if the server wanted to rotate its key, then it required releasing a new version of the extension, which was unnecessarily difficult. The extension has been modified so that the commitments are stored in a trusted location that the client can access when it needs to verify the server response. Currently this location is a separate Privacy Pass GitHub repository. For those that are interested, we have provided a more detailed description of the new commitment format in Appendix A at the end of this post.

Supporting the latest version of the Privacy Pass Protocol

Implementing server-side support in Workers

So far we have focused on client-side updates. As part of supporting v2.0 of the extension, we are rolling out some major changes to the server-side support that Cloudflare uses. For version 1.0, we used a Go implementation of the server. In v2.0 we are introducing a new server implementation that runs in the Cloudflare Workers platform. This server implementation is only compatible with v2.0 releases of Privacy Pass, so you may need to update your extension if you have auto-updates turned off in your browser.

Our server will run at https://privacypass.cloudflare.com, and all Privacy Pass requests sent to the Cloudflare edge are handled by Worker scripts that run on this domain. Our implementation has been rewritten using Javascript, with cryptographic operations running in the V8 engine that powers Cloudflare Workers. This means that we are able to run highly efficient and constant-time cryptographic operations. On top of this, we benefit from the enhanced performance provided by running our code in the Workers Platform, as close to the user as possible.

WebCrypto support

Firstly, you may be asking, how do we manage to implement cryptographic operations in Cloudflare Workers? Currently, support for performing cryptographic operations is provided in the Workers platform via the WebCrypto API. This API allows users to compute functionality such as cryptographic hashing, alongside more complicated operations like ECDSA signatures.

In the Privacy Pass protocol, as we’ll discuss a bit later, the main cryptographic operations are performed by a protocol known as a verifiable oblivious pseudorandom function (VOPRF). Such a protocol allows a client to learn function outputs computed by a server, without revealing to the server what their actual input was. The verifiable aspect means that the server must also prove (in a publicly verifiable way) that the evaluation they pass to the user is correct. Such a function is pseudorandom because the server output is indistinguishable from a random sequence of bytes.

Supporting the latest version of the Privacy Pass Protocol

The VOPRF functionality requires a server to perform low-level ECC operations that are not currently exposed in the WebCrypto API. We balanced out the possible ways of getting around this requirement. First we trialled trying to use the WebCrypto API in a non-standard manner, using EC Diffie-Hellman key exchange as a method for performing the scalar multiplication that we needed. We also tried to implement all operations using pure JavaScript. Unfortunately both methods were unsatisfactory in the sense that they would either mean integrating with large external cryptographic libraries, or they would be far too slow to be used in a performant Internet setting.

In the end, we settled on a solution that adds functions necessary for Privacy Pass to the internal WebCrypto interface in the Cloudflare V8 Javascript engine. This algorithm mimics the sign/verify interface provided by signature algorithms like ECDSA. In short, we use the sign() function to issue Privacy Pass tokens to the client. While verify() can be used by the server to verify data that is redeemed by the client. These functions are implemented directly in the V8 layer and so they are much more performant and secure (running in constant-time, for example) than pure JS alternatives.

The Privacy Pass WebCrypto interface is not currently available for public usage. If it turns out there is enough interest in using this additional algorithm in the Workers platform, then we will consider making it public.

Applications

In recent times, VOPRFs have been shown to be a highly useful primitive in establishing many cryptographic tools. Aside from Privacy Pass, they are also essential for constructing password-authenticated key exchange protocols such as OPAQUE. They have also been used in designs of private set intersection, password-protected secret-sharing protocols, and privacy-preserving access-control for private data storage.

Public redemption API

Writing the server in Cloudflare Workers means that we will be providing server-side support for Privacy Pass on a public domain! While we only issue tokens to clients after we are sure that we can trust them, anyone will be able to redeem the tokens using our public redemption API at https://privacypass.cloudflare.com/api/redeem. As we roll-out the server-side component worldwide, you will be able to interact with this API and verify Cloudflare Privacy Pass tokens independently of the browser extension.

Supporting the latest version of the Privacy Pass Protocol

This means that any service can accept Privacy Pass tokens from a client that were issued by Cloudflare, and then verify them with the Cloudflare redemption API. Using the result provided by the API, external services can check whether Cloudflare has authorized the user in the past.

We think that this will benefit other service providers because they can use the attestation of authorization from Cloudflare in their own decision-making processes, without sacrificing the privacy of the client at any stage. We hope that this ecosystem can grow further, with potentially more services providing public redemption APIs of their own. With a more diverse set of issuers, these attestations will become more useful.

By running our server on a public domain, we are effectively a customer of the Cloudflare Workers product. This means that we are also able to make use of Workers KV for protecting against malicious clients. In particular, servers must check that clients are not re-using tokens during the redemption phase. The performance of Workers KV in analyzing reads makes this an obvious choice for providing double-spend protection globally.

If you would like to use the public redemption API, we provide documentation for using it at https://privacypass.github.io/api-redeem. We also provide some example requests and responses in Appendix B at the end of the post.

Standardization & new applications

In tandem with the recent engineering work that we have been doing on supporting Privacy Pass, we have been collaborating with the wider community in an attempt to standardize both the underlying VOPRF functionality, and the protocol itself. While the process of standardization for oblivious pseudorandom functions (OPRFs) has been running for over a year, the recent efforts to standardize the Privacy Pass protocol have been driven by very recent applications that have come about in the last few months.

Standardizing protocols and functionality is an important way of providing interoperable, secure, and performant interfaces for running protocols on the Internet. This makes it easier for developers to write their own implementations of this complex functionality. The process also provides helpful peer reviews from experts in the community, which can lead to better surfacing of potential security risks that should be mitigated in any implementation. Other benefits include coming to a consensus on the most reliable, scalable and performant protocol designs for all possible applications.

Oblivious pseudorandom functions

Oblivious pseudorandom functions (OPRFs) are a generalization of VOPRFs that do not require the server to prove that they have evaluated the functionality properly. Since July 2019, we have been collaborating on a draft with the Crypto Forum Research Group (CFRG) at the Internet Research Task Force (IRTF) to standardize an OPRF protocol that operates in prime-order groups. This is a generalisation of the setting that is provided by elliptic curves. This is the same VOPRF construction that was originally specified by the Privacy Pass protocol and is based heavily on the original protocol design from the paper of Jarecki, Kiayias and Krawczyk.

One of the recent changes that we’ve made in the draft is to increase the size of the key that we consider for performing OPRF operations on the server-side. Existing research suggests that it is possible to create specific queries that can lead to small amounts of the key being leaked. For keys that provide only 128 bits of security this can be a problem as leaking too many bits would reduce security beyond currently accepted levels. To counter this, we have effectively increased the minimum key size to 196 bits. This prevents this leakage becoming an attack vector using any practical methods. We discuss these attacks in more detail later on when discussing our future plans for VOPRF development.

Recent applications and standardizing the protocol

The application that we demonstrated when originally supporting Privacy Pass was always intended as a proof-of-concept for the protocol. Over the past few months, a number of new possibilities have arisen in areas that go far beyond what was previously envisaged.

Supporting the latest version of the Privacy Pass Protocol

For example, the trust token API, developed by the Web Incubator Community Group, has been proposed as an interface for using Privacy Pass. This applications allows third-party vendors to check that a user has received a trust attestation from a set of central issuers. This allows the vendor to make decisions about the honesty of a client without having to associate a behaviour profile with the identity of the user. The objective is to prevent against fraudulent activity from users who are not trusted by the central issuer set. Checking trust attestations with central issuers would be possible using similar redemption APIs to the one that we have introduced.

A separate piece of work from Facebook details a similar application for preventing fraudulent behavior that may also be compatible with the Privacy Pass protocol. Finally, other applications have arisen in the areas of providing access to private storage and establishing security and privacy models in advertisement confirmations.

A new draft

With the applications above in mind, we have recently started collaborative work on a new IETF draft that specifically lays out the required functionality provided by the Privacy Pass protocol as a whole. Our aim is to develop, alongside wider industrial partners and the academic community, a functioning specification of the Privacy Pass protocol. We hope that by doing this we will be able to design a base-layer protocol, that can then be used as a cryptographic primitive in wider applications that require some form of lightweight authorization. Our plan is to present the first version of this draft at the upcoming IETF 106 meeting in Singapore next month.

The draft is still in the early stages of development and we are actively looking for people who are interested in helping to shape the protocol specification. We would be grateful for any help that contributes to this process. See the GitHub repository for the current version of the document.

Future avenues

Finally, while we are actively working on a number of different pathways in the present, the future directions for the project are still open. We believe that there are many applications out there that we have not considered yet and we are excited to see where the protocol is used in the future. Here are some other ideas we have for novel applications and security properties that we think might be worth pursuing in future.

Publicly verifiable tokens

One of the disadvantages of using a VOPRF is that redemption tokens are only verifiable by the original issuing server. If we used an underlying primitive that allowed public verification of redemption tokens, then anyone could verify that the issuing server had issued the particular token. Such a protocol could be constructed on top of so-called blind signature schemes, such as Blind RSA. Unfortunately, there are performance and security concerns arising from the usage of blind signature schemes in a browser environment. Existing schemes (especially RSA-based variants) require cryptographic computations that are much heavier than the construction used in our VOPRF protocol.

Post-quantum VOPRF alternatives

The only constructions of VOPRFs exist in pre-quantum settings, usually based on the hardness of well-known problems in group settings such as the discrete-log assumption. No constructions of VOPRFs are known to provide security against adversaries that can run quantum computational algorithms. This means that the Privacy Pass protocol is only believed to be secure against adversaries running  on classical hardware.

Recent developments suggest that quantum computing may arrive sooner than previously thought. As such, we believe that investigating the possibility of constructing practical post-quantum alternatives for our current cryptographic toolkit is a task of great importance for ourselves and the wider community. In this case, devising performant post-quantum alternatives for VOPRF constructions would be an important theoretical advancement. Eventually this would lead to a Privacy Pass protocol that still provides privacy-preserving authorization in a post-quantum world.

VOPRF security and larger ciphersuites

We mentioned previously that VOPRFs (or simply OPRFs) are susceptible to small amounts of possible leakage in the key. Here we will give a brief description of the actual attacks themselves, along with further details on our plans for implementing higher security ciphersuites to mitigate the leakage.

Specifically, malicious clients can interact with a VOPRF for creating something known as a q-Strong-Diffie-Hellman (q-sDH) sample. Such samples are created in mathematical groups (usually in the elliptic curve setting). For any group there is a public element g that is central to all Diffie-Hellman type operations, along with the server key K, which is usually just interpreted as a randomly generated number from this group. A q-sDH sample takes the form:

( g, g^K, g^(K^2), … , g^(K^q) )

and asks the malicious adversary to create a pair of elements satisfying (g^(1/(s+K)),s). It is possible for a client in the VOPRF protocol to create a q-SDH sample by just submitting the result of the previous VOPRF evaluation back to the server.

While this problem is believed to be hard to break, there are a number of past works that show that the problem is somewhat easier than the size of the group suggests (for example, see here and here). Concretely speaking, the bit security implied by the group can be reduced by up to log2(q) bits. While this is not immediately fatal, even to groups that should provide 128 bits of security, it can lead to a loss of security that means that the setting is no longer future-proof. As a result, any group providing VOPRF functionality that is instantiated using an elliptic curve such as P-256 or Curve25519 provides weaker than advised security guarantees.

With this in mind, we have taken the recent decision to upgrade the ciphersuites that we recommend for OPRF usage to only those that provide > 128 bits of security, as standard. For example, Curve448 provides 196 bits of security. To launch an attack that reduced security to an amount lower than 128 bits would require making 2^(68) client OPRF queries. This is a significant barrier to entry for any attacker, and so we regard these ciphersuites as safe for instantiating the OPRF functionality.

In the near future, it will be necessary to upgrade the ciphersuites that are used in our support of the Privacy Pass browser extension to the recommendations made in the current VOPRF draft. In general, with a more iterative release process, we hope that the Privacy Pass implementation will be able to follow the current draft standard more closely as it evolves during the standardization process.

Get in touch!

You can now install v2.0 of the Privacy Pass extension in Chrome or Firefox.

If you would like to help contribute to the development of this extension then you can do so on GitHub. Are you a service provider that would like to integrate server-side support for the extension? Then we would be very interested in hearing from you!

We will continue to work with the wider community in developing the standardization of the protocol; taking our motivation from the available applications that have been developed. We are always looking for new applications that can help to expand the Privacy Pass ecosystem beyond its current boundaries.

Supporting the latest version of the Privacy Pass Protocol

Appendix

Here are some extra details related to the topics that we covered above.

A. Commitment format for key rotations

Key commitments are necessary for the server to prove that they’re acting honestly during the Privacy Pass protocol. The commitments that Privacy Pass uses for the v2.0 release have a slightly different format from the previous release.

"2.00": {
  "H": "BPivZ+bqrAZzBHZtROY72/E4UGVKAanNoHL1Oteg25oTPRUkrYeVcYGfkOr425NzWOTLRfmB8cgnlUfAeN2Ikmg=",
  "expiry": "2020-01-11T10:29:10.658286752Z",
  "sig": "MEUCIQDu9xeF1q89bQuIMtGm0g8KS2srOPv+4hHjMWNVzJ92kAIgYrDKNkg3GRs9Jq5bkE/4mM7/QZInAVvwmIyg6lQZGE0="
}

First, the version of the server key is 2.00, the server must inform the client which version it intends to use in the response to a client containing issued tokens. This is so that the client can always use the correct commitments when verifying the zero-knowledge proof that the server sends.

The value of the member H is the public key commitment to the secret key used by the server. This is base64-encoded elliptic curve point of the form H=kG where G is the fixed generator of the curve, and k is the secret key of the server. Since the discrete-log problem is believed to be hard to solve, deriving k from H is believed to be difficult. The value of the member expiry is an expiry date for the commitment that is used. The value of the member sig is an ECDSA signature evaluated using a long-term signing key associated with the server, and over the values of H and expiry.

When a client retrieves the commitment, it checks that it hasn’t expired and that the signature verifies using the corresponding verification key that is embedded into the configuration of the extension. If these checks pass, it retrieves H and verifies the issuance response sent by the server. Previous versions of these commitments did not include signatures, but these signatures will be validated from v2.0 onwards.

When a server wants to rotate the key, it simply generates a new key k2 and appends a new commitment to k2 with a new identifier such as 2.01. It can then use k2 as the secret for the VOPRF operations that it needs to compute.

B. Example Redemption API request

The redemption API at is available over HTTPS by sending POST requests to https://privacypass.cloudflare.com/api/redeem. Requests to this endpoint must specify Privacy Pass data using JSON-RPC 2.0 syntax in the body of the request. Let’s look at an example request:

{
  "jsonrpc": "2.0",
  "method": "redeem",
  "params": {
    "data": [
      "lB2ZEtHOK/2auhOySKoxqiHWXYaFlAIbuoHQnlFz57A=",
      "EoSetsN0eVt6ztbLcqp4Gt634aV73SDPzezpku6ky5w=",
      "eyJjdXJ2ZSI6InAyNTYiLCJoYXNoIjoic2hhMjU2IiwibWV0aG9kIjoic3d1In0="
    ],
    "bindings": [
      "string1",
      "string2"
    ],
    "compressed":"false"
  },
  "id": 1
}

In the above: params.data[0] is the client input data used to generate a token in the issuance phase; params.data[1] is the HMAC tag that the server uses to verify a redemption; and params.data[2] is a stringified, base64-encoded JSON object that specifies the hash-to-curve parameters used by the client. For example, the last element in the array corresponds to the object:

{
    curve: "p256",
    hash: "sha256",
    method: "swu",
}

Which specifies that the client has used the curve P-256, with hash function SHA-256, and the SSWU method for hashing to curve. This allows the server to verify the transaction with the correct ciphersuite. The client must bind the redemption request to some fixed information, which it stores as multiple strings in the array params.bindings. For example, it could send the Host header of the HTTP request, and the HTTP path that was used (this is what is used in the Privacy Pass browser extension). Finally, params.compressed is an optional boolean value (defaulting to false) that indicates whether the HMAC tag was computed over compressed or uncompressed point encodings.

Currently the only supported ciphersuites are the example above, or the same except with method equal to increment for the hash-and-increment method of hashing to a curve. This is the original method used in v1.0 of Privacy Pass, and is supported for backwards-compatibility only. See the provided documentation for more details.

Example response

If a request is sent to the redemption API and it is successfully verified, then the following response will be returned.

{
  "jsonrpc": "2.0",
  "result": "success",
  "id": 1
}

When an error occurs something similar to the following will be returned.

{
  "jsonrpc": "2.0",
  "error": {
    "message": <error-message>,
    "code": <error-code>,
  },
  "id": 1
}

The error codes that we provide are specified as JSON-RPC 2.0 codes, we document the types of errors that we provide in the API documentation.

Public keys are not enough for SSH security

Post Syndicated from Sam Rhea original https://blog.cloudflare.com/public-keys-are-not-enough-for-ssh-security/

Public keys are not enough for SSH security

If your organization uses SSH public keys, it’s entirely possible you have already mislaid one. There is a file sitting in a backup or on a former employee’s computer which grants the holder access to your infrastructure. If you share SSH keys between employees it’s likely only a few keys are enough to give an attacker access to your entire system. If you don’t share them, it’s likely your team has generated so many keys you long lost track of at least one.

If an attacker can breach a single one of your client devices it’s likely there is a known_hosts file which lists every target which can be trivially reached with the keys the machine already contains. If someone is able to compromise a team member’s laptop, they could use keys on the device that lack password protection to reach sensitive destinations.

Should that happen, how would you respond and revoke the lost SSH key? Do you have an accounting of the keys which have been generated? Do you rotate SSH keys? How do you manage that across an entire organization so consumed with serving customers that security has to be effortless to be adopted?

Cloudflare Access launched support for SSH connections last year to bring zero-trust security to how teams connect to infrastructure. Access integrates with your IdP to bring SSO security to SSH connections by enforcing identity-based rules each time a user attempts to connect to a target resource.

However, once Access connected users to the server they still had to rely on legacy SSH keys to authorize their account. Starting today, we’re excited to help teams remove that requirement and replace static SSH keys with short-lived certificates.

Replacing a private network with Cloudflare Access

In traditional network perimeter models, teams secure their infrastructure with two gates: a private network and SSH keys.

The private network requires that any user attempting to connect to a server must be on the same network, or a peered equivalent (such as a VPN). However, that introduces some risk. Private networks default to trust that a user on the network can reach a machine. Administrators must proactively segment the network or secure each piece of the infrastructure with control lists to work backwards from that default.

Cloudflare Access secures infrastructure by starting from the other direction: no user should be trusted. Instead, users must prove they should be able to access any unique machine or destination by default.

We released support for SSH connections in Cloudflare Access last year to help teams leave that network perimeter model and replace it with one that evaluates every request to a server for user identity. Through integration with popular identity providers, that solution also gives teams the ability to bring their SSO pipeline into their SSH flow.

Replacing static SSH keys with short-lived certificates

Once a user is connected to a server over SSH, they typically need to authorize their session. The machine they are attempting to reach will have a set of profiles which consists of user or role identities. Those profiles define what actions the user is able to take.

SSH processes make a few options available for the user to login to a profile. In some cases, users can login with a username and password combination. However, most teams rely on public-private key certificates to handle that login. To use that flow, administrators and users need to take prerequisite steps.

Prior to the connection, the user will generate a certificate and provide the public key to an administrator, who will then configure the server to trust the certificate and associate it with a certain user and set of permissions. The user stores that certificate on their device and presents it during that last mile. However, this leaves open all of the problems that SSO attempts to solve:

  • Most teams never force users to rotate certificates. If they do, it might be required once a year at most. This leaves static credentials to core infrastructure lingering on hundreds or thousands of devices.
  • Users are responsible for securing their certificates on their devices. Users are also responsible for passwords, but organizations can enforce requirements and revocation centrally.
  • Revocation is difficult. Teams must administer a CRL or OCSP platform to ensure that lost or stolen certificates are not used.

With Cloudflare Access, you can bring your SSO accounts to user authentication within your infrastructure. No static keys required.

How does it work?

To build this we turned to three tools we already had: Cloudflare Access, Argo Tunnel and Workers.

Access is a policy engine which combines the employee data in your identity provider (like Okta or AzureAD) with policies you craft. Based on those policies Access is able to limit access to your internal applications to the users you choose. It’s not a far leap to see how the same policy concept could be used to control access to a server over SSH. You write a policy and we use it to decide which of your employees should be able to access which resources. Then we generate a short-lived certificate allowing them to access that resource for only the briefest period of time. If you remove a user from your IdP, their access to your infrastructure is similarly removed, seamlessly.

To actually funnel the traffic through our network we use another existing Cloudflare tool: Argo Tunnel. Argo Tunnel flips the traditional model of connecting a server to the Internet. When you spin up our daemon on a machine it makes outbound connections to Cloudflare, and all of your traffic then flows over those connections. This allows the machine to be a part of Cloudflare’s network without you having to expose the machine to the Internet directly.

For HTTP use cases, Argo Tunnel only needs to run on the server. In the case of the Access SSH flow, we proxy SSH traffic through Cloudflare by running the Argo Tunnel client, cloudflared, on both the server and the end user’s laptop.

When users connect over SSH to a resource secured by Access for Infrastructure, they use the command-line tool cloudflared. cloudflared takes the SSH traffic bound for that hostname and forwards it through Cloudflare based on SSH config settings. No piping or command wrapping required. cloudflared launches a browser window and prompts the user to authenticate with their SSO credentials.

Once authenticated, Access checks the user’s identity against the policy you have configured for that application. If the user is permitted to reach the resource, Access generates a JSON Web Token (JWT), signed by Cloudflare and scoped to the user and application. Access distributes that token to the user’s device through cloudflared and the tool stores it locally.

Like the core Access authentication flow, the token validation is built with a Cloudflare Worker running in every one of our data centers, making it both fast and highly available. Workers made it possible for us to deploy this SSH proxying to all 194 of Cloudflare’s data centers, meaning Access for Infrastructure often speeds up SSH sessions rather than slowing them down.

Public keys are not enough for SSH security

With short-lived certificates enabled, the instance of cloudflared running on the client takes one additional step. cloudflared sends that token to a Cloudflare certificate signing endpoint that creates an ephemeral certificate. The user’s SSH flow then sends both the token, which is used to authenticate through Access, and the short-lived certificate, which is used to authenticate to the server. Like the core Access authentication flow, the token validation is built with a Cloudflare Worker running in every one of our data centers, making it both fast and highly available.

Public keys are not enough for SSH security

When the server receives the request, it validates the short-lived certificate against that public key and, if authentic, authorizes the user identity to a matching Unix user. The certificate, once issued, is valid for 2 minutes but the SSH connection can last longer once the session has started.

What is the end user experience?

Cloudflare Access’ SSH feature is entirely transparent to the end user and does not require any unique SSH commands, wrappers, or flags. Instead, Access requires that your team members take a couple one-time steps to get started:

1. Install the cloudflared daemon

The same lightweight software that runs on the target server is used to proxy SSH connections from your team members’ devices through Cloudflare. Users can install it with popular package managers like brew or at the link available here. Alternatively, the software is open-source and can be built and distributed by your administrators.

2. Print SSH configuration update and save

Once an end user has installed cloudflared, they need to run one command to generate new lines to add to their SSH config file:

cloudflared access ssh-config --hostname vm.example.com --short-lived-cert

The --hostname field will contain the hostname or wildcard subdomain of the resource protected behind Access. Once run, cloudflared will print the following configurations details:

Host vm.example.com
    ProxyCommand bash -c '/usr/local/bin/cloudflared access ssh-gen --hostname %h; ssh -tt %[email protected] >&2 <&1'
    
Host cfpipe-vm.example.com
    HostName vm.example.com
    ProxyCommand /usr/local/bin/cloudflared access ssh --hostname %h
    IdentityFile ~/.cloudflared/vm.example.com-cf_key
    CertificateFile ~/.cloudflared/vm.example.com-cf_key-cert.pup

Users need to append that output to their SSH config file. Once saved, they can connect over SSH to the protected resource. Access will prompt them to authenticate with their SSO credentials in the browser, in the same way they login to any other browser-based tool. If they already have an active browser session with their credentials, they’ll just see a success page.

In their terminal, cloudflared will establish the session and issue the client certificate that corresponds to their identity.


What’s next?

With short-lived certificates, Access can become a single SSO-integrated gateway for your team and infrastructure in any environment. Users can SSH directly to a given machine and administrators can replace their jumphosts altogether, removing that overhead. The feature is available today for all Access customers. You can get started by following the documentation available here.

Tales from the Crypt(o team)

Post Syndicated from Nick Sullivan original https://blog.cloudflare.com/tales-from-the-crypt-o-team/

Tales from the Crypt(o team)

Tales from the Crypt(o team)

Halloween season is upon us. This week we’re sharing a series of blog posts about work being done at Cloudflare involving cryptography, one of the spookiest technologies around. So bookmark this page and come back every day for tricks, treats, and deep technical content.

A long-term mission

Cryptography is one of the most powerful technological tools we have, and Cloudflare has been at the forefront of using cryptography to help build a better Internet. Of course, we haven’t been alone on this journey. Making meaningful changes to the way the Internet works requires time, effort, experimentation, momentum, and willing partners. Cloudflare has been involved with several multi-year efforts to leverage cryptography to help make the Internet better.

Here are some highlights to expect this week:

  • We’re renewing Cloudflare’s commitment to privacy-enhancing technologies by sharing some of the recent work being done on Privacy Pass
  • We’re helping forge a path to a quantum-safe Internet by sharing some of the results of the Post-quantum Cryptography experiment
  • We’re sharing the rust-based software we use to power time.cloudflare.com
  • We’re doing a deep dive into the technical details of Encrypted DNS
  • We’re announcing support for a new technique we developed with industry partners to help keep TLS private keys more secure

The milestones we’re sharing this week would not be possible without partnerships with companies, universities, and individuals working in good faith to help build a better Internet together. Hopefully, this week provides a fun peek into the future of the Internet.

AWS Security Profiles: Greg McConnel, Senior Manager, Security Specialists Team

Post Syndicated from Becca Crockett original https://aws.amazon.com/blogs/security/aws-security-profiles-greg-mcconnel-senior-manager-security-specialists-team/

Greg McConnel, Senior Manager
In the weeks leading up to re:Invent 2019, we’ll share conversations we’ve had with people at AWS who will be presenting at the event so you can learn more about them and some of the interesting work that they’re doing.


How long have you been at AWS, and what do you do in your current role?

I’ve been with AWS nearly seven years. I was previously a Solutions Architect on the Security Specialists Team for the Americas, and I recently took on an interim management position for the team. Once we find a permanent team lead, I’ll become the team’s West Coast manager.

How do you explain your job?

The Security Specialist team is a customer-facing role where we get to talk to customers about the security, identity, and compliance of AWS. We help enable customers to securely and compliantly adopt the AWS Cloud. Some of what we do is one-to-one with customers, but we also engage in a lot of one-to-many activities. My team talks to customers about their security concerns, how best to secure their workloads on AWS, and how to use our services alongside partner solutions to improve their security posture. A big part of my role is building content and delivering that to customers. This includes things like white papers, blog posts, hands-on workshops, Well-Architected reviews, and presentations.

What are you currently working on that you’re most excited about?

In the Security Specialist team, the one-to-one engagements we have with customers are primarily short-term and centered around particular questions or topics. Although this can benefit customers a great deal, it’s a reactive model. In order to move to a more proactive stance, we plan to start working with individual customers on a longer-term basis to help them solve particular security challenges or opportunities. This will remove any barriers to working directly with a Security Specialist and allow us to dive deep into the security concerns of our customers.

What’s the most challenging part of your job?

Keeping up with the pace of innovation. Just about everyone who works with AWS experiences this, whether you’re a customer or working in the AWS field. AWS is launching so many amazing things all the time it can be challenging to keep up. The specialist SA role is attractive because the area we cover is somewhat limited. It’s a carved-out space within the larger, continuously expanding body of AWS knowledge. So it’s a little easier, and it allows me to dive deeper into particular security topics. But even within this carved out area of security, it takes some dedication to stay current and informed for my customers. I find things like Jeff Barr’s AWS News Blog and the AWS Security Blog really valuable. Also, I’m a big fan of hands on learning, so just playing around with new services, especially by following along with examples in new blog posts, can be an interesting way to keep up.

What’s your favorite part of your job?

Creating content, particularly workshops. For me, hands-on learning is not only an effective way to learn, it’s simply more fun. The process of creating workshops with a team can be rewarding and educational. It’s a creative, interactive process. Getting to see others try out your workshop and provide feedback is so satisfying, especially when you can tell that people now understand new concepts because of the work you did. Watching other people deliver a workshop that you built is also rewarding.

In your opinion, what’s the biggest challenge facing cloud security right now?

AWS allows you to move very fast. One of the primary advantages of cloud computing is the speed and agility it provides. When you’re moving fast though, security can be overlooked a bit. It’s so easy to start building on AWS that customers might not invest the necessary cycles on security, or they might think that doing so will slow them down. The reality, though, is that AWS provides the tools to allow you to stay agile while maintaining—and in many cases improving—your security. Automation and continuous monitoring are the keys here. AWS makes it possible to automate many basic security tasks, like patching. With the right tooling, you also gain the visibility needed to accurately identify and monitor critical assets and data. We provide great solutions for logging and monitoring that are highly integrated with our other services. So it’s quite possible now to move fast and stay secure, and it’s important that you do both.

What security best practices do you recommend to customers?

There are certain services we recommend that customer look into enabling because they’re an easy win when it comes to security. These aren’t requirements because there are many great partner solutions that customers can also use. But these are solutions that I’d encourage customers to at least consider:

  • Turn on Amazon GuardDuty, which is a threat detection service that continuously monitors for malicious and unauthorized activity. The benefits it provides versus the effort involved to enable it makes it a pretty easy choice. You literally click a button to turn on GuardDuty, and the service starts analyzing tens of billions of events across a number of AWS data sources.
  • Work with your account team to get a Well-Architected review of your key workloads, especially with a focus on the Security pillar. You can even run well-architected reviews yourself with the AWS Well-Architected Tool. A well-architected review can help you build secure, high-performing, resilient, and efficient infrastructure for your application.
  • Use IAM access advisor to help ensure that all the credentials in your AWS accounts are not overly broad by examining your service last accessed information.
  • Use AWS Organizations for centralized administration of your credentials management and governance and move to full-feature mode to take advantage of everything the service offers.
  • Practice the principle of least privilege.

What does cloud security mean to you, personally?

This is the first time in a while that I’ve worked for a company where I actually like the product. I’m constantly surprised by the new services and features we release and what our customers are building on AWS. My background is in security, especially identity, so that’s the area I am most passionate about within AWS. I want people to use AWS because they can build amazing things at a pace that was just not possible before the cloud—but I want people to do all of this securely. Being able to help customers use our services and do so securely keeps me motivated.

You have a passion for building great workshops. What are you and Jesse Fuchs doing to help make workshops at re:Invent better?

Jesse and I have been working on a central site for AWS security workshops. We post a curated list of workshops, and everything on the site has gone through an internal bar raiser review. This is a formal process for reviewing the workshops and other content to make sure they meet certain standards, that they’re up to date, and that they follow a certain format so customers will find them familiar no matter which workshop they go through.

We’re now implementing a version of this bar raiser review for every workshop at re:Invent this year. In the past, there were people who helped review workshops, but there was no formal, independent bar raising process. Now, every re:Invent workshop will go through it. Eventually, we want to extend this review at other AWS events.

Over time, you’ll see a lot more workshops added to the site. All of these workshops can either be delivered by AWS to the customer, or by customers to their own internal teams. It’s all Open Source too. Finally, we’ll keep the workshops up to date as new features are added.

The workshop you and Jesse Fuchs will host at re:Invent promises that attendees will build an end-to-end functional app with a secure identity provider. How can you build such an app in such a relatively short time?

It know it sounds like a tall order, but the workshop is designed to be completed in the allotted time. And since the content will be up on GitHub, you can also work on it afterwards on your own. It’s a level 400 workshop, so we’re expecting people to come into the session with a certain level of familiarity with some of these services. The session is two and half hours, and a big part of this workshop is the one-on-one interaction with the facilitators. So if attendees feel like this is a lot to take in, they should keep in mind that they’ll get one-on-one interaction with experts in these fields. Our facilitators will sit down with attendees and address their questions as they go through the hands-on work. A lot of the learning actually comes out of that facilitation. The session starts with an initial presentation and detailed instructions for doing the workshop. Then the majority of the time will be spent hands-on with that added one-on-one interaction. I don’t think we stress enough the value that customers get from the facilitation that occurs in workshops and how much learning occurs during that process.

What else should we know about your workshop?

One of the things I concentrate on is identity, and that’s not just Identity and Access Management. It’s identity across the board. If you saw Quint Van Deman’s session last year at re:Invent, he used an analogy about identity being a cake with three layers. There’s the platform layer on the bottom, which includes access to the console. There’s the infrastructure layer, which is identity for services like EC2 and RDS, for example. And then there’s the application identity layer on the top. That’s sometimes the layer that customers are most interested in because it’s related to the applications they’re building. Our workshop is primarily about that top layer. It’s one of the more difficult topics to cover, but again, an interesting one for customers.

What is your advice to first-time attendees at re:Invent?

Take advantage of your time at re:Invent and plan your week because there’s really no fluff. It’s not about marketing, it’s about learning. re:Invent is an educational event first and foremost. Also, take advantage of meeting people because it’s also a great networking event. Having conversations with people from companies that are doing similar things to what you’re doing on AWS can be very enlightening.

You like obscure restaurants and even more obscure movies. Can you share some favorites?

From a movie standpoint, I like Thrillers and Sci-Fi—especially unusual Sci-Fi movies that are somewhat out of the mainstream. I do like blockbusters, but I definitely like to find great independent movies. I recently saw a movie called Coherence that was really interesting. It’s about what would happen if a group of friends having a dinner party could move between parallel or alternate universes. A few others I’d recommend to people who share my taste for the obscure, and that are maybe even a little though-provoking, are Perfect Sense and The Quiet Earth.

From a restaurant standpoint, I am always looking for new Asian food restaurants, especially Vietnamese and Thai food. There are a lot of great ones hidden here in Los Angeles.
 
Want more AWS Security news? Follow us on Twitter.

The AWS Security team is hiring! Want to find out more? Check out our career page.

Greg McConnel

In his nearly 7 years at AWS, Greg has been in a number of different roles, which gives him a unique viewpoint on this company. He started out as a TAM in Enterprise Support, moved to a Gaming Specialist SA role, and then found a fitting home in AWS Security as a Security Specialist SA focusing on identity. Greg will continue to contribute to the team in his new role as a manager. Greg currently lives in LA where he loves to kayak, go on road trips, and in general, enjoy the sunny climate.

Using Grab’s Trust Counter Service to Detect Fraud Successfully

Post Syndicated from Grab Tech original https://engineering.grab.com/using-grabs-trust-counter-service-to-detect-fraud-successfully

Background

Fraud is not a new phenomenon, but with the rise of the digital economy it has taken different and aggressive forms. Over the last decade, novel ways to exploit technology have appeared, and as a result, millions of people have been impacted and millions of dollars in revenue have been lost. According to ACFE survey, companies lost USD6.3 billion due to fraud. Organizations lose 5% of its revenue annually due to fraud.

In this blog, we take a closer look at how we developed an anti-fraud solution using the Counter service, which can be an indispensable tool in the highly complex world of fraud detection.

Anti-fraud solution using counters

At Grab, we detect fraud by deploying data science, analytics, and engineering tools to search for anomalous and suspicious transactions, or to identify high-risk individuals who are likely to commit fraud. Grab’s Trust Platform team provides a common anti-fraud solution across a variety of business verticals, such as transportation, payment, food, and safety. The team builds tools for managing data feeds, creates SDK for engineering integration, and builds rules engines and consoles for fraud detection.

One example of fraudulent behavior could be that of an individual who masquerades as both driver and passenger, and makes cashless payments to get promotions, for example, earn a one dollar rebate in the next transaction.In our system, we analyze real time booking and payment signals, compare it with the historical data of the driver and passenger pair, and create rules using the rule engine. We count the number of driver and passenger pairs at a given time frame. This counter is provided as an input to the rule.If the counter value exceeds a predefined threshold value, the rule evaluates it as a fraud transaction. We send this verdict back to the booking service.

The conventional method

Fraud detection is a job that requires cross-functional teams like data scientists, data analysts, data engineers, and backend engineers to work together. Usually data scientists or data analysts come up with an offline idea and apply it to real-time traffic. For example, a rule gets invented after brainstorming sessions by data scientists and data analysts. In the conventional method, the rule needs to be communicated to engineers.

Automated solution using the Counter service

To overcome the challenges in the conventional method, the Trust platform team decided to come out with the Counter service, a self-service platform, which provides management tools for users, and a computing engine for integrating with the backend services. This service provides an interface, such as a UI based rule editor and data feed, so that analysts can experiment and create rules without interacting with engineers. The platform team also decided to provide different data contracts, APIs, and SDKs to engineers so that the business verticals can use it quickly and easily.

The major engineering challenges faced in designing the Counter service

There are millions of transactions happening at Grab every day, which implies we needed to perform billions of fraud and safety detections. As seen from the example shared earlier, most predictions require a group of counters. In the above use case, we need to know how many counts of the cashless payment happened for a driver and passenger pair. Due to the scale of Grab’s business, the potential combinations of drivers and passengers could be exponential. However, this is only one use case. So imagine that there could be hundreds of counters for different use cases. Hence it’s important that we provide a platform for stakeholders to manage counters.

Some of the common challenges we faced were:

Scalability

As mentioned above, we could potentially have an exponential number of passengers and drivers in a single counter. So it’s a great challenge to store the counters in the database, read, and query them in real-time. When there are billions of counter keys across a long period of time, the Trust team had to find a scalable way to write and fetch keys effectively and meet the client’s SLA.

Self-serving

A counter is usually invented by data scientists or analysts and used by engineers. For example, every time a new type of counter is needed from data scientists, developers need to manually make code changes, such as adding a new stream, capturing related data sets for the counter, and storing it on the fraud service, then doing a deployment to make the counters ready. It usually takes two or more weeks for the whole iteration, and if there are any changes from the data analysts’ side, which happens often, the situation loops again. The team had to come up with a solution to prevent the long loop of manual tasks by coming out with a self-serving interface.

Manageable and extendable

Due to a lack of connection between real-time and offline data, data analysts and data scientists did not have a clear picture of what is written in the counters. That’s because the conventional counter data were stored in Redis database to satisfy the query SLA. They could not track the correctness of counter value, or its history. With the new solution, the stakeholders can get a real-time picture of what is stored in the counters using the data engineering tools.

The Machine Learning challenges solved by the Counter service

The Counter service plays an important role in our Machine Learning (ML) workflow.

Data Consistency Challenge/Issue

Most of the machine learning workflows need dedicated input data. However, when there is an anti-fraud model that is trained using offline data from the data lake, it is difficult to use the same model in real-time. This is because the model lacks the data contract and the consistency with the data source. In this case, the Counter service becomes a type of data source by providing the value of counters to file system.

ML featuring

Counters are important features for the ML models. Imagine there is a new invention of counters, which data scientists need to evaluate. We need to provide a historical data set for counters to work. The Counter service provides a counter replay feature, which allows data scientists to simulate the counters via historical payload.

In general, the Counter service is a bridge between online and offline datasets, data scientists, and engineers. There was technical debt with regards to data consistency and automation on the ML pipeline, and the Counter service closed this loop.

How we designed the Counter service

We followed the principle of asynchronized data ingestion, and synchronized transaction for designing the Counter service.

The diagram shows how the counters are generated and saved to database.

How the counters are generated and saved to the database

Counter creation workflow

  1. User opens the Counter Creation UI and creates a new key “fraud:counter:counter_name”.
  2. Configures required fields.
  3. The Counter service monitors the new counter-creation, puts a new counter into load script storage, and starts processing new counter events (see Counter Write below).

Counter write workflow

  1. The Counter service monitors multiple streams, assembles extra data from online data services (i.e. Common Data Service (CDS), passenger service, hydra service, etc), so that rich dataset would also be available for editors on each stream resource.
  2. The Counter Processor evaluates the user-configured expression and writes the evaluated values to the dedicated Grab-Stats stream using the GrabPlugin tool.

Counter read workflow

Counter read workflow

We use Grab-Stats as our storage service. Basically Grab-Stats runs above ScyllaDB, which is a distributed NoSQL data store. We use ScyllaDB because of its good performance on aggregation in memory to deal with the time series dataset. In comparison with in-memory storage like AWS elasticCache, it is 10 times cheaper and as reliable as AWS in terms of stability. The p99 of reading from ScyllaDB is less than 150ms which satisfies our SLA.

How we improved the Counter service performance

We used the multi-buckets strategy to improve the Counter service performance.

Background

There are different time windows when you perform a query. Some counters are time sensitive so that it needs to know what happened in the last 30 or 60 minutes. Some other counters focus on the long term and need to know the events in the last 30 or 90 days.

From a transactional database perspective, it’s not possible to serve small range as well as long term events at the same time. This is because the more the need for the accuracy of the data and the longer the time range, the more aggregations need to happen on database. Which means we would not be able to satisfy the SLA. Otherwise we will need to block other process which leads to the service downgrade.

Solution for improving the query

We resolved this problem by using different granularities of the tables. We pre-aggregated the signals into different time buckets, such as 15min, 1 hour, and 1 day.

When a request comes in, the time-range of the request will be divided by the buckets, and the results are conquered. For example, if there is a request for 9/10 23:15:20 to 9/12 17:20:18, the handler will query 15min buckets within the hour.  It will query for hourly buckets for the same day. And it will query the daily buckets for the rest of 2 days. This way, we avoid doing heavy aggregations, but still keep the accuracy in 15 minutes level in a scalable response time.

Counter service UI

We allowed data analysts and data scientists to onboard counters by themselves, from a dedicated web portal. After the counter is submitted, the Counter service takes care of the integration and parsing the logic at runtime.

Counter service UI

Backend integration

We provide SDK for quicker and better integration. The engineers only need to provide the counter identifier ID (which is shown in the UI) and the time duration in the query. Under the hood we provide a GRPC protocol to communicate across services. We divide the query time window to smaller granularities, fetching from different time series tables and then conquering the result. We are also providing a short TTL cache layer to take the uncommon traffic from client such as network retry or traffic throttle. Our QPS are designed to target 100K.

Monitoring the Counter service

The Counter service dashboard helps to track the human errors while editing the counters in real-time. The Counter service sends alerts to slack channel to notify users if there is any error.

Counter service dashboard

We setup Datadog for monitoring multiple system metrics. The figure below shows a portion of stream processing and counter writing. In the example below, the total stream QPS would reach 5k at peak hour, and the total counter saved to storage tier is about 4k. It will keep climbing without an upper limit, when more counters are onboarded.

Counter service dashboard with multiple metrics

The Counter service UI portal also helps users to fetch real-time counter results for verification purposes.

Counter service UI

Future plans

Here’s what we plan to do in the near future to improve the Counter service.

Close the ML workflow loop

As mentioned above, we plan to send the resource payload of the Counter service to the offline data lake, in order to complete the counter replay function for data scientists. We are working on the project called “time traveler”. As the name indicates, it is used not only for serving the online transactional data, but also supports historical data analytics, and provides more flexibility on counter inventions and experiments.

There are more automation steps we plan to do, such as adding a replay button on the web portal, and hooking up with the offline big data engine to trigger the analytics jobs. The performance metrics will be collected and displayed on the web portal. A single platform would be able to manage both the online and offline data.

Integration to Griffin

Griffin is our rule engine. Counters are sometimes an input to a particular rule, and one rule usually needs many counters to work together. We need to provide a better integration to Griffin on backend. We plan to minimize the current engineering effort when using counters on Griffin. A counter then becomes an automated input variable on Griffin, which can be configured on the web portal by any users.