Tag Archives: security

Introducing Cloudflare for Campaigns

Post Syndicated from Alissa Starzak original https://blog.cloudflare.com/introducing-cloudflare-for-campaigns/

Introducing Cloudflare for Campaigns

Introducing Cloudflare for Campaigns

During the past year, we saw nearly 2 billion global citizens go to the polls to vote in democratic elections. There were major elections in more than 50 countries, including India, Nigeria, and the United Kingdom, as well as elections for the European Parliament. In 2020, we will see a similar number of elections in countries from Peru to Myanmar. In November, U.S citizens will cast their votes for the 46th President, 435 seats in the U.S House of Representatives, 35 of the 100 seats in the U.S. Senate, and many state and local elections.

Recognizing the importance of maintaining public access to election information, Cloudflare launched the Athenian Project in 2017, providing U.S. state and local government entities with the tools needed to secure their election websites for free. As we’ve seen, however, political parties and candidates for office all over the world are also frequent targets for cyberattack. Cybersecurity needs for campaign websites and internal tools are at an all time high.

Although Cloudflare has helped improve the security and performance of political parties and candidates for office all over the world for years, we’ve long felt that we could do more. So today, we’re announcing Cloudflare for Campaigns, a suite of Cloudflare services tailored to campaign needs. Cloudflare for Campaigns is designed to make it easier for all political campaigns and parties, especially those with small teams and limited resources, to get access to cybersecurity services.

Risks faced by political campaigns

Since Russians attempted to use cyberattacks to interfere in the U.S. Presidential election in 2016, the news has been filled with reports of cyber threats against political campaigns, in both the United States and around the world. Hackers targeted the Presidential campaigns of Emmanuel Macron in France and Angela Merkel in Germany with phishing attacks, the main political parties in the UK with DDoS attacks, and congressional campaigns in California with a combination of malware, DDoS attacks and brute force login attempts.

Both because of our services to state and local government election websites through the Athenian Project and because a significant number of political parties and candidates for office use our services, Cloudflare has seen many attacks on election infrastructure and political campaigns firsthand.

During the 2020 U.S. election cycle, Cloudflare has provided services to 18 major presidential campaigns, as well as a range of congressional campaigns. On a typical day, Cloudflare blocks 400,000 attacks against political campaigns, and, on a busy day, Cloudflare blocks more than 40 million attacks against campaigns.

What is Cloudflare for Campaigns?

Cloudflare for Campaigns is a suite of Cloudflare products focused on the needs of political campaigns, particularly smaller campaigns that don’t have the resources to bring significant cybersecurity resources in house. To ensure the security of a campaign website, the Cloudflare for Campaigns package includes Business-level service, as well as security tools particularly helpful for political campaigns websites, such as the web application firewall, rate limiting, load balancing, Enterprise level “I am Under Attack Support”, bot management, and multi-user account enablement.

Introducing Cloudflare for Campaigns

To ensure the security of internal campaign teams, the Cloudflare for Campaigns service will also provide tools for campaigns to ensure the security of their internal teams with Cloudflare Access, allowing for campaigns to secure, authenticate, and monitor user access to any domain, application, or path on Cloudflare, without using a VPN. Along with Access, we will be providing Cloudflare Gateway with DNS-based filtering at multiple locations to protect campaign staff as they navigate the Internet by keeping malicious content off the campaign’s network using DNS filtering, helping prevent users from running into phishing scams or malware sites. Campaigns can use Gateway after the product’s public release.

Cloudflare for Campaigns also includes Cloudflare reliability and security guide, which lists a best practice guide for political campaigns to maintain their campaign site and secure their internal teams.

Regulatory Challenges

Although there is widespread agreement that campaigns and political parties face threats of cyberattack, there is less consensus on how best to get political campaigns the help they need.  Many political campaigns and political parties operate under resource constraints, without the technological capability and financial resources to dedicate to cybersecurity. At the same time, campaigns around the world are the subject of a variety of different regulations intended to prevent corruption of democratic processes. As a practical matter, that means that, although campaigns may not have the resources needed to access cybersecurity services, donation of cybersecurity services to campaigns may not always be allowed.

In the U.S., campaign finance regulations prohibit corporations from providing any contributions of either money or services to federal candidates or political party organizations. These rules prevent companies from offering free or discounted services if those services are not provided on the same terms and conditions to similarly situated members of the general public. The Federal Elections Commission (FEC), which enforces U.S. campaign finance laws, has struggled with the issue of how best to apply those rules to the provision of free or discounted cybersecurity services to campaigns. In consideration of a number of advisory opinions, they have publicly wrestled with the competing priorities of securing campaigns from cyberattack while not opening a backdoor to donation of goods services that are intended to curry favors with particular candidates.

The FEC has issued two advisory opinions to tech companies seeking to provide free or discounted cybersecurity services to campaigns. In 2018, the FEC approved a request by Microsoft to offer a package of enhanced online account security protections for “election-sensitive” users. The FEC reasoned that Microsoft was offering the services to its paid users “based on commercial rather than political considerations, in the ordinary course of its business and not merely for promotional consideration or to generate goodwill.” In July 2019, the FEC approved a request by a cybersecurity company to provide low-cost anti-phishing services to campaigns because those services would be provided in the ordinary course of business and on the same terms and conditions as offered to similarly situated non-political clients.

In September 2018, a month after Microsoft submitted its request, Defending Digital Campaigns (DDC), a nonprofit established with the mission to “secure our democratic campaign process by providing eligible campaigns and political parties, committees, and related organizations with knowledge, training, and resources to defend themselves from cyber threats,” submitted a request to the FEC to offer free or reduced-cost cybersecurity services, including from technology corporations, to federal candidates and parties. Over the following months, the FEC issued and requested comment on multiple draft opinions on whether the donation was permissible and, if so, on what basis. As described by the FEC, to support its position, DDC represented that “federal candidates and parties are singularly ill-equipped to counteract these threats.” The FEC’s advisory opinion to DDC noted:

“You [DDC] state that presidential campaign committees and national party committees require expert guidance on cybersecurity and you contend that the ‘vast majority of campaigns’ cannot afford full-time cybersecurity staff and that ‘even basic cybersecurity consulting software and services’ can overextend the budgets of most congressional campaigns. AOR004. For instance, you note that a congressional candidate in California reported a breach to the Federal Bureau of Investigation (FBI) in March of this year but did not have the resources to hire a professional cybersecurity firm to investigate the attack, or to replace infected computers. AOR003.”

In May 2019, the FEC approved DDC’s request to partner with technology companies to provide free and discounted cybersecurity services “[u]nder the unusual and exigent circumstances” presented by the request and “in light of the demonstrated, currently enhanced threat of foreign cyberattacks against party and candidate committees.”

All of these opinions demonstrate the FEC’s desire to allow campaigns to access affordable cybersecurity services because of the heightened threat of cyberattack, while still being cautious to ensure that those services are offered transparently and consistent with the goals of campaign finance laws.

Partnering with DDC to Provide Free Services to US Candidates

We share the view of both DDC and the FEC that political campaigns — which are central to our democracy — must have the tools to protect themselves against foreign cyberattack. Cloudflare is therefore excited to announce a new partnership with DDC to provide Cloudflare for Campaigns for free to candidates and parties that meet DDC’s criteria.

Introducing Cloudflare for Campaigns

To receive free services under DDC, political campaigns must meet the following criteria, as the DDC laid out to the FEC:

  • A House candidate’s committee that has at least $50,000 in receipts for the current election cycle, and a Senate candidate’s committee that has at least $100,000 in receipts for the current election cycle;
  • A House or Senate candidate’s committee for candidates who have qualified for the general election ballot in their respective elections; or
  • Any presidential candidate’s committee whose candidate is polling above five percent in national polls.

For more information on eligibility for these services under DDC and the next steps, please visit cloudflare.com/campaigns/usa.

Election package

Although political campaigns are regulated differently all around the world, Cloudflare believes that the integrity of all political campaigns should be protected against powerful adversaries. With this in mind, Cloudflare will therefore also be offering Cloudflare for Campaigns as a paid service, designed to help campaigns all around the world as we attempt to address regulatory hurdles. For more information on how to sign up for the Cloudflare election package, please visit cloudflare.com/campaigns.

Introducing Cloudflare for Teams

Post Syndicated from Matthew Prince original https://blog.cloudflare.com/introducing-cloudflare-for-teams/

Introducing Cloudflare for Teams

Ten years ago, when Cloudflare was created, the Internet was a place that people visited. People still talked about ‘surfing the web’ and the iPhone was less than two years old, but on July 4, 2009 large scale DDoS attacks were launched against websites in the US and South Korea.

Those attacks highlighted how fragile the Internet was and how all of us were becoming dependent on access to the web as part of our daily lives.

Fast forward ten years and the speed, reliability and safety of the Internet is paramount as our private and work lives depend on it.

We started Cloudflare to solve one half of every IT organization’s challenge: how do you ensure the resources and infrastructure that you expose to the Internet are safe from attack, fast, and reliable. We saw that the world was moving away from hardware and software to solve these problems and instead wanted a scalable service that would work around the world.

To deliver that, we built one of the world’s largest networks. Today our network spans more than 200 cities worldwide and is within milliseconds of nearly everyone connected to the Internet. We have built the capacity to stand up to nation-state scale cyberattacks and a threat intelligence system powered by the immense amount of Internet traffic that we see.

Introducing Cloudflare for Teams

Today we’re expanding Cloudflare’s product offerings to solve the other half of every IT organization’s challenge: ensuring the people and teams within an organization can access the tools they need to do their job and are safe from malware and other online threats.

The speed, reliability, and protection we’ve brought to public infrastructure is extended today to everything your team does on the Internet.

In addition to protecting an organization’s infrastructure, IT organizations are charged with ensuring that employees of an organization can access the tools they need safely. Traditionally, these problems would be solved by hardware products like VPNs and Firewalls. VPNs let authorized users access the tools they needed and Firewalls kept malware out.

Castle and Moat

Introducing Cloudflare for Teams

The dominant model was the idea of a castle and a moat. You put all your valuable assets inside the castle. Your Firewall created the moat around the castle to keep anything malicious out. When you needed to let someone in, a VPN acted as the drawbridge over the moat.

This is still the model most businesses use today, but it’s showing its age. The first challenge is that if an attacker is able to find its way over the moat and into the castle then it can cause significant damage. Unfortunately, few weeks go by without reading a news story about how an organization had significant data compromised because an employee fell for a phishing email, or a contractor was compromised, or someone was able to sneak into an office and plug in a rogue device.

The second challenge of the model is the rise of cloud and SaaS. Increasingly an organization’s resources aren’t in the just one castle anymore, but instead in different public cloud and SaaS vendors.

Services like Box, for instance, provide better storage and collaboration tools than most organizations could ever hope to build and manage themselves. But there’s literally nowhere you can ship a hardware box to Box in order to build your own moat around their SaaS castle. Box provides some great security tools themselves, but they are different from the tools provided by every other SaaS and public cloud vendor. Where IT organizations used to try to have a single pane of glass with a complex mess of hardware to see who was getting stopped by their moats and who was crossing their drawbridges, SaaS and cloud make that visibility increasingly difficult.

The third challenge to the traditional castle and moat strategy of IT is the rise of mobile. Where once upon a time your employees would all show up to work in your castle, now people are working around the world. Requiring everyone to login to a limited number of central VPNs becomes obviously absurd when you picture it as villagers having to sprint back from wherever they are across a drawbridge whenever they want to get work done. It’s no wonder VPN support is one of the top IT organization tickets and likely always will be for organizations that maintain a castle and moat approach.

Introducing Cloudflare for Teams

But it’s worse than that. Mobile has also introduced a culture where employees bring their own devices to work. Or, even if on a company-managed device, work from the road or home — beyond the protected walls of the castle and without the security provided by a moat.

If you’d looked at how we managed our own IT systems at Cloudflare four years ago, you’d have seen us following this same model. We used firewalls to keep threats out and required every employee to login through our VPN to get their work done. Personally, as someone who travels extensively for my job, it was especially painful.

Regularly, someone would send me a link to an internal wiki article asking for my input. I’d almost certainly be working from my mobile phone in the back of a cab running between meetings. I’d try and access the link and be prompted to login to our VPN in San Francisco. That’s when the frustration would start.

Corporate mobile VPN clients, in my experience, all seem to be powered by some 100-sided die that only will allow you to connect if the number of miles you are from your home office is less than 25 times whatever number is rolled. Much frustration, and several IT tickets later, with a little luck I may be able to connect. And, even then, the experience was horribly slow and unreliable.

When we audited our own system, we found that the frustration with the process had caused multiple teams to create work arounds that were, effectively, unauthorized drawbridges over our carefully constructed moat. And, as we increasingly adopted SaaS tools like Salesforce and Workday, we lost much visibility into how these tools were being used.

Around the same time we were realizing the traditional approach to IT security was untenable for an organization like Cloudflare, Google published their paper titled “BeyondCorp: A New Approach to Enterprise Security.” The core idea was that a company’s intranet should be no more trusted than the Internet. And, rather than the perimeter being enforced by a singular moat, instead each application and data source should authenticate the individual and device each time it is accessed.

The BeyondCorp idea, which has come to be known as a ZeroTrust model for IT security, was influential for how we thought about our own systems. Powerfully, because Cloudflare had a flexible global network, we were able to use it both to enforce policies as our team accessed tools as well as to protect ourselves from malware as we did our jobs.

Cloudflare for Teams

Today, we’re excited to announce Cloudflare for Teams™: the suite of tools we built to protect ourselves, now available to help any IT organization, from the smallest to the largest.

Cloudflare for Teams is built around two complementary products: Access and Gateway. Cloudflare Access™ is the modern VPN — a way to ensure your team members get fast access to the resources they need to do their job while keeping threats out. Cloudflare Gateway™ is the modern Next Generation Firewall — a way to ensure that your team members are protected from malware and follow your organization’s policies wherever they go online.

Powerfully, both Cloudflare Access and Cloudflare Gateway are built atop the existing Cloudflare network. That means they are fast, reliable, scalable to the largest organizations, DDoS resistant, and located everywhere your team members are today and wherever they may travel. Have a senior executive going on a photo safari to see giraffes in Kenya, gorillas in Rwanda, and lemurs in Madagascar — don’t worry, we have Cloudflare data centers in all those countries (and many more) and they all support Cloudflare for Teams.

Introducing Cloudflare for Teams

All Cloudflare for Teams products are informed by the threat intelligence we see across all of Cloudflare’s products. We see such a large diversity of Internet traffic that we often see new threats and malware before anyone else. We’ve supplemented our own proprietary data with additional data sources from leading security vendors, ensuring Cloudflare for Teams provides a broad set of protections against malware and other online threats.

Moreover, because Cloudflare for Teams runs atop the same network we built for our infrastructure protection products, we can deliver them very efficiently. That means that we can offer these products to our customers at extremely competitive prices. Our goal is to make the return on investment (ROI) for all Cloudflare for Teams customers nothing short of a no brainer. If you’re considering another solution, contact us before you decide.

Both Cloudflare Access and Cloudflare Gateway also build off products we’ve launched and battle tested already. For example, Gateway builds, in part, off our 1.1.1.1 Public DNS resolver. Today, more than 40 million people trust 1.1.1.1 as the fastest public DNS resolver globally. By adding malware scanning, we were able to create our entry-level Cloudflare Gateway product.

Cloudflare Access and Cloudflare Gateway build off our WARP and WARP+ products. We intentionally built a consumer mobile VPN service because we knew it would be hard. The millions of WARP and WARP+ users who have put the product through its paces have ensured that it’s ready for the enterprise. That we have 4.5 stars across more than 200,000 ratings, just on iOS, is a testament of how reliable the underlying WARP and WARP+ engines have become. Compare that with the ratings of any corporate mobile VPN client, which are unsurprisingly abysmal.

We’ve partnered with some incredible organizations to create the ecosystem around Cloudflare for Teams. These include endpoint security solutions including VMWare Carbon Black, Malwarebytes, and Tanium. SEIM and analytics solutions including Datadog, Sumo Logic, and Splunk. Identity platforms including Okta, OneLogin, and Ping Identity. Feedback from these partners and more is at the end of this post.

If you’re curious about more of the technical details about Cloudflare for Teams, I encourage you to read Sam Rhea’s post.

Serving Everyone

Cloudflare has always believed in the power of serving everyone. That’s why we’ve offered a free version of Cloudflare for Infrastructure since we launched in 2010. That belief doesn’t change with our launch of Cloudflare for Teams. For both Cloudflare Access and Cloudflare Gateway, there will be free versions to protect individuals, home networks, and small businesses. We remember what it was like to be a startup and believe that everyone deserves to be safe online, regardless of their budget.

With both Cloudflare Access and Gateway, the products are segmented along a Good, Better, Best framework. That breaks out into Access Basic, Access Pro, and Access Enterprise. You can see the features available with each tier in the table below, including Access Enterprise features that will roll out over the coming months.

Introducing Cloudflare for Teams

We wanted a similar Good, Better, Best framework for Cloudflare Gateway. Gateway Basic can be provisioned in minutes through a simple change to your network’s recursive DNS settings. Once in place, network administrators can set rules on what domains should be allowed and filtered on the network. Cloudflare Gateway is informed both by the malware data gathered from our global sensor network as well as a rich corpus of domain categorization, allowing network operators to set whatever policy makes sense for them. Gateway Basic leverages the speed of 1.1.1.1 with granular network controls.

Gateway Pro, which we’re announcing today and you can sign up to beta test as its features roll out over the coming months, extends the DNS-provisioned protection to a full proxy. Gateway Pro can be provisioned via the WARP client — which we are extending beyond iOS and Android mobile devices to also support Windows, MacOS, and Linux — or network policies including MDM-provisioned proxy settings or GRE tunnels from office routers. This allows a network operator to filter on policies not merely by the domain but by the specific URL.

Introducing Cloudflare for Teams

Building the Best-in-Class Network Gateway

While Gateway Basic (provisioned via DNS) and Gateway Pro (provisioned as a proxy) made sense, we wanted to imagine what the best-in-class network gateway would be for Enterprises that valued the highest level of performance and security. As we talked to these organizations we heard an ever-present concern: just surfing the Internet created risk of unauthorized code compromising devices. With every page that every user visited, third party code (JavaScript, etc.) was being downloaded and executed on their devices.

The solution, they suggested, was to isolate the local browser from third party code and have websites render in the network. This technology is known as browser isolation. And, in theory, it’s a great idea. Unfortunately, in practice with current technology, it doesn’t perform well. The most common way the browser isolation technology works is to render the page on a server and then push a bitmap of the page down to the browser. This is known as pixel pushing. The challenge is that can be slow, bandwidth intensive, and it breaks many sophisticated web applications.

We were hopeful that we could solve some of these problems by moving the rendering of the pages to Cloudflare’s network, which would be closer to end users. So we talked with many of the leading browser isolation companies about potentially partnering. Unfortunately, as we experimented with their technologies, even with our vast network, we couldn’t overcome the sluggish feel that plagues existing browser isolation solutions.

Enter S2 Systems

Introducing Cloudflare for Teams

That’s when we were introduced to S2 Systems. I clearly remember first trying the S2 demo because my first reaction was: “This can’t be working correctly, it’s too fast.” The S2 team had taken a different approach to browser isolation. Rather than trying to push down a bitmap of what the screen looked like, instead they pushed down the vectors to draw what’s on the screen. The result was an experience that was typically at least as fast as browsing locally and without broken pages.

The best, albeit imperfect, analogy I’ve come up with to describe the difference between S2’s technology and other browser isolation companies is the difference between WindowsXP and MacOS X when they were both launched in 2001. WindowsXP’s original graphics were based on bitmapped images. MacOS X were based on vectors. Remember the magic of watching an application “genie” in and out the MacOS X doc? Check it out in a video from the launch…

At the time watching a window slide in and out of the dock seemed like magic compared with what you could do with bitmapped user interfaces. You can hear the awe in the reaction from the audience. That awe that we’ve all gotten used to in UIs today comes from the power of vector images. And, if you’ve been underwhelmed by the pixel-pushed bitmaps of existing browser isolation technologies, just wait until you see what is possible with S2’s technology.

Introducing Cloudflare for Teams

We were so impressed with the team and the technology that we acquired the company. We will be integrating the S2 technology into Cloudflare Gateway Enterprise. The browser isolation technology will run across Cloudflare’s entire global network, bringing it within milliseconds of virtually every Internet user. You can learn more about this approach in Darren Remington’s blog post.

Once the rollout is complete in the second half of 2020 we expect we will be able to offer the first full browser isolation technology that doesn’t force you to sacrifice performance. In the meantime, if you’d like a demo of the S2 technology in action, let us know.

The Promise of a Faster Internet for Everyone

Cloudflare’s mission is to help build a better Internet. With Cloudflare for Teams, we’ve extended that network to protect the people and organizations that use the Internet to do their jobs. We’re excited to help a more modern, mobile, and cloud-enabled Internet be safer and faster than it ever was with traditional hardware appliances.

But the same technology we’re deploying now to improve enterprise security holds further promise. The most interesting Internet applications keep getting more complicated and, in turn, requiring more bandwidth and processing power to use.

For those of us fortunate enough to be able to afford the latest iPhone, we continue to reap the benefits of an increasingly powerful set of Internet-enabled tools. But try and use the Internet on a mobile phone from a few generations back, and you can see how quickly the latest Internet applications leaves legacy devices behind. That’s a problem if we want to bring the next 4 billion Internet users online.

We need a paradigm shift if the sophistication of applications and complexity of interfaces continues to keep pace with the latest generation of devices. To make the best of the Internet available to everyone, we may need to shift the work of the Internet off the end devices we all carry around in our pockets and let the network — where power, bandwidth, and CPU are relatively plentiful — carry more of the load.

That’s the long term promise of what S2’s technology combined with Cloudflare’s network may someday power. If we can make it so a less expensive device can run the latest Internet applications — using less battery, bandwidth, and CPU than ever before possible — then we can make the Internet more affordable and accessible for everyone.

We started with Cloudflare for Infrastructure. Today we’re announcing Cloudflare for Teams. But our ambition is nothing short of Cloudflare for Everyone.

Early Feedback on Cloudflare for Teams from Customers and Partners

Introducing Cloudflare for Teams

“Cloudflare Access has enabled Ziff Media Group to seamlessly and securely deliver our suite of internal tools to employees around the world on any device, without the need for complicated network configurations,” said Josh Butts, SVP Product & Technology, Ziff Media Group.

Introducing Cloudflare for Teams

“VPNs are frustrating and lead to countless wasted cycles for employees and the IT staff supporting them,” said Amod Malviya, Cofounder and CTO, Udaan. “Furthermore, conventional VPNs can lull people into a false sense of security. With Cloudflare Access, we have a far more reliable, intuitive, secure solution that operates on a per user, per access basis. I think of it as Authentication 2.0 — even 3.0”

Introducing Cloudflare for Teams

“Roman makes healthcare accessible and convenient,” said Ricky Lindenhovius, Engineering Director, Roman Health. “Part of that mission includes connecting patients to physicians, and Cloudflare helps Roman securely and conveniently connect doctors to internally managed tools. With Cloudflare, Roman can evaluate every request made to internal applications for permission and identity, while also improving speed and user experience.”

Introducing Cloudflare for Teams

“We’re excited to partner with Cloudflare to provide our customers an innovative approach to enterprise security that combines the benefits of endpoint protection and network security,” said Tom Barsi, VP Business Development, VMware. “VMware Carbon Black is a leading endpoint protection platform (EPP) and offers visibility and control of laptops, servers, virtual machines, and cloud infrastructure at scale. In partnering with Cloudflare, customers will have the ability to use VMware Carbon Black’s device health as a signal in enforcing granular authentication to a team’s internally managed application via Access, Cloudflare’s Zero Trust solution. Our joint solution combines the benefits of endpoint protection and a zero trust authentication solution to keep teams working on the Internet more secure.”

Introducing Cloudflare for Teams

“Rackspace is a leading global technology services company accelerating the value of the cloud during every phase of our customers’ digital transformation,” said Lisa McLin, vice president of alliances and channel chief at Rackspace. “Our partnership with Cloudflare enables us to deliver cutting edge networking performance to our customers and helps them leverage a software defined networking architecture in their journey to the cloud.”

Introducing Cloudflare for Teams

“Employees are increasingly working outside of the traditional corporate headquarters. Distributed and remote users need to connect to the Internet, but today’s security solutions often require they backhaul those connections through headquarters to have the same level of security,” said Michael Kenney, head of strategy and business development for Ingram Micro Cloud. “We’re excited to work with Cloudflare whose global network helps teams of any size reach internally managed applications and securely use the Internet, protecting the data, devices, and team members that power a business.”

Introducing Cloudflare for Teams

“At Okta, we’re on a mission to enable any organization to securely use any technology. As a leading provider of identity for the enterprise, Okta helps organizations remove the friction of managing their corporate identity for every connection and request that their users make to applications. We’re excited about our partnership with Cloudflare and bringing seamless authentication and connection to teams of any size,” said Chuck Fontana, VP, Corporate & Business Development, Okta.

Introducing Cloudflare for Teams

“Organizations need one unified place to see, secure, and manage their endpoints,” said Matt Hastings, Senior Director of Product Management at Tanium. “We are excited to partner with Cloudflare to help teams secure their data, off-network devices, and applications. Tanium’s platform provides customers with a risk-based approach to operations and security with instant visibility and control into their endpoints. Cloudflare helps extend that protection by incorporating device data to enforce security for every connection made to protected resources.”

Introducing Cloudflare for Teams

“OneLogin is happy to partner with Cloudflare to advance security teams’ identity control in any environment, whether on-premise or in the cloud, without compromising user performance,” said Gary Gwin, Senior Director of Product at OneLogin. “OneLogin’s identity and access management platform securely connects people and technology for every user, every app, and every device. The OneLogin and Cloudflare for Teams integration provides a comprehensive identity and network control solution for teams of all sizes.”

Introducing Cloudflare for Teams

“Ping Identity helps enterprises improve security and user experience across their digital businesses,” said Loren Russon, Vice President of Product Management, Ping Identity. “Cloudflare for Teams integrates with Ping Identity to provide a comprehensive identity and network control solution to teams of any size, and ensures that only the right people get the right access to applications, seamlessly and securely.”

Introducing Cloudflare for Teams

“Our customers increasingly leverage deep observability data to address both operational and security use cases, which is why we launched Datadog Security Monitoring,” said Marc Tremsal, Director of Product Management at Datadog. “Our integration with Cloudflare already provides our customers with visibility into their web and DNS traffic; we’re excited to work together as Cloudflare for Teams expands this visibility to corporate environments.”

Introducing Cloudflare for Teams

“As more companies support employees who work on corporate applications from outside of the office, it is vital that they understand each request users are making. They need real-time insights and intelligence to react to incidents and audit secure connections,” said John Coyle, VP of Business Development, Sumo Logic. “With our partnership with Cloudflare, customers can now log every request made to internal applications and automatically push them directly to Sumo Logic for retention and analysis.”

Introducing Cloudflare for Teams

“Cloudgenix is excited to partner with Cloudflare to provide an end-to-end security solution from the branch to the cloud.  As enterprises move off of expensive legacy MPLS networks and adopt branch to internet breakout policies, the CloudGenix CloudBlade platform and Cloudflare for Teams together can make this transition seamless and secure. We’re looking forward to Cloudflare’s roadmap with this announcement and partnership opportunities in the near term.” said Aaron Edwards, Field CTO, Cloudgenix.

Introducing Cloudflare for Teams

“In the face of limited cybersecurity resources, organizations are looking for highly automated solutions that work together to reduce the likelihood and impact of today’s cyber risks,” said Akshay Bhargava, Chief Product Officer, Malwarebytes. “With Malwarebytes and Cloudflare together, organizations are deploying more than twenty layers of security defense-in-depth. Using just two solutions, teams can secure their entire enterprise from device, to the network, to their internal and external applications.”

Introducing Cloudflare for Teams

“Organizations’ sensitive data is vulnerable in-transit over the Internet and when it’s stored at its destination in public cloud, SaaS applications and endpoints,” said Pravin Kothari, CEO of CipherCloud. “CipherCloud is excited to partner with Cloudflare to secure data in all stages, wherever it goes. Cloudflare’s global network secures data in-transit without slowing down performance. CipherCloud CASB+ provides a powerful cloud security platform with end-to-end data protection and adaptive controls for cloud environments, SaaS applications and BYOD endpoints. Working together, teams can rely on integrated Cloudflare and CipherCloud solution to keep data always protected without compromising user experience.”

Security on the Internet with Cloudflare for Teams

Post Syndicated from Sam Rhea original https://blog.cloudflare.com/cloudflare-for-teams-products/

Security on the Internet with Cloudflare for Teams

Security on the Internet with Cloudflare for Teams

Your experience using the Internet has continued to improve over time. It’s gotten faster, safer, and more reliable. However, you probably have to use a different, worse, equivalent of it when you do your work. While the Internet kept getting better, businesses and their employees were stuck using their own private networks.

In those networks, teams hosted their own applications, stored their own data, and protected all of it by building a castle and moat around that private world. This model hid internally managed resources behind VPN appliances and on-premise firewall hardware. The experience was awful, for users and administrators alike. While the rest of the Internet became more performant and more reliable, business users were stuck in an alternate universe.

That legacy approach was less secure and slower than teams wanted, but the corporate perimeter mostly worked for a time. However, that began to fall apart with the rise of cloud-delivered applications. Businesses migrated to SaaS versions of software that previously lived in that castle and behind that moat. Users needed to connect to the public Internet to do their jobs, and attackers made the Internet unsafe in sophisticated, unpredictable ways – which opened up every business to  a new world of never-ending risks.

How did enterprise security respond? By trying to solve a new problem with a legacy solution, and forcing the Internet into equipment that was only designed for private, corporate networks. Instead of benefitting from the speed and availability of SaaS applications, users had to backhaul Internet-bound traffic through the same legacy boxes that made their private network miserable.

Teams then watched as their bandwidth bills increased. More traffic to the Internet from branch offices forced more traffic over expensive, dedicated links. Administrators now had to manage a private network and the connections to the entire Internet for their users, all with the same hardware. More traffic required more hardware and the cycle became unsustainable.

Cloudflare’s first wave of products secured and improved the speed of those sites by letting customers, from free users to some of the largest properties on the Internet, replace that hardware stack with Cloudflare’s network. We could deliver capacity at a scale that would be impossible for nearly any company to build themselves. We deployed data centers in over 200 cities around the world that help us reach users wherever they are.

We built a unique network to let sites scale how they secured infrastructure on the Internet with their own growth. But internally, businesses and their employees were stuck using their own private networks.

Just as we helped organizations secure their infrastructure by replacing boxes, we can do the same for their teams and their data. Today, we’re announcing a new platform that applies our network, and everything we’ve learned, to make the Internet faster and safer for teams.
Cloudflare for Teams protects enterprises, devices, and data by securing every connection without compromising user performance. The speed, reliability and protection we brought to securing infrastructure is extended to everything your team does on the Internet.

The legacy world of corporate security

Organizations all share three problems they need to solve at the network level:

  1. Secure team member access to internally managed applications
  2. Secure team members from threats on the Internet
  3. Secure the corporate data that lives in both environments

Each of these challenges pose a real risk to any team. If any component is compromised, the entire business becomes vulnerable.

Internally managed applications

Solving the first bucket, internally managed applications, started by building a perimeter around those internal resources. Administrators deployed applications on a private network and users outside of the office connected to them with client VPN agents through VPN appliances that lived back on-site.

Users hated it, and they still do, because it made it harder to get their jobs done. A sales team member traveling to a customer visit in the back of a taxi had to start a VPN client on their phone just to review details about the meeting. An engineer working remotely had to sit and wait as every connection they made to developer tools was backhauled  through a central VPN appliance.

Administrators and security teams also had issues with this model. Once a user connects to the private network, they’re typically able to reach multiple resources without having to prove they’re authorized to do so . Just because I’m able to enter the front door of an apartment building, doesn’t mean I should be able to walk into any individual apartment. However, on private networks, enforcing additional security within the bounds of the private network required complicated microsegmentation, if it was done at all.

Threats on the Internet

The second challenge, securing users connecting to SaaS tools on the public Internet and applications in the public cloud, required security teams to protect against known threats and potential zero-day attacks as their users left the castle and moat.

How did most companies respond? By forcing all traffic leaving branch offices or remote users back through headquarters and using the same hardware that secured their private network to try and build a perimeter around the Internet, at least the Internet their users accessed. All of the Internet-bound traffic leaving a branch office in Asia, for example, would be sent back through a central location in Europe, even if the destination was just down the street.

Organizations needed those connections to be stable, and to prioritize certain functions like voice and video, so they paid carriers to support dedicated multi-protocol label switching (MPLS) links. MPLS delivered improved performance by applying label switching to traffic which downstream routers can forward without needing to perform an IP lookup, but was eye-wateringly expensive.

Securing data

The third challenge, keeping data safe, became a moving target. Organizations had to keep data secure in a consistent way as it lived and moved between private tools on corporate networks and SaaS applications like Salesforce or Office 365.

The answer? More of the same. Teams backhauled traffic over MPLS links to a place where data could be inspected, adding more latency and introducing more hardware that had to be maintained.

What changed?

The balance of internal versus external traffic began to shift as SaaS applications became the new default for small businesses and Fortune 500s alike. Users now do most of their work on the Internet, with tools like Office 365 continuing to gain adoption. As those tools become more popular, more data leaves the moat and lives on the public Internet.

User behavior also changed. Users left the office and worked from multiple devices, both managed and unmanaged. Teams became more distributed and the perimeter was stretched to its limit.

This caused legacy approaches to fail

Legacy approaches to corporate security pushed the  castle and moat model further out. However, that model simply cannot scale with how users do work on the Internet today.

Internally managed applications

Private networks give users headaches, but they’re also a constant and complex chore to maintain. VPNs require expensive equipment that must be upgraded or expanded and, as more users leave the office, that equipment must try and scale up.

The result is a backlog of IT help desk tickets as users struggle with their VPN and, on the other side of the house, administrators and security teams try to put band-aids on the approach.

Threats on the Internet

Organizations initially saved money by moving to SaaS tools, but wound up spending more money over time as their traffic increased and bandwidth bills climbed.

Additionally, threats evolve. The traffic sent back to headquarters was secured with static models of scanning and filtering using hardware gateways. Users were still vulnerable to new types of threats that these on-premise boxes did not block yet.

Securing data

The cost of keeping data secure in both environments also grew. Security teams attempted to inspect Internet-bound traffic for threats and data loss by backhauling branch office traffic through on-premise hardware, degrading speed and increasing bandwidth fees.

Even more dangerous, data now lived permanently outside of that castle and moat model. Organizations were now vulnerable to attacks that bypassed their perimeter and targeted SaaS applications directly.

How will Cloudflare solve these problems?

Cloudflare for Teams consists of two products, Cloudflare Access and Cloudflare Gateway.

Security on the Internet with Cloudflare for Teams

We launched Access last year and are excited to bring it into Cloudflare for Teams. We built Cloudflare Access to solve the first challenge that corporate security teams face: protecting internally managed applications.

Cloudflare Access replaces corporate VPNs with Cloudflare’s network. Instead of placing internal tools on a private network, teams deploy them in any environment, including hybrid or multi-cloud models, and secure them consistently with Cloudflare’s network.

Deploying Access does not require exposing new holes in corporate firewalls. Teams connect their resources through a secure outbound connection, Argo Tunnel, which runs in your infrastructure to connect the applications and machines to Cloudflare. That tunnel makes outbound-only calls to the Cloudflare network and organizations can replace complex firewall rules with just one: disable all inbound connections.

Administrators then build rules to decide who should authenticate to and reach the tools protected by Access. Whether those resources are virtual machines powering business operations or internal web applications, like Jira or iManage, when a user needs to connect, they pass through Cloudflare first.

When users need to connect to the tools behind Access, they are prompted to authenticate with their team’s SSO and, if valid, are instantly connected to the application without being slowed down. Internally-managed apps suddenly feel like SaaS products, and the login experience is seamless and familiar

Behind the scenes, every request made to those internal tools hits Cloudflare first where we enforce identity-based policies. Access evaluates and logs every request to those apps for identity, to give administrators more visibility and to offer more security than a traditional VPN.

Security on the Internet with Cloudflare for Teams

Every Cloudflare data center, in 200 cities around the world, performs the entire authentication check. Users connect faster, wherever they are working, versus having to backhaul traffic to a home office.

Access also saves time for administrators. Instead of configuring complex and error-prone network policies, IT teams build policies that enforce authentication using their identity provider. Security leaders can control who can reach internal applications in a single pane of glass and audit comprehensive logs from one source.

In the last year, we’ve released features that expand how teams can use Access so they can fully eliminate their VPN. We’ve added support for RDP, SSH, and released support for short-lived certificates that replace static keys. However, teams also use applications that do not run in infrastructure they control, such as SaaS applications like Box and Office 365. To solve that challenge, we’re releasing a new product, Cloudflare Gateway.

Security on the Internet with Cloudflare for Teams

Cloudflare Gateway secures teams by making the first destination a Cloudflare data center located near them, for all outbound traffic. The product places Cloudflare’s global network between users and the Internet, rather than forcing the Internet through legacy hardware on-site.

Cloudflare Gateway’s first feature begins by preventing users from running into phishing scams or malware sites by combining the world’s fastest DNS resolver with Cloudflare’s threat intelligence. Gateway resolver can be deployed to office networks and user devices in a matter of minutes. Once configured, Gateway actively blocks potential malware and phishing sites while also applying content filtering based on policies administrators configure.

However, threats can be hidden in otherwise healthy hostnames. To protect users from more advanced threats, Gateway will audit URLs and, if enabled, inspect  packets to find potential attacks before they compromise a device or office network. That same deep packet inspection can then be applied to prevent the accidental or malicious export of data.

Organizations can add Gateway’s advanced threat prevention in two models:

  1. by connecting office networks to the Cloudflare security fabric through GRE tunnels and
  2. by distributing forward proxy clients to mobile devices.

Security on the Internet with Cloudflare for Teams

The first model, delivered through Cloudflare Magic Transit, will give enterprises a way to migrate to Gateway without disrupting their current workflow. Instead of backhauling office traffic to centralized on-premise hardware, teams will point traffic to Cloudflare over GRE tunnels. Once the outbound traffic arrives at Cloudflare, Gateway can apply file type controls, in-line inspection, and data loss protection without impacting connection performance. Simultaneously, Magic Transit protects a corporate IP network from inbound attacks.

When users leave the office, Gateway’s client application will deliver the same level of Internet security. Every connection from the device will pass through Cloudflare first, where Gateway can apply threat prevention policies. Cloudflare can also deliver that security without compromising user experience, building on new technologies like the WireGuard protocol and integrating features from Cloudflare Warp, our popular individual forward proxy.

In both environments, one of the most common vectors for attacks is still the browser. Zero-day threats can compromise devices by using the browser as a vehicle to execute code.

Existing browser isolation solutions attempt to solve this challenge in one of two approaches: 1) pixel pushing and 2) DOM reconstruction. Both approaches lead to tradeoffs in performance and security. Pixel pushing degrades speed while also driving up the cost to stream sessions to users. DOM reconstruction attempts to strip potentially harmful content before sending it to the user. That tactic relies on known vulnerabilities and is still exposed to the zero day threats that isolation tools were meant to solve.

Cloudflare Gateway will feature always-on browser isolation that not only protects users from zero day threats, but can also make browsing the Internet faster. The solution will apply a patented approach to send vector commands that a browser can render without the need for an agent on the device. A user’s browser session will instead run in a Cloudflare data center where Gateway destroys the instance at the end of each session, keeping malware away from user devices without compromising performance.

When deployed, remote browser sessions will run in one of Cloudflare’s 200 data centers, connecting users to a faster, safer model of navigating the Internet without the compromises of legacy approaches. If you would like to learn more about this approach to browser isolation, I’d encourage you to read Darren Remington’s blog post on the topic.

Why Cloudflare?

To make infrastructure safer, and web properties faster, Cloudflare built out one of the world’s largest and most sophisticated networks. Cloudflare for Teams builds on that same platform, and all of its unique advantages.

Fast

Security should always be bundled with performance. Cloudflare’s infrastructure products delivered better protection while also improving speed. That’s possible because of the network we’ve built, both its distribution and how the data we have about the network allows Cloudflare to optimize requests and connections.

Cloudflare for Teams brings that same speed to end users by using that same network and route optimization. Additionally, Cloudflare has built industry-leading components that will become features of this new platform. All of these components leverage Cloudflare’s network and scale to improve user performance.

Gateway’s DNS-filtering features build on Cloudflare’s 1.1.1.1 public DNS resolver, the world’s fastest resolver according to DNSPerf. To protect entire connections, Cloudflare for Teams will deploy the same technology that underpins Warp, a new type of VPN with consistently better reviews than competitors.

Massive scalability

Cloudflare’s 30 TBps of network capacity can scale to meet the needs of nearly any enterprise. Customers can stop worrying about buying enough hardware to meet their organization’s needs and, instead, replace it with Cloudflare.

Near users, wherever they are — literally

Cloudflare’s network operates in 200 cities and more than 90 countries around the world, putting Cloudflare’s security and performance close to users, wherever they work.

That network includes presence in global headquarters, like London and New York, but also in traditionally underserved regions around the world.

Cloudflare data centers operate within 100 milliseconds of 99% of Internet-connected population in the developed world, and within 100 milliseconds of 94% of the Internet-connected population globally. All of your end users should feel like they have the performance traditionally only available to those in headquarters.

Easier for administrators

When security products are confusing, teams make mistakes that become incidents. Cloudflare’s solution is straightforward and easy to deploy. Most security providers in this market built features first and never considered usability or implementation.

Cloudflare Access can be deployed in less than an hour; Gateway features will build on top of that dashboard and workflow. Cloudflare for Teams brings the same ease-of-use of our tools that protect infrastructure to the products that new secure users, devices, and data.

Better threat intelligence

Cloudflare’s network already secures more than 20 million Internet properties and blocks 72 billion cyber threats each day. We build products using the threat data we gather from protecting 11 million HTTP requests per second on average.

What’s next?

Cloudflare Access is available right now. You can start replacing your team’s VPN with Cloudflare’s network today. Certain features of Cloudflare Gateway are available in beta now, and others will be added in beta over time. You can sign up to be notified about Gateway now.

Cloudflare + Remote Browser Isolation

Post Syndicated from Darren Remington original https://blog.cloudflare.com/cloudflare-and-remote-browser-isolation/

Cloudflare + Remote Browser Isolation

Cloudflare announced today that it has purchased S2 Systems Corporation, a Seattle-area startup that has built an innovative remote browser isolation solution unlike any other currently in the market. The majority of endpoint compromises involve web browsers — by putting space between users’ devices and where web code executes, browser isolation makes endpoints substantially more secure. In this blog post, I’ll discuss what browser isolation is, why it is important, how the S2 Systems cloud browser works, and how it fits with Cloudflare’s mission to help build a better Internet.

What’s wrong with web browsing?

It’s been more than 30 years since Tim Berners-Lee wrote the project proposal defining the technology underlying what we now call the world wide web. What Berners-Lee envisioned as being useful for “several thousand people, many of them very creative, all working toward common goals[1] has grown to become a fundamental part of commerce, business, the global economy, and an integral part of society used by more than 58% of the world’s population[2].

The world wide web and web browsers have unequivocally become the platform for much of the productive work (and play) people do every day. However, as the pervasiveness of the web grew, so did opportunities for bad actors. Hardly a day passes without a major new cybersecurity breach in the news. Several contributing factors have helped propel cybercrime to unprecedented levels: the commercialization of hacking tools, the emergence of malware-as-a-service, the presence of well-financed nation states and organized crime, and the development of cryptocurrencies which enable malicious actors of all stripes to anonymously monetize their activities.

The vast majority of security breaches originate from the web. Gartner calls the public Internet a “cesspool of attacks” and identifies web browsers as the primary culprit responsible for 70% of endpoint compromises.[3] This should not be surprising. Although modern web browsers are remarkable, many fundamental architectural decisions were made in the 1990’s before concepts like security, privacy, corporate oversight, and compliance were issues or even considerations. Core web browsing functionality (including the entire underlying WWW architecture) was designed and built for a different era and circumstances.

In today’s world, several web browsing assumptions are outdated or even dangerous. Web browsers and the underlying server technologies encompass an extensive – and growing – list of complex interrelated technologies. These technologies are constantly in flux, driven by vibrant open source communities, content publishers, search engines, advertisers, and competition between browser companies. As a result of this underlying complexity, web browsers have become primary attack vectors. According to Gartner, “the very act of users browsing the internet and clicking on URL links opens the enterprise to significant risk. […] Attacking thru the browser is too easy, and the targets too rich.[4] Even “ostensibly ‘good’ websites are easily compromised and can be used to attack visitors” (Gartner[5]) with more than 40% of malicious URLs found on good domains (Webroot[6]). (A complete list of vulnerabilities is beyond the scope of this post.)

The very structure and underlying technologies that power the web are inherently difficult to secure. Some browser vulnerabilities result from illegitimate use of legitimate functionality: enabling browsers to download files and documents is good, but allowing downloading of files infected with malware is bad; dynamic loading of content across multiple sites within a single webpage is good, but cross-site scripting is bad; enabling an extensive advertising ecosystem is good, but the inability to detect hijacked links or malicious redirects to malware or phishing sites is bad; etc.

Enterprise Browsing Issues

Enterprises have additional challenges with traditional browsers.

Paradoxically, IT departments have the least amount of control over the most ubiquitous app in the enterprise – the web browser. The most common complaints about web browsers from enterprise security and IT professionals are:

  1. Security (obviously). The public internet is a constant source of security breaches and the problem is growing given an 11x escalation in attacks since 2016 (Meeker[7]). Costs of detection and remediation are escalating and the reputational damage and financial losses for breaches can be substantial.
  2. Control. IT departments have little visibility into user activity and limited ability to leverage content disarm and reconstruction (CDR) and data loss prevention (DLP) mechanisms including when, where, or who downloaded/upload files.
  3. Compliance. The inability to control data and activity across geographies or capture required audit telemetry to meet increasingly strict regulatory requirements. This results in significant exposure to penalties and fines.

Given vulnerabilities exposed through everyday user activities such as email and web browsing, some organizations attempt to restrict these activities. As both are legitimate and critical business functions, efforts to limit or curtail web browser use inevitably fail or have a substantive negative impact on business productivity and employee morale.

Current approaches to mitigating security issues inherent in browsing the web are largely based on signature technology for data files and executables, and lists of known good/bad URLs and DNS addresses. The challenge with these approaches is the difficulty of keeping current with known attacks (file signatures, URLs and DNS addresses) and their inherent vulnerability to zero-day attacks. Hackers have devised automated tools to defeat signature-based approaches (e.g. generating hordes of files with unknown signatures) and create millions of transient websites in order to defeat URL/DNS blacklists.

While these approaches certainly prevent some attacks, the growing number of incidents and severity of security breaches clearly indicate more effective alternatives are needed.

What is browser isolation?

The core concept behind browser isolation is security-through-physical-isolation to create a “gap” between a user’s web browser and the endpoint device thereby protecting the device (and the enterprise network) from exploits and attacks. Unlike secure web gateways, antivirus software, or firewalls which rely on known threat patterns or signatures, this is a zero-trust approach.

There are two primary browser isolation architectures: (1) client-based local isolation and (2) remote isolation.

Local browser isolation attempts to isolate a browser running on a local endpoint using app-level or OS-level sandboxing. In addition to leaving the endpoint at risk when there is an isolation failure, these systems require significant endpoint resources (memory + compute), tend to be brittle, and are difficult for IT to manage as they depend on support from specific hardware and software components.

Further, local browser isolation does nothing to address the control and compliance issues mentioned above.

Remote browser isolation (RBI) protects the endpoint by moving the browser to a remote service in the cloud or to a separate on-premises server within the enterprise network:

  • On-premises isolation simply relocates the risk from the endpoint to another location within the enterprise without actually eliminating the risk.
  • Cloud-based remote browsing isolates the end-user device and the enterprise’s network while fully enabling IT control and compliance solutions.

Given the inherent advantages, most browser isolation solutions – including S2 Systems – leverage cloud-based remote isolation. Properly implemented, remote browser isolation can protect the organization from browser exploits, plug-ins, zero-day vulnerabilities, malware and other attacks embedded in web content.

How does Remote Browser Isolation (RBI) work?

In a typical cloud-based RBI system (the red-dashed box ❶ below), individual remote browsers ❷ are run in the cloud as disposable containerized instances – typically, one instance per user. The remote browser sends the rendered contents of a web page to the user endpoint device ❹ using a specific protocol and data format ❸. Actions by the user, such as keystrokes, mouse and scroll commands, are sent back to the isolation service over a secure encrypted channel where they are processed by the remote browser and any resulting changes to the remote browser webpage are sent back to the endpoint device.

Cloudflare + Remote Browser Isolation

In effect, the endpoint device is “remote controlling” the cloud browser. Some RBI systems use proprietary clients installed on the local endpoint while others leverage existing HTML5-compatible browsers on the endpoint and are considered ‘clientless.’

Data breaches that occur in the remote browser are isolated from the local endpoint and enterprise network. Every remote browser instance is treated as if compromised and terminated after each session. New browser sessions start with a fresh instance. Obviously, the RBI service must prevent browser breaches from leaking outside the browser containers to the service itself. Most RBI systems provide remote file viewers negating the need to download files but also have the ability to inspect files for malware before allowing them to be downloaded.

A critical component in the above architecture is the specific remoting technology employed by the cloud RBI service. The remoting technology has a significant impact on the operating cost and scalability of the RBI service, website fidelity and compatibility, bandwidth requirements, endpoint hardware/software requirements and even the user experience. Remoting technology also determines the effective level of security provided by the RBI system.

All current cloud RBI systems employ one of two remoting technologies:

(1)    Pixel pushing is a video-based approach which captures pixel images of the remote browser ‘window’ and transmits a sequence of images to the client endpoint browser or proprietary client. This is similar to how remote desktop and VNC systems work. Although considered to be relatively secure, there are several inherent challenges with this approach:

  • Continuously encoding and transmitting video streams of remote webpages to user endpoint devices is very costly. Scaling this approach to millions of users is financially prohibitive and logistically complex.
  • Requires significant bandwidth. Even when highly optimized, pushing pixels is bandwidth intensive.
  • Unavoidable latency results in an unsatisfactory user experience. These systems tend to be slow and generate a lot of user complaints.
  • Mobile support is degraded by high bandwidth requirements compounded by inconsistent connectivity.
  • HiDPI displays may render at lower resolutions. Pixel density increases exponentially with resolution which means remote browser sessions (particularly fonts) on HiDPI devices can appear fuzzy or out of focus.

(2) DOM reconstruction emerged as a response to the shortcomings of pixel pushing. DOM reconstruction attempts to clean webpage HTML, CSS, etc. before forwarding the content to the local endpoint browser. The underlying HTML, CSS, etc., are reconstructed in an attempt to eliminate active code, known exploits, and other potentially malicious content. While addressing the latency, operational cost, and user experience issues of pixel pushing, it introduces two significant new issues:

  • Security. The underlying technologies – HTML, CSS, web fonts, etc. – are the attack vectors hackers leverage to breach endpoints. Attempting to remove malicious content or code is like washing mosquitos: you can attempt to clean them, but they remain inherent carriers of dangerous and malicious material. It is impossible to identify, in advance, all the means of exploiting these technologies even through an RBI system.
  • Website fidelity. Inevitably, attempting to remove malicious active code, reconstructing HTML, CSS and other aspects of modern websites results in broken pages that don’t render properly or don’t render at all. Websites that work today may not work tomorrow as site publishers make daily changes that may break DOM reconstruction functionality. The result is an infinite tail of issues requiring significant resources in an endless game of whack-a-mole. Some RBI solutions struggle to support common enterprise-wide services like Google G Suite or Microsoft Office 365 even as malware laden web email continues to be a significant source of breaches.

Cloudflare + Remote Browser Isolation

Customers are left to choose between a secure solution with a bad user experience and high operating costs, or a faster, much less secure solution that breaks websites. These tradeoffs have driven some RBI providers to implement both remoting technologies into their products. However, this leaves customers to pick their poison without addressing the fundamental issues.

Given the significant tradeoffs in RBI systems today, one common optimization for current customers is to deploy remote browsing capabilities to only the most vulnerable users in an organization such as high-risk executives, finance, business development, or HR employees. Like vaccinating half the pupils in a classroom, this results in a false sense of security that does little to protect the larger organization.

Unfortunately, the largest “gap” created by current remote browser isolation systems is the void between the potential of the underlying isolation concept and the implementation reality of currently available RBI systems.

S2 Systems Remote Browser Isolation

S2 Systems remote browser isolation is a fundamentally different approach based on S2-patented technology called Network Vector Rendering (NVR).

The S2 remote browser is based on the open-source Chromium engine on which Google Chrome is built. In addition to powering Google Chrome which has a ~70% market share[8], Chromium powers twenty-one other web browsers including the new Microsoft Edge browser.[9] As a result, significant ongoing investment in the Chromium engine ensures the highest levels of website support, compatibility and a continuous stream of improvements.

A key architectural feature of the Chromium browser is its use of the Skia graphics library. Skia is a widely-used cross-platform graphics engine for Android, Google Chrome, Chrome OS, Mozilla Firefox, Firefox OS, FitbitOS, Flutter, the Electron application framework and many other products. Like Chromium, the pervasiveness of Skia ensures ongoing broad hardware and platform support.

Cloudflare + Remote Browser Isolation
Skia code fragment

Everything visible in a Chromium browser window is rendered through the Skia rendering layer. This includes application window UI such as menus, but more importantly, the entire contents of the webpage window are rendered through Skia. Chromium compositing, layout and rendering are extremely complex with multiple parallel paths optimized for different content types, device contexts, etc. The following figure is an egregious simplification for illustration purposes of how S2 works (apologies to Chromium experts):

Cloudflare + Remote Browser Isolation

S2 Systems NVR technology intercepts the remote Chromium browser’s Skia draw commands ❶, tokenizes and compresses them, then encrypts and transmits them across the wire ❷ to any HTML5 compliant web browser ❸ (Chrome, Firefox, Safari, etc.) running locally on the user endpoint desktop or mobile device. The Skia API commands captured by NVR are pre-rasterization which means they are highly compact.

On first use, the S2 RBI service transparently pushes an NVR WebAssembly (Wasm) library ❹ to the local HTML5 web browser on the endpoint device where it is cached for subsequent use. The NVR Wasm code contains an embedded Skia library and the necessary code to unpack, decrypt and “replay” the Skia draw commands from the remote RBI server to the local browser window. A WebAssembly’s ability to “execute at native speed by taking advantage of common hardware capabilities available on a wide range of platforms[10] results in near-native drawing performance.

The S2 remote browser isolation service uses headless Chromium-based browsers in the cloud, transparently intercepts draw layer output, transmits the draw commands efficiency and securely over the web, and redraws them in the windows of local HTML5 browsers. This architecture has a number of technical advantages:

(1)    Security: the underlying data transport is not an existing attack vector and customers aren’t forced to make a tradeoff between security and performance.

(2)    Website compatibility: there are no website compatibility issues nor long tail chasing evolving web technologies or emerging vulnerabilities.

(3)    Performance: the system is very fast, typically faster than local browsing (subject of a future blog post).

(4)    Transparent user experience: S2 remote browsing feels like native browsing; users are generally unaware when they are browsing remotely.

(5)    Requires less bandwidth than local browsing for most websites. Enables advanced caching and other proprietary optimizations unique to web browsers and the nature of web content and technologies.

(6)    Clientless: leverages existing HTML5 compatible browsers already installed on user endpoint desktop and mobile devices.

(7)    Cost-effective scalability: although the details are beyond the scope of this post, the S2 backend and NVR technology have substantially lower operating costs than existing RBI technologies. Operating costs translate directly to customer costs. The S2 system was designed to make deployment to an entire enterprise and not just targeted users (aka: vaccinating half the class) both feasible and attractive for customers.

(8)    RBI-as-a-platform: enables implementation of related/adjacent services such as DLP, content disarm & reconstruction (CDR), phishing detection and prevention, etc.

S2 Systems Remote Browser Isolation Service and underlying NVR technology eliminates the disconnect between the conceptual potential and promise of browser isolation and the unsatisfying reality of current RBI technologies.

Cloudflare + S2 Systems Remote Browser Isolation

Cloudflare’s global cloud platform is uniquely suited to remote browsing isolation. Seamless integration with our cloud-native performance, reliability and advanced security products and services provides powerful capabilities for our customers.

Our Cloudflare Workers architecture enables edge computing in 200 cities in more than 90 countries and will put a remote browser within 100 milliseconds of 99% of the Internet-connected population in the developed world. With more than 20 million Internet properties directly connected to our network, Cloudflare remote browser isolation will benefit from locally cached data and builds on the impressive connectivity and performance of our network. Our Argo Smart Routing capability leverages our communications backbone to route traffic across faster and more reliable network paths resulting in an average 30% faster access to web assets.

Once it has been integrated with our Cloudflare for Teams suite of advanced security products, remote browser isolation will provide protection from browser exploits, zero-day vulnerabilities, malware and other attacks embedded in web content. Enterprises will be able to secure the browsers of all employees without having to make trade-offs between security and user experience. The service will enable IT control of browser-conveyed enterprise data and compliance oversight. Seamless integration across our products and services will enable users and enterprises to browse the web without fear or consequence.

Cloudflare’s mission is to help build a better Internet. This means protecting users and enterprises as they work and play on the Internet; it means making Internet access fast, reliable and transparent. Reimagining and modernizing how web browsing works is an important part of helping build a better Internet.


[1] https://www.w3.org/History/1989/proposal.html

[2] “Internet World Stats,”https://www.internetworldstats.com/, retrieved 12/21/2019.

[3] “Innovation Insight for Remote Browser Isolation,” (report ID: G00350577) Neil MacDonald, Gartner Inc, March 8, 2018”

[4] Gartner, Inc., Neil MacDonald, “Innovation Insight for Remote Browser Isolation”, 8 March 2018

[5] Gartner, Inc., Neil MacDonald, “Innovation Insight for Remote Browser Isolation”, 8 March 2018

[6] “2019 Webroot Threat Report: Forty Percent of Malicious URLs Found on Good Domains”, February 28, 2019

[7] “Kleiner Perkins 2018 Internet Trends”, Mary Meeker.

[8] https://www.statista.com/statistics/544400/market-share-of-internet-browsers-desktop/, retrieved December 21, 2019

[9] https://en.wikipedia.org/wiki/Chromium_(web_browser), retrieved December 29, 2019

[10] https://webassembly.org/, retrieved December 30, 2019

Orchestrating a security incident response with AWS Step Functions

Post Syndicated from Benjamin Smith original https://aws.amazon.com/blogs/compute/orchestrating-a-security-incident-response-with-aws-step-functions/

In this post I will show how to implement the callback pattern of an AWS Step Functions Standard Workflow. This is used to add a manual approval step into an automated security incident response framework. The framework could be extended to remediate automatically, according to the individual policy actions defined. For example, applying alternative actions, or restricting actions to specific ARNs.

The application uses Amazon EventBridge to trigger a Step Functions Standard Workflow on an IAM policy creation event. The workflow compares the policy action against a customizable list of restricted actions. It uses AWS Lambda and Step Functions to roll back the policy temporarily, then notify an administrator and wait for them to approve or deny.

Figure 1: High-level architecture diagram.

Important: the application uses various AWS services, and there are costs associated with these services after the Free Tier usage. Please see the AWS pricing page for details.

You can deploy this application from the AWS Serverless Application Repository. You then create a new IAM Policy to trigger the rule and run the application.

Deploy the application from the Serverless Application Repository

  1. Find the “Automated-IAM-policy-alerts-and-approvals” app in the Serverless Application Repository.
  2. Complete the required application settings
    • Application name: an identifiable name for the application.
    • EmailAddress: an administrator’s email address for receiving approval requests.
    • restrictedActions: the IAM Policy actions you want to restrict.

      Figure 2 Deployment Fields

  3. Choose Deploy.

Once the deployment process is completed, 21 new resources are created. This includes:

  • Five Lambda functions that contain the business logic.
  • An Amazon EventBridge rule.
  • An Amazon SNS topic and subscription.
  • An Amazon API Gateway REST API with two resources.
  • An AWS Step Functions state machine

To receive Amazon SNS notifications as the application administrator, you must confirm the subscription to the SNS topic. To do this, choose the Confirm subscription link in the verification email that was sent to you when deploying the application.

EventBridge receives new events in the default event bus. Here, the event is compared with associated rules. Each rule has an event pattern defined, which acts as a filter to match inbound events to their corresponding rules. In this application, a matching event rule triggers an AWS Step Functions execution, passing in the event payload from the policy creation event.

Running the application

Trigger the application by creating a policy either via the AWS Management Console or with the AWS Command Line Interface.

Using the AWS CLI

First install and configure the AWS CLI, then run the following command:

aws iam create-policy --policy-name my-bad-policy1234 --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketObjectLockConfiguration",
                "s3:DeleteObjectVersion",
                "s3:DeleteBucket"
            ],
            "Resource": "*"
        }
    ]
}'

Using the AWS Management Console

  1. Go to Services > Identity Access Management (IAM) dashboard.
  2. Choose Create policy.
  3. Choose the JSON tab.
  4. Paste the following JSON:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "VisualEditor0",
                "Effect": "Allow",
                "Action": [
                    "s3:GetBucketObjectLockConfiguration",
                    "s3:DeleteObjectVersion",
                    "s3:DeleteBucket"
                ],
                "Resource": "*"
            }
        ]
    }
  5. Choose Review policy.
  6. In the Name field, enter my-bad-policy.
  7. Choose Create policy.

Either of these methods creates a policy with the permissions required to delete Amazon S3 buckets. Deleting an S3 bucket is one of the restricted actions set when the application is deployed:

Figure 3 default restricted actions

This sends the event to EventBridge, which then triggers the Step Functions state machine. The Step Functions state machine holds each state object in the workflow. Some of the state objects use the Lambda functions created during deployment to process data.

Others use Amazon States Language (ASL) enabling the application to conditionally branch, wait, and transition to the next state. Using a state machine decouples the business logic from the compute functionality.

After triggering the application, go to the Step Functions dashboard and choose the newly created state machine. Choose the current running state machine from the executions table.

Figure 4 State machine executions.

You see a visual representation of the current execution with the workflow is paused at the AskUser state.

Figure 5 Workflow Paused

These are the states in the workflow:

ModifyData
State Type: Pass
Re-structures the input data into an object that is passed throughout the workflow.

ValidatePolicy
State type: Task. Services: AWS Lambda
Invokes the ValidatePolicy Lambda function that checks the new policy document against the restricted actions.

ChooseAction
State type: Choice
Branches depending on input from ValidatePolicy step.

TempRemove
State type: Task. Service: AWS Lambda
Creates a new default version of the policy with only permissions for Amazon CloudWatch Logs and deletes the previously created policy version.

AskUser
State type: Choice
Sends an approval email to user via SNS, with the task token that initiates the callback pattern.

UsersChoice
State type: Choice
Branch based on the user action to approve or deny.

Denied
State type: Pass
Ends the execution with no further action.

Approved
State type: Task. Service: AWS Lambda
Restores the initial policy document by creating as a new version.

AllowWithNotification
State type: Task. Services: AWS Lambda
With no restricted actions detected, the user is still notified of change (via an email from SNS) before execution ends.

The callback pattern

An important feature of this application is the ability for an administrator to approve or deny a new policy. The Step Functions callback pattern makes this possible.

The callback pattern allows a workflow to pause during a task and wait for an external process to return a task token. The task token is generated when the task starts. When the AskUser function is invoked, it is passed a task token. The task token is published to the SNS topic along with the API resources for approval and denial. These API resources are created when the application is first deployed.

When the administrator clicks on the approve or deny links, it passes the token with the API request to the receiveUser Lambda function. This Lambda function uses the incoming task token to resume the AskUser state.

The lifecycle of the task token as it transitions through each service is shown below:

Figure 6 Task token lifecycle

  1. To invoke this callback pattern, the askUser state definition is declared using the .waitForTaskToken identifier, with the task token passed into the Lambda function as a payload parameter:
    "AskUser":{
     "Type": "Task",
     "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
     "Parameters":{  
     "FunctionName": "${AskUser}",
     "Payload":{  
     "token.$":"$$.Task.Token"
      }
     },
      "ResultPath":"$.taskresult",
      "Next": "usersChoice"
      },
  2. The askUser Lambda function can then access this token within the event object:
    exports.handler = async (event,context) => {
        let approveLink = `process.env.APIAllowEndpoint?token=${JSON.stringify(event.token)}`
        let denyLink = `process.env.APIDenyEndpoint?token=${JSON.stringify(event.token)}
    //code continues
  3. The task token is published to an SNS topic along with the message text parameter:
        let params = {
     TopicArn: process.env.Topic,
     Message: `A restricted Policy change has been detected Approve:${approveLink} Or Deny:${denyLink}` 
    }
     let res = await sns.publish(params).promise()
    //code continues
  4. The administrator receives an email with two links, one to approve and one to deny. The task token is appended to these links as a request query string parameter named token:

    Figure 7 Approve / deny email.

  5. Using the Amazon API Gateway proxy integration, the task token is passed directly to the recieveUser Lambda function from the API resource, and accessible from within in the function code as part of the event’s queryStringParameter object:
    exports.handler = async(event, context) => {
    //some code
        let taskToken = event.queryStringParameters.token
    //more code
    
  6.  The token is then sent back to the askUser state via an API call from within the recieveUser Lambda function.  This API call also defines the next course of action for the workflow to take.
    //some code 
    let params = {
            output: JSON.stringify({"action":NextAction}),
            taskToken: taskTokenClean
        }
    let res = await stepfunctions.sendTaskSuccess(params).promise()
    //code continues
    

Each Step Functions execution can last for up to a year, allowing for long wait periods for the administrator to take action. There is no extra cost for a longer wait time as you pay for the number of state transitions, and not for the idle wait time.

Conclusion

Using EventBridge to route IAM policy creation events directly to AWS Step Functions reduces the need for unnecessary communication layers. It helps promote good use of compute resources, ensuring Lambda is used to transform data, and not transport or orchestrate.

Using Step Functions to invoke services sequentially has two important benefits for this application. First, you can identify the use of restricted policies quickly and automatically. Also, these policies can be removed and held in a ‘pending’ state until approved.

Step Functions Standard Workflow’s callback pattern can create a robust orchestration layer that allows administrators to review each change before approving or denying.

For the full code base see the GitHub repository https://github.com/bls20AWS/AutomatedPolicyOrchestrator.

For more information on other Step Functions patterns, see our documentation on integration patterns.

Behind the scenes: GitHub security alerts

Post Syndicated from Justin Hutchings original https://github.blog/2019-12-11-behind-the-scenes-github-vulnerability-alerts/

If you have code on GitHub, chances are that you’ve had a security vulnerability alert at some point. Since the feature launched, GitHub has sent more than 62 million security alerts for vulnerable dependencies.

How does it work?

Vulnerability alerts rely on two pieces of data: an inventory of all the software that your code depends on, and a curated list of known vulnerabilities in open-source code. 

Any time you push a change to a dependency manifest file, GitHub has a job that parses those manifest files, and stores your dependency on those packages in the dependency graph. If you’re dependent on something that hasn’t been seen before, a background task runs to get more information about the package from the package registries themselves and adds it. We use the information from the package registries to establish the canonical repository that the package came from, and to help populate metadata like readmes, known versions, and the published licenses. 

On GitHub Enterprise Server, this process works identically, except we don’t get any information from the public package registries in order to protect the privacy of the server and its code. 

The dependency graph supports manifests for JavaScript (npm, Yarn), .NET (Nuget), Java (Maven), PHP (Composer), Python (PyPI), and Ruby (Rubygems). This data powers our vulnerability alerts, but also dependency insights, the used by badge, and the community contributors experiences. 

Beyond the dependency graph, we aggregate data from a number of sources and curate those to bring you actionable security alerts. GitHub brings in security vulnerability data from a number of sources, including the National Vulnerability Database (a service of the United States National Institute of Standards and Technology), maintainer security advisories from open-source maintainers, community datasources, and our partner WhiteSource

Once we learn about a vulnerability, it passes through an advanced machine learning model that’s trained to recognize vulnerabilities which impact developers. This model rejects anything that isn’t related to an open-source toolchain. If the model accepts the vulnerability, a bot creates a pull request in a GitHub private repository for our  team of curation experts to manually review.

GitHub curates vulnerabilities because CVEs (Common Vulnerability Entries) are often ambiguous about which open-source projects are impacted. This can be particularly challenging when multiple libraries with similar names exist, or when they’re a part of a larger toolkit. Depending on the kind of vulnerability, our curation team may follow-up with outside security researchers or maintainers about the impact assessment. This follow-up helps to confirm that an alert is warranted and to identify the exact packages that are impacted. 

Once the curation team completes the mappings, we merge the pull request and it starts a background job that notifies users about any affected repositories. Depending on the vulnerability, this can cause a lot of alerts. In a recent incident, more than two million repositories were alerted about a vulnerable version of lodash, a popular JavaScript utility library.

GitHub Enterprise Server customers get a slightly different experience. If an admin has enabled security vulnerability alerts through GitHub Connect, the server will download the latest curated list of vulnerabilities from GitHub.com over the private GitHub Connect channel on its next scheduled sync (about once per hour). If a new vulnerability exists, the server determines the impacted users and repositories before generating alerts directly. 

Security vulnerabilities are a matter of public good. High-profile breaches impact the trustworthiness of the entire tech industry, so we publish a curated set of vulnerabilities on our GraphQL APIs for community projects and enterprise tools to use in custom workflows as necessary. Users can also browse the known vulnerabilities from public sources on the GitHub Advisory Database.

Engineers behind the feature

Despite advanced technology, security alerting is a human process driven by dedicated GitHubbers. Meet Rob (@rschultheis), one of the core members of our security team, and learn about his experiences at GitHub through a friendly Q&A:

Humphrey Dogart (German Shepherd) and Rob Schultheis (Software Engineer on the GitHub Security team)

How long have you been with GitHub? 

Two years

How did you get into software security? 

I’ve worked with open source software for most of my 20 year career in tech, and honestly for much of that time I didn’t pay much attention to security. When I started at GitHub I was given the opportunity to work on the first iteration of security alerts. It quickly became clear that having a high quality, open dataset was going to be a critical factor in the success of the feature. I dove into the task of curating that advisory dataset and found a whole side to the industry that was open for exploration, and I’ve stayed with it ever since!

What are the trickiest parts of vulnerability curation? 

The hardest problem is probably confirming that our advisory data correctly identifies which version(s) of a package are vulnerable to a given advisory, and which version(s) first address it.

What was the most difficult security vulnerability you’ve had to publish? 

One memorable vulnerability was CVE-2015-9284. This one was tough in several ways because it was a part of a popular library, it was also unpatched when it became fully public, and finally, it was published four years after the initial disclosure to maintainers. Even worse, all attempts to fix it had stalled.

We ended up proceeding to publish it and the community quickly responded and finally got the security issue patched.

What’s your favorite feel-good moment working in security? 

Seeing tweets and other feedback thanking us is always wonderful. We do read them! And that goes the same for those critical of the feature or the way certain advisories were disclosed or published. Please keep them coming—they’re really valuable to us as we keep evolving our security offerings.

Since you work at home, can you introduce us to your furry officemate? 

I live with a seven month old shepherd named Humphrey Dogart. His primary responsibilities are making sure I don’t spend all day on the computer, and he does a great job of that. I think we make a great team!


Learn more about GitHub security alerts

The post Behind the scenes: GitHub security alerts appeared first on The GitHub Blog.

New – VPC Ingress Routing – Simplifying Integration of Third-Party Appliances

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/new-vpc-ingress-routing-simplifying-integration-of-third-party-appliances/

When I was delivering the Architecting on AWS class, customers often asked me how to configure an Amazon Virtual Private Cloud to enforce the same network security policies in the cloud as they have on-premises. For example, to scan all ingress traffic with an Intrusion Detection System (IDS) appliance or to use the same firewall in the cloud as on-premises. Until today, the only answer I could provide was to route all traffic back from their VPC to an on-premises appliance or firewall in order to inspect the traffic with their usual networking gear before routing it back to the cloud. This is obviously not an ideal configuration, it adds latency and complexity.

Today, we announce new VPC networking routing primitives to allow to route all incoming and outgoing traffic to/from an Internet Gateway (IGW) or Virtual Private Gateway (VGW) to a specific Amazon Elastic Compute Cloud (EC2) instance’s Elastic Network Interface. It means you can now configure your Virtual Private Cloud to send all traffic to an EC2 instance before the traffic reaches your business workloads. The instance typically runs network security tools to inspect or to block suspicious network traffic (such as IDS/IPS or Firewall) or to perform any other network traffic inspection before relaying the traffic to other EC2 instances.

How Does it Work?
To learn how it works, I wrote this CDK script to create a VPC with two public subnets: one subnet for the appliance and one subnet for a business application. The script launches two EC2 instances with public IP address, one in each subnet. The script creates the below architecture:

This is a regular VPC, the subnets have routing tables to the Internet Gateway and the traffic flows in and out as expected. The application instance hosts a static web site, it is accessible from any browser. You can retrieve the application public DNS name from the EC2 Console (for your convenience, I also included the CLI version in the comments of the CDK script).

AWS_REGION=us-west-2
APPLICATION_IP=$(aws ec2 describe-instances                           \
                     --region $AWS_REGION                             \
                     --query "Reservations[].Instances[] | [?Tags[?Key=='Name' && Value=='application']].NetworkInterfaces[].Association.PublicDnsName"  \
                     --output text)
				   
curl -I $APPLICATION_IP

Configure Routing
To configure routing, you need to know the VPC ID, the ENI ID of the ENI attached to the appliance instance, and the Internet Gateway ID. Assuming you created the infrastructure using the CDK script I provided, here are the commands I use to find these three IDs (be sure to adjust to the AWS Region you use):

AWS_REGION=us-west-2
VPC_ID=$(aws cloudformation describe-stacks                              \
             --region $AWS_REGION                                        \
             --stack-name VpcIngressRoutingStack                         \
             --query "Stacks[].Outputs[?OutputKey=='VPCID'].OutputValue" \
             --output text)

ENI_ID=$(aws ec2 describe-instances                                       \
             --region $AWS_REGION                                         \
             --query "Reservations[].Instances[] | [?Tags[?Key=='Name' &&  Value=='appliance']].NetworkInterfaces[].NetworkInterfaceId" \
             --output text)

IGW_ID=$(aws ec2 describe-internet-gateways                               \
             --region $AWS_REGION                                         \
             --query "InternetGateways[] | [?Attachments[?VpcId=='${VPC_ID}']].InternetGatewayId" \
             --output text)

To route all incoming traffic through my appliance, I create a routing table for the Internet Gateway and I attach a rule to direct all traffic to the EC2 instance Elastic Network Interface (ENI):

# create a new routing table for the Internet Gateway
ROUTE_TABLE_ID=$(aws ec2 create-route-table                      \
                     --region $AWS_REGION                        \
                     --vpc-id $VPC_ID                            \
                     --query "RouteTable.RouteTableId"           \
                     --output text)

# create a route for 10.0.1.0/24 pointing to the appliance ENI
aws ec2 create-route                             \
    --region $AWS_REGION                         \
    --route-table-id $ROUTE_TABLE_ID             \
    --destination-cidr-block 10.0.1.0/24         \
    --network-interface-id $ENI_ID

# associate the routing table to the Internet Gateway
aws ec2 associate-route-table                      \
    --region $AWS_REGION                           \
    --route-table-id $ROUTE_TABLE_ID               \
    --gateway-id $IGW_ID

Alternatively, I can use the VPC Console under the new Edge Associations tab.

To route all application outgoing traffic through the appliance, I replace the default route for the application subnet to point to the appliance’s ENI:

SUBNET_ID=$(aws ec2 describe-instances                                  \
                --region $AWS_REGION                                    \
                --query "Reservations[].Instances[] | [?Tags[?Key=='Name' && Value=='application']].NetworkInterfaces[].SubnetId"    \
                --output text)
ROUTING_TABLE=$(aws ec2 describe-route-tables                           \
                    --region $AWS_REGION                                \
                    --query "RouteTables[?VpcId=='${VPC_ID}'] | [?Associations[?SubnetId=='${SUBNET_ID}']].RouteTableId" \
                    --output text)

# delete the existing default route (the one pointing to the internet gateway)
aws ec2 delete-route                       \
    --region $AWS_REGION                   \
    --route-table-id $ROUTING_TABLE        \
    --destination-cidr-block 0.0.0.0/0
	
# create a default route pointing to the appliance's ENI
aws ec2 create-route                          \
    --region $AWS_REGION                      \
    --route-table-id $ROUTING_TABLE           \
    --destination-cidr-block 0.0.0.0/0        \
    --network-interface-id $ENI_ID
	
aws ec2 associate-route-table       \
    --region $AWS_REGION            \
    --route-table-id $ROUTING_TABLE \
    --subnet-id $SUBNET_ID

Alternatively, I can use the VPC Console. Within the correct routing table, I select the Routes tab and click Edit routes to replace the default route (the one pointing to 0.0.0.0/0) to target the appliance’s ENI.

Now I have the routing configuration in place. The new routing looks like:

Configure the Appliance Instance
Finally, I configure the appliance instance to forward all traffic it receives. Your software appliance usually does that for you, no extra step is required when you use AWS Marketplace appliances. When using a plain Linux instance, two extra steps are required:

1. Connect to the EC2 appliance instance and configure IP traffic forwarding in the kernel:

APPLIANCE_ID=$(aws ec2 describe-instances  \
                   --region $AWS_REGION    \
                   --query "Reservations[].Instances[] | [?Tags[?Key=='Name' && Value=='appliance']].InstanceId" \
                   --output text)
aws ssm start-session --region $AWS_REGION --target $APPLIANCE_ID	

##
## once connected (you see the 'sh-4.2$' prompt), type:
##

sudo sysctl -w net.ipv4.ip_forward=1
sudo sysctl -w net.ipv6.conf.all.forwarding=1
exit

2. Configure the EC2 instance to accept traffic for different destinations than itself (known as Dest/Source check) :

aws ec2 modify-instance-attribute --region $AWS_REGION \
                         --no-source-dest-check        \
                         --instance-id $APPLIANCE_ID

Now, the appliance is ready to forward traffic to the other EC2 instances. You can test this by pointing your browser (or using `cURL`) to the application instance.

APPLICATION_IP=$(aws ec2 describe-instances --region $AWS_REGION                          \
                     --query "Reservations[].Instances[] | [?Tags[?Key=='Name' && Value=='application']].NetworkInterfaces[].Association.PublicDnsName"  \
                     --output text)
				   
curl -I $APPLICATION_IP

To verify the traffic is really flowing through the appliance, you can enable source/destination check on the instance again (use --source-dest-check parameter with the modify-instance-attributeCLI command above). The traffic is blocked when Source/Destination check is enabled.

Cleanup
Should you use the CDK script I provided for this article, be sure to run cdk destroy when finished. This ensures you are not billed for the two EC2 instances I use for this demo. As I modified routing tables behind the back of AWS CloudFormation, I need to manually delete the routing tables, the subnet and the VPC. The easiest is to navigate to the VPC Console, select the VPC and click Actions => Delete VPC. The console deletes all components in the correct order. You might need to wait 5-10 minutes after the end of cdk destroy before the console is able to delete the VPC.

From our Partners
During the beta test of these new routing capabilities, we granted early access to a collection of AWS partners. They provided us with tons of helpful feedback. Here are some of the blog posts that they wrote in order to share their experiences (I am updating this article with links as they are published):

  • 128 Technology
  • Aviatrix
  • Checkpoint
  • Cisco
  • Citrix
  • FireEye
  • Fortinet
  • HashiCorp
  • IBM Security
  • Lastline
  • Netscout
  • Palo Alto Networks
  • ShieldX Networks
  • Sophos
  • Trend Micro
  • Valtix
  • Vectra AI
  • Versa Networks

Availability
There is no additional costs to use Virtual Private Cloud ingress routing. It is available in all regions (including AWS GovCloud (US-West)) and you can start to use it today.

You can learn more about gateways routing tables in the updated VPC documentation.

What are the appliances you are going to use with this new VPC routing capability?

— seb

Identify Unintended Resource Access with AWS Identity and Access Management (IAM) Access Analyzer

Post Syndicated from Brandon West original https://aws.amazon.com/blogs/aws/identify-unintended-resource-access-with-aws-identity-and-access-management-iam-access-analyzer/

Today I get to share my favorite kind of announcement. It’s the sort of thing that will improve security for just about everyone that builds on AWS, it can be turned on with almost no configuration, and it costs nothing to use. We’re launching a new, first-of-its-kind capability called AWS Identity and Access Management (IAM) Access Analyzer. IAM Access Analyzer mathematically analyzes access control policies attached to resources and determines which resources can be accessed publicly or from other accounts. It continuously monitors all policies for Amazon Simple Storage Service (S3) buckets, IAM roles, AWS Key Management Service (KMS) keys, AWS Lambda functions, and Amazon Simple Queue Service (SQS) queues. With IAM Access Analyzer, you have visibility into the aggregate impact of your access controls, so you can be confident your resources are protected from unintended access from outside of your account.

Let’s look at a couple examples. An IAM Access Analyzer finding might indicate an S3 bucket named my-bucket-1 is accessible to an AWS account with the id 123456789012 when originating from the source IP 11.0.0.0/15. Or IAM Access Analyzer may detect a KMS key policy that allow users from another account to delete the key, identifying a data loss risk you can fix by adjusting the policy. If the findings show intentional access paths, they can be archived.

So how does it work? Using the kind of math that shows up on unexpected final exams in my nightmares, IAM Access Analyzer evaluates your policies to determine how a given resource can be accessed. Critically, this analysis is not based on historical events or pattern matching or brute force tests. Instead, IAM Access Analyzer understands your policies semantically. All possible access paths are verified by mathematical proofs, and thousands of policies can be analyzed in a few seconds. This is done using a type of cognitive science called automated reasoning. IAM Access Analyzer is the first service powered by automated reasoning available to builders everywhere, offering functionality unique to AWS. To start learning about automated reasoning, I highly recommend this short video explainer. If you are interested in diving a bit deeper, check out this re:Invent talk on automated reasoning from Byron Cook, Director of the AWS Automated Reasoning Group. And if you’re really interested in understanding the methodology, make yourself a nice cup of chamomile tea, grab a blanket, and get cozy with a copy of Semantic-based Automated Reasoning for AWS Access Policies using SMT.

Turning on IAM Access Analyzer is way less stressful than an unexpected nightmare final exam. There’s just one step. From the IAM Console, select Access analyzer from the menu on the left, then click Create analyzer.

Creating an Access Analyzer

Analyzers generate findings in the account from which they are created. Analyzers also work within the region defined when they are created, so create one in each region for which you’d like to see findings.

Once our analyzer is created, findings that show accessible resources appear in the Console. My account has a few findings that are worth looking into, such as KMS keys and IAM roles that are accessible by other accounts and federated users.Viewing Access Analyzer Findings

I’m going to click on the first finding and take a look at the access policy for this KMS key.

An Access Analyzer Finding

From here we can see the open access paths and details about the resources and principals involved. I went over to the KMS console and confirmed that this is intended access, so I archived this particular finding.

All IAM Access Analyzer findings are visible in the IAM Console, and can also be accessed using the IAM Access Analyzer API. Findings related to S3 buckets can be viewed directly in the S3 Console. Bucket policies can then be updated right in the S3 Console, closing the open access pathway.

An Access Analyzer finding in S3

You can also see high-priority findings generated by IAM Access Analyzer in AWS Security Hub, ensuring a comprehensive, single source of truth for your compliance and security-focused team members. IAM Access Analyzer also integrates with CloudWatch Events, making it easy to automatically respond to or send alerts regarding findings through the use of custom rules.

Now that you’ve seen how IAM Access Analyzer provides a comprehensive overview of cloud resource access, you should probably head over to IAM and turn it on. One of the great advantages of building in the cloud is that the infrastructure and tools continue to get stronger over time and IAM Access Analyzer is a great example. Did I mention that it’s free? Fire it up, then send me a tweet sharing some of the interesting things you find. As always, happy building!

— Brandon

Harnessing the Power of the People: Cloudflare’s First Security Awareness Month Design Challenge Winners

Post Syndicated from Jacqueline Keith original https://blog.cloudflare.com/cloudflare-security-awareness-month-design-challenge/

Harnessing the Power of the People: Cloudflare’s First Security Awareness Month Design Challenge Winners

Grabbing the attention of employees at a security and privacy-focused company on security awareness presents a unique challenge; how do you get people who are already thinking about security all day to think about it some more? October marked Cloudflare’s first Security Awareness Month as a public company and to celebrate, the security team challenged our entire company population to create graphics, slogans, and memes to encourage us all to think and act more securely every day.

Employees approached this challenge with gusto; global participation meant plenty of high quality submissions to vote on. In addition to being featured here, the winning designs will be displayed in Cloudflare offices throughout 2020 and the creators will be on the decision panel for next year’s winners. Three rose to the top, highlighting creativity and style that is uniquely Cloudflarian. I sat down with the winners to talk through their thoughts on security and what all companies can do to drive awareness.

Eugene Wang, Design Team, First Place

Harnessing the Power of the People: Cloudflare’s First Security Awareness Month Design Challenge Winners


Sílvia Flores, Executive Assistant, Second Place

Harnessing the Power of the People: Cloudflare’s First Security Awareness Month Design Challenge Winners


Scott Jones, e-Learning Developer, Third Place

Security Haiku

Wipe that whiteboard clean‌‌
Visitors may come and see
Secrets not for them

No tailgating please
You may be a nice person
But I don’t know that‌‌‌‌‌‌

1. What inspired your design?

Eugene: The friendly “Welcome” cloud seen in our all company slides was a jumping off point. It seemed like a great character that embodied being a Cloudflarian and had tons of potential to get into adventures. I also wanted security to be a bit fun, where appropriate. Instead of a serious breach (though it could be), here it was more a minor annoyance personified by a wannabe-sneaky alligator. Add a pun, and there you go—poster design!

Sílvia: What inspired my design was the cute Cloudflare mascot the otter since there are so many otters in SF. Also, security can be fun and I added a pun for all the employees to remember the security system in an entertaining and respectful way. This design is very much my style and I believe making things cute and bright can really grab attention from people who are so busy in their work. A bright, orange, leopard print poster cannot be missed!

Scott: I have always loved the haiku form and poems were allowed!

2. What’s the number one thing security teams can do to get non-security people excited about security?

Eugene: Make them realize and identify the threats that can happen everyday, and their role in keeping things secure. Cute characters and puns help.

Sílvia: Make it more accessible for people to engage and understand it, possibly making more activities, content, and creating a fun environment for people to be aware but also be mindful.

Scott: Use whatever means available to keep the idea of being security conscious in everyone’s active awareness. This can and should be done in a variety of different ways so as to engage everyone in one way or another, visually with posters and signs, mentally by having contests, multi-sensory through B.E.E.R. meeting presentations and yes, even through a careful use of fear by periodically giving examples of what can happen if security is not followed…I believe that people like working here and believe in what we are doing and how we are doing it, so awareness mixed in with a little fear can reach people on a more visceral and personal level.

3. What’s your favorite security tip?

Eugene: Look at the destination of the return email.

Sílvia: LastPass. Oh my lord. I cannot remember one single password since we need to make them so difficult! With numbers, caps, symbols, emojis (ahaha). LastPass makes it easier for me to be secure and still be myself and not remembering any password without freaking out.

Scott: “See something, say something” because it both reflects our basic responsibility to each other and exhibits a pride that we have as being part of a company we believe in and want to protect.‌‌

‌‌For security practitioners and engagement professionals, it’s easy to try to boil the ocean when Security Awareness Month comes around. The list of potential topics and guidance is endless. Focusing on two or three key messages, gauging the maturity of your organization, and encouraging company-wide participation makes it a company-wide effort. Extra recognition and glory for those that go over and above never hurts either.

Want to run a security awareness design contest at your company? Reach out to us at [email protected] for tips and best practices for getting started, garnering support, and encouraging participation.

‌‌

How to get started with security response automation on AWS

Post Syndicated from Cameron Worrell original https://aws.amazon.com/blogs/security/how-get-started-security-response-automation-aws/

At AWS, we encourage you to use automation to help quickly detect and respond to security events within your AWS environments. In addition to increasing the speed of detection and response, automation also helps you scale your security operations as you expand your workloads running on AWS. For these reasons, security automation is a key ­principle outlined in both the Well-Architected and Cloud Adoption frameworks as well as in the AWS Security Incident Response Guide.

In this blog post, you’ll learn to implement automated security response mechanisms within your AWS environments. This post will include common patterns, implementation considerations, and an example solution. Security response automation is a broad topic that spans many areas. The goal of this blog post is to introduce you to core concepts and help you get started.

A word from our lawyers: Please note that you are responsible for making your own independent assessment of the information in this post. This post: (a) is for informational purposes only, (b) represents current AWS product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers, or licensors.

What is security response automation?

Security response automation is a planned and programmed action taken to achieve a desired state for an application or resource based on a condition or event. When you implement security response automation, you should adopt an approach that draws from existing security frameworks. Frameworks are published materials which consist of standards, guidelines, and best practices in order help organizations manage cybersecurity-related risk. Using frameworks helps you achieve consistency and scalability and enables you to focus more on the strategic aspects of your security program. You should work with compliance professionals within your organization to understand any specific security frameworks that may also be relevant for your AWS environment.

Our example solution is based on the NIST Cybersecurity Framework (CSF), which is designed to help organizations assess and improve their ability to prevent, detect, and respond to security events. According to the CSF, “cybersecurity incident response” supports your ability to contain the impact of potential cybersecurity incidents. Although automation is not a CSF requirement, automating responses to events enables you to create repeatable, predictable approaches to monitoring and responding to threats.

The five main steps in the CSF are identify, protect, detect, respond and recover. We’ve expanded the detect and respond steps to include automation and investigation activities.
 

Figure 1: The five steps in the CSF

Figure 1: The five steps in the CSF

The following definitions for each step in the diagram above are based on the CSF but have been adapted for our example in this blog post. Although we will focus on the detect, automate and respond steps, it’s important to understand the entire process flow.

  • Identify: Identify and understand the resources, applications, and data within your AWS environment.
  • Protect: Develop and implement appropriate controls and safeguards to ensure delivery of services.
  • Detect: Develop and implement appropriate activities to identify the occurrence of a cybersecurity event. This step includes the implementation of monitoring capabilities which will be discussed further in the next section.
  • Automate: Develop and implement planned, programmed actions that will achieve a desired state for an application or resource based on a condition or event.
  • Investigate: Perform a systematic examination of the security event to establish the root cause.
  • Respond: Develop and implement appropriate activities to take automated or manual actions regarding a detected security event.
  • Recover: Develop and implement appropriate activities to maintain plans for resilience and to restore any capabilities or services that were impaired due to a security event.

Security response automation on AWS

AWS CloudTrail, AWS Config, and Amazon EventBridge continuously record details about the resources and configuration changes in your AWS account. You can use this information to automatically detect resource changes and to react to deviations from your desired state.
 

Figure 2: Automated remediation flow

Figure 2: Automated remediation flow

As shown in the diagram above, an automated remediation flow on AWS has three stages:

  • Monitor: Your automated monitoring tools collect information about resources and applications running in your AWS environment. For example, they might collect AWS CloudTrail information about activities performed in your AWS account, usage metrics from your Amazon EC2 instances, or flow log information about the traffic going to and from network interfaces in your Amazon Virtual Private Cloud (VPC).
  • Detect: When a monitoring tool detects a predefined condition—such as a breached threshold, anomalous activity, or configuration deviation—it raises a flag within the system. A triggering condition might be an anomalous activity detected by Amazon GuardDuty, a resource becoming out of compliance with an AWS Config Rule, or a high rate of blocked requests on an Amazon VPC security group or AWS WAF web access control list.
  • Respond: When a condition is flagged, an automated response is triggered that performs an action you’ve predefined—something intended to remediate or mitigate the flagged condition. Examples of automated response actions might include modifying a VPC security group, patching an Amazon EC2 instance, or rotating credentials.

You can use the event-driven flow described above to achieve many automated response patterns with varying degrees of complexity. Your response pattern could be as simple as invoking a single AWS Lambda function, or it could be a complex series of AWS Step Function tasks with advanced logic. In this blog post, we’ll use two simple Lambda functions in our example solution.

How to define your response automation

Now that we’ve introduced the concept of security response automation, start thinking about security requirements within your environment that you’d like to enforce through automation. These design requirements might come from general best practices you’d like to follow, or they might be specific controls from compliance frameworks relevant for your business. Regardless, your objectives should be quantitative, not qualitative. Here are some examples of quantitative objectives:

  • Remote administrative network access to servers should be limited.
  • Server storage volumes should be encrypted.
  • AWS console logins should be protected by multi-factor authentication.

As an optional step, you can expand these objectives into user stories that define the conditions and remediation actions when there is an event. User stories are informal descriptions that briefly document a feature within a software system. User stories may be global and span across multiple applications or they may be specific to a single application. For example:

“Remote administrative network access to servers should be limited. Remote access ports include SSH TCP port 22 and RDP TCP port 3389. If open remote access ports are detected within the environment, they should be automatically closed and the owner will be notified.”

Once you’ve completed your user story, you can determine how to use automated remediation to help achieve these objectives in your AWS environment. User stories should be stored in a location that provides versioning support and can reference the associated automation code.

You should carefully consider the effect of your remediation mechanisms in order to prevent unintended impact on your resources and applications. Remediation actions such as instance termination, credential revocation, and security group modification can adversely affect application availability. Depending on the level of risk that’s acceptable to your organization, your automated mechanism might only provide a notification which can then be manually investigated prior to remediation. Once you’ve identified an automated remediation mechanism, you can build out the required components and test them in a non-production environment.

Sample response automation walkthrough

In the following section, we’ll walk you through an automated remediation for a simulated event that indicates potential unauthorized activity—the unintended disabling of CloudTrail logging. Outside parties might want to disable logging to prevent detection and recording of their unauthorized activity. Our response is to re-enable the CloudTrail logging and immediately notify the security contact. Here’s the user story for this scenario:

“CloudTrail logging should be enabled for all AWS accounts and regions. If CloudTrail logging is disabled, it will automatically be enabled and the security operations team will be notified.”

Note: The sample response automation below references Amazon EventBridge which extends and builds upon CloudWatch Events. Amazon EventBridge uses the same Amazon CloudWatch Events API, so the event structure and rules configuration are the same. This blog post uses base functionality that is identical in both EventBridge and CloudWatch Events.

Prerequisites

In order to use our sample remediation, you will need to enable Amazon GuardDuty and AWS Security Hub in the AWS Region you have selected. Both of these services include a 30-day free trial. See the AWS Security Hub pricing page and the Amazon GuardDuty pricing page for additional details.

Important: You’ll use AWS CloudTrail to test the sample remediation. Running more than one CloudTrail trail in your AWS account will result in charges based on the number of events processed while the trail is running. Charges for additional copies of management events recorded in a Region are applied based on the published pricing plan. To minimize the charges, follow the clean-up steps that we provide later in this post to remove the sample automation and delete the trail.

Deploy the sample response automation

In this section, we’ll show you how to deploy and test the CloudTrail logging remediation sample. Amazon GuardDuty generates the finding Stealth:IAMUser/CloudTrailLoggingDisabled when CloudTrail logging is disabled, and AWS Security Hub collects findings from GuardDuty using the standardized finding format mentioned earlier. We recommend that you deploy this sample into a non-production AWS account.

Select the Launch Stack button below to deploy a CloudFormation template with an automation sample in the us-east-1 Region. You can also download the template and implement it in another Region. The template consists of an Amazon EventBridge rule, an AWS Lambda function and the IAM permissions necessary for both components to execute. It takes several minutes for the CloudFormation stack build to complete.

Select this image to open a link that starts building the CloudFormation stack

  1. In the CloudFormation console, choose the Select Template form, and then select Next.
  2. On the Specify Details page, provide the email address for a security contact. (For the purpose of this walkthrough, it should be an email address you have access to.) Then select Next.
  3. On the Options page, accept the defaults, then select Next.
  4. On the Review page, confirm the details, then select Create.
  5. While the stack is being created, check the inbox of the email address you provided in step 2. Look for an email message with the subject AWS Notification – Subscription Confirmation. Select the link in the body of the email to confirm your subscription to the Amazon Simple Notification Service (Amazon SNS) topic. You should see a success message similar to the screenshot below:
     
    Figure 3: SNS subscription confirmation

    Figure 3: SNS subscription confirmation

  6. Return to the CloudFormation console. Once the Status field for the CloudFormation stack changes to CREATE COMPLETE (as shown in figure 4), the solution is implemented and is ready for testing.
     
    Figure 4: CREATE_COMPLETE status

    Figure 4: CREATE_COMPLETE status

Test the sample automation

You’re now ready to test the automated response by creating a test trail in CloudTrail, then trying to stop it.

  1. From the AWS Management Console, choose Services > CloudTrail.
  2. Select Trails, then select Create Trail.
  3. On the Create Trail form:
    1. Enter a value for Trail name. We use test-trail in our example below.
    2. Under Management events, select Write-only (to minimize event volume).
       
      Figure 5: Create a CloudTrail trail

      Figure 5: Create a CloudTrail trail

    3. Under Storage location, choose an existing S3 bucket or create a new one. Note that since S3 bucket names are globally unique, you must add characters (such as a random string) to the name. For example: my-test-trail-bucket-<random-string>.
  4. On the Trails page of the CloudTrail console, verify that the new trail has started. You should see a green checkmark in the Status column, as shown in figure 6.
     
    Figure 6: Verify new trail has started

    Figure 6: Verify new trail has started

  5. You’re now ready to act like an unauthorized user trying to cover their tracks! Stop the logging for the trail you just created:
    1. Select the new trail name to display its configuration page.
    2. Toggle the Logging switch in the top-right corner to OFF.
    3. When prompted with a warning dialog box, select Continue.
    4. Verify that the Logging switch is now off, as shown below.
       
      Figure 7: Verify logging switch is off

      Figure 7: Verify logging switch is off

      You have now simulated a security event by disabling logging for one of the trails in the CloudTrail service. Within the next few seconds, the near real-time automated response will detect the stopped trail, restart it, and send an email notification. You can refresh the Trails page of the CloudTrail console to verify that the trail’s status is ON again.

      Within the next several minutes, the investigatory automated response will also begin. GuardDuty will detect the action that stopped the trail and enrich the data about the source of unexpected behavior. Security Hub will then ingest that information and optionally correlate with other security events.

      Following the steps below, you can monitor findings within Security Hub for the finding type TTPs/Defense Evasion/Stealth:IAMUser-CloudTrailLoggingDisabled to be generated:

  6. In the AWS Management Console, choose Services > Security Hub
    1. Select Findings in the left pane.
    2. Select the Add filters field, then select Type.
    3. Select EQUALS, paste TTPs/Defense Evasion/Stealth:IAMUser-CloudTrailLoggingDisabled into the field, then select Apply.
    4. Refresh your browser periodically until the finding is generated.
    5. Figure 8: Monitor Security Hub for your finding

      Figure 8: Monitor Security Hub for your finding

While you wait on that detection, let’s dig into the components of automation.

How the sample automation works

This example incorporates two automated responses: a near real-time workflow and an investigatory workflow. The near real-time workflow provides a rapid response to an individual event, in this case the stopping of a trail. The goal is to restore the trail to a functioning state and alert security responders as quickly as possible. The investigatory workflow still includes a response to provide defense in depth and also uses services that support a more in-depth investigation of the incident.

Figure 9: Sample automation workflow

Figure 9: Sample automation workflow

In the near real-time workflow, Amazon EventBridge monitors for the undesired activity. When a trail is stopped, AWS CloudTrail publishes an event on the EventBridge bus. An EventBridge rule detects the trail-stopping event and invokes a Lambda function to respond to the event by restarting the trail and notifying the security contact via an Amazon Simple Notification Service (SNS) topic.

In the investigative workflow, CloudTrail logs are monitored for undesired activities. For example, if a trail is stopped, there will be a corresponding log record. GuardDuty detects this activity and retrieves additional data points about the source IP that executed the API call. Two common examples of those additional data points in GuardDuty findings include whether the API call came from an IP address on a threat list, or whether it came from a network not commonly used in your AWS account. An AWS Lambda function responds by restarting the trail and notifying the security contact. Finally, the finding is imported into AWS Security Hub for additional investigation.

AWS Security Hub imports findings from AWS security services such as GuardDuty, Amazon Macie and Amazon Inspector, plus from any third-party product integrations you’ve enabled. All findings are provided to Security Hub in AWS Security Finding Format, which eliminates the need for data conversion. Security Hub correlates these findings to help you identify related security events and determine a root cause. Security Hub also publishes its findings to Amazon EventBridge to enable further processing by other AWS services such as AWS Lambda.

Respond step deep dive

Amazon EventBridge and AWS Lambda work together to respond to a security finding. Amazon EventBridge is a service that provides real-time access to changes in data in AWS services, your own applications, and Software-as-a-Service (SaaS) applications without writing code. In this example, EventBridge identifies a Security Hub finding that requires action and invokes a Lambda function that performs remediation. As shown in figure 10, the Lambda function both notifies the security operator via SNS and restarts the stopped CloudTrail.

Figure 10: Sample "respond" workflow

Figure 10: Sample “respond” workflow

To set this response up, we looked for an event to indicate that a trail had stopped or was disabled. We knew that the GuardDuty finding Stealth:IAMUser/CloudTrailLoggingDisabled is raised when CloudTrail logging is disabled. Therefore, we configured the default event bus to look for this event. You can learn more about all of the available GuardDuty findings in the user guide.

How the code works

When Security Hub publishes a finding to EventBridge, it includes full details of the incident as discovered by GuardDuty. The finding is published in JSON format. If you review the details of the sample finding, note that it has several fields helping you identify the specific events that you’re looking for. Here are some of the relevant details:


{
   …
   "source":"aws.securityhub",
   …
   "detail":{
      "findings": [{
		…
    	“Types”: [
			"TTPs/Defense Evasion/Stealth:IAMUser-CloudTrailLoggingDisabled"
			],
		…
      }]
}

You can build an event pattern using these fields, which an EventBridge filtering rule can then use to identify events and to invoke the remediation Lambda function. Below is a snippet from the CloudFormation template we provided earlier that defines that event pattern for the EventBridge filtering rule:


# pattern matches the nested JSON format of a specific Security Hub finding
      EventPattern:
        source:
        - aws.securityhub
        detail-type:
          - "Security Hub Findings - Imported"
        detail:
          findings:
            Types:
              - "TTPs/Defense Evasion/Stealth:IAMUser-CloudTrailLoggingDisabled"

Once the rule is in place, EventBridge continuously scans the event bus for this pattern. When EventBridge finds a match, it invokes the remediating Lambda function and passes the full details of the event to the function. The Lambda function then parses the JSON fields in the event so that it can act as shown in this Python code snippet:


# extract trail ARN by parsing the incoming Security Hub finding (in JSON format)
trailARN = event['detail']['findings'][0]['ProductFields']['action/awsApiCallAction/affectedResources/AWS::CloudTrail::Trail']   

# description contains useful details to be sent to security operations
description = event['detail']['findings'][0]['Description']

The code also issues a notification to security operators so they can review the findings and insights in Security Hub and other services to better understand the incident and to decide whether further manual actions are warranted. Here’s the code snippet that uses SNS to send out a note to security operators:


#Sending the notification that the AWS CloudTrail has been disabled.
snspublish = snsclient.publish(
	TargetArn = snsARN,
	Message="Automatically restarting CloudTrail logging.  Event description: \"%s\" " %description
	)

While notifications to human operators are important, the Lambda function will not wait to take action. It immediately remediates the condition by restarting the stopped trail in CloudTrail. Here’s a code snippet that restarts the trail to reenable logging:


#Enabling the AWS CloudTrail logging
try:
	client = boto3.client('cloudtrail')
	enablelogging = client.start_logging(Name=trailARN)
	logger.debug("Response on enable CloudTrail logging- %s" %enablelogging)
except ClientError as e:
	logger.error("An error occured: %s" %e)

After the trail has been restarted, API activity is once again logged and can be audited. This can help provide relevant data for the remaining steps in the incident response process. The data is especially important for the post-incident phase, when your team analyzes lessons learned to prevent future incidents. You can also use this phase to identify additional steps to automate in your incident response.

Clean up

After you’ve completed the sample security response automation, we recommend that you remove the resources created in this walkthrough example from your account in order to minimize any charges associated with the trail in CloudTrail and data stored in S3.

Important: Deleting resources in your account can negatively impact the applications running in your AWS account. Verify that applications and AWS account security do not depend on the resources you’re about to delete.

Here are the clean-up steps:

  1. Delete the CloudFormation stack.
  2. Delete the trail you created in CloudTrail.
  3. If you created an S3 bucket for CloudTrail logs, you can also delete that S3 bucket.
  4. New accounts can try GuardDuty at no cost for 30 days. You can suspend or disable GuardDuty before the free trial period to avoid charges.
  5. Security Hub comes with a 30-day free trial. You can avoid charges by disabling the service before the trial period is over.

Summary

You’ve learned the basic concepts and considerations behind security response automation on AWS and how to use Amazon EventBridge, Amazon GuardDuty and AWS Security Hub to automatically re-enable AWS CloudTrail when it becomes disabled unexpectedly. As a next step, you may want to start building your own response automations and dive deeper into the AWS Security Incident Response Guide, NIST Cybersecurity Framework (CSF) or the AWS Cloud Adoption Framework (CAF) Security Perspective. You can explore additional automatic remediation solutions on the AWS Security Blog. You can find the code used in this example on GitHub.

If you have feedback about this blog post, submit them in the Comments section below. If you have questions about using this solution, start a thread in the EventBridgeGuardDuty or Security Hub forums, or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Cameron Worrell

Cameron Worrell

Cameron is a Solutions Architect with a passion for security and enterprise transformation. He joined AWS in 2015.

Alex Tomic

Alex Tomic

Alex is an AWS Enterprise Solutions Architect focused on security and compliance. He joined AWS in 2014.

Nathan Case

Nathan Case

Nathan is a Senior Security Strategist, and joined AWS in 2016. He is always interested to see where our customers plan to go and how we can help them get there. He is also interested in intel, combined data lake sharing opportunities, and open source collaboration. In the end Nathan loves technology and that we can change the world to make it a better place.

Introducing Flan Scan: Cloudflare’s Lightweight Network Vulnerability Scanner

Post Syndicated from Nadin El-Yabroudi original https://blog.cloudflare.com/introducing-flan-scan/

Introducing Flan Scan: Cloudflare’s Lightweight Network Vulnerability Scanner

Introducing Flan Scan: Cloudflare’s Lightweight Network Vulnerability Scanner

Today, we’re excited to open source Flan Scan, Cloudflare’s in-house lightweight network vulnerability scanner. Flan Scan is a thin wrapper around Nmap that converts this popular open source tool into a vulnerability scanner with the added benefit of easy deployment.

We created Flan Scan after two unsuccessful attempts at using “industry standard” scanners for our compliance scans. A little over a year ago, we were paying a big vendor for their scanner until we realized it was one of our highest security costs and many of its features were not relevant to our setup. It became clear we were not getting our money’s worth. Soon after, we switched to an open source scanner and took on the task of managing its complicated setup. That made it difficult to deploy to our entire fleet of more than 190 data centers.

We had a deadline at the end of Q3 to complete an internal scan for our compliance requirements but no tool that met our needs. Given our history with existing scanners, we decided to set off on our own and build a scanner that worked for our setup. To design Flan Scan, we worked closely with our auditors to understand the requirements of such a tool. We needed a scanner that could accurately detect the services on our network and then lookup those services in a database of CVEs to find vulnerabilities relevant to our services. Additionally, unlike other scanners we had tried, our tool had to be easy to deploy across our entire network.

We chose Nmap as our base scanner because, unlike other network scanners which sacrifice accuracy for speed, it prioritizes detecting services thereby reducing false positives. We also liked Nmap because of the Nmap Scripting Engine (NSE), which allows scripts to be run against the scan results. We found that the “vulners” script, available on NSE, mapped the detected services to relevant CVEs from a database, which is exactly what we needed.

The next step was to make the scanner easy to deploy while ensuring it outputted actionable and valuable results. We added three features to Flan Scan which helped package up Nmap into a user-friendly scanner that can be deployed across a large network.

  • Easy Deployment and ConfigurationTo create a lightweight scanner with easy configuration, we chose to run Flan Scan inside a Docker container. As a result, Flan Scan can be built and pushed to a Docker registry and maintains the flexibility to be configured at runtime. Flan Scan also includes sample Kubernetes configuration and deployment files with a few placeholders so you can get up and scanning quickly.
  • Pushing results to the Cloud Flan Scan adds support for pushing results to a Google Cloud Storage Bucket or an S3 bucket. All you need to do is set a few environment variables and Flan Scan will do the rest. This makes it possible to run many scans across a large network and collect the results in one central location for processing.
  • Actionable Reports – Flan Scan generates actionable reports from Nmap’s output so you can quickly identify vulnerable services on your network, the applicable CVEs, and the IP addresses and ports where these services were found. The reports are useful for engineers following up on the results of the scan as well as auditors looking for evidence of compliance scans.

Introducing Flan Scan: Cloudflare’s Lightweight Network Vulnerability Scanner
Sample run of Flan Scan from start to finish. 

How has Scan Flan improved Cloudflare’s network security?

By the end of Q3, not only had we completed our compliance scans, we also used Flan Scan to tangibly improve the security of our network. At Cloudflare, we pin the software version of some services in production because it allows us to prioritize upgrades by weighing the operational cost of upgrading against the improvements of the latest version. Flan Scan’s results revealed that our FreeIPA nodes, used to manage Linux users and hosts, were running an outdated version of Apache with several medium severity vulnerabilities. As a result, we prioritized their update. Flan Scan also found a vulnerable instance of PostgreSQL leftover from a performance dashboard that no longer exists.

Flan Scan is part of a larger effort to expand our vulnerability management program. We recently deployed osquery to our entire network to perform host-based vulnerability tracking. By complementing osquery’s findings with Flan Scan’s network scans we are working towards comprehensive visibility of the services running at our edge and their vulnerabilities. With two vulnerability trackers in place, we decided to build a tool to manage the increasing number of vulnerability  sources. Our tool sends alerts on new vulnerabilities, filters out false positives, and tracks remediated vulnerabilities. Flan Scan’s valuable security insights were a major impetus for creating this vulnerability tracking tool.

How does Flan Scan work?

Introducing Flan Scan: Cloudflare’s Lightweight Network Vulnerability Scanner

The first step of Flan Scan is running an Nmap scan with service detection. Flan Scan’s default Nmap scan runs the following scans:

  1. ICMP ping scan – Nmap determines which of the IP addresses given are online.
  2. SYN scan – Nmap scans the 1000 most common ports of the IP addresses which responded to the ICMP ping. Nmap marks ports as open, closed, or filtered.
  3. Service detection scan – To detect which services are running on open ports Nmap performs TCP handshake and banner grabbing scans.

Other types of scanning such as UDP scanning and IPv6 addresses are also possible with Nmap. Flan Scan allows users to run these and any other extended features of Nmap by passing in Nmap flags at runtime.

Introducing Flan Scan: Cloudflare’s Lightweight Network Vulnerability Scanner
Sample Nmap output

Flan Scan adds the “vulners” script tag in its default Nmap command to include in the output a list of vulnerabilities applicable to the services detected. The vulners script works by making API calls to a service run by vulners.com which returns any known vulnerabilities for the given service.

Introducing Flan Scan: Cloudflare’s Lightweight Network Vulnerability Scanner
Sample Nmap output with Vulners script

The next step of Flan Scan uses a Python script to convert the structured XML of Nmap’s output to an actionable report. The reports of the previous scanner we used listed each of the IP addresses scanned and present the vulnerabilities applicable to that location. Since we had multiple IP addresses running the same service, the report would repeat the same list of vulnerabilities under each of these IP addresses. This meant scrolling back and forth on documents hundreds of pages long to obtain a list of all IP addresses with the same vulnerabilities.  The results were impossible to digest.

Flan Scans results are structured around services. The report enumerates all vulnerable services with a list beneath each one of relevant vulnerabilities and all IP addresses running this service. This structure makes the report shorter and actionable since the services that need to be remediated can be clearly identified. Flan Scan reports are made using LaTeX because who doesn’t like nicely formatted reports that can be generated with a script? The raw LaTeX file that Flan Scan outputs can be converted to a beautiful PDF by using tools like pdf2latex or TeXShop.

Introducing Flan Scan: Cloudflare’s Lightweight Network Vulnerability Scanner
Sample Flan Scan report

What’s next?

Cloudflare’s mission is to help build a better Internet for everyone, not just Internet giants who can afford to buy expensive tools. We’re open sourcing Flan Scan because we believe it shouldn’t cost tons of money to have strong network security.

You can get started running a vulnerability scan on your network in a few minutes by following the instructions on the README. We welcome contributions and suggestions from the community.

Even faster connection establishment with QUIC 0-RTT resumption

Post Syndicated from Alessandro Ghedini original https://blog.cloudflare.com/even-faster-connection-establishment-with-quic-0-rtt-resumption/

Even faster connection establishment with QUIC 0-RTT resumption

One of the more interesting features introduced by TLS 1.3, the latest revision of the TLS protocol, was the so called “zero roundtrip time connection resumption”, a mode of operation that allows a client to start sending application data, such as HTTP requests, without having to wait for the TLS handshake to complete, thus reducing the latency penalty incurred in establishing a new connection.

The basic idea behind 0-RTT connection resumption is that if the client and server had previously established a TLS connection between each other, they can use information cached from that session to establish a new one without having to negotiate the connection’s parameters from scratch. Notably this allows the client to compute the private encryption keys required to protect application data before even talking to the server.

However, in the case of TLS, “zero roundtrip” only refers to the TLS handshake itself: the client and server are still required to first establish a TCP connection in order to be able to exchange TLS data.

Even faster connection establishment with QUIC 0-RTT resumption

Zero means zero

QUIC goes a step further, and allows clients to send application data in the very first roundtrip of the connection, without requiring any other handshake to be completed beforehand.

Even faster connection establishment with QUIC 0-RTT resumption

After all, QUIC already shaved a full round-trip off of a typical connection’s handshake by merging the transport and cryptographic handshakes into one. By reducing the handshake by an additional roundtrip, QUIC achieves real 0-RTT connection establishment.

It literally can’t get any faster!

Attack of the clones

Unfortunately, 0-RTT connection resumption is not all smooth sailing, and it comes with caveats and risks, which is why Cloudflare does not enable 0-RTT connection resumption by default. Users should consider the risks involved and decide whether to use this feature or not.

For starters, 0-RTT connection resumption does not provide forward secrecy, meaning that a compromise of the secret parameters of a connection will trivially allow compromising the application data sent during the 0-RTT phase of new connections resumed from it. Data sent after the 0-RTT phase, meaning after the handshake has been completed, would still be safe though, as TLS 1.3 (and QUIC) will still perform the normal key exchange algorithm (which is forward secret) for data sent after the handshake completion.

More worryingly, application data sent during 0-RTT can be captured by an on-path attacker and then replayed multiple times to the same server. In many cases this is not a problem, as the attacker wouldn’t be able to decrypt the data, which is why 0-RTT connection resumption is useful, but in some cases this can be dangerous.

For example, imagine a bank that allows an authenticated user (e.g. using HTTP cookies, or other HTTP authentication mechanisms) to send money from their account to another user by making an HTTP request to a specific API endpoint. If an attacker was able to capture that request when 0-RTT connection resumption was used, they wouldn’t be able to see the plaintext and get the user’s credentials, because they wouldn’t know the secret key used to encrypt the data; however they could still potentially drain that user’s bank account by replaying the same request over and over:

Even faster connection establishment with QUIC 0-RTT resumption

Of course this problem is not specific to banking APIs: any non-idempotent request has the potential to cause undesired side effects, ranging from slight malfunctions to serious security breaches.

In order to help mitigate this risk, Cloudflare will always reject 0-RTT requests that are obviously not idempotent (like POST or PUT requests), but in the end it’s up to the application sitting behind Cloudflare to decide which requests can and cannot be allowed with 0-RTT connection resumption, as even innocuous-looking ones can have side effects on the origin server.

To help origins detect and potentially disallow specific requests, Cloudflare also follows the techniques described in RFC8470. Notably, Cloudflare will add the Early-Data: 1 HTTP header to requests received during 0-RTT resumption that are forwarded to origins.

Origins able to understand this header can then decide to answer the request with the 425 (Too Early) HTTP status code, which will instruct the client that originated the request to retry sending the same request but only after the TLS or QUIC handshake have fully completed, at which point there is no longer any risk of replay attacks. This could even be implemented as part of a Cloudflare Worker.

Even faster connection establishment with QUIC 0-RTT resumption

This makes it possible for origins to allow 0-RTT requests for endpoints that are safe, such as a website’s index page which is where 0-RTT is most useful, as that is typically the first request a browser makes after establishing a connection, while still protecting other endpoints such as APIs and form submissions. But if an origin does not provide any of those non-idempotent endpoints, no action is required.

One stop shop for all your 0-RTT needs

Just like we previously did for TLS 1.3, we now support 0-RTT resumption for QUIC as well. In honor of this event, we have dusted off the user-interface controls that allow Cloudflare users to enable this feature for their websites, and introduced a dedicated toggle to control whether 0-RTT connection resumption is enabled or not, which can be found under the “Network” tab on the Cloudflare dashboard:

Even faster connection establishment with QUIC 0-RTT resumption

When TLS 1.3 and/or QUIC (via the HTTP/3 toggle) are enabled, 0-RTT connection resumption will be automatically offered to clients that support it, and the replay mitigation mentioned above will also be applied to the connections making use of this feature.

In addition, if you are a user of our open-source HTTP/3 patch for NGINX, after updating the patch to the latest version, you’ll be able to enable support for 0-RTT connection resumption in your own NGINX-based HTTP/3 deployment by using the built-in “ssl_early_data” option, which will work for both TLS 1.3 and QUIC+HTTP/3.

Log every request to corporate apps, no code changes required

Post Syndicated from Sam Rhea original https://blog.cloudflare.com/log-every-request-to-corporate-apps-no-code-changes-required/

Log every request to corporate apps, no code changes required

When a user connects to a corporate network through an enterprise VPN client, this is what the VPN appliance logs:

Log every request to corporate apps, no code changes required

The administrator of that private network knows the user opened the door at 12:15:05, but, in most cases, has no visibility into what they did next. Once inside that private network, users can reach internal tools, sensitive data, and production environments. Preventing this requires complicated network segmentation, and often server-side application changes. Logging the steps that an individual takes inside that network is even more difficult.

Cloudflare Access does not improve VPN logging; it replaces this model. Cloudflare Access secures internal sites by evaluating every request, not just the initial login, for identity and permission. Instead of a private network, administrators deploy corporate applications behind Cloudflare using our authoritative DNS. Administrators can then integrate their team’s SSO and build user and group-specific rules to control who can reach applications behind the Access Gateway.

When a request is made to a site behind Access, Cloudflare prompts the visitor to login with an identity provider. Access then checks that user’s identity against the configured rules and, if permitted, allows the request to proceed. Access performs these checks on each request a user makes in a way that is transparent and seamless for the end user.

However, since the day we launched Access, our logging has resembled the screenshot above. We captured when a user first authenticated through the gateway, but that’s where it stopped. Starting today, we can give your team the full picture of every request made to every application.

We’re excited to announce that you can now capture logs of every request a user makes to a resource behind Cloudflare Access. In the event of an emergency, like a stolen laptop, you can now audit every URL requested during a session. Logs are standardized in one place, regardless of whether you use multiple SSO providers or secure multiple applications, and the Cloudflare Logpush platform can send them to your SIEM for retention and analysis.

Auditing every login

Cloudflare Access brings the speed and security improvements Cloudflare provides to public-facing sites and applies those lessons to the internal applications your team uses. For most teams, these were applications that traditionally lived behind a corporate VPN. Once a user joined that VPN, they were inside that private network, and administrators had to take additional steps to prevent users from reaching things they should not have access to.

Access flips this model by assuming no user should be able to reach anything by default; applying a zero-trust solution to the internal tools your team uses. With Access, when any user requests the hostname of that application, the request hits Cloudflare first. We check to see if the user is authenticated and, if not, send them to your identity provider like Okta, or Azure ActiveDirectory. The user is prompted to login, and Cloudflare then evaluates if they are allowed to reach the requested application. All of this happens at the edge of our network before a request touches your origin, and for the user, it feels like the seamless SSO flow they’ve become accustomed to for SaaS apps.

Log every request to corporate apps, no code changes required

When a user authenticates with your identity provider, we audit that event as a login and make those available in our API. We capture the user’s email, their IP address, the time they authenticated, the method (in this case, a Google SSO flow), and the application they were able to reach.

Log every request to corporate apps, no code changes required

These logs can help you track every user who connected to an internal application, including contractors and partners who might use different identity providers. However, this logging stopped at the authentication. Access did not capture the next steps of a given user.

Auditing every request

Cloudflare secures both external-facing sites and internal resources by triaging each request in our network before we ever send it to your origin. Products like our WAF enforce rules to protect your site from attacks like SQL injection or cross-site scripting. Likewise, Access identifies the principal behind each request by evaluating each connection that passes through the gateway.

Once a member of your team authenticates to reach a resource behind Access, we generate a token for that user that contains their SSO identity. The token is structured as JSON Web Token (JWT). JWT security is an open standard for signing and encrypting sensitive information. These tokens provide a secure and information-dense mechanism that Access can use to verify individual users. Cloudflare signs the JWT using a public and private key pair that we control. We rely on RSA Signature with SHA-256, or RS256, an asymmetric algorithm, to perform that signature. We make the public key available so that you can validate their authenticity, as well.

When a user requests a given URL, Access appends the user identity from that token as a request header, which we then log as the request passes through our network. Your team can collect these logs in your preferred third-party SIEM or storage destination by using the Cloudflare Logpush platform.

Cloudflare Logpush can be used to gather and send specific request headers from the requests made to sites behind Access. Once enabled, you can then configure the destination where Cloudflare should send these logs. When enabled with the Access user identity field, the logs will export to your systems as JSON similar to the logs below.

{
   "ClientIP": "198.51.100.206",
   "ClientRequestHost": "jira.widgetcorp.tech",
   "ClientRequestMethod": "GET",
   "ClientRequestURI": "/secure/Dashboard/jspa",
   "ClientRequestUserAgent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36",
   "EdgeEndTimestamp": "2019-11-10T09:51:07Z",
   "EdgeResponseBytes": 4600,
   "EdgeResponseStatus": 200,
   "EdgeStartTimestamp": "2019-11-10T09:51:07Z",
   "RayID": "5y1250bcjd621y99"
   "RequestHeaders":{"cf-access-user":"srhea"},
}
 
{
   "ClientIP": "198.51.100.206",
   "ClientRequestHost": "jira.widgetcorp.tech",
   "ClientRequestMethod": "GET",
   "ClientRequestURI": "/browse/EXP-12",
   "ClientRequestUserAgent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36",
   "EdgeEndTimestamp": "2019-11-10T09:51:27Z",
   "EdgeResponseBytes": 4570,
   "EdgeResponseStatus": 200,
   "EdgeStartTimestamp": "2019-11-10T09:51:27Z",
   "RayID": "yzrCqUhRd6DVz72a"
   "RequestHeaders":{"cf-access-user":"srhea"},
}

In the example above, the user initially visited the splash page for a sample Jira instance. The next request was made to a specific Jira ticket, EXP-12, about one minute after the first request. With per-request logging, Access administrators can review each request a user made once authenticated in the event that an account is compromised or a device stolen.

The logs are consistent across all applications and identity providers. The same standard fields are captured when contractors login with their AzureAD instance to your supply chain tool as when your internal users authenticate with Okta to your Jira. You can also augment the data above with other request details like TLS cipher used and WAF results.

How can this data be used?

The native logging capabilities of hosted applications vary wildly. Some tools provide more robust records of user activity, but others would require server-side code changes or workarounds to add this level of logging. Cloudflare Access can give your team the ability to skip that work and introduce logging in a single gateway that applies to all resources protected behind it.

The audit logs can be exported to third-party SIEM tools or S3 buckets for analysis and anomaly detection. The data can also be used for audit purposes in the event that a corporate device is lost or stolen. Security teams can then use this to recreate user sessions from logs as they investigate.

What’s next?

Any enterprise customer with Logpush enabled can now use this feature at no additional cost. Instructions are available here to configure Logpush and additional documentation here to enable Access per-request logs.

How to enable encryption in a browser with the AWS Encryption SDK for JavaScript and Node.js

Post Syndicated from Spencer Janyk original https://aws.amazon.com/blogs/security/how-to-enable-encryption-browser-aws-encryption-sdk-javascript-node-js/

In this post, we’ll show you how to use the AWS Encryption SDK (“ESDK”) for JavaScript to handle an in-browser encryption workload for a hypothetical application. First, we’ll review some of the security and privacy properties of encryption, including the names AWS uses for the different components of a typical application. Then, we’ll discuss some of the reasons you might want to encrypt each of those components, with a focus on in-browser encryption, and we’ll describe how to perform that encryption using the ESDK. Lastly, we’ll talk about some of the security properties to be mindful of when designing an application, and where to find additional resources.

An overview of the security and privacy properties of encryption

Encryption is a technique that can restrict access to sensitive data by making it unreadable without a key. An encryption process takes data that is plainly readable or processable (“plaintext”) and uses principles of mathematics to obscure the contents so that it can’t be read without the use of a secret key. To preserve user privacy and prevent unauthorized disclosure of sensitive business data, developers need ways to protect sensitive data during the entire data lifecycle. Data needs to be protected from risks associated with unintentional disclosure as data flows between collection, storage, processing, and sharing components of an application. In this context, encryption is typically divided into two separate techniques: encryption at rest for storing data; and encryption in transit for moving data between entities or systems.

Many applications use encryption in transit to secure connections between their users and the services they provide, and then encrypt the data before it’s stored. However, as applications become more complex and data must be moved between more nodes and stored in more diverse places, there are more opportunities for data to be accidentally leaked or unintentionally disclosed. When a user enters their data in a browser, Transport Layer Security (TLS) can protect that data in transit between the user’s browser and a service endpoint. But in a distributed system, intermediary services between that endpoint and the service that processes that sensitive data might log or cache the data before transporting it. Encrypting sensitive data at the point of collection in the browser is a form of encryption at rest that minimizes the risk of unauthorized access and protects the data if it’s lost, stolen, or accidentally exposed. Encrypting data in the browser means that even if it’s completely exposed elsewhere, it’s unreadable and worthless to anyone without access to the key.

A typical web application

A typical web application will accept some data as input, process it, and then store it. When the user needs to access stored data, the data often follows the same path used when it was input. In our example there are three primary components to the path:

Figure 1: A hypothetical web application where the application is composed of an end-user interacting with a browser front-end, a third party which processes data received from the browser, processing is performed in Amazon EC2, and storage happens in Amazon S3

Figure 1: A hypothetical web application where the application is composed of an end-user interacting with a browser front-end, a third party which processes data received from the browser, processing is performed in Amazon EC2, and storage happens in Amazon S3

  1. An end-user interacts with the application using an interface in the browser.
  2. As data is sent to Amazon EC2, it passes through the infrastructure of a third party which could be an Internet Service Provider, an appliance in the user’s environment, or an application running in the cloud.
  3. The application on Amazon EC2 processes the data once it has been received.
  4. Once the application is done processing data, it is stored in Amazon S3 until it is needed again.

As data moves between components, TLS is used to prevent inadvertent disclosure. But what if one or more of these components is a third-party service that doesn’t need access to sensitive data? That’s where encryption at rest comes in.

Encryption at rest is available as a server-side, client-side, and client-side in-browser protection. Server-side encryption (SSE) is the most commonly used form of encryption with AWS customers, and for good reason: it’s easy to use because it’s natively supported by many services, such as Amazon S3. When SSE is used, the service that’s storing data will encrypt each piece of data with a key (a “data key”) when it’s received, and then decrypt it transparently when it’s requested by an authorized user. This has the benefit of being seamless for application developers because they only need to check a box in Amazon S3 to enable encryption, and it also adds an additional level of access control by having separate permissions to download an object and perform a decryption operation. However, there is a security/convenience tradeoff to consider, because the service will allow any role with the appropriate permissions to perform a decryption. For additional control, many AWS services—including S3—support the use of customer-managed AWS Key Management Service (AWS KMS) customer master keys (CMKs) that allow you to specify key policies or use grants or AWS Identity and Access Management (IAM) policies to control which roles or users have access to decryption, and when. Configuring permission to decrypt using customer-managed CMKs is often sufficient to satisfy compliance regimes that require “application-level encryption.”

Some threat models or compliance regimes may require client-side encryption (CSE), which can add a powerful additional level of access control at the expense of additional complexity. As noted above, services perform server-side encryption on data after it has left the boundary of your application. TLS is used to secure the data in transit to the service, but some customers might want to only manage encrypt/decrypt operations within their application on EC2 or in the browser. Applications can use the AWS Encryption SDK to encrypt data within the application trust boundary before it’s sent to a storage service.

But what about a use case where customers don’t even want plaintext data to leave the browser? Or what if end-users input data that is passed through or logged by intermediate systems that belong to a third-party? It’s possible to create a separate application that only manages encryption to ensure that your environment is segregated, but using the AWS Encryption SDK for JavaScript allows you to encrypt data in an end-user browser before it’s ever sent to your application, so only your end-user will be able to view their plaintext data. As you can see in Figure 2 below, in-browser encryption can allow data to be safely handled by untrusted intermediate systems while ensuring its confidentiality and integrity.

Figure 2: A hypothetical web application with encryption where the application is composed of an end-user interacting with a browser front-end, a third party which processes data received from the browser, processing is performed in Amazon EC2, and storage happens in Amazon S3

Figure 2: A hypothetical web application with encryption where the application is composed of an end-user interacting with a browser front-end, a third party which processes data received from the browser, processing is performed in Amazon EC2, and storage happens in Amazon S3

  1. The application in the browser requests a data key to encrypt sensitive data entered by the user before it is passed to a third party.
  2. Because the sensitive data has been encrypted, the third party cannot read it. The third party may be an Internet Service Provider, an appliance in the user’s environment, an application running in the cloud, or a variety of other actors.
  3. The application on Amazon EC2 can make a request to KMS to decrypt the data key so the data can be decrypted, processed, and re-encrypted.
  4. The encrypted object is stored in S3 where a second encryption request is made so the object can be encrypted when it is stored server side.

How to encrypt in the browser

The first step of in-browser encryption is including a copy of the AWS Encryption SDK for JavaScript with the scripts you’re already sending to the user when they access your application. Once it’s present in the end-user environment, it’s available for your application to make calls. To perform the encryption, the ESDK will request a data key from the cryptographic materials provider that is used to encrypt, and an encrypted copy of the data key that will be stored with the object being encrypted. After a piece of data is encrypted within the browser, the ciphertext can be uploaded to your application backend for processing or storage. When a user needs to retrieve the plaintext, the ESDK can read the metadata attached to the ciphertext to determine the appropriate method to decrypt the data key, and if they have access to the CMK decrypt the data key and then use it to decrypt the data.

Important considerations

One common issue with browser-based applications is inconsistent feature support across different browser vendors and versions. For example, how will the application respond to browsers that lack native support for the strongest recommended cryptographic algorithm suites? Or, will there be a message or alternative mode if a user accesses the application using a browser that has JavaScript disabled? The ESDK for JavaScript natively supports a fallback mode, but it may not be appropriate for all use cases. Be sure to understand what kind of browser environments you will need to support to determine whether in-browser encryption is appropriate, and include support for graceful degradation if you expect limited browser support. Developers should also consider the ways that unauthorized users might monitor user actions via a browser extension, make unauthorized browser requests without user knowledge, or request a “downgraded” (less mathematically intensive) cryptographic operation.

It’s always a good idea to have your application designs reviewed by security professionals. If you have an AWS Account Manager or Technical Account Manager, you can ask them to connect you with a Solutions Architect to review your design. If you’re an AWS customer but don’t have an account manager, consider visiting an AWS Loft to participate in our “Ask an Expert” plan.

Where to learn more

If you have questions about this post, let us know in the Comments section below, or consult the AWS Encryption SDK Developer Forum. Because the Encryption SDK is open-source, you can always contribute, open an issue, or ask questions in Github.

The AWS Encryption SDK for JavaScript is available at: https://github.com/awslabs/aws-encryption-sdk-javascript
Documentation is available at: https://docs.aws.amazon.com/encryption-sdk/latest/developer-guide/javascript.html

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Janyk author photo

Spencer Janyk

Spencer is a Senior Product Manager at Amazon Web Services working on data encryption and privacy. He has previously worked on vulnerability management and monitoring for enterprises and applying machine learning to challenges in ad tech, social media, diversity in recruiting, and talent management. Spencer holds a Master of Arts in Performance Studies from New York University and a Bachelor of Arts in Gender Studies from Whitman College.

Gray author photo

Amanda Gray

Amanda is a Senior Security Engineer at Amazon Web Services on the Crypto Tools team. Previously, Amanda worked on application security and privacy by design, and she continues to promote these goals every day. Amanda holds Bachelors’ degrees in Physics and Computer Science from the University of Washington and Smith College respectively, and a Master’s degree in Physical Oceanography from the University of Washington.

AWS Security Profiles: Maritza Mills, Senior Product Manager, Perimeter Protection

Post Syndicated from Becca Crockett original https://aws.amazon.com/blogs/security/aws-security-profiles-maritza-mills-senior-product-manager-perimeter-protection/

Maritza Mills, Senior Product Manager
In the weeks leading up to re:Invent 2019, we’ll share conversations we’e had with people at AWS who will be presenting at the event so you can learn more about them and some of the interesting work that they’re doing.


How long have you been at AWS, and what do you do in your current role?

How long have you been at AWS, and what do you do in your current role?
I’ve been at AWS almost two years. I’m a product manager for our Perimeter Protection team, which includes products like AWS Web Application Firewall (WAF), AWS Shield and AWS Firewall Manager. I spend a lot of my time talking with customers—primarily security specialists and network engineers—about how they can protect their web applications and how they can defend against Distributed Denial of Service (DDoS) attacks. My work is about deeply understanding the technical challenges customers are facing. I then use that information to inform what we need to build next, and then I work with our engineering team to figure out how we deliver it.

What’s the most challenging part of your job?

Deciding how to prioritize what we work on next. We have AWS customers with a lot of different needs, but we only have so much time in a day. My team has to balance the most pressing customer challenges along with the challenges we anticipate customers will face in the future, plus how quickly we’ll be able to deliver solutions to those challenges. I wish that we could do everything, all the time, but we have to make difficult choices about which things we’re going to do first.

What’s your favorite part of your job?

Constantly learning something new from our customers. A big part of what I do involves listening to customers to understand their most difficult technical challenges, and every customer is different. A customer in healthcare will have different needs from a customer in finance versus one in gaming. It’s exciting to learn about the different problems each customer faces. Even at the same company, different teams may have different goals and approaches to security. Often, I might educate customers on the tools currently available to fit their needs, but there are also times when the solution a customer needs has not been invented yet, and that’s when things really get interesting.

What does cloud security mean to you on a personal level?

When I think about security in the cloud, it’s about security for individual people. If you store data in the cloud, part of “security” is protecting access to your personal information, like your messages and photos, or credit card numbers, or personal healthcare data.

But it’s not just about preventing unauthorized access. It’s also about making sure that peoples’ data are available for them when they need it. One of the big things that we focus on in Perimeter Protection—particularly in AWS Shield—is protecting applications from denial of service attacks so that the applications are always available. This means that when you need to access the money in your bank account, or say, when a hospital needs to access vital information about a patient, the apps are always up and available. When I think about security and what we’re doing at scale here at AWS, that’s what’s most important to me on a personal level.

What’s the most common misperception you encounter about cloud security?

Sometimes, customers might be tempted to use blanket protections without thinking about why their particular application or business is unique, and what different protections they should put in place as a result.

Cloud security is an ongoing discipline that requires continuously monitoring your applications and updating your controls as your applications change. At AWS, we have this concept of the shared responsibility model, where AWS handles security of the cloud itself and customers are responsible for securing the applications which they run on the cloud. We’ve designed several tools to help customers manage that responsibility and adapt and scale as quickly as their applications do. In Perimeter Protection specifically, services like AWS Firewall Manager are designed to give our customers central visibility of their security controls, such as Amazon VPC security groups, AWS WAF rules, and AWS Shield Advanced protections. Services like Firewall Manager also constantly monitor these configurations so that customers can be alerted when changes have occurred.

I encourage customers to think carefully about how their applications will change over time, and how to best monitor and adjust to those changes as they occur.

What challenges do you currently see in the application security space, and how do you think the field will evolve to meet those challenges?

One challenge that I currently see is the pace of change, and the fact that customers need ways to keep up with these changes.

In the past, many security controls have been static—you set them up, and they don’t change. But as our customers have migrated into AWS, they’re able to operate in a more dynamic way and to scale up or down more quickly than they could before. At the same time, we’ve seen the techniques used to gain unauthorized access or to launch DDoS attacks scale and become more sophisticated. Here at AWS, we’re constantly looking ahead to anticipate how customers will need to actively monitor and secure their applications, and then we build those capabilities into our services.

Today, services like AWS Shield can automatically detect and mitigate DDoS attacks and provide you with alarms and the ability to continuously monitor your network flows. AWS WAF gives you the ability to write custom rules so you can create granular protections for your specific environment. We also provide you with information regarding security best practices so you can proactively architect your applications in a way that allows you to quickly react to new and unique attack vectors. That’s part of what we’ll be addressing in our upcoming re:Invent talk, as well.

You and Paul Oremland are leading a re:Invent session called A defense-in-depth approach to building web applications. What can you tell us about the session that’s not described in the catalog?

In this session, we’ll start by reviewing common security vulnerabilities, and then provide detailed examples of how to mitigate them at each layer of their application. I expect attendees will gain a better sense of how those layers fit together and how to think creatively about their individual security needs based on how they’ve architected their system, or based on their specific business case. Finally, I want all customers, from startups to enterprise, to understand how those challenges change as they scale. We’ll be touching on all of that.

It’s a 400-level session, so it’s a technical deep dive. It’s going to have a lot of good information for security specialists and engineers who want to have hands-on examples that they can go back and use. But I also want to encourage people who are exploring or are newer to this space to join us because even if the hands-on portion is a little too advanced, I think the strategy and philosophy of how to think about application security is going to be very relevant even to those less familiar with the subject matter, and to the work that they might do in the future.

What are you hoping that your audience will do differently as a result of attending?

I want to motivate attendees to perform a review of their current architecture and consider the current controls that they have in place. Then, I’d like them to ask themselves, “Why did I put this control here?” and “Do I know exactly what risk each control is mitigating?” I’d also like them to consider whether there are protections they’ve opted not to use in the past, and whether that decision is still an acceptable risk.

How did you choose your topic?

We developed it based on numerous conversations we’ve had with customers when they’re exploring how to protect their applications at the edge. But, we usually find that the conversation expands into other parts of the stack that need protection as well. One goal of this session is to talk about these needs up front, so that customers can come into conversations with us already knowing how they’d like to protect their entire application.

Any advice for first-time attendees coming to re:Invent?

Make sure you have enough time to get to your next session. There’s a lot of different things going on at re:Invent, and they take place in a lot of different buildings. While I think we do a great job with the schedule and spacing, first-time attendees should be aware that they might have a session in one building and then need to immediately be in another building for their next session. Factor that into your commute plans.

You enjoy discussing song lyrics. Who have you enjoyed the most?

Rush is one of my favorite bands when it comes to lyricism. As a kid, the music was just interesting. But as I’ve gotten older, certain lines hit me differently.

In the song “Dreamline,” there’s a particular verse that says:

When we are young
Wandering the face of the earth
Wondering what our dreams might be worth
Learning that we’re only immortal
For a limited time

When I was younger, I really could relate to that feeling of immortality in a way, as if I was going to be around forever. But as I’ve gotten older, I’ve realized that life is very short and very precious, and I want to make the most of it. So I enjoy going back to that song every single time. It’s changed for me as I’ve grown.

And what song has created the lengthiest discussion for you?

I’ve had some great conversations about Fast Car by Tracy Chapman. The themes in that song are relatable to people in so many different ways, and at different times in their lives. One of the great things about song lyrics is that the way people interpret a song is influenced by their personal experiences in life, and this song in particular has always opened up meaningful conversations for me.

Want more AWS Security news? Follow us on Twitter.

The AWS Security team is hiring! Want to find out more? Check out our career page.

Maritza Mills, Senior Product Manager

Maritza Mills

Maritza is a Senior Product Manager for AWS WAF, Shield and Firewall Manager.

Going Keyless Everywhere

Post Syndicated from Nick Sullivan original https://blog.cloudflare.com/going-keyless-everywhere/

Going Keyless Everywhere

Going Keyless Everywhere

Time flies. The Heartbleed vulnerability was discovered just over five and a half years ago. Heartbleed became a household name not only because it was one of the first bugs with its own web page and logo, but because of what it revealed about the fragility of the Internet as a whole. With Heartbleed, one tiny bug in a cryptography library exposed the personal data of the users of almost every website online.

Heartbleed is an example of an underappreciated class of bugs: remote memory disclosure vulnerabilities. High profile examples other than Heartbleed include Cloudbleed and most recently NetSpectre. These vulnerabilities allow attackers to extract secrets from servers by simply sending them specially-crafted packets. Cloudflare recently completed a multi-year project to make our platform more resilient against this category of bug.

For the last five years, the industry has been dealing with the consequences of the design that led to Heartbleed being so impactful. In this blog post we’ll dig into memory safety, and how we re-designed Cloudflare’s main product to protect private keys from the next Heartbleed.

Memory Disclosure

Perfect security is not possible for businesses with an online component. History has shown us that no matter how robust their security program, an unexpected exploit can leave a company exposed. One of the more famous recent incidents of this sort is Heartbleed, a vulnerability in a commonly used cryptography library called OpenSSL that exposed the inner details of millions of web servers to anyone with a connection to the Internet. Heartbleed made international news, caused millions of dollars of damage, and still hasn’t been fully resolved.

Typical web services only return data via well-defined public-facing interfaces called APIs. Clients don’t typically get to see what’s going on under the hood inside the server, that would be a huge privacy and security risk. Heartbleed broke that paradigm: it enabled anyone on the Internet to get access to take a peek at the operating memory used by web servers, revealing privileged data usually not exposed via the API. Heartbleed could be used to extract the result of previous data sent to the server, including passwords and credit cards. It could also reveal the inner workings and cryptographic secrets used inside the server, including TLS certificate private keys.

Heartbleed let attackers peek behind the curtain, but not too far. Sensitive data could be extracted, but not everything on the server was at risk. For example, Heartbleed did not enable attackers to steal the content of databases held on the server. You may ask: why was some data at risk but not others? The reason has to do with how modern operating systems are built.

A simplified view of process isolation

Most modern operating systems are split into multiple layers. These layers are analogous to security clearance levels. So-called user-space applications (like your browser) typically live in a low-security layer called user space. They only have access to computing resources (memory, CPU, networking) if the lower, more credentialed layers let them.

User-space applications need resources to function. For example, they need memory to store their code and working memory to do computations. However, it would be risky to give an application direct access to the physical RAM of the computer they’re running on. Instead, the raw computing elements are restricted to a lower layer called the operating system kernel. The kernel only runs specially-designed applications designed to safely manage these resources and mediate access to them for user-space applications.

When a new user space application process is launched, the kernel gives it a virtual memory space. This virtual memory space acts like real memory to the application but is actually a safely guarded translation layer the kernel uses to protect the real memory. Each application’s virtual memory space is like a parallel universe dedicated to that application. This makes it impossible for one process to view or modify another’s, the other applications are simply not addressable.

Going Keyless Everywhere

Heartbleed, Cloudbleed and the process boundary

Heartbleed was a vulnerability in the OpenSSL library, which was part of many web server applications. These web servers run in user space, like any common applications. This vulnerability caused the web server to return up to 2 kilobytes of its memory in response to a specially-crafted inbound request.

Cloudbleed was also a memory disclosure bug, albeit one specific to Cloudflare, that got its name because it was so similar to Heartbleed. With Cloudbleed, the vulnerability was not in OpenSSL, but instead in a secondary web server application used for HTML parsing. When this code parsed a certain sequence of HTML, it ended up inserting some process memory into the web page it was serving.

Going Keyless Everywhere

It’s important to note that both of these bugs occurred in applications running in user space, not kernel space. This means that the memory exposed by the bug was necessarily part of the virtual memory of the application. Even if the bug were to expose megabytes of data, it would only expose data specific to that application, not other applications on the system.

In order for a web server to serve traffic over the encrypted HTTPS protocol, it needs access to the certificate’s private key, which is typically kept in the application’s memory. These keys were exposed to the Internet by Heartbleed. The Cloudbleed vulnerability affected a different process, the HTML parser, which doesn’t do HTTPS and therefore doesn’t keep the private key in memory. This meant that HTTPS keys were safe, even if other data in the HTML parser’s memory space wasn’t.

Going Keyless Everywhere

The fact that the HTML parser and the web server were different applications saved us from having to revoke and re-issue our customers’ TLS certificates. However, if another memory disclosure vulnerability is discovered in the web server, these keys are again at risk.

Moving keys out of Internet-facing processes

Not all web servers keep private keys in memory. In some deployments, private keys are held in a separate machine called a Hardware Security Module (HSM). HSMs are built to withstand physical intrusion and tampering and are often built to comply with stringent compliance requirements. They can often be bulky and expensive. Web servers designed to take advantage of keys in an HSM connect to them over a physical cable and communicate with a specialized protocol called PKCS#11. This allows the web server to serve encrypted content while being physically separated from the private key.

Going Keyless Everywhere

At Cloudflare, we built our own way to separate a web server from a private key: Keyless SSL. Rather than keeping the keys in a separate physical machine connected to the server with a cable, the keys are kept in a key server operated by the customer in their own infrastructure (this can also be backed by an HSM).

Going Keyless Everywhere

More recently, we launched Geo Key Manager, a service that allows users to store private keys in only select Cloudflare locations. Connections to locations that do not have access to the private key use Keyless SSL with a key server hosted in a datacenter that does have access.

In both Keyless SSL and Geo Key Manager, private keys are not only not part of the web server’s memory space, they’re often not even in the same country! This extreme degree of separation is not necessary to protect against the next Heartbleed. All that is needed is for the web server and the key server to not be part of the same application. So that’s what we did. We call this Keyless Everywhere.

Going Keyless Everywhere

Keyless SSL is coming from inside the house

Repurposing Keyless SSL for Cloudflare-held private keys was easy to conceptualize, but the path from ideation to live in production wasn’t so straightforward. The core functionality of Keyless SSL comes from the open source gokeyless which customers run on their infrastructure, but internally we use it as a library and have replaced the main package with an implementation suited to our requirements (we’ve creatively dubbed it gokeyless-internal).

As with all major architecture changes, it’s prudent to start with testing out the model with something new and low risk. In our case, the test bed was our experimental TLS 1.3 implementation. In order to quickly iterate through draft versions of the TLS specification and push releases without affecting the majority of Cloudflare customers, we re-wrote our custom nginx web server in Go and deployed it in parallel to our existing infrastructure. This server was designed to never hold private keys from the start and only leverage gokeyless-internal. At this time there was only a small amount of TLS 1.3 traffic and it was all coming from the beta versions of browsers, which allowed us to work through the initial kinks of gokeyless-internal without exposing the majority of visitors to security risks or outages due to gokeyless-internal.

The first step towards making TLS 1.3 fully keyless was identifying and implementing the new functionality we needed to add to gokeyless-internal. Keyless SSL was designed to run on customer infrastructure, with the expectation of supporting only a handful of private keys. But our edge must simultaneously support millions of private keys, so we implemented the same lazy loading logic we use in our web server, nginx. Furthermore, a typical customer deployment would put key servers behind a network load balancer, so they could be taken out of service for upgrades or other maintenance. Contrast this with our edge, where it’s important to maximize our resources by serving traffic during software upgrades. This problem is solved by the excellent tableflip package we use elsewhere at Cloudflare.

The next project to go Keyless was Spectrum, which launched with default support for gokeyless-internal. With these small victories in hand, we had the confidence necessary to attempt the big challenge, which was porting our existing nginx infrastructure to a fully keyless model. After implementing the new functionality, and being satisfied with our integration tests, all that’s left is to turn this on in production and call it a day, right? Anyone with experience with large distributed systems knows how far “working in dev” is from “done,” and this story is no different. Thankfully we were anticipating problems, and built a fallback into nginx to complete the handshake itself if any problems were encountered with the gokeyless-internal path. This allowed us to expose gokeyless-internal to production traffic without risking downtime in the event that our reimplementation of the nginx logic was not 100% bug-free.

When rolling back the code doesn’t roll back the problem

Our deployment plan was to enable Keyless Everywhere, find the most common causes of fallbacks, and then fix them. We could then repeat this process until all sources of fallbacks had been eliminated, after which we could remove access to private keys (and therefore the fallback) from nginx. One of the early causes of fallbacks was gokeyless-internal returning ErrKeyNotFound, indicating that it couldn’t find the requested private key in storage. This should not have been possible, since nginx only makes a request to gokeyless-internal after first finding the certificate and key pair in storage, and we always write the private key and certificate together. It turned out that in addition to returning the error for the intended case of the key truly not found, we were also returning it when transient errors like timeouts were encountered. To resolve this, we updated those transient error conditions to return ErrInternal, and deployed to our canary datacenters. Strangely, we found that a handful of instances in a single datacenter started encountering high rates of fallbacks, and the logs from nginx indicated it was due to a timeout between nginx and gokeyless-internal. The timeouts didn’t occur right away, but once a system started logging some timeouts it never stopped. Even after we rolled back the release, the fallbacks continued with the old version of the software! Furthermore, while nginx was complaining about timeouts, gokeyless-internal seemed perfectly healthy and was reporting reasonable performance metrics (sub-millisecond median request latency).

Going Keyless Everywhere

To debug the issue, we added detailed logging to both nginx and gokeyless, and followed the chain of events backwards once timeouts were encountered.

➜ ~ grep 'timed out' nginx.log | grep Keyless | head -5
2018-07-25T05:30:49.000 29m41 2018/07/25 05:30:49 [error] 4525#0: *1015157 Keyless SSL request/response timed out while reading Keyless SSL response, keyserver: 127.0.0.1
2018-07-25T05:30:49.000 29m41 2018/07/25 05:30:49 [error] 4525#0: *1015231 Keyless SSL request/response timed out while waiting for Keyless SSL response, keyserver: 127.0.0.1
2018-07-25T05:30:49.000 29m41 2018/07/25 05:30:49 [error] 4525#0: *1015271 Keyless SSL request/response timed out while waiting for Keyless SSL response, keyserver: 127.0.0.1
2018-07-25T05:30:49.000 29m41 2018/07/25 05:30:49 [error] 4525#0: *1015280 Keyless SSL request/response timed out while waiting for Keyless SSL response, keyserver: 127.0.0.1
2018-07-25T05:30:50.000 29m41 2018/07/25 05:30:50 [error] 4525#0: *1015289 Keyless SSL request/response timed out while waiting for Keyless SSL response, keyserver: 127.0.0.1

You can see the first request to log a timeout had id 1015157. Also interesting that the first log line was “timed out while reading,” but all the others are “timed out while waiting,” and this latter message is the one that continues forever. Here is the matching request in the gokeyless log:

➜ ~ grep 'id=1015157 ' gokeyless.log | head -1
2018-07-25T05:30:39.000 29m41 2018/07/25 05:30:39 [DEBUG] connection 127.0.0.1:30520: worker=ecdsa-29 opcode=OpECDSASignSHA256 id=1015157 sni=announce.php?info_hash=%a8%9e%9dc%cc%3b1%c8%23%e4%93%21r%0f%92mc%0c%15%89&peer_id=-ut353s-%ce%ad%5e%b1%99%06%24e%d5d%9a%08&port=42596&uploaded=65536&downloaded=0&left=0&corrupt=0&key=04a184b7&event=started&numwant=200&compact=1&no_peer_id=1 ip=104.20.33.147

Aha! That SNI value is clearly invalid (SNIs are like Host headers, i.e. they are domains, not URL paths), and it’s also quite long. Our storage system indexes certificates based on two indices: which SNI they correspond to, and which IP addresses they correspond to (for older clients that don’t support SNI). Our storage interface uses the memcached protocol, and the client library that gokeyless-internal uses rejects requests for keys longer than 250 characters (memcached’s maximum key length), whereas the nginx logic is to simply ignore the invalid SNI and treat the request as if only had an IP. The change in our new release had shifted this condition from ErrKeyNotFound to ErrInternal, which triggered cascading problems in nginx. The “timeouts” it encountered were actually a result of throwing away all in-flight requests multiplexed on a connection which happened to return ErrInternalfor a single request. These requests were retried, but once this condition triggered, nginx became overloaded by the number of retried requests plus the continuous stream of new requests coming in with bad SNI, and was unable to recover. This explains why rolling back gokeyless-internal didn’t fix the problem.

This discovery finally brought our attention to nginx, which thus far had escaped blame since it had been working reliably with customer key servers for years. However, communicating over localhost to a multitenant key server is fundamentally different than reaching out over the public Internet to communicate with a customer’s key server, and we had to make the following changes:

  • Instead of a long connection timeout and a relatively short response timeout for customer key servers, extremely short connection timeouts and longer request timeouts are appropriate for a localhost key server.
  • Similarly, it’s reasonable to retry (with backoff) if we timeout waiting on a customer key server response, since we can’t trust the network. But over localhost, a timeout would only occur if gokeyless-internal were overloaded and the request were still queued for processing. In this case a retry would only lead to more total work being requested of gokeyless-internal, making the situation worse.
  • Most significantly, nginx must not throw away all requests multiplexed on a connection if any single one of them encounters an error, since a single connection no longer represents a single customer.

Implementations matter

CPU at the edge is one of our most precious assets, and it’s closely guarded by our performance team (aka CPU police). Soon after turning on Keyless Everywhere in one of our canary datacenters, they noticed gokeyless using ~50% of a core per instance. We were shifting the sign operations from nginx to gokeyless, so of course it would be using more CPU now. But nginx should have seen a commensurate reduction in CPU usage, right?

Going Keyless Everywhere

Wrong. Elliptic curve operations are very fast in Go, but it’s known that RSA operations are much slower than their BoringSSL counterparts.

Although Go 1.11 includes optimizations for RSA math operations, we needed more speed. Well-tuned assembly code is required to match the performance of BoringSSL, so Armando Faz from our Crypto team helped claw back some of the lost CPU by reimplementing parts of the math/big package with platform-dependent assembly in an internal fork of Go. The recent assembly policy of Go prefers the use of Go portable code instead of assembly, so these optimizations were not upstreamed. There is still room for more optimizations, and for that reason we’re still evaluating moving to cgo + BoringSSL for sign operations, despite cgo’s many downsides.

Changing our tooling

Process isolation is a powerful tool for protecting secrets in memory. Our move to Keyless Everywhere demonstrates that this is not a simple tool to leverage. Re-architecting an existing system such as nginx to use process isolation to protect secrets was time-consuming and difficult. Another approach to memory safety is to use a memory-safe language such as Rust.

Rust was originally developed by Mozilla but is starting to be used much more widely. The main advantage that Rust has over C/C++ is that it has memory safety features without a garbage collector.

Re-writing an existing application in a new language such as Rust is a daunting task. That said, many new Cloudflare features, from the powerful Firewall Rules feature to our 1.1.1.1 with WARP app, have been written in Rust to take advantage of its powerful memory-safety properties. We’re really happy with Rust so far and plan on using it even more in the future.

Conclusion

The harrowing aftermath of Heartbleed taught the industry a lesson that should have been obvious in retrospect: keeping important secrets in applications that can be accessed remotely via the Internet is a risky security practice. In the following years, with a lot of work, we leveraged process separation and Keyless SSL to ensure that the next Heartbleed wouldn’t put customer keys at risk.

However, this is not the end of the road. Recently memory disclosure vulnerabilities such as NetSpectre have been discovered which are able to bypass application process boundaries, so we continue to actively explore new ways to keep keys secure.

Going Keyless Everywhere

Delegated Credentials for TLS

Post Syndicated from Nick Sullivan original https://blog.cloudflare.com/keyless-delegation/

Delegated Credentials for TLS

Delegated Credentials for TLS

Today we’re happy to announce support for a new cryptographic protocol that helps make it possible to deploy encrypted services in a global network while still maintaining fast performance and tight control of private keys: Delegated Credentials for TLS. We have been working with partners from Facebook, Mozilla, and the broader IETF community to define this emerging standard. We’re excited to share the gory details today in this blog post.

Deploying TLS globally

Many of the technical problems we face at Cloudflare are widely shared problems across the Internet industry. As gratifying as it can be to solve a problem for ourselves and our customers, it can be even more gratifying to solve a problem for the entire Internet. For the past three years, we have been working with peers in the industry to solve a specific shared problem in the TLS infrastructure space: How do you terminate TLS connections while storing keys remotely and maintaining performance and availability? Today we’re announcing that Cloudflare now supports Delegated Credentials, the result of this work.

Cloudflare’s TLS/SSL features are among the top reasons customers use our service. Configuring TLS is hard to do without internal expertise. By automating TLS, web site and web service operators gain the latest TLS features and the most secure configurations by default. It also reduces the risk of outages or bad press due to misconfigured or insecure encryption settings. Customers also gain early access to unique features like TLS 1.3, post-quantum cryptography, and OCSP stapling as they become available.

Unfortunately, for web services to authorize a service to terminate TLS for them, they have to trust the service with their private keys, which demands a high level of trust. For services with a global footprint, there is an additional level of nuance. They may operate multiple data centers located in places with varying levels of physical security, and each of these needs to be trusted to terminate TLS.

To tackle these problems of trust, Cloudflare has invested in two technologies: Keyless SSL, which allows customers to use Cloudflare without sharing their private key with Cloudflare; and Geo Key Manager, which allows customers to choose the datacenters in which Cloudflare should keep their keys. Both of these technologies are able to be deployed without any changes to browsers or other clients. They also come with some downsides in the form of availability and performance degradation.

Keyless SSL introduces extra latency at the start of a connection. In order for a server without access to a private key to establish a connection with a client, that servers needs to reach out to a key server, or a remote point of presence, and ask them to do a private key operation. This not only adds additional latency to the connection, causing the content to load slower, but it also introduces some troublesome operational constraints on the customer. Specifically, the server with access to the key needs to be highly available or the connection can fail. Sites often use Cloudflare to improve their site’s availability, so having to run a high-availability key server is an unwelcome requirement.

Turning a pull into a push

The reason services like Keyless SSL that rely on remote keys are so brittle is their architecture: they are pull-based rather than push-based. Every time a client attempts a handshake with a server that doesn’t have the key, it needs to pull the authorization from the key server. An alternative way to build this sort of system is to periodically push a short-lived authorization key to the server and use that for handshakes. Switching from a pull-based model to a push-based model eliminates the additional latency, but it comes with additional requirements, including the need to change the client.

Enter the new TLS feature of Delegated Credentials (DCs). A delegated credential is a short-lasting key that the certificate’s owner has delegated for use in TLS. They work like a power of attorney: your server authorizes our server to terminate TLS for a limited time. When a browser that supports this protocol connects to our edge servers we can show it this “power of attorney”, instead of needing to reach back to a customer’s server to get it to authorize the TLS connection. This reduces latency and improves performance and reliability.

Delegated Credentials for TLS
The pull model

Delegated Credentials for TLS
The push model

A fresh delegated credential can be created and pushed out to TLS servers long before the previous credential expires. Momentary blips in availability will not lead to broken handshakes for clients that support delegated credentials. Furthermore, a Delegated Credentials-enabled TLS connection is just as fast as a standard TLS connection: there’s no need to connect to the key server for every handshake. This removes the main drawback of Keyless SSL for DC-enabled clients.

Delegated credentials are intended to be an Internet Standard RFC that anyone can implement and use, not a replacement for Keyless SSL. Since browsers will need to be updated to support the standard, proprietary mechanisms like Keyless SSL and Geo Key Manager will continue to be useful. Delegated credentials aren’t just useful in our context, which is why we’ve developed it openly and with contributions from across industry and academia. Facebook has integrated them into their own TLS implementation, and you can read more about how they view the security benefits here.  When it comes to improving the security of the Internet, we’re all on the same team.

"We believe delegated credentials provide an effective way to boost security by reducing certificate lifetimes without sacrificing reliability. This will soon become an Internet standard and we hope others in the industry adopt delegated credentials to help make the Internet ecosystem more secure."

Subodh Iyengar, software engineer at Facebook

Extensibility beyond the PKI

At Cloudflare, we’re interested in pushing the state of the art forward by experimenting with new algorithms. In TLS, there are three main areas of experimentation: ciphers, key exchange algorithms, and authentication algorithms. Ciphers and key exchange algorithms are only dependent on two parties: the client and the server. This freedom allows us to deploy exciting new choices like ChaCha20-Poly1305 or post-quantum key agreement in lockstep with browsers. On the other hand, the authentication algorithms used in TLS are dependent on certificates, which introduces certificate authorities and the entire public key infrastructure into the mix.

Unfortunately, the public key infrastructure is very conservative in its choice of algorithms, making it harder to adopt newer cryptography for authentication algorithms in TLS. For instance, EdDSA, a highly-regarded signature scheme, is not supported by certificate authorities, and root programs limit the certificates that will be signed. With the emergence of quantum computing, experimenting with new algorithms is essential to determine which solutions are deployable and functional on the Internet.

Since delegated credentials introduce the ability to use new authentication key types without requiring changes to certificates themselves, this opens up a new area of experimentation. Delegated credentials can be used to provide a level of flexibility in the transition to post-quantum cryptography, by enabling new algorithms and modes of operation to coexist with the existing PKI infrastructure. It also enables tiny victories, like the ability to use smaller, faster Ed25519 signatures in TLS.

Inside DCs

A delegated credential contains a public key and an expiry time. This bundle is then signed by a certificate along with the certificate itself, binding the delegated credential to the certificate for which it is acting as “power of attorney”. A supporting client indicates its support for delegated credentials by including an extension in its Client Hello.

A server that supports delegated credentials composes the TLS Certificate Verify and Certificate messages as usual, but instead of signing with the certificate’s private key, it includes the certificate along with the DC, and signs with the DC’s private key. Therefore, the private key of the certificate only needs to be used for the signing of the DC.

Certificates used for signing delegated credentials require a special X.509 certificate extension. This requirement exists to avoid breaking assumptions people may have about the impact of temporary access to their keys on security, particularly in cases involving HSMs and the still unfixed Bleichbacher oracles in older TLS versions.  Temporary access to a key can enable signing lots of delegated credentials which start far in the future, and as a result support was made opt-in. Early versions of QUIC had similar issues, and ended up adopting TLS to fix them. Protocol evolution on the Internet requires working well with already existing protocols and their flaws.

Delegated Credentials at Cloudflare and Beyond

Currently we use delegated credentials as a performance optimization for Geo Key Manager and Keyless SSL. Customers can update their certificates to include the special extension for delegated credentials, and we will automatically create delegated credentials and distribute them to the edge through the Keyless SSL or Geo Key Manager. For more information, see the documentation. It also enables us to be more conservative about where we keep keys for customers, improving our security posture.

Delegated Credentials would be useless if it wasn’t also supported by browsers and other HTTP clients. Christopher Patton, a former intern at Cloudflare, implemented support in Firefox and its underlying NSS security library. This feature is now in the Nightly versions of Firefox. You can turn it on by activating the configuration option security.tls.enable_delegated_credentials at about:config. Studies are ongoing on how effective this will be in a wider deployment. There also is support for Delegated Credentials in BoringSSL.

"At Mozilla we welcome ideas that help to make the Web PKI more robust. The Delegated Credentials feature can help to provide secure and performant TLS connections for our users, and we’re happy to work with Cloudflare to help validate this feature."

Thyla van der Merwe, Cryptography Engineering Manager at Mozilla

One open issue is the question of client clock accuracy. Until we have a wide-scale study we won’t know how many connections using delegated credentials will break because of the 24 hour time limit that is imposed.  Some clients, in particular mobile clients, may have inaccurately set clocks, the root cause of one third of all certificate errors in Chrome. Part of the way that we’re aiming to solve this problem is through standardizing and improving Roughtime, so web browsers and other services that need to validate certificates can do so independent of the client clock.

Cloudflare’s global scale means that we see connections from every corner of the world, and from many different kinds of connection and device. That reach enables us to find rare problems with the deployability of protocols. For example, our early deployment helped inform the development of the TLS 1.3 standard. As we enable developing protocols like delegated credentials, we learn about obstacles that inform and affect their future development.

Conclusion

As new protocols emerge, we’ll continue to play a role in their development and bring their benefits to our customers. Today’s announcement of a technology that overcomes some limitations of Keyless SSL is just one example of how Cloudflare takes part in improving the Internet not just for our customers, but for everyone. During the standardization process of turning the draft into an RFC, we’ll continue to maintain our implementation and come up with new ways to apply delegated credentials.

Announcing cfnts: Cloudflare’s implementation of NTS in Rust

Post Syndicated from Watson Ladd original https://blog.cloudflare.com/announcing-cfnts/

Announcing cfnts: Cloudflare's implementation of NTS in Rust

Announcing cfnts: Cloudflare's implementation of NTS in Rust

Several months ago we announced that we were providing a new public time service. Part of what we were providing was the first major deployment of the new Network Time Security (NTS) protocol, with a newly written implementation of NTS in Rust. In the process, we received helpful advice from the NTP community, especially from the NTPSec and Chrony projects. We’ve also participated in several interoperability events. Now we are returning something to the community: Our implementation, cfnts, is now open source and we welcome your pull requests and issues.

The journey from a blank source file to a working, deployed service was a lengthy one, and it involved many people across multiple teams.


"Correct time is a necessity for most security protocols in use on the Internet. Despite this, secure time transfer over the Internet has previously required complicated configuration on a case by case basis. With the introduction of NTS, secure time synchronization will finally be available for everyone. It is a small, but important, step towards increasing security in all systems that depend on accurate time. I am happy that Cloudflare are sharing their NTS implementation. A diversity of software with NTS support is important for quick adoption of the new protocol."

Marcus Dansarie, coauthor of the NTS specification


How NTS works

NTS is structured as a suite of two sub-protocols as shown in the figure below. The first is the Network Time Security Key Exchange (NTS-KE), which is always conducted over Transport Layer Security (TLS) and handles the creation of key material and parameter negotiation for the second protocol. The second is NTPv4, the current version of the NTP protocol, which allows the client to synchronize their time from the remote server.

In order to maintain the scalability of NTPv4, it was important that the server not maintain per-client state. A very small server can serve millions of NTP clients. Maintaining this property while providing security is achieved with cookies that the server provides to the client that contain the server state.

In the first stage, the client sends a request to the NTS-KE server and gets a response via TLS. This exchange carries out a number of functions:

  • Negotiates the AEAD algorithm to be used in the second stage.
  • Negotiates the second protocol. Currently, the standard only defines how NTS works with NTPv4.
  • Negotiates the NTP server IP address and port.
  • Creates cookies for use in the second stage.
  • Creates two symmetric keys (C2S and S2C) from the TLS session via exporters.

Announcing cfnts: Cloudflare's implementation of NTS in Rust

In the second stage, the client securely synchronizes the clock with the negotiated NTP server. To synchronize securely, the client sends NTPv4 packets with four special extensions:

  • Unique Identifier Extension contains a random nonce used to prevent replay attacks.
  • NTS Cookie Extension contains one of the cookies that the client stores. Since currently only the client remembers the two AEAD keys (C2S and S2C), the server needs to use the cookie from this extension to extract the keys. Each cookie contains the keys encrypted under a secret key the server has.
  • NTS Cookie Placeholder Extension is a signal from the client to request additional cookies from the server. This extension is needed to make sure that the response is not much longer than the request to prevent amplification attacks.
  • NTS Authenticator and Encrypted Extension Fields Extension contains a ciphertext from the AEAD algorithm with C2S as a key and with the NTP header, timestamps, and all the previously mentioned extensions as associated data. Other possible extensions can be included as encrypted data within this field. Without this extension, the timestamp can be spoofed.

After getting a request, the server sends a response back to the client echoing the Unique Identifier Extension to prevent replay attacks, the NTS Cookie Extension to provide the client with more cookies, and the NTS Authenticator and Encrypted Extension Fields Extension with an AEAD ciphertext with S2C as a key. But in the server response, instead of sending the NTS Cookie Extension in plaintext, it needs to be encrypted with the AEAD to provide unlinkability of the NTP requests.

The second handshake can be repeated many times without going back to the first stage since each request and response gives the client a new cookie. The expensive public key operations in TLS are thus amortized over a large number of requests. Furthermore, specialized timekeeping devices like FPGA implementations only need to implement a few symmetric cryptographic functions and can delegate the complex TLS stack to a different device.

Why Rust?

While many of our services are written in Go, and we have considerable experience on the Crypto team with Go, a garbage collection pause in the middle of responding to an NTP packet would negatively impact accuracy. We picked Rust because of its zero-overhead and useful features.

  • Memory safety After Heartbleed, Cloudbleed, and the steady drip of vulnerabilities caused by C’s lack of memory safety, it’s clear that C is not a good choice for new software dealing with untrusted inputs. The obvious solution for memory safety is to use garbage collection, but garbage collection has a substantial runtime overhead, while Rust has less runtime overhead.
  • Non-nullability Null pointers are an edge case that is frequently not handled properly. Rust explicitly marks optionality, so all references in Rust can be safely dereferenced. The type system ensures that option types are properly handled.
  • Thread safety  Data-race prevention is another key feature of Rust. Rust’s ownership model ensures that all cross-thread accesses are synchronized by default. While not a panacea, this eliminates a major class of bugs.
  • Immutability Separating types into mutable and immutable is very important for reducing bugs. For example, in Java, when you pass an object into a function as a parameter, after the function is finished, you will never know whether the object has been mutated or not. Rust allows you to pass the object reference into the function and still be assured that the object is not mutated.
  • Error handling  Rust result types help with ensuring that operations that can produce errors are identified and a choice made about the error, even if that choice is passing it on.

While Rust provides safety with zero overhead, coding in Rust involves understanding linear types and for us a new language. In this case the importance of security and performance meant we chose Rust over a potentially easier task in Go.

Dependencies we use

Because of our scale and for DDoS protection we needed a highly scalable server. For UDP protocols without the concept of a connection, the server can respond to one packet at a time easily, but for TCP this is more complex. Originally we thought about using Tokio. However, at the time Tokio suffered from scheduler problems that had caused other teams some issues. As a result we decided to use Mio directly, basing our work on the examples in Rustls.

We decided to use Rustls over OpenSSL or BoringSSL because of the crate’s consistent error codes and default support for authentication that is difficult to disable accidentally. While there are some features that are not yet supported, it got the job done for our service.

Other engineering choices

More important than our choice of programming language was our implementation strategy. A working, fully featured NTP implementation is a complicated program involving a phase-locked loop. These have a difficult reputation due to their nonlinear nature, beyond the usual complexities of closed loop control. The response of a phase lock loop to a disturbance can be estimated if the loop is locked and the disturbance small. However, lock acquisition, large disturbances, and the necessary filtering in NTP are all hard to analyze mathematically since they are not captured in the linear models applied for small scale analysis. While NTP works with the total phase, unlike the phase-locked loops of electrical engineering, there are still nonlinear elements. For NTP testing, changes to this loop requires weeks of operation to determine the performance as the loop responds very slowly.

Computer clocks are generally accurate over short periods, while networks are plagued with inconsistent delays. This demands a slow response. Changes we make to our service have taken hours to have an effect, as the clients slowly adapt to the new conditions. While RFC 5905 provides lots of details on an algorithm to adjust the clock, later implementations such as chrony have improved upon the algorithm through much more sophisticated nonlinear filters.

Rather than implement these more sophisticated algorithms, we let chrony adjust the clock of our servers, and copy the state variables in the header from chrony and adjust the dispersion and root delay according to the formulas given in the RFC. This strategy let us focus on the new protocols.

Prague

Part of what the Internet Engineering Task Force (IETF) does is organize events like hackathons where implementers of a new standard can get together and try to make their stuff work with one another. This exposes bugs and infelicities of language in the standard and the implementations. We attended the IETF 104 hackathon to develop our server and make it work with other implementations. The NTP working group members were extremely generous with their time, and during the process we uncovered a few issues relating to the exact way one has to handle ALPN with older OpenSSL versions.

At the IETF 104 in Prague we had a working client and server for NTS-KE by the end of the hackathon. This was a good amount of progress considering we started with nothing. However, without implementing NTP we didn’t actually know that our server and client were computing the right thing. That would have to wait for later rounds of testing.

Announcing cfnts: Cloudflare's implementation of NTS in Rust
Wireshark during some NTS debugging

Crypto Week

As Crypto Week 2019 approached we were busily writing code. All of the NTP protocol had to be implemented, together with the connection between the NTP and NTS-KE parts of the server. We also had to deploy processes to synchronize the ticket encrypting keys around the world and work on reconfiguring our own timing infrastructure to support this new service.

With a few weeks to go we had a working implementation, but we needed servers and clients out there to test with. But because we only support TLS 1.3 on the server, which had only just entered into OpenSSL, there were some compatibility problems.

We ended up compiling a chrony branch with NTS support and NTPsec ourselves and testing against time.cloudflare.com. We also tested our client against test servers set up by the chrony and NTPsec projects, in the hopes that this would expose bugs and have our implementations work nicely together. After a few lengthy days of debugging, we found out that our nonce length wasn’t exactly in accordance with the spec, which was quickly fixed. The NTPsec project was extremely helpful in this effort. Of course, this was the day that our office had a blackout, so the testing happened outside in Yerba Buena Gardens.

Announcing cfnts: Cloudflare's implementation of NTS in Rust
Yerba Buena commons. Taken by Wikipedia user Beyond My Ken. CC-BY-SA

During the deployment of time.cloudflare.com, we had to open up our firewall to incoming NTP packets. Since the start of Cloudflare’s network existence and because of NTP reflection attacks we had previously closed UDP port 123 on the router. Since source port 123 is also used by clients sometimes to send NTP packets, it’s impossible for NTP servers to filter reflection attacks without parsing the contents of NTP packet, which routers have difficulty doing.  In order to protect Cloudflare infrastructure we got an entire subnet just for the time service, so it could be aggressively throttled and rerouted in case of massive DDoS attacks. This is an exceptional case: most edge services at Cloudflare run on every available IP.

Bug fixes

Shortly after the public launch, we discovered that older Windows versions shipped with NTP version 3, and our server only spoke version 4. This was easy to fix since the timestamps have not moved in NTP versions: we echo the version back and most still existing NTP version 3 clients will understand what we meant.

Also tricky was the failure of Network Time Foundation ntpd clients to expand the polling interval. It turns out that one has to echo back the client’s polling interval to have the polling interval expand. Chrony does not use the polling interval from the server, and so was not affected by this incompatibility.

Both of these issues were fixed in ways suggested by other NTP implementers who had run into these problems themselves. We thank Miroslav Lichter tremendously for telling us exactly what the problem was, and the members of the Cloudflare community who posted packet captures demonstrating these issues.

Continued improvement

The original production version of cfnts was not particularly object oriented and several contributors were just learning Rust. As a result there was quite a bit of unwrap and unnecessary mutability flying around. Much of the code was in functions even when it could profitably be attached to structures. All of this had to be restructured. Keep in mind that some of the best code running in the real-world have been written, rewritten, and sometimes rewritten again! This is actually a good thing.

As an internal project we relied on Cloudflare’s internal tooling for building, testing, and deploying code. These were replaced with tools available to everyone like Docker to ensure anyone can contribute. Our repository is integrated with Circle CI, ensuring that all contributions are automatically tested. In addition to unit tests we test the entire end to end functionality of getting a measurement of the time from a server.

The Future

NTPsec has already released support for NTS but we see very little usage. Please try turning on NTS if you use NTPsec and see how it works with time.cloudflare.com.  As the draft advances through the standards process the protocol will undergo an incompatible change when the identifiers are updated and assigned out of the IANA registry instead of being experimental ones, so this is very much an experiment. Note that your daemon will need TLS 1.3 support and so could require manually compiling OpenSSL and then linking against it.

We’ve also added our time service to the public NTP pool. The NTP pool is a widely used volunteer-maintained service that provides NTP servers geographically spread across the world. Unfortunately, NTS doesn’t currently work well with the pool model, so for the best security, we recommend enabling NTS and using time.cloudflare.com and other NTS supporting servers.

In the future, we’re hoping that more clients support NTS, and have licensed our code liberally to enable this. We would love to hear if you incorporate it into a product and welcome contributions to make it more useful.

We’re also encouraged to see that Netnod has a production NTS service at nts.ntp.se. The more time services and clients that adopt NTS, the more secure the Internet will be.

Acknowledgements

Tanya Verma and Gabbi Fisher were major contributors to the code, especially the configuration system and the client code. We’d also like to thank Gary Miller, Miroslav Lichter, and all the people at Cloudflare who set up their laptops and home machines to point to time.cloudflare.com for early feedback.

Announcing cfnts: Cloudflare's implementation of NTS in Rust

The TLS Post-Quantum Experiment

Post Syndicated from Kris Kwiatkowski original https://blog.cloudflare.com/the-tls-post-quantum-experiment/

The TLS Post-Quantum Experiment

The TLS Post-Quantum Experiment

In June, we announced a wide-scale post-quantum experiment with Google. We implemented two post-quantum (i.e., not yet known to be broken by quantum computers) key exchanges, integrated them into our TLS stack and deployed the implementation on our edge servers and in Chrome Canary clients. The goal of the experiment was to evaluate the performance and feasibility of deployment in TLS of two post-quantum key agreement ciphers.

In our previous blog post on post-quantum cryptography, we described differences between those two ciphers in detail. In case you didn’t have a chance to read it, we include a quick recap here. One characteristic of post-quantum key exchange algorithms is that the public keys are much larger than those used by “classical” algorithms. This will have an impact on the duration of the TLS handshake. For our experiment, we chose two algorithms: isogeny-based SIKE and lattice-based HRSS. The former has short key sizes (~330 bytes) but has a high computational cost; the latter has larger key sizes (~1100 bytes), but is a few orders of magnitude faster.

During NIST’s Second PQC Standardization Conference, Nick Sullivan presented our approach to this experiment and some initial results. Quite accurately, he compared NTRU-HRSS to an ostrich and SIKE to a turkey—one is big and fast and the other is small and slow.

The TLS Post-Quantum Experiment

Setup & Execution

We based our experiment on TLS 1.3. Cloudflare operated the server-side TLS connections and Google Chrome (Canary and Dev builds) represented the client side of the experiment. We enabled both CECPQ2 (HRSS + X25519) and CECPQ2b (SIKE/p434 + X25519) key-agreement algorithms on all TLS-terminating edge servers. Since the post-quantum algorithms are considered experimental, the X25519 key exchange serves as a fallback to ensure the classical security of the connection.

Clients participating in the experiment were split into 3 groups—those who initiated TLS handshake with post-quantum CECPQ2, CECPQ2b or non post-quantum X25519 public keys. Each group represented approximately one third of the Chrome Canary population participating in the experiment.

In order to distinguish between clients participating in or excluded from the experiment, we added a custom extension to the TLS handshake. It worked as a simple flag sent by clients and echoed back by Cloudflare edge servers. This allowed us to measure the duration of TLS handshakes only for clients participating in the experiment.

For each connection, we collected telemetry metrics. The most important metric was a TLS server-side handshake duration defined as the time between receiving the Client Hello and Client Finished messages. The diagram below shows details of what was measured and how post-quantum key exchange was integrated with TLS 1.3.

The TLS Post-Quantum Experiment

The experiment ran for 53 days in total, between August and October. During this time we collected millions of data samples, representing 5% of (anonymized) TLS connections that contained the extension signaling that the client was part of the experiment. We carried out the experiment in two phases.

In the first phase of the experiment, each client was assigned to use one of the three key exchange groups, and each client offered the same key exchange group for every connection. We collected over 10 million records over 40 days.

In the second phase of the experiment, client behavior was modified so that each client randomly chose which key exchange group to offer for each new connection, allowing us to directly compare the performance of each algorithm on a per-client basis. Data collection for this phase lasted 13 days and we collected 270 thousand records.

Results

We now describe our server-side measurement results. Client-side results are described at https://www.imperialviolet.org/2019/10/30/pqsivssl.html.

What did we find?

The primary metric we collected for each connection was the server-side handshake duration. The below histograms show handshake duration timings for all client measurements gathered in the first phase of the experiment, as well as breakdowns into the top five operating systems. The operating system breakdowns shown are restricted to only desktop/laptop devices except for Android, which consists of only mobile devices.

The TLS Post-Quantum Experiment

It’s clear from the above plots that for most clients, CECPQ2b performs worse than CECPQ2 and CONTROL. Thus, the small key size of CECPQ2b does not make up for its large computational cost—the ostrich outpaces the turkey.

Digging a little deeper

This means we’re done, right? Not quite. We are interested in determining if there are any populations of TLS clients for which CECPQ2b consistency outperforms CECPQ2. This requires taking a closer look at the long tail of handshake durations. The below plots show cumulative distribution functions (CDFs) of handshake timings zoomed in on the 80th percentile (e.g., showing the top 20% of slowest handshakes).

The TLS Post-Quantum Experiment

Here, we start to see something interesting. For Android, Linux, and Windows devices, there is a crossover point where CECPQ2b actually starts to outperform CECPQ2 (Android: ~94th percentile, Linux: ~92nd percentile, Windows: ~95th percentile). macOS and ChromeOS do not appear to have these crossover points.

These effects are small but statistically significant in some cases. The below table shows approximate 95% confidence intervals for the 50th (median), 95th, and 99th percentiles of handshake durations for each key exchange group and device type, calculated using Maritz-Jarrett estimators. The numbers within square brackets give the lower and upper bounds on our estimates for each percentile of the “true” distribution of handshake durations based on the samples collected in the experiment. For example, with a 95% confidence level we can say that the 99th percentile of handshake durations for CECPQ2 on Android devices lies between 4057ms and 4478ms, while the 99th percentile for CECPQ2b lies between 3276ms and 3646ms. Since the intervals do not overlap, we say that with statistical significance, the experiment indicates that CECPQ2b performs better than CECPQ2 for the slowest 1% of Android connections. Configurations where CECPQ2 or CECPQ2b outperforms the other with statistical significance are marked with green in the table.

The TLS Post-Quantum Experiment

Per-client comparison

A second phase of the experiment directly examined the performance of each key exchange algorithm for individual clients, where a client is defined to be a unique (anonymized) IP address and user agent pair. Instead of choosing a single key exchange algorithm for the duration of the experiment, clients randomly selected one of the experiment configurations for each new connection. Although the duration and sample size were limited for this phase of the experiment, we collected at least three handshake measurements for each group configuration from 3900 unique clients.

The plot below shows for each of these clients the difference in latency between CECPQ2 and CECPQ2b, taking the minimum latency sample for each key exchange group as the representative value. The CDF plot shows that for 80% of clients, CECPQ2 outperformed or matched CECPQ2b, and for 99% of clients, the latency gap remained within 70ms. At a high level, this indicates that very few clients performed significantly worse with CECPQ2 over CECPQ2b.

The TLS Post-Quantum Experiment

Do other factors impact the latency gap?

We looked at a number of other factors—including session resumption, IP version, and network location—to see if they impacted the latency gap between CECPQ2 and CECPQ2b. These factors impacted the overall handshake latency, but we did not find that any made a significant impact on the latency gap between post-quantum ciphers. We share some interesting observations from this analysis below.

Session resumption

Approximately 53% of all connections in the experiment were completed with TLS handshake resumption. However, the percentage of resumed connections varied significantly based on the device configuration. Connections from mobile devices were only resumed ~25% of the time, while between 40% and 70% of connections from laptop/desktop devices were resumed. Additionally, resumption provided between a 30% and 50% speedup for all device types.

IP version

We also examined the impact of IP version on handshake latency. Only 12.5% of the connections in the experiment used IPv6. These connections were 20-40% faster than IPv4 connections for desktop/laptop devices, but ~15% slower for mobile devices. This could be an artifact of IPv6 being generally deployed on newer devices with faster processors. For Android, the experiment was only run on devices with more modern processors, which perhaps eliminated the bias.

Network location

The slow connections making up the long tail of handshake durations were not isolated to a few countries, Autonomous Systems (ASes), or subnets, but originated from a globally diverse set of clients. We did not find a correlation between the relative performance of the two post-quantum key exchange algorithms based on these factors.

Discussion

We found that CECPQ2 (the ostrich) outperformed CECPQ2 (the turkey) for the majority of connections in the experiment, indicating that fast algorithms with large keys may be more suitable for TLS than slow algorithms with small keys. However, we observed the opposite—that CECPQ2b outperformed CECPQ2—for the slowest connections on some devices, including Windows computers and Android mobile devices. One possible explanation for this is packet fragmentation and packet loss. The maximum size of TCP packets that can be sent across a network is limited by the maximum transmission unit (MTU) of the network path, which is often ~1400 bytes. During the TLS handshake the server responds to the client with its public key and ciphertext, the combined size of which exceeds the MTU, so it is likely that handshake messages must be split across multiple TCP packets. This increases the risk of lost packets and delays due to retransmission. A repeat of this experiment that includes collection of fine-grained TCP telemetry could confirm this hypothesis.

A somewhat surprising result of this experiment is just how fast HRSS performs for the majority of connections. Recall that the CECPQ2 cipher performs key exchange operations for both X25519 and HRSS, but the additional overhead of HRSS is barely noticeable. Comparing benchmark results, we can see that HRSS will be faster than X25519 on the server side and slower on the client side.

The TLS Post-Quantum Experiment

In our design, the client side performs two operations—key generation and KEM decapsulation. Looking at those two operations we can see that the key generation is a bottleneck here.

Key generation: 	3553.5 [ops/sec]
KEM decapsulation: 	17186.7 [ops/sec]

In algorithms with quotient-style keys (like NTRU), the key generation algorithm performs an inversion in the quotient ring—an operation that is quite computationally expensive. Alternatively, a TLS implementation could generate ephemeral keys ahead of time in order to speed up key exchange. There are several other lattice-based key exchange candidates that may be worth experimenting with in the context of TLS key exchange, which are based on different underlying principles than the HRSS construction. These candidates have similar key sizes and faster key generation algorithms, but have their own drawbacks. For now, HRSS looks like the more promising algorithm for use in TLS.

In the case of SIKE, we implemented the most recent version of the algorithm, and instantiated it with the most performance-efficient parameter set for our experiment. The algorithm is computationally expensive, so we were required to use assembly to optimize it. In order to ensure best performance on Intel, most performance-critical operations have two different implementations; the library detects CPU capabilities and uses faster instructions if available, but otherwise falls back to a slightly slower generic implementation. We developed our own optimizations for 64-bit ARM CPUs. Nevertheless, our results show that SIKE incurred a significant overhead for every connection, especially on devices with weaker processors. It must be noted that high-performance isogeny-based public key cryptography is arguably much less developed than its lattice-based counterparts. Some ideas to develop this are floating around, and we hope to see performance improvements in the future.

The TLS Post-Quantum Experiment

DNS Encryption Explained

Post Syndicated from Peter Wu original https://blog.cloudflare.com/dns-encryption-explained/

DNS Encryption Explained

DNS Encryption Explained

The Domain Name System (DNS) is the address book of the Internet. When you visit cloudflare.com or any other site, your browser will ask a DNS resolver for the IP address where the website can be found. Unfortunately, these DNS queries and answers are typically unprotected. Encrypting DNS would improve user privacy and security. In this post, we will look at two mechanisms for encrypting DNS, known as DNS over TLS (DoT) and DNS over HTTPS (DoH), and explain how they work.

Applications that want to resolve a domain name to an IP address typically use DNS. This is usually not done explicitly by the programmer who wrote the application. Instead, the programmer writes something such as fetch("https://example.com/news") and expects a software library to handle the translation of “example.com” to an IP address.

Behind the scenes, the software library is responsible for discovering and connecting to the external recursive DNS resolver and speaking the DNS protocol (see the figure below) in order to resolve the name requested by the application. The choice of the external DNS resolver and whether any privacy and security is provided at all is outside the control of the application. It depends on the software library in use, and the policies provided by the operating system of the device that runs the software.

DNS Encryption Explained
Overview of DNS query and response

The external DNS resolver

The operating system usually learns the resolver address from the local network using Dynamic Host Configuration Protocol (DHCP). In home and mobile networks, it typically ends up using the resolver from the Internet Service Provider (ISP). In corporate networks, the selected resolver is typically controlled by the network administrator. If desired, users with control over their devices can override the resolver with a specific address, such as the address of a public resolver like Google’s 8.8.8.8 or Cloudflare’s 1.1.1.1, but most users will likely not bother changing it when connecting to a public Wi-Fi hotspot at a coffee shop or airport.

The choice of external resolver has a direct impact on the end-user experience. Most users do not change their resolver settings and will likely end up using the DNS resolver from their network provider. The most obvious observable property is the speed and accuracy of name resolution. Features that improve privacy or security might not be immediately visible, but will help to prevent others from profiling or interfering with your browsing activity. This is especially important on public Wi-Fi networks where anyone in physical proximity can capture and decrypt wireless network traffic.

Unencrypted DNS

Ever since DNS was created in 1987, it has been largely unencrypted. Everyone between your device and the resolver is able to snoop on or even modify your DNS queries and responses. This includes anyone in your local Wi-Fi network, your Internet Service Provider (ISP), and transit providers. This may affect your privacy by revealing the domain names that are you are visiting.

What can they see? Well, consider this network packet capture taken from a laptop connected to a home network:

DNS Encryption Explained

The following observations can be made:

  • The UDP source port is 53 which is the standard port number for unencrypted DNS. The UDP payload is therefore likely to be a DNS answer.
  • That suggests that the source IP address 192.168.2.254 is a DNS resolver while the destination IP 192.168.2.14 is the DNS client.
  • The UDP payload could indeed be parsed as a DNS answer, and reveals that the user was trying to visit twitter.com.
  • If there are any future connections to 104.244.42.129 or 104.244.42.1, then it is most likely traffic that is directed at “twitter.com”.
  • If there is some further encrypted HTTPS traffic to this IP, succeeded by more DNS queries, it could indicate that a web browser loaded additional resources from that page. That could potentially reveal the pages that a user was looking at while visiting twitter.com.

Since the DNS messages are unprotected, other attacks are possible:

  • Queries could be directed to a resolver that performs DNS hijacking. For example, in the UK, Virgin Media and BT return a fake response for domains that do not exist, redirecting users to a search page. This redirection is possible because the computer/phone blindly trusts the DNS resolver that was advertised using DHCP by the ISP-provided gateway router.
  • Firewalls can easily intercept, block or modify any unencrypted DNS traffic based on the port number alone. It is worth noting that plaintext inspection is not a silver bullet for achieving visibility goals, because the DNS resolver can be bypassed.

Encrypting DNS

Encrypting DNS makes it much harder for snoopers to look into your DNS messages, or to corrupt them in transit. Just as the web moved from unencrypted HTTP to encrypted HTTPS there are now upgrades to the DNS protocol that encrypt DNS itself. Encrypting the web has made it possible for private and secure communications and commerce to flourish. Encrypting DNS will further enhance user privacy.

Two standardized mechanisms exist to secure the DNS transport between you and the resolver, DNS over TLS (2016) and DNS Queries over HTTPS (2018). Both are based on Transport Layer Security (TLS) which is also used to secure communication between you and a website using HTTPS. In TLS, the server (be it a web server or DNS resolver) authenticates itself to the client (your device) using a certificate. This ensures that no other party can impersonate the server (the resolver).

With DNS over TLS (DoT), the original DNS message is directly embedded into the secure TLS channel. From the outside, one can neither learn the name that was being queried nor modify it. The intended client application will be able to decrypt TLS, it looks like this:

DNS Encryption Explained

In the packet trace for unencrypted DNS, it was clear that a DNS request can be sent directly by the client, followed by a DNS answer from the resolver. In the encrypted DoT case however, some TLS handshake messages are exchanged prior to sending encrypted DNS messages:

  • The client sends a Client Hello, advertising its supported TLS capabilities.
  • The server responds with a Server Hello, agreeing on TLS parameters that will be used to secure the connection. The Certificate message contains the identity of the server while the Certificate Verify message will contain a digital signature which can be verified by the client using the server Certificate. The client typically checks this certificate against its local list of trusted Certificate Authorities, but the DoT specification mentions alternative trust mechanisms such as public key pinning.
  • Once the TLS handshake is Finished by both the client and server, they can finally start exchanging encrypted messages.
  • While the above picture contains one DNS query and answer, in practice the secure TLS connection will remain open and will be reused for future DNS queries.

Securing unencrypted protocols by slapping TLS on top of a new port has been done before:

  • Web traffic: HTTP (tcp/80) -> HTTPS (tcp/443)
  • Sending email: SMTP (tcp/25) -> SMTPS (tcp/465)
  • Receiving email: IMAP (tcp/143) -> IMAPS (tcp/993)
  • Now: DNS (tcp/53 or udp/53) -> DoT (tcp/853)

A problem with introducing a new port is that existing firewalls may block it. Either because they employ a whitelist approach where new services have to be explicitly enabled, or a blocklist approach where a network administrator explicitly blocks a service. If the secure option (DoT) is less likely to be available than its insecure option, then users and applications might be tempted to try to fall back to unencrypted DNS. This subsequently could allow attackers to force users to an insecure version.

Such fallback attacks are not theoretical. SSL stripping has previously been used to downgrade HTTPS websites to HTTP, allowing attackers to steal passwords or hijack accounts.

Another approach, DNS Queries over HTTPS (DoH), was designed to support two primary use cases:

  • Prevent the above problem where on-path devices interfere with DNS. This includes the port blocking problem above.
  • Enable web applications to access DNS through existing browser APIs.
    DoH is essentially HTTPS, the same encrypted standard the web uses, and reuses the same port number (tcp/443). Web browsers have already deprecated non-secure HTTP in favor of HTTPS. That makes HTTPS a great choice for securely transporting DNS messages. An example of such a DoH request can be found here.

DNS Encryption Explained
DoH: DNS query and response transported over a secure HTTPS stream

Some users have been concerned that the use of HTTPS could weaken privacy due to the potential use of cookies for tracking purposes. The DoH protocol designers considered various privacy aspects and explicitly discouraged use of HTTP cookies to prevent tracking, a recommendation that is widely respected. TLS session resumption improves TLS 1.2 handshake performance, but can potentially be used to correlate TLS connections. Luckily, use of TLS 1.3 obviates the need for TLS session resumption by reducing the number of round trips by default, effectively addressing its associated privacy concern.

Using HTTPS means that HTTP protocol improvements can also benefit DoH. For example, the in-development HTTP/3 protocol, built on top of QUIC, could offer additional performance improvements in the presence of packet loss due to lack of head-of-line blocking. This means that multiple DNS queries could be sent simultaneously over the secure channel without blocking each other when one packet is lost.

A draft for DNS over QUIC (DNS/QUIC) also exists and is similar to DoT, but without the head-of-line blocking problem due to the use of QUIC. Both HTTP/3 and DNS/QUIC, however, require a UDP port to be accessible. In theory, both could fall back to DoH over HTTP/2 and DoT respectively.

Deployment of DoT and DoH

As both DoT and DoH are relatively new, they are not universally deployed yet. On the server side, major public resolvers including Cloudflare’s 1.1.1.1 and Google DNS support it. Many ISP resolvers however still lack support for it. A small list of public resolvers supporting DoH can be found at DNS server sources, another list of public resolvers supporting DoT and DoH can be found on DNS Privacy Public Resolvers.

There are two methods to enable DoT or DoH on end-user devices:

  • Add support to applications, bypassing the resolver service from the operating system.
  • Add support to the operating system, transparently providing support to applications.

There are generally three configuration modes for DoT or DoH on the client side:

  • Off: DNS will not be encrypted.
  • Opportunistic mode: try to use a secure transport for DNS, but fallback to unencrypted DNS if the former is unavailable. This mode is vulnerable to downgrade attacks where an attacker can force a device to use unencrypted DNS. It aims to offer privacy when there are no on-path active attackers.
  • Strict mode: try to use DNS over a secure transport. If unavailable, fail hard and show an error to the user.

The current state for system-wide configuration of DNS over a secure transport:

  • Android 9: supports DoT through its “Private DNS” feature. Modes:
    • Opportunistic mode (“Automatic”) is used by default. The resolver from network settings (typically DHCP) will be used.
    • Strict mode can be configured by setting an explicit hostname. No IP address is allowed, the hostname is resolved using the default resolver and is also used for validating the certificate. (Relevant source code)
  • iOS and Android users can also install the 1.1.1.1 app to enable either DoH or DoT support in strict mode. Internally it uses the VPN programming interfaces to enable interception of unencrypted DNS traffic before it is forwarded over a secure channel.
  • Linux with systemd-resolved from systemd 239: DoT through the DNSOverTLS option.

    • Off is the default.
    • Opportunistic mode can be configured, but no certificate validation is performed.
    • Strict mode is available since systemd 243. Any certificate signed by a trusted certificate authority is accepted. However, there is no hostname validation with the GnuTLS backend while the OpenSSL backend expects an IP address.
    • In any case, no Server Name Indication (SNI) is sent. The certificate name is not validated, making a man-in-the-middle rather trivial.
  • Linux, macOS, and Windows can use a DoH client in strict mode. The cloudflared proxy-dns command uses the Cloudflare DNS resolver by default, but users can override it through the proxy-dns-upstream option.

Web browsers support DoH instead of DoT:

  • Firefox 62 supports DoH and provides several Trusted Recursive Resolver (TRR) settings. By default DoH is disabled, but Mozilla is running an experiment to enable DoH for some users in the USA. This experiment currently uses Cloudflare’s 1.1.1.1 resolver, since we are the only provider that currently satisfies the strict resolver policy required by Mozilla. Since many DNS resolvers still do not support an encrypted DNS transport, Mozilla’s approach will ensure that more users are protected using DoH.
    • When enabled through the experiment, or through the “Enable DNS over HTTPS” option at Network Settings, Firefox will use opportunistic mode (network.trr.mode=2 at about:config).
    • Strict mode can be enabled with network.trr.mode=3, but requires an explicit resolver IP to be specified (for example, network.trr.bootstrapAddress=1.1.1.1).
    • While Firefox ignores the default resolver from the system, it can be configured with alternative resolvers. Additionally, enterprise deployments who use a resolver that does not support DoH have the option to disable DoH.
  • Chrome 78 enables opportunistic DoH if the system resolver address matches one of the hard-coded DoH providers (source code change). This experiment is enabled for all platforms except Linux and iOS, and excludes enterprise deployments by default.
  • Opera 65 adds an option to enable DoH through Cloudflare’s 1.1.1.1 resolver. This feature is off by default. Once enabled, it appears to use opportunistic mode: if 1.1.1.1:443 (without SNI) is reachable, it will be used. Otherwise it falls back to the default resolver, unencrypted.

The DNS over HTTPS page from the curl project has a comprehensive list of DoH providers and additional implementations.

As an alternative to encrypting the full network path between the device and the external DNS resolver, one can take a middle ground: use unencrypted DNS between devices and the gateway of the local network, but encrypt all DNS traffic between the gateway router and the external DNS resolver. Assuming a secure wired or wireless network, this would protect all devices in the local network against a snooping ISP, or other adversaries on the Internet. As public Wi-Fi hotspots are not considered secure, this approach would not be safe on open Wi-Fi networks. Even if it is password-protected with WPA2-PSK, others will still be able to snoop and modify unencrypted DNS.

Other security considerations

The previous sections described secure DNS transports, DoH and DoT. These will only ensure that your client receives the untampered answer from the DNS resolver. It does not, however, protect the client against the resolver returning the wrong answer (through DNS hijacking or DNS cache poisoning attacks). The “true” answer is determined by the owner of a domain or zone as reported by the authoritative name server. DNSSEC allows clients to verify the integrity of the returned DNS answer and catch any unauthorized tampering along the path between the client and authoritative name server.

However deployment of DNSSEC is hindered by middleboxes that incorrectly forward DNS messages, and even if the information is available, stub resolvers used by applications might not even validate the results. A report from 2016 found that only 26% of users use DNSSEC-validating resolvers.

DoH and DoT protect the transport between the client and the public resolver. The public resolver may have to reach out to additional authoritative name servers in order to resolve a name. Traditionally, the path between any resolver and the authoritative name server uses unencrypted DNS. To protect these DNS messages as well, we did an experiment with Facebook, using DoT between 1.1.1.1 and Facebook’s authoritative name servers. While setting up a secure channel using TLS increases latency, it can be amortized over many queries.

Transport encryption ensures that resolver results and metadata are protected. For example, the EDNS Client Subnet (ECS) information included with DNS queries could reveal the original client address that started the DNS query. Hiding that information along the path improves privacy. It will also prevent broken middle-boxes from breaking DNSSEC due to issues in forwarding DNS.

Operational issues with DNS encryption

DNS encryption may bring challenges to individuals or organizations that rely on monitoring or modifying DNS traffic. Security appliances that rely on passive monitoring watch all incoming and outgoing network traffic on a machine or on the edge of a network. Based on unencrypted DNS queries, they could potentially identify machines which are infected with malware for example. If the DNS query is encrypted, then passive monitoring solutions will not be able to monitor domain names.

Some parties expect DNS resolvers to apply content filtering for purposes such as:

  • Blocking domains used for malware distribution.
  • Blocking advertisements.
  • Perform parental control filtering, blocking domains associated with adult content.
  • Block access to domains serving illegal content according to local regulations.
  • Offer a split-horizon DNS to provide different answers depending on the source network.

An advantage of blocking access to domains via the DNS resolver is that it can be centrally done, without reimplementing it in every single application. Unfortunately, it is also quite coarse. Suppose that a website hosts content for multiple users at example.com/videos/for-kids/ and example.com/videos/for-adults/. The DNS resolver will only be able to see “example.com” and can either choose to block it or not. In this case, application-specific controls such as browser extensions would be more effective since they can actually look into the URLs and selectively prevent content from being accessible.

DNS monitoring is not comprehensive. Malware could skip DNS and hardcode IP addresses, or use alternative methods to query an IP address. However, not all malware is that complicated, so DNS monitoring can still serve as a defence-in-depth tool.

All of these non-passive monitoring or DNS blocking use cases require support from the DNS resolver. Deployments that rely on opportunistic DoH/DoT upgrades of the current resolver will maintain the same feature set as usually provided over unencrypted DNS. Unfortunately this is vulnerable to downgrades, as mentioned before. To solve this, system administrators can point endpoints to a DoH/DoT resolver in strict mode. Ideally this is done through secure device management solutions (MDM, group policy on Windows, etc.).

Conclusion

One of the cornerstones of the Internet is mapping names to an address using DNS. DNS has traditionally used insecure, unencrypted transports. This has been abused by ISPs in the past for injecting advertisements, but also causes a privacy leak. Nosey visitors in the coffee shop can use unencrypted DNS to follow your activity. All of these issues can be solved by using DNS over TLS (DoT) or DNS over HTTPS (DoH). These techniques to protect the user are relatively new and are seeing increasing adoption.

From a technical perspective, DoH is very similar to HTTPS and follows the general industry trend to deprecate non-secure options. DoT is a simpler transport mode than DoH as the HTTP layer is removed, but that also makes it easier to be blocked, either deliberately or by accident.

Secondary to enabling a secure transport is the choice of a DNS resolver. Some vendors will use the locally configured DNS resolver, but try to opportunistically upgrade the unencrypted transport to a more secure transport (either DoT or DoH). Unfortunately, the DNS resolver usually defaults to one provided by the ISP which may not support secure transports.

Mozilla has adopted a different approach. Rather than relying on local resolvers that may not even support DoH, they allow the user to explicitly select a resolver. Resolvers recommended by Mozilla have to satisfy high standards to protect user privacy. To ensure that parental control features based on DNS remain functional, and to support the split-horizon use case, Mozilla has added a mechanism that allows private resolvers to disable DoH.

The DoT and DoH transport protocols are ready for us to move to a more secure Internet. As can be seen in previous packet traces, these protocols are similar to existing mechanisms to secure application traffic. Once this security and privacy hole is closed, there will be many more to tackle.

DNS Encryption Explained