All posts by Mark Ryland

How AWS threat intelligence deters threat actors

Post Syndicated from Mark Ryland original https://aws.amazon.com/blogs/security/how-aws-threat-intelligence-deters-threat-actors/

Every day across the Amazon Web Services (AWS) cloud infrastructure, we detect and successfully thwart hundreds of cyberattacks that might otherwise be disruptive and costly. These important but mostly unseen victories are achieved with a global network of sensors and an associated set of disruption tools. Using these capabilities, we make it more difficult and expensive for cyberattacks to be carried out against our network, our infrastructure, and our customers. But we also help make the internet as a whole a safer place by working with other responsible providers to take action against threat actors operating within their infrastructure. Turning our global-scale threat intelligence into swift action is just one of the many steps that we take as part of our commitment to security as our top priority. Although this is a never-ending endeavor and our capabilities are constantly improving, we’ve reached a point where we believe customers and other stakeholders can benefit from learning more about what we’re doing today, and where we want to go in the future.

Global-scale threat intelligence using the AWS Cloud

With the largest public network footprint of any cloud provider, our scale at AWS gives us unparalleled insight into certain activities on the internet, in real time. Some years ago, leveraging that scale, AWS Principal Security Engineer Nima Sharifi Mehr started looking for novel approaches for gathering intelligence to counter threats. Our teams began building an internal suite of tools, given the moniker MadPot, and before long, Amazon security researchers were successfully finding, studying, and stopping thousands of digital threats that might have affected its customers.

MadPot was built to accomplish two things: first, discover and monitor threat activities and second, disrupt harmful activities whenever possible to protect AWS customers and others. MadPot has grown to become a sophisticated system of monitoring sensors and automated response capabilities. The sensors observe more than 100 million potential threat interactions and probes every day around the world, with approximately 500,000 of those observed activities advancing to the point where they can be classified as malicious. That enormous amount of threat intelligence data is ingested, correlated, and analyzed to deliver actionable insights about potentially harmful activity happening across the internet. The response capabilities automatically protect the AWS network from identified threats, and generate outbound communications to other companies whose infrastructure is being used for malicious activities.

Systems of this sort are known as honeypots—decoys set up to capture threat actor behavior—and have long served as valuable observation and threat intelligence tools. However, the approach we take through MadPot produces unique insights resulting from our scale at AWS and the automation behind the system. To attract threat actors whose behaviors we can then observe and act on, we designed the system so that it looks like it’s composed of a huge number of plausible innocent targets. Mimicking real systems in a controlled and safe environment provides observations and insights that we can often immediately use to help stop harmful activity and help protect customers.

Of course, threat actors know that systems like this are in place, so they frequently change their techniques—and so do we. We invest heavily in making sure that MadPot constantly changes and evolves its behavior, continuing to have visibility into activities that reveal the tactics, techniques, and procedures (TTPs) of threat actors. We put this intelligence to use quickly in AWS tools, such as AWS Shield and AWS WAF, so that many threats are mitigated early by initiating automated responses. When appropriate, we also provide the threat data to customers through Amazon GuardDuty so that their own tooling and automation can respond.

Three minutes to exploit attempt, no time to waste

Within approximately 90 seconds of launching a new sensor within our MadPot simulated workload, we can observe that the workload has been discovered by probes scanning the internet. From there, it takes only three minutes on average before attempts are made to penetrate and exploit it. This is an astonishingly short amount of time, considering that these workloads aren’t advertised or part of other visible systems that would be obvious to threat actors. This clearly demonstrates the voracity of scanning taking place and the high degree of automation that threat actors employ to find their next target.

As these attempts run their course, the MadPot system analyzes the telemetry, code, attempted network connections, and other key data points of the threat actor’s behavior. This information becomes even more valuable as we aggregate threat actor activities to generate a more complete picture of available intelligence.

Disrupting attacks to maintain business as usual

In-depth threat intelligence analysis also happens in MadPot. The system launches the malware it captures in a sandboxed environment, connects information from disparate techniques into threat patterns, and more. When the gathered signals provide high enough confidence in a finding, the system acts to disrupt threats whenever possible, such as disconnecting a threat actor’s resources from the AWS network. Or, it could entail preparing that information to be shared with the wider community, such as a computer emergency response team (CERT), internet service provider (ISP), a domain registrar, or government agency so that they can help disrupt the identified threat.

As a major internet presence, AWS takes on the responsibility to help and collaborate with the security community when possible. Information sharing within the security community is a long-standing tradition and an area where we’ve been an active participant for years.

In the first quarter of 2023:

  • We used 5.5B signals from our internet threat sensors and 1.5B signals from our active network probes in our anti-botnet security efforts.
  • We stopped over 1.3M outbound botnet-driven DDoS attacks.
  • We shared our security intelligence findings, including nearly a thousand botnet C2 hosts, with relevant hosting providers and domain registrars.
  • We traced back and worked with external parties to dismantle the sources of 230k L7/HTTP(S) DDoS attacks.

Three examples of MadPot’s effectiveness: Botnets, Sandworm, and Volt Typhoon

Recently, MadPot detected, collected, and analyzed suspicious signals that uncovered a distributed denial of service (DDoS) botnet that was using the domain free.bigbots.[tld] (the top-level domain is omitted) as a command and control (C2) domain. A botnet is made up of compromised systems that belong to innocent parties—such as computers, home routers, and Internet of Things (IoT) devices—that have been previously compromised, with malware installed that awaits commands to flood a target with network packets. Bots under this C2 domain were launching 15–20 DDoS attacks per hour at a rate of about 800 million packets per second.

As MadPot mapped out this threat, our intelligence revealed a list of IP addresses used by the C2 servers corresponding to an extremely high number of requests from the bots. Our systems blocked those IP addresses from access to AWS networks so that a compromised customer compute node on AWS couldn’t participate in the attacks. AWS automation then used the intelligence gathered to contact the company that was hosting the C2 systems and the registrar responsible for the DNS name. The company whose infrastructure was hosting the C2s took them offline in less than 48 hours, and the domain registrar decommissioned the DNS name in less than 72 hours. Without the ability to control DNS records, the threat actor could not easily resuscitate the network by moving the C2s to a different network location. In less than three days, this widely distributed malware and the C2 infrastructure required to operate it was rendered inoperable, and the DDoS attacks impacting systems throughout the internet ground to a halt.

MadPot is effective in detecting and understanding the threat actors that target many different kinds of infrastructure, not just cloud infrastructure, including the malware, ports, and techniques that they may be using. Thus, through MadPot we identified the threat group called Sandworm—the cluster associated with Cyclops Blink, a piece of malware used to manage a botnet of compromised routers. Sandworm was attempting to exploit a vulnerability affecting WatchGuard network security appliances. With close investigation of the payload, we identified not only IP addresses but also other unique attributes associated with the Sandworm threat that were involved in an attempted compromise of an AWS customer. MadPot’s unique ability to mimic a variety of services and engage in high levels of interaction helped us capture additional details about Sandworm campaigns, such as services that the actor was targeting and post-exploitation commands initiated by that actor. Using this intelligence, we notified the customer, who promptly acted to mitigate the vulnerability. Without this swift action, the actor might have been able to gain a foothold in the customer’s network and gain access to other organizations that the customer served.

For our final example, the MadPot system was used to help government cyber and law enforcement authorities identify and ultimately disrupt Volt Typhoon, the widely-reported state-sponsored threat actor that focused on stealthy and targeted cyber espionage campaigns against critical infrastructure organizations. Through our investigation inside MadPot, we identified a payload submitted by the threat actor that contained a unique signature, which allowed identification and attribution of activities by Volt Typhoon that would otherwise appear to be unrelated. By using the data lake that stores a complete history of MadPot interactions, we were able to search years of data very quickly and ultimately identify other examples of this unique signature, which was being sent in payloads to MadPot as far back as August 2021. The previous request was seemingly benign in nature, so we believed that it was associated with a reconnaissance tool. We were then able to identify other IP addresses that the threat actor was using in recent months. We shared our findings with government authorities, and those hard-to-make connections helped inform the research and conclusions of the Cybersecurity and Infrastructure Security Agency (CISA) of the U.S. government. Our work and the work of other cooperating parties resulted in their May 2023 Cybersecurity advisory. To this day, we continue to observe the actor probing U.S. network infrastructure, and we continue to share details with appropriate government cyber and law enforcement organizations.

Putting global-scale threat intelligence to work for AWS customers and beyond

At AWS, security is our top priority, and we work hard to help prevent security issues from causing disruption to your business. As we work to defend our infrastructure and your data, we use our global-scale insights to gather a high volume of security intelligence—at scale and in real time—to help protect you automatically. Whenever possible, AWS Security and its systems disrupt threats where that action will be most impactful; often, this work happens largely behind the scenes. As demonstrated in the botnet case described earlier, we neutralize threats by using our global-scale threat intelligence and by collaborating with entities that are directly impacted by malicious activities. We incorporate findings from MadPot into AWS security tools, including preventative services, such as AWS WAF, AWS Shield, AWS Network Firewall, and Amazon Route 53 Resolver DNS Firewall, and detective and reactive services, such as Amazon GuardDuty, AWS Security Hub, and Amazon Inspector, putting security intelligence when appropriate directly into the hands of our customers, so that they can build their own response procedures and automations.

But our work extends security protections and improvements far beyond the bounds of AWS itself. We work closely with the security community and collaborating businesses around the world to isolate and take down threat actors. In the first half of this year, we shared intelligence of nearly 2,000 botnet C2 hosts with relevant hosting providers and domain registrars to take down the botnets’ control infrastructure. We also traced back and worked with external parties to dismantle the sources of approximately 230,000 Layer 7 DDoS attacks. The effectiveness of our mitigation strategies relies heavily on our ability to quickly capture, analyze, and act on threat intelligence. By taking these steps, AWS is going beyond just typical DDoS defense, and moving our protection beyond our borders.

We’re glad to be able to share information about MadPot and some of the capabilities that we’re operating today. For more information, see this presentation from our most recent re:Inforce conference: How AWS threat intelligence becomes managed firewall rules, as well as an overview post published today, Meet MadPot, a threat intelligence tool Amazon uses to protect customers from cybercrime, which includes some good information about the AWS security engineer behind the original creation of MadPot. Going forward, you can expect to hear more from us as we develop and enhance our threat intelligence and response systems, making both AWS and the internet as a whole a safer place.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Mark Ryland

Mark Ryland

Mark is the director of the Office of the CISO for AWS. He has over 30 years of experience in the technology industry, and has served in leadership roles in cybersecurity, software engineering, distributed systems, technology standardization, and public policy. Previously, he served as the Director of Solution Architecture and Professional Services for the AWS World Public Sector team.

Removing header remapping from Amazon API Gateway, and notes about our work with security researchers

Post Syndicated from Mark Ryland original https://aws.amazon.com/blogs/security/removing-header-remapping-from-amazon-api-gateway-and-notes-about-our-work-with-security-researchers/

At Amazon Web Services (AWS), our APIs and service functionality are a promise to our customers, so we very rarely make breaking changes or remove functionality from production services. Customers use the AWS Cloud to build solutions for their customers, and when disruptive changes are made or functionality is removed, the downstream impacts can be significant. As builders, we’ve felt the impact of these types of changes ourselves, and we work hard to avoid these situations whenever possible.

When we do need to make breaking changes, we try to provide a smooth path forward for customers who were using the old functionality. Often that means changing the behavior for new users or new deployments, and then allowing a transition window for existing customers to migrate from the old to the new behavior. There are many examples of this pattern, such as an update to IAM role trust policy behavior that we made last year.

This post explains one such recent change that we’ve made in Amazon API Gateway. We also discuss how we work with the security research community to improve things for customers.

Summary and customer impact

Recently, researchers at Omegapoint disclosed an edge case issue with how API Gateway handled HTTP header remapping with custom authorizers based on AWS Lambda. As is often the case with security research, this work generated a second, tangentially related authorization-caching issue that the Omegapoint team also reported.

After analyzing these reports, the API Gateway team decided to remove a documented feature from the service and to adjust another behavior to improve service behavior. We’ve made the appropriate changes to the API Gateway documentation.

As of June 14, 2023, the header remapping feature is no longer available in API Gateway. Customers can still use Velocity Template Language-based (VTL) transformations for header remapping, because this approach wasn’t impacted by the reported issue. If you’re using this design pattern in API Gateway and have questions about this change, reach out to your AWS support team.

The authorization-caching behavior was working as originally designed; but based on the report, we’ve adjusted it to better meet customer expectations.

The team at Omegapoint has published their findings in the blog post Writeup: AWS API Gateway header smuggling and cache confusion.

Before we removed the feature, we contacted customers who were using the direct HTTP header remapping feature through email and the AWS Health Dashboard. If you haven’t been contacted, no action is required on your part.

More details

The main issue that Omegapoint reported was related to a documented, client-controlled HTTP header remapping feature in API Gateway. This feature allowed customers to use one set of header values in the interaction between their clients and API Gateway, and a different set of header values from API Gateway to the backend. The client could send two sets of header values: one for API Gateway and one for the backend. API Gateway would process both sets, but then remap (overwrite) one set of values with another set. This feature was especially useful when allowing newly created API Gateway clients to continue to work with legacy servers whose header-handling logic couldn’t be modified.

The report from Omegapoint highlighted that customers who relied on Lambda authorizers for request-based authorization could be surprised when the remapping feature was used to overwrite header values that were used for further authorization on the backend, which could potentially lead to unintended access. The Lambda authorizer itself worked as expected on unmapped headers, but if there was additional authorization logic in the backend, it could be impacted by a misbehaving client.

The second issue that Omegapoint reported was related to the caching behavior in API Gateway for authorization policies. Previously, the caching method might reuse a cached authorization with a different value when the <method.request.multivalueheader.*> value was used in the request header within the time-to-live (TTL) of the cached value. This was the expected behavior of the wildcard value.

However, after reviewing the report, we agreed that it could surprise customers, and potentially allow misbehaving clients to bypass expected authorization. We were able to change this behavior without customer impact, because there is no evidence of customers relying on this behavior. So now, cached authorizations are no longer used in the <multivalueheader> case.

How we work with researchers

Security researchers regularly submit vulnerability reports to AWS Security. Some researchers are independent, some work in academic institutions, and others work in AWS partner or customer organizations. Our Outreach team triages submissions rapidly. Upon receipt, we start a conversation and work closely with researchers to understand their concerns, give our perspective, and agree on the best path forward.

If technical changes are required, our services and security teams work together to determine and implement the appropriate remediations based on the potential impact. They work with affected customers to reduce or eliminate impacts, and they work with the researchers to coordinate the publication of their findings.

Often these reports highlight situations where the designed and documented behavior might result in a surprising outcome for some customers. In those cases, we work with the researcher to make the appropriate updates to the documentation, if needed, and help ensure that the researcher’s finding is published with customer education as the primary goal.

In other cases, where warranted, we communicate about security issues to the broader customer and security community by using a security bulletin. Finally, we publish security blog posts in cases where providing more context makes sense, such as the current issue.

Security is our top priority, and working with the community to make our customers and the AWS Cloud safer is a key part of that. Clear communication helps build understanding and trust.

Working together

We removed the direct remapping feature because not many customers were using it, and we felt that documentation warning against the impacted design choices provided insufficient visibility and protection for customers. We designed and released the feature in an era when it was reasonable to assume that an API Gateway client would be well-behaved, but as times change, it now makes sense that an API client could be potentially negligent or even hostile. There are multiple alternative approaches that can provide the same outcome for customers, but in a more expected and controlled manner, which made this a simpler process to work through.

When researchers report potential security findings, we work through our process to determine the best outcome for our customers. In most cases, we can adjust designs to address the issue, while maintaining the affected features.

In rare cases, such as this one, the more effective path forward is to sunset a feature in favor of a more expected and secure approach. This is a core principle of evolving architectures and building resilient systems. It’s something that we practice regularly at AWS and a key principle that we share with our customers and the community through the AWS Well-Architected Framework.

Our thanks to the team at Omegapoint for reporting these issues, and to all of the researchers who continue to work with us to help make the AWS Cloud safer for our customers.

Want more AWS Security news? Follow us on Twitter.

Mark Ryland

Mark Ryland

Mark is the director of the Office of the CISO for AWS. He has over 30 years of experience in the technology industry, and has served in leadership roles in cybersecurity, software engineering, distributed systems, technology standardization, and public policy. Previously, he served as the Director of Solution Architecture and Professional Services for the AWS World Public Sector team.

Our commitment to shared cybersecurity goals

Post Syndicated from Mark Ryland original https://aws.amazon.com/blogs/security/our-commitment-to-shared-cybersecurity-goals/

The United States Government recently launched its National Cybersecurity Strategy. The Strategy outlines the administration’s ambitious vision for building a more resilient future, both in the United States and around the world, and it affirms the key role cloud computing plays in realizing this vision.

Amazon Web Services (AWS) is broadly committed to working with customers, partners, and governments such as the United States to improve cybersecurity. That longstanding commitment aligns with the goals of the National Cybersecurity Strategy. In this blog post, we will summarize the Strategy and explain how AWS will help to realize its vision.

The Strategy identifies two fundamental shifts in how the United States allocates cybersecurity roles, responsibilities, and resources. First, the Strategy calls for a shift in cybersecurity responsibility away from individuals and organizations with fewer resources, toward larger technology providers that are the most capable and best-positioned to be successful. At AWS, we recognize that our success and scale bring broad responsibility. We are committed to improving cybersecurity outcomes for our customers, our partners, and the world at large.

Second, the Strategy calls for realigning incentives to favor long-term investments in a resilient future. As part of our culture of ownership, we are long-term thinkers, and we don’t sacrifice long-term value for short-term results. For more than fifteen years, AWS has delivered security, identity, and compliance services for millions of active customers around the world. We recognize that we operate in a complicated global landscape and dynamic threat environment that necessitates a dynamic approach to security. Innovation and long-term investments have been and will continue to be at the core of our approach, and we continue to innovate to build and improve on our security and the services we offer customers.

AWS is working to enhance cybersecurity outcomes in ways that align with each of the Strategy’s five pillars:

  1. Defend Critical Infrastructure — Customers, partners, and governments need confidence that they are migrating to and building on a secure cloud foundation. AWS is architected to have the most flexible and secure cloud infrastructure available today, and our customers benefit from the data centers, networks, custom hardware, and secure software layers that we have built to satisfy the requirements of the most security-sensitive organizations. Our cloud infrastructure is secure by design and secure by default, and our infrastructure and services meet the high bar that the United States Government and other customers set for security.
  2. Disrupt and Dismantle Threat Actors — At AWS, our paramount focus on security leads us to implement important measures to prevent abuse of our services and products. Some of the measures we undertake to deter, detect, mitigate, and prevent abuse of AWS products include examining new registrations for potential fraud or identity falsification, employing service-level containment strategies when we detect unusual behavior, and helping protect critical systems and sensitive data against ransomware. Amazon is also working with government to address these threats, including by serving as one of the first members of the Joint Cyber Defense Collaborative (JCDC). Amazon is also co-leading a study with the President’s National Security Telecommunications Advisory Committee on addressing the abuse of domestic infrastructure by foreign malicious actors.
  3. Shape Market Forces to Drive Security and Resilience — At AWS, security is our top priority. We continuously innovate based on customer feedback, which helps customer organizations to accelerate their pace of innovation while integrating top-tier security architecture into the core of their business and operations. For example, AWS co-founded the Open Cybersecurity Schema Framework (OCSF) project, which facilitates interoperability and data normalization between security products. We are contributing to the quality and safety of open-source software both by direct contributions to open-source projects and also by an initial commitment of $10 million in a variety of open-source security improvement projects in and through the Open Source Security Foundation (OpenSSF).
  4. Invest in a Resilient Future — Cybersecurity skills training, workforce development, and education on next-generation technologies are essential to addressing cybersecurity challenges. That’s why we are making significant investments to help make it simpler for people to gain the skills they need to grow their careers, including in cybersecurity. Amazon is committing more than $1.2 billion to provide no-cost education and skills training opportunities to more than 300,000 of our own employees in the United States, to help them secure new, high-growth jobs. Amazon is also investing hundreds of millions of dollars to provide no-cost cloud computing skills training to 29 million people around the world. We will continue to partner with the Cybersecurity and Infrastructure Security Agency (CISA) and others in government to develop the cybersecurity workforce.
  5. Forge International Partnerships to Pursue Shared Goals — AWS is working with governments around the world to provide innovative solutions that advance shared goals such as bolstering cyberdefenses and combating security risks. For example, we are supporting international forums such as the Organization of American States to build security capacity across the hemisphere. We encourage the administration to look toward internationally recognized, risk-based cybersecurity frameworks and standards to strengthen consistency and continuity of security among interconnected sectors and throughout global supply chains. Increased adoption of these standards and frameworks, both domestically and internationally, will mitigate cyber risk while facilitating economic growth.

AWS shares the Biden administration’s cybersecurity goals and is committed to partnering with regulators and customers to achieve them. Collaboration between the public sector and industry has been central to US cybersecurity efforts, fueled by the recognition that the owners and operators of technology must play a key role. As the United States Government moves forward with implementation of the National Cybersecurity Strategy, we look forward to redoubling our efforts and welcome continued engagement with all stakeholders—including the United States Government, our international partners, and industry collaborators. Together, we can address the most difficult cybersecurity challenges, enhance security outcomes, and build toward a more secure and resilient future.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Mark Ryland

Mark Ryland

Mark is the director of the Office of the CISO for AWS. He has over 30 years of experience in the technology industry, and has served in leadership roles in cybersecurity, software engineering, distributed systems, technology standardization, and public policy. Previously, he served as the Director of Solution Architecture and Professional Services for the AWS World Public Sector team.

Announcing an update to IAM role trust policy behavior

Post Syndicated from Mark Ryland original https://aws.amazon.com/blogs/security/announcing-an-update-to-iam-role-trust-policy-behavior/

AWS Identity and Access Management (IAM) is changing an aspect of how role trust policy evaluation behaves when a role assumes itself. Previously, roles implicitly trusted themselves from a role trust policy perspective if they had identity-based permissions to assume themselves. After receiving and considering feedback from customers on this topic, AWS is changing role assumption behavior to always require self-referential role trust policy grants. This change improves consistency and visibility with regard to role behavior and privileges. This change allows customers to create and understand role assumption permissions in a single place (the role trust policy) rather than two places (the role trust policy and the role identity policy). It increases the simplicity of role trust permission management: “What you see [in the trust policy] is what you get.”

Therefore, beginning today, for any role that has not used the identity-based behavior since June 30, 2022, a role trust policy must explicitly grant permission to all principals, including the role itself, that need to assume it under the specified conditions. Removal of the role’s implicit self-trust improves consistency and increases visibility into role assumption behavior.

Most AWS customers will not be impacted by the change at all. Only a tiny percentage (approximately 0.0001%) of all roles are involved. Customers whose roles have recently used the previous implicit trust behavior are being notified, beginning today, about those roles, and may continue to use this behavior with those roles until February 15, 2023, to allow time for making the necessary updates to code or configuration. Or, if these customers are confident that the change will not impact them, they can opt out immediately by substituting in new roles, as discussed later in this post.

The first part of this post briefly explains the change in behavior. The middle sections answer practical questions like: “why is this happening?,” “how might this change impact me?,” “which usage scenarios are likely to be impacted?,” and “what should I do next?” The usage scenario section is important because it shows that, based on our analysis, the self-assuming role behavior exhibited by code or human users is very likely to be unnecessary and counterproductive. Finally, for security professionals interested in better understanding the reasons for the old behavior, the rationale for the change, as well as its possible implications, the last section reviews a number of core IAM concepts and digs in to additional details.

What is changing?

Until today, an IAM role implicitly trusted itself. Consider the following role trust policy attached to the role named RoleA in AWS account 123456789012.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789012:role/RoleB"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

This role trust policy grants role assumption access to the role named RoleB in the same account. However, if the corresponding identity-based policy for RoleA grants the sts:AssumeRole action with respect to itself, then RoleA could also assume itself. Therefore, there were actually two roles that could assume RoleA: the explicitly permissioned RoleB, and RoleA, which implicitly trusted itself as a byproduct of the IAM ownership model (explained in detail in the final section). Note that the identity-based permission that RoleA must have to assume itself is not required in the case of RoleB, and indeed an identity-based policy associated with RoleB that references other roles is not sufficient to allow RoleB to assume them. The resource-based permission granted by RoleA’s trust policy is both necessary and sufficient to allow RoleB to assume RoleA.

Although earlier we summarized this behavior as “implicit self-trust,” the key point here is that the ability of Role A to assume itself is not actually implicit behavior. The role’s self-referential permission had to be explicit in one place or the other (or both): either in the role’s identity-based policy (perhaps based on broad wildcard permissions), or its trust policy. But unlike the case with other principals and role trust, an IAM administrator would have to look in two different policies to determine whether a role could assume itself.

As of today, for any new role, or any role that has not recently assumed itself while relying on the old behavior, IAM administrators must modify the previously shown role trust policy as follows to allow RoleA to assume itself, regardless of the privileges granted by its identity-based policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::123456789012:role/RoleB",
                    "arn:aws:iam::123456789012:role/RoleA"
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

This change makes role trust behavior clearer and more consistent to understand and manage, whether directly by humans or as embodied in code.

How might this change impact me?

As previously noted, most customers will not be impacted by the change at all. For those customers who do use the prior implicit trust grant behavior, AWS will work with you to eliminate your usage prior to February 15, 2023. Here are more details for the two cases of customers who have not used the behavior, and those who have.

If you haven’t used the implicit trust behavior since June 30, 2022

Beginning today, if you have not used the old behavior for a given role at any time since June 30, 2022, you will now experience the new behavior. Those existing roles, as well as any new roles, will need an explicit reference in their own trust policy in order to assume themselves. If you have roles that are used only very occasionally, such as once per quarter for a seldom-run batch process, you should identify those roles and if necessary either remove the dependency on the old behavior or update their role trust policies to include the role itself prior to their next usage (see the second sample policy above for an example).

If you have used the implicit trust behavior since June 30, 2022

If you have a role that has used the implicit trust behavior since June 30, 2022, then you will continue to be able to do so with that role until February 15, 2023. AWS will provide you with notice referencing those roles beginning today through your AWS Health Dashboard and will also send an email with the relevant information to the account owner and security contact. We are allowing time for you to make any necessary changes to your existing processes, code, or configurations to prepare for removal of the implicit trust behavior. If you can’t change your processes or code, you can continue to use the behavior by making a configuration change—namely, by updating the relevant role trust policies to reference the role itself. On the other hand, you can opt out of the old behavior at any time by creating a new role with a different Amazon Resource Name (ARN) with the desired identity-based and trust-policy-based permissions and substituting it for any older role that was identified as using the implicit trust behavior. (The new role will not be allow-listed, because the allow list is based on role ARNs.) You can also modify an existing allow-listed role’s trust policy to explicitly deny access to itself. See the “What should I do next?” section for more information.

Notifications and retirement

As we previously noted, starting today, accounts with existing roles that use the implicit self-assume role assumption behavior will be notified of this change by email and through their AWS Health Dashboard. Those roles have been allow-listed, and so for now their behavior will continue as before. After February 15, 2023, the old behavior will be retired for all roles and all accounts. IAM Documentation has been updated to make clear the new behavior.

After the old behavior is retired from the allow-listed roles and accounts, role sessions that make self-referential role assumption calls will fail with an Access Denied error unless the role’s trust policy explicitly grants the permission directly through a role ARN. Another option is to grant permission indirectly through an ARN to the root principal in the trust policy that acts as a delegation of privilege management, after which permission grants in identity-based policies determine access, similar to the typical cross-account case.

Which usage scenarios are likely to be impacted?

Users often attach an IAM role to an Amazon Elastic Compute Cloud (Amazon EC2) instance, an Amazon Elastic Container Service (Amazon ECS) task, or AWS Lambda function. Attaching a role to one of these runtime environments enables workloads to use short-term session credentials based on that role. For example, when an EC2 instance is launched, AWS automatically creates a role session and assigns it to the instance. An AWS best practice is for the workload to use these credentials to issue AWS API calls without explicitly requesting short-term credentials through sts:AssumeRole calls.

However, examples and code snippets commonly available on internet forums and community knowledge sharing sites might incorrectly suggest that workloads need to call sts:AssumeRole to establish short-term sessions credentials for operation within those environments.

We analyzed AWS Security Token Service (AWS STS) service metadata about role self-assumption in order to understand the use cases and possible impact of the change. What the data shows is that in almost all cases this behavior is occurring due to unnecessarily reassuming the role in an Amazon EC2, Amazon ECS, Amazon Elastic Kubernetes Services (EKS), or Lambda runtime environment already provided by the environment. There are two exceptions, discussed at the end of this section under the headings, “self-assumption with a scoped-down policy” and “assuming a target compute role during development.”

There are many variations on this theme, but overall, most role self-assumption occurs in scenarios where the person or code is unnecessarily reassuming the role that the code was already running as. Although this practice and code style can still work with a configuration change (by adding an explicit self-reference to the role trust policy), the better practice will almost always be to remove this unnecessary behavior or code from your AWS environment going forward. By removing this unnecessary behavior, you save CPU, memory, and network resources.

Common mistakes when using Amazon EKS

Some users of the Amazon EKS service (or possibly their shell scripts) use the command line interface (CLI) command aws eks get-token to obtain an authentication token for use in managing a Kubernetes cluster. The command takes as an optional parameter a role ARN. That parameter allows a user to assume another role other than the one they are currently using before they call get-token. However, the CLI cannot call that API without already having an IAM identity. Some users might believe that they need to specify the role ARN of the role they are already using. We have updated the Amazon EKS documentation to make clear that this is not necessary.

Common mistakes when using AWS Lambda

Another example is the use of an sts:AssumeRole API call from a Lambda function. The function is already running in a preassigned role provided by user configuration within the Lambda service, or else it couldn’t successfully call any authenticated API action, including sts:AssumeRole. However, some Lambda functions call sts:AssumeRole with the target role being the very same role that the Lambda function has already been provided as part of its configuration. This call is unnecessary.

AWS Software Development Kits (SDKs) all have support for running in AWS Lambda environments and automatically using the credentials provided in that environment. We have updated the Lambda documentation to make clear that such STS calls are unnecessary.

Common mistakes when using Amazon ECS

Customers can associate an IAM role with an Amazon ECS task to give the task AWS credentials to interact with other AWS resources.

We detected ECS tasks that call sts:AssumeRole on the same role that was provided to the ECS task. Amazon ECS makes the role’s credentials available inside the compute resources of the ECS task, whether on Amazon EC2 or AWS Fargate, and these credentials can be used to access AWS services or resources as the IAM role associated with the ECS talk, without being called through sts:AssumeRole. AWS handles renewing the credentials available on ECS tasks before the credentials expire. AWS STS role assumption calls are unnecessary, because they simply create a new set of the same temporary role session credentials.

AWS SDKs all have support for running in Amazon ECS environments and automatically using the credentials provided in that ECS environment. We have updated the Amazon ECS documentation to make clear that calling sts:AssumeRole for an ECS task is unnecessary.

Common mistakes when using Amazon EC2

Users can configure an Amazon EC2 instance to contain an instance profile. This instance profile defines the IAM role that Amazon EC2 assigns the compute instance when it is launched and begins to run. The role attached to the EC2 instance enables your code to send signed requests to AWS services. Without this attached role, your code would not be able to access your AWS resources (nor would it be able to call sts:AssumeRole). The Amazon EC2 service handles renewing these temporary role session credentials that are assigned to the instance before they expire.

We have observed that workloads running on EC2 instances call sts:AssumeRole to assume the same role that is already associated with the EC2 instance and use the resulting role-session for communication with AWS services. These role assumption calls are unnecessary, because they simply create a new set of the same temporary role session credentials.

AWS SDKs all have support for running in Amazon EC2 environments and automatically using the credentials provided in that EC2 environment. We have updated the Amazon EC2 documentation to make clear that calling sts:AssumeRole for an EC2 instance with a role assigned is unnecessary.

For information on creating an IAM role, attaching that role to an EC2 instance, and launching an instance with an attached role, see “IAM roles for Amazon EC2” in the Amazon EC2 User Guide.

Other common mistakes

If your use case does not use any of these AWS execution environments, you might still experience an impact from this change. We recommend that you examine the roles in your account and identify scenarios where your code (or human use through the AWS CLI) results in a role assuming itself. We provide Amazon Athena and AWS CloudTrail Lake queries later in this post to help you locate instances where a role assumed itself. For each instance, you can evaluate whether a role assuming itself is the right operation for your needs.

Self-assumption with a scoped-down policy

The first pattern we have observed that is not a mistake is the use of self-assumption combined with a scoped-down policy. Some systems use this approach to provide different privileges for different use cases, all using the same underlying role. Customers who choose to continue with this approach can do so by adding the role to its own trust policy. While the use of scoped-down policies and the associated least-privilege approach to permissions is a good idea, we recommend that customers switch to using a second generic role and assume that role along with the scoped-down policy rather than using role self-assumption. This approach provides more clarity in CloudTrail about what is happening, and limits the possible iterations of role assumption to one round, since the second role should not be able to assume the first. Another possible approach in some cases is to limit subsequent assumptions is by using an IAM condition in the role trust policy that is no longer satisfied after the first role assumption. For example, for Lambda functions, this would be done by a condition checking for the presence of the “lambda:SourceFunctionArn” property; for EC2, by checking for presence of “ec2:SourceInstanceARN.”

Assuming an expected target compute role during development

Another possible reason for role self-assumption may result from a development practice in which developers attempt to normalize the roles that their code is running in between scenarios in which role credentials are not automatically provided by the environment, and scenarios where they are. For example, imagine a developer is working on code that she expects to run as a Lambda function, but during development is using her laptop to do some initial testing of the code. In order to provide the same execution role as is expected later in product, the developer might configure the role trust policy to allow assumption by a principal readily available on the laptop (an IAM Identity Center role, for example), and then assume the expected Lambda function execution role when the code is initializing. The same approach could be used on a build and test server. Later, when the code is deployed to Lambda, the actual role is already available and in use, but the code need not be modified in order to provide the same post-role-assumption behavior that existing outside of Lambda: the unmodified code can automatically assume what is in this case the same role, and proceed. While this approach is not illogical, as with the scope-down policy case we recommend that customers configure distinct roles for assumption both in development and test environments as well as later production environments. Again, this approach provides more clarity in CloudTrail about what is happening, and limits the possible iterations of role assumption to one round, since the second role should not be able to assume the first.

What should I do next?

If you receive an email or AWS Health Dashboard notification for an account, we recommend that you review your existing role trust policies and corresponding code. For those roles, you should remove the dependency on the old behavior, or if you can’t, update those role trust policies with an explicit self-referential permission grant. After the grace period expires on February 15, 2023, you will no longer be able to use the implicit self-referential permission grant behavior.

If you currently use the old behavior and need to continue to do so for a short period of time in the context of existing infrastructure as code or other automated processes that create new roles, you can do so by adding the role’s ARN to its own trust policy. We strongly encourage you to treat this as a temporary stop-gap measure, because in almost all cases it should not be necessary for a role to be able to assume itself, and the correct solution is to change the code that results in the unnecessary self-assumption. If for some reason that self-service solution is not sufficient, you can reach out to AWS Support to seek an accommodation of your use case for new roles or accounts.

If you make any necessary code or configuration changes and want to remove roles that are currently allow-listed, you can also ask AWS Support to remove those roles from the allow list so that their behavior follows the new model. Or, as previously noted, you can opt out of the old behavior at any time by creating a new role with a different ARN that has the desired identity-based and trust-policy–based permissions and substituting it for the allow-listed role. Another stop-gap type of option is to add an explicit deny that references the role to its own trust policy.

If you would like to understand better the history of your usage of role self-assumption in a given account or organization, you can follow these instructions on querying CloudTrail data with Athena and then use the following Athena query against your account or organization CloudTrail data, as stored in Amazon Simple Storage Services (Amazon S3). The results of the query can help you understand the scenarios and conditions and code involved. Depending on the size of your CloudTrail logs, you may need to follow the partitioning instructions to query subsets of your CloudTrail logs sequentially. If this query yields no results, the role self-assumption scenario described in this blog post has never occurred within the analyzed CloudTrail dataset.

SELECT eventid, eventtime, userIdentity.sessioncontext.sessionissuer.arn as RoleARN, split_part(userIdentity.principalId, ':', 2) as RoleSessionName from cloudtrail_logs t CROSS JOIN UNNEST(t.resources) unnested (resources_entry) where eventSource = 'sts.amazonaws.com' and eventName = 'AssumeRole' and userIdentity.type = 'AssumedRole' and errorcode IS NULL and substr(userIdentity.sessioncontext.sessionissuer.arn,12) = substr(unnested.resources_entry.ARN,12)

As another option, you can follow these instructions to set up CloudTrail Lake to perform a similar analysis. CloudTrail Lake allows richer, faster queries without the need to partition the data. As of September 20, 2022, CloudTrail Lake now supports import of CloudTrail logs from Amazon S3. This allows you to perform a historical analysis even if you haven’t previously enabled CloudTrail Lake. If this query yields no results, the scenario described in this blog post has never occurred within the analyzed CloudTrail dataset.

SELECT eventid, eventtime, userIdentity.sessioncontext.sessionissuer.arn as RoleARN, userIdentity.principalId as RoleIdColonRoleSessionName from $EDS_ID where eventSource = 'sts.amazonaws.com' and eventName = 'AssumeRole' and userIdentity.type = 'AssumedRole' and errorcode IS NULL and userIdentity.sessioncontext.sessionissuer.arn = element_at(resources,1).arn

Understanding the change: more details

To better understand the background of this change, we need to review the IAM basics of identity-based policies and resource-based policies, and then explain some subtleties and exceptions. You can find additional overview material in the IAM documentation.

The structure of each IAM policy follows the same basic model: one or more statements with an effect (allow or deny), along with principals, actions, resources, and conditions. Although the identity-based and resource-based policies share the same basic syntax and semantics, the former is associated with a principal, the latter with a resource. The main difference between the two is that identity-based policies do not specify the principal, because that information is supplied implicitly by associating the policy with a given principal. On the other hand, resource policies do not specify an arbitrary resource, because at least the primary identifier of the resource (for example, the bucket identifier of an S3 bucket) is supplied implicitly by associating the policy with that resource. Note that an IAM role is the only kind of AWS object that is both a principal and a resource.

In most cases, access to a resource within the same AWS account can be granted by either an identity-based policy or a resource-based policy. Consider an Amazon S3 example. An identity-based policy attached to an IAM principal that allows the s3:GetObject action does not require an equivalent grant in the S3 bucket resource policy. Conversely, an s3:GetObject permission grant in a bucket’s resource policy is all that is needed to allow a principal in the same account to call the API with respect to that bucket; an equivalent identity-based permission is not required. Either the identity-based policy or the resource-based policy can grant the necessary permission. For more information, see IAM policy types: How and when to use them.

However, in order to more tightly govern access to certain security-sensitive resources, such as AWS Key Management Service (AWS KMS) keys and IAM roles, those resource policies need to grant access to the IAM principal explicitly, even within the same AWS account. A role trust policy is the resource policy associated with a role that specifies which IAM principals can assume the role by using one of the sts:AssumeRole* API calls. For example, in order for RoleB to assume RoleA in the same account, whether or not RoleB’s identity-based policy explicitly allows it to assume RoleA, RoleA’s role trust policy must grant access to RoleB. Within the same account, an identity-based permission by itself is not sufficient to allow assumption of a role. On the other hand, a resource-based permission—a grant of access in the role trust policy—is sufficient. (Note that it’s possible to construct a kind of hybrid permission to a role by using both its resource policy and other identity-based policies. In that case, the role trust policy grants permission to the root principal ARN; after that, the identity-based policy of a principal in that account would need to explicitly grant permission to assume that role. This is analogous to the typical cross-account role trust scenario.)

Until now, there has been a nonintuitive exception to these rules for situations where a role assumes itself. Since a role is both a principal (potentially with an identity-based policy) and a resource (with a resource-based policy), it is in the unique position of being both a subject and an object within the IAM system, as well as being an object owned by itself rather than its containing account. Due to this ownership model, roles with identity-based permission to assume themselves implicitly trusted themselves as resources, and vice versa. That is to say, roles that had the privilege as principals to assume themselves implicitly trusted themselves as resources, without an explicit self-referential Allow in the role trust policy. Conversely, a grant of permission in the role trust policy was sufficient regardless of whether there was a grant in the same role’s identity-based policy. Thus, in the self-assumption case, roles behaved like most other resources in the same account: only a single permission was required to allow role self-assumption, either on the identity side or the resource side of their dual-sided nature. Because of a role’s implicit trust of itself as a resource, the role’s trust policy—which might otherwise limit assumption of the role with properties such as actions and conditions—was not applied, unless it contained an explicit deny of itself.

The following example is a role trust policy attached to the role named RoleA in account 123456789012. It grants explicit access only to the role named RoleB.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789012:role/RoleB"
            },
            "Action": ["sts:AssumeRole", "sts:TagSession"],
            "Condition": {
                "StringEquals": {
                    "aws:PrincipalTag/project": "BlueSkyProject"
                }
            }
        }
    ]
}

Assuming that the corresponding identity-based policy for RoleA granted the sts:AssumeRole action with regard to RoleA, this role trust policy provided that there were two roles that could assume RoleA: RoleB (explicitly referenced in the trust policy) and RoleA (assuming it was explicitly referenced in its identity policy). RoleB could assume RoleA only if it had the principal tag project:BlueSkyProject because of the trust policy condition. (The sts:TagSession permission is needed here in case tags need to be added by the caller as parted of the RoleAssumption call.) RoleA, on the other hand, did not need to meet that condition because it relied on a different explicit permission—the one granted in the identity-based policy. RoleA would have needed the principal tag project:BlueSkyProject to meet the trust policy condition if and only if it was relying on the trust policy to gain access through the sts:AssumeRole action; that is, in the case where its identity-based policy did not provide the needed privilege.

As we previously noted, after considering feedback from customers on this topic, AWS has decided that requiring self-referential role trust policy grants even in the case where the identity-based policy also grants access is the better approach to delivering consistency and visibility with regard to role behavior and privileges. Therefore, as of today, r­ole assumption behavior requires an explicit self-referential permission in the role trust policy, and the actions and conditions within that policy must also be satisfied, regardless of the permissions expressed in the role’s identity-based policy. (If permissions in the identity-based policy are present, they must also be satisfied.)

Requiring self-reference in the trust policy makes role trust policy evaluation consistent regardless of which role is seeking to assume the role. Improved consistency makes role permissions easier to understand and manage, whether through human inspection or security tooling. This change also eliminates the possibility of continuing the lifetime of an otherwise temporary credential without explicit, trackable grants of permission in trust policies. It also means that trust policy constraints and conditions are enforced consistently, regardless of which principal is assuming the role. Finally, as previously noted, this change allows customers to create and understand role assumption permissions in a single place (the role trust policy) rather than two places (the role trust policy and the role identity policy). It increases the simplicity of role trust permission management: “what you see [in the trust policy] is what you get.”

Continuing with the preceding example, if you need to allow a role to assume itself, you now must update the role trust policy to explicitly allow both RoleB and RoleA. The RoleA trust policy now looks like the following:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::123456789012:role/RoleB",
                    "arn:aws:iam::123456789012:role/RoleA"
                ]
            },
            "Action": ["sts:AssumeRole", "sts:TagSession"],
            "Condition": {
                "StringEquals": {
					"aws:PrincipalTag/project": "BlueSkyProject"
				}
            }
        }
    ]
}

Without this new principal grant, the role can no longer assume itself. The trust policy conditions are also applied, even if the role still has unconditioned access to itself in its identity-based policy.

Conclusion

In this blog post we’ve reviewed the old and new behavior of role assumption in the case where a role seeks to assume itself. We’ve seen that, according to our analysis of service metadata, the vast majority of role self-assumption behavior that relies solely on identity-based privileges is totally unnecessary, because the code (or human) who calls sts:AssumeRole is already, without realizing it, using the role’s credentials to call the AWS STS API. Eliminating that mistake will improve performance and decrease resource consumption. We’ve also explained in more depth the reasons for the old behavior and the reasons for making the change, and provided Athena and CloudTrail Lake queries that you can use to examine past or (in the case of allow-listed roles) current self-assumption behavior in your own environments. You can reach out to AWS Support or your customer account team if you need help in this effort.

If you currently use the old behavior and need to continue to do so, your primary option is to create an explicit allow for the role in its own trust policy. If that option doesn’t work due to operational constraints, you can reach out to AWS Support to seek an accommodation of your use case for new roles or new accounts. You can also ask AWS Support to remove roles from the allow-list if you want their behavior to follow the new model.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new IAM-tagged discussion on AWS re:Post or contact AWS Support.

AWS would like to thank several customers and partners who highlighted this behavior as something they found surprising and unhelpful, and asked us to consider making this change. We would also like to thank independent security researcher Ryan Gerstenkorn who engaged with AWS on this topic and worked with us prior to this update.

Want more AWS Security news? Follow us on Twitter.

Mark Ryland

Mark Ryland

Mark is the director of the Office of the CISO for AWS. He has over 30 years of experience in the technology industry and has served in leadership roles in cybersecurity, software engineering, distributed systems, technology standardization and public policy. Previously, he served as the Director of Solution Architecture and Professional Services for the AWS World Public Sector team.

Stephen Whinston

Stephen Whinston

Stephen is a Senior Product Manager with the AWS Identity and Access Management organization. Prior to Amazon, Stephen worked in product management for cloud service and identity management providers. Stephen holds degrees in computer science and an MBA from the University of Colorado Leeds School of Business. Outside of work, Stephen enjoys his family time and the Pacific Northwest.

AWS co-announces release of the Open Cybersecurity Schema Framework (OCSF) project

Post Syndicated from Mark Ryland original https://aws.amazon.com/blogs/security/aws-co-announces-release-of-the-open-cybersecurity-schema-framework-ocsf-project/

In today’s fast-changing security environment, security professionals must continuously monitor, detect, respond to, and mitigate new and existing security issues. To do so, security teams must be able to analyze security-relevant telemetry and log data by using multiple tools, technologies, and vendors. The complex and heterogeneous nature of this task drives up costs and may slow down detection and response times. Our mission is to innovate on behalf of our customers so they can more quickly analyze and protect their environment when the need arises.

With that goal in mind, alongside a number of partner organizations, we’re pleased to announce the release of the Open Cybersecurity Schema Framework (OCSF) project, which includes an open specification for the normalization of security telemetry across a wide range of security products and services, as well as open-source tools that support and accelerate the use of the OCSF schema. As a co-founder of the OCSF effort, we’ve helped create the specifications and tools that are available to all industry vendors, partners, customers, and practitioners. Joining us in this announcement is an array of key security vendors, beginning with Splunk, the co-founder with AWS of the OCSF project, and also including Broadcom, Salesforce, Rapid7, Tanium, Cloudflare, Palo Alto Networks, DTEX, CrowdStrike, IBM Security, JupiterOne, Zscaler, Sumo Logic, IronNet, Securonix, and Trend Micro. Going forward, anyone can participate in the evolution of the specification and tooling at https://github.com/ocsf.

Our customers have told us that interoperability and data normalization between security products is a challenge for them. Security teams have to correlate and unify data across multiple products from different vendors in a range of proprietary formats; that work has a growing cost associated with it. Instead of focusing primarily on detecting and responding to events, security teams spend time normalizing this data as a prerequisite to understanding and response. We believe that use of the OCSF schema will make it easier for security teams to ingest and correlate security log data from different sources, allowing for greater detection accuracy and faster response to security events. We see value in contributing our engineering efforts and also projects, tools, training, and guidelines to help standardize security telemetry across the industry. These efforts benefit our customers and the broader security community.

Although we as an industry can’t directly control the behavior of threat actors, we can improve our collective defenses by making it easier for security teams to do their jobs more efficiently. At AWS, we are excited to see the industry come together to use the OCSF project to make it easier for security professionals to focus on the things that are important to their business: identifying and responding to events, then using that data to proactively improve their security posture.

To learn more about the OCSF project, visit https://github.com/ocsf.

Want more AWS Security news? Follow us on Twitter.

Mark Ryland

Mark Ryland

Mark is the director of the Office of the CISO for AWS. He has over 30 years of experience in the technology industry and has served in leadership roles in cybersecurity, software engineering, distributed systems, technology standardization and public policy. Previously, he served as the Director of Solution Architecture and Professional Services for the AWS World Public Sector team.

Zero Trust architectures: An AWS perspective

Post Syndicated from Mark Ryland original https://aws.amazon.com/blogs/security/zero-trust-architectures-an-aws-perspective/

Our mission at Amazon Web Services (AWS) is to innovate on behalf of our customers so they have less and less work to do when building, deploying, and rapidly iterating on secure systems. From a security perspective, our customers seek answers to the ongoing question What are the optimal patterns to ensure the right level of confidentiality, integrity, and availability of my systems and data while increasing speed and agility? Increasingly, customers are asking specifically about how security architectural patterns that fall under the banner of Zero Trust architecture or Zero Trust networking might help answer this question.

Given the surge in interest in technology that uses the Zero Trust label, as well as the variety of concepts and models that come under the Zero Trust umbrella, we’d like to provide our perspective. We’ll share our definition and guiding principles for Zero Trust, and then explore the larger subdomains that have emerged under that banner. We’ll also talk about how AWS has woven these principles into the fabric of the AWS cloud since its earliest days, as well as into many recent developments. Finally, we’ll review how AWS can help you on your own Zero Trust journey, focusing on the underlying security objectives that matter most to our customers. Technological approaches rise and fall, but underlying security objectives tend to be relatively stable over time. (A good summary of some of those can be found in the Design Principles of the AWS Well-Architected Framework.)

Definition and guiding principles for Zero Trust

Let’s start out with a general definition. Zero Trust is a conceptual model and an associated set of mechanisms that focus on providing security controls around digital assets that do not solely or fundamentally depend on traditional network controls or network perimeters. The zero in Zero Trust fundamentally refers to diminishing—possibly to zero!—the trust historically created by an actor’s location within a traditional network, whether we think of the actor as a person or a software component. In a Zero Trust world, network-centric trust models are augmented or replaced by other techniques—which we can describe generally as identity-centric controls—to provide equal or better security mechanisms than we had in place previously. Better security mechanisms should be understood broadly to include attributes such as greater usability and flexibility, even if the overall security posture remains the same. Let’s consider more details and possible approaches along the two dimensions.

One dimension is the network. Do we achieve Zero Trust by allowing all network packets to flow between all hosts or endpoints, but implement all security controls above the network layer? Or do we break our systems down into smaller logical components and implement much tighter network segments or packet-level controls—so-called micro-segments or micro-perimeters? Do we add some kind of gateway or proxy technology that enforces a new kind of trust boundary? Do we still use VPN technology for network isolation but make it more dynamic and hidden from the user experience, so that users don’t even notice that network boundaries are being created and torn down as needed? Or some combination of these techniques?

The other dimension is identity and access management. Are we talking about human actors with their PCs, tablets, and phones trying to access web applications? Or are we talking about machine-to-machine, software-to-software communication, where all requests are authenticated and authorized using other kinds of techniques? Or perhaps we’re thinking of some combination of the two. For example, certain security-relevant properties or attributes of the user’s situation—strength of authentication, device type, ownership, posture assessment, health, network location, and others—are propagated to and through the software systems with which the user is interacting, and alter their access dynamically.

Thus, as we start to look more closely at Zero Trust, we can immediately see the possibility of confusion—because many different topics and concepts are implicated—but also a clear indication of opportunities to build better, more flexible, and more secure software systems. What are some of the principles that can help guide us through both the confusion and the opportunities?

Our first guiding principle for Zero Trust is that while the conceptual model decreases reliance on network location, the role of network controls and perimeters remains important to the overall security architecture. In other words, the best security doesn’t come from making a binary choice between identity-centric and network-centric tools, but rather by using both effectively in combination with each other. Identity-centric controls, such as the AWS SigV4 request signing process, which is used to interact with AWS API endpoints, uniquely authenticate and authorize each and every signed API request, and provide very fine-grained access controls. However, network-centric tools such as Amazon Virtual Private Cloud (Amazon VPC), security groups, AWS PrivateLink, and VPC endpoints are straightforward to understand and use, filter unnecessary noise out of the system, and provide excellent guardrails within which identity-centric controls can operate. Ideally, these two kinds of controls should not only coexist, they should be aware of and augment one another. For example, VPC endpoints provide the ability to attach a policy that allows you to write and enforce identity-centric rules at a logical network boundary—in that case, the private network exit from your Amazon VPC on the way to a nearby AWS service endpoint.

Our second guiding principle for Zero Trust is that it can mean different things in different contexts. Arguably one of the key reasons for the ambiguity surrounding Zero Trust is that the term encompasses many different use cases which share only the fundamental technical concept of diminishing the security relevance of a network location or boundary. Yet those use cases differ substantially in what they’re trying to achieve for the organization. As we noted above, common examples of Zero Trust goals range from ensuring workforce agility and mobility—using browsers and mobile apps and the internet to access business systems and applications—to the creation of carefully segmented micro-service architectures inside of new cloud-based applications. By focusing on a specific problem that we’re trying to solve, and approaching it with fresh eyes and new tools, we can avoid getting mired in low-value discussions around whether a new approach to a security challenge is really—or to what degree it is—an application of the Zero Trust concept.

Our third guiding principle is that Zero Trust concepts must be applied in accordance with the organizational value of the system and data being protected. Over time, the application of the Zero Trust conceptual model and associated mechanisms will continue to improve defense in depth, and continue to make security controls we already have work better through the increased visibility and software-defined nature of the cloud. Applied well, the tenets of Zero Trust can significantly raise the security bar, especially for critical workloads. However, if applied in strict orthodoxy, Zero Trust methods can limit the incorporation of more traditional technologies into upgraded or new systems, and stifle innovation by overly taxing organizations where the benefits aren’t commensurate with the effort. For many business systems, network controls and network perimeters will continue to be important and usually adequate controls for a long time, perhaps forever. We believe it’s best to think of Zero Trust concepts as additive to existing security controls and concepts, rather than as replacements.

Examples of Zero Trust principles and capabilities at work today within the AWS cloud

The most prominent example of Zero Trust in AWS is how millions of customers typically interact with AWS every day using the AWS Management Console or securely calling AWS APIs over a diverse set of public and private networks. Whether called via the console, the AWS Command Line Interface (AWS CLI), or software written to the AWS APIs, ultimately all of these methods of interaction reach a set of web services with endpoints that are reachable from the internet. There is absolutely nothing about the security of the AWS API infrastructure that depends on network reachability. Each one of these signed API requests is authenticated and authorized every single time at rates of millions upon millions of requests per second globally. Our customers do so confidently; knowing that the cryptographic strength of the underlying Transport Layer Security (TLS) protocol—augmented by the AWS Signature v4 signing process—properly secures these requests without any regard to the trustworthiness of the underlying network. Interestingly, the use of cloud-based APIs is rarely—if ever—mentioned in Zero Trust discussions. Perhaps this is because AWS led the way with this approach to securing APIs from the start, such that it is now assumed to be a basic part of every cloud security story.

Similarly, but perhaps not as well understood, when individual AWS services need to call each other to operate and deliver their service capabilities, they rely on the same mechanisms that you use as a customer. You can see this in action in the form of service-linked roles. For example, when AWS Auto Scaling determines that it needs to call the Amazon Elastic Compute Cloud (Amazon EC2) API to create or terminate an EC2 instance in your account, the AWS Auto Scaling service assumes the service-linked role you’ve provided in your account, receives the resulting AWS short-term credentials, and uses these credentials to sign requests using the SigV4 process to the appropriate EC2 APIs. On the receiving end, AWS Identity and Access Management (IAM) authenticates and authorizes the incoming calls for EC2. In other words, even though they’re both AWS services, AWS Auto Scaling and EC2 have no inherent trust, network or otherwise, of one another and use strong identity-centric controls as the basis of the security model between the two services as they operate on your behalf. You, the customer, have full visibility into both the privileges that you’re granting to one service, as well as an AWS CloudTrail record of the use of those privileges.

Other great examples of Zero Trust capabilities in the AWS portfolio can be found in the IoT Service. When we launched AWS IoT Core we made a strategic decision—against the prevailing industry norms at the time—to always require TLS network encryption and modern client authentication, including certificate-based mutual TLS, when connecting IoT devices to service endpoints. We subsequently added TLS support to FreeRTOS, enabling modern, secure communication to an entire class of small CPU and small memory devices that were previously assumed to not be capable of it. With AWS IoT Greengrass, we pioneered a way of working with existing no-security devices using a remote gateway that relied on local network presence but also was able to run AWS Lambda functions to validate security and provide a secure proxy to the cloud. These examples highlight where adherence to AWS security standards brought key foundational components of Zero Trust to a technology domain where vast amounts of unauthenticated, unencrypted network messaging over the open internet was previously the norm.

How AWS can help you on your Zero Trust journey

To help you on your own Zero Trust journey, there are a number of AWS cloud-specific identity and networking capabilities that provide core Zero Trust building blocks as standard features. AWS services provide this functionality via simple API calls, without you needing to build, maintain, or operate any infrastructure or additional software components. To help best frame the conversation, we’ll consider these capabilities against the backdrop of three distinct use cases:

  1. Authorizing specific flows between components to eliminate unneeded lateral network mobility.
  2. Enabling friction-free access to internal applications for your workforce.
  3. Securing digital transformation projects such as IoT.

Our first use case focuses mainly on machine-to-machine communications—authorizing specific flows between components to help eliminate lateral network mobility risk. Otherwise put, if two components don’t need to talk to one another across the network, they shouldn’t be able to, even if these systems happen to exist within the same network or network segment. This greatly reduces the overall surface area of the connected systems and eliminates unneeded pathways, particularly those that lead to sensitive data. Within this use case, our discussion should begin with security groups, which have been a part of Amazon EC2 since its earliest days. Security groups provide highly dynamic, software-defined network micro-perimeters for both north-south and east-west traffic. Security group assignments occur automatically as resources come and go, and rules in one security group can reference one another by ID, either within the same Amazon VPC or across larger peered networks in the same or different regions. These properties allow security groups to act as a kind of identity system in which group membership becomes a relevant property for determining whether or not to permit particular network flows. This helps enable you to author extremely granular rules without the associated operational burden of keeping them up-to-date as membership in a group ebbs and flows. Similarly, PrivateLink provides an extremely useful building block in the general space of micro-perimeters and micro-segmentation. Using PrivateLink, a load-balanced endpoint can be exposed as a narrow, one-way gateway between two VPCs, with tight identity-based controls determining who can access the gateway and where incoming packets can land. Initiating network connections in the other direction isn’t allowed at all, and the VPCs don’t even need to have routes between one another. Thousands of customers use PrivateLink today as a fundamental building block of a secure micro-services architecture, as well as secure and private access to PaaS and SaaS services from their suppliers.

Going back to our discussion about AWS APIs, the AWS SigV4 signature process for authenticating and authorizing API requests is no longer just for AWS services. You can achieve the same kind of hardened interface approach using the Amazon API Gateway service, which allows software interfaces to be securely available on the open internet. API Gateway provides distributed denial of service (DDoS) protection, rate limiting, and AWS IAM support as one of several authorization options. When you choose AWS IAM authorization, you author standard IAM policies that define who can call your API and where they can call it from, using the full expressiveness of the IAM policy language. Callers sign their requests using their AWS credentials, typically delivered in the form of IAM roles attached to compute resources, and IAM uniquely authenticates and authorizes every single call to your API according to those policies. With one step, your API is protected behind the massively scaled, super performant, globally available IAM service that protects AWS APIs—with nothing for you to manage or maintain. Calls from the API Gateway front-end to your back-end implementation are secured by mutual TLS, so you’re assured that only API Gateway is able to invoke the back-end implementation. With this strong identity-centric control in place, you have two choices. You can safely place your back-end implementation on the public network, or add the VPC integration model such that the API Gateway call to your back-end implementation running inside of your VPC is protected by an identity-centric control (mutual TLS) and a network-centric control (private connectivity from API Gateway to your code). The security achieved by these feature combinations, arguably only possible in the cloud, makes discussions of east-west concerns seem underwhelming and rooted in constraints of the past.

Our second use case, enabling friction-free access to internal applications for your workforce, is all about improving workforce mobility without compromising security. Traditionally these applications have existed behind a strong VPN front door. However, VPNs can be expensive to scale and aren’t necessarily compatible with the full array of mobile devices that the modern workforce demands. The objective in this case is to make the locks on the individual applications so good that you can eliminate the VPN-based front door. To achieve this, our customers have told us that they want a range of technical solutions to choose from according to their industry, risk tolerance, developer maturity, and other factors. At one end of the spectrum, we have many customers who prefer to use desktop as a serviceAmazon Workspaces—or application as a serviceAmazon AppStream 2.0—models to provide a powerful and flexible pixel proxy approach to Zero Trust. Traditional security controls are applied to those intermediary virtual devices, and then any user with a PC, tablet, or HTML5 client can reach those virtualized desktops or applications over the internet—or behind additional network controls and perimeters, if they so desire—to provide a rich, desktop-like experience without having to worry about the security of the final device in the hands of the user. Similarly, customers have asked for a better way to access their enterprise applications securely from mobile phones without deploying mobile device management or other such often cumbersome and expensive technologies. To meet that requirement, we launched Amazon WorkLink, providing a secure proxy service that renders complex web applications in the AWS cloud. Amazon WorkLink streams only pixels—and a very minimal amount of JavaScript for interactivity—to mobile phones. No sensitive enterprise data is ever stored or cached on the mobile device.

At the other end of the spectrum, we have customers who want to connect their internal web applications directly to the internet. For these customers, the combination of AWS Shield, AWS WAF, and Application Load Balancer with OpenID Connect (OIDC) authentication provides a fully managed identity-aware network protection stack. Shield provides managed DDoS protection services that provide always-on detection and automatic inline mitigations that minimize application downtime and latency. AWS WAF is a web application firewall that lets you monitor and protect web requests before they reach your infrastructure using your desired combination of rule groups provided by AWS, the AWS Marketplace, or your own custom ones. By enabling authentication in Application Load Balancer—beyond the normal load balancing capabilities—you can directly integrate with your existing identity provider (IdP) to offload the work of authenticating users, and to leverage the existing capabilities within your IdP—such as strong authentication, device posture assessment, conditional access, and policy enforcement. Using this combination, your internal custom applications quickly become just as flexible as SaaS applications, allowing your workforce to enjoy the same work-anywhere flexibility as SaaS while unifying your application portfolio under a common security model powered by modern identity standards.

Our third use case—securing digital transformation projects such as IoT—is markedly different from the first two. Consider a connected vehicle, relaying a critical stream of instrumentation over mobile networks and the internet into a cloud based analytics environment for processing and insights. These workloads have always existed entirely outside the traditional enterprise network, and require a security model that accounts for that situation. The family of AWS IoT services provides scalable solutions for issuing unique device identities to every device in your fleet, and then using those identities and their associated access control policies to securely control how they communicate and interact with the cloud. The security of these devices can be easily monitored and maintained with AWS IoT Device Defender, over-the-air software updates, and even entire operating system upgrades—now built in to FreeRTOS—to keep devices safe and secure over time. Moving forward, as more and more IT workloads move closer to the edge to minimize latency and improve user experiences, the prevalence of this use case will continue to expand, even if it isn’t applicable to your business today.

It’s still Day 1

We hope this post has helped communicate our vision for Zero Trust, and highlighted how we believe that our underlying security principles and advancing capabilities represent a bar-raising security model both for the AWS cloud and for the environments that our customers build on top of our services.

At Amazon we obsess over customers and their needs, so our job is never done. We have lots more capabilities we want to build, and lots more guidance still to offer. We look forward to your feedback and to continuing the journey together—reflecting the words and core vision of our founder, Jeff Bezos: “It’s still Day 1.”

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Mark Ryland

Mark is the director of the Office of the CISO for AWS. He has over 29 years of experience in the technology industry and has served in leadership roles in cybersecurity, software engineering, distributed systems, technology standardization and public policy. Previously, he served as the Director of Solution Architecture and Professional Services for the AWS World Public Sector team.

Author

Quint Van Deman

Quint is a Principal Specialist for AWS Identity. In this role, he leads the go-to-market creation and execution for AWS Identity services, field enablement, and strategic customer advisement, and is a company wide subject matter expert on identity, access management, and federation. Before joining the Specialist team, Quint was an early member of the AWS Professional Services team, where he led AWS teams directing several of AWS’ most prominent enterprise customers along their journey to the cloud. Prior to joining AWS, Quint held enterprise architect style roles within a number of mid size organizations and consulting firms, mostly specializing in large scale open source infrastructure.