Tag Archives: AWS Well-Architected

Choosing a Well-Architected CI/CD approach: Open Source on AWS

Post Syndicated from Mikhail Vasilyev original https://aws.amazon.com/blogs/devops/choosing-a-well-architected-ci-cd-approach-open-source-on-aws/


When building a CI/CD platform, it is important to make an informed decision regarding every underlying tool. This post explores evaluating the criteria for selecting each tool focusing on a balance between meeting functional and non-functional requirements, and maximizing value.

Your first decision: source code management.

Source code is potentially your most valuable asset, and so we start by choosing a source code management tool. These tools normally have high non-functional requirements in order to protect your assets and to ensure they are available to the organization when needed. The requirements usually include demand for high durability, high availability (HA), consistently high throughput, and strong security with role-based access controls.

At the same time, source code management tools normally have many specific functional requirements as well. For example, the ability to provide collaborative code review in the UI, flexible and tunable merge policies including both automated and manual gates (code checks), and out-of-box UI-level integrations with numerous other tools. These kinds of integrations can include enabling monitoring, CI, chats, and agile project management.

Many teams also treat source code management tools as their portal into other CI/CD tools. They make them shareable between teams, and might prefer to stay within one single context and user interface throughout the entire DevOps cycle. Many source code management tools are actually a stack of services that support multiple steps of your CI/CD workflows from within a single UI. This makes them an excellent starting point for building your CI/CD platforms.

The first decision your need to make is whether to go with an open source solution for managing code or with AWS-managed solutions, such as AWS CodeCommit. Open source solutions include (but are not limited to) the following: Gerrit, Gitlab, Gogs, and Phabricator.

You decision will be influenced by the amount of benefit your team can gain from the flexibility provided through open source, and how well your team can support deploying and managing these solutions. You will also need to consider the infrastructure and management overhead cost.

Engineering teams that have the capacity to develop their own plugins for their CI/CD platforms, or whom even contribute directly to open source projects, will often prefer open source solutions for the flexibility they provide. This will be especially true if they are fluent in designing and supporting their own cloud infrastructure. If the team gets more value by trading the flexibility of open source for not having to worry about managing infrastructure (especially if High Availability, Scalability, Durability, and Security are more critical) an AWS-managed solution would be a better choice.

Source Code Management Solution

When the choice is made in favor of an open-source code management solution (such as Gitlab), the next decision will be how to architect the deployment. Will the team deploy to a single instance, or design for high availability, durability, and scalability? Teams that want to design Gitlab for HA can use the following guide to proceed: Installing GitLab on Amazon Web Services (AWS)

By adopting AWS services (such as Amazon RDS, Amazon ElastiCache for Redis, and Autoscaling Groups), you can lower the management burden of supporting the underlying infrastructure in this self-managed HA scenario.

High level overview of self-managed HA Gitlab deployment

Your second decision: Continuous Integration engine

Selecting your CI engine, you might be able to benefit from additional features of previously selected solutions. Gitlab provides both source control services, as well as built-in CI tools, called Gitlab CI. Gitlab Runners are responsible for running CI jobs, and the actual jobs are described as YML files stored in Gitlab’s git repository along with product code. For security and performance reasons, GitLab Runners should be on resources separate from your GitLab instance.

You could manage those resources or you could use one of the AWS services that can support deploying and managing Runners. The use of an on-demand service removes the expense of implementing and managing a capability that is undifferentiated heavy lifting for you. This provides cost optimization and enables operational excellence. You pay for what you use and the service team manages the underlying service.

Continuous Integration engine Solution

In an architecture example (below), Gitlab Runners are deployed in containers running on Amazon EKS. The team has less infrastructure to manage, can start focusing on development faster by not having to implement the capability, and can provision resources in an optimal way for their on-demand needs.

To further optimize costs, you can use EC2 Spot Instances for your EKS nodes. CI jobs are normally compute intensive and limited in run time. The runner jobs can easily be restarted on a different resource with little impact. This makes them tolerant of failure and the use of EC2 Spot instances very appealing. Amazon EKS and Spot Instances are supported out-of-box in Gitlab. As a result there is no integration to develop, only configuration is required.

To support infrastructure as code best practices, Runners are deployed with Helm and are stored and versioned as Helm charts. All of the infrastructure as code information used to implement the CI/CD platform itself is stored in templates such as Terraform.

High level overview of Infrastructure as Code on Gitlab and Gitlab CI

High level overview of Infrastructure as Code on Gitlab and Gitlab CI

Your third decision: Container Registry

You will be unable to deploy Runners if the container images are not available. As a result, the primary non-functional requirements for your production container registry are likely to include high availability, durability, transparent scalability, and security. At the same time, your functional requirements for a container registry might be lower. It might be sufficient to have a simple UI, and simple APIs supporting basic flows. Customers looking for a managed solution can use Amazon ECR, which is OCI compliant and supports Helm Charts.

Container Registry Solution

For this set of requirements, the flexibility and feature velocity of open source tools does not provide an advantage. Self-supporting high availability and strengthened security could be costly in implementation time and long-term management. Based on [Blog post 1 Diagram 1], an AWS-managed solution provides cost advantages and has no management overhead. In this case, an AWS-managed solution is a better choice for your container registry than an open-source solution hosted on AWS. In this example, Amazon ECR is selected. Customers who prefer to go with open-source container registries might consider solutions like Harbor.

High level overview of Gitlab CI with Amazon ECR

High level overview of Gitlab CI with Amazon ECR

Additional Considerations

Now that the main services for the CI/CD platform are selected, we will take a high level look at additional important considerations. You need to make sure you have observability into both infrastructure and applications, that backup tools and policies are in place, and that security needs are addressed.

There are many mechanisms to strengthen security including the use of security groups. Use IAM for granular permission control. Robust policies can limit the exposure of your resources and control the flow of traffic. Implement policies to prevent your assets leaving your CI environment inappropriately. To protect sensitive data, such as worker secrets, encrypt these assets while in transit and at rest. Select a key management solution to reduce your operational burden and to support these activities such as AWS Key Management Service (AWS KMS). To deliver secure and compliant application changes rapidly while running operations consistently with automation, implement DevSecOps.

Amazon S3 is durable, secure, and highly available by design making it the preferred choice to store EBS-level backups by many customers. Amazon S3 satisfies the non-functional requirements for a backup store. It also supports versioning and tiered storage classes, making it a cost-effective as well.

Your observability requirements may emphasize versatility and flexibility for application-level monitoring. Using Amazon CloudWatch to monitor your infrastructure and then extending your capabilities through an open-source solutions such as Prometheus may be advantageous. You can get many of the benefits of both open-source Prometheus and AWS services with Amazon Managed Service for Prometheus (AMP). For interactive visualization of metrics, many customers choose solutions such as open-source Grafana, available as an AWS service Amazon Managed Service for Grafana (AMG).

CI/CD Platform with Gitlab and AWS

CI/CD Platform with Gitlab and AWS


We have covered how making informed decisions can maximize value and synergy between open-source solutions on AWS, such as Gitlab, and AWS-managed services, such as Amazon EKS and Amazon ECR. You can find the right balance of open-source tools and AWS services that will meet your functional and non-functional requirements, and help maximizing the value you get from those resources.

Pete Goldberg, Director of Partnerships at GitLab: “When aligning your development process to AWS Well Architected Framework, GitLab allows customers to build and automate processes to achieve Operational Excellence. As a single tool designed to facilitate collaboration across the organization, GitLab simplifies the process to follow the Fully Separated Operating Model where Engineering and Operations come together via automated processes that remove the historical barriers between the groups. This gives organizations the ability to efficiently and rapidly deploy new features and applications that drive the business while providing the risk mitigation and compliance they require. By allowing operations teams to define infrastructure as code in the same tool that the engineering teams are storing application code, and allowing your automation bring those together for your CI/CD workflows companies can move faster while having compliance and controls built-in, providing the entire organization greater transparency. With GitLab’s integrations with different AWS compute options (EC2, Lambda, Fargate, ECS or EKS), customers can choose the best type of compute for the job without sacrificing the controls required to maintain Operational Excellence.”


Author bio

Mikhail is a Solutions Architect for RUS-CIS. Mikhail supports customers on their cloud journeys with Well-architected best practices and adoption of DevOps techniques on AWS. Mikhail is a fan of ChatOps, Open Source on AWS and Operational Excellence design principles.

Building well-architected serverless applications: Regulating inbound request rates – part 1

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-regulating-inbound-request-rates-part-1/

This series of blog posts uses the AWS Well-Architected Tool with the Serverless Lens to help customers build and operate applications using best practices. In each post, I address the serverless-specific questions identified by the Serverless Lens along with the recommended best practices. See the introduction post for a table of contents and explanation of the example application.

Reliability question REL1: How do you regulate inbound request rates?

Defining, analyzing, and enforcing inbound request rates helps achieve better throughput. Regulation helps you adapt different scaling mechanisms based on customer demand. By regulating inbound request rates, you can achieve better throughput, and adapt client request submissions to a request rate that your workload can support.

Required practice: Control inbound request rates using throttling

Throttle inbound request rates using steady-rate and burst rate requests

Throttling requests limits the number of requests a client can make during a certain period of time. Throttling allows you to control your API traffic. This helps your backend services maintain their performance and availability levels by limiting the number of requests to actual system throughput.

To prevent your API from being overwhelmed by too many requests, Amazon API Gateway throttles requests to your API. These limits are applied across all clients using the token bucket algorithm. API Gateway sets a limit on a steady-state rate and a burst of request submissions. The algorithm is based on an analogy of filling and emptying a bucket of tokens representing the number of available requests that can be processed.

Each API request removes a token from the bucket. The throttle rate then determines how many requests are allowed per second. The throttle burst determines how many concurrent requests are allowed. I explain the token bucket algorithm in more detail in “Building well-architected serverless applications: Controlling serverless API access – part 2

Token bucket algorithm

Token bucket algorithm

API Gateway limits the steady-state rate and burst requests per second. These are shared across all APIs per Region in an account. For further information on account-level throttling per Region, see the documentation. You can request account-level rate limit increases using the AWS Support Center. For more information, see Amazon API Gateway quotas and important notes.

You can configure your own throttling levels, within the account and Region limits to improve overall performance across all APIs in your account. This restricts the overall request submissions so that they don’t exceed the account-level throttling limits.

You can also configure per-client throttling limits. Usage plans restrict client request submissions to within specified request rates and quotas. These are applied to clients using API keys that are associated with your usage policy as a client identifier. You can add throttling levels per API route, stage, or method that are applied in a specific order.

For more information on API Gateway throttling, see the AWS re:Invent presentation “I didn’t know Amazon API Gateway could do that”.

API Gateway throttling

API Gateway throttling

You can also throttle requests by introducing a buffering layer using Amazon Kinesis Data Stream or Amazon SQS. Kinesis can limit the number of requests at the shard level while SQS can limit at the consumer level. For more information on using SQS as a buffer with Amazon Simple Notification Service (SNS), read “How To: Use SNS and SQS to Distribute and Throttle Events”.

Identify steady-rate and burst rate requests that your workload can sustain at any point in time before performance degraded

Load testing your serverless application allows you to monitor the performance of an application before it is deployed to production. Serverless applications can be simpler to load test, thanks to the automatic scaling built into many of the services. During a load test, you can identify quotas that may act as a limiting factor for the traffic you expect and take action.

Perform load testing for a sustained period of time. Gradually increase the traffic to your API to determine your steady-state rate of requests. Also use a burst strategy with no ramp up to determine the burst rates that your workload can serve without errors or performance degradation. There are a number of AWS Marketplace and AWS Partner Network (APN) solutions available for performance testing, Gatling Frontline, BlazeMeter, and Apica.

In the serverless airline example used in this series, you can run a performance test suite using Gatling, an open source tool.

To deploy the test suite, follow the instructions in the GitHub repository perf-tests directory. Uncomment the deploy.perftest line in the repository Makefile.

Perf-test makefile

Perf-test makefile

Once the file is pushed to GitHub, AWS Amplify Console rebuilds the application, and deploys an AWS CloudFormation stack. You can run the load tests locally, or use an AWS Step Functions state machine to run the setup and Gatling load test simulation.

Performance test using Step Functions

Performance test using Step Functions

The Gatling simulation script uses constantUsersPerSec and rampUsersPerSec to add users for a number of test scenarios. You can use the test to simulate load on the application. Once the tests run, it generates a downloadable report.

Gatling performance results

Gatling performance results

Artillery Community Edition is another open-source tool for testing serverless APIs. You configure the number of requests per second and overall test duration, and it uses a headless Chromium browser to run its test flows. For Artillery, the maximum number of concurrent tests is constrained by your local computing resources and network. To achieve higher throughput, you can use Serverless Artillery, which runs the Artillery package on Lambda functions. As a result, this tool can scale up to a significantly higher number of tests.

For more information on how to use Artillery, see “Load testing a web application’s serverless backend”. This runs tests against APIs in a demo application. For example, one of the tests fetches 50,000 questions per hour. This calls an API Gateway endpoint and tests whether the AWS Lambda function, which queries an Amazon DynamoDB table, can handle the load.

Artillery performance test

Artillery performance test

This is a synchronous API so the performance directly impacts the user’s experience of the application. This test shows that the median response time is 165 ms with a p95 time of 201 ms.

Performance test API results

Performance test API results

Another consideration for API load testing is whether the authentication and authorization service can handle the load. For more information on load testing Amazon Cognito and API Gateway using Step Functions, see “Using serverless to load test Amazon API Gateway with authorization”.

API load testing with authentication and authorization

API load testing with authentication and authorization


Regulating inbound requests helps you adapt different scaling mechanisms based on customer demand. You can achieve better throughput for your workloads and make them more reliable by controlling requests to a rate that your workload can support.

In this post, I cover controlling inbound request rates using throttling. I show how to use throttling to control steady-rate and burst rate requests. I show some solutions for performance testing to identify the request rates that your workload can sustain before performance degradation.

This well-architected question will be continued where I look at using, analyzing, and enforcing API quotas. I cover mechanisms to protect non-scalable resources.

For more serverless learning resources, visit Serverless Land.

Top 5: Featured Architecture Content for June

Post Syndicated from Elyse Lopez original https://aws.amazon.com/blogs/architecture/top-5-featured-architecture-content-for-june/

The AWS Architecture Center provides new and notable reference architecture diagrams, vetted architecture solutions, AWS Well-Architected best practices, whitepapers, and more. This blog post features some of our top picks from the new and newly updated content we released this month.

1. Taco Bell: Aurora as The Heart of the Menu Middleware and Data Integration Platform for Taco Bell (YouTube)

This episode of the This is My Architecture video series explores how Taco Bell built a serverless data integration platform to create unique menus for more than 7,000 locations. Find out how this menu middleware uses Amazon Aurora combined with several other services, including AWS Amplify, AWS Lambda, Amazon API Gateway, and AWS Step Functions to create a cost-effective, scalable data pipeline.

Image still from This is My Architecture video series

2. Well-Architected IoT Lens Checklist

How do you effectively implement AWS IoT workloads? This IoT Lens Checklist provides insights that we have gathered from real-world case studies. This will help you quickly learn the key design elements of Well-Architected IoT workloads. The checklist also provides recommendations for improvement.

3. Derive Insights from AWS Lake House

This whitepaper provides insights and design patterns for cloud architects, data scientists, and developers. It shows you how a lake house architecture allows you to query data across your data warehouse, data lake, and operational databases. Learn how you can store data in a data lake and use a ring of purpose-built data services to quickly make decisions.

4. AWS Limit Monitor Solution

This AWS Solutions Implementation helps you automatically track resource use and avoid overspending. Managed from a centralized location, this ready-to-deploy solution provides a cost-effective way to stay within service quotas by receiving notifications — even via Slack! — before you reach the limit.

5. Cloud Automation for 5G Networks

Digital services providers (DSPs) around the world are focusing on 5G development as part of upgrading their digital infrastructure. This whitepaper explains how DSPs can use different AWS tools and services to fully automate their 5G network deployment and testing and allow orchestration, closed loop use cases, edge analytics, and more.

Choosing a Well-Architected CI/CD approach: Open-source software and AWS Services

Post Syndicated from Brian Carlson original https://aws.amazon.com/blogs/devops/choosing-well-architected-ci-cd-open-source-software-aws-services/

This series of posts discusses making informed decisions when choosing to implement open-source tools on AWS services, adopt managed AWS services to satisfy the same needs, or use a combination of both.

We look at key considerations for evaluating open-source software and AWS services using the perspectives of a startup company and a mature company as examples. You can use these two different points of view to compare to your own organization. To make this investigation easier we will use Continuous Integration (CI) and Continuous Delivery (CD) capabilities as the target of our investigation.

Startup Company rocket and Mature Company rocket

In two related posts, we follow two AWS customers, Iponweb and BigHat Biosciences, as they share their CI/CD journeys, their perspectives, the decisions they made, and why. To end the series, we explore an example reference architecture showing the benefits AWS provides regardless of your emphasis on open-source tools or managed AWS services.

Why CI/CD?

Until your creations are in the hands of your customers, investment in development has provided no return. The faster valuable changes enter production, the greater positive impact you can have on your customer. In today’s highly competitive world, the ability to frequently and consistently deliver value is a competitive advantage. The Operational Excellence (OE) pillar of the AWS Well-Architected Framework recognizes this impact and focuses on the capabilities of CI/CD in two dedicated sections.

The concepts in CI/CD originate from software engineering but apply equally to any form of content. The goal is to support development, integration, testing, deployment, and delivery to production. For example, making changes to an application, updating your machine learning (ML) models, changing your multimedia assets, or referring to the AWS Well-Architected Framework.

Adopting CI/CD and the best practices from the Operational Excellence pillar can help you address risks in your environment, and limit errors from manual processes. More importantly, they help free your teams from the related manual processes, so they can focus on satisfying customer needs, differentiating your organization, and accelerating the flow of valuable changes into production.

A red question mark sits on a field of chaotically arranged black question marks.

How do you decide what you need?

The first question in the Operational Excellence pillar is about understanding needs and making informed decisions. To help you frame your own decision-making process, we explore key considerations from the perspective of a fictional startup company and a fictional mature company. In our two related posts, we explore these same considerations with Iponweb and BigHat.

The key considerations include:

  • Functional requirements – Providing specific features and capabilities that deliver value to your customers.
  • Non-functional requirements – Enabling the safe, effective, and efficient delivery of the functional requirements. Non-functional requirements include security, reliability, performance, and cost requirements.
    • Without security, you can’t earn customer trust. If your customers can’t trust you, you won’t have customers.
    • Without reliability you aren’t available to serve your customers. If you can’t serve your customers, you won’t have customers.
    • Performance is focused on timely and efficient delivery of value, not delivering as fast as possible.
    • Cost is focused on optimizing the value received for the resources spent (for example, money, time, or effort), not minimizing expense.
  • Operational requirements – Enabling you to effectively and efficiently support, maintain, sustain, and improve the delivery of value to your customers. When you “Design with Ops in Mind,” you’re enabling effective and efficient support for your business outcomes.

These non-feature-related key considerations are why Operational Excellence, Security, Reliability, Performance Efficiency, and Cost Optimization are the five pillars of the AWS Well-Architected Framework.

The startup company

Any startup begins as a small team of inspired people working together to realize the unique solution they believe solves an unsolved problem.

For our fictional small team, everyone knows each other personally and all speak frequently. We share processes and procedures in discussions, and everyone know what needs to be done. Our team members bring their expertise and dedicate it, and the majority of their work time, to delivering our solution. The results of our efforts inform changes we make to support our next iteration.

However, our manual activities are error-prone and inconsistencies exist in the way we do them. Performing these tasks takes time away from delivering our solution. When errors occur, they have the potential to disrupt everyone’s progress.

We have capital available to make some investments. We would prefer to bring in more team members who can contribute directly to developing our solution. We need to iterate faster if we are to achieve a broadly viable product in time to qualify for our next round of funding. We need to decide what investments to make.

  • Goals – Reach the next milestone and secure funding to continue development
  • Needs – Reduce or eliminate the manual processes and associated errors
  • Priority – Rapid iteration
  • CI/CD emphasis – Baseline CI/CD capabilities and non-functional requirements are emphasized over a rich feature set

The mature company

Our second fictional company is a large and mature organization operating in a mature market segment. We’re focused on consistent, quality customer experiences to serve and retain our customers.

Our size limits the personal relationships between our service and development teams. The process to make requests, and the interfaces between teams and their systems, are well documented and understood.

However, the systems we have implemented over time, as needs were identified and addressed, aren’t well documented. Our existing tool chain includes some in-house scripting and both supported and unsupported versions of open-source tools. There are limited opportunities for us to acquire new customers.

When conditions change and new features are desired, we want to be able to rapidly implement and deploy those features as fast as possible. If we can differentiate our services, however briefly, we may be able to win customers away from our competitors. Our other path to improved profitability is to evolve our processes, maximizing integration and efficiencies, and capturing cost reductions.

  • Goals – Differentiate ourselves in the marketplace with desired new features
  • Needs – Address the risks of poorly documented systems and unsupported software
  • Priority – Evolve efficiency
  • CI/CD emphasis – Rich feature set and integrations are emphasized over improving the existing non-functional capabilities

Open-source tools on AWS vs. AWS services

The choice of open-source tools or AWS service is not binary. You can select the combination of solutions that provides the greatest value. You can implement open-source tools for their specific benefits where they outweigh the costs and operational burden, using underlying AWS services like Amazon Elastic Compute Cloud (Amazon EC2) to host them. You can then use AWS managed services, like AWS CodeBuild, for the undifferentiated features you need, without additional cost or operational burden.

A group of people sit around a table discussing the pieces of a puzzle and their ideas.

Feature Set

Our fictional organizations both want to accelerate the flow of beneficial changes into production and are evaluating CI/CD alternatives to support that outcome. Our startup company wants a working solution—basic capabilities, author/code, build, and deploy, so that they can focus on development. Our mature company is seeking every advantage—a rich feature set, extensive opportunities for customization, integration capabilities, and fine-grained control.

Open-source tools

Open-source tools often excel at meeting functional requirements. When a new functionality, capability, or integration is desired, any developer can implement it for themselves, and then contribute their code back to the project. As the user community for an open-source project expands the number of use cases and the features identified grows, so does the number of potential solutions and potential contributors. Developers are using these tools to support their efforts and implement new features that provide value to them.

However, features may be released in unsupported versions and then later added to the supported feature set. Non-functional requirements take time and are less appealing because they don’t typically bring immediate value to the product. Non-functional capabilities may lag behind the feature set.

Consider the following:

  • Open-source tools may have more features and existing integrations to other tools
  • The pace of feature set delivery may be extremely rapid
  • The features delivered are those desired and created by the active members of the community
  • You are free to implement the features your company desires
  • There is no commitment to long-term support for the project or any given feature
  • You can implement open-source tools on multiple cloud providers or on premises
  • If the project is abandoned, you’re responsible for maintaining your implementation

AWS services

AWS services are driven by customer needs. Services and features are supported by dedicated teams. These customer-obsessed teams focus on all customer needs, with security being their top priority. Both functional and non-functional requirements are addressed with an emphasis on enabling customer outcomes while minimizing the effort they expend to achieve them.

Consider the following:

  • The pace of delivery of feature sets is consistent
  • The feature roadmap is driven by customer need and customer requests
  • The AWS service team is dedicated to support of the service
  • AWS services are available on the AWS Cloud and on premises through AWS Outposts

Picture showing symbol of dollar

Cost Optimization

Why are we discussing cost after the feature set? Security and reliability are fundamentally more important. Leadership naturally gravitates to following the operational excellence best practice of evaluating trade-offs. Having looked at the potential benefits from the feature set, the next question is typically, “What is this going to cost?” Leadership defines the priorities and allocates the resources necessary (capital, time, effort). We review cost optimization second so that leadership can make a comparison of the expected benefits between CI/CD investments, and investments in other efforts, so they can make an informed decision.

Our organizations are both cost conscious. Our startup is working with finite capital and time. In contrast, our mature company can plan to make investments over time and budget for the needed capital. Early investment in a robust and feature-rich CI/CD tool chain could provide significant advantages towards the startup’s long-term success, but if the startup fails early, the value of that investment will never be realized. The mature company can afford to realize the value of their investment over time and can make targeted investments to address specific short-term needs.

Open-source tools

Open-source software doesn’t have to be purchased, but there are costs to adopt. Open-source tools require appropriate skills in order to be implemented, and to perform management and maintenance activities. Those skills must be gained through dedicated training of team members, team member self-study, or by hiring new team members with the existing skills. The availability of skilled practitioners of open-source tools varies with how popular a tool is and how long it has had an active community. Loss of skilled team members includes the loss of their institutional knowledge and intimacy with the implementation. Skills must be maintained with changes to the tools and as team members join or leave. Time is required from skilled team members to support management and maintenance activities. If commercial support for the tool is desired, it may be available through third-parties at an additional cost.

The time to value of an open-source implementation includes the time to implement and configure the resources and software. Additional value may be realized through investment of time configuring or implementing desired integrations and capabilities. There may be existing community-supported integrations or capabilities that reduce the level of effort to achieve these.

Consider the following:

  • No cost to acquire the software.
  • The availability of skill practitioners of open-source tools may be lower. Cost (capital and time) to acquire, establish, or maintain skill set may be higher.
  • There is an ongoing cost to maintain the team member skills necessary to support the open-source tools.
  • There is an ongoing cost of time for team members to perform management and maintenance activities.
  • Additional commercial support for open-source tools may be available at additional cost
  • Time to value includes implementation and configuration of resources and the open-source software. There may be more predefined community integrations.

AWS services

AWS services are provided pay-as-you-go with no required upfront costs. As of August 2020, more than 400,000 individuals hold active AWS Certifications, a number that grew more than 85% between August 2019 and August 2020.

Time to value for AWS services is extremely short and limited to the time to instantiate or configure the service for your use. Additional value may be realized through the investment of time configuring or implementing desired integrations. Predefined integrations for AWS services are added as part of the service development roadmap. However, there may be fewer existing integrations to reduce your level of effort.

Consider the following:

  • No cost to acquire the software; AWS services are pay-as-you-go for use.
  • AWS skill sets are broadly available. Cost (capital and time) to acquire, establish, or maintain skill sets may be lower.
  • AWS services are fully managed, and service teams are responsible for the operation of the services.
  • Time to value is limited to the time to instantiate or configure the service. There may be fewer predefined integrations.
  • Additional support for AWS services is available through AWS Support. Cost for support varies based on level of support and your AWS utilization.

Open-source tools on AWS services

Open-source tools on AWS services don’t impact these cost considerations. Migration off of either of these solutions is similarly not differentiated. In either case, you have to invest time in replacing the integrations and customizations you wish to maintain.

Picture showing a checkmark put on security


Both organizations are concerned about reputation and customer trust. They both want to act to protect their information systems and are focusing on confidentiality and integrity of data. They both take security very seriously. Our startup wants to be secure by default and wants to trust the vendor to address vulnerabilities within the service. Our mature company has dedicated resources that focus on security, and the company practices defense in depth across internal organizations.

The startup and the mature company both want to know whether a choice is safe, secure, and can validate the security of their choice. They also want to understand their responsibilities and the shared responsibility model that applies.

Open-source tools

Open-source tools are the product of the contributors and may contain flaws or vulnerabilities. The entire community has access to the code to test and validate. There are frequently many eyes evaluating the security of the tools. A company or individual may perform a validation for themselves. However, there may be limited guidance on secure configurations. Controls in the implementer’s environment may reduce potential risk.

Consider the following:

  • You’re responsible for the security of the open-source software you implement
  • You control the security of your data within your open-source implementation
  • You can validate the security of the code and act as desired

AWS services

AWS service teams make security their highest priority and are able to respond rapidly when flaws are identified. There is robust guidance provided to support configuring AWS services securely.

Consider the following:

  • AWS is responsible for the security of the cloud and the underlying services
  • You are responsible for the security of your data in the cloud and how you configure AWS services
  • You must rely on the AWS service team to validate the security of the code

Open-source tools on AWS services

Open-source tools on AWS services combine these considerations; the customer is responsible for the open-source implementation and the configuration of the AWS services it consumes. AWS is responsible for the security of the AWS Cloud and the managed AWS services.

Picture showing global distribution for redundancy to depict reliability


Everyone wants reliable capabilities. What varies between companies is their appetite for risk, and how much they can tolerate the impact of non-availability. The startup emphasized the need for their systems to be available to support their rapid iterations. The mature company is operating with some existing reliability risks, including unsupported open-source tools and in-house scripts.

The startup and the mature company both want to understand the expected reliability of a choice, meaning what percentage of the time it is expected to be available. They both want to know if a choice is designed for high availability and will remain available even if a portion of the systems fails or is in a degraded state. They both want to understand the durability of their data, how to perform backups of their data, and how to perform recovery in the event of a failure.

Both companies need to determine what is an acceptable outage duration, commonly referred to as a Recovery Time Objective (RTO), and for what quantity of elapsed time it is acceptable to lose transactions (including committing changes), commonly referred to as Recovery Point Objective (RPO). They need to evaluate if they can achieve their RTO and RPO objectives with each of the choices they are considering.

Open-source tools

Open-source reliability is dependent upon the effectiveness of the company’s implementation, the underlying resources supporting the implementation, and the reliability of the open-source software. Open-source tools are the product of the contributors and may or may not incorporate high availability features. Depending on the implementation and tool, there may be a requirement for downtime for specific management or maintenance activities. The ability to support RTO and RPO depends on the teams supporting the company system, the implementation, and the mechanisms implemented for backup and recovery.

Consider the following:

  • You are responsible for implementing your open-source software to satisfy your reliability needs and high availability needs
  • Open-source tools may have downtime requirements to support specific management or maintenance activities
  • You are responsible for defining, implementing, and testing the backup and recovery mechanisms and procedures
  • You are responsible for the satisfaction of your RTO and RPO in the event of a failure of your open-source system

AWS services

AWS services are designed to support customer availability needs. As managed services, the service teams are responsible for maintaining the health of the services.

Consider the following:

Open-source tools on AWS services

Open-source tools on AWS services combine these considerations; the customer is responsible for the open-source implementation (including data durability, backup, and recovery) and the configuration of the AWS services it consumes. AWS is responsible for the health of the AWS Cloud and the managed services.

Picture showing a graph depicting performance measurement


What defines timely and efficient delivery of value varies between our two companies. Each is looking for results before an engineer becomes idled by having to wait for results. The startup iterates rapidly based on the results of each prior iteration. There is limited other activity for our startup engineer to perform before they have to wait on actionable results. Our mature company is more likely to have an outstanding backlog or improvements that can be acted upon while changes moves through the pipeline.

Open-source tools

Open-source performance is defined by the resources upon which it is deployed. Open-source tools that can scale out can dynamically improve their performance when resource constrained. Performance can also be improved by scaling up, which is required when performance is constrained by resources and scaling out isn’t supported. The performance of open-source tools may be constrained by characteristics of how they were implemented in code or the libraries they use. If this is the case, the code is available for community or implementer-created improvements to address the limitation.

Consider the following:

  • You are responsible for managing the performance of your open-source tools
  • The performance of open-source tools may be constrained by the resources they are implemented upon; the code and libraries used; their system, resource, and software configuration; and the code and libraries present within the tools

AWS services

AWS services are designed to be highly scalable. CodeCommit has a highly scalable architecture, and CodeBuild scales up and down dynamically to meet your build volume. CodePipeline allows you to run actions in parallel in order to increase your workflow speeds.

Consider the following:

  • AWS services are fully managed, and service teams are responsible for the performance of the services.
  • AWS services are designed to scale automatically.
  • Your configuration of the services you consume can affect the performance of those services.
  • AWS services quotas exist to prevent unexpected costs. You can make changes to service quotas that may affect performance and costs.

Open-source tools on AWS services

Open-source tools on AWS services combine these considerations; the customer is responsible for the open-source implementation (including the selection and configuration of the AWS Cloud resources) and the configuration of the AWS services it consumes. AWS is responsible for the performance of the AWS Cloud and the managed AWS services.

Picture showing cart-wheels in motion, depicting operations


Our startup company wants to limit its operations burden as much as possible in order to focus on development efforts. Our mature company has an established and robust operations capability. In both cases, they perform the management and maintenance activities necessary to support their needs.

Open-source tools

Open-source tools are supported by their volunteer communities. That support is voluntary, without any obligation or commitment from the users. If either company adopts open-source tools, they’re responsible for the management and maintenance of the system. If they want additional support with an obligation and commitment to support their implementation, third parties may provide commercial support at additional cost.

Consider the following:

  • You are responsible for supporting your implementation.
  • The open-source community may provide volunteer support for the software.
  • There is no commitment to support the software by the open-source community.
  • There may be less documentation, or accepted best practices, available to support open-source tools.
  • Early adoption of open-source tools, or the use of development builds, includes the chance of encountering unidentified edge cases and unanticipated issues.
  • The complexity of an implementation and its integrations may increase the difficulty to support open-source tools. The time to identify contributing factors may be extended by the complexity during an incident. Maintaining a set of skilled team members with deep understanding of your implementation may help mitigate this risk.
  • You may be able to acquire commercial support through a third party.

AWS services

AWS services are committed to providing long-term support for their customers.

Consider the following:

  • There is long-term commitment from AWS to support the service
  • As a managed service, the service team maintains current documentation
  • Additional levels of support are available through AWS Support
  • Support for AWS is available through partners and third parties

Open-source tools on AWS services

Open-source tools on AWS services combine these considerations. The company is responsible for operating the open-source tools (for example, software configuration changes, updates, patching, and responding to faults). AWS is responsible for the operation of the AWS Cloud and the managed AWS services.


In this post, we discussed how to make informed decisions when choosing to implement open-source tools on AWS services, adopt managed AWS services, or use a combination of both. To do so, you must examine your organization and evaluate the benefits and risks.

A magnifying glass is focused on the single red figure in a group of otherwise blue paper figures standing on a white surface.

Examine your organization

You can make an informed decision about the capabilities you adopt. The insight you need can be gained by examining your organization to identify your goals, needs, and priorities, and discovering what your current emphasis is. Ask the following questions:

  • What is your organization trying to accomplish and why?
  • How large is your organization and how is it structured?
  • How are roles and responsibilities distributed across teams?
  • How well defined and understood are your processes and procedures?
  • How do you manage development, testing, delivery, and deployment today?
  • What are the major challenges your organization faces?
  • What are the challenges you face managing development?
  • What problems are you trying to solve with CI/CD tools?
  • What do you want to achieve with CI/CD tools?

Evaluate benefits and risk

Armed with that knowledge, the next step is to explore the trade-offs between open-source options and managed AWS services. Then evaluate the benefits and risks in terms of the key considerations:

  • Features
  • Cost
  • Security
  • Reliability
  • Performance
  • Operations

When asked “What is the correct answer?” the answer should never be “It depends.” We need to change the question to “What is our use case and what are our needs?” The answer will emerge from there.

Make an informed decision

A Well-Architected solution can include open-source tools, AWS Services, or any combination of both! A Well-Architected choice is an informed decision that evaluates trade-offs, balances benefits and risks, satisfies your requirements, and most importantly supports the achievement of your business outcomes.

Read the other posts in this series and take this journey with BigHat Biosciences and Iponweb as they share their perspectives, the decisions they made, and why.


Want to learn more? Check out the following CI/CD and developer tools on AWS:

Continuous integration (CI)
Continuous delivery (CD)
AWS Developer Tools

For more information about the AWS Well-Architected Framework, refer to the following whitepapers:

AWS Well-Architected Framework
AWS Well-Architected Operational Excellence pillar
AWS Well-Architected Security pillar
AWS Well-Architected Reliability pillar
AWS Well-Architected Performance Efficiency pillar
AWS Well-Architected Cost Optimization pillar

The 3 hexagons of the well architected logo appear to the right of the words AWS Well-Architected.

Author bio

portrait photo of Brian Carlson Brian is the global Operational Excellence lead for the AWS Well-Architected program. Formerly the technical lead for an international network, Brian works with customers and partners researching the operations best practices with the greatest positive impact and produces guidance to help you achieve your goals.


Choosing a CI/CD approach: Open Source on AWS, an Iponweb story

Post Syndicated from Mikhail Vasilyev original https://aws.amazon.com/blogs/devops/choosing-a-ci-cd-approach-open-source-on-aws-an-iponweb-story/

Iponweb is a global leader in building programmatic and real-time advertising technology and infrastructure for some of the world’s biggest digital media buyers and sellers. The company develops client-facing products and internal development tools that must be platform agnostic to support spanning across multiple cloud services.

In this post, we explore how Iponweb applied key considerations when choosing a continuous integration, continuous deployment (CI/CD), what they determined to be the right CI/CD approach for them, and review some considerations that may apply to your own business needs. And in the next post, we will dive even deeper into these key considerations.

How did Iponweb decide what they needed?

The first and most important question in designing a Well-Architected approach is: “How do you determine your priorities?” AWS Well-Architected defines the first two best practices to do that as: ”evaluate external customer needs” (Iponweb’s clients) and “evaluate internal customer needs” (Iponweb’s team).

Iponweb started with these two considerations while selecting the strategic toolset. After evaluating their customers’ requirements, the next step was to look at the needs of the Iponweb team. Their priorities included the products and features required, the cost, and the ability to build multi-cloud solutions.

Iponweb is dedicated to operating securely with the reliability and performance to support their customers. Solutions had to satisfy their fundamental requirements in these areas to be considered in their evaluation.

Feature set

Iponweb evaluated available options for the CI tool chain and found that, for their needs, GitLab was the clear winner, differentiated by delivering the greatest number of required features at the best price while being platform agnostic.

AWS had the complete set of tools, services, and best practices to support Iponweb’s goal to establish an open-source, self-hosted CI environment using GitLab. Upon completing their thorough evaluation process, Iponweb selected AWS to implement its CI environment.


Iponweb understood the investment they would be making within their team to leverage and support all the desired features of GitLab. Iponweb evaluated the expertise of its internal teams and factored in ease of integration with supporting services.

They adopted several AWS services that satisfied their undifferentiated needs, which allowed them to remove the operational burden and cost of maintaining their own implementations of various capabilities and features.

Furthermore, the availability of Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances provided the opportunity to further manage costs for their CI resource needs and usage patterns.


Iponweb leveraged their existing security control implementations and integration with AWS to support adopting additional AWS services. AWS was responsible for the security of the cloud, including the underlying AWS services. Iponweb was able to focus on secure and effective configurations of those services and secure and effective configuration of their GitLab implementation. This ensured the security of their open-source, self-hosted CI environment.

When setting priorities for the design of a Well-Architected approach, it’s imperative to “manage benefits and risks,” which emphasizes making informed decisions when adopting open source or any tools. Iponweb achieved their best value solution by applying Well-Architected practices in Operational Excellence, Cost Optimization, and Security pillars by leveraging AWS products and services.

Overview of solution

Continuous integration consists of three key processes, each of which AWS supports:

  • Code stage – Iponweb built the centralized Git repository on the GitLab platform on EC2 servers, providing the UI and API to store and manage the code.
  • Test and build stage – They used GitLab as the application layer to manage build and test flows through GitLab Runners (compute workers for CI jobs). This layer is implanted via GitLab in containers, and is deployed and managed by Amazon Elastic Kubernetes Service (Amazon EKS).
  • Publish stageAmazon Elastic Container Registry (Amazon ECR) stores the infrastructure containers for the runners and product containers.

The following diagram illustrates this architecture:

At the core of Iponweb’s CI platform architecture is the open-source GitLab Community Edition.

Implementing the solution

CI jobs are either run regularly or triggered by events such as merge requests. The jobs are described as code in YAML files and are stored and versioned along with the product code itself. Runner versions are published into Amazon ECR and launched as Docker containers in Amazon EKS.

Runner code is stored as Helm charts that help Iponweb package up and manage their large-scale Kubernetes deployments. In addition, Amazon EKS has support for Helm and many other plugins for Kubernetes.

Iponweb developers innovate at a very fast pace, and customize Iponweb’s client solutions in rapid iterations. To address uncertain container registry requirements, Iponweb decided to use Amazon ECR. As a managed service, Amazon ECR eliminates concerns about scaling capacity and management. Integration of GitLab with Amazon EKS and Amazon ECR is provided out of the box through a UI and predefined scripts, with no additional overhead to develop and deploy code or plugins.

Iponweb was able to implement the Well-Architected design principle: “stop continuously estimating its capacity needs.” Enabling them to focus on more strategic development activities. They performed a thorough analysis of each component, looking at the total cost of ownership, including operations and management. In doing so, they implemented the best practice from the Cost Optimization pillar: “How do you evaluate cost when you select services?”

In the Cost Optimization pillar, a key question is “How do you use pricing models to reduce costs?” Iponweb deployed runners in Amazon EKS for precise, granular, and on-demand compute scaling for each CI job. These tasks have short-term capacity needs, so Iponweb benefited from configuring Amazon EKS on Spot Instances, achieving factor price reduction. The EC2 Spot pricing model is most appropriate for their CI resource needs and usage patterns.

To protect their data at rest, Iponweb followed a best practice from the Security pillar: “Implement secure key management.” They used AWS Key Management Service (AWS KMS) to manage secrets for the runners.

To protect the code and artifacts, and to ensure these valuable assets don’t leave the CI environment inappropriately, Iponweb followed best practices in Infrastructure Protection from the Security pillar question, “How do you protect your networks?” Iponweb scrupulously defined the network protection requirements, limiting their exposure by controlling traffic at all layers, and implementing security groups to prevent inappropriate access into and out of their VPC.

Michael Benuhis, CTO at Iponweb, says:

“Iponweb was able to get the best of open-source software and public cloud services by building the continuous integration platform on Amazon Web Services. Open-source tools provided Iponweb platform agnosticism for serving our diverse customer base, while managed Amazon EKS on EC2 Spot Instances eliminated the operational burden of managing our own Kubernetes infrastructure, and with greater cost efficiency.”


Iponweb has satisfied their current needs and aren’t looking for improvement in the short term. They will stay on the free version of GitLab, satisfied for the moment with what they have achieved. They have custom automations in place to synchronize with GitLab and integrate with their existing tools. They like the features provided by the paid version of GitLab, but there isn’t a business case to support an informed decision to upgrade at this time.

They have achieved their goal of using Amazon EKS and Spot under GitLab CI/CD integrated with their existing systems and satisfying their needs.

How to approach threat modeling

Post Syndicated from Darran Boyd original https://aws.amazon.com/blogs/security/how-to-approach-threat-modeling/

In this post, I’ll provide my tips on how to integrate threat modeling into your organization’s application development lifecycle. There are many great guides on how to perform the procedural parts of threat modeling, and I’ll briefly touch on these and their methodologies. However, the main aim of this post is to augment the existing guidance with some additional tips on how to handle the people and process components of your threat modeling approach, which in my experience goes a long way to improving the security outcomes, security ownership, speed to market, and general happiness of all involved. Furthermore, I’ll also provide some guidance specific to when you’re using Amazon Web Services (AWS).

Let’s start with a primer on threat modeling.

Why use threat modeling

IT systems are complex, and are becoming increasingly more complex and capable over time, delivering more business value and increased customer satisfaction and engagement. This means that IT design decisions need to account for an ever-increasing number of use cases, and be made in a way that mitigates potential security threats that may lead to business-impacting outcomes, including unauthorized access to data, denial of service, and resource misuse.

This complexity and number of use-case permutations typically makes it ineffective to use unstructured approaches to find and mitigate threats. Instead, you need a systematic approach to enumerate the potential threats to the workload, and to devise mitigations and prioritize these mitigations to make sure that the limited resources of your organization have the maximum impact in improving the overall security posture of the workload. Threat modeling is designed to provide this systematic approach, with the aim of finding and addressing issues early in the design process, when the mitigations have a low relative cost compared to later in the lifecycle.

The AWS Well-Architected Framework calls out threat modeling as a specific best practice within the Security Pillar, under the area of foundational security, under the question SEC 1: How do you securely operate your workload?:

“Identify and prioritize risks using a threat model: Use a threat model to identify and maintain an up-to-date register of potential threats. Prioritize your threats and adapt your security controls to prevent, detect, and respond. Revisit and maintain this in the context of the evolving security landscape.”

Threat modeling is most effective when done at the workload (or workload feature) level, in order to ensure that all context is available for assessment. AWS Well-Architected defines a workload as:

“A set of components that together deliver business value. The workload is usually the level of detail that business and technology leaders communicate about. Examples of workloads are marketing websites, e-commerce websites, the back-ends for a mobile app, analytic platforms, etc. Workloads vary in levels of architectural complexity, from static websites to architectures with multiple data stores and many components.”

The core steps of threat modeling

In my experience, all threat modeling approaches are similar; at a high level, they follow these broad steps:

  1. Identify assets, actors, entry points, components, use cases, and trust levels, and include these in a design diagram.
  2. Identify a list of threats.
  3. Per threat, identify mitigations, which may include security control implementations.
  4. Create and review a risk matrix to determine if the threat is adequately mitigated.

To go deeper into the general practices associated with these steps, I would suggest that you read the SAFECode Tactical Threat Modeling whitepaper and the Open Web Application Security Project (OWASP) Threat Modeling Cheat Sheet. These guides are great resources for you to consider when adopting a particular approach. They also reference a number of tools and methodologies that are helpful to accelerate the threat modeling process, including creating threat model diagrams with the OWASP Threat Dragon project and determining possible threats with the OWASP Top 10, OWASP Application Security Verification Standard (ASVS) and STRIDE. You may choose to adopt some combination of these, or create your own.

When to do threat modeling

Threat modeling is a design-time activity. It’s typical that during the design phase you would go beyond creating a diagram of your architecture, and that you may also be building in a non-production environment—and these activities are performed to inform and develop your production design. Because threat modeling is a design-time activity, it occurs before code review, code analysis (static or dynamic), and penetration testing; these all come later in the security lifecycle.

Always consider potential threats when designing your workload from the earliest phases—typically when people are still on the whiteboard (whether physical or virtual). Threat modeling should be performed during the design phase of a given workload feature or feature change, as these changes may introduce new threats that require you to update your threat model.

Threat modeling tips

Ultimately, threat modeling requires thought, brainstorming, collaboration, and communication. The aim is to bridge the gap between application development, operations, business, and security. There is no shortcut to success. However, there are things I’ve observed that have meaningful impacts on the adoption and success of threat modeling—I’ll be covering these areas in the following sections.

1. Assemble the right team

Threat modeling is a “team sport,” because it requires the knowledge and skill set of a diverse team where all inputs can be viewed as equal in value. For all listed personas in this section, the suggested mindset is to start from your end-customers’ expectations, and work backwards. Think about what your customers expect from this workload or workload feature, both in terms of its security properties and maintaining a balance of functionality and usability.

I recommend that the following perspectives be covered by the team, noting that a single individual can bring more than one of these perspectives to the table:

The Business persona – First, to keep things grounded, you’ll want someone who represents the business outcomes of the workload or feature that is part of the threat modeling process. This person should have an intimate understanding of the functional and non-functional requirements of the workload—and their job is to make sure that these requirements aren’t unduly impacted by any proposed mitigations to address threats. Meaning that if a proposed security control (that is, mitigation) renders an application requirement unusable or overly degraded, then further work is required to come to the right balance of security and functionality.

The Developer persona – This is someone who understands the current proposed design for the workload feature, and has had the most depth of involvement in the design decisions made to date. They were involved in design brainstorming or whiteboarding sessions leading up to this point, when they would typically have been thinking about threats to the design and possible mitigations to include. In cases where you are not developing your own in-house application (e.g. COTS applications) you would bring in the internal application owner.

The Adversary persona – Next, you need someone to play the role of the adversary. The aim of this persona is to put themselves in the shoes of an attacker, and to critically review the workload design and look for ways to take advantage of a design flaw in the workload to achieve a particular objective (for example, unauthorized sharing of data). The “attacks” they perform are a mental exercise, not actual hands-on-keyboard exploitation. If your organization has a so-called Red Team, then they could be a great fit for this role; if not, you may want to have one or more members of your security operations or engineering team play this role. Or alternately, bring in a third party who is specialized in this area.

The Defender persona – Then, you need someone to play the role of the defender. The aim of this persona is to see the possible “attacks” designed by the adversary persona as potential threats, and to devise security controls that mitigate the threats. This persona also evaluates whether the possible mitigations are reasonably manageable in terms of on-going operational support, monitoring, and incident response.

The AppSec SME persona – The Application Security (AppSec) subject matter expert (SME) persona should be the most familiar with the threat modeling process and discussion moderation methods, and should have a depth of IT security knowledge and experience. Discussion moderation is crucial for the overall exercise process to make sure that the overall objectives of the process are kept on-track, and that the appropriate balance between security and delivery of the customer outcome is maintained. Ultimately, it’s this persona who endorses the threat model and advises the scope of the actions beyond the threat modeling exercise—for example, penetration testing scope.

2. Have a consistent approach

In the earlier section, I listed some of the popular threat modeling approaches, and which one you select is not as important as using it consistently both within and across your teams.

By using a consistent approach and format, teams can move faster and estimate effort more accurately. Individuals can learn from examples, by looking at threat models developed by other team members or other teams—saving them from having to start from scratch.

When your team estimates the effort and time required to create a threat model, the experience and time taken from previous threat models can be used to provide more accurate estimations of delivery timelines.

Beyond using a consistent approach and format, consistency in the granularity and relevance of the threats being modeled is key. Later in this post I describe a recommendation for creating a catalog of threats for reuse across your organization.

Finally, and importantly, this approach allows for scalability: if a given workload feature that’s undergoing a threat modeling exercise is using components that have an existing threat model, then the threat model (or individual security controls) of those components can be reused. With this approach, you can effectively take a dependency on a component’s existing threat model, and build on that model, eliminating re-work.

3. Align to the software delivery methodology

Your application development teams already have a particular workflow and delivery style. These days, Agile-style delivery is most popular. Ensure that the approach you take for threat modeling integrates well with both your delivery methodology and your tools.

Just like for any other deliverable, capture the user stories related to threat modeling as part of the workload feature’s sprint, epic, or backlog.

4. Use existing workflow tooling

Your application development teams are already using a suite of tools to support their delivery methodology. This would typically include collaboration tools for documentation (for example, a team wiki), and an issue-tracking tool to track work products through the software development lifecycle. Aim to use these same tools as part of your security review and threat modeling workflow.

Existing workflow tools can provide a single place to provide and view feedback, assign actions, and view the overall status of the threat modeling deliverables of the workload feature. Being part of the workflow reduces the friction of getting the project done and allows threat modeling to become as commonplace as unit testing, QA testing, or other typical steps of the workflow.

By using typical workflow tools, team members working on creating and reviewing the threat model can work asynchronously. For example, when the threat model reviewer adds feedback, the author is notified, and then the author can address the feedback when they have time, without having to set aside dedicated time for a meeting. Also, this allows the AppSec SME to more effectively work across multiple threat model reviews that they may be engaged in.

Having a consistent approach and language as described earlier is an important prerequisite to make this asynchronous process feasible, so that each participant can read and understand the threat model without having to re-learn the correct interpretation each time.

5. Break the workload down into smaller parts

It’s advisable to decompose (break down) the workload into features and perform the threat modeling exercise at the feature level, rather than create a single threat model for an entire workload. This approach has a number of key benefits:

  1. Having smaller chunks of work allows more granular tracking of progress, which aligns well with development teams that are following Agile-style delivery, and gives leadership a constant view of progress.
  2. This approach tends to create threat models that are more detailed, which results in more findings being identified.
  3. Decomposing also opens up the opportunity for the threat model to be reused as a dependency for other workload features that use the same components.
  4. By considering threat mitigations for each component, at the overall workload level this means that a single threat may have multiple mitigations, resulting in an improved resilience against those threats.
  5. Issues with a single threat model, for example a critical threat which is not yet mitigated, does not become launch blocking for the entire workload, but rather just for the individual feature.

The question then becomes, how far should you decompose the workload?

As a general rule, in order to create a threat model, the following context is required, at a minimum:

  • One asset. For example, credentials, customer records, and so on.
  • One entry point. For example, Amazon API Gateway REST API deployment.
  • Two components. For example, a web browser and an API Gateway REST API; or API Gateway and an AWS Lambda function.

Creating a threat model for a given AWS service (for example, API Gateway) in isolation wouldn’t fully meet this criteria—given that the service is a single component, there is no movement of the data from one component to another. Furthermore, the context of all the possible use cases of the service within a workload isn’t known, so you can’t comprehensively derive the threats and mitigations. AWS performs threat modeling of the multiple features that make up a given AWS service. Therefore, for your workload feature that leverages a given AWS service, you wouldn’t need to threat model the AWS service, but instead consider the various AWS service configuration options and your own workload-specific mitigations when you look to mitigate the threats you’ve identified. I go into more depth on this in the “Identify and evaluate mitigations” section, where I go into the concept of baseline security controls.

6. Distribute ownership

Having a central person or department responsible for creation of threat models doesn’t work in the long run. These central entities become bottlenecks and can only scale up with additional head count. Furthermore, centralized ownership doesn’t empower those who are actually designing and implementing your workload features.

Instead, what scales well is distributed ownership of threat model creation by the team that is responsible for designing and implementing each workload feature. Distributed ownership scales and drives behavior change, because now the application teams are in control, and importantly they’re taking security learnings from the threat modeling process and putting those learnings into their next feature release, and therefore constantly improving the security of their workload and features.

This creates the opportunity for the AppSec SME (or team) to effectively play the moderator and security advisor role to all the various application teams in your organization. The AppSec SME will be in a position to drive consistency, adoption, and communication, and to set and raise the security bar among teams.

7. Identify entry points

When you look to identify entry points for AWS services that are components within your overall threat model, it’s important to understand that, depending on the type of AWS service, the entry points may vary based on the architecture of the workload feature included in the scope of the threat model.

For example, with Amazon Simple Storage Service (Amazon S3), the possible types of entry-points to an S3 bucket are limited to what is exposed through the Amazon S3 API, and the service doesn’t offer the capability for you, as a customer, to create additional types of entry points. In this Amazon S3 example, as a customer you make choices about how these existing types of endpoints are exposed—including whether the bucket is private or publicly accessible.

On the other end of the spectrum, Amazon Elastic Compute Cloud (Amazon EC2) allows customers to create additional types of entry-points to EC2 instances (for example, your application API), besides the entry-point types that are provided by the Amazon EC2 API and those native to the operating system running on the EC2 instance (for example, SSH or RDP).

Therefore, make sure that you’re applying the entry points that are specific to the workload feature, in additional to the native endpoints for AWS services, as part of your threat model.

8. Come up with threats

Your aim here is to try to come up with answers to the question “What can go wrong?” There isn’t any canonical list that lists all the possible threats, because determining threats depends on the context of the workload feature that’s under assessment, and the types of threats that are unique to a given industry, geographical area, and so on.

Coming up with threats requires brainstorming. The brainstorming exercise can be facilitated by using a mnemonic like STRIDE (Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, and Elevation of Privilege), or by looking through threat lists like the OWASP Top 10 or HiTrust Threat Catalog to get the ideas flowing.

Through this process, it’s recommended that you develop and contribute to a threat catalog that is contextual to your organization and will accelerate the brainstorming process going forward, as well as drive consistency in the granularity of threats that you model.

9. Identify and evaluate mitigations

Here, your aim is to identify the mitigations (security controls) within the workload design and evaluate whether threats have been adequately addressed. Keep in mind that there are typically multiple layers of controls and multiple responsibilities at play.

For your own in-house applications and code, you would want to review the mitigations you’ve included in your design—including, but not limited to, input validation, authentication, session handling, and bounds handling.

Consider all other components of your workload (for example, software as a service (SaaS), infrastructure supporting your COTS applications, or components hosted within your on-premises data centers) and determine the security controls that are part of the workload design.

When you use AWS services, Security and Compliance is a shared responsibility between AWS and you as our customer. This is described on the AWS Shared Responsibility Model page.

This means, for the portions of the AWS services that you’re using that are the responsibility of AWS (Security of the Cloud), the security controls are managed by AWS, along with threat identification and mitigation. The distribution of responsibility between AWS (Security of the Cloud) and you (Security in the Cloud) depends on which AWS service you use. Below, I provide examples of infrastructure, container, and abstracted AWS services to show how your responsibility for identifying and mitigating threats can vary:

  • Amazon EC2 is a good example of an infrastructure service, where you are able to access a virtual server in the cloud, you get to choose the operating system, and you have control of the service and all aspects you run on it—so you would be responsible for mitigating the identified threats.
  • Amazon Relational Database Service (Amazon RDS) is a representative example of a container service, where there is no operating system exposed for you, and instead AWS exposes the selected database engine to you (for example, MySQL). AWS is responsible for the security of the operating system in this example, and you don’t need to devise mitigations. However, the database engine is under your control as well as all aspects above it, so you would need to consider mitigations for these areas. Here, AWS is taking on a larger portion of the responsibility compared to infrastructure services.
  • Amazon S3, AWS Key Management Service (AWS KMS), and Amazon DynamoDB are examples of an abstracted service where AWS exposes the entire service control plane and data plane to you through the service API. Again, here there are no operating systems, database engines, or platforms exposed to you—these are an AWS responsibility. However, the API actions and associated policies are under your control and so are all aspects above the API level, so you should be considering mitigations for these. For this type of service, AWS takes a larger portion of responsibility compared to container and infrastructure types of services.

While these examples do not encompass all types of AWS services that may be in your workload, they demonstrate how your Security and Compliance responsibilities under the Shared Responsibility Model will vary in this context. Understanding the balance of responsibilities between AWS and yourself for the types of AWS services in your workload helps you scope your threat modeling exercise to the mitigations that are under your control, which are typically a combination of AWS service configuration options and your own workload-specific mitigations. For the AWS portion of the responsibility, you will find that AWS services are in-scope of many compliance programs, and the audit reports are available for download for AWS customers (at no cost) from AWS Artifact.

Regardless of which AWS services you’re using, there’s always an element of customer responsibility, and this should be included in your workload threat model.

Specifically, for security control mitigations for the AWS services themselves, you’d want to consider security controls across domains, including these domains: Identity and Access Management (Authentication/Authorization), Data Protection (At-Rest, In-Transit), Infrastructure Security, and Logging and Monitoring. AWS services each have a dedicated security chapter in the documentation, which provides guidance on the security controls to consider as mitigations. When capturing these security controls and mitigations in your threat model, you should aim to include references to the actual code, IAM policies, and AWS CloudFormation templates located in the workload’s infrastructure-as-code repository, and so on. This helps the reviewer or approver of your threat model to get an unambiguous view of the intended mitigation.

As in the case for threat identification, there’s no canonical list enumerating all the possible security controls. Through the process described here, you should consciously develop baseline security controls that align to your organization’s control objectives, and where possible, implement these baseline security controls as platform-level controls, including AWS service-level configurations (for example, encryption at rest) or guardrails (for example, through service control policies). By doing this, you can drive consistency and scale, so that these implemented baseline security controls are automatically inherited and enforced for other workload features that you design and deploy.

When you come up with the baseline security controls, it’s important to note that the context of a given workload feature isn’t known. Therefore, it’s advisable to consider these controls as a negotiable baseline that you can deviate from, provided that when you perform the workload threat modeling exercise, you find that the threat that the baseline control was designed to mitigate isn’t applicable, or there are other mitigations or compensating controls that adequately mitigate the threat. Compensating controls and mitigating factors could include: reduced data asset classification, non-human access, or ephemeral data/workload.

To learn more about how to start thinking about baseline security controls as part of your overall cloud security governance, have a look at the How to think about cloud security governance blog post.

10. Decide when enough is enough

There’s no perfect answer to this question. However, it’s important to have a risk-based perspective on the threat modeling process to create a balanced approach, so that the likelihood and impact of a risk are appropriately considered. Over-emphasis on “let’s build and ship it” could lead to significant costs and delays later. Conversely, over-emphasis on “let’s mitigate every conceivable threat” could lead to the workload feature shipping late (or never), and your customers might move on. In the recommendation I made earlier in the “Assemble the right team” section, the selection of personas is deliberate to make sure that there’s a natural tension between shipping the feature, and mitigating threats. Embrace this healthy tension.

11. Don’t let paralysis stop you before you start

Earlier in the “Break the workload down into smaller parts” section, I gave the recommendation that you should scope your threat models down to a workload feature. You may be thinking to yourself, “We’ve already shipped <X number> of features, how do we threat model those?” This is a completely reasonable question.

My view is that rather than go back to threat model features that are already live, aim to threat model any new features that you are working on now and improve the security properties of the code you ship next, and for each feature you ship after that. During this process you, your team, and your organization will learn—not just about threat modeling—but how to communicate effectively with one another. Make adjustments, iterate, improve. Sometime in the future, when you’re routinely providing high quality, consistent and reusable threat models for your new features, you can start putting activities to perform threat modeling for existing features into your backlog.


Threat modeling is an investment—in my view, it’s a good one, because finding and mitigating threats in the design phase of your workload feature can reduce the relative cost of mitigation, compared to finding the threats later. Consistently implementing threat modeling will likely also improve your security posture over time.

I’ve shared my observations and tips for practical ways to incorporate threat modeling into your organization, which center around communication, collaboration, and human-led expertise to find and address threats that your end customer expects. Armed with these tips, I encourage you to look across the workload features you’re working on now (or have in your backlog) and decide which ones will be the first you’ll threat model.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Boyd author photo

Darran Boyd

Darran is a Principal Security Solutions Architect at AWS, responsible for helping customers make good security choices and accelerating their journey to the AWS Cloud. Darran’s focus and passion is to deliver strategic security initiatives that unlock and enable our customers at scale.

New – SaaS Lens in AWS Well-Architected Tool

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-saas-lens-in-aws-well-architected-tool/

To help you build secure, high-performing, resilient, and efficient solutions on AWS, in 2015 we publicly launched the AWS Well-Architected Framework. It started as a single whitepaper but has expanded to include domain-specific lenses, hands-on labs, and the AWS Well-Architected Tool (available at no cost in the AWS Management Console) that provides a mechanism for regularly evaluating your workloads, identifying high risk issues, and recording your improvements.

To offer more workload-specific advice, in 2017 we extended the framework with the concept of “lens” to go beyond a general perspective and enter specific technology domains. Now, to help accelerate building Software-as-a-Service (SaaS) solutions, the AWS SaaS Factory team has led an effort to build a new AWS Well-Architected SaaS Lens.

SaaS is a licensing and delivery model by which software is centrally managed and hosted by a provider and available to customers on a subscription basis. In this way, software providers can innovate rapidly, optimize their costs, and gain operational efficiencies. At the same time, customers benefit from simplified IT management, speed, and a pay-for-what-you-use business model.

The Well-Architected SaaS Lens adds questions to the tool that are tailored to SaaS workloads and intended to drive critical thinking for developing and operating SaaS workloads. Each question has a list of best practices, and each best practice has a list of improvement plans to help guide you in implementing them. AWS Solution Architects from the AWS SaaS Factory Program, having worked with thousands of software developers and AWS Partners, view these well-architected patterns as a key component of building and operating a SaaS architecture on AWS.

Using the SaaS Lens in the Well-Architected Tool
In the Well-Architected Tool console, I start by defining my workload. Today, I’m reviewing a pre-production environment of a SaaS application. It’s just a minimum viable product (MVP) version of what I want to build, with just enough features to be usable and get a first feedback.

Now, I can choose which lenses to apply. The AWS Well-Architected Framework is there by default. I select the SaaS Lens. This is adding a set of additional questions that help me understand how to design, deploy, and architect my SaaS workload following the framework best practices. Other lenses are available in the tool, for example the Serverless Lens described here.

Now, I start my review. Many questions in the SaaS Lens are focused on how you are managing a multi-tenant application. This is the first question for the Operational Excellence pillar. I can also add some notes to explain my answer better or take note of what I want to improve.

I don’t need to answer all questions to start improving my SaaS application. For example, this is the improvement plan based on my answer to the previous question. For each point here, I can click and get more information on how to implement that on AWS.

Moving to the Reliability pillar, I feel more confident because of the techniques I used to separate individual tenants of my SaaS application in their own “sandbox” environment.

As I expect, no risks are detected this time!

When I finish reviewing the SaaS Lens for my workload, I get an overview of the detected risks. Here, I can also save a milestone that I can use later to compare my status and estimate my improvements.

Just below that, I get a suggestion on what to focus on next. Again, I can click and get in-depth suggestion on how to mitigate the risk.

As often happens in IT services, this is an iterative process. The AWS Well-Architected Tool helps quantify the risks and gives me a path to follow to continuously improve my SaaS application.

Available Now
The SaaS Lens is available today in all regions where the AWS Well-Architected Tool is offered, as described in the AWS Regional Services List. It can be applied to existing workloads, or used for new workloads you define in the tool.

There are no costs in using the AWS Well-Architected Tool; you can use it to improve the application you are working on, or to get visibility into multiple workloads used by the department or area you are working with.

Learn more about the new SaaS Lens and get started today with the AWS Well-Architected Tool!


Techniques for writing least privilege IAM policies

Post Syndicated from Ben Potter original https://aws.amazon.com/blogs/security/techniques-for-writing-least-privilege-iam-policies/

In this post, I’m going to share two techniques I’ve used to write least privilege AWS Identity and Access Management (IAM) policies. If you’re not familiar with IAM policy structure, I highly recommend you read understanding how IAM works and policies and permissions.

Least privilege is a principle of granting only the permissions required to complete a task. Least privilege is also one of many Amazon Web Services (AWS) Well-Architected best practices that can help you build securely in the cloud. For example, if you have an Amazon Elastic Compute Cloud (Amazon EC2) instance that needs to access an Amazon Simple Storage Service (Amazon S3) bucket to get configuration data, you should only allow read access to the specific S3 bucket that contains the relevant data.

There are a number of ways to grant access to different types of resources, as some resources support both resource-based policies and IAM policies. This blog post will focus on demonstrating how you can use IAM policies to grant restrictive permissions to IAM principals to meet least privilege standards.

In AWS, an IAM principal can be a user, role, or group. These identities start with no permissions and you add permissions using a policy. In AWS, there are different types of policies that are used for different reasons. In this blog, I only give examples for identity-based policies that attach to IAM principals to grant permissions to an identity. You can create and attach multiple identity-based policies to your IAM principals, and you can reuse them across your AWS accounts. There are two types of managed policies. Customer managed policies are created and managed by you, the customer. AWS managed policies are provided as examples, cannot be modified, but can be copied, enhanced, and saved as Customer managed policies. The main elements of a policy statement are:

  • Effect: Specifies whether the statement will Allow or Deny an action.
  • Action: Describes a specific action or actions that will either be allowed or denied to run based on the Effect entered. API actions are unique to each service. For example, s3:ListBuckets is an Amazon S3 service API action that enables an IAM Principal to list all S3 buckets in the same account.
  • NotAction: Can be used as an alternative to using Action. This element will allow an IAM principal to invoke all API actions to a specific AWS service except those actions specified in this list.
  • Resource: Specifies the resources—for example, an S3 bucket or objects—that the policy applies to in Amazon Resource Name (ARN) format.
  • NotResource: Can be used instead of the Resource element to explicitly match every AWS resource except those specified.
  • Condition: Allows you to build expressions to match the condition keys and values in the policy against keys and values in the request context sent by the IAM principal. Condition keys can be service-specific or global. A global condition key can be used with any service. For example, a key of aws:CurrentTime can be used to allow access based on date and time.

Starting with the visual editor

The visual editor is my default starting place for building policies as I like the wizard and seeing all available services, actions, and conditions without looking at the documentation. If there is a complex policy with many services, I often look at the AWS managed policies as a starting place for the actions that are required, then use the visual editor to fine tune and check the resources and conditions.

The policy I’m going to walk you through creating is to grant an AWS Lambda function permission to get specific objects from Amazon S3, and put items in a specific table in Amazon DynamoDB. You can access the visual editor when you choose Create policy under policies in the IAM console, or add policies when viewing a role, group, or user as shown in Figure 1. If you’re not familiar with creating policies, you can follow the full instructions in the IAM documentation.

Figure 1: Use the visual editor to create a policy

Figure 1: Use the visual editor to create a policy

Begin by choosing the first service—S3—to grant access to as shown in Figure 2. You can only choose one service at a time, so you’ll need to add DynamoDB after.

Figure 2: Select S3 service

Figure 2: Select S3 service

Now you will see a list of access levels with the option to manually add actions. Expand the read access level to show all read actions that are supported by the Amazon S3 service. You can now see all read access level actions. For getting an object, check the box for GetObject. Selecting the ? next to an action expands information including a description, supported resource types, and supported condition keys as shown in Figure 3.

Figure 3: Expand Read in Access level, select GetObject, and select the ? next to GetObject

Figure 3: Expand Read in Access level, select GetObject, and select the ? next to GetObject

Expand Resources, you will see that the visual editor has listed object as that is the only resource supported by the GetObject action as shown in Figure 4.

Figure 4: Expand Resources

Figure 4: Expand Resources

Select Add ARN, which opens a dialogue to help you specify the ARN for the objects. Enter a bucket name—such as doc-example-bucket—and then the object name. For the object name you can use a wildcard (*) as a suffix. For example, to allow objects beginning with alpha you would enter alpha*. This is an important step. For this least privileged policy, you are restricting to a specific bucket, and an object prefix. You could even specify an individual object depending on your use case.

Figure 5: Enter bucket name and object name

Figure 5: Enter bucket name and object name

If you have multiple ARNs (bucket and objects) to allow, you can repeat the step.

Figure 6: ARN added for S3 object

Figure 6: ARN added for S3 object

The final step is to expand the request conditions, and choose Add condition. The Add request condition dialogue will open. Select the drop down next to Condition key to list the global condition keys, then the service level condition keys are listed after. You’ll see that there’s an s3:ExistingObjectTag condition that—as the name suggests—matches an existing object tag. You can use this condition key to allow the GetObject request only when the object tag meets your condition. That means you can tag your objects with a specific tag key and value pair, and your policy condition must match this key-value pair to allow the action to execute. When you’re using condition keys with multiple keys or values, you can use condition operators and evaluation logic. As shown in Figure 7, tag-key is entered directly below the condition key. This is the key of the tag to match. For the Operator, select StringEquals to match the tag exactly. Checking If exists tests at least one member of the set of request values, and at least one member of the set of condition key values. The Value to enter is the actual tag value: tag-value as shown in figure 7.

Figure 7: ARN added for S3 object

Figure 7: ARN added for S3 object

That’s it for adding the S3 action, as shown in figure 8.

Figure 8: S3 GetObject action with resource and conditions configured

Figure 8: S3 GetObject action with resource and conditions configured

Now you need to add the DynamoDB permissions by selecting Add additional permissions. Select Choose a service and then select DynamoDB. For actions, expand the Write access level, then choose PutItem.

Figure 9: Choose write access level

Figure 9: Choose write access level

Expand Resources and then select Add ARN. The dialogue that appears will help you build the ARN just like it did for the Amazon S3 service. Enter the Region, for example the ap-southeast-2 (Sydney) Region, the account ID, and the table name. Choosing Add will add the resource ARN to your policy.

Figure 10: Enter Region, account, and table name

Figure 10: Enter Region, account, and table name

Now it’s time to add conditions. Expand Request conditions and then choose Add condition.

There are many DynamoDB conditions that you could use, however you can choose dynamodb:LeadingKeys to represent the first key, or partition keys in a table. You can see from the documentation that a qualifier of For all values in request is recommend. For the Operator you can use StringEquals as your string is going to exactly match, then a Value can use a prefix with wildcard, such as alpha* as shown in figure 11.

Figure 11: Add request conditions

Figure 11: Add request conditions

Choosing Add will take you back to the main visual editor where you can choose Review policy to continue. Enter a name and description for the policy, and then choose Create policy.

You can now attach it to a role to test.

You can see in this example that a policy can use least privilege by using specific resources and conditions. Note that sometimes when you use the AWS Management Console, it requires additional permissions to provide information for the console experience.

Starting with AWS managed policies

AWS managed policies can be a good starting place to see the actions typically associated with a particular service or job function. For example, you can attach the AmazonS3ReadOnlyAccess policy to a role used by an Amazon EC2 instance that allows read-only access to all Amazon S3 buckets. It has an effect of Allow to allow access, and there are two actions that use wildcards (*) to allow all Get and List actions for S3—for example, s3:GetObject and s3:ListBuckets. The resource is a wildcard to allow all S3 buckets the account has access to. A useful feature of this policy is that it only allows read and list access to S3, but not to any other services or types of actions.

Let’s make our own custom IAM policy to make it least privilege. Starting with the action element, you can use the reference for Amazon S3 to see all actions, a description of what each action does, the resource type for each action, and condition keys for each action. Now let’s imagine this policy is used by an Amazon EC2 instance to fetch an application configuration object from within an S3 bucket. Looking at the descriptions for actions starting with Get you can see that the only action that we really need is GetObject. You can then use the resource element to restrict an action to a set of objects prefixed with config within a specific bucket.

         "Effect": "Allow",
         "Action": "s3:GetObject",
         "Resource": "arn:aws:s3::: <doc-example-bucket>/<config*>"

Now that you’ve reduced the scope of what this policy can do for service actions and resources, you can add a condition element that uses attribute based access control (ABAC) to define conditions based on attributes—in this case, a resource tag. In this example, when you’re reading objects from a single bucket, you can set specific conditions to further reduce the scope of permissions given to an IAM principal. There’s an s3:ExistingObjectTag condition that you can use to allow the GetObject request only when the object tag meets your condition. That means you can tag your objects with a specific tag key and value pair, and your IAM policy condition must match this key-value pair to allow the API action to successfully run. When you’re using condition keys with multiple keys or values, you can use condition operators and evaluation logic. You can see that ForAnyValue tests at least one member of the set of request values, and at least one member of the set of condition key values. Alternatively, you can use global condition keys that apply to all services:

         "Effect": "Allow",
         "Action": "s3:GetObject",
         "Resource": "arn:aws:s3:::<doc-example-bucket>/<config*>",
         "Condition": {
                "ForAnyValue:StringEquals": {
                    "s3:ExistingObjectTag/<tag-key>": "<tag-value>"

In the preceding policy example, the condition element only allows s3:GetObject permissions if the object is tagged with a key of tag-key and a value of tag-value. While you’re experimenting, you can identify errors in your custom policies by using the IAM policy simulator or reviewing the errors messages recorded in AWS CloudTrail logs.


In this post, I’ve shown two different techniques that you can use to create least privilege policies for IAM. You can adapt these methods to create AWS Single Sign-On permission sets and AWS Organizations service control policies (SCPs). Starting with managed policies is a useful strategy when an AWS supplied managed policy already exists for your use case, and then to reduce the scope of what it can do through permissions. I tend to use the visual editor the most for editing policies because it saves looking up the resource and conditions for each action. I suggest that you start by reviewing the policies you’re already using. Start with policies that grant excessive permissions—like the example Administrator policy—and tie them back to the use case of the users or things that need the access. Use the last accessed information, IAM best practices, and look at the AWS Well-Architected best practices and AWS Well-Architected tool.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Ben Potter

Ben is the global security leader for the AWS Well-Architected Framework and is responsible for sharing best practices in security with customers and partners. Ben is also an ambassador for the No More Ransom initiative helping fight cyber crime with Europol, McAfee, and law enforcement across the globe. You can learn more about him in this interview.

How to think about cloud security governance

Post Syndicated from Paul Hawkins original https://aws.amazon.com/blogs/security/how-to-think-about-cloud-security-governance/

When customers first move to the cloud, their instinct might be to build a cloud security governance model based on one or more regulatory frameworks that are relevant to their industry. Although this can be a helpful first step, it’s also critically important that organizations understand what the control objectives for their workloads should be.

In this post, we discuss what you need to do both organizationally and technically with Amazon Web Services (AWS) to build an efficient and effective governance model. People who are taking their first steps in cloud can use this post to guide their thinking. It can also act as useful context for folks who have been running in the cloud for a while to evaluate their current governance approach.

But before you can build that model, it’s important to understand what governance is and to consider why you need it. Governance is how an organization ensures the consistent application of policies across all teams. The best way to implement consistent governance is by codifying as much of the process as possible. Security governance in particular is used to support business objectives by defining policies and controls to manage risk.

Moving to the cloud provides you with an opportunity to deliver features faster, react to the changing world in a more agile way, and return some decision making to the hands of the people closest to the business. In this fast-paced environment, it’s important to have a way to maintain consistency, scaleability, and security. This is where a strong governance model helps.

Creating the right governance model for your organization may seem like a complex task, but it doesn’t have to be.


Many customers use a standard framework that’s relevant to their industry to inform their decision-making process. Some frameworks that are commonly used to develop a security governance model include: NIST Cybersecurity Framework (CSF), Information Security Registered Assessors Program (IRAP), Payment Card Industry Data Security Standard (PCI DSS), or ISO/IEC 27001:2013

Some of these standards provide requirements that are specific to a particular regulator, or region and others are more widely applicable—you should choose one that fits the needs of your organization.

While frameworks are useful to set the context for a security program and give guidance on governance models, you shouldn’t build either one only to check boxes on a particular standard. It’s critical that you should build for security first and then use the compliance standards as a way to demonstrate that you’re doing the right things.

Control objectives

After you’ve selected a framework to use, the next considerations are controls. A control is a technical- or process-based implementation that’s designed to ensure that the likelihood or consequences of an identified risk are reduced to a level that’s acceptable to the organization’s risk appetite. Examples of controls include firewalls, logging mechanisms, access management tools, and many more.

Controls will evolve over time; sometimes they do so very quickly in the early stages of cloud adoption. During this rapid evolution, it’s easy to focus purely on the implementation of a control rather than the objective of it. However, if you want to build a robust and useful governance model, you must not lose sight of control objectives.

Consider the example of the firewall. When you use a firewall, you implement a control. The objective is to make sure that only traffic that should reach your environment is able to reach it. Although a firewall is one way to meet this objective, you can achieve the same outcome with a layered approach using Amazon Virtual Private Cloud (Amazon VPC) Security Groups, AWS WAF and Amazon VPC network access control lists (ACLs). Splitting the control implementation into multiple places can enable workload owners to have greater flexibility in how they configure resources while the baseline posture is delivered automatically.

Not all areas of a business necessarily have the same cloud maturity level, or use the same methods to deploy or run workloads. As a security architect, your job is to help those different parts of the business deliver outcomes in the way that is appropriate for their maturity or particular workload.

The best way to help drive this goal is for the security part of your organization to clearly communicate the necessary control objectives. As a security architect, it’s easier to have a discussion about the things that need tweaking in an application if the objectives are well communicated. It is much harder if the workload owner doesn’t know they have to meet certain security expectations.

What is the job of security?

At AWS, we talk to customers across a range of industries. One thing that consistently comes up in conversation is how to help customers understand the role of their security team in a distributed cloud-aware environment. The answer is always the same: we as security people are here to help the business deploy and run applications securely. Our job is to guide and educate the rest of the organization on the best way to meet the business objectives while meeting the security, risk, and compliance requirements.

So how do you do this?

Technology and culture are both important to an organization’s security posture, and they enable each other. AWS is a good example of an organization that has a strong culture of security ownership. One thing that all customers can take away from AWS: security is everyone’s job. When you understand that, it becomes easier to build the mechanisms that make the configuration and operation of appropriate security control objectives a reality.

The cloud environment that you build goes a long way to achieving this goal in two key ways. First, it provides guardrails and automated guidance for people building on the platform. Second, it allows solutions to scale.

One of the challenges organizations encounter is that there are more developers than there are security people. The traditional approach of point-in-time risk and control assessments performed by a human looking at an architecture diagram doesn’t scale. You need a way to scale that knowledge and capability without increasing the number of people. The best way to achieve this is to codify as much as possible, early in the build and release process.

One way to do this is to run the AWS platform as a product in its own right. Team members should be able to submit feature requests, and there should be metrics on the features that are enabled through the platform. The more security capability that teams building workloads can inherit from the platform, the less they have to implement at the workload level and the more time they can spend on product features. There will always be some security control objectives that can only be delivered by specific configuration at the workload level; this should build on top of what’s inherited from the cloud platform. Your security team and the other teams need to work together to make sure that the capabilities provided by the cloud platform are available to help people build and release securely.

One part of the governance model that we like to highlight is the concept of platform onboarding. The idea of this part of the governance model is to quickly and consistently get to a baseline set of controls that enable you to use a service safely in a particular environment. A good example here is to give developers access to evaluate a service in an experimentation account. To support this process, you don’t want to spend a long time building controls for every possible outcome. The best approach is to take advantage of the foundational controls that are delivered by the cloud platform as the starting point. Things like federation, logging, and service control policies can be used to provide guard rails that enable you to use services quickly. When the services are being evaluated, your security team can work together with your business to define more specific controls that make sense for the actual use cases.

AWS Well-Architected Framework

The cloud platform you use is the foundation of many of the security controls. These guard rails of federation, logging, service control polices, and automated response apply to workloads of all types. The security pillar in the AWS Well-Architected Framework builds on other risk management and compliance frameworks, provides you with best practices, and helps you to evaluate your architectures. These best practices are a great place to look for what you should do when building in the cloud. The categories—identity and access management, detection, infrastructure protection, data protection, and incident response—align with the most important areas to focus on when you build in AWS.

For example, identity is a foundational control in a cloud environment. One of the AWS Well-Architected security best practices is “Rely on a centralized identity provider.” You can use AWS Single Sign-On (AWS SSO) for this purpose or an equivalent centralized mechanism. If you centralize your identity provider, you can perform identity lifecycle management on users, provide them with access to only the resources that are required, and support users who move between teams. This can apply across the multiple AWS accounts in your AWS environment. AWS Organizations uses service control policies to enable you to use a subset of AWS services in particular environments; this is an identity-centric way of providing guard rails.

In addition to federating users, it’s important to enable logging and monitoring services across your environment. This allows you to generate an event when something unexpected happens, such as a user trying to call AWS Key Management Service (AWS KMS) to decrypt data that they should have access to. Securely storing logs means that you can perform investigations to determine the causes of any issues you might encounter. AWS customers who use Amazon GuardDuty and AWS CloudTrail, and have a set of AWS Config rules enabled, have access to security monitoring and logging capabilities as they build their applications.

The layer cake model

When you think about cloud security, we find it useful to use the layer cake as a good mental model. The base of the cake is the understanding of the below-the-line capability that AWS provides. This includes self-serving the compliance documentation from AWS Artifact and understanding the AWS shared responsibility model.

The middle of the cake is the foundational controls, including those described previously in this post. This is the most important layer, because it’s where the most controls are and therefore where the most value is for the security team. You could describe it as the “solve it once, consume it many times” layer.

The top of the cake is the application-specific layer. This layer includes things that are more context dependent, such as the correct control objectives for a certain type of application or data classification. The work in the middle layer helps support this layer, because the middle layer provides the mechanisms that make it easier to automatically deliver the top layer capability.

The middle and top layers are not just technology layers. They also include the people and process parts of the equation. The technology is just there to support the processes.

One thing to be aware of is that you shouldn’t try to define every possible control for a service before you allow your business to use the service. Make use of the various environments in your organization—experimenting, development, testing, and production—to get the services in the hands of developers as quickly as possible with the minimum guardrails to avoid accidental misconfiguration. Then, use the time when the services are being assessed to collaborate with the developers on control implementation. Control implementations can then be rolled into the middle layer of the cake, and the services can be adopted by other parts of the business.

This is also the ideal time to apply practical threat modelling techniques so you can understand what threats and risks you must address. Working with your business to define recommended implementation patterns also helps provide context for how services are typically used. This means you can focus on the controls that are most relevant.

The architecture, platform, or cloud center of excellence (CoE) teams can help at this stage. They can likely make a quick determination of whether an AWS service fits in with your organization’s architectural direction. This quick triage helps the security team focus their efforts in helping get services safely in the hands of the business without being seen as blocking adoption. A good mechanism for streamlining the use of new services is to make sure the backlog is well communicated, typically on a platform team wiki. This helps the security and non-security parts of your organization prioritize their time on services that deliver the most business value. A consistent development approach means that the services that are used are probably being used in more places across the organization. This helps your organization get the benefits of scale as consistent approaches to control implementation are replicated between teams.

Simplicity, metrics, and culture

The world moves fast. You can’t just define a security posture and control objectives, and then walk away. New services are launched that make it easier to do more complex things, business priorities change, and the threat landscape evolves. How do you keep up with all of it?

The answer is a combination of simplicity, metrics, and culture.

Simplicity is hard, but useful. For example, if you have 100 application teams all building in a different way, you have a large number of different configurations that you must ensure are sensibly defined. Ideally, you do this programmatically, which means that the work to define and maintain that set of security controls is significant. If you have 100 application teams using only 10 main patterns, it’s easier to build controls. This has the added benefit of reducing the complexity at the operations end, which applies to both the day-to-day operations and to incident responses. Simplification of your control environment means that your monitoring is less complex, troubleshooting is easier, and people have time to focus on the development of new controls or processes.

Metrics are important because you can make informed decisions based on data. A good example of the usefulness of metrics is patching. Patching is one of the easiest ways to improve your security posture. Having metrics on patch age, presented where this information is most important in your environment, enables you to focus on the most valuable areas. For example, infrastructure on your edge is more important to keep patched than infrastructure that is behind multiple layers of controls. You should patch everything, but you need to make it easy for application teams to do so as part of their build and release cycles. Exposing metrics to teams and leadership helps your organization learn from high performing areas in the business. These could be teams that are regularly meeting the patching expectations or have low instances of needing to remediate penetration testing findings. Metrics and data about your control effectiveness enables you to provide assurance internally and externally that you’re meeting your control objectives.

This brings us to culture. Security as an enabler is something that we think is the most important concept to take away from this post. You must build capabilities that enable people in your organization to have the secure configuration or design choice be the easiest option. This is the role of security. You should also make sure that, when there are problems, your security team works with the business to help everyone learn the cause and improve for next time.

AWS has a culture that uses trouble ticketing for everything. If our employees think they have a security problem, we tell them to open a ticket; if they’re not sure that they have a security problem, we tell them to open a ticket anyway to get guidance. This kind of culture encourages people to communicate and help means so we can identify and fix issues early. Issues that aren’t as severe thought can be downgraded quickly. This culture of ticketing gives us data to inform what we build, which helps people be more secure. You can get started with a system like this in your own environment, or look to extend the capability if you’ve already started.

Take our recommendation to turn on GuardDuty across all your accounts. We recommend that the resulting high and medium alerts are sent to a ticketing system. Look at how you resolve those issues and use that to prioritize the next two weeks of work. Now you can build automation to fix the issues and, more importantly, build to prevent the issues from happening in the first place. Ask yourself, “What information did I need to diagnose the problem?” Then, build automation to enrich the findings so your tickets have that context. Iterate on the automation to understand the context. For example, you may want to include information to show whether the environment is production or non-production.

Note that having production-like controls in non-production environments means that you reduce the chance of deployment failures. It also gets teams used to working within the security guardrails. This increased rigor earlier on in the process, and helps your change management team, too.


It doesn’t matter what security frameworks or standards you use to inform your business, and you might not even align with a particular industry standard. What does matter is building a governance model that empowers the people in your organization to consistently make good security decisions and provides the capability for your security team to enable this to happen. To get started or continue to evolve your governance model, follow the AWS Well-Architected security best practices. Then, make sure that the platform you implement helps you deliver the foundational security control objectives so that your business can spend more of its time on the business logic and security configuration that is specific to its workloads.

The technology and governance choices you make are the first step in building a positive security culture. Security is everyone’s job, and it’s key to make sure that your platform, automation, and metrics support making that job easy.

The areas of focus we’ve talked about in this post are what allow security to be an enabler for business and to ultimately help you better help your customers and earn their trust with everything you do.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Paul Hawkins

Paul helps customers of all sizes understand how to think about cloud security so they can build the technology and culture where security is a business enabler. He takes an optimistic approach to security and believes that getting the foundations right is the key to improving your security posture.


Maddie Bacon

Maddie (she/her) is a technical writer for AWS Security with a passion for creating meaningful content. She previously worked as a security reporter and editor at TechTarget and has a BA in Mathematics. In her spare time, she enjoys reading, traveling, and all things Harry Potter.