Tag Archives: AWS Well-Architected

Architecting for sustainability: a re:Invent 2021 recap

2022-04-29 Margaret O'Toole

Post Syndicated from Margaret O'Toole original https://aws.amazon.com/blogs/architecture/architecting-for-sustainability-a-reinvent-2021-recap/

At AWS re:Invent 2021, we announced the AWS Well-Architected Sustainability Pillar, which marks sustainability as a key pillar of building workloads to best deliver on business need. In session ARC325 – Architecting for Sustainability, Adrian Cockcroft, Steffen Grunwald, and Drew Engelson (Director of Engineering at Starbucks) gave a detailed explanation of what to expect from the Sustainability Pillar whitepaper and how Starbucks has applied AWS best practices to their workloads.

AWS is committed to building a sustainable business for our customers and the planet. Amazon co-founded The Climate Pledge in 2019, committing to reach net-zero carbon by the Year 2040 — 10 years ahead of the Paris Agreement. As part of reaching our commitment, Amazon is on a path to powering our operations with 100% renewable energy by 2025 — 5 years ahead of our original target of 2030.

In 2020, we became the world’s largest corporate purchaser of renewable energy, reaching 65% renewable energy across our business. Amazon also continues to invest in renewable energy projects paired with energy storage: the energy storage systems allow Amazon to store clean energy produced by its solar projects and deploy it when solar energy is unavailable, such as in the evening hours, or during periods of high demand. This strengthens the climate impact of Amazon’s clean energy portfolio by enabling carbon-free electricity throughout more parts of the day.

As customers architect solutions with AWS services to fulfill their business needs, they also influence the sustainability of a workload. AWS had published five pillars in the AWS Well-Architected Framework that capture best practices for operational excellence, security, reliability, performance, and cost of workloads. These pillars have supported the business need for faster time-to-value for features and the delivery of products.

In light of the climate crisis, we have a new challenge to help businesses optimize their application architectures and workloads for sustainability. With sustainability as the sixth pillar in the AWS Well-Architected Framework, builders have guidance to optimize their workloads for energy reduction and efficiency improvement across all components of a workload.

How Starbucks is reducing their footprint

Starbucks has set very ambitious sustainability goals to reduce the environmental impact of their business. As Drew Engelson mentioned in his presentation, there was a gap between these major goals and how members of the technology teams could participate in this mission. Drew decided to evaluate the systems his team was responsible for and initiated an internal framework for understanding the environmental impact of the systems.

Sustainability proxy metrics based on service usage, as demonstrated in the AWS Well-Architected Lab for Sustainability, complemented AWS customer carbon footprint tool data to help the team drive reductions. The goal was to snap a baseline and identify areas for further optimization. The Starbucks team applied the Well-Architected Sustainability Pillar best practices after performing an AWS Well-Architected review early in the process.

Through working with AWS, Starbucks saw that from 2019 to 2020, the actual carbon footprint of their AWS workloads was reduced by approximately 32%, despite tremendous business growth during that same period. The customer carbon footprint tool indicates that these systems’ carbon footprint was further reduced by 50% in subsequent quarters.

Optimizing beyond cost

The Starbucks team already optimized their workloads for cost efficiency, leading to high energy and resource efficiency. Starbucks relies heavily on Kubernetes, which allows them to densely pack their services onto the underlying infrastructure, yielding very high utilization. Binary protocols, like gRPC, are used for efficient communication between microservices. This also cuts down on the data transfer between different networks. Many of their services are written in efficient Scala code, which adds another layer of optimization to the workload.

By taking a data-driven approach, Drew and his team at Starbucks also were able to review scaling thresholds. Based on data, they were able to adjust the auto scaling curve much more closely and smoothly, matching the actual demand curve and reducing the resources they needed to provision.

Drew’s team went beyond the initial optimization cost-benefits to sustainability of the whole stack. They identified workloads suitable for Amazon EC2 Spot Instances to leverage unused, on-demand capacity and increase the overall resource utilization of the cloud. The Starbucks team is starting to examine the impact of their mobile application to end-user devices, and they are taking steps to reduce the downstream impacts: for example, considering the size of client-side scripts, devices’ CPU usage, and color scheme (dark mode reduces the energy required for certain display types).

Starbucks takes optimization for sustainability seriously—so seriously that they coined the term “TCO₂”, which highlights the importance climate impact when measuring Total Costs of Ownership (Figure). Drew raised an important question for application teams and architects that frequently make tradeoffs for cost benefits:

If carbon was the thing we’re optimizing for, would we make different choices?

Drew Engelson, Director of Engineering at Starbucks, discussing "TCO2"

Figure. Drew Engelson, Director of Engineering at Starbucks, discussing “TCO₂“

Get started, the sustainable way

At AWS, we encourage architects to build solutions with sustainability in mind. If your team wants to get started with the concepts from the AWS Well-Architected Sustainability Pillar, conduct a review of your workload in the Well-Architected Tool, or check out the AWS Well-Architected Framework and AWS Well-Architected Labs to learn more!

Optimize AI/ML workloads for sustainability: Part 3, deployment and monitoring

2022-04-27 Benoit de Chateauvieux

Post Syndicated from Benoit de Chateauvieux original https://aws.amazon.com/blogs/architecture/optimize-ai-ml-workloads-for-sustainability-part-3-deployment-and-monitoring/

We’re celebrating Earth Day 2022 from 4/22 through 4/29 with posts that highlight how to build, maintain, and refine your workloads for sustainability.

AWS estimates that inference (the process of using a trained machine learning [ML] algorithm to make a prediction) makes up 90 percent of the cost of an ML model. Given with AWS you pay for what you use, we estimate that inference also generally equates to most of the resource usage within an ML lifecycle.

In this series, we’re following the phases of the Well-Architected machine learning lifecycle (Figure 1) to optimize your artificial intelligence (AI)/ML workloads. In Part 3, our final piece in the series, we show you how to reduce the environmental impact of your ML workload once your model is in production.

If you missed the first parts of this series, in Part 1, we showed you how to examine your workload to help you 1) evaluate the impact of your workload, 2) identify alternatives to training your own model, and 3) optimize data processing. In Part 2, we identified ways to reduce the environmental impact of developing, training, and tuning ML models.

Figure 1. ML lifecycle

Deployment

Select sustainable AWS Regions

As mentioned in Part 1, select an AWS Region with sustainable energy sources. When regulations and legal aspects allow, choose Regions near Amazon renewable energy projects and Regions where the grid has low published carbon intensity to deploy your model.

Align SLAs with sustainability goals

Define SLAs that support your sustainability goals while meeting your business requirements:

If your users can tolerate some latency, deploy your model on asynchronous endpoints to reduce resources that are idle between tasks and minimize the impact of load spikes. Asynchronous endpoints will automatically scale the instance count to zero when there are no requests to process, so you only maintain an inference infrastructure when your endpoint is processing requests.
If your workload doesn’t require high availability, deploy it to a single Availability Zone to reduce the cloud resources you consume. Adjusting availability is an example of a conscious trade off you can make to meet your sustainability targets.
When you don’t need real-time inference, use Amazon SageMaker batch transform. Unlike persistent endpoints, clusters are decommissioned when batch transform jobs finish so you don’t continuously maintain an inference infrastructure.

Use efficient silicon

For CPU-based ML inference, use AWS Graviton3. These processors offer the best performance per watt in Amazon Elastic Compute Cloud (Amazon EC2). They use up to 60% less energy than comparable EC2 instances. Graviton3 processors deliver up to three times better performance compared to Graviton2 processors for ML workloads, and they support bfloat16.

For deep learning workloads, the Amazon EC2 Inf1 instances (based on custom designed AWS Inferentia chips) deliver 2.3 times higher throughput and 80% lower cost compared to g4dn instances. Inf1 has 50% higher performance per watt than g4dn, which makes it the most sustainable ML accelerator Amazon EC2 offers.

Make efficient use of GPU

Use Amazon Elastic Inference to attach just the right amount of GPU-powered inference acceleration to any EC2 or SageMaker instance type or Amazon Elastic Container Service (Amazon ECS) task.

While training jobs batch process hundreds of data samples in parallel, inference jobs usually process a single input in real time, and thus consume a small amount of GPU compute. Elastic Inference allows you to reduce the cost and environmental impact of your inference by using GPU resources more efficiently.

Optimize models for inference

Improve efficiency of your models by compiling them into optimized forms with the following:

Various open-source libraries (like Treelite for decision tree ensembles)
Third-party tools like Hugging Face Infinity, which allows you to speed up transformer models and run inference not only on GPU but also on CPU.
SageMaker Neo’s runtime consumes as little as one-tenth the footprint of a deep learning framework and optimizes models to perform up to 25 time faster with no loss in accuracy (example with XGBoost).

Deploying more efficient models means you need fewer resources for inference.

Deploy multiple models behind a single endpoint

SageMaker provides three methods to deploy multiple models to a single endpoint to improve endpoint utilization:

Host multiple models in one container behind one endpoint. Multi-model endpoints are served using a single container. This can help you cut up to 90 percent of your inference costs and carbon emissions.
Host multiple models that use different containers behind one endpoint.
Host a linear sequence of containers in an inference pipeline behind a single endpoint.

Sharing endpoint resources is more sustainable and less expensive than deploying a single model behind one endpoint.

Right-size your inference environment

Right-size your endpoints by using metrics from Amazon CloudWatch or by using the Amazon SageMaker Inference Recommender. This tool can run load testing jobs and recommend the proper instance type to host your model. When you use the appropriate instance type, you limit the carbon emission associated with over-provisioning.

If your workload has intermittent or unpredictable traffic, configure autoscaling inference endpoints in SageMaker to optimize your endpoints. Autoscaling monitors your endpoints and dynamically adjusts their capacity to maintain steady and predictable performance using as few resources as possible. You can also try Serverless Inference (in preview), which automatically launches compute resources and scales them in and out depending on traffic, which eliminates idle resources.

Consider inference at the edge

When working on Internet of Things (IoT) use cases, evaluate if ML inference at the edge can reduce the carbon footprint of your workload. To do this, consider factors like the compute capacity of your devices, their energy consumption, or the emissions related to data transfer to the cloud. When deploying ML models to edge devices, consider using SageMaker Edge Manager, which integrates with SageMaker Neo and AWS IoT Greengrass (Figure 2).

Figure 2. Run inference at the edge with SageMaker Edge

Device manufacturing represents 32-57 percent of the global Information Communication Technology carbon footprint. If your ML model is optimized, it requires less compute resources. You can then perform inference on lower specification machines, which minimizes the environmental impact of the device manufacturing and uses less energy.

The following techniques compress the size of models for deployment, which speeds up inference and saves energy without significant loss of accuracy:

Pruning removes weights (learnable parameters) that don’t contribute much to the model.
Quantization represents numbers with the low-bit integers without incurring significant loss in accuracy. Specifically, you can reduce resource usage by replacing the parameters in an inference model with half-precision (16 bit), bfloat16 (16 bit, but the same dynamic range as 32 bit), or 8-bit integers instead of the usual single-precision floating-point (32 bit) values.

Archive or delete unnecessary artifacts

Compress and reduce the volume of logs you keep during the inference phase. By default, CloudWatch retains logs indefinitely. By setting limited retention time for your inference logs, you’ll avoid the carbon footprint of unnecessary log storage. Also delete unused versions of your models and custom container images from your repositories.

Monitoring

Retrain only when necessary

Monitor your ML model in production and only retrain if it’s required. Because of model drift, robustness, or new ground truth data being available, models usually need to be retrained. Instead of retraining arbitrarily, monitor your ML model in production, automate your model drift detection and only retrain when your model’s predictive performance has fallen below defined KPIs.

Consider SageMaker Pipelines, AWS Step Functions Data Science SDK for Amazon SageMaker, or third-party tools to automate your retraining pipelines.

Measure results and improve

To monitor and quantify improvements during the inference phase, track the following metrics:

Resources provisioned for your endpoints (InstanceType and AcceleratorType)
Efficient use of these resources (CPUUtilization, GPUUtilization, GPUMemoryUtilization, MemoryUtilization, and DiskUtilization) in the CloudWatch Console

For storage:

The total size of the data captured by Amazon SageMaker Model Monitor using Amazon S3 Storage Lens
The size of your CloudWatch log groups

Conclusion

AI/ML workloads can be energy intensive, but as called out by UN and mentioned in the last IPCC report, AI can contribute to mitigation of climate change and the achievement of several Sustainable Development Goals. As technology builders, it’s our responsibility to make sustainable use of AI and ML.

In this blog post series, we presented best practices you can use to make sustainability-conscious architectural decisions and reduce the environmental impact for your AI/ML workloads.

About the Well-Architected Framework

These practices are part of the Sustainability Pillar of the AWS Well-Architected Framework. AWS Well-Architected is a set of guiding design principles developed by AWS to help organizations build secure, high-performing, resilient, and efficient infrastructure for a variety of applications and workloads. Use the AWS Well-Architected Tool to review your workloads periodically to address important design considerations and ensure that they follow the best practices and guidance of the AWS Well-Architected Framework. For follow up questions or comments, join our growing community on AWS re:Post.

Optimize AI/ML workloads for sustainability: Part 2, model development

2022-03-21 Benoit de Chateauvieux

Post Syndicated from Benoit de Chateauvieux original https://aws.amazon.com/blogs/architecture/optimize-ai-ml-workloads-for-sustainability-part-2-model-development/

More complexity often means using more energy, and machine learning (ML) models are becoming bigger and more complex. And though ML hardware is getting more efficient, the energy required to train these ML models is increasing sharply.

In this series, we’re following the phases of the Well-Architected machine learning lifecycle (Figure 1) to optimize your artificial intelligence (AI)/ML workloads. In Part 2, we examine the model development phase and show you how to train, tune, and evaluate your ML model to help you reduce your carbon footprint.

If you missed the first part of this series, we showed you how to examine your workload to help you 1) evaluate the impact of your workload, 2) identify alternatives to training your own model, and 3) optimize data processing.

Figure 1. ML lifecycle

Model building

Define acceptable performance criteria

When you build an ML model, you’ll likely need to make trade-offs between your model’s accuracy and its carbon footprint. When we focus only on the model’s accuracy, we “ignore the economic, environmental, or social cost of reaching the reported accuracy.” Because the relationship between model accuracy and complexity is at best logarithmic, training a model longer or looking for better hyperparameters only leads to a small increase in performance.

Establish performance criteria that support your sustainability goals while meeting your business requirements, not exceeding them.

Select energy-efficient algorithms

Begin with a simple algorithm to establish a baseline. Then, test different algorithms with increasing complexity to observe whether performance has improved. If so, compare the performance gain against the difference in resources required.

Try to find simplified versions of algorithms. This will help you use less resources to achieve a similar outcome. For example, DistilBERT, a distilled version of BERT, has 40% fewer parameters, runs 60% faster, and preserves 97% of BERT’s performance.

Use pre-trained or partially pre-trained models

Consider techniques to avoid training a model from scratch:

Transfer Learning: Use a pre-trained source model and reuse it as the starting point for a second task. For example, a model trained on ImageNet (14 million images) can generalize with other datasets.
Incremental Training: Use artifacts from an existing model on an expanded dataset to train a new model.

Optimize your deep learning models to accelerate training

Compile your DL models from their high-level language representation to hardware-optimized instructions to reduce training time. You can achieve this with open-source compilers or Amazon SageMaker Training Compiler, which can speed up training of DL models by up to 50% by more efficiently using SageMaker GPU instances.

Start with small experiments, datasets, and compute resources

Experiment with smaller datasets in your development notebook. This allows you to iterate quickly with limited carbon emission.

Automate the ML environment

When building your model, use Lifecycle Configuration Scripts to automatically stop idle SageMaker Notebook instances. If you are using SageMaker Studio, install the auto-shutdown Jupyter extension to detect and stop idle resources.

Use the fully managed training process provided by SageMaker to automatically launch training instances and shut them down as soon as the training job is complete. This minimizes idle compute resources and thus limits the environmental impact of your training job.

Adopt a serverless architecture for your MLOps pipelines. For example, orchestration tools like AWS Step Functions or SageMaker Pipelines only provision resources when work needs to be done. This way, you’re not maintaining compute infrastructure 24/7.

Model training

Select sustainable AWS Regions

Use a debugger

A debugger like SageMaker Debugger can identify training problems like system bottlenecks, overfitting, saturated activation functions, and under-utilization of system resources. It also provides built-in rules like LowGPUUtilization or Overfit. These rules monitor your workload and will automatically stop a training job as soon as it detects a bug (Figure 2), which helps you avoid unnecessary carbon emissions.

Figure 2. Automatically stop buggy training jobs with SageMaker Debugger

Optimize the resources of your training environment

Reference the recommended instance types for the algorithm you’ve selected in the SageMaker documentation. For example, for DeepAR, you should start with a single CPU instance and only switch to GPU and multiple instances when necessary.

Right size your training jobs with Amazon CloudWatch metrics that monitor the utilization of resources like CPU, GPU, memory, and disk utilization.

Consider Managed Spot Training, which takes advantage of unused Amazon Elastic Compute Cloud (Amazon EC2) capacity and can save you up to 90% in cost compared to On-Demand instances. By shaping your demand for the existing supply of EC2 instance capacity, you will improve your overall resource efficiency and reduce idle capacity of the overall AWS Cloud.

Use efficient silicon

Use AWS Trainium for optimized for DL training workloads. It is expected to be our most energy efficient processor for this purpose.

Archive or delete unnecessary training artifacts

Organize your ML experiments with SageMaker Experiments to clean up training resources you no longer need.

Reduce the volume of logs you keep. By default, CloudWatch retains logs indefinitely. By setting limited retention time for your notebooks and training logs, you’ll avoid the carbon footprint of unnecessary log storage.

Model tuning and evaluation

Use efficient cross-validation techniques for hyperparameter optimization

Prefer Bayesian search over random search (and avoid grid search). Bayesian search makes intelligent guesses about the next set of parameters to pick based on the prior set of trials. It typically requires 10 times fewer jobs than random search, and thus 10 times less compute resources, to find the best hyperparameters.

Limit the maximum number of concurrent training jobs. Running hyperparameter tuning jobs concurrently gets more work done quickly. However, a tuning job improves only through successive rounds of experiments. Typically, running one training job at a time achieves the best results with the least amount of compute resources.

Carefully choose the number of hyperparameters and their ranges. You get better results and use less compute resources by limiting your search to a few parameters and small ranges of values. If you know that a hyperparameter is log-scaled, convert it to further improve the optimization.

Use warm-start hyperparameter tuning

Use warm-start to leverage the learning gathered in previous tuning jobs to inform which combinations of hyperparameters to search over in the new tuning job. This technique avoids restarting hyperparameter optimization jobs from scratch and thus reduces the compute resources needed.

Measure results and improve

To monitor and quantify improvements of your training jobs, track the following metrics:

Resources provisioned for your training jobs (InstanceCount, InstanceType, and VolumeSizeInGB)
Efficient use of these resources (CPUUtilization, GPUUtilization, GPUMemoryUtilization, MemoryUtilization, and DiskUtilization) in the SageMaker Console, the CloudWatch Console or your SageMaker Debugger Profiling Report

For storage:

The total size of your Amazon Simple Storage Service (Amazon S3) buckets and storage class distribution, using Amazon S3 Storage Lens
The size of your CloudWatch log groups

Conclusion

In this blog post, we discussed techniques and best practices to reduce the energy required to build, train, and evaluate your ML models.

We also provided recommendations for the tuning process as it makes up a large part of the carbon impact of building an ML model. During hyperparameter and neural design search, hundreds of versions of a given model are created, trained, and evaluated before identifying an optimal design.

In the next post, we’ll continue our sustainability journey through the ML lifecycle and discuss the best practices you can follow when deploying and monitoring your model in production.

Want to learn more? Check out the Sustainability Pillar of the AWS Well-Architected Framework, the Architecting for sustainability session at re:Invent 2021, and other blog posts on architecting for sustainability.

Looking for more architecture content? AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

New and Updated AWS Well-Architected Lenses

2022-03-17 Channy Yun

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/new-and-updated-aws-well-architected-lenses/

Since 2015, the AWS Well-Architected Framework has been helping AWS customers and partners improve their cloud architectures. The framework consists of design principles, questions, and best practices across multiple pillars: Operational Excellence, Security, Reliability, Performance Efficiency, and Cost Optimization. At AWS re:Invent 2021, we introduced a new Sustainability Pillar to help organizations learn, measure, and improve their workloads using environmental best practices for cloud computing.

In 2017, we introduced AWS Well-Architected Lenses and extended the best practice guidance to specific industry and technology domains, such as serverless, high performance computing (HPC), internet of things (IoT), software as a service (SaaS), foundational technical review (FTR), and financial services. Use the applicable Lenses together with the pillars of the AWS Well-Architected Framework to fully evaluate your workloads.

In 2021, we added four new lenses for various technologies and industries at the request of our customers. If you are planning a new workload for the new year, check out the new and updated Lenses to help guide you through the implementation of AWS best practices.

New AWS Well-Architected Lenses

Streaming Media Lens (September 29, 2021)
The Streaming Media Lens helps customers apply best practices in the design, delivery, and maintenance of their cloud-based streaming media workloads. Whether you’ve just started designing a greenﬁeld video application on AWS or are looking to migrate an existing workload, this Lens provides perspective on best practices and can spark new ideas. To learn more about best practices for architecting and improving your streaming media workloads on AWS, see the Streaming Media Lens documentation.

SAP Lens (October 29, 2021)
The SAP Lens is a collection of customer-proven design principles and best practices for ensuring SAP workloads on AWS are well-architected. The SAP Lens is based on insights that AWS has gathered from customers, AWS Partners, and the SAP Specialist Architect community. The Lens is designed to help you adopt a cloud-native approach to running SAP. To learn more, see the SAP Lens documentation.

Games Industry Lens (November 19, 2021)
The Games Industry Lens helps customers review and improve cloud-based architecture for game development, deployment, operations of gaming platforms, and to support massive player scale. The Lens presents common games deployment scenarios and identifies key elements to ensure your platforms are in accordance with the best practices of AWS Well-Architected Framework. Learn the best practices for designing, architecting, and deploying your games workloads on AWS in the Games Industry Lens documentation.

Hybrid Networking Lens (November 22, 2021)
The Hybrid Networking Lens provides best practices and strategies to use when designing hybrid networking architectures. This Lens supports a broad spectrum of use cases and helps set you up for success in building hybrid networking architectures and integrating your on-premises data center with AWS operations. It outlines three areas to consider when designing hybrid network connectivity for your workload: data layer, monitoring and configuration management, and security. To learn more, see the Hybrid Networking Lens documentation.

Updated AWS Well-Architected Lens

Machine Learning Lens (October 13, 2021)
The Machine Learning (ML) Lens introduces a set of established and repeatable best practices across the ML lifecycle phases. You can apply this guidance and architectural principles when designing your ML workloads or after your workloads have entered production as part of continuous improvement. The Lens includes guidance and resources on implementing the best practices on AWS. To learn more, see the ML Lens documentation.

Data Analytics Lens (October 29, 2021)
The Data Analytics Lens is a collection of customer-proven best practices for designing well-architected analytics workloads. It contains insights that AWS has gathered from real-world case studies and helps you learn the key design elements of well-architected analytics workloads, along with recommendations for improvement. For more information about building your own data analytics workload, see the Data Analytics Lens whitepaper.

Management and Governance Lens (December 17, 2021)
The Management and Governance Lens (M&G Lens) provides clear guidance to help you prepare your environment, regardless of your stage of cloud adoption, with a focus on eight different functions. Those functions are controls and guardrails, network connectivity, identity management, security management, monitoring and observability, cloud financial management, service management, and sourcing and distribution. To learn more, see the M&G Lens documentation.

To get started with your favorite lenses, visit the AWS Well-Architected page. You can learn, measure, and build using architectural best practices and tools.

To review your workloads using the AWS Well-Architected Framework, we recommend using the AWS Well-Architected Tool, a self-service tool designed to help you review AWS workloads at any time, without the need for an AWS Solutions Architect.

It provides a mechanism for regularly evaluating your workloads, identifying high-risk issues, and recording your improvements applying your favorite Lenses. You can also leverage Custom Lenses to record and track progress towards your organization’s internal best practices.

If you want to train these best practices, AWS Well-Architected Labs provides codes and documentation in the format of hands-on labs to help you learn, measure, and build using architectural best practices categorized into levels. Also, you can access an ecosystem of hundreds of members in the AWS Well-Architected Partner Program in your area to help analyze and review your applications.

You can refer to the AWS Architecture Center, a collection of reference architecture patterns, vetted architecture solutions, and best practices. If you’re new to AWS, use the Architect Learning Plan to learn how to design applications and systems on AWS. Build technical skills as you progress along the path toward AWS Certification.

This is My Architecture is a video series that showcases innovative architectural solutions on AWS by customers and partners. We would love to hear more from you, especially about your success stories in building your applications on AWS Well-Architected Framework. Please share with your account team to introduce your stories.

– Channy

Using DevOps Automation to Deploy Lambda APIs across Accounts and Environments

2022-02-25 Subrahmanyam Madduru

Post Syndicated from Subrahmanyam Madduru original https://aws.amazon.com/blogs/architecture/using-devops-automation-to-deploy-lambda-apis-across-accounts-and-environments/

by Subrahmanyam Madduru – Global Partner Solutions Architect Leader, AWS, Sandipan Chakraborti – Senior AWS Architect, Wipro Limited, Abhishek Gautam – AWS Developer and Solutions Architect, Wipro Limited, Arati Deshmukh – AWS Architect, Infosys

As more and more enterprises adopt serverless technologies to deliver their business capabilities in a more agile manner, it is imperative to automate release processes. Multiple AWS Accounts are needed to separate and isolate workloads in production versus non-production environments. Release automation becomes critical when you have multiple business units within an enterprise, each consisting of a number of AWS accounts that are continuously deploying to production and non-production environments.

As a DevOps best practice, the DevOps engineering team responsible for build-test-deploy in a non-production environment should not release the application and infrastructure code on to both non-production and production environments. This risks introducing errors in application and infrastructure deployments in production environments. This in turn results in significant rework and delays in delivering functionalities and go-to-market initiatives. Deploying the code in a repeatable fashion while reducing manual error requires automating the entire release process. In this blog, we show how you can build a cross-account code pipeline that automates the releases across different environments using AWS CloudFormation templates and AWS cross-account access.

Cross-account code pipeline enables an AWS Identity & Access Management (IAM) user to assume an IAM Production role using AWS Secure Token Service (Managing AWS STS in an AWS Region – AWS Identity and Access Management) to switch between non-production and production deployments based as required. An automated release pipeline goes through all the release stages from source, to build, to deploy, on non-production AWS Account and then calls STS Assume Role API (cross-account access) to get temporary token and access to AWS Production Account for deployment. This follow the least privilege model for granting role-based access through IAM policies, which ensures the secure automation of the production pipeline release.

Solution Overview

In this blog post, we will show how a cross-account IAM assume role can be used to deploy AWS Lambda Serverless API code into pre-production and production environments. We are building on the process outlined in this blog post: Building a CI/CD pipeline for cross-account deployment of an AWS Lambda API with the Serverless Framework by programmatically automating the deployment of Amazon API Gateway using CloudFormation templates. For this use case, we are assuming a single tenant customer with separate AWS Accounts to isolate pre-production and production workloads. In Figure 1, we have represented the code pipeline workflow diagramatically for our use case.

Figure 1. AWS cross-account CodePipeline for production and non-production workloads

Figure 1. AWS cross-account AWS CodePipeline for production and non-production workloads

Let us describe the code pipeline workflow in detail for each step noted in the preceding diagram:

An IAM user belonging to the DevOps engineering team logs in to AWS Command-line Interface (AWS CLI) from a local machine using an IAM secret and access key.
Next, the IAM user assumes the IAM role to the corresponding activities – AWS Code Commit, AWS CodeBuild, AWS CodeDeploy, AWS CodePipeline Execution and deploys the code for pre-production.
A typical AWS CodePipeline comprises of build, test and deploy stages. In the build stage, the AWS CodeBuild service generates the Cloudformation template stack (template-export.yaml) into Amazon S3.
In the deploy stage, AWS CodePipeline uses a CloudFormation template (a yaml file) to deploy the code from an S3 bucket containing the application API endpoints via Amazon API Gateway in the pre-production environment.
The final step in the pipeline workflow is to deploy the application code changes onto the Production environment by assuming STS production IAM role.

Since the AWS CodePipeline is fully automated, we can use the same pipeline by switching between pre-production and production accounts. These accounts assume the IAM role appropriate to the target environment and deploy the validated build to that environment using CloudFormation templates.

Prerequisites

Here are the pre-requisites before you get started with implementation.

A user with appropriate privileges (for example: Project Admin) in a production AWS account
A user with appropriate privileges (for example: Developer Lead) in a pre-production AWS account such as development
A CloudFormation template for deploying infrastructure in the pre-production account
Ensure your local machine has AWS CLI installed and configured

Implementation Steps

In this section, we show how you can use AWS CodePipeline to release a serverless API in a secure manner to pre-production and production environments. AWS CloudWatch logging will be used to monitor the events on the AWS CodePipeline.

1. Create Resources in a pre-production account

In this step, we create the required resources such as a code repository, an S3 bucket, and a KMS key in a pre-production environment.

Clone the code repository into your CodeCommit. Make necessary changes to index.js and ensure the buildspec.yaml is there to build the artifacts.
- Using codebase (lambda APIs) as input, you output a CloudFormation template, and environmental configuration JSON files (used for configuring Production and other non-Production environments such as dev, test). The build artifacts are packaged using AWS Serverless Application Model into a zip file and uploads it to an S3 bucket created for storing artifacts. Make note of the repository name as it will be required later.
Create an S3 bucket in a Region (Example: us-east-2). This bucket will be used by the pipeline for get and put artifacts. Make a note of the bucket name.
- Make sure you edit the bucket policy to have your production account ID and the bucket name. Refer to AWS S3 Bucket Policy documentation to make changes to Amazon S3 bucket policies and permissions.
Navigate to AWS Key Management Service (KMS) and create a symmetric key.
Then create a new secret, configure the KMS key and provide access to development and production account. Make a note of the ARN for the key.

2. Create IAM Roles in the Production Account and required policies

In this step, we create roles and policies required to deploy the code.

Create a cross account access IAM role (CodePipelineCrossAccountRole) in the Production account.
Create a read/write policy to access S3 bucket containing the resource artifacts
Create a policy is to access KMS keys to AWS Production account.

{
"Version": "2012-10-17",
"Statement": [
{
    "Effect": "Allow",
      "Action": [
      "kms:DescribeKey",
    "kms:GenerateDataKey*",
      "kms:Encrypt",
      "kms:ReEncrypt*",
      "kms:Decrypt"
],
"Resource": [
      "Your KMS Key ARN you created in Development Account"
]
    }
]
}

Once you’ve created both policies, attach them to the previously created cross-account role.

3. Create a CloudFormation Deployment role

In this step, you need to create another IAM role, “CloudFormationDeploymentRole” for Application deployment. Then attach the following four policies to it.

Policy 1: For Cloudformation to deploy the application in the Production account

{
"Version": "2012-10-17",
"Statement": [
    {
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
        "cloudformation:DetectStackDrift",
        "cloudformation:CancelUpdateStack",
        "cloudformation:DescribeStackResource",
        "cloudformation:CreateChangeSet",
        "cloudformation:ContinueUpdateRollback",
        "cloudformation:DetectStackResourceDrift",
    "cloudformation:DescribeStackEvents",
    "cloudformation:UpdateStack",
    "cloudformation:DescribeChangeSet",
    "cloudformation:ExecuteChangeSet",
    "cloudformation:ListStackResources",
    "cloudformation:SetStackPolicy",
    "cloudformation:ListStacks",
        "cloudformation:DescribeStackResources",
        "cloudformation:DescribePublisher",
        "cloudformation:GetTemplateSummary",
    "cloudformation:DescribeStacks",
    "cloudformation:DescribeStackResourceDrifts",
    "cloudformation:CreateStack",
      "cloudformation:GetTemplate",
      "cloudformation:DeleteStack",
      "cloudformation:TagResource",
    "cloudformation:UntagResource",
    "cloudformation:ListChangeSets",
        "cloudformation:ValidateTemplate"
],
      "Resource": "arn:aws:cloudformation:us-east-2:940679525002:stack/DevOps-Automation-API*/*"        }
]
}

Policy 2: For Cloudformation to perform required IAM actions

{
"Version": "2012-10-17",
"Statement": [
    {
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
        "iam:GetRole",
        "iam:GetPolicy",
        "iam:TagRole",
        "iam:DeletePolicy",
        "iam:CreateRole",
        "iam:DeleteRole",
        "iam:AttachRolePolicy",
        "iam:PutRolePolicy",
        "iam:TagPolicy",
        "iam:CreatePolicy",
        "iam:PassRole",
        "iam:DetachRolePolicy",
        "iam:DeleteRolePolicy"
      ],
      "Resource": "*"
    }
]
}

Policy 3: Lambda function service invocation policy

{
"Version": "2012-10-17",
"Statement": [
    {
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
        "lambda:CreateFunction",
        "lambda:UpdateFunctionCode",
        "lambda:AddPermission",
        "lambda:InvokeFunction",
      "lambda:GetFunction",
        "lambda:DeleteFunction",
        "lambda:PublishVersion",
        "lambda:CreateAlias"
      ],
      "Resource": "arn:aws:lambda:us-east-2:Your_Production_AccountID:function:SampleApplication*"
    }
]
}

Policy 4: API Gateway service invocation policy

{
"Version": "2012-10-17",
"Statement": [
    {
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
        "apigateway:DELETE",
        "apigateway:PATCH",
        "apigateway:POST",
        "apigateway:GET"
      ],
      "Resource": [
        "arn:aws:apigateway:*::/restapis/*/deployments/*",
        "arn:aws:apigateway:*::/restapis/*/stages/*",
        "arn:aws:apigateway:*::/clientcertificates",
        "arn:aws:apigateway:*::/restapis/*/models",
        "arn:aws:apigateway:*::/restapis/*/resources/*",
        "arn:aws:apigateway:*::/restapis/*/models/*",
        "arn:aws:apigateway:*::/restapis/*/gatewayresponses/*",
        "arn:aws:apigateway:*::/restapis/*/stages",
        "arn:aws:apigateway:*::/restapis/*/resources",
        "arn:aws:apigateway:*::/restapis/*/gatewayresponses",
        "arn:aws:apigateway:*::/clientcertificates/*",
        "arn:aws:apigateway:*::/account",
        "arn:aws:apigateway:*::/restapis/*/deployments",
        "arn:aws:apigateway:*::/restapis"
      ]
    },
    {
      "Sid": "VisualEditor1",
      "Effect": "Allow",
      "Action": [
        "apigateway:DELETE",
        "apigateway:PATCH",
        "apigateway:POST",
        "apigateway:GET"
      ],
      "Resource": "arn:aws:apigateway:*::/restapis/*/resources/*/methods/*/responses/*"
    },
    {
      "Sid": "VisualEditor2",
      "Effect": "Allow",
      "Action": [
    "apigateway:DELETE",
    "apigateway:PATCH",
    "apigateway:GET"
      ],
      "Resource": "arn:aws:apigateway:*::/restapis/*"
    },
    {
      "Sid": "VisualEditor3",
      "Effect": "Allow",
      "Action": [
        "apigateway:DELETE",
        "apigateway:PATCH",
        "apigateway:GET"
      ],
      "Resource": "arn:aws:apigateway:*::/restapis/*/resources/*/methods/*"
}
]
}

Make sure you also attach the S3 read/write access and KMS policies created in Step-2, to the CloudFormationDeploymentRole.

4. Setup and launch CodePipeline

You can launch the CodePipeline either manually in the AWS console using “Launch Stack” or programmatically via command-line in CLI.

On your local machine go to terminal/ command prompt and launch this command:

aws cloudformation deploy –template-file <Path to pipeline.yaml> –region us-east-2 –stack-name <Name_Of_Your_Stack> –capabilities CAPABILITY_IAM –parameter-overrides ArtifactBucketName=<Your_Artifact_Bucket_Name> ArtifactEncryptionKeyArn=<Your_KMS_Key_ARN> ProductionAccountId=<Your_Production_Account_ID> ApplicationRepositoryName=<Your_Repository_Name> RepositoryBranch=master

If you have configured a profile in AWS CLI, mention that profile while executing the command:

–profile <your_profile_name>

After launching the pipeline, your serverless API gets deployed in pre-production as well as in the production Accounts. You can check the deployment of your API in production or pre-production Account, by navigating to the API Gateway in the AWS console and looking for your API in the Region where it was deployed.

Figure 2. Check your deployment in pre-production/production environment

Then select your API and navigate to stages, to view the published API with an endpoint. Then validate your API response by selecting the API link.

Figure 3. Check whether your API is being published in pre-production/production environment

Alternatively you can also navigate to your APIs by navigating through your deployed application CloudFormation stack and selecting the link for API in the Resources tab.

Cleanup

If you are trying this out in your AWS accounts, make sure to delete all the resources created during this exercise to avoid incurring any AWS charges.

Conclusion

In this blog, we showed how to build a cross-account code pipeline to automate releases across different environments using AWS CloudFormation templates and AWS Cross Account Access. You also learned how serveless APIs can be securely deployed across pre-production and production accounts. This helps enterprises automate release deployments in a repeatable and agile manner, reduce manual errors and deliver business cababilities more quickly.

Optimize AI/ML workloads for sustainability: Part 1, identify business goals, validate ML use, and process data

2022-02-10 Benoit de Chateauvieux

Post Syndicated from Benoit de Chateauvieux original https://aws.amazon.com/blogs/architecture/optimize-ai-ml-workloads-for-sustainability-part-1-identify-business-goals-validate-ml-use-and-process-data/

Training artificial intelligence (AI) services and machine learning (ML) workloads uses a lot of energy—and they are becoming bigger and more complex. As an example, the Carbontracker: Tracking and Predicting the Carbon Footprint of Training Deep Learning Models study estimates that a single training session for a language model like GPT-3 can have a carbon footprint similar to traveling 703,808 kilometers by car.

Although ML uses a lot of energy, it is also one of the best tools we have to fight the effects of climate change. For example, we’ve used ML to help deliver food and pharmaceuticals safely and with much less waste, reduce the cost and risk involved in maintaining wind farms, restore at-risk ecosystems, and predict and understand extreme weather.

In this series of three blog posts, we’ll provide guidance from the Sustainability Pillar of the AWS Well-Architected Framework to reduce the carbon footprint of your AI/ML workloads.

This first post follows the first three phases provided in the Well-Architected machine learning lifecycle (Figure 1):

Business goal identification
ML problem framing
Data processing (data collection, data preprocessing, feature engineering)

You’ll learn best practices for each phase to help you review and refine your workloads to maximize utilization and minimize waste and the total resources deployed and powered to support your workload.

Figure 1. ML lifecycle

Business goal identification

Define the overall environmental impact or benefit

Measure your workload’s impact and its contribution to the overall sustainability goals of the organization. Questions you should ask:

How does this workload support our overall sustainability mission?
How much data will we have to store and process? What is the impact of training the model? How often will we have to re-train?
What are the impacts resulting from customer use of this workload?
What will be the productive output compared with this total impact?

Asking these questions will help you establish specific sustainability objectives and success criteria to measure against in the future.

ML problem framing

Identify if ML is the right solution

Always ask if AI/ML is right for your workload. There is no need to use computationally intensive AI when a simpler, more sustainable approach might succeed just as well.

For example, using ML to route Internet of Things (IoT) messages may be unwarranted; you can express the logic with a Rules Engine.

Consider AI services and pre-trained models

Once you decide if AI/ML is the right tool, consider whether the workload needs to be developed as a custom model.

Many workloads can use the managed AWS AI services shown in Figure 2. Using these services means that you won’t need the associated resources to collect/store/process data and to prepare/train/tune/deploy an ML model.

Figure 2. Managed AWS AI services

If adopting a fully managed AI service is not appropriate, evaluate if you can use pre-existing datasets, algorithms, or models. AWS Marketplace offers over 1,400 ML-related assets that customers can subscribe to. You can also fine-tune an existing model starting from a pre-trained model, like those available on Hugging Face. Using pre-trained models from third parties can reduce the resources you need for data preparation and model training.

Select sustainable Regions

Select an AWS Region with sustainable energy sources. When regulations and legal aspects allow, choose Regions near Amazon renewable energy projects and Regions where the grid has low published carbon intensity to host your data and workloads.

Data processing (data collection, data preprocessing, feature engineering)

Avoid datasets and processing duplication

Evaluate if you can avoid data processing by using existing publicly available datasets like AWS Data Exchange and Open Data on AWS (which includes the Amazon Sustainability Data Initiative). They offer weather and climate datasets, satellite imagery, air quality or energy data, among others. When you use these curated datasets, it avoids duplicating the compute and storage resources needed to download the data from the providers, store it in the cloud, organize, and clean it.

For internal data, you can also reduce duplication and rerun of feature engineering code across teams and projects by using a feature storage, such as Amazon SageMaker Feature Store.

Once your data is ready for training, use pipe input mode to stream it from Amazon Simple Storage Service (Amazon S3) instead of copying it to Amazon Elastic Block Store (Amazon EBS). This way, you can reduce the size of your EBS volumes.

Minimize idle resources with serverless data pipelines

Adopt a serverless architecture for your data pipeline so it only provisions resources when work needs to be done. For example, when you use AWS Glue and AWS Step Functions for data ingestion and preprocessing, you are not maintaining compute infrastructure 24/7. As shown in Figure 3, Step Functions can orchestrate AWS Glue jobs to create event-based serverless ETL/ELT pipelines.

Figure 3. Orchestrating data preparation with AWS Glue and Step Functions

Implement data lifecycle policies aligned with your sustainability goals

Classify data to understand its significance to your workload and your business outcomes. Use this information to determine when you can move data to more energy-efficient storage or safely delete it.

Manage the lifecycle of all your data and automatically enforce deletion timelines to minimize the total storage requirements of your workload using Amazon S3 Lifecycle policies. The Amazon S3 Intelligent-Tiering storage class will automatically move your data to the most sustainable access tier when access patterns change.

Define data retention periods that support your sustainability goals while meeting your business requirements, not exceeding them.

Adopt sustainable storage options

Use the appropriate storage tier to reduce the carbon impact of your workload. On Amazon S3, for example, you can use energy-efficient, archival-class storage for infrequently accessed data, as shown in Figure 4. And if you can easily recreate an infrequently accessed dataset, use the Amazon S3 One Zone-IA class to reduce by 3x or more its carbon footprint.

Figure 4. Data access patterns for Amazon S3

Don’t over-provision block storage for notebooks and use object storage services like Amazon S3 for common datasets.

Tip: You can check the free disk space on your SageMaker Notebooks using !df -h.

Select efficient file formats and compression algorithms

Use efficient file formats such as Parquet or ORC to train your models. Compared to CSV, they can help you reduce your storage by up to 87%.

Migrating to a more efficient compression algorithm can also greatly contribute to your storage reduction efforts. For example, Zstandard produces 10–15% smaller files than Gzip at the same compression speed. Some SageMaker built-in algorithms accept x-recordio-protobuf input, which can be streamed directly from Amazon S3 instead of being copied to a notebook instance.

Minimize data movement across networks

Compress your data before moving it over the network.

Minimize data movement across networks when selecting a Region; store your data close to your producers and train your models close to your data.

Measure results and improve

To monitor and quantify improvements, track the following metrics:

Total size of your S3 buckets and storage class distribution, using Amazon S3 Storage Lens
DiskUtilization metric of your SageMaker processing jobs
StorageBytes metric of your SageMaker Studio shared storage volume

Conclusion

In this blog post, we discussed the importance of defining the overall environmental impact or benefit of your ML workload and why managed AI services or pre-trained ML models are sustainable alternatives to custom models. You also learned best practices to reduce the carbon footprint of your ML workload in the data processing phase.

In the next post, we will continue our sustainability journey through the ML lifecycle and discuss the best practices you can follow in the model development phase.

Looking for more architecture content? AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

Insights for CTOs: Part 2 – Enable Good Decisions at Scale with Robust Security

2021-12-17 Syed Jaffry

Post Syndicated from Syed Jaffry original https://aws.amazon.com/blogs/architecture/insights-for-ctos-part-2-enable-good-decisions-at-scale-with-robust-security/

In my role as a Senior Solutions Architect, I have spoken to chief technology officers (CTOs) and executive leadership of large enterprises like big banks, software as a service (SaaS) businesses, mid-sized enterprises, and startups.

In this 6-part series, I share insights gained from various CTOs during their cloud adoption journeys at their respective organizations. I have taken these lessons and summarized architecture best practices to help you build and operate applications successfully in the cloud. This series will also cover topics on building and operating cloud applications, security, cloud financial management, modern data and artificial intelligence (AI), cloud operating models, and strategies for cloud migration.

In part 2, my colleague Paul Hawkins and I will show you how to effectively communicate organization-wide security processes. This will ensure you can make informed decisions to scale effectively. We also describe how to establish robust security controls using best practices from the Security Pillar of the Well-Architected Framework.

Effectively establish and communicate security processes

To ensure your employees, customers, contractors, etc., understand your organization’s security goals, make sure that people know the what, how, and why behind your security objectives:

What are the overall objectives they need to meet?
How do you intend for the organization and your customers to work together to meet these goals?
Why are meeting these goals important to your organization and customers?

Having well communicated security principles gives a common understanding of overall objectives. Once you communicate these goals, you can get more specific in terms of how those objectives can be achieved.

The next sections discuss best practices to establish your organization’s security processes.

Create a “path to production” process

A “path to production” process is a set of consistent and reusable engineering standards and steps that each new cloud workload must adhere to prior to production deployment. Using this process will increase delivery velocity while reducing business risk by ensuring strong compliance to standards.

Classify your data for better access control

Understanding the type of data that you are handling and where it is being handled is critical to understanding what you need to do to appropriately protect it. For example, the requirements for a public website are different than a payment processing workload. By knowing where and when sensitive data is being accessed or used, you can more easily assess and establish the appropriate controls.

Figure 1 shows a scale that will help you determine when and how to protect sensitive data. It shows that you would apply stricter access controls for more sensitive data to reduce the risk of inappropriate access. Detective controls allow you to audit and respond to unexpected access.

By simplifying the baseline control posture across all environments and layering on stricter controls where appropriate, you will make it easier to deliver change more swiftly while maintaining the right level of security.

Figure 1. Data classification and control scale

Identify and prioritize how to address risks using a threat model

As shown in the How to approach threat modeling blog post, threat modeling helps workload teams identify potential threats and develop or implement security controls to address those threats.

Threat modeling is most effective when it’s done at the workload (or workload feature) level. We recommend creating reusable threat modeling templates. This will help ensure quicker time to production and a consistent security control posture for your systems.

Create feedback cycles

Security, like other areas of architecture and design, is not static. You don’t implement security processes and walk away, just like you wouldn’t ship an application and never improve its availability, performance, or ease of operation.

Implementation of feedback cycles will vary depending on your organizational structure and processes. However, one common way we have seen feedback cycles being implemented is with a collaborative, blame-free root cause analysis (RCA) process. It allows you to understand how many issues you have been able to prevent or effectively respond to and apply that knowledge to make your systems more secure. It also demonstrates organizational support for an objective discussion where people are not penalized for asking questions.

Security controls

Protect your applications and infrastructure

To secure your organization, build automation that delivers robust answers to the following questions:

Preventative controls – how well can you block unauthorized access?
Detective controls – how well can you identify unexpected activity or unwanted configuration?
Incident response – how quickly and effectively can you respond and recover from issues?
Data protection – how well is the data protected while being used and stored?

Preventative controls

Start with robust identity and access management (IAM). For human access, avoid having to maintain separate credentials between cloud and on-premises systems. It does not scale and creates threat vectors such as long-lived credentials and credential leaks.

Instead, use federated authentication within a centralized system for provisioning and deprovisioning organization-wide access to all your systems, including the cloud. For AWS access, you can do this with AWS Single Sign-On (AWS SSO), direct federation to IAM, or integration with partner solutions, such as Okta or Active Directory.

Enhance your trust boundary with the principles of “zero trust.” Traditionally, organizations tend to rely on the network as the primary point of control. This can create a “hard shell, soft core” model, which doesn’t consider context for access decisions. Zero trust is about increasing your use of identity as a means to grant access in addition to traditional controls that rely on network being private.

Apply “defense in depth” to your application infrastructure with a layered security architecture. The sequence in which you layer the controls together can depend on your use case. For example, you can apply IAM controls either at the database layer or at the start of user activity—or both. Figure 2 shows a conceptual view of layering controls to help secure access to your data. Figure 3 shows the implementation view for a web-facing application.

Figure 2. Defense in depth

Figure 3. Defense in depth applied to a web application

Detective controls

Detective controls allow you to get the information you need to respond to unexpected changes and incidents. Tools like Amazon GuardDuty and AWS Config can integrate with your security information and event monitoring (SIEM) system so you can respond to incidents using human and automated intervention.

Incident response

When security incidents are detected, timely and appropriate response is critical to minimize business impact. A robust incident response process is a combination of human intervention steps and automation. The AWS Security Hub Automated Response and Remediation solution provides an example of how you can build incident response automation.

Protect data with robust controls

Restrict access to your databases with private networking and strong identity and access control. Apply data encryption in transit (TLS) and at rest. A common mistake that organizations make is not enabling encryption at rest in databases at the time of initial deployment.

It is difficult to enable database encryption after the fact without time-consuming data migration. Therefore, enable database encryption from the start and minimize direct human access to data by applying principles of least privilege. This reduces the likelihood of accidental disclosure of information or misconfiguration of systems.

Ready to get started?

As a CTO, understanding the overall posture of your security processes against the foundational security controls is beneficial. Tracking key metrics on the effectiveness of the decision-making process, overall security objectives, and the improvement in posture over time should be regularly evaluated by the CTO and CISO organizations.

Embedding the principles of robust security processes and controls into the way your organization designs, develops, and operates workloads makes it easier to consistently make good decisions quickly.

To get started, look at workloads where engineering and security are already working together or bootstrap an initiative for this. Use the Well Architected Tool’s Security Pillar to create and communicate a set of objectives that demonstrate value.

Other blogs in this series

Insights for CTOs: Part 1 – Building and Operating Cloud Applications

Looking for more architecture content? AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

Creating a Multi-Region Application with AWS Services – Part 1, Compute and Security

2021-12-08 Joe Chapman

Post Syndicated from Joe Chapman original https://aws.amazon.com/blogs/architecture/creating-a-multi-region-application-with-aws-services-part-1-compute-and-security/

Building a multi-Region application requires lots of preparation and work. Many AWS services have features to help you build and manage a multi-Region architecture, but identifying those capabilities across 200+ services can be overwhelming.

In this 3-part blog series, we’ll explore AWS services with features to assist you in building multi-Region applications. In Part 1, we’ll build a foundation with AWS security, networking, and compute services. In Part 2, we’ll add in data and replication strategies. Finally, in Part 3, we’ll look at the application and management layers.

Considerations before getting started

AWS Regions are built with multiple isolated and physically separate Availability Zones (AZs). This approach allows you to create highly available Well-Architected workloads that span AZs to achieve greater fault tolerance. There are three general reasons that you may need to expand beyond a single Region:

Expansion to a global audience as an application grows and its user base becomes more geographically dispersed, there can be a need to reduce latencies for different parts of the world.
Reducing Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO) as part of disaster recovery (DR) plan.
Local laws and regulations may have strict data residency and privacy requirements that must be followed.

Ensuring security, identity, and compliance

Creating a security foundation starts with proper authentication, authorization, and accounting to implement the principle of least privilege. AWS Identity and Access Management (IAM) operates in a global context by default. With IAM, you specify who can access which AWS resources and under what conditions. For workloads that use directory services, the AWS Directory Service for Microsoft Active Directory Enterprise Edition can be set up to automatically replicate directory data across Regions. This allows applications to reduce lookup latencies by using the closest directory and creates durability by spanning multiple Regions.

Applications that need to securely store, rotate, and audit secrets, such as database passwords, should use AWS Secrets Manager. It encrypts secrets with AWS Key Management Service (AWS KMS) keys and can replicate secrets to secondary Regions to ensure applications are able to obtain a secret in the closest Region.

Encrypt everything all the time

AWS KMS can be used to encrypt data at rest, and is used extensively for encryption across AWS services. By default, keys are confined to a single Region. AWS KMS multi-Region keys can be created to replicate keys to a second Region, which eliminates the need to decrypt and re-encrypt data with a different key in each Region.

AWS CloudTrail logs user activity and API usage. Logs are created in each Region, but they can be centralized from multiple Regions and multiple accounts into a single Amazon Simple Storage Service (Amazon S3) bucket. As a best practice, these logs should be aggregated to an account that is only accessible to required security personnel to prevent misuse.

As your application expands to new Regions, AWS Security Hub can aggregate and link findings to a single Region to create a centralized view across accounts and Regions. These findings are continuously synced between Regions to keep you updated on global findings.

We put these features together in Figure 1.

Figure 1. Multi-Region security, identity, and compliance services

Building a global network

For resources launched into virtual networks in different Regions, Amazon Virtual Private Cloud (Amazon VPC) allows private routing between Regions and accounts with VPC peering. These resources can communicate using private IP addresses and do not require an internet gateway, VPN, or separate network appliances. This works well for smaller networks that only require a few peering connections. However, as the number of peered connections increases, the mesh of peered connections can become difficult to manage and troubleshoot.

AWS Transit Gateway can help reduce these difficulties by creating a central transitive hub to act as a cloud router. A Transit Gateway’s routing capabilities can expand to additional Regions with Transit Gateway inter-Region peering to create a globally distributed private network.

Building a reliable, cost-effective way to route users to distributed Internet applications requires highly available and scalable Domain Name System (DNS) records. Amazon Route 53 does exactly that.

Route 53 routing policies can route traffic to a record with the lowest latency, or automatically fail over a record. If a larger failure occurs, the Route 53 Application Recovery Controller can simplify the monitoring and failover process for application failures across Regions, AZs, and on-premises.

Amazon CloudFront’s content delivery network is truly global, built across 300+ points of presence (PoP) spread throughout the world. Applications that have multiple possible origins, such as across Regions, can use CloudFront origin failover to automatically fail over the origin. CloudFront’s capabilities expand beyond serving content, with the ability to run compute at the edge. CloudFront functions make it easy to run lightweight JavaScript functions, and AWS Lambda@Edge makes it easy to run Node.js and Python functions across these 300+ PoPs.

AWS Global Accelerator uses the AWS global network infrastructure to provide two static anycast IPs for your application. It automatically routes traffic to the closest Region deployment, and if a failure is detected it will automatically redirect traffic to a healthy endpoint within seconds.

Figure 2 brings these features together to create a global network across two Regions.

Figure 2. AWS VPC connectivity and content delivery

Building the compute layer

An Amazon Elastic Compute Cloud (Amazon EC2) instance is based on an Amazon Machine Image (AMI). An AMI specifies instance configurations such as the instance’s storage, launch permissions, and device mappings. When a new standard image needs to be created, EC2 Image Builder can be used to streamline copying AMIs to selected Regions.

Although EC2 instances and their associated Amazon Elastic Block Store (Amazon EBS) volumes live in a single AZ, Amazon Data Lifecycle Manager can automate the process of taking and copying EBS snapshots across Regions. This can enhance DR strategies by providing a relatively easy cold backup-and-restore option for EBS volumes.

As an architecture expands into multiple Regions, it can become difficult to track where instances are provisioned. Amazon EC2 Global View helps solve this by providing a centralized dashboard to see Amazon EC2 resources such as instances, VPCs, subnets, security groups, and volumes in all active Regions.

Microservice-based applications that use containers benefit from quicker start-up times. Amazon Elastic Container Registry (Amazon ECR) can help ensure this happens consistently across Regions with private image replication at the registry level. An ECR private registry can be configured for either cross-Region or cross-account replication to ensure your images are ready in secondary Regions when needed.

We bring these compute layer features together in Figure 3.

Figure 3. AMI and EBS snapshot copy across Regions

Summary

It’s important to create a solid foundation when architecting a multi-Region application. These foundations pave the way for you to move fast in a secure, reliable, and elastic way as you build out your application. In this post, we covered options across AWS security, networking, and compute services that have built-in functionality to take away some of the undifferentiated heavy lifting. We’ll cover data, application, and management services in future posts.

Ready to get started? We’ve chosen some AWS Solutions and AWS Blogs to help you!

Looking for more architecture content? AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

New – Sustainability Pillar for AWS Well-Architected Framework

2021-12-02 Alex Casalboni

Post Syndicated from Alex Casalboni original https://aws.amazon.com/blogs/aws/sustainability-pillar-well-architected-framework/

The AWS Well-Architected Framework has been helping AWS customers improve their cloud architectures since 2015. The framework consists of design principles, questions, and best practices across multiple pillars: Operational Excellence, Security, Reliability, Performance Efficiency, and Cost Optimization.

Today we are introducing a new Sustainability Pillar to help organizations learn, measure, and improve their workloads using environmental best practices for cloud computing.

Similar to the other pillars, the Sustainability Pillar contains questions aimed at evaluating the design, architecture, and implementation of your workloads to reduce their energy consumption and improve their efficiency. The pillar is designed as a tool to track your progress toward policies and best practices that support a more sustainable future, not just a simple checklist.

The Shared Responsibility Model of Cloud Sustainability
The shared responsibility model also applies to sustainability. AWS is responsible for the sustainability of the cloud, while AWS customers are responsible for sustainability in the cloud.

The sustainability of the cloud allows AWS customers to reduce associated energy usage by nearly 80% with respect to a typical on-premises deployment. This is possible because of the much higher server utilization, power and cooling efficiency, custom data center design, and continued progress on the path to powering AWS operations with 100% renewable energy by 2025. But we can achieve much more by collectively designing sustainable architectures.

We are introducing the new Sustainability Pillar to help organizations improve their sustainability in the cloud. This is a continuous effort focused on energy reduction and efficiency of all types of workloads. In practice, the pillar helps developers and cloud architects surface the trade-offs, highlight patterns and best practices, and avoid anti-patterns. For example, selecting an efficient programming language, adopting modern algorithms, using efficient data storage techniques, and deploying correctly sized and efficient infrastructure.

Specifically, the pillar is designed to support organizations in developing a better understanding of the state of their workloads, as well as the impact related to defined sustainability targets, how to measure against these targets, and how to model where they cannot directly measure.

In addition to building sustainable workloads in the cloud, you can use AWS technology to solve broader sustainability challenges. For example, reducing the environmental incidents caused by industrial equipment failure using Amazon Monitron to detect abnormal behavior and conduct preventative maintenance. We call this sustainability through the cloud.

Well-Architected Design Principles for Sustainability in the Cloud
The Sustainability Pillar includes design principles and operational guidance, as well as architectural and software patterns.

The design principles will facilitate good design for sustainability:

Understand your impact – Measure business outcomes and the related sustainability impact to establish performance indicators, evaluate improvements, and estimate the impact of proposed changes over time.
Establish sustainability goals – Set long-term goals for each workload, model return on investment (ROI) and give owners the resources to invest in sustainability goals. Plan for growth and design your architecture to reduce the impact per unit of work such as per user or per operation.
Maximize utilization – Right size each workload to maximize the energy efficiency of the underlying hardware, and minimize idle resources.
Anticipate and adopt new, more efficient hardware and software offerings – Support upstream improvements by your partners, continually evaluate hardware and software choices for efficiencies, and design for flexibility to adopt new technologies over time.
Use managed services – Shared services reduce the amount of infrastructure needed to support a broad range of workloads. Leverage managed services to help minimize your impact and automate sustainability best practices such as moving infrequent accessed data to cold storage and adjusting compute capacity.
Reduce the downstream impact of your cloud workloads – Reduce the amount of energy or resources required to use your services and reduce the need for your customers to upgrade their devices; test using device farms to measure impact and test directly with customers to understand the actual impact on them.

Well-Architected Best Practices for Sustainability
The design principles summarized above correspond to concrete architectural best practices that development teams can apply every day.

Some examples of architectural best practices for sustainability:

Optimize geographic placement of workloads for user locations
Optimize areas of code that consume the most time or resources
Optimize impact on customer devices and equipment
Implement a data classification policy
Use lifecycle policies to delete unnecessary data
Minimize data movement across networks
Optimize your use of GPUs
Adopt development and testing methods that allow rapid introduction of potential sustainability improvements
Increase the utilization of your build environments

Many of these best practices are generic and apply to all workloads, while others are specific to some use cases, verticals, and compute platforms. I’d highly encourage you to dive into these practices and identify the areas where you can achieve the most impact immediately.

Transforming sustainability into a non-functional requirement can result in cost effective solutions and directly translate to cost savings on AWS, as you only pay for what you use. In some cases, meeting these non-functional targets might involve tradeoffs in terms of uptime, availability, or response time. Where minor tradeoffs are required, the sustainability improvements are likely to outweigh the change in quality of service. It’s important to encourage teams to continuously experiment with sustainability improvements and embed proxy metrics in their team goals.

Available Now
The AWS Well-Architected Sustainability Pillar is a new addition to the existing framework. By using the design principles and best practices defined in the Sustainability Pillar Whitepaper, you can make informed decisions balancing security, cost, performance, reliability, and operational excellence with sustainability outcomes for your workloads on AWS.

Learn more about the new Sustainability Pillar.

— Alex

Announcing AWS Well-Architected Custom Lenses: Extend the Well-Architected Framework with Your Internal Best Practices

2021-11-29 Alex Casalboni

Post Syndicated from Alex Casalboni original https://aws.amazon.com/blogs/aws/well-architected-custom-lenses-internal-best-practices/

We launched the AWS Well-Architected Framework back in 2015 to help you review workloads against architectural best practices, and across pillars such as operational excellence, security, reliability, performance efficiency, and cost optimization. In 2017, we extended the framework with the concept of “lenses” to optimize specific workload types such as the Serverless Lens, the SaaS Lens, and the Foundational Technical Review (FTR) Lens for APN Partners. In 2018, we launched the AWS Well-Architected Tool, a self-service tool designed to help you review AWS workloads at any time, without the need for an AWS Solutions Architect.

Today, I’m happy to announce the general availability of AWS Well-Architected Custom Lenses, a new feature of the AWS Well-Architected Tool that lets you bring your own best practices to complement the existing framework based on your industry, operational plans, and internal processes. Custom Lenses provide a consolidated view and a consistent way to measure and improve your workloads on AWS without relying on external spreadsheets or third-party systems.

In addition to AWS Well-Architected Lenses, now you can create and share custom lenses and include them in your workload reviews, ultimately tailoring the review to your organizational needs. For example, you could define a custom lens to review your workloads against PCI compliance, SOC 2 compliance, or other national or industry regulations. As an AWS Partner, you might include ad-hoc best practices in your custom lenses when reviewing workloads with customers from different industries and segments, ultimately making the review process easier, faster, and more comprehensive.

How to Define a new Custom Lens
You author a new custom lens by editing a JSON preset template, where you define questions, choices, helpful resources, improvement plans, and risk rules.

Here’s how it works: download the template from the AWS Well-Architected Tool, work on it locally, and then re-upload it.

The JSON structure is composed of multiple pillars. Each pillar might contain multiple questions, each with its own choices and risk rules.

Your JSON file will look like this:

{
    "schemaVersion": "2021-11-01",
    "name": "My Test Lens",
    "description": "This is a description of my test lens.",
    "pillars": [
        {
            "id": "pillar_red",
            "name": "Red Pillar",
            "questions": [
                {
                    "id": "pillar_1_q1",
                    "title": "How do you get started with this pillar?",
                    "description": "Optional description.",
                    "choices": [
                        {
                            "id": "choice1",
                            "title": "Best practice #1",
                            "helpfulResource": {
                                "displayText": "This is helpful text for the first choice.",
                                "url": "https://aws.amazon.com"
                            },
                            "improvementPlan": {
                                "displayText": "This is text that will be shown for improvement of this choice."
                            }
                        },
                        {
                            "id": "choice2",
                            "title": "Best practice #2",
                            ...
                        }
                    ],
                    "riskRules": [
                        {
                            "condition": "choice1 && choice2",
                            "risk": "NO_RISK"
                        },
                        {
                            "condition": "choice1 && !choice2",
                            "risk": "MEDIUM_RISK"
                        },
                        {
                            "condition": "default",
                            "risk": "HIGH_RISK"
                        }
                    ]
                }
            ]
            ...
        },
        ...
    ]
}

Once you’re ready to submit your JSON file, proceed with the upload.

And don’t worry about making it perfect on the first try. You’ll be able to improve it and add new versions.

AWS Well-Architected Custom Lenses in Action
You find the list of custom lenses and their latest version in the new Custom Lenses section.

Each custom lens has an owner and can be shared with multiple AWS accounts too.

Before using this new custom lens in a workload review, you’ll need to publish it and assign it a version.

Select Publish lens and provide a version name such as 1.0.

Now you can create a new workload review and apply both AWS-owned lenses and your own custom lenses, in addition to the main framework.

During the workload review, you will go through each pillar and questions of the custom lens, using the same user interface provided by the AWS Well-Architected Tool.

Last but not least, you can share your custom lens with other AWS Identity and Access Management (IAM) principals such as AWS accounts, IAM users, and IAM roles.

Available Today at No Charge
Custom Lenses are available today in all AWS Regions where the AWS Well-Architected Tool is available, at no cost. You can define up to five custom lenses and share them across AWS Accounts, in addition to the existing Well-Architected Framework and AWS-owned Lenses.

Check out the technical documentation here.

We’re looking forward to hearing your feedback and iterating quickly to improve the authoring and sharing experience based on your needs.

— Alex

Top 5: Featured Architecture Content for September

2021-10-04 Elyse Lopez

Post Syndicated from Elyse Lopez original https://aws.amazon.com/blogs/architecture/top-5-featured-architecture-content-for-september/

The AWS Architecture Center provides new and notable reference architecture diagrams, vetted architecture solutions, AWS Well-Architected best practices, whitepapers, and more. This blog post features some of our best picks from the new and newly updated content we released in the past month.

1. AWS Best Practices for DDoS Resiliency

Prioritizing the availability and responsiveness of your application helps you maintain customer trust. That’s why it’s crucial to protect your business from the impact of distributed denial of service (DDoS) and other cyberattacks. This whitepaper provides you prescriptive guidance to improve the resiliency of your applications and best practices for how to manage different attack types.

2. Predictive Modeling for Automotive Retail

Automotive retailers use data to better understand how their incentives are helping to sell cars. This new reference architecture diagram shows you how to design a modeling system that provides granular return on investment (ROI) predictions for automotive sales incentives.

3. AWS Graviton Performance Testing – Tips for Independent Software Vendors

If you’re deciding whether to phase in AWS Graviton processors for your workload, this whitepaper covers best practices and common pitfalls for defining test approaches to evaluate Amazon Elastic Compute Cloud (Amazon EC2) instance performance and how to set success factors and compare different test methods and their implementation.

4. Text Analysis with Amazon OpenSearch Service and Amazon Comprehend

This AWS Solutions Implementation was recently updated with new guidance related to Amazon OpenSearch Service, the successor to Amazon Elasticsearch Service. Learn how Amazon OpenSearch Service and Amazon Comprehend work together to deploy a cost-effective, end-to-end solution to extract meaningful insights from unstructured text-based data such as customer calls, support tickets, and online customer feedback.

5. Back to Basics: Hosting a Static Website on AWS

In this episode of Back to Basics, join SA Readiness Specialist Even Zhang as he breaks down the AWS services you can use to host and scale your static website without a single server. You’ll also learn how to use additional functionalities to enhance your observability and security posture or run A/B tests.

Figure 1. CloudFront Edge Locations and Caches from Back to Basics video

Choosing a Well-Architected CI/CD approach: Open Source on AWS

2021-07-27 Mikhail Vasilyev

Post Syndicated from Mikhail Vasilyev original https://aws.amazon.com/blogs/devops/choosing-a-well-architected-ci-cd-approach-open-source-on-aws/

Introduction

When building a CI/CD platform, it is important to make an informed decision regarding every underlying tool. This post explores evaluating the criteria for selecting each tool focusing on a balance between meeting functional and non-functional requirements, and maximizing value.

Your first decision: source code management.

Source code is potentially your most valuable asset, and so we start by choosing a source code management tool. These tools normally have high non-functional requirements in order to protect your assets and to ensure they are available to the organization when needed. The requirements usually include demand for high durability, high availability (HA), consistently high throughput, and strong security with role-based access controls.

At the same time, source code management tools normally have many specific functional requirements as well. For example, the ability to provide collaborative code review in the UI, flexible and tunable merge policies including both automated and manual gates (code checks), and out-of-box UI-level integrations with numerous other tools. These kinds of integrations can include enabling monitoring, CI, chats, and agile project management.

Many teams also treat source code management tools as their portal into other CI/CD tools. They make them shareable between teams, and might prefer to stay within one single context and user interface throughout the entire DevOps cycle. Many source code management tools are actually a stack of services that support multiple steps of your CI/CD workflows from within a single UI. This makes them an excellent starting point for building your CI/CD platforms.

The first decision your need to make is whether to go with an open source solution for managing code or with AWS-managed solutions, such as AWS CodeCommit. Open source solutions include (but are not limited to) the following: Gerrit, Gitlab, Gogs, and Phabricator.

You decision will be influenced by the amount of benefit your team can gain from the flexibility provided through open source, and how well your team can support deploying and managing these solutions. You will also need to consider the infrastructure and management overhead cost.

Engineering teams that have the capacity to develop their own plugins for their CI/CD platforms, or whom even contribute directly to open source projects, will often prefer open source solutions for the flexibility they provide. This will be especially true if they are fluent in designing and supporting their own cloud infrastructure. If the team gets more value by trading the flexibility of open source for not having to worry about managing infrastructure (especially if High Availability, Scalability, Durability, and Security are more critical) an AWS-managed solution would be a better choice.

Source Code Management Solution

When the choice is made in favor of an open-source code management solution (such as Gitlab), the next decision will be how to architect the deployment. Will the team deploy to a single instance, or design for high availability, durability, and scalability? Teams that want to design Gitlab for HA can use the following guide to proceed: Installing GitLab on Amazon Web Services (AWS)

By adopting AWS services (such as Amazon RDS, Amazon ElastiCache for Redis, and Autoscaling Groups), you can lower the management burden of supporting the underlying infrastructure in this self-managed HA scenario.

High level overview of self-managed HA Gitlab deployment

Your second decision: Continuous Integration engine

Selecting your CI engine, you might be able to benefit from additional features of previously selected solutions. Gitlab provides both source control services, as well as built-in CI tools, called Gitlab CI. Gitlab Runners are responsible for running CI jobs, and the actual jobs are described as YML files stored in Gitlab’s git repository along with product code. For security and performance reasons, GitLab Runners should be on resources separate from your GitLab instance.

You could manage those resources or you could use one of the AWS services that can support deploying and managing Runners. The use of an on-demand service removes the expense of implementing and managing a capability that is undifferentiated heavy lifting for you. This provides cost optimization and enables operational excellence. You pay for what you use and the service team manages the underlying service.

Continuous Integration engine Solution

In an architecture example (below), Gitlab Runners are deployed in containers running on Amazon EKS. The team has less infrastructure to manage, can start focusing on development faster by not having to implement the capability, and can provision resources in an optimal way for their on-demand needs.

To further optimize costs, you can use EC2 Spot Instances for your EKS nodes. CI jobs are normally compute intensive and limited in run time. The runner jobs can easily be restarted on a different resource with little impact. This makes them tolerant of failure and the use of EC2 Spot instances very appealing. Amazon EKS and Spot Instances are supported out-of-box in Gitlab. As a result there is no integration to develop, only configuration is required.

To support infrastructure as code best practices, Runners are deployed with Helm and are stored and versioned as Helm charts. All of the infrastructure as code information used to implement the CI/CD platform itself is stored in templates such as Terraform.

High level overview of Infrastructure as Code on Gitlab and Gitlab CI

Your third decision: Container Registry

You will be unable to deploy Runners if the container images are not available. As a result, the primary non-functional requirements for your production container registry are likely to include high availability, durability, transparent scalability, and security. At the same time, your functional requirements for a container registry might be lower. It might be sufficient to have a simple UI, and simple APIs supporting basic flows. Customers looking for a managed solution can use Amazon ECR, which is OCI compliant and supports Helm Charts.

Container Registry Solution

For this set of requirements, the flexibility and feature velocity of open source tools does not provide an advantage. Self-supporting high availability and strengthened security could be costly in implementation time and long-term management. Based on [Blog post 1 Diagram 1], an AWS-managed solution provides cost advantages and has no management overhead. In this case, an AWS-managed solution is a better choice for your container registry than an open-source solution hosted on AWS. In this example, Amazon ECR is selected. Customers who prefer to go with open-source container registries might consider solutions like Harbor.

High level overview of Gitlab CI with Amazon ECR

Additional Considerations

Now that the main services for the CI/CD platform are selected, we will take a high level look at additional important considerations. You need to make sure you have observability into both infrastructure and applications, that backup tools and policies are in place, and that security needs are addressed.

There are many mechanisms to strengthen security including the use of security groups. Use IAM for granular permission control. Robust policies can limit the exposure of your resources and control the flow of traffic. Implement policies to prevent your assets leaving your CI environment inappropriately. To protect sensitive data, such as worker secrets, encrypt these assets while in transit and at rest. Select a key management solution to reduce your operational burden and to support these activities such as AWS Key Management Service (AWS KMS). To deliver secure and compliant application changes rapidly while running operations consistently with automation, implement DevSecOps.

Amazon S3 is durable, secure, and highly available by design making it the preferred choice to store EBS-level backups by many customers. Amazon S3 satisfies the non-functional requirements for a backup store. It also supports versioning and tiered storage classes, making it a cost-effective as well.

Your observability requirements may emphasize versatility and flexibility for application-level monitoring. Using Amazon CloudWatch to monitor your infrastructure and then extending your capabilities through an open-source solutions such as Prometheus may be advantageous. You can get many of the benefits of both open-source Prometheus and AWS services with Amazon Managed Service for Prometheus (AMP). For interactive visualization of metrics, many customers choose solutions such as open-source Grafana, available as an AWS service Amazon Managed Service for Grafana (AMG).

CI/CD Platform with Gitlab and AWS

Conclusion

We have covered how making informed decisions can maximize value and synergy between open-source solutions on AWS, such as Gitlab, and AWS-managed services, such as Amazon EKS and Amazon ECR. You can find the right balance of open-source tools and AWS services that will meet your functional and non-functional requirements, and help maximizing the value you get from those resources.

Pete Goldberg, Director of Partnerships at GitLab: “When aligning your development process to AWS Well Architected Framework, GitLab allows customers to build and automate processes to achieve Operational Excellence. As a single tool designed to facilitate collaboration across the organization, GitLab simplifies the process to follow the Fully Separated Operating Model where Engineering and Operations come together via automated processes that remove the historical barriers between the groups. This gives organizations the ability to efficiently and rapidly deploy new features and applications that drive the business while providing the risk mitigation and compliance they require. By allowing operations teams to define infrastructure as code in the same tool that the engineering teams are storing application code, and allowing your automation bring those together for your CI/CD workflows companies can move faster while having compliance and controls built-in, providing the entire organization greater transparency. With GitLab’s integrations with different AWS compute options (EC2, Lambda, Fargate, ECS or EKS), customers can choose the best type of compute for the job without sacrificing the controls required to maintain Operational Excellence.”

Author bio

Mikhail is a Solutions Architect for RUS-CIS. Mikhail supports customers on their cloud journeys with Well-architected best practices and adoption of DevOps techniques on AWS. Mikhail is a fan of ChatOps, Open Source on AWS and Operational Excellence design principles.

Building well-architected serverless applications: Regulating inbound request rates – part 1

2021-07-22 Julian Wood

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-regulating-inbound-request-rates-part-1/

This series of blog posts uses the AWS Well-Architected Tool with the Serverless Lens to help customers build and operate applications using best practices. In each post, I address the serverless-specific questions identified by the Serverless Lens along with the recommended best practices. See the introduction post for a table of contents and explanation of the example application.

Reliability question REL1: How do you regulate inbound request rates?

Defining, analyzing, and enforcing inbound request rates helps achieve better throughput. Regulation helps you adapt different scaling mechanisms based on customer demand. By regulating inbound request rates, you can achieve better throughput, and adapt client request submissions to a request rate that your workload can support.

Required practice: Control inbound request rates using throttling

Throttle inbound request rates using steady-rate and burst rate requests

Throttling requests limits the number of requests a client can make during a certain period of time. Throttling allows you to control your API traffic. This helps your backend services maintain their performance and availability levels by limiting the number of requests to actual system throughput.

To prevent your API from being overwhelmed by too many requests, Amazon API Gateway throttles requests to your API. These limits are applied across all clients using the token bucket algorithm. API Gateway sets a limit on a steady-state rate and a burst of request submissions. The algorithm is based on an analogy of filling and emptying a bucket of tokens representing the number of available requests that can be processed.

Each API request removes a token from the bucket. The throttle rate then determines how many requests are allowed per second. The throttle burst determines how many concurrent requests are allowed. I explain the token bucket algorithm in more detail in “Building well-architected serverless applications: Controlling serverless API access – part 2”

Token bucket algorithm

API Gateway limits the steady-state rate and burst requests per second. These are shared across all APIs per Region in an account. For further information on account-level throttling per Region, see the documentation. You can request account-level rate limit increases using the AWS Support Center. For more information, see Amazon API Gateway quotas and important notes.

You can configure your own throttling levels, within the account and Region limits to improve overall performance across all APIs in your account. This restricts the overall request submissions so that they don’t exceed the account-level throttling limits.

You can also configure per-client throttling limits. Usage plans restrict client request submissions to within specified request rates and quotas. These are applied to clients using API keys that are associated with your usage policy as a client identifier. You can add throttling levels per API route, stage, or method that are applied in a specific order.

For more information on API Gateway throttling, see the AWS re:Invent presentation “I didn’t know Amazon API Gateway could do that”.

API Gateway throttling

You can also throttle requests by introducing a buffering layer using Amazon Kinesis Data Stream or Amazon SQS. Kinesis can limit the number of requests at the shard level while SQS can limit at the consumer level. For more information on using SQS as a buffer with Amazon Simple Notification Service (SNS), read “How To: Use SNS and SQS to Distribute and Throttle Events”.

Identify steady-rate and burst rate requests that your workload can sustain at any point in time before performance degraded

Load testing your serverless application allows you to monitor the performance of an application before it is deployed to production. Serverless applications can be simpler to load test, thanks to the automatic scaling built into many of the services. During a load test, you can identify quotas that may act as a limiting factor for the traffic you expect and take action.

Perform load testing for a sustained period of time. Gradually increase the traffic to your API to determine your steady-state rate of requests. Also use a burst strategy with no ramp up to determine the burst rates that your workload can serve without errors or performance degradation. There are a number of AWS Marketplace and AWS Partner Network (APN) solutions available for performance testing, Gatling Frontline, BlazeMeter, and Apica.

In the serverless airline example used in this series, you can run a performance test suite using Gatling, an open source tool.

To deploy the test suite, follow the instructions in the GitHub repository perf-tests directory. Uncomment the deploy.perftest line in the repository Makefile.

Perf-test makefile

Once the file is pushed to GitHub, AWS Amplify Console rebuilds the application, and deploys an AWS CloudFormation stack. You can run the load tests locally, or use an AWS Step Functions state machine to run the setup and Gatling load test simulation.

Performance test using Step Functions

The Gatling simulation script uses constantUsersPerSec and rampUsersPerSec to add users for a number of test scenarios. You can use the test to simulate load on the application. Once the tests run, it generates a downloadable report.

Gatling performance results

Artillery Community Edition is another open-source tool for testing serverless APIs. You configure the number of requests per second and overall test duration, and it uses a headless Chromium browser to run its test flows. For Artillery, the maximum number of concurrent tests is constrained by your local computing resources and network. To achieve higher throughput, you can use Serverless Artillery, which runs the Artillery package on Lambda functions. As a result, this tool can scale up to a significantly higher number of tests.

For more information on how to use Artillery, see “Load testing a web application’s serverless backend”. This runs tests against APIs in a demo application. For example, one of the tests fetches 50,000 questions per hour. This calls an API Gateway endpoint and tests whether the AWS Lambda function, which queries an Amazon DynamoDB table, can handle the load.

Artillery performance test

This is a synchronous API so the performance directly impacts the user’s experience of the application. This test shows that the median response time is 165 ms with a p95 time of 201 ms.

Performance test API results

Another consideration for API load testing is whether the authentication and authorization service can handle the load. For more information on load testing Amazon Cognito and API Gateway using Step Functions, see “Using serverless to load test Amazon API Gateway with authorization”.

API load testing with authentication and authorization

Conclusion

Regulating inbound requests helps you adapt different scaling mechanisms based on customer demand. You can achieve better throughput for your workloads and make them more reliable by controlling requests to a rate that your workload can support.

In this post, I cover controlling inbound request rates using throttling. I show how to use throttling to control steady-rate and burst rate requests. I show some solutions for performance testing to identify the request rates that your workload can sustain before performance degradation.

This well-architected question will be continued where I look at using, analyzing, and enforcing API quotas. I cover mechanisms to protect non-scalable resources.

For more serverless learning resources, visit Serverless Land.

Top 5: Featured Architecture Content for June

2021-06-30 Elyse Lopez

Post Syndicated from Elyse Lopez original https://aws.amazon.com/blogs/architecture/top-5-featured-architecture-content-for-june/

The AWS Architecture Center provides new and notable reference architecture diagrams, vetted architecture solutions, AWS Well-Architected best practices, whitepapers, and more. This blog post features some of our top picks from the new and newly updated content we released this month.

1. Taco Bell: Aurora as The Heart of the Menu Middleware and Data Integration Platform for Taco Bell (YouTube)

This episode of the This is My Architecture video series explores how Taco Bell built a serverless data integration platform to create unique menus for more than 7,000 locations. Find out how this menu middleware uses Amazon Aurora combined with several other services, including AWS Amplify, AWS Lambda, Amazon API Gateway, and AWS Step Functions to create a cost-effective, scalable data pipeline.

2. Well-Architected IoT Lens Checklist

How do you effectively implement AWS IoT workloads? This IoT Lens Checklist provides insights that we have gathered from real-world case studies. This will help you quickly learn the key design elements of Well-Architected IoT workloads. The checklist also provides recommendations for improvement.

3. Derive Insights from AWS Lake House

This whitepaper provides insights and design patterns for cloud architects, data scientists, and developers. It shows you how a lake house architecture allows you to query data across your data warehouse, data lake, and operational databases. Learn how you can store data in a data lake and use a ring of purpose-built data services to quickly make decisions.

4. AWS Limit Monitor Solution

This AWS Solutions Implementation helps you automatically track resource use and avoid overspending. Managed from a centralized location, this ready-to-deploy solution provides a cost-effective way to stay within service quotas by receiving notifications — even via Slack! — before you reach the limit.

5. Cloud Automation for 5G Networks

Digital services providers (DSPs) around the world are focusing on 5G development as part of upgrading their digital infrastructure. This whitepaper explains how DSPs can use different AWS tools and services to fully automate their 5G network deployment and testing and allow orchestration, closed loop use cases, edge analytics, and more.

Choosing a Well-Architected CI/CD approach: Open-source software and AWS Services

2021-06-10 Brian Carlson

Post Syndicated from Brian Carlson original https://aws.amazon.com/blogs/devops/choosing-well-architected-ci-cd-open-source-software-aws-services/

This series of posts discusses making informed decisions when choosing to implement open-source tools on AWS services, adopt managed AWS services to satisfy the same needs, or use a combination of both.

We look at key considerations for evaluating open-source software and AWS services using the perspectives of a startup company and a mature company as examples. You can use these two different points of view to compare to your own organization. To make this investigation easier we will use Continuous Integration (CI) and Continuous Delivery (CD) capabilities as the target of our investigation.

Startup Company rocket and Mature Company rocket

In two related posts, we follow two AWS customers, Iponweb and BigHat Biosciences, as they share their CI/CD journeys, their perspectives, the decisions they made, and why. To end the series, we explore an example reference architecture showing the benefits AWS provides regardless of your emphasis on open-source tools or managed AWS services.

Why CI/CD?

Until your creations are in the hands of your customers, investment in development has provided no return. The faster valuable changes enter production, the greater positive impact you can have on your customer. In today’s highly competitive world, the ability to frequently and consistently deliver value is a competitive advantage. The Operational Excellence (OE) pillar of the AWS Well-Architected Framework recognizes this impact and focuses on the capabilities of CI/CD in two dedicated sections.

The concepts in CI/CD originate from software engineering but apply equally to any form of content. The goal is to support development, integration, testing, deployment, and delivery to production. For example, making changes to an application, updating your machine learning (ML) models, changing your multimedia assets, or referring to the AWS Well-Architected Framework.

Adopting CI/CD and the best practices from the Operational Excellence pillar can help you address risks in your environment, and limit errors from manual processes. More importantly, they help free your teams from the related manual processes, so they can focus on satisfying customer needs, differentiating your organization, and accelerating the flow of valuable changes into production.

A red question mark sits on a field of chaotically arranged black question marks.

How do you decide what you need?

The first question in the Operational Excellence pillar is about understanding needs and making informed decisions. To help you frame your own decision-making process, we explore key considerations from the perspective of a fictional startup company and a fictional mature company. In our two related posts, we explore these same considerations with Iponweb and BigHat.

The key considerations include:

Functional requirements – Providing specific features and capabilities that deliver value to your customers.
Non-functional requirements – Enabling the safe, effective, and efficient delivery of the functional requirements. Non-functional requirements include security, reliability, performance, and cost requirements.
- Without security, you can’t earn customer trust. If your customers can’t trust you, you won’t have customers.
- Without reliability you aren’t available to serve your customers. If you can’t serve your customers, you won’t have customers.
- Performance is focused on timely and efficient delivery of value, not delivering as fast as possible.
- Cost is focused on optimizing the value received for the resources spent (for example, money, time, or effort), not minimizing expense.
Operational requirements – Enabling you to effectively and efficiently support, maintain, sustain, and improve the delivery of value to your customers. When you “Design with Ops in Mind,” you’re enabling effective and efficient support for your business outcomes.

These non-feature-related key considerations are why Operational Excellence, Security, Reliability, Performance Efficiency, and Cost Optimization are the five pillars of the AWS Well-Architected Framework.

The startup company

Any startup begins as a small team of inspired people working together to realize the unique solution they believe solves an unsolved problem.

For our fictional small team, everyone knows each other personally and all speak frequently. We share processes and procedures in discussions, and everyone know what needs to be done. Our team members bring their expertise and dedicate it, and the majority of their work time, to delivering our solution. The results of our efforts inform changes we make to support our next iteration.

However, our manual activities are error-prone and inconsistencies exist in the way we do them. Performing these tasks takes time away from delivering our solution. When errors occur, they have the potential to disrupt everyone’s progress.

We have capital available to make some investments. We would prefer to bring in more team members who can contribute directly to developing our solution. We need to iterate faster if we are to achieve a broadly viable product in time to qualify for our next round of funding. We need to decide what investments to make.

Goals – Reach the next milestone and secure funding to continue development
Needs – Reduce or eliminate the manual processes and associated errors
Priority – Rapid iteration
CI/CD emphasis – Baseline CI/CD capabilities and non-functional requirements are emphasized over a rich feature set

The mature company

Our second fictional company is a large and mature organization operating in a mature market segment. We’re focused on consistent, quality customer experiences to serve and retain our customers.

Our size limits the personal relationships between our service and development teams. The process to make requests, and the interfaces between teams and their systems, are well documented and understood.

However, the systems we have implemented over time, as needs were identified and addressed, aren’t well documented. Our existing tool chain includes some in-house scripting and both supported and unsupported versions of open-source tools. There are limited opportunities for us to acquire new customers.

When conditions change and new features are desired, we want to be able to rapidly implement and deploy those features as fast as possible. If we can differentiate our services, however briefly, we may be able to win customers away from our competitors. Our other path to improved profitability is to evolve our processes, maximizing integration and efficiencies, and capturing cost reductions.

Goals – Differentiate ourselves in the marketplace with desired new features
Needs – Address the risks of poorly documented systems and unsupported software
Priority – Evolve efficiency
CI/CD emphasis – Rich feature set and integrations are emphasized over improving the existing non-functional capabilities

Open-source tools on AWS vs. AWS services

The choice of open-source tools or AWS service is not binary. You can select the combination of solutions that provides the greatest value. You can implement open-source tools for their specific benefits where they outweigh the costs and operational burden, using underlying AWS services like Amazon Elastic Compute Cloud (Amazon EC2) to host them. You can then use AWS managed services, like AWS CodeBuild, for the undifferentiated features you need, without additional cost or operational burden.

A group of people sit around a table discussing the pieces of a puzzle and their ideas.

Feature Set

Our fictional organizations both want to accelerate the flow of beneficial changes into production and are evaluating CI/CD alternatives to support that outcome. Our startup company wants a working solution—basic capabilities, author/code, build, and deploy, so that they can focus on development. Our mature company is seeking every advantage—a rich feature set, extensive opportunities for customization, integration capabilities, and fine-grained control.

Open-source tools

Open-source tools often excel at meeting functional requirements. When a new functionality, capability, or integration is desired, any developer can implement it for themselves, and then contribute their code back to the project. As the user community for an open-source project expands the number of use cases and the features identified grows, so does the number of potential solutions and potential contributors. Developers are using these tools to support their efforts and implement new features that provide value to them.

However, features may be released in unsupported versions and then later added to the supported feature set. Non-functional requirements take time and are less appealing because they don’t typically bring immediate value to the product. Non-functional capabilities may lag behind the feature set.

Consider the following:

Open-source tools may have more features and existing integrations to other tools
The pace of feature set delivery may be extremely rapid
The features delivered are those desired and created by the active members of the community
You are free to implement the features your company desires
There is no commitment to long-term support for the project or any given feature
You can implement open-source tools on multiple cloud providers or on premises
If the project is abandoned, you’re responsible for maintaining your implementation

AWS services

AWS services are driven by customer needs. Services and features are supported by dedicated teams. These customer-obsessed teams focus on all customer needs, with security being their top priority. Both functional and non-functional requirements are addressed with an emphasis on enabling customer outcomes while minimizing the effort they expend to achieve them.

Consider the following:

The pace of delivery of feature sets is consistent
The feature roadmap is driven by customer need and customer requests
The AWS service team is dedicated to support of the service
AWS services are available on the AWS Cloud and on premises through AWS Outposts

Picture showing symbol of dollar

Cost Optimization

Why are we discussing cost after the feature set? Security and reliability are fundamentally more important. Leadership naturally gravitates to following the operational excellence best practice of evaluating trade-offs. Having looked at the potential benefits from the feature set, the next question is typically, “What is this going to cost?” Leadership defines the priorities and allocates the resources necessary (capital, time, effort). We review cost optimization second so that leadership can make a comparison of the expected benefits between CI/CD investments, and investments in other efforts, so they can make an informed decision.

Our organizations are both cost conscious. Our startup is working with finite capital and time. In contrast, our mature company can plan to make investments over time and budget for the needed capital. Early investment in a robust and feature-rich CI/CD tool chain could provide significant advantages towards the startup’s long-term success, but if the startup fails early, the value of that investment will never be realized. The mature company can afford to realize the value of their investment over time and can make targeted investments to address specific short-term needs.

Open-source tools

Open-source software doesn’t have to be purchased, but there are costs to adopt. Open-source tools require appropriate skills in order to be implemented, and to perform management and maintenance activities. Those skills must be gained through dedicated training of team members, team member self-study, or by hiring new team members with the existing skills. The availability of skilled practitioners of open-source tools varies with how popular a tool is and how long it has had an active community. Loss of skilled team members includes the loss of their institutional knowledge and intimacy with the implementation. Skills must be maintained with changes to the tools and as team members join or leave. Time is required from skilled team members to support management and maintenance activities. If commercial support for the tool is desired, it may be available through third-parties at an additional cost.

The time to value of an open-source implementation includes the time to implement and configure the resources and software. Additional value may be realized through investment of time configuring or implementing desired integrations and capabilities. There may be existing community-supported integrations or capabilities that reduce the level of effort to achieve these.

Consider the following:

No cost to acquire the software.
The availability of skill practitioners of open-source tools may be lower. Cost (capital and time) to acquire, establish, or maintain skill set may be higher.
There is an ongoing cost to maintain the team member skills necessary to support the open-source tools.
There is an ongoing cost of time for team members to perform management and maintenance activities.
Additional commercial support for open-source tools may be available at additional cost
Time to value includes implementation and configuration of resources and the open-source software. There may be more predefined community integrations.

AWS services

AWS services are provided pay-as-you-go with no required upfront costs. As of August 2020, more than 400,000 individuals hold active AWS Certifications, a number that grew more than 85% between August 2019 and August 2020.

Time to value for AWS services is extremely short and limited to the time to instantiate or configure the service for your use. Additional value may be realized through the investment of time configuring or implementing desired integrations. Predefined integrations for AWS services are added as part of the service development roadmap. However, there may be fewer existing integrations to reduce your level of effort.

Consider the following:

No cost to acquire the software; AWS services are pay-as-you-go for use.
AWS skill sets are broadly available. Cost (capital and time) to acquire, establish, or maintain skill sets may be lower.
AWS services are fully managed, and service teams are responsible for the operation of the services.
Time to value is limited to the time to instantiate or configure the service. There may be fewer predefined integrations.
Additional support for AWS services is available through AWS Support. Cost for support varies based on level of support and your AWS utilization.

Open-source tools on AWS services

Open-source tools on AWS services don’t impact these cost considerations. Migration off of either of these solutions is similarly not differentiated. In either case, you have to invest time in replacing the integrations and customizations you wish to maintain.

Picture showing a checkmark put on security

Security

Both organizations are concerned about reputation and customer trust. They both want to act to protect their information systems and are focusing on confidentiality and integrity of data. They both take security very seriously. Our startup wants to be secure by default and wants to trust the vendor to address vulnerabilities within the service. Our mature company has dedicated resources that focus on security, and the company practices defense in depth across internal organizations.

The startup and the mature company both want to know whether a choice is safe, secure, and can validate the security of their choice. They also want to understand their responsibilities and the shared responsibility model that applies.

Open-source tools

Open-source tools are the product of the contributors and may contain flaws or vulnerabilities. The entire community has access to the code to test and validate. There are frequently many eyes evaluating the security of the tools. A company or individual may perform a validation for themselves. However, there may be limited guidance on secure configurations. Controls in the implementer’s environment may reduce potential risk.

Consider the following:

You’re responsible for the security of the open-source software you implement
You control the security of your data within your open-source implementation
You can validate the security of the code and act as desired

AWS services

AWS service teams make security their highest priority and are able to respond rapidly when flaws are identified. There is robust guidance provided to support configuring AWS services securely.

Consider the following:

AWS is responsible for the security of the cloud and the underlying services
You are responsible for the security of your data in the cloud and how you configure AWS services
You must rely on the AWS service team to validate the security of the code

Open-source tools on AWS services

Open-source tools on AWS services combine these considerations; the customer is responsible for the open-source implementation and the configuration of the AWS services it consumes. AWS is responsible for the security of the AWS Cloud and the managed AWS services.

Picture showing global distribution for redundancy to depict reliability

Reliability

Everyone wants reliable capabilities. What varies between companies is their appetite for risk, and how much they can tolerate the impact of non-availability. The startup emphasized the need for their systems to be available to support their rapid iterations. The mature company is operating with some existing reliability risks, including unsupported open-source tools and in-house scripts.

The startup and the mature company both want to understand the expected reliability of a choice, meaning what percentage of the time it is expected to be available. They both want to know if a choice is designed for high availability and will remain available even if a portion of the systems fails or is in a degraded state. They both want to understand the durability of their data, how to perform backups of their data, and how to perform recovery in the event of a failure.

Both companies need to determine what is an acceptable outage duration, commonly referred to as a Recovery Time Objective (RTO), and for what quantity of elapsed time it is acceptable to lose transactions (including committing changes), commonly referred to as Recovery Point Objective (RPO). They need to evaluate if they can achieve their RTO and RPO objectives with each of the choices they are considering.

Open-source tools

Open-source reliability is dependent upon the effectiveness of the company’s implementation, the underlying resources supporting the implementation, and the reliability of the open-source software. Open-source tools are the product of the contributors and may or may not incorporate high availability features. Depending on the implementation and tool, there may be a requirement for downtime for specific management or maintenance activities. The ability to support RTO and RPO depends on the teams supporting the company system, the implementation, and the mechanisms implemented for backup and recovery.

Consider the following:

You are responsible for implementing your open-source software to satisfy your reliability needs and high availability needs
Open-source tools may have downtime requirements to support specific management or maintenance activities
You are responsible for defining, implementing, and testing the backup and recovery mechanisms and procedures
You are responsible for the satisfaction of your RTO and RPO in the event of a failure of your open-source system

AWS services

AWS services are designed to support customer availability needs. As managed services, the service teams are responsible for maintaining the health of the services.

Consider the following:

AWS services are fully managed and service teams are responsible for the health of the services.
AWS services are designed and implemented to support customer reliability requirements. AWS CodeCommit is specifically designed for high availability. AWS CodeCommit, AWS CodeBuild, AWS CodePipeline, and AWS CodeDeploy are all engineered to exceed 99.9% availability.
CodeCommit, CodeBuild, CodePipeline, and CodeDeploy use highly durable services, including Amazon S3 and Amazon DynamoDB, to store customer data redundantly across multiple facilities.

Open-source tools on AWS services

Open-source tools on AWS services combine these considerations; the customer is responsible for the open-source implementation (including data durability, backup, and recovery) and the configuration of the AWS services it consumes. AWS is responsible for the health of the AWS Cloud and the managed services.

Picture showing a graph depicting performance measurement

Performance

What defines timely and efficient delivery of value varies between our two companies. Each is looking for results before an engineer becomes idled by having to wait for results. The startup iterates rapidly based on the results of each prior iteration. There is limited other activity for our startup engineer to perform before they have to wait on actionable results. Our mature company is more likely to have an outstanding backlog or improvements that can be acted upon while changes moves through the pipeline.

Open-source tools

Open-source performance is defined by the resources upon which it is deployed. Open-source tools that can scale out can dynamically improve their performance when resource constrained. Performance can also be improved by scaling up, which is required when performance is constrained by resources and scaling out isn’t supported. The performance of open-source tools may be constrained by characteristics of how they were implemented in code or the libraries they use. If this is the case, the code is available for community or implementer-created improvements to address the limitation.

Consider the following:

You are responsible for managing the performance of your open-source tools
The performance of open-source tools may be constrained by the resources they are implemented upon; the code and libraries used; their system, resource, and software configuration; and the code and libraries present within the tools

AWS services

AWS services are designed to be highly scalable. CodeCommit has a highly scalable architecture, and CodeBuild scales up and down dynamically to meet your build volume. CodePipeline allows you to run actions in parallel in order to increase your workflow speeds.

Consider the following:

AWS services are fully managed, and service teams are responsible for the performance of the services.
AWS services are designed to scale automatically.
Your configuration of the services you consume can affect the performance of those services.
AWS services quotas exist to prevent unexpected costs. You can make changes to service quotas that may affect performance and costs.

Open-source tools on AWS services

Open-source tools on AWS services combine these considerations; the customer is responsible for the open-source implementation (including the selection and configuration of the AWS Cloud resources) and the configuration of the AWS services it consumes. AWS is responsible for the performance of the AWS Cloud and the managed AWS services.

Picture showing cart-wheels in motion, depicting operations

Operations

Our startup company wants to limit its operations burden as much as possible in order to focus on development efforts. Our mature company has an established and robust operations capability. In both cases, they perform the management and maintenance activities necessary to support their needs.

Open-source tools

Open-source tools are supported by their volunteer communities. That support is voluntary, without any obligation or commitment from the users. If either company adopts open-source tools, they’re responsible for the management and maintenance of the system. If they want additional support with an obligation and commitment to support their implementation, third parties may provide commercial support at additional cost.

Consider the following:

You are responsible for supporting your implementation.
The open-source community may provide volunteer support for the software.
There is no commitment to support the software by the open-source community.
There may be less documentation, or accepted best practices, available to support open-source tools.
Early adoption of open-source tools, or the use of development builds, includes the chance of encountering unidentified edge cases and unanticipated issues.
The complexity of an implementation and its integrations may increase the difficulty to support open-source tools. The time to identify contributing factors may be extended by the complexity during an incident. Maintaining a set of skilled team members with deep understanding of your implementation may help mitigate this risk.
You may be able to acquire commercial support through a third party.

AWS services

AWS services are committed to providing long-term support for their customers.

Consider the following:

There is long-term commitment from AWS to support the service
As a managed service, the service team maintains current documentation
Additional levels of support are available through AWS Support
Support for AWS is available through partners and third parties

Open-source tools on AWS services

Open-source tools on AWS services combine these considerations. The company is responsible for operating the open-source tools (for example, software configuration changes, updates, patching, and responding to faults). AWS is responsible for the operation of the AWS Cloud and the managed AWS services.

Conclusion

In this post, we discussed how to make informed decisions when choosing to implement open-source tools on AWS services, adopt managed AWS services, or use a combination of both. To do so, you must examine your organization and evaluate the benefits and risks.

A magnifying glass is focused on the single red figure in a group of otherwise blue paper figures standing on a white surface.

Examine your organization

You can make an informed decision about the capabilities you adopt. The insight you need can be gained by examining your organization to identify your goals, needs, and priorities, and discovering what your current emphasis is. Ask the following questions:

What is your organization trying to accomplish and why?
How large is your organization and how is it structured?
How are roles and responsibilities distributed across teams?
How well defined and understood are your processes and procedures?
How do you manage development, testing, delivery, and deployment today?
What are the major challenges your organization faces?
What are the challenges you face managing development?
What problems are you trying to solve with CI/CD tools?
What do you want to achieve with CI/CD tools?

Evaluate benefits and risk

Armed with that knowledge, the next step is to explore the trade-offs between open-source options and managed AWS services. Then evaluate the benefits and risks in terms of the key considerations:

Features
Cost
Security
Reliability
Performance
Operations

When asked “What is the correct answer?” the answer should never be “It depends.” We need to change the question to “What is our use case and what are our needs?” The answer will emerge from there.

Make an informed decision

A Well-Architected solution can include open-source tools, AWS Services, or any combination of both! A Well-Architected choice is an informed decision that evaluates trade-offs, balances benefits and risks, satisfies your requirements, and most importantly supports the achievement of your business outcomes.

Read the other posts in this series and take this journey with BigHat Biosciences and Iponweb as they share their perspectives, the decisions they made, and why.

Resources

Want to learn more? Check out the following CI/CD and developer tools on AWS:

Continuous integration (CI)
Continuous delivery (CD)
AWS Developer Tools

For more information about the AWS Well-Architected Framework, refer to the following whitepapers:

AWS Well-Architected Framework
AWS Well-Architected Operational Excellence pillar
AWS Well-Architected Security pillar
AWS Well-Architected Reliability pillar
AWS Well-Architected Performance Efficiency pillar
AWS Well-Architected Cost Optimization pillar

The 3 hexagons of the well architected logo appear to the right of the words AWS Well-Architected.

Author bio

Brian is the global Operational Excellence lead for the AWS Well-Architected program. Formerly the technical lead for an international network, Brian works with customers and partners researching the operations best practices with the greatest positive impact and produces guidance to help you achieve your goals.

Choosing a CI/CD approach: Open Source on AWS, an Iponweb story

2021-05-11 Mikhail Vasilyev

Post Syndicated from Mikhail Vasilyev original https://aws.amazon.com/blogs/devops/choosing-a-ci-cd-approach-open-source-on-aws-an-iponweb-story/

Iponweb is a global leader in building programmatic and real-time advertising technology and infrastructure for some of the world’s biggest digital media buyers and sellers. The company develops client-facing products and internal development tools that must be platform agnostic to support spanning across multiple cloud services.

In this post, we explore how Iponweb applied key considerations when choosing a continuous integration, continuous deployment (CI/CD), what they determined to be the right CI/CD approach for them, and review some considerations that may apply to your own business needs. And in the next post, we will dive even deeper into these key considerations.

How did Iponweb decide what they needed?

The first and most important question in designing a Well-Architected approach is: “How do you determine your priorities?” AWS Well-Architected defines the first two best practices to do that as: ”evaluate external customer needs” (Iponweb’s clients) and “evaluate internal customer needs” (Iponweb’s team).

Iponweb started with these two considerations while selecting the strategic toolset. After evaluating their customers’ requirements, the next step was to look at the needs of the Iponweb team. Their priorities included the products and features required, the cost, and the ability to build multi-cloud solutions.

Iponweb is dedicated to operating securely with the reliability and performance to support their customers. Solutions had to satisfy their fundamental requirements in these areas to be considered in their evaluation.

Feature set

Iponweb evaluated available options for the CI tool chain and found that, for their needs, GitLab was the clear winner, differentiated by delivering the greatest number of required features at the best price while being platform agnostic.

AWS had the complete set of tools, services, and best practices to support Iponweb’s goal to establish an open-source, self-hosted CI environment using GitLab. Upon completing their thorough evaluation process, Iponweb selected AWS to implement its CI environment.

Cost

Iponweb understood the investment they would be making within their team to leverage and support all the desired features of GitLab. Iponweb evaluated the expertise of its internal teams and factored in ease of integration with supporting services.

They adopted several AWS services that satisfied their undifferentiated needs, which allowed them to remove the operational burden and cost of maintaining their own implementations of various capabilities and features.

Furthermore, the availability of Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances provided the opportunity to further manage costs for their CI resource needs and usage patterns.

Security

Iponweb leveraged their existing security control implementations and integration with AWS to support adopting additional AWS services. AWS was responsible for the security of the cloud, including the underlying AWS services. Iponweb was able to focus on secure and effective configurations of those services and secure and effective configuration of their GitLab implementation. This ensured the security of their open-source, self-hosted CI environment.

When setting priorities for the design of a Well-Architected approach, it’s imperative to “manage benefits and risks,” which emphasizes making informed decisions when adopting open source or any tools. Iponweb achieved their best value solution by applying Well-Architected practices in Operational Excellence, Cost Optimization, and Security pillars by leveraging AWS products and services.

Overview of solution

Continuous integration consists of three key processes, each of which AWS supports:

Code stage – Iponweb built the centralized Git repository on the GitLab platform on EC2 servers, providing the UI and API to store and manage the code.
Test and build stage – They used GitLab as the application layer to manage build and test flows through GitLab Runners (compute workers for CI jobs). This layer is implanted via GitLab in containers, and is deployed and managed by Amazon Elastic Kubernetes Service (Amazon EKS).
Publish stage – Amazon Elastic Container Registry (Amazon ECR) stores the infrastructure containers for the runners and product containers.

The following diagram illustrates this architecture:

At the core of Iponweb’s CI platform architecture is the open-source GitLab Community Edition.

Implementing the solution

CI jobs are either run regularly or triggered by events such as merge requests. The jobs are described as code in YAML files and are stored and versioned along with the product code itself. Runner versions are published into Amazon ECR and launched as Docker containers in Amazon EKS.

Runner code is stored as Helm charts that help Iponweb package up and manage their large-scale Kubernetes deployments. In addition, Amazon EKS has support for Helm and many other plugins for Kubernetes.

Iponweb developers innovate at a very fast pace, and customize Iponweb’s client solutions in rapid iterations. To address uncertain container registry requirements, Iponweb decided to use Amazon ECR. As a managed service, Amazon ECR eliminates concerns about scaling capacity and management. Integration of GitLab with Amazon EKS and Amazon ECR is provided out of the box through a UI and predefined scripts, with no additional overhead to develop and deploy code or plugins.

Iponweb was able to implement the Well-Architected design principle: “stop continuously estimating its capacity needs.” Enabling them to focus on more strategic development activities. They performed a thorough analysis of each component, looking at the total cost of ownership, including operations and management. In doing so, they implemented the best practice from the Cost Optimization pillar: “How do you evaluate cost when you select services?”

In the Cost Optimization pillar, a key question is “How do you use pricing models to reduce costs?” Iponweb deployed runners in Amazon EKS for precise, granular, and on-demand compute scaling for each CI job. These tasks have short-term capacity needs, so Iponweb benefited from configuring Amazon EKS on Spot Instances, achieving factor price reduction. The EC2 Spot pricing model is most appropriate for their CI resource needs and usage patterns.

To protect their data at rest, Iponweb followed a best practice from the Security pillar: “Implement secure key management.” They used AWS Key Management Service (AWS KMS) to manage secrets for the runners.

To protect the code and artifacts, and to ensure these valuable assets don’t leave the CI environment inappropriately, Iponweb followed best practices in Infrastructure Protection from the Security pillar question, “How do you protect your networks?” Iponweb scrupulously defined the network protection requirements, limiting their exposure by controlling traffic at all layers, and implementing security groups to prevent inappropriate access into and out of their VPC.

Michael Benuhis, CTO at Iponweb, says:

“Iponweb was able to get the best of open-source software and public cloud services by building the continuous integration platform on Amazon Web Services. Open-source tools provided Iponweb platform agnosticism for serving our diverse customer base, while managed Amazon EKS on EC2 Spot Instances eliminated the operational burden of managing our own Kubernetes infrastructure, and with greater cost efficiency.”

Conclusion

Iponweb has satisfied their current needs and aren’t looking for improvement in the short term. They will stay on the free version of GitLab, satisfied for the moment with what they have achieved. They have custom automations in place to synchronize with GitLab and integrate with their existing tools. They like the features provided by the paid version of GitLab, but there isn’t a business case to support an informed decision to upgrade at this time.

They have achieved their goal of using Amazon EKS and Spot under GitLab CI/CD integrated with their existing systems and satisfying their needs.

How to approach threat modeling

2021-01-12 Darran Boyd

Post Syndicated from Darran Boyd original https://aws.amazon.com/blogs/security/how-to-approach-threat-modeling/

In this post, I’ll provide my tips on how to integrate threat modeling into your organization’s application development lifecycle. There are many great guides on how to perform the procedural parts of threat modeling, and I’ll briefly touch on these and their methodologies. However, the main aim of this post is to augment the existing guidance with some additional tips on how to handle the people and process components of your threat modeling approach, which in my experience goes a long way to improving the security outcomes, security ownership, speed to market, and general happiness of all involved. Furthermore, I’ll also provide some guidance specific to when you’re using Amazon Web Services (AWS).

Let’s start with a primer on threat modeling.

Why use threat modeling

IT systems are complex, and are becoming increasingly more complex and capable over time, delivering more business value and increased customer satisfaction and engagement. This means that IT design decisions need to account for an ever-increasing number of use cases, and be made in a way that mitigates potential security threats that may lead to business-impacting outcomes, including unauthorized access to data, denial of service, and resource misuse.

This complexity and number of use-case permutations typically makes it ineffective to use unstructured approaches to find and mitigate threats. Instead, you need a systematic approach to enumerate the potential threats to the workload, and to devise mitigations and prioritize these mitigations to make sure that the limited resources of your organization have the maximum impact in improving the overall security posture of the workload. Threat modeling is designed to provide this systematic approach, with the aim of finding and addressing issues early in the design process, when the mitigations have a low relative cost compared to later in the lifecycle.

The AWS Well-Architected Framework calls out threat modeling as a specific best practice within the Security Pillar, under the area of foundational security, under the question SEC 1: How do you securely operate your workload?:

“Identify and prioritize risks using a threat model: Use a threat model to identify and maintain an up-to-date register of potential threats. Prioritize your threats and adapt your security controls to prevent, detect, and respond. Revisit and maintain this in the context of the evolving security landscape.”

Threat modeling is most effective when done at the workload (or workload feature) level, in order to ensure that all context is available for assessment. AWS Well-Architected defines a workload as:

“A set of components that together deliver business value. The workload is usually the level of detail that business and technology leaders communicate about. Examples of workloads are marketing websites, e-commerce websites, the back-ends for a mobile app, analytic platforms, etc. Workloads vary in levels of architectural complexity, from static websites to architectures with multiple data stores and many components.”

The core steps of threat modeling

In my experience, all threat modeling approaches are similar; at a high level, they follow these broad steps:

Identify assets, actors, entry points, components, use cases, and trust levels, and include these in a design diagram.
Identify a list of threats.
Per threat, identify mitigations, which may include security control implementations.
Create and review a risk matrix to determine if the threat is adequately mitigated.

To go deeper into the general practices associated with these steps, I would suggest that you read the SAFECode Tactical Threat Modeling whitepaper and the Open Web Application Security Project (OWASP) Threat Modeling Cheat Sheet. These guides are great resources for you to consider when adopting a particular approach. They also reference a number of tools and methodologies that are helpful to accelerate the threat modeling process, including creating threat model diagrams with the OWASP Threat Dragon project and determining possible threats with the OWASP Top 10, OWASP Application Security Verification Standard (ASVS) and STRIDE. You may choose to adopt some combination of these, or create your own.

When to do threat modeling

Threat modeling is a design-time activity. It’s typical that during the design phase you would go beyond creating a diagram of your architecture, and that you may also be building in a non-production environment—and these activities are performed to inform and develop your production design. Because threat modeling is a design-time activity, it occurs before code review, code analysis (static or dynamic), and penetration testing; these all come later in the security lifecycle.

Always consider potential threats when designing your workload from the earliest phases—typically when people are still on the whiteboard (whether physical or virtual). Threat modeling should be performed during the design phase of a given workload feature or feature change, as these changes may introduce new threats that require you to update your threat model.

Threat modeling tips

Ultimately, threat modeling requires thought, brainstorming, collaboration, and communication. The aim is to bridge the gap between application development, operations, business, and security. There is no shortcut to success. However, there are things I’ve observed that have meaningful impacts on the adoption and success of threat modeling—I’ll be covering these areas in the following sections.

1. Assemble the right team

Threat modeling is a “team sport,” because it requires the knowledge and skill set of a diverse team where all inputs can be viewed as equal in value. For all listed personas in this section, the suggested mindset is to start from your end-customers’ expectations, and work backwards. Think about what your customers expect from this workload or workload feature, both in terms of its security properties and maintaining a balance of functionality and usability.

I recommend that the following perspectives be covered by the team, noting that a single individual can bring more than one of these perspectives to the table:

The Business persona – First, to keep things grounded, you’ll want someone who represents the business outcomes of the workload or feature that is part of the threat modeling process. This person should have an intimate understanding of the functional and non-functional requirements of the workload—and their job is to make sure that these requirements aren’t unduly impacted by any proposed mitigations to address threats. Meaning that if a proposed security control (that is, mitigation) renders an application requirement unusable or overly degraded, then further work is required to come to the right balance of security and functionality.

The Developer persona – This is someone who understands the current proposed design for the workload feature, and has had the most depth of involvement in the design decisions made to date. They were involved in design brainstorming or whiteboarding sessions leading up to this point, when they would typically have been thinking about threats to the design and possible mitigations to include. In cases where you are not developing your own in-house application (e.g. COTS applications) you would bring in the internal application owner.

The Adversary persona – Next, you need someone to play the role of the adversary. The aim of this persona is to put themselves in the shoes of an attacker, and to critically review the workload design and look for ways to take advantage of a design flaw in the workload to achieve a particular objective (for example, unauthorized sharing of data). The “attacks” they perform are a mental exercise, not actual hands-on-keyboard exploitation. If your organization has a so-called Red Team, then they could be a great fit for this role; if not, you may want to have one or more members of your security operations or engineering team play this role. Or alternately, bring in a third party who is specialized in this area.

The Defender persona – Then, you need someone to play the role of the defender. The aim of this persona is to see the possible “attacks” designed by the adversary persona as potential threats, and to devise security controls that mitigate the threats. This persona also evaluates whether the possible mitigations are reasonably manageable in terms of on-going operational support, monitoring, and incident response.

The AppSec SME persona – The Application Security (AppSec) subject matter expert (SME) persona should be the most familiar with the threat modeling process and discussion moderation methods, and should have a depth of IT security knowledge and experience. Discussion moderation is crucial for the overall exercise process to make sure that the overall objectives of the process are kept on-track, and that the appropriate balance between security and delivery of the customer outcome is maintained. Ultimately, it’s this persona who endorses the threat model and advises the scope of the actions beyond the threat modeling exercise—for example, penetration testing scope.

2. Have a consistent approach

In the earlier section, I listed some of the popular threat modeling approaches, and which one you select is not as important as using it consistently both within and across your teams.

By using a consistent approach and format, teams can move faster and estimate effort more accurately. Individuals can learn from examples, by looking at threat models developed by other team members or other teams—saving them from having to start from scratch.

When your team estimates the effort and time required to create a threat model, the experience and time taken from previous threat models can be used to provide more accurate estimations of delivery timelines.

Beyond using a consistent approach and format, consistency in the granularity and relevance of the threats being modeled is key. Later in this post I describe a recommendation for creating a catalog of threats for reuse across your organization.

Finally, and importantly, this approach allows for scalability: if a given workload feature that’s undergoing a threat modeling exercise is using components that have an existing threat model, then the threat model (or individual security controls) of those components can be reused. With this approach, you can effectively take a dependency on a component’s existing threat model, and build on that model, eliminating re-work.

3. Align to the software delivery methodology

Your application development teams already have a particular workflow and delivery style. These days, Agile-style delivery is most popular. Ensure that the approach you take for threat modeling integrates well with both your delivery methodology and your tools.

Just like for any other deliverable, capture the user stories related to threat modeling as part of the workload feature’s sprint, epic, or backlog.

4. Use existing workflow tooling

Your application development teams are already using a suite of tools to support their delivery methodology. This would typically include collaboration tools for documentation (for example, a team wiki), and an issue-tracking tool to track work products through the software development lifecycle. Aim to use these same tools as part of your security review and threat modeling workflow.

Existing workflow tools can provide a single place to provide and view feedback, assign actions, and view the overall status of the threat modeling deliverables of the workload feature. Being part of the workflow reduces the friction of getting the project done and allows threat modeling to become as commonplace as unit testing, QA testing, or other typical steps of the workflow.

By using typical workflow tools, team members working on creating and reviewing the threat model can work asynchronously. For example, when the threat model reviewer adds feedback, the author is notified, and then the author can address the feedback when they have time, without having to set aside dedicated time for a meeting. Also, this allows the AppSec SME to more effectively work across multiple threat model reviews that they may be engaged in.

Having a consistent approach and language as described earlier is an important prerequisite to make this asynchronous process feasible, so that each participant can read and understand the threat model without having to re-learn the correct interpretation each time.

5. Break the workload down into smaller parts

It’s advisable to decompose (break down) the workload into features and perform the threat modeling exercise at the feature level, rather than create a single threat model for an entire workload. This approach has a number of key benefits:

Having smaller chunks of work allows more granular tracking of progress, which aligns well with development teams that are following Agile-style delivery, and gives leadership a constant view of progress.
This approach tends to create threat models that are more detailed, which results in more findings being identified.
Decomposing also opens up the opportunity for the threat model to be reused as a dependency for other workload features that use the same components.
By considering threat mitigations for each component, at the overall workload level this means that a single threat may have multiple mitigations, resulting in an improved resilience against those threats.
Issues with a single threat model, for example a critical threat which is not yet mitigated, does not become launch blocking for the entire workload, but rather just for the individual feature.

The question then becomes, how far should you decompose the workload?

As a general rule, in order to create a threat model, the following context is required, at a minimum:

One asset. For example, credentials, customer records, and so on.
One entry point. For example, Amazon API Gateway REST API deployment.
Two components. For example, a web browser and an API Gateway REST API; or API Gateway and an AWS Lambda function.

Creating a threat model for a given AWS service (for example, API Gateway) in isolation wouldn’t fully meet this criteria—given that the service is a single component, there is no movement of the data from one component to another. Furthermore, the context of all the possible use cases of the service within a workload isn’t known, so you can’t comprehensively derive the threats and mitigations. AWS performs threat modeling of the multiple features that make up a given AWS service. Therefore, for your workload feature that leverages a given AWS service, you wouldn’t need to threat model the AWS service, but instead consider the various AWS service configuration options and your own workload-specific mitigations when you look to mitigate the threats you’ve identified. I go into more depth on this in the “Identify and evaluate mitigations” section, where I go into the concept of baseline security controls.

6. Distribute ownership

Having a central person or department responsible for creation of threat models doesn’t work in the long run. These central entities become bottlenecks and can only scale up with additional head count. Furthermore, centralized ownership doesn’t empower those who are actually designing and implementing your workload features.

Instead, what scales well is distributed ownership of threat model creation by the team that is responsible for designing and implementing each workload feature. Distributed ownership scales and drives behavior change, because now the application teams are in control, and importantly they’re taking security learnings from the threat modeling process and putting those learnings into their next feature release, and therefore constantly improving the security of their workload and features.

This creates the opportunity for the AppSec SME (or team) to effectively play the moderator and security advisor role to all the various application teams in your organization. The AppSec SME will be in a position to drive consistency, adoption, and communication, and to set and raise the security bar among teams.

7. Identify entry points

When you look to identify entry points for AWS services that are components within your overall threat model, it’s important to understand that, depending on the type of AWS service, the entry points may vary based on the architecture of the workload feature included in the scope of the threat model.

For example, with Amazon Simple Storage Service (Amazon S3), the possible types of entry-points to an S3 bucket are limited to what is exposed through the Amazon S3 API, and the service doesn’t offer the capability for you, as a customer, to create additional types of entry points. In this Amazon S3 example, as a customer you make choices about how these existing types of endpoints are exposed—including whether the bucket is private or publicly accessible.

On the other end of the spectrum, Amazon Elastic Compute Cloud (Amazon EC2) allows customers to create additional types of entry-points to EC2 instances (for example, your application API), besides the entry-point types that are provided by the Amazon EC2 API and those native to the operating system running on the EC2 instance (for example, SSH or RDP).

Therefore, make sure that you’re applying the entry points that are specific to the workload feature, in additional to the native endpoints for AWS services, as part of your threat model.

8. Come up with threats

Your aim here is to try to come up with answers to the question “What can go wrong?” There isn’t any canonical list that lists all the possible threats, because determining threats depends on the context of the workload feature that’s under assessment, and the types of threats that are unique to a given industry, geographical area, and so on.

Coming up with threats requires brainstorming. The brainstorming exercise can be facilitated by using a mnemonic like STRIDE (Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, and Elevation of Privilege), or by looking through threat lists like the OWASP Top 10 or HiTrust Threat Catalog to get the ideas flowing.

Through this process, it’s recommended that you develop and contribute to a threat catalog that is contextual to your organization and will accelerate the brainstorming process going forward, as well as drive consistency in the granularity of threats that you model.

9. Identify and evaluate mitigations

Here, your aim is to identify the mitigations (security controls) within the workload design and evaluate whether threats have been adequately addressed. Keep in mind that there are typically multiple layers of controls and multiple responsibilities at play.

For your own in-house applications and code, you would want to review the mitigations you’ve included in your design—including, but not limited to, input validation, authentication, session handling, and bounds handling.

Consider all other components of your workload (for example, software as a service (SaaS), infrastructure supporting your COTS applications, or components hosted within your on-premises data centers) and determine the security controls that are part of the workload design.

When you use AWS services, Security and Compliance is a shared responsibility between AWS and you as our customer. This is described on the AWS Shared Responsibility Model page.

This means, for the portions of the AWS services that you’re using that are the responsibility of AWS (Security of the Cloud), the security controls are managed by AWS, along with threat identification and mitigation. The distribution of responsibility between AWS (Security of the Cloud) and you (Security in the Cloud) depends on which AWS service you use. Below, I provide examples of infrastructure, container, and abstracted AWS services to show how your responsibility for identifying and mitigating threats can vary:

Amazon EC2 is a good example of an infrastructure service, where you are able to access a virtual server in the cloud, you get to choose the operating system, and you have control of the service and all aspects you run on it—so you would be responsible for mitigating the identified threats.
Amazon Relational Database Service (Amazon RDS) is a representative example of a container service, where there is no operating system exposed for you, and instead AWS exposes the selected database engine to you (for example, MySQL). AWS is responsible for the security of the operating system in this example, and you don’t need to devise mitigations. However, the database engine is under your control as well as all aspects above it, so you would need to consider mitigations for these areas. Here, AWS is taking on a larger portion of the responsibility compared to infrastructure services.
Amazon S3, AWS Key Management Service (AWS KMS), and Amazon DynamoDB are examples of an abstracted service where AWS exposes the entire service control plane and data plane to you through the service API. Again, here there are no operating systems, database engines, or platforms exposed to you—these are an AWS responsibility. However, the API actions and associated policies are under your control and so are all aspects above the API level, so you should be considering mitigations for these. For this type of service, AWS takes a larger portion of responsibility compared to container and infrastructure types of services.

While these examples do not encompass all types of AWS services that may be in your workload, they demonstrate how your Security and Compliance responsibilities under the Shared Responsibility Model will vary in this context. Understanding the balance of responsibilities between AWS and yourself for the types of AWS services in your workload helps you scope your threat modeling exercise to the mitigations that are under your control, which are typically a combination of AWS service configuration options and your own workload-specific mitigations. For the AWS portion of the responsibility, you will find that AWS services are in-scope of many compliance programs, and the audit reports are available for download for AWS customers (at no cost) from AWS Artifact.

Regardless of which AWS services you’re using, there’s always an element of customer responsibility, and this should be included in your workload threat model.

Specifically, for security control mitigations for the AWS services themselves, you’d want to consider security controls across domains, including these domains: Identity and Access Management (Authentication/Authorization), Data Protection (At-Rest, In-Transit), Infrastructure Security, and Logging and Monitoring. AWS services each have a dedicated security chapter in the documentation, which provides guidance on the security controls to consider as mitigations. When capturing these security controls and mitigations in your threat model, you should aim to include references to the actual code, IAM policies, and AWS CloudFormation templates located in the workload’s infrastructure-as-code repository, and so on. This helps the reviewer or approver of your threat model to get an unambiguous view of the intended mitigation.

As in the case for threat identification, there’s no canonical list enumerating all the possible security controls. Through the process described here, you should consciously develop baseline security controls that align to your organization’s control objectives, and where possible, implement these baseline security controls as platform-level controls, including AWS service-level configurations (for example, encryption at rest) or guardrails (for example, through service control policies). By doing this, you can drive consistency and scale, so that these implemented baseline security controls are automatically inherited and enforced for other workload features that you design and deploy.

When you come up with the baseline security controls, it’s important to note that the context of a given workload feature isn’t known. Therefore, it’s advisable to consider these controls as a negotiable baseline that you can deviate from, provided that when you perform the workload threat modeling exercise, you find that the threat that the baseline control was designed to mitigate isn’t applicable, or there are other mitigations or compensating controls that adequately mitigate the threat. Compensating controls and mitigating factors could include: reduced data asset classification, non-human access, or ephemeral data/workload.

To learn more about how to start thinking about baseline security controls as part of your overall cloud security governance, have a look at the How to think about cloud security governance blog post.

10. Decide when enough is enough

There’s no perfect answer to this question. However, it’s important to have a risk-based perspective on the threat modeling process to create a balanced approach, so that the likelihood and impact of a risk are appropriately considered. Over-emphasis on “let’s build and ship it” could lead to significant costs and delays later. Conversely, over-emphasis on “let’s mitigate every conceivable threat” could lead to the workload feature shipping late (or never), and your customers might move on. In the recommendation I made earlier in the “Assemble the right team” section, the selection of personas is deliberate to make sure that there’s a natural tension between shipping the feature, and mitigating threats. Embrace this healthy tension.

11. Don’t let paralysis stop you before you start

Earlier in the “Break the workload down into smaller parts” section, I gave the recommendation that you should scope your threat models down to a workload feature. You may be thinking to yourself, “We’ve already shipped <X number> of features, how do we threat model those?” This is a completely reasonable question.

My view is that rather than go back to threat model features that are already live, aim to threat model any new features that you are working on now and improve the security properties of the code you ship next, and for each feature you ship after that. During this process you, your team, and your organization will learn—not just about threat modeling—but how to communicate effectively with one another. Make adjustments, iterate, improve. Sometime in the future, when you’re routinely providing high quality, consistent and reusable threat models for your new features, you can start putting activities to perform threat modeling for existing features into your backlog.

Conclusion

Threat modeling is an investment—in my view, it’s a good one, because finding and mitigating threats in the design phase of your workload feature can reduce the relative cost of mitigation, compared to finding the threats later. Consistently implementing threat modeling will likely also improve your security posture over time.

I’ve shared my observations and tips for practical ways to incorporate threat modeling into your organization, which center around communication, collaboration, and human-led expertise to find and address threats that your end customer expects. Armed with these tips, I encourage you to look across the workload features you’re working on now (or have in your backlog) and decide which ones will be the first you’ll threat model.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

New – SaaS Lens in AWS Well-Architected Tool

2020-12-03 Danilo Poccia

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-saas-lens-in-aws-well-architected-tool/

To help you build secure, high-performing, resilient, and efficient solutions on AWS, in 2015 we publicly launched the AWS Well-Architected Framework. It started as a single whitepaper but has expanded to include domain-specific lenses, hands-on labs, and the AWS Well-Architected Tool (available at no cost in the AWS Management Console) that provides a mechanism for regularly evaluating your workloads, identifying high risk issues, and recording your improvements.

To offer more workload-specific advice, in 2017 we extended the framework with the concept of “lens” to go beyond a general perspective and enter specific technology domains. Now, to help accelerate building Software-as-a-Service (SaaS) solutions, the AWS SaaS Factory team has led an effort to build a new AWS Well-Architected SaaS Lens.

SaaS is a licensing and delivery model by which software is centrally managed and hosted by a provider and available to customers on a subscription basis. In this way, software providers can innovate rapidly, optimize their costs, and gain operational efficiencies. At the same time, customers benefit from simplified IT management, speed, and a pay-for-what-you-use business model.

The Well-Architected SaaS Lens adds questions to the tool that are tailored to SaaS workloads and intended to drive critical thinking for developing and operating SaaS workloads. Each question has a list of best practices, and each best practice has a list of improvement plans to help guide you in implementing them. AWS Solution Architects from the AWS SaaS Factory Program, having worked with thousands of software developers and AWS Partners, view these well-architected patterns as a key component of building and operating a SaaS architecture on AWS.

Using the SaaS Lens in the Well-Architected Tool
In the Well-Architected Tool console, I start by defining my workload. Today, I’m reviewing a pre-production environment of a SaaS application. It’s just a minimum viable product (MVP) version of what I want to build, with just enough features to be usable and get a first feedback.

Now, I can choose which lenses to apply. The AWS Well-Architected Framework is there by default. I select the SaaS Lens. This is adding a set of additional questions that help me understand how to design, deploy, and architect my SaaS workload following the framework best practices. Other lenses are available in the tool, for example the Serverless Lens described here.

Now, I start my review. Many questions in the SaaS Lens are focused on how you are managing a multi-tenant application. This is the first question for the Operational Excellence pillar. I can also add some notes to explain my answer better or take note of what I want to improve.

I don’t need to answer all questions to start improving my SaaS application. For example, this is the improvement plan based on my answer to the previous question. For each point here, I can click and get more information on how to implement that on AWS.

Moving to the Reliability pillar, I feel more confident because of the techniques I used to separate individual tenants of my SaaS application in their own “sandbox” environment.

As I expect, no risks are detected this time!

When I finish reviewing the SaaS Lens for my workload, I get an overview of the detected risks. Here, I can also save a milestone that I can use later to compare my status and estimate my improvements.

Just below that, I get a suggestion on what to focus on next. Again, I can click and get in-depth suggestion on how to mitigate the risk.

As often happens in IT services, this is an iterative process. The AWS Well-Architected Tool helps quantify the risks and gives me a path to follow to continuously improve my SaaS application.

Available Now
The SaaS Lens is available today in all regions where the AWS Well-Architected Tool is offered, as described in the AWS Regional Services List. It can be applied to existing workloads, or used for new workloads you define in the tool.

There are no costs in using the AWS Well-Architected Tool; you can use it to improve the application you are working on, or to get visibility into multiple workloads used by the department or area you are working with.

Learn more about the new SaaS Lens and get started today with the AWS Well-Architected Tool!

— Danilo

Techniques for writing least privilege IAM policies

2020-12-02 Ben Potter

Post Syndicated from Ben Potter original https://aws.amazon.com/blogs/security/techniques-for-writing-least-privilege-iam-policies/

In this post, I’m going to share two techniques I’ve used to write least privilege AWS Identity and Access Management (IAM) policies. If you’re not familiar with IAM policy structure, I highly recommend you read understanding how IAM works and policies and permissions.

Least privilege is a principle of granting only the permissions required to complete a task. Least privilege is also one of many Amazon Web Services (AWS) Well-Architected best practices that can help you build securely in the cloud. For example, if you have an Amazon Elastic Compute Cloud (Amazon EC2) instance that needs to access an Amazon Simple Storage Service (Amazon S3) bucket to get configuration data, you should only allow read access to the specific S3 bucket that contains the relevant data.

There are a number of ways to grant access to different types of resources, as some resources support both resource-based policies and IAM policies. This blog post will focus on demonstrating how you can use IAM policies to grant restrictive permissions to IAM principals to meet least privilege standards.

In AWS, an IAM principal can be a user, role, or group. These identities start with no permissions and you add permissions using a policy. In AWS, there are different types of policies that are used for different reasons. In this blog, I only give examples for identity-based policies that attach to IAM principals to grant permissions to an identity. You can create and attach multiple identity-based policies to your IAM principals, and you can reuse them across your AWS accounts. There are two types of managed policies. Customer managed policies are created and managed by you, the customer. AWS managed policies are provided as examples, cannot be modified, but can be copied, enhanced, and saved as Customer managed policies. The main elements of a policy statement are:

Effect: Specifies whether the statement will Allow or Deny an action.
Action: Describes a specific action or actions that will either be allowed or denied to run based on the Effect entered. API actions are unique to each service. For example, s3:ListBuckets is an Amazon S3 service API action that enables an IAM Principal to list all S3 buckets in the same account.
NotAction: Can be used as an alternative to using Action. This element will allow an IAM principal to invoke all API actions to a specific AWS service except those actions specified in this list.
Resource: Specifies the resources—for example, an S3 bucket or objects—that the policy applies to in Amazon Resource Name (ARN) format.
NotResource: Can be used instead of the Resource element to explicitly match every AWS resource except those specified.
Condition: Allows you to build expressions to match the condition keys and values in the policy against keys and values in the request context sent by the IAM principal. Condition keys can be service-specific or global. A global condition key can be used with any service. For example, a key of aws:CurrentTime can be used to allow access based on date and time.

Starting with the visual editor

The visual editor is my default starting place for building policies as I like the wizard and seeing all available services, actions, and conditions without looking at the documentation. If there is a complex policy with many services, I often look at the AWS managed policies as a starting place for the actions that are required, then use the visual editor to fine tune and check the resources and conditions.

The policy I’m going to walk you through creating is to grant an AWS Lambda function permission to get specific objects from Amazon S3, and put items in a specific table in Amazon DynamoDB. You can access the visual editor when you choose Create policy under policies in the IAM console, or add policies when viewing a role, group, or user as shown in Figure 1. If you’re not familiar with creating policies, you can follow the full instructions in the IAM documentation.

Figure 1: Use the visual editor to create a policy

Begin by choosing the first service—S3—to grant access to as shown in Figure 2. You can only choose one service at a time, so you’ll need to add DynamoDB after.

Figure 2: Select S3 service

Now you will see a list of access levels with the option to manually add actions. Expand the read access level to show all read actions that are supported by the Amazon S3 service. You can now see all read access level actions. For getting an object, check the box for GetObject. Selecting the ? next to an action expands information including a description, supported resource types, and supported condition keys as shown in Figure 3.

Figure 3: Expand Read in Access level, select GetObject, and select the ? next to GetObject

Expand Resources, you will see that the visual editor has listed object as that is the only resource supported by the GetObject action as shown in Figure 4.

Figure 4: Expand Resources

Select Add ARN, which opens a dialogue to help you specify the ARN for the objects. Enter a bucket name—such as doc-example-bucket—and then the object name. For the object name you can use a wildcard (*) as a suffix. For example, to allow objects beginning with alpha you would enter alpha*. This is an important step. For this least privileged policy, you are restricting to a specific bucket, and an object prefix. You could even specify an individual object depending on your use case.

Figure 5: Enter bucket name and object name

If you have multiple ARNs (bucket and objects) to allow, you can repeat the step.

Figure 6: ARN added for S3 object

The final step is to expand the request conditions, and choose Add condition. The Add request condition dialogue will open. Select the drop down next to Condition key to list the global condition keys, then the service level condition keys are listed after. You’ll see that there’s an s3:ExistingObjectTag condition that—as the name suggests—matches an existing object tag. You can use this condition key to allow the GetObject request only when the object tag meets your condition. That means you can tag your objects with a specific tag key and value pair, and your policy condition must match this key-value pair to allow the action to execute. When you’re using condition keys with multiple keys or values, you can use condition operators and evaluation logic. As shown in Figure 7, tag-key is entered directly below the condition key. This is the key of the tag to match. For the Operator, select StringEquals to match the tag exactly. Checking If exists tests at least one member of the set of request values, and at least one member of the set of condition key values. The Value to enter is the actual tag value: tag-value as shown in figure 7.

Figure 7: ARN added for S3 object

That’s it for adding the S3 action, as shown in figure 8.

Figure 8: S3 GetObject action with resource and conditions configured

Now you need to add the DynamoDB permissions by selecting Add additional permissions. Select Choose a service and then select DynamoDB. For actions, expand the Write access level, then choose PutItem.

Figure 9: Choose write access level

Expand Resources and then select Add ARN. The dialogue that appears will help you build the ARN just like it did for the Amazon S3 service. Enter the Region, for example the ap-southeast-2 (Sydney) Region, the account ID, and the table name. Choosing Add will add the resource ARN to your policy.

Figure 10: Enter Region, account, and table name

Now it’s time to add conditions. Expand Request conditions and then choose Add condition.

There are many DynamoDB conditions that you could use, however you can choose dynamodb:LeadingKeys to represent the first key, or partition keys in a table. You can see from the documentation that a qualifier of For all values in request is recommend. For the Operator you can use StringEquals as your string is going to exactly match, then a Value can use a prefix with wildcard, such as alpha* as shown in figure 11.

Figure 11: Add request conditions

Choosing Add will take you back to the main visual editor where you can choose Review policy to continue. Enter a name and description for the policy, and then choose Create policy.

You can now attach it to a role to test.

You can see in this example that a policy can use least privilege by using specific resources and conditions. Note that sometimes when you use the AWS Management Console, it requires additional permissions to provide information for the console experience.

Starting with AWS managed policies

AWS managed policies can be a good starting place to see the actions typically associated with a particular service or job function. For example, you can attach the AmazonS3ReadOnlyAccess policy to a role used by an Amazon EC2 instance that allows read-only access to all Amazon S3 buckets. It has an effect of Allow to allow access, and there are two actions that use wildcards (*) to allow all Get and List actions for S3—for example, s3:GetObject and s3:ListBuckets. The resource is a wildcard to allow all S3 buckets the account has access to. A useful feature of this policy is that it only allows read and list access to S3, but not to any other services or types of actions.

Let’s make our own custom IAM policy to make it least privilege. Starting with the action element, you can use the reference for Amazon S3 to see all actions, a description of what each action does, the resource type for each action, and condition keys for each action. Now let’s imagine this policy is used by an Amazon EC2 instance to fetch an application configuration object from within an S3 bucket. Looking at the descriptions for actions starting with Get you can see that the only action that we really need is GetObject. You can then use the resource element to restrict an action to a set of objects prefixed with config within a specific bucket.

         "Effect": "Allow",
         "Action": "s3:GetObject",
         "Resource": "arn:aws:s3::: <doc-example-bucket>/<config*>"

Now that you’ve reduced the scope of what this policy can do for service actions and resources, you can add a condition element that uses attribute based access control (ABAC) to define conditions based on attributes—in this case, a resource tag. In this example, when you’re reading objects from a single bucket, you can set specific conditions to further reduce the scope of permissions given to an IAM principal. There’s an s3:ExistingObjectTag condition that you can use to allow the GetObject request only when the object tag meets your condition. That means you can tag your objects with a specific tag key and value pair, and your IAM policy condition must match this key-value pair to allow the API action to successfully run. When you’re using condition keys with multiple keys or values, you can use condition operators and evaluation logic. You can see that ForAnyValue tests at least one member of the set of request values, and at least one member of the set of condition key values. Alternatively, you can use global condition keys that apply to all services:

         "Effect": "Allow",
         "Action": "s3:GetObject",
         "Resource": "arn:aws:s3:::<doc-example-bucket>/<config*>",
         "Condition": {
                "ForAnyValue:StringEquals": {
                    "s3:ExistingObjectTag/<tag-key>": "<tag-value>"
            }

In the preceding policy example, the condition element only allows s3:GetObject permissions if the object is tagged with a key of tag-key and a value of tag-value. While you’re experimenting, you can identify errors in your custom policies by using the IAM policy simulator or reviewing the errors messages recorded in AWS CloudTrail logs.

Conclusion

In this post, I’ve shown two different techniques that you can use to create least privilege policies for IAM. You can adapt these methods to create AWS Single Sign-On permission sets and AWS Organizations service control policies (SCPs). Starting with managed policies is a useful strategy when an AWS supplied managed policy already exists for your use case, and then to reduce the scope of what it can do through permissions. I tend to use the visual editor the most for editing policies because it saves looking up the resource and conditions for each action. I suggest that you start by reviewing the policies you’re already using. Start with policies that grant excessive permissions—like the example Administrator policy—and tie them back to the use case of the users or things that need the access. Use the last accessed information, IAM best practices, and look at the AWS Well-Architected best practices and AWS Well-Architected tool.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

How to think about cloud security governance

2020-08-25 Paul Hawkins

Post Syndicated from Paul Hawkins original https://aws.amazon.com/blogs/security/how-to-think-about-cloud-security-governance/

When customers first move to the cloud, their instinct might be to build a cloud security governance model based on one or more regulatory frameworks that are relevant to their industry. Although this can be a helpful first step, it’s also critically important that organizations understand what the control objectives for their workloads should be.

In this post, we discuss what you need to do both organizationally and technically with Amazon Web Services (AWS) to build an efficient and effective governance model. People who are taking their first steps in cloud can use this post to guide their thinking. It can also act as useful context for folks who have been running in the cloud for a while to evaluate their current governance approach.

But before you can build that model, it’s important to understand what governance is and to consider why you need it. Governance is how an organization ensures the consistent application of policies across all teams. The best way to implement consistent governance is by codifying as much of the process as possible. Security governance in particular is used to support business objectives by defining policies and controls to manage risk.

Moving to the cloud provides you with an opportunity to deliver features faster, react to the changing world in a more agile way, and return some decision making to the hands of the people closest to the business. In this fast-paced environment, it’s important to have a way to maintain consistency, scaleability, and security. This is where a strong governance model helps.

Creating the right governance model for your organization may seem like a complex task, but it doesn’t have to be.

Frameworks

Many customers use a standard framework that’s relevant to their industry to inform their decision-making process. Some frameworks that are commonly used to develop a security governance model include: NIST Cybersecurity Framework (CSF), Information Security Registered Assessors Program (IRAP), Payment Card Industry Data Security Standard (PCI DSS), or ISO/IEC 27001:2013

Some of these standards provide requirements that are specific to a particular regulator, or region and others are more widely applicable—you should choose one that fits the needs of your organization.

While frameworks are useful to set the context for a security program and give guidance on governance models, you shouldn’t build either one only to check boxes on a particular standard. It’s critical that you should build for security first and then use the compliance standards as a way to demonstrate that you’re doing the right things.

Control objectives

After you’ve selected a framework to use, the next considerations are controls. A control is a technical- or process-based implementation that’s designed to ensure that the likelihood or consequences of an identified risk are reduced to a level that’s acceptable to the organization’s risk appetite. Examples of controls include firewalls, logging mechanisms, access management tools, and many more.

Controls will evolve over time; sometimes they do so very quickly in the early stages of cloud adoption. During this rapid evolution, it’s easy to focus purely on the implementation of a control rather than the objective of it. However, if you want to build a robust and useful governance model, you must not lose sight of control objectives.

Consider the example of the firewall. When you use a firewall, you implement a control. The objective is to make sure that only traffic that should reach your environment is able to reach it. Although a firewall is one way to meet this objective, you can achieve the same outcome with a layered approach using Amazon Virtual Private Cloud (Amazon VPC) Security Groups, AWS WAF and Amazon VPC network access control lists (ACLs). Splitting the control implementation into multiple places can enable workload owners to have greater flexibility in how they configure resources while the baseline posture is delivered automatically.

Not all areas of a business necessarily have the same cloud maturity level, or use the same methods to deploy or run workloads. As a security architect, your job is to help those different parts of the business deliver outcomes in the way that is appropriate for their maturity or particular workload.

The best way to help drive this goal is for the security part of your organization to clearly communicate the necessary control objectives. As a security architect, it’s easier to have a discussion about the things that need tweaking in an application if the objectives are well communicated. It is much harder if the workload owner doesn’t know they have to meet certain security expectations.

What is the job of security?

At AWS, we talk to customers across a range of industries. One thing that consistently comes up in conversation is how to help customers understand the role of their security team in a distributed cloud-aware environment. The answer is always the same: we as security people are here to help the business deploy and run applications securely. Our job is to guide and educate the rest of the organization on the best way to meet the business objectives while meeting the security, risk, and compliance requirements.

So how do you do this?

Technology and culture are both important to an organization’s security posture, and they enable each other. AWS is a good example of an organization that has a strong culture of security ownership. One thing that all customers can take away from AWS: security is everyone’s job. When you understand that, it becomes easier to build the mechanisms that make the configuration and operation of appropriate security control objectives a reality.

The cloud environment that you build goes a long way to achieving this goal in two key ways. First, it provides guardrails and automated guidance for people building on the platform. Second, it allows solutions to scale.

One of the challenges organizations encounter is that there are more developers than there are security people. The traditional approach of point-in-time risk and control assessments performed by a human looking at an architecture diagram doesn’t scale. You need a way to scale that knowledge and capability without increasing the number of people. The best way to achieve this is to codify as much as possible, early in the build and release process.

One way to do this is to run the AWS platform as a product in its own right. Team members should be able to submit feature requests, and there should be metrics on the features that are enabled through the platform. The more security capability that teams building workloads can inherit from the platform, the less they have to implement at the workload level and the more time they can spend on product features. There will always be some security control objectives that can only be delivered by specific configuration at the workload level; this should build on top of what’s inherited from the cloud platform. Your security team and the other teams need to work together to make sure that the capabilities provided by the cloud platform are available to help people build and release securely.

One part of the governance model that we like to highlight is the concept of platform onboarding. The idea of this part of the governance model is to quickly and consistently get to a baseline set of controls that enable you to use a service safely in a particular environment. A good example here is to give developers access to evaluate a service in an experimentation account. To support this process, you don’t want to spend a long time building controls for every possible outcome. The best approach is to take advantage of the foundational controls that are delivered by the cloud platform as the starting point. Things like federation, logging, and service control policies can be used to provide guard rails that enable you to use services quickly. When the services are being evaluated, your security team can work together with your business to define more specific controls that make sense for the actual use cases.

AWS Well-Architected Framework

The cloud platform you use is the foundation of many of the security controls. These guard rails of federation, logging, service control polices, and automated response apply to workloads of all types. The security pillar in the AWS Well-Architected Framework builds on other risk management and compliance frameworks, provides you with best practices, and helps you to evaluate your architectures. These best practices are a great place to look for what you should do when building in the cloud. The categories—identity and access management, detection, infrastructure protection, data protection, and incident response—align with the most important areas to focus on when you build in AWS.

For example, identity is a foundational control in a cloud environment. One of the AWS Well-Architected security best practices is “Rely on a centralized identity provider.” You can use AWS Single Sign-On (AWS SSO) for this purpose or an equivalent centralized mechanism. If you centralize your identity provider, you can perform identity lifecycle management on users, provide them with access to only the resources that are required, and support users who move between teams. This can apply across the multiple AWS accounts in your AWS environment. AWS Organizations uses service control policies to enable you to use a subset of AWS services in particular environments; this is an identity-centric way of providing guard rails.

In addition to federating users, it’s important to enable logging and monitoring services across your environment. This allows you to generate an event when something unexpected happens, such as a user trying to call AWS Key Management Service (AWS KMS) to decrypt data that they should have access to. Securely storing logs means that you can perform investigations to determine the causes of any issues you might encounter. AWS customers who use Amazon GuardDuty and AWS CloudTrail, and have a set of AWS Config rules enabled, have access to security monitoring and logging capabilities as they build their applications.

The layer cake model

When you think about cloud security, we find it useful to use the layer cake as a good mental model. The base of the cake is the understanding of the below-the-line capability that AWS provides. This includes self-serving the compliance documentation from AWS Artifact and understanding the AWS shared responsibility model.

The middle of the cake is the foundational controls, including those described previously in this post. This is the most important layer, because it’s where the most controls are and therefore where the most value is for the security team. You could describe it as the “solve it once, consume it many times” layer.

The top of the cake is the application-specific layer. This layer includes things that are more context dependent, such as the correct control objectives for a certain type of application or data classification. The work in the middle layer helps support this layer, because the middle layer provides the mechanisms that make it easier to automatically deliver the top layer capability.

The middle and top layers are not just technology layers. They also include the people and process parts of the equation. The technology is just there to support the processes.

One thing to be aware of is that you shouldn’t try to define every possible control for a service before you allow your business to use the service. Make use of the various environments in your organization—experimenting, development, testing, and production—to get the services in the hands of developers as quickly as possible with the minimum guardrails to avoid accidental misconfiguration. Then, use the time when the services are being assessed to collaborate with the developers on control implementation. Control implementations can then be rolled into the middle layer of the cake, and the services can be adopted by other parts of the business.

This is also the ideal time to apply practical threat modelling techniques so you can understand what threats and risks you must address. Working with your business to define recommended implementation patterns also helps provide context for how services are typically used. This means you can focus on the controls that are most relevant.

The architecture, platform, or cloud center of excellence (CoE) teams can help at this stage. They can likely make a quick determination of whether an AWS service fits in with your organization’s architectural direction. This quick triage helps the security team focus their efforts in helping get services safely in the hands of the business without being seen as blocking adoption. A good mechanism for streamlining the use of new services is to make sure the backlog is well communicated, typically on a platform team wiki. This helps the security and non-security parts of your organization prioritize their time on services that deliver the most business value. A consistent development approach means that the services that are used are probably being used in more places across the organization. This helps your organization get the benefits of scale as consistent approaches to control implementation are replicated between teams.

Simplicity, metrics, and culture

The world moves fast. You can’t just define a security posture and control objectives, and then walk away. New services are launched that make it easier to do more complex things, business priorities change, and the threat landscape evolves. How do you keep up with all of it?

The answer is a combination of simplicity, metrics, and culture.

Simplicity is hard, but useful. For example, if you have 100 application teams all building in a different way, you have a large number of different configurations that you must ensure are sensibly defined. Ideally, you do this programmatically, which means that the work to define and maintain that set of security controls is significant. If you have 100 application teams using only 10 main patterns, it’s easier to build controls. This has the added benefit of reducing the complexity at the operations end, which applies to both the day-to-day operations and to incident responses. Simplification of your control environment means that your monitoring is less complex, troubleshooting is easier, and people have time to focus on the development of new controls or processes.

Metrics are important because you can make informed decisions based on data. A good example of the usefulness of metrics is patching. Patching is one of the easiest ways to improve your security posture. Having metrics on patch age, presented where this information is most important in your environment, enables you to focus on the most valuable areas. For example, infrastructure on your edge is more important to keep patched than infrastructure that is behind multiple layers of controls. You should patch everything, but you need to make it easy for application teams to do so as part of their build and release cycles. Exposing metrics to teams and leadership helps your organization learn from high performing areas in the business. These could be teams that are regularly meeting the patching expectations or have low instances of needing to remediate penetration testing findings. Metrics and data about your control effectiveness enables you to provide assurance internally and externally that you’re meeting your control objectives.

This brings us to culture. Security as an enabler is something that we think is the most important concept to take away from this post. You must build capabilities that enable people in your organization to have the secure configuration or design choice be the easiest option. This is the role of security. You should also make sure that, when there are problems, your security team works with the business to help everyone learn the cause and improve for next time.

AWS has a culture that uses trouble ticketing for everything. If our employees think they have a security problem, we tell them to open a ticket; if they’re not sure that they have a security problem, we tell them to open a ticket anyway to get guidance. This kind of culture encourages people to communicate and help means so we can identify and fix issues early. Issues that aren’t as severe thought can be downgraded quickly. This culture of ticketing gives us data to inform what we build, which helps people be more secure. You can get started with a system like this in your own environment, or look to extend the capability if you’ve already started.

Take our recommendation to turn on GuardDuty across all your accounts. We recommend that the resulting high and medium alerts are sent to a ticketing system. Look at how you resolve those issues and use that to prioritize the next two weeks of work. Now you can build automation to fix the issues and, more importantly, build to prevent the issues from happening in the first place. Ask yourself, “What information did I need to diagnose the problem?” Then, build automation to enrich the findings so your tickets have that context. Iterate on the automation to understand the context. For example, you may want to include information to show whether the environment is production or non-production.

Note that having production-like controls in non-production environments means that you reduce the chance of deployment failures. It also gets teams used to working within the security guardrails. This increased rigor earlier on in the process, and helps your change management team, too.

Summary

It doesn’t matter what security frameworks or standards you use to inform your business, and you might not even align with a particular industry standard. What does matter is building a governance model that empowers the people in your organization to consistently make good security decisions and provides the capability for your security team to enable this to happen. To get started or continue to evolve your governance model, follow the AWS Well-Architected security best practices. Then, make sure that the platform you implement helps you deliver the foundational security control objectives so that your business can spend more of its time on the business logic and security configuration that is specific to its workloads.

The technology and governance choices you make are the first step in building a positive security culture. Security is everyone’s job, and it’s key to make sure that your platform, automation, and metrics support making that job easy.

The areas of focus we’ve talked about in this post are what allow security to be an enabler for business and to ultimately help you better help your customers and earn their trust with everything you do.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

How Starbucks is reducing their footprint

Optimizing beyond cost

Get started, the sustainable way

Deployment

Select sustainable AWS Regions

Align SLAs with sustainability goals

Use efficient silicon

Make efficient use of GPU

Optimize models for inference

Deploy multiple models behind a single endpoint

Right-size your inference environment

Consider inference at the edge

Archive or delete unnecessary artifacts

Monitoring

Retrain only when necessary

Measure results and improve

Conclusion

Other posts in this series

About the Well-Architected Framework

Model building

Define acceptable performance criteria

Select energy-efficient algorithms

Use pre-trained or partially pre-trained models

Optimize your deep learning models to accelerate training

Start with small experiments, datasets, and compute resources

Automate the ML environment

Model training

Select sustainable AWS Regions

Use a debugger

Optimize the resources of your training environment

Use efficient silicon

Archive or delete unnecessary training artifacts

Model tuning and evaluation

Use efficient cross-validation techniques for hyperparameter optimization

Use warm-start hyperparameter tuning

Measure results and improve

Conclusion

Other posts in this series

New AWS Well-Architected Lenses

Updated AWS Well-Architected Lens

Solution Overview

Prerequisites

Implementation Steps

Cleanup

Conclusion

Business goal identification

Define the overall environmental impact or benefit

ML problem framing

Identify if ML is the right solution

Consider AI services and pre-trained models

Select sustainable Regions

Data processing (data collection, data preprocessing, feature engineering)

Avoid datasets and processing duplication

Minimize idle resources with serverless data pipelines

Implement data lifecycle policies aligned with your sustainability goals

Adopt sustainable storage options

Select efficient file formats and compression algorithms

Minimize data movement across networks

Measure results and improve

Conclusion

Effectively establish and communicate security processes

Create a “path to production” process

Classify your data for better access control

Identify and prioritize how to address risks using a threat model

Create feedback cycles

Security controls

Protect your applications and infrastructure

Preventative controls

Detective controls

Incident response

Protect data with robust controls

Ready to get started?

Other blogs in this series

Considerations before getting started

Ensuring security, identity, and compliance

Encrypt everything all the time

Building a global network

Building the compute layer

Summary

1. AWS Best Practices for DDoS Resiliency