Tag Archives: resilience

London Stock Exchange Group uses chaos engineering on AWS to improve resilience

2024-04-01 Elias Bedmar

Post Syndicated from Elias Bedmar original https://aws.amazon.com/blogs/architecture/london-stock-exchange-group-uses-chaos-engineering-on-aws-to-improve-resilience/

This post was co-written with Luke Sudgen, Lead DevOps Engineer Post Trade, and Padraig Murphy, Solutions Architect Post Trade, from London Stock Exchange Group.

In this post, we’ll discuss some failure scenarios that were tested by London Stock Exchange Group (LSEG) Post Trade Technology teams during a chaos engineering event supported by AWS. Chaos engineering allows LSEG to simulate real-world failures in their cloud systems as part of controlled experiments. This methodology improves resilience and observability, which reduces risk and helps achieve compliance with regulators before deploying to production.

Introduction, tooling, and methodology

As a heavily regulated provider of global financial markets infrastructure, LSEG is always looking for opportunities to enhance workload resilience. LSEG and AWS teamed up to organize and run a 3-day AWS Experience-Based Acceleration (EBA) event to perform chaos engineering experiments against key workloads. The event was sponsored and led by the architecture function and included cross-functional Post Trade technical teams across various workstreams. The experiments were run using AWS Fault Injection Service (FIS) following the experiment methodology described in the Verify the resilience of your workloads using Chaos Engineering blog post.

Resilience of modern distributed cloud systems can be continuously improved through reviewing workload architectures and recovery, assessing standard operating procedures (SOPs), and building SOP alerts and recovery automations. AWS Resilience Hub provides a comprehensive tooling suite to get started on these activities.

Another key activity to validate and enhance your resilience posture is chaos engineering, a methodology that induces controlled chaos into customer systems through real-world controlled experiments. Chaos engineering helps customers create real-world failure conditions that can uncover hidden bugs, monitor blind spots, and manage bottlenecks that are difficult to find in distributed systems. This makes it a very useful tool in regulated industries such as financial services.

Architectural overview

The architectural diagram in Figure 1 comprises a three-tier application deployed in virtual private clouds (VPCs) with a multi-AZ setup.

Figure 1. Chaos engineering pattern for hybrid architecture (3-tier application)

Operating within a public subnet, the web application creates a hybrid architecture by using an Amazon Elastic Compute Cloud (Amazon EC2) Auto Scaling group and connecting to an Amazon Relational Database Service (Amazon RDS) database that’s located in a private subnet and connected with on-premises services. Additionally, a number of internal services are hosted in a separate VPC, housed within containers. FIS provides a controlled environment to validate the robustness of the architecture against various failure scenarios, such as:

Amazon EC2 instance failure that causes the application or container pod on the machine to also fail
Amazon RDS database instance reboot or failover
Severe network latency degradation
Network connectivity disruption
Amazon Elastic Block Store (Amazon EBS) volume failure (IOPS pause, disk full)

Amazon EC2 instance and container failure

The objective of this use case is to evaluate the resilience of the application or container pod running on Amazon EC2 instances and identify how the system can adapt itself and continue functioning during unexpected disruptions or instability of an instance. You can use aws:ec2:stop-instances or aws:ec2:terminate-instances FIS actions to mimic different EC2 instance failure modes. The response of running containers to the different instance failures was also assessed. If you’re running containers within a managed AWS service such as Amazon Elastic Container Service (Amazon ECS) or Amazon Elastic Kubernetes Service (Amazon EKS), you can use FIS failure scenarios for ECS tasks and EKS pods.

Amazon RDS failure

RDS failure is another common scenario you can use to identify and troubleshoot database managed service failures from failovers and node reboots at a large scale. FIS can be used to inject reboot/failover failure conditions into the managed RDS instances to understand the bottlenecks and issues from disaster failovers, sync failures, and other database-related problems.

Severe network latency degradation

Network latency degradation injects latency in the network interface that connects two systems. This helps you understand how these systems handle a data transfer delay and your operational response readiness (alerts, metrics, and correction). This FIS action (aws:ssm:send-command/AWSFIS-Run-Network-Latency) uses the Linux traffic control (tc) utility.

Network connectivity disruption

Connectivity issues like traffic disruption or other network issues can be simulated with FIS network actions. FIS supports the aws:network:disrupt-connectivity action to test your application’s resilience in the event of total or partial connectivity loss within its subnet, as well as disruption (including cross-Region) with other AWS networking components such as route tables or AWS Transit Gateway.

Amazon EBS volume failure (IOPS pause)

Disk failure is a problematic issue in real-time operations-based systems. It can lead to transactions failing due to I/O failures or storage failure during peak activity in heavy workloads. The EBS volume failure actions test system performance under different disk failure scenarios. FIS supports the aws:ebs:pause-volume-io action to pause I/O operations on target EBS volumes, as well as other failure modes. The target volumes must be in the same Availability Zone and must be attached to instances built on the AWS Nitro System.

Outcomes and conclusion

Following the experiment, the teams from LSEG successfully identified a series of architectural improvements to reduce application recovery time and enhance metric granularity and alerting. As a second tangible output, the teams now have a reusable chaos engineering methodology and toolset. Running regular in-person cross-functional events is a great way to implement a chaos engineering practice in your organization.

You can start your resilience journey on AWS today with AWS Resilience Hub.

How AWS is helping customers achieve their digital sovereignty and resilience goals

2024-03-26 Max Peterson

Post Syndicated from Max Peterson original https://aws.amazon.com/blogs/security/how-aws-is-helping-customers-achieve-their-digital-sovereignty-and-resilience-goals/

As we’ve innovated and expanded the Amazon Web Services (AWS) Cloud, we continue to prioritize making sure customers are in control and able to meet regulatory requirements anywhere they operate. With the AWS Digital Sovereignty Pledge, which is our commitment to offering all AWS customers the most advanced set of sovereignty controls and features available in the cloud, we are investing in an ambitious roadmap of capabilities for data residency, granular access restriction, encryption, and resilience. Today, I’ll focus on the resilience pillar of our pledge and share how customers are able to improve their resilience posture while meeting their digital sovereignty and resilience goals with AWS.

Resilience is the ability for any organization or government agency to respond to and recover from crises, disasters, or other disruptive events while maintaining its core functions and services. Resilience is a core component of sovereignty and it’s not possible to achieve digital sovereignty without it. Customers need to know that their workloads in the cloud will continue to operate in the face of natural disasters, network disruptions, and disruptions due to geopolitical crises. Public sector organizations and customers in highly regulated industries rely on AWS to provide the highest level of resilience and security to help meet their needs. AWS protects millions of active customers worldwide across diverse industries and use cases, including large enterprises, startups, schools, and government agencies. For example, the Swiss public transport organization BERNMOBIL improved its ability to protect data against ransomware attacks by using AWS.

Building resilience into everything we do

AWS has made significant investments in building and running the world’s most resilient cloud by building safeguards into our service design and deployment mechanisms and instilling resilience into our operational culture. We build to guard against outages and incidents, and account for them in the design of AWS services—so when disruptions do occur, their impact on customers and the continuity of services is as minimal as possible. To avoid single points of failure, we minimize interconnectedness within our global infrastructure. The AWS global infrastructure is geographically dispersed, spanning 105 Availability Zones (AZs) within 33 AWS Regions around the world. Each Region is comprised of multiple Availability Zones, and each AZ includes one or more discrete data centers with independent and redundant power infrastructure, networking, and connectivity. Availability Zones in a Region are meaningfully distant from each other, up to 60 miles (approximately 100 km) to help prevent correlated failures, but close enough to use synchronous replication with single-digit millisecond latency. AWS is the only cloud provider to offer three or more Availability Zones within each of its Regions, providing more redundancy and better isolation to contain issues. Common points of failure, such as generators and cooling equipment, aren’t shared across Availability Zones and are designed to be supplied by independent power substations. To better isolate issues and achieve high availability, customers can partition applications across multiple Availability Zones in the same Region. Learn more about how AWS maintains operational resilience and continuity of service.

Resilience is deeply ingrained in how we design services. At AWS, the services we build must meet extremely high availability targets. We think carefully about the dependencies that our systems take. Our systems are designed to stay resilient even when those dependencies are impaired; we use what is called static stability to achieve this level of resilience. This means that systems operate in a static state and continue to operate as normal without needing to make changes during a failure or when dependencies are unavailable. For example, in Amazon Elastic Compute Cloud (Amazon EC2), after an instance is launched, it’s just as available as a physical server in a data center. The same property holds for other AWS resources such as virtual private clouds (VPCs), Amazon Simple Storage Service (Amazon S3) buckets and objects, and Amazon Elastic Block Store (Amazon EBS) volumes. Learn more in our Fault Isolation Boundaries whitepaper.

Information Services Group (ISG) cited strengthened resilience when naming AWS a Leader in their recent report, Provider Lens for Multi Public Cloud Services – Sovereign Cloud Infrastructure Services (EU), “AWS delivers its services through multiple Availability Zones (AZs). Clients can partition applications across multiple AZs in the same AWS region to enhance the range of sovereign and resilient options. AWS enables its customers to seamlessly transport their encrypted data between regions. This ensures data sovereignty even during geopolitical instabilities.”

AWS empowers governments of all sizes to safeguard digital assets in the face of disruptions. We proudly worked with the Ukrainian government to securely migrate data and workloads to the cloud immediately following Russia’s invasion, preserving vital government services that will be critical as the country rebuilds. We supported the migration of over 10 petabytes of data. For context, that means we migrated data from 42 Ukraine government authorities, 24 Ukrainian universities, a remote learning K–12 school serving hundreds of thousands of displaced children, and dozens of other private sector companies.

For customers who are running workloads on-premises or for remote use cases, we offer solutions such as AWS Local Zones, AWS Dedicated Local Zones, and AWS Outposts. Customers deploy these solutions to help meet their needs in highly regulated industries. For example, to help meet the rigorous performance, resilience, and regulatory demands for the capital markets, Nasdaq used AWS Outposts to provide market operators and participants with added agility to rapidly adjust operational systems and strategies to keep pace with evolving industry dynamics.

Enabling you to build resilience into everything you do

Millions of customers trust that AWS is the right place to build and run their business-critical and mission-critical applications. We provide a comprehensive set of purpose-built resilience services, strategies, and architectural best practices that you can use to improve your resilience posture and meet your sovereignty goals. These services, strategies, and best practices are outlined in the AWS Resilience Lifecycle Framework across five stages—Set Objectives, Design and Implement, Evaluate and Test, Operate, and Respond and Learn. The Resilience Lifecycle Framework is modeled after a standard software development lifecycle, so you can easily incorporate resilience into your existing processes.

For example, you can use the AWS Resilience Hub to set your resilience objectives, evaluate your resilience posture against those objectives, and implement recommendations for improvement based on the AWS Well-Architected Framework and AWS Trusted Advisor. Within Resilience Hub, you can create and run AWS Fault Injection Service experiments, which allow you to test how your application will respond to certain types of disruptions. Recently, Pearson, a global provider of educational content, assessment, and digital services to learners and enterprises, used Resilience Hub to improve their application resilience.

Other AWS resilience services such as AWS Backup, AWS Elastic Disaster Recovery (AWS DRS), and Amazon Route53 Application Recovery Controller (Route 53 ARC) can help you quickly respond and recover from disruptions. When Thomson Reuters, an international media company that provides solutions for tax, law, media, and government to clients in over 100 countries, wanted to improve data protection and application recovery for one of its business units, they adopted AWS DRS. AWS DRS provides Thomson Reuters continuous replication, so changes they made in the source environment were updated in the disaster recovery site within seconds.

Achieve your resilience goals with AWS and our AWS Partners

AWS offers multiple ways for you to achieve your resilience goals, including assistance from AWS Partners and AWS Professional Services. AWS Resilience Competency Partners specialize in improving customers’ critical workloads’ availability and resilience in the cloud. AWS Professional Services offers Resilience Architecture Readiness Assessments, which assess customer capabilities in eight critical domains—change management, disaster recovery, durability, observability, operations, redundancy, scalability, and testing—to identify gaps and areas for improvement.

We remain committed to continuing to enhance our range of sovereign and resilient options, allowing customers to sustain operations through disruption or disconnection. AWS will continue to innovate based on customer needs to help you build and run resilient applications in the cloud to keep up with the changing world.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Let’s Architect! Resiliency in architectures

2023-08-02 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-resiliency-in-architectures/

What is “resiliency”, and why does it matter? When we discussed this topic in an early 2022 edition of Let’s Architect!, we referenced the AWS Well-Architected Framework, which defines resilience as having “the capability to recover when stressed by load, accidental or intentional attacks, and failure of any part in the workload’s components.” Businesses rely heavily on the availability and performance of their digital services. Resilience has emerged as critical for any efficiently architected system, which is why it is a fundamental role in ensuring the reliability and availability of workloads hosted on the AWS Cloud platform.

In this newer edition of Let’s Architect!, we share some best practices for putting together resilient architectures, focusing on providing continuous service and avoiding disruptions. Ensuring uninterrupted operations is likely a primary objective when it comes to building a resilient architecture.

Understand resiliency patterns and trade-offs to architect efficiently in the cloud

In this AWS Architecture Blog post, the authors introduce five resilience patterns. Each of these patterns comes with specific strengths and trade-offs, allowing architects to personalize their resilience strategies according to the unique requirements of their applications and business needs. By understanding these patterns and their implications, organizations can design resilient cloud architectures that deliver high availability and efficient recovery from potential disruptions.

Take me to this Architecture Blog post!

Resilience patterns and tradeoffs

Timeouts, retries, and backoff with jitter

Marc Broker discusses the inevitability of failures and the importance of designing systems to withstand them. He highlights three essential tools for building resilience: timeouts, retries, and backoff. By embracing these three techniques, we can create robust systems that maintain high availability in the face of failures. Timeouts, backoff, and jitter are fundamental to spread the traffic coming from clients and avoid overloading your systems. Building resilience is a fundamental aspect of ensuring the reliability and performance of AWS services in the ever-changing and dynamic technological landscape.

Take me to the Amazon Builders’ Library!

The Amazon Builder’s Library is a collection of technical resources produced by engineers at Amazon

Prepare & Protect Your Applications From Disruption With AWS Resilience Hub

The AWS Resilience Hub not only protects businesses from potential downtime risks but also helps them build a robust foundation for their applications, ensuring uninterrupted service delivery to customers and users.

In this AWS Online Tech Talk, led by the Principal Product Manager of AWS Resilience Hub, the importance of a resilience hub to protect mission-critical applications from downtime risks is emphasized. The AWS Resilience Hub is showcased as a centralized platform to define, validate, and track application resilience. The talk includes strategies to avoid disruptions caused by software, infrastructure, or operational issues, plus there’s also a demo demonstrating how to apply these techniques effectively.

If you are interested in delving deeper into the services discussed in the session, AWS Resilience Hub is a valuable resource for monitoring and implementing resilient architectures.

Take me to this AWS Online Tech Talk!

AWS Resilience Hub recommendations

Data resiliency design patterns with AWS

In this re:Invent 2022 session, data resiliency, why it matters to customers, and how you can incorporate it into your application architecture is discussed in depth. This session kicks off with the comprehensive overview of data resiliency, breaking down its core components and illustrating its critical role in modern application development. It, then, covers application data resiliency and protection designs, plus extending from the native data resiliency capabilities of AWS storage through DR solutions using AWS Elastic Disaster Recovery.

Take me to this re:Invent 2022 video!

Asynchronous cross-region replication

See you next time!

Thanks for joining our discussion on architecture resiliency! See you in two weeks when we’ll talk about security on AWS.

To find all the blogs from this series, visit the Let’s Architect! list of content on the AWS Architecture Blog.

Securely validate business application resilience with AWS FIS and IAM

2023-02-24 Dr. Rudolf Potucek

Post Syndicated from Dr. Rudolf Potucek original https://aws.amazon.com/blogs/devops/securely-validate-business-application-resilience-with-aws-fis-and-iam/

To avoid high costs of downtime, mission critical applications in the cloud need to achieve resilience against degradation of cloud provider APIs and services.

In 2021, AWS launched AWS Fault Injection Simulator (FIS), a fully managed service to perform fault injection experiments on workloads in AWS to improve their reliability and resilience. At the time of writing, FIS allows to simulate degradation of Amazon Elastic Compute Cloud (EC2) APIs using API fault injection actions and thus explore the resilience of workflows where EC2 APIs act as a fault boundary.

In this post we show you how to explore additional fault boundaries in your applications by selectively denying access to any AWS API. This technique is particularly useful for fully managed, “black box” services like Amazon Simple Storage Service (S3) or Amazon Simple Queue Service (SQS) where a failure of read or write operations is sufficient to simulate problems in the service. This technique is also useful for injecting failures in serverless applications without needing to modify code. While similar results could be achieved with network disruption or modifying code with feature flags, this approach provides a fine granular degradation of an AWS API without the need to re-deploy and re-validate code.

Overview

We will explore a common application pattern: user uploads a file, S3 triggers an AWS Lambda function, Lambda transforms the file to a new location and deletes the original:

S3 upload and transform logical workflow: User uploads file to S3, upload triggers AWS Lambda execution, Lambda writes transformed file to a new bucket and deletes original. Workflow can be disrupted at file deletion.

Figure 1. S3 upload and transform logical workflow: User uploads file to S3, upload triggers AWS Lambda execution, Lambda writes transformed file to a new bucket and deletes original. Workflow can be disrupted at file deletion.

We will simulate the user upload with an Amazon EventBridge rate expression triggering an AWS Lambda function which creates a file in S3:

S3 upload and transform implemented demo workflow: Amazon EventBridge triggers a creator Lambda function, Lambda function creates a file in S3, file creation triggers AWS Lambda execution on transformer function, Lambda writes transformed file to a new bucket and deletes original. Workflow can be disrupted at file deletion.

Figure 2. S3 upload and transform implemented demo workflow: Amazon EventBridge triggers a creator Lambda function, Lambda function creates a file in S3, file creation triggers AWS Lambda execution on transformer function, Lambda writes transformed file to a new bucket and deletes original. Workflow can be disrupted at file deletion.

Using this architecture we can explore the effect of S3 API degradation during file creation and deletion. As shown, the API call to delete a file from S3 is an application fault boundary. The failure could occur, with identical effect, because of S3 degradation or because the AWS IAM role of the Lambda function denies access to the API.

To inject failures we use AWS Systems Manager (AWS SSM) automation documents to attach and detach IAM policies at the API fault boundary and FIS to orchestrate the workflow.

Each Lambda function has an IAM execution role that allows S3 write and delete access, respectively. If the processor Lambda fails, the S3 file will remain in the bucket, indicating a failure. Similarly, if the IAM execution role for the processor function is denied the ability to delete a file after processing, that file will remain in the S3 bucket.

Prerequisites

Following this blog posts will incur some costs for AWS services. To explore this test application you will need an AWS account. We will also assume that you are using AWS CloudShell or have the AWS CLI installed and have configured a profile with administrator permissions. With that in place you can create the demo application in your AWS account by downloading this template and deploying an AWS CloudFormation stack:

git clone https://github.com/aws-samples/fis-api-failure-injection-using-iam.git
cd fis-api-failure-injection-using-iam
aws cloudformation deploy --stack-name test-fis-api-faults --template-file template.yaml --capabilities CAPABILITY_NAMED_IAM

Fault injection using IAM

Once the stack has been created, navigate to the Amazon CloudWatch Logs console and filter for /aws/lambda/test-fis-api-faults. Under the EventBridgeTimerHandler log group you should find log events once a minute writing a timestamped file to an S3 bucket named fis-api-failure-ACCOUNT_ID. Under the S3TriggerHandler log group you should find matching deletion events for those files.

Once you have confirmed object creation/deletion, let’s take away the permission of the S3 trigger handler lambda to delete files. To do this you will attach the FISAPI-DenyS3DeleteObject policy that was created with the template:

ROLE_NAME=FISAPI-TARGET-S3TriggerHandlerRole
ROLE_ARN=$( aws iam list-roles --query "Roles[?RoleName=='${ROLE_NAME}'].Arn" --output text )
echo Target Role ARN: $ROLE_ARN

POLICY_NAME=FISAPI-DenyS3DeleteObject
POLICY_ARN=$( aws iam list-policies --query "Policies[?PolicyName=='${POLICY_NAME}'].Arn" --output text )
echo Impact Policy ARN: $POLICY_ARN

aws iam attach-role-policy \
  --role-name ${ROLE_NAME}\
  --policy-arn ${POLICY_ARN}

With the deny policy in place you should now see object deletion fail and objects should start showing up in the S3 bucket. Navigate to the S3 console and find the bucket starting with fis-api-failure. You should see a new object appearing in this bucket once a minute:

Figure 3. S3 bucket listing showing files not being deleted because IAM permissions DENY file deletion during FIS experiment.

If you would like to graph the results you can navigate to AWS CloudWatch, select “Logs Insights“, select the log group starting with /aws/lambda/test-fis-api-faults-S3CountObjectsHandler, and run this query:

fields @timestamp, @message
| filter NumObjects >= 0
| sort @timestamp desc
| stats max(NumObjects) by bin(1m)
| limit 20

This will show the number of files in the S3 bucket over time:

AWS CloudWatch Logs Insights graph showing the increase in the number of retained files in S3 bucket over time, demonstrating the effect of the introduced failure.

Figure 4. AWS CloudWatch Logs Insights graph showing the increase in the number of retained files in S3 bucket over time, demonstrating the effect of the introduced failure.

You can now detach the policy:

ROLE_NAME=FISAPI-TARGET-S3TriggerHandlerRole
ROLE_ARN=$( aws iam list-roles --query "Roles[?RoleName=='${ROLE_NAME}'].Arn" --output text )
echo Target Role ARN: $ROLE_ARN

POLICY_NAME=FISAPI-DenyS3DeleteObject
POLICY_ARN=$( aws iam list-policies --query "Policies[?PolicyName=='${POLICY_NAME}'].Arn" --output text )
echo Impact Policy ARN: $POLICY_ARN

aws iam detach-role-policy \
  --role-name ${ROLE_NAME}\
  --policy-arn ${POLICY_ARN}

We see that newly written files will once again be deleted but the un-processed files will remain in the S3 bucket. From the fault injection we learned that our system does not tolerate request failures when deleting files from S3. To address this, we should add a dead letter queue or some other retry mechanism.

Note: if the Lambda function does not return a success state on invocation, EventBridge will retry. In our Lambda functions we are cost conscious and explicitly capture the failure states to avoid excessive retries.

Fault injection using SSM

To use this approach from FIS and to always remove the policy at the end of the experiment, we first create an SSM document to automate adding a policy to a role. To inspect this document, open the SSM console, navigate to the “Documents” section, find the FISAPI-IamAttachDetach document under “Owned by me”, and examine the “Content” tab (make sure to select the correct region). This document takes the name of the Role you want to impact and the Policy you want to attach as parameters. It also requires an IAM execution role that grants it the power to list, attach, and detach specific policies to specific roles.

Let’s run the SSM automation document from the console by selecting “Execute Automation”. Determine the ARN of the FISAPI-SSM-Automation-Role from CloudFormation or by running:

POLICY_NAME=FISAPI-DenyS3DeleteObject
POLICY_ARN=$( aws iam list-policies --query "Policies[?PolicyName=='${POLICY_NAME}'].Arn" --output text )
echo Impact Policy ARN: $POLICY_ARN

Use FISAPI-SSM-Automation-Role, a duration of 2 minutes expressed in ISO8601 format as PT2M, the ARN of the deny policy, and the name of the target role FISAPI-TARGET-S3TriggerHandlerRole:

Figure 5. Image of parameter input field reflecting the instructions in blog text.

Alternatively execute this from a shell:

ASSUME_ROLE_NAME=FISAPI-SSM-Automation-Role
ASSUME_ROLE_ARN=$( aws iam list-roles --query "Roles[?RoleName=='${ASSUME_ROLE_NAME}'].Arn" --output text )
echo Assume Role ARN: $ASSUME_ROLE_ARN

ROLE_NAME=FISAPI-TARGET-S3TriggerHandlerRole
ROLE_ARN=$( aws iam list-roles --query "Roles[?RoleName=='${ROLE_NAME}'].Arn" --output text )
echo Target Role ARN: $ROLE_ARN

POLICY_NAME=FISAPI-DenyS3DeleteObject
POLICY_ARN=$( aws iam list-policies --query "Policies[?PolicyName=='${POLICY_NAME}'].Arn" --output text )
echo Impact Policy ARN: $POLICY_ARN

aws ssm start-automation-execution \
  --document-name FISAPI-IamAttachDetach \
  --parameters "{
      \"AutomationAssumeRole\": [ \"${ASSUME_ROLE_ARN}\" ],
      \"Duration\": [ \"PT2M\" ],
      \"TargetResourceDenyPolicyArn\": [\"${POLICY_ARN}\" ],
      \"TargetApplicationRoleName\": [ \"${ROLE_NAME}\" ]
    }"

Wait two minutes and then examine the content of the S3 bucket starting with fis-api-failure again. You should now see two additional files in the bucket, showing that the policy was attached for 2 minutes during which files could not be deleted, and confirming that our application is not resilient to S3 API degradation.

Permissions for injecting failures with SSM

Fault injection with SSM is controlled by IAM, which is why you had to specify the FISAPI-SSM-Automation-Role:

Visual representation of IAM permission used for fault injections with SSM. It shows the SSM execution role permitting access to use SSM automation documents as well as modify IAM roles and policies via the SSM document. It also shows the SSM user needing to have a pass-role permission to grant the SSM execution role to the SSM service.

Figure 6. Visual representation of IAM permission used for fault injections with SSM.

This role needs to contain an assume role policy statement for SSM to allow assuming the role:

      AssumeRolePolicyDocument:
        Statement:
          - Action:
             - 'sts:AssumeRole'
            Effect: Allow
            Principal:
              Service:
                - "ssm.amazonaws.com"

The role also needs to contain permissions to describe roles and their attached policies with an optional constraint on which roles and policies are visible:

          - Sid: GetRoleAndPolicyDetails
            Effect: Allow
            Action:
              - 'iam:GetRole'
              - 'iam:GetPolicy'
              - 'iam:ListAttachedRolePolicies'
            Resource:
              # Roles
              - !GetAtt EventBridgeTimerHandlerRole.Arn
              - !GetAtt S3TriggerHandlerRole.Arn
              # Policies
              - !Ref AwsFisApiPolicyDenyS3DeleteObject

Finally the SSM role needs to allow attaching and detaching a policy document. This requires

an ALLOW statement
a constraint on the policies that can be attached
a constraint on the roles that can be attached to

In the role we collapse the first two requirements into an ALLOW statement with a condition constraint for the Policy ARN. We then express the third requirement in a DENY statement that will limit the '*' resource to only the explicit role ARNs we want to modify:

          - Sid: AllowOnlyTargetResourcePolicies
            Effect: Allow
            Action:  
              - 'iam:DetachRolePolicy'
              - 'iam:AttachRolePolicy'
            Resource: '*'
            Condition:
              ArnEquals:
                'iam:PolicyARN':
                  # Policies that can be attached
                  - !Ref AwsFisApiPolicyDenyS3DeleteObject
          - Sid: DenyAttachDetachAllRolesExceptApplicationRole
            Effect: Deny
            Action: 
              - 'iam:DetachRolePolicy'
              - 'iam:AttachRolePolicy'
            NotResource: 
              # Roles that can be attached to
              - !GetAtt EventBridgeTimerHandlerRole.Arn
              - !GetAtt S3TriggerHandlerRole.Arn

We will discuss security considerations in more detail at the end of this post.

Fault injection using FIS

With the SSM document in place you can now create an FIS template that calls the SSM document. Navigate to the FIS console and filter for FISAPI-DENY-S3PutObject. You should see that the experiment template passes the same parameters that you previously used with SSM:

Image of FIS experiment template action summary. This shows the SSM document ARN to be used for fault injection and the JSON parameters passed to the SSM document specifying the IAM Role to modify and the IAM Policy to use.

Figure 7. Image of FIS experiment template action summary. This shows the SSM document ARN to be used for fault injection and the JSON parameters passed to the SSM document specifying the IAM Role to modify and the IAM Policy to use.

You can now run the FIS experiment and after a couple minutes once again see new files in the S3 bucket.

Permissions for injecting failures with FIS and SSM

Fault injection with FIS is controlled by IAM, which is why you had to specify the FISAPI-FIS-Injection-EperimentRole:

Visual representation of IAM permission used for fault injections with FIS and SSM. It shows the SSM execution role permitting access to use SSM automation documents as well as modify IAM roles and policies via the SSM document. It also shows the FIS execution role permitting access to use FIS templates, as well as the pass-role permission to grant the SSM execution role to the SSM service. Finally it shows the FIS user needing to have a pass-role permission to grant the FIS execution role to the FIS service.

Figure 8. Visual representation of IAM permission used for fault injections with FIS and SSM. It shows the SSM execution role permitting access to use SSM automation documents as well as modify IAM roles and policies via the SSM document. It also shows the FIS execution role permitting access to use FIS templates, as well as the pass-role permission to grant the SSM execution role to the SSM service. Finally it shows the FIS user needing to have a pass-role permission to grant the FIS execution role to the FIS service.

This role needs to contain an assume role policy statement for FIS to allow assuming the role:

      AssumeRolePolicyDocument:
        Statement:
          - Action:
              - 'sts:AssumeRole'
            Effect: Allow
            Principal:
              Service:
                - "fis.amazonaws.com"

The role also needs permissions to list and execute SSM documents:

            - Sid: RequiredReadActionsforAWSFIS
              Effect: Allow
              Action:
                - 'cloudwatch:DescribeAlarms'
                - 'ssm:GetAutomationExecution'
                - 'ssm:ListCommands'
                - 'iam:ListRoles'
              Resource: '*'
            - Sid: RequiredSSMStopActionforAWSFIS
              Effect: Allow
              Action:
                - 'ssm:CancelCommand'
              Resource: '*'
            - Sid: RequiredSSMWriteActionsforAWSFIS
              Effect: Allow
              Action:
                - 'ssm:StartAutomationExecution'
                - 'ssm:StopAutomationExecution'
              Resource: 
                - !Sub 'arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:automation-definition/${SsmAutomationIamAttachDetachDocument}:$DEFAULT'

Finally, remember that the SSM document needs to use a Role of its own to execute the fault injection actions. Because that Role is different from the Role under which we started the FIS experiment, we need to explicitly allow SSM to assume that role with a PassRole statement which will expand to FISAPI-SSM-Automation-Role:

            - Sid: RequiredIAMPassRoleforSSMADocuments
              Effect: Allow
              Action: 'iam:PassRole'
              Resource: !Sub 'arn:aws:iam::${AWS::AccountId}:role/${SsmAutomationRole}'

Secure and flexible permissions

So far, we have used explicit ARNs for our guardrails. To expand flexibility, we can use wildcards in our resource matching. For example, we might change the Policy matching from:

            Condition:
              ArnEquals:
                'iam:PolicyARN':
                  # Explicitly listed policies - secure but inflexible
                  - !Ref AwsFisApiPolicyDenyS3DeleteObject

or the equivalent:

            Condition:
              ArnEquals:
                'iam:PolicyARN':
                  # Explicitly listed policies - secure but inflexible
                  - !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:policy/${FullPolicyName}

to a wildcard notation like this:

            Condition:
              ArnEquals:
                'iam:PolicyARN':
                  # Wildcard policies - secure and flexible
                  - !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:policy/${PolicyNamePrefix}*'

If we set PolicyNamePrefix to FISAPI-DenyS3 this would now allow invoking FISAPI-DenyS3PutObject and FISAPI-DenyS3DeleteObject but would not allow using a policy named FISAPI-DenyEc2DescribeInstances.

Similarly, we could change the Resource matching from:

            NotResource: 
              # Explicitly listed roles - secure but inflexible
              - !GetAtt EventBridgeTimerHandlerRole.Arn
              - !GetAtt S3TriggerHandlerRole.Arn

to a wildcard equivalent like this:

            NotResource: 
              # Wildcard policies - secure and flexible
              - !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:role/${RoleNamePrefixEventBridge}*'
              - !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:role/${RoleNamePrefixS3}*'

and setting RoleNamePrefixEventBridge to FISAPI-TARGET-EventBridge and RoleNamePrefixS3 to FISAPI-TARGET-S3.

Finally, we would also change the FIS experiment role to allow SSM documents based on a name prefix by changing the constraint on automation execution from:

            - Sid: RequiredSSMWriteActionsforAWSFIS
              Effect: Allow
              Action:
                - 'ssm:StartAutomationExecution'
                - 'ssm:StopAutomationExecution'
              Resource: 
                # Explicitly listed resource - secure but inflexible
                # Note: the $DEFAULT at the end could also be an explicit version number
                # Note: the 'automation-definition' is automatically created from 'document' on invocation
                - !Sub 'arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:automation-definition/${SsmAutomationIamAttachDetachDocument}:$DEFAULT'

            - Sid: RequiredSSMWriteActionsforAWSFIS
              Effect: Allow
              Action:
                - 'ssm:StartAutomationExecution'
                - 'ssm:StopAutomationExecution'
              Resource: 
                # Wildcard resources - secure and flexible
                # 
                # Note: the 'automation-definition' is automatically created from 'document' on invocation
                - !Sub 'arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:automation-definition/${SsmAutomationDocumentPrefix}*'

and setting SsmAutomationDocumentPrefix to FISAPI-. Test this by updating the CloudFormation stack with a modified template:

aws cloudformation deploy --stack-name test-fis-api-faults --template-file template2.yaml --capabilities CAPABILITY_NAMED_IAM

Permissions governing users

In production you should not be using administrator access to use FIS. Instead we create two roles FISAPI-AssumableRoleWithCreation and FISAPI-AssumableRoleWithoutCreation for you (see this template). These roles require all FIS and SSM resources to have a Name tag that starts with FISAPI-. Try assuming the role without creation privileges and running an experiment. You will notice that you can only start an experiment if you add a Name tag, e.g. FISAPI-secure-1, and you will only be able to get details of experiments and templates that have proper Name tags.

If you are working with AWS Organizations, you can add further guard rails by defining SCPs that control the use of the FISAPI-* tags similar to this blog post.

Caveats

For this solution we are choosing to attach policies instead of permission boundaries. The benefit of this is that you can attach multiple independent policies and thus simulate multi-step service degradation. However, this means that it is possible to increase the permission level of a role. While there are situations where this might be of interest, e.g. to simulate security breaches, please implement a thorough security review of any fault injection IAM policies you create. Note that modifying IAM Roles may trigger events in your security monitoring tools.

The AttachRolePolicy and DetachRolePolicy calls from AWS IAM are eventually consistent, meaning that in some cases permission propagation when starting and stopping fault injection may take up to 5 minutes each.

Cleanup

To avoid additional cost, delete the content of the S3 bucket and delete the CloudFormation stack:

# Clean up policy attachments just in case
CLEANUP_ROLES=$(aws iam list-roles --query "Roles[?starts_with(RoleName,'FISAPI-')].RoleName" --output text)
for role in $CLEANUP_ROLES; do
  CLEANUP_POLICIES=$(aws iam list-attached-role-policies --role-name $role --query "AttachedPolicies[?starts_with(PolicyName,'FISAPI-')].PolicyName" --output text)
  for policy in $CLEANUP_POLICIES; do
    echo Detaching policy $policy from role $role
    aws iam detach-role-policy --role-name $role --policy-arn $policy
  done
done
# Delete S3 bucket content
ACCOUNT_ID=$( aws sts get-caller-identity --query Account --output text )
S3_BUCKET_NAME=fis-api-failure-${ACCOUNT_ID}
aws s3 rm --recursive s3://${S3_BUCKET_NAME}
aws s3 rb s3://${S3_BUCKET_NAME}
# Delete cloudformation stack
aws cloudformation delete-stack --stack-name test-fis-api-faults
aws cloudformation wait stack-delete-complete --stack-name test-fis-api-faults

Conclusion

AWS Fault Injection Simulator provides the ability to simulate various external impacts to your application to validate and improve resilience. We’ve shown how combining FIS with IAM to selectively deny access to AWS APIs provides a generic path to explore fault boundaries across all AWS services. We’ve shown how this can be used to identify and improve a resilience problem in a common S3 upload workflow. To learn about more ways to use FIS, see this workshop.

About the authors:

How Vanguard made their technology platform resilient and efficient by building cross-Region replication for Amazon Kinesis Data Streams

2023-02-23 Raghu Boppanna

Post Syndicated from Raghu Boppanna original https://aws.amazon.com/blogs/big-data/how-vanguard-made-their-technology-platform-resilient-and-efficient-by-building-cross-region-replication-for-amazon-kinesis-data-streams/

This is a guest post co-written with Raghu Boppanna from Vanguard.

At Vanguard, the Enterprise Advice line of business improves investor outcomes through digital access to superior, personalized, and affordable financial advice. They made it possible, in part, by driving economies of scale across the globe for investors with a highly resilient and efficient technical platform. Vanguard opted for a multi-Region architecture for this workload to help protect against impairments of Regional services. For high availability purposes, there is a need to make the data used by the workload available not just in the primary Region, but also in the secondary Region with minimal replication lag. In the event of a service impairment in the primary Region, the solution should be able to fail over to the secondary Region with as little data loss as possible and the ability to resume data ingestion.

Vanguard Cloud Technology Office and AWS partnered to build an infrastructure solution on AWS that met their resilience requirements. The multi-Region solution enables a robust fail-over mechanism, with built-in observability and recovery. The solution also supports streaming data from multiple sources to different Kinesis data streams. The solution is currently being rolled out to the different lines of business teams to improve the resilience posture of their workloads.

The use case discussed here requires Change Data Capture (CDC) to stream data from a remote data source (mainframe DB2) to Amazon Kinesis Data Streams, because the business capability depends on this data. Kinesis Data Streams is a fully managed, massively scalable, durable, and low-cost streaming service that can continuously capture and stream large amounts of data from multiple sources, and makes the data available for consumption within milliseconds. The service is built to be highly resilient and uses multiple Availability Zones to process and store data.

The solution discussed in this post explains how AWS and Vanguard innovated to build a resilient architecture to meet their high availability goals.

Solution overview

The solution uses AWS Lambda to replicate data from Kinesis data streams in the primary Region to a secondary Region. In the event of any service impairment impacting the CDC pipeline, the failover process promotes the secondary Region to primary for the producers and consumers. We use Amazon DynamoDB global tables for replication checkpoints that allows to resume data streaming from the checkpoint and also maintains a primary Region configuration flag that prevents an infinite replication loop of the same data back and forth.

The solution also provides the flexibility for Kinesis Data Streams consumers to use the primary or any secondary Region within the same AWS account.

The following diagram illustrates the reference architecture.

Let’s look at each component in detail:

CDC processor (producer) – In this reference architecture, the producer is deployed on Amazon Elastic Compute Cloud (Amazon EC2) in both the primary and secondary Regions, and is active in the primary Region and on standby mode in the secondary Region. It captures CDC data from the external data source (like a DB2 database as shown in the architecture above), and streams to Kinesis Data Streams in the primary Region. Vanguard uses a 3^rd party tool Qlik Replicate as their CDC Processor. It produces a well-formed payload including the DB2 commit timestamp to the Kinesis data stream, in addition to the actual row data from the remote data source. (example-stream-1 in this example). The following code is a sample payload containing only the primary key of the record that changed and the commit timestamp (for simplicity, the rest of the table row data is not shown below):
```
{
    "eventSource": "aws:kinesis",
    "kinesis": 
    {
         "ApproximateArrivalTimestamp": "Mon July 18 20:00:00 UTC 2022",
         "SequenceNumber": "49544985256907370027570885864065577703022652638596431874",
         "PartitionKey": "12349999",
         "KinesisSchemaVersion": "1.0",
         "Data": "eyJLZXkiOiAxMjM0OTk5OSwiQ29tbWl0VGltZXN0YW1wIjogIjIwMjItMDctMThUMjA6MDA6MDAifQ=="
    },
    "eventId": "shardId-000000000000:49629136582982516722891309362785181370337771525377097730",
    "invokeIdentityArn": "arn:aws:iam::6243876582:role/kds-crr-LambdaRole-1GZWP67437SD",
    "eventName": "aws:kinesis:record",
    "eventVersion": "1.0",
    "eventSourceARN": "arn:aws:kinesis:us-east-1:6243876582:stream/kds-stream-1/consumer/kds-crr:6243876582",
    "awsRegion": "us-east-1"
}
```
The Base64 decoded value of Data is as follows. The actual Kinesis record would contain the entire row data of the table row that changed, in addition to the primary key and the commit timestamp.

{"Key": 12349999,"CommitTimestamp": "2022-07-18T20:00:00"}

The CommitTimestamp in the Data field is used in the replication checkpoint and is critical to accurately track how much of the stream data has been replicated to the secondary Region. The checkpoint can then be used to facilitate a CDC processor (producer) failover and accurately resume producing data from the replication checkpoint timestamp onwards.

The alternative to using a remote data source CommitTimestamp (if unavailable) is to use the ApproximateArrivalTimestamp (which is the timestamp when the record is actually written to the data stream).
Cross-Region replication Lambda function – The function is deployed to both primary and secondary Regions. It’s set up with an event source mapping to the data stream containing CDC data. The same function can be used to replicate data of multiple streams. It’s invoked with a batch of records from Kinesis Data Streams and replicates the batch to a target replication Region (which is provided via the Lambda configuration environment). For cost considerations, if the CDC data is actively produced into the primary Region only, the reserved concurrency of the function in the secondary Region can be set to zero, and modified during regional failover. The function has AWS Identity and Access Management (IAM) role permissions to do the following:
- Read and write to the DynamoDB global tables used in this solution, within the same account.
- Read and write to Kinesis Data Streams in both Regions within the same account.
- Publish custom metrics to Amazon CloudWatch in both Regions within the same account.
Replication checkpoint – The replication checkpoint uses the DynamoDB global table in both the primary and secondary Regions. It’s used by the cross-Region replication Lambda function to persist the commit timestamp of the last replication record as the replication checkpoint for every stream that is configured for replication. For this post, we create and use a global table called kdsReplicationCheckpoint.
Active Region config – The active Region uses the DynamoDB global table in both primary and secondary Regions. It uses the native cross-Region replication capability of the global table to replicate the configuration. It’s pre-populated with data about which is the primary Region for a stream, to prevent replication back to the primary Region by the Lambda function in the standby Region. This configuration may not be required if the Lambda function in the standby Region has a reserved concurrency set to zero, but can serve as a safety check to avoid infinite replication loop of the data. For this post, we create a global table called kdsActiveRegionConfig and put an item with the following data:
```
{
 "stream-name": "example-stream-1",
 "active-region" : "us-east-1"
}
```
Kinesis Data Streams – The stream to which the CDC processor produces the data. For this post, we use a stream called example-stream-1 in both the Regions, with the same shard configuration and access policies.

Sequence of steps in cross-Region replication

Let’s briefly look at how the architecture is exercised using the following sequence diagram.

The sequence consists of the following steps:

The CDC processor (in us-east-1) reads the CDC data from the remote data source.
The CDC processor (in us-east-1) streams the CDC data to Kinesis Data Streams (in us-east-1).
The cross-Region replication Lambda function (in us-east-1) consumes the data from the data stream (in us-east-1). The enhanced fan-out pattern is recommended for dedicated and increased throughput for cross-Region replication.

The replicator Lambda function (in us-east-1) validates its current Region with the active Region configuration for the stream being consumed, with the help of the kdsActiveRegionConfig DynamoDB global tableThe following sample code (in Java) can help illustrate the condition being evaluated:

// Fetch the current AWS Region from the Lambda function’s environment
String currentAWSRegion = System.getenv(“AWS_REGION”);
// Read the stream name from the first Kinesis Record once for the entire batch being processed. This is done because we are reusing the same Lambda function for replicating multiple streams.
String currentStreamNameConsumed = kinesisRecord.getEventSourceARN().split(“:”)[5].split(“/”)[1];
// Build the DynamoDB query condition using the stream name
Map<String, Condition> keyConditions = singletonMap(“streamName”, Condition.builder().comparisonOperator(EQ).attributeValueList(AttributeValue.builder().s(currentStreamNameConsumed).build()).build());
// Query the DynamoDB Global Table
QueryResponse queryResponse = ddbClient.query(QueryRequest.builder().tableName("kdsActiveRegionConfig").keyConditions(keyConditions).attributesToGet(“ActiveRegion”).build());

The function evaluates the response from DynamoDB with the following code:

// Evaluate the response
if (queryResponse.hasItems()) {
       AttributeValue activeRegionForStream = queryResponse.items().get(0).get(“ActiveRegion”);
       return currentAWSRegion.equalsIgnoreCase(activeRegionForStream.s());
}

Depending on the response, the function takes the following actions:
1. If the response is true, the replicator function produces the records to Kinesis Data Streams in us-east-2 in a sequential manner.
  - If there is a failure, the sequence number of the record is tracked and the iteration is broken. The function returns the list of failed sequence numbers. By returning the failed sequence number, the solution uses the feature of Lambda checkpointing to be able to resume processing of a batch of records with partial failures. This is useful when handling any service impairments, where the function tries to replicate the data across Regions to ensure stream parity and no data loss.
  - If there are no failures, an empty list is returned, which indicates the batch was successful.
2. If the response is false, the replicator function returns without performing any replication. To reduce the cost of the Lambda invocations, you can set the reserved concurrency of the function in the DR Region (us-east-2) to zero. This will prevent the function from being invoked. When you failover, you can update this value to an appropriate number based on the CDC throughput and set the reserved concurrency of the function in us-east-1 to zero to prevent it from executing unnecessarily.
After all the records are produced to Kinesis Data Streams in us-east-2, the replicator function checkpoints to the kdsReplicationCheckpoint DynamoDB global table (in us-east-1) with the following data:
```
{ "streamName": "example-stream-1", "lastReplicatedTimestamp": "2022-07-18T20:00:00" }
```
The function returns after successfully processing the batch of records.

Performance considerations

The performance expectations of the solution should be understood with respect to the following factors:

Region selection – The replication latency is directly proportional to the distance being traveled by the data, so understand your Region selection
Velocity – The incoming velocity of the data or the volume of data being replicated
Payload size – The size of the payload being replicated

Monitor the Cross-Region replication

It’s recommended to track and observe the replication as it happens. You can tailor the Lambda function to publish custom metrics to CloudWatch with the following metrics at the end of every invocation. Publishing these metrics to both the primary and secondary Regions helps protect yourself from impairments affecting observability in the primary Region.

Throughput – The current Lambda invocation batch size
ReplicationLagSeconds – The difference between the current timestamp (after processing all the records) and the ApproximateArrivalTimestamp of the last record that was replicated

The following example CloudWatch metric graph shows the average replication lag was 2 seconds with a throughput of 100 records replicated from us-east-1 to us-east-2.

Common failover strategy

During any impairments impacting the CDC pipeline in the primary Region, business continuity or disaster recovery needs may dictate a pipeline failover to the secondary (standby) Region. This means a couple of things need to be done as part of this failover process:

If possible, stop all the CDC tasks in the CDC processor tool in us-east-1.
The CDC processor must be failed over to the secondary Region, so that it can read the CDC data from the remote data source while operating out of the standby Region.
The kdsActiveRegionConfig DynamoDB global table needs to be updated. For instance, for the stream example-stream-1 used in our example, the active Region is changed to us-east-2:

{
"stream-name": "example-stream-1",
"active-Region" : "us-east-2"
}

All the stream checkpoints need to be read from the kdsReplicationCheckpoint DynamoDB global table (in us-east-2), and the timestamps from each of the checkpoints are used to start the CDC tasks in the producer tool in us-east-2 Region. This minimizes the chances of data loss and accurately resumes streaming the CDC data from the remote data source from the checkpoint timestamp onwards.
If using reserved concurrency to control Lambda invocations, set the value to zero in the primary Region(us-east-1) and to a suitable non-zero value in the secondary Region(us-east-2).

Vanguard’s multi-step failover strategy

Some of the third-party tools that Vanguard uses have a two-step CDC process of streaming data from a remote data source to a destination. Vanguard’s tool of choice for their CDC processor follows this two-step approach:

The first step involves setting up a log stream task that reads the data from the remote data source and persists in a staging location.
The second step involves setting up individual consumer tasks that read data from the staging location—which could be on Amazon Elastic File System (Amazon EFS) or Amazon FSx, for example—and stream it to the destination. The flexibility here is that each of these consumer tasks can be triggered to stream from different commit timestamps. The log stream task usually starts reading data from the minimum of all the commit timestamps used by the consumer tasks.

Let’s look at an example to explain the scenario:

Consumer task A is streaming data from a commit timestamp 2022-07-19T20:00:00 onwards to example-stream-1.
Consumer task B is streaming data from a commit timestamp 2022-07-19T21:00:00 onwards to example-stream-2.
In this situation, the log stream should read data from the remote data source from the minimum of the timestamps used by the consumer tasks, which is 2022-07-19T20:00:00.

The following sequence diagram demonstrates the exact steps to run during a failover to us-east-2 (the standby Region).

The steps are as follows:

The failover process is triggered in the standby Region (us-east-2 in this example) when required. Note that the trigger can be automated using comprehensive health checks of the pipeline in the primary Region.
The failover process updates the kdsActiveRegionConfig DynamoDB global table with the new value for the Region as us-east-2 for all the stream names.
The next step is to fetch all the stream checkpoints from the kdsReplicationCheckpoint DynamoDB global table (in us-east-2).
After the checkpoint information is read, the failover process finds the minimum of all the lastReplicatedTimestamp.
The log stream task in the CDC processor tool is started in us-east-2 with the timestamp found in Step 4. It begins reading CDC data from the remote data source from this timestamp onwards and persists them in the staging location on AWS.
The next step is to start all the consumer tasks to read data from the staging location and stream to the destination data stream. This is where each consumer task is supplied with the appropriate timestamp from the kdsReplicationCheckpoint table according to the streamName to which the task streams the data.

After all the consumer tasks are started, data is produced to the Kinesis data streams in us-east-2. From there on, the process of cross-Region replication is the same as described earlier – the replication Lambda function in us-east-2 starts replicating data to the data stream in us-east-1.

The consumer applications reading data from the streams are expected to be idempotent to be able to handle duplicates. Duplicates can be introduced in the stream due to many reasons, some of which are called out below.

The Producer or the CDC Processor introduces duplicates into the stream while replaying the CDC data during a failover
DynamoDB Global Table uses asynchronous replication of data across Regions and if the kdsReplicationCheckpoint table data has a replication lag, the failover process may potentially use an older checkpoint timestamp to replay the CDC data.

Also, consumer applications should checkpoint the CommitTimestamp of the last record that was consumed. This is to facilitate better monitoring and recovery.

Path to maturity: Automated recovery

The ideal state is to fully automate the failover process, reducing time to recover and meeting the resilience Service Level Objective (SLO). However, in most organizations, the decision to fail over, fail back, and trigger the failover requires manual intervention in assessing the situation and deciding the outcome. Creating scripted automation to perform the failover that can be run by a human is a good place to start.

Vanguard has automated all of the steps of failover, but still have humans make the decision on when to invoke it. You can customize the solution to meet your needs and depending on the CDC processor tool you use in your environment.

Conclusion

In this post, we described how Vanguard innovated and built a solution for replicating data across Regions in Kinesis Data Streams to make the data highly available. We also demonstrated a robust checkpoint strategy to facilitate a Regional failover of the replication process when needed. The solution also illustrated how to use DynamoDB global tables for tracking the replication checkpoints and configuration. With this architecture, Vanguard was able to deploy workloads depending on the CDC data to multiple Regions to meet business needs of high availability in the face of service impairments impacting CDC pipelines in the primary Region.

If you have any feedback please leave a comment in the Comments section below.

About the authors

Raghu Boppanna works as an Enterprise Architect at Vanguard’s Chief Technology Office. Raghu specializes in Data Analytics, Data Migration/Replication including CDC Pipelines, Disaster Recovery and Databases. He has earned several AWS Certifications including AWS Certified Security – Specialty & AWS Certified Data Analytics – Specialty.

Parameswaran V Vaidyanathan is a Senior Cloud Resilience Architect with Amazon Web Services. He helps large enterprises achieve the business goals by architecting and building scalable and resilient solutions on the AWS Cloud.

Richa Kaul is a Senior Leader in Customer Solutions serving Financial Services customers. She is based out of New York. She has extensive experience in large scale cloud transformation, employee excellence, and next generation digital solutions. She and her team focus on optimizing value of cloud by building performant, resilient and agile solutions. Richa enjoys multi sports like triathlons, music, and learning about new technologies.

Mithil Prasad is a Principal Customer Solutions Manager with Amazon Web Services. In his role, Mithil works with Customers to drive cloud value realization, provide thought leadership to help businesses achieve speed, agility, and innovation.

Let’s Architect! Architecting with Amazon DynamoDB

2022-10-19 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-architecting-with-amazon-dynamodb/

NoSQL databases are an essential part of the technology industry in today’s world. Why are we talking about NoSQL databases? NoSQL databases often allow developers to be in control of the structure of the data, and they are a good fit for big data scenarios and offer fast performance.

In this issue of Let’s Architect!, we explore Amazon DynamoDB capabilities and potential solutions to apply in your architectures. A key strength of DynamoDB is the capability of operating at scale globally; for instance, multiple products built by Amazon are powered by DynamoDB. During Prime Day 2022, the service also maintained high availability while delivering single-digit millisecond responses, peaking at 105.2 million requests-per-second. Let’s start!

Data modeling with DynamoDB

Working with a new database technology means understanding exactly how it works and the best design practices for taking full advantage of its features.

In this video, the key principles for modeling DynamoDB tables are discussed, plus practical patterns to use while defining your data models are explored and how data modeling for NoSQL databases (like DynamoDB) is different from modeling for traditional relational databases.

With this video, you can learn about the main components of DynamoDB, some design considerations that led to its creation, and all the best practices for efficiently using primary keys, secondary keys, and indexes. Peruse the original paper to learn more about DyanamoDB in Dynamo: Amazon’s Highly Available Key-value Store.

Amazon DynamoDB uses partitioning to provide horizontal scalability

Single-table vs. multi-table in Amazon DynamoDB

When considering single-table versus multi-table in DynamoDB, it is all about your application’s needs. It is possible to avoid naïve lifting-and-shifting your relational data model into DynamoDB tables. In this post, you will discover different use cases on when to use single-table compared with multi-table designs, plus understand certain data-modeling principles for DynamoDB.

Use a single-table design to provide materialized joins in Amazon DynamoDB

Optimizing costs on DynamoDB tables

Infrastructure cost is an important dimension for every customer. Despite your role inside an organization, you should monitor opportunities for optimizing costs, when possible.
For this reason, we have created a guide on DynamoDB tables cost-optimization that provides several suggestions for reducing your bill at the end of the month.

Build resilient applications with Amazon DynamoDB global tables: Part 1

When you operate global systems that are spread across multiple AWS regions, dealing with data replication and writes across regions can be a challenge. DynamoDB global tables help by providing the performance of DynamoDB across multiple regions with data synchronization and multi-active database where each replica can be used for both writing and reading data.

Another use case for global tables are resilient applications with the lowest possible recovery time objective (RTO) and recovery point objective (RPO). In this blog series, we show you how to approach such a scenario.

Amazon DynamoDB active-active architecture

See you next time!

Thanks for joining our discussion on DynamoDB. See you in a few weeks, when we explore cost optimization!

Looking for more architecture content?

AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

Continually assessing application resilience with AWS Resilience Hub and AWS CodePipeline

2022-06-22 Scott Bryen

Post Syndicated from Scott Bryen original https://aws.amazon.com/blogs/architecture/continually-assessing-application-resilience-with-aws-resilience-hub-and-aws-codepipeline/

As customers commit to a DevOps mindset and embrace a nearly continuous integration/continuous delivery model to implement change with a higher velocity, assessing every change impact on an application resilience is key. This blog shows an architecture pattern for automating resiliency assessments as part of your CI/CD pipeline. Automatically running a resiliency assessment within CI/CD pipelines, development teams can fail fast and understand quickly if a change negatively impacts an applications resilience. The pipeline can stop the deployment into further environments, such as QA/UAT and Production, until the resilience issues have been improved.

AWS Resilience Hub is a managed service that gives you a central place to define, validate and track the resiliency of your AWS applications. It is integrated with AWS Fault Injection Simulator (FIS), a chaos engineering service, to provide fault-injection simulations of real-world failures. Using AWS Resilience Hub, you can assess your applications to uncover potential resilience enhancements. This will allow you to validate your applications recovery time (RTO), recovery point (RPO) objectives and optimize business continuity while reducing recovery costs. Resilience Hub also provides APIs for you to integrate its assessment and testing into your CI/CD pipelines for ongoing resilience validation.

AWS CodePipeline is a fully managed continuous delivery service for fast and reliable application and infrastructure updates. You can use AWS CodePipeline to model and automate your software release processes. This enables you to increase the speed and quality of your software updates by running all new changes through a consistent set of quality checks.

Continuous resilience assessments

Figure 1 shows the resilience assessments automation architecture in a multi-account setup. AWS CodePipeline, AWS Step Functions, and AWS Resilience Hub are defined in your deployment account while the application AWS CloudFormation stacks are imported from your workload account. This pattern relies on AWS Resilience Hub ability to import CloudFormation stacks from a different accounts, regions, or both, when discovering an application structure.

Figure 1. High-level architecture pattern for automating resilience assessments

Add application to AWS Resilience Hub

Begin by adding your application to AWS Resilience Hub and assigning a resilience policy. This can be done via the AWS Management Console or using CloudFormation. In this instance, the application has been created through the AWS Management Console. Sebastien Stormacq’s post, Measure and Improve Your Application Resilience with AWS Resilience Hub, walks you through how to add your application to AWS Resilience Hub.

In a multi-account environment, customers typically have dedicated AWS workload account per environment and we recommend you separate CI/CD capabilities into another account. In this post, the AWS Resilience Hub application has been created in the deployment account and the resources have been discovered using an CloudFormation stack from the workload account. Proper permissions are required to use AWS Resilience Hub to manage application in multiple accounts.

Figure 2. Adding application to AWS Resilience Hub

Create AWS Step Function to run resilience assessment

Whenever you make a change to your application CloudFormation, you need to update and publish the latest version in AWS Resilience Hub to ensure you are assessing the latest changes. Now that AWS Step Functions SDK integrations support AWS Resilience Hub, you can build a state machine to coordinate the process, which will be triggered from AWS Code Pipeline.

AWS Step Functions is a low-code, visual workflow service that developers use to build distributed applications, automate IT and business processes, and build data and machine learning pipelines using AWS services. Workflows manage failures, retries, parallelization, service integrations, and observability so developers can focus on higher-value business logic.

Figure 3. AWS Step Function for orchestrating AWS SDK calls

The first step in the workflow is to update the resources associated with the application defined in AWS Resilience Hub by calling ImportResourcesToDraftApplication.
Check for the import process to complete using a wait state, a call to DescribeDraftAppVersionResourcesImportStatus and then a choice state to decide whether to progress or continue waiting.
Once complete, publish the draft application by calling PublishAppVersion to ensure we are assessing the latest version.
Once published, call StartAppAssessment to kick-off a resilience assessment.
Check for the assessment to complete using a wait state, a call to DescribeAppAssessment and then a choice state to decide whether to progress or continue waiting.
In the choice state, use assessment status from the response to determine if the assessment is pending, in progress or successful.
If successful, use the compliance status from the response to determine whether to progress to success or fail.
- Compliance status will be either “PolicyMet” or “PolicyBreached”.
If policy breached, publish onto SNS to alert the development team before moving to fail.

Create stage within code pipeline

Now that we have the AWS Step Function created, we need to integrate it into our pipeline. The post Fine-grained Continuous Delivery With CodePipeline and AWS Step Functions demonstrates how you can trigger a step function from AWS Code Pipeline.

When adding the stage, you need to pass the ARN of the stack which was deployed in the previous stage as well as the ARN of the application in AWS Resilience Hub. These will be required on the AWS SDK calls and you can pass this in as a literal.

Figure 4. AWS CodePipeline stage step function input

Figure 5. Example state using the input from AWS CodePipeline stage

For more information about these AWS SDK calls, please refer to the AWS Resilience Hub API Reference documents.

Customers often run their workloads in lower environments in a less resilient way to save on cost. It’s important to add the assessment stage at the appropriate point of your pipeline. We recommend adding this to your pipeline after the deployment to a test environment which mirrors production but before deploying to production. By doing this you can fail fast and halt changes which will lower resilience in production.

A note on service quotas: AWS Resilience Hub allows you to run 20 assessments per month per application. If you need to increase this quota, please raise a ticket with AWS Support.

Conclusion

In this post, we have seen an approach to continuously assessing resilience as part of your CI/CD pipeline using AWS Resilience Hub, AWS CodePipeline and AWS Step Functions. This approach will enable you to understand fast if a change will weaken resilience.

AWS Resilience Hub also generates recommended AWS FIS Experiments that you can deploy and use to test the resilience of your application. As well as assessing the resilience, we also recommend you integrate running these tests into your pipeline. The post Chaos Testing with AWS Fault Injection Simulator and AWS CodePipeline demonstrates how you can active this.

Understand resiliency patterns and trade-offs to architect efficiently in the cloud

2022-06-03 Haresh Nandwani

Post Syndicated from Haresh Nandwani original https://aws.amazon.com/blogs/architecture/understand-resiliency-patterns-and-trade-offs-to-architect-efficiently-in-the-cloud/

This post was originally published in June 2022 and is now updated with more information on efficiently architecting resilient patterns in the cloud.

Architecting workloads for resilience on the cloud often need to evaluate multiple factors before they can decide the most optimal architecture for their workloads.

Example Corp has multiple applications with varying criticality, and each of their applications have different needs in terms of resiliency, complexity, and cost. They have many choices to architect their workloads for resiliency and cost, but which option suits their needs best? What should they consider when choosing the patterns most appropriate for the needs of their applications?

To help answer these questions, we’ll discuss the five resilience patterns in Figure 1 and the trade-offs to consider when implementing them: 1) design complexity, 2) cost to implement, 3) operational effort, 4) effort to secure, and 5) environmental impact. This will help you achieve varying levels of resiliency and make decisions about the most appropriate architecture for your needs. Our intent is to provide a high-level approach to structure conversations on trade-offs associated with each of these patterns. For a deeper dive on each pattern, please navigate to the Further reading section at the end of this post.

Note: these patterns are not mutually exclusive; you may decide to implement a combination of one of more patterns.

Figure 1. Resilience patterns and trade-offs

What is resiliency? Why does it matter?

The AWS Well-Architected Framework defines resilience as having “the capability to recover when stressed by load (more requests for service), attacks (either accidental through a bug, or deliberate through intention), and failure of any component in the workload’s components.”

To meet your business’ resilience requirements, consider the following core factors as you design your workloads:

Design complexity – An increase in system complexity typically increases the emergent behaviors of that system. Each individual workload component has to be resilient, and you’ll need to eliminate single points of failure across people, process, and technology elements. Customers should consider their resilience requirements and decide if increasing system complexity is an effective approach, or if keeping the system simple and using a disaster recovery (DR) plan is be more appropriate.
Cost to implement – Costs often significantly increase when you implement higher resilience because there are new software and infrastructure components to operate. It’s important for such costs to be offset by the potential costs of future loss.
Operational effort – Deploying and supporting highly resilient systems requires complex operational processes and advanced technical skills. For example, customers might need to improve their operational processes using the Operational Readiness Review (ORR) approach. Before you decide to implement higher resilience, evaluate your operational competency to confirm you have the required level of process maturity and skillsets.
Effort to secure – Security complexity is less directly correlated with resilience. However, there are generally more components to secure for highly resilient systems. Using security best practices for cloud deployments can achieve security objectives without adding significant complexity even with a higher deployment footprint.
Environmental impact – An increased deployment footprint for resilient systems may increase your consumption of cloud resources. However, you can use trade-offs, like approximate computing and deliberately implementing slower response times to reduce resource consumption. The AWS Well-Architected Sustainability Pillar describes these patterns and provides guidance on sustainability best practices.

Pattern 1 (P1): Multi-AZ

P1 is a cloud-based architecture pattern (Figure 2) that introduces Availability Zones (AZs) into your architecture to increase your system’s resilience. The P1 pattern uses a Multi-AZ architecture where applications operate in multiple AZs within a single AWS Region. This allows your application to withstand AZ-level impacts.

As shown in Figure 2, Example Corp deploys their internal employee applications using the P1 pattern. These applications are low business impact and therefore have lower requirements for resiliency.

Example Corp deploys their low-business-impact applications as a single Amazon Elastic Compute Cloud (Amazon EC2) instance managed by an Auto Scaling group. Amazon EC2 uses health checks to automatically detect faults. If an AZ fails, Amazon EC2 prompts an Amazon EC2 Auto Scaling group to recreate their application in another unaffected AZ.

Figure 2. Multi-AZ deployment pattern (P1)

Trade-offs

P1 is low in several categories and mitigates a disruption to the AZ hosting the application, but this comes at the expense of application recovery. If an AZ is down, it will disrupt end users’ access to the application while the new resources are being re-provisioned in a new AZ. This is known as bi-modal behavior.

Pattern 2 (P2): Multi-AZ with static stability

P2 uses multiple instances across multiple AZs within a Region to increase resilience. The pattern uses static stability to prevent bimodal behavior. Statically stable systems remain stable and operate in one mode, irrespective of changes to their operating environment. A key benefit of a statically stable system on AWS is it reduces complexity of recovery during a disruption thanks to pre-provisioned resource capacity. Any resources needed to maintain operations during a disruption, such as the loss of resources in an AZ, already exist and AWS service control planes do not need to be available for recovery to be successful. To learn more about static stability, data planes and control planes read the builder’s library article Static stability using Availability Zones.

As shown in Figure 3, Example Corp has a customer-facing website that has a lower tolerance for downtime. Any time the website is down, it could result in lost revenue. Because of this, the website requires two EC2 instances that are provisioned within two AZs. Using health checks, when the AZ becomes impaired, the website continues to operate as the Elastic Load Balancer diverts traffic away from the impacted AZ. For more on using health checks, see the Implementing health checks article in The Amazon Builder’s Library.

Figure 3. Multi-AZ with static stability pattern (P2)

Trade-offs

P2 mitigates an AZ disruption without downtime to application clients but must be weighed against cost concerns. P1 is less expensive from an infrastructure cost perspective, as it provisions less compute capacity and relies on launching new instances in case of a failure. However, P1’s bimodal behavior can affect your customers during large-scale events.

Implementing P2 requires your application to support distributed operation across multiple instances. If your application can support this pattern, you can deploy your workload to all available AZs (usually 3 or more) across the Region. This will reduce costs associated with over-provisioning because you only have to provision 150% of your capacity across three AZs compared with the 200% in two AZs (as mentioned in our earlier example).

Pattern 3 (P3): Application portfolio distribution

P3 uses a Multi-Region pattern to increase functional resilience, as demonstrated in Figure 4. It distributes different critical applications in multiple Regions.

Example Corp provides banking services, like credit balance checks, to consumers on multiple digital channels. These services are available to consumers via a mobile application, contact center, and web-based applications. Each digital channel is deployed to a separate Region, which mitigates against a regional service disruption.

For example, a Region with the customers’ mobile application may have a disruption that causes the mobile app to be unavailable, but customers can still access banking services via online banking deployed in an alternate Region. Regional service disruptions are rare, but implementing a pattern like this ensures your users retain access to business-critical services during disruptions.

Figure 4. Application portfolio distribution pattern (P3)

Trade-offs

P3 mitigates the possibility of a regional service disruption impacting a multitude of systems at the same time. Operating an application portfolio that spans multiple Regions requires significant operational planning and management. Isolated functional elements may depend on common downstream systems and data sources that are deployed in a single Region. Therefore, Region-wide events may still cause disruption, but the impact surface area should be reduced.

Pattern 4 (P4): Multi-AZ deployment (multi-Region DR)

Example Corp operates several business-critical services that have a very low tolerance for disruption, such as the ability for consumers to make bank payments. Example Corp reviewed the four common patterns for DR (as defined in Disaster Recovery of Workloads on AWS: Recovery in the Cloud) and decided to use the following sub-patterns for their multi-Region applications:

Pilot Light – This pattern works for applications that require RTO/RPO of 10s of minutes. Data is actively replicated and application infrastructure is pre-provisioned in the DR Region. Cost optimization is a key driver here, as the application infrastructure is kept switched-off and only switched-on during the restore event.
Warm Standby – This pattern improves restore times significantly compared with pilot light by keeping your applications running in the DR Region but with a reduced capacity. Application infrastructure will be scaled up during a DR event, but this can typically be automated with minimal manual effort. This pattern can achieve RTO/RPO of minutes if implemented correctly.

Trade-offs

P4 mitigates a disruption to a regional service while reducing mitigation costs. Regional DR patterns increase deployment complexity as infrastructure changes need to be synchronized across Regions. Testing resilience is also significantly more complex and include simulating regional disruptions. Using Infrastructure as Code to automate deployments can help alleviate these issues.

Pattern 5 (P5): Multi-Region active-active

Example Corp’s core banking and Customer Relationship Management applications have zero tolerance for disruption. They use the P5 pattern for deploying these applications because it has an RTO of real-time and an RPO of near-zero data loss. They run their workload simultaneously in multiple Regions, allowing them to serve traffic from all Regions simultaneously. This pattern not only mitigates against regional disruptions but also addresses their zero tolerance requirements (Figure 5).

Figure 5. Multi-Region active-active pattern (P5)

Trade-offs

P5 mitigates the disruption of a regional service, and invests additional costs and complexity to deliver a RTO of near zero. Multi-active deployments are generally complex, as they include multiple applications that collaborate to deliver required business services. If you implement this pattern, you’ll need to consider the fact that you’re introducing asynchronous replication for data across Regions and the impact that has on data consistency.

Operating this pattern requires a very high level of process maturity, so we recommend customers gradually build towards this pattern by starting with the deployment patterns described earlier.

Conclusion

In this blog post, we introduced five resilience patterns and trade-offs to consider when implementing them. In an effort to help you find the most efficient architecture for your use case, we demonstrated how Example Corp evaluated these options and how they applied them to their business needs.

Looking for more architecture content?

AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

How to use regional SAML endpoints for failover

2022-05-31 Jonathan VanKim

Post Syndicated from Jonathan VanKim original https://aws.amazon.com/blogs/security/how-to-use-regional-saml-endpoints-for-failover/

Many Amazon Web Services (AWS) customers choose to use federation with SAML 2.0 in order to use their existing identity provider (IdP) and avoid managing multiple sources of identities. Some customers have previously configured federation by using AWS Identity and Access Management (IAM) with the endpoint signin.aws.amazon.com. Although this endpoint is highly available, it is hosted in a single AWS Region, us-east-1. This blog post provides recommendations that can improve resiliency for customers that use IAM federation, in the unlikely event of disrupted availability of one of the regional endpoints. We will show you how to use multiple SAML sign-in endpoints in your configuration and how to switch between these endpoints for failover.

How to configure federation with multi-Region SAML endpoints

AWS Sign-In allows users to log in into the AWS Management Console. With SAML 2.0 federation, your IdP portal generates a SAML assertion and redirects the client browser to an AWS sign-in endpoint, by default signin.aws.amazon.com/saml. To improve federation resiliency, we recommend that you configure your IdP and AWS federation to support multiple SAML sign-in endpoints, which requires configuration changes for both your IdP and AWS. If you have only one endpoint configured, you won’t be able to log in to AWS by using federation in the unlikely event that the endpoint becomes unavailable.

Let’s take a look at the Region code SAML sign-in endpoints in the AWS General Reference. The table in the documentation shows AWS regional endpoints globally. The format of the endpoint URL is as follows, where <region-code> is the AWS Region of the endpoint: https://<region-code>.signin.aws.amazon.com/saml

All regional endpoints have a region-code value in the DNS name, except for us-east-1. The endpoint for us-east-1 is signin.aws.amazon.com—this endpoint does not contain a Region code and is not a global endpoint. AWS documentation has been updated to reference SAML sign-in endpoints.

In the next two sections of this post, Configure your IdP and Configure IAM roles, I’ll walk through the steps that are required to configure additional resilience for your federation setup.

Important: You must do these steps before an unexpected unavailability of a SAML sign-in endpoint.

Configure your IdP

You will need to configure your IdP and specify which AWS SAML sign-in endpoint to connect to.

To configure your IdP

If you are setting up a new configuration for AWS federation, your IdP will generate a metadata XML configuration file. Keep track of this file, because you will need it when you configure the AWS portion later.
Register the AWS service provider (SP) with your IdP by using a regional SAML sign-in endpoint. If your IdP allows you to import the AWS metadata XML configuration file, you can find these files available for the public, GovCloud, and China Regions.
If you are manually setting the Assertion Consumer Service (ACS) URL, we recommend that you pick the endpoint in the same Region where you have AWS operations.
In SAML 2.0, RelayState is an optional parameter that identifies a specified destination URL that your users will access after signing in. When you set the ACS value, configure the corresponding RelayState to be in the same Region as the ACS. This keeps the Region configurations consistent for both ACS and RelayState. Following is the format of a Region-specific console URL.
https://<region-code>.console.aws.amazon.com/

For more information, refer to your IdP’s documentation on setting up the ACS and RelayState.

Configure IAM roles

Next, you will need to configure IAM roles’ trust policies for all federated human access roles with a list of all the regional AWS Sign-In endpoints that are necessary for federation resiliency. We recommend that your trust policy contains all Regions where you operate. If you operate in only one Region, you can get the same resiliency benefits by configuring an additional endpoint. For example, if you operate only in us-east-1, configure a second endpoint, such as us-west-2. Even if you have no workloads in that Region, you can switch your IdP to us-west-2 for failover. You can log in through AWS federation by using the us-west-2 SAML sign-in endpoint and access your us-east-1 AWS resources.

To configure IAM roles

Log in to the AWS Management Console with credentials to administer IAM. If this is your first time creating the identity provider trust in AWS, follow the steps in Creating IAM SAML identity providers to create the identity providers.

Next, create or update IAM roles for federated access. For each IAM role, update the trust policy that lists the regional SAML sign-in endpoints. Include at least two for increased resiliency.

The following example is a role trust policy that allows the role to be assumed by a SAML provider coming from any of the four US Regions.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": "arn:aws:iam:::saml-provider/IdP"
            },
            "Action": "sts:AssumeRoleWithSAML",
            "Condition": {
                "StringEquals": {
                    "SAML:aud": [
                        "https://us-east-2.signin.aws.amazon.com/saml",
                        "https://us-west-1.signin.aws.amazon.com/saml",
                        "https://us-west-2.signin.aws.amazon.com/saml",
                        "https://signin.aws.amazon.com/saml"
                    ]
                }
            }
        }
    ]
}

When you use a regional SAML sign-in endpoint, the corresponding regional AWS Security Token Service (AWS STS) endpoint is also used when you assume an IAM role. If you are using service control policies (SCP) in AWS Organizations, check that there are no SCPs denying the regional AWS STS service. This will prevent the federated principal from being able to obtain an AWS STS token.

Switch regional SAML sign-in endpoints

In the event that the regional SAML sign-in endpoint your ACS is configured to use becomes unavailable, you can reconfigure your IdP to point to another regional SAML sign-in endpoint. After you’ve configured your IdP and IAM role trust policies as described in the previous two sections, you’re ready to change to a different regional SAML sign-in endpoint. The following high-level steps provide guidance on switching the regional SAML sign-in endpoint.

To switch regional SAML sign-in endpoints

Change the configuration in the IdP to point to a different endpoint by changing the value for the ACS.
Change the configuration for the RelayState value to match the Region of the ACS.
Log in with your federated identity. In the browser, you should see the new ACS URL when you are prompted to choose an IAM role.

Figure 1: New ACS URL

The steps to reconfigure the ACS and RelayState will be different for each IdP. Refer to the vendor’s IdP documentation for more information.

Conclusion

In this post, you learned how to configure multiple regional SAML sign-in endpoints as a best practice to further increase resiliency for federated access into your AWS environment. Check out the updates to the documentation for AWS Sign-In endpoints to help you choose the right configuration for your use case. Additionally, AWS has updated the metadata XML configuration for the public, GovCloud, and China AWS Regions to include all sign-in endpoints.

The simplest way to get started with SAML federation is to use AWS Single Sign-On (AWS SSO). AWS SSO helps manage your permissions across all of your AWS accounts in AWS Organizations.

If you have any questions, please post them in the Security Identity and Compliance re:Post topic or reach out to AWS Support.

Want more AWS Security news? Follow us on Twitter.

How to automate AWS Managed Microsoft AD scaling based on utilization metrics

2021-12-02 Dennis Rothmel

Post Syndicated from Dennis Rothmel original https://aws.amazon.com/blogs/security/how-to-automate-aws-managed-microsoft-ad-scaling-based-on-utilization-metrics/

AWS Directory Service for Microsoft Active Directory (AWS Managed Microsoft AD), provides a fully managed service for Microsoft Active Directory (AD) in the AWS cloud. When you create your directory, AWS deploys two domain controllers in separate Availability Zones that are exclusively yours for high availability. For use cases requiring even higher resilience and performance, in a specific Region or during specific hours, AWS Managed Microsoft AD allows you to scale by deploying additional domain controllers to meet your needs. These domain controllers can help load-balance, increase overall performance, or simply provide additional nodes to protect against temporary availability issues. AWS Managed Microsoft AD allows you to define the correct number of domain controllers for your directory based on your individual use case.

This post will walk you through how to automate scaling in AWS Managed Microsoft AD using utilization metrics from your directory. You’ll do this using Amazon CloudWatch Alarms, SNS notifications, and a Lambda function to increase the number of domain controllers in your directory based on utilization peaks.

Simplified directory scaling

AWS Managed Microsoft AD has now simplified this directory scaling process by integrating with Amazon CloudWatch metrics. This new integration enables you to:

Analyze your directory to identify expected average and peak directory utilization
Scale your directory based on utilization data to adequately address the expected load
Automate the addition of domain controllers to handle unexpected load.

Integration is available for both domain controller utilization metrics such as CPU, Memory, Disk and Network, and for AD-specific metrics, such as LDAP searches, binds, DNS queries, and Directory reads/writes. Analyzing this data over time to identify expected average and peak utilization on your directory can help you deploy additional domain controllers in Regions that need them. Once you’ve established this utilization baseline, you can deploy additional domain controllers to service this load, and configure alarms for anything exceeding this baseline.

Solution overview

In this example, our AWS Managed Microsoft AD has the default two domain controllers; once your utilization threshold is reached, you’ll add one additional domain controller (domain controller 3 in the diagram) to cover this additional load.

Figure 1: Solution overview

To create a CloudWatch Alarm with SNS topic notifications

In the AWS Console, navigate to CloudWatch
Choose Metrics to see the Browse Metrics panel
Choose the Directory Service namespace, then choose AWS Managed Microsoft AD.
In the Directory ID column, select your directory and check search for this only.
From the Metric Category column, select Processor from Metric Category and check add to search. This view will show the processor utilization for your directory.

Figure 2. Processor utilization metrics

To see the average utilization across all domain controllers, choose Add Math, then All Functions, then AVG to create a metric math expression for average CPU utilization across all domain controllers.

Figure 3. Adding a math function to compute average

Next, choose the Graphed Metrics tab in the CloudWatch metrics console, select the newly created expression, then select the bell icon from the Actions column to create a CloudWatch alarm based on this metric.

Figure 4. Create a CloudWatch Alarm using Metric Math Expression

Configure the threshold alarm to trigger when CPU utilization exceeds 70%.

In the Metrics section, under Period, choose 1 Hour.
In the Conditions section, under Threshold Type, choose Static. Under Define the alarm condition, choose Greater than threshold. Under Define the threshold value, enter 70. See Figure 5 for an image of how alarm parameters should look on your screen. Choose Next to Configure actions.

Figure 5. Configure the alarm parameters

On the Configure actions screen, configure the actions using the parameters listed below to send an email notification when the alarm state is triggered. See Figure 6 for an image of how email notifications are configured.

In the Notification section, set Alarm state trigger to In alarm. Set Select an SNS topic to Create topic. Fill in the name of the alarm in the Create a new topic field, and add the email where notifications should be sent to the Email endpoints that will receive notification field. An email address is required to create the SNS topic and you should use an email address that’s accessible by your operations team. This SNS topic will be used to trigger the Lambda automation described in a later section. Note: make a note of the SNS topic name you chose; you will use it later when creating the Lambda function in the To create an AWS Lambda function to automate scale out procedure below.

Figure 6. Create SNS topic and email notification

In the Alarm name field, provide a name for the alarm. You can optionally also add an Alarm description. Choose Next.
Review your configuration, and choose Create alarm to create the alarm.

Once you’ve completed these steps, you will now have an alarm implemented for when domain controller CPU utilization exceeds an average of 70% across both domain controllers. This will trigger an SNS topic when your directory is experiencing a heavy load, which will be used to start the Lambda automation and will send an informational email notification. In the next section, we’ll configure an AWS Lambda function to automate the addition of a domain controller based on this SNS topic.

For additional details on CloudWatch Alarms, please see the Amazon CloudWatch documentation.

To create an AWS Lambda function to automate scale out

The sample Lambda function shown below checks the number of domain controllers in this Region, and increases that by adding one additional domain controller. This procedure describes how to configure the IAM role required for this Lambda function, then how to deploy the Lambda function to execute when the alarm is triggered to automatically add a domain controller when your load exceeds your typical usage baseline.

Note: For additional details on Lambda creation, please see the AWS Lambda documentation.

To automate scale-out using AWS Lambda

In the AWS Console, navigate to IAM and choose Policies, then choose Create Policy.
Choose the JSON tab, and create a new IAM role using the policy provided in JSON below.

For more details on this configuration, see the AWS Directory Service documentation.

Sample policy

{
	"Version":"2012-10-17",
	"Statement":[
	{
		"Effect":"Allow",
		"Action":[
			"ds:DescribeDomainControllers",
			"ds:UpdateNumberOfDomainControllers",
			"ec2:DescribeSubnets",
			"ec2:DescribeVpcs",
			"ec2:CreateNetworkInterface",
			"ec2:DescribeNetworkInterfaces",
			"ec2:DeleteNetworkInterface"
		],
		"Resource":"*"
	}
	]
}

Choose Next:Tags to add tags (optional) before choosing Next:Review.
On the Create Policy screen, provide a name in the Name field. You can optionally also add a description. Choose Create policy to complete creating the new policy.

Note: make a note of the policy name you chose; you will use it later when updating the execution role for the Lambda function.

Figure 7. Provide a name to create the IAM policy

In the AWS Console, navigate to Lambda and choose Create Function
On the Create Function screen, select Author from Scratch and provide a Name, then choose Create Function.

Figure 8. Create a Lambda function

Once created, on the Lambda function’s page, choose the Configuration tab, then choose Permissions from the sidebar and choose the execution role name linked under Role name. This will open the IAM console in another tab, preloaded to your Lambda execution role.

Figure 9: Select the Execution Role

On the execution role screen, choose Attach policies and select the IAM policy you’ve just created (e.g. DirectoryService-DCNumber Update). On the Attach Permissions screen, choose Attach policy to complete updating the execution role. Once completed this step, you may close this tab and return the previous browser tab.

Figure 10. Select and attach the IAM policy

On the Lambda function screen, choose the Configuration tab, then choose Triggers from the sidebar.
On the Add Trigger screen, choose the pulldown under Trigger configuration and select SNS. On the SNS topic box, select the SNS topic you created in Step 9 of the To create a CloudWatch Alarm with SNS topic notifications procedure above. Then choose Add to complete the trigger configuration.
On the Lambda function screen, choose the Configuration tab, then choose Environment variables from the sidebar.
On the Environment variables card, click Edit.
On the Edit environment variables screen, choose Add environment variables and use the Key “DIRECTORY_ID” and the Value will be the directory ID for you AWS Managed Microsoft AD.

Figure 11. The “Edit environment variables” screen

On the Lambda function screen, choose the Code tab to open the in-browser code editor experience inside the Code source card. Paste in the sample Lambda function code given below to complete the implementation.

Figure 12. Paste sample code to complete the Lambda function setup

Sample Lambda function code

The sample Lambda function given below automates adding another domain controller to your directory. When your CloudWatch alarm triggers, you will receive a notification email, and an additional domain controller will be deployed to provide the added capacity to support the increase in directory usage.

Note: The example code contains a variable for the maximum number of domain controllers (maxDcNum), to prevent you from over provisioning in the event of a missed configuration. This value is set to 3 for this blog post’s example and can be increased to suit your use case.

import json
import boto3

maxDcNum = 10
minDcNum = 2
region   = "us-east-1"
dsId = "d-906752246f"

ds = boto3.client('ds', region_name=region)

def lambda_handler(event, context):
    
    ## get the current number of domain controllers
    dcs = ds.describe_domain_controllers(DirectoryId = dsId)

    DomainControllers = dcs["DomainControllers"]
    
    DCcount = len(DomainControllers)
    print(">>> Current number of DCs:" + str(DCcount))

    #increase the number of DCs
    if DCcount < maxDcNum:
        NewDCnumber = DCcount + 1 
        response = ds.update_number_of_domain_controllers(DirectoryId = dsId, DesiredNumber =  NewDCnumber);    

        return {
            'statusCode': 200,
            'body': json.dumps("New DC number will be " + str(NewDCnumber))
        }
    else:
        return {
            'statusCode': 200,
            'body': json.dumps("Max number of DCs reached. The number of DCs is" + str(DCcount))
        }

Note: When testing this Lambda function, remember that this will increase the number of domain controllers for your directory in that Region. If the additional domain controller is not needed, please reduce the count after the test to avoid costs for an additional domain controller. The same principles used in this article to automate the addition of domain controllers can be applied to automate the reduction of domain controllers and you should consider automating the reduction to optimize for resilience, performance and cost.

Conclusion

In this post, you’ve implemented alarms based on thresholds in Domain Controller utilization using AWS CloudWatch and automation to increase the number of domain controllers using AWS Lambda functions. This solution helps to cost-effectively improve resilience and performance of your directory, by scaling your directory based on historical load patterns.

To learn more about using AWS Managed Microsoft AD, visit the AWS Directory Service documentation. For general information and pricing, see the AWS Directory Service home page. If you have comments about this blog post, submit a comment in the Comments section below. If you have implementation or troubleshooting questions, start a new thread on the Directory Service forum or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Keeping Netflix Reliable Using Prioritized Load Shedding

2020-11-02 Netflix Technology Blog

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/keeping-netflix-reliable-using-prioritized-load-shedding-6cc827b02f94

How viewers are able to watch their favorite show on Netflix while the infrastructure self-recovers from a system failure

By Manuel Correa, Arthur Gonigberg, and Daniel West

Getting stuck in traffic is one of the most frustrating experiences for drivers around the world. Everyone slows to a crawl, sometimes for a minor issue or sometimes for no reason at all. As engineers at Netflix, we are constantly reevaluating how to redesign traffic management. What if we knew the urgency of each traveler and could selectively route cars through, rather than making everyone wait?

In Netflix engineering, we’re driven by ensuring Netflix is there when you need it to be. Yet, as recent as last year, our systems were susceptible to metaphorical traffic jams; we had on/off circuit breakers, but no progressive way to shed load. Motivated by improving the lives of our members, we’ve introduced priority-based progressive load shedding.

The animation below shows the behavior of the Netflix viewer experience when the backend is throttling traffic based on priority. While the lower priority requests are throttled, the playback experience remains uninterrupted and the viewer is able to enjoy their title. Let’s dig into how we accomplished this.

Failure can occur due to a myriad of reasons: misbehaving clients that trigger a retry storm, an under-scaled service in the backend, a bad deployment, a network blip, or issues with the cloud provider. All such failures can put a system under unexpected load, and at some point in the past, every single one of these examples has prevented our members’ ability to play. With these incidents in mind, we set out to make Netflix more resilient with these goals:

Consistently prioritize requests across device types (Mobile, Browser, and TV)
Progressively throttle requests based on priority
Validate assumptions by using Chaos Testing (deliberate fault injection) for requests of specific priorities

The resulting architecture that we envisioned with priority throttling and chaos testing included is captured below.

High level playback architecture with priority throttling and chaos testing

Building a request taxonomy

We decided to focus on three dimensions in order to categorize request traffic: throughput, functionality, and criticality. Based on these characteristics, traffic was classified into the following:

NON_CRITICAL: This traffic does not affect playback or members’ experience. Logs and background requests are examples of this type of traffic. These requests are usually high throughput which contributes to a large percentage of load in the system.
DEGRADED_EXPERIENCE: This traffic affects members’ experience, but not the ability to play. The traffic in this bucket is used for features like: stop and pause markers, language selection in the player, viewing history, and others.
CRITICAL: This traffic affects the ability to play. Members will see an error message when they hit play if the request fails.

Using attributes of the request, the API gateway service (Zuul) categorizes the requests into NON_CRITICAL, DEGRADED_EXPERIENCE and CRITICAL buckets, and computes a priority score between 1 to 100 for each request given its individual characteristics. The computation is done as a first step so that it is available for the rest of the request lifecycle.

Most of the time, the request workflow proceeds normally without taking the request priority into account. However, as with any service, sometimes we reach a point when either one of our backends is in trouble or Zuul itself is in trouble. When that happens requests with higher priority get preferential treatment. The higher priority requests will get served, while the lower priority ones will not. The implementation is analogous to a priority queue with a dynamic priority threshold. This allows Zuul to drop requests with a priority lower than the current threshold.

Finding the best place to throttle traffic

Zuul can apply load shedding in two moments during the request lifecycle: when it routes requests to a specific back-end service (service throttling) or at the time of initial request processing, which affects all back-end services (global throttling).

Service throttling

Zuul can sense when a back-end service is in trouble by monitoring the error rates and concurrent requests to that service. Those two metrics are approximate indicators of failures and latency. When the threshold percentage for one of these two metrics is crossed, we reduce load on the service by throttling traffic.

Global throttling

Another case is when Zuul itself is in trouble. As opposed to the scenario above, global throttling will affect all back-end services behind Zuul, rather than a single back-end service. The impact of this global throttling can cause much bigger problems for members. The key metrics used to trigger global throttling are CPU utilization, concurrent requests, and connection count. When any of the thresholds for those metrics are crossed, Zuul will aggressively throttle traffic to keep itself up and healthy while the system recovers. This functionality is critical: if Zuul goes down, no traffic can get through to our backend services, resulting in a total outage.

Introducing priority-based progressive load shedding

Once we had the prioritization piece in place, we were able to combine it with our load shedding mechanism to dramatically improve streaming reliability. When we’re in a bad situation (i.e. any of the thresholds above are exceeded), we progressively drop traffic, starting with the lowest priority. A cubic function is used to manage the level of throttling. If things get really, really bad the level will hit the sharp side of the curve, throttling everything.

The graph above is an example of how the cubic function is applied. As the overload percentage increases (i.e. the range between the throttling threshold and the max capacity), the priority threshold trails it very slowly: at 35%, it’s still in the mid-90s. If the system continues to degrade, we hit priority 50 at 80% exceeded and then eventually 10 at 95%, and so on.

Given that a relatively small amount of requests impact streaming availability, throttling low priority traffic may affect certain product features but will not prevent members pressing “play” and watching their favorite show. By adding progressive priority-based load shedding, Zuul can shed enough traffic to stabilize services without members noticing.

Handling retry storms

When Zuul decides to drop traffic, it sends a signal to devices to let them know that we need them to back off. It does this by indicating how many retries they can perform and what kind of time window they can perform them in. For example:

{ “maxRetries” : <max-retries>, “retryAfterSeconds”: <seconds> }

Using this backpressure mechanism, we can stop retry storms much faster than we could in the past. We automatically adjust these two dials based on the priority of the request. Requests with higher priority will retry more aggressively than lower ones, also increasing streaming availability.

Validating which requests are right for the job

To validate our request taxonomy assumptions on whether a specific request fell into the NON_CRITICAL, DEGRADED, or CRITICAL bucket, we needed a way to test the user’s experience when that request was shed. To accomplish this, we leveraged our internal failure injection tool (FIT) and created a failure injection point in Zuul that allowed us to shed any request based on a supplied priority. This enabled us to manually simulate a load shedded experience by blocking ranges of priorities for a specific device or member, giving us an idea of which requests could be safely shed without impacting the user.

Continually ensuring those requests are still right for the job

One of the goals here is to reduce members’ pain by shedding requests that are not expected to affect the user’s streaming experience. However, Netflix changes quickly and requests that were thought to be noncritical can unexpectedly become critical. In addition, Netflix has a wide variety of client devices, client versions, and ways to interact with the system. To make sure we weren’t causing members pain when throttling NON_CRITICAL requests in any of these scenarios, we leveraged our infrastructure experimentation platform ChAP.

This platform allows us to stage an A/B experiment that will allocate a small number of production users to either a control or treatment group for 45 minutes while throttling a range of priorities for the treatment group. This lets us capture a variety of live use cases and measure the impact to their playback experience. ChAP analyzes the members’ KPIs per device to determine if there is a deviation between the control and the treatment groups.

In our first experiment, we detected a race condition in both Android and iOS devices for a low priority request that caused sporadic playback errors. Since we practice continuous experimentation, once the initial experiments were run and the bugs were fixed, we scheduled them to run on a periodic basis. This allows us to detect regressions early and keep users streaming.

Experiment regression detection before and after fix (SPS indicates streaming availability)

Reaping the benefits

In 2019, before progressive load shedding was in place, the Netflix streaming services experienced an outage that resulted in a sizable percentage of members who were not able to play for a period of time. In 2020, days after the implementation was deployed, the team started seeing the benefit of the solution. Netflix experienced a similar issue with the same potential impact as the outage seen in 2019. Unlike then, Zuul’s progressive load shedding kicked in and started shedding traffic until the service was in a healthy state without impacting members’ ability to play at all.

The graph below shows a stable streaming availability metric stream per second (SPS) while Zuul is performing progressive load shedding based on request priority during the incident. The different colors in the graph represent requests with different priority being throttled.

Members were happily watching their favorite show on Netflix while the infrastructure was self-recovering from a system failure.

We are not done yet

For future work, the team is looking into expanding the use of request priority for other use cases like better retry policies between devices and back-ends, dynamically changing load shedding thresholds, tuning the request priorities using Chaos Testing as a guiding principle, and other areas that will make Netflix even more resilient.

If you’re interested in helping Netflix stay up in the face of shifting systems and unexpected failures, reach out to us. We’re hiring!

Keeping Netflix Reliable Using Prioritized Load Shedding was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Introduction, tooling, and methodology

Architectural overview

Amazon EC2 instance and container failure

Amazon RDS failure

Severe network latency degradation

Network connectivity disruption

Amazon EBS volume failure (IOPS pause)

Outcomes and conclusion

Building resilience into everything we do

Enabling you to build resilience into everything you do

Achieve your resilience goals with AWS and our AWS Partners

See you next time!

Overview

Prerequisites

Fault injection using IAM

Fault injection using SSM

Permissions for injecting failures with SSM

Fault injection using FIS

Permissions for injecting failures with FIS and SSM

Secure and flexible permissions

Permissions governing users

Caveats

Cleanup

Conclusion

Solution overview

Sequence of steps in cross-Region replication

Performance considerations

Monitor the Cross-Region replication

Common failover strategy

Vanguard’s multi-step failover strategy

Path to maturity: Automated recovery

Conclusion

About the authors

#1: Creating a Multi-Region Application with AWS Services – Part 2, Data and Replication

#2: Reduce Cost and Increase Security with Amazon VPC Endpoints

#3: Multi-Region Migration using AWS Application Migration Service

#4: Let’s Architect! Architecting for Sustainability

#5: Let’s Architect! Serverless architecture on AWS

#6: Let’s Architect! Tools for Cloud Architects

#7: Announcing updates to the AWS Well-Architected Framework

#8: Creating a Multi-Region Application with AWS Services – Part 3, Application Management and Monitoring

#9: Let’s Architect! Creating resilient architecture

#10: Using DevOps Automation to Deploy Lambda APIs across Accounts and Environments

Goodbye, 2022!

See you next time!

Other posts in this series

Looking for more architecture content?

Continuous resilience assessments

Add application to AWS Resilience Hub

Create AWS Step Function to run resilience assessment

Create stage within code pipeline

Conclusion

What is resiliency? Why does it matter?

Pattern 1 (P1): Multi-AZ

Trade-offs

Pattern 2 (P2): Multi-AZ with static stability

Trade-offs

Pattern 3 (P3): Application portfolio distribution

Trade-offs

Pattern 4 (P4): Multi-AZ deployment (multi-Region DR)

Trade-offs

Pattern 5 (P5): Multi-Region active-active

Trade-offs

Conclusion

Further reading

Looking for more architecture content?

How to configure federation with multi-Region SAML endpoints

Configure your IdP

Configure IAM roles

Switch regional SAML sign-in endpoints

Conclusion

Simplified directory scaling

Solution overview

To automate scale-out using AWS Lambda

Sample policy

Sample Lambda function code

Conclusion

How viewers are able to watch their favorite show on Netflix while the infrastructure self-recovers from a system failure

Building a request taxonomy

Finding the best place to throttle traffic