Tag Archives: fault injection

Top Architecture Blog Posts of 2023

Post Syndicated from Andrea Courtright original https://aws.amazon.com/blogs/architecture/top-architecture-blog-posts-of-2023/

2023 was a rollercoaster year in tech, and we at the AWS Architecture Blog feel so fortunate to have shared in the excitement. As we move into 2024 and all of the new technologies we could see, we want to take a moment to highlight the brightest stars from 2023.

As always, thanks to our readers and to the many talented and hardworking Solutions Architects and other contributors to our blog.

I give you our 2023 cream of the crop!

#10: Build a serverless retail solution for endless aisle on AWS

In this post, Sandeep and Shashank help retailers and their customers alike in this guided approach to finding inventory that doesn’t live on shelves.

Building endless aisle architecture for order processing

Figure 1. Building endless aisle architecture for order processing

Check it out!

#9: Optimizing data with automated intelligent document processing solutions

Who else dreads wading through large amounts of data in multiple formats? Just me? I didn’t think so. Using Amazon AI/ML and content-reading services, Deependra, Anirudha, Bhajandeep, and Senaka have created a solution that is scalable and cost-effective to help you extract the data you need and store it in a format that works for you.

AI-based intelligent document processing engine

Figure 2: AI-based intelligent document processing engine

Check it out!

#8: Disaster Recovery Solutions with AWS managed services, Part 3: Multi-Site Active/Passive

Disaster recovery posts are always popular, and this post by Brent and Dhruv is no exception. Their creative approach in part 3 of this series is most helpful for customers who have business-critical workloads with higher availability requirements.

Warm standby with managed services

Figure 3. Warm standby with managed services

Check it out!

#7: Simulating Kubernetes-workload AZ failures with AWS Fault Injection Simulator

Continuing with the theme of “when bad things happen,” we have Siva, Elamaran, and Re’s post about preparing for workload failures. If resiliency is a concern (and it really should be), the secret is test, test, TEST.

Architecture flow for Microservices to simulate a realistic failure scenario

Figure 4. Architecture flow for Microservices to simulate a realistic failure scenario

Check it out!

#6: Let’s Architect! Designing event-driven architectures

Luca, Laura, Vittorio, and Zamira weren’t content with their four top-10 spots last year – they’re back with some things you definitely need to know about event-driven architectures.

Let's Architect

Figure 5. Let’s Architect artwork

Check it out!

#5: Use a reusable ETL framework in your AWS lake house architecture

As your lake house increases in size and complexity, you could find yourself facing maintenance challenges, and Ashutosh and Prantik have a solution: frameworks! The reusable ETL template with AWS Glue templates might just save you a headache or three.

Reusable ETL framework architecture

Figure 6. Reusable ETL framework architecture

Check it out!

#4: Invoking asynchronous external APIs with AWS Step Functions

It’s possible that AWS’ menagerie of services doesn’t have everything you need to run your organization. (Possible, but not likely; we have a lot of amazing services.) If you are using third-party APIs, then Jorge, Hossam, and Shirisha’s architecture can help you maintain a secure, reliable, and cost-effective relationship among all involved.

Invoking Asynchronous External APIs architecture

Figure 7. Invoking Asynchronous External APIs architecture

Check it out!

#3: Announcing updates to the AWS Well-Architected Framework

The Well-Architected Framework continues to help AWS customers evaluate their architectures against its six pillars. They are constantly striving for improvement, and Haleh’s diligence in keeping us up to date has not gone unnoticed. Thank you, Haleh!

Well-Architected logo

Figure 8. Well-Architected logo

Check it out!

#2: Let’s Architect! Designing architectures for multi-tenancy

The practically award-winning Let’s Architect! series strikes again! This time, Luca, Laura, Vittorio, and Zamira were joined by Federica to discuss multi-tenancy and why that concept is so crucial for SaaS providers.

Let's Architect

Figure 9. Let’s Architect

Check it out!

And finally…

#1: Understand resiliency patterns and trade-offs to architect efficiently in the cloud

Haresh, Lewis, and Bonnie revamped this 2022 post into a masterpiece that completely stole our readers’ hearts and is among the top posts we’ve ever made!

Resilience patterns and trade-offs

Figure 10. Resilience patterns and trade-offs

Check it out!

Bonus! Three older special mentions

These three posts were published before 2023, but we think they deserve another round of applause because you, our readers, keep coming back to them.

Thanks again to everyone for their contributions during a wild year. We hope you’re looking forward to the rest of 2024 as much as we are!

Simulating Kubernetes-workload AZ failures with AWS Fault Injection Simulator

Post Syndicated from Siva Guruvareddiar original https://aws.amazon.com/blogs/architecture/simulating-kubernetes-workload-az-failures-with-aws-fault-injection-simulator/

In highly distributed systems, it is crucial to ensure that applications function correctly even during infrastructure failures. One common infrastructure failure scenario is when an entire Availability Zone (AZ) becomes unavailable. Applications are often deployed across multiple AZs to ensure high availability and fault tolerance in cloud environments such as Amazon Web Services (AWS).

Kubernetes helps manage and deploy applications across multiple nodes and AZs, though it can be difficult to test how your applications will behave during an AZ failure. This is where fault injection simulators come in. The AWS Fault Injection Simulator (AWS FIS) service can intentionally inject faults or failures into a system to test its resilience. In this blog post, we will explore how to use an AWS FIS to simulate an AZ failure for Kubernetes workloads.

Solution overview

To ensure that Kubernetes cluster workloads are architected to handle failures, you must test their resilience by simulating real-world failure scenarios. Kubernetes allows you to deploy workloads across multiple AZs to handle failures, but it’s still important to test how your system behaves during AZ failures. To do this, we use a microservice for product details with the aim of running this microservice using auto-scaling with both Cluster Autoscaler (CA, from Kubernetes community) and Karpenter and test how the system responds to varying traffic levels.

This blog post explores a load test to mimic the behavior of hundreds of users accessing the service concurrently to simulate a realistic failure scenario. This test uses AWS FIS to disrupt network connectivity, and simulate AZ failure in a controlled manner. This allows us to measure how users are impacted when using CA and then with Karpenter.

Both CA and Karpenter automatically adjust the size of a cluster based on the resource requirements of the running workloads. By comparing the performance of the microservice under these two autoscaling tools, we can determine which tool is better-suited to handle such scenarios.

Figure 1 demonstrates the solution’s architecture.

Architecture flow for Microservices to simulate a realistic failure scenario

Figure 1. Architecture flow for microservices to simulate a realistic failure scenario


Install the following utilities on a Linux-based host machine, which can be an Amazon Elastic Compute Cloud (Amazon EC2) instance, AWS Cloud9 instance, or a local machine with access to your AWS account:

Setting up a microservice environment

This blog post consists of two major parts: Bootstrap and experiment. The bootstrap section provides step-by-step instructions for:

  • Creating and deploying a sample microservice
  • Creating an AWS IAM role for the FIS service
  • Creating an FIS experiment template

By following these bootstrap instructions, you can set up your own environment to test the different autoscaling tools’ performance in Kubernetes.

In the experiment section, we showcase how the system behaves with CA, then Karpenter.

Let’s start by setting a few environment variables using the following code:

export FIS_ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output text)
export FIS_AWS_REGION=us-west-2
export FIS_CLUSTER_NAME="fis-simulation-cluster"

Next, clone the sample repository which contains the code for our solution:

git clone https://github.com/aws-samples/containers-blog-maelstrom.git
cd ./containers-blog-maelstrom/fis-simulation-blog

Step 1. Bootstrap the environment

This solution uses Amazon EKS for AWS Cloud Development Kit (AWS CDK) Blueprints to provision our Amazon EKS cluster.

The first step to any AWS CDK deployment is bootstrapping the environment. cdk bootstrap is an AWS Command Line Interface (AWS CLI) tool that prepares the environment with resources required by AWS CDK to perform deployments into that environment (for example, a combination of AWS account and AWS Region).

Let’s run the below commands to bootstrap your environment and install all node dependencies required for deploying the solution:

npm install
cdk bootstrap aws://$FIS_ACCOUNT_ID/$FIS_AWS_REGION

We’ll use Amazon EKS Blueprints for CDK to create an Amazon EKS cluster and deploy add-ons. This stack deploys the following add-ons into the cluster:

  • AWS Load Balancer Controller
  • Core DNS
  • Kube-proxy

Step 2. Create an Amazon EKS cluster

Run the below command to deploy the Amazon EKS cluster:

npm install
cdk deploy "*" --require-approval never

Deployment takes approximately 20-30 minutes; then you will have a fully functioning Amazon EKS cluster in your account.


Deployment time: 1378.09s

Copy and run the aws eks update-kubeconfig ... command from the output section to gain access to your Amazon EKS cluster using kubectl.

Step 3. Deploy a microservice to Amazon EKS

Use the code from the following Github repository and deploy using Helm.

git clone https://github.com/aws-containers/eks-app-mesh-polyglot-demo.git
helm install workshop eks-app-mesh-polyglot-demo/workshop/helm-chart/

Note: You are not restricted to this one as a mandate. If you have other microservices, use the same. This command deploys the following three microservices:

  1. Frontend-node as the UI to the product catalog application
  2. Catalog detail backend
  3. Product catalog backend

To test the resiliency, let’s take one of the microservice viz productdetail, the backend microservice as an example. When checking the status of a service like the following, you will see that proddetail is of type ClusterIP, which is accessible only within the cluster. To access this outside of the cluster, perform the following steps.

kubectl get service proddetail -n workshop
NAME          TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)     AGE                                                                   5000/TCP       11h
proddetail    ClusterIP   <none>        3000/TCP    11m

Create ingress class

cat <<EOF | kubectl create -f -
apiVersion: networking.k8s.io/v1
kind: IngressClass
  name: aws-alb
  controller: ingress.k8s.aws/alb  

Create ingress resource

cat <<EOF | kubectl create -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
  namespace: workshop
  name: proddtl-ingress
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
  ingressClassName: aws-alb
  - http:
      - path: /
        pathType: Prefix
            name: proddetail
               number: 3000

After this, your web URL is ready:

kubectl get ingress -n workshop
NAME              CLASS     HOSTS   ADDRESS                                                                  PORTS   AGE
proddtl-ingress   aws-alb   *       k8s-workshop-proddtli-166014b35f-354421654.us-west-1.elb.amazonaws.com   80      14s

Test the connectivity from your browser:

Testing the connectivity from your browser

Figure 2. Testing the connectivity from your browser

Step 4. Create an IAM Role for AWS FIS

Before an AWS FIS experiment, create an IAM role. Let’s create a trust policy and attach as shown here:

cat > fis-trust-policy.json << EOF
    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Principal": {
                "Service": [
            "Action": "sts:AssumeRole"
aws iam create-role --role-name my-fis-role --assume-role-policy-document file://permissons/fis-trust-policy.json

Create an AWS FIS policy and attach

aws iam attach-role-policy --role-name my-fis-role --policy-arn  arn:aws:iam::aws:policy/service-role/AWSFaultInjectionSimulatorNetworkAccess
aws iam attach-role-policy --role-name my-fis-role --policy-arn  arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
aws iam attach-role-policy --role-name my-fis-role --policy-arn  arn:aws:iam::aws:policy/service-role/AWSFaultInjectionSimulatorEKSAccess
aws iam attach-role-policy --role-name my-fis-role --policy-arn  arn:aws:iam::aws:policy/service-role/AWSFaultInjectionSimulatorEC2Access
aws iam attach-role-policy --role-name my-fis-role --policy-arn  arn:aws:iam::aws:policy/service-role/AWSFaultInjectionSimulatorSSMAccess

Step 5: Create an AWS FIS experiment

Use AWS FIS to create an experiment to disrupt the network connectivity as below. Use the following experiment template with the IAM role created from the previous step:

Experiment template

Figure 3. Experiment template

Step 6. Failure simulation with AWS FIS on CA and Karpenter

We’ve completed microservice setup, made it internet-accessible, and created an AWS FIS template to simulate failures. Now let’s experiment with how the system behaves with different autoscalers: CA and Karpenter.

With the microservice available within Amazon EKS cluster, we’ll use Locust to simulate user behavior with a total of 100 users trying to access the URLs concurrently.

For the following experiments, Run 1 shows 100 users trying to access the service without any system disruptions. We’ll then move to AWS FIS and disrupt network connectivity in Run 2. Measuring the user impact and comparing it to the results from the first run provides insights on how the system responds to failures and can be improved for greater reliability and performance.

Simulating failures with Cluster Autoscaler (CA)

To perform this experiment, select your experiment template and click Start experiment, then enter start in the field. Currently 12 replicas of the proddetail microservice are running.

As the following Locust charts detail, Run 1 completed without failures. Run 2 simulated network connectivity disruption, resulting in a visible failure rate of 4 percent, with a peak of 7 failures at one time.

For this experiment, we used a total of seven nodes of type t3.small. Use eks-node-viewer to visualize dynamic node usage within a cluster.

CA experiment results

Figure 4. CA experiment results

Dynamic node usage within cluster

Figure 5. Dynamic node usage within cluster

Simulating failures with Karpenter

Continuing the same experiment with 12 replicas of the proddetail microservice, this time we are using Karpenter. As in the following figures, the cluster uses a combination of t3.small and “C” and “M” instances provided in Karpenter’s provisioner configuration.

In Run 1, we observe 0 failures. In Run 2, when network connectivity was disrupted by AWS FIS, Karpenter was able to maintain user requests with almost 0 percent failure. This outcome highlights the effectiveness of Karpenter as an autoscaler for maintaining high availability by carefully configuring the provisioner.

Karpenter experiment results

Figure 6. Karpenter experiment results

Dynamic node usage within cluster

Figure 7. Dynamic node usage within cluster


Use the following commands to clean up your experiment environment.

#delete Ingress resources
kubectl delete ingress proddtl-ingresss -n workshop
kubectl delete ingressclass aws-alb

#delete IAM resources
aws iam delete-role --role-name my-fis-role

#delete FIS resources
fis_template_id=`aws fis list-experiment-templates --region $FIS_AWS_REGION |jq ".experimentTemplates[0].id"`
aws fis delete-experiment-template --id $fis_template_id --region $FIS_AWS_REGION

#delete application resources and cluster
helm uninstall workshop
cdk destroy


This experiment results show that Karpenter performs better and recovers quicker from network disrupt connectivity than Cluster Autoscaler. The figures in this blog post highlight Karpenter’s resiliency and ability to scale and recover from failures quickly.

While this experiment provides valuable insights into the performance and reliability of Kubernetes workloads in the face of failures, it’s important to acknowledge that this is not a true test of an AZ unavailable situation. In a real-world scenario, an AZ failure can have a cascading effect, potentially impacting other services that workloads depend upon. But simulating an AZ failure in a controlled environment helps you better understand how your Kubernetes cluster and applications will behave in an actual failure scenario. This knowledge can help you identify and address any issues before they occur in production, ensuring that your applications remain highly available and resilient.

In summary, this experiment provides good insights into the performance and resilience of Kubernetes workloads. It is not a perfect representation of a real-world AZ failure, but by leveraging tools such as AWS FIS and carefully configuring autoscaling policies, you can take proactive steps to optimize performance and ensure high availability for critical applications.

Perform Chaos Testing on your Amazon Aurora Cluster

Post Syndicated from Anthony Pasquariello original https://aws.amazon.com/blogs/architecture/perform-chaos-testing-on-your-amazon-aurora-cluster/

“Everything fails all the time” Werner Vogels, AWS CTO

In 2010, Netflix introduced a tool called “Chaos Monkey”, that was used for introducing faults in a production environment. Chaos Monkey led to the birth of Chaos engineering where teams test their live applications by purposefully injecting faults. Observations are then used to take corrective action and increase resiliency of applications.

In this blog, you will learn about the fault injection capabilities available in Amazon Aurora for simulating various database faults.

Chaos Experiments

Chaos experiments consist of:

  • Understanding the application baseline: The application’s steady-state behavior
  • Designing an experiment: Ask “What can go wrong?” to identify failure scenarios
  • Run the experiment: Introduce faults in the application environment
  • Observe and correct: Redesign apps or infrastructure for fault tolerance

Chaos experiments require fault simulation across distributed components of the application. Amazon Aurora provides a set of fault simulation capabilities that may be used by teams to exercise chaos experiments against their applications.

Amazon Aurora fault injection

Amazon Aurora is a fully managed database service that is compatible with MySQL and PostgreSQL. Aurora is highly fault tolerant due to its six-way replicated storage architecture. In order to test the resiliency of an application built with Aurora, developers can leverage the native fault injection features to design chaos experiments. The outcome of the experiments gives a better understanding of the blast radius, depth of monitoring required, and the need to evaluate event response playbooks.

In this section, we will describe the various fault injection scenarios that you can use for designing your own experiments. We’ll show you how to conduct the experiment and use the results. This will make your application more resilient and prepared for an actual event.

Note that availability of the fault injection feature is dependent on the version of MySQL and PostgreSQL.

Figure 1. Fault injection overview

Figure 1. Fault injection overview

1. Testing an instance crash

An Aurora cluster can have one primary and up to 15 read replicas. If the primary instance fails, one of the replicas becomes the primary. Applications must be designed to recover from these instance failures as soon as possible to have minimal impact on the end-user experience.

The instance crash fault injection simulates failure of the instance/dispatcher/node in the Aurora database cluster. Fault injection may be carried out on the primary or replicas by running the API against the target instance.

Example: Aurora PostgreSQL for instance crash simulation

The query following will simulate a database instance crash:

SELECT aurora_inject_crash ('instance' );

Since this is a simulation, it does not lead to a failover to the replica. As an alternative to using this API, you can carry out an actual failover by using the AWS Management Console or AWS CLI.

The team should observe the change in the application’s behavior to understand the impact of the instance failure. Take corrective actions to reduce the impact of such failures on the application.

A long recovery time on the application would require the team to reduce the Domain Name Service (DNS) time-to-live (TTL) for the DB connections. As a general best practice, the Aurora Database cluster should have at least one replica.

2. Testing the replica failure

Aurora manages asynchronous replication between cluster nodes within a cluster. The typical replication lag is under 100 milliseconds. Network slowness or issues on the nodes may lead to an increase in replication lag between writer and replica nodes.

The replica failure fault injection allows you to simulate replication failure across one or more replicas. Note that this type of fault injection applies only to a DB cluster that has at least one read replica.

Replica failure manifests itself as stale data read by the application that is connecting to the replicas. The specific functional impact on the application depends on the sensitivity to the freshness of data. Note that this fault injection mechanism does not apply to the native replication supported mechanisms in PostgreSQL and MySQL databases.

Example: Aurora PostgreSQL for replica failure

The statement following will simulate 100% failure of replica named ‘my-replica’ for 20 seconds.

SELECT aurora_inject_replica_failure(100, 20, ‘my-replica’)

The team must observe the behavior of the application from the data sensitivity perspective. If the observed lag is unacceptable, the team must evaluate corrective actions such as vertical scaling of database instances and query optimization. As a best practice, the team should monitor the replication lag and take proactive actions to address it.

3. Testing the disk failure

Aurora’s storage volume consists of six copies of data across three Availability Zones (refer the diagram preceding). Aurora has an inherent ability to repair itself for failures in the storage components. This high reliability is achieved by way of a quorum model. Reads require only 3/6 nodes and writes require 4/6 nodes to be available. However, there may still be transient impact on application depending on how widespread the issue.

The disk failure injection capability allows you to simulate failures of storage nodes and partial failure of disks. The severity of failure can be set as a percentage value. The simulation continues only for the specified amount of time. There is no impact on the actual data on the storage nodes and the disk.

Example: Aurora PostgreSQL for disk failure simulation

You may get the number of disks (for index) on your cluster using the query:

SELECT disks FROM aurora_show_volume_status()

The query following will simulate 75% failure on disk with index 15. The simulation will end in 20 seconds.

SELECT aurora_inject_disk_failure(75, 15, true, 20)

Applications may experience temporary failures due to this fault injection and should be able to gracefully recover from it. If the recovery time is higher than a threshold, or the application has a complete failure, the team can redesign their application.

4. Disk congestion fault

Disk congestion usually happens because of heavy I/O traffic against the storage devices. The impact may range from degraded application performance, to complete application failures.

Aurora provides the capability to simulate disk congestion without synthetic SQL load against the database. With this fault injection mechanism, you can gain a better understanding of the performance characteristics of the application under heavy I/O spikes.

Example: Aurora PostgreSQL for disk congestion simulation

You may get the number of disks (for index) on your cluster using the query:

SELECT disks FROM aurora_show_volume_status()

The query following will simulate a 100% disk failure for 20 seconds. The failure will be simulated on disk with index 15. Simulated delay will be between 30 and 40 milliseconds.

SELECT aurora_inject_disk_congestion(100, 15, true, 20, 30, 40)

If the observed behavior is unacceptable, then the team must carefully consider the load characteristics of their application. Depending on the observations, corrective action may include query optimization, indexing, vertical scaling of the database instances, and adding more replicas.


A chaos experiment involves injecting a fault in a production environment and then observing the application behavior. The outcome of the experiment helps the team identify application weaknesses and evaluate event response processes. Amazon Aurora natively provides fault-injection capabilities that can be used by teams to conduct chaos experiments for database failure scenarios. Aurora can be used for simulating instance failure, replication failure, disk failures, and disk congestion. Try out these capabilities in Aurora to make your applications more robust and resilient from database failures.