Tag Archives: Architecture

Microservices discovery using Amazon EC2 and HashiCorp Consul

2023-06-30 Marine Haddad

Post Syndicated from Marine Haddad original https://aws.amazon.com/blogs/architecture/microservices-discovery-using-amazon-ec2-and-hashicorp-consul/

These days, large organizations typically have microservices environments that span across cloud platforms, on-premises data centers, and colocation facilities. The reasons for this vary but frequently include latency, local support structures, and historic architectural decisions. However, due to the complex nature of these environments, efficient mechanisms for service discovery and configuration management must be implemented to support operations at scale. This is an issue also faced by Nomura.

Nomura is a global financial services group with an integrated network spanning over 30 countries and regions. By connecting markets East & West, Nomura services the needs of individuals, institutions, corporates, and governments through its three business divisions: Retail, Investment Management, and Wholesale (Global Markets and Investment Banking). E-Trading Strategy Foreign Exchange sits within Global Markets, and focuses on all quantitative analysis and technical aspects of electronic FX flows. The team builds out a number of innovative solutions for clients, all of which are needed to operate in an ultra-low latency environment to be competitive. The focus is to build high-quality engineered platforms that can handle all aspects of Nomura’s growing 24 hours a day, 5-and-a-half days a week FX business.

In this blog post, we share the solution we developed for Nomura and how you can build a service discovery mechanism that uses a hierarchical rule-based algorithm. We use the flexibility of Amazon Elastic Compute Cloud (Amazon EC2) and third-party software, such as SpringBoot and Consul. The algorithm supports features such as service discovery by service name, Domain Name System (DNS) latency, and custom provided tags. This can activate customers with automated deployments, since services are able to auto-discover and connect with other services. Based on provided tags, customers can implement environment boundaries so that a service doesn’t connect to an unintended service. Finally, we built a failover mechanism, so that if a service becomes unavailable, an alternative service would be provided (based on given criteria).

After reading this post, you can use the provided assets in the open source repository to deploy the solution in their sandbox environment. The Terraform and Java code that accompanies this post can be amended as needed to suit individual requirements.

Overview of solution

The solution is composed of a microservices platform that is spread over two different data centers, and a Consul cluster per data center. We use two Amazon Virtual Private Clouds (VPCs) to model geographically distributed Consul “data centers”. These VPCs are then connected via an AWS Transit Gateway. By permitting communication across the different data centers, the Consul clusters can form a wide-area network (WAN) and have visibility of service instances deployed to either. The SpringBoot microservices use the Spring Cloud Consul plugin to connect to the Consul cluster. We have built a custom configuration provider that uses Amazon EC2 instance metadata service to retrieve the configuration. The configuration provider mechanism is highly extensible, so anyone can build their own configuration provider.

The major components of this solution are:

- Sample microservices built using Java and SpringBoot, and deployed in Amazon EC2 with one microservice instance per EC2 instance
- A Consul cluster per Region with one Consul agent per EC2 instance
- A custom service discovery algorithm

Figure 1. Multi-VPC infrastructure architecture

A typical flow for a microservice would be to 1/ boot up, 2/ retrieve relevant information from the EC2 Metadata Service (such as tags), and, 3/ use it to register itself with Consul. Once a service is registered with Consul it can discover services to integrate with, and it can be discovered by other services.

An important component of this service discovery mechanism is a custom algorithm that performs service discovery based on the tags created when registering the service with Consul.

Figure 2. Service discovery flow

The service flow shown in Figure 2 is as follows:

The Consul agent deployed on the instance registers to the local Consul cluster, and the service registers to its Consul agent.
The Trading service looks up for available Pricer services via API calls.
The Consul agent returns the list of available Pricer services, so that the Trading service can query a Pricer service.

Walkthrough

Following are the steps required to deploy this solution:

Provision the infrastructure using Terraform. The application .jar file and the Consul configuration are deployed as part of it.
Test the solution.
Clean up AWS resources.

The steps are detailed in the next section, and the code can be found in this GitHub repository.

Prerequisites

Install Git
Install Terraform
Install Packer
Install AWS CLI
An AWS account
A Consul Enterprise License – if you don’t have an Enterprise license, you can contact the Hashicorp Support Team to request a trial license.

Deployment steps

Note: The default AWS Region used in this deployment is ap-southeast-1. If you’re working in a different AWS Region, make sure to update it.

Clone the repository

First, clone the repository that contains all the deployment assets:

git clone https://github.com/aws-samples/geographical-hierarchical-service-lookup-with-consul-on-aws

Build Amazon Machine Images (AMIs)

1. Build the Consul Server AMI in AWS

Go to the ~/deployment/scripts/amis/consul-server/ directory and build the AMI by running:

packer build .

The output should look like this:

==> Builds finished. The artifacts of successful builds are:

--> amazon-ebs.ubuntu20-ami: AMIs were created:

ap-southeast-1: ami-12345678910

Make a note of the AMI ID. This will be used as part of the Terraform deployment.

2. Build the Consul Client AMI in AWS

Go to ~/deployment/scripts/amis/consul-client/ directory and build the AMI by running:

packer build .

The output should look like this:

==> Builds finished. The artifacts of successful builds are:

--> amazon-ebs.ubuntu20-ami: AMIs were created:

ap-southeast-1: ami-12345678910

Make a note of the AMI ID. This will be used as part of the Terraform deployment.

Prepare the deployment

There are a few steps that must be accomplished before applying the Terraform configuration.

1. Update deployment variables

- In a text editor, go to directory ~/deployment/
- Edit the variable file template.var.tfvars.json by adding the variables values, including the AMI IDs previously built for the Consul Server and Client

Note: The key pair name should be entered without the “.pem” extension.

2. Place the application file .jar in the root folder ~/deployment/

Deploy the solution

To deploy the solution, run the following commands from the terminal:

export VAR_FILE=template.var.tfvars.json

terraform init && terraform plan --var-file=$VAR_FILE -out plan.out

terraform apply plan.out

Validate the deployment

All the EC2 instances have been deployed with AWS Systems Manager access, so you can connect privately to the terminal using the AWS Systems Manager Session Manager feature.

To connect to an instance:

1. Select an instance

2. Click Connect

3. Go to Session Manager tab

Using Session Manager, connect to one of the Consul servers and run the following commands:

consul members

This command shows you the list of all Consul servers and clients connected to this cluster.

consul members -wan

This command shows you the list of all Consul servers connected to this WAN environment.

To see the Consul User Interface:

1. Open your terminal and run:

aws ssm start-session --target <instanceID> --document-name AWS-StartPortForwardingSession --parameters '{"portNumber":["8500"],"localPortNumber":["8500"]}' --region <region>

Where instanceID is the AWS Instance ID of one of the Consul servers, and Region is the AWS Region.

Using System Manager Port Forwarding allows you to connect privately to the instance via a browser.

2. Open a browser and go to http://localhost:8500/ui

3. Find the Management Token ID in AWS Secrets Manager in the AWS Management Console

4. Login to the Consul UI using the Management Token ID

Test the solution

Connect to the trading instance and query the different services:

curl http://localhost:9090/v1/discover/service/pricer

curl http://localhost:9090/v1/discover/service/static-data

This deployment assumes that the Trading service queries the Pricer and Static-Data services, and that services are returned based on an order of precedence (see Table 1 following):

Service	Precedence	Customer	Cluster	Location	Environment
TRADING	1	ACME	ALPHA	DC1	DEV

PRICER	1	ACME	ALPHA	DC1	DEV
PRICER	2	ACME	ALPHA	DC2	DEV
PRICER	3	ACME	BETA	DC1	DEV
PRICER	4	ACME	BETA	DC2	DEV
PRICER	5	SHARED	ALPHA	DC1	DEV
PRICER	6	SHARED	ALPHA	DC2	DEV

STATIC-DATA	1	SHARED	SHARED	DC1	DEV
STATIC-DATA	2	SHARED	SHARED	DC2	DEV
STATIC-DATA	2	SHARED	BETA	DC2	DEV
STATIC-DATA	2	SHARED	GAMMA	DC2	DEV
STATIC-DATA	-1	STARK	ALPHA	DC1	DEV
STATIC-DATA	-1	ACME	BETA	DC2	PROD

Table 1. Service order of precedence

To test the solution, switch on and off services in the AWS Management Console and repeat Trading queries to look at where the traffic is being redirected.

Cleaning up

To avoid incurring future charges, delete the solution from ~/deployment/ in the terminal:

terraform destroy --var-file=$VAR_FILE

Conclusion

In this post, we outlined the prevalent challenge of complex globally distributed microservice architectures. We demonstrated how customers can build a hierarchical service discovery mechanism to support such an environment using a combination of Amazon EC2 service and third-party software such as SpringBoot and Consul. Use this to test this solution into your sandbox environment and to see if it could bring the answer to your current challenge.

Additional resources:

Let’s Architect! Open-source technologies on AWS

2023-06-22 Vittorio Denti

Post Syndicated from Vittorio Denti original https://aws.amazon.com/blogs/architecture/lets-architect-open-source-technologies-on-aws/

We brought you a Let’s Architect! blog post about open-source on AWS that covered some technologies with development led by AWS/Amazon, as well as well-known solutions available on managed AWS services. Today, we’re following the same approach to share more insights about the process itself for developing open-source. That’s why the first topic we discuss in this post is a re:Invent talk from Heitor Lessa, Principal Solutions Architect at AWS, explaining some interesting approaches for developing and scaling successful open-source projects.

This edition of Let’s Architect! also touches on observability with Open Telemetry, Apache Kafka on AWS, and Infrastructure as Code with an hands-on workshop on AWS Cloud Development Kit (AWS CDK).

Powertools for AWS Lambda: Lessons from the road to 10 million downloads

Powertools for AWS Lambda is an open-source library to help engineering teams implement serverless best practices. In two years, Powertools went from an initial prototype to a fast-growing project in the open-source world. Rapid growth along with support from a wide community led to challenges from balancing new features with operational excellence to triaging bug reports and RFCs and scaling and redesigning documentation.

In this session, you can learn about Powertools for AWS Lambda to understand what it is and the problems it solves. Moreover, there are many valuable lessons to learn how to create and scale a successful open-source project. From managing the trade-off between releasing new features and achieving operational stability to measuring the impact of the project, there are many challenges in open-source projects that require careful thought.

Take me to this video!

Heitor Lessa describing one the key lessons: development and releasing new features should be as important as the other activities (governance, operational excellence, and more)

Heitor Lessa describing one of the key lessons: development and releasing new features should be as important as the other activities (governance, operational excellence, and more).

Observability the open-source way

The recent blog post Let’s Architect! Monitoring production systems at scale talks about the importance of monitoring. Setting up observability is critical to maintain application and infrastructure health, but instrumenting applications to collect monitoring signals such as metrics and logs can be challenging when using vendor-specific SDKs.

This video introduces you to OpenTelemetry, an open-source observability framework. OpenTelemetry provides a flexible, single vendor-agnostic SDK based on open-source specifications that developers can use to instrument and collect signals from applications. This resource explains how it works in practice and how to monitor microservice-based applications with the OpenTelemetry SDK.

Take me to this video!

With AWS Distro for OpenTelemetry, you can collect data from your AWS resources.

Best practices for right-sizing your Apache Kafka clusters to optimize performance and cost

Apache Kafka is an open-source streaming data store that decouples applications producing streaming data (producers) into its data store from applications consuming streaming data (consumers) from its data store. Amazon Managed Streaming for Apache Kafka (Amazon MSK) allows you to use the open-source version of Apache Kafka with the service managing infrastructure and operations for you.

This blog post explains how the underlying infrastructure configuration can affect Apache Kafka performance. You can learn strategies on how to size the clusters to meet the desired throughput, availability, and latency requirements. This resource helps you discover strategies to find the optimal sizing for your resources, and learn the mental models adopted to conduct the investigation and derive the conclusions.

Take me to this blog!

Comparisons of put latencies for three clusters with different broker sizes

AWS Cloud Development Kit workshop

AWS Cloud Development Kit (AWS CDK) is an open-source software development framework that allows you to provision cloud resources programmatically (Infrastructure as Code or IaC) by using familiar programming languages such as Python, Typescript, Javascript, Java, Go, and C#/.Net.

CDK allows you to create reusable template and assets, test your infrastructure, make deployments repeatable, and make your cloud environment stable by removing manual (and error-prone) operations. This workshop introduces you to CDK, where you can learn how to provision an initial simple application as well as become familiar with more advanced concepts like CDK constructs.

Take me to this workshop!

This construct can be attached to any Lambda function that is used as an API Gateway backend. It counts how many requests were issued to each URL.

See you next time!

Thanks for joining our conversation! To find all the blogs from this series, you can check out the Let’s Architect! list of content on the AWS Architecture Blog.

Disaster Recovery for Oracle Database on Amazon EC2 with Fast-Start Failover

2023-06-14 Harshad Gohil

Post Syndicated from Harshad Gohil original https://aws.amazon.com/blogs/architecture/disaster-recovery-for-oracle-database-on-amazon-ec2-with-fast-start-failover/

High availability is non-negotiable for organizations today to prevent business-critical application disruptions. Enterprises must prioritize database scalability and availability to avoid downtime in their databases, network, servers, or storage environments.

For organizations that want to avoid required application changes, Oracle Real Application Clusters (RAC) is an option for providing high availability and scalability to the Oracle database. While the RAC feature is not supported by Oracle databases on Amazon Elastic Compute Cloud (Amazon EC2), Oracle Active Data Guard helps achieve high availability on AWS cloud.

The Oracle Data Guard feature helps customers survive disasters and data corruption while creating, maintaining, and managing one or more synchronized standby databases. But further, configuring Oracle Data Guard Fast-Start Failover (FSFO) helps achieve high availability.

In this blog post, we provide an architectural solution to achieve database high availability when running Oracle Database on Amazon EC2 with Oracle Data Guard along with Fast-Start Failover to address Availability Zones (AZs) or Amazon EC2 instance failures. We also introduce the steps you can take to make database failover happen without manual intervention, and offer recommendations for cross-Region disaster recovery.

Solution overview

Let’s explore this solution by discussing the architecture and two alternate options for securing high availability using Oracle Data Guard, along with the advantages and limitations of each. We will then offer a walkthrough of steps to make database failover happen without manual intervention.

Oracle high availability using Oracle Data Guard with multi-AZ and multi-Region with multi-AZ setup

This architecture is recommended to maintain high availability for Oracle databases on Amazon EC2 with protection against Amazon EC2 service outages in a Region. A disaster recovery environment and higher resiliency are provided after an Amazon EC2 service outage. This protects against Amazon EC2 service outages in an AWS Region and maintains resiliency due to the multi-AZ setup in a secondary Region.

In this architecture, Oracle Data Guard Fast Sync replication exists between the Primary database in AZ 1 in Region A, with standbys in AZ 2 Region A (Fast Sync), AZ1 in Region B (ASYNC), and AZ2 in Region B (ASYNC). There is an asynchronous cascading replication setup between standby databases to avoid network latency issues across regions.

Should Region A experience an Amazon EC2 service outage, the Oracle observer, a client software that monitors Oracle Data Guard and initiate failover to the Standby database in Region B. Applications can continue to connect to the database resulting in high availability with limited/minimal data loss based on the data change rate amount, as in Figure 1.

Figure 1. Oracle with cascading standby databases across regions

Using Oracle RedoRoutes, the default behavior of Data guard can be controlled and it can be set using the following example during setup.

Oracle RedoRoutes setup example:

dgmgrl > edit database DB_1A set property RedoRoutes= ‘ (LOCAL: DB_1B FASTSYNC PRIORITY=1, DB_2A ASYNC PRIORITY=2,DB_2B ASYNC PRIORITY=3)) (DB_1B: (DB_2A ASYNC PRIORITY=1, DB_2B ASYNC PRIORITY=2)) (DB_2A: DB_1B ASYNC) (DB_2B: DB_1B ASYNC)’

dgmgrl > edit database DB_1B set property RedoRoutes= ‘(LOCAL: (DB_1A FASTSYNC PRIORITY=1, DB_2A ASYNC PRIORITY=2,DB_2B ASYNC PRIORITY=3))(DB_1A: (DB_2A ASYNC PRIORITY=1, DB_2B ASYNC PRIORITY=2)) ‘

dgmgrl > edit database DB_1B set property RedoRoutes= ‘(LOCAL: (DB_2B FASTSYNC PRIORITY=1, DB_1A ASYNC PRIORITY=2, DB_1B ASYNC PRIORITY=3))(DB_2B: (DB_1A ASYNC PRIORITY=1, DB_1B ASYNC PRIORITY=2)) (DB_1A: DB_2B ASYNC)(DB_1B: DB_2B ASYNC )’

dgmgrl > edit database DB_1B set property RedoRoutes= ‘(LOCAL: (DB_2A FASTSYNC PRIORITY=1, DB_1A ASYNC PRIORITY=2, DB_1B ASYNC PRIORITY=3))(DB_2A: (DB_1A ASYNC PRIORITY=1, DB_1B ASYNC PRIORITY=2))’

For more information on Oracle RedoRoutes setup for Oracle Cascading Standby, refer to this step-by-step configuration documentation.

Database failover with Amazon Route 53 and Oracle Data Guard

The following walkthrough defines the steps you can take to make database failover happen without manual intervention using Amazon Route 53 and Oracle Data Guard.

Prerequisites

Before getting started, review the following prerequisites for this solution:

An AWS account
Amazon EC2 instances with Oracle Data Guard setup
Oracle Data Guard Fast-Start Failover setup

Walkthrough

Step 1. Create Oracle Database Service

For applications to connect without manual intervention on event of failure, we recommend creating an Oracle database service using the Oracle DBMS_Package called DBMS_SERVICE.

exec dbms_service.CREATE_SERVICE(SERVICE_NAME=>'DB_SERVICE_FOR_APP', NETWORK_NAME=>'DB_SERVICE_FOR_APP');

exec dbms_service.START_SERVICE('DB_SERVICE_FOR_APP');

Step 2. Network configuration

Applications can connect to the database seamlessly without manual intervention in an event of a failover from the Primary database to Standby using the Oracle Transparent Application Failover (TAF) approach, though TAF requires updating application connection strings in case of a host IP change.

The following approach using Amazon Route 53 is recommended for added flexibility and scalability. Route 53 has DNS A records that map to the database instance IPs and CNAME records that can redirect DNS queries to A records. The following depicts the DNS mapping. The CNAME, along with the database service name, can be used by the application in its network configuration.

Database_Name =
(DESCRIPTION =
    (ADDRESS_LIST =
       (ADDRESS = (PROTOCOL = TCP)(HOST = <db_cname>)(PORT = 1521))
   (connect_data =
       (service_name = <db_service_name>)
   )) )

To update the CNAME in Route 53 to map to the Primary host automatically in the event of failure, follow these steps.

Step 3. Route 53 setup

Create a script named route53update.sh and place it on the database hosts using the following code.

#!/bin/bash

export ORACLE_HOME="<<change>> "

export LD_LIBRARY_PATH=$ORACLE_HOME/lib

export PATH=$ORACLE_HOME/bin:$PATH:/usr/local/bin:/usr/bin

LOG_FILE="/tmp/switch_dns_$$.log"

DNS_DOMAIN="<<change>> "

ACTIVE_DB_CNAME="<<change>> "

HOSTED_ZONE_ID="<<change>> "

TTL="<<change>> "

update_dns () {

TMPFILE="/tmp/route53_dns_$$.log"

cat > ${TMPFILE} << EOF

{

"Comment":"Updating DNS of record ${1}.${DNS_DOMAIN}",

"Changes":[

{

"Action":"UPSERT",

"ResourceRecordSet":{

"ResourceRecords":[

{

"Value":"$2"

}

],

"Name":"${1}.${DNS_DOMAIN}.",

"Type":"CNAME",

"TTL":$TTL

}

]

}

EOF

/usr/local/bin/aws route53 change-resource-record-sets \

--hosted-zone-id $HOSTED_ZONE_ID \

--change-batch file://"$TMPFILE" >> "$LOG_FILE"

}

prim_uniq_sid=`$ORACLE_HOME/bin/sqlplus -s / as sysdba <<EOF

set feedback off echo off lines 2000 head off

select upper(db_unique_name) from v\\$dataguard_config where DEST_ROLE='PRIMARY DATABASE';

EOF`

prim_uniq_sid=`echo $prim_uniq_sid| sed 's/^[ \t]*//;s/[ \t]*$//'`

host_current=`$ORACLE_HOME/bin/tnsping ${prim_uniq_sid}|sed -n 's/$.*Host$$[^)]*$$.*$/\2/pi' |sed 's/=//g'|sed 's/^[ \t]*//;s/[ \t]*$//'`

dns_current_host=`/usr/local/bin/aws route53 list-resource-record-sets --hosted-zone-id $HOSTED_ZONE_ID --query "ResourceRecordSets[?Name == '${ACTIVE_DB_CNAME}.${DNS_DOMAIN}.'].ResourceRecords" --output text`

if [ "$host_current" != "$dns_current_host" ]; then

update_dns ${ACTIVE_DB_CNAME} $host_current

fi

Step 4. Database job setup

Create a job in the Oracle Primary database to execute the shell script just introduced to initiate in the event of failover using the following code.

begin
  dbms_scheduler.create_job
  (
   job_name => 'route53update',
   job_type => 'executable',
   number_of_arguments => 0,
   job_action => '/<<location of script>>/ route53update.sh',
   auto_drop => false
  );

dbms_scheduler.enable('route53update');
end;
/

Step 5. Database trigger setup

In an event of a failure, the Primary will failover and the Standby starts up as the new Primary. A trigger needs to be created on the Primary database to execute the job on any failover to update the Route53 CNAME using the following code.

create or replace trigger SYS.Update_Route53_Record
AFTER STARTUP ON DATABASE
DECLARE
db_role varchar2(16);
db_mode varchar2(20);BEGIN
select database_role, open_mode into db_role, db_mode from v$database;
if db_role = 'PRIMARY' then
dbms_scheduler.run_job('route53update') ;
END IF;
END;
/

Alternate Option 1: Single Region with multi-AZ

This option is a minimum recommended configuration to maintain high availability for Oracle databases on Amazon EC2 for customers who do not have a multi-region setup.

Advantage: Protects against Amazon EC2 service outage in a single AZ.
Limitation: Does not protect against Amazon EC2 service outages in a single Region.

In this architecture, Oracle Data Guard Fast Sync replication exists between the Oracle database instance in a multi-AZ setup with the Primary database (Read Write) in AZ 1 and the Standby database (Read Only) in AZ 2.

If the primary database is unreachable due to any failure, the observer will failover to the standby database in a different AZ. Applications can continue to connect to the database with zero data loss due to synchronous replication between AZ using the Maximum Availability/Maximum Protection mode setup in Oracle Data Guard. If the primary database is in us-east-1a and standby in us-east-1b, the RedoRoutes property can be defined as follows.

Oracle RedoRoutes setup example:

dgmgrl> edit database DB_1A set property RedoRoutes= '(LOCAL: (DB_1B FASTSYNC)'

dgmgrl> edit database DB_1B set property RedoRoutes= '(LOCAL: (DB_1A FASTSYNC)'

For more information on how disaster recovery works in the AWS Cloud, visit the Disaster recovery is different in the cloud section of the AWS Well-Architected Framework. For more on Oracle RedoRoutes setup, refer to the Oracle Redo Routing Rules documentation.

Alternate Option 2: Multi-AZ with multi-Region with single AZ

This option is recommended to maintain high availability for an Oracle database on Amazon EC2 for customers who need multi-region availability. It provides protection against the rare unavailability of Amazon EC2 instances in the primary Region, in which case a disaster recovery environment is provided.

Advantage: Protects against Amazon EC2 service outages in a 2 AZ or AWS Region.
Limitation: Decreased resiliency without high availability on Amazon EC2 service outage in an entire Region

In this architecture, Oracle Data Guard Fast Sync replication exists between the Oracle database instance in multi-AZ within the single Region, with the Primary database in AZ 1 in Region A and Standby database in AZ 2 in Region A. There is an asynchronous replication setup between the Standby database cross-Region.

Asynchronous replication is recommended between Region replication to avoid network latency issue. A cascading standby setup ensures there is no additional performance impact on the primary database to send data to multiple standbys.

If the primary database is unreachable, failover happens between AZs in Region A. In the event of an Amazon EC2 service outage in a Region, failover occurs to Region B, resulting in high availability with minimal data loss based on the data change rate amount. If the primary database is in us-east-1a and standby in us-east-1b (Fast Sync) and us-east-2a (Async), the RedoRoutes property can be defined as follows.

Oracle RedoRoutes setup example:

dgmgrl > edit database DB_1A set property RedoRoutes= '(LOCAL: (DB_1B FASTSYNC PRIORITY=1, DB_2A ASYNC PRIORITY=2))(DB_1B: DB_2A ASYNC)(DB_2A: DB_1B ASYNC)'

dgmgrl > edit database DB_1B set property RedoRoutes= '(LOCAL: (DB_1A FASTSYNC PRIORITY=1, DB_2A ASYNC PRIORITY=2)) (DB_1A: DB_2A ASYNC)'

dgmgrl > edit database DB_1B set property RedoRoutes= '(LOCAL: (DB_1A FASTSYNC PRIORITY=1, DB_1B ASYNC PRIORITY=2))'

Cleaning up

The services involved in this solution incur costs. When you’re done using this solution, clean up the following resources:

Amazon EC2 instances – Stop or delete (terminate) the Amazon EC2 instances that you provisioned.
Route53 – Delete the hosted Zone ID and A records/CNAMEs created.

Conclusion

This blog post demonstrates how high availability and disaster recovery can be achieved for an Oracle database on an Amazon EC2 instance using Oracle Data Guard. Using the architectures in this post, you can achieve zero data loss with the Oracle Fast-Start Failover option within the same Region or cross-Region on Amazon EC2.

You can also use this architecture to replicate data from an Oracle database on Amazon EC2 to an Oracle database hosted outside of the AWS cloud. With Oracle Cascading Standby and Oracle RedoRoutes, you can remove high dependency on the Primary database to improve overall performance.

Simulating Kubernetes-workload AZ failures with AWS Fault Injection Simulator

2023-06-09 Siva Guruvareddiar

Post Syndicated from Siva Guruvareddiar original https://aws.amazon.com/blogs/architecture/simulating-kubernetes-workload-az-failures-with-aws-fault-injection-simulator/

In highly distributed systems, it is crucial to ensure that applications function correctly even during infrastructure failures. One common infrastructure failure scenario is when an entire Availability Zone (AZ) becomes unavailable. Applications are often deployed across multiple AZs to ensure high availability and fault tolerance in cloud environments such as Amazon Web Services (AWS).

Kubernetes helps manage and deploy applications across multiple nodes and AZs, though it can be difficult to test how your applications will behave during an AZ failure. This is where fault injection simulators come in. The AWS Fault Injection Simulator (AWS FIS) service can intentionally inject faults or failures into a system to test its resilience. In this blog post, we will explore how to use an AWS FIS to simulate an AZ failure for Kubernetes workloads.

Solution overview

To ensure that Kubernetes cluster workloads are architected to handle failures, you must test their resilience by simulating real-world failure scenarios. Kubernetes allows you to deploy workloads across multiple AZs to handle failures, but it’s still important to test how your system behaves during AZ failures. To do this, we use a microservice for product details with the aim of running this microservice using auto-scaling with both Cluster Autoscaler (CA, from Kubernetes community) and Karpenter and test how the system responds to varying traffic levels.

This blog post explores a load test to mimic the behavior of hundreds of users accessing the service concurrently to simulate a realistic failure scenario. This test uses AWS FIS to disrupt network connectivity, and simulate AZ failure in a controlled manner. This allows us to measure how users are impacted when using CA and then with Karpenter.

Both CA and Karpenter automatically adjust the size of a cluster based on the resource requirements of the running workloads. By comparing the performance of the microservice under these two autoscaling tools, we can determine which tool is better-suited to handle such scenarios.

Figure 1 demonstrates the solution’s architecture.

Figure 1. Architecture flow for microservices to simulate a realistic failure scenario

Prerequisites

Install the following utilities on a Linux-based host machine, which can be an Amazon Elastic Compute Cloud (Amazon EC2) instance, AWS Cloud9 instance, or a local machine with access to your AWS account:

AWS CLI version 2 to interact with AWS services using CLI commands
Node.js (v16.0.0 or later) and npm (8.10.0 or later)
AWS CDK v2.70.0 or later to build and deploy cloud infrastructure and Kubernetes resources programmatically
Kubectl to communicate with the Kubernetes API server
Helm to manage Kubernetes applications
eks-node-viewer to visualize dynamic node usage within an Amazon Elastic Kubernetes Service (Amazon EKS) cluster

Setting up a microservice environment

This blog post consists of two major parts: Bootstrap and experiment. The bootstrap section provides step-by-step instructions for:

Creating and deploying a sample microservice
Creating an AWS IAM role for the FIS service
Creating an FIS experiment template

By following these bootstrap instructions, you can set up your own environment to test the different autoscaling tools’ performance in Kubernetes.

In the experiment section, we showcase how the system behaves with CA, then Karpenter.

Let’s start by setting a few environment variables using the following code:

export FIS_ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output text)
export FIS_AWS_REGION=us-west-2
export FIS_CLUSTER_NAME="fis-simulation-cluster"

Next, clone the sample repository which contains the code for our solution:

git clone https://github.com/aws-samples/containers-blog-maelstrom.git
cd ./containers-blog-maelstrom/fis-simulation-blog

Step 1. Bootstrap the environment

This solution uses Amazon EKS for AWS Cloud Development Kit (AWS CDK) Blueprints to provision our Amazon EKS cluster.

The first step to any AWS CDK deployment is bootstrapping the environment. cdk bootstrap is an AWS Command Line Interface (AWS CLI) tool that prepares the environment with resources required by AWS CDK to perform deployments into that environment (for example, a combination of AWS account and AWS Region).

Let’s run the below commands to bootstrap your environment and install all node dependencies required for deploying the solution:

npm install
cdk bootstrap aws://$FIS_ACCOUNT_ID/$FIS_AWS_REGION

We’ll use Amazon EKS Blueprints for CDK to create an Amazon EKS cluster and deploy add-ons. This stack deploys the following add-ons into the cluster:

AWS Load Balancer Controller
AWS VPC CNI
Core DNS
Kube-proxy

Step 2. Create an Amazon EKS cluster

Run the below command to deploy the Amazon EKS cluster:

npm install
cdk deploy "*" --require-approval never

Deployment takes approximately 20-30 minutes; then you will have a fully functioning Amazon EKS cluster in your account.

fis-simulation-cluster

Deployment time: 1378.09s

Copy and run the aws eks update-kubeconfig ... command from the output section to gain access to your Amazon EKS cluster using kubectl.

Step 3. Deploy a microservice to Amazon EKS

Use the code from the following Github repository and deploy using Helm.

git clone https://github.com/aws-containers/eks-app-mesh-polyglot-demo.git
helm install workshop eks-app-mesh-polyglot-demo/workshop/helm-chart/

Note: You are not restricted to this one as a mandate. If you have other microservices, use the same. This command deploys the following three microservices:

Frontend-node as the UI to the product catalog application
Catalog detail backend
Product catalog backend

To test the resiliency, let’s take one of the microservice viz productdetail, the backend microservice as an example. When checking the status of a service like the following, you will see that proddetail is of type ClusterIP, which is accessible only within the cluster. To access this outside of the cluster, perform the following steps.

kubectl get service proddetail -n workshop
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE 5000/TCP 11h
proddetail ClusterIP 10.100.168.219 <none> 3000/TCP 11m

Create ingress class

cat <<EOF | kubectl create -f -
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
name: aws-alb
spec:
controller: ingress.k8s.aws/alb
EOF

Create ingress resource

cat <<EOF | kubectl create -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  namespace: workshop
  name: proddtl-ingress
  annotations:
   alb.ingress.kubernetes.io/scheme: internet-facing
   alb.ingress.kubernetes.io/target-type: ip
spec:
  ingressClassName: aws-alb
  rules:
  - http:
   paths:
   - path: /
   pathType: Prefix
   backend:
   service:
   name: proddetail
   port:
   number: 3000
EOF

After this, your web URL is ready:

kubectl get ingress -n workshop
NAME CLASS HOSTS ADDRESS PORTS AGE
proddtl-ingress aws-alb * k8s-workshop-proddtli-166014b35f-354421654.us-west-1.elb.amazonaws.com 80 14s

Test the connectivity from your browser:

Figure 2. Testing the connectivity from your browser

Step 4. Create an IAM Role for AWS FIS

Before an AWS FIS experiment, create an IAM role. Let’s create a trust policy and attach as shown here:

cat > fis-trust-policy.json << EOF
{
   "Version": "2012-10-17",
   "Statement": [
   {
   "Effect": "Allow",
   "Principal": {
   "Service": [
   "fis.amazonaws.com"
   ]
   },
   "Action": "sts:AssumeRole"
   }
   ]
}
EOF
aws iam create-role --role-name my-fis-role --assume-role-policy-document file://permissons/fis-trust-policy.json

Create an AWS FIS policy and attach

aws iam attach-role-policy --role-name my-fis-role --policy-arn arn:aws:iam::aws:policy/service-role/AWSFaultInjectionSimulatorNetworkAccess
aws iam attach-role-policy --role-name my-fis-role --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
aws iam attach-role-policy --role-name my-fis-role --policy-arn arn:aws:iam::aws:policy/service-role/AWSFaultInjectionSimulatorEKSAccess
aws iam attach-role-policy --role-name my-fis-role --policy-arn arn:aws:iam::aws:policy/service-role/AWSFaultInjectionSimulatorEC2Access
aws iam attach-role-policy --role-name my-fis-role --policy-arn arn:aws:iam::aws:policy/service-role/AWSFaultInjectionSimulatorSSMAccess

Step 5: Create an AWS FIS experiment

Use AWS FIS to create an experiment to disrupt the network connectivity as below. Use the following experiment template with the IAM role created from the previous step:

Figure 3. Experiment template

Step 6. Failure simulation with AWS FIS on CA and Karpenter

We’ve completed microservice setup, made it internet-accessible, and created an AWS FIS template to simulate failures. Now let’s experiment with how the system behaves with different autoscalers: CA and Karpenter.

With the microservice available within Amazon EKS cluster, we’ll use Locust to simulate user behavior with a total of 100 users trying to access the URLs concurrently.

For the following experiments, Run 1 shows 100 users trying to access the service without any system disruptions. We’ll then move to AWS FIS and disrupt network connectivity in Run 2. Measuring the user impact and comparing it to the results from the first run provides insights on how the system responds to failures and can be improved for greater reliability and performance.

Simulating failures with Cluster Autoscaler (CA)

To perform this experiment, select your experiment template and click Start experiment, then enter start in the field. Currently 12 replicas of the proddetail microservice are running.

As the following Locust charts detail, Run 1 completed without failures. Run 2 simulated network connectivity disruption, resulting in a visible failure rate of 4 percent, with a peak of 7 failures at one time.

For this experiment, we used a total of seven nodes of type t3.small. Use eks-node-viewer to visualize dynamic node usage within a cluster.

Figure 4. CA experiment results

Figure 5. Dynamic node usage within cluster

Simulating failures with Karpenter

Continuing the same experiment with 12 replicas of the proddetail microservice, this time we are using Karpenter. As in the following figures, the cluster uses a combination of t3.small and “C” and “M” instances provided in Karpenter’s provisioner configuration.

In Run 1, we observe 0 failures. In Run 2, when network connectivity was disrupted by AWS FIS, Karpenter was able to maintain user requests with almost 0 percent failure. This outcome highlights the effectiveness of Karpenter as an autoscaler for maintaining high availability by carefully configuring the provisioner.

Figure 6. Karpenter experiment results

Figure 7. Dynamic node usage within cluster

Cleanup

Use the following commands to clean up your experiment environment.

#delete Ingress resources
kubectl delete ingress proddtl-ingresss -n workshop
kubectl delete ingressclass aws-alb

#delete IAM resources
aws iam delete-role --role-name my-fis-role

#delete FIS resources
fis_template_id=`aws fis list-experiment-templates --region $FIS_AWS_REGION |jq ".experimentTemplates[0].id"`
aws fis delete-experiment-template --id $fis_template_id --region $FIS_AWS_REGION

#delete application resources and cluster
helm uninstall workshop
cdk destroy

Conclusion

This experiment results show that Karpenter performs better and recovers quicker from network disrupt connectivity than Cluster Autoscaler. The figures in this blog post highlight Karpenter’s resiliency and ability to scale and recover from failures quickly.

While this experiment provides valuable insights into the performance and reliability of Kubernetes workloads in the face of failures, it’s important to acknowledge that this is not a true test of an AZ unavailable situation. In a real-world scenario, an AZ failure can have a cascading effect, potentially impacting other services that workloads depend upon. But simulating an AZ failure in a controlled environment helps you better understand how your Kubernetes cluster and applications will behave in an actual failure scenario. This knowledge can help you identify and address any issues before they occur in production, ensuring that your applications remain highly available and resilient.

In summary, this experiment provides good insights into the performance and resilience of Kubernetes workloads. It is not a perfect representation of a real-world AZ failure, but by leveraging tools such as AWS FIS and carefully configuring autoscaling policies, you can take proactive steps to optimize performance and ensure high availability for critical applications.

How organizations are modernizing for cloud operations

2023-06-07 Adam Keller

Post Syndicated from Adam Keller original https://aws.amazon.com/blogs/devops/how-organizations-are-modernizing-for-cloud-operations/

Over the past decade, we’ve seen a rapid evolution in how IT operations teams and application developers work together. In the early days, there was a clear division of responsibilities between the two teams, with one team focused on providing and maintaining the servers and various components (i.e., storage, DNS, networking, etc.) for the application to run, while the other primarily focused on developing the application’s features, fixing bugs, and packaging up their artifacts for the operations team to deploy. Ultimately, this division led to a siloed approach which presented glaring challenges. These siloes hindered communication between the teams, which would often result in developers being ready to ship code and passing it over to the operations teams with little to no collaboration prior. In turn, operations teams were often left scrambling trying to deliver on the requirements at the last minute. This would lead to bottlenecks in software delivery, delaying features and bug fixes from being shipped. Aside from software delivery, operations teams were primarily responsible for handling on-call duties, which encompassed addressing issues arising from both applications and infrastructure. Consequently, when incidents occurred, the operations teams were the ones receiving alerts, irrespective of the source of the problem. This raised the question: what motivates the software developers to create resilient and dependable software? Terms such as “throw it over the wall” and “it works on my laptop” were coined because of this and are still commonly referenced in discussion today.

The DevOps movement emerged in response to these challenges, aiming to build a bridge between developers and operations teams. DevOps focuses on collaboration between the two teams through communication and integration by fostering a culture of shared responsibility. This approach promotes the use of automation of infrastructure and application code leveraging continuous integration (CI) and continuous delivery (CD), microservices architectures, and visibility through monitoring, logging and tracing. The end result of operating in a DevOps model provides quicker and more reliable release cycles. While the ideology is well intentioned, implementing a DevOps practice is not easy as organizations struggle to adapt and adhere to the cultural expectations. In addition, teams can struggle to find the right balance between speed and stability, which often times results in reverting back to old behaviors due to fear of downtime and instability of their environments. While DevOps is very focused on culture through collaboration and automation, not all developers want to be involved in operations and vice versa. This poses the question: how do organizations centralize a frictionless developer experience, with guardrails and best practices baked in, while providing a golden path for developers to self serve? This is where platform engineering comes in.

Platform engineering has emerged as a critical discipline for organizations, which is driving the next evolution of infrastructure and operations, while simultaneously empowering developers to create and deliver robust, scalable applications. It aims to improve developer experience by providing self service mechanisms that provide some level of abstraction for provisioning resources, with good practices baked in. This builds on top of DevOps practices by enabling the developer to have full control of their resources through self service, without having to throw it over the wall. There are various ways that platform engineering teams implement these self service interfaces, from leveraging a GitOps focused strategy to building Internal Developer Platforms with a UI and/or API. With the increasing demand for faster and more agile development, many organizations are adopting this model to streamline their operations, gain visibility, reduce costs, and lower the friction of onboarding new applications.

In this blog post, we will explore the common operational models used within organizations today, where platform engineering fits within these models, the common patterns used to build and develop these self-service platforms, and what lies ahead for this emerging field.

Operational Models

It’s important for us to start by understanding how we see technology teams operate today and the various ways they support development teams from instantiating infrastructure to defining pipelines and deploying application code. In the below diagram we highlight the four common operational models and will discuss each to understand the benefits and challenges they bring. This is also critical in understanding where platform teams fit, and where they don’t.

This image shows a sliding scale of the various provisioning models. For each model it shows the interaction between developers and the platform team.

Centralized Provisioning

In a centralized provisioning model, the responsibility for architecting, deploying, and managing infrastructure falls primarily on a centralized team. Organizations assign enforcement of controls into specific roles with narrow scope, including release management, process management, and segmentation of siloed teams (networking, compute, pipelines, etc). The request model generally requires a ticket or request to be sent to the central or dedicated siloed team, ticket enters a backlog, and the developers wait until resources can be provisioned on their behalf. In an ideal world, the central teams can quickly provision the resources and pipelines to get the developers up and running; but, in reality these teams are busy with work and have to prioritize accordingly which often times leaves development teams waiting or having to predict what they need well in advance.

While this model provides central control over resource provisioning, it introduces bottlenecks into the delivery process and generally results in slower deployment cycles and feedback loops. This model becomes especially challenging when supporting a large number of development teams with varying requirements and use cases. Ultimately this model can lead to frustration and friction between teams and hence why organizations after some time look to move away from operating in this model. This leads us to segue into the next model, which is the Platform-enabled Golden Path.

Platform-enabled Golden Path

The platform-enabled golden path model is an approach that allows for developer to have some form of customization while still maintaining consistency by following a set of standards. In this model, platform engineers clearly lay out “preferred” standards with sane defaults, guardrails, and good practices based on common architectures that development teams can use as-is. Sophisticated platform teams may implement their own customizations on top of this framework in the following ways:

Leveraging a managed IaC orchestration service (such as AWS Proton, AWS Service Catalog, or Amazon CodeCatalyst)
Building out a custom Internal Developer Platform (Backstage, or built in house)
Implementing a GitOps model and tool (Flux, ArgoCD, etc)
Managing templates in a central repository for reuse with documentation
Building custom libraries or modules (AWS Cloud Development Kit (CDK) Constructs/Libraries, Terraform Modules, etc)

The platform engineering team is responsible for creating and updating the templates, with maintenance responsibilities typically being shared. This approach strikes a balance between consistency and flexibility, allowing for some customization while still maintaining standards. However, it can be challenging to maintain visibility across the organization, as development teams have more freedom to customize their infrastructure. This becomes especially challenging when platform teams want a change to propagate across resources deployed by the various development teams building on top of these patterns.

Embedded DevOps

Embedded DevOps is a model in which DevOps engineers are directly aligned with development teams to define, provision, and maintain their infrastructure. There are a couple of common patterns around how organizations use this model.

Floating model: A central DevOps team can leverage a floating model where a DevOps engineer will be directly embedded onto a development team early in the development process to help build out the required pipelines and infrastructure resources, and jump to another team once everything is up and running.
Permanent embedded model: Alternatively, a development team can have a permanent DevOps engineer on the team to help support early iterations as well as maintenance as the application evolves. The DevOps engineer is ideally there from the beginning of the project and continues to support and improve the infrastructure and automation based on feature requests and bug fixes.

A central platform and/or architecture team may define the acceptable configurations and resources, while DevOps engineers decide how to best use them to meet the needs of their development team. Individual teams are responsible for maintenance and updating of the templates and pipelines. This model offers greater agility and flexibility, but also requires the funding to hire DevOps engineers per development team, which can become costly as development teams scale. It’s important that when operating in this model to maintain collaboration between members of the DevOps team to ensure that best practices can be shared.

Decentralized DevOps

Lastly, the decentralized DevOps model gives development teams full end-to-end ownership and responsibility for defining and managing their infrastructure and pipelines. A central team may be focused on building out guardrails and boundaries to ensure that they limit the blast radius within the boundaries. They can also create a process to ensure that infrastructure deployed meets company standards, while ensuring development teams are free to make design decisions and remain autonomous. This approach offers the greatest agility and flexibility, but also the highest risk of inconsistency, errors, and security vulnerabilities. Additionally, this model requires a cultural shift in the organization because the development teams now own the entire stack, which results in more responsibility. This model can be a deterrent to developers, especially if they are unfamiliar with building resources in the cloud and/or don’t want to do it.

Overall, each model has its strengths and weaknesses, and the purpose of this blog is to educate on the patterns that are emerging. Ultimately the right approach depends on the organization’s specific needs and goals as well as their willingness to shift culturally. Of the above patterns, the two that are emerging as the most common are Platform-enabled Golden Path and Decentralized DevOps. Furthermore, we’re seeing that more often than not platform teams are finding themselves going back and forth between the two patterns within the same organization. This is in part due to technology making infrastructure creation in the cloud more accessible through abstraction and automation (think of tools like the AWS Cloud Development Kit (CDK), AWS Serverless Application Model (SAM) CLI, AWS Copilot, Serverless framework, etc). Let’s now look at the technology patterns that are emerging to support these use cases.

Emerging patterns

Of the trends that are on the rise, Internal Developer Platforms and GitOps practices are becoming increasingly popular in the industry due to their ability to streamline the software development process and improve collaboration between development and platform teams. Internal Developer Platforms provide a centralized platform for developers to access resources and tools needed to build, test, deploy, and monitor applications and associated infrastructure resources. By providing a self-service interface with pre-approved patterns (via UI, API, or Git), internal developer platforms empower development teams to work independently and collaborate with one another more effectively. This reduces the burden on IT and operations teams while also increasing the agility and speed of development as developers aren’t required to wait in line to get resources provisioned. The paradigm shifts with Internal Developer Platforms because the platform teams are focused on building the blueprints and defining the standards for backend resources that development teams centrally consume via the provided interfaces. The platform team should view the internal developer platform as a product and look at developers as their customer.

While internal developer platforms provide a lot of value and abstraction through a UI and API’s, some organizations prefer to use Git as the center of deployment orchestration, and this is where leveraging GitOps can help. GitOps is a methodology that leverages Git as the source of orchestrating and managing the deployment of infrastructure and applications. With GitOps, infrastructure is defined declaratively as code, and changes are tracked in Git, allowing for a more standardized and automated deployment process. Using git for deployment orchestration is not new, but there are some concepts with GitOps that take Git orchestration to a new level.

Let’s look at the principles of GitOps, as defined by OpenGitOps:

Declarative
- A system managed by GitOps must have its desired state expressed declaratively.
Versioned and Immutable
- Desired state is stored in a way that enforces immutability, versioning and retains a complete version history.
Pulled Automatically
- Software agents automatically pull the desired state declarations from the source.
Continuously Reconciled
- Software agents continuously observe actual system state and attempt to apply the desired state.

GitOps helps to reduce the risk of errors and improve consistency across the organization as all change is tracked centrally. Additionally this provides developers with a familiar interface in git as well as the ability to store the desired state of their infrastructure and applications in one place. Lastly, GitOps is focused on ensuring that the desired state in git is always maintained, and if drift occurs, an external process will reconcile the state of the resources. GitOps was born in the Kubernetes ecosystem using tools like Flux and ArgoCD.

The final emerging trend to discuss is particularly relevant to teams functioning within a decentralized DevOps model, possessing end-to-end responsibility for the stack, encompassing infrastructure and application delivery. The amount of cognitive load required to connect the underlying cloud resources together while also being an expert in building out business logic for the application is extremely high, and hence why teams look to harness the power of abstraction and automation for infrastructure provisioning. While this may appear analogous to previously mentioned practices, the key distinction lies in the utilization of tools specifically designed to enhance the developer experience. By abstracting various components (such as networking, identity, and stitching everything together), these tools eliminate the necessity for interaction with centralized teams, empowering developers to operate autonomously and assume complete ownership of the infrastructure. This trend is exemplified by the adoption of innovative tools such as AWS App Composer, AWS CodeCatalyst, SAM CLI, AWS Copilot CLI, and the AWS Cloud Development Kit (CDK).

Looking ahead

If there is one thing that we can ascertain it’s that the journey to successful developer enablement is ongoing, and it’s clear that finding that balance of speed, security, and flexibility can be difficult to achieve. Throughout all of these evolutionary trends in technology, Git has remained as the nucleus of infrastructure and application deployment automation. This is not new; however, the processes being built around Git such as GitOps are. The industry continues to gravitate towards this model, and at AWS we are looking at ways to enable builders to leverage git as the source of truth with simple integrations. For example, AWS Proton has built integrations with git for central template storage with a feature called template sync and recently released a feature called service sync, which allows developers to configure and deploy their Proton services using Git. These features empower the platform team and developers to seamlessly store their templates and desired infrastructure resource states within Git, requiring no additional effort beyond the initial setup.

We also see that interest in building internal developer platforms is on a sharp incline, and it’s still in the early days. With tools like AWS Proton, AWS Service Catalog, Backstage, and other SaaS providers, platform teams are able to define patterns centrally for developers to self serve patterns via a library or “shopping cart”. As mentioned earlier, it’s vital that the teams building out the internal developer platforms think of ways to enable the developer to deploy supplemental resources that aren’t defined in the central templates. While the developer platform can solve the majority of the use cases, it’s nearly impossible to solve them all. If you can’t enable developers to deploy resources on top of their platform deployed services, you’ll find that you’re back to the original problem statement outlined in the beginning of this blog which can ultimately result in a failed implementation. AWS Proton solves this through a feature we call components, which enables developers to bring their own IaC templates to deploy on top of their services deployed through Proton.

The rising popularity of the aforementioned patterns reveals an unmet need for developers who seek to tailor their cloud resources according to the specific requirements of their applications and the demands of platform/central teams that require governance. This is particularly prevalent in serverless workloads, where developers often integrate their application and infrastructure code, utilizing services such as AWS Step Functions to transfer varying degrees of logic from the application layer to the managed service itself. Centralizing these resources becomes increasingly challenging due to their dynamic nature, which adapts to the evolving requirements of business logic. Consequently, it is nearly impossible to consolidate these patterns into a universally applicable blueprint for reuse across diverse business scenarios.

As the distinction between cloud resources and application code becomes increasingly blurred, developers are compelled to employ tools that streamline the underlying logic, enabling them to achieve their desired outcomes swiftly and securely. In this context, it is crucial for platform teams to identify and incorporate these tools, ensuring that organizational safeguards and expectations are upheld. By doing so, they can effectively bridge the gap between developers’ preferences and the essential governance required by the platform or central team.

Wrapping up

We’ve explored the various operating models and emerging trends designed to facilitate these models. Platform Engineering represents the ongoing evolution of DevOps, aiming to enhance the developer experience for rapid and secure deployments. It is crucial to recognize that developers possess varying skill sets and preferences, even within the same organization. As previously discussed, some developers prefer complete ownership of the entire stack, while others concentrate solely on writing code without concerning themselves with infrastructure. Consequently, the platform engineering practice must continuously adapt to accommodate these patterns in a manner that fosters enablement rather than posing as obstacles. To achieve this, the platform must be treated as a product, with developers as its customers, ensuring that their needs and preferences are prioritized and addressed effectively.

To determine where your organization fits within the discussed operational models, we encourage you to initiate a self-assessment and have internal discussions. Evaluate your current infrastructure provisioning, deployment processes, and development team support. Consider the benefits and challenges of each model and how they align with your organization’s specific needs, goals, and cultural willingness to shift.

To facilitate this process, gather key stakeholders from various teams, including leadership, platform engineering, development, and DevOps, for a collaborative workshop. During this workshop, review the four operational models (Centralized Provisioning, Platform-enabled Golden Path, Embedded DevOps, and Decentralized DevOps) and discuss the following:

How closely does each model align with your current organizational structure and processes?
What are the potential benefits and challenges of adopting or transitioning to each model within your organization?
What challenges are you currently facing with the model that you operate under?
How can technology be leveraged to optimize infrastructure creation and deployment automation?

By conducting this self-assessment and engaging in open dialogue, your organization can identify the most suitable operational model and develop a strategic plan to optimize collaboration, efficiency, and agility within your technology teams. If a more guided approach is preferred, reach out to our solutions architects and/or AWS partners to assist.

Let’s Architect! Multi-tenant SaaS architectures

2023-06-07 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-multi-tenant-saas-architectures/

In a multi-tenant architecture multiple instances of an application run on a shared infrastructure. With this type of approach, each tenant is isolated from others, typically through logical separation, while utilizing a shared infrastructure. This allows multiple tenants to use the same application and maintain their data security, privacy, and customization requirements.

Understanding architectural patterns for multi-tenancy has become crucial for architects and developers aiming to deliver scalable, secure, and cost-effective solutions. Isolating tenant data is a fundamental responsibility for Software as a Service (SaaS) providers. In this edition of Let’s Architect!, we talk about comprehensive exploration of multi-tenant architectures, covering various aspects, such as SaaS microservices, SaaS serverless, SaaS EKS, and an insightful whitepaper.

SaaS microservices deep dive: Simplifying multi-tenant development

In this session, Michael Beardsley, Principal Solutions Architect at AWS, takes a deep dive into the realm of multi-tenant microservices, exploring various patterns and strategies that enable the seamless implementation of multi-tenant microservices, all while ensuring that additional complexity is not imposed upon the SaaS builders. He shares practical patterns to simplify the development process by addressing crucial aspect, such as authorization, data access, tenant isolation, metrics, billing, logging, and a plethora of other considerations; this is irrespective of the chosen compute platform (like Amazon Elastic Container Service, Amazon Elastic Kubernetes Service [Amazon EKS], or AWS Lambda) or database solution.

There is another session available that highlights specific techniques and architecture strategies that can directly impact the success of a SaaS business. If you’re interested in learning more about optimizing multi-tenant SaaS architecture, this session is a great opportunity.

Take me to this video!

SaaS multi-tenant microservices

Building a Multi-Tenant SaaS Solution Using AWS Serverless Services

In this AWS Partner Network (APN) Blog post, you will explore a reference solution that presents a comprehensive perspective on a functional multi-tenant serverless SaaS environment. This solution effectively showcases various essential components required to construct a multi-tenant SaaS solution using serverless services, including onboarding processes, tenant isolation mechanisms, data partitioning techniques, a tenant deployment pipeline, and robust observability measures.

By delving into these aspects, you can gain valuable insights into the architecture and design considerations involved in creating a successful multi-tenant SaaS solution.

Take me to this AWS APN blogpost!

Tenant registration flow

Amazon EKS SaaS deep dive: A multi-tenant EKS SaaS solution

In this re:Invent 2021 presentation, Tod Golding, Principal Partner Solutions Architect, chats about a SaaS reference solution that addresses fundamental multi-tenant considerations, examining its approach to core SaaS topics, including tenant isolation, identity, onboarding, tenant administration, and data partitioning. The goal is to explore an Amazon EKS SaaS architecture through the lens of working code and highlight the key architectural strategies that were used in this reference environment.

There is also valuable information available on Github regarding EKS multi-tenancy. Exploring the Github repositories related to EKS multi-tenancy can provide further insights, resources, and practical examples for implementing multi-tenant architectures on EKS. This presentation is an engaging way to dive deeper into this topic and gain a more comprehensive understanding of best practices and real-world implementations.

Take me to this video!

Tenant deployment model

Saas Storage Strategies

Storage represents a challenging aspect of building and delivering multi-tenant software solutions. There are different strategies that can be used to partition tenant data, each with a unique set of trade-offs for implementing separation between tenants. This whitepaper covers different storage models for multi-tenancy; in particular, you can learn about the:

Silo model (data from the tenant is fully isolated)
Pool model (all the tenants use the same database and table)
Bridge model (single database but a different table for each tenant)

For each of these models, the whitepaper describes in detail how they can be implemented, as well as the different trade-offs in terms of isolation and agility. You can also discover how these tenancy models can be implemented specifically on databases, such as Amazon DynamoDB and Amazon Relational Database Service, thus covering both NoSQL and SQL scenarios.

Take me to this whitepaper!

Partitioning model tradeoffs

See you next time!

Thanks for joining our conversation on multi-tenant SaaS architectures! Next time, we’ll talk about open-source technologies.

To find all the blogs from this series, you can check out the Let’s Architect! list of content on the AWS Architecture Blog.

Large-scale digital biomarker computation with AWS serverless services

2023-05-31 Avni Patel

Post Syndicated from Avni Patel original https://aws.amazon.com/blogs/architecture/large-scale-digital-biomarker-computation-with-aws-serverless-services/

Digital biomarkers are quantitative, objective measures of physiological and behavioral data. They are collected and measured using digital devices that better represent free-living activity in contrast to a highly structured in-clinic setting. This approach generates large amounts of data that requires processing.

Digital biomarker data is typically saved in different formats, and various sources require different (and often multiple) processing steps. In 2020, the Digital Sciences and Translational Imaging group at Pfizer developed a specialized pipeline using AWS services for analyzing incoming sensor data from wrist devices for sleep and scratch activity.

In time, the original pipeline needed updating to compute digital biomarkers at scale while maintaining reproducibility and data provenance. The key component in re-designing the pipeline is flexibility so the platform can:

Handle various incoming data sources and different algorithms
Handle distinct sets of algorithm parameters

These goals were accomplished with a framework for handling file analysis that uses a two-part approach:

A Python package using a custom common architecture
An AWS-based pipeline to handle processing mapping and computing distribution

This blog post introduces this custom Python package data processing pipeline using AWS services. The AWS architecture maintains data provenance while enabling fast, efficient, and scalable data processing. These large data sets would otherwise consume significant local resources.

Let’s explore each part of this framework in detail.

Custom Python package for data processing

Computation of digital biomarkers requires specific algorithms. Typically, these algorithms—for example, gait, activity, and more—had separate code repositories. Little thought was given to interaction and reproducibility.

Pfizer addressed this issue by creating the SciKit-Digital-Health Python package (SKDH) to implement these algorithms and data ingestion methods. As the SKDH framework is designed with reproducibility and integration in mind, modules can be chained together in a pipeline structure and run sequentially. Pipelines can be saved and loaded later; a key feature in enabling dynamic pipeline selection and reproducibility.

AWS framework for file analysis

While SKDH is the core that computes the digital biomarkers, it requires an operational framework for handling all the data files. To that end, Pfizer and AWS designed the following architecture to handle file analysis on a study-by-study basis, as shown in Figure 1:

Figure 1. Digital biomarker catalog and file processing workflow

The overall platform can be broken into two components: the Catalog and File Processing AWS Step Functions workflows. Let’s explore each of these components, along with the necessary configuration file to process a study.

Catalog Step Functions workflow

The Catalog workflow searches study Amazon Simple Storage Service (Amazon S3) buckets for files that need processing. These files are defined in a study configuration document. The workflow consists of the following steps:

The Catalog Step Function is manually triggered.
If the study configuration file is found and the S3 Study buckets specified in the configuration file exist, search the S3 Study buckets for any files matching those enumerated in the study configuration file.
If the configuration file or S3 Study bucket are not found, send a message to an Amazon Simple Queue Service (Amazon SQS) queue.
For any cataloged files, start a new Processing workflow for each of these files.

File processing Step Functions workflow

The Processing workflow verifies processing requirements and processes both study data files and metadata files through the following steps:

From the study configuration file, the workflow gets relevant details for the file. This includes metadata processing requirements as needed, and the specified set of SKDH algorithms to run.
If the file contains metadata, a special AWS Lambda function cleans the data and saves it to the metadata Amazon DynamoDB table. Otherwise, check whether all requirements are met, including processing requirements (for example, height for gait algorithms) and logistics requirements (for example, file name information extraction functions).
For missing requirements, the workflow will send a message to an Amazon SQS queue and try waiting for them to be met.
If all requirements are met, it will start a batch job for the file.
The batch job runs on AWS Fargate with a custom Docker image with a specified version of SKDH. This allows for easy versioning and reproducibility.
The Docker image loads the specified SKDH pipeline and computes the digital biomarkers. Any results generated are then uploaded back to the S3 Study bucket at the end of processing.

Study configuration file

The configuration file defines everything necessary for processing a study. This includes which S3 bucket(s) the study data is located in, under which prefixes, and the list of file patterns of files to be processed. Figure 2 offers a sample configuration file example:

Figure 2. Sample study configuration file

Building a stack with AWS CloudFormation

AWS CloudFormation provisions and configures all of the resources for the pipeline, allowing AWS services to be easily added and removed. CloudFormation integration with Github makes it easy to keep track of service changes. AWS Step Functions can also be integrated into CloudFormation template files, which gives an AWS specification in essentially one master file.

To iterate and update the stack, a combination of Python, Amazon SDK for Python (Boto3), and AWS Command Line Interface (AWS CLI) tools are used. Python allows for a more intricate command line script with positional and keyword arguments depending on the process. For example, updating the pipeline is a simple command, where <env> is the dev/stage/prod environment that needs to be updated:

Figure 3. Updating the pipeline with a simple command

Meanwhile, the Python function would be something like the following:

Figure 4. Sample Python function

While it might seem unconventional to use both SDK for Python and a call to the AWS CLI from Python in the same script, the AWS CLI has some useful features that make the CloudFormation packaging and stack creation much easier. The end of the example file in Figure 3 pertains to the main utility in using the Python argparse package, which provides an easy to write user-friendly CLI.

This could also be achieved using AWS Cloud Development Kit (AWS CDK), an open source software development framework that allows you to define infrastucture in familiar programming languages. AWS CDK uses constructs that can be customized and reused.

Conclusion

This blog post shared a custom data processing pipeline using AWS services that, through intentional integration with a Python package, can run arbitrary algorithms to process data. By using AWS as the framework for this pipeline, data provenance is maintained while enabling fast, efficient, and scalable processing of large amounts of data that would typically consume significant company resources.

Managing data confidentiality for Scope 3 emissions using AWS Clean Rooms

2023-05-31 Sundeep Ramachandran

Post Syndicated from Sundeep Ramachandran original https://aws.amazon.com/blogs/architecture/managing-data-confidentiality-for-scope-3-emissions-using-aws-clean-rooms/

Scope 3 emissions are indirect greenhouse gas emissions that are a result of a company’s activities, but occur outside the company’s direct control or ownership. Measuring these emissions requires collecting data from a wide range of external sources, like raw material suppliers, transportation providers, and other third parties. One of the main challenges with Scope 3 data collection is ensuring data confidentiality when sharing proprietary information between third-party suppliers. Organizations are hesitant to share information that could potentially be used by competitors. This can make it difficult for companies to accurately measure and report on their Scope 3 emissions. And the result is that it limits their ability to manage climate-related impacts and risks.

In this blog, we show how to use AWS Clean Rooms to share Scope 3 emissions data between a reporting company and two of their value chain partners (a raw material purchased goods supplier and a transportation provider). Data confidentially requirements are specified by each organization before participating in the data AWS Clean Rooms collaboration (see Figure 1).

Figure 1. Data confidentiality requirements of reporting company and value chain partners

Each account has confidential data described as follows:

Column 1 lists the raw material Region of origin. This is business confidential information for supplier.
Column 2 lists the emission factors at the raw material level. This is sensitive information for the supplier.
Column 3 lists the mode of transportation. This is business confidential information for the transportation provider.
Column 4 lists the emissions in transporting individual items. This is sensitive information for the transportation provider.
Rows in column 5 list the product recipe at the ingredient level. This is trade secret information for the reporting company.

Overview of solution

In this architecture, AWS Clean Rooms is used to analyze and collaborate on emission datasets without sharing, moving, or revealing underlying data to collaborators (shown in Figure 2).

Figure 2. Architecture for AWS Clean Rooms Scope 3 collaboration

Three AWS accounts are used to demonstrate this approach. The Reporting Account creates a collaboration in AWS Clean Rooms and invites the Purchased Goods Account and Transportation Account to join as members. All accounts can protect their underlying data with privacy-enhancing controls to contribute data directly from Amazon Simple Storage Service (S3) using AWS Glue tables.

The Purchased Goods Account includes users who can update the purchased goods bucket. Similarly, the Transportation Account has users who can update the transportation bucket. The Reporting Account can run SQL queries on the configured tables. AWS Clean Rooms only returns results complying with the analysis rules set by all participating accounts.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Three AWS accounts in the same AWS Region
An Amazon S3 bucket in each account with emissions data (see Figure 1)
An AWS Glue Data Catalog for the emissions data stored in each S3 bucket

Although Amazon S3 and AWS Clean Rooms are free-tier eligible, a low fee applies to AWS Glue. Clean-up actions are provided later in this blog post to minimize costs.

Configuration

We configured the S3 buckets for each AWS account as follows:

Reporting Account: reportingcompany.csv
Purchased Goods Account: purchasedgood.csv
Transportation Account: transportation.csv

Create an AWS Glue Data Catalog for each S3 data source following the method in the Glue Data Catalog Developer Guide. The AWS Glue tables should match the schema detailed previously in Figure 1, for each respective account (see Figure 3).

Figure 3. Configured AWS Glue table for ‘Purchased Goods’

Data consumers can be configured to ingest, analyze, and visualize queries (refer back to Figure 2). We will tag the Reporting Account Glue Database as “reporting-db” and the Glue Table as “reporting.” Likewise, the Purchased Goods Account will have “purchase-db” and “purchase” tags.

Security

Additional actions are recommended to secure each account in a production environment. To configure encryption, review the Further Reading section at the end of this post, AWS Identity and Access Management (IAM) roles, and Amazon CloudWatch.

Walkthrough

This walkthrough consists of four steps:

The Reporting Account creates the AWS Clean Rooms collaboration and invites the Purchased Goods Account and Transportation Account to share data.
The Purchased Goods Account and Transportation Account accepts this invitation.
Rules are applied for each collaboration account restricting how data is shared between AWS Clean Rooms collaboration accounts.
The SQL query is created and run in the Reporting Account.

1. Create the AWS Clean Rooms collaboration in the Reporting Account

(The steps covered in this section require you to be logged into the Reporting Account.)

Navigate to the AWS Clean Rooms console and click Create collaboration.
In the Details section, type “Scope 3 Clean Room Collaboration” in the Name field.
Scroll to the Member 1 section. Enter “Reporting Account” in the Member display name field.
In Member 2 section, enter “Purchased Goods Account” for your first collaboration member name, with their account number in the Member AWS account ID box.
Click Add another member and add “Transportation Account” as the third collaborator with their AWS account number.
Choose the “Reporting Account” as the Member who can query and receive result in the Member abilities section. Click Next.
Select Yes, join by creating membership now. Click Next.
Verify the collaboration settings on the Review and Create page, then select Create and join collaboration and create membership.

Both accounts will then receive an invitation to accept the collaboration (see Figure 4). The console reveals each member status as “Invited” until accepted. Next, we will show how the invited members apply query restrictions on their data.

Figure 4. New collaboration created in AWS Clean Rooms

2. Accept invitations and configure table collaboration rules

Steps in this section are applied to the Purchased Goods Account and Transportation Account following collaboration environment setup. For brevity, we will demonstrate steps using the Purchased Goods Account. Differences for the Transportation Account are noted.

Log in to the AWS account owning the Purchased Goods Account and accept the collaboration invitation.
Open the AWS Clean Rooms console and select Collaborations on the left-hand navigation pane, then click Available to join.
You will see an invitation from the Scope 3 Clean Room Collaboration. Click on Scope 3 Clean Room Collaboration and then Create membership.
Select Tables, then Associate table. Click Configure new table.

The next action is to associate the Glue table created from the purchasedgoods.csv file. This sequence restricts access to the origin_region column (transportation_mode for the Transportation Account table) in the collaboration.

In the Scope 3 Clean Room Collaboration, select Configured tables in the left-hand pane, then Configure new table. Select the AWS Glue table associated with purchasedgoods.csv (shown in Figure 5).
Select the AWS Glue Database (purchase-db) and AWS Glue Table (purchase).
Verify the correct table section by toggling View schema from the AWS Glue slider bar.
In the Columns allowed in collaboration section, select all fields except for origin_region. This action prevents the origin_region column being accessed and viewed in the collaboration.
Complete this step by selecting Configure new table.

Figure 5. Purchased Goods account table configuration

Select Configure analysis rule (see Figure 6).
Select Aggregation type then Next.
Select SUM as the Aggregate function and s3_upstream_purchased_good for the column.
Under Join controls, select Specify Join column. Select “item” from the list of options. This permits SQL join queries to execute on the “item” column. Click Next.

Figure 6. Table rules for the Purchased Goods account

The next page specifies the minimum number of unique rows to aggregate for the “join” command. Select “item” for Column name and “2” for the Minimum number of distinct values. Click Next.
To confirm the table configuration query rules, click Configure analysis rule.
The final step is to click Associate to collaboration and select Scope 3 Clean Room Collaboration in the pulldown menu. Select Associate table after page refresh.

The procedure in this section is repeated for the Transportation Account, with the following exceptions:

The columns shared in this collaboration are item, s3_upstream_transportation, and unit.
The Aggregation function is a SUM applied on the s3_upstream_transportation column.
The item column has an Aggregation constraint minimum of two distinct values.

3. Configure table collaboration rules inside the Reporting Account

At this stage, member account tables are created and shared in the collaboration. The next step is to configure the Reporting Account tables in the Reporting Account’s AWS account.

Navigate to AWS Clean Rooms. Select Configured tables, then Configure new table.
Select the Glue database and table associated with the file reportingcompany.csv.
Under Columns allowed in collaboration, select All columns, then Configure new table.
Configure collaboration rules by clicking Configure analysis rule using the Guided workflow.
Select Aggregation type, then Next.
Select SUM as the Aggregate function and ingredient for the column (see Figure 7).
Only SQL join queries can be executed on the ingredient column by selecting it in the Specify join columns section.
In the Dimension controls, select product. This option permits grouping by product name in the SQL query. Select Next.
Select None in the Scalar functions section. Click Next. Read more about scalar functions in the AWS Clean Rooms User Guide.

Figure 7. Table rules for the Reporting account

On the next page, select ingredient for Column name and 2 for the Minimum number of distinct values. Click Next. To confirm query control submission, select Configure analysis rule on the next page.
Validate the setting in the Review and Configure window, then select Next.
Inside the Configured tables tab, select Associate to collaboration. Assign the table to the Scope 3 Clean Rooms Collaboration.
Select the Scope 3 Clean Room Collaboration in the dropdown menu. Select Choose collaboration.
On the Scope 3 Clean Room Collaboration page, select reporting, then Associate table.

4. Create and run the SQL query

Queries can now be run inside the Reporting Account (shown in Figure 8).

Figure 8. Query results in the Clean Rooms Reporting Account

Select an S3 destination to output the query results. Select Action, then Set results settings.
Enter the S3 bucket name, then click Save changes.
Paste this SQL snippet inside the query text editor (see Figure 8):

SELECT
r.product AS “Product”,
SUM(p.s3_upstream_purchased_good) AS “Scope_3_Purchased_Goods_Emissions”,
SUM(t.s3_upstream_transportation) AS “Scope_3_Transportation_Emissions”
FROM
reporting r
INNER JOIN purchase p ON r.ingredient = p.item
INNER JOIN transportation t ON p.item = t.item
GROUP BY
r.product

Click Run query. The query results should appear after a few minutes on the initial query, but will take less time for subsequent queries.

Conclusion

This example shows how Clean Rooms can aggregate data across collaborators to produce total Scope 3 emissions for each product from purchased goods and transportation. This query was performed between three organizations without revealing underlying emission factors or proprietary product recipe to one another. This alleviates data confidentially concerns and improves sustainability reporting transparency.

Clean Up

The following steps are taken to clean up all resources created in this walkthrough:

Member and Collaboration Accounts:
1. AWS Clean Rooms: Disassociate and delete collaboration tables
2. AWS Clean Rooms: Remove member account in the collaboration
3. AWS Glue: Delete the crawler, database, and tables
4. AWS IAM: Delete the AWS Clean Rooms service policy
5. Amazon S3: Delete the CSV file storage buckets
  ·
Collaboration Account only:

1. Amazon S3: delete the SQL query bucket
2. AWS Clean Rooms: delete the Scope 3 Clean Room Collaboration

Further Reading:

Security Practices

AWS Clean Rooms security practices
Create IAM Roles to enforce S3 and Clean Rooms access policies through least privilege permissions
Apply encryption rules on S3 buckets with key rotation
Use CloudTrail to record AWS Clean Rooms API calls
Encrypt Glue tables using KMS keys

Author Spotlight: Vittorio Denti, Machine Learning Engineer at Amazon

2023-05-26 Elise Chahine

Post Syndicated from Elise Chahine original https://aws.amazon.com/blogs/architecture/author-spotlight-vittorio-denti-machine-learning-engineer-at-amazon/

The Author Spotlight series pulls back the curtain on some of AWS and Amazon’s most prolific authors. Read on to find out more about our very own Vittorio Denti’s journey, in his own words!

I’m Vittorio, and I work as a Machine Learning Engineer at Amazon. My journey in the engineering world began when I started studying Computer Engineering at Politecnico di Milano during my bachelor’s degree. It continued during the master’s degree, when I started focusing on data and artificial intelligence. During my master’s, I spent one year in Stockholm, at the KTH Royal Institute of Technology, where I studied machine learning and deep learning. I joined Amazon Web Services (AWS) in 2020 through Tech U, a graduate program for technical roles in AWS, where I started working in the area of cloud architecture. During my time in AWS, I have had the opportunity to become familiar with software architectures, the AWS cloud, and develop system design skills for distributed systems.

At AWS, I was able to grow from a technical standpoint by learning new technologies and architectural patterns for distributed systems. I’ve also had the opportunity to grow my personal perspective. I found amazing mentors: people with years of experience that were always available to talk with me, compare ideas, and explain complex topics. When I was studying at university, I developed a vertical set of technical skills, which I’m still using every day at work, and I’m now able to boost those skills by taking advantage of the pragmatism and mindset acquired through mentoring.

With AWS, I was mainly working on cloud architectures for customers but also on internal projects and blog posts. I had the opportunity to support customers in designing their machine learning architectures for model training and for serving real-time traffic, as well as working on different challenges like query optimization for databases or asynchronous communication patterns for microservices. This breadth of scenarios was fundamental to build horizontal skills outside of the machine learning domain and learning how to navigate unexplored areas.

At the end of 2022, I moved from AWS to Amazon, where I started working as a Machine Learning Engineer on Amazon Sponsored Display, a self-service solution from Amazon Ads. In this new role, I work as an engineer developing and productionizing some of the machine learning algorithms used in advertising. I really like this role because there are many opportunities to explore and build. Also, working on a product and understanding the long-term impact of architectural and engineering decisions is extremely valuable to grow as a professional.

By working at Amazon’s scale, there are many factors to consider before coming up with a solution. All of this is an amazing opportunity to learn more, experiment, and satisfy my curiosity. As you have understood by now, I always like facing new challenges, pushing myself out of the comfort zone, and being able to learn about new areas and technologies.

Some of my favorite moments so far are related to public and internal conferences that I have attended both with AWS and Amazon. In 2021, I had the opportunity to go to re:Invent 2021 in Las Vegas—the annual AWS cloud conference. In 2023, I went to Madrid for DevCon, the internal engineering conference for Amazon employees. These experiences have been extremely valuable to discover more about new technical content as well as for networking and learning from peers.

Vittorio (on the right) with his colleague Darren (on the left) at re:Invent 2021. They attended the conference and supported the AWS Game Day.

Vittorio (on the right) with his colleague Darren (on the left) at re:Invent 2021, supporting AWS Game Day

What’s on my mind

Machine learning is a growing area: it’s still fairly new and there is a lot of innovation happening. Engineers work to find the “optimal” way for running machine learning in production and balance all the different trade-offs that consistently arise. Machine learning engineering faces the complexities of software engineering and adds extra challenges deriving from working with data and models. It’s not only about developing software! We also have to work to explore, monitor, and (in some cases) build datasets and training models, plus productionize them. There are two main entities to consider (datasets and models) that do not usually appear in software engineering, and this brings extra complexity.

Some stimulating challenges are about increasing model performance without impacting latency, working with unbalanced datasets, and detecting model or data drift automatically and solving for it. This is just a set of fascinating engineering areas to explore for machine learning. While for some of these problems, the industry already has some answers or directions, the question is: how can we do this at scale and create mature and reliable solutions? In the next few years I would like to answer these questions. For instance, car manufacturers have well-established supply chains and mature manufacturing processes. To the best of my knowledge, we don’t have something comparable in the area of machine learning engineering… yet! We have a set of excellent artisans working at scale, good processes to ensure quality (like code reviews and testing), but the risk of failures or inefficiencies is still high because of the stochastic nature of the domain.

Other topics I’m passionate about are machine learning science and software architecture. I like reading research papers, understanding the latest innovations, and learn not only from their contributions, but also from the processes that the authors followed to achieve those results. There are many opportunities to transfer knowledge from other domains to machine learning engineering, and I’m always inquiring to learn more about science and engineering to run the latest innovations in production.

Favorite blog posts and resources

Design a data mesh with event streaming for real-time recommendations on AWS

Data architectures were mainly designed around technologies rather than business domains in the past. This changed in 2019, when Zhamak Dehghani introduced the data mesh. Data mesh is an application of the Domain-Driven-Design (DDD) principles to data architectures: data is organized into data domains, and the data is the product that the team owns and offers for consumption.

Designing a data architecture for working with machine learning at scale requires careful thought. You may have different data sources, consumers, and data coming from different areas of the organization. Also, your organization may grow and evolve throughout time, so you may have some domains coming up and others disappearing. In 2022, I published an AWS Big Data blog on designing a data mesh architecture with event streaming for machine learning.

In the blog, we design a data mesh with event streaming for a scenario that requires real-time music recommendations on AWS. Because ML applications may have multiple types of input data, we propose a solution that works both for data at rest as well as real-time streaming.

The complete data mesh with event-streaming architecture uses two different data planes: one is dedicated for sharing data at rest (blue), the other one is for data in motion (red)

The Let’s Architect! series on the Architecture Blog

Let’s Architect! is an editorial initiative that publishes a bi-weekly post on a specific topic since January 2022. Some of the most recent topics include event-driven architectures, microservices, data mesh and machine learning. We gather four or five different AWS content pieces that provide a perspective on why that content is relevant. Content can be an AWS blog post, a whitepaper, any hands-on workshop with architectural patterns, or a video. We focus on architectural content that can be valuable for software architects as well as software engineers willing to learn more about system design and distributed systems.

This is not only a way to learn and stay familiar with different technical areas but also to contribute and share resources with the community. The initiative has had a strong influence, and we now have customers and even many of our colleagues waiting for the upcoming posts. I have been working on this initiative with Luca Mezzalira, Zamira Jaupaj, Federica Ciuffo and Laura Hyatt, and with the support of the AWS Architecture Blog.

The Let’s Architect! series

Amazon Builder’s Library

One of most inspiring resources to learn from is the Amazon Builder’s Library. Its mission is to share with customers our experience of building secure and highly available services, and successful durable businesses, at Amazon and AWS.

I think it’s extremely valuable because we can learn directly from the Amazon’s experience of running distributed systems at scale. It’s not only about the content itself or how they solved a specific challenge, but also about learning how they were able to formalize a problem and the journey that they followed to achieve the solution. Also, if you wonder how some AWS services are designed under the hood… you can learn about typical patterns used to implement them by navigating through the resources in the library.

What you’ll find on the Amazon Builder’s Library

Let’s Architect! Designing microservices architectures

2023-05-24 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-designing-microservices-architectures/

In 2022, we published Let’s Architect! Architecting microservices with containers. We covered integrations patterns and some approaches for implementing microservices using containers. In this Let’s Architect! post, we want to drill down into microservices only, by focusing on the main challenges that software architects and engineers face while working on large distributed systems structured as a set of independent services.

There are many considerations to cover in detail within a broad topic like microservices. We should reflect on the organizational structure, automation pipelines, multi-account strategy, testing, communication, and many other areas. With this post we dive deep into the topic by analyzing the options for discoverability and connectivity available through Amazon VPC Lattice; then, we focus on architectural patterns for communication, mainly on asynchronous communication, as it fits very well into the paradigm. Finally, we explore how to work with serverless microservices and analyze a case study from Amazon, coming directly from the Amazon Builder’s Library.

AWS Container Day featuring Kubernetes

Modern applications are often built using a microservice distributed approach, which involves dividing the application into smaller, specialized services. Each of these services implement their own subset of functionalities or business logic. To facilitate communication between these services, it is essential to have a method to authorize, route, and monitor network traffic. It is also important, in case of issues, to have the ability of identifying the root cause of an issue, whether it originates at the application, service, or network level.

Amazon VPC Lattice can offer a consistent way to connect, secure, and monitor communication between instances, containers, and serverless functions. With Amazon VPC Lattice, you can define policies for traffic management, network access, advanced routing, implement discoverability, and, at the same time, monitor how the traffic is flowing inside complex applications in near real time.

Take me to this video!

Amazon VPC Lattice service gives you a consistent way to connect, secure, and monitor communication between your services

Application integration patterns for microservices

Loosely coupled integration can help you design independent systems that can be developed and operated individually, plus increase the availability and reliability of the overall system landscape—particularly by using asynchronous communication. While there are many approaches for integration and conversation scenarios, it’s not always clear which approach is best for a given situation.

Join this re:Invent 2022 session to learn about foundational patterns for integration and conversation scenarios with an emphasis on loose coupling and asynchronous communication. Explore real-world use cases architected with cloud-native and serverless services, and receive guidance on choosing integration technology.

Take me to this re:Invent 2022 video!

Loosely coupled integration can help you design independent systems that can be developed and operated individually and can also increase the availability and reliability of the overall system

Design patterns for success in serverless microservices

Software engineers love patterns—proven approaches to well-known problems that make software development easier and set our projects up for success. In complex, distributed systems, such as microservices, patterns like CQRS and Event Sourcing help decouple and scale systems.

The first part of the video is all about introducing architectural patterns and their applications, while the second part contains a set of demos and examples from the AWS console.
In this session, we examine at some typical patterns for building robust and performant serverless microservices, and how data access patterns can drive polyglot persistence.

Take me to this AWS Summit video!

With event sourcing data is stored as a series of events, instead of direct updates to data stores; microservices replay events from an event store to compute the appropriate state of their own data stores

Avoiding overload in distributed systems by putting the smaller service in control

If we don’t pay attention to the relative scale of a service and its clients, distributed systems with microservices can be at risk of overload. A common architecture pattern adopted by many AWS services consists of splitting the system in a control plane and a data plane.

This article drills down into this scenario to understand what could happen if the data plane fleet exceeds the scale of the control plane fleet by a factor of 100 or more. This can happen in a microservices-based architecture when service X recovers from an outage and starts sending a large amount of request to service Y. Without careful fine-tuning, this shift in behavior can overwhelm the smaller callee. With this resource, we want to share some mental models and design strategies that are beneficial for distributed systems and teams working on microservices architectures.

Take me to the Amazon Builders’ Library!

To stay up to date on the data plane’s operational state, the control plane can poll an Amazon S3 bucket into which data plane servers periodically write that information

To stay updated on the data plane’s operational state, the control plane can poll an Amazon S3 bucket into which data plane servers periodically write that information

See you next time!

Thanks for stopping by! Join us in two weeks when we’ll discuss multi-tenancy and patterns for SaaS on AWS.

To find all the blogs from this series, you can check out the Let’s Architect! list of content on the AWS Architecture Blog.

dApp authentication with Amazon Cognito and Web3 proxy with Amazon API Gateway

2023-05-17 Nicolas Menciere

Post Syndicated from Nicolas Menciere original https://aws.amazon.com/blogs/architecture/dapp-authentication-with-amazon-cognito-and-web3-proxy-with-amazon-api-gateway/

If your decentralized application (dApp) must interact directly with AWS services like Amazon S3 or Amazon API Gateway, you must authorize your users by granting them temporary AWS credentials. This solution uses Amazon Cognito in combination with your users’ digital wallet to obtain valid Amazon Cognito identities and temporary AWS credentials for your users. It also demonstrates how to use Amazon API Gateway to secure and proxy API calls to third-party Web3 APIs.

In this blog, you will build a fully serverless decentralized application (dApp) called “NFT Gallery”. This dApp permits users to look up their own non-fungible token (NFTs) or any other NFT collections on the Ethereum blockchain using one of the following two Web3 providers HTTP APIs: Alchemy or Moralis. These APIs help integrate Web3 components in any web application without Blockchain technical knowledge or access.

Solution overview

The user interface (UI) of your dApp is a single-page application (SPA) written in JavaScript using ReactJS, NextJS, and Tailwind CSS.

The dApp interacts with Amazon Cognito for authentication and authorization, and with Amazon API Gateway to proxy data from the backend Web3 providers’ APIs.

Architecture diagram

Figure 1. Architecture diagram showing authentication and API request proxy solution for Web3

Prerequisites

Install Node.js, yarn, or npm, and the AWS Serverless Application Model Command Line Interface (AWS SAM CLI) on your computer.
Have an AWS account and the proper AWS Identity and Access Management (IAM) permissions to deploy the resources required by this architecture.
Install a digital wallet extension on your browser and connect to the Ethereum blockchain. Metamask is a popular digital wallet.
Get an Alchemy account (free) and an API Key for the Ethereum blockchain. Read the Alchemy Quickstart guide for more information.
Sign up for a Moralis account (free) and API Key. Read the Moralis Getting Started documentation for more information.

Using the AWS SAM framework

You’ll use AWS SAM as your framework to define, build, and deploy your backend resources. AWS SAM is built on top of AWS CloudFormation and enables developers to define serverless components using a simpler syntax.

Walkthrough

Clone this GitHub repository.

Build and deploy the backend

The source code has two top level folders:

backend: contains the AWS SAM Template template.yaml. Examine the template.yaml file for more information about the resources deployed in this project.
dapp: contains the code for the dApp

1. Go to the backend folder and copy the prod.parameters.example file to a new file called prod.parameters. Edit it to add your Alchemy and Moralis API keys.

2. Run the following command to process the SAM template (review the sam build Developer Guide).

sam build

3. You can now deploy the SAM Template by running the following command (review the sam deploy Developer Guide).

sam deploy --parameter-overrides $(cat prod.parameters) --capabilities CAPABILITY_NAMED_IAM --guided --confirm-changeset

4. SAM will ask you some questions and will generate a samconfig.toml containing your answers.

You can edit this file afterwards as desired. Future deployments will use the .toml file and can be run using sam deploy. Don’t commit the samconfig.toml file to your code repository as it contains private information.

Your CloudFormation stack should be deployed after a few minutes. The Outputs should show the resources that you must reference in your web application located in the dapp folder.

Run the dApp

You can now run your dApp locally.

1. Go to the dapp folder and copy the .env.example file to a new file named .env. Edit this file to add the backend resources values needed by the dApp. Follow the instructions in the .env.example file.

2. Run the following command to install the JavaScript dependencies:

yarn

3. Start the development web server locally by running:

yarn dev

Your dApp should now be accessible at http://localhost:3000.

Deploy the dApp

The SAM template creates an Amazon S3 bucket and an Amazon CloudFront distribution, ready to serve your Single Page Application (SPA) on the internet.

You can access your dApp from the internet with the URL of the CloudFront distribution. It is visible in your CloudFormation stack Output tab in the AWS Management Console, or as output of the sam deploy command.

For now, your S3 bucket is empty. Build the dApp for production and upload the code to the S3 bucket by running these commands:

cd dapp
yarn build
cd out
aws s3 sync . s3://${BUCKET_NAME}

Replace ${BUCKET_NAME} by the name of your S3 bucket.

Automate deployment using SAM Pipelines

SAM Pipelines automatically generates deployment pipelines for serverless applications. If changes are committed to your Git repository, it automates the deployment of your CloudFormation stack and dApp code.

With SAM Pipeline, you can choose a Git provider like AWS CodeCommit, and a build environment like AWS CodePipeline to automatically provision and manage your deployment pipeline. It also supports GitHub Actions.

Read more about the sam pipeline bootstrap command to get started.

Host your dApp using Interplanetary File System (IPFS)

IPFS is a good solution to host dApps in a decentralized way. IPFS Gateway can serve as Origin to your CloudFront distribution and serve IPFS content over HTTP.

dApps are often hosted on IPFS to increase trust and transparency. With IPFS, your web application source code and assets are not tied to a DNS name and a specific HTTP host. They will live independently on the IPFS network.

Secure authentication and authorization

In this section, we’ll demonstrate how to:

Authenticate users via their digital wallet using Amazon Cognito user pool
Protect your API Gateway from the public internet by authorizing access to both authenticated and unauthenticated users
Call Alchemy and Moralis third party APIs securely using API Gateway HTTP passthrough and AWS Lambda proxy integrations
Use the JavaScript Amplify Libraries to interact with Amazon Cognito and API Gateway from your web application

Authentication

Your dApp is usable by both authenticated and unauthenticated users. Unauthenticated users can look up NFT collections while authenticated users can also look up their own NFTs.

In your dApp, there is no login/password combination or Identity Provider (IdP) in place to authenticate your users. Instead, users connect their digital wallet to the web application.

To capture users’ wallet addresses and grant them temporary AWS credentials, you can use Amazon Cognito user pool and Amazon Cognito identity pool.

You can create a custom authentication flow by implementing an Amazon Cognito custom authentication challenge, which uses AWS Lambda triggers. This challenge requires your users to sign a generated message using their digital wallet. If the signature is valid, it confirms that the user owns this wallet address. The wallet address is then used as a user identifier in the Amazon Cognito user pool.

Figure 2 details the Amazon Cognito authentication process. Three Lambda functions are used to perform the different authentication steps.

Figure 2. Amazon Cognito authentication process

To define the authentication success conditions, the Amazon Cognito user pool calls the “Define auth challenge” Lambda function (defineAuthChallenge.js).
To generate the challenge, Amazon Cognito calls the “Create auth challenge” Lambda function (createAuthChallenge.js). In this case, it generates a random message for the user to sign. Amazon Cognito forwards the challenge to the dApp, which prompts the user to sign the message using their digital wallet and private key. The dApp then returns the signature to Amazon Cognito as a response.
To verify if the user’s wallet effectively signed the message, Amazon Cognito forwards the user’s response to the “Verify auth challenge response” Lambda function (verifyAuthChallengeResponse.js). If True, then Amazon Cognito authenticates the user and creates a new identity in the user pool with the wallet address as username.
Finally, Amazon Cognito returns a JWT Token to the dApp containing multiple claims, one of them being cognito:username, which contains the user’s wallet address. These claims will be passed to your AWS Lambda event and Amazon API Gateway mapping templates allowing your backend to securely identify the user making those API requests.

Authorization

Amazon API Gateway offers multiple ways of authorizing access to an API route. This example showcases three different authorization methods:

AWS_IAM: Authorization with IAM Roles. IAM roles grant access to specific API routes or any other AWS resources. The IAM Role assumed by the user is granted by Amazon Cognito identity pool.
COGNITO_USER_POOLS: Authorization with Amazon Cognito user pool. API routes are protected by validating the user’s Amazon Cognito token.
NONE: No authorization. API routes are open to the public internet.

API Gateway backend integrations

HTTP proxy integration

The HTTP proxy integration method allows you to proxy HTTP requests to another API. The requests and responses can passthrough as-is, or you can modify them on the fly using Mapping Templates.

This method is a cost-effective way to secure access to any third-party API. This is because your third-party API keys are stored in your API Gateway and not on the frontend application.

You can also activate caching on API Gateway to reduce the amount of API calls made to the backend APIs. This will increase performance, reduce cost, and control usage.

Inspect the GetNFTsMoralisGETMethod and GetNFTsAlchemyGETMethod resources in the SAM template to understand how you can use Mapping Templates to modify the headers, path, or query string of your incoming requests.

Lambda proxy integration

API Gateway can use AWS Lambda as backend integration. Lambda functions enable you to implement custom code and logic before returning a response to your dApp.

In the backend/src folder, you will find two Lambda functions:

getNFTsMoralisLambda.js: Calls Moralis API and returns raw response
getNFTsAlchemyLambda.js: Calls Alchemy API and returns raw response

To access your authenticated user’s wallet address from your Lambda function code, access the cognito:username claim as follows:

var wallet_address = event.requestContext.authorizer.claims["cognito:username"];

Using Amplify Libraries in the dApp

The dApp uses the AWS Amplify Javascript Libraries to interact with Amazon Cognito user pool, Amazon Cognito identity pool, and Amazon API Gateway.

With Amplify Libraries, you can interact with the Amazon Cognito custom authentication flow, get AWS credentials for your frontend, and make HTTP API calls to your API Gateway endpoint.

The Amplify Auth library is used to perform the authentication flow. To sign up, sign in, and respond to the Amazon Cognito custom challenge, use the Amplify Auth library. Examine the ConnectButton.js and user.js files in the dapp folder.

To make API calls to your API Gateway, you can use the Amplify API library. Examine the api.js file in the dApp to understand how you can make API calls to different API routes. Note that some are protected by AWS_IAM authorization and others by COGNITO_USER_POOL.

Based on the current authentication status, your users will automatically assume the CognitoAuthorizedRole or CognitoUnAuthorizedRole IAM Roles referenced in the Amazon Cognito identity pool. AWS Amplify will automatically use the credentials associated with your AWS IAM Role when calling an API route protected by the AWS_IAM authorization method.

Amazon Cognito identity pool allows anonymous users to assume the CognitoUnAuthorizedRole IAM Role. This allows secure access to your API routes or any other AWS services you configured, even for your anonymous users. Your API routes will then not be publicly available to the internet.

Cleaning up

To avoid incurring future charges, delete the CloudFormation stack created by SAM. Run the sam delete command or delete the CloudFormation stack in the AWS Management Console directly.

Conclusion

In this blog, we’ve demonstrated how to use different AWS managed services to run and deploy a decentralized web application (dApp) on AWS. We’ve also shown how to integrate securely with Web3 providers’ APIs, like Alchemy or Moralis.

You can use Amazon Cognito user pool to create a custom authentication challenge and authenticate users using a cryptographically signed message. And you can secure access to third-party APIs, using API Gateway and keep your secrets safe on the backend.

Finally, you’ve seen how to host a single-page application (SPA) using Amazon S3 and Amazon CloudFront as your content delivery network (CDN).

Enable transparent connectivity to Oracle Data Guard environments using Amazon Route 53 CNAME records

2023-05-12 Sudip Acharya

Post Syndicated from Sudip Acharya original https://aws.amazon.com/blogs/architecture/enable-transparent-connectivity-to-oracle-data-guard-environments-using-amazon-route-53-cname-records/

Customers choose AWS for running their Oracle database workload to help increase resiliency, performance, and scalability of the database layer. A high availability (HA) solution for the database stack is an important aspect to consider when migrating or deploying Oracle databases in AWS to help ensure that the architecture can meet the service level agreement (SLA) of the application. Customers who run their Oracle databases on Amazon Elastic Compute Cloud (Amazon EC2) commonly choose Oracle Data Guard physical standby databases to help meet the HA and disaster recovery (DR) for their Oracle database workloads.

As discussed in this Oracle documentation, role-based services with multiple listener endpoints in the connection URL or tnsnames.ora entry is the preferred way to transparently connect to the database layer that is part of a Data Guard configuration. However, some application components and driver configurations don’t support multiple hostnames in the connection URL. Those applications require a single hostname or IP for the clients to connect to the Data Guard environment.

This post talks about the concept of using an Amazon Route 53 CNAME record in a Data Guard environment on EC2 and lists the artifacts to automatically route the connection between primary and standby environments in a Data Guard configuration based on the database role.

Solution overview

To help avoid the manual efforts to update DNS entries or tnsnames.ora file after a failover or switchover operation in a Data Guard environment, the solution uses an AFTER DB_ROLE_CHANGE trigger to automate the DNS failover process. This trigger runs a shell script on the database host, which in turn updates the CNAME record in Route 53 to point the CNAME records to reflect the role transition. The following diagram illustrates the solution architecture (Figure 1).

Figure 1. Solution architecture

The solution discussed in this post covers routing new database connection requests to the right database post a Data Guard switchover activity. However, other factors such as application/client TTL settings and behavior of the connection pool to invalidate the connection handles created prior to the switchover activity can cause the application to connect to the database with a different role (like read-write workloads are connected to standby after switchover) and can generate errors, such as ORA-16000: database or pluggable database open for read-only access. It is a best practice to verify the database role before using the connection handles for transactions to verify that the application is connected to the database with the expected role.

The following workflow depicts the sequence of events that happens during a failover or switchover activity in a Data Guard environment to enable seamless connectivity for the application:

A role transition event occurs in the Data Guard environment.
The event triggers the AFTER DB_ROLE_CHANGE trigger.
The trigger runs the shell script on the EC2 instance using a scheduler job.
The shell script updates Route 53 to point the CNAME records to reflect the role transition.

Prerequisites

This post assumes the following prerequisites:

You should have an existing Data Guard configuration with one primary and one standby DB instance within a single VPC. Refer to the Oracle quick start template to deploy a Data Guard environment on Amazon EC2.
The steps discussed here are for self-managed Data Guard configuration on Amazon EC2 with Red Hat Linux AMI.
The scenario discussed in the post involves one primary and one standby database in the Data Guard configuration. For any other configurations, the scripts shown in this example require additional changes.
A private or public Route 53 hosted zone should be configured in the VPC where the DB environment exists.
The shell script uses the instance profile of the EC2 instance to run the AWS Command Line Interface (AWS CLI) commands. Make sure that the instance profile of the EC2 instances hosting the primary and standby databases has a policy attached that allows changing the record set in the hosted zone such as the following:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DBCnameFlipPloicy",
"Effect": "Allow",
"Action": [
"route53:ChangeResourceRecordSets",
"route53:ListResourceRecordSets"
],
"Resource": "arn:aws:route53:::hostedzone/<<YourHostedZoneId>>"
}
]
}

Nslookup, jq, and curl utilities must be installed on all of the DB hosts. If not installed, you can install the utility on RHEL Linux using the following command:

yum install -y bind-utils
yum install -y curl
yum install -y jq

Environment details

This post assumes a Data Guard configuration with two instances within a single VPC, one primary and one standby, with the following details and naming conventions:

Oracle database version – 19.10 configured in maximum performance mode with Active Data Guard
Route 53 domain name – mydbdomain
Database name – orcl
DB_UNIQUE_NAME – orcl_a and orcl_b
Instance names – orcl
Route 53 A record for the host in AZ1 – orcl-a-db.mydbdomain
Route 53 A record for the host in AZ2 – orcl-b-db.mydbdomain

Route 53 configuration

Two A records are created in Route 53 to point to the IPs of the primary and standby hosts. Two CNAME records are also created in Route 53, which are automatically updated during the Data Guard switchover and failover scenarios. The CNAME record orcl-rw.mydbdomain points to the instance in the primary role that can accept read/write transactions, and orcl-ro.mydbdomain points to the instance in the standby role that accepts read-only queries.

The A records configuration is as follows:

DB host IP in AZ1 (10.0.0.5 in this example) – orcl-a-db.mydbdomain
DB host IP in AZ2 (10.0.32.5 in this example) – orcl-b-db.mydbdomain

The CNAME records configuration is as follows:

orcl-a-db.mydbdomain – orcl-rw.mydbdomain
orcl-b-db.mydbdomain – orcl-ro.mydbdomain

The following screenshot shows the Route 53 console view of the domain mydbdomain.

Figure 2. The Route 53 console view of the domain mydbdomain

TNS configuration

The following tnsnames.ora file entries show how connections can be made to primary and standby databases using the CNAME records without a dependency on the actual IP address of the EC2 instances that host primary and standby databases. The entry orcl_a always points to the instance on orcl-a-db.mydbdomain, and orcl_b always points to the instance on orcl-b-db.mydbdomain, regardless of their roles. The entries orclrw and orclro direct the connection to the databases playing primary and standby roles, respectively.

orcl_a =
(description =
(address = (protocol = tcp)(host = orcl-a-db.mydbdomain)(port = 1525))
(connect_data =
(server = dedicated)
(service_name = orcl_a)
)
)

orcl_b =
(description =
(address = (protocol = tcp)(host = orcl-b-db.mydbdomain)(port = 1525))
(connect_data =
(server = dedicated)
(service_name = orcl_b)
)
)

orclrw =
(description =
(address = (protocol = tcp)(host = orcl-rw.mydbdomain)(port = 1525))
(connect_data =
(server = dedicated)
(service_name = orcl)
)
)

orclro =
(description =
(address = (protocol = tcp)(host = orcl-ro.mydbdomain)(port = 1525))
(connect_data =
(server = dedicated)
(service_name = orcl)
)
)

To enable connectivity using orclrw and orclro TNS entries, you can use either a role-based service or a static listener registration entry in both the primary and standby listener, as shown in the following code:

SID_DESC =
      (GLOBAL_DBNAME = orcl)
      (ORACLE_HOME = /opt/oracle/product/19c/dbhome_1)
      (SID_NAME = orcl)
    )

Implement the solution

To implement an automated DNS update during an Oracle switchover or failover, we use an Oracle database trigger and a shell script. The following are the high-level steps for the entire workflow:

Create a DB_ROLE_CHANGE ON DATABASE trigger on the primary database
The trigger in turn creates a DBMS job that calls a shell script with the cname_switch.sh.
The shell script updates the Route 53 CNAME entries.

Database trigger

Use the following code for the database trigger:

CREATE OR REPLACE TRIGGER sys.cname_flip_post_role_change 
AFTER DB_ROLE_CHANGE ON DATABASE
DECLARE
  v_db_name VARCHAR2(9);
  v_db_role VARCHAR2(16);
BEGIN
  SELECT DATABASE_ROLE  INTO v_db_role FROM V$DATABASE;
  SELECT DB_UNIQUE_NAME INTO v_db_name FROM V$DATABASE;

  IF v_db_role = 'PRIMARY' THEN
    BEGIN
      dbms_scheduler.drop_job('RW_CNAME_FLIP');
    EXCEPTION
      WHEN OTHERS THEN NULL;
    END;

    dbms_scheduler.create_job(
      job_name   => 'RW_CNAME_FLIP',
      job_type   => 'EXECUTABLE',
      number_of_arguments => 1,
      job_action => '/home/oracle/admin/bin/cname_switch.sh',
      enabled    => false,
      auto_drop  => true);

    dbms_scheduler.set_job_argument_value(
      job_name          => 'RW_CNAME_FLIP',
      argument_position => 1,
      argument_value    => v_db_name);

    BEGIN
      dbms_scheduler.run_job('RW_CNAME_FLIP');
    EXCEPTION
    WHEN OTHERS THEN
      raise_application_error(-20101, 'CNAME flip failed, check script error');
    END;

  END IF;

EXCEPTION
  WHEN OTHERS THEN
    raise_application_error(-20102, 'CNAME flip failed due to error: ' || SQLERR
M);
END;
/

Shell script

This script determines the current CNAME, identifies the dependent A records, and maps the CNAME to the correct A records accordingly. This shell script is provided for reference assuming the naming conventions for db_name and db_unique_name as used in the sample configuration. You should review and modify the script to meet your specific requirements and organization standards.

As per the example shown earlier, the shell script is placed in the location /home/oracle/admin/bin/cname_switch.sh.

Note: it’s common to see production databases that are restored or cloned to lower environments.

If the script is run in those environments, it can potentially change the CNAME entries unexpectedly. To mitigate this, the shell script has the function restore_safeguard. This function checks that the IP assigned to the EC2 instance is actually matching with the A records configured for this database in Route 53. If no match is found, this will not perform CNAME failover.

#! /bin/bash
#set -x

# Variables may need to be changed to suit your environment

DB_NAME=$1
DB_IN=$1
echo "Orginal Input : ${DB_NAME}"
DB_NAME=`echo "${DB_NAME::-2}"`  # removing last 2 characters from DB_UNIQUE_NAME
DB_NAME=`echo "${DB_NAME}" | tr '[:upper:]' '[:lower:]'`
echo "Modified Input : ${DB_NAME}"

DB_DOMAIN=<<YOUR_AWS_ROUTE53_DOMAIN_NAME>>    # Update as per your AWS Route53 domian name
ZONE_ID=<<YOUR_AWS_ROUTE53_HOSTED_ZONE_ID>>   # Update as per your AWS Route53 hosted zone ID
EC2_METADATA='http://169.254.169.254/latest/dynamic/instance-identity/document'

# CNAME and A-Records related varables :

RW_CNAME=`echo "${DB_NAME}-rw.${DB_DOMAIN}"`
RO_CNAME=`echo "${DB_NAME}-ro.${DB_DOMAIN}"`
A_CNAME=`echo "${DB_NAME}-a-db.${DB_DOMAIN}"`
B_CNAME=`echo "${DB_NAME}-b-db.${DB_DOMAIN}"`

REGION=`curl -s ${EC2_METADATA}|grep region|awk -F\" '{print $4}'`

# Logfile configuration and file initilization

TS=`date +%Y%m%d_%H%M%S`
LOG_DIR=/tmp
CHANGE_SET_FILE=`echo "${LOG_DIR}/${DB_NAME}-CnameFlip-${TS}.json"`
LOG_FILE=`echo "${LOG_DIR}/${DB_NAME}-CnameFlip-${TS}.log"`
CONF_FILE=`echo "file://${CHANGE_SET_FILE}"`

# Function to check if current host IP matching with Route 53 configuration

IS_SAFE='Unsafe'

function restore_safeguard()
{
    AWS_TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"`
    LOCAL_IPV4=`curl -sH "X-aws-ec2-metadata-token: $AWS_TOKEN" -v http://169.254.169.254/latest/meta-data/local-ipv4`
    PUBLIC_IPV4=`curl -sH "X-aws-ec2-metadata-token: $AWS_TOKEN" -v http://169.254.169.254/latest/meta-data/public-ipv4`
    NOT_FOUND=`echo ${PUBLIC_IPV4} | grep '404 - Not Found' | wc -l`

    if [ ${NOT_FOUND} == 1 ]; then
       PUBLIC_IPV4='No Public IP Assigned'
    fi

    A_IP=$(aws route53 list-resource-record-sets --hosted-zone-id ${ZONE_ID} \
           --query 'ResourceRecordSets[?Type==`A`].{Name: Name, Value:ResourceRecords[0].Value}' | \
           jq -cr --arg DB_NAME "${DB_NAME}-a" '.[] | select( .Name | contains($DB_NAME)).Value')

    B_IP=$(aws route53 list-resource-record-sets --hosted-zone-id ${ZONE_ID} \
           --query 'ResourceRecordSets[?Type==`A`].{Name: Name, Value:ResourceRecords[0].Value}' | \
           jq -cr --arg DB_NAME "${DB_NAME}-b" '.[] | select( .Name | contains($DB_NAME)).Value')

    PREVIOUS_RW_ID=$(aws route53 list-resource-record-sets --hosted-zone-id ${ZONE_ID} \
           --query 'ResourceRecordSets[?Type==`CNAME`].{Name: Name, Value:ResourceRecords[0].Value}' | \
           jq -cr --arg DB_NAME "${DB_NAME}-rw" '.[] | select( .Name | contains($DB_NAME)).Value' | cut -d'-' -f2)

    if [ ${PREVIOUS_RW_ID} == 'a' ]; then
       RW_NODE_IP=${A_IP}
       RO_NODE_IP=${B_IP}
    else
       RW_NODE_IP=${B_IP}
       RO_NODE_IP=${A_IP}
    fi

    # Looging Input values

    echo "Orginal Input   : ${DB_IN}"          | tee -a ${LOG_FILE}
    echo "Modified Input  : ${DB_NAME}"        | tee -a ${LOG_FILE}
    echo "Current RW ID   : ${PREVIOUS_RW_ID}" | tee -a ${LOG_FILE}
    echo "Host Private IP : ${LOCAL_IPV4}"     | tee -a ${LOG_FILE}
    echo "Host Public IP  : ${PUBLIC_IPV4}"    | tee -a ${LOG_FILE}
    echo "A Node IP       : ${A_IP}"           | tee -a ${LOG_FILE}
    echo "A Node IP       : ${B_IP}"           | tee -a ${LOG_FILE}
    echo "RW Node IP      : ${RW_NODE_IP}"     | tee -a ${LOG_FILE}
    echo "RO Node IP      : ${RO_NODE_IP}"     | tee -a ${LOG_FILE}

    if [ "${LOCAL_IPV4}" == "${RO_NODE_IP}" -o "${PUBLIC_IPV4}" == "${RO_NODE_IP}" ]; then
       IS_SAFE='Safe'
    else
       IS_SAFE='Unsafe'
    fi
}

restore_safeguard

if [ ${IS_SAFE} == 'Safe' ]; then
   echo "Safe for CNAME faliover..." | tee -a ${LOG_FILE}
else
   echo "Unsafe for CNAME faliover..." | tee -a ${LOG_FILE}
   echo "Aborting..."
   exit 1
fi

PRI_DB_ID=`nslookup ${RW_CNAME}|grep "canonical name"|cut -d'=' -f2|cut -d'-' -f2`

# Looging Input values :
echo "Orginal Input      : ${DB_IN}"     | tee    ${LOG_FILE}
echo "Modified Input     : ${DB_NAME}"   | tee -a ${LOG_FILE}
echo "Current RW host ID : ${PRI_DB_ID}" | tee -a ${LOG_FILE}

echo -e "\nChange to be done : \n" | tee -a ${LOG_FILE}

if [ ${PRI_DB_ID} == 'a' ]; then
   echo "Changing ${RW_CNAME} from ${A_CNAME} to ${B_CNAME}" | tee -a ${LOG_FILE}
   echo "Changing ${RO_CNAME} from ${B_CNAME} to ${A_CNAME}" | tee -a ${LOG_FILE}
   TO_BE_RW_CNAME=${B_CNAME}
   TO_BE_RO_CNAME=${A_CNAME}
else
   echo "Changing ${RW_CNAME} from ${B_CNAME} to ${A_CNAME}" | tee -a ${LOG_FILE}
   echo "Changing ${RO_CNAME} from ${A_CNAME} to ${B_CNAME}" | tee -a ${LOG_FILE}
   TO_BE_RW_CNAME=${A_CNAME}
   TO_BE_RO_CNAME=${B_CNAME}
fi

R53_CHANGE=`echo -e "
{
  \"Comment\": \"Flip CNAMEs\",
  \"Changes\": [
    {
      \"Action\" : \"UPSERT\",
      \"ResourceRecordSet\" : {
        \"Name\" : \"${RW_CNAME}.\",
        \"Type\" : \"CNAME\",
        \"TTL\"  : 60,
        \"ResourceRecords\" : [{ \"Value\": \"${TO_BE_RW_CNAME}.\" }]
      }
    },
    {
      \"Action\" : \"UPSERT\",
      \"ResourceRecordSet\" : {
        \"Name\" : \"${RO_CNAME}\",
        \"Type\" : \"CNAME\",
        \"TTL\"  : 60,
        \"ResourceRecords\" : [{ \"Value\": \"${TO_BE_RO_CNAME}.\" }]
      }
    }
  ]
}
"`

echo -e "\nRoute53 Change Set :\n" | tee -a ${LOG_FILE}
echo ${R53_CHANGE} | tee -a ${LOG_FILE}
echo ${R53_CHANGE} > ${CHANGE_SET_FILE}

echo -e "\nCommand to Execute : " | tee -a ${LOG_FILE}
echo -e "\naws route53 change-resource-record-sets --hosted-zone-id ${ZONE_ID} \
         --change-batch ${CONF_FILE} \n" | tee -a ${LOG_FILE}

echo -e "\nExecution Result :\n"
aws route53 change-resource-record-sets --hosted-zone-id ${ZONE_ID} \
--change-batch ${CONF_FILE} | tee -a ${LOG_FILE}

echo -e "\nAfter Change :\n "
aws route53 list-resource-record-sets --hosted-zone-id ${ZONE_ID} | tee -a ${LOG_FILE}

Test the solution

The following screenshot shows the Route 53 console view of the domain mydbdomain before the switchover. The primary database is running on orcl-a-db.mydomain because orcl-rw.mydomain is pointing to that.

Figure 3. Route 53 console view of the domain mydbdomain before the switchover

The following SQL displays the current role of both primary and standby databases and host_name they are currently running on.

[oracle@ip-10-0-0-5 sql]$ cat db_info.sql

ALTER SESSION SET NLS_DATE_FORMAT='YYYY-MM-DD:HH24:MI';
set lines 150 pages 200
col HOST_NAME for a30 trunc

select d.NAME, d.db_unique_name, d.DATABASE_ROLE, d.OPEN_MODE, i.INSTANCE_NAME, 
i.HOST_NAME, i.STARTUP_TIME
from v$instance i, v$database d;

[oracle@ip-10-0-0-5 sql]$ sqlplus system@orclrw

SQL> @db_info

NAME  DB_UNIQUE_NAME DATABASE_ROLE OPEN_MODE INSTANCE_NAME HOST_NAME STARTUP_TIME
------ ---------------- -------------- ---------------- ------------------------------ ----------------
ORCL orcl_a PRIMARY READ WRITE orcl ip-10-0-0-5.us-west-2.compute. 2020-05-24:01:47

[oracle@ip-10-0-0-5 sql]$ sqlplus system@orclro

SQL> @db_info

NAME DB_UNIQUE_NAME DATABASE_ROLE OPEN_MODE INSTANCE_NAME HOST_NAME STARTUP_TIME
------ ---------------- -------------------- -------------- ------------------------------- ----------------
ORCL orcl_b PHYSICAL STANDBY READ ONLY WITH APPLY orcl ip-10-0-32-5.us-west-2.compute. 2020-05-24:05:50

Let’s initiate the switchover:

[oracle@ip-10-0-0-5 sql]$ dgmgrl /
DGMGRL for Linux: Release 12.2.0.1.0 - Production on Wed May 27 06:42:51 2020

Copyright (c) 1982, 2017, Oracle and/or its affiliates.  All rights reserved.

Welcome to DGMGRL, type "help" for information.
Connected to "orcl_a"
Connected as SYSDG.
DGMGRL> show configuration;

Configuration - awsguard

  Protection Mode: MaxPerformance
  Members:
  orcl_a - Primary database
    orcl_b - Physical standby database

Fast-Start Failover: DISABLED

Configuration Status:
SUCCESS   (status updated 39 seconds ago)

DGMGRL> switchover to orcl_b;
Performing switchover NOW, please wait...
Operation requires a connection to database "orcl_b"
Connecting ...
Connected to "orcl_b"
Connected as SYSDBA.
New primary database "orcl_b" is opening...
Oracle Clusterware is restarting database "orcl_a" ...
Switchover succeeded, new primary is "orcl_b"
DGMGRL>
DGMGRL> show configuration;

Configuration - awsguard

  Protection Mode: MaxPerformance
  Members:
  orcl_b - Primary database
    orcl_a - Physical standby database

Fast-Start Failover: DISABLED

Configuration Status:
SUCCESS   (status updated 67 seconds ago)

DGMGRL>

Now that the switchover is complete, let’s connect to the database using the orclrw and orclro TNS entries using the following code:

[oracle@ip-10-0-0-5 sql]$ sqlplus system@orclrw

SQL> @db_info

NAME DB_UNIQUE_NAME  DATABASE_ROLE  OPEN_MODE     INSTANCE_NAME  HOST_NAME                      STARTUP_TIME
----- -------------- ------------- -------------- ------------------------------ ----------------
ORCL  orcl_b PRIMARY        READ WRITE    orcl          ip-10-0-32-5.us-west-2.compute 2020-05-24:05:50


[oracle@ip-10-0-0-5 sql]$ sqlplus system@orclro

SQL> @db_info

NAME  DATABASE_ROLE     OPEN_MODE            INSTANCE_NAME  HOST_NAME            STARTUP_TIME
----- ----------------- -------------------- -------------- ------------------------------ ----------------
ORCL orcl_a PHYSICAL STANDBY  READ ONLY WITH APPLY orcl          ip-10-0-0-5.us-west-2.compute. 2020-05-27:06:43

The following screenshot shows the Route 53 console view of the domain mydbdomain after the switchover. The primary database is now running on orcl-b-db.mydomain because orcl-rw.mydomain is pointing to that.

Figure 4. Route 53 console view of the domain mydbdomain after the switchover

Conclusion

Application connectivity to a Data Guard environment can be challenging, especially when the application configuration doesn’t support multiple hostnames or listener endpoints. In this post, we discussed step-by-step details to enable seamless connectivity to Data Guard environments using Route 53 CNAME records, a database trigger, and a shell script. You can use these artifacts to direct the DB connections to the database with the right role seamlessly without application changes. If you are using Data Guard Observer for automated failover, another blog, Setup a high availability design for Oracle Data Guard (Fast-Start Failover) using Amazon Route 53 discusses an alternate mechanism to achieve the same result.

Let’s Architect! Designing serverless solutions

2023-05-10 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-designing-serverless-solutions/

During his re:Invent 2022 keynote, Werner Vogels, AWS Vice President and Chief Technology Officer, emphasized the asynchronous nature of our world and the challenges associated with incorporating asynchronicity into our architectures. AWS serverless services can help users concentrate on the asynchronous aspects of their workloads, easing the execution of event-driven architectures and enabling the adoption of effective integration patterns for communication both within and beyond a bounded context.

In this edition of Let’s Architect!, we offer an in-depth exploration of the architecture of serverless AWS services, such as AWS Lambda. We also present a new workshop centered on design patterns employing serverless AWS services, which ultimately delivers valuable insights on implementing event-driven architectures within systems.

A closer look at AWS Lambda

This video is the perfect companion for those seeking to learn and master a Lambda architecture, empowering you to effectively leverage its capabilities in your workloads.

With the knowledge gained from this video, you will be well-equipped to design your functions’ code in a highly optimized manner, ensuring efficient performance and resource utilization. Furthermore, a comprehensive understanding of Lambda functions can help identify and apply the most suitable approach to cloud workloads, resulting in an agile and robust cloud infrastructure that meets a project’s unique requirements.

Take me to this video!

Discover how AWS Lambda functions work under the hood

Implementing an event-driven serverless story generation application with ChatGPT and DALL-E

This example of an event-driven serverless architecture showcases the power of leveraging AWS services and AI technologies to develop innovative solutions. Built upon a foundation of serverless services, including Amazon EventBridge, Amazon DynamoDB, Lambda, Amazon Simple Storage Service, and managed artificial intelligence (AI ) services like Amazon Polly, this architecture demonstrates the seamless capacity to create daily stories with a scheduled launch. By utilizing EventBridge scheduler, an Lambda function is initiated every night to generate new content. The integration of AI services, like ChatGPT and DALL-E, further elevates the solution, as their compatibility with the serverless model enables efficient and dynamic content creation. This case serves as a testament to the potential of combining event-driven serverless architectures, with cutting-edge AI technologies for inventive and impactful applications.

Take me to this Compute Blog post!

How to build an event-driven architecture with serverless AWS services integrating ChatGPT and DALL-E

AWS Workshop Studio: Serverless Patterns

The AWS Serverless Patterns workshop offers a comprehensive learning experience to enhance your understanding of architectural patterns applicable to serverless projects. Throughout the workshop, participants will delve into various patterns, such as synchronous and asynchronous implementations, tailored to meet the demands of modern serverless applications. This hands-on approach ensures a production-ready understanding, encompassing crucial topics like testing serverless workloads, establishing automation pipelines, and more. Take this workshop to elevate your serverless architecture knowledge!

Take me to the serverless workshop!

The high-level architecture of the workshops modules

Building Serverlesspresso: Creating event-driven architectures

Serverlesspresso is an event-driven, serverless workload that uses EventBridge and AWS Step Functions to coordinate events across microservices and support thousands of orders per day. This comprehensive session delves into design considerations, development processes, and valuable lessons learned from creating a production-ready solution. Discover practical patterns and extensibility options that contribute to a robust, scalable, and cost-effective application. Gain insights into combining EventBridge and Step Functions to address complex architectural challenges in larger applications.

Take me to this video!

How to leverage AWS Step Functions for orchestrating your workflows

See you next time!

Thanks for joining our conversation on serverless solutions! We’ll see you next time when we talk about AWS microservices.

Can’t get enough of the Let’s Architect! series? Visit the Let’s Architect! page of the AWS Architecture Blog!

Optimizing data with automated intelligent document processing solutions

2023-04-28 Deependra Shekhawat

Post Syndicated from Deependra Shekhawat original https://aws.amazon.com/blogs/architecture/optimizing-data-with-automated-intelligent-document-processing-solutions/

Many organizations struggle to effectively manage and derive insights from the large amount of unstructured data locked in emails, PDFs, images, scanned documents, and more. The variety of formats, document layouts, and text makes it difficult for any standard Optical Character Recognition (OCR) to extract key insights from these data sources.

To help organizations overcome these document management and information extraction challenges, AWS offers connected, pre-trained artificial intelligence (AI) service APIs that help drive business outcomes from these document-based rich data sources.

This blog post describes a cost-effective, scalable automated intelligent document processing solution that leverages a Natural Processing Language (NLP) engine using Amazon Textract and Amazon Comprehend. This solution helps customers take advantage of industry leading machine learning (ML) technology in their document workflows without the need for in-house ML expertise.

Customer document management challenges

Customers across industry verticals experience the following document management challenges:

Extraction process accuracy varies significantly when applied to diverse sources; specifically handwritten text, images, and scanned documents.
Existing scripting and rule-based solutions cannot provide customer domain or problem-specific classifiers.
Traditional document management systems cannot consider feedback from domain experts to improve the learning process.
The Personally Identifiable Information (PII) data-handling is not robust or customizable, causing data privacy leakage concern.
Many manual interventions are required to complete the entire process.

Automated intelligent document processing solution

We introduced an automated intelligent document processing implementation to address key document management challenges. At the heart of the solution is a NLP engine that combines:

Amazon Textract
Amazon Comprehend
Amazon SageMaker
Custom regular expression-based Python parser

The full solution also leverages other AWS services as described in the following diagram (Figure 1) and steps to develop and operate a cost-effective and scalable architecture for document processing. It effectively extracts text from document types including PDFs, images, scanned documents, Microsoft Excel workbooks, and more.

Figure 1: AI-based intelligent document processing engine

Solution overview

Let’s explore the automated intelligent document processing solution step by step.

The document upload engine or business users upload the respective files or documents through a custom web application to the designated Amazon Simple Storage Service (Amazon S3) bucket.
The event-based architecture signals an Amazon S3 push event to invoke the respective AWS Lambda function to start document pre-processing.
The Lambda function evaluates the document payload, leverages Amazon Simple Queue Service (Amazon SQS) for async processing, prepares document metadata, stores it in Amazon DynamoDB, and calls the NLP engine to perform the information extraction process.
The NLP engine leverages Amazon Textract for text extraction from a variety of sources and leverages document metadata to optimize the appropriate API calls (for example, form, tabular, or PDF).
- Amazon Textract output is fed into Amazon Comprehend which consumes the extracted text and performs entity parsing, line/paragraph-based sentiment analysis, and document/paragraph classification. For better accuracy, we leverage a custom classifier within Amazon Comprehend.
- Amazon Comprehend also provides key APIs to mask PII data before it is used for any further consumption. The solution offers the ability to configure masking rules for each PII entity per masking requirements.
- To ensure the solution has capability to handle data from Microsoft Excel workbooks, we developed a custom parser using Python running inside an AWS Lambda function. Depending on the document metadata, this function can be invoked.
Output of Amazon Comprehend is then fed to ML models deployed using Amazon SageMaker depending on additional use cases configured by the customer to complement the overall process with ML-based recommendations, predictions, and personalization.
Once the NLP engine completes its processing, the job completion notification event signals another AWS Lambda function and updates the status in the respective Amazon SQS queue.
The Lambda post-processing function parses the resultant content generated by the NLP engine and stores it in the Amazon DynamoDB and Amazon S3 bucket. This step is responsible for the required data augmentation, key entities validation, and default value assignment to create a data structure that could be consumed by the presentation/visualization layer.
Users get the flexibility to see the extracted information and compare it with the original document extract in the custom user interface (UI). They can provide their feedback on extraction and entity parsing accuracy. From a user access management perspective, Amazon Cognito provides authorization and authentication.

Customer benefits

The automated intelligent document processing solution helps customers:

Increase overall document management efficiency by 50-60%, leveraging automation and nullifying manual interventions
Reduce in-house team involvement in administrative activities by up to 70% using integrated and connected processing workflows
Gain better visibility into key contractual obligations with features such as Document Classification (helps properly route documents to the respective process/team) and Obligation Extraction
Utilize a UI-based feedback mechanism for in-house domain experts/reviewers to see and validate the extracted information and offer feedback to inform further model training

From a cost-optimization perspective, depending on document type and required information, only the respective Amazon Textract APIs calls are submitted. (For example, it is not worth using form/table-based Textract API calls for a Know Your Customer (KYC) document such as a driver’s license or passport when the AnalyzeID API is the most efficient solution.)

To maximize solution benefits, customers should invest time in building well-defined taxonomies ahead of using the document processing solution to accommodate their own use cases or industry domain-specific requirements. Their taxonomy input highlights only relevant keys and takes respective actions in case the requires keys are not extracted.

Vertical industry use cases

As mentioned, this document processing solution can be used across industry segments. Let’s explore some practical use cases. For example, it can help insurance industry professionals to accelerate claim processing and customer KYC-related processes. By extracting the key entities from the claim documents, mapping them against the customer defined taxonomy, and integrating with Amazon SageMaker models for anomaly detection (anomalous claims), insurance providers can improve claim management and customer satisfaction.

In the healthcare industry, the solution can help with medical records and report processing, key medical entity extraction, and customer data masking.

The document processing solution can help the banking industry by automating check processing and delivering the ability to extract key entities like payer, payee, date, and amount from the checks.

Conclusion

Manual document processing is resource-intensive, time consuming, and costly. Customers need to allocate resources to process large volume documents, lowering business agility. Their employees are performing manual “stare and compare” tasks, potentially reducing worker morale and preventing them from focusing where their efforts are better placed.

Intelligent document processing helps businesses overcome these challenges by automating the classification, extraction, and analysis of data. This expedites decision cycles, allocates resources to high-value tasks, and reduces costs.

Pre-trained APIs of AWS AI services allow for quick classification, extraction, and data analyzation from scores of documents. This solution also has industry specific features that can quickly process specialized industry specific documents. This blog discussed the foundational architecture to helps to accelerate implementation of any specific document processing use case.

How Huron built an Amazon QuickSight Asset Catalogue with AWS CDK Based Deployment Pipeline

2023-04-26 Corey Johnson

Post Syndicated from Corey Johnson original https://aws.amazon.com/blogs/big-data/how-huron-built-an-amazon-quicksight-asset-catalogue-with-aws-cdk-based-deployment-pipeline/

This is a guest blog post co-written with Corey Johnson from Huron.

Having an accurate and up-to-date inventory of all technical assets helps an organization ensure it can keep track of all its resources with metadata information such as their assigned oners, last updated date, used by whom, how frequently and more. It helps engineers, analysts and businesses access the most up-to-date release of the software asset that bring accuracy to the decision-making process. By keeping track of this information, organizations will be able to identify technology gaps, refresh cycles, and expire assets as needed for archival.

In addition, an inventory of all assets is one of the foundational elements of an organization that facilitates the security and compliance team to audit the assets for improving privacy, security posture and mitigate risk to ensure the business operations run smoothly. Organizations may have different ways of maintaining an asset inventory, that may be an Excel spreadsheet or a database with a fully automated system to keep it up-to-date, but with a common objective of keeping it accurate. Even if organizations can follow manual approaches to update the inventory records but it is recommended to build automation, so that it is accurate at any point of time.

The DevOps practices which revolutionized software engineering in the last decade have yet to come to the world of Business Intelligence solutions. Business intelligence tools by their nature use a paradigm of UI driven development with code-first practices being secondary or nonexistent. As the need for applications that can leverage the organizations internal and client data increases, the same DevOps practices (BIOps) can drive and delivery quality insights more reliably

In this post, we walk you through a solution that Huron and manage lifecycle for all Amazon QuickSight resources across the organization by collaborating with AWS Data Lab Resident Architect & AWS Professional Services team.

About Huron

Huron is a global professional services firm that collaborates with clients to put possible into practice by creating sound strategies, optimizing operations, accelerating digital transformation, and empowering businesses and their people to own their future. By embracing diverse perspectives, encouraging new ideas, and challenging the status quo, Huron creates sustainable results for the organizations we serve. To help address its clients’ growing cloud needs, Huron is an AWS Partner.

Use Case Overview

Huron’s Business Intelligence use case represents visualizations as a service, where Huron has core set of visualizations and dashboards available as products for its customers. The products exist in different industry verticals (healthcare, education, commercial) with independent development teams. Huron’s consultants leverage the products to provide insights as part of consulting engagements. The insights from the product help Huron’s consultants accelerate their customer’s transformation. As part of its overall suite of offerings, there are product dashboards that are featured in a software application following a standardized development lifecycle. In addition, these product dashboards may be forked for customer-specific customization to support a consulting engagement while still consuming from Huron’s productized data assets and datasets. In the next stage of the cycle, Huron’s consultants experiment with new data sources and insights that in turn fed back into the product dashboards.

When changes are made to a product analysis, challenges arise when a base reference analysis gets updated because of new feature releases or bug fixes, and all the customer visualizations that are created from it also need to be updated. To maintain the integrity of embedded visualizations, all metadata and lineage must be available to the parent application. This access to the metadata supports the need for updating visuals based on changes as well as automating row and column level security ensuring customer data is properly governed.

In addition, few customers request customizations on top of the base visualizations, for which Huron team needs to create a replica of the base reference and then customize it for the customer. These are maintained by Huron’s in the field consultants rather than the product development team. These customer specific visualizations create operational overhead because they require Huron to keep track of new customer specific visualizations and maintain them for future releases when the product visuals change.

Huron leverages Amazon QuickSight for their Business Intelligence (BI) reporting needs, enabling them to embed visualizations at scale with higher efficiency and lower cost. A large attraction for Huron to adopt QuickSight came from the forward-looking API capabilities that enable and set the foundation for a BIOps culture and technical infrastructure. To address the above requirement, Huron Global Product team decided to build a QuickSight Asset Tracker and QuickSight Asset Deployment Pipeline.

The QuickSight Asset tracker serves as a catalogue of all QuickSight resources (datasets, analysis, templates, dashboards etc.) with its interdependent relationship. It will help;

Create an inventory of all QuickSight resources across all business units
Enable dynamic embedding of visualizations and dashboards based on logged in user
Enable dynamic row and column level security on the dashboards and visualizations based on the logged-in user
Meet compliance and audit requirements of the organization
Maintain the current state of all customer specific QuickSight resources

The solution integrates an AWS CDK based pipeline to deploy QuickSight Assets that:

Supports Infrastructure-as-a-code for QuickSight Asset Deployment and enables rollbacks if required.
Enables separation of development, staging and production environments using QuickSight folders that reduces the burden of multi-account management of QuickSight resources.
Enables a hub-and-spoke model for Data Access in multiple AWS accounts in a data mesh fashion.

QuickSight Asset Tracker and QuickSight Asset Management Pipeline – Architecture Overview

The QuickSight Asset Tracker was built as an independent service, which was deployed in a shared AWS service account that integrated Amazon Aurora Serverless PostgreSQL to store metadata information, AWS Lambda as the serverless compute and Amazon API Gateway to provide the REST API layer.

It also integrated AWS CDK and AWS CloudFormation to deploy the product and customer specific QuickSight resources and keep them in consistent and stable state. The metadata of QuickSight resources, created using either AWS console or the AWS CDK based deployment were maintained in Amazon Aurora database through the QuickSight Asset Tracker REST API service.

The CDK based deployment pipeline is triggered via a CI/CD pipeline which performs the following functions:

Takes the ARN of the QuickSight assets (dataset, analysis, etc.)
Describes the asset and dependent resources (if selected)
Creates a copy of the resource in another environment (in this case a QuickSight folder) using CDK

The solution architecture integrated the following AWS services.

Amazon Aurora Serverless integrated as the backend database to store metadata information of all QuickSight resources with customer and product information they are related to.
Amazon QuickSight as the BI service using which visualization and dashboards can be created and embedded into the online applications.
AWS Lambda as the serverless compute service that gets invoked by online applications using Amazon API Gateway service.
Amazon SQS to store customer request messages, so that the AWS CDK based pipeline can read from it for processing.
AWS CodeCommit is integrated to store the AWS CDK deployment scripts and AWS CodeBuild, AWS CloudFormation integrated to deploy the AWS resources using an infrastructure as a code approach.
AWS CloudTrail is integrated to audit user actions and trigger Amazon EventBridge rules when a QuickSight resource is created, updated or deleted, so that the QuickSight Asset Tracker is up-to-date.
Amazon S3 integrated to store metadata information, which is used by AWS CDK based pipeline to deploy the QuickSight resources.
AWS LakeFormation enables cross-account data access in support of the QuickSight Data Mesh

The following provides a high-level view of the solution architecture.

Architecture Walkthrough:

The following provides a detailed walkthrough of the above architecture.

QuickSight Dataset, Template, Analysis, Dashboard and visualization relationships:
- Steps 1 to 2 represent QuickSight reference analysis reading data from different data sources that may include Amazon S3, Amazon Athena, Amazon Redshift, Amazon Aurora or any other JDBC based sources.
- Step 3 represents QuickSight templates being created from reference analysis when a customer specific visualization needs to be created and step 4.1 to 4.2 represents customer analysis and dashboards being created from the templates.
- Steps 7 to 8 represent QuickSight visualizations getting generated from analysis/dashboard and step 6 represents the customer analysis/dashboard/visualizations referring their own customer datasets.
- Step 10 represents a new fork being created from the base reference analysis for a specific customer, which will create a new QuickSight template and reference analysis for that customer.
- Step 9 represents end users accessing QuickSight visualizations.
Asset Tracker REST API service:
- Step 15.2 to 15.4 represents the Asset Tracker service, which is deployed in a shared AWS service account, where Amazon API Gateway provides the REST API layer, which invokes AWS Lambda function to read from or write to backend Aurora database (Aurora Serverless v2 – PostgreSQL engine). The database captures all relationship metadata between QuickSight resources, its owners, assigned customers and products.
Online application – QuickSight asset discovery and creation
- Step 15.1 represents the front-end online application reading QuickSight metadata information from the Asset Tracker service to help customers or end users discover visualizations available and be able to dynamically render based on the user login.
- Step 11 to 12 represents the online application requesting creation of new QuickSight resources, which pushes requests to Amazon SQS and then AWS Lambda triggers AWS CodeBuild to deploy new QuickSight resources. Step 13.1 and 13.2 represents the CDK based pipeline maintaining the QuickSight resources to keep them in a consistent state. Finally, the AWS CDK stack invokes the Asset Tracker service to update its metadata as represented in step 13.3.
Tracking QuickSight resources created outside of the AWS CDK Stack
- Step 14.1 represents users creating QuickSight resources using the AWS Console and step 14.2 represents that activity getting logged into AWS CloudTrail.
- Step 14.3 to 14.5 represents triggering EventBridge rule for CloudTrail activities that represents QuickSight resource being created, updated or deleted and then invoke the Asset Tracker REST API to register the QuickSight resource metadata.

Architecture Decisions:

The following are few architecture decisions we took while designing the solution.

Choosing Aurora database for Asset Tracker: We have evaluated Amazon Neptune for the Asset Tracker database as most of the metadata information we capture are primarily maintaining relationship between QuickSight resources. But when we looked at the query patterns, we found the query pattern is always just one level deep to find who is the parent of a specific QuickSight resource and that can be solved with a relational database’s Primary Key / Foreign Key relationship and with simple self-join SQL query. Knowing the query pattern does not require a graph database, we decided to go with Amazon Aurora to keep it simple, so that we can avoid introducing a new database technology and can reduce operational overhead of maintaining it. In future as the use case evolve, we can evaluate the need for a Graph database and plan for integrating it. For Amazon Aurora, we choose Amazon Aurora Serverless as the usage pattern is not consistent to reserve a server capacity and the serverless tech stack will help reduce operational overhead.
Decoupling Asset Tracker as a common REST API service: The Asset Tracker has future scope to be a centralized metadata layer to keep track of all the QuickSight resources across all business units of Huron. So instead of each business unit having its own metadata database, if we build it as a service and deploy it in a shared AWS service account, then we will get benefit from reduced operational overhead, duplicate infrastructure cost and will be able to get a consolidated view of all assets and their integrations. The service provides the ability of applications to consume metadata about the QuickSight assets and then apply their own mapping of security policies to the assets based on their own application data and access control policies.
Central QuickSight account with subfolder for environments: The choice was made to use a central account which reduces developer friction of having multiple accounts with multiple identities, end users having to manage multiple accounts and access to resources. QuickSight folders allow for appropriate permissions for separating “environments”. Furthermore, by using folder-based sharing with QuickSight groups, users with appropriate permissions already have access to the latest versions of QuickSight assets without having to share their individual identities.

The solution included an automated Continuous Integration (CI) and Continuous Deployment (CD) pipeline to deploy the resources from development to staging and then finally to production. The following provides a high-level view of the QuickSight CI/CD deployment strategy.

Aurora Database Tables and Reference Analysis update flow

The following are the database tables integrated to capture the QuickSight resource metadata.

QS_Dataset: This captures metadata of all QuickSight datasets that are integrated in the reference analysis or customer analysis. This includes AWS ARN (Amazon Resource Name), data source type, ID and more.
QS_Template: This table captures metadata of all QuickSight templates, from which customer analysis and dashboards will be created. This includes AWS ARN, parent reference analysis ID, name, version number and more.
QS_Folder: This table captures metadata about QuickSight folders which logically groups different visualizations. This includes AWS ARN, name, and description.
QS_Analysis: This table captures metadata of all QuickSight analysis that includes AWS ARN, name, type, dataset IDs, parent template ID, tags, permissions and more.
QS_Dashboard: This table captures metadata information of QuickSight dashboards that includes AWS ARN, parent template ID, name, dataset IDs, tags, permissions and more.
QS_Folder_Asset_Mapping: This table captures folder to QuickSight asset mapping that includes folder ID, Asset ID, and asset type.

As the solution moves to the next phase of implementation, we plan to introduce additional database tables to capture metadata information about QuickSight sheets and asset mapping to customers and products. We will extend the functionality to support visual based embedding to enable truly integrated customer data experiences where embedded visuals mesh with the native content on a web page.

While explaining the use case, we have highlighted it creates a challenge when a base reference analysis gets updated and we need to track the templates that are inherited from it make sure the change is pushed to the linked customer analysis and dashboards. The following example scenarios explains, how the database tables change when a reference analysis is updated.

Example Scenario: When “reference analysis” is updated with a new release

When a base reference analysis is updated because of a new feature release, then a new QuickSight reference analysis and template needs to be created. Then we need to update all customer analysis and dashboard records to point to the new template ID to form the lineage.

The following sequential steps represent the database changes that needs to happen.

Insert a new record to the “Analysis” table to represent the new reference analysis creation.
Insert a new record to the “Template” table with new reference analysis ID as parent, created in step 1.
Retrieve “Analysis” and “Dashboard” table records that points to previous template ID and then update those records with the new template ID, created in step 2.

How will it enable a more robust embedding experience

The QuickSight asset tracker integration with Huron’s products provide users with a personalized, secure and modern analytics experience. When user’s login through Huron’s online application, it will use logged in user’s information to dynamically identify the products they are mapped to and then render the QuickSight visualizations & dashboards that the user is entitled to see. This will improve user experience, enable granular permission management and will also increase performance.

How AWS collaborated with Huron to help build the solution

AWS team collaborated with Huron team to design and implement the solution. AWS Data Lab Resident Architect collaborated with Huron’s lead architect for initial architecture design that compared different options for integration and deriving tradeoffs between them, before finalizing the final architecture. Then with the help of AWS Professional service engineer, we could build the base solution that can be extended by Huron team to roll it out to all business units and integrate additional reporting features on top of it.

The AWS Data Lab Resident Architect program provides AWS customers with guidance in refining and executing their data strategy and solutions roadmap. Resident Architects are dedicated to customers for 6 months, with opportunities for extension, and help customers (Chief Data Officers, VPs of Data Architecture, and Builders) make informed choices and tradeoffs about accelerating their data and analytics workloads and implementation.

The AWS Professional Services organization is a global team of experts that can help customers realize their desired business outcomes when using the AWS Cloud. The Professional Services team work together with customer’s team and their chosen member of the AWS Partner Network (APN) to execute their enterprise cloud computing initiatives.

Next Steps

Huron has rolled out the solution for one business unit and as a next step we plan to roll it out to all business units, so that the asset tracker service is populated with assets available across all business units of the organization to provide consolidated view.

In addition, Huron will be building a reporting layer on top of the Amazon Aurora asset tracker database, so that the leadership has a way to discover assets by business unit, by owner, created between specific date range or the reports that are not updated since a while.

Once the asset tracker is populated with all QuickSight assets, it will be integrated into the front-end online application that can help end users discover existing assets and request creation of new assets.

Newer QuickSight API’s such as assets-as-a-bundle and assets-as-code further accelerate the capabilities of the service by improving the development velocity and reliability of making changes.

Conclusion

This blog explained how Huron built an Asset Tracker to keep track of all QuickSight resources across the organization. This solution may provide a reference to other organizations who would like to build an inventory of visualization reports, ML models or other technical assets. This solution leveraged Amazon Aurora as the primary database, but if an organization would also like to build a detailed lineage of all the assets to understand how they are interrelated then they can consider integrating Amazon Neptune as an alternate database too.

If you have a similar use case and would like to collaborate with AWS Data Analytics Specialist Architects to brainstorm on the architecture, rapidly prototype it and implement a production ready solution then connect with your AWS Account Manager or AWS Solution Architect to start an engagement with AWS Data Lab team.

About the Authors

Corey Johnson is the Lead Data Architect at Huron, where he leads its data architecture for their Global Products Data and Analytics initiatives.

Sakti Mishra is a Principal Data Analytics Architect at AWS, where he helps customers modernize their data architecture, help define end to end data strategy including data security, accessibility, governance, and more. He is also the author of the book Simplify Big Data Analytics with Amazon EMR. Outside of work, Sakti enjoys learning new technologies, watching movies, and visiting places with family.

Let’s Architect! Getting started with containers

2023-04-26 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-getting-started-with-containers/

Most of AWS customers building cloud-native applications or modernizing applications choose containers to run their microservices applications to accelerate innovation and time to market while lowering their total cost of ownership (TCO). Using containers in AWS comes with other benefits, such as increased portability, scalability, and flexibility.

The combination of containers technologies and AWS services also provides features such as load balancing, auto scaling, and service discovery, making it easier to deploy and manage applications at scale.

In this edition of Let’s Architect! we share useful resources to help you to get started with containers on AWS.

Container Build Lens

This whitepaper describes the Container Build Lens for the AWS Well-Architected Framework. It helps customers review and improve their cloud-based architectures and better understand the business impact of their design decisions. The document describes general design principles for containers, as well as specific best practices and implementation guidance using the Six Pillars of the Well-Architected Framework.

Take me to explore the Containers Build Lens!

Follow Containers Build Lens Best practices to architect your containers-based workloads.

EKS Workshop

The EKS Workshop is a useful resource to familiarize yourself with Amazon Elastic Kubernetes Service (Amazon EKS) by practicing on real use-cases. It is built to help users learn about Amazon EKS features and integrations with popular open-source projects. The workshop is abstracted into high-level learning modules, including Networking, Security, DevOps Automation, and more. These are further broken down into standalone labs focusing on a particular feature, tool, or use case.

Once you’re done experimenting with EKS Workshop, start building your environments with Amazon EKS Blueprints, a collection of Infrastructure as Code (IaC) modules that helps you configure and deploy consistent, batteries-included Amazon EKS clusters across accounts and regions following AWS best practices. Amazon EKS Blueprints are available in both Terraform and CDK.

Take me to this workshop!

The workshop is abstracted into high-level learning modules, including Networking, Security, DevOps Automation, and more.

Architecting for resiliency on AWS App Runner

Learn how to architect an highly available and resilient application using AWS App Runner. With App Runner, you can start with just the source code of your application or a container image. The complexity of running containerized applications is abstracted away, including the cloud resources needed for running your web application or API. App Runner manages load balancers, TLS certificates, auto scaling, logs, metrics, teachability and more, so you can focus on implementing your business logic in a highly scalable and elastic environment.

Take me to this blog post!

A high-level architecture for an available and resilient application with AWS App Runner

Securing Kubernetes: How to address Kubernetes attack vectors

As part of designing any modern system on AWS, it is necessary to think about the security implications and what can affect your security posture. This session introduces the fundamentals of the Kubernetes architecture and common attack vectors. It also includes security controls provided by Amazon EKS and suggestions on how to address them. With these strategies, you can learn how to reduce risk for your Kubernetes-based workloads.

Take me to this video!

Some common attack vectors that need addressing with Kubernetes

See you next time!

Thanks for exploring architecture tools and resources with us!

Next time we’ll talk about serverless.

To find all the posts from this series, check out the Let’s Architect! page of the AWS Architecture Blog.

Prioritizing sustainable cloud architectures: a how-to round up

2023-04-21 Kate Brierley

Post Syndicated from Kate Brierley original https://aws.amazon.com/blogs/architecture/prioritizing-sustainable-cloud-architectures-a-how-to-round-up/

With Earth Month upon us and in celebration of Earth Day tomorrow, 4/22, sustainability is top-of-mind for individuals and organizations around the world. But it doesn’t take a certain time of year to act toward the urgent need to innovate and adopt smarter, more efficient solutions!

Sustainable cloud architectures are fundamental to sustainable workloads, and we’re spotlighting content that helps build solutions to meet and advance sustainability goals. Here’s our recent post round up to make sustainable architectures meaningful and actionable for customers of all kinds:

Architecting for Sustainability at AWS re:Invent 2022

This post spotlights the AWS re:Invent 2022 sustainability track and key conversations around sustainability of, in, and through the cloud. It covers key uses cases and breakout sessions, including AWS customers demonstrating best practices from the AWS Well-Architected Framework Sustainability Pillar. Hear about these and more:

The Amazon Prime Video experience using the AWS sustainability improvement process for Thursday Night Football streaming
Pinterest’s sustainability journey with AWS from Pinterest Chief Architect David Chaiken

David Chaiken, Chief Architect at Pinterest, describes Pinterest’s sustainability journey with AWS

Let’s Architect! Architecting for Sustainability

The most recent sustainability focused Let’s Architect! series post shares practical tips for making cloud applications more sustainable. It also covers the AWS customer carbon footprint tool to help organizations monitor, analyze, and reduce their AWS footprint, and details how Amazon Prime Video used these tools to establish baselines and drive significant efficiencies across their AWS usage.

Prime Video case study for understanding how the architecture can be designed for sustainability

Optimizing your Modern Data AWS Infrastructure for Sustainability Series

This two-part blog series explores more specific topics relating to the Sustainability Pillar of the AWS Well-Architected Framework as connected to the Modern Data Architecture on AWS. What’s covered includes:

Integrating a data lake and purpose-built data services to efficiently build analytics workloads to provide speed and agility at scale in Part 1 – Data Ingestion and Data Lake
Guidance and best practices to optimize the components within the unified data governance, data movement, and purpose-built analytics pillars in Part 2 – Unified Data Governance, Data Movement, and Purpose-built Analytics

Modern Data Analytics Reference Architecture on AWS

How to Select a Region for your Workload Based on Sustainability Goals

Did you know workload Region selection significantly affects KPIs including performance, cost, and carbon footprint? For example, when an AWS Region is chosen based on the market-based method, emissions are calculated using the electricity that business purchases. Contracting and purchasing electricity produced by renewable energy sources like solar and wind are more sustainable. Region selection is is another part of the Well-Architected Framework Sustainability Pillar, and this blog post covers key considerations for choosing AWS Regions per workload.

Carbon intensity of electricity for South Central Sweden

Check back soon for more earth-friendly advice from our experts!

Mitigating DDoS with data science using AWS Shield Advanced and AWS WAF

2023-04-19 Jasmine Maheshwari

Post Syndicated from Jasmine Maheshwari original https://aws.amazon.com/blogs/architecture/mitigating-ddos-with-data-science-using-aws-shield-advanced-and-aws-waf/

This blog post helps customers in mitigating distributed denial-of-service (DDoS) using AWS Shield Advanced, AWS WAF, and data science. We explore how to use these services along with machine learning (ML) to detect and mitigate DDoS attacks.

Bad actors conduct DDoS attacks using botnets. Through botnets, attackers look for zero-day vulnerabilities—specifically on network devices such as routers—and make these devices a part of their network. Bots speared around the world are waiting for instructions from control servers.

This post examines a real-world use case. Online payment solutions company Razorpay has a business-to-business (B2B) model where merchants invoke APIs for payments. Given the complex nature of B2B payments, the traffic patterns between small-scale merchants and large-scale merchants needs to be distinguished. This business model puts the company at risk for DDoS attacks, and here is how AWS helps them address this ongoing concern.

First, we will introduce an initial DDoS architecture approach and then how it was modified to meet specific Razorpay needs, as is possible for cross-industry client with varying requirements.

Basic DDoS recommendations

Let’s explore some out-of-the-box DDoS business solution recommendations for different resource categories:

Static websites: Amazon CloudFront is a content delivery network (CDN) service that is used to host static websites. The AWS WAF rate-limiting rule can be used to limit traffic on a CDN.
Dynamic websites: In this scenario, a standard WAF Rate Limit rule can be used on Application Load Balancer (ALB).
APIs with customer context: APIs receive these assets on behalf of other merchants (for example, an end-user places a food order from a food-delivery application, completes a payment, and the request is sent to api.razorpay.com. Razorpay understands that the end user is trying to make a payment toward the food app).

Figure 1 represents the difficulty levels and solutions for each resource category.

Figure 1. DDoS business solution resource category recommendations

DDoS response phases

We’ve discussed initial DDoS architectures, so now let’s explore the various DDoS response phases:

Phase 1 – Automated inventory: Creating an automated inventory system capturing the list of APIs or websites exposed. The APIs are ranked by exposure and risk of an outage to create a list of assets that is ranked on the basis of risk.
Phase 2 – Bucketing assets into groups: Bucketing APIs and websites (static or dynamic) into groups. This phase also subgroups the assets into unauthenticated or authenticated, users and more. As we will discuss later in this post, most Razorpay APIs have a customer context requiring a multilayered defense mechanism.
Phase 3 – Testing the solution: Simulating attack traffic to test the solution under different loads, origins, and more.

AWS DDoS solutions

AWS has two major out-of-the-box solutions when protecting against DDoS attacks: AWS Shield and AWS WAF.

AWS Shield has three important features for DDoS mitigation:

Alarms: Triggers an alarm when a DDoS attack is suspected, customized based on metrics such as total volume, error rates at the ALB, and response latency
Visibility: Provides top five important field values in real time, such as requesting client IPs, countries, user agents, referrer headers, and url routes to take action.
On-call support: Provides on-call support from the Shield Response team (SRT) to understand attack vectors and create AWS WAF rules based on insights.

AWS WAF has two rule types:

Blocking: Blocks requests matching expected variables, from individuals to a combination of fields such as IP, URL route, body, or country
Rate limiting: Tracks the rate of requests for each originating IP address, and triggers the rule action on IPs with rates that go over the limit. A scope-down statement can also be added to the rule, to only track and rate limit requests that match the scope-down statement. (We will soon discuss how Razorpay relies heavily on rate limiting it the AWS WAF and other microservice architecture solutions.)

Razorpay microservices protection architecture

Razorpay has two main layers protecting their internal microservices. The request to payment API api.razorpay.com goes from Amazon Route 53 to ALB. AWS WAF and AWS Shield Advanced are deployed on ALB. Requests are then forwarded to microservices fronted by API Gateway as in Figure 2.

Razorpay’s internal microservices protection layers

Figure 2. Razorpay internal microservices protection layers

Razorpay API Gateway

An API Gateway is a critical piece of infrastructure for microservice architectures. Razorpay is a B2B organization that performs actions on behalf of merchants. To achieve this—and understand context about the merchants and end-users alike—a custom API Gateway:

Works at a high scale with very low latency
Understands merchant context to provide authentication and authorization
Provides security features around malicious traffic

Razorpay uses a self-managed API gateway called Edge. The API Gateway is the first system on the ingress path to understand Razorpay domain context that prior layers cannot. Every request goes through multiple layers of logic at the API Gateway before being forwarded to respective services.

Razorpay Edge and AWS solution insights

As Edge performs multiple computations on each request, Razorpay generates insights for each request to build intelligence and prevent DDoS attacks. These events are fed into a time-series database in near real time and generate insights using machine learning (ML) models. The insights:

Identify malicious patterns: Generating confidence percentage that helps in defining further action, verifies consumer authenticity, and serves the request. It also blocks malicious requests at the edge.
Derive pattern-based rate limits: Deriving rate limits based on a larger set of data—including consumer and IP address—by looking at weekly and monthly patterns.
Once this information is available at Razorypay Edge, it can perform various operations to ensure only legit requests reach the service layer. Let’s explore the request flow to understand each stage’s contribution toward DDoS handling, as in Figure 3.

Figure 3. Razorpay Edge request flow for DDoS handling

All Razorpay requests flow through API Gateway. Here are some key recommendations:

Based on feedback data available, the gateway checks if the given request pattern falls in the malicious category. If the confidence percentage is high for a malicious request, block the request. A block rule is applied at the AWS WAF layer. If the confidence percentage is low, use CAPTCHA to check user authenticity.
Apply rate limit on pre-existing data in the request object using feedback from an ML model.
Perform authentication/identification and authorization.
Apply rate limit on data derived from request object, also using feedback from a ML model.
Request user CAPTCHA verification if the request is identified as malicious with low confidence.

As a best practice, publish generated insights for each request to Apache Kafka out of the request flow to avoid latency impact and reduce blast radius.

Conclusion

In this post, you learned about addressing DDoS attacks with ML models.

Organizations can use the insights data generated by an API Gateway to identify request patterns, categorize the patterns into various buckets, and perform the following depending on category:

Send feedback to an API Gateway for immediate action
Update the rate limits for Edge-based historical patterns
Update WAF rule for blocking/reducing rate limits

As discussed, using AWS WAF and AWS Shield Advanced along with ML helps detect and mitigate DDoS attacks.

Let’s Architect! Monitoring production systems at scale

2023-04-12 Vittorio Denti

Post Syndicated from Vittorio Denti original https://aws.amazon.com/blogs/architecture/lets-architect-monitoring-production-systems-at-scale/

“Everything fails, all the time” is a famous quote from Amazon’s Chief Technology Officer Werner Vogels. This means that software and distributed systems may eventually fail because something can always go wrong. We have to accept this and design our systems accordingly, test our software and services, and think about all the possible edge cases.

With this in mind, we should also set our teams up for success by providing visibility in every environment for a quick turnaround when incidents happen. When a system serves traffic in production, we need to monitor it to make sure it behaves as expected and that all components are healthy. But questions arise such as:

How do we monitor a system?
What is monitoring?
What are some architectural and engineering approaches to implement in order to design a successful monitoring strategy?

All of these questions require complex answers. It’s not possible to cover everything in a blog post, but let’s start exploring the topic and sharing resources to guide you through this domain.

In this edition of Let’s Architect! we share some practices for monitoring used at Amazon and AWS, as well as more resources to discover how to build monitoring solutions for the workloads running on AWS.

Observability best practices at Amazon

Observability and monitoring are engineering tasks that also require putting a suitable cultural mindset in place. At Amazon, if a service doesn’t run as expected, the team writes a CoE (Correction of Errors) document to analyze the issue and answer critical questions to learn from it. There are also weekly operations meetings to analyze operational and performance dashboards for each service.

The session introduced here covers the full range of monitoring at Amazon, from how teams assess system health at a high level to how they understand the details of a single request. Use this resource to learn some best practices for metrics, logs, and tracing, and using these signals to achieve operational excellence.

Take me to this re:Invent video!

Observability is an iterative process which requires us to establish a feedback loop and improve based on the signals coming from the system.

Build an observability solution using managed AWS services and the OpenTelemetry standard

Visibility of what’s happening in a distributed system is key to operationalize workloads at scale. OpenTelemetry is the standard for observability and AWS services are fully integrated with that. The blog post introduced in this section shows you how AWS Distro for OpenTelemetry (ADOT) works under the hood and how to use it with a Kubernetes cluster. But keep in mind, this is just one of the many implementations available for AWS compute services and OpenTelemetry—so even if you’re not using Kubernetes right now, we’ve still got you covered!

Want more? Watch this re:Invent video for an understanding of how to think about logging, tracing, metrics, and monitoring with AWS services, and the possibilities to provide the observability your distributed systems need. This is a great learning resource with many demos and examples.

Take me to this blog post!

Flow of metrics and traces from Application services to the Observability Platform.

Optimizing your AWS Batch architecture for scale with observability dashboards

We’ve explored the mental models and strategies for monitoring in previous resources. Now let’s see how these principles can be applied in a scenario where we run batch and ML computing jobs at scale. In the blog post introduced in this section, you can learn how to use runtime metrics to understand an architecture designed on AWS Batch for running batch computing jobs. AWS Batch is a fully managed service enabling you to run jobs at any scale without needing to manage underlying compute resources. This blog explains how AWS Batch works and guides you through the process used to design a monitoring framework.

Since the solution is open-source, you are free to add other custom metrics you find useful. To get started with the AWS Batch open-source observability solution, visit the project page on GitHub. Several customers have used this monitoring tool to optimize their workload for scale by reshaping their jobs, refining their instance selection, and tuning their AWS Batch architecture.

Take me to this blog!

High-level structure of AWS Batch resources and interactions. This diagram depicts a user submitting jobs based on a job definition template to a job queue, which then communicates to a compute environment that resources are needed.

Observability workshop

This resource provides a hands-on experience for you on the variety of toolsets AWS offers to set up monitoring and observability on your applications. Whether your workload is on-premises or on AWS—or your application is a giant monolith or based on modern microservices-based architecture—the observability tools can provide deeper insights into application performance and health.

The monitoring tools covered in this workshop provide powerful capabilities that enable you to identify bottlenecks, issues, and defects without having to manually sift through various logs, metrics, and trace data.

Take me to this workshop!

The diagram illustrates the various components of the PetAdoptions architecture. In the workshop you will learn how to monitor this application.

See you next time!

Thanks for exploring architecture tools and resources with us!

Next time we’ll talk about containers on AWS.

To find all the posts from this series, check out the Let’s Architect! page of the AWS Architecture Blog.

Announcing updates to the AWS Well-Architected Framework

2023-04-11 Haleh Najafzadeh

Post Syndicated from Haleh Najafzadeh original https://aws.amazon.com/blogs/architecture/announcing-updates-to-the-aws-well-architected-framework-2/

We are excited to announce the availability of improved AWS Well-Architected Framework guidance. In this update, we have made changes across all six pillars of the framework: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability.

A brief history

The AWS Well-Architected Framework is a collection of best practices that allow customers to evaluate and improve the design, implementation, and operations of their workloads in the cloud.

In 2012, the first version of the framework was published, leading to the 2015 release of the guidance whitepaper. We added the operational excellence pillar in 2016. The pillar-specific whitepapers and AWS Well-Architected Lenses were released in 2017, and, the following year, the AWS Well-Architected Tool was launched.

In 2020, the content for the Well-Architected Framework received a major update, as well as more lenses, and API integration with the AWS Well-Architected Tool. The sixth pillar, Sustainability, was added in 2021. In 2022, dedicated pages were introduced for each consolidated best practice across all six pillars, with several best practices updated with improved prescriptive guidance.

AWS Well-Architected timeline

What’s new

Well-Architected Framework content is consistently updated and improved in order to adapt to the constantly changing and innovating AWS environment, with new and evolved emerging services and technologies. This ensures cloud architects can build and operate secure, high-performing, resilient, efficient, and sustainable systems in the AWS Cloud.

The content updates and improvements in this release focus on providing more complete coverage across the AWS service portfolio to help customers make more informed decisions when developing implementation plans. Services that were added or expanded in coverage include: AWS Elastic Disaster Recovery, AWS Trusted Advisor, AWS Resilience Hub, AWS Config, AWS Security Hub, Amazon GuardDuty, AWS Organizations, AWS Control Tower, AWS Compute Optimizer, AWS Budgets, Amazon CodeWhisperer, and Amazon CodeGuru.

Pillar updates

The Operational Excellence Pillar has a new best practice on enabling support plans for production workloads. This Pillar also has a major update on defining a customer communication plan for outages.

In the Security Pillar, we added a new best practice area, Application Security (AppSec). AppSec is complete with eight new best practices to guide customers as they develop, test, and release software, providing guidance on how to consider the tools, testing, and organizational approach used to develop software.

The Reliability Pillar has a new best practice on architecting workloads to meet availability targets and uptime service-level agreements (SLAs). We also added the resilience shared responsibility model to its introduction section.

The Cost Optimization Pillar has new best practices on automating operations as a part of cost-optimization efforts and enforcing data-retention policies.

In the Sustainability Pillar, we introduced a clear process for selecting Regions, as well as tools for right-sizing services and improving the overall utilization of resources in the AWS Cloud.

Best practice updates

The implementation guidance and best practices have been updated in this release to be more prescriptive, including enhanced recommendations and steps on reusable architecture patterns targeting specific business outcomes in the AWS Cloud.

As many as 113 best practices are updated with more prescriptive guidance in Operational Excellence (22), Security (18), Reliability (14), Performance Efficiency (10), Cost Optimization (22), and Sustainability (27). Fourteen new best practices have been introduced in Operational Excellence (1), Security (9), Reliability (1), Cost Optimization (2), and Sustainability (1).

From a total of 127 new/updated best practices, 78% include explicit implementation steps as part of making them more prescriptive. The remaining 22% have been updated by improving their existing implementation steps. These changes are in addition to the 51 improved best practices released in 2022 (18 in Q3 2022, and 33 in Q4 2022), resulting in more than 50% of the existing Framework best practices having been updated recently.

The content is available in 11 languages: English, Spanish, French, German, Italian, Japanese, Korean, Indonesian, Brazilian Portuguese, Simplified Chinese, and Traditional Chinese.

Here is the list of best practices that are new or updated in this release:

Operational Excellence: OPS01-BP03, OPS01-BP04, OPS02-BP01, OPS02-BP06, OPS02-BP07, OPS03-BP04, OPS03-BP05, OPS04-BP01, OPS04-BP03, OPS04-BP04, OPS04-BP05, OPS05-BP02, OPS05-BP06, OPS05-BP07, OPS07-BP01, OPS07-BP05, OPS07-BP06, OPS08-BP02, OPS08-BP03, OPS08-BP04, OPS10-BP05, OPS11-BP01, OPS11-BP04
Security: SEC01-BP01, SEC01-BP02, SEC01-BP07, SEC02-BP01, SEC02-BP02, SEC02-BP03, SEC02-BP05, SEC03-BP02, SEC03-BP04, SEC03-BP07, SEC03-BP09, SEC04-BP01, SEC05-BP01, SEC06-BP01, SEC07-BP01, SEC08-BP04, SEC08-BP02, SEC09-BP02, SEC03-BP08, SEC11-BP01, SEC11-BP02, SEC11-BP03, SEC11-BP04, SEC11-BP05, SEC11-BP06, SEC11-BP07, SEC11-BP08
Reliability: REL01-BP01, REL01-BP02, REL01-BP03, REL01-BP04, REL01-BP06, REL02-BP01, REL09-BP01, REL09-BP02, REL09-BP03, REL09-BP04, REL10_BP04, REL10-BP03, REL11-BP07, REL13-BP02, REL13-BP03
Performance Efficiency: PERF02-BP06, PERF05_BP03, PERF05-BP02, PERF05-BP04, PERF05-BP05, PERF05-BP06, PERF05-BP07, PFRF04-BP04, PERF02_BP04, PERF02_BP05
Cost Optimization: COST02_BP01, COST02_BP02, COST02_BP03, COST02_BP05, COST03_BP02, COST03_BP04, COST03_BP05, COST04_BP01, COST04_BP02, COST04_BP03, COST04_BP04, COST04_BP05, COST05_BP03, COST05_BP05, COST05_BP06, COST06_BP01, COST06_BP03, COST07_BP01, COST07_BP02, COST07_BP05, COST09_BP03, COST10_BP01, COST10_BP02, COST11_BP01
Sustainability: SUS01_BP01, SUS02_BP01, SUS02_BP02, SUS02_BP03, SUS02_BP04, SUS02_BP05, SUS02_BP06, SUS03_BP01, SUS03_BP02, SUS03_BP03, SUS03_BP04, SUS03_BP05, SUS04_BP01, SUS04_BP02, SUS04_BP03, SUS04_BP04, SUS04_BP05, SUS04_BP06, SUS04_BP07, SUS04_BP08, SUS05_BP01, SUS05_BP02, SUS05_BP03, SUS05_BP04, SUS06_BP01, SUS06_BP02, SUS06_BP03, SUS06_BP04

Updates in this release are also available in the AWS Well-Architected Tool, which can be used to review your workloads, address important design considerations, and help ensure that you follow the best practices and guidance of the AWS Well-Architected Framework.

Ready to get started? Review the updated AWS Well-Architected Framework Pillar best practices, as well as pillar-specific whitepapers.

Have questions about some of the new best practices or most recent updates? Join our growing community on AWS re:Post.