Tag Archives: Architecture

Field Notes: Launch a Fully Configured AWS Deep Learning Desktop with NICE DCV

2021-07-22 Ajay Vohra

Post Syndicated from Ajay Vohra original https://aws.amazon.com/blogs/architecture/field-notes-launch-a-fully-configured-aws-deep-learning-desktop-with-nice-dcv/

You want to start quickly when doing deep learning using GPU-activated Elastic Compute Cloud (Amazon EC2) instances in the AWS Cloud. Although AWS provides end-to-end machine learning (ML) in Amazon SageMaker, working at the deep learning frameworks level, the quickest way to start is with AWS Deep Learning AMIs (DLAMIs), which provide preconfigured Conda environments for most of the popular frameworks.

DLAMIs make it straightforward to launch Amazon EC2 instances, but these instances do not automatically provide the high-performance graphics visualization required during deep learning research. Additionally, they are not preconfigured to use AWS storage services or SageMaker. This post explains how you can launch a fully-configured deep learning desktop in the AWS Cloud. Not only is this desktop preconfigured with the popular frameworks such as TensorFlow, PyTorch, and Apache MXNet, but it is also enabled for high-performance graphics visualization. NICE DCV is the remote display protocol used for visualization. In addition, it is preconfigured to use AWS storage services and SageMaker.

Overview of the Solution

The deep learning desktop described in this solution is ready for research and development of deep neural networks (DNNs) and their visualization. You no longer need to set up low-level drivers, libraries, and frameworks, or configure secure access to AWS storage services and SageMaker. The desktop has preconfigured access to your data in a Simple Storage Service (Amazon S3) bucket, and a shared Amazon Elastic File System (Amazon EFS) is automatically attached to the desktop. It is automatically configured to access SageMaker for ML services, and provides you with the ability to prepare the data needed for deep learning, and to research, develop, build, and debug your DNNs. You can use all the advanced capabilities of SageMaker from your deep learning desktop. The following diagram shows the reference architecture for this solution.

Reference Architecture to Launch a Fully Configured AWS Deep Learning Desktop with NICE DCV

Figure 1 – Architecture overview of the solution to launch a fully configured AWS Deep Learning Desktop with NICE DCV

The deep learning desktop solution discussed in this post is contained in a single AWS CloudFormation template. To launch the solution, you create a CloudFormation stack from the template. Before we provide a detailed walkthrough for the solution, let us review the key benefits.

DNN Research and Development

During the DNN research phase, there is iterative exploration until you choose the preferred DNN architecture. During this phase, you may prefer to work in an isolated environment (for example, a dedicated desktop) with your favorite integrated development environment (IDE) (for example, Visual Studio Code or PyCharm). Developers like the ability to step through code the IDE Debugger. With the increasing support for imperative programming in modern ML frameworks, the ability to step through code in the research phase can accelerate DNN development.

The DLAMIs are preconfigured with NVIDIA GPU drivers, NVIDIA CUDA Toolkit, and low-level libraries such as Deep Neural Network library (cuDNN). Deep learning ML frameworks such as TensorFlow, PyTorch, and Apache MXNet are preconfigured.

After you launch the deep learning desktop, you need to install and open your favorite IDE, clone your GitHub repository, and you can start researching, developing, debugging, and visualizing your DNN. The acceleration of DNN research and development is the first key benefit for the solution described in this post.

Screenshot showing Developing on deep learning desktop with Visual Studio Code IDE

Figure 2 – Developing on deep learning desktop with Visual Studio Code IDE

Elasticity in number of GPUs

During the research phase, you need to debug any issues using a single GPU. However, as the DNN is stabilized, you horizontally scale across multiple GPUs in a single machine, followed by scaling across multiple machines.

Most modern deep learning frameworks support distributed training across multiple GPUs in a single machine, and also across multiple machines. However, when you use a single GPU in an on-premises desktop equipped with multiple GPUs, the idle GPUs are wasted. With the deep learning desktop solution described in this post, you can stop the deep learning desktop instance, change its Amazon EC2 instance type to another compatible type, restart the desktop, and get the exact number of GPUs you need at the moment. The elasticity in the number of GPUs in the deep learning desktop is the second key benefit for the solution described in this post.

Integrated access to storage services

Since the deep learning desktop is running in AWS Cloud, you have access to all of the AWS data storage options, including the S3 object store, the Amazon EFS, and the Amazon FSx file system for Lustre. You can build your favorite data pipeline and it will be supported by one or more data storage options. You can also easily use ML-IO library, which is a high-performance data access library for ML tasks with support for multiple data formats. The integrated access to highly durable and scalable object and file system storage services for accessing ML data is the third key benefit for the solution described in this post.

Integrated access to SageMaker

Once you have a stable version of your DNN, you need to find the right hyperparameters that lead to model convergence during training. Having tuned the hyperparameters, you need to run multiple trials over permutations of datasets and hyperparameters to fine-tune your models. Finally, you may need to prune and compile the models to optimize inference. To compress the training time, you may need to do distributed data parallel training across multiple GPUs in multiple machines. For all of these activities, the deep learning desktop is preconfigured to use SageMaker. You can use jupyter-lab notebooks running on the desktop to launch SageMaker training jobs for distributed training in infrastructure automatically managed by SageMaker.

Figure 3 – Submitting a SageMaker training job from deep learning desktop using Jupyter Lab notebook

The SageMaker training logs, TensorBoard summaries, and model checkpoints can be configured to be written to the Amazon EFS attached to the deep learning desktop. You can use the Linux command tail to monitor the logs, or start a TensorBoard server from the Conda environment on the deep learning desktop, and monitor the progress of your SageMaker training jobs. You can use a Jupyter Lab notebook running on the deep learning desktop to load a specific model checkpoint available on the Amazon EFS, and visualize the predictions from the model checkpoint, even while the SageMaker training job is still running.

Figure 4 – Locally monitoring the TensorBoard summaries from SageMaker training job

SageMaker offers many advanced capabilities, such as profiling ML training jobs using Amazon SageMaker Debugger, and these services are easily accessible from the deep learning desktop. You can manage the training input data, training model checkpoints, training logs, and TensorBoard summaries of your local iterative development, in addition to the distributed SageMaker training jobs, all from your deep learning desktop. The integrated access to SageMaker services is the fourth key use case for the solution described in this post.

Prerequisites

To get started, complete the following steps:

If you do not have an AWS Account, create a new AWS account.
You must have AWS Identity and Access Management (IAM) access consistent with administrator job function.
In the AWS Management Console, select a supported Region (review the prerequisites in the read me file within the GitHub repository for the list of supported Regions).
Subscribe to DLAMI (Ubuntu 18.04).
Create a new Amazon EC2 key pair, or use an existing key pair.
Create a new S3 bucket in the selected AWS Region, or use an existing bucket. The bucket can be empty.
Clone the GitHub repository on your laptop using the Git clone command.

Walkthrough

The complete source and reference documentation for this solution is available in the repository accompanying this post. Following is a walkthrough of the steps.

Create a CloudFormation stack

Create a stack on the CloudFormation console in your selected AWS Region using the CloudFormation template in your cloned GitHub repository. This CloudFormation stack creates IAM resources. When you are creating a CloudFormation stack using the console, you must confirm: I acknowledge that AWS CloudFormation might create IAM resources.

To create the CloudFormation stack, you must specify values for the following input parameters (for the rest of the input parameters, default values are recommended):

DesktopAccessCIDR – Use the public internet address of your laptop as the base value for the CIDR.
DesktopInstanceType – For deep leaning, the recommended value for this parameter is p3.2xlarge, or larger.
DesktopVpcId – Select an Amazon Virtual Private Cloud (VPC) with at least one public subnet.
DesktopVpcSubnetId – Select a public subnet in your VPC.
DesktopSecurityGroupId – The specified security group must allow inbound access over ports 22 (SSH) and 8443 (NICE DCV) from your DesktopAccessCIDR, and must allow inbound access from within the security group to port 2049 and all network ports required for distributed SageMaker training in your subnet.
If you leave it blank, the automatically-created security group allows inbound access for SSH, and NICE DCV from your DesktopAccessCIDR, and allows inbound access to all ports from within the security group.
KeyName – Select your SSH key pair name.
S3Bucket – Specify your S3 bucket name. The bucket can be empty.

Visit the documentation on all the input parameters.

Connect to the deep learning desktop

When the status for the stack in the CloudFormation console is CREATE_COMPLETE, find the deep learning desktop instance launched in your stack in the Amazon EC2 console,
Connect to the instance using SSH as user ubuntu, using your SSH key pair. When you connect using SSH, if you see the message, “Cloud init in progress. Machine will REBOOT after cloud init is complete!!”, disconnect and try again in about 15 minutes.
The desktop installs the NICE DCV server on first-time startup, and automatically reboots after the install is complete. If instead you see the message, “NICE DCV server is enabled!”, the desktop is ready for use.
Before you can connect to the desktop using the NICE DCV client, you need to set a new password for user ubuntu using the Bash command:
```
sudo passwd ubuntu 
```
After you successfully set the new password for user ubuntu, exit the SSH connection. You are now ready to connect to the desktop using a suitable NICE DCV client (a non–web browser client is recommended) using the user ubuntu, and the new password.
NICE DCV client asks you to specify the server host and port to connect. For the server host, use the public IPv4 DNS address of the desktop Amazon EC2 instance available in Amazon EC2 console.
You do not need to specify the port, because the desktop is configured to use the default NICE DCV server port of 8443.
When you first login to the desktop using the NICE DCV client, you will be asked if you would like to upgrade the OS version. Do not upgrade the OS version!

Develop on the deep learning desktop

When you are connected to the desktop using the NICE DCV client, use the Ubuntu Software Center to install Visual Studio Code, or your favorite IDE. To view the available Conda environments containing the popular deep learning frameworks preconfigured on the desktop, open a desktop terminal, and run the Bash command:

conda env list

The deep learning desktop instance has secure access to the S3 bucket you specified when you created the CloudFormation stack. You can verify access to the S3 bucket by running the Bash command (replace ‘your-bucket-name’ following with your S3 bucket name):

aws s3 ls your-bucket-name

If your bucket is empty, a successful initiation of the previous command will produce no output, which is normal.

An Amazon Elastic Block Store (Amazon EBS) root volume is attached to the instance. In addition, an Amazon EFS is mounted on the desktop at the value of EFSMountPath input parameter, which by default is /home/ubuntu/efs. You can use the Amazon EFS for staging deep learning input and output data.

Use SageMaker from the deep learning desktop

The deep learning desktop is preconfigured to use SageMaker. To get started with SageMaker examples in a JupyterLab notebook, launch the following Bash commands in a desktop terminal:

mkdir ~/git
cd ~/git
git clone https://github.com/aws/amazon-sagemaker-examples.git
jupyter-lab

This will start a ‘jupyter-lab’ notebook server in the terminal, and open a tab in your web browser. You can explore any of the SageMaker example notebooks. We recommend starting with the example Distributed Training of Mask-RCNN in SageMaker using Amazon EFS found at the following path in the cloned repository:

advanced_functionality/distributed_tensorflow_mask_rcnn/mask-rcnn-scriptmode-efs.ipynb

The preceding SageMaker example requires you to specify a subnet and a security group. Use the preconfigured OS environment variables as follows:

security_group_ids = [ os.environ['desktop_sg_id'] ] 
subnets = [ os.environ['desktop_subnet_id' ] ]

Stopping and restarting the desktop

You may safely reboot, stop, and restart the desktop instance at any time. The desktop will automatically mount the Amazon EFS at restart.

Clean Up

When you no longer need the deep learning desktop, you may delete the CloudFormation stack from the CloudFormation console. Deleting the stack will shut down the desktop instance, and delete the root Amazon EBS volume attached to the desktop. The Amazon EFS is not automatically deleted when you delete the stack.

Conclusion

In this post, we showed how to launch a desktop pre-configured with the popular machine learning frameworks for research and development of deep learning neural networks. NICE-DCV was used for high performance visualization related to deep learning. AWS storage services were used for highly scalable access to deep learning data. Finally, Amazon SageMaker was used for the distributed training of deep learning data.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Data Caching Across Microservices in a Serverless Architecture

2021-07-21 Irfan Saleem

Post Syndicated from Irfan Saleem original https://aws.amazon.com/blogs/architecture/data-caching-across-microservices-in-a-serverless-architecture/

Organizations are re-architecting their traditional monolithic applications to incorporate microservices. This helps them gain agility and scalability and accelerate time-to-market for new features.

Each microservice performs a single function. However, a microservice might need to retrieve and process data from multiple disparate sources. These can include data stores, legacy systems, or other shared services deployed on premises in data centers or in the cloud. These scenarios add latency to the microservice response time because multiple real-time calls are required to the backend systems. The latency often ranges from milliseconds to a few seconds depending on size of the data, network bandwidth, and processing logic. In certain scenarios, it makes sense to maintain a cache close to the microservices layer to improve performance by reducing or eliminating the need for the real-time backend calls.

Caches reduce latency and service-to-service communication of microservice architectures. A cache is a high-speed data storage layer that stores a subset of data. When data is requested from a cache, it is delivered faster than if you accessed the data’s primary storage location.

While working with our customers, we have observed use cases where data caching helps reduce latency in the microservices layer. Caching can be implemented in several ways. In this blog post, we discuss a couple of these use cases that customers have built. In both use cases, the microservices layer is created using Serverless on AWS offerings. It requires data from multiple data sources deployed locally in the cloud or on premises. The compute layer is built using AWS Lambda. Though Lambda functions are short-lived, the cached data can be used by subsequent instances of the same microservice to avoid backend calls.

Use case 1: On-demand cache to reduce real-time calls

In this use case, the Cache-Aside design pattern is used for lazy loading of frequently accessed data. This means that an object is only cached when it is requested by a consumer, and the respective microservice decides if the object is worth saving.

This use case is typically useful when the microservices layer makes multiple real-time calls to fetch and process data. These calls can be greatly reduced by caching frequently accessed data for a short period of time.

Let’s discuss a real-world scenario. Figure 1 shows a customer portal that provides a list of car loans, their status, and the net outstanding amount for a customer:

The Billing microservice gets a request. It then tries to get required objects (for example, the list of car loans, their status, and the net outstanding balance) from the cache using an object_key. If the information is available in the cache, a response is sent back to the requester using cached data.
If requested objects are not available in the cache (a cache miss), the Billing microservice makes multiple calls to local services, applications, and data sources to retrieve data. The result is compiled and sent back to the requester. It also resides in the cache for a short period of time.
Meanwhile, if a customer makes a payment using the Payment microservice, the balance amount in the cache must be invalidated/deleted. The Payment microservice processes the payment and invokes an asynchronous event (payment_processed) with the respective object key for the downstream processes that will remove respective objects from the cache.
The events are stored in the event store.
The CacheManager microservice gets the event (payment_processed) and makes a delete request to the cache for the respective object_key. If necessary, the CacheManager can also refresh cached data. It can call a resource within the Billing service or it can refresh data directly from the source system depending on the data refresh logic.

Figure 1. Reducing latency by caching frequently accessed data on demand

Figure 2 shows AWS services for use case 1. The microservices layer (Billing, Payments, and Profile) is created using Lambda. The Amazon API Gateway is exposing Lambda functions as API operations to the internal or external consumers.

Figure 2. Suggested AWS services for implementing use case 1

All three microservices are connected with the data cache and can save and retrieve objects from the cache. The cache is maintained in-memory using Amazon ElastiCache. The data objects are kept in cache for a short period of time. Every object has an associated TTL (time to live) value assigned to it. After that time period, the object expires. The custom events (such as payment_processed) are published to Amazon EventBridge for downstream processing.

Use case 2: Proactive caching of massive volumes of data

During large modernization and migration initiatives, not all data sources are colocated for a certain period of time. Some legacy systems, such as mainframe, require a longer decommissioning period. Many legacy backend systems process data through periodic batch jobs. In such scenarios, front-end applications can use cached data for a certain period of time (ranging from a few minutes to few hours) depending on nature of data and its usage. The real-time calls to the backend systems cannot deal with the extensive call volume on the front-end application.

In such scenarios, required data/objects can be identified up front and loaded directly into the cache through an automated process as shown in Figure 3:

An automated process loads data/objects in the cache during the initial load. Subsequent changes to the data sources (either in a mainframe database or another system of record) are captured and applied to the cache through an automated CDC (change data capture) pipeline.
Unlike use case 1, the microservices layer does not make real-time calls to load data into the cache. In this use case, microservices use data already cached for their processing.
However, the microservices layer may create an event if data in the cache is stale or specific objects have been changed by another service (for example, by the Payment service when a payment is made).
The events are stored in Event Manager. Upon receiving an event, the CacheManager initiates a backend process to refresh stale data on demand.
All data changes are sent directly to the system of record.

Figure 3. Eliminating real-time calls by caching massive data volumes proactively

As shown in Figure 4, the data objects are maintained in Amazon DynamoDB, which provides low-latency data access at any scale. The data retrieval is managed through DynamoDB Accelerator (DAX), a fully managed, highly available, in-memory cache. It delivers up to a 10 times performance improvement, even at millions of requests per second.

Figure 4. Suggested AWS services for implementing use case 2

The data in DynamoDB can be loaded through different methods depending on the customer use case and technology landscape. API Gateway, Lambda, and EventBridge are providing similar functionality as described in use case 1.

Use case 2 is also beneficial in scenarios where front-end applications must cache data for an extended period of time, such as a customer’s shopping cart.

In addition to caching, the following best practices can also be used to reduce latency and to improve performance within the Lambda compute layer:

Best practices for working with AWS Lambda functions shows you how to use execution environment reuse to improve the performance of your Lambda functions
The Introducing AWS Lambda Extensions blog post shows you how to implement configuration and data cache using AWS Lambda Extensions

Conclusion

The microservices architecture allows you to build several caching layers depending on your use case. In this blog, we discussed data caching within the compute layer to reduce latency when data is retrieved from disparate sources. The information from use case 1 can help you reduce real-time calls to your back-end system by saving frequently used data to the cache. Use case 2 helps you maintain large volumes of data in caches for extended periods of time when real-time calls to the backend system are not possible.

Digitally Optimize your Factory Issue Resolution with Amazon Virtual Andon

2021-07-20 Ajay Swamy

Post Syndicated from Ajay Swamy original https://aws.amazon.com/blogs/architecture/digitally-optimize-your-factory-issue-resolution-with-amazon-virtual-andon/

As a manufacturing enterprise, maximizing your operational efficiency and optimizing output are critical in a competitive global market. Global black swan events such as COVID-19 have necessitated the ability to monitor remotely and respond to issues actively, on the factory floor.

Amazon Virtual Andon (AVA) is a digital notification system that helps factory personnel raise, route, track, and resolve factory floor issues more quickly. The solution was incubated in the Amazon Fulfillment Centers to monitor and resolve issues. It is now available as a self-deployable solution that manufacturers can use to optimize productivity.

An intuitive UI to deliver quick value

Amazon Virtual Andon comes with an intuitive user interface (UI) to set up your factory asset hierarchy, so you can model your real factory floor equipment (Figure 1). With AVA, you can create events (incidents) that are likely to occur across your factory. You can granularly tie events to the underlying stations and processes. Factory personnel can monitor manufacturing equipment for events, quickly raise issues, route the problem to specialized engineers, and ensure tracking and resolution of those issues.

Setup and configuration

Figure 1. Create your factory hierarchy

With AVA, you can set up your factory configuration quickly (see Figure 2a, 2b following). You can choose to create multiple factory sites, areas, stations, processes, and devices with the setup UI. You can also create users and assign them to Admin, Manager, Engineer, and Associate roles. You can set fine-grained controls on what users can access within the Permissions screen.

Figure 2a. Create new factory areas

Figure 2b. Create factory processes

Client screen

In Figure 3, you can see AVA’s graphical client screen, where floor associates can raise issues when breakdowns occur. Associates can raise an issue by clicking on the issue tile. Once issues are raised, they are routed via Amazon Simple Notification Service (SNS) to subscribed emails addresses or phone numbers of specialized engineers for solving those specific breakdown events.

Figure 3. Raise issues screen

Observer screen

The issues routed to the engineers are visible on the observer screen, shown in Figure 4. The engineers can confirm that the problem is being addressed. This updates the status on the client screen, and allows for seamless communication between different users on the factory floor. The engineers can then close the issues by specifying a root cause and descriptive text. This enables prescriptive issue resolution, and can assist with future problems.

Figure 4. Issue response screen

Metrics reporting and API operations

AVA also provides robust reporting seen in Figure 5. This allows managers to view issue metrics for the selected sites and areas in the last seven days. The metric reporting will allow you to observe trends across different areas of your factory floor and optimize accordingly.

Figure 5. Metrics and reporting

Lastly, AVA comes with a rich set of GraphQL APIs that you can use to create sites, issues, and monitor and resolve problems programmatically. In addition, the APIs allow for custom integration with external systems such as factory MES (Manufacturing Execution System) systems, for automated issue creation and solving.

AVA deployment architecture

Deploying this solution with the default parameters builds the following environment in the AWS Cloud:

Figure 6. Amazon Virtual Andon deployment architecture

An Amazon CloudFront web interface deploys into an Amazon Simple Storage Service (S3) bucket configured for web hosting.
An Amazon Cognito user pool activates the solution’s administrators to register users and groups using the web interface.
AWS AppSync GraphQL APIs and AWS Amplify power the web interface. Amazon DynamoDB tables store the factory data.
An AWS IoT rules engine helps users monitor manufacturing workstations or devices for events. It then routes the events to the correct engineer for resolution in real time.
Authorized users can interact with and receive notifications. Using an AWS Lambda function and Amazon SNS, you can send emails and SMS notifications.
Issues created, acknowledged, and closed in the web interface are recorded and updated using AWS AppSync and DynamoDB.

Conclusion

Amazon Virtual Andon (AVA) is a self-deployable solution that helps you resolve factory floor issues more quickly and accurately. With AVA’s intuitive UI and open APIs, you can optimize issue resolution and focus on optimizing production output. You also can holistically monitor your manufacturing site remotely via an intuitive web application. Get started with Amazon Virtual Andon today.

Perform Chaos Testing on your Amazon Aurora Cluster

2021-07-20 Anthony Pasquariello

Post Syndicated from Anthony Pasquariello original https://aws.amazon.com/blogs/architecture/perform-chaos-testing-on-your-amazon-aurora-cluster/

“Everything fails all the time” Werner Vogels, AWS CTO

In 2010, Netflix introduced a tool called “Chaos Monkey”, that was used for introducing faults in a production environment. Chaos Monkey led to the birth of Chaos engineering where teams test their live applications by purposefully injecting faults. Observations are then used to take corrective action and increase resiliency of applications.

In this blog, you will learn about the fault injection capabilities available in Amazon Aurora for simulating various database faults.

Chaos Experiments

Chaos experiments consist of:

Understanding the application baseline: The application’s steady-state behavior
Designing an experiment: Ask “What can go wrong?” to identify failure scenarios
Run the experiment: Introduce faults in the application environment
Observe and correct: Redesign apps or infrastructure for fault tolerance

Chaos experiments require fault simulation across distributed components of the application. Amazon Aurora provides a set of fault simulation capabilities that may be used by teams to exercise chaos experiments against their applications.

Amazon Aurora fault injection

Amazon Aurora is a fully managed database service that is compatible with MySQL and PostgreSQL. Aurora is highly fault tolerant due to its six-way replicated storage architecture. In order to test the resiliency of an application built with Aurora, developers can leverage the native fault injection features to design chaos experiments. The outcome of the experiments gives a better understanding of the blast radius, depth of monitoring required, and the need to evaluate event response playbooks.

In this section, we will describe the various fault injection scenarios that you can use for designing your own experiments. We’ll show you how to conduct the experiment and use the results. This will make your application more resilient and prepared for an actual event.

Note that availability of the fault injection feature is dependent on the version of MySQL and PostgreSQL.

Figure 1. Fault injection overview

1. Testing an instance crash

An Aurora cluster can have one primary and up to 15 read replicas. If the primary instance fails, one of the replicas becomes the primary. Applications must be designed to recover from these instance failures as soon as possible to have minimal impact on the end-user experience.

The instance crash fault injection simulates failure of the instance/dispatcher/node in the Aurora database cluster. Fault injection may be carried out on the primary or replicas by running the API against the target instance.

Example: Aurora PostgreSQL for instance crash simulation

The query following will simulate a database instance crash:

SELECT aurora_inject_crash ('instance' );

Since this is a simulation, it does not lead to a failover to the replica. As an alternative to using this API, you can carry out an actual failover by using the AWS Management Console or AWS CLI.

The team should observe the change in the application’s behavior to understand the impact of the instance failure. Take corrective actions to reduce the impact of such failures on the application.

A long recovery time on the application would require the team to reduce the Domain Name Service (DNS) time-to-live (TTL) for the DB connections. As a general best practice, the Aurora Database cluster should have at least one replica.

2. Testing the replica failure

Aurora manages asynchronous replication between cluster nodes within a cluster. The typical replication lag is under 100 milliseconds. Network slowness or issues on the nodes may lead to an increase in replication lag between writer and replica nodes.

The replica failure fault injection allows you to simulate replication failure across one or more replicas. Note that this type of fault injection applies only to a DB cluster that has at least one read replica.

Replica failure manifests itself as stale data read by the application that is connecting to the replicas. The specific functional impact on the application depends on the sensitivity to the freshness of data. Note that this fault injection mechanism does not apply to the native replication supported mechanisms in PostgreSQL and MySQL databases.

Example: Aurora PostgreSQL for replica failure

The statement following will simulate 100% failure of replica named ‘my-replica’ for 20 seconds.

SELECT aurora_inject_replica_failure(100, 20, ‘my-replica’)

The team must observe the behavior of the application from the data sensitivity perspective. If the observed lag is unacceptable, the team must evaluate corrective actions such as vertical scaling of database instances and query optimization. As a best practice, the team should monitor the replication lag and take proactive actions to address it.

3. Testing the disk failure

Aurora’s storage volume consists of six copies of data across three Availability Zones (refer the diagram preceding). Aurora has an inherent ability to repair itself for failures in the storage components. This high reliability is achieved by way of a quorum model. Reads require only 3/6 nodes and writes require 4/6 nodes to be available. However, there may still be transient impact on application depending on how widespread the issue.

The disk failure injection capability allows you to simulate failures of storage nodes and partial failure of disks. The severity of failure can be set as a percentage value. The simulation continues only for the specified amount of time. There is no impact on the actual data on the storage nodes and the disk.

Example: Aurora PostgreSQL for disk failure simulation

You may get the number of disks (for index) on your cluster using the query:

SELECT disks FROM aurora_show_volume_status()

The query following will simulate 75% failure on disk with index 15. The simulation will end in 20 seconds.

SELECT aurora_inject_disk_failure(75, 15, true, 20)

Applications may experience temporary failures due to this fault injection and should be able to gracefully recover from it. If the recovery time is higher than a threshold, or the application has a complete failure, the team can redesign their application.

4. Disk congestion fault

Disk congestion usually happens because of heavy I/O traffic against the storage devices. The impact may range from degraded application performance, to complete application failures.

Aurora provides the capability to simulate disk congestion without synthetic SQL load against the database. With this fault injection mechanism, you can gain a better understanding of the performance characteristics of the application under heavy I/O spikes.

Example: Aurora PostgreSQL for disk congestion simulation

You may get the number of disks (for index) on your cluster using the query:

SELECT disks FROM aurora_show_volume_status()

The query following will simulate a 100% disk failure for 20 seconds. The failure will be simulated on disk with index 15. Simulated delay will be between 30 and 40 milliseconds.

SELECT aurora_inject_disk_congestion(100, 15, true, 20, 30, 40)

If the observed behavior is unacceptable, then the team must carefully consider the load characteristics of their application. Depending on the observations, corrective action may include query optimization, indexing, vertical scaling of the database instances, and adding more replicas.

Conclusion

A chaos experiment involves injecting a fault in a production environment and then observing the application behavior. The outcome of the experiment helps the team identify application weaknesses and evaluate event response processes. Amazon Aurora natively provides fault-injection capabilities that can be used by teams to conduct chaos experiments for database failure scenarios. Aurora can be used for simulating instance failure, replication failure, disk failures, and disk congestion. Try out these capabilities in Aurora to make your applications more robust and resilient from database failures.

Field Notes: How Sportradar Accelerated Data Recovery Using AWS Services

2021-07-19 Mithil Prasad

Post Syndicated from Mithil Prasad original https://aws.amazon.com/blogs/architecture/field-notes-how-sportradar-accelerated-data-recovery-using-aws-services/

This post was co-written by Mithil Prasad, AWS Senior Customer Solutions Manager, Patrick Gryczkat, AWS Solutions Architect, Ben Burdsall, CTO at Sportradar and Justin Shreve, Director of Engineering at Sportradar.

Ransomware is a type of malware which encrypts data, effectively locking those affected by it out of their own data and requesting a payment to decrypt the data. The frequency of ransomware attacks has increased over the past year, with local governments, hospitals, and private companies experiencing cases of ransomware.

For Sportradar, providing their customers with access to high quality sports data and insights is central to their business. Ensuring that their systems are designed securely and in a way which minimizes the possibility of a ransomware attack is top priority. While ransomware attacks can occur both on premises and in the cloud, AWS services offer increased visibility and native encryption and back up capabilities. This helps prevent and minimize the likelihood and impact of a ransomware attack.

Recovery, backup, and the ability to go back to a known good state is best practice. To further expand their defense and diminish the value of ransom, the Sportradar architecture team set out to leverage their AWS Step Functions expertise to minimize recovery time. The team’s strategy centered on achieving a short deployment process. This process commoditized their production environment, allowing them to spin up interchangeable environments in new isolated AWS accounts, pulling in data from external and isolated sources, and diminishing the value of a production environment as a ransom target. This also minimized the impact of a potential data destruction event.

By partnering with AWS, Sportradar was able to build a secure and resilient infrastructure to provide timely recovery of their service in the event of data destruction by an unauthorized third party. Sportradar automated the deployment of their application to a new AWS account and established a new isolation boundary from an account with compromised resources. In this blog post, we show how the Sportradar architecture team used a combination of AWS CodePipeline and AWS Step Functions to automate and reduce their deployment time to less than two hours.

Solution Overview

Sportradar’s solution uses AWS Step Functions to orchestrate the deployment of resources, the recovery of data, and the deployment of application code, and to navigate all necessary dependencies for order of deployment. While deployment can be orchestrated through CodePipeline, Sportradar used their familiarity with Step Functions to create a quick and repeatable deployment process for their environment.

Sportradar’s solution to a ransomware Disaster Recovery scenario has also provided them with a reliable and accelerated process for deploying development and testing environments. Developers are now able to scale testing and development environments up and down as needed. This has allowed their Development and QA teams to follow the pace of feature development, versus weekly or bi-weekly feature release and testing schedules tied to a single testing environment.

Reference Architecture Showing How Sportradar Accelerated Data Recovery

Figure 1 – Reference Architecture Diagram showing Automated Deployment Flow

Prerequisites

The prerequisites for implementing this deployment strategy are:

An implemented database backup policy
Ideally data should be backed up to a data bunker AWS account outside the scope of the environment you are looking to protect. This is so that in the event of a ransomware attack, your backed up data is isolated from your affected environment and account
Application code within a GitHub repository
Separation of duties
Access and responsibility for the backups and GitHub repository should be separated to different stakeholders in order to reduce the likelihood of both being impacted by a security breach

Step 1: New Account Setup

Once data destruction is identified, the first step in Sportradar’s process is to use a pre-created runbook to create a new AWS account. A new account is created in case the malicious actors who have encrypted the application’s data have access to not just the application, but also to the AWS account the application resides in.

The runbook sets up a VPC for a selected Region, as well as spinning up the following resources:

Security Groups with network connectivity to their git repository (in this case GitLab), IAM Roles for their resources
KMS Keys
Amazon S3 buckets with CloudFormation deployment templates
CodeBuild, CodeDeploy, and CodePipeline

Step 2: Deploying Secrets

It is a security best practice to ensure that no secrets are hard coded into your application code. So, after account setup is complete, the new AWS accounts Access Keys and the selected AWS Region are passed into CodePipeline variables. The application secrets are then deployed to the AWS Parameter Store.

Step 3: Deploying Orchestrator Step Function and In-Memory Databases

To optimize deployment time, Sportradar decided to leave the deployment of their in-memory databases running on Amazon EC2 outside of their orchestrator Step Function. They deployed the database using a CloudFormation template from their CodePipeline. This was in parallel with the deployment of the Step Function, which orchestrates the rest of their deployment.

Step 4: Step Function Orchestrates the Deployment of Microservices and Alarms

The AWS Step Functions orchestrate the deployment of Sportradar’s microservices solutions, deploying 10+ Amazon RDS instances, and restoring each dataset from DB snapshots. Following that, 80+ producer Amazon SQS queues and S3 buckets for data staging were deployed. After the successful deployment of the SQS queues, the Lambda functions for data ingestion and 15+ data processing Step Functions are deployed to begin pulling in data from various sources into the solution.

Then the API Gateways and Lambda functions which provide the API layer for each of the microservices are deployed in front of the restored RDS instances. Finally, 300+ Amazon CloudWatch Alarms are created to monitor the environment and trigger necessary alerts. In total Sportradar’s deployment process brings online: 15+ Step Functions for data processing, 30+ micro-services, 10+ Amazon RDS instances with over 150GB of data, 80+ SQS Queues, 180+ Lambda functions, CDN for UI, Amazon Elasticache, and 300+ CloudWatch alarms to monitor the applications. In all, that is over 600 resources deployed with data restored consistently in less than 2 hours total.

Reference Architecture Diagram for How Sportradar Accelerated Data Recovery Using AWS Services

Figure 2 – Reference Architecture Diagram of the Recovered Application

Conclusion

In this blog, we showed how Sportradar’s team used Step Functions to accelerate their deployments, and a walk-through of an example disaster recovery scenario. Step Functions can be used to orchestrate the deployment and configuration of a new environment, allowing complex environments to be deployed in stages, and for those stages to appropriately wait on their dependencies.

For examples of Step Functions being used in different orchestration scenarios, check out how Step Functions acts as an orchestrator for ETLs in Orchestrate multiple ETL jobs using AWS Step Functions and AWS Lambda and Orchestrate Apache Spark applications using AWS Step Functions and Apache Livy. For migrations of Amazon EC2 based workloads, read more about CloudEndure, Migrating workloads across AWS Regions with CloudEndure Migration.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Using Amazon Macie to Validate S3 Bucket Data Classification

2021-07-16 Bill Magee

Post Syndicated from Bill Magee original https://aws.amazon.com/blogs/architecture/using-amazon-macie-to-validate-s3-bucket-data-classification/

Securing sensitive information is a high priority for organizations for many reasons. At the same time, organizations are looking for ways to empower development teams to stay agile and innovative. Centralized security teams strive to create systems that align to the needs of the development teams, rather than mandating how those teams must operate.

Security teams who create automation for the discovery of sensitive data have some issues to consider. If development teams are able to self-provision data storage, how does the security team protect that data? If teams have a business need to store sensitive data, they must consider how, where, and with what safeguards that data is stored.

Let’s look at how we can set up Amazon Macie to validate data classifications provided by decentralized software development teams. Macie is a fully managed service that uses machine learning (ML) to discover sensitive data in AWS. If you are not familiar with Macie, read New – Enhanced Amazon Macie Now Available with Substantially Reduced Pricing.

Data classification is part of the security pillar of a Well-Architected application. Following the guidelines provided in the AWS Well-Architected Framework, we can develop a resource-tagging scheme that fits our needs.

Overview of decentralized data validation system

In our example, we have multiple levels of data classification that represent different levels of risk associated with each classification. When a software development team creates a new Amazon Simple Storage Service (S3) bucket, they are responsible for labeling that bucket with a tag. This tag represents the classification of data stored in that bucket. The security team must maintain a system to validate that the data in those buckets meets the classification specified by the development teams.

This separation of roles and responsibilities for development and security teams who work independently requires a validation system that’s decoupled from S3 bucket creation. It should automatically detect new buckets or data in the existing buckets, and validate the data against the assigned classification tags. It should also notify the appropriate development teams of misclassified or unclassified buckets in a timely manner. These notifications can be through standard notification channels, such as email or Slack channel notifications.

Validation and alerts with AWS services

Figure 1. Validation system for data classification

We assume that teams are permitted to create S3 buckets and we will use AWS Config to enforce the following required tags: DataClassification and SupportSNSTopic. The DataClassification tag indicates what type of data is allowed in the bucket. The SupportSNSTopic tag indicates an Amazon Simple Notification Service (SNS) topic. If there are issues found with the data in the bucket, a message is published to the topic, and Amazon SNS will deliver an alert. For example, if there is personally identifiable information (PII) data in a bucket that is classified as non-sensitive, the system will alert the owners of the bucket.

Macie is configured to scan all S3 buckets on a scheduled basis. This configuration ensures that any new bucket and data placed in the buckets is analyzed the next time the Macie job runs.

Macie provides several managed data identifiers for discovering and classifying the data. These include bank account numbers, credit card information, authentication credentials, PII, and more. You can also create custom identifiers (or rules) to gather information not covered by the managed identifiers.

Macie integrates with Amazon EventBridge to allow us to capture data classification events and route them to one or more destinations for reporting and alerting needs. In our configuration, the event initiates an AWS Lambda. The Lambda function is used to validate the data classification inferred by Macie against the classification specified in the DataClassification tag using custom business logic. If a data classification violation is found, the Lambda then sends a message to the Amazon SNS topic specified in the SupportSNSTopic tag.

The Lambda function also creates custom metrics and sends those to Amazon CloudWatch. The metrics are organized by engineering team and severity. This allows the security team to create a dashboard of metrics based on the Macie findings. The findings can also be filtered per engineering team and severity to determine which teams need to be contacted to ensure remediation.

Conclusion

This solution provides a centralized security team with the tools it needs. The team can validate the data classification of an Amazon S3 bucket that is self-provisioned by a development team. New Amazon S3 buckets are automatically included in the Macie jobs and alerts. These are only sent out if the data in the bucket does not conform to the classification specified by the development team. The data auditing process is loosely coupled with the Amazon S3 Bucket creation process, enabling self-service capabilities for development teams, while ensuring proper data classification. Your teams can stay agile and innovative, while maintaining a strong security posture.

Learn more about Amazon Macie and Data Classification.

Architecting a Highly Available Serverless, Microservices-Based Ecommerce Site

2021-07-15 Senthil Kumar

Post Syndicated from Senthil Kumar original https://aws.amazon.com/blogs/architecture/architecting-a-highly-available-serverless-microservices-based-ecommerce-site/

The number of ecommerce vendors is growing globally—they often handle large traffic at different times of the day and different days of the year. This, in addition to building, managing, and maintaining IT infrastructure on-premises data centers can present challenges to ecommerce businesses’ scalability and growth.

This blog provides you a Serverless on AWS solution that offloads the undifferentiated heavy lifting of managing resources and ensures your businesses’ architecture can handle peak traffic.

Common architecture set up versus serverless solution

The following sections describe a common monolithic architecture and our suggested alternative approach: setting up microservices-based order submission and product search modules. These modules are independently deployable and scalable.

Typical monolithic architecture

Figure 1 shows how a typical on-premises ecommerce infrastructure with different tiers is set up:

Web servers serve static assets and proxy requests to application servers
Application servers process ecommerce business logic and authentication logic
Databases store user and other dynamic data
Firewall and load balancers provide network components for load balancing and network security

Figure 1. Monolithic on-premises ecommerce infrastructure with different tiers

Monolithic architecture tightly couples different layers of the application. This prevents them from being independently deployed and scaled.

Microservices-based modules

Order submission workflow module

This three-layer architecture can be set up in the AWS Cloud using serverless components:

Static content layer (Amazon CloudFront and Amazon Simple Storage Service (Amazon S3)). This layer stores static assets on Amazon S3. By using CloudFront in front of the S3 storage cache, you can deliver assets to customers globally with low latency and high transfer speeds.
Authentication layer (Amazon Cognito or customer proprietary layer). Ecommerce sites deliver authenticated and unauthenticated content to the user. With Amazon Cognito, you can manage users’ sign-up, sign-in, and access controls, so this authentication layer ensures that only authenticated users have access to secure data.
Dynamic content layer (AWS Lambda and Amazon DynamoDB). All business logic required for the ecommerce site is handled by the dynamic content layer. Using Lambda and DynamoDB ensures that these components are scalable and can handle peak traffic.

As shown in Figure 2, the order submission workflow is split into two sections: synchronous and asynchronous.

By splitting the order submission workflow, you allow users to submit their order details and get an orderId. This makes sure that they don’t have to wait for backend processing to complete. This helps unburden your architecture during peak shopping periods when the backend process can get busy.

Figure 2. Microservices-based order submission workflow

The details of the order, such as credit card information in encrypted form, shipping information, etc., are stored in DynamoDB. This action invokes an asynchronous workflow managed by AWS Step Functions.

Figure 3 shows sample step functions from the asynchronous process. In this scenario, you are using external payment processing and shipping systems. When both systems get busy, step functions can manage long-running transactions and also the required retry logic. It uses a decision-based business workflow, so if a payment transaction fails, the order can be canceled. Or, once payment is successful, the order can proceed.

Amazon Simple Notification Service (Amazon SNS) notifies users whenever their order status changes. You can even extend Step Functions to have it react based on status of shipping.

Figure 3. Sample AWS Step Functions asynchronous workflow that uses external payment processing service and shipping system

Product search module

Our product search module is set up using the following serverless components:

Amazon Elasticsearch Service (Amazon ES) stores product data, which is updated whenever product-related data changes.
Lambda formats the data.
Amazon API Gateway allows users to search without authentication. As shown in Figure 4, searching for products on the ecommerce portal does not require users to log in. All traffic via API Gateway is unauthenticated.

Figure 4. Microservices-based product search workflow module with dynamic traffic through API Gateway

Replicating data across Regions

If your ecommerce application runs on multiple Regions, it may require the content and data to be replicated. This allows the application to handle local traffic from that Region and also act as a failover option if the application fails in another Region. The content and data are replicated using the multi-Region replication features of Amazon S3 and DynamoDB global tables.

Figure 5 shows a multi-Region ecommerce site built on AWS with serverless services. It uses the following features to make sure that data between all Regions are in sync for data/assets that do not need data residency compliance:

Amazon S3 multi-Region replication keeps static assets in sync for assets.
DynamoDB global tables keeps dynamic data in sync across Regions.

Assets that are specific to their Region are stored in Regional specific buckets.

Figure 5. Data replication for a multi-Region ecommerce website built using serverless components

Amazon Route 53 DNS web service manages traffic failover from one Region to another. Route 53 provides different routing policies, and depending on your business requirement, you can choose the failover routing policy.

Best practices

Now that we’ve shown you how to build these applications, make sure you follow these best practices to effectively build, deploy, and monitor the solution stack:

Infrastructure as Code (IaC). A well-defined, repeatable infrastructure is important for managing any solution stack. AWS CloudFormation allows you to treat your infrastructure as code and provides a relatively easy way to model a collection of related AWS and third-party resources.
AWS Serverless Application Model (AWS SAM). An open-source framework. Use it to build serverless applications on AWS.
Deployment automation. AWS CodePipeline is a fully managed continuous delivery service that automates your release pipelines for fast and reliable application and infrastructure updates.
AWS CodeStar. Allows you to quickly develop, build, and deploy applications on AWS. It provides a unified user interface, enabling you to manage all of your software development activities in one place.
AWS Well-Architected Framework. Provides a mechanism for regularly evaluating your workloads, identifying high risk issues, and recording your improvements.
Serverless Applications Lens. Documents how to design, deploy, and architect serverless application workloads.
Monitoring. AWS provides many services that help you monitor and understand your applications, including Amazon CloudWatch, AWS CloudTrail, and AWS X-Ray.

Conclusion

In this blog post, we showed you how to architect a highly available, serverless, and microservices-based ecommerce website that operates in multiple Regions.

We also showed you how to replicate data between different Regions for scaling and if your workload fails. These serverless services reduce the burden of building and managing physical IT infrastructure to help you focus more on building solutions.

Related information

10 Things Serverless Architects Should Know

Integrating Amazon Connect and Amazon Lex with Third-party Systems

2021-07-15 Steven Warwick

Post Syndicated from Steven Warwick original https://aws.amazon.com/blogs/architecture/integrating-amazon-connect-and-amazon-lex-with-third-party-systems/

AWS customers who provide software solutions that integrate with AWS often require design patterns that offer some flexibility. They must build, support, and expand products and solutions to meet their end user business requirements. These design patterns must use the underlying services and infrastructure through API operations. As we will show, third-party solutions can integrate with Amazon Connect to initiate customer-specific workflows. You don’t need specific utterances when using Amazon Lex to convert speech to text.

Introduction to Amazon Connect workflows

Platform as a service (PaaS) systems are built to handle a variety of use cases with various inputs. These inputs are provided by upstream systems and sometimes result in complex integrations. For example, when creating a call center management solution, these workflows may require opaque transcription data to be sent to downstream third-party system. This pattern allows the caller to interact with a third-party system with an Amazon Connect flow. The solution allows for communication to occur multiple times. Opaque transcription data is transferred between the Amazon Connect contact flow, through Amazon Lex, and then to the third-party system. The third-party system can modify and update workflows without affecting the Amazon Connect or Amazon Lex systems.

API workflow use case

AnyCompany Tech (our hypothetical company) is a PaaS company that allows other companies to quickly build workflows with their tools. It provides the ability to take customer calls with a request-response style interaction. AnyCompany built an API into the Amazon Connect flow to allow their end users to return various response types. Examples of the response types are “disconnect,” “speak,” and “failed.”

Using Amazon Connect, AnyCompany allows their end users to build complex workflows using a basic question-response API. Each customer of AnyCompany can build workflows that respond to a voice input. It is processed via the high-quality speech recognition and natural language understanding capabilities of Amazon Lex. The caller is prompted by “What is your question?” Amazon Lex processes the audio input then invokes a Lambda function that connects to AnyCompany Tech. They in turn initiate their customer’s unique workflow. The customer’s workflow may change over time without requiring any further effort from AnyCompany Tech.

Questions graph database use case

AnyCompany Storage is a company that has a graph database that stores documents and information correlating the business to its inventory, sales, marketing, and employees. Accessing this database will be done via a question-response API. For example, such questions might be: “What are our third quarter earnings?”, “Do we have product X in stock?”, or “When was Jane Doe hired?” The company wants the ability to have their employees call in and after proper authentication, ask any question of the system and receive a response. Using the architecture in Figure 1, the company can link their Amazon Connect implementation up to this API. The output from Amazon Lex is passed into the API, a response is received, and it is then passed to Amazon Connect.

Amazon Connect third-party system architecture

Figure 1. End-customer call flow

User calls Amazon Connect using the telephone number for the connect instance.
Amazon Connect receives the incoming call and starts an Amazon Connect contact flow. This Amazon Connect flow captures the caller’s utterance and forwards it to Amazon Lex.
Amazon Lex starts the requested bot. The Amazon Lex bot translates the caller’s utterance into text and sends it to AWS Lambda via an event.
AWS Lambda accepts the incoming data, transforms or enhances it as needed, and calls out to the external API via some transport
The external API processes the content sent to it from AWS Lambda.
The external API returns a response back to AWS Lambda.
AWS Lambda accepts the response from the external API, then forwards this response to Amazon Lex.
Amazon Lex returns the response content to Amazon Connect.
Amazon Connect processes the response.

Solution components

Amazon Connect

Amazon Connect allows customer calls to get information from the third-party system. An Amazon Connect contact flow is required to get the callers input, which is then sent to Amazon Lex. The Amazon Lex bot must be granted permission to interact with an Amazon Connect contact flow. As shown in Figure 2, the Get customer input block must call the fallback intent of the Amazon Lex bot. Figure 2 demonstrates a basic flow used to get input, check the response, and perform an action.

Figure 2. Basic Amazon Connect contact flow to integrate with Amazon Lex

Amazon Lex

Amazon Lex converts the voice given by a caller into text, which is then processed by a Lambda function. When setting up the Amazon Lex bot you will require one fallback intent (Figure 3), one unused intent (Figure 4), and one clarification prompts disabled (Figure 5).

Figure 3. Amazon Lex fallback intent

The fallback intent is created by using the pre-defined AMAZON.FallbackIntent and must be set up to call a Lambda function in the Fulfillment section. Using the fallback intent and Lambda fulfillment causes the system to ignore any utterance pattern and pass any translation directly to the Lambda function.

Figure 4. Amazon Lex unused intent

The unused intent is only created to satisfy Amazon Lex’s requirement for a bot to have at least one custom intent with a valid utterance.

Figure 5. Amazon Lex error handling clarification prompts

In the error handling section of the Amazon Lex bot, the clarification prompts must be disabled. Disabling the clarification prompts stops the Amazon Lex bot from asking the caller to clarify the input.

AWS Lambda

AWS Lambda is called by Amazon Lex, which is used to interact with the third-party system. The third-party system will return values, which the Lambda will add to its session attributes resulting object. The resulting object from AWS Lambda requires the dialog action to be set up with the type set to “Close,” fulfillmentState set to “Fulfilled,” and the message contentType set to “CustomPayload”. This will allow Amazon Lex to pass the values to Amazon Connect without speaking the results to the caller.

Conclusion

In this blog we showed how Amazon Connect, Amazon Lex, and AWS Lambda functions can be used together to create common interactions with third-party systems. This is a flexible architecture and requires few changes to the Amazon Connect contact flow. You don’t have to set up pre-defined utterances that limit the allowed inputs. Using this solution, AWS customers can provide flexible solutions that interact with third-party systems.

Related information

Intelligently Search Media Assets with Amazon Rekognition and Amazon ES

2021-07-14 Sridhar Chevendra

Post Syndicated from Sridhar Chevendra original https://aws.amazon.com/blogs/architecture/intelligently-search-media-assets-with-amazon-rekognition-and-amazon-es/

Media assets have become increasingly important to industries like media and entertainment, manufacturing, education, social media applications, and retail. This is largely due to innovations in digital marketing, mobile, and ecommerce.

Successfully locating a digital asset like a video, graphic, or image reduces costs related to reproducing or re-shooting. An efficient search engine is critical to quickly delivering something like the latest fashion trends. This in turn increases customer satisfaction, builds brand loyalty, and helps increase businesses’ online footprints, ultimately contributing towards revenue.

This blog post shows you how to build automated indexing and search functions using AWS serverless managed artificial intelligence (AI)/machine learning (ML) services. This architecture provides high scalability, reduces operational overhead, and scales out/in automatically based on the demand, with a flexible pay-as-you-go pricing model.

Automatic tagging and rich metadata with Amazon ES

Asset libraries for images and videos are growing exponentially. With Amazon Elasticsearch Service (Amazon ES), this media is indexed and organized, which is important for efficient search and quick retrieval.

Adding correct metadata to digital assets based on enterprise standard taxonomy will help you narrow down search results. This includes information like media formats, but also richer metadata like location, event details, and so forth. With Amazon Rekognition, an advanced ML service, you do not need to tag and index these media assets. This automatic tagging and organization frees you up to gain insights like sentiment analysis from social media.

Figure 1 is tagged using Amazon Rekognition. You can see how rich metadata (Apparel, T-Shirt, Person, and Pills) is extracted automatically. Without Amazon Rekognition, you would have to manually add tags and categorize the image. This means you could only do a keyword search on what’s manually tagged. If the image was not tagged, then you likely wouldn’t be able to find it in a search.

Figure 1. An image tagged automatically with Amazon Rekognition

Data ingestion, organization, and storage with Amazon S3

As shown in Figure 2, use Amazon Simple Storage Service (Amazon S3) to store your static assets. It provides high availability and scalability, along with unlimited storage. When you choose Amazon S3 as your content repository, multiple data providers are configured for data ingestion for future consumption by downstream applications. In addition to providing storage, Amazon S3 lets you organize data into prefixes based on the event type and captures S3 object mutations through S3 event notifications.

Figure 2. Solution overview diagram

S3 event notifications are invoked for a specific prefix, suffix, or combination of both. They integrate with Amazon Simple Queue Service (Amazon SQS), Amazon Simple Notification Service (Amazon SNS), and AWS Lambda as targets. (Refer to the Amazon S3 Event Notifications user guide for best practices). S3 event notification targets vary across use cases. For media assets, Amazon SQS is used to decouple the new data objects ingested into S3 buckets and downstream services. Amazon SQS provides flexibility over the data processing based on resource availability.

Data processing with Amazon Rekognition

Once media assets are ingested into Amazon S3, they are ready to be processed. Amazon Rekognition determines the entities within each asset. Amazon Rekognition then extracts the entities in JSON format and assigns a confidence score.

If the confidence score is below the defined threshold, use Amazon Augmented AI (A2I) for further review. A2I is an ML service that helps you build the workflows required for human review of ML predictions.

Amazon Rekognition also supports custom modeling to help identify entities within the images for specific business needs. For instance, a campaign may need images of products worn by a brand ambassador at a marketing event. Then they may need to further narrow their search down by the individual’s name or age demographic.

Using our solution, a Lambda function invokes Amazon Rekognition to extract the entities from the ingested assets. Lambda continuously polls the SQS queue for any new messages. Once a message is available, the Lambda function invokes the Amazon Rekognition endpoint to extract the relevant entities.

The following is a sample output from detect_labels API call in Amazon Rekognition and the transformed output that will be updated to downstream search engine:

{'Labels': [{'Name': 'Clothing', 'Confidence': 99.98137664794922, 'Instances': [], 'Parents': []}, {'Name': 'Apparel', 'Confidence': 99.98137664794922,'Instances': [], 'Parents': []}, {'Name': 'Shirt', 'Confidence': 97.00833129882812, 'Instances': [], 'Parents': [{'Name': 'Clothing'}]}, {'Name': 'T-Shirt', 'Confidence': 76.36670684814453, 'Instances': [{'BoundingBox': {'Width': 0.7963646650314331, 'Height': 0.6813027262687683, 'Left':
0.09593021124601364, 'Top': 0.1719706505537033}, 'Confidence': 53.39663314819336}], 'Parents': [{'Name': 'Clothing'}]}], 'LabelModelVersion': '2.0', 'ResponseMetadata': {'RequestId': '3a561e82-badc-4ba0-aa77-39a13f1bb3a6', 'HTTPStatusCode': 200, 'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1', 'date': 'Mon, 17 May 2021 18:32:27 GMT', 'x-amzn-requestid': '3a561e82-badc-4ba0-aa77-39a13f1bb3a6','content-length': '542', 'connection': 'keep-alive'}, 'RetryAttempts': 0}}

As shown, the Lambda function submits an API call to Amazon Rekognition, where a T-shirt image in .jpeg format is provided as the input. Based on your confidence score threshold preference, Amazon Rekognition will prompt you to initiate a human review using Amazon A2I. It will also prompt you to use Amazon Rekognition Custom Labels to train the custom models. Lambda then identifies and arranges the labels and updates the specified index.

Indexing with Amazon ES

Amazon ES is a managed search engine service that provides enterprise-grade search engine capability for applications. In our solution, assets are searched based on entities that are used as metadata to update the index. Amazon ES is hosted as a public endpoint or a VPC endpoint for secure access within the specified AWS account.

Labels are identified and marked as tags, which are assigned to .jpeg formatted images. The following sample output shows the query on one of the tags issued on an Amazon ES cluster.

Query:

curl-XGET https://<ElasticSearch Endpoint>/<_IndexName>/_search?q=T-Shirt

Output:

{"took":140,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":0.05460011,"hits":[{"_index":"movies","_type":"_doc","_id":"15","_score":0.05460011,"_source":{"fileName":"s7-1370766_lifestyle.jpg","objectTags":["Clothing","Apparel","Sailor
Suit","Sleeve","T-Shirt","Shirt","Jersey"]}}]}}

In addition to photos, Amazon Rekognition also detects the labels on videos. It can recognize labels and identify characters and entities. These are then added to Amazon ES to enhance search capability. This allows users to skip to specific parts of a video for quick searchability. For instance, a marketer may need images of cashmere sweaters from a fashion show that was streamed and recorded.

Once the raw video clip is identified, it is then converted using Amazon Elastic Transcoder to play back on mobile devices, tablets, web browsers, and connected televisions. Elastic Transcoder is a highly scalable and cost-effective media transcoding service in the cloud. Segmented output renditions are created for delivery using the multiple protocols to compatible devices.

Conclusion

This blog describes AWS services that can be applied to diverse set of use cases for tagging and efficient search of images and videos. You can build automated indexing and search using AWS serverless managed AI/ML services. They provide high scalability, reduce operational overhead, and scale out/in automatically based on the demand, with a flexible pay-as-you-go pricing model.

To get started, use these references to create your own sample architectures:

Architecture Monthly Magazine: Genomics

2021-07-13 Jane Scolieri

Post Syndicated from Jane Scolieri original https://aws.amazon.com/blogs/architecture/architecture-monthly-magazine-genomics/

The field of genomics has made huge strides in the last 20 years.

Genomics organizations and researchers are rising to the many challenges we face today, and seeking improved methods for future needs. Amazon Web Services (AWS) provides an array of services that can help the genomics industry with securely handling and interpreting genomics data, assisting with regulatory compliance, and supporting complex research workloads. In this issue, we have case studies from Lifebit and Fred Hutch, blogs on genomic sequencing and the Registry of Open Data, and some reference architecture and solutions to support your work.

We include videos from the Smithsonian, AstraZeneca, Genomic Discoveries, AMP lab, Illumina, and the University of Sydney.

We hope you’ll find this edition of Architecture Monthly useful. We’d like to thank Kelli Jonakin, PhD, Global Head of Life Sciences & Genomics Marketing, AWS, as well as our Experts, Ryan Ulaszek, Worldwide Tech Leader – Genomics, and Lisa McFerrin, Worldwide Tech Leader – Bioinformatics, for their contribution.

Please give us your feedback! Include your comments on the Amazon Kindle page. You can view past issues and reach out to [email protected] anytime with your questions and comments.

In this month’s Genomics issue:

Ask an Expert: Ryan Ulaszek & Lisa McFerrin
Executive Brief: Genomics on AWS: Accelerating scientific discoveries and powering business agility
Case Study: Fred Hutch Microbiome Researchers Use AWS to Perform Seven Years of Compute Time in Seven Days
Quick Start: For rapid deployment
Blog: NIH’s Sequence Read Archive, the world’s largest genome sequence repository: Openly accessible on AWS
Solutions: Genomics Secondary Analysis Using AWS Step Functions and AWS Batch
Reference Architecture: Genomics data transfer, analytics, and machine learning reference architecture
Case Study: Lifebit Powers Collaborative Research Environment for Genomics England on AWS
Quick Start: Illumina DRAGEN on AWS
Executive Brief: Genomic data security and compliance on the AWS Cloud
Solutions: Genomics Tertiary Analysis and Data Lakes Using AWS Glue and Amazon Athena
Reference Architecture: Genomics report pipeline reference architecture
Blog: Broad Institute gnomAD data now accessible on the Registry of Open Data on AWS
Quick Start: Workflow orchestration for genomics analysis on AWS
Solutions: Genomics Tertiary Analysis and Machine Learning Using Amazon SageMaker
Reference Architecture: Research data lake ingestion pipeline reference architecture
Videos:
- The Smithsonian Institution Improves Genome Annotation for Biodiverse Species Using the AWS Cloud
- AstraZeneca Genomics on AWS: A Journey from Petabytes to New Medicines
- Accelerate Genomic Discoveries on AWS
- UC Berkeley AMP Lab Genomics Project on AWS – Customer Success Story
- Helix Uses Illumina’s BaseSpace Sequence Hub on AWS to Build Their Personal Genomics Platform
- University of Sydney Accelerate Genomics Research with AWS and Ronin

Download the Magazine

How to access the magazine

View and download past issues as PDFs on the AWS Architecture Monthly webpage.
Readers in the US, UK, Germany, and France can subscribe to the Kindle version of the magazine at Kindle Newsstand.
Visit Flipboard, a personalized mobile magazine app that you can also read on your computer.
We hope you’re enjoying Architecture Monthly, and we’d like to hear from you—leave us a star rating and comment on the Amazon Kindle Newsstand page or contact us anytime at [email protected].

Should I Run my Containers on AWS Fargate, AWS Lambda, or Both?

2021-07-12 Rob Solomon

Post Syndicated from Rob Solomon original https://aws.amazon.com/blogs/architecture/should-i-run-my-containers-on-aws-fargate-aws-lambda-or-both/

Containers have transformed how companies build and operate software. Bundling both application code and dependencies into a single container image improves agility and reduces deployment failures. But what compute platform should you choose to be most efficient, and what factors should you consider in this decision?

With the release of container image support for AWS Lambda functions (December 2020), customers now have an additional option for building serverless applications using their existing container-oriented tooling and DevOps best practices. In addition, a single container image can be configured to run on both of these compute platforms: AWS Lambda (using serverless functions) or AWS Fargate (using containers).

Three key factors can influence the decision of what platform you use to deploy your container: startup time, task runtime, and cost. That decision may vary each time a task is initiated, as shown in the three scenarios following.

Design considerations for deploying a container

Total task duration consists of startup time and runtime. The startup time of a containerized task is the time required to provision the container compute resource and deploy the container. Task runtime is the time it takes for the application code to complete.

Startup time: Some tasks must complete quickly. For example, when a user waits for a web response, or when a series of tasks is completed in sequential order. In those situations, the total duration time must be minimal. While the application code may be optimized to run faster, startup time depends on the chosen compute platform as well. AWS Fargate container startup time typically takes from 60 to 90 seconds. AWS Lambda initial cold start can take up to 5 seconds. Following that first startup, the same containerized function has negligible startup time.

Task runtime: The amount of time it takes for a task to complete is influenced by the compute resources allocated (vCPU and memory) and application code. AWS Fargate lets you select vCPU and memory size. With AWS Lambda, you define the amount of allocated memory. Lambda then provisions a proportional quantity of vCPU. In both AWS Fargate and AWS Lambda uses, increasing the amount of compute resources may result in faster completion time. However, this will depend on the application. While the additional compute resources incur greater cost, the total duration may be shorter, so the overall cost may also be lower.

AWS Lambda has a maximum limit of 15 minutes of runtime. Lambda shouldn’t be used for these tasks to avoid the likelihood of timeout errors.

Figure 1 illustrates the proportion of startup time to total duration. The initial steepness of each line shows a rapid decrease in startup overhead. This is followed by a flattening out, showing a diminishing rate of efficiency. Startup time delay becomes less impactful as the total job duration increases. Other factors (such as cost) become more significant.

Figure 1. Ratio of startup time as a function to overall job duration for each service

Cost: When making the choice between Fargate and Lambda, it is important to understand the different pricing models. This way, you can make the appropriate selection for your needs.

AWS Fargate on Amazon ECS pricing is calculated per second based on allocated memory, vCPUs, and storage
AWS Lambda pricing is calculated per millisecond based on allocated memory and number of function invocations

Figure 2 shows a cost analysis of Lambda vs Fargate. This is for the entire range of configurations for a runtime task. For most of the range of configurable memory, AWS Lambda is more expensive per second than even the most expensive configuration of Fargate.

Figure 2. Total cost for both AWS Lambda and AWS Fargate based on task duration

From a cost perspective, AWS Fargate is more cost-effective for tasks running for several seconds or longer. If cost is the only factor at play, then Fargate would be the better choice. But the savings gained by using Fargate may be offset by the business value gained from the shorter Lambda function startup time.

Dynamically choose your compute platform

In the following scenarios, we show how a single container image can serve multiple use cases. The decision to run a given containerized application on either AWS Lambda or AWS Fargate can be determined at runtime. This decision depends on whether cost, speed, or duration are the priority.

In Figure 3, an image-processing AWS Batch job runs on a nightly schedule, processing tens of thousands of images to extract location information. When run as a batch job, image processing may take 1–2 hours. The job pulls images stored in Amazon Simple Storage Service (S3) and writes the location metadata to Amazon DynamoDB. In this case, AWS Fargate provides a good combination of compute and cost efficiency. An added benefit is that it also supports tasks that exceed 15 minutes. If a single image is submitted for real-time processing, response time is critical. In that case, the same image-processing code can be run on AWS Lambda, using the same container image. Rather than waiting for the next batch process to run, the image is processed immediately.

Figure 3. One-off invocation of a typically long-running batch job

In Figure 4, a SaaS application uses an AWS Lambda function to allow customers to submit complex text search queries for files stored in an Amazon Elastic File System (EFS) volume. The task should return results quickly, which is an ideal condition for AWS Lambda. However, a small percentage of jobs run much longer than the average, exceeding the maximum duration of 15 minutes.

A straightforward approach to avoid job failure is to initiate an Amazon CloudWatch alarm when the Lambda function times out. CloudWatch alarms can automatically retry the job using Fargate. An alternate approach is to capture historical data and use it to create a machine learning model in Amazon SageMaker. When a new job is initiated, the SageMaker model can predict the time it will take the job to complete. Lambda can use that prediction to route the job to either AWS Lambda or AWS Fargate.

Figure 4. Short duration tasks with occasional outliers running longer than 15 minutes

In Figure 5, a customer runs a containerized legacy application that encompasses many different kinds of functions, all related to a recurring data processing workflow. Each function performs a task of varying complexity and duration. These can range from processing data files, updating a database, or submitting machine learning jobs.

Using a container image, one code base can be configured to contain all of the individual functions. Longer running functions, such as data preparation and big data analytics, are routed to Fargate. Shorter duration functions like simple queries can be configured to run using the container image in AWS Lambda. By using AWS Step Functions as an orchestrator, the process can be automated. In this way, a monolithic application can be broken up into a set of “Units of Work” that operate independently.

Figure 5. Heterogeneous function orchestration

Conclusion

If your job lasts milliseconds and requires a fast response to provide a good customer experience, use AWS Lambda. If your function is not time-sensitive and runs on the scale of minutes, use AWS Fargate. For tasks that have a total duration of under 15 minutes, customers must decide based on impacts to both business and cost. Select the service that is the most effective serverless compute environment to meet your requirements. The choice can be made manually when a job is scheduled or by using retry logic to switch to the other compute platform if the first option fails. The decision can also be based on a machine learning model trained on historical data.

Field Notes: How to Set Up Your Cross-Account and Cross-Region Database for Amazon Aurora

2021-07-09 Ashutosh Pateriya

Post Syndicated from Ashutosh Pateriya original https://aws.amazon.com/blogs/architecture/field-notes-how-to-set-up-your-cross-account-and-cross-region-database-for-amazon-aurora/

This post was co-written by Ashutosh Pateriya, Solution Architect at AWS and Nirmal Tomar, Principal Consultant at Infosys Technologies Ltd.

Various organizations have stringent regulatory compliance obligations or business requirements that require an effective cross-account and cross-region database setup. We recommend to establish a Disaster Recovery (DR) environment in different AWS accounts and Regions for a system design that can handle AWS Region failures.

Account-level separation is important for isolating production environments from other environments, that is, development, test, UAT and DR environments. These are defined by external compliance requirements, such as PCI-DSS or HIPAA. Cross-account replication helps an organization recover data from a replicated account, when the primary AWS account is compromised, and access to the account is lost. Using multiple Regions gives you greater control over the recovery time, if there is a hard dependency failure on a Regional AWS service.

You can use Amazon Aurora Global Database to replicate your RDS database across Regions, within the same AWS Account. In this blog, we show how to build a custom solution for Cross-Account, Cross-Region database replication with configurable Recovery Time Objective (RTO)/ Recovery Point Objective (RPO).

Solution Overview

a custom solution for Cross-Account, Cross-Region database replication with configurable Recovery Time

Figure 1 – Architecture of a Cross-Account, Cross-Region database replication with configurable Recovery Time Objective (RTO)/ Recovery Point Objective (RPO)

Step 1: Automated Amazon Aurora snapshots cannot be shared with other AWS accounts. Hence, create a manual version of the snapshot for the same.

Step 2: Amazon Aurora cannot be restored from the snapshot created in different AWS Regions. To overcome this, you must copy the snapshot in the target AWS Region after AWS KMS encryption.

Step 3: Share the DB cluster snapshot with the target account, in order to restore the cluster.

Step 4: Restore Aurora Cluster from the shared snapshot.

Step 5: Restore the required Aurora Instances.

Step 6: Swap the cluster and instance names, to make it accessible from the previous Reader/Writer endpoint.

Step 7: Perform a sanity test and shut down the old cluster.

Prerequisites

Two AWS accounts with permission to spin up required resources.
A running Aurora instance in the source AWS account.

Walkthrough

The solution outlined is custom built, and automated using AWS Step functions, AWS Lambda, Amazon Simple Notification Service (SNS), AWS Key Management Service (KMS), AWS CloudFormation Templates, Amazon Aurora and AWS Identity and Access Management IAM by following these four steps:

Launch a one-time set up of the AWS KMS in the AWS source account to share the key with destination account. This key will be used by the destination account to restore the Aurora cluster from a KMS encrypted snapshot.
Set up the step function workflow in the AWS source account and source Region. On a scheduled interval, this workflow creates a snapshot in the source account and source Region. Copy it to the target Region and push a notification to the SNS topic and setup a Lambda function as a target for the SNS topic. Refer to the documentation to learn more about Using Lambda with Amazon SNS.
Set up the step function workflow in the AWS target account and in the target Region. This workflow is initiated from the previous workflow after sharing the snapshot. It restores the Aurora cluster from the shared snapshot in the target account and in the target Region.
Provide access to the Lambda function in the target account and target Region. Set it up to listen to the SNS topic in the source account and source Region.

Figure 2 – Architecture showing AWS Step Function in Source Account and Source Region

We can implement a Lambda function with different supported programming languages. For this solution, we selected Python to implement various Lambda functions used in both step function workflows.

Note: We are using us-west-2 as the source Region and us-west-1 as the target Region for this demo. You can choose Regions as per your design need.

Part 1: One Time AWS KMS key setup in the Source Account and Target Region for cross-account access of the Encrypted Snapshot

We can share a Customer Master Key (CMK) across multiple AWS accounts using different ways, such as using AWS CloudFormation templates, CLI, and AWS console.

Following are one-time steps that allow you to create an AWS KMS key in the source AWS account. You can then provide cross-account usage access of the same key to the destination account for restoring the Aurora cluster. You can achieve this with the AWS Console and AWS CloudFormation Template.

Note: Several AWS services being used do not fall under the AWS Free Tier. To avoid extra charges, make sure to delete resources once the demo is completed.

Using the AWS Console

You can set up cross-account AWS KMS keys using the AWS Console. For more information on the setup, refer to the guide: Allowing users in other accounts to use a CMK.

Using the AWS CloudFormation Template

You can set up cross account AWS KMS keys using CloudFormation templates by following these steps:

Launch the template:

Use the source account and destination AWS Region.
Type the appropriate stack name and destination account number and select Create stack.

Create Stack screenshot

Click on the Resources tab to confirm if the KMS key and its alias name have been created.

Resources screenshot

Part 2: Snapshot creation and sharing Target Account in Target Region using the Step Function workflow

Now, we schedule the AWS Step Function workflow with single/multiple times in a day/week/month using Amazon EventBridge or Amazon CloudWatch Events per Recovery Time Objective (RTO)/ Recovery Point Objective (RPO).

This workflow first creates a snapshot in the AWS source account and source Region> The snapshot copies it in target Region after KMS encryption, shares the snapshot with the target account and sends a notification to SNS Topic. A Lambda function is subscribed to the topic in the target account.

Diagram showing AWS Step Function in Source Account and Source Region

Figure 3 – Architecture showing the AWS Step Functions Workflow

The Import Lambda functions involved in this workflow:

a. CreateRDSSnapshotInSourceRegion: This Lambda function creates a snapshot in the source Account and Region for an existing Aurora cluster.

b. FetchSourceRDSSnapshotStatus: This Lambda function fetches the snapshot status and waits for its availability before moving to the next step.

c. EncrytAndCopySnapshotInTargetRegion: This Lambda function copies the source snapshot in the target AWS Region after encryption with AWS KMS keys.

d. FetchTargetRDSSnapshotStatus: This Lambda function fetch the snapshot status in the target Region and will wait to move to the next step once snapshot status changed to available.

e. ShareSnapshotStateWithTargetAccount: This Lambda function shares the snapshot with the target account and target Region.

f. SendSNSNotificationToTargetAccount: This option sends the notification to the SNS topic. An AWS Lambda function is subscribed to this topic for the destination AWS Account.

Using the following AWS CloudFormation Template, we set up the required services/workflow in the source account and source Region:

a. Launch the template in the source account and source Region:

Specify stack details screenshot

b. Fill out the preceding form with the following details and select next.

Stack name: Stack Name which you want to create.

DestinationAccountNumber: Destination AWS Account number where you want to restore the Aurora cluster.

KMSKey: KMS key ARN number setup in previous steps.

ScheduleExpression: Schedule a cron expression when you want to run this workflow. Sample expressions are available in this guide, Schedule expressions using rate or cron.

SourceDBClusterIdentifier: The identifier of the DB cluster, for which a snapshot is created. It must match the identifier of an existing DBCluster. This parameter isn’t case-sensitive.

SourceDBSnapshotName: Snapshot name which we will use in this flow.

TargetAWSRegion: Destination AWS Account Region where you want to restore the Aurora cluster.

Permissions screenshot

c. Select IAM role to launch this template per best practices.

Capabilities and transforms screenshot

d. Acknowledge to create various resources including IAM roles/policies and select Create Stack.

Part 3: Workflow to Restore the Aurora Cluster in the AWS Target Account and Target Region

Once AWS Lambda receives a cross-account sns notification for the shared snapshot, this workflow begins:

The Lambda function starts the AWS Step Function workflow in an asynchronous mode.
The workflow checks for the snapshot availability and starts restoring the Aurora cluster based on the configured engine, version, and other parameters.
Depending on the availability of the cluster, it adds instances based on a predefined configuration.
Once the instances are available, it swaps cluster names so that the updated database is available at the same reader and writer endpoints.
Once the databases are restored, the old Aurora cluster is shut down.

Note:

a) The database is restored in the default VPC, Availability Zone, and option group, with the default cluster version and option group. If you want to restore with the specific configurations, you can edit the restore cluster function with details available in the same section.

b) Database credentials will be the same as source account. If you want to override credential or change the authentication mechanism, you can edit the restore cluster function with details available in the same section.

Figure 4 – Architecture showing the AWS Step Functions Workflow

Import Lambda functions involved in this workflow:

a) FetchRDSSnapshotStatus: This Lambda function fetches the snapshot status and moves to RestoreCluster functionality once the snapshot is available.

b) RestoreRDSCluster: This Lambda function restores the cluster from the shared snapshot with the mandatory attribute only.

– You can edit the Lambda function as described following to add specific parameters like specific VPC, Subnets, AvailabilityZones, DBSubnetGroupName, Autoscaling for reader capacity, DBClusterParameterGroupName, BacktrackWindow and many more.

c) RestoreInstance: This Lambda function creates an instance for the restored cluster.

d) SwapClusterName: This Lambda function swaps the cluster name, so that the updated database is available at previous endpoints. Then the cluster url is swapped.

e) TerminateOldCluster: This Lambda function will shut down the previous cluster.

Using the following AWS CloudFormation Template, we can set up the required services/workflow in the target account and target Region:

a) Launch the template in target account and target Region:

Specify stack details screenshot

b) Fill out the preceding form with the following details and click the next button.

Stack name: Stack name which you want to create.

Engine: The database engine to be used for the new DB cluster, such as aurora-mysql. It must be compatible with the engine of the source.

DBInstanceClass: The compute and memory capacity of the Amazon RDS DB instance, for example, db.t3.medium.

Not all DB instance classes are available in all AWS Regions, or for all database engines. For the full list of DB instance classes, and availability for your engine, see DB Instance Class in the Amazon RDS User Guide.

ExistingDBClusterIdentifier: Name of the existing DB cluster. In the final step, if the old cluster is available with this name, it will be deleted and a new cluster will be available with this name, so the reader/writer endpoint will not change. This parameter isn’t case-sensitive. It must contain from 1 to 63 letters, numbers, or hyphens. The first character must be a letter. DBClusterIdentifier can’t end with a hyphen or contain two consecutive hyphens.

TempDBClusterIdentifier: Temporary name of the DB cluster created from the DB snapshot or DB cluster snapshot. In the final step, this cluster name will be swapped with the original cluster and the previous cluster will be deleted. This parameter isn’t case-sensitive. Must contain from 1 to 63 letters, numbers, or hyphens. The first character must be a letter. DBClusterIdentifier can’t end with a hyphen or contain two consecutive hyphens.

ExistingDBInstanceIdentifier: Name of the existing DB instance identifier. In the final step, if an previous instance is available with this name, it will be deleted and a new instance will be available with this name. This parameter is stored as a lowercase string. Must contain from 1 to 63 letters, numbers, or hyphens. The first character must be a letter, can’t end with a hyphen or contain two consecutive hyphens.

TempDBInstanceIdentifier: Temporary name of the existing DB instance. In the final step, if a previous instance is available with this name, it will be deleted and a new instance will be available. This parameter is stored as a lowercase string. Must contain from 1 to 63 letters, numbers, or hyphens. The first character must be a letter, can’t end with a hyphen or contain two consecutive hyphens.

Permissions2 screenshot

c) Select IAM role to launch this template as per best practices and select Next.

Capabilities and transforms screenshot 2

d) Acknowledge to create various resources including IAM roles/policies and select Create Stack.

After launching the CloudFormation steps, all resources like IAM roles, step function workflow and Lambda functions will be set up in the target account.

Part 4: Cross Account Role for Lambda Invocation to restore the Aurora Cluster

In this step, we provide access to Lambda function setup in the target account and target Region to listen to the SNS topic setup in the source account and source Region. An SNS notification is initiated by the first workflow after the snapshot sharing steps. The Lambda function initiates a workflow to restore the Aurora cluster in the target account and target Region.

Using the following code , we can configure a Lambda function in the target account to listen to the sns topic created in the source account and Region. More details are available in this Tutorial: Using AWS Lambda with Amazon Simple Notification Service.

a) Configure the SNS topic in source account to allow Subscriptions for target account by using the following AWS CLI command in the source account.

aws sns add-permission \
  --label lambda-access --aws-account-id <<targetAccountID>> \
  --topic-arn <<snsTopicinSourceAccount>> \
  --action-name Subscribe ListSubscriptionsByTopic

Replace << targetAccountID >> with target AWS account number and <<snsTopicinSourceAccount>> with sns topic created as an output of part 2 CloudFormation template in the source account.

b) Configure the InvokeStepFunction Lambda function in target account to allow invocation from the SNS topic in source account by using following AWS CLI command in the target account. This policy allows the specific SNS topic in the source account to invoke the Lambda function.

aws lambda add-permission \
  --function-name InvokeStepFunction \
  --source-arn <<snsTopicinSourceAccount>> \
  --statement-id sns-x-account \
  --action "lambda:InvokeFunction" \
   --principal sns.amazonaws.com

Replace <<snsTopicinSourceAccount>> with sns topic created as an output of part 2 CloudFormation template.

c) Subscribe the Lambda function in the target account to the SNS topic in the source account by using the following AWS CLI command in the target account.

aws sns subscribe \
  --topic-arn <<snsTopicinSourceAccount>> \
   --protocol "lambda" \
  --notification-endpoint << InvokeStepFunctionArninTargetAccount>>

Initiate the CLI command in the source Region as sns topic is created there. Replace <<snsTopicinSourceAccount>> with sns topic created as an output of part 2 CloudFormation template execution and << InvokeStepFunctionArninTargetAccount >>with InvokeStepFunction Lambda in the target account created as an output of part 3 CloudFormation template launch.

Security Consideration:

1. We can share the snapshots with different accounts using public and private mode.

Public mode permits all AWS accounts to restore a DB instance from the shared snapshot.
Private mode permits only AWS accounts that you specify to restore a DB instance from the shared snapshot. We recommend sharing the snapshot with the destination account in private mode, as we are using the same in our solution.

Snapshot permissions screenshot

2. Sharing a manual DB cluster snapshot (whether encrypted or unencrypted) enables authorized AWS accounts to directly restore a DB cluster from the snapshot. This is instead of taking a copy and restoring from that. We recommend sharing an encrypted snapshot with the destination account, as we are using the same in our solution.

Cleaning up

Delete the resources to avoid future incurring charges.

Conclusion

In this blog, you learned how to replicate databases across the Region and across different accounts. This helps in designing your DR environment and fulfilling compliance requirements. This solution can be customized for any RDS-based databases or Aurora Serverless, and you can achieve the desired level of RTO and RPO.

Reference:

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

ERGO Breaks New Frontiers for Insurance with AI Factory on AWS

2021-07-07 Piotr Klesta

Post Syndicated from Piotr Klesta original https://aws.amazon.com/blogs/architecture/ergo-breaks-new-frontiers-for-insurance-with-ai-factory-on-aws/

This post is co-authored with Piotr Klesta, Robert Meisner and Lukasz Luszczynski of ERGO

Artificial intelligence (AI) and related technologies are already finding applications in our homes, cars, industries, and offices. The insurance business is no exception to this. When AI is implemented correctly, it adds a major competitive advantage. It enhances the decision-making process, improves efficiency in operations, and provides hassle-free customer assistance.

At ERGO Group, we realized early on that innovation using AI required more flexibility in data integration than most of our legacy data architectures allowed. Our internal governance, data privacy processes, and IT security requirements posed additional challenges towards integration. We had to resolve these issues in order to use AI at the enterprise level, and allow for sensitive data to be used in a cloud environment.

We aimed for a central system that introduces ‘intelligence’ into other core application systems, and thus into ERGO’s business processes. This platform would support the process of development, training, and testing of complex AI models, in addition to creating more operational efficiency. The goal of the platform is to take the undifferentiated heavy lifting away from our data teams so that they focus on what they do best – harness data insights.

Building ERGO AI Factory to power AI use cases

Our quest for this central system led to the creation of AI Factory built on AWS Cloud. ERGO AI Factory is a compliant platform for running production-ready AI use cases. It also provides a flexible model development and testing environment. Let’s look at some of the capabilities and services we offer to our advanced analytics teams.

Figure 1: AI Factory imperatives

Compliance: Enforcing security measures (for example, authentication, encryption, and least privilege) was one of our top priorities for the platform. We worked closely with the security teams to meet strict domain and geo-specific compliance requirements.
Data governance: Data lineage and deep metadata extraction are important because they support proper data governance and auditability. They also allow our users to navigate a complex data landscape. Our data ingestion frameworks include a mixture of third party and AWS services to capture and catalog both technical and business metadata.
Data storage and access: AI Factory stores data in Amazon Simple Storage Service (S3) in a secure and compliant manner. Access rights are only granted to individuals working on the corresponding projects. Roles are defined in Active Directory.
Automated data pipelines: We sought to provide a flexible and robust data integration solution. An ETL pipeline using Apache Spark, Apache Airflow, and Kubernetes pods is central to our data ingestion. We use this for AI model development and subsequent data preparation for operationalization and model integration.
Monitoring and security: AI Factory relies on open-source cloud monitoring solutions like Grafana to detect security threats and anomalies. It does this by collecting service and application logs, tracking metrics, and generating alarms.
Feedback loop: We store model inputs/outputs and use BI tools, such as Amazon QuickSight, to track the behavior and performance of productive AI models. It’s important to share such information with our business partners so we can build their trust and confidence with AI.
Developer-friendly environment: Creating AI models is possible in a notebook-style or integrated development environment. Because our data teams use a variety of machine learning (ML) frameworks and libraries, we keep our platform extensible and our framework agnostic. We support Python/R, Apache Spark, PyTorch and TensorFlow, and more. All this is bolstered by CI/CD processes that accelerate delivery and reduce errors.
Business process integration: AI Factory offers services to integrate ML models into existing business processes. We focus on standardizing processes and close collaboration with business and technical stakeholders. Our overarching goal is to operationalize the AI model in the shortest possible timeframe, while preserving high quality and security standards.

AI Factory architecture

So far, we have looked at the functional building blocks of the AI Factory. Let’s take an architectural view of the platform using a five-step workflow:

Figure 2: AI Factory high-level architecture

Data ingestion environment: We use this environment to ingest data from the prominent on-premises ERGO data sources. We can schedule the batch or Delta transfer data to various cloud destinations using multiple Kubernetes-hosted microservices. Once ingested, data is persisted and cataloged as ERGO’s data lake on Amazon S3. It is prepared for processing by the upstream environments.
Model development environment: This environment is used primarily by data scientists and data engineers. We use Amazon EMR and Amazon SageMaker extensively for data preparation, data wrangling, experimentation with predictive models, and development through rapid iterations.
Model operationalization environment: Trained models with satisfactory KPIs are promoted from the model development to the operationalization environment. This is where we integrate AI models in business processes. The team focuses on launching and optimizing the operation of services and algorithms.
- Integration with ERGO business processes is achieved using Kubernetes-hosted ‘Model Service.’ This allows us to infuse AI models provided by data scientists in existing business processes.
- An essential part of model operationalization is to continuously monitor the quality of the deployed ML models using the ‘feedback loop service.’
Model insights environment: This environment is used for displaying information about platform performance, processes, and analytical data. Data scientists use its services to check for unexpected bias or performance drifts that the model could exhibit. Feedback coming from the business through the “feedback loop service’ allows them to identify problems fast and retrain the model.
Shared services: Though shown as the fifth step of the workflow, the shared services environment supports almost every step in the process. It provides common, shared components between different parts of the platform managing CI/CD and orchestration processes within the AI factory. Additional services like platform logging and monitoring, authentication, and metadata management are also delivered from the shared services environment.

A binding theme across the various subplatforms is that all provisioning and deployment activities are automated using Infrastructure as Code (IaC) practices. This reduces the potential for human error, provides architectural flexibility, and greatly speeds up software development and our infrastructure-related operations.

All components of the AI factory are run in the AWS Cloud and can be scaled and adapted as needed. The connection between model development and operationalization happens at well-defined interfaces to prevent unnecessary coupling of components.

Lessons learned

Security first

Align with security early and often
Understand all the regulatory obligations and document them as critical, non-functional requirements

Modular approach

Combine modern data science technology and professional IT with a cross-functional, agile way of working
Apply loosely coupled services with an API-first approach

Data governance

Tracking technical metadata is important but not sufficient, you need business attributes too
Determine data ownership in operational systems to map upstream data governance workflows
Establish solutions to data masking as the data moves across sub-platforms
Define access rights and permissions boundaries among various personas

FinOps strategy

Carefully track platform cost
Assign owners responsible for monitoring and cost improvements
Provide regular feedback to platform stakeholders on usage patterns and associated expenses

Working with our AWS team

Establish cadence for architecture review and new feature updates
Plan cloud training and enablement

The future for the AI factory

The creation of the AI Factory was an essential building block of ERGO’s strategy. Now we are ready to embrace the next chapter in our advanced analytics journey.

We plan to focus on important use cases that will deliver the highest business value. We want to make the AI Factory available to ERGO’s international subsidiaries. We are also enhancing and scaling its capabilities. We are creating an ‘analytical content hub’ based on automated text extraction, improving speech to text, and developing translation processes for all unstructured and semistructured data using AWS AI services.

Field Notes: Orchestrating and Monitoring Complex, Long-running Workflows Using AWS Step Functions

2021-07-06 Max Winter

Post Syndicated from Max Winter original https://aws.amazon.com/blogs/architecture/field-notes-orchestrating-and-monitoring-complex-long-running-workflows-using-aws-step-functions/

Situation:

IHS Markit’s Wall Street Office (WSO) offers financial reports to hundreds of clients worldwide. When IHS Markit completed the migration of WSO’s SaaS software to AWS, it unlocked the power and agility to deliver new product features monthly, as opposed to a multi-year release cycle. This migration also presented a great opportunity to further enhance the customer experience by automating the WSO reporting team’s own Continuous Integration and Continuous Deployment (CI/CD) workflow. WSO then offered the same migration workflow to its on-prem clients, who needed the ability to upgrade quickly in order to meet a regulatory LIBOR reporting deadline. This rapid upgrade was enabled by fully automating regression testing of new software versions.

In this blog post, I outline the architectures created in collaboration with WSO to orchestrate and monitor the complex, long-running reconciliation workflows in their environment by leveraging the power of AWS Step Functions. To enable each client’s migration to AWS, WSO needed to ensure that the new, AWS version of the reporting application produced identical outputs to the previous, on-premises version. For a single migration, the process is as follows:

spin up the old version of the SQL Server and reporting engine on Windows servers,
run reports,
repeat the process with the new version,
compare the outputs and review the differences.

The problem came with scaling this process. IHS Markit provides financial solutions and tools to numerous clients. To enable these clients to transition away from LIBOR, the WSO team was tasked with migrating over 80 instances of the application, and reconciling hundreds of reports for each migration. During upgrades, customers must manually validate custom extracts created in the WSO Reporting application against current and next version, which limits upgrade frequency and increases the resourcing cost of these validations. Without automation, upgrading all clients would have taken an entire new Operations team, and cost the firm over 700 developer-hours to meet the regulatory LIBOR cessation deadline.

WSO was able to save over 4,000 developer-hours by making this process repeatable so it can be used as an automated regression test as part of the regular Systems Development Lifecycle process The following diagram shows the reconciliation workflow steps enabled as part of this automated process.

Figure 1 – Reconciliation Workflow Steps

Complication:

The team quickly realized that a Serverless and event-driven solution would be required to make this process manageable. The initial approach was to use AWS Lambda functions to call PowerShell scripts to perform each step in the reconciliation process. They also used Amazon SNS to invoke the next Lambda function when the previous step completed.

The problem came when the Operations team tried to monitor these Lambda functions, with multiple parallel reconciliations running concurrently. The Lambda outputs became mixed together in shared Amazon Cloud Watch log groups, and there was no way to quickly see the overall progress of any given reconciliation workflow. It was also difficult to figure out how to recover from errors.

Furthermore, the team found that some steps in this process, such as database restoration, ran longer than the 15 minute Lambda timeout limit. As a result, they were forced to look for alternatives to manage these long-running steps. Following is an architecture diagram showing the serverless component used to automate and scale the process.

Figure 2 – Initial Orchestration Architecture

Solution:

Enter Step Functions and AWS Systems Manager (formerly known as SSM) Automation. To address the problem of orchestrating the many sequential and parallel steps in our workflow, AWS Solutions Architects suggested replacing Amazon SNS with AWS Step Functions.

The Step Functions state machine controls the order in which the steps are invoked, including successful and error state transitions. The service is integrated with 14 other AWS services (Lambda function, SSM Automation, Amazon ECS, and more.), and can invoke them, as well as manual actions. These calls can be synchronous or run via steps that wait for an event. A state machine instance is long-lived and can support processes that take up to a year to complete.

Figure 3 – Step Function Designer UI

This immediately gave the Development team a holistic, visual way to design our workflow, and offered Operations a graphical user interface (UI) to monitor ongoing reconciliations in real time. The Step Functions console lists out all running and past reconciliations, including their status, and allows the operator to drill down into the detailed state diagram of any given reconciliation. The operator can then see how far it’s progressed or where it encountered an error.

The UI also provides Amazon CloudWatch links for any given step, isolating the logs of that particular Lambda execution, eliminating the need to search through the CloudWatch log group manually. The screenshot below illustrates what an in-progress Step Function looks to an operator, with each step listed out with its own status and a link to its log.

Figure 4 – Step Function Execution Monitor

Figure 5 – Step Function Execution Detail Viewer

Figure 6 -Step Function Log Group in CloudWatch

Figure 6 -Step Function Log Group in Amazon CloudWatch

The team also used the Step Function state machine as a container for metadata about each particular reconciliation process instance (like the environment ID and the database and Amazon EC2 instances associated with that environment), reducing the need to pass this data between Lambda functions.

To solve the problem of long-running PowerShell scripts, AWS Solutions Architects suggested using SSM Automation. Unlike Lambda functions, SSM Automation is meant to run operational scripts, with no maximum time limit. They also have native PowerShell integration, which you can use to call the existing scripts and capture their output.

Figure 7 – SSM Automation UI for Monitoring Long-running Tasks and Manual Approval Steps

To save time running hundreds of reports, the team looked into the ‘Map State’ feature of Step Functions. Map takes an array of input data, then creates an instance of the step (in this case a Lambda call) for each item in this array. It waits for them all to complete before proceeding.

This is recommended to implement as a fan-out pattern with almost no orchestration code. The Map State step also gives Operations users the option to limit the level of parallelism, in this case letting only 5 reports run simultaneously. This prevents overloading our reporting applications and databases.

Figure 8 – Map State is used to Fan Out the “CompareReport” Step as Multiple Parallel Steps

To deal with errors in any of the workflow steps, the Development team introduced a manual review step, which you can model in Step Functions. The manual step notifies a mailing list of the error, then waits for a reply to tell it whether to retry or abort the workflow.

The only challenge the Development team found was the mechanism for re-running an individual failed step. At this time, any failure needs to have an explicit state transition within the Step Function’s state diagram. While the Step Function can auto-retry a step, the team wanted to insert a wait-for-human-investigation step before retrying the more expensive and complex steps.

This presented 2 options:

add wait-and-loop-back steps around every step we may want to retry,
route all failures to a single wait-for-investigation step.

The former added significant complexity to the state machine, so AWS Solutions Architects raised this as a product feature that should be added to the Step Functions UI.

The proposed enhancement would allow any failed step to be manually rerun or skipped via UI controls, without adding explicit steps to each state machine to model this. In the meantime, the Dev team went with the latter approach, and had the human error review step loop back to the top of the state machine to retry the entire workflow. To avoid re-running long steps, they created a check within the step Lambda function to query the Step Functions API and determine whether that step had already succeeded before the loop-back, and complete it instantly if it had.

Figure 9 – Human Intervention Steps Used to Allow Time to Review Results or Resolve Errors Before Retrying Failed Steps

Conclusion:

Within 6 weeks, WSO was able to run the first reconciliations and begin the LIBOR migration on time. The Step Function Designer instantly gave the Developer team an operator UI and workflow orchestration engine. Normally, this would have required the creation of an entire 3-tier stack, scheduler and logging infrastructure.

Instead, using Step Functions allowed the developers to spend their time on the reconciliation logic that makes their application unique. The report compare tool developed by the WSO team provides clients with automated artifacts confirming that customer report data remained identical between current version and next version of WSO. The new testing artifacts provide clients with robust and comprehensive testing of critical data extracts.

Figure 10 -The Completed Step Function which Orchestrates an Entire Reconciliation Process

We hope that this blog post provided useful insights to help determine if using AWS Step Functions are a good fit for you.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Vertical Integration Strategy Powered by Amazon EventBridge

2021-07-03 Tiago Oliveira

Post Syndicated from Tiago Oliveira original https://aws.amazon.com/blogs/architecture/vertical-integration-strategy-powered-by-amazon-eventbridge/

Over the past few years, midsize and large enterprises have adopted vertical integration as part of their strategy to optimize operations and profitability. Vertical integration consists of separating different stages of the production line from other related departments, such as marketing and logistics. Enterprises implement such strategy to gain full control of their value chain: from the raw material production to the assembly lines and end consumer.

To achieve operational efficiency, enterprises must keep a level of independence between departments. However, this can lead to unstandardized operations and communication issues. Moreover, with this kind of autonomy for independent and dynamic verticals, the enterprise may lose some measure of visibility and control. As a result, it becomes challenging to generate a basic report from multiple departments. This blog post provides a high-level solution to integrate your different business verticals, using an event-driven architecture on top of Amazon EventBridge.

Event-driven architecture

Event-driven architecture is an architectural pattern to model communication between services while decoupling applications from each other. Applications scale and fail independently, and a central event bus facilitates the communication between the services in the enterprise. Instead of a particular application sending a request directly to another, it produces an event. The central event router captures it and forwards the message to the proper destinations.

For instance, when a customer places a new order on the retail website, the application sends the event to the event bus. Following, the event bus sends the message to the ERP system and the fulfillment center for dispatch. In this scenario, we call the application sending the event, an event publisher, and the applications receiving the event, event consumers.

Because all messages are going through the central event bus, there is clear independence between the applications within the enterprise. Here are some benefits:

Application independence occurs even if they belong to the same business workflow
You can plug in more event consumers to receive the same event type
You can add a data lake to receive all new order events from the retail website
You can receive all the events from the payment system and the customer relations department

This ensures you can integrate independent departments, increase overall visibility, and make sense of specific processes happening in the organization using the right tools.

Implementing event-driven architecture with Amazon EventBridge

Each vertical organically generates lifecycle events. Enterprises can use the event-driven architecture paradigm to make the information flow between the departments by asynchronously exchanging events through the event bus. This way, each department can react to events generated by other departments and initiate processes or actions depending on its business needs.

Such an approach creates a dynamic and flexible choreography between the different participants, which is unique to the enterprise. Such choreography can be followed and monitored using analytics and fine-grained event data collected on the data lake. Read Using AWS X-Ray tracing with Amazon EventBridge to learn how to debug and analyze this kind of distributed application.

Figure 1. Architecture diagram depicting enterprise vertical integration with Amazon EventBridge

In Figure 1, Amazon EventBridge works as the central event bus, the core component of this event-driven architecture. Through Amazon EventBridge, each event publisher sends or receives lifecycle events to and from all the other participants. Amazon EventBridge has an advanced routing mechanism using the concept of rules. Each rule defines up to five targets for the event arriving on the bus. Events are selected based on the event pattern. You can set up routing rules to determine where to send your data to build application architectures. These will react in real time to your data sources, with event publisher and consumer decoupled.

In addition to initiating the heavy routing and distribution of events, Amazon EventBridge can also give real-time insights into how the business runs. Using metrics automatically sent to Amazon CloudWatch, it is possible to see which kinds of events are arriving, and at which rate. You can also see how those events are distributed across the registered targets, and any failures that occur during this distribution. Every event can also be archived using the Amazon EventBridge events archiving feature.

Amazon Simple Storage Service (S3) is the backend storage, or data lake, for all the events that have ever transited via the event bus. With Amazon S3, customers have a cost-efficient storage service at any scale, with 11 9’s of durability. To help customers manage and secure their data, S3 provides features such as Amazon S3 Lifecycle to optimize costs. S3 Object Lock allows the write-once-read-many (WORM) model. You can expand this data and transform it into information using S3. Using services like Amazon Athena, Amazon Redshift, and Amazon EMR, those events can be transformed, correlated, and aggregated to generate insights on the business. The Amazon S3 data lake can also be the input to a data warehouse, machine learning models, and real-time analytics. Learn more about how to use Amazon S3 as the data lake storage.

A critical feature of this solution is the initiation of complex queries on top of the data lake. Amazon API Gateway provides one single flexible and elastic API entry point to retrieve data from the data lake. It also can publish events directly to the event bus. For complex queries, Amazon API Gateway can be integrated with an AWS Lambda. It will coordinate the execution of standard SQL queries using Amazon Athena as the query engine. You can read about a fully functional example of such an API called athena-express.

After collecting data from multiple departments, third-party entities, and shop floors, you can use the data to derive business value using cross-organization dashboards. In this way, you can increase visibility over the different entities and make sense of the data from the distributed systems. Even though this design allows you to use your favorite BI tool, we are using Amazon QuickSight for this solution. For example, with QuickSight, you can author your interactive dashboards, which include machine learning-powered insights. Those dashboards can then connect the marketing campaigns data with the sales data. You can measure how effective those campaigns were and forecast the demand on the production lines.

Conclusion

In this blog post, we showed you how to use Amazon EventBridge as an event bus to allow event-driven architectures. This architecture pattern streamlines the adoption of vertical integration. Enterprises can decouple IT systems from each other while retaining visibility into the data they generate. Integrating those systems can happen asynchronously using a choreography approach instead of having an orchestrator as a central component. There are technical challenges to implement this kind of solution, such as maintaining consistency in distributed applications and transactions spanning multiple microservices. Refer to the saga pattern for microservices-based architecture, and how to implement it using AWS Step Functions.

With a data lake in place to collect all the data produced by IT systems, you can create BI dashboards that provide a holistic view of multiple departments. Moreover, it allows organizations to get better insights into their valuable data and explore other use cases, such as machine learning. To support the data lake creation and management, refer to AWS Lake Formation and a series of other blog posts.

To learn more about Amazon EventBridge from a hands-on perspective, refer to this EventBridge workshop.

Field Notes: Use AWS Cloud9 to Power Your Visual Studio Code IDE

2021-07-01 Nick Ragusa

Post Syndicated from Nick Ragusa original https://aws.amazon.com/blogs/architecture/field-notes-use-aws-cloud9-to-power-your-visual-studio-code-ide/

Everyone has their favorite integrated development environment, or IDE, as it’s more commonly known. For many of us, it’s a tool that we rely on for our day-to-day activities. In some instances, it’s a tool we’ve spent years getting set up just the way we want – from the theme that looks the best to the most productive plugins and extensions that help us optimize our workflows.

After many iterations of trying different IDEs, I’ve chosen Visual Studio Code as my daily driver. One of the things I appreciate most about Visual Studio Code is its vast ecosystem of extensions, allowing me to extend its core functionality to exactly how I need it. I’ve spent hours installing (and sometimes subsequently removing) extensions, figuring out the keyboard shortcut combinations but it’s the theme and syntax highlighter that I find the most appealing. Despite all of this time invested in building the perfect IDE, it can still fall somewhat short.

One of the hardest challenges to overcome, regardless of IDE, is the fact that it’s confined to the hardware that it’s installed on. Some of these challenges include running out of local disk space, RAM exhaustion, or requiring more CPU cores to process a build. In this blog post, I show you how to overcome the limitations of your laptop or desktop by using AWS Cloud9 to power your Visual Studio Code IDE.

Solution Overview

One way to overcome the limitations of your laptop or desktop is to use a remote machine. You’ve probably tried to mount some remote file system over SSH on your machine so you could continue to use your IDE, but timeouts and connection resets made that a frustrating and unusable solution.

Imagine a solution where you can:

Use SSH to connect to a remote instance, install all of your favorite extensions on the remote instance, and take advantage of the remote machine’s resources including CPU, disk, RAM, and more.
Access the instance without needing to open security groups and ACLs from your (often changing) desktop IP.
Leverage your existing AWS infrastructure including your VPC, IAM user / role and policies, and take advantage of AWS Cloud9 to power your remote instance.

With AWS Cloud9, you start with an environment pre-packaged with essential tools for popular programming languages, coupled with the power of Amazon EC2. As an added advantage, you can even take advantage of AWS Cloud9’s built-in auto-shutdown capability which powers off your instance when you’re not actively connected to it. The following diagram shows the architecture you can build to use AWS Cloud9 to power your Visual Studio Code IDE.

Visual Studio Ref Architecture

Walkthrough

I will walk you through setting up the Visual Studio Code with the Remote SSH extension. Once installed, I’ll show you how to leverage the features of AWS Cloud9 to power your Visual Studio Code IDE, including:

Automatically power on your instance when you connect to it
Configure your SSH client to use AWS Systems Manager Session Manager to connect to your AWS Cloud9 instance
Modify your AWS Cloud9 instance to shut down after you disconnect

Before you get started, launch the following AWS CloudFormation template in your AWS account. This will be the easiest way to get up and running to be able to evaluate this solution.

Once launched, a second CloudFormation stack is deployed named aws-cloud9-VS-Code-Remote-SSH-Demo-<unique id>.

Select this stack from your console.
Select the Resources tab, and take note of the instance Physical ID.
Save this value for use later on.

CloudFormation Screenshot

Finally, you may wish to encrypt your AWS Cloud9 instance’s EBS volume for an additional layer of security. Follow this procedure to encrypt your AWS Cloud9 instance’s volume.

Prerequisites

For this walkthrough, you should have the following available to you:

An AWS account
Visual Studio Code downloaded and installed
An OpenSSH compatible SSH client with a key pair that can be used for public key authentication
AWS CLI installed and configured with a named profile that provides access into the AWS account

Install Remote – SSH Extension

First, install the Remote SSH extension in Visual Studio Code. You have the option of clicking the Install button from the preceding link, or by opening the extensions panel (View -> Extensions) from within Visual Studio Code and searching for Remote SSH.

Install Remote – SSH Extension

Once installed, there are a few extension settings I suggest modifying for this solution. To access the extension settings, click on the icon in the extension browser, and go to Extension Settings.

Here, you will land on a screen similar to the following. Adjust:

Remote.SSH: Connect Timeout – I suggest putting a value such as 60 here. Doing so will prevent a timeout from happening in the event your AWS Cloud9 instance is powered off. This gives it ample time to power on and get ready.
Remote.SSH: Default Extensions – For your essential extensions, specify them on this screen. Whenever you connect to a new remote host, these extensions will be installed by default.

Screen Settings screenshot

Install the Session Manager plugin for the AWS CLI

First, install the Session Manager plugin to start and stop sessions that connect to your EC2 instances from the AWS CLI. This plugin can be installed on supported versions of Microsoft Windows, macOS, Linux, and Ubuntu Server.

Note: the plugin requires AWS CLI version 1.16.12 or later to work.

Create an SSH key pair

An SSH key pair is required to access our instance over SSH. Using public key authentication over a simple password provides cryptographic strength over even the most complex passwords. In my macOS environment, I have a utility called ssh-keygen, which I will use to create a key pair. From a terminal, run:

$ ssh-keygen -b 4096 -C 'VS Code Remote SSH user' -t rsa

To avoid any confusion with existing SSH keys, I chose to save my key to /Users/nickragusa/.ssh/id_rsa-cloud9 for this example.

Now that we have a keypair created, we need to copy the public key to the authorized_keys file on the AWS Cloud9 instance. Since I saved my key to /Users/nickragusa/.ssh/id_rsa-cloud9, the corresponding public key has been saved to /Users/nickragusa/.ssh/id_rsa-cloud9.pub.

Open the AWS Cloud9 service console in the same region you deployed the CloudFormation stack. Your environment should be named VS Code Remote SSH Demo. Select Open IDE:

AWS Cloud9 console

Now that you’re in your AWS Cloud9 IDE, edit ~/.ssh/authorized_keys and paste your public key, being sure to paste it below the lines. Following is an example of using the AWS Cloud9 editor to first reveal the file and then edit it:

AWS Cloud9 Editor

Save the file and move on to the next steps.

Modify the shutdown script

As an optional cost saving measure in AWS Cloud9, you can specify a timeout value in minutes that when reached, the instance will automatically power down. This works perfectly when you are connected to your AWS Cloud9 IDE through your browser. In our use case here, we connect to the instance strictly via SSH from our Visual Studio Code IDE. We modify the shutdown logic on the instance itself to check for connectivity from our IDE, and if we’re actively connected, this prevents the automatic shutdown.

To prevent the instance from shutting down while connected from Visual Studio Code, download this script and place it in ~/.c9/stop-if-inactive.sh. From your AWS Cloud9 instance, run:

# Save a copy of the script first
$ sudo mv ~/.c9/stop-if-inactive.sh ~/.c9/stop-if-inactive.sh-SAVE
$ curl https://raw.githubusercontent.com/aws-samples/cloud9-to-power-vscode-blog/main/scripts/stop-if-inactive.sh -o ~/.c9/stop-if-inactive.sh
$ sudo chown root:root ~/.c9/stop-if-inactive.sh
$ sudo chmod 755 ~/.c9/stop-if-inactive.sh

That’s all of the configuration changes that are needed on your AWS Cloud9 instance. The remainder of the work will be done on your laptop or desktop.

ssm-proxy.sh – This script will launch each time you SSH to the AWS Cloud9 instance. Upon invocation, the script checks the current state of your instance. If your instance is powered off, it will execute the aws ec2 start-instances CLI to power on the instance and wait for it to start. Once started, it will use the aws ssm start-session command, along with the Session Manager plugin installed earlier, to create an SSH session with the instance via AWS Systems Manager Session Manager.
~/.ssh/config – For OpenSSH clients, this file defines any user specific SSH configurations you need. In this file, you specify the EC2 instance ID of your AWS Cloud9 instance, the private SSH key to use, the user to connect as, and the location of the ssm-proxy.sh script.

Start by downloading the ssm-proxy.sh script to your machine. To keep things organized, I downloaded this script to my ~/.ssh/ folder.

$ curl https://raw.githubusercontent.com/aws-samples/cloud9-to-power-vscode blog/main/scripts/ssm-proxy.sh -o ~/.ssh/ssm-proxy.sh
$ chmod +x ~/.ssh/ssm-proxy.sh

Next, you edit this script to match your environment. Specifically, there are 4 variables at the top of ssm-proxy.sh that you may want to edit:

AWS_PROFILE – If you use named profiles, you specify the appropriate profile here. If you do not use named profiles, specify a value of default here.
AWS_REGION – The Region where you launched your AWS Cloud9 environment into.
MAX_ITERATION – The number of times this script will check if the environment has successfully powered on.
SLEEP_DURATION – The number of seconds to wait between MAX_ITERATION checks.

Once this script is set up, you need to edit your SSH configuration. You can use the example provided to get you started.

Edit ~/.ssh/config and paste the contents of the example to the bottom of the file. Edit any values that are unique to your environment, specifically:

IdentityFile – This should be the full path to the private key you created earlier.
HostName – Specify the EC2 instance ID of your AWS Cloud9 environment. You should have copied this value down from the ‘Walkthrough’ section of this post.
ProxyCommand – Make sure to specify the full path of the location where you downloaded the ssm-proxy.sh script from the previous step.

Test the solution

Your AWS Cloud9 environment is now up and you have configured your local SSH configuration to use AWS Systems Manager Session Manager for connectivity. Time to try it all out!

From the Visual Studio Code editor, open the command palette, start typing remote-ssh to bring up the correct command, and select Remote-SSH: Connect Current Window to Host.

Command Palette screenshot

A list of your SSH hosts should appear as defined in ~/.ssh/config. Select the host called cloud9 and wait for the connection to establish.

Now your Visual Studio Code environment is powered by your AWS Cloud9 instance! As some next steps, I suggest:

Open the integrated terminal so you can explore the file system and check what tools are already installed.
Browse the extension gallery and install any extensions you frequently use. Remember that extensions are installed locally – so any extensions you may have already installed on your desktop / laptop are not automatically available in this remote environment.

Cleaning up

To avoid any incurring future charges, you should delete any resources you created. I recommend deleting the CloudFormation stack you launched at the beginning of this walkthrough.

Conclusion

In this post, I covered how you can use the Remote SSH extension within Visual Studio Code to connect to your AWS Cloud9 instance. By making some adjustments to your local OpenSSH configuration, you can use AWS Systems Manager Session Manager to establish an SSH connection to your instance. This is done without the need to open an SSH specific port in the instance’s security group.

Now that you can power your Visual Studio Code IDE with AWS Cloud9, how will you take your IDE to the next level? Feel free to share some of your favorite extensions, keyboard shortcuts, or your vastly improved build times in the comments!

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Overview of Data Transfer Costs for Common Architectures

2021-07-01 Birender Pal

Post Syndicated from Birender Pal original https://aws.amazon.com/blogs/architecture/overview-of-data-transfer-costs-for-common-architectures/

Data transfer charges are often overlooked while architecting a solution in AWS. Considering data transfer charges while making architectural decisions can help save costs. This blog post will help identify potential data transfer charges you may encounter while operating your workload on AWS. Service charges are out of scope for this blog, but should be carefully considered when designing any architecture.

Data transfer between AWS and internet

There is no charge for inbound data transfer across all services in all Regions. Data transfer from AWS to the internet is charged per service, with rates specific to the originating Region. Refer to the pricing pages for each service—for example, the pricing page for Amazon Elastic Compute Cloud (Amazon EC2)—for more details.

Data transfer within AWS

Data transfer within AWS could be from your workload to other AWS services, or it could be between different components of your workload.

Data transfer between your workload and other AWS services

When your workload accesses AWS services, you may incur data transfer charges.

Accessing services within the same AWS Region

If the internet gateway is used to access the public endpoint of the AWS services in the same Region (Figure 1 – Pattern 1), there are no data transfer charges. If a NAT gateway is used to access the same services (Figure 1 – Pattern 2), there is a data processing charge (per gigabyte (GB)) for data that passes through the gateway.

Figure 1. Accessing AWS services in same Region

Accessing services across AWS Regions

If your workload accesses services in different Regions (Figure 2), there is a charge for data transfer across Regions. The charge depends on the source and destination Region (as described on the Amazon EC2 Data Transfer pricing page).

Figure 2. Accessing AWS services in different Region

Data transfer within different components of your workload

Charges may apply if there is data transfer between different components of your workload. These charges vary depending on where the components are deployed.

Workload components in same AWS Region

Data transfer within the same Availability Zone is free. One way to achieve high availability for a workload is to deploy in multiple Availability Zones.

Consider a workload with two application servers running on Amazon EC2 and a database running on Amazon Relational Database Service (Amazon RDS) for MySQL (Figure 3). For high availability, each application server is deployed into a separate Availability Zone. Here, data transfer charges apply for cross-Availability Zone communication between the EC2 instances. Data transfer charges also apply between Amazon EC2 and Amazon RDS. Consult the Amazon RDS for MySQL pricing guide for more information.

Figure 3. Workload components across Availability Zones

To minimize impact of a database instance failure, enable a multi-Availability Zone configuration within Amazon RDS to deploy a standby instance in a different Availability Zone. Replication between the primary and standby instances does not incur additional data transfer charges. However, data transfer charges will apply from any consumers outside the current primary instance Availability Zone. Refer to the Amazon RDS pricing page for more detail.

A common pattern is to deploy workloads across multiple VPCs in your AWS network. Two approaches to enabling VPC-to-VPC communication are VPC peering connections and AWS Transit Gateway. Data transfer over a VPC peering connection that stays within an Availability Zone is free. Data transfer over a VPC peering connection that crosses Availability Zones will incur a data transfer charge for ingress/egress traffic (Figure 4).

Figure 4. VPC peering connection

Transit Gateway can interconnect hundreds or thousands of VPCs (Figure 5). Cost elements for Transit Gateway include an hourly charge for each attached VPC, AWS Direct Connect, or AWS Site-to-Site VPN. Data processing charges apply for each GB sent from a VPC, Direct Connect, or VPN to Transit Gateway.

Figure 5. VPC peering using Transit Gateway in same Region

Workload components in different AWS Regions

If workload components communicate across multiple Regions using VPC peering connections or Transit Gateway, additional data transfer charges apply. If the VPCs are peered across Regions, standard inter-Region data transfer charges will apply (Figure 6).

Figure 6. VPC peering across Regions

For peered Transit Gateways, you will incur data transfer charges on only one side of the peer. Data transfer charges do not apply for data sent from a peering attachment to a Transit Gateway. The data transfer for this cross-Region peering connection is in addition to the data transfer charges for the other attachments (Figure 7).

Figure 7. Transit Gateway peering across Regions

Data transfer between AWS and on-premises data centers

Data transfer will occur when your workload needs to access resources in your on-premises data center. There are two common options to help achieve this connectivity: Site-to-Site VPN and Direct Connect.

Data transfer over AWS Site-to-Site VPN

One option to connect workloads to an on-premises network is to use one or more Site-to-Site VPN connections (Figure 8 – Pattern 1). These charges include an hourly charge for the connection and a charge for data transferred from AWS. Refer to Site-to-Site VPN pricing for more details. Another option to connect multiple VPCs to an on-premises network is to use a Site-to-Site VPN connection to a Transit Gateway (Figure 8 – Pattern 2). The Site-to-Site VPN will be considered another attachment on the Transit Gateway. Standard Transit Gateway pricing applies.

Figure 8. Site-to-Site VPN patterns

Data transfer over AWS Direct Connect

Direct Connect can be used to connect workloads in AWS to on-premises networks. Direct Connect incurs a fee for each hour the connection port is used and data transfer charges for data flowing out of AWS. Data transfer into AWS is $0.00 per GB in all locations. The data transfer charges depend on the source Region and the Direct Connect provider location. Direct Connect can also connect to the Transit Gateway if multiple VPCs need to be connected (Figure 9). Direct Connect is considered another attachment on the Transit Gateway and standard Transit Gateway pricing applies. Refer to the Direct Connect pricing page for more details.

Figure 9. Direct Connect patterns

A Direct Connect gateway can be used to share a Direct Connect across multiple Regions. When using a Direct Connect gateway, there will be outbound data charges based on the source Region and Direct Connect location (Figure 10).

Figure 10. Direct Connect gateway

General tips

Data transfer charges apply based on the source, destination, and amount of traffic. Here are some general tips for when you start planning your architecture:

Avoid routing traffic over the internet when connecting to AWS services from within AWS by using VPC endpoints:
- VPC gateway endpoints allow communication to Amazon S3 and Amazon DynamoDB without incurring data transfer charges.
- VPC interface endpoints are available for some AWS services. This type of endpoint incurs hourly service charges and data transfer charges.
Use Direct Connect instead of the internet for sending data to on-premises networks.
Traffic that crosses an Availability Zone boundary typically incurs a data transfer charge. Use resources from the local Availability Zone whenever possible.
Traffic that crosses a Regional boundary will typically incur a data transfer charge. Avoid cross-Region data transfer unless your business case requires it.
Use the AWS Free Tier. Under certain circumstances, you may be able to test your workload free of charge.
Use the AWS Pricing Calculator to help estimate the data transfer costs for your solution.
Use a dashboard to better visualize data transfer charges – this workshop will show how.

Conclusion

AWS provides the ability to deploy across multiple Availability Zones and Regions. With a few clicks, you can create a distributed workload. As you increase your footprint across AWS, it helps to understand various data transfer charges that may apply. This blog post provided information to help you make an informed decision and explore different architectural patterns to save on data transfer costs.

Top 5: Featured Architecture Content for June

2021-06-30 Elyse Lopez

Post Syndicated from Elyse Lopez original https://aws.amazon.com/blogs/architecture/top-5-featured-architecture-content-for-june/

The AWS Architecture Center provides new and notable reference architecture diagrams, vetted architecture solutions, AWS Well-Architected best practices, whitepapers, and more. This blog post features some of our top picks from the new and newly updated content we released this month.

1. Taco Bell: Aurora as The Heart of the Menu Middleware and Data Integration Platform for Taco Bell (YouTube)

This episode of the This is My Architecture video series explores how Taco Bell built a serverless data integration platform to create unique menus for more than 7,000 locations. Find out how this menu middleware uses Amazon Aurora combined with several other services, including AWS Amplify, AWS Lambda, Amazon API Gateway, and AWS Step Functions to create a cost-effective, scalable data pipeline.

2. Well-Architected IoT Lens Checklist

How do you effectively implement AWS IoT workloads? This IoT Lens Checklist provides insights that we have gathered from real-world case studies. This will help you quickly learn the key design elements of Well-Architected IoT workloads. The checklist also provides recommendations for improvement.

3. Derive Insights from AWS Lake House

This whitepaper provides insights and design patterns for cloud architects, data scientists, and developers. It shows you how a lake house architecture allows you to query data across your data warehouse, data lake, and operational databases. Learn how you can store data in a data lake and use a ring of purpose-built data services to quickly make decisions.

4. AWS Limit Monitor Solution

This AWS Solutions Implementation helps you automatically track resource use and avoid overspending. Managed from a centralized location, this ready-to-deploy solution provides a cost-effective way to stay within service quotas by receiving notifications — even via Slack! — before you reach the limit.

5. Cloud Automation for 5G Networks

Digital services providers (DSPs) around the world are focusing on 5G development as part of upgrading their digital infrastructure. This whitepaper explains how DSPs can use different AWS tools and services to fully automate their 5G network deployment and testing and allow orchestration, closed loop use cases, edge analytics, and more.

How Banks Can Use AWS to Meet Compliance

2021-06-29 Jiwan Panjiker

Post Syndicated from Jiwan Panjiker original https://aws.amazon.com/blogs/architecture/how-banks-can-use-aws-to-meet-compliance/

Since the 2008 financial crisis, banking supervisory institutions such as the Basel Committee on Banking Supervision (BCBS) have strengthened regulations. There is now increased oversight over the financial services industry. For banks, making the necessary changes to comply with these rules is a challenging, multi-year effort.

Basel IV, a massive update to existing rules, is due for implementation in January 2023. Basel IV standardizes the approach to calculating credit risk, increases the impact of risk-weighted assets (RWAs) and emphasizes data transparency.

Given the complexity of data, modeling, and numerous assumptions that have to be made, compliance under Basel IV implementation will be challenging. Standardization omits nuances unique to your business, which can drive up costs, but violating guidelines will result in steep penalties.

This post will address these challenges by outlining a mechanism that facilitates a healthy, data-driven dialogue between banks and regulators to better achieve compliance objectives. The reference architecture will focus on enabling fast, iterative releases with the help of serverless AWS services.

There are four key actions to take in order to support this mechanism:

Automate data management
Establish a continuous integration/continuous delivery (CI/CD) pipeline
Enable fast, point-in-time audit replays
Set up proactive monitoring and notifications

Automate data management

Due to frequent merger activity, banks are typically comprised of a web of integrated systems and siloed business units, making it difficult to consolidate data. Under Basel IV guidelines, auditors want banks to provide detailed data in a presentable way.

You can tackle this first challenge by establishing a data pipeline as shown in Figure 1. Take inventory of each data source as it is incorporated into the pipeline. Identify the critical internal and external data sources that will be used to populate the initial landing area. Amazon Simple Storage Service (S3) is a great choice for this.

Figure 1. Data pipeline that cleans, processes, and segments data

Amazon S3 is a highly available, durable service that is a popular data lake solution. S3 offers WORM storage capabilities like S3 Glacier Vault and S3 Object Lock to protect the integrity of your archived data in accordance with U.S. SEC and FINRA rules.

Basel IV regulations also require banks to use many attributes to develop accurate credit risk models. The attributes can be a mix of datasets such as financial statements, internal balanced scorecards, macro-economic data, and credit ratings. The risk models themselves can also be segmented by portfolio types, industry segments, asset types and much more.

You can split data into different domains and designate data owners with separate S3 buckets. Credit risk model developers, analyst, and data scientists can then use the structure of the S3 buckets to pull together relevant datasets. They can then store the outputs into S3 buckets.

To support fast, automated data retrieval, store object metadata in a highly scalable, and queryable database. You can set up Amazon S3 so that an event can initiate a function to populate Amazon DynamoDB. Developers can use AWS Lambda to write these functions using popular languages like Python.

With AWS Glue, you can automate Extract/Load/Transform (ETL) processes to clean and move data to the different S3 buckets. AWS Glue can also support data operations by automatically cataloging your various data sources.

Taking on a structured approach will simplify data governance and transparency as the business continues to grow and operate.

Establish a CI/CD pipeline

Adopt tools that machine learning teams can use to build a streamlined CI/CD solution as demonstrated in Figure 2.

Figure 2. An end-to-end machine learning development and deployment pipeline

Using tightly integrated AWS services, your teams can minimize time spent managing tools and deployment processes, and instead, focus on tuning the models and analyzing the results.

Amazon SageMaker brings together a powerful set of machine learning capabilities on the AWS Cloud. It helps data scientists and engineers build insightful models. Figure 2 depicts the high-level architecture and shows how Amazon SageMaker Pipelines helps teams orchestrate the automation and deployment processes.

The core of the pipeline uses a set of AWS deployment services so that your teams can collaborate and review effectively. With AWS CodeCommit, your teams can set up git-based repository to store and version models for data processing, training, and evaluation. The repository can also store code and configuration files using AWS CloudFormation for deployment. You can use AWS CodePipeline and AWS CodeBuild to create and update a model endpoint based on the approved/reviewed changes.

Any updates detected in the AWS CodeCommit repository initiate a deployment whenever a new model version is added to the Model Registry. Amazon S3 can be used to store generated model artifacts, historical data, and models.

Enable fast, point-in-time audit replays

Figure 3. Containers offer a lightweight, powerful solution to run audits using historical assets

One of the main themes of Basel IV is transparency. Figure 3 illustrates a solution to build trust with regulators by allowing them to verify and understand modeling activity.

A lightweight application is hosted in AWS Fargate and enables auditors to re-run Basel credit risk models under specified conditions. With AWS Fargate, you don’t need to manually manage instances or container orchestration. Configure the CPU or memory specifications at the task level and set guidelines around scalability for your service. Your tasks then scale up and down automatically, based on demand, and will optimize cost efficiency and availability.

Figure 3 shows the following:

The application takes inputs such as date, release version, and model type.
It then queries DynamoDB with this information.
The query will return the data necessary to retrieve model artifacts from previous CI/CD deployments and relevant datasets from historical S3 buckets.
Using this information, it can spin up as many containers as needed to run the model.
It then stores the outputs in a separate S3 bucket.
Auditors will have a detailed trace of all the attributes, assumptions, and data that went into the modeling effort. To streamline this process, the app can also compare the outputs of the historical runs to the recent replay and highlight any significant deviations.

Though internal models will be de-emphasized under Basel IV, banks will continue to run internal models as a benchmark against the broader standards. Schedule AWS Fargate tasks to run these models regularly to capitalize on highly performant compute services while minimizing costs.

Set up proactive monitoring and notifications

Figure 4. Scheduled jobs can send out notifications using Amazon SNS when certain thresholds are breached

The last principle is based around establishing an early warning system, enabling banks to take on a more proactive role in maintaining compliance.

With automated monitoring and notifications, banks will be able to respond quickly to potential concerns. For instance, there can be a daily scheduled job that launches containers and runs the models against the latest data. If any thresholds are breached, alerts can be sent out via SMS or email. Operational teams can be subscribed to certain message topics using Amazon Simple Notification Service (SNS). They can then respond before actual compliance issues emerge.

Conclusion

With a Well-Architected approach, AWS helps you control your data, deploy new features, and embrace a serverless approach. This frees you to innovate quickly and address regulatory challenges.

You can iterate with new AWS services and bring machine learning to bear on various streams of data to identify high impact pools of value. You can get a clearer picture of the data to make it easier to identify areas where you can reduce RWAs. Using Amazon S3, you can turn on AWS analytics services such as Amazon QuickSight and Amazon Athena to visualize the data. You’ll be able to fulfill reporting requirements such as those found in regulatory studies like CCAR, DFAST, CECL, and IFRS9.

For more information about establishing a data pipeline, read Lake House Formation Architecture. It is a powerful pattern that combines a few concepts that will help bring your data together cohesively. To set up a robust CI/CD pipeline, explore the AWS Serverless CI/CD Reference Architecture.