Tag Archives: Customer Solutions

Adding approval notifications to EC2 Image Builder before sharing AMIs

2022-06-15 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/adding-approval-notifications-to-ec2-image-builder-before-sharing-amis/

This blog post was written by, Glenn Chia Jin Wee, Associate Cloud Architect at AWS and Randall Han, Associate Professional Services Consultant at AWS.

In some situations, you may be required to manually validate the Amazon Machine Image (AMI) built from an Amazon Elastic Compute Cloud (Amazon EC2) Image Builder pipeline before sharing this AMI to other AWS accounts or to an AWS Organization. Currently, Image Builder provides an end-to-end pipeline that automatically shares AMIs after they’ve been built.

In this post, we will walk through the steps to enable approval notifications before AMIs are shared with other AWS accounts. Having a manual approval step could be useful if you would like to verify the AMI configurations before it is shared to other AWS accounts or an AWS Organization. This reduces the possibility of incorrectly configured AMIs being shared to other teams which in turn could lead to downstream issues if applications are installed using this AMI. This solution uses serverless resources to send an email with a link that automatically shares the AMI with the specified AWS accounts. Users select this link after they’ve verified that the AMI is built according to specifications.

Overview

In this solution, an Image Builder Pipeline is run that builds a Golden AMI in Account A. After the AMI is built, Image Builder publishes data about the AMI to an Amazon Simple Notification Service (Amazon SNS) topic.
This SNS Topic passes the data to an AWS Lambda function that subscribes to it.
The Lambda function that subscribes to this topic retrieves the data, formats it, and sends a customized email to another SNS Topic.
The second SNS Topic has an email subscription with the Approver’s email. The approver will receive the customized email with a URL that interacts with the next set of Serverless resources.
Selecting the URL makes a GET request to Amazon API Gateway, thereby passing the AMI ID in the query string.
API Gateway then triggers another Lambda function and passes the AMI ID to it.
The Lambda function obtains the AMI ID from the query string parameter of the API Gateway request, and then shares it with the provided target account.

Prerequisites

For this walkthrough, you will need the following:

Two AWS accounts – one to host solution resources, and the second with which to share the built AMI.
AWS Serverless Application Model (AWS SAM) installed, and AWS credentials configured for Account A. Refer to AWS docs: Installing the AWS SAM CLI for the installation and configuration steps.
A new Amazon Virtual Private Cloud (Amazon VPC) will be created from the stack. Make sure that you have fewer than five VPCs in the selected Region.

Walkthrough

In this section, we will guide you through the steps required to deploy the Image Builder solution that utilizes Serverless resources. The solution is deployed with AWS SAM.

In this scenario, we deploy the solution within the approver’s account. The approval email will be sent to a predefined email address for manual approval, before the newly created AMI is shared to target accounts.

Once the approver selects the approval link, an email notification will be sent to the predefined target account email address, notifying that the AMI has been successfully shared.

The high-level steps we will follow are:

In Account A, deploy the provided AWS SAM template. This includes an example Image Builder Pipeline, Amazon SNS topics, API Gateway, and Lambda functions.
Approve the SNS subscription from your supplied email address.
Run the pipeline from the Amazon EC2 Image Builder Console.
[Optional] After the pipeline runs, launch an Amazon EC2 instance from the built AMI to conduct manual tests
An Amazon SNS email will be sent to you with an API Gateway URL. When clicked, an AWS Lambda function shares the AMI to the Account B.
Log in to Account B and verify that the AMI has been shared.

Step 1: Launch the AWS SAM template

Clone the SAM templates from this GitHub repository.
Run the following command to deploy the templates via SAM. Replace <approver email> with the Approver’s email and <AWS Account B ID> with the AWS Account ID of your second AWS Account.

sam deploy \

–template-file template.yaml \

–stack-name ec2-image-builder-approver-notifications \

–capabilities CAPABILITY_IAM \

–resolve-s3 \

–parameter-overrides \

ApproverEmail=<approver email> \

TargetAccountEmail=<target account email> \

TargetAccountlds=<AWS Account B ID>

Step 2: Verify your email address

After running the deployment, you will receive an email prompting you to confirm the Subscription at the approver email address. Choose Confirm subscription.

This leads to the following screen, which shows that your subscription is confirmed.

Repeat the previous 2 steps for the target email address.

Step 3: Run the pipeline from the Image Builder console

In the Image Builder console, under Image pipelines, select the checkbox next to the Pipeline created, choose Actions, and select Run pipeline.

Note that the pipeline takes approximately 20 to 30 minutes to complete.

Step 4: [Optional] Launch an Amazon EC2 instance from the built AMI

There could be a requirement to manually validate the AMI before sharing it to other AWS accounts or to the AWS organization. With this requirement, approvers will launch an Amazon EC2 instance from the built AMI and conduct manual tests on the EC2 instance to make sure that it is functional.

In the Amazon EC2 console, under Images, choose AMIs. Validate that the AMI is created.

Follow AWS docs: Launching an EC2 instances from a custom AMI for steps on how to launch an Amazon EC2 instance from the AMI.

Step 5: Select the approval URL in the email sent

When the pipeline is run successfully, you will receive another email with a URL to share the AMI.

2. Selecting this URL results in the following screen which shows that the AMI share is successful.

Step 6: Verify that the AMI is shared to Account B

Log in to Account B.
In the Amazon EC2 console, under Images, choose AMIs. Then, in the dropdown, choose Private images. Validate that the AMI is shared.

3. Verify that a success email notification was sent to the target account email address provided.

Clean up

This section provides the necessary information for deleting various resources created as part of this post.

1. Deregister the AMIs created and shared.

a. Log in to Account A and follow the steps at AWS documentation: Deregister your Linux AMI.

2. Delete the SAM stack with the following command. Replace <region> with the Region of choice.

sam delete –stack-name ec2-image-builder-approver-notifications –no-prompts –region <region>

3. Delete the CloudWatch log groups for the Lambda functions. You’ll identify it with the name `/aws/lambda/ec2-image-builder-approve*`.

4. Consider deleting the Amazon S3 bucket used to store the packaged Lambda artifact.

Conclusion

In this post, we explained how to use Serverless resources to enable approval notifications for an Image Builder pipeline before AMIs are shared to other accounts. This solution can be extended to share to more than one AWS account or even to an AWS organization. With this solution, you will be notified when new golden images are created, allowing you to verify the correctness of their configuration before sharing them to for wider use. This reduces the possibility of sharing AMIs with misconfigurations that the written tests may not have identified.

We invite you to experiment with different AMIs created using Image Builder, and with different Image Builder components. Check out this GitHub repository for various examples that use Image Builder. Also check out this blog on Image builder integrations with EC2 Auto Scaling Instance Refresh. Let us know your questions and findings in the comments, and have fun!

How GE Proficy Manufacturing Data Cloud replatformed to improve TCO, data SLA, and performance

2022-06-15 Jyothin Madari

Post Syndicated from Jyothin Madari original https://aws.amazon.com/blogs/big-data/how-ge-proficy-manufacturing-data-cloud-replatformed-to-improve-tco-data-sla-and-performance/

This is post is co-authored by Jyothin Madari, Madhusudhan Muppagowni and Ayush Srivastava from GE.

GE Proficy Manufacturing Data Cloud (MDC), part of the GE Digital’s Manufacturing Execution Systems (MES) suite of solutions, allows GED’s customers to increase the derived value easily and quickly from the MES by reliably bringing enterprise-wide manufacturing data into the cloud and transforming it into a structured dataset for advanced analytics and deeper insights into the manufacturing processes.

In this post, we share how MDC modernized the hybrid cloud strategy by replatforming. This solution improved scalability, their data availability Service Level Agreement (SLA), and performance.

Challenge

MDC v1 was built on Predix services using industrial use case-optimized Predix services such as Predix Columnar Store (Cassandra) and Predix Insights (Amazon EMR). MDC evolved in both features and the underlying platform over the past year with a goal to improve TCO, data SLA, and performance. MDC’s customer base grew and the number of sites from customers grew to over 100 in the past couple of years. The increased number of sites needed more compute and storage capacity. This increased infrastructure and operational cost significantly, while introducing increased data latency and lowering the data freshness interval from the cloud.

How we started

MDC evaluated several vendors for their storage and compute capabilities using various measurements: security, performance, scalability, ease of management and operation, reduction of overall cost and increase in ROI, partnership, and migration help (technology assistance). The MDC team saw opportunities to improve the product by using native AWS services such as Amazon Redshift, AWS Glue, and Amazon Managed Workflows for Apache Airflow (Amazon MWAA), which made the product more performant and scalable while reducing operation costs and making it future-ready for advanced analytics and new customer use cases.

The GE Digital team, comprised of domain experts, developers, and QA, worked shoulder to shoulder with the AWS ProServe team, comprised of Solution Architects, Data Architects, and Big Data Experts, in determining the key architectural changes required and solutions to implementation challenges.

Overview of solution

The following diagram illustrates the high-level architecture of the solution.

This is a broad overview, and the specifics of networking and security between components are out of scope for this post.

The solution includes the following main steps and components:

CDC and log collector – Compressed CSV data is collected from over 100 Manufacturing Data Sources Proficy Plant Applications and sinked into an Amazon Simple Storage Service (Amazon S3) bucket.
S3 raw bucket – Our data lands in Amazon S3 without any transformation, but appropriately partitioned (tenant, site, date, and so on) for the ease of future processing.
AWS Lambda – When the file lands in the S3 raw bucket, it triggers an S3 event notification, which invokes AWS Lambda. Lambda extracts metadata (bucket name, key name, date, and so on) from the event and saves it in Amazon DynamoDB.
AWS Glue – Our goal is now to take CSV files, with varying schemas, and convert them into Apache Parquet format. An AWS Glue extract, transform, and load (ETL) job reads a list of files to be processed from the DynamoDB table and fetches them from the S3 raw bucket. We have preconfigured unified AVRO schemas in the AWS Glue Schema Registry for schema conversion. Converted data lands in the S3 raw Parquet bucket.
S3 raw Parquet bucket – Data in this bucket is still raw and unmodified; only the format was changed. This intermediary storage is required due to schema and column order mismatch in CSV files.
Amazon Redshift – The majority of transformations and data enrichment happens in this step. Amazon Redshift Spectrum consumes data from the S3 raw Parquet bucket and external PostgreSQL dimension tables (through a federated query). Transformations are performed via stored procedures, where we encapsulate logic for data transformation, data validation, and business-specific logic. The Amazon Redshift cluster is configured with concurrency scaling, auto workload management (WLM) with caching, and the latest RA3 instance types.
MDC API – These custom-built, web-based, REST API microservices talk on the backend with Amazon Redshift and expose data to external users, business intelligence (BI) tools, and partners.
Amazon Redshift data export and archival – On a scheduled basis, Amazon Redshift exports (UNLOAD command) contextualized and business-defined aggregated data. Exports are landed in the S3 bucket as Apache Parquet files.
S3 Parquet export bucket – This bucket stores the exported data (hundreds of TBs) used by external users who need to run extensive, heavy analytics and AI or machine learning (ML) with various tools (such as Amazon EMR, Amazon Athena, Apache Spark, and Dremio).
End-users – External users consume data from the API. The main use case here is reporting and visual analytics.
Amazon MWAA – The orchestrator of the solution, Amazon MWAA is used for scheduling Amazon Redshift stored procedures, AWS Glue ETL jobs, and Amazon Redshift exports at regular intervals with error handling and retries built in.

Bringing it all together

MDC replaced both Predix Columnar Store (Cassandra) and Predix Insights (Amazon EMR) with Amazon Redshift for both storage of the MDC data models and compute (ELT). Amazon MWAA is used to schedule the workloads that do the bulk of the ELT. Lambda, AWS Glue, and DynamoDB are used to normalize the schema differences between sites. It was important not to disrupt MDC customers while replatforming. To achieve this, MDC used a phased approach to migrate the data models to Amazon Redshift. They used federated queries to query existing PostgreSQL for dimensional data, which facilitated having some of the data models in Amazon Redshift, while the others were in Cassandra with no interruption to MDC customers. Redshift Spectrum facilitated querying the raw data in Amazon S3 directly both for ETL and data validation.

75% of the MDC team along with the AWS ProServe team and AWS Solution Architects collaborated with the GE Digital Security Team and Platform Team to implement the architecture with AWS native services. It took approximately 9 months to implement, secure, and performance tune the architecture and migrate data models in three phases. Each phase has gone through a GE Digital internal security review. Amazon Redshift Auto WLM, short query acceleration, and tuning the sort keys to optimize querying patterns improved the Proficy MDC API performance. Because the unload of the data from Amazon Redshift was fast, Proficy MDC is now able to export the data much more frequently to our end customers.

Conclusion

With replatforming, Proficy MDC was able to improve ETL performance by approximately 75%. Data latency and freshness improved by approximately 87%. The solution reduced TCO of the platform by approximately 50%. Proficy MDC was also able reduce the infrastructure and operational cost. Improved performance and reduced latency has allowed us to speed up the next steps in our journey to modernize the enterprise data architecture and hybrid cloud data platform.

About the Authors

Jyothin Madari leads the Manufacturing Data Cloud (MDC) engineering team; part of the manufacturing suite of products at GE Digital. He has 18 years of experience, 4 of which is with GE Digital. Most recently he has been working on data migration projects with an aim to reduce costs and improve performance. He is an AWS Certified Cloud Practitioner, a keen learner and loves solving interesting problems. Connect with him on LinkedIn.

Madhusudhan (Madhu) Muppagowni is a Technical Architect and Principal Software Developer based in Silicon Valley, Bay Area, California. He is passionate about Software Development and Architecture. He thrives on producing Well-Architected and Secure SaaS Products, Data Pipelines that can make a real impact. He loves outdoors and an avid hiker and backpacker. Connect with him on LinkedIn.

Ayush Srivastava is a Senior Staff Engineer and Technical Anchor based in Hyderabad, India. He is passionate about Software Development and Architecture. He has Demonstrated track record of successfully technical anchoring small to large Secure SaaS Products, Data Pipelines from start to finish. He loves exploring different places and he says “I’m in love with cities I have never been to and people I have never met.” Connect with him on LinkedIn.

Karen Grygoryan is Data Architect with AWS ProServe. Connect with him on LinkedIn.

Gnanasekaran Kailasam is a Data Architect at AWS. He has worked with building data warehouses and big data solutions for over 16 years. He loves to learn new technologies and solving, automating, and simplifying customer problems with easy-to-use cloud data solutions on AWS. Connect with him on LinkedIn.

Capturing GPU Telemetry on the Amazon EC2 Accelerated Computing Instances

2022-06-15 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/capturing-gpu-telemetry-on-the-amazon-ec2-accelerated-computing-instances/

This post is written by Amr Ragab, Principal Solutions Architect EC2.

AWS is excited to announce the native integration of monitoring GPU metrics through the CloudWatch Agent. Customers can now easily monitor GPU utilization and its memory to scale their workloads more effectively without custom scripts. In this post, we’ll describe how to allow GPU monitoring and integrate it into your Amazon Machine Images (AMI). Furthermore, we’ll extend this to include the monitoring of GPU hardware events utilizing CloudWatch Log Streams. By combining this telemetry into the Amazon CloudWatch Console, customers can have a complete picture of GPU activity across their fleets.

Capturing GPU metrics

There is an extensive list of NVIDIA accelerator metrics that can be captured. Depending on the workload type, it may be unnecessary to capture all of the metrics at all times. The following table breaks down the suggested metrics to collect by workload type. This considers a balance of cost and impactful metrics at scale.

Compute (Machine Learning (ML), High Performance Computing (HPC))	Graphics/Gaming
utilization_gpu power_draw utilization_memory memory_total memory_used memory_free pcie_link_gen_current pcie_link_width_current clocks_current_smclocks_current_memory	utilization_gpu utilization_memory memory_total memory_usedmemory_free pcie_link_gen_current pcie_link_width_current encoder_stats_session_count encoder_stats_average_fps encoder_stats_average_latency clocks_current_graphics clocks_current_memory clocks_current_video

Moreover, this is supported through custom AMIs that are deployed with managed service offerings, including Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Services (Amazon ECS), and AWS ParallelCluster w/ SLURM for HPC clusters.

The following is an example screenshot from the CloudWatch Console showcasing the telemetry captured for a P4d instance. You can see that we captured the preceding metrics on a per-GPU index. Each Amazon Elastic Compute Cloud (Amazon EC2) P4d instance has 8x A100 GPUs.

Capturing GPU Xid events

Xid events are a reporting mechanism from GPU hardware vendors that emit notable events from the device to the OS in this case we are capturing the events through the NVRM kernel module. Current GPU architecture requires that the full GPU with protections are passed into the running instance. Thus, most errors that manifest inside of the customer instance aren’t directly visible to the Amazon EC2 virtualization stack. Although some of these errors are benign, others indicate problems with the customer application, the NVIDIA driver, and under rare circumstances a defect in the GPU hardware.

For NVIDIA-based Amazon EC2 instances, these errors will be logged in the system journal with an “NVRM:” regular expression.

These events can be collected and pushed to Amazon CloudWatch Logs as a stream. When an Xid event occurs on the GPU, it will parse the event and push it the log stream for that instance ID in the Region in which the instance is running. The following steps are required to get started capturing those events.

Deployment

We’ll cover the deployment in two different use-cases: 1. You have an existing instance running and you want to start to capture metrics and XID events. 2. You want to build and an AMI and use it within Amazon EC2 or additional services.

I. On a running Amazon EC2 instance

Step 1. Attach an IAM Role to the EC2 instance that has permission to CloudWatch Metrics/Logs. The following is an IAM policy that you can attach to your IAM Role.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "1",
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricStream",
                "logs:CreateLogDelivery",
                "logs:CreateLogStream",
                "cloudwatch:PutMetricData",
                "logs:UpdateLogDelivery",
                "logs:CreateLogGroup",
                "logs:PutLogEvents",
                "cloudwatch:ListMetrics"
            ],
            "Resource": "*"
        }
    ]
}

Step 2. Connect to a shell on the EC2 instance (through SSM or SSH). Install the CloudWatch Agent following the instructions here. There is support across architectures and distributions.

Step 3. Next, we can create our CloudWatch Agent JSON configuration file. The following JSON snippet will capture the logs from gpuerrors.log and push to CloudWatch Logs. Save the contents of the following JSON snippet to a file on the instance at /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json.

{
     "agent": {
         "run_as_user": "root"
     },
     "metrics": {
         "append_dimensions": {
             "AutoScalingGroupName": "${aws:AutoScalingGroupName}",
             "ImageId": "${aws:ImageId}",
             "InstanceId": "${aws:InstanceId}",
             "InstanceType": "${aws:InstanceType}"
         },
         "aggregation_dimensions": [["InstanceId"]],
         "metrics_collected": {
            "nvidia_gpu": {
                "measurement": [
                    "utilization_gpu",
                    "utilization_memory",
                    "memory_total",
                    "memory_used",
                    "memory_free",
                    "clocks_current_graphics",
                    "clocks_current_sm",
                    "clocks_current_memory"
                ]
            }
         }
     },
     "logs": {
         "logs_collected": {
             "files": {
                 "collect_list": [
                     {
                         "file_path": "/var/log/gpuevent.log",
                         "log_group_name": "/ec2/accelerated/accel-event-log",
                         "log_stream_name": "{instance_id}"
                     }
                 ]
             }
         }
     }
 }

Step 4. To start capturing the logs, restart the aws cloudwatch systemd service.

sudo systemctl restart amazon-cloudwatch-agent.service

At this point, if you navigate to the CloudWatch Console in the Region that the instance is running, – All metrics – CWAgent, you should see a table of metrics similar to the following screenshot.

Step 5. To capture the XID events it’s possible to use the same CloudWatch Log directive used in the preceding image were set the GPU metrics to capture. The JSON following snippet defines that we will stream the log in /var/log/gpuevent.log to CloudWatch.

"logs": {
         "logs_collected": {
             "files": {
                 "collect_list": [
                     {
                         "file_path": "/var/log/gpuevent.log",
                         "log_group_name": "/ec2/accelerated/accel-event-log",
                         "log_stream_name": "{instance_id}"
                     }
                 ]
             }
         }
     }

The GitHub project is an open source reference design for capturing these errors in the CloudWatch agent.

https://github.com/aws-samples/aws-efa-nccl-baseami-pipeline

Step 6. Save the following file as /opt/aws/aws-hwaccel-event-parser.py|.go with the following contents, which will write the Xid errors parsed to /var/log/gpuevent.log:

The code is available in either Python3 or Go (> 1.16).

Golang code of the hwaccel-event-parser: https://github.com/aws-samples/aws-efa-nccl-baseami-pipeline/blob/master/nvidia-efa-ami_base/cloudwatch/nvidia/aws-hwaccel-event-parser.go

Python3 code: https://github.com/aws-samples/aws-efa-nccl-baseami-pipeline/blob/master/nvidia-efa-ami_base/cloudwatch/nvidia/aws-hwaccel-event-parser.py

As you can see from the code, this is a blocking thread, and it will be running during the lifetime of the instance or container.

Step 7. For ease of deployment, you can also create a systemd service (aws-hw-monitor.service), which will run at startup before the CloudWatch agent.

[Unit]
Description=HW Error Monitor
Before=amazon-cloudwatch-agent.service
After=syslog.target network-online.target

[Service]
Type=simple
ExecStart=/opt/aws/cloudwatch/aws-cloudwatch-wrapper.sh
RemainAfterExit=1
TimeoutStartSec=0

[Install]
WantedBy=multi-user.target

Where /opt/aws/cloudwatch/aws-cloudwatch-wrapper.sh is a script which contains:

#!/bin/bash
python3 /opt/aws/aws-hwaccel-event-parser.py &

Finally, enable and start the hw monitor service

sudo systemctl enable aws-hw-monitor.service –now

II. Building an AMI

For convenience, the following repo has what is needed to build the AMI for Amazon EC2, Amazon EKS, Amazon ECS, Amazon Linux 2, and Ubuntu 18.04/20.04 distributions. You must have Packer installed on your machine, and it must be authenticated to make API calls on your behalf to AWS. Generally you need to modify the variables:{} json and execute the packer build.

"variables": {
    "region": "us-east-1",
    "flag": "<flag>",
    "subnet_id": "<subnetid>",
    "security_groupids": "<security_group_id,security_group_id",
    "build_ami": "<buildami>",
    "efa_pkg": "aws-efa-installer-latest.tar.gz",
    "intel_mkl_version": "intel-mkl-2020.0-088",
    "nvidia_version": "510.47.03",
    "cuda_version": "cuda-toolkit-11-6 nvidia-gds-11-6",
    "cudnn_version": "libcudnn8",
    "nccl_version": "v2.12.7-1"
  },

After filling in the variables, check that the packer script is validated.

packer validate nvidia-efa-ml-al2.yml
packer build nvidia-efa-ml-al2.yml

The log group namespace is /ec2/accelerated/accel-event-log. However, you may change this to the namespace of your preference in the CloudWatch Agent config file created earlier.

Navigate to the CloudWatch Console – Logs – Log groups – /ec2/accelerated/accel-event-log. It’s sorted by instance ID, where the instance ID of the latest stream is on top.

We can see in the preceding screenshot that an example workload ran on instance i-03a7b66de3198977e, which was a p4d.24xlarge triggered a Xid 63 event. Capturing these events is the first step. Next, we must interpret what these events mean. With each Xid error, there is a number associated with each event. As previously mentioned, these can be hardware errors, driver, and/or application errors. If you’re running on an Amazon EC2 accelerated instance, and after code execution run into one of these errors, contact AWS Support with the instance ID and Xid error. The following is a list of the more common Xid errors that you may encounter.

Xid Error	Name	Description	Action
48	Double Bit ECC error	Hardware memory error	Contact AWS Support with Xid error and instance ID
74	GPU NVLink error	Further SXid errors should also be populated which will inform on the error seen with the NVLink fabric	Get information on which links are causing the issue by running nvidia-smi nvlink -e
63	GPU Row Remapping Event	Specific to Ampere architecture –- a row bank is pending a memory remap	Stop all CUDA processes, and reset the GPU (nvidia-smi -r), and make sure thatensure the remap is cleared in nvidia-smi -q
13	Graphics Engine Exception	User application fault , illegal instruction or register	Rerun the application with CUDA_LAUNCH_BLOCKING=1 enabled which should determine if it’s a NVIDIA driver or hardware issue
31	GPU memory page fault	Illegal memory address access error	Rerun the application with CUDA_LAUNCH_BLOCKING=1 enabled which should determine if it’s a NVIDIA driver or hardware issue

A quick way to check for row remapping failures is to run the below command on the instance.

nvidia-smi --query-remapped-
rows=gpu_name,gpu_bus_id,remapped_rows.failure,remapped_rows.pending,remapped_rows.correctable,remapped_rows.uncorrectable --format=csv
gpu_name, gpu_bus_id, remapped_rows.failure, remapped_rows.pending, remapped_rows.correctable, remapped_rows.uncorrectable
NVIDIA A100-SXM4-40GB, 00000000:10:1C.0, 0, 0, 0, 0
NVIDIA A100-SXM4-40GB, 00000000:10:1D.0, 0, 0, 0, 0
NVIDIA A100-SXM4-40GB, 00000000:20:1C.0, 0, 0, 0, 0
NVIDIA A100-SXM4-40GB, 00000000:20:1D.0, 0, 0, 0, 0
NVIDIA A100-SXM4-40GB, 00000000:90:1C.0, 0, 0, 0, 0
NVIDIA A100-SXM4-40GB, 00000000:90:1D.0, 0, 0, 0, 0
NVIDIA A100-SXM4-40GB, 00000000:A0:1C.0, 0, 0, 0, 0
NVIDIA A100-SXM4-40GB, 00000000:A0:1D.0, 0, 0, 0, 0

This isn’t an exhaustive list of Xid events, but it provides some of the more common ones that you may come across as you develop your accelerated workload. You can find a more complete table of events here. Furthermore, if you have questions, you can reach out to AWS Support with the output of the tar ball created by executing the nvidia-bug-report.sh script included with the NVIDIA driver.

Conclusion

Get started with integrating this monitoring into your AMIs if you use custom AMIs specifically for key services, such as Amazon EKS, Amazon ECS, or Amazon EC2 with AWS ParallelCluster. This will help you discover utilization metrics for your accelerated computing workloads. If you have any questions about this post, then reach out to your account team.

Supercharging Dream11’s Data Highway with Amazon Redshift RA3 clusters

2022-06-01 Dhanraj Gaikwad

Post Syndicated from Dhanraj Gaikwad original https://aws.amazon.com/blogs/big-data/supercharging-dream11s-data-highway-with-amazon-redshift-ra3-clusters/

This is a guest post by Dhanraj Gaikwad, Principal Engineer on Dream11 Data Engineering team.

Dream11 is the world’s largest fantasy sports platform, with over 120 million users playing fantasy cricket, football, kabaddi, basketball, hockey, volleyball, handball, rugby, futsal, American football, and baseball. Dream11 is the flagship brand of Dream Sports, India’s leading Sports Technology company, and has partnerships with several national and international sports bodies and cricketers.

In this post, we look at how we supercharged our data highway, the backbone of our major analytics pipeline, by migrating our Amazon Redshift clusters to RA3 nodes. We also look at why we were excited about this migration, the challenges we faced during the migration and how we overcame them, as well as the benefits accrued from the migration.

Background

The Dream11 Data Engineering team runs the analytics pipelines (what we call our Data Highway) across Dream Sports. In near-real time, we analyze various aspects that directly impact the end-user experience, which can have a profound business impact for Dream11.

Initially, we were analyzing upwards of terabytes of data per day with Amazon Redshift clusters that ran mainly on dc2.8xlarge nodes. However, due to a rapid increase in our user participation over the last few years, we observed that our data volumes increased multi-fold. Because we were using dc2.8xlarge clusters, this meant adding more nodes of dc2.8xlarge instance types to the Amazon Redshift clusters. Not only was this increasing our costs, it also meant that we were adding additional compute power when what we really needed was more storage. Because we anticipated significant growth during the Indian Premier League (IPL) 2021, we actively explored various options using our AWS Enterprise Support team. Additionally, we were expecting more data volume over the next few years.

The solution

After discussions with AWS experts and the Amazon Redshift product team, we at Dream11 were recommended the most viable option of migrating our Amazon Redshift clusters from dc2.8xlarge to the newer RA3 nodes. The most obvious reason for this was the decoupled storage from compute. As a result, we could use lesser nodes and move our storage to Amazon Redshift managed storage. This allowed us to respond to data volume growth in the coming years as well as reduce our costs.

To start off, we conducted a few elementary tests using an Amazon Redshift RA3 test cluster. After we were convinced that this wouldn’t require many changes in our Amazon Redshift queries, we decided to carry out a complete head-to-head performance test between the two clusters.

Validating the solution

Because the user traffic on the Dream11 app tends to spike during big ticket tournaments like the IPL, we wanted to ensure that the RA3 clusters could handle the same traffic that we usually experience during our peak. The AWS Enterprise Support team suggested using the Simple Replay tool, an open-sourced tool released by AWS that you can use to record and replay the queries from one Amazon Redshift cluster to another. This tool allows you to capture queries on a source Amazon Redshift cluster, and then replay the same queries on a destination Amazon Redshift cluster (or clusters). We decided to use this tool to capture our performance test queries on the existing dc2.8xlarge clusters and replay them on a test Amazon Redshift cluster composed of RA3 nodes. During this time of our experimentation, the newer version of the automated AWS CloudFormation-based toolset (now on GitHub), was not available.

Challenges faced

The first challenge came up when using the Simple Replay tool because there was no easy way to compare the performance of like-to-like queries on the two types of clusters. Although Amazon Redshift provides various statistics using meta-tables about individual queries and their performance, the Simple Replay tool adds additional comments in each Amazon Redshift query on the target cluster to make it easier to know if these queries were run by the Simple Replay tool. In addition, the Simple Replay tool drops comments from the queries on the source cluster.

Comparing each query performance with the Amazon Redshift performance test suite would mean writing additional scripts for easy performance comparison. An alternative would have been to modify the Simple Replay tool code, because it’s open source on GitHub. However, with the IPL 2022 beginning in just a few days, we had to explore another option urgently.

After further discussions with the AWS Enterprise Support team, we decided to use two test clusters: one with the old dc2.8xlarge nodes, and another with the newer RA3 nodes. The idea was to use the Simple Replay tool to run the captured queries from our original cluster on both test clusters. This meant that the queries would be identical on both test clusters, making it easier to compare. Although this meant running an additional test cluster for a few days, we went ahead with this option. As a side note, the newer automated AWS CloudFormation-based toolset does exactly the same in an automated way.

After we were convinced that most of our Amazon Redshift queries performed satisfactorily, we noticed that certain queries were performing slower on the RA3-based cluster than the dc2.8xlarge cluster. We narrowed down the problem to SQL queries with full table scans. We rectified it by following proper data modelling practices in the ETL workflow. Then we were ready to migrate to the newer RA3 nodes.

The migration to RA3

The migration from the old cluster to the new cluster was smoother than we thought. We used the elastic resize approach, which meant we only had a few minutes of Amazon Redshift downtime. We completed the migration successfully with a sufficient buffer timeline for more tests. Additional tests indicated that the new cluster performed how we wanted it to.

The trial by fire

The new cluster performed satisfactorily during our peak performance loads in the IPL as well as the following ICC T20 Cricket World Cup. We’re excited that the new RA3 node-based Amazon Redshift cluster can support our data volume growth needs without needing to increase the number of instance nodes.

We migrated from dc2 to RA3 in April 2021. The data volume has grown by 50% since then. If we had continued with dc2 instances, the cluster cost would have increased by 50%. However, because of the migration to RA3 instances, even with an increase in data volume by 50% since April 2021, the cluster cost has increased by 0.7%, which is attributed to an increase in storage cost.

Conclusion

Migrating to the newer RA3-based Amazon Redshift cluster helped us decouple our computing needs from our storage needs, and now we’re prepared for our expected data volume growth for the next few years. Moreover, we don’t need to add compute nodes if we only need storage, which is expected to bring down our costs in the long run. We did need to fine-tune some of our queries on the newer cluster. With the Simple Replay tool, we could do a direct comparison between the older and the newer cluster. You can also use the newer automated AWS CloudFormation-based toolset if you want to follow a similar approach.

We highly recommend RA3 instances. They give you the flexibility to size your RA3 cluster based on the
amount of data stored without increasing your compute costs.

About the Authors

Dhanraj Gaikwad is a Principal Data Engineer at Dream11. Dhanraj has more than 15 years of experience in the field of data and analytics. In his current role, Dhanraj is responsible for building the data platform for Dream Sports and is specialized in data warehousing, including data modeling, building data pipelines, and query optimizations. He is passionate about solving large-scale data problems and taking unique approaches to deal with them.

Sanket Raut is a Principal Technical Account Manager at AWS based in Vasai ,India. Sanket has more than 16 years of industry experience, including roles in cloud architecture, systems engineering, and software design. He currently focuses on enabling large startups to streamline their cloud operations and optimize their cloud spend. His area of interest is in serverless technologies.

Making your Go workloads up to 20% faster with Go 1.18 and AWS Graviton

2022-05-31 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/making-your-go-workloads-up-to-20-faster-with-go-1-18-and-aws-graviton/

This blog post was written by Syl Taylor, Professional Services Consultant.

In March 2022, the highly anticipated Go 1.18 was released. Go 1.18 brings to the language some long-awaited features and additions, such as generics. It also brings significant performance improvements for Arm’s 64-bit architecture used in AWS Graviton server processors. In this post, we show how migrating Go workloads from Go 1.17.8 to Go 1.18 can help you run your applications up to 20% faster and more cost-effectively. To achieve this goal, we selected a series of realistic and relatable workloads to showcase how they perform when compiled with Go 1.18.

Overview

Go is an open-source programming language which can be used to create a wide range of applications. It’s developer-friendly and suitable for designing production-grade workloads in areas such as web development, distributed systems, and cloud-native software.

AWS Graviton2 processors are custom-built by AWS using 64-bit Arm Neoverse cores to deliver the best price-performance for your cloud workloads running in Amazon Elastic Compute Cloud (Amazon EC2). They provide up to 40% better price/performance over comparable x86-based instances for a wide variety of workloads and they can run numerous applications, including those written in Go.

Web service throughput

For web applications, the number of HTTP requests that a server can process in a window of time is an important measurement to determine scalability needs and reduce costs.

To demonstrate the performance improvements for a Go-based web service, we selected the popular Caddy web server. To perform the load testing, we selected the hey application, which was also written in Go. We deployed these packages in a client/server scenario on m6g Graviton instances.

The Caddy web server compiled with Go 1.18 brings a 7-8% throughput improvement as compared with the variant compiled with Go 1.17.8.

We conducted a second test where the client downloads a dynamic page on which the request handler performs some additional processing to write the HTTP response content. The performance gains were also noticeable at 10-11%.

Regular expression searches

Searching through large amounts of text is where regular expression patterns excel. They can be used for many use cases, such as:

Checking if a string has a valid format (e.g., email address, domain name, IP address),
Finding all of the occurrences of a string (e.g., date) in a text document,
Identifying a string and replacing it with another.

However, despite their efficiency in search engines, text editors, or log parsers, regular expression evaluation is an expensive operation to run. We recommend identifying optimizations to reduce search time and compute costs.

The following example uses the Go regexp package to compile a pattern and search for the presence of a standard date format in a large generated string. We observed a 13.5% increase in completed executions with a 12% reduction in execution time.

In a second example, we used the Go regexp package to find all of the occurrences of a pattern for character sequences in a string, and then replace them with a single character. We observed a 12% increase in evaluation rate with an 11% reduction in execution time.

As with most workloads, the improvements will vary depending on the input data, the hardware selected, and the software stack installed. Furthermore, with this use case, the regular expression usage will have an impact on the overall performance. Given the importance of regex patterns in modern applications, as well as the scale at which they’re used, we recommend upgrading to Go 1.18 for any software that relies heavily on regular expression operations.

Database storage engines

Many database storage engines use a key-value store design to benefit from simplicity of use, faster speed, and improved horizontal scalability. Two implementations commonly used are B-trees and LSM (log-structured merge) trees. In the age of cloud technology, building distributed applications that leverage a suitable database service is important to make sure that you maximize your business outcomes.

B-trees are seen in many database management systems (DBMS), and they’re used to efficiently perform queries using indexes. When we tested a sample program for inserting and deleting in a large B-tree structure, we observed a 10.5% throughput increase with a 10% reduction in execution time.

On the other hand, LSM trees can achieve high rates of write throughput, thus making them useful for big data or time series events, such as metrics and real-time analytics. They’re used in modern applications due to their ability to handle large write workloads in a time of rapid data growth. The following are examples of databases that use LSM trees:

InfluxDB is a powerful database used for high-speed read and writes on time series data. It’s written in Go and its storage engine uses a variation of LSM called the Time-Structured Merge Tree (TSM).
CockroachDB is a popular distributed SQL database written in Go with its own LSM tree implementation.
Badger is written in Go and is the engine behind Dgraph, a graph database. Its design leverages LSM trees.

When we tested an LSM tree sample program, we observed a 13.5% throughput increase with a 9.5% reduction in execution time.

We also tested InfluxDB using comparison benchmarks to analyze writes and reads to the database server. On the load stress test, we saw a 10% increase of insertion throughput and a 14.5% faster rate when querying at a large scale.

In summary, for databases with an engine written in Go, you’ll likely observe better performance when upgrading to a version that has been compiled with Go 1.18.

Machine learning training

A popular unsupervised machine learning (ML) algorithm is K-Means clustering. It aims to group similar data points into k clusters. We used a dataset of 2D coordinates to train K-Means and obtain the cluster distribution in a deterministic manner. The example program uses an OOP design. We noticed an 18% improvement in execution throughput and a 15% reduction in execution time.

A widely-used and supervised ML algorithm for both classification and regression is Random Forest. It’s composed of numerous individual decision trees, and it uses a voting mechanism to determine which prediction to use. It’s a powerful method for optimizing ML models.

We ran a deterministic example to train a dense Random Forest. The program uses an OOP design and we noted a 20% improvement in execution throughput and a 15% reduction in execution time.

Recursion

An efficient, general-purpose method for sorting data is the merge sort algorithm. It works by repeatedly breaking down the data into parts until it can compare single units to each other. Then, it decides their order in the intermediary steps that will merge repeatedly until the final sorted result. To implement this divide-and-conquer approach, merge sort must use recursion. We ran the program using a large dataset of numbers and observed a 7% improvement in execution throughput and a 4.5% reduction in execution time.

Depth-first search (DFS) is a fundamental recursive algorithm for traversing tree or graph data structures. Many complex applications rely on DFS variants to solve or optimize hard problems in various areas, such as path finding, scheduling, or circuit design. We implemented a standard DFS traversal in a fully-connected graph. Then we observed a 14.5% improvement in execution throughput and a 13% reduction in execution time.

Conclusion

In this post, we’ve shown that a variety of applications, not just those primarily compute-bound, can benefit from the 64-bit Arm CPU performance improvements released in Go 1.18. Programs with an object-oriented design, recursion, or that have many function calls in their implementation will likely benefit more from the new register ABI calling convention.

By using AWS Graviton EC2 instances, you can benefit from up to a 40% price/performance improvement over other instance types. Furthermore, you can save even more with Graviton through the additional performance improvements by simply recompiling your Go applications with Go 1.18.

To learn more about Graviton, see the Getting started with AWS Graviton guide.

How Paytm modernized their data pipeline using Amazon EMR

2022-05-12 Rajat Bhardwaj

Post Syndicated from Rajat Bhardwaj original https://aws.amazon.com/blogs/big-data/how-paytm-modernized-their-data-pipeline-using-amazon-emr/

This post was co-written by Rajat Bhardwaj, Senior Technical Account Manager at AWS and Kunal Upadhyay, General Manager at Paytm.

Paytm is India’s leading payment platform, pioneering the digital payment era in India with 130 million active users. Paytm operates multiple lines of business, including banking, digital payments, bill recharges, e-wallet, stocks, insurance, lending and mobile gaming. At Paytm, the Central Data Platform team is responsible for turning disparate data from multiple business units into insights and actions for their executive management and merchants, who are small, medium or large business entities accepting payments from the Paytm platforms.

The Data Platform team modernized their legacy data pipeline with AWS services. The data pipeline collects data from different sources and runs analytical jobs, generating approximately 250K reports per day, which are consumed by Paytm executives and merchants. The legacy data pipeline was set up on premises using a proprietary solution and didn’t utilize the open-source Hadoop stack components such as Spark or Hive. This legacy setup was resource-intensive, having high CPU and I/O requirements. Analytical jobs took approximately 8–10 hours to complete, which often led to Service Level Agreements (SLA) breaches. The legacy solution was also prone to outages due to higher than expected hardware resource consumption. Its hardware and software limitations impacted the ability of the system to scale during peak load. Data models used in the legacy setup processed the entire data every time, which led to an increased processing time.

In this post, we demonstrate how the Paytm Central Data Platform team migrated their data pipeline to AWS and modernized it using Amazon EMR, Amazon Simple Storage Service (Amazon S3) and underlying AWS Cloud infrastructure along with Apache Spark. We optimized the hardware usage and reduced the data analytical processing, resulting in shorter turnaround time to generate insightful reports, all while maintaining operational stability and scale irrespective of the size of daily ingested data.

Overview of solution

The key to modernizing a data pipeline is to adopt an optimal incremental approach, which helps reduce the end-to-end cycle to analyze the data and get meaningful insights from it. To achieve this state, it’s vital to ingest incremental data in the pipeline, process delta records and reduce the analytical processing time. We configured the data sources to inherit the unchanged records and tuned the Spark jobs to only analyze the newly inserted or updated records. We used temporal data columns to store the incremental datasets until they’re processed. Data intensive Spark jobs are configured in incremental on-demand deduplicating mode to process the data. This helps to eliminate redundant data tuples from the data lake and reduces the total data volume, which saves compute and storage capacity. We also optimized the scanning of raw tables to restrict the scans to only the changed record set which reduced scanning time by approximately 90%. Incremental data processing also helps to reduce the total processing time.

At the time of this writing, the existing data pipeline has been operationally stable for 2 years. Although this modernization was vital, there is a risk of an operational outage while the changes are being implemented. Data skewing needs to be handled in the new system by an appropriate scaling strategy. Zero downtime is expected from the stakeholders because the reports generated from this system are vital for Paytm’s CXO, executive management and merchants on a daily basis.

The following diagram illustrates the data pipeline architecture.

Benefits of the solution

The Paytm Central Data Office team, comprised of 10 engineers, worked with the AWS team to modernize the data pipeline. The team worked for approximately 4 months to complete this modernization and migration project.

Modernizing the data pipeline with Amazon EMR 6.3 helped efficiently scale the system at a lower cost. Amazon EMR managed scaling helped reduce the scale-in and scale-out time and increase the usage of Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances for running the Spark jobs. Paytm is now able to utilize a Spot to On-Demand ratio of 80:20, resulting in higher cost savings. Amazon EMR managed scaling also helped automatically scale the EMR cluster based on YARN memory usage with the desired type of EC2 instances. This approach eliminates the need to configure multiple Amazon EMR scaling policies tied to specific types of EC2 instances as per the compute requirements for running the Spark jobs.

In the following sections, we walk through the key tasks to modernize the data pipeline.

Migrate over 400 TB of data from the legacy storage to Amazon S3

Paytm team built a proprietary data migration application with the open-source AWS SDK for Java for Amazon S3 using the Scala programming language. This application can connect with multiple cloud providers , on-premises data centers and migrate the data to a central data lake built on Amazon S3.

Modernize the transformation jobs for over 40 data flows

Data flows are defined in the system for ingesting raw data, preprocessing the data and aggregating the data that is used by the analytical jobs for report generation. Data flows are developed using Scala programming language on Apache Spark. We use an Azkaban batch workflow job scheduler for ordering and tracking the Spark job runs. Workflows are created on Amazon EMR to schedule these Spark jobs multiple times during a day. We also implemented Spark optimizations to improve the operational efficiency for these jobs. We use Spark broadcast joins to handle the data skewness, which can otherwise lead to data spillage, resulting in extra storage needs. We also tuned the Spark jobs to avoid a large number of small files, which is a known problem with Spark if not handled effectively. This is mainly because Spark is a parallel processing system and data loading is done through multiple tasks where each task can load into multiple partition. Data-intensive jobs are run using Spark stages.

The following is the code snippet for the Scala jobs:

nodes:
  - name: jobC
    type: noop
    # jobC depends on jobA and jobB
    dependsOn:
      - jobA
      - jobB

  - name: jobA
    type: command
    config:
      command: echo "This is an echoed text."

  - name: jobB
    type: command
    config:
      command: pwd

Validate the data

Accuracy of the data reports is vital for the modern data pipeline. The modernized pipeline has additional data reconciliation steps to improve the correctness of data across the platform. This is achieved by having greater programmatic control over the processed data. We could only reconcile data for the legacy pipeline after the entire data processing was complete. However, the modern data pipeline enables all the transactions to be reconciled at every step of the transaction, which gives granular control for data validation. It also helps isolate the cause of any data processing errors. Automated tests were done before go-live to compare the data records generated by the legacy vs. the modern system to ensure data sanity. These steps helped ensure the overall sanity of the processed data by the new system. Deduplication of data is done frequently via on-demand queries to eliminate redundant data, thereby reducing the processing time. As an example, if there are transactions which are already consumed by the end clients but still a part of the data-set, these can be eliminated by the deduplication, resulting in processing of only the newer transactions for the end client consumption.

The following sample query uses Spark SQL for on-demand deduplication of raw data at the reporting layer:

Insert over table  <<table>>
select col1,col2,col3 ---...coln 
from (select t.*
            ,row_number() over(order by col) as rn 
      from <<table>>
     ) t
where rn = 1

What we achieved as part of the modernization

With the new data pipeline, we reduced the compute infrastructure by 400% which helps to save compute cost. The earlier legacy stack was running on over 6,000 virtual cores. Optimization techniques helped to run the same system at an improved scale, with approximately 1,500 virtual cores. We are able to reduce the compute and storage capacity for 400 TB of data and 40 data flows after migrating to Amazon EMR. We also achieved Spark optimizations, which helped to reduce the runtime of the jobs by 95% (from 8–10 hours to 20–30 minutes), CPU consumption by 95%, I/O by 98% and overall computation time by 80%. The incremental data processing approach helped to scale the system despite data skewness, which wasn’t the case with the legacy solution.

Conclusion

In this post, we showed how Paytm modernized their data lake and data pipeline using Amazon EMR, Amazon S3, underlying AWS Cloud infrastructure and Apache Spark. Choice of these cloud & big-data technologies helped to address the challenges for operating a big data pipeline because the type and volume of data from disparate sources adds complexity to the analytical processing.

By partnering with AWS, the Paytm Central Data Platform team created a modern data pipeline in a short amount of time. It provides reduced data analytical times with astute scaling capabilities, generating high-quality reports for the executive management and merchants on a daily basis.

As next steps, do a deep dive bifurcating the data collection and data processing stages for your data pipeline system. Each stage of the data pipeline should be appropriately designed and scaled to reduce the processing time while maintaining integrity of the reports generated as an output.

If you have feedback about this post, submit comments in the Comments section below.

About the Authors

Rajat Bhardwaj is a Senior Technical Manager with Amazon Web Services based in India, having 23 years of work experience with multiple roles in software development, telecom, and cloud technologies. He works along with AWS Enterprise customers, providing advocacy and strategic technical guidance to help plan and build solutions using AWS services and best practices. Rajat is an avid runner, having competed several half and full marathons in recent years.

Kunal Upadhyay is a General Manager with Paytm Central Data Platform team based out of Bengaluru, India. Kunal has 16 years of experience in big data, distributed computing, and data intelligence. When not building software, Kunal enjoys travel and exploring the world, spending time with friends and family.

Amazon EC2 DL1 instances Deep Dive

2022-05-05 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/amazon-ec2-dl1-instances-deep-dive/

This post is written by Amr Ragab, Principal Solutions Architect, Amazon EC2.

AWS is excited to announce that the new Amazon Elastic Compute Cloud (Amazon EC2) DL1 instances are now generally available in US-East (N. Virginia) and US-West (Oregon). DL1 provides up to 40% better price performance for training deep learning models as compared to current generation GPU-based EC2 instances. The dl1.24xlarge instance type features eight Intel-Habana Gaudi accelerators, which are custom-built to train deep learning models. Each Gaudi accelerator has 32 GB of high bandwidth memory (HBM2) and a peer-to-peer bidirectional bandwidth of 100 Gbps RoCE, for a total bidirectional interconnect bandwidth of 700 Gbps per card. Further instance specifications are as follows:

Instance Size	vCPU	Instance Memory (GiB)	Gaudi Accelerators	Network Bandwidth (Gbps)	Total Accelerator Interconnect (Gbs)	Local Instance Storage	EBS Bandwidth (Gbps)
d1.24xlarge	96	768	8	4×100 Gbps	700	4x1TB NVMe	19

Instance Architecture

As the preceding instance architecture indicates, pairs of Gaudi accelerators (e.g., Gaudi0 and Gaudi1) are attached directly through a PCIe Gen3x16 link. Additionally, peer-to-peer networking via 100 Gbps RoCEv2 links – with seven active links per card – provides a torus configuration with a total of 700 Gbps of interconnect bandwidth per card. This topology is a separate interconnect outside of the two NUMA domains. Furthermore, the instance supports four EFA ENIs and 4x1TB of local NVMe SSD storage. We will provide a peer-direct driver over EFA, which will let you utilize high throughput, low latency peer-direct networking between accelerators across multiple instances to efficiently scale multi-node distributed training workloads.

Quick Start

Quickly get started with DL1 and SynapseAI SDK through with the following options:

1) Habana Deep Learning AMIs provided by AWS.

2) AWS Marketplace AMIs provided by Habana.

3) Using Packer to build a custom Amazon Machine Images (AMI) provided by this GitHub repo. This repo also provides build scripts to create Amazon Elastic Container Service (Amazon ECS) and Amazon Elastic Kubernetes Service (Amazon EKS) AMIs.

After selecting an AMI, launch a dl1.24xlarge instance in either us-east-1 or us-west-2. To help identify in which availability zone(s) dl1.24xlarge is available, run the following command:

aws ec2 describe-instance-type-offerings \
--location-type availability-zone \
--filters Name=instance-type,Values=dl1.24xlarge \
--region us-west-2 \
--output table

Once launched, you can connect to the instance over SSH (with the correct security group attached).

Habana Collectives Communication Library (HCL/HCCL)

As part of the Habana SynapseAI SDK, Habana Gaudi’s use the HCCL library for handling the collectives between HPUs. Get more information on HCCL here. On DL1 through the HCL-tests, we can confirm close to 700 Gbps (689 Gbps) per card for the collectives tested as follows.

You can confirm these tests by cloning the github repo here.

Amazon EKS Quick Start

Support for DL1 on Amazon EKS is available today with Amazon EKS versions > 1.19. The following is a quick start to get up and running quickly with DL1.

The following dependencies will be needed:

eksctl – You need version 0.70.0+ of eksctl.
kubectl – You use Kubernetes version 1.20 in this post.

Create EKS cluster:

eksctl create cluster --region us-east-1 --without-nodegroup \
--vpc-public-subnets subnet-037d8e430963c2d3e,subnet-0abe898359a7d43e9

Nodegroup configuration – save the following codeblock to a file called dl1-managed-ng.yaml. Replace the AMI ID in the code block with the AMI created earlier.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: fabulous-rainbow-1635807811
  region: us-west-2

vpc:
  id: vpc-34f1894c
  subnets:
    public:
      endpoint-one:
        id: subnet-4532e73d
      endpoint-two:
        id: subnet-8f8b7dc5

managedNodeGroups:
  - name: dl1-ng-1d
    instanceType: dl1.24xlarge
    volumeSize: 200
    instancePrefix: dl1-ng-1d-worker
    ami: ami-072c632cbbc2255b3
    iam:
      withAddonPolicies:
        imageBuilder: true
        autoScaler: true
        ebs: true
        fsx: true
        cloudWatch: true
    ssh:
      allow: true
      publicKeyName: amrragab-aws
    subnets:
    - endpoint-one
    minSize: 1
    desiredCapacity: 1
    maxSize: 4
    overrideBootstrapCommand: |
      #!/bin/bash
      /etc/eks/bootstrap.sh fabulous-rainbow-1635807811

Create the managed nodegroup with the following command:

eksctl create nodegroup -f dl1-managed-ng.yaml

Once the nodegroup has been completed, you must apply the habana-k8s-device-plugin

kubectl create -f https://vault.habana.ai/artifactory/docker-k8s-device-plugin/habana-k8s-device-plugin.yaml

Once completed, you should see the Gaudi devices as an allocatable resource in your EKS
cluster, presenting 8 Gaudi accelerators per DL1 node in the cluster.

Allocatable:

attachable-volumes-aws-ebs: 39
cpu:                        95690m
ephemeral-storage:          192188443124
habana.ai/gaudi:            8
hugepages-1Gi:              0
hugepages-2Mi:              30000Mi
memory:                     753055132Ki
pods:                       15

Example Distributed Machine Learning (ML) Workloads

The following tables are examples of Mixed Precision/FP32 training results comparing DL1 to the common GPU instances used for ML training.

Model: ResNet50
Framework: TensorFlow 2
Dataset: Imagenet2012
GitHub: https://github.com/HabanaAI/Model-
References/tree/master/TensorFlow/computer_vision/Resnets/resnet_keras

Instance Type	Batch Size	Mixed Precision Training Throughput (images/sec)
8x Gaudi – 32 GB (dl1.24xlarge)	256	13036
8x A100 – 40 GB (p4d.24xlarge)	256	17921
8x V100 – 32 GB (p3dn.24xlarge)	256	9685
8x V100 – 16GB (p3.16xlarge)	256	8945

Model: Bert Large – Pretraining
Framework: Pytorch 1.9
Dataset: Wikipedia/BooksCorpus
GitHub: https://github.com/HabanaAI/Model-References/tree/master/PyTorch/nlp/bert

Instance Type	Batch Size @128 Sequence Length	Mixed Precision Training Throughput (seq/sec)
8x Gaudi – 32 GB (dl1.24xlarge)	256	1318
8x A100 – 40 GB (p4d.24xlarge)	8192	2979
8x V100 – 32 GB (p3dn.24xlarge)	8192	1458
8x V100 – 16GB (p3.16xlarge)	8192	1013

You can find a more comprehensive list of ML models supported with performance data here. Support for containers with TensorFlow and Pytorch are also available. Furthermore, you can stay up-to-date with the operator support for TensorFlow and Pytorch.

CONCLUSION

We are excited to innovate on behalf of our customers and provide a diverse choice in ML accelerators with DL1 instances. The DL1 instances powered by Gaudi accelerators can provide up to 40% better price performance for training deep learning models as compared to current generation GPU-based EC2 instances. DL1 instances use the Habana SynapseAI SDK with framework support in Pytorch and TensorFlow. Additional future support for EFA with peer direct HPUs across nodes will also be supported. Now it’s time to go power up your ML workloads with Amazon EC2 DL1 instances.

Use Amazon Kinesis Data Firehose to extract data insights with Coralogix

2022-05-04 Tal Knopf

Post Syndicated from Tal Knopf original https://aws.amazon.com/blogs/big-data/use-amazon-kinesis-data-firehose-to-extract-data-insights-with-coralogix/

This is a guest blog post co-written by Tal Knopf at Coralogix.

Digital data is expanding exponentially, and the existing limitations to store and analyze it are constantly being challenged and overcome. According to Moore’s Law, digital storage becomes larger, cheaper, and faster with each successive year. The advent of cloud databases is just one example of how this is happening. Previous hard limits on storage size have become obsolete since their introduction.

In recent years, the amount of available data storage in the world has increased rapidly, reflecting this new reality. If you took all the information just from US academic research libraries and lumped it together, it would add up to 2 petabytes.

Coralogix has worked with AWS to bring you a solution to allow for the flawless integration of high volumes of data with the Coralogix platform for analysis, using Amazon Kinesis Data Firehose.

Kinesis Data Firehose and Coralogix

Kinesis Data Firehose delivers real-time streaming data to destinations like Amazon Simple Storage Service (Amazon S3), Amazon Redshift, or Amazon OpenSearch Service (successor to Amazon Elasticsearch Service), and now supports delivering streaming data to Coralogix. There is no limit on the number of delivery streams, so you can use it to get data from multiple AWS services.

Kinesis Data Firehose provides built-in, fully managed error handling, transformation, conversion, aggregation, and compression functionality, so you don’t need to write applications to handle these complexities.

Coralogix is an AWS Partner Network (APN) Advanced Technology Partner with AWS DevOps Competency. The platform enables you to easily explore and analyze logs to gain deeper insights into the state of your applications and AWS infrastructure. You can analyze all your AWS service logs while storing only the ones you need, and generate metrics from aggregated logs to uncover and alert on trends in your AWS services.

Solution overview

Imagine a pipe flowing with data—messages, to be more specific. These messages can contain log lines, metrics, or any other type of data you want to collect.

Clearly, there must be something pushing data into the pipe; this is the provider. There must also be a mechanism for pulling data out of the pipe; this is the consumer.

Kinesis Data Firehose makes it easy to collect, process, and analyze real-time, streaming data by grouping the pipes together in the most efficient way to help with management and scaling.

It offers a few significant benefits compared to other solutions:

Keeps monitoring simple – With this solution, you can configure AWS Web Application Firewall (AWS WAF), Amazon Route 53 Resolver Query Logs, or Amazon API Gateway to deliver log events directly to Kinesis Data Firehose.
Integrates flawlessly – Most AWS services use Amazon CloudWatch by default to collect logs, metrics, and additional events data. CloudWatch logs can easily be sent using the Firehose delivery stream.
Flexible with minimum maintenance – To configure Kinesis Data Firehose with the Coralogix API as a destination, you only need to set up the authentication in one place, regardless of the amount of services or integrations providing the actual data. You can also configure an S3 bucket as a backup plan. You can back up all log events or only those exceeding a specified retry duration.
Scale, scale, scale – Kinesis Data Firehose scales up to meet your needs with no need for you to maintain it. The Coralogix platform is also built for scale and can meet all your monitoring needs as your system grows.

Prerequisites

To get started, you must have the following:

A Coralogix account. If you don’t already have an account, you can sign up for one.
A Coralogix private key.

To find your key, in your Coralogix account, choose API Keys on the Data Flow menu.

Locate the key for Send Your Data.

Set up your delivery stream

To configure your deliver stream, complete the following steps:

On the Kinesis Data Firehose console, choose Create delivery stream.
Under Choose source and destination, for Source, choose Direct PUT.
For Destination, choose Coralogix.
For Delivery stream name¸ enter a name for your stream.
Under Destination settings, for HTTP endpoint name, enter a name for your endpoint.
For HTTP endpoint URL, enter your endpoint URL based on your Region and Coralogix account configuration.
For Access key, enter your Coralogix private key.
For Content encoding¸ select GZIP.
For Retry duration, enter 60.

To override the logs applicationName, subsystemName, or computerName, complete the optional steps under Parameters.

For Key, enter the log name.
For Value, enter a new value to override the default.
For this post, leave the configuration under Buffer hints as is.
In the Backup settings section, for Source record in Amazon S3, select Failed data only (recommended).
For S3 backup bucket, choose an existing bucket or create a new one.
Leave the settings under Advanced settings as is.
Review your settings and choose Create delivery stream.

Logs subscribed to your delivery stream are immediately sent and available for analysis within Coralogix.

Conclusion

Coralogix provides you with full visibility into your logs, metrics, tracing, and security data without relying on indexing to provide analysis and insights. When you use Kinesis Data Firehose to send data to Coralogix, you can easily centralize all your AWS service data for streamlined analysis and troubleshooting.

To get the most out of the platform, check out Getting Started with Coralogix, which provides information on everything from parsing and enrichment to alerting and data clustering.

About the Authors

Tal Knopf is the Head of Customer DevOps at Coralogix. He uses his vast experience in designing and building customer-focused solutions to help users extract the full value from their observability data. Previously, he was a DevOps engineer in Akamai and other companies, where he specialized in large-scale systems and CDN solutions.

Ilya Rabinov is a Solutions Architect at AWS. He works with ISVs at late stages of their journey to help build new products, migrate existing applications, or optimize workloads on AWS. His ares of interest include machine learning, artificial intelligence, security, DevOps culture, CI/CD, and containers.

How SailPoint solved scaling issues by migrating legacy big data applications to Amazon EMR on Amazon EKS

2022-04-26 Richard Li

Post Syndicated from Richard Li original https://aws.amazon.com/blogs/big-data/how-sailpoint-solved-scaling-issues-by-migrating-legacy-big-data-applications-to-amazon-emr-on-amazon-eks/

This post is co-written with Richard Li from SailPoint.

SailPoint Technologies is an identity security company based in Austin, TX. Its software as a service (SaaS) solutions support identity governance operations in regulated industries such as healthcare, government, and higher education. SailPoint distinguishes multiple aspects of identity as individual identity security services, including cloud governance, SaaS management, access risk governance, file access management, password management, provisioning, recommendations, and separation of duties, as well as access certification, access insights, access modeling, and access requests.

In this post, we share how SailPoint updated its platform for big data operations, and solved scaling issues by migrating legacy big data applications to Amazon EMR on Amazon EKS.

The challenge with the legacy data environment

SailPoint acquired a SaaS software platform that processes and analyzes identity, resource, and usage data from multiple cloud providers, and provides access insights, usage analysis, and access risk analysis. The original design criteria of the platform was focused on serving small to medium-sized companies. To quickly process these analytics insights, many of these processing workloads were done inside many microservices through streaming connections.

After acquisition, we set a goal to expand the platform’s capability to handle customers with large cloud footprints over multiple cloud providers, sometime over hundreds or even thousands of accounts producing large amount of cloud event data.

The legacy architecture has a simplistic approach for data processing, as shown in the following diagram. We were processing the vast majority of event data in-service and directly ingested into Amazon Relational Database Service (Amazon RDS), which we then merged with a graph database to form the final view..

We needed to convert this into a scalable process that could handle customers of any size. To address this challenge, we had to quickly introduce a big data processing engine in the platform.

How migrating to Amazon EMR on EKS helped solve this challenge

When evaluating the platform for our big data operations, several factors made Amazon EMR on EKS a top choice.

The amount of event data we receive at any given time is generally unpredictable. To stay cost-effective and efficient, we need a platform that is capable of scaling up automatically when the workload increases to reduce wait time, and can scale down when the capacity is no longer needed to save cost. Because our existing application workloads are already running on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster with the cluster autoscaler enabled, running Amazon EMR on EKS on top of our existing EKS cluster fits this need.

Amazon EMR on EKS can safely coexist on an EKS cluster that is already hosting other workloads, be contained within a specified namespace, and have controlled access through use of Kubernetes role-based access control and AWS Identity and Access Management (IAM) roles for service accounts. Therefore, we didn’t have to build new infrastructures just for Amazon EMR. We simply linked up Amazon EMR on EKS with our existing EKS cluster running our application workloads. This reduced the amount of DevOps support needed, and significantly sped up our implementation and deployment timeline.

Unlike Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2), because our EKS cluster spans over multiple Availability Zones, we can control Spark pods placements using Kubernetes’s pod scheduling and placement strategy to achieve higher fault tolerance.

With the ability to create and use custom images in Amazon EMR on EKS, we could also utilize our existing container-based application build and deployment pipeline for our Amazon EMR on EKS workload without any modifications. This also gave us additional benefit in reducing job startup time because we package all job scripts as well as all dependencies with the image, without having to fetch them at runtime.

We also utilize AWS Step Functions as our core workflow engine. The native integration of Amazon EMR on EKS with Step Functions is another bonus where we didn’t have to build custom code for job dispatch. Instead, we could utilize the Step Functions native integration to seamlessly integrate Amazon EMR jobs with our existing workflow, with very little effort.

In merely 5 months, we were able to go from design, to proof of concept, to rolling out phase 1 of the event analytics processing. This vastly improved our event analytics processing capability by extending horizontal scalability, which gave us the ability to take customers with significantly larger cloud footprints than the legacy platform was designed for.

During the development and rollout of the platform, we also found that the Spark History Server provided by Amazon EMR on EKS was very useful in terms of helping us identify performance issues and tune the performance of our jobs.

As of this writing, the phase 1 rollout, which includes the event processing component of the core analytics processing, is complete. We’re now expanding the platform to migrate additional components onto Amazon EMR on EKS. The following diagram depicts our future architecture with Amazon EMR on EKS when all phases are complete.

In addition, to improve performances and reduce costs, we’re currently testing the Spark dynamic resource allocation support of Amazon EMR on EKS. This would automatically scale up and down the job executors based on load, and therefore boost performance when needed and reduce cost when the workload is low. Furthermore, we’re investigating the possibility to reduce the overall cost and increase performance by utilizing the pod template feature that would allow us to seamlessly transition our Amazon EMR job workload to AWS Graviton based instances.

Conclusion

With Amazon EMR on EKS, we can now onboard new customers and process vast amounts of data in a cost-effective manner, which we couldn’t do with our legacy environment. We plan to expand our Amazon EMR on EKS footprint to handle all our transform and load data analytics processes.

About the Authors

Richard Li is a senior staff software engineer on the SailPoint Technologies Cloud Access Management team.

Janak Agarwal is a product manager for Amazon EMR on Amazon EKS at AWS.

Kiran Guduguntla is a WW Go-to-Market Specialist for Amazon EMR at AWS. He works with AWS customers across the globe to strategize, build, develop, and deploy modern data analytics solutions.

How MarketAxess® uses AWS Developer Tools to create scalable and secure CI/CD pipelines

2022-04-18 Aaron Lima

Post Syndicated from Aaron Lima original https://aws.amazon.com/blogs/devops/how-marketaxess-uses-aws-developer-tools-to-create-scalable-and-secure-ci-cd-pipelines/

Very often, enterprise organizations strive to adopt modern DevOps practices, tofocus on governance and security without sacrificing development velocity. In this guest post, Prashant Joshi, Senior Cloud Engineer at MarketAxess, explains how they use the AWS Cloud Development Kit (AWS CDK), AWS CodePipeline, and AWS CodeBuild to simplify the developer experience by dynamically provisioning pipelines and maintaining governance at MarketAxess.

Problem Statement

MarketAxess is a financial technology company that operates an e-trading platform, for institutional credit markets. As MarketAxess adopted DevOps firm-wide, we struggled to ensure pipeline consistency. We had developers using static code analysis and linting, but it wasn’t enforced. As more teams began to adopt DevOps practices, the importance of providing consistency over code quality, security scanning, and artifact management grew. However, we were challenged with increasing our engineering workforce and implementing best practices in the various pipelines. As a small team, we needed a way to reliably manage and scale pipelines while reducing engineering overhead. We thought about the DevOps tenets, as well as the importance of automation, and we decided to build automation that would provision pipelines for development teams. These pipelines included best practices for Continuous Integration and Continuous Deployment (CI/CD). We wanted to build this automation with self-service, so that teams can get started developing a solution to a business problem, without having to spend too much time around the CI/CD aspects of their projects.

We chose the AWS CDK to deploy AWS CodePipeline, AWS CodeBuild, and AWS Identity and Access Management (IAM) resources, and used an API webhook using AWS Lambda and Amazon API Gateway for integration. In this post, we provide an example of how these services can be used to create dynamic cross account CI/CD pipelines.

Solution

In developing our solution, we wanted to accomplish three main goals:

Standardization and Governance of Pipelines – We wanted to ensure consistent practices in each team’s pipeline to make sure of code quality and security.
Simplified Developer Interaction – We wanted developers to focus mainly on interacting with the code repository for their project.
Improve Management of Dynamically Provisioned Pipelines – Knowing that we would need to make changes, improvements, and enhancements, we wanted tools and a process that was flexible.

We achieved these goals using AWS CDK to automate the creation of CodePipeline and define mandatory actions in the pipeline. We also created a webhook using API Gateway to integrate with our Bitbucket repositories to automatically trigger the automation. The pipelines can dynamically be provisioned or updated based on the YAML manifest file submitted to the repository. We process the manifest file with Amazon Elastic Container Service (Amazon ECS) Fargate tasks, because we had containerized the processing components using Docker. However, with the release of container support in Lambda, we are now considering this as a potential replacement. These pipelines run CI stages based on the programing language defined by development teams in the manifest file, and they deploy a tested versioned artifact to the corresponding environments via standard Software Defined Lifecycle (SDLC) practices. As a part of CI stages, we semantically version our code and tag our commits accordingly. This lets us trace commit to pipeline execution. The following architecture diagram shows a CloudFormation pipeline generated via AWS CDK.

CloudFormation Pipeline Architecture Diagram

The process flow is as follows:

Developer pushes a change to the repository.
A webhook is triggered when the Pull Request is merged that creates or modifies the pipeline based on the manifest file submitted to the repository.
This triggers a Lambda function that performs the following:
1. Clones the repository from Internally hosted BitBucket repos.
2. Uploads the repository to the source Amazon Simple Storage Service (Amazon S3) bucket, which is encrypted using Customer Managed Keys (CMK) with the AWS Key Management Service (KMS).
3. An ECS Task is run, and a manifest file is passed which gives the project parameters. Pipelines are built according to these project parameters.
An ECS Task processes the metadata file and runs cdk Logic, finally it triggers the pipeline.
1. As source code is progressed through the pipeline, the build stage output to the artifact bucket. Pipeline artifacts are encrypted with a CMK. The IAM roles in the target account only have access to this bucket.

Additionally, through the power of the IAM integration with CodePipeline, the team could implement session tags with IAM roles and Okta to make sure that independent teams only approve pipelines, which are owned by respective teams. Furthermore, we use attribute-based tags to protect the production environment from unauthorized actions, so that deployment to production can only come through the pipeline.

The AWS CDK-based pipelines let MarketAxess enable teams to independently build and obtain immediate feedback, while still centrally governing CI and CD patterns. The solution took six months of two DevOps engineers working full time to build the cdk structure and support for the core languages and their corresponding CI and CD stages. We continue to iterate on the cdk code base and pipelines, incorporating feedback from our development community to ensure developer satisfaction.

Simplified Developer Interaction

Although we were enforcing standards via the automation, we still wanted to give development teams autonomy through a simple mechanism. We wanted developers to interact with our pipeline creation process through a pipeline manifest file that they submitted to their repository. An example of the manifest file schema is in the following screenshot:

Manifest File Schema

As shown above, the manifest lets developers define custom application configurations, while preserving consistent quality gates. This manifest is checked in to source control, and upon a commit to the code repository it triggers our automation. This lets our pipelines mutate on manifest file changes, and it makes sure that the latest commit goes through the latest quality gates. Each repository gets its own pipeline, and, to maintain the security of the pipeline, we used IAM Session Tags with Okta. We tag each pipeline and its associated resources with a unique attribute that is mapped to the development team so that they only have access to their pipelines, and only authorized individuals may approve production deployments.

Using AWS CDK, AWS CodePipeline, and other AWS Services, we have been able to improve the stability and quality of the code being delivered. CodePipeline and AWS CDK have helped us develop a cloud native pipeline solution that meets our governance best practices and compliance requirements. We met our three goals, and we can iterate and change easily moving forward.

Conclusion

Organizations that achieve the automation and self-service ideals of DevOps can build, release, and deploy features and apps to users faster and at higher levels of quality. In this post, we saw a real-life example of using Infrastructure as Code with AWS CDK to build a service that helps maintain governance and helps developers get work done. Here are two other posts that demonstrate using AWS Service Catalog to create secure DevOps pipelines or DevOps pipelines that deploy containerized applications.

Using AWS Step Functions and Amazon DynamoDB for business rules orchestration

2022-03-29 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/using-aws-step-functions-and-amazon-dynamodb-for-business-rules-orchestration/

This post is written by Vijaykumar Pannirselvam, Cloud Consultant, Sushant Patil, Cloud Consultant, and Kishore Dhamodaran, Senior Solution Architect.

A Business Rules Engine (BRE) is used in enterprises to manage business-critical decisions. The logic or rules used to make such decisions can vary in complexity. A finance department may have a basic rule to get any purchase over a certain dollar amount to get director approval. A mortgage company may need to run complex rules based on inputs (for example, credit score, debt-to-income ratio, down payment) to make an approval decision for a loan.

Decoupling these rules from application logic provides agility to your rules management, since business rules may often change while your application may not. It can also provide standardization across your enterprise, so every department can communicate with the same taxonomy.

As part of migrating their workloads, some enterprises consider replacing their commercial rules engine with cloud native and open-source alternatives. The motivation for such a move stem from several factors, such as simplifying the architecture, cost, security considerations, or vendor support.

Many of these commercial rules engines come as part of a BPMS offering that provides orchestration capabilities for rules execution. For a successful migration to cloud using an open-source rules engine management system, you need an orchestration capability to manage incoming rule requests, auditing the rules, and tracking exceptions.

This post showcases an orchestration framework that allows you to use an open-source rules engine. It uses Drools rules engine to build a set of rules for calculating insurance premiums based on the properties of Car and Person objects. This uses AWS Step Functions, AWS Lambda, Amazon API Gateway, Amazon DynamoDB, and open-source Drools rules engine to show this. You can swap the rules engine provided you can manage it in the AWS Cloud environment and expose it as an API.

Solution overview

The following diagram shows the solution architecture.

The solution comprises:

API Gateway – a fully managed service that makes it easier to create, publish, maintain, monitor, and secure APIs at any scale for API consumers. API Gateway helps you manage traffic to backend systems, in this case Step Functions, which orchestrates the execution of tasks. For the REST API use-case, you can also set up a cache with customizable keys and time-to-live in seconds for your API data to avoid hitting your backend services for each request.
Step Functions – a low code service to orchestrate multiple steps involved to accomplish tasks. Step Functions uses the finite-state machine (FSM) model, which uses given states and transitions to complete the tasks. The diagram depicts three states: Audit Request, Execute Ruleset and Audit Response. We execute them sequentially. You can add additional states and transitions, such as validating incoming payloads, and branching out parallel execution of the states.
Drools rules engine Spring Boot application – runtime component of the rule execution. You set the Drools rule engine Spring Boot application as an Apache Maven Docker project with Drools Maven dependencies. You then deploy the Drools rule engine Docker image to an Amazon Elastic Container Registry (Amazon ECR), create an AWS Fargate cluster, and an Amazon Elastic Container Service (Amazon ECS) service. The service launches Amazon ECS tasks and maintains the desired count. An Application Load Balancer distributes the traffic evenly to all running containers.
Lambda – a serverless execution environment giving you an ability to interact with the Drools Engine and a persistence layer for rule execution audit functions. The Lambda component provides the audit function required to persist the incoming requests and outgoing responses in DynamoDB. Apart from the audit function, Lambda is also used to invoke the service exposed by the Drools Spring Boot application.
DynamoDB – a fully managed and highly scalable key/value store, to persist the rule execution information, such as request and response payload information. DynamoDB provides the persistence layer for the incoming request JSON payload and for the outgoing response JSON payload. The audit Lambda function invokes the DynamoDB put_item() method when it receives the request or response event from Step Functions. The DynamoDB table rule_execution_audit has an entry for every request and response associated with the incoming request-id originated by the application (upstream).

Drools rules engine implementation

The Drools rules engine separates the business rules from the business processes. You use DRL (Drools Rule Language) by defining business rules as .drl text files. You define model objects to build the rules.

The model objects are POJO (Plain Old Java Objects) defined using Eclipse, with the Drools plugin installed. You should have some level of knowledge about building rules and executing them using the Drools rules engine. The below diagram describes the functions of this component.

You define the following rules in the .drl file as part of the GitHub repo. The purpose of these rules is to evaluate the driver premium based on the input model objects provided as input. The inputs are Car and Driver objects and output is the Policy object, which has the premium calculated based on the certain criteria defined in the rule:

rule "High Risk"
     when     
         $car : Car(style == "SPORTS", color == "RED") 
         $policy : Policy() 
         and $driver : Driver ( age < 21 )                             
     then
         System.out.println(drools.getRule().getName() +": rule fired");          
         modify ($policy) { setPremium(increasePremiumRate($policy, 20)) };
 end
 
 rule "Med Risk"
     when     
         $car : Car(style == "SPORTS", color == "RED") 
         $policy : Policy() 
         and $driver : Driver ( age > 21 )                             
     then
         System.out.println(drools.getRule().getName() +": rule fired");          
         modify ($policy) { setPremium(increasePremiumRate($policy, 10)) };
 end
 
 
 function double increasePremiumRate(Policy pol, double percentage) {
     return (pol.getPremium() + pol.getPremium() * percentage / 100);
 }

Once the rules are defined, you define a RestController that takes input parameters and evaluates the above rules. The below code snippet is a POST method defined in the controller, which handles the requests and sends the response to the caller.

@PostMapping(value ="/policy/premium", consumes = {MediaType.APPLICATION_JSON_VALUE, MediaType.APPLICATION_XML_VALUE }, produces = {MediaType.APPLICATION_JSON_VALUE, MediaType.APPLICATION_XML_VALUE})
    public ResponseEntity<Policy> getPremium(@RequestBody InsuranceRequest requestObj) {
        
        System.out.println("handling request...");
        
        Car carObj = requestObj.getCar();        
        Car carObj1 = new Car(carObj.getMake(),carObj.getModel(),carObj.getYear(), carObj.getStyle(), carObj.getColor());
        System.out.println("###########CAR##########");
        System.out.println(carObj1.toString());
        
        System.out.println("###########POLICY##########");        
        Policy policyObj = requestObj.getPolicy();
        Policy policyObj1 = new Policy(policyObj.getId(), policyObj.getPremium());
        System.out.println(policyObj1.toString());
            
        System.out.println("###########DRIVER##########");    
        Driver driverObj = requestObj.getDriver();
        Driver driverObj1 = new Driver( driverObj.getAge(), driverObj.getName());
        System.out.println(driverObj1.toString());
        
        KieSession kieSession = kieContainer.newKieSession();
        kieSession.insert(carObj1);      
        kieSession.insert(policyObj1); 
        kieSession.insert(driverObj1);         
        kieSession.fireAllRules(); 
        printFactsMessage(kieSession);
        kieSession.dispose();
    
        
        return ResponseEntity.ok(policyObj1);
    }

Prerequisites

OpenJDK 17.
Python version 3.9.
Apache Maven version 3.8.4.
Docker version 20.10.12.
Postman to test your API.
Set up AWS CLI to allow you to deploy your resources.
Set up the AWS SAM CLI to deploy your resources.
Have the appropriate AWS credentials for interacting with resources in your AWS account.

Solution walkthrough

Clone the project GitHub repository to your local machine, do a Maven build, and create a Docker image. The project contains Drools related folders needed to build the Java application.
```
git clone https://github.com/aws-samples/aws-step-functions-business-rules-orchestration
cd drools-spring-boot
mvn clean install
mvn docker:build
```

Create an Amazon ECR private repository to host your Docker image.

aws ecr create-repository —repository-name drools_private_repo —image-tag-mutability MUTABLE —image-scanning-configuration scanOnPush=false

Tag the Docker image and push it to the Amazon ECR repository.

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <<INSERT ACCOUNT NUMBER>>.dkr.ecr.us-east-1.amazonaws.com
docker tag drools-rule-app:latest <<INSERT ACCOUNT NUMBER>>.dkr.ecr.us-east-1.amazonaws.com/drools_private_repo:latest
docker push <<INSERT ACCOUNT NUMBER>>.dkr.ecr.us-east-1.amazonaws.com/drools_private_repo:latest

Deploy resources using AWS SAM:
```
cd ..
sam build
sam deploy --guided
```

Verifying the deployment

Verify the business rules execution and the orchestration components:

Navigate to the API Gateway console, and choose the rules-stack API.
Under Resources, choose POST, followed by TEST.

Enter the following JSON under the Request Body section, and choose Test.

{
  "context": {
    "request_id": "REQ-99999",
    "timestamp": "2021-03-17 03:31:51:40"
  },
  "request": {
    "driver": {
      "age": "18",
      "name": "Brian"
    },
    "car": {
      "make": "honda",
      "model": "civic",
      "year": "2015",
      "style": "SPORTS",
      "color": "RED"
    },
    "policy": {
      "id": "1231231",
      "premium": "300"
    }
  }
}

The response received shows results from the evaluation of the business rule “High Risk“, with the premium representing the percentage calculation in the rule definition. Try changing the request input to evaluate a “Medium Risk” rule by modifying the age of the driver to 22 or higher:
Optionally, you can verify the API using Postman. Get the endpoint information by navigating to the rule-stack API, followed by Stages in the navigation pane, then choosing either Dev or Stage.
Enter the payload in the request body and choose Send:
The response received is results from the evaluation of business rule “High Risk“, with the premium representing the percentage calculation in the rule definition. Try changing the request input to evaluate a “Medium Risk” rule by modifying the age of the driver to 22 or higher.
Observe the request and response audit logs. Navigate to the DynamoDB console. Under the navigation pane, choose Tables, then choose rule_execution_audit.
Under the Tables section in the navigation pane, choose Explore Items. Observe the individual audit logs by choosing the audit_id.

Cleaning up

To avoid incurring ongoing charges, clean up the infrastructure by deleting the stack using the following command:

sam delete

Delete the Amazon ECR repository, and any other resources you created as a prerequisite for this exercise.

Conclusion

In this post, you learned how to leverage an orchestration framework using Step Functions, Lambda, DynamoDB, and API Gateway to build an API backed by an open-source Drools rules engine, running on a container. Try this solution for your cloud native business rules orchestration use-case.

For more serverless learning resources, visit Serverless Land.

Insights for CTOs: Part 3 – Growing your business with modern data capabilities

2022-03-25 Syed Jaffry

Post Syndicated from Syed Jaffry original https://aws.amazon.com/blogs/architecture/insights-for-ctos-part-3-growing-your-business-with-modern-data-capabilities/

This post was co-wrtiten with Jonathan Hwang, head of Foundation Data Analytics at Zendesk.

In my role as a Senior Solutions Architect, I have spoken to chief technology officers (CTOs) and executive leadership of large enterprises like big banks, software as a service (SaaS) businesses, mid-sized enterprises, and startups.

In this 6-part series, I share insights gained from various CTOs and engineering leaders during their cloud adoption journeys at their respective organizations. I have taken these lessons and summarized architecture best practices to help you build and operate applications successfully in the cloud. This series also covers building and operating cloud applications, security, cloud financial management, modern data and artificial intelligence (AI), cloud operating models, and strategies for cloud migration.

In Part 3, I’ve collaborated with the head of Foundation Analytics at Zendesk, Jonathan Hwang, to show how Zendesk incrementally scaled their data and analytics capabilities to effectively use the insights they collect from customer interactions. Read how Zendesk built a modern data architecture using Amazon Simple Storage Service (Amazon S3) for storage, Apache Hudi for row-level data processing, and AWS Lake Formation for fine-grained access control.

Why Zendesk needed to build and scale their data platform

Zendesk is a customer service platform that connects over 100,000 brands with hundreds of millions of customers via telephone, chat, email, messaging, social channels, communities, review sites, and help centers. They use data from these channels to make informed business decisions and create new and updated products.

In 2014, Zendesk’s data team built the first version of their big data platform in their own data center using Apache Hadoop for incubating their machine learning (ML) initiative. With that, they launched Answer Bot and Zendesk Benchmark report. These products were so successful they soon overwhelmed the limited compute resources available in the data center. By the end of 2017, it was clear Zendesk needed to move to the cloud to modernize and scale their data capabilities.

Incrementally modernizing data capabilities

Zendesk built and scaled their workload to use data lakes on AWS, but soon encountered new architecture challenges:

The General Data Protection Regulation (GDPR) “right to be forgotten” rule made it difficult and costly to maintain data lakes, because deleting a small piece of data required reprocessing large datasets.
Security and governance was harder to manage when data lake scaled to a larger number of users.

The following sections show you how Zendesk is addressing GDPR rules by evolving from plain Apache Parquet files on Amazon S3 to Hudi datasets on Amazon S3 to enable row level inserts/updates/deletes. To address security and governance, Zendesk is migrating to AWS Lake Formation centralized security for fine-grained access control at scale.

Zendesk’s data platform

Figure 1 shows Zendesk’s current data platform. It consists of three data pipelines: “Data Hub,” “Data Lake,” and “Self Service.”

Figure 1. Zendesk data pipelines

Data Lake pipelines

The Data Lake and Data Hub pipelines cover the entire lifecycle of the data from ingestion to consumption.

The Data Lake pipelines consolidate the data from Zendesk’s highly distributed databases into a data lake for analysis.

Zendesk uses Amazon Database Migration Service (AWS DMS) for change data capture (CDC) from over 1,800 Amazon Aurora MySQL databases in eight AWS Regions. It detects transaction changes and applies them to the data lake using Amazon EMR and Hudi.

Zendesk ticket data consists of over 10 billion events and petabytes of data. The data lake files in Amazon S3 are transformed and stored in Apache Hudi format and registered on the AWS Glue catalog to be available as data lake tables for analytics querying and consumption via Amazon Athena.

Data Hub pipelines

The Data Hub pipelines focus on real-time events and streaming analytics use cases with Apache Kafka. Any application at Zendesk can publish events to a global Kafka message bus. Apache Flink ingests these events into Amazon S3.

The Data Hub provides high-quality business data that is highly available and scalable.

Self-managed pipeline

The self-managed pipelines empower product engineering teams to use the data lake for those use cases that don’t fit into our standard integration patterns. All internal Zendesk product engineering teams can use standard tools such as Amazon EMR, Amazon S3, Athena, and AWS Glue to publish their own analytics dataset and share them with other teams.

A notable example of this is Zendesk’s fraud detection engineering team. They publish their fraud detection data and findings through our self-manage data lake platform and use Amazon QuickSight for visualization.

You need fine-grained security and compliance

Data lakes can accelerate growth through faster decision making and product innovation. However, they can also bring new security and compliance challenges:

Visibility and auditability. Who has access to what data? What level of access do people have and how/when and who is accessing it?
Fine-grained access control. How do you define and enforce least privilege access to subsets of data at scale without creating bottlenecks or key person/team dependencies?

Lake Formation helps address these concerns by auditing data access and offering row- and column-level security and a delegated access control model to create data stewards for self-managed security and governance.

Zendesk used Lake Formation to build a fine-grained access control model that uses row-level security. It detects personally identifiable information (PII) while scaling the data lake for self-managed consumption.

Some Zendesk customers opt out of having their data included in ML or market research. Zendesk uses Lake Formation to apply row-level security to filter out records associated with a list of customer accounts who have opted out of queries. They also help data lake users understand which data lake tables contain PII by automatically detecting and tagging columns in the data catalog using AWS Glue’s PII detection algorithm.

The value of real-time data processing

When you process and consume data closer to the time of its creation, you can make faster decisions. Streaming analytics design patterns, implemented using services like Amazon Managed Streaming for Apache Kafka (Amazon MSK) or Amazon Kinesis, create an enterprise event bus to exchange data between heterogeneous applications in near real time.

For example, it is common to use streaming to augment the traditional database CDC ingestion into the data lake with additional streaming ingestion of application events. CDC is a common data ingestion pattern, but the information can be too low level. This requires application context to be reconstructed in the data lake and business logic to be duplicated in two places, inside the application and in the data lake processing layer. This creates a risk of semantic misrepresentation of the application context.

Zendesk faced this challenge with their CDC data lake ingestion from their Aurora clusters. They created an enterprise event bus built with Apache Kafka to augment their CDC with higher-level application domain events to be exchanged directly between heterogeneous applications.

Zendesk’s streaming architecture

A CDC database ticket table schema can sometimes contain unnecessary and complex attributes that are application specific and do not capture the domain model of the ticket. This makes it hard for downstream consumers to understand and use the data. A ticket domain object may span several database tables when modeled in third normal form, which makes querying for analysts difficult downstream. This is also a brittle integration method because downstream data consumers can easily be impacted when the application logic changes, which makes it hard to derive a common data view.

To move towards event-based communication between microservices, Zendesk created the Platform Data Architecture (PDA) project, which uses a standard object model to represent a higher level, semantic view of their application data. Standard objects are domain objects designed for cross-domain communication and do not suffer from the lower level fragmented scope of database CDC. Ultimately, Zendesk aims to transition their data architecture from a collection of isolated products and data silos into a cohesive unified data platform.

Figure 2. An application view of Zendesk’s streaming architecture

Figure 3 shows how all Zendesk products and users integrate through common standard objects and standard events within the Data Hub. Applications publish and consume standard objects and events to/from the event bus.

For example, a complete ticket standard object will be published to the message bus whenever it is created, updated, or changed. On the consumption side, these events get used by product teams to enable platform capabilities such as search, data export, analytics, and reporting dashboards.

Summary

As Zendesk’s business grew, their data lake evolved from simple Parquet files on Amazon S3 to a modern Hudi-based incrementally updateable data lake. Now, their original coarse-grained IAM security policies use fine-grained access control with Lake Formation.

We have repeatedly seen this incremental architecture evolution achieve success because it reduces the business risk associated with the change and provides sufficient time for your team to learn and evaluate cloud operations and managed services.

Looking for more architecture content? AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

Women write blogs: a selection of posts from AWS Solutions Architects

2022-03-10 Bonnie McClure

Post Syndicated from Bonnie McClure original https://aws.amazon.com/blogs/architecture/women-write-blogs/

This International Women’s Day, we’re featuring more than a week’s worth of posts that highlight female builders and leaders. We’re showcasing women in the industry who are building, creating, and, above all, inspiring, empowering, and encouraging everyone—especially women and girls—in tech.

A blog can be a great starting point for you in finding and implementing a particular solution; learning about new features, services, and products; keeping up with the latest trends and ideas; or even understanding and resolving a tricky problem. Today, as part of our International Women’s Day celebration, we’re showcasing blogs written by women that do just that and more.

We’ve included all kinds of posts for you to peruse:

Architecture overview posts
Best practices posts
Customer/partner (co-written/sponsored/partnered) posts that highlight architectural solutions built with AWS services
How-to tutorials that explain the steps the reader needs to take to complete a task

Architecture overviews

How a Grocer Can Deliver Personalized Experiences with Recipes

by Chara Gravani and Stefano Vozza

Chara and Stefano bring us a way to differentiate and reinvent the customer journey for a grocery retailer. Their solution uses Amazon Personalize to deliver personalized recipe recommendations to increase customer satisfaction and loyalty, and in turn, increase revenue. They consider a customer who is shopping for groceries online. As they place products in their basket, they are presented with a list of recipes that contain the same ingredients as those products added to the basket. The suggested recipes are then personalized based on the customer’s profile and historical product preferences.

Best practices posts

Best practices for migrating self-hosted Prometheus on Amazon EKS to Amazon Managed Service for Prometheus

by Elamaran Shanmugam, Deval Parikh, and Ramesh Kumar Venkatraman

With a focus on the five pillars of the AWS Well-Architected Framework, Elamaran, Deval, and Ramesh examine some of the best practices to follow if you’re moving a self-managed Prometheus workload on Amazon Elastic Kubernetes Service (Amazon EKS) to Amazon Managed Service for Prometheus.

Optimizing your AWS Infrastructure for Sustainability Series

by Katja Philipp, Aleena Yunus, Otis Antoniou, and Ceren Tahtasiz

As organizations align their business with sustainable practices, it is important to review every functional area. If you’re building, deploying, and maintaining an IT stack, improving its environmental impact requires informed decision making. In this three-part blog series, Katja, Aleena, Otis, and Cern provide strategies to optimize your AWS architecture within compute, storage, and networking.

Customer/partner posts

Scaling DLT to 1M TPS on AWS: Optimizing a Regulated Liabilities Network

by Erica Salinas and Jack Iu

Erica and Jack discuss how they partnered with SETL to jointly stand up a basic Regulated Liabilities Network (RLN) and refine the scalability of the environment to at least 1 million transactions per second. They show you how scaling characteristics were achieved while maintaining the business requirements of atomicity and finality and discuss how each RLN component was optimized for high performance.

How-to tutorials

Monitor and visualise building occupancy with AWS IoT Core, Amazon QuickSight and Raspberry Pi

by Jamila Jamilova

Occupancy monitoring in buildings is a valuable tool across different industries. For example, museums can analyze occupancy data in near real-time to understand the popularity and number of visitors to decide where a particular gallery should be located. To help with cases like this, Jamila brings you a solution that monitors how building space is being utilized. It shows how busy each area of a building gets during different times of the day based on a motion sensor’s location. This device, a Raspberry Pi with a passive infrared (PIR) sensor, senses motion in direct proximity (in other words, if a human has moved in or out of the sensor’s range) and will generate data that is stored, analyzed, and visualized to help you understand how best to use your space.

Create an iOS tracker application with Amazon Location Service and AWS Amplify

by Panna Shetty and Fernando Rocha

Emergency management teams venture into dangerous situations to rescue those in need, potentially risking their own lives. To keep themselves safe during an event where they cannot easily track each other by line of sight, a muster point is established as a designated safety zone, or a geofence. This geofence may change in response to evolving conditions. One way to improve this process is automating member tracking and response activity, so that emergency managers can quickly account for all members and ensure they are safe. Panna and Fernando bring you a solution to apply to this situation and others like it. It uses Amazon Location Service to create a serverless architecture that is capable of tracking the user’s current location and identify if they are in a safe area or not.

Optimize workforce in your store using Amazon Rekognition

by Laura Reith and Kayla Jing

Retailers often need to make decisions to improve the in-store customer experience through personnel management. Having too few or too many employees working can be detrimental to the business. When store traffic outpaces staffing, it can result in long checkout lines and limited customer interface, creating a poor customer experience. The opposite can be true as well by having too many employees during periods of low traffic, which generates wasted operating costs. In this post, Laura and Kayla show you how to use Amazon Rekognition and AWS DeepLens to detect and analyze occupancy in a retail business to optimize workforce utilization.

Adding Build MLOps workflows with Amazon SageMaker projects, GitLab, and GitLab pipelines

by Lauren Mullennex, Indrajit Ghosalkar, and Kirit Thadaka

In this post, Lauren, Indrajit, and Kirit walk you through using a custom Amazon SageMaker machine learning operations project template to automatically build and configure a continuous integration/continuous delivery (CI/CD) pipeline. This pipeline incorporates your existing CI/CD tooling with SageMaker features for data preparation, model training, model evaluation, and model deployment. In their use case, they focus on using GitLab and GitLab pipelines with SageMaker projects and pipelines.

Deploying Sample UI Forms using React, Formik, and AWS CDK

by Kevin Rivera, Mark Carlson, Shruti Arora, and Britney Tong

Many companies use UI forms to collect customer data for account registrations, online shopping, and surveys. These forms can be difficult to write, maintain, and test. To help with this, Kevin, Mark, Shruti, and Britney show you how to use the JavaScript libraries React and Formik. These third-party libraries provide front-end developers with tools to implement simple forms for a user interface.

Multi-Region Migration using AWS Application Migration Service

by Shreya Pathak and Medha Shree

Shreya and Medha demonstrate how AWS Application Migration Service simplifies, expedites, and reduces the cost of migrating Amazon Elastic Compute Cloud (Amazon EC2)-hosted workloads from one AWS Region to another. It integrates with AWS Migration Hub, which allows you to organize your servers into applications. With the migration services they discuss, you can track the progress of your migration at the server and application level, even as you move servers into multiple Regions.

Tracking Overall Equipment Effectiveness with AWS IoT Analytics and Amazon QuickSight

by Shailaja Suresh and Michael Brown

To drive process efficiencies and optimize costs, manufacturing organizations need a scalable approach to access data across disparate silos across their organization. In this post, Shailaja and Michael demonstrate how overall equipment effectiveness can be calculated, monitored, and scaled out using two key services: AWS IoT Analytics and Amazon QuickSight.

Use AnalyticsIQ with Amazon QuickSight to gain insights for your business

by Sumitha AP

Sumitha shows you how to use the AnalyticsIQ Social Determinants of Health Sample Data dataset to gain insights into society’s health and wellness and how to generate easy-to-understand visualizations using QuickSight that could improve healthcare professionals’ decision making.

We’ve got more content for International Women’s Day!

For more than a week we’re sharing content created by women. Check it out!

Deploying service-mesh-based architectures using AWS App Mesh and Amazon ECS from Kesha Williams, an AWS Hero and award-winning software engineer.
A collection of several blog posts written and co-authored by women
Curated content from the Let’s Architect! team and a live Twitter chat
A post on Women at AWS – Diverse Backgrounds, Common Goal of Becoming Solutions Architects
Another post on Building your brand as a SA

Other ways to participate

Building a serverless image catalog with AWS Step Functions Workflow Studio

2022-03-08 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/building-a-serverless-image-catalog-with-aws-step-functions-workflow-studio/

This post is written by Pascal Vogel, Associate Solutions Architect, and Benjamin Meyer, Sr. Solutions Architect.

Workflow Studio is a low-code visual workflow designer for AWS Step Functions that enables the orchestration of serverless workflows through a guided interactive interface. With the integration of Step Functions and the AWS SDK, you can now access more than 200 AWS services and over 9,000 API actions in your state machines.

This walkthrough uses Workflow Studio to implement a serverless image cataloging pipeline. It includes content moderation, automated tagging, and parallel image processing. Workflow Studio allows you to set up API integrations to other AWS services quickly with drag and drop actions, without writing custom application code.

Solution overview

Photo sharing websites often allow users to publish user-generated content such as text, images, or videos. Manual content review and categorization can be challenging. This solution enables the automation of these tasks.

In this workflow:

An image stored in Amazon S3 is checked for inappropriate content using the Amazon Rekognition DetectModerationLabels API.
Based on the result of (1), appropriate images are forwarded to image processing while inappropriate ones trigger an email notification.
Appropriate images undergo two processing steps in parallel: the detection of objects and text in the image via Amazon Rekognition’s DetectLabels and DetectText APIs. The results of both processing steps are saved in an Amazon DynamoDB table.
An inappropriate image triggers an email notification for manual content moderation via the Amazon Simple Notification Service (SNS).

Prerequisites

To follow this walkthrough, you need:

An AWS account.
An AWS user with AdministratorAccess (see the instructions on the AWS Identity and Access Management (IAM) console).
AWS CLI using the instructions here.
AWS Serverless Application Model (AWS SAM) CLI using the instructions here.

Initial project setup

Get started by cloning the project repository from GitHub:

git clone https://github.com/aws-samples/aws-step-functions-image-catalog-blog.git

The cloned repository contains two AWS SAM templates.

The starter directory contains a template. It deploys AWS resources and permissions that you use later for building the image cataloging workflow.
The solution directory contains a template that deploys the finished image cataloging pipeline. Use this template if you want to skip ahead to the finished solution.

Both templates deploy the following resources to your AWS account:

An Amazon S3 bucket that holds the image files for the catalog.
A DynamoDB table as the data store of the image catalog.
An SNS topic and subscription that allow you to send an email notification.
A Step Functions state machine that defines the processing steps in the cataloging pipeline.

To follow the walkthrough, deploy the AWS SAM template in the starter directory using the AWS SAM CLI:

cd aws-step-functions-image-catalog-blog/starter
sam build
sam deploy --guided

Configure the AWS SAM deployment as follows. Input your email address for the parameter ModeratorEmailAddress:

During deployment, you receive an email asking you to confirm the subscription to notifications generated by the Step Functions workflow. In the email, choose Confirm subscription to receive these notifications.

Confirm successful resource creation by going to the AWS CloudFormation console. Open the serverless-image-catalog-starter stack and choose the Stack info tab:

View the Outputs tab of the CloudFormation stack. You reference these items later in the walkthrough:

Implementing the image cataloging pipeline

Accessing Step Functions Workflow Studio

To access Step Functions in Workflow Studio:

Access the Step Functions console.
In the list of State machines, select image-catalog-workflow-starter.
Choose the Edit button.
Choose Workflow Studio.

Workflow Studio consists of three main areas:

The Canvas lets you modify the state machine graph via drag and drop.
The States Browser lets you browse and search more than 9,000 API Actions from over 200 AWS services.
The Inspector panel lets you configure the properties of state machine states and displays the Step Functions definition in the Amazon States Language (ASL).

For the purpose of this walkthrough, you can delete the Pass state present in the state machine graph. Right click on it and choose Delete state.

Auto-moderating content with Amazon Rekognition and the Choice State

Use Amazon Rekognition’s DetectModerationLabels API to detect inappropriate content in the images processed by the workflow:

In the States browser, search for the DetectModerationLabels API action.
Drag and drop the API action on the state machine graph on the canvas.

In the Inspector panel, select the Configuration tab and add the following API Parameters:

{
  "Image": {
    "S3Object": {
      "Bucket.$": "$.bucket",
      "Name.$": "$.key"
    }
  }
}

Switch to the Output tab and check the box next to Add original input to output using ResultPath. This allows you to pass both the original input and the task’s output on to the next state on the state machine graph.

Input the following ResultPath:

$.moderationResult

Step Functions enables you to make decisions based on the output of previous task states via the choice state. Use the result of the DetectModerationLabels API action to decide how to proceed with the image:

Access the Flow tab in the States browser. Drag and drop a Choice state to the state machine graph below the DetectModerationLabels API action.
In the States browser, choose Flow.
Select a Choice state and place it after the DetectModerationLabels state on the graph.
Select the added Choice state.
In the Inspector panel, choose Rule #1 and select Edit.
Choose Add conditions.
For Variable, enter $.moderationResult.ModerationLabels[0].
For Operator, choose is present.
Choose Save conditions.

If Amazon Rekognition detects inappropriate content, the workflow notifies content moderators to inspect the image manually:

In the States browser, find the SNS Publish API Action.
Drag the Action into the Rule #1 branch of the Choice state.
For API Parameters, select the SNS topic that is visible in the Outputs of the serverless-image-catalog-starter stack in the CloudFormation console.

Speeding up image cataloging with the Parallel state

Appropriate images should be processed and included in the image catalog. In this example, processing includes the automated generation of tags based on objects and text identified in the image.

To accelerate this, instruct Step Functions to perform these tasks concurrently via a Parallel state:

In the States browser, select the Flow tab.
Drag and drop a Parallel state onto the Default branch of the previously added Choice state.
Search the Amazon Rekognition DetectLabels API action in the States browser
Drag and drop it inside the parallel state.

Configure the following API parameters:

{
  "Image": {
    "S3Object": {
      "Bucket.$": "$.bucket",
      "Name.$": "$.key"
    }
  }
}

Switch to the Output tab and check the box next to Add original input to output using ResultPath. Set the ResultPath to $.output.

Record the results of the Amazon Rekognition DetectLabels API Action to the DynamoDB database:

Place a DynamoDB UpdateItem API Action inside the Parallel state below the Amazon Rekognition DetectLabels API action.
Configure the following API Parameters to save the tags to the DynamoDB table. Input the name of the DynamoDB table visible in the Outputs of the serverless-image-catalog-starter stack in the CloudFormation console:

{
  "TableName": "<DynamoDB table name>",
  "Key": {
    "Id": {
      "S.$": "$.key"
    }
  },
  "UpdateExpression": "set detectedObjects=:o",
  "ExpressionAttributeValues": {
    ":o": {
      "S.$": "States.JsonToString($.output.Labels)"
    }
  }
}

This API parameter definition makes use of an intrinsic function to convert the list of objects identified by Amazon Rekognition from JSON to String.

In addition to objects, you also want to identify text in images and store it in the database. To do so:

Drag and drop an Amazon Rekognition DetectText API action into the Parallel state next to the DetectLabels Action.
Configure the API Parameters and ResultPath identical to the DetectLabels API Action.
Place another DynamoDB UpdateItem API Action inside the Parallel state below the Amazon Rekognition DetectText API Action. Set the following API Parameters and input the same DynamoDB table name as before.

{
  "TableName": "<DynamoDB table name>",
  "Key": {
    "Id": {
      "S.$": "$.key"
    }
  },
  "UpdateExpression": "set detectedText=:t",
  "ExpressionAttributeValues": {
    ":t": {
      "S.$": "States.JsonToString($.output.TextDetections)"
    }
  }
}

To save the state machine:

Choose Apply and exit.
Choose Save.
Choose Save anyway.

Finishing up and testing the image cataloging workflow

To test the image cataloging workflow, upload an image to the S3 bucket created as part of the initial project setup. Find the name of the bucket in the Outputs of the serverless-image-catalog-starter stack in the CloudFormation console.

Select the image-catalog-workflow-starter state machine in the Step Functions console.
Choose Start execution.

Paste the following test event (use your S3 bucket name):

{
    "bucket": "<S3-bucket-name>",
    "key": "<Image-name>.jpeg"
}

Choose Start execution.

Once the execution has started, you can follow the state of the state machine live in the Graph inspector. For an appropriate image, the result will look as follows:

Next, repeat the test process with an image that Amazon Rekognition classifies as inappropriate. Find out more about inappropriate content categories here. This produces the following result:

You receive an email notifying you regarding the inappropriate image and its properties.

Cleaning up

To clean up the resources provisioned as part of the solution run the following command in the aws-step-functions-image-catalog-blog/starter directory:

sam delete

Conclusion

This blog post demonstrates how to implement a serverless image cataloging pipeline using Step Functions Workflow Studio. By orchestrating AWS API actions and flow states via drag and drop, you can process user-generated images. This example checks images for appropriateness and generates tags based on their content without custom application code.

You can now expand and improve this workflow by triggering it automatically each time an image is uploaded to the Amazon S3 bucket or by adding a manual approval step for submitted content. To find out more about Workflow Studio, visit the AWS Step Functions Developer Guide.

For more serverless learning resources, visit Serverless Land.

Building Blue/Green application deployment to Micro Focus Enterprise Server

2022-03-03 Kevin Yung

Post Syndicated from Kevin Yung original https://aws.amazon.com/blogs/devops/building-blue-green-application-deployment-to-micro-focus-enterprise-server/

Organizations running mainframe production workloads often follow the traditional approach of application deployment. To release new features of existing applications into production, the application is redeployed using the new version of software on the existing infrastructure. This poses the following challenges:

The cutover of the application deployment from testing to production usually takes place during a planned outage window with associated downtime.
Rollback is difficult, since the earlier version of the software must be redeployed from scratch on the existing infrastructure. This may result in applications being unavailable for longer durations owing to the rollback.
Due to differences in testing and production environments, some defects may leak into production, affecting the application code quality and thus increasing the number of production outages

Automated, robust application deployment is recognized as a prime driver for moving from a Mainframe to AWS, as service stability, security, and quality can be better managed. In this post, you will learn how to build Blue/Green (zero-downtime) deployments for mainframe applications rehosted to Micro Focus Enterprise Server with AWS Developer Tools (AWS CodeBuild, CodePipeline, and CodeDeploy).

This is a continuation of our previous post “Automate thousands of mainframe tests on AWS with the Micro Focus Enterprise Suite”. In our last post, we explained how you can implement a pattern for continuous integration and testing of mainframe applications with AWS Developer tools and Micro Focus Enterprise Suite. If you haven’t already checked it out, then we strongly recommend that you read through it before proceeding to the rest of this post.

Overview of solution

In this section, we explain the three important design “ingredients” to be implemented in the overall solution:

Implementation of Enterprise Server Performance and Availability Cluster (PAC)
End-to-end design of CI/CD pipeline for multiple teams development
Blue/green deployment process for a rehosted mainframe application

First, let’s look at the solution design for the Micro Focus Enterprise Server PAC cluster.

Overview of Micro Focus Enterprise Server Performance and Availability Cluster (PAC)

In the Blue/Green deployment solution, Micro Focus Enterprise Server is the hosting environment for mainframe applications with the software installed into Amazon EC2 instances. Application deployment in Amazon EC2 Auto Scaling is one of the critical requirements to build a Blue/Green deployment. Micro Focus Enterprise Server PAC technology is the feature that allows for the Auto Scaling of Enterprise Server instances. For details on how to build Micro Focus Enterprise PAC Cluster with Amazon EC2 Auto Scaling and Systems Manager, see our AWS Prescriptive Guidance document. An overview of the infrastructure architecture is shown in the following figure, and the following table explains the components in the architecture.

Infrastructure architecture overview for blue/green application deployment to Micro Focus Enterprise Server

Components	Description
Micro Focus Enterprise Servers	Deploy applications to Micro Focus Enterprise Servers PAC in Amazon EC2 Auto Scaling Group.
Micro Focus Enterprise Server Common Web Administration (ESCWA)	Manage Micro Focus Enterprise Server PAC with ESCWA server, e.g., Adding or Removing Enterprise Server to/from a PAC.
Relational Database for both user and system data files	Setup Amazon Aurora RDS Instance in Multi-AZ to host both user and system data files to be shared across the Enterprise server instances.
Micro Focus Enterprise Server Scale-Out Repository (SOR)	Setup an Amazon ElastiCache Redis Instance and replicas in Multi-AZ to host user data.
Application endpoint and load balancer	Setup a Network Load Balancer to provide a hostname for end users to connect the application, e.g., accessing the application through a 3270 emulator.

CI/CD Pipelines design supporting multi-streams of mainframe development

In a previous DevOps post, Automate thousands of mainframe tests on AWS with the Micro Focus Enterprise Suite, we introduced two levels of pipelines. The first level of pipeline is used by mainframe project teams to test project scope changes. The second level of the pipeline is used for system integration tests, where the pipeline will perform tests for all of the promoted changes from the project pipelines and perform extensive systems tests.

In this post, we are extending the two levels pipeline to add a production deployment pipeline. When system testing is complete and successful, the tested application artefacts are promoted to the production pipeline in preparation for live production release. The following figure depicts each stage of the three levels of CI/CD pipeline and the purpose of each stage.

Different levels of CI/CD pipeline - Project Team Pipeline, Systems Test Pipeline and Production Deployment Pipeline

Let’s look at the artifact promotion to production pipeline in greater detail. The Systems Test Pipeline promotes the tested artifacts in binary format into an Amazon S3 bucket and the S3 event triggers production pipeline to kick-off. This artifact promotion process can be gated using a manual approval action in CodePipeline. For customers who want to have a fully automated continuous deployment, the manual promotion approval step can be removed.

The following diagram shows the AWS Stages in AWS CodePipeline of the production deployment pipeline:

Stages in production deployment pipeline using AWS CodePipeline

After the production pipeline is kicked off, it downloads the new version artifact from the S3 bucket. See the details of how to setup the S3 bucket as a Source of CodePipeline in the document AWS CodePipeline Document S3 as Source.

In the following section, we explain each of these pipeline stages in detail:

It prepares and packages a new version of production configuration artifacts, for example, the Micro Focus Enterprise Server config file, blue/green deployment scripts etc.
Use in the CodeBuild Project to kick off an application blue/green deployment with AWS CodeDeploy.
Use a manual approval gate to wait for an operator to validate the new version of the application and approve to continue the production traffic switch
Continue the blue/green deployment by allowing traffic to the new version of the application and block the traffic to the old version.
After a successful Blue/Green switch and deployment, tag the production version in the code repository.

Now that you’ve seen the pipeline design, we will dive deep into the details of the blue/green deployment with AWS CodeDeploy.

Blue/green deployment with AWS CodeDeploy

In the blue/green deployment, we used the technique of swapping Auto Scaling Group behind an Elastic Load Balancer. Refer to the AWS Blue/Green deployment whitepaper for the details of the technique. As AWS CodeDeploy is a fully-managed service that automates software deployment, it is used to automate the entire Blue/Green process.

Firstly, the following best practices are applied to setup the Enterprise Server’s infrastructure:

AWS Image Builder is used to install Micro Focus Enterprise Server software and AWS CodeDeploy Agent into Amazon Machine Image (AMI). Create an EC2 Launch Template with the Enterprise Server AMI ID.
A Network Load Balancer is used to setup a TCP connection health check to validate that Micro Focus Enterprise Server is listening on the required ports, e.g., port 9270, so that connectivity is available for 3270 emulators.
A script was created to confirm application deployment validity in each EC2 instance. This is achieved by using a PowerShell script that triggers a CICS transaction from the Micro Focus Enterprise Server command line interface.

In the CodePipeline, we created a CodeBuild project to create a new deployment with CodeDeploy. We will go into the details of the CodeBuild buildspec.yaml configuration.

In the CodeBuild buildspec.yaml’s pre_build section, we used the following steps:

In the pre-build stage, the CodeBuild will perform two steps:

Create an initial Amazon EC2 Auto Scaling using Micro Focus Enterprise Server AMI and a Launch Template for the first-time deployment of the application.
Use AWS CLI to update the initial Auto Scaling Group name into a Systems Manager Parameter Store, and it will later be used by CodeDeploy to create a copy during the blue/green deployment.

In the build stage, the buildspec will perform the following steps:

Retrieve the Auto Scaling Group name of the Enterprise Servers from the Systems Manager Parameter Store.
Then, a blue/green deployment configuration is created for the deployment group of the application. In the AWS CLI command, we use the WITH_TRAFFIC_CONTROL option to let us manually verify and approve before switching the traffic to the new version of the application. The command snippet is shown here.

BlueGreenConf=\
        "terminateBlueInstancesOnDeploymentSuccess={action=TERMINATE}"\
        ",deploymentReadyOption={actionOnTimeout=STOP_DEPLOYMENT,waitTimeInMinutes=600}" \
        ",greenFleetProvisioningOption={action=COPY_AUTO_SCALING_GROUP}"

DeployType="BLUE_GREEN,deploymentOption=WITH_TRAFFIC_CONTROL"

/usr/local/bin/aws deploy update-deployment-group \
      --application-name "${APPLICATION_NAME}" \
     --current-deployment-group-name "${DEPLOYMENT_GROUP_NAME}" \
     --auto-scaling-groups "${AsgName}" \
      --load-balancer-info targetGroupInfoList=[{name="${TARGET_GROUP_NAME}"}] \
      --deployment-style "deploymentType=$DeployType" \
      --Blue/Green-deployment-configuration "$BlueGreenConf"

Next, the new version of application binary is released from the CodeBuild source DemoBinto the production S3 bucket.

release="bankdemo-$(date '+%Y-%m-%d-%H-%M').tar.gz"
RELEASE_FILE="s3://${PRODUCTION_BUCKET}/${release}"

/usr/local/bin/aws deploy push \
    --application-name ${APPLICATION_NAME} \
    --description "version - $(date '+%Y-%m-%d %H:%M')" \
    --s3-location ${RELEASE_FILE} \
    --source ${CODEBUILD_SRC_DIR_DemoBin}/

Create a new deployment for the application to initiate the Blue/Green switch.

/usr/local/bin/aws deploy create-deployment \
    --application-name ${APPLICATION_NAME} \
    --s3-location bucket=${PRODUCTION_BUCKET},key=${release},bundleType=zip \
    --deployment-group-name "${DEPLOYMENT_GROUP_NAME}" \
    --description "Bankdemo Production Deployment ${release}"\
    --query deploymentId \
    --output text

After setting up the deployment options, the following is a snapshot of a deployment configuration from the AWS Management Console.

Snapshot of deployment configuration from AWS Management Console

In the AWS Post “Under the Hood: AWS CodeDeploy and Auto Scaling Integration”, we explain how AWS CodeDeploy sets up Auto Scaling lifecycle hooks to listen for Auto Scaling events. In the event of an EC2 instance launch and termination, AWS CodeDeploy can instruct its agent in the instance to run the prepared scripts.

In the following table, we list each stage in a blue/green deployment and the tasks that ran.

Hooks	Tasks
BeforeInstall	Create application folder structures in the newly launched Amazon EC2 and prepare for installation
AfterInstall	Enable Windows Firewall Rule for application traffic
	Activate Micro Focus License using License Server
	Prepare Production Database Connections
	Import config to create Region in Micro Focus Enterprise Server
	Deploy the latest application binaries into each of the Micro Focus Enterprise Servers
ApplicationStart	Use AWS CLI to start a Systems Manager Automation “Scale-Out” runbook with the target of ESCWA server
	The Automation runbook will add the newly launched Micro Focus Enterprise Server instance into a PAC
	The Automation runbook will start the imported region in the newly launched Micro Focus Enterprise Server
	Validate that the application is listening on a service port, for example, port 9270
	Use the Micro Focus command “castran” to run an online transaction in Micro Focus Enterprise Server to validate the service status
AfterBlockTraffic	Use AWS CLI to start a Systems Manager Automation “Scale-In” runbook with the target ESCWA server
	The Automation runbook will try stopping the Region in the terminating EC2 instance
	The Automation runbook will remove the Enterprise Server instance from the PAC

The tasks in the table are automated using PowerShell, and the scripts are used in appspec.yml config for CodeDeploy to orchestrate the deployment.

In the following appspec.yml, the locations of the binary files to be installed are defined in addition to the Micro Focus Enterprise Server Region XML config file. During the AfrerInstall stage, the XML config is imported into the Enterprise Server.

version: 0.0
os: windows
files:
  - source: scripts
    destination: C:\scripts\
  - source: online
    destination: C:\BANKDEMO\online\
  - source: common
    destination: C:\BANKDEMO\common\
  - source: batch
    destination: C:\BANKDEMO\batch\
  - source: scripts\BANKDEMO.xml
    destination: C:\BANKDEMO\
hooks:
  BeforeInstall: 
    - location: scripts\BeforeInstall.ps1
      timeout: 300
  AfterInstall: 
    - location: scripts\AfterInstall.ps1    
  ApplicationStart:
    - location: scripts\ApplicationStart.ps1
      timeout: 300
  ValidateService:
    - location: scripts\ValidateServer.cmd
      timeout: 300
  AfterBlockTraffic:
    - location: scripts\AfterBlockTraffic.ps1

Using the sample Micro Focus Bankdemo application, and the steps outlined above, we have setup a blue/green deployment process in Micro Focus Enterprise Server.

There are four important considerations when setting up blue/green deployment:

For batch applications, the blue/green deployment should be invoked only outside of the scheduled “batch window”.
For online applications, AWS CodeDeploy will deregister the Auto Scaling group from the target group of the Network Load Balancer. The deregistration may take a while as the server has to finish processing the ongoing requests before it can continue deployment of the new application instance. In this case, enabling Elastic Load Balancing connection draining feature with appropriate timeout value can minimize the risk of closing unfinished transactions. In addition, consider doing deployment in low-traffic windows to improve the deployment speeds.
For application changes that require updates to the database schema, the version roll-forward and rollback can be managed via DB migrations tools, e.g., Flyway and Fluent Migrator.
For testing in production environments, adherence to any regulatory compliance, such as full audit trail of events, must be considered.

Conclusion

In this post, we introduced the solution to use Micro Focus Enterprise Server PAC, Amazon EC2 Auto Scaling, AWS Systems Manager, and AWS CodeDeploy to automate the blue/green deployment of rehosted mainframe applications in AWS.

Through the blue/green deployment methodology, we can shift traffic between two identical clusters running different application versions in parallel. This mitigates the risks commonly associated with mainframe application deployment, namely downtime and rollback capacity, while ensure higher code quality in production through “Shift Right” testing.

A demo of the solution is available on the AWS Partner Micro Focus website [Solution-Demo]. If you’re interested in modernizing your mainframe applications, then please contact Micro Focus and AWS mainframe business development at [email protected].

Additional Information

About the authors

How the Georgia Data Analytics Center built a cloud analytics solution from scratch with the AWS Data Lab

2022-03-02 Kanti Chalasani

Post Syndicated from Kanti Chalasani original https://aws.amazon.com/blogs/big-data/how-the-georgia-data-analytics-center-built-a-cloud-analytics-solution-from-scratch-with-the-aws-data-lab/

This is a guest post by Kanti Chalasani, Division Director at Georgia Data Analytics Center (GDAC). GDAC is housed within the Georgia Office of Planning and Budget to facilitate governed data sharing between various state agencies and departments.

The Office of Planning and Budget (OPB) established the Georgia Data Analytics Center (GDAC) with the intent to provide data accountability and transparency in Georgia. GDAC strives to support the state’s government agencies, academic institutions, researchers, and taxpayers with their data needs. Georgia’s modern data analytics center will help to securely harvest, integrate, anonymize, and aggregate data.

In this post, we share how GDAC created an analytics platform from scratch using AWS services and how GDAC collaborated with the AWS Data Lab to accelerate this project from design to build in record time. The pre-planning sessions, technical immersions, pre-build sessions, and post-build sessions helped us focus on our objectives and tangible deliverables. We built a prototype with a modern data architecture and quickly ingested additional data into the data lake and the data warehouse. The purpose-built data and analytics services allowed us to quickly ingest additional data and deliver data analytics dashboards. It was extremely rewarding to officially release the GDAC public website within only 4 months.

A combination of clear direction from OPB executive stakeholders, input from the knowledgeable and driven AWS team, and the GDAC team’s drive and commitment to learning played a huge role in this success story. GDAC’s partner agencies helped tremendously through timely data delivery, data validation, and review.

We had a two-tiered engagement with the AWS Data Lab. In the first tier, we participated in a Design Lab to discuss our near-to-long-term requirements and create a best-fit architecture. We discussed the pros and cons of various services that can help us meet those requirements. We also had meaningful engagement with AWS subject matter experts from various AWS services to dive deeper into the best practices.

The Design Lab was followed by a Build Lab, where we took a smaller cross section of the bigger architecture and implemented a prototype in 4 days. During the Build Lab, we worked in GDAC AWS accounts, using GDAC data and GDAC resources. This not only helped us build the prototype, but also helped us gain hands-on experience in building it. This experience also helped us better maintain the product after we went live. We were able to continually build on this hands-on experience and share the knowledge with other agencies in Georgia.

Our Design and Build Lab experiences are detailed below.

Step 1: Design Lab

We wanted to stand up a platform that can meet the data and analytics needs for the Georgia Data Analytics Center (GDAC) and potentially serve as a gold standard for other government agencies in Georgia. Our objective with the AWS Data Design Lab was to come up with an architecture that meets initial data needs and provides ample scope for future expansion, as our user base and data volume increased. We wanted each component of the architecture to scale independently, with tighter controls on data access. Our objective was to enable easy exploration of data with faster response times using Tableau data analytics as well as build data capital for Georgia. This would allow us to empower our policymakers to make data-driven decisions in a timely manner and allow State agencies to share data and definitions within and across agencies through data governance. We also stressed on data security, classification, obfuscation, auditing, monitoring, logging, and compliance needs. We wanted to use purpose-built tools meant for specialized objectives.

Over the course of the 2-day Design Lab, we defined our overall architecture and picked a scaled-down version to explore. The following diagram illustrates the architecture of our prototype.

The architecture contains the following key components:

Amazon Simple Storage Service (Amazon S3) for raw data landing and curated data staging.
AWS Glue for extract, transform, and load (ETL) jobs to move data from the Amazon S3 landing zone to Amazon S3 curated zone in optimal format and layout. We used an AWS Glue crawler to update the AWS Glue Data Catalog.
AWS Step Functions for AWS Glue job orchestration.
Amazon Athena as a powerful tool for a quick and extensive SQL data analysis and to build a logical layer on the landing zone.
Amazon Redshift to create a federated data warehouse with conformed dimensions and star schemas for consumption by Tableau data analytics.

Step 2: Pre-Build Lab

We started with planning sessions to build foundational components of our infrastructure: AWS accounts, Amazon Elastic Compute Cloud (Amazon EC2) instances, an Amazon Redshift cluster, a virtual private cloud (VPC), route tables, security groups, encryption keys, access rules, internet gateways, a bastion host, and more. Additionally, we set up AWS Identity and Access Management (IAM) roles and policies, AWS Glue connections, dev endpoints, and notebooks. Files were ingested via secure FTP, or from a database to Amazon S3 using AWS Command Line Interface (AWS CLI). We crawled Amazon S3 via AWS Glue crawlers to build Data Catalog schemas and tables for quick SQL access in Athena.

The GDAC team participated in Immersion Days for training in AWS Glue, AWS Lake Formation, and Amazon Redshift in preparation for the Build Lab.

We defined the following as the success criteria for the Build Lab:

Create ETL pipelines from source (Amazon S3 raw) to target (Amazon Redshift). These ETL pipelines should create and load dimensions and facts in Amazon Redshift.
Have a mechanism to test the accuracy of the data loaded through our pipelines.
Set up Amazon Redshift in a private subnet of a VPC, with appropriate users and roles identified.
Connect from AWS Glue to Amazon S3 to Amazon Redshift without going over the internet.
Set up row-level filtering in Amazon Redshift based on user login.
Data pipelines orchestration using Step Functions.
Build and publish Tableau analytics with connections to our star schema in Amazon Redshift.
Automate the deployment using AWS CloudFormation.
Set up column-level security for the data in Amazon S3 using Lake Formation. This allows for differential access to data based on user roles to users using both Athena and Amazon Redshift Spectrum.

Step 3: Four-day Build Lab

Following a series of implementation sessions with our architect, we formed the GDAC data lake and organized downstream data pulls for the data warehouse with governed data access. Data was ingested in the raw data landing lake and then curated into a staging lake, where data was compressed and partitioned in Parquet format.

It was empowering for us to build PySpark Extract Transform Loads (ETL) AWS Glue jobs with our meticulous AWS Data Lab architect. We built reusable glue jobs for the data ingestion and curation using the code snippets provided. The days were rigorous and long, but we were thrilled to see our centralized data repository come into fruition so rapidly. Cataloging data and using Athena queries proved to be a fast and cost-effective way for data exploration and data wrangling.

The serverless orchestration with Step Functions allowed us to put AWS Glue jobs into a simple readable data workflow. We spent time designing for performance and partitioning data to minimize cost and increase efficiency.

Database access from Tableau and SQL Workbench/J were set up for my team. Our excitement only grew as we began building data analytics and dashboards using our dimensional data models.

Step 4: Post-Build Lab

During our post-Build Lab session, we closed several loose ends and built additional AWS Glue jobs for initial and historic loads and append vs. overwrite strategies. These strategies were picked based on the nature of the data in various tables. We returned for a second Build Lab to work on building data migration tasks from Oracle Database via VPC peering, file processing using AWS Glue DataBrew, and AWS CloudFormation for automated AWS Glue job generation. If you have a team of 4–8 builders looking for a fast and easy foundation for a complete data analytics system, I would highly recommend the AWS Data Lab.

Conclusion

All in all, with a very small team we were able to set up a sustainable framework on AWS infrastructure with elastic scaling to handle future capacity without compromising quality. With this framework in place, we are moving rapidly with new data feeds. This would not have been possible without the assistance of the AWS Data Lab team throughout the project lifecycle. With this quick win, we decided to move forward and build AWS Control Tower with multiple accounts in our landing zone. We brought in professionals to help set up infrastructure and data compliance guardrails and security policies. We are thrilled to continually improve our cloud infrastructure, services and data engineering processes. This strong initial foundation has paved the pathway to endless data projects in Georgia.

About the Author

Kanti Chalasani serves as the Division Director for the Georgia Data Analytics Center (GDAC) at the Office of Planning and Budget (OPB). Kanti is responsible for GDAC’s data management, analytics, security, compliance, and governance activities. She strives to work with state agencies to improve data sharing, data literacy, and data quality through this modern data engineering platform. With over 26 years of experience in IT management, hands-on data warehousing, and analytics experience, she thrives for excellence.

Vishal Pathak is an AWS Data Lab Solutions Architect. Vishal works with customers on their use cases, architects solutions to solve their business problems, and helps them build scalable prototypes. Prior to his journey with AWS, Vishal helped customers implement BI, data warehousing, and data lake projects in the US and Australia.

Introducing AWS Virtual Waiting Room

2022-02-10 James Beswick

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/introducing-aws-virtual-waiting-room/

This post is written by Justin Pirtle, Principal Solutions Architect, Joan Morgan, Software Developer Engineer, and Jim Thario, Software Developer Engineer.

Today, AWS is introducing an official AWS Virtual Waiting Room solution. You can integrate this new, open-source solution with existing web and mobile applications. It can help buffer users during times of peak demand and sudden bursts of traffic, preventing systems from resource exhaustion.

Events commonly use virtual waiting rooms where there is either unknown demand or expected large bursts of traffic. Examples of such events include concert ticket sales, Black Friday promotions, COVID-19 vaccine registrations, and more. Virtual waiting rooms allow a quota of users to view, select, and complete their transactions directly. They shield the application’s backend environment from traffic by buffering users in a waiting room until it is their turn in line.

Like any real-life queuing system, a user enters the AWS Virtual Waiting Room and requests a number in line. After receiving a number corresponding to the unique device ID, the browser then polls regularly for updates. The update provides the current number being served and anticipated time until they are front of line.

After reaching the front of the line, the user can exchange the number and device ID for a secure session token. This is included with their downstream requests to authenticate users securely.

If a user discovers the backend endpoint and tries to send requests, they are redirected into the waiting room. The API requests are denied access until they have a valid token. This prevents the backend from needing to scale to accommodate all users at a single time.

Integrating the AWS Virtual Waiting Room into your application

Integration steps depend on the integration pattern for your application. You can decide if all users are routed through the waiting room or only during periods of excessive traffic. You can also choose to protect only the web host serving the backend webpages or one or more APIs powering backend commerce services.

There are four common patterns supported for integrating the waiting room into your application:

Upstream redirection of all traffic from the main target site to flow through AWS Virtual Waiting Room. This option sends all user traffic through the waiting room with the initial capacity of users permitted to the protected system. The traffic passes through transparently, then it buffers the remaining users. It admits new users as capacity becomes available. The target system is only accessible by users who pass through the waiting room.
Downstream redirection to the virtual waiting room from the target site. This option sends all traffic to the target site. The target site conditionally redirects requests that need to enter the waiting room. No DNS or upstream modifications are needed. The target site must be able to handle the initial user requests and redirection responses.
Direct target site API integration for buffering users from an existing website without any redirection. Your web or mobile application integrates the virtual waiting room at the API-level. This does not need any redirection to a different waiting room endpoint or site. This can offer a seamless user experience but may require more development for the integration.
OpenID Connect (OIDC) adapter. This option offers no-code native integration of the waiting room with OpenID Connect-enabled system components, such as the AWS Application Load Balancer (ALB). Users are redirected by the load balancer or similar component to the waiting room. They are buffered until issued a signed, time-limited JSON Web Token (JWT). Once the user’s JWT token is issued, the load balancer then forwards user requests to the target backend systems.

Overview of the AWS Virtual Waiting Room solution

The AWS Virtual Waiting Room solution implementation includes three main components:

Core APIs. The main resources deployed include two Amazon API Gateway deployments, a VPC, several AWS Lambda functions, an Amazon DynamoDB table, and an Amazon ElastiCache cluster. This API provides the basic mechanisms for tracking clients entering the waiting room. It requests status of the line progression and an authentication token to enter the target protected site.
Waiting room front-end website. The waiting room static site is shown to users awaiting their turn. This site dynamically updates the position being served and their place in line on a configurable interval. You customize this site’s HTML, CSS, and JavaScript to match your frontend styling and theme.
Lambda authorizer for protected target system. The Lambda authorizer wraps and protects the downstream protected target system’s APIs. This ensures that all user invocations have a validated time-limited token issued by the waiting room core API. It helps to prevent users from bypassing the waiting room.

The Virtual Waiting Room CloudFormation template deploys the following infrastructure:

An Amazon CloudFront distribution to deliver public API calls for the client.
Amazon API Gateway public API resources to process queue requests from the virtual waiting room, track the queue position, and support validation of tokens that allow access to the target website.
An Amazon Simple Queue Service (Amazon SQS) queue to regulate traffic to the AWS Lambda function that processes the queue messages. Instead of invoking the Lambda function for each request, the SQS queue batches the incoming bursts of requests.
API Gateway private API resources to support administrative functions.
Lambda functions to validate and process public and private API requests, and return the appropriate responses.
Amazon Virtual Private Cloud (VPC) to host the Lambda functions that interact directly with the Amazon ElastiCache for Redis cluster. VPC endpoints allow Lambda functions in the VPC to communicate with services within the solution.
An Amazon CloudWatch rule to invoke a Lambda function that works with a custom Amazon EventBridge bus to periodically broadcast status updates.
An Amazon DynamoDB table to store token data.
AWS Secrets Manager to store keys for token operations and other sensitive data.
(Optional) Authorizer component consisting of an AWS Identity and Access Management (IAM) role and a Lambda function to validate signatures for your API calls. The only requirement for the authorizer to protect your API is to use API Gateway.
(Optional) Amazon Simple Notification Service (Amazon SNS), CloudWatch, and Lambda functions to support two inlet strategies.
(Optional) OpenID adaptor component with API Gateway and Lambda functions to allow an OpenID provider to authenticate users to your website. CloudFront distribution with an Amazon Simple Storage Service (Amazon S3) bucket for the waiting room page for this component.
(Optional) A CloudFront distribution with Amazon S3 origin bucket for the optional sample waiting room web application.

Deploying the AWS Virtual Waiting Room

To get started with the AWS Virtual Waiting Room, deploy the Getting Started stack. This deploys the Core APIs stack, the Authorizers stack, and a sample application CloudFormation stack:

Launch the Getting Started CloudFormation stack. The template launches in the US East (N. Virginia) Region by default. To launch the solution in a different AWS Region, use the Region selector in the console navigation bar.
On the Create stack page, verify that the correct template URL is in the Amazon S3 URL text box and choose Next.
On the Specify stack details page, assign a name to your solution stack, and accept all default parameter values. For information about naming character limitations, refer to IAM and STS Limits in the AWS Identity and Access Management User Guide. Choose Next.
On the Configure stack options page, choose Next.
On the Review page, review and confirm the settings. Check the box acknowledging that the template creates AWS Identity and Access Management (IAM) resources.
Choose Create stack to deploy the stack.
You can view the status of the stack in the AWS CloudFormation Console in the Status column. You should receive a CREATE_COMPLETE status in approximately 30 minutes.
Once successfully deployed, browse to the Outputs tab.
Copy the ControlPanelURL and WaitingRoomURL to a scratch pad file for later use.

Configuring the AWS Virtual Waiting Room

After deploying the three stacks, test the waiting room using the sample application:

Navigate to the IAM console. Create a new IAM user or select an existing IAM user in the same account where you deployed the waiting room stack.
Grant the selected IAM user programmatic access. Download the key file or copy the access key ID and secret access key values to your scratch pad for later use.
Add the IAM user to the ProtectedAPIGroup IAM user group created by the getting started template:
Open the control panel in a new tab or browser window using the ControlPanelURL output you saved earlier.
In the control panel, expand the Configuration section.
Enter the access key ID and secret access key that you retrieved in Generate AWS keys to call the IAM secured APIs. The endpoints and event ID are filled in from the URL parameters.
Choose Use. The button activates after you have supplied the credentials.
You now see the status “Connected” shown following the various metrics reported:

Test the sample waiting room

Browse to the sample waiting room in a new browser tab. Use the WaitingRoomURL you captured previously from the CloudFormation stack output values.
Select Reserve to enter the waiting room. If you are unable to proceed with your transaction, your assigned number is not yet reached.
Navigate back to the browser tab with the control panel.
Under Increment Serving Counter, select Change. This manually increments the serving counter and allows 100 users to move on from the waiting room to the target site.
Navigate back to the waiting room and choose Check out now! You are redirected to the target site since your serving number is eligible to proceed beyond the waiting room.
Select Purchase now to finish your transaction at the target site. This page represents the protected system beyond the waiting room. Replace this with the actual system users you are protecting.
After the simulated purchase is complete, you can see that the transaction is successful. This transaction is authorized using the time-limited authorization token, which came from the waiting room previously. If a user bypasses the waiting room, they would not be successful in completing a transaction.

Customizing the AWS Virtual Waiting Room for your application

The sample browser client demonstrates an entire user flow frontend with the AWS Virtual Waiting Room flow for an ecommerce purchase. You can use this code as a starting point for your waiting room or reference the API communication code for integrating the waiting room into your existing website.

This sample code is built with Vue.js and Bootstrap to render the user interface. It uses the Axios and Axios-Retry packages to make API calls to the virtual waiting room stack. The sample code uses the Axios-Retry package to show how to handle throttling conditions and exponential backoff in high-traffic situations.

The control panel client is used to make requests to the private waiting room API that requires IAM-based authorization. The control panel client demonstrates how to construct and sign requests to the private API. It can be used in production or customized further. All of the sample source code room source is available in GitHub including the sample user client and control panel client.

Conclusion

The AWS Virtual Waiting Room solution is available today at no additional cost, provided as open source under the Apache 2 license. It supports customized integration with any front-end application via a variety of integration techniques. You can also customize how and when the waiting room allows users to progress into the protected target system using a variety of strategies.

To learn more about the AWS Virtual Waiting Room solution, visit the solution implementation and implementation guide.

For more serverless learning resources, visit Serverless Land.

How UnitedHealth Group Improved Disaster Recovery for Machine-to-Machine Authentication

2022-02-08 Vinodh Kumar Rathnasabapathy

Post Syndicated from Vinodh Kumar Rathnasabapathy original https://aws.amazon.com/blogs/architecture/how-unitedhealth-group-improved-disaster-recovery-for-machine-to-machine-authentication/

This blog post was co-authored by Vinodh Kumar Rathnasabapathy, Senior Manager of Software Engineering, UnitedHealth Group.

Engineers who use Amazon Cognito for machine-to-machine authentication select a primary Region where they deploy their application infrastructure and the Amazon Cognito authorization endpoint. Amazon Cognito is a highly available service in single Region deployments with a published service-level agreement (SLA) target of 99.9%. The UnitedHealth Group (UHG) team needed a solution that would enable them to build and deploy their applications in multiple Regions to achieve higher availability targets. A multi-Region application architecture would also allow UHG engineers to failover to a secondary Region in the event that their application experienced issues in the primary Region.

At UHG, Federated Data Services (FDS) is a business-critical customer-facing application, which requires 99.95% availability and disaster recovery features. The FDS engineering team needed their Amazon Cognito infrastructure to be highly available in case of any service events in AWS Regions, along with having greater flexibility of switching between Regions.

The FDS engineering team worked on a custom solution using existing AWS services to fulfill their availability and recovery requirements. This solution not only serves the purpose of their current business needs but also provides recovery in case of any future disaster.

Overview of the solution

In this solution, we select two AWS Regions which will include the primary Region and the failover Region. Amazon Cognito app clients (including client ID and client secret pair) are created in both Regions and stored in an Amazon DynamoDB table. Client applications are given the client ID and client secret of the primary Region. Optionally, an application-generated ID and secret can be provided to the client to conceal the actual Amazon Cognito client ID and client secret. The process is as follows:

The client application (machine) initiates an authentication request by sending the Amazon Cognito app client ID and client secret to an Amazon Route 53 domain record.
Route 53 routes authentication requests to the in-Region Amazon API Gateway utilizing a Simple routing policy. From there, API Gateway shuts down the TLS connection using AWS Certificate Manager (ACM), and serves as a proxy for the authentication request to AWS Lambda.
AWS Lambda verifies the client ID and client secret, and uses them to look up the in-Region client ID and client secret.
Lambda uses preconfigured environment variables to request the appropriate Region from DynamoDB.
AWS Lambda then passes the returned app client credentials to the in-Region Amazon Cognito deployment. Amazon Cognito verifies the client ID and client secret, and returns an access token to the Lambda function.
The client application (machine) can now use this token to access downstream applications.
The client authentication process in the secondary (failover) Region is the same, with one exception. In the secondary Region, the Lambda environment variables retrieve app client credentials from the DynamoDB database for the secondary Region Amazon Cognito instance.

To initiate a failover between Regions, the Route 53 domain record needs to be pointed to the secondary Region API Gateway Regional endpoint. The downstream application’s Amazon Cognito configuration files must also be updated to point to the secondary Region Amazon Cognito instance. Alerts can be enabled using Amazon CloudWatch alarms to notify system operators of issues that may warrant a failover (a manual process to help system operators decide when to failover). The entire failover process takes just a couple of seconds for DNS to switch over and for the application to start accepting tokens from the secondary Region. This failover process could be automated based on generated alerts.

This architecture is suitable for a hot standby, active-passive type of application deployment. It is important to note that independent Amazon Cognito environments are being used in each Region, so you will need to set up your application to failover to the secondary Region for authentication. For example, your backend should be able to accept and validate access tokens from both primary and secondary Amazon Cognito user pools. To learn more about disaster recovery options in AWS, visit Disaster recovery options in the cloud.

Architecture overview

Figure 1 shows you how to build a multi-Region machine-to-machine architecture using Amazon Cognito, which uses DynamoDB global tables to perform the data replication. A Lambda function is utilized to retrieve the credentials for the active Region that the application is operating in, and a Regional Amazon Cognito endpoint returns the required token.

Figure 1. Multi-Region Amazon Cognito machine-to-machine architecture

Process flow

The Route53 domain record for the authentication proxy service is given to the client application and pointed at the API Gateway Regional endpoint. The client passes primary Region app client credentials to the API Gateway.
API Gateway passes Lambda the client ID and client secret pair.
Lambda does a lookup in DynamoDB to verify the client ID and client secret. After the identity is confirmed, Lambda uses Region-based environment variables to identify if the client should be using the primary Region or secondary Region for authenticating to Amazon Cognito. Lambda retrieves the Region-based client ID and client secret from DynamoDB.
Lambda passes the Region-based app ID and secret to Amazon Cognito, which verifies the client ID and client secret, and returns an access token to the Lambda function.
Lambda passes the access token from the Regional Amazon Cognito environment back to the client to be used against Region-based backend applications.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Note: Ensure that you are following your organization’s security best practices while deploying this architecture.

Implementation

This implementation will focus on the Lambda logic that is used to retrieve user credentials based on Lambda environment variables. We will share snippets of the Lambda function code so you can create the logic necessary to enable multi-Region application architecture using Amazon Cognito app clients. In addition to the Lambda function, you will also need to create and configure the following resources using security implementation designated by your organization:

DynamoDB table with fields primaryClientID, primaryClientSecret, secondaryClientID, and secondaryClientSecret.

1. Because this table is used to store secrets, make sure encryption is enabled and you follow Security Best Practices for Amazon DynamoDB and your organization’s security best practices.
2. Enable DynamoDB global tables.

API Gateway Regional endpoint with TLS encryption using ACM.
Route53 domain record routing traffic to Regional API Gateway.
Downstream application configuration pointing at the Regional Amazon Cognito endpoint.
IAM role that grants access from Lambda to DynamoDB.

Now let’s configure our Lambda function using Node.js language. Within the Lambda console create a new Lambda function that you will author from scratch. Select the Node.js runtime and change the runtime Lambda execution role to the IAM role that you have created for Lambda. Next, we will walk through the Lambda function code and configuration.

Attach your new IAM role as a Lambda runtime role that will grant your Lambda function access to the DynamoDB table.
Within the Lambda configuration environment variables, create several key-value variables in the Lambda function. The following are the environmental variables we will add.

1. key: OAUTH_HOST_PRIMARY, value: https://${cognito-primary-region-domain-name}/oauth2/token
2. key: OAUTH_HOST_SECONDARY, value: https://${cognito-secondary-region-domain-name}/oauth2/token

Import the Node.js libraries using the following code.

"dependencies": {
    "aws-sdk": "^2.723.0",
    "aws-serverless-express": "^3.3.8",
    "axios": "^0.21.1",
    "base-64": "^1.0.0",
    "cors": "^2.8.5",
    "dotenv": "^8.2.0",
    "express": "^4.17.1",
    "jsonwebtoken": "^8.5.1",
    "jwt-decode": "^2.2.0",
    }

Parse data from the incoming client application authentication request.

router.post("/", async (request, response) => {
  const client_id = request.body["client_id"];
  const client_secret = request.body["client_secret"]
      
  const client = await dynamoDB.getClientCredentialById(client_id, client_secret);
    
  if (client === undefined) {
    return response.status(400).json({
       message:
          "Client not configured for in Cognito. Please check with support team",
        });
      }
    
   const client_credentials = getClientCredentials(client);
   let access_token = await authService.getJwtToken(client_credentials);
          
   response.send(access_token);
});

Reference the environment variables to determine the Region that the Lambda function is operating in and set the Region as primary or secondary.

getClientCredentials = client => {
   return region === "us-east-1" ? { clientId: client.primaryClientId, clientSecret: client.primaryClientSecret, oAuthHost: config.OAUTH_HOST }
   : region === "us-east-2" ? { clientId: client.secondaryClientId, clientSecret: client.secondaryClientSecret, oAuthHost: config.OAUTH_HOST_EAST_2 }
   : {};
}

Verify the client ID and client secret against the DynamoDB table and get the Region-based client ID and client secret.

const getClientCredentialById = async (client_id, client_secret) => {

  let params = {
    TableName: clientCredentialTable,
    Key: {
      primaryClientId: client_id,
      primaryClientSecret: client_secret,
    },
  };

  const clientCredential = await ddb.get(params).promise();
  return clientCredential.Item;
};

Pass the Regional client ID and client secret to Amazon Cognito. You will receive an access token from Amazon Cognito.

   const base64ClientCredentials = base64.encode(
     client_credentials.clientId.concat(":").concat(client_credentials.clientSecret)
   );
   const headers = {
     "content-type": "application/x-www-form-urlencoded",
     authorization: "Basic " + base64ClientCredentials,
    };
    const data = "grant_type=client_credentials";
    
    // Post request to Cognito OAuth URL
    const token = await Axios.post(client_credentials.oAuthHost, data, { headers });
    return token.data;
 };

Pass the access token from the Regional Amazon Cognito environment back to the client to be used against Region-based backend applications.

let access_token = await authService.getJwtToken(client_credentials);
 
  response.send(access_token);

New app client creation

You need to implement this Lambda function in both the primary and secondary Regions. Modify the environmental variables in the secondary Region with the secondary Region’s information.

To enroll a new client in this multi-Region architecture using Amazon Cognito we will go through the process as shown in the following illustration.

Figure 2. Creating a new app client

You need to create a new:

App ID and secret in Amazon Cognito in the primary Region.
App ID and secret in Amazon Cognito in the secondary Region.
Item in DynamoDB table consisting of the Amazon Cognito credentials created: primaryClientID, primaryClientSecret, secondaryClientID, and secondaryClientSecret.

Failover process

In this blog post we are creating a hot standby, active-passive type of application deployment. You will need to ensure your application is configured to be able to use either the primary Region Amazon Cognito or the secondary Region Amazon Cognito environment. The process to failover between Regions consists of the following:

Start application backend in the secondary Region.
Reconfigure the application backend Amazon Cognito identity provider YAML file to point at the secondary Region Amazon Cognito identity provider.
Modify the Route 53 domain record to point client applications at the secondary Region API Gateway Regional endpoint.

Cleaning up

In order to avoid unnecessary charges, please be sure to clean up any resources that were built as part of this architecture that are no longer in use.

Conclusion

In this blog post, we presented a solution that allows you to failover Amazon Cognito app clients from one AWS Region to another Region. The benefits of this architecture will allow you to have greater flexibility for running your Amazon Cognito app clients in the Region that is best suited for your use case. With this solution you now have the capability to failover Amazon Cognito app clients to a different AWS Region in the event of application or system errors.

Several variants of this solution can be implemented using the provided Lambda failover logic. You can store App ID and secret in AWS Secrets Manager. To learn more, see How to replicate secrets in AWS Secrets Manager to multiple Regions. You can also automate the failover process between primary and secondary Regions. To accomplish this, you will need to evaluate events which should cause a failover in your environment. Later, build the appropriate automation to failover your downstream application to the secondary Region Amazon Cognito deployment.

Amazon Cognito can be used for machine authentication as per the limits posted in Quotas in Amazon Cognito. Review the limits of number of app clients per user pool and the other applicable rate limits (for example, client credentials rate limits) and verify that these limits meet the needs of your application.

Optum’s Story

UnitedHealth Group (UHG) is the health and well-being company responsible for over 150 million lives globally. Optum, a part of UnitedHealth Group, is a health services business serving the healthcare marketplace, including payers, care providers, employers, governments, life sciences companies and consumers, through its OptumHealth, OptumInsight, OptumRx, and OptumServe businesses.

Federated Data Services (FDS) is the power behind the scenes that enables interoperability and secure transmission of personally identifiable information. It protects health information between lines of businesses, technology systems, applications, members and providers. With FDS, data is centralized, making it easier to share and retrieve by other systems. This assures businesses get a flexible and scalable solution that adapts to changes in technology and business needs.

Expanding Your EC2 Possibilities By Utilizing the CPU Launch Options

2022-02-04 limillan

Post Syndicated from limillan original https://aws.amazon.com/blogs/compute/expanding-your-ec2-possibilities-by-utilizing-the-cpu-launch-options/

This post is written by: Matthew Brunton, Senior Solutions Architect – WWPS

To ensure our customers have the appropriate machines available for their workloads, AWS offers a wide range of hardware options that include hundreds of types of instances that help customers achieve the best price performance for their workloads. In some specialized circumstances, our customers need an even wider range of options, or more flexibility. This could be driven by a desire to optimize licensing costs, or the customer wanting more hardware configuration options. Some high performance workloads can improve by turning off simultaneous multithreading. In our AWS Well Architected Framework – High Performance Computing Lens we have the following recommendation “Unless an application has been tested with hyperthreading enabled, it is recommended that hyperthreading be disabled”. With these factors in mind, AWS offers the ability to configure some options regarding the CPU configuration in launched instances.

Our larger instance types that have a higher number of cores, and offer multithreaded cores will translate to a larger combination of potential options. The valid combinations of cores and threads per core can be found here. To consider utilizing the CPU options for both cores and threads per core, you will need to consider instance types that have multiple CPU’s and/or cores.

You can specify numerous CPU options for some of our larger instances via the console, command line interface, or the API. Moreover, you can remove CPUs from the launch configuration, or deactivate threading within CPUs that have multiple threads per core. Amazon Elastic Compute Cloud (Amazon EC2) FAQ’s for the Optimize CPU’s feature can be found here. You should be aware that this feature is only available during instance launch and cannot be modified after launch. The launch options persist after you reboot, stop, or start an instance.

You can easily determine how many CPUs and threads a machine has. There are numerous ways to see this via the AWS Management Console and the AWS Command Line Interface (CLI).

Within the AWS Management Console, under the ‘Instance Details’ section, opening up the ‘Host and placement group’ item reveals the number of vCPUs that your machine has.

Figure 1: Console details showing number of vCPU’s

This information is also available using the AWS CLI as follows:

aws ec2 describe-instances --region us-east-1 --filters "Name=instance-type,Values='c6i.*large'"

...
    "Instances": [
        {
            "Monitoring": {
                "State": "disabled"
            }, 
            "PrivateDnsName": "ip-172-31-44-121.ec2.internal",
 "PrivateIpAddress": "172.31.44.121",                              
 "State": {
                "Code": 16, 
                "Name": "running"
            }, 
            "EbsOptimized": false, 
            "LaunchTime": "2021-11-22T01:46:58+00:00",
            "ProductCodes": [], 
            "VpcId": " vpc-7f7f1502", 
            "CpuOptions": { "CoreCount": 32, "ThreadsPerCore": 1 }, 
            "StateTransitionReason": "", 
            ...
        }
    ]
...

Figure 2: ec2 describe-instances cli output

A handy AWS CLI command ‘describe-instance-types’ will show the list of valid cores, the possible threads-per-core, and the default values for the vCPUs and cores. These are listed in the ‘DefaultVCpus’ and ‘DefaultCores’ items in the returned JSON listed as follows.

aws ec2 describe-instance-types --region us-east-1 --filters "Name=instance-type,Values='c6i.*'"
{
    "InstanceTypes": [
        {
            "InstanceType": "c6i.4xlarge",
            "CurrentGeneration": true,
            "FreeTierEligible": false,
            "SupportedUsageClasses": [
                "on-demand",
                "spot"
            ],
            "SupportedRootDeviceTypes": [
                "ebs"
            ],
            "SupportedVirtualizationTypes": [
                "hvm"
            ],
            "BareMetal": false,
            "Hypervisor": "nitro",
            "ProcessorInfo": {
                "SupportedArchitectures": [
                    "x86_64"
                ],
                "SustainedClockSpeedInGhz": 3.5
            },
            "VCpuInfo": { "DefaultVCpus": 16, "DefaultCores": 8, "DefaultThreadsPerCore": 2, "ValidCores": [ 2, 4, 6, 8 ], "ValidThreadsPerCore": [ 1, 2 ] },
            "MemoryInfo": {
                "SizeInMiB": 32768
            },

Figure 3: ec2 describe-instance-types cli output

To utilize the AWS CLI to launch one or multiple instances, you should use the run instances CLI command.

The following is the shorthand syntax:

aws ec2 run-instances --image-id xxx --instance-type xxx --cpu-options “CoreCount=xx,ThreadsPerCore=xx” --key-name xxx --region xxx

For example, the following command launches a 32-core machine with only 1 thread per core instead of the standard 2 threads per core:

aws ec2 run-instances --image-id ami-0c2b8ca1dad447f8a --instance-type c6i.16xlarge
--cpu-options " CoreCount =32, ThreadsPerCore =1" --key-name MyKeyPair --region xxx

If you are using the CPU options parameters in CloudFormation templates, then the following applies:

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-ec2-instance-cpuoptions.html

The following is an example of the YAML syntax for specifying the CPU configuration.

Resources:
  CustomEC2Instance:
    Type: "AWS::EC2::Instance"
    Properties:
      InstanceType: xxx
      ImageId: xxx
      CpuOptions:
CoreCount: xx
ThreadsPerCore: x

As can be seen in the following table, there are a number of valid CPU core options, as well as the option to set one or two threads per core for each CPU for certain instance types. This significantly opens up the number of combinations and permutations to meet your specific workload need. In the case of the c6i instance listed in the following table, there are an additional 32 CPU core and threading combinations available to customers.

Instance type	Default vCPUs	Default CPU cores	Default threads per core	Valid CPU cores	Valid threads per core
c6i.16xlarge	64	32	2	2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32	1, 2

Figure 4: Valid Core and Thread Launch Options

Note that you still pay the same amount for the EC2 instance, even if you deactivate some of the cores or threads.

Conclusion

This approach can let customers customize the EC2 hardware options even further to ensure a wider range of CPU/memory combinations over and above the already extensive AWS instance options. This lets customers finely tune hardware for their exact application requirements, whether that is running High Performance workloads or trying to save money where license restrictions mean you can benefit from a specific CPU configuration.

We have run through various scenarios in this post, which detailed how to launch instances with alternate CPU configurations, easily check the current configuration of your running instances via our API and console, and how to configure the options in CloudFormation templates.

We love to see our customers maximizing the flexibility of the AWS platform to deliver outstanding results. Have a look at some of your High Performance workloads and give the threading options a try, or take a deep dive into any of your more expensive licenses to see if you could benefit from an alternate CPU configuration. In order to get started, check out our detailed developer documentation for the optimize CPU options.

How GE Aviation automated engine wash analytics with AWS Glue using a serverless architecture

2022-02-04 Giridhar G Jorapur

Post Syndicated from Giridhar G Jorapur original https://aws.amazon.com/blogs/big-data/how-ge-aviation-automated-engine-wash-analytics-with-aws-glue-using-a-serverless-architecture/

This post is authored by Giridhar G Jorapur, GE Aviation Digital Technology.

Maintenance and overhauling of aircraft engines are essential for GE Aviation to increase time on wing gains and reduce shop visit costs. Engine wash analytics provide visibility into the significant time on wing gains that can be achieved through effective water wash, foam wash, and other tools. This empowers GE Aviation with digital insights that help optimize water and foam wash procedures and maximize fuel savings.

This post demonstrates how we automated our engine wash analytics process to handle the complexity of ingesting data from multiple data sources and how we selected the right programming paradigm to reduce the overall time of the analytics job. Prior to automation, analytics jobs took approximately 2 days to complete and ran only on an as-needed basis. In this post, we learn how to process large-scale data using AWS Glue and by integrating with other AWS services such as AWS Lambda and Amazon EventBridge. We also discuss how to achieve optimal AWS Glue job performance by applying various techniques.

Challenges

When we considered automating and developing the engine wash analytics process, we observed the following challenges:

Multiple data sources – The analytics process requires data from different sources such as foam wash events from IoT systems, flight parameters, and engine utilization data from a data lake hosted in an AWS account.
Large dataset processing and complex calculations – We needed to run analytics for seven commercial product lines. One of the product lines has approximately 280 million records, which is growing at a rate of 30% year over year. We needed analytics to run against 1 million wash events and perform over 2,000 calculations, while processing approximately 430 million flight records.
Scalable framework to accommodate new product lines and calculations – Due to the dynamics of the use case, we needed an extensible framework to add or remove new or existing product lines without affecting the existing process.
High performance and availability – We needed to run analytics daily to reflect the latest updates in engine wash events and changes in flight parameter data.
Security and compliance – Because the analytics processes involve flight and engine-related data, the data distribution and access need to adhere to the stringent security and compliance regulations of the aviation industry.

Solution overview

The following diagram illustrates the architecture of our wash analytics solution using AWS services.

The solution includes the following components:

EventBridge (1) – We use an EventBridge (time-based) to schedule the daily process to capture the delta changes between the runs.
Lambda (2a) – Lambda orchestrates the AWS Glue jobs initiation, backup, and recovery on failure for each stage, utilizing EventBridge (event-based) for the alerting of these events.
Lambda (2b) – Foam cart events from IoT devices are loaded into staging buckets in Amazon Simple Storage Service (Amazon S3) daily.
AWS Glue (3) – The wash analytics need to handle a small subset of data daily, but the initial historical load and transformation is huge. Because AWS Glue is serverless, it’s easy to set up and run with no maintenance.
- Copy job (3a) – We use an AWS Glue copy job to copy only the required subset of data from across AWS accounts by connecting to AWS Glue Data Catalog tables using a cross-account AWS Identity and Access Management (IAM) role.
- Business transformation jobs (3b, 3c) – When the copy job is complete, Lambda triggers subsequent AWS Glue jobs. Because our jobs are both compute and memory intensive, we use G2.x worker nodes. We can use Amazon CloudWatch metrics to fine-tune our jobs to use the right worker nodes. To handle complex calculations, we split large jobs up into multiple jobs by pipelining the output of one job as input to another job.
Source S3 buckets (4a) – Flights, wash events, and other engine parameter data is available in source buckets in a different AWS account exposed via Data Catalog tables.
Stage S3 bucket (4b) – Data from another AWS account is required for calculations, and all the intermediate outputs from the AWS Glue jobs are written to the staging bucket.
Backup S3 bucket (4c) – Every day before starting the AWS Glue job, the previous day’s output from the output bucket is backed up in the backup bucket. In case of any job failure, the data is recovered from this bucket.
Output S3 bucket (4d) – The final output from the AWS Glue jobs is written to the output bucket.

Continuing our analysis of the architecture components, we also use the following:

AWS Glue Data Catalog tables (5) – We catalog flights, wash events, and other engine parameter data using Data Catalog tables, which are accessed by AWS Glue copy jobs from another AWS account.
EventBridge (6) – We use EventBridge (event-based) to monitor for AWS Glue job state changes (SUCEEDED, FAILED, TIMEOUT, and STOPPED) and orchestrate the workflow, including backup, recovery, and job status notifications.
IAM role (7) – We set up cross-account IAM roles to copy the data from one account to another from the AWS Glue Data Catalog tables.
CloudWatch metrics (8) – You can monitor many different CloudWatch metrics. The following metrics can help you decide on horizontal or vertical scaling when fine-tuning the AWS Glue jobs:
- CPU load of the driver and executors
- Memory profile of the driver
- ETL data movement
- Data shuffle across executors
- Job run metrics, including active executors, completed stages, and maximum needed executors
Amazon SNS (9) – Amazon Simple Notification Service (Amazon SNS) automatically sends notifications to the support group on the error status of jobs, so they can take corrective action upon failure.
Amazon RDS (10) – The final transformed data is stored in Amazon Relational Database Service (Amazon RDS) for PostgreSQL (in addition to Amazon S3) to support legacy reporting tools.
Web application (11) – A web application is hosted on AWS Elastic Beanstalk, and is enabled with Auto Scaling exposed for clients to access the analytics data.

Implementation strategy

Implementing our solution included the following considerations:

Security – The data required for running analytics is present in different data sources and different AWS accounts. We needed to craft well-thought-out role-based access policies for accessing the data.
Selecting the right programming paradigm – PySpark does lazy evaluation while working with data frames. For PySpark to work efficiently with AWS Glue, we created data frames with required columns upfront and performed column-wise operations.
Choosing the right persistence storage – Writing to Amazon S3 enables multiple consumption patterns, and writes are much faster due to parallelism.

If we’re writing to Amazon RDS (for supporting legacy systems), we need to watch out for database connectivity and buffer lock issues while writing from AWS Glue jobs.

Data partitioning – Identifying the right key combination is important for partitioning the data for Spark to perform optimally. Our initial runs (without data partition) with 30 workers of type G2.x took 2 hours and 4 minutes to complete.

The following screenshot shows our CloudWatch metrics.

After a few dry runs, we were able to arrive at partitioning by a specific column (df.repartition(columnKey)) and with 24 workers of type G2.x, the job completed in 2 hours and 7 minutes. The following screenshot shows our new metrics.

We can observe a difference in CPU and memory utilization—running with even fewer nodes shows a smaller CPU utilization and memory footprint.

The following table shows how we achieved the final transformation with the strategies we discussed.

Iteration	Run Time	AWS Glue Job Status	Strategy
1	~12 hours	Unsuccessful/Stopped	Initial iteration
2	~9 hours	Unsuccessful/Stopped	Changing code to PySpark methodology
3	5 hours, 11 minutes	Partial success	Splitting a complex large job into multiple jobs
4	3 hours, 33 minutes	Success	Partitioning by column key
5	2 hours, 39 minutes	Success	Changing CSV to Parquet file format while storing the copied data from another AWS account and intermediate results in the stage S3 bucket
6	2 hours, 9 minutes	Success	Infra scaling: horizontal and vertical scaling

Conclusion

In this post, we saw how to build a cost-effective, maintenance-free solution using serverless AWS services to process large-scale data. We also learned how to gain optimal AWS Glue job performance with key partitioning, using the Parquet data format while persisting in Amazon S3, splitting complex jobs into multiple jobs, and using the right programming paradigm.

As we continue to solidify our data lake solution for data from various sources, we can use Amazon Redshift Spectrum to serve various future analytical use cases.

About the Authors

Giridhar G Jorapur is a Staff Infrastructure Architect at GE Aviation. In this role, he is responsible for designing enterprise applications, migration and modernization of applications to the cloud. Apart from work, Giri enjoys investing himself in spiritual wellness. Connect him on LinkedIn.