Tag Archives: Best practices

Publish Amazon DevOps Guru Insights to Slack Channel

2022-11-08 Chetan Makvana

Post Syndicated from Chetan Makvana original https://aws.amazon.com/blogs/devops/publish-amazon-devops-guru-insights-to-slack-channel/

Customers using Amazon DevOps Guru often wants to publish operational insights to chat collaboration platforms, such as Slack and Amazon Chime. Amazon DevOps Guru offers a fully managed AIOps platform service that enables developers and operators to improve application availability and resolve operational issues faster. It minimizes manual effort by leveraging machine learning (ML) powered recommendations. DevOps Guru automatically detects operational insights, predicts impending resource exhaustion, details likely cause, and recommends remediation actions. For customers running critical applications, having access to these operational insights and real-time alerts are key aspects to improve their overall incident remediation processes and maintain operational excellence. Customers use chat collaboration platforms to monitor operational insights and respond to events, which reduces context switching between applications and provides opportunities to respond faster.

This post walks you through how to integrate DevOps Guru with Slack channel to receive notifications for new operational insights detected by DevOps Guru. It doesn’t talk about enabling Amazon DevOps Guru and generating operational insights. You can refer to Gaining operational insights with AIOps using Amazon DevOps Guru to know more about this.

Solution overview

Amazon DevOps Guru integrates with Amazon EventBridge to notify you of events relating to insights and corresponding insight updates. To receive operational insight notifications in Slack channels, you configure routing rules to determine where to send notifications and use pre-defined DevOps Guru patterns to only send notifications or trigger actions that match that pattern. You can select any of the following pre-defined patterns to filter events to trigger actions in a supported AWS resource. For this post, we will send events only for “New Insights Open”.

DevOps Guru New Insight Open
DevOps Guru New Anomaly Association
DevOps Guru Insight Severity Upgraded
DevOps Guru New Recommendation Created
DevOps Guru Insight Closed

When EventBridge receives an event from DevOps Guru, the event rule fires and the event notification is sent to Slack channel by using AWS Lambda or AWS Chatbot. Chatbot is easier to configure and deploy. However, if you want more customization, we have also written a Lambda function that allows additional formatting options.

Amazon EventBridge receives an event from Amazon DevOps Guru, and fires event rule. A rule matches incoming events and sends them to AWS Lambda or AWS Chatbot. With AWS Lambda, you write code to customize the message and send formatted message to the Slack channel. To receive event notifications in chat channels, you configure an SNS topic as a target in the Amazon EventBridge rule and then associate the topic with a chat channel in the AWS Chatbot console. AWS Chatbot then sends event to the configured Slack channel.

Figure 1: Amazon EventBridge Integration with Slack using AWS Lambda or AWS Chatbot

The goal of this tutorial is to show a technical walkthrough of integration of DevOps Guru with Slack using the following options:

Publish using AWS Lambda
Publish using AWS Chatbot

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account
A Slack channel
For “Publish using AWS Lambda”, AWS Serverless Application Model (SAM) CLI
For “Publish using AWS Chatbot”, AWS Command Line Interface (CLI).
We recommend using AWS Cloud9 to create an environment to get access to the AWS Serverless Application Model (SAM) CLI or AWS Command Line Interface (CLI) from a bash terminal. AWS Cloud9 is a browser-based IDE that provides a development environment in the cloud. While creating the new environment, ensure you choose Linux2 as the operating system. Alternatively, you can use your bash terminal in your favorite IDE, configure your AWS credentials, and install AWS SAM or AWS CLI

Publish using AWS Lambda

In this tutorial, you will perform the following steps:

Create a Slack Webhook URL
Launch SAM template to deploy the solution
Test the solution

Create a Slack Webhook URL

This step configures Slack workflow and creates a Webhook URL used for API call. You will need to have access to add a new channel and app to your Slack Workspace.

Create a new channel for events (i.e. devopsguru_events).
Within Slack, click on your workspace name drop-down arrow in the upper left.
Choose Tools > Workflow Builder.
Click Create in the upper right-hand corner of the Workflow Builder and give your workflow a name.
Click Next.
Click Select next to Webhook.
Click Add variable and add the following variables one at a time in the Key section. All data types will be text.

- text
- account
- region
- startTime
- insightType
- severity
- description
- insightUrl
- numOfAnomalies

When done, you should have 9 variables, double check them as they are case sensitive and will be referenced.
Click Add Step.
On the Add a workflow step window, click Add next to send a message.
Under Send this message to select the channel you created in earlier step.
In Message text, create the following.

Final message is with placeholder as corresponding variables created in Step #7

Figure 2: Message text configuration in Slack

Click Save.
Click Publish.
For the deployment, we will need the Webhook URL. Copy it in the notepad.

Launch SAM template to deploy the solution

In this step, you will launch the SAM template. This template deploys an AWS Lambda function that is triggered by an Amazon EventBridge rule when Amazon DevOps Guru notifies event relating to “DevOps Guru New Insight Open”. It also deploys AWS Secret Manager, Amazon EventBridge Rule and required permission to invoke this specific function. The AWS Lambda function retrieves the Slack Webhook URL from AWS Secret Manager and posts a message to Slack using webhook API call.

Create a new directory, navigate to that directory in a terminal and clone the GitHub repository using the below command.

git clone https://github.com/aws-samples/devops-guru-integration-with-slack.git

Change directory to the directory where you cloned the GitHub repository.

cd devops-guru-integration-with-slack

From the command line, use AWS SAM to build the serverless application with its dependencies.

sam build

From the command line, use AWS SAM to deploy the AWS resources for the pattern as specified in the template.yml file.

sam deploy --guided

During the prompts.

- enter a stack name.
- enter the desired AWS Region.
- enter the Secret name to store Slack Channel Webhook URL.
- enter the Slack Channel Webhook URL that you copied in an earlier step.
- allow SAM CLI to create IAM roles with the required permissions.

Once you have run sam deploy --guided mode once and saved arguments to a configuration file (samconfig.toml), you can use sam deploy in future to use these defaults.

Test the solution

Follow this blog to enable DevOps Guru and generate operational insights.
When DevOps Guru detects a new insight, it generates events in EventBridge. EventBridge then triggers Lambda that sends it to a Slack channel as below.

Slack channel shows message with details like Account, Region, Start Time, Insight Type, Severity, Description, Insight URL and Number of anomalies found.

Figure 3. Message published to Slack

Cleaning up

To avoid incurring future charges, delete the resources.

Delete resources deployed from this blog.
From the command line, use AWS SAM to delete the serverless application with its dependencies.

sam delete

Publish using AWS Chatbot

In this tutorial, you will perform the following steps:

Configure Amazon Simple Notification Service (SNS) and Amazon EventBridge using the AWS Command Line Interface (CLI)
Configure AWS Chatbot to a Slack workspace
Test the solution

Configure Amazon SNS and Amazon Eventbridge

We will now configure and deploy an SNS topic and an Eventbridge rule. This EventBridge rule will be triggered by DevOps Guru when “DevOps Guru New Insight Open” events are generated. The event will then be sent to the SNS topic which we will configure as a target for the Eventbridge rule.

Using CLI, create an SNS topic running the following command in the CLI. Alternatively, you can configure and create an SNS topic in the AWS management console.

aws sns create-topic --name devops-guru-insights-chatbot-topic

Save the SNS topic ARN that is generated in the CLI for a later step in this walkthrough.
Now we will create the Eventbridge rule. Run the following command to create the Eventbridge rule. Alternatively, you can configure and create the rule in the AWS management console.

aws events put-rule --name "devops-guru-insights-chatbot-rule" -
-event-pattern "{\"source\":[\"aws.devops-guru\"],\"detail-type\":[\"DevOps
 Guru New Insight Open\"]}"

We now want to add targets to the rule we just created. Use the ARN of the SNS topic we created in step one.

aws events put-targets --rule devops-guru-insights-chatbot-rule --targets "Id"="1","Arn"=""

We now have created an SNS topic, and an Eventbridge rule to send “DevOps Guru New Insight Open” events to that SNS topic.

Create and Add AWS Chatbot to a Slack workspace

In this step, we will configure AWS Chatbot and our Slack channel to receive the SNS Notifications we configured in the previous step.

Sign into the AWS management console and open AWS Chatbot at https://console.aws.amazon.com/Chatbot/.
Under Configure a chat client, select Slack from the dropdown and click Configure Client.
You will then need to give AWS Chatbot permission to access your workspace, click Allow.

AWS Chatbot is requesting permission to access the Slack workspace

Figure 4. AWS Chatbot requesting permission

Once configured, you’ll be redirected to the AWS management console. You’ll now want to click Configure new channel.
Use the follow configurations for the setup of the Slack channel.

- Configuration Name: aws-chatbot-devops-guru
- Channel Type: Public or Private
  - If adding Chatbot to a private channel, you will need the Channel ID. One way you can get this is by going to your slack channel and copying the link, the last set of unique characters will be your Channel ID.
- Channel Role: Create an IAM role using a template
- Role name: awschatbot-devops-guru-role
- Policy templates: Notification permissions
- Guardrail Policies: AWS-Chatbot-NotificationsOnly-Policy-5f5dfd95-d198-49b0-8594-68d08aba8ba1
- SNS Topics:
  - Region: us-east-1 (Select the region you created the SNS topic in)
  - Topics: devops-guru-insights-chatbot-topic

Click Configure.
You should now have your slack channel configured for AWS Chatbot.
Finally, we just need to invite AWS Chatbot to our slack channel.

- Type /invite in your slack channel and it will show different options.
- Select Add apps to this channel and invite AWS Chatbot to the channel.

Now your solution is fully integrated and ready for testing.

Test the solution

Follow this blog to enable DevOps Guru and generate operational insights.
When DevOps Guru detects a new insight, it generates events in EventBridge, it will send those events to SNS. AWS Chatbot receives the notification from SNS and publishes the notification to your slack channel.

Slack channel shows message with “DevOps Guru New Insight Open”

Figure 5. Message published to Slack

Cleaning up

To avoid incurring future charges, delete the resources.

Delete resources deployed from this blog.
When ready, delete the EventBridge rule, SNS topic, and channel configuration on Chatbot.

Conclusion

In this post, you learned how Amazon DevOps Guru integrates with Amazon EventBridge and publishes insights into Slack channel using AWS Lambda or AWS Chatbot. “Publish using AWS Lambda” option gives more flexibility to customize the message that you want to publish to Slack channel. Using “Publish using AWS Chabot”, you can add AWS Chatbot to your Slack channel in just a few clicks. However, the message is not customizable, unlike the first option. DevOps users can now monitor all reactive and proactive insights into Slack channels. This post talked about publishing new DevOps Guru insight to Slack. However, you can expand it to publish other events like new recommendations created, new anomaly associated, insight severity upgraded or insight closed.

About the authors:

Create a Multi-Region Python Package Publishing Pipeline with AWS CDK and CodePipeline

2022-11-03 Brian Smitches

Post Syndicated from Brian Smitches original https://aws.amazon.com/blogs/devops/create-a-multi-region-python-package-publishing-pipeline-with-aws-cdk-and-codepipeline/

Customers can author and store internal software packages in AWS by leveraging the AWS CodeSuite (AWS CodePipeline, AWS CodeBuild, AWS CodeCommit, and AWS CodeArtifact). As of the publish date of this blog post, there is no native way to replicate your CodeArtifact Packages across regions. This blog addresses how a custom solution built with the AWS Cloud Development Kit and AWS CodePipeline can create a Multi-Region Python Package Publishing Pipeline.

Whether it’s for resiliency or performance improvement, many customers want to deploy their applications across multiple regions. When applications are dependent on custom software packages, the software packages should be replicated to multiple regions as well. This post will walk through how to deploy a custom package publishing pipeline in your own AWS Account. This pipeline connects a Python package source code repository to build and publish pip packages to CodeArtifact Repositories spanning three regions (the primary and two replica regions). While this sample CDK Application is built specifically for pip packages, the underlying architecture can be reused for different software package formats, such as npm, Maven, NuGet, etc.

Solution overview

The following figure demonstrates the solution workflow:

A CodePipeline pipeline orchestrates the building and publishing of the software package

1. This pipeline is triggered by commits on the main branch of the CodeCommit repository
2. A CodeBuild job builds the pip packages using twine to be distributed
3. The publish stage (third column) uses three parallel CodeBuild jobs to publish the distribution package to the two CodeArtifact repositories in separate regions

The first CodeArtifact Repository stores the package contents in the primary region.
The second and third CodeArtifact Repository act as replicas and store the package contents in other regions.

Figure 1. A figure showing the architecture diagram

Figure 1. Architecture diagram

All of these resources are defined in a single AWS CDK Application. The resources are defined in CDK Stacks that are deployed as AWS CloudFormation Stacks. AWS CDK can deploy the different stacks across separate regions.

Prerequisites

Before getting started, you will need the following:

An AWS account
An instance of the AWS Cloud9 IDE or an alternative local compute environment, such as your personal computer
The following installed on your compute environment:

1. AWS CDK
2. AWS Command Line Interface (AWS CLI)
3. npm

The AWS Accounts must be bootstrapped for CDK in the necessary regions. The default configuration uses us-east-1, us-east-2 and us-west-2 as these three regions support CodeArtifact.

A new AWS Cloud9 IDE is recommended for this tutorial to isolate these actions in this post from your normal compute environment. See the Cloud9 Documentation for Creating an Environment.

Deploy the Python Package Publishing Pipeline into your AWS Account with the CDK

The source code can be found in this GitHub Repository.

Fork the GitHub Repo into your account. This way you can experiment with changes as necessary to fit your workload.
In your local compute environment, clone the GitHub Repository and cd into the project directory:

git clone [email protected]:<YOUR_GITHUB_USERNAME>/multi-region-
python-package-publishing-pipeline.git && cd multi-region-
python-package-publishing-pipeline

Install the necessary node packages:

npm i

(Optional) Override the default configurations for the CodeArtifact domainName, repositoryName, primaryRegion, and replicaRegions.
1. navigate to ./bin/multiregion_package_publishing.ts and update the relevant fields.
2. From the project’s root directory (multi-region-python-package-publishing-pipeline), deploy the AWS CDK application. This step may take 5-10 minutes.

cdk deploy --all

When prompted “Do you wish to deploy these changes (y/n)?”, Enter y.

Viewing the deployed CloudFormation stacks

After the deployment of the AWS CDK application completes, you can view the deployed AWS CDK Stacks via CloudFormation. From the AWS Console, search “CloudFormation’ in the search bar and navigate to the service dashboard. In the primary region (us-east-1(N. Virginia)) you should see two stacks: CodeArtifactPrimaryStack-<region> and PackagePublishingPipelineStack.

Figure 2. Screenshot showing the CloudFormation Stacks in the primary region

Switch regions to one of the secondary regions us-west-2 (Oregon) or us-east-2 (Ohio) to see the remaining stacks named CodeArtifactReplicaStack-<region>. These correspond to the three AWS CDK Stacks from the architecture diagram.

Figure 3. Screenshot showing the CloudFormation stacks in a separate region

Viewing the CodePipeline Package Publishing Pipeline

From the Console, select the primary region (us-east-1) and navigate to CodePipeline by utilizing the search bar. Select the Pipeline titled packagePipeline and inspect the state of the pipeline. This pipeline triggers after every commit from the CodeCommit repository named PackageSourceCode. If the pipeline is still in process, then wait a few minutes, as this pipeline can take approximately 7–8 minutes to complete all three stages (Source, Build, and Publish). Once it’s complete, the pipeline should reflect the following screenshot:

Figure 4. A screenshot showing the CodePipeline flow

Viewing the Published Package in the CodeArtifact Repository

To view the published artifacts, go to the primary or secondary region and navigate to the CodeArtifact dashboard by utilizing the search bar in the Console. You’ll see a repository named package-artifact-repo. Select the repository and you’ll see the sample pip package named mypippackage inside the repository. This package is defined by the source code in the CodeCommit repository named PackageSourceCode in the primary region (us-east-1).

Figure 5. Screenshot of the package repository

Create a new package version in CodeCommit and monitor the pipeline release

Navigate to your CodeCommit’s PackageSourceCode (us-east-1 CodeCommit > Repositories > PackageSourceCode. Open the setup.py file and select the Edit button. Make a simple modification, change the version = '1.0.0' to version = '1.1.0' and commit the changes to the Main branch.

Figure 6. A screenshot of the source package’s code repository in CodeCommit

Now navigate back to CodePipeline and watch as the pipeline performs the release automatically. When the pipeline finishes, this new package version will live in each of the three CodeArtifact Repositories.

Install the custom pip package to your local Python Environment

For your development team to connect to this CodeArtifact Repository to download repositories, you must configure the pip tool to look in this repository. From your Cloud9 IDE (or local development environment), let’s test the installation of this package for Python3:

Copy the connection instructions for the pip tool. Navigate to the CodeArtifact repository of your choice and select View connection instructions
1. Select Copy to copy the snippet to your clipboard

Figure 7. Screenshot showing directions to connect to a code artifact repository

Paste the command from your clipboard
Run pip install mypippackage==1.0.0

Figure 8. Screenshot showing CodeArtifact login

Test the package works as expected by importing the modules
Start the Python REPL by running python3 in the terminal

Figure 9. Screenshot of the package being imported

Clean up

Destroy all of the AWS CDK Stacks by running cdk destroy --all from the root AWS CDK application directory.

Conclusion

In this post, we walked through how to deploy a CodePipeline pipeline to automate the publishing of Python packages to multiple CodeArtifact repositories in separate regions. Leveraging the AWS CDK simplifies the maintenance and configuration of this multi-region solution by using Infrastructure as Code and predefined Constructs. If you would like to customize this solution to better fit your needs, please read more about the AWS CDK and AWS Developer Tools. Some links we suggest include the CodeArtifact User Guide (with sections covering npm, Python, Maven, and NuGet), the CDK API Reference, CDK Pipelines, and the CodePipeline User Guide.

About the authors:

Building highly resilient applications with on-premises interdependencies using AWS Local Zones

2022-10-27 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/building-highly-resilient-applications-with-on-premises-interdependencies-using-aws-local-zones/

This blog post is written by Rachel Rui Liu, Senior Solutions Architect.

AWS Local Zones are a type of infrastructure deployment that places compute, storage, database, and other select AWS services close to large population and industry centers.

Following the successful launch of the AWS Local Zone s in 16 US cities since 2019, in Feb 2022, AWS announced plans to launch new AWS Local Zones in 32 metropolitan areas in 26 countries worldwide.

With Local Zones, we’ve seen use cases in two common categories.

The first category of use cases is for workloads that require extremely low latency between end-user devices and workload servers. For example, let’s consider media content creation and real-time multiplayer gaming. For these use cases, deploying the workload to a Local Zone can help achieve down to single-digit milliseconds latency between end-user devices and the AWS infrastructure, which is ideal for a good end-user experience.

This post will focus on addressing the second category of use cases, which is commonly seen in an enterprise hybrid architecture, where customers must achieve low latency between AWS infrastructure and existing on-premises data centers. Compared to the first category of use cases, these use cases can tolerate slightly higher latency between the end-user devices and the AWS infrastructure. However, these workloads have dependencies to these on-premises systems, so the lowest possible latency between AWS infrastructure and on-premises data centers is required for better application performance. Here are a few examples of these systems:

Financial services sector mainframe workloads hosted on premises serving regional customers.
Enterprise Active Directory hosted on premise serving cloud and on-premises workloads.
Enterprise applications hosted on premises processing a high volume of locally generated data.

For workloads deployed in AWS, the time taken for each interaction with components still hosted in the on-premises data center is increased by the latency. In turn, this delays responses received by the end-user. The total latency accumulates and results in suboptimal user experiences.

By deploying modernized workloads in Local Zones, you can reduce latency while continuing to access systems hosted in on-premises data centers, thereby reducing the total latency for the end-user. At the same time, you can enjoy the benefits of agility, elasticity, and security offered by AWS, and can apply the same automation, compliance, and security best practices that you’ve been familiar with in the AWS Regions.

Enterprise workload resiliency with Local Zones

While designing hybrid architectures with Local Zones, resiliency is an important consideration. You want to route traffic to the nearest Local Zone for low latency. However, when disasters happen, it’s critical to fail over to the parent Region automatically.

Let’s look at the details of hybrid architecture design based on real world deployments from different angles to understand how the architecture achieves all of the design goals.

Hybrid architecture with resilient network connectivity

The following diagram shows a high-level overview of a resilient enterprise hybrid architecture with Local Zones, where you have redundant connections between the AWS Region, the Local Zone, and the corporate data center.

Here are a few key points with this network connectivity design:

Use AWS Direct Connect or Site-to-Site VPN to connect the corporate data center and AWS Region.
Use Direct Connect or self-hosted VPN to connect the corporate data center and the Local Zone. This connection will provide dedicated low-latency connectivity between the Local Zone and corporate data center.
Transit Gateway is a regional service. When attaching the VPC to AWS Transit Gateway, you can only add subnets provisioned in the Region. Instances on subnets in the Local Zone can still use Transit Gateway to reach resources in the Region.
For subnets provisioned in the Region, the VPC route table should be configured to route the traffic to the corporate data center via Transit Gateway.
For subnets provisioned in Local Zone, the VPC route table should be configured to route the traffic to the corporate data center via the self-hosted VPN instance or Direct Connect.

Hybrid architecture with resilient workload deployment

The next examples show a public and a private facing workload.

To simplify the diagram and focus on application layer architecture, the following diagrams assume that you are using Direct Connect to connect between AWS and the on-premises data center.

Example 1: Resilient public facing workload

With a public facing workload, end-user traffic will be routed to the Local Zone. If the Local Zone is unavailable, then the traffic will be routed to the Region automatically using an Amazon Route 53 failover policy.

Here are the key design considerations for this architecture:

Deploy the workload in the Local Zone and put the compute layer in an AWS AutoScaling Group, so that the application can scale up and down depending on volume of requests.
Deploy the workload in both the Local Zone and an AWS Region, and put the compute layer into an autoscaling group. The regional deployment will act as pilot light or warm standby with minimal footprint. But it can scale out when the Local Zone is unavailable.
Two Application Load Balancers (ALBs) are required: one in the Region and one in the Local Zone. Each ALB will dispatch the traffic to each workload cluster inside the autoscaling group local to it.
An internet gateway is required for public facing workloads. When using a Local Zone, there’s no extra configuration needed: define a single internet gateway and attach it to the VPC.

If you want to specify an Elastic IP address to be the workload’s public endpoint, the Local Zone will have a different address pool than the Region. Noting that BYOIP is unsupported for Local Zones.

Create a Route 53 DNS record with “Failover” as the routing policy.

For the primary record, point it to the alias of the ALB in the Local Zone. This will set Local Zone as the preferred destination for the application traffic which minimizes latency for end-users.
For the secondary record, point it to the alias of the ALB in the AWS Region.
Enable health check for the primary record. If health check against the primary record fails, which indicates that the workload deployed in the Local Zone has failed to respond, then Route 53 will automatically point to the secondary record, which is the workload deployed in the AWS Region.

Example 2: Resilient private workload

For a private workload that’s only accessible by internal users, a few extra considerations must be made to keep the traffic inside of the trusted private network.

The architecture for resilient private facing workload has the same steps as public facing workload, but with some key differences. These include:

Instead of using a public hosted zone, create private hosted zones in Route 53 to respond to DNS queries for the workload.
Create the primary and secondary records in Route 53 just like the public workload but referencing the private ALBs.
To allow end-users onto the corporate network (within offices or connected via VPN) to resolve the workload, use the Route 53 Resolver with an inbound endpoint. This allows end-users located on-premises to resolve the records in the private hosted zone. Route 53 Resolver is designed to be integrated with an on-premises DNS server.
No internet gateway is required for hosting the private workload. You might need an internet gateway in the Local Zone for other purposes: for example, to host a self-managed VPN solution to connect the Local Zone with the corporate data center.

Hosting multiple workloads

Customers who host multiple workloads in a single VPC generally must consider how to segregate those workloads. As with workloads in the AWS Region, segregation can be implemented at a subnet or VPC level.

If you want to segregate workloads at the subnet level, you can extend your existing VPC architecture by provisioning extra sets of subnets to the Local Zone.

Although not shown in the diagram, for those of you using a self-hosted VPN to connect the Local Zone with an on-premises data center, the VPN solution can be deployed in a centralized subnet.

You can continue to use security groups, network access control lists (NACLs) , and VPC route tables – just as you would in the Region to segregate the workloads.

If you want to segregate workloads at the VPC level, like many of our customers do, within the Region, inter-VPC routing is generally handled by Transit Gateway. However, in this case, it may be undesirable to send traffic to the Region to reach a subnet in another VPC that is also extended to the Local Zone.

Key considerations for this design are as follows:

Direct Connect is deployed to connect the Local Zone with the corporate data center. Therefore, each VPC will have a dedicated Virtual Private Gateway provisioned to allow association with the Direct Connect Gateway.
To enable inter-VPC traffic within the Local Zone, peer the two VPCs together.
Create a VPC route table in VPC A. Add a route for Subnet Y where the destination is the peering link. Assign this route table to Subnet X.
Create a VPC route table in VPC B. Add a route for Subnet X where the destination is the peering link. Assign this route table to Subnet Y.
If necessary, add routes for on-premises networks and the transit gateway to both route tables.

This design allows traffic between subnets X and Y to stay within the Local Zone, thereby avoiding any latency from the Local Zone to the AWS Region while still permitting full connectivity to all other networks.

Conclusion

In this post, we summarized the use cases for enterprise hybrid architecture with Local Zones, and showed you:

Reference architectures to host workloads in Local Zones with low-latency connectivity to corporate data centers and resiliency to enable fail over to the AWS Region automatically.
Different design considerations for public and private facing workloads utilizing this hybrid architecture.
Segregation and connectivity considerations when extending this hybrid architecture to host multiple workloads.

Hopefully you will be able to follow along with these reference architectures to build and run highly resilient applications with local system interdependencies using Local Zones.

Automate Amazon Redshift Serverless data warehouse management using AWS CloudFormation and the AWS CLI

2022-10-19 Ranjan Burman

Post Syndicated from Ranjan Burman original https://aws.amazon.com/blogs/big-data/automate-amazon-redshift-serverless-data-warehouse-management-using-aws-cloudformation-and-the-aws-cli/

Amazon Redshift Serverless makes it simple to run and scale analytics without having to manage the instance type, instance size, lifecycle management, pausing, resuming, and so on. It automatically provisions and intelligently scales data warehouse compute capacity to deliver fast performance for even the most demanding and unpredictable workloads, and you pay only for what you use. Just load your data and start querying right away in the Amazon Redshift Query Editor or in your favorite business intelligence (BI) tool and continue to enjoy the best price performance and familiar SQL features in an easy-to-use, zero administration environment.

Redshift Serverless separates compute and storage and introduces two abstractions:

Workgroup – A workgroup is a collection of compute resources. It groups together compute resources like RPUs, VPC subnet groups, and security groups.
Namespace – A namespace is a collection of database objects and users. It groups together data objects, such as databases, schemas, tables, users, or AWS Key Management Service (AWS KMS) keys for encrypting data.

Some organizations want to automate the creation of workgroups and namespaces for automated infrastructure management and consistent configuration across environments, and provide end-to-end self-service capabilities. You can automate the workgroup and namespace management operations using the Redshift Serverless API, the AWS Command Line Interface (AWS CLI), or AWS CloudFormation, which we demonstrate in this post.

Solution overview

In the following sections, we discuss the automation approaches for various tasks involved in Redshift Serverless data warehouse management using AWS CloudFormation (for more information, see RedshiftServerless resource type reference) and the AWS CLI (see redshift-serverless).

The following are some of the key use cases and appropriate automation approaches to use with AWS CloudFormation:

Enable end-to-end self-service from infrastructure setup to querying
Automate data consumer onboarding for data provisioned through AWS Data Exchange
Accelerate workload isolation by creating endpoints
Create a new data warehouse with consistent configuration across environments

The following are some of the main use cases and approaches for the AWS CLI:

Automate maintenance operations:
- Backup and limits
- Modify RPU configurations
- Manage limits
Automate migration from provisioned to serverless

Prerequisites

To run the operations described in this post, make sure that this user or role has AWS Identity Access and Management (IAM) arn:aws:iam::aws:policy/AWSCloudFormationFullAccess, and either the administrator permission arn:aws:iam::aws:policy/AdministratorAccess or the full Amazon Redshift permission arn:aws:iam::aws:policy/AmazonRedshiftFullAccess policy attached. Refer to Security and connections in Amazon Redshift Serverless for further details.

You should have at least three subnets, and they must span across three Availability Zones.It is not enough if just 3 subnets created in same availability zone. To create a new VPC and subnets, use the following CloudFormation template to deploy in your AWS account.

Create a Redshift Serverless namespace and workgroup using AWS CloudFormation

AWS CloudFormation helps you model and set up your AWS resources so that you can spend less time on infrastructure setup and more time focusing on your applications that run in AWS. You create a template that describes all the AWS resources that you want, and AWS CloudFormation takes care of provisioning and configuring those resources based on the given input parameters.

To create the namespace and workgroup for a Redshift Serverless data warehouse using AWS CloudFormation, complete the following steps:

Choose Launch Stack to launch AWS CloudFormation in your AWS account with a template:
For Stack name, enter a meaningful name for the stack, for example, rsserverless.
Enter the parameters detailed in the following table.

Parameters	Default	Allowed Values	Description
Namespace	.	N/A	The name of the namespace of your choice to be created.
Database Name	dev	N/A	The name of the first database in the Redshift Serverless environment.
Admin User Name	admin	N/A	The administrator’s user name for the Redshift Serverless namespace being create.
Admin User Password	.	N/A	The password associated with the admin user.
Associate IAM Role	.	Comma-delimited list of ARNs of IAM roles	Associate an IAM role to your Redshift Serverless namespace (optional).
Log Export List	userlog, connectionlog, useractivitylog	userlog, connectionlog, useractivitylog	Provide comma-separated values from the list. For example, userlog, connectionlog, useractivitylog. If left blank, LogExport is turned off.
Workgroup	.	N/A	The workgroup name of your choice to be created.
Base RPU	128	Minimum value of 32 and maximum value of 512	The base RPU for the Redshift Serverless workgroup.
Publicly accessible	false	true, false	Indicates if the Redshift Serverless instance is publicly accessible.
Subnet Ids	.	N/A	You must have at least three subnets, and they must span across three Availability Zones.
Security Group Id	.	N/A	The list of security group IDs in your VPC.
Enhanced VPC Routing	false	true, false	The value that specifies whether to enable enhanced VPC routing, which forces Redshift Serverless to route traffic through your VPC.

Pass the parameters provided to the AWS::RedshiftServerless::Namespace and AWS::RedshiftServerless::Workgroup resource types:

Resources:
  RedshiftServerlessNamespace:
    Type: 'AWS::RedshiftServerless::Namespace'
    Properties:
      AdminUsername:
        Ref: AdminUsername
      AdminUserPassword:
        Ref: AdminUserPassword
      DbName:
        Ref: DatabaseName
      NamespaceName:
        Ref: NamespaceName
      IamRoles:
        Ref: IAMRole
      LogExports:
        Ref: LogExportsList        
  RedshiftServerlessWorkgroup:
    Type: 'AWS::RedshiftServerless::Workgroup'
    Properties:
      WorkgroupName:
        Ref: WorkgroupName
      NamespaceName:
        Ref: NamespaceName
      BaseCapacity:
        Ref: BaseRPU
      PubliclyAccessible:
        Ref: PubliclyAccessible
      SubnetIds:
        Ref: SubnetId
      SecurityGroupIds:
        Ref: SecurityGroupIds
      EnhancedVpcRouting:
        Ref: EnhancedVpcRouting        
    DependsOn:
      - RedshiftServerlessNamespace

Perform namespace and workgroup management operations using the AWS CLI

The AWS CLI is a unified tool to manage your AWS services. With just one tool to download and configure, you can control multiple AWS services from the command line and automate them through scripts.

To run the Redshift Serverless CLI commands, you need to install the latest version of AWS CLI. For instructions, refer to Installing or updating the latest version of the AWS CLI.

Now you’re ready to complete the following steps:

Use the following command to create a new namespace:

aws redshift-serverless create-namespace \
    --admin-user-password '<password>' \
    --admin-username cfn-blog-admin \
    --db-name cfn-blog-db \
    --namespace-name 'cfn-blog-ns'

The following screenshot shows an example output.

create-namespace

Use the following command to create a new workgroup mapped to the namespace you just created:

aws redshift-serverless create-workgroup \
    --base-capacity 128 \
    --namespace-name 'cfn-blog-ns' \
    --no-publicly-accessible \
    --security-group-ids "sg-0269bd680e0911ce7" \
    --subnet-ids "subnet-078eedbdd99398568" "subnet-05defe25a59c0e4c2" "subnet-0f378d07e02da3e48"\
    --workgroup-name 'cfn-blog-wg'

The following is an example output.

create workgroup

To allow instances and devices outside the VPC to connect to the workgroup, use the publicly-accessible option in the create-workgroup CLI command.

To verify the workgroup has been created and is in AVAILABLE status, use the following command:

aws redshift-serverless get-workgroup \
--workgroup-name 'cfn-blog-wg' \
--output text \
--query 'workgroup.status'

The following screenshot shows our output.

Regardless of whether your snapshot was made from a provisioned cluster or serverless workgroup, it can be restored into a new serverless workgroup. Restoring a snapshot replaces the namespace and workgroup with the contents of the snapshot.

Use the following command to restore from a snapshot:

aws redshift-serverless restore-from-snapshot \
--namespace-name 'cfn-blog-ns' \
--snapshot-arn arn:aws:redshift:us-east-1:<account-id>:snapshot:<cluster-identifier>/<snapshot-identifier> \
--workgroup-name 'cfn-blog-wg'

The following is an example output.

To check the workgroup status, run the following command:

aws redshift-serverless get-workgroup \
--workgroup-name 'cfn-blog-wg' \
--output text \
--query 'workgroup.status'

To create a snapshot from an existing namespace, run the following command:

aws redshift-serverless create-snapshot \
--namespace-name cfn-blog-ns \
--snapshot-name cfn-blog-snapshot-from-ns \
--retention-period 7

The following is an example output.

Redshift Serverless creates recovery points of your namespace that are available for 24 hours. To keep your recovery point longer than 24 hours, convert it to a snapshot.

To find the recovery points associated to your namespace, run the following command:

aws redshift-serverless list-recovery-points \
--namespace-name cfn-blog-ns \
--no-paginate

The following an example output with the list of all the recovery points.

list recovery points

Let’s take the latest recoveryPointId from the list and convert to snapshot.

To create a snapshot from a recovery point, run the following command:

aws redshift-serverless convert-recovery-point-to-snapshot \
--recovery-point-id f9eaf9ac-a98d-4809-9eee-869ef03e98b4 \
--retention-period 7 \
--snapshot-name cfn-blog-snapshot-from-rp

The following is an example output.

convert-recovery-point

In addition to restoring a snapshot to a serverless namespace, you can also restore from a recovery point.

First, you need to find the recovery point identifier using the list-recovery-points command.
Then use the following command to restore from a recovery point:

aws redshift-serverless restore-from-recovery-point \
--namespace-name cfn-blog-ns \
--recovery-point-id 15c55fb4-d973-4d8a-a8fe-4741e7911137 \
--workgroup-name cfn-blog-wg

The following is an example output.

restore from recovery point

The base RPU determines the starting capacity for your serverless environment.

Use the following command to modify the base RPU based on your workload requirements:

aws redshift-serverless update-workgroup \
--base-capacity 256 \
--workgroup-name 'cfn-blog-wg'

The following is an example output.

Run the following command to verify the workgroup base RPU capacity has been modified to 256:

aws redshift-serverless get-workgroup \
--workgroup-name 'cfn-blog-wg' \
--output text \
--query 'workgroup.baseCapacity'

To keep costs predictable for Redshift Serverless, you can set the maximum RPU hours used per day, per week, or per month. In addition, you can take action when the limit is reached. Actions include: write a log entry to a system table, receive an alert, or turn off user queries.

Use the following command to first get the workgroup ARN:

aws redshift-serverless get-workgroup --workgroup-name 'cfn-blog-wg' \
--output text \
--query 'workgroup.workgroupArn'

The following screenshot shows our output.

Use the workgroupArn output from the preceding command with the following command to set the daily RPU usage limit and set the action behavior to log:

aws redshift-serverless create-usage-limit \
--amount 256 \
--breach-action log \
--period daily \
--resource-arn arn:aws:redshift-serverless:us-east-1:<aws-account-id>:workgroup/1dcdd402-8aeb-432e-8833-b1f78a112a93 \
--usage-type serverless-compute

The following is an example output.

Conclusion

You have now learned how to automate management operations on Redshift Serverless namespaces and workgroups using AWS CloudFormation and the AWS CLI. To automate creation and management of Amazon Redshift provisioned clusters, refer to Automate Amazon Redshift Cluster management operations using AWS CloudFormation.

About the Authors

Ranjan Burman is a Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers build scalable analytical solutions. He has more than 15 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with the use of cloud solutions.

Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 16 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

Urvish Shah is a Senior Database Engineer at Amazon Redshift. He has more than a decade of experience working on databases, data warehousing and in analytics space. Outside of work, he enjoys cooking, travelling and spending time with his daughter.

Get started with Apache Hudi using AWS Glue by implementing key design concepts – Part 1

2022-10-17 Amit Maindola

Post Syndicated from Amit Maindola original https://aws.amazon.com/blogs/big-data/part-1-get-started-with-apache-hudi-using-aws-glue-by-implementing-key-design-concepts/

Many organizations build data lakes on Amazon Simple Storage Service (Amazon S3) using a modern architecture for a scalable and cost-effective solution. Open-source storage formats like Parquet and Avro are commonly used, and data is stored in these formats as immutable files. As the data lake is expanded to additional use cases, there are still some use cases that are very difficult with data lakes, such as CDC (change data capture), time travel (querying point-in-time data), privacy regulation requiring deletion of data, concurrent writes, and consistency regarding handling small file problems.

Apache Hudi is an open-source transactional data lake framework that greatly simplifies incremental data processing and streaming data ingestion. However, organizations new to data lakes may struggle to adopt Apache Hudi due to unfamiliarity with the technology and lack of internal expertise.

In this post, we show how to get started with Apache Hudi, focusing on the Hudi CoW (Copy on Write) table type on AWS using AWS Glue, and implementing key design concepts for different use cases. We expect readers to have a basic understanding of data lakes, AWS Glue, and Amazon S3. We walk you through common batch data ingestion use cases with actual test results using a TPC-DS dataset to show how the design decisions can influence the outcome.

Apache Hudi key concepts

Before diving deep into the design concepts, let’s review the key concepts of Apache Hudi, which is important to understand before you make design decisions.

Hudi table and query types

Hudi supports two table types: Copy on Write (CoW) and Merge on Read (MoR). You have to choose the table type in advance, which influences the performance of read and write operations.

The difference in performance depends on the volume of data, operations, file size, and other factors. For more information, refer to Table & Query Types.

When you use the CoW table type, committed data is implicitly compacted, meaning it’s updated to columnar file format during write operation. With the MoR table type, data isn’t compacted with every commit. As a result, for the MoR table type, compacted data lives in columnar storage (Parquet) and deltas are stored in a log (Avro) raw format until compaction merges changes the data to columnar file format. Hudi supports snapshot, incremental, and read-optimized queries for Hudi tables, and the output of the result depends on the query type.

Indexing

Indexing is another key concept for the design. Hudi provides efficient upserts and deletes with fast indexing for both CoW and MoR tables. For CoW tables, indexing enables fast upsert and delete operations by avoiding the need to join against the entire dataset to determine which files to rewrite. For MoR, this design allows Hudi to bound the amount of records any given base file needs to be merged against. Specifically, a given base file needs to be merged only against updates for records that are part of that base file. In contrast, designs without an indexing component could end up having to merge all the base files against all incoming update and delete records.

Solution overview

The following diagram describes the high-level architecture for our solution. We ingest the TPC-DS (store_sales) dataset from the source S3 bucket in CSV format and write it to the target S3 bucket using AWS Glue in Hudi format. We can query the Hudi tables on Amazon S3 using Amazon Athena and AWS Glue Studio Notebooks.

The following diagram illustrates the relationships between our tables.

For our post, we use the following tables from the TPC-DS dataset: one fact table, store_sales, and the dimension tables store, item, and date_dim. The following table summarizes the table row counts.

Table	Approximate Row Counts
store_sales	2.8 billion
store	1,000
item	300,000
date_dim	73,000

Set up the environment

After you sign in to your test AWS account, launch the provided AWS CloudFormation template by choosing Launch Stack:

This template configures the following resources:

AWS Glue jobs hudi_bulk_insert, hudi_upsert_cow, and hudi_bulk_insert_dim. We use these jobs for the use cases covered in this post.
An S3 bucket to store the output of the AWS Glue job runs.
AWS Identity and Access Management (IAM) roles and policies with appropriate permissions.

Before you run the AWS Glue jobs, you need to subscribe to the AWS Glue Apache Hudi Connector (latest version: 0.10.1). The connector is available on AWS Marketplace. Follow the connector installation and activation process from the AWS Marketplace link, or refer to Process Apache Hudi, Delta Lake, Apache Iceberg datasets at scale, part 1: AWS Glue Studio Notebook to set it up.

After you create the Hudi connection, add the connector name to all the AWS Glue scripts under Advanced properties.

Bulk insert job

To run the bulk insert job, choose the job hudi_bulk_insert on the AWS Glue console.

The job parameters as shown in the following screenshot are added as part of the CloudFormation stack setup. You can use different values to create CoW partitioned tables with different bulk insert options.

The parameters are as follows:

HUDI_DB_NAME – The database in the AWS Glue Data Catalog where the catalog table is created.
HUDI_INIT_SORT_OPTION – The options for bulk_insert include GLOBAL_SORT, which is the default. Other options include NONE and PARTITION_SORT.
HUDI_TABLE_NAME – The table name prefix that you want to use to identify the table created. In the code, we append the sort option to the name you specify in this parameter.
OUTPUT_BUCKET – The S3 bucket created through the CloudFormation stack where the Hudi table datasets are written. The bucket name format is <account number><bucket name>. The bucket name is the one given while creating the CloudFormation stack.
CATEGORY_ID – The default for this parameter is ALL, which processes categories of test data in a single AWS Glue job. To test the parallel on the same table, change the parameter value to one of categories from 3, 5, or 8 for the dataset that we use for each parallel AWS Glue job.

Upsert job for the CoW table

To run the upsert job, choose the job hudi_upsert_cow on the AWS Glue console.

The following job parameters are added as part of the CloudFormation stack setup. You can run upsert and delete operations on CoW partitioned tables with different bulk insert options based on the values provided for these parameters.

OUTPUT-BUCKET – The same value as the previous job parameter.
HUDI_TABLE_NAME – The name of the table created in your AWS Glue Data Catalog.
HUDI_DB_NAME – The same value as the previous job parameter. The default value is Default.

Bulk insert job for the Dimension tables

To test the queries on the CoW tables, the fact table that is created using the bulk insert operation needs supplemental dimensional tables. This AWS Glue job has to be run before you can test the TPC queries provided later in this post. To run this job, choose hudi_bulk_insert_dim on the AWS Glue console and use the parameters shown in the following screenshot.

The parameters are as follows:

OUTPUT-BUCKET – The same value as the previous job parameter.
HUDI_INIT_SORT_OPTION – The options for bulk_insert include GLOBAL_SORT, which is the default. Other available options are NONE and PARTITION_SORT.
HUDI_DB_NAME – The Hudi database name. Default is the default value.

Hudi design considerations

In this section, we walk you through a few use cases to demonstrate the difference in the outcome for different settings and operations.

Data migration use case

In Apache Hudi, you ingest the data into CoW or MoR tables types using either insert, upsert, or bulk insert operations. Data migration initiatives often involve one-time initial loads into the target datastore, and we recommend using the bulk insert operation for initial loads.

The bulk insert option provides the same semantics as insert, while implementing a sort-based data writing algorithm, which can scale very well for several hundred TBs of initial load. However, this just does a best-effort job at sizing files vs. guaranteeing file sizes like inserts and upserts do. Also, the primary keys aren’t sorted during the insert, therefore it’s not advised to use insert during the initial data load. By default, a Bloom index is created for the table, which enables faster lookups for upsert and delete operations.

Bulk insert has the following three sort options, which have different outcomes.

GLOAL_SORT – Sorts the record key for the entire dataset before writing.
PARTITION_SORT – Applies only to partitioned tables. In this option, the record key is sorted within each partition, and the insert time is faster than the default sort.
NONE – Doesn’t sort data before writing.

For testing the bulk insert with the three sort options, we use the following AWS Glue job configuration, which is part of the script hudi_bulk_insert:

AWS Glue version: 3.0
AWS Glue worker type: G1.X
Number of AWS Glue workers: 200
Input file: TPC-DS/2.13/1TB/store_sales
Input file format: CSV (TPC-DS)
Number of input files: 1,431
Number of rows in the input dataset: Approximately 2.8 billion

The following charts illustrate the behavior of the bulk insert operations with GLOBAL_SORT, PARTITION_SORT, and NONE as sort options for a CoW table. The statistics in the charts are created by using an average of 10 bulk insert operation runs for each sort option.

Because bulk insert does a best-effort job to pack the data in files, you see a different number of files created with different sort options.

We can observe the following:

Bulk insert with GLOBAL_SORT has the least number of files, because Hudi tried to create the optimal sized files. However, it takes the most time.
Bulk insert with NONE as the sort option has the fastest write time, but resulted in a greater number of files.
Bulk insert with PARTITION_SORT also has a faster write time compared to GLOBAL SORT, but also results in a greater number of files.

Based on these results, although GLOBAL_SORT takes more time to ingest the data, it creates a smaller number of files, which has better upsert and read performance.

The following diagrams illustrate the Spark run plans for the bulk_insert operation using various sort options.

The first shows the Spark run plan for bulk_insert when the sort option is PARTITION_SORT.

The next is the Spark run plan for bulk_insert when the sort option is NONE.

The last is the Spark run plan for bulk_insert when the sort option is GLOBAL_SORT.

The Spark run plan for bulk_insert with GLOBAL_SORT involves shuffling of data to create optimal sized files. For the other two sort options, data shuffling isn’t involved. As a result, bulk_insert with GLOBAL_SORT takes more time compared to the other sort options.

To test the bulk insert with various bulk insert sort data options on a partitioned table, modify the Hudi AWS Glue job (hudi_bulk_insert) parameter --HUDI_INIT_SORT_OPTION.

We change the parameter --HUDI_INIT_SORT_OPTION to PARTITION_SORT or NONE to test the bulk insert with different data sort options. You need to run the job hudi_bulk_insert_dim, which loads the rest of the tables needed to test the SQL queries.

Now, look at the query performance difference between these three options. For query runtime, we ran two TPC-DS queries (q52.sql and q53.sql, as shown in the following query snippets) using interactive session with AWS Glue Studio Notebook with the following notebook configuration to compare the results.

AWS Glue version: 3.0
AWS Glue worker type: G1.X
Number of AWS Glue workers: 50

Before executing the following queries, replace the table names in the queries with the tables you generate in your account.
q52

SELECT
  dt.d_year,
  item.i_brand_id brand_id,
  item.i_brand brand,
  sum(ss_ext_sales_price) ext_price
FROM date_dim dt, store_sales, item
WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
  AND store_sales.ss_item_sk = item.i_item_sk
  AND item.i_manager_id = 1
  AND dt.d_moy = 11
  AND dt.d_year = 2000
GROUP BY dt.d_year, item.i_brand, item.i_brand_id
ORDER BY dt.d_year, ext_price DESC, brand_id
LIMIT 100

SELECT *
FROM
  (SELECT
    i_manufact_id,
    sum(ss_sales_price) sum_sales,
    avg(sum(ss_sales_price))
    OVER (PARTITION BY i_manufact_id) avg_quarterly_sales
  FROM item, store_sales, date_dim, store
  WHERE ss_item_sk = i_item_sk AND
    ss_sold_date_sk = d_date_sk AND
    ss_store_sk = s_store_sk AND
    d_month_seq IN (1200, 1200 + 1, 1200 + 2, 1200 + 3, 1200 + 4, 1200 + 5, 1200 + 6,
                          1200 + 7, 1200 + 8, 1200 + 9, 1200 + 10, 1200 + 11) AND
    ((i_category IN ('Books', 'Children', 'Electronics') AND

As you can see in the following chart, the performance of the GLOBAL_SORT table outperforms NONE and PARTITION_SORT due to a smaller number of files created in the bulk insert operation.

Ongoing replication use case

For ongoing replication, updates and deletes usually come from transactional databases. As you saw in the previous section, the bulk operation with GLOBAL_SORT took the most time and the operation with NONE took the least time. When you anticipate a higher volume of updates and deletes on an ongoing basis, the sort option is critical for your write performance.

To illustrate the ongoing replication using Apache Hudi upsert and delete operations, we tested using the following configuration:

AWS Glue version: 3.0
AWS Glue worker type: G1.X
Number of AWS Glue workers: 100

To test the upsert and delete operations, we use the store_sales CoW table, which was created using the bulk insert operation in the previous section with all three sort options. We make the following changes:

Insert data into a new partition (month 1 and year 2004) using the existing data from month 1 of year 2002 with a new primary key; total of 32,164,890 records
Update the ss_list_price column by $1 for the existing partition (month 1 and year 2003); total of 5,997,571 records
Delete month 5 data for year 2001; total of 26,997,957 records

The following chart illustrates the runtimes for the upsert operation for the CoW table with different sort options used during the bulk insert.

As you can see from the test run, the runtime of the upsert is higher for NONE and PARTITION_SORT CoW tables. The Bloom index, which is created by default during the bulk insert operation, enables faster lookup for upsert and delete operations.

To test the upsert and delete operations on a CoW table for tables with different data sort options, modify the AWS Glue job (hudi_upsert_cow) parameter HUDI_TABLE_NAME to the desired table, as shown in the following screenshot.

For workloads where updates are performed on the most recent partitions, a Bloom index works fine. For workloads where the update volume is less but the updates are spread across partitions, a simple index is more efficient. You can specify the index type while creating the Hudi table by using the parameter hoodie.index.type. Both the Bloom index and simple index enforce uniqueness of table keys within a partition. If you need uniqueness of keys for the entire table, you must create a global Bloom index or global simple index based on the update workloads.

Multi-tenant partitioned design use case

In this section, we cover Hudi optimistic concurrency using a multi-tenant table design, where each tenant data is stored in a separate table partition. In a real-world scenario, you may encounter a business need to process different tenant data simultaneously, such as a strict SLA to make the data available for downstream consumption as quickly as possible. Without Hudi optimistic concurrency, you can’t have concurrent writes to the same Hudi table. In such a scenario, you can speed up the data writes using Hudi optimistic concurrency when each job operates on a different table dataset. In our multi-tenant table design using Hudi optimistic concurrency, you can run concurrent jobs, where each job writes data to a separate table partition.

For AWS Glue, you can implement Hudi optimistic concurrency using an Amazon DynamoDB lock provider, which was introduced with Apache Hudi 0.10.0. The initial bulk insert script has all the configurations needed to allow multiple writes. The role being used for AWS Glue needs to have DynamoDB permissions added to make it work. For more information about concurrency control and alternatives for lock providers, refer to Concurrency Control.

To simulate concurrent writes, we presume your tenant is based on the category field from the TPC DC test dataset and accordingly partitioned based on the category id field (i_category_id). Let’s modify the script hudi_bulk_insert to run an initial load for different categories. You need to configure your AWS Glue job to run concurrently based on the Maximum concurrency parameter, located under the advanced properties. We describe the Hudi configuration parameters that are needed in the appendix at the end of this post.

The TPC-DS dataset includes data from years 1998–2003. We use i_catagory_id as the tenant ID. The following screenshot shows the distribution of data for multiple tenants (i_category_id). In our testing, we load the data for i_category_id values 3, 5, and 8.

The AWS Glue job hudi_bulk_insert is designed to insert data into specific partitions based on the parameter CATEGORY_ID. If bulk insert job for dimension tables is not run before you need to run the job hudi_bulk_insert_dim, which loads the rest of the tables needed to test the SQL queries.

Now we run three concurrent jobs, each with respective values 3, 5, and 8 to simulate concurrent writes for multiple tenants. The following screenshot illustrates the AWS Glue job parameter to modify for CATEGORY_ID.

We used the following AWS Glue job configuration for each of the three parallel AWS Glue jobs:

AWS Glue version: 3.0
AWS Glue worker type: G1.X
Number of AWS Glue workers: 100
Input file: TPC-DS/2.13/1TB/store_sales
Input file format: CSV (TPC-DS)

The following screenshot shows all three concurrent jobs started around the same time for three categories, which loaded 867 million rows (50.1 GB of data) into the store_sales table. We used the GLOBAL_SORT option for all three concurrent AWS Glue jobs.

The following screenshot shows the data from the Hudi table where all three concurrent writers inserted data into different partitions, which is illustrated by different colors. All the AWS Glue jobs were run in US Central Time zone (UTC -5). The _hoodie_commit_time is in UTC.

The first two results highlighted in blue corresponds to the AWS Glue job CATEGORY_ID = 3, which had the start time of 09/27/2022 21:23:39 US CST (09/28/2022 02:23:39 UTC).

The next two results highlighted in green correspond to the AWS Glue job CATEGORY_ID = 8, which had the start time of 09/27/2022 21:23:50 US CST (09/28/2022 02:23:50 UTC).

The last two results highlighted in green correspond to the AWS Glue job CATEGORY_ID = 5, which had the start time of 09/27/2022 21:23:44 US CST (09/28/2022 02:23:44 UTC).

The sample data from the Hudi table has _hoodie_commit_time values corresponding to the AWS Glue job run times.

As you can see, we were able to load data into multiple partitions of the same Hudi table concurrently using Hudi optimistic concurrency.

Key findings

As the results show, bulk_insert with GLOBAL_SORT scales well for loading TBs of data in the initial load process. This option is recommended for use cases that require frequent changes after a large migration. Also, when query performance is critical in your use case, we recommend the GLOBAL_SORT option because of the smaller number of files being created with this option.

PARTITION_SORT has better performance for data load compared to GLOBAL_SORT, but it generates a significantly larger number of files, which negatively impacts query performance. You can use this option when the query involves a lot of joins between partitioned tables on record key columns.

The NONE option doesn’t sort the data, but it’s useful when you need the fastest initial load time and requires minimal updates, with the added capability of supporting record changes.

Clean up

When you’re done with this exercise, complete the following steps to delete your resources and stop incurring costs:

On the Amazon S3 console, empty the buckets created by the CloudFormation stack.
On the CloudFormation console, select your stack and choose Delete.

This cleans up all the resources created by the stack.

Conclusion

In this post, we covered some of the Hudi concepts that are important for design decisions. We used AWS Glue and the TPC-DS dataset to collect the results of different use cases for comparison. You can learn from the use cases covered in this post to make the key design decisions, particularly when you’re at the early stage of Apache Hudi adoption. You can go through the steps in this post to start a proof of concept using AWS Glue and Apache Hudi.

References

Appendix

The following table summarizes the Hudi configuration parameters that are needed.

Configuration	Value	Description	Required
hoodie.write. concurrency.mode	`optimistic_concurrency_control`	Property to turn on optimistic concurrency control.	Yes
hoodie.cleaner. policy.failed.writes	`LAZY`	Property to turn on optimistic concurrency control.	Yes
hoodie.write. lock.provider	`org.apache. hudi.client. transaction.lock. DynamoDBBasedLockProvider`	Lock provider implementation to use.	Yes
hoodie.write. lock.dynamodb.table	`<String>`	The DynamoDB table name to use for acquiring locks. If the table doesn’t exist, it will be created. You can use the same table across all your Hudi jobs operating on the same or different tables.	Yes
hoodie.write. lock.dynamodb.partition_key	`<String>`	The string value to be used for the locks table partition key attribute. It must be a string that uniquely identifies a Hudi table, such as the Hudi table name.	Yes: ‘tablename’
hoodie.write. lock.dynamodb.region	`<String>`	The AWS Region in which the DynamoDB locks table exists, or must be created.	Yes: Default: us-east-1
hoodie.write. lock.dynamodb.billing_mode	`<String>`	The DynamoDB billing mode to be used for the locks table while creating. If the table already exists, then this doesn’t have an effect.	Yes: Default PAY_PER_REQUEST
hoodie.write. lock.dynamodb.endpoint_url	`<String>`	The DynamoDB URL for the Region where you’re creating the table.	Yes: dynamodb.us-east-1.amazonaws.com
hoodie.write. lock.dynamodb.read_capacity	`<Integer>`	The DynamoDB read capacity to be used for the locks table while creating. If the table already exists, then this doesn’t have an effect.	No: Default 20
hoodie.write. lock.dynamodb. write_capacity	`<Integer>`	The DynamoDB write capacity to be used for the locks table while creating. If the table already exists, then this doesn’t have an effect.	No: Default 10

About the Authors

Amit Maindola is a Data Architect focused on big data and analytics at Amazon Web Services. He helps customers in their digital transformation journey and enables them to build highly scalable, robust, and secure cloud-based analytical solutions on AWS to gain timely insights and make critical business decisions.

Srinivas Kandi is a Data Architect with focus on data lake and analytics at Amazon Web Services. He helps customers to deploy data analytics solutions in AWS to enable them with prescriptive and predictive analytics.

Mitesh Patel is a Principal Solutions Architect at AWS. His main area of depth is application and data modernization. He helps customers to build scalable, secure and cost effective solutions in AWS.

Analyze Amazon Cognito advanced security intelligence to improve visibility and protection

2022-10-17 Diana Alvarado

Post Syndicated from Diana Alvarado original https://aws.amazon.com/blogs/security/analyze-amazon-cognito-advanced-security-intelligence-to-improve-visibility-and-protection/

As your organization looks to improve your security posture and practices, early detection and prevention of unauthorized activity quickly becomes one of your main priorities. The behaviors associated with unauthorized activity commonly follow patterns that you can analyze in order to create specific mitigations or feed data into your security monitoring systems.

This post shows you how you can analyze security intelligence from Amazon Cognito advanced security features logs by using AWS native services. You can use the intelligence data provided by the logs to increase your visibility into sign-in and sign-up activities from users, this can help you with monitoring, decision making, and to feed other security services in your organization, such as a web application firewall or security information and event management (SIEM) tool. The data can also enrich available security feeds like fraud detection systems, increasing protection for the workloads that you run on AWS.

Amazon Cognito advanced security features overview

Amazon Cognito provides authentication, authorization, and user management for your web and mobile apps. Your users can sign in to apps directly with a user name and password, or through a third party such as social providers or standard enterprise providers through SAML 2.0/OpenID Connect (OIDC). Amazon Cognito includes additional protections for users that you manage in Amazon Cognito user pools. In particular, Amazon Cognito can add risk-based adaptive authentication and also flag the use of compromised credentials. For more information, see Checking for compromised credentials in the Amazon Cognito Developer Guide.

With adaptive authentication, Amazon Cognito examines each user pool sign-in attempt and generates a risk score for how likely the sign-in request is from an unauthorized user. Amazon Cognito examines a number of factors, including whether the user has used the same device before or has signed in from the same location or IP address. A detected risk is rated as low, medium, or high, and you can determine what actions should be taken at each risk level. You can choose to allow or block the request, require a second authentication factor, or notify the user of the risk by email. Security teams and administrators can also submit feedback on the risk through the API, and users can submit feedback by using a link that is sent to the user’s email. This feedback can improve the risk calculation for future attempts.

To add advanced security features to your existing Amazon Cognito configuration, you can get started by using the steps for Adding advanced security to a user pool in the Amazon Cognito Developer Guide. Note that there is an additional charge for advanced security features, as described on our pricing page. These features are applicable only to native Amazon Cognito users; they aren’t applicable to federated users who sign in with an external provider.

Solution architecture

Figure 1: Solution architecture

Figure 1 shows the high-level architecture for the advanced security solution. When an Amazon Cognito sign-in event is recorded by AWS CloudTrail, the solution uses an Amazon EventBridge rule to send the event to an Amazon Simple Queue Service (Amazon SQS) queue and batch it, to then be processed by an AWS Lambda function. The Lambda function uses the event information to pull the sign-in security information and send it as logs to an Amazon Simple Storage Service (Amazon S3) bucket and Amazon CloudWatch Logs.

Prerequisites and considerations for this solution

This solution assumes that you are using Amazon Cognito with advanced security features already enabled, the solution does not create a user pool and does not activate the advanced security features on an existing one.

The following list describes some limitations that you should be aware of for this solution:

This solution does not apply to events in the hosted UI, but the same architecture can be adapted for that environment, with some changes to the events processor.
The Amazon Cognito advanced security features support only native users. This solution is not applicable to federated users.
The admin API used in this solution has a default rate limit of 30 requests per second (RPS). If you have a higher rate of authentication attempts, this API call might be throttled and you will need to implement a re-try pattern to confirm that your requests are processed.

Implement the solution

You can deploy the solution automatically by using the following AWS CloudFormation template.

Choose the following Launch Stack button to launch a CloudFormation stack in your account and deploy the solution.

You’ll be redirected to the CloudFormation service in the US East (N. Virginia) Region, which is the default AWS Region, to deploy this solution. You can change the Region to align it to where your Cognito User Pool is running.

This template will create multiple cloud resources including, but not limited to, the following:

An EventBridge rule for sending the Amazon Cognito events
An Amazon SQS queue for sending the events to Lambda
A Lambda function for getting the advanced security information based on the authentication events from CloudTrail
An S3 bucket to store the logs

In the wizard, you’ll be asked to modify or provide one parameter, the existing Cognito user pool ID. You can get this value from the Amazon Cognito console or the Cognito API.

Now, let’s break down each component of the solution in detail.

Sending the authentication events from CloudTrail to Lambda

Cognito advanced security features supports the CloudTrail events: SignUp, ConfirmSignUp, ForgotPassword, ResendConfirmationCode, InitiateAuth and RespondToAuthChallenge. This solution will focus on the sign-in event InitiateAuth as an example.

The solution creates an EventBridge rule that will run when an event is identified in CloudTrail and send the event to an SQS queue. This is useful so that events can be batched up and decoupled for Lambda to process.

The EventBridge rule uses Amazon SQS as a target. The queue is created by the solution and uses the default settings, with the exception that Receive message wait time is set to 20 seconds for long polling. For more information about long polling and how to manually set up an SQS queue, see Consuming messages using long polling in the Amazon SQS Developer Guide.

When the SQS queue receives the messages from EventBridge, these are sent to Lambda for processing. Let’s now focus on understanding how this information is processed by the Lambda function.

Using Lambda to process Amazon Cognito advanced security features information

In order to get the advanced security features evaluation information, you need authentication details that can only be obtained by using the Amazon Cognito identity provider (IdP) API call admin_list_user_auth_events. This API call requires a username to fetch all the authentication event details for a specific user. For security reasons, the username is not logged in CloudTrail and must be obtained by using other event information.

You can use the Lambda function in the sample solution to get this information. It’s composed of three main sequential actions:

The Lambda function gets the sub identifiers from the authentication events recorded by CloudTrail.
Each sub identifier is used to get the user name through an API call to list_users.
3. The sample function retrieves the last five authentication event details from advanced security features for each of these users by using the admin_list_user_auth_events API call. You can modify the function to retrieve a different number of events, or use other criteria such as a timestamp or a specific time period.

Getting the user name information from a CloudTrail event

The following sample authentication event shows a sub identifier in the CloudTrail event information, shown as sub under additionalEventData. With this sub identifier, you can use the ListUsers API call from the Cognito IdP SDK to get the user name details.

{
"eventVersion": "1.XX",
"userIdentity": {
"type": "Unknown",
"principalId": "Anonymous"
},
"eventTime": "2022-01-01T11:11:11Z",
"eventSource": "cognito-idp.amazonaws.com",
"eventName": "InitiateAuth",
"awsRegion": "us-east-1",
"sourceIPAddress": "xx.xx.xx.xx",
"userAgent": "Mozilla/5.0 (xxxx)",
"requestParameters": {
"authFlow": "USER_SRP_AUTH",
"authParameters": "HIDDEN_DUE_TO_SECURITY_REASONS",
"clientMetadata": {},
"clientId": "iiiiiiiii"
},
"responseElements": {
"challengeName": "PASSWORD_VERIFIER",
"challengeParameters": {
"SALT": "HIDDEN_DUE_TO_SECURITY_REASONS",
"SECRET_BLOCK": "HIDDEN_DUE_TO_SECURITY_REASONS",
"USER_ID_FOR_SRP": "HIDDEN_DUE_TO_SECURITY_REASONS",
"USERNAME": "HIDDEN_DUE_TO_SECURITY_REASONS",
"SRP_B": "HIDDEN_DUE_TO_SECURITY_REASONS"
}
},
"additionalEventData": {
"sub": "11110b4c-1f4264cd111"
},
"requestID": "xxxxxxxx",
"eventID": "xxxxxxxxxx",
"readOnly": false,
"eventType": "AwsApiCall",
"managementEvent": true,
"recipientAccountId": "xxxxxxxxxxxxx",
"eventCategory": "Management"
}

Listing authentication events information

After the Lambda function obtains the username, it can then use the Cognito IdP API call admin_list_user_auth_events to get the advanced security feature risk evaluation information for each of the authentication events for that user. Let’s look into the details of that evaluation.

The authentication event information from Amazon Cognito advanced security provides information for each of the categories evaluated and logs the results. Those results can then be used to decide whether the authentication attempt information is useful for the security team to be notified or take action. It’s recommended that you limit the number of events returned, in order to keep performance optimized.

The following sample event shows some of the risk information provided by advanced security features; the options for the response syntax can be found in the CognitoIdentityProvider API documentation.

}
]
at the bottom, so
"AuthEvents": [
{
"EventId": "1111111”,
"EventType": "SignIn",
"CreationDate": 111111.111,
"EventResponse": "Pass",
"EventRisk": {
"RiskDecision": "NoRisk",
"CompromisedCredentialsDetected": false
},
"ChallengeResponses": [
{
"ChallengeName": "Password",
"ChallengeResponse": "Success"
}
],
"EventContextData": {
"IpAddress": "72.xx.xx.xx",
"DeviceName": "Firefox xx
"City": "Axxx",
"Country": "United States"
}
}
]

The event information that is returned includes the details that are highlighted in this sample event, such as CompromisedCredentialsDetected, RiskDecision, and RiskLevel, which you can evaluate to decide whether the information can be used to enrich other security monitoring services.

Logging the authentication events information

You can use a Lambda extensions layer to send logs to an S3 bucket. Lambda still sends logs to Amazon CloudWatch Logs, but you can disable this activity by removing the required permissions to CloudWatch on the Lambda execution role. For more details on how to set this up, see Using AWS Lambda extensions to send logs to custom destinations.

Figure 2 shows an example of a log sent by Lambda. It includes execution information that is logged by the extension, as well as the information returned from the authentication evaluation by advanced security features.

Figure 2: Sample log information sent to S3

Note that the detailed authentication information in the Lambda execution log is the same as the preceding sample event. You can further enhance the information provided by the Lambda function by modifying the function code and logging more information during the execution, or by filtering the logs and focusing only on high-risk or compromised login attempts.

After the logs are in the S3 bucket, different applications and tools can use this information to perform automated security actions and configuration updates or provide further visibility. You can query the data from Amazon S3 by using Amazon Athena, feed the data to other services such as Amazon Fraud Detector as described in this post, mine the data by using artificial intelligence/machine learning (AI/ML) managed tools like AWS Lookout for Metrics, or enhance visibility with AWS WAF.

Sample scenarios

You can start to gain insights into the security information provided by this solution in an existing environment by querying and visualizing the log data directly by using CloudWatch Logs Insights. For detailed information about how you can use CloudWatch Logs Insights with Lambda logs, see the blog post Operating Lambda: Using CloudWatch Logs Insights.

The CloudFormation template deploys the CloudWatch Logs Insights queries. You can view the queries for the sample solution in the Amazon CloudWatch console, under Queries.

To access the queries in the CloudWatch console

In the CloudWatch console, under Logs, choose Insights.
Choose Select log group(s). In the drop-drown list, select the Lambda log group.
The query box should show the pre-created query. Choose Run query. You should then see the query results in the bottom-right panel.
(Optional) Choose Add to dashboard to add the widget to a dashboard.

CloudWatch Logs Insights discovers the fields in the auth event log automatically. As shown in Figure 3, you can see the available fields in the right-hand side Discovered fields pane, which includes the Amazon Cognito information in the event.

Figure 3: The fields available in CloudWatch Logs Insights

The first query, shown in the following code snippet, will help you get a view of the number of requests per IP, where the advanced security features have determined the risk decision as Account Takeover and the CompromisedCredentialsDetected as true.

fields @message
| filter @message like /INFO/
| filter AuthEvents.0.EventType like 'SignIn'
| filter AuthEvents.0.EventRisk.RiskDecision like "AccountTakeover" and 
AuthEvents.0.EventRisk.CompromisedCredentialsDetected =! "false"
| stats count(*) as RequestsperIP by AuthEvents.2.EventContextData.IpAddress as IP
| sort desc

You can view the results of the query as a table or graph, as shown in Figure 4.

Figure 4: Sample query results for CompromisedCredentialsDetected

Using the same approach and the convenient access to the fields for query, you can explore another use case, using the following query, to view the number of requests per IP for each type of event (SignIn, SignUp, and forgot password) where the risk level was high.

fields @message
| filter @message like /INFO/
| filter AuthEvents.0.EventRisk.RiskLevel like "High"
| stats count(*) as RequestsperIP by AuthEvents.0.EventContextData.IpAddress as IP, 
AuthEvents.0.EventType as EventType
| sort desc

Figure 5 shows the results for this EventType query.

Figure 5: The sample results for the EventType query

In the final sample scenario, you can look at event context data and query for the source of the events for which the risk level was high.

fields @message
| filter @message like /INFO/
| filter AuthEvents.0.EventRisk.RiskLevel like 'High'
| stats count(*) as RequestsperCountry by AuthEvents.0.EventContextData.Country as Country
| sort desc

Figure 6 shows the results for this RiskLevel query.

Figure 6: Sample results for the RiskLevel query

As you can see, there are many ways to mix and match the filters to extract deep insights, depending on your specific needs. You can use these examples as a base to build your own queries.

Conclusion

In this post, you learned how to use security intelligence information provided by Amazon Cognito through its advanced security features to improve your security posture and practices. You used an advanced security solution to retrieve valuable authentication information using CloudTrail logs as a source and a Lambda function to process the events, send this evaluation information in the form of a log to CloudWatch Logs and S3 for use as an additional security feed for wider organizational monitoring and visibility. In a set of sample use cases, you explored how to use CloudWatch Logs Insights to quickly and conveniently access this information, aggregate it, gain deep insights and use it to take action.

To learn more, see the blog post How to Use New Advanced Security Features for Amazon Cognito User Pools.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security news? Follow us on Twitter.

Adding approval notifications to EC2 Image Builder before sharing AMIs

2022-10-14 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/adding-approval-notifications-to-ec2-image-builder-before-sharing-amis-2/

This blog post is written by, Glenn Chia Jin Wee, Associate Cloud Architect, and Randall Han, Professional Services.

You may be required to manually validate the Amazon Machine Image (AMI) built from an Amazon Elastic Compute Cloud (Amazon EC2) Image Builder pipeline before sharing this AMI to other AWS accounts or to an AWS organization. Currently, Image Builder provides an end-to-end pipeline that automatically shares AMIs after they’ve been built.

In this post, we will walk through the steps to enable approval notifications before AMIs are shared with other AWS accounts. Image Builder supports automated image testing using test components. The recommended best practice is to automate test steps, however situations can arise where test steps become either challenging to automate or internal compliance policies mandate manual checks be conducted prior to distributing images. In such situations, having a manual approval step is useful if you would like to verify the AMI configuration before it is shared to other AWS accounts or an AWS Organization. A manual approval step reduces the potential for sharing an incorrectly configured AMI with other teams which can lead to downstream issues. This solution sends an email with a link to approve or reject the AMI. Users approve the AMI after they’ve verified that it is built according to specifications. Upon approving the AMI, the solution automatically shares it with the specified AWS accounts.

Overview

In this solution, an Image Builder Pipeline is run that builds a Golden AMI in Account A. After the AMI is built, Image Builder publishes data about the AMI to an Amazon Simple Notification Service (Amazon SNS)
The SNS Topic passes the data to an AWS Lambda function that subscribes to it.
The Lambda function that subscribes to this topic retrieves the data, formats it, and then starts an SSM Automation, passing it the AMI Name and ID.
The first step of the SSM Automation is a manual approval step. The SSM Automation first publishes to an SNS Topic that has an email subscription with the Approver’s email. The approver will receive the email with a URL that they can click to approve the step.
The approval step defines a specific AWS Identity and Access Management (IAM) Role as an approver. This role has the minimum required permissions to approve the manual approval step. After performing manual tests on the Golden AMI, the Approver principal will assume this role.
After assuming this role, the approver will click on the approval link that was sent via email. After approving the step, an AWS Lambda Function is triggered.
This Lambda Function shares the Golden AMI with Account B and sends an email notifying the Target Account Recipients that the AMI has been shared.

Prerequisites

For this walkthrough, you will need the following:

Two AWS accounts – one to host the solution resources, and the second which receives the shared Golden AMI.
- In the account that hosts the solution, prepare an AWS Identity and Access Management (IAM) principal with the sts:AssumeRole permission. This principal must assume the IAM Role that is listed as an approver in the Systems Manager approval step. The ARN of this IAM principal is used in the AWS CloudFormation Approver parameter, This ARN is added to the trust policy of approval IAM Role.
- In addition, in the account hosting the solution, ensure that the IAM principal deploying the CloudFormation template has the required permissions to create the resources in the stack.
A new Amazon Virtual Private Cloud (Amazon VPC) will be created from the stack. Make sure that you have fewer than five VPCs in the selected Region.

Walkthrough

In this section, we will guide you through the steps required to deploy the Image Builder solution. The solution is deployed with CloudFormation.

In this scenario, we deploy the solution within the approver’s account. The approval email will be sent to a predefined email address for manual approval, before the newly created AMI is shared to target accounts.

The approver first assumes the approval IAM Role and then selects the approval link. This leads to the Systems Manager approval page. Upon approval, an email notification will be sent to the predefined target account email address, notifying the relevant stakeholders that the AMI has been successfully shared.

The high-level steps we will follow are:

In Account A, deploy the provided AWS CloudFormation template. This includes an example Image Builder Pipeline, Amazon SNS topics, Lambda functions, and an SSM Automation Document.
Approve the SNS subscription from your supplied email address.
Run the pipeline from the Amazon EC2 Image Builder Console.
[Optional] To conduct manual tests, launch an Amazon EC2 instance from the built AMI after the pipeline runs.
An email will be sent to you with options to approve or reject the step. Ensure that you have assumed the IAM Role that is the approver before clicking the approval link that leads to the SSM console approval page.
Upon approving the step, an AWS Lambda function shares the AMI to the Account B and also sends an email to the target account email recipients notifying them that the AMI has been shared.
Log in to Account B and verify that the AMI has been shared.

Step 1: Deploy the AWS CloudFormation template

1. The CloudFormation template, template.yaml that deploys the solution can also found at this GitHub repository. Follow the instructions at the repository to deploy the stack.

Step 2: Verify your email address

After running the deployment, you will receive an email prompting you to confirm the Subscription at the approver email address. Choose Confirm subscription.

This leads to the following screen, which shows that your subscription is confirmed.

Repeat the previous 2 steps for the target email address.

Step 3: Run the pipeline from the Image Builder console

In the Image Builder console, under Image pipelines, select the checkbox next to the Pipeline created, choose Actions, and select Run pipeline.

Note: The pipeline takes approximately 20 – 30 minutes to complete.

Step 4: [Optional] Launch an Amazon EC2 instance from the built AMI

If you have a requirement to manually validate the AMI before sharing it with other accounts or to the AWS organization an approver will launch an Amazon EC2 instance from the built AMI and conduct manual tests on the EC2 instance to make sure it is functional.

In the Amazon EC2 console, under Images, choose AMIs. Validate that the AMI is created.

Follow AWS docs: Launching an EC2 instances from a custom AMI for steps on how to launch an Amazon EC2 instance from the AMI.

Step 5: Select the approval URL in the email sent

When the pipeline is run successfully, you will receive another email with a URL to approve the AMI.

Before clicking on the Approve link, you must assume the IAM Role that is set as an approver for the Systems Manager step.
In the CloudFormation console, choose the stack that was deployed.

4. Choose Outputs and copy the IAM Role name.

5. While logged in as the IAM Principal that has permissions to assume the approval IAM Role, follow the instructions at AWS IAM documentation for switching a role using the console to assume the approval role.
In the Switch Role page, in Role paste the name of the IAM Role that you copied in the previous step.

Note: This IAM Role was deployed with minimum permissions. Hence, seeing warning messages in the console is expected after assuming this role.

6. Now in the approval email, select the Approve URL. This leads to the Systems Manager console. Choose Submit.

7. After approving the manual step, the second step is executed, which shares the AMI to the target account.

Step 6: Verify that the AMI is shared to Account B

Log in to Account B.
In the Amazon EC2 console, under Images, choose AMIs. Then, in the dropdown, choose Private images. Validate that the AMI is shared.

Verify that a success email notification was sent to the target account email address provided.

Clean up

This section provides the necessary information for deleting various resources created as part of this post.

Deregister the AMIs that were created and shared.
1. Log in to Account A and follow the steps at AWS documentation: Deregister your Linux AMI.
Delete the CloudFormation stack. For instructions, refer to Deleting a stack on the AWS CloudFormation console.

Conclusion

In this post, we explained how to enable approval notifications for an Image Builder pipeline before AMIs are shared to other accounts. This solution can be extended to share to more than one AWS account or even to an AWS organization. With this solution, you will be notified when new golden images are created, allowing you to verify the accuracy of their configuration before sharing them to for wider use. This reduces the possibility of sharing AMIs with misconfigurations that the written tests may not have identified.

We invite you to experiment with different AMIs created using Image Builder, and with different Image Builder components. Check out this GitHub repository for various examples that use Image Builder. Also check out this blog on Image builder integrations with EC2 Auto Scaling Instance Refresh. Let us know your questions and findings in the comments, and have fun!

Fine-tuning Operations at Slice using AWS DevOps Guru

2022-10-12 Adnan Bilwani

Post Syndicated from Adnan Bilwani original https://aws.amazon.com/blogs/devops/fine-tuning-operations-at-slice-using-aws-devops-guru/

This guest post was authored by Sapan Jain, DevOps Engineer at Slice, and edited by Sobhan Archakam and Adnan Bilwani, at AWS.

Slice empowers over 18,000 independent pizzerias with the modern tools that have grown the major restaurant chains. By uniting these small businesses with specialized technology, marketing, data insights, and shared services, Slice enables them to serve their digitally-minded customers and move away from third-party apps. Using Amazon DevOps Guru, Slice is able to fine-tune their operations to better support these customers.

Serial tech entrepreneur Ilir Sela started Slice to modernize and support his family’s New York City pizzerias. Today, the company partners with restaurants in 3,000 cities and all 50 states, forming the nation’s largest pizza network. For more information, visit slicelife.com.

Slice’s challenge

At Slice, we manage a wide variety of systems, services, and platforms, all with varying levels of complexity. Observability, monitoring, and log aggregation are things we excel at, and they’re always critical for our platform engineering team. However, deriving insights from this data still requires some manual investigation, particularly when dealing with operational anomalies and/or misconfigurations.

To gain automated insights into our services and resources, Slice conducted a proof-of-concept utilizing Amazon DevOps Guru to analyze a small selection of AWS resources. Amazon DevOps Guru identified potential issues in our environment, resulting in actionable insights (ultimately leading to remediation). As a result of this analysis, we enabled Amazon DevOps Guru account-wide, thereby leading to numerous insights into our production environment.

Insights with Amazon DevOps Guru

After we configured Amazon DevOps Guru to begin its account-wide analysis, we left the tool alone to begin the process of collecting and analyzing data. We immediately began seeing some actionable insights for various production AWS resources, some of which are highlighted in the following section:

Amazon DynamoDB Point-in-time recovery

Amazon DynamoDB offers a point-in-time recovery (PITR) feature that provides continuous backups of your DynamoDB data for 35 days to help you protect against accidental write or deletes. If enabled, this lets you restore your respective table to a previous state. Amazon DevOps Guru identified several tables in our environment that had PITR disabled, along with a corresponding Recommendation.

The graphic shows proactive insights for the last 1 month. The one insight shown is 'Dynamo Table Point in Time Recovery not enabled' with a status of OnGoing and a severity of low.

Figure 1. The graphic shows proactive insights for the last 1 month. The one insight shown is ‘Dynamo Table Point in Time Recovery not enabled’ with a status of OnGoing and a severity of low.

Elasticache anomalous evictions

Amazon Elasticache for Redis is used by a handful of our services to cache any relevant application data. Amazon DevOps Guru identified that one of our instances was exhibiting anomalous behavior regarding its cache eviction rate. Essentially, due to the memory pressure of the instance, the eviction rate of cache entries began to increase. DevOps Guru recommended revisiting the sizing of this instance and scaling it vertically or horizontally, where appropriate.

The graph shows the metric: count of ElastiCache evictions plotted for the time period Jul 3, 20:35 to Jul 3, 21:35 UTC. A highlighted section shows that the evictions increased to a peak of 2500 between 21:00 and 21:08. Outside of this interval the evictions are below 500.

Figure 2. The graph shows the metric: count of ElastiCache evictions plotted for the time period Jul 3, 20:35 to Jul 3, 21:35 UTC. A highlighted section shows that the evictions increased to a peak of 2500 between 21:00 and 21:08. Outside of this interval the evictions are below 500

AWS Lambda anomalous errors

We manage a few AWS Lambda functions that all serve different purposes. During the beginning of normal work day, we began to see increased error rates for a particular function resulting in an exception being thrown. DevOps Guru was able to detect the increase in error rates and flag them as anomalous. Although retries in this case wouldn’t have solved the problem, it did increase our visibility into the issue (which was also corroborated by our APM platform).

The graph shows the metric: count of AWS/Lambda errors plotted between 11:00 and 13:30 on Jul 6. The sections between the times 11:23 and 12:15 and at 12:37 and 13:13 UTC are highlighted to show the anomalies.

Figure 3. The graph shows the metric: count of AWS/Lambda errors plotted between 11:00 and 13:30 on Jul 6. The sections between the times 11:23 and 12:15 and at 12:37 and 13:13 UTC are highlighted to show the anomalies

Figure 3. The graph shows the metric: count of AWS/Lambda errors plotted between 11:00 and 13:30 on Jul 6. The sections between the times 11:23 and 12:15 UTC are highlighted to show the anomalies

Conclusion

Amazon DevOps Guru integrated into our environment quickly, with no more additional configuration or setup aside from a few button clicks to enable the service. After reviewing several of the proactive insights that DevOps Guru provided, we could formulate plans of action regarding remediation. One specific case example of this is where DevOps Guru flagged several of our Lambda functions for not containing enough subnets. After triaging the finding, we discovered that we were lacking multi-AZ redundancy for several of those functions. As a result, we could implement a change that maximized our availability of those resources.

With the continuous analysis that DevOps Guru performs, we continue to gain new insights into the resources that we utilize and deploy in our environment. This lets us improve operationally while simultaneously maintaining production stability.

About the author:

Best Practices for Hosting Regulated Gaming Workloads in AWS Local Zones and on AWS Outposts

2022-10-07 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/best-practices-for-hosting-regulated-gaming-workloads-in-aws-local-zones-and-on-aws-outposts/

This blog post is written by Shiv Bhatt, Manthan Raval, and Pawan Matta, who are Senior Solutions Architects with AWS.

Many industries are subject to regulations that are created to protect the interests of the various stakeholders. For some industries, the specific details of the regulatory requirements influence not only the organization’s operations, but also their decisions for adopting new technology. In this post, we highlight the workload residency challenges that you may encounter when you deploy regulated gaming workloads, and how AWS Local Zones and AWS Outposts can help you address those challenges.

Regulated gaming workloads and residency requirements

A regulated gaming workload is a type of workload that’s subject to federal, state, local, or tribal laws related to the regulation of gambling and real money gaming. Examples of these workloads include sports betting, horse racing, casino, poker, lottery, bingo, and fantasy sports. The operators provide gamers with access to these workloads through online and land-based channels, and they’re required to follow various regulations required in their jurisdiction. Some regulations define specific workload residency requirements, and depending on the regulatory agency, the regulations could require that workloads be hosted within a specific city, state, province, or country. For example, in the United States, different state and tribal regulatory agencies dictate whether and where gaming operations are legal in a state, and who can operate. The agencies grant licenses to the operators of regulated gaming workloads, which then govern who can operate within the state, and sometimes, specifically where these workloads can be hosted. In addition, federal legislation can also constrain how regulated gaming workloads can be operated. For example, the United States Federal Wire Act makes it illegal to facilitate bets or wagers on sporting events across state lines. This regulation requires that operators make sure that users who place bets in a specific state are also within the borders of that state.

Benefits of using AWS edge infrastructure with regulated gaming workloads

The use of AWS edge infrastructure, specifically Local Zones and Outposts to host a regulated gaming workload, can help you meet workload residency requirements. You can manage Local Zones and Outposts by using the AWS Management Console or by using control plane API operations, which lets you seamlessly consume compute, storage, and other AWS services.

Local Zones

Local Zones are a type of AWS infrastructure deployment that place compute, storage, database, and other select services closer to large population, industry, and IT centers. Like AWS Regions, Local Zones enable you to innovate more quickly and bring new products to market sooner without having to worry about hardware and data center space procurement, capacity planning, and other forms of undifferentiated heavy-lifting. Local Zones have their own connections to the internet, and support AWS Direct Connect, so that workloads hosted in the Local Zone can serve local end-users with very low-latency communications. Local Zones are by default connected to a parent Region via Amazon’s redundant and high-bandwidth private network. This lets you extend Amazon Virtual Private Cloud (Amazon VPC) in the AWS Region to Local Zones. Furthermore, this provides applications hosted in AWS Local Zones with fast, secure, and seamless access to the broader portfolio of AWS services in the AWS Region. You can see the full list of AWS services supported in Local Zones on the AWS Local Zones features page.

You can start using Local Zones right away by enabling them in your AWS account. There are no setup fees, and as with the AWS Region, you pay only for the services that you use. There are three ways to pay for Amazon Elastic Compute Cloud (Amazon EC2) instances in Local Zones: On-Demand, Savings Plans, and Spot Instances. See the full list of cities where Local Zones are available on the Local Zones locations page.

Outposts

Outposts is a family of fully-managed solutions that deliver AWS infrastructure and services to most customer data center locations for a consistent hybrid experience. For a full list of countries and territories where Outposts is available, see the Outposts rack FAQs and Outposts servers FAQs. Outposts is available in various form factors, from 1U and 2U Outposts servers to 42U Outposts racks, and multiple rack deployments. To learn more about specific configuration options and pricing, see Outposts rack and Outposts servers.

You configure Outposts to work with a specific AWS Region using AWS Direct Connect or an internet connection, which lets you extend Amazon VPC in the AWS Region to Outposts. Like Local Zones, this provides applications hosted on Outposts with fast, secure, and seamless access to the broader portfolio of AWS services in the AWS Region. See the full list of AWS services supported on Outposts rack and on Outposts servers.

Choosing between AWS Regions, Local Zones, and Outposts

When you build and deploy a regulated gaming workload, you must assess the residency requirements carefully to make sure that your workload complies with regulations. As you make your assessment, we recommend that you consider separating your regulated gaming workload into regulated and non-regulated components. For example, for a sports betting workload, the regulated components might include sportsbook operation, and account and wallet management, while non-regulated components might include marketing, the odds engine, and responsible gaming. In describing the following scenarios, it’s assumed that regulated and non-regulated components must be fault-tolerant.

For hosting the non-regulated components of your regulated gaming workload, we recommend that you consider using an AWS Region instead of a Local Zone or Outpost. An AWS Region offers higher availability, larger scale, and a broader selection of AWS services.

For hosting regulated components, the type of AWS infrastructure that you choose will depend on which of the following scenarios applies to your situation:

Scenario one: An AWS Region is available in your jurisdiction and local regulators have approved the use of cloud services for your regulated gaming workload.
Scenario two: An AWS Region isn’t available in your jurisdiction, but a Local Zone is available, and local regulators have approved the use of cloud services for your regulated gaming workload.
Scenario three: An AWS Region or Local Zone isn’t available in your jurisdiction, or local regulators haven’t approved the use of cloud services for your regulated gaming workload, but Outposts is available.

Let’s look at each of these scenarios in detail.

Scenario one: Use an AWS Region for regulated components

When local regulators have approved the use of cloud services for regulated gaming workloads, and an AWS Region is available in your jurisdiction, consider using an AWS Region rather than a Local Zone and Outpost. For example, in the United States, the State of Ohio has announced that it will permit regulated gaming workloads to be deployed in the cloud on infrastructure located within the state when sports betting goes live in January 2023. By using the US East (Ohio) Region, operators in the state don’t need to procure and manage physical infrastructure and data center space. Instead, they can use various compute, storage, database, analytics, and artificial intelligence/machine learning (AI/ML) services that are readily available in the AWS Region. You can host a regulated gaming workload entirely in a single AWS Region, which includes Availability Zones (AZs) – multiple, isolated locations within each AWS Region. By deploying your workload redundantly across at least two AZs, you can help make sure of the high availability, as shown in the following figure.

Scenario two: Use a Local Zone for regulated components

A second scenario might be that local regulators have approved the use of cloud services for regulated gaming workloads, and an AWS Region isn’t available in your jurisdiction, but a Local Zone is available. In this scenario, consider using a Local Zone rather than Outposts. A Local Zone can support more elasticity in a more cost-effective way than Outposts can. However, you might also consider using a Local Zone and Outposts together to increase availability and scalability for regulated components. Let’s consider the State of Illinois, in the United States, which allows regulated gaming workloads to be deployed in the cloud, if workload residency requirements are met. Operators in this state can host regulated components in a Local Zone in Chicago, and they can also use Outposts in their data center in the same state, for high availability and disaster recovery, as shown in the following figure.

Scenario three: Use of Outposts for regulated components

When local regulators haven’t approved the use of cloud services for regulated gaming workloads, or when an AWS Region or Local Zone isn’t available in your jurisdiction, you can still choose to host your regulated gaming workloads on Outposts for a consistent cloud experience, if Outposts is available in your jurisdiction. If you choose to use Outposts, then note that as part of the shared responsibility model, customers are responsible for attesting to physical security and access controls around the Outpost, as well as environmental requirements for the facility, networking, and power. Use of Outposts requires you to procure and manage the data center within the city, state, province, or country boundary (as required by local regulations) that may be suitable to host regulated components, depending on the jurisdiction. Furthermore, you should procure and configure supported network connections between Outposts and the parent AWS Region. During the Outposts ordering process, you should account for the compute and network capacity required to support the peak load and availability design.

For a higher availability level, you should consider procuring and deploying two or more Outposts racks or Outposts servers in a data center. You might also consider deploying redundant network paths between Outposts and the parent AWS Region. However, depending on your business service level agreement (SLA) for regulated gaming workload, you might choose to spread Outposts racks across two or more isolated data centers within the same regulated boundary, as shown in the following figure.

Options to route ingress gaming traffic

You have two options to route ingress gaming traffic coming into your regulated and non-regulated components when you deploy the configurations that we described previously in Scenarios two and three. Your gaming traffic can come through to the AWS Region, or through the Local Zones or Outposts. Note that the benefits that we mentioned previously around selecting the AWS Region for deploying regulated and non-regulated components are the same when you select an ingress route.

Let’s discuss the benefits and trade offs for each of these options.

Option one: Route ingress gaming traffic through an AWS Region

If you choose to route ingress gaming traffic through an AWS Region, your regulated gaming workloads benefit from access to the wide range of tools, services, and capacity available in the AWS Region. For example, native AWS security services, like AWS WAF and AWS Shield, which provide protection against DDoS attacks, are currently only available in AWS Regions. Only traffic that you route into your workload through an AWS Region benefits from these services.

If you route gaming traffic through an AWS Region, and non-regulated components are hosted in an AWS Region, then traffic has a direct path to non-regulated components. In addition, gaming traffic destined to regulated components, hosted in a Local Zone and on Outposts, can be routed through your non-regulated components and a few native AWS services in the AWS Region, as shown in Figure 2.

Option two: Route ingress gaming traffic through a Local Zone or Outposts

Choosing to route ingress gaming traffic through a Local Zone or Outposts requires careful planning to make sure that tools, services, and capacity are available in that jurisdiction, as shown in the following figure. In addition, consider how choosing this route will influence the pillars of the AWS Well-Architected Framework. This route might require deploying and managing most of your non-regulated components in a Local Zone or on Outposts as well, including native AWS services that aren’t available in Local Zones or on Outposts. If you plan to implement this topology, then we recommend that you consider using AWS Partner solutions to replace the native AWS services that aren’t available in Local Zones or Outposts.

Conclusion

If you’re building regulated gaming workloads, then you might have to follow strict workload residency and availability requirements. In this post, we’ve highlighted how Local Zones and Outposts can help you meet these workload residency requirements by bringing AWS services closer to where they’re needed. We also discussed the benefits of using AWS Regions in compliment to the AWS edge infrastructure, and several reliability and cost design considerations.

Although this post provides information to consider when making choices about using AWS for regulated gaming workloads, you’re ultimately responsible for maintaining compliance with the gaming regulations and laws in your jurisdiction. You’re in the best position to determine and maintain ultimate responsibility for determining whether activities are legal, including evaluating the jurisdiction of the activities, how activities are made available, and whether specific technologies or services are required to make sure of compliance with the applicable law. You should always review these regulations and laws before you deploy regulated gaming workloads on AWS.

Common streaming data enrichment patterns in Amazon Kinesis Data Analytics for Apache Flink

2022-10-05 Ali Alemi

Post Syndicated from Ali Alemi original https://aws.amazon.com/blogs/big-data/common-streaming-data-enrichment-patterns-in-amazon-kinesis-data-analytics-for-apache-flink/

Stream data processing allows you to act on data in real time. Real-time data analytics can help you have on-time and optimized responses while improving overall customer experience.

Apache Flink is a distributed computation framework that allows for stateful real-time data processing. It provides a single set of APIs for building batch and streaming jobs, making it easy for developers to work with bounded and unbounded data. Apache Flink provides different levels of abstraction to cover a variety of event processing use cases.

Amazon Kinesis Data Analytics is an AWS service that provides a serverless infrastructure for running Apache Flink applications. This makes it easy for developers to build highly available, fault tolerant, and scalable Apache Flink applications without needing to become an expert in building, configuring, and maintaining Apache Flink clusters on AWS.

Data streaming workloads often require data in the stream to be enriched via external sources (such as databases or other data streams). For example, assume you are receiving coordinates data from a GPS device and need to understand how these coordinates map with physical geographic locations; you need to enrich it with geolocation data. You can use several approaches to enrich your real-time data in Kinesis Data Analytics depending on your use case and Apache Flink abstraction level. Each method has different effects on the throughput, network traffic, and CPU (or memory) utilization. In this post, we cover these approaches and discuss their benefits and drawbacks.

Data enrichment patterns

Data enrichment is a process that appends additional context and enhances the collected data. The additional data often is collected from a variety of sources. The format and the frequency of the data updates could range from once in a month to many times in a second. The following table shows a few examples of different sources, formats, and update frequency.

Data	Format	Update Frequency
IP address ranges by country	CSV	Once a month
Company organization chart	JSON	Twice a year
Machine names by ID	CSV	Once a day
Employee information	Table (Relational database)	A few times a day
Customer information	Table (Non-relational database)	A few times an hour
Customer orders	Table (Relational database)	Many times a second

Based on the use case, your data enrichment application may have different requirements in terms of latency, throughput, or other factors. The remainder of the post dives deeper into different patterns of data enrichment in Kinesis Data Analytics, which are listed in the following table with their key characteristics. You can choose the best pattern based on the trade-off of these characteristics.

Enrichment Pattern	Latency	Throughput	Accuracy if Reference Data Changes	Memory Utilization	Complexity
Pre-load reference data in Apache Flink Task Manager memory	Low	High	Low	High	Low
Partitioned pre-loading of reference data in Apache Flink state	Low	High	Low	Low	Low
Periodic Partitioned pre-loading of reference data in Apache Flink state	Low	High	Medium	Low	Medium
Per-record asynchronous lookup with unordered map	Medium	Medium	High	Low	Low
Per-record asynchronous lookup from an external cache system	Low or Medium (Depending on Cache storage and implementation)	Medium	High	Low	Medium
Enriching streams using the Table API	Low	High	High	Low – Medium (depending on the selected join operator)	Low

Enrich streaming data by pre-loading the reference data

When the reference data is small in size and static in nature (for example, country data including country code and country name), it’s recommended to enrich your streaming data by pre-loading the reference data, which you can do in several ways.

To see the code implementation for pre-loading reference data in various ways, refer to the GitHub repo. Follow the instructions in the GitHub repository to run the code and understand the data model.

Pre-loading of reference data in Apache Flink Task Manager memory

The simplest and also fastest enrichment method is to load the enrichment data into each of the Apache Flink task managers’ on-heap memory. To implement this method, you create a new class by extending the RichFlatMapFunction abstract class. You define a global static variable in your class definition. The variable could be of any type, the only limitation is that it should extend java.io.Serializable—for example, java.util.HashMap. Within the open() method, you define a logic that loads the static data into your defined variable. The open() method is always called first, during the initialization of each task in Apache Flink’s task managers, which makes sure the whole reference data is loaded before the processing begins. You implement your processing logic by overriding the processElement() method. You implement your processing logic and access the reference data by its key from the defined global variable.

The following architecture diagram shows the full reference data load in each task slot of the task manager.

This method has the following benefits:

Easy to implement
Low latency
Can support high throughput

However, it has the following disadvantages:

If the reference data is large in size, the Apache Flink task manager may run out of memory.
Reference data can become stale over a period of time.
Multiple copies of the same reference data are loaded in each task slot of the task manager.
Reference data should be small to fit in the memory allocated to a single task slot. In Kinesis Data Analytics, each Kinesis Processing Unit (KPU) has 4 GB of memory, out of which 3 GB can be used for heap memory. If ParallelismPerKPU in Kinesis Data Analytics is set to 1, one task slot runs in each task manager, and the task slot can use the whole 3 GB of heap memory. If ParallelismPerKPU is set to a value greater than 1, the 3 GB of heap memory is distributed across multiple task slots in the task manager. If you’re deploying Apache Flink in Amazon EMR or in a self-managed mode, you can tune taskmanager.memory.task.heap.size to increase the heap memory of a task manager.

Partitioned pre-loading of reference data in Apache Flink State

In this approach, the reference data is loaded and kept in the Apache Flink state store at the start of the Apache Flink application. To optimize the memory utilization, first the main data stream is divided by a specified field via the keyBy() operator across all task slots. Furthermore, only the portion of the reference data that corresponds to each task slot is loaded in the state store.

This is achieved in Apache Flink by creating the class PartitionPreLoadEnrichmentData, extending the RichFlatMapFunction abstract class. Within the open method, you override the ValueStateDescriptor method to create a state handle. In the referenced example, the descriptor is named locationRefData, the state key type is String, and the value type is Location. In this code, we use ValueState compared to MapState because we only hold the location reference data for a particular key. For example, when we query Amazon S3 to get the location reference data, we query for the specific role and get a particular location as a value.

In Apache Flink, ValueState is used to hold a specific value for a key, whereas MapState is used to hold a combination of key-value pairs.

This technique is useful when you have a large static dataset that is difficult to fit in memory as a whole for each partition.

The following architecture diagram shows the load of reference data for the specific key for each partition of the stream.

For example, our reference data in the sample GitHub code has roles which are mapped to each building. Because the stream is partitioned by roles, only the specific building information per role is required to be loaded for each partition as the reference data.

This method has the following benefits:

Low latency.
Can support high throughput.
Reference data for specific partition is loaded in the keyed state.
In Kinesis Data Analytics, the default state store configured is RocksDB. RocksDB can utilize a significant portion of 1 GB of managed memory and 50 GB of disk space provided by each KPU. This provides enough room for the reference data to grow.

However, it has the following disadvantages:

Reference data can become stale over a period of time

Periodic partitioned pre-loading of reference data in Apache Flink State

This approach is a fine-tune of the previous technique, where each partitioned reference data is reloaded on a periodic basis to refresh the reference data. This is useful if your reference data changes occasionally.

The following architecture diagram shows the periodic load of reference data for the specific key for each partition of the stream.

In this approach, the class PeriodicPerPartitionLoadEnrichmentData is created, extending the KeyedProcessFunction class. Similar to the previous pattern, in the context of the GitHub example, ValueState is recommended here because each partition only loads a single value for the key. In the same way as mentioned earlier, in the open method, you define the ValueStateDescriptor to handle the value state and define a runtime context to access the state.

Within the processElement method, load the value state and attach the reference data (in the referenced GitHub example, buildingNo to the customer data). Also register a timer service to be invoked when the processing time passes the given time. In the sample code, the timer service is scheduled to be invoked periodically (for example, every 60 seconds). In the onTimer method, update the state by making a call to reload the reference data for the specific role.

This method has the following benefits:

Low latency.
Can support high throughput.
Reference data for specific partitions is loaded in the keyed state.
Reference data is refreshed periodically.
In Kinesis Data Analytics, the default state store configured is RocksDB. Also, 50 GB of disk space provided by each KPU. This provides enough room for the reference data to grow.

However, it has the following disadvantages:

If the reference data changes frequently, the application still has stale data depending on how frequently the state is reloaded
The application can face load spikes during reload of reference data

Enrich streaming data using per-record lookup

Although pre-loading of reference data provides low latency and high throughput, it may not be suitable for certain types of workloads, such as the following:

Reference data updates with high frequency
Apache Flink needs to make an external call to compute the business logic
Accuracy of the output is important and the application shouldn’t use stale data

Normally, for these types of use cases, developers trade-off high throughput and low latency for data accuracy. In this section, you learn about a few of common implementations for per-record data enrichment and their benefits and disadvantages.

Per-record asynchronous lookup with unordered map

In a synchronous per-record lookup implementation, the Apache Flink application has to wait until it receives the response after sending every request. This causes the processor to stay idle for a significant period of processing time. Instead, the application can send a request for other elements in the stream while it waits for the response for the first element. This way, the wait time is amortized across multiple requests and therefore it increases the process throughput. Apache Flink provides asynchronous I/O for external data access. While using this pattern, you have to decide between unorderedWait (where it emits the result to the next operator as soon as the response is received, disregarding the order of the element on the stream) and orderedWait (where it waits until all inflight I/O operations complete, then sends the results to the next operator in the same order as original elements were placed on the stream). Usually, when downstream consumers disregard the order of the elements in the stream, unorderedWait provides better throughput and less idle time. Visit Enrich your data stream asynchronously using Kinesis Data Analytics for Apache Flink to learn more about this pattern.

The following architecture diagram shows how an Apache Flink application on Kinesis Data Analytics does asynchronous calls to an external database engine (for example Amazon DynamoDB) for every event in the main stream.

This method has the following benefits:

Still reasonably simple and easy to implement
Reads the most up-to-date reference data

However, it has the following disadvantages:

It generates a heavy read load for the external system (for example, a database engine or an external API) that hosts the reference data
Overall, it might not be suitable for systems that require high throughput with low latency

Per-record asynchronous lookup from an external cache system

A way to enhance the previous pattern is to use a cache system to enhance the read time for every lookup I/O call. You can use Amazon ElastiCache for caching, which accelerates application and database performance, or as a primary data store for use cases that don’t require durability like session stores, gaming leaderboards, streaming, and analytics. ElastiCache is compatible with Redis and Memcached.

For this pattern to work, you must implement a caching pattern for populating data in the cache storage. You can choose between a proactive or reactive approach depending your application objectives and latency requirements. For more information, refer to Caching patterns.

The following architecture diagram shows how an Apache Flink application calls to read the reference data from an external cache storage (for example, Amazon ElastiCache for Redis). Data changes must be replicated from the main database (for example, Amazon Aurora) to the cache storage by implementing one of the caching patterns.

Implementation for this data enrichment pattern is similar to the per-record asynchronous lookup pattern; the only difference is that the Apache Flink application makes a connection to the cache storage, instead of connecting to the primary database.

This method has the following benefits:

Better throughput because caching can accelerate application and database performance
Protects the primary data source from the read traffic created by the stream processing application
Can provide lower read latency for every lookup call
Overall, might not be suitable for medium to high throughput systems that want to improve data freshness

However, it has the following disadvantages:

Additional complexity of implementing a cache pattern for populating and syncing the data between the primary database and the cache storage
There is a chance for the Apache Flink stream processing application to read stale reference data depending on what caching pattern is implemented
Depending on the chosen cache pattern (proactive or reactive), the response time for each enrichment I/O may differ, therefore the overall processing time of the stream could be unpredictable

Alternatively, you can avoid these complexities by using the Apache Flink JDBC connector for Flink SQL APIs. We discuss enrichment stream data via Flink SQL APIs in more detail later in this post.

Enrich stream data via another stream

In this pattern, the data in the main stream is enriched with the reference data in another data stream. This pattern is good for use cases in which the reference data is updated frequently and it’s possible to perform change data capture (CDC) and publish the events to a data streaming service such as Apache Kafka or Amazon Kinesis Data Streams. This pattern is useful in the following use cases, for example:

Customer purchase orders are published to a Kinesis data stream, and then join with customer billing information in a DynamoDB stream
Data events captured from IoT devices should enrich with reference data in a table in Amazon Relational Database Service (Amazon RDS)
Network log events should enrich with the machine name on the source (and the destination) IP addresses

The following architecture diagram shows how an Apache Flink application on Kinesis Data Analytics joins data in the main stream with the CDC data in a DynamoDB stream.

To enrich streaming data from another stream, we use a common stream to stream join patterns, which we explain in the following sections.

Enrich streams using the Table API

Apache Flink Table APIs provide higher abstraction for working with data events. With Table APIs, you can define your data stream as a table and attach the data schema to it.

In this pattern, you define tables for each data stream and then join those tables to achieve the data enrichment goals. Apache Flink Table APIs support different types of join conditions, like inner join and outer join. However, you want to avoid those if you’re dealing with unbounded streams because those are resource intensive. To limit the resource utilization and run joins effectively, you should use either interval or temporal joins. An interval join requires one equi-join predicate and a join condition that bounds the time on both sides. To better understand how to implement an interval join, refer to Get started with Apache Flink SQL APIs in Kinesis Data Analytics Studio.

Compared to interval joins, temporal table joins don’t work with a time period within which different versions of a record are kept. Records from the main stream are always joined with the corresponding version of the reference data at the time specified by the watermark. Therefore, fewer versions of the reference data remain in the state.

Note that the reference data may or may not have a time element associated with it. If it doesn’t, you may need to add a processing time element for the join with the time-based stream.

In the following example code snippet, the update_time column is added to the currency_rates reference table from the change data capture metadata such as Debezium. Furthermore, it’s used to define a watermark strategy for the table.

CREATE TABLE currency_rates (
    currency STRING,
    conversion_rate DECIMAL(32, 2),
    update_time TIMESTAMP(3) METADATA FROM `values.source.timestamp` VIRTUAL,
        WATERMARK FOR update_time AS update_time,
    PRIMARY KEY(currency) NOT ENFORCED
) WITH (
   'connector' = 'kafka',
   'value.format' = 'debezium-json',
   /* ... */
);

This method has the following benefits:

Easy to implement
Low latency
Can support high throughput when reference data is a data stream

SQL APIs provide higher abstractions over how the data is processed. For more complex logic around how the join operator should process, we recommend you always start with SQL APIs first and use DataStream APIs if you really need to.

Conclusion

In this post, we demonstrated different data enrichment patterns in Kinesis Data Analytics. You can use these patterns and find the one that addresses your needs and quickly develop a stream processing application.

For further reading on Kinesis Data Analytics, visit the official product page.

About the Authors

Ali Alemi is a Streaming Specialist Solutions Architect at AWS. Ali advises AWS customers with architectural best practices and helps them design real-time analytics data systems that are reliable, secure, efficient, and cost-effective. He works backward from customers’ use cases and designs data solutions to solve their business problems. Prior to joining AWS, Ali supported several public sector customers and AWS consulting partners in their application modernization journey and migration to the cloud.

Subham Rakshit is a Streaming Specialist Solutions Architect for Analytics at AWS based in the UK. He works with customers to design and build search and streaming data platforms that help them achieve their business objective. Outside of work, he enjoys spending time solving jigsaw puzzles with his daughter.

Dr. Sam Mokhtari is a Senior Solutions Architect in AWS. His main area of depth is data and analytics, and he has published more than 30 influential articles in this field. He is also a respected data and analytics advisor who led several large-scale implementation projects across different industries, including energy, health, telecom, and transport.

Design considerations for Amazon EMR on EKS in a multi-tenant Amazon EKS environment

2022-09-21 Lotfi Mouhib

Post Syndicated from Lotfi Mouhib original https://aws.amazon.com/blogs/big-data/design-considerations-for-amazon-emr-on-eks-in-a-multi-tenant-amazon-eks-environment/

Many AWS customers use Amazon Elastic Kubernetes Service (Amazon EKS) in order to take advantage of Kubernetes without the burden of managing the Kubernetes control plane. With Kubernetes, you can centrally manage your workloads and offer administrators a multi-tenant environment where they can create, update, scale, and secure workloads using a single API. Kubernetes also allows you to improve resource utilization, reduce cost, and simplify infrastructure management to support different application deployments. This model is beneficial for those running Apache Spark workloads, for several reasons. For example, it allows you to have multiple Spark environments running concurrently with different configurations and dependencies that are segregated from each other through Kubernetes multi-tenancy features. In addition, the same cluster can be used for various workloads like machine learning (ML), host applications, data streaming and thereby reducing operational overhead of managing multiple clusters.

AWS offers Amazon EMR on EKS, a managed service that enables you to run your Apache Spark workloads on Amazon EKS. This service uses the Amazon EMR runtime for Apache Spark, which increases the performance of your Spark jobs so that they run faster and cost less. When you run Spark jobs on EMR on EKS and not on self-managed Apache Spark on Kubernetes, you can take advantage of automated provisioning, scaling, faster runtimes, and the development and debugging tools that Amazon EMR provides

In this post, we show how to configure and run EMR on EKS in a multi-tenant EKS cluster that can used by your various teams. We tackle multi-tenancy through four topics: network, resource management, cost management, and security.

Concepts

Throughout this post, we use terminology that is either specific to EMR on EKS, Spark, or Kubernetes:

Multi-tenancy – Multi-tenancy in Kubernetes can come in three forms: hard multi-tenancy, soft multi-tenancy and sole multi-tenancy. Hard multi-tenancy means each business unit or group of applications gets a dedicated Kubernetes; there is no sharing of the control plane. This model is out of scope for this post. Soft multi-tenancy is where pods might share the same underlying compute resource (node) and are logically separated using Kubernetes constructs through namespaces, resource quotas, or network policies. A second way to achieve multi-tenancy in Kubernetes is to assign pods to specific nodes that are pre-provisioned and allocated to a specific team. In this case, we talk about sole multi-tenancy. Unless your security posture requires you to use hard or sole multi-tenancy, you would want to consider using soft multi-tenancy for the following reasons:
- Soft multi-tenancy avoids underutilization of resources and waste of compute resources.
- There is a limited number of managed node groups that can be used by Amazon EKS, so for large deployments, this limit can quickly become a limiting factor.
- In sole multi-tenancy there is high chance of ghost nodes with no pods scheduled on them due to misconfiguration as we force pods into dedicated nodes with label, taints and tolerance and anti-affinity rules.
Namespace – Namespaces are core in Kubernetes and a pillar to implement soft multi-tenancy. With namespaces, you can divide the cluster into logical partitions. These partitions are then referenced in quotas, network policies, service accounts, and other constructs that help isolate environments in Kubernetes.
Virtual cluster – An EMR virtual cluster is mapped to a Kubernetes namespace that Amazon EMR is registered with. Amazon EMR uses virtual clusters to run jobs and host endpoints. Multiple virtual clusters can be backed by the same physical cluster. However, each virtual cluster maps to one namespace on an EKS cluster. Virtual clusters don’t create any active resources that contribute to your bill or require lifecycle management outside the service.
Pod template – In EMR on EKS, you can provide a pod template to control pod placement, or define a sidecar container. This pod template can be defined for executor pods and driver pods, and stored in an Amazon Simple Storage Service (Amazon S3) bucket. The S3 locations are then submitted as part of the applicationConfiguration object that is part of configurationOverrides, as defined in the EMR on EKS job submission API.

Security considerations

In this section, we address security from different angles. We first discuss how to protect IAM role that is used for running the job. Then address how to protect secrets use in jobs and finally we discuss how you can protect data while it is processed by Spark.

IAM role protection

A job submitted to EMR on EKS needs an AWS Identity and Access Management (IAM) execution role to interact with AWS resources, for example with Amazon S3 to get data, with Amazon CloudWatch Logs to publish logs, or use an encryption key in AWS Key Management Service (AWS KMS). It’s a best practice in AWS to apply least privilege for IAM roles. In Amazon EKS, this is achieved through IRSA (IAM Role for Service Accounts). This mechanism allows a pod to assume an IAM role at the pod level and not at the node level, while using short-term credentials that are provided through the EKS OIDC.

IRSA creates a trust relationship between the EKS OIDC provider and the IAM role. This method allows only pods with a service account (annotated with an IAM role ARN) to assume a role that has a trust policy with the EKS OIDC provider. However, this isn’t enough, because it would allow any pod with a service account within the EKS cluster that is annotated with a role ARN to assume the execution role. This must be further scoped down using conditions on the role trust policy. This condition allows the assume role to happen only if the calling service account is the one used for running a job associated with the virtual cluster. The following code shows the structure of the condition to add to the trust policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": <OIDC provider ARN >
            },
            "Action": "sts:AssumeRoleWithWebIdentity"
            "Condition": { "StringLike": { “<OIDC_PROVIDER>:sub": "system:serviceaccount:<NAMESPACE>:emr-containers-sa-*-*-<AWS_ACCOUNT_ID>-<BASE36_ENCODED_ROLE_NAME>”} }
        }
    ]
}

To scope down the trust policy using the service account condition, you need to run the following the command with AWS CLI:

aws emr-containers update-role-trust-policy \
–cluster-name cluster \
–namespace namespace \
–role-name iam_role_name_for_job_execution

The command will the add the service account that will be used by the spark client, Jupyter Enterprise Gateway, Spark kernel, driver or executor. The service accounts name have the following structure emr-containers-sa-*-*-<AWS_ACCOUNT_ID>-<BASE36_ENCODED_ROLE_NAME>.

In addition to the role segregation offered by IRSA, we recommend blocking access to instance metadata because a pod can still inherit the rights of the instance profile assigned to the worker node. For more information about how you can block access to metadata, refer to Restrict access to the instance profile assigned to the worker node.

Secret protection

Sometime a Spark job needs to consume data stored in a database or from APIs. Most of the time, these are protected with a password or access key. The most common way to pass these secrets is through environment variables. However, in a multi-tenant environment, this means any user with access to the Kubernetes API can potentially access the secrets in the environment variables if this access isn’t scoped well to the namespaces the user has access to.

To overcome this challenge, we recommend using a Secrets store like AWS Secrets Manager that can be mounted through the Secret Store CSI Driver. The benefit of using Secrets Manager is the ability to use IRSA and allow only the role assumed by the pod access to the given secret, thereby improving your security posture. You can refer to the best practices guide for sample code showing the use of Secrets Manager with EMR on EKS.

Spark data encryption

When a Spark application is running, the driver and executors produce intermediate data. This data is written to the node local storage. Anyone who is able to exec into the pods would be able to read this data. Spark supports encryption of this data, and it can be enabled by passing --conf spark.io.encryption.enabled=true. Because this configuration adds performance penalty, we recommend enabling data encryption only for workloads that store and access highly sensitive data and in untrusted environments.

Network considerations

In this section we discuss how to manage networking within the cluster as well as outside the cluster. We first address how Spark handle cross executors and driver communication and how to secure it. Then we discuss how to restrict network traffic between pods in the EKS cluster and allow only traffic destined to EMR on EKS. Last, we discuss how to restrict traffic of executors and driver pods to external AWS service traffic using security groups.

Network encryption

The communication between the driver and executor uses RPC protocol and is not encrypted. Starting with Spark 3 in the Kubernetes backed cluster, Spark offers a mechanism to encrypt communication using AES encryption.

The driver generates a key and shares it with executors through the environment variable. Because the key is shared through the environment variable, potentially any user with access to the Kubernetes API (kubectl) can read the key. We recommend securing access so that only authorized users can have access to the EMR virtual cluster. In addition, you should set up Kubernetes role-based access control in such a way that the pod spec in the namespace where the EMR virtual cluster runs is granted to only a few selected service accounts. This method of passing secrets through the environment variable would change in the future with a proposal to use Kubernetes secrets.

To enable encryption, RPC authentication must also be enabled in your Spark configuration. To enable encryption in-transit in Spark, you should use the following parameters in your Spark config:

--conf spark.authenticate=true

--conf spark.network.crypto.enabled=true

Note that these are the minimal parameters to set; refer to Encryption from the complete list of parameters.

Additionally, applying encryption in Spark has a negative impact on processing speed. You should only apply it when there is a compliance or regulation need.

Securing Network traffic within the cluster

In Kubernetes, by default pods can communicate over the network across different namespaces in the same cluster. This behavior is not always desirable in a multi-tenant environment. In some instances, for example in regulated industries, to be compliant you want to enforce strict control over the network and send and receive traffic only from the namespace that you’re interacting with. For EMR on EKS, it would be the namespace associated to the EMR virtual cluster. Kubernetes offers constructs that allow you to implement network policies and define fine-grained control over the pod-to-pod communication. These policies are implemented by the CNI plugin; in Amazon EKS, the default plugin would be the VPC CNI. A policy is defined as follows and is applied with kubectl:

Kind: NetworkPolicy
metadata:
  name: default-np-ns1
  namespace: <EMR-VC-NAMESPACE>
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          nsname: <EMR-VC-NAMESPACE>

Network traffic outside the cluster

In Amazon EKS, when you deploy pods on Amazon Elastic Compute Cloud (Amazon EC2) instances, all the pods use the security group associated with the node. This can be an issue if your pods (executor pods) are accessing a data source (namely a database) that allows traffic based on the source security group. Database servers often restrict network access only from where they are expecting it. In the case of a multi-tenant EKS cluster, this means pods from other teams that shouldn’t have access to the database servers, would be able to send traffic to it.

To overcome this challenge, you can use security groups for pods. This feature allows you to assign a specific security group to your pods, thereby controlling the network traffic to your database server or data source. You can also refer to the best practices guide for a reference implementation.

Cost management and chargeback

In a multi-tenant environment, cost management is a critical subject. You have multiple users from various business units, and you need to be able to precisely chargeback the cost of the compute resource they have used. At the beginning of the post, we introduced three models of multi-tenancy in Amazon EKS: hard multi-tenancy, soft multi-tenancy, and sole multi-tenancy. Hard multi-tenancy is out of scope because the cost tracking is trivial; all the resources are dedicated to the team using the cluster, which is not the case for sole multi-tenancy and soft multi-tenancy. In the next sections, we discuss these two methods to track the cost for each of model.

Soft multi-tenancy

In a soft multi-tenant environment, you can perform chargeback to your data engineering teams based on the resources they consumed and not the nodes allocated. In this method, you use the namespaces associated with the EMR virtual cluster to track how much resources were used for processing jobs. The following diagram illustrates an example.

Diagram -1 Soft multi-tenancy

Tracking resources based on the namespace isn’t an easy task because jobs are transient in nature and fluctuate in their duration. However, there are partner tools available that allow you to keep track of the resources used, such as Kubecost, CloudZero, Vantage, and many others. For instructions on using Kubecost on Amazon EKS, refer to this blog post on cost monitoring for EKS customers.

Sole multi-tenancy

For sole multi-tenancy, the chargeback is done at the instance (node) level. Each member on your team uses a specific set of nodes that are dedicated to it. These nodes aren’t always running, and are spun up using the Kubernetes auto scaling mechanism. The following diagram illustrates an example.

Diagram -2 Sole tenancy

With sole multi-tenancy, you use a cost allocation tag, which is an AWS mechanism that allows you to track how much each resource has consumed. Although the method of sole multi-tenancy isn’t efficient in terms of resource utilization, it provides a simplified strategy for chargebacks. With the cost allocation tag, you can chargeback a team based on all the resources they used, like Amazon S3, Amazon DynamoDB, and other AWS resources. The chargeback mechanism based on the cost allocation tag can be augmented using the recently launched AWS Billing Conductor, which allows you to issue bills internally for your team.

Resource management

In this section, we discuss considerations regarding resource management in multi-tenant clusters. We briefly discuss topics like sharing resources graciously, setting guard rails on resource consumption, techniques for ensuring resources for time sensitive and/or critical jobs, meeting quick resource scaling requirements and finally cost optimization practices with node selectors.

Sharing resources

In a multi-tenant environment, the goal is to share resources like compute and memory for better resource utilization. However, this requires careful capacity management and resource allocation to make sure each tenant gets their fair share. In Kubernetes, resource allocation is controlled and enforced by using ResourceQuota and LimitRange. ResourceQuota limits resources on the namespace level, and LimitRange allows you to make sure that all the containers are submitted with a resource requirement and a limit. In this section, we demonstrate how a data engineer or Kubernetes administrator can set up ResourceQuota as a LimitRange configuration.

The administrator creates one ResourceQuota per namespace that provides constraints for aggregate resource consumption:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-resources
  namespace: teamA
spec:
  hard:
    requests.cpu: "1000"
    requests.memory: 4000Gi
    limits.cpu: "2000"
    limits.memory: 6000Gi

For LimitRange, the administrator can review the following sample configuration. We recommend using default and defaultRequest to enforce the limit and request field on containers. Lastly, from a data engineer perspective while submitting the EMR on EKS jobs, you need to make sure the Spark parameters of resource requirements are within the range of the defined LimitRange. For example, in the following configuration, the request for spark.executor.cores=7 will fail because the max limit for CPU is 6 per container:

apiVersion: v1
kind: LimitRange
metadata:
  name: cpu-min-max
  namespace: teamA
spec:
  limits:
  - max:
      cpu: "6"
    min:
      cpu: "100m"
    default:
      cpu: "500m"
    defaultRequest:
      cpu: "100m"
    type: Container

Priority-based resource allocation

Diagram – 3 Illustrates an example of resource allocation with priority.

As all the EMR virtual clusters share the same EKS computing platform with limited resources, there will be scenarios in which you need to prioritize jobs in a sensitive timeline. In this case, high-priority jobs can utilize the resources and finish the job, whereas low-priority jobs that are running gets stopped and any new pods must wait in the queue. EMR on EKS can achieve this with the help of pod templates, where you specify a priority class for the given job.

When a pod priority is enabled, the Kubernetes scheduler orders pending pods by their priority and places them in the scheduling queue. As a result, the higher-priority pod may be scheduled sooner than pods with lower priority if its scheduling requirements are met. If this pod can’t be scheduled, the scheduler continues and tries to schedule other lower-priority pods.

The preemptionPolicy field on the PriorityClass defaults to PreemptLowerPriority, and the pods of that PriorityClass can preempt lower-priority pods. If preemptionPolicy is set to Never, pods of that PriorityClass are non-preempting. In other words, they can’t preempt any other pods. When lower-priority pods are preempted, the victim pods get a grace period to finish their work and exit. If the pod doesn’t exit within that grace period, that pod is stopped by the Kubernetes scheduler. Therefore, there is usually a time gap between the point when the scheduler preempts victim pods and the time that a higher-priority pod is scheduled. If you want to minimize this gap, you can set a deletion grace period of lower-priority pods to zero or a small number. You can do this by setting the terminationGracePeriodSeconds option in the victim Pod YAML.

See the following code samples for priority class:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 100
globalDefault: false
description: " High-priority Pods and for Driver Pods."

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 50
globalDefault: false
description: " Low-priority Pods."

One of the key considerations while templatizing the driver pods, especially for low-priority jobs, is to avoid the same low-priority class for both driver and executor. This will save the driver pods from getting evicted and lose the progress of all its executors in a resource congestion scenario. In this low-priority job example, we have used a high-priority class for driver pod templates and low-priority classes only for executor templates. This way, we can ensure the driver pods are safe during the eviction process of low-priority jobs. In this case, only executors will be evicted, and the driver can bring back the evicted executor pods as the resource becomes freed. See the following code:

apiVersion: v1
kind: Pod
spec:
  priorityClassName: "high-priority"
  nodeSelector:
    eks.amazonaws.com/capacityType: ON_DEMAND
  containers:
  - name: spark-kubernetes-driver # This will be interpreted as Spark driver container

apiVersion: v1
kind: Pod
spec:
  priorityClassName: "low-priority"
  nodeSelector:
    eks.amazonaws.com/capacityType: SPOT
  containers:
  - name: spark-kubernetes-executors # This will be interpreted as Spark executor container

Overprovisioning with priority

Diagram – 4 Illustrates an example of overprovisioning with priority.

As pods wait in a pending state due to resource availability, additional capacity can be added to the cluster with Amazon EKS auto scaling. The time it takes to scale the cluster by adding new nodes for deployment has to be considered for time-sensitive jobs. Overprovisioning is an option to mitigate the auto scaling delay using temporary pods with negative priority. These pods occupy space in the cluster. When pods with high priority are unschedulable, the temporary pods are preempted to make the room. This causes the auto scaler to scale out new nodes due to overprovisioning. Be aware that this is a trade-off because it adds higher cost while minimizing scheduling latency. For more information about overprovisioning best practices, refer to Overprovisioning.

Node selectors

EKS clusters can span multiple Availability Zones in a VPC. A Spark application whose driver and executor pods are distributed across multiple Availability Zones can incur inter- Availability Zone data transfer costs. To minimize or eliminate the data transfer cost, you should configure the job to run on a specific Availability Zone or even specific node type with the help of node labels. Amazon EKS places a set of default labels to identify capacity type (On-Demand or Spot Instance), Availability Zone, instance type, and more. In addition, we can use custom labels to meet workload-specific node affinity.

EMR on EKS allows you to choose specific nodes in two ways:

At the job level. Refer to EKS Node Placement for more details.
In the driver and executor level using pod templates.

When using pod templates, we recommend using on demand instances for driver pods. You can also consider including spot instances for executor pods for workloads that are tolerant of occasional periods when the target capacity is not completely available. Leveraging spot instances allow you to save cost for jobs that are not critical and can be terminated. Please refer Define a NodeSelector in PodTemplates.

Conclusion

In this post, we provided guidance on how to design and deploy EMR on EKS in a multi-tenant EKS environment through different lenses: network, security, cost management, and resource management. For any deployment, we recommend the following:

Use IRSA with a condition scoped on the EMR on EKS service account
Use a secret manager to store credentials and the Secret Store CSI Driver to access them in your Spark application
Use ResourceQuota and LimitRange to specify the resources that each of your data engineering teams can use and avoid compute resource abuse and starvation
Implement a network policy to segregate network traffic between pods

Lastly, if you are considering migrating your spark workload to EMR on EKS you can further learn about design patterns to manage Apache Spark workload in EMR on EKS in this blog and about migrating your EMR transient cluster to EMR on EKS in this blog.

About the Authors

Lotfi Mouhib is a Senior Solutions Architect working for the Public Sector team with Amazon Web Services. He helps public sector customers across EMEA realize their ideas, build new services, and innovate for citizens. In his spare time, Lotfi enjoys cycling and running.

Ajeeb Peter is a Senior Solutions Architect with Amazon Web Services based in Charlotte, North Carolina, where he guides global financial services customers to build highly secure, scalable, reliable, and cost-efficient applications on the cloud. He brings over 20 years of technology experience on Software Development, Architecture and Analytics from industries like finance and telecom.

Implementing long running deployments with AWS CloudFormation Custom Resources using AWS Step Functions

2022-09-17 DAMODAR SHENVI WAGLE

Post Syndicated from DAMODAR SHENVI WAGLE original https://aws.amazon.com/blogs/devops/implementing-long-running-deployments-with-aws-cloudformation-custom-resources-using-aws-step-functions/

AWS CloudFormation custom resource provides mechanisms to provision AWS resources that don’t have built-in support from CloudFormation. It lets us write custom provisioning logic for resources that aren’t supported as resource types under CloudFormation. This post focusses on the use cases where CloudFormation custom resource is used to implement a long running task/job. With custom resources, you can manage these custom tasks (which are one-off in nature) as deployment stack resources.

The routine pattern used for implementing custom resources is via AWS Lambda function. However, when using the Lambda function as the custom resource provider, you must consider its trade-offs, such as its 15 minute timeout. Tasks involved in the provisioning of certain AWS resources can be long running and could span beyond the Lambda timeout. In these scenarios, you must look beyond the conventional Lambda function-based approach for custom resources.

In this post, I’ll demonstrate how to use AWS Step Functions to implement custom resources using AWS Cloud Development Kit (AWS CDK). Step Functions allow complex deployment tasks to be orchestrated as a step-by-step workflow. It also offers direct integration with any AWS service via AWS SDK integrations. By default the CloudFormation stack waits for 1 hour before timing out. The timeout can be increased to maximum 12 hours using wait conditions. In this post, you’ll also see how to use wait conditions with custom resource to run long running deployment tasks as part of a CloudFormation stack.

Prerequisites

Before proceeding any further, you must identify and designate an AWS account required for the solution to work. You must also create an AWS account profile in ~/.aws/credentials for the designated AWS account, if you don’t already have one. The profile must have sufficient permissions to run an AWS CDK stack. It should be your private profile and only be used during the course of this post. Therefore, it should be fine if you want to use admin privileges. Don’t share the profile details, especially if it has admin privileges. I recommend removing the profile when you’re finished with this walkthrough. For more information about creating an AWS account profile, see Configuring the AWS CLI.

Services and frameworks used in the post include CloudFormation, Step Functions, Lambda, DynamoDB, Amazon S3, and AWS CDK.

Solution overview

The following architecture diagram shows the application of Step Functions to implement custom resources.

Figure 1. Architecture diagram

The user deploys a CloudFormation stack that includes a custom resource implementation.
The CloudFormation custom resource triggers a Lambda function with the appropriate event which can be CREATE/UPDATE/DELETE.
The custom resource Lambda function invokes Step Functions workflow and offloads the event handling responsibility. The CloudFormation event and context are wrapped inside the Step Function input at the time of invocation.
The custom resource Lambda function returns SUCCESS back to CloudFormation stack indicating that the custom resource provisioning has begun. CloudFormation stack then goes into waiting mode where it waits for a SUCCESS or FAILURE signal to continue.
In the interim, Step Functions workflow handles the custom resource event through one or more steps.
Step Functions workflow prepares the response to be sent back to CloudFormation stack.
Send Response Lambda function sends a success/failure response back to CloudFormation stack. This propels CloudFormation stack out of the waiting mode and into completion.

Solution deep dive

In this section I will get into the details of several key aspects of the solution

Custom Resource Definition

Following code snippet shows the custom resource definition which can be found here. Please note that we also define AWS::CloudFormation::WaitCondition and AWS::CloudFormation::WaitConditionHandle alongside the custom resource. AWS::CloudFormation::WaitConditionHandle resource sets up a pre-signed URL which is passed into the CallbackUrl property of the Custom Resource.

The final completion signal for the custom resource i.e. SUCCESS/FAILURE is received over this CallbackUrl. To learn more about wait conditions please refer to its user guide here. Note that, when updating the custom resource, you cannot use the existing WaitCondition-WaitConditionHandle resource pair. You need to create a new pair for tracking each update/delete operation on the custom resource.

/************************** Custom Resource Definition *****************************/
// When you intend to update CustomResource make sure that a new WaitCondition and 
// a new WaitConditionHandle resource is created to track CustomResource update.
// The strategy we are using here is to create a hash of Custom Resource properties.
// The resource names for WaitCondition and WaitConditionHandle carry this hash.
// Anytime there is an update to the custom resource properties, a new hash is generated,
// which automatically leads to new WaitCondition and WaitConditionHandle resources.
const resourceName: string = getNormalizedResourceName('DemoCustomResource');
const demoData = {
    pk: 'demo-sfn',
    sk: resourceName,
    ts: Date.now().toString()
};
const dataHash = hash(demoData);
const wcHandle = new CfnWaitConditionHandle(
    this, 
    'WCHandle'.concat(dataHash)
)
const customResource = new CustomResource(this, resourceName, {
    serviceToken: customResourceLambda.functionArn,
    properties: {
        DDBTable: String(demoTable.tableName),
        Data: JSON.stringify(demoData),
        CallbackUrl: wcHandle.ref
    }
});
        
// Note: AWS::CloudFormation::WaitCondition resource type does not support updates.
new CfnWaitCondition(
    this,
    'WC'.concat(dataHash),
    {
        count: 1,
        timeout: '300',
        handle: wcHandle.ref
    }
).node.addDependency(customResource)
/**************************************************************************************/

Custom Resource Lambda

Following code snippet shows how the custom resource lambda function passes the CloudFormation event as an input into the StepFunction at the time of invocation. CloudFormation event contains the CallbackUrl resource property I discussed in the previous section.

private async startExecution() {
    const input = {
        cfnEvent: this.event,
        cfnContext: this.context
    };
    const params: StartExecutionInput = {
        stateMachineArn: String(process.env.SFN_ARN),
        input: JSON.stringify(input)
    };
    let attempt = 0;
    let retry = false;
    do {
        try {
            const response = await this.sfnClient.startExecution(params).promise();
            console.debug('Response: ' + JSON.stringify(response));
            retry = false;

Custom Resource StepFunction

The StepFunction handles the CloudFormation event based on the event type. The CloudFormation event containing CallbackUrl is passed down the stages of StepFunction all the way to the final step. The last step of the StepFunction sends back the response over CallbackUrl via send-cfn-response lambda function as shown in the following code snippet.

/**
 * Send response back to cloudformation
 * @param event
 * @param context
 * @param response
 */
export async function sendResponse(event: any, context: any, response: any) {
    const responseBody = JSON.stringify({
        Status: response.Status,
        Reason: "Success",
        UniqueId: response.PhysicalResourceId,
        Data: JSON.stringify(response.Data)
    });
    console.debug("Response body:\n", responseBody);
    const parsedUrl = url.parse(event.ResourceProperties.CallbackUrl);
    const options = {
        hostname: parsedUrl.hostname,
        port: 443,
        path: parsedUrl.path,
        method: "PUT",
        headers: {
            "content-type": "",
            "content-length": responseBody.length
        }
    };
    await new Promise(() => {
        const request = https.request(options, function(response: any) {
	    console.debug("Status code: " + response.statusCode);
	    console.debug("Status message: " + response.statusMessage);
	    context.done();
    	})
	request.on("error", function(error) {
	    console.debug("send(..) failed executing https.request(..): " + error);
	    context.done();
	});
	request.write(responseBody);
	request.end();
    });
    return;
}

Demo

Clone the GitHub repo cfn-custom-resource-using-step-functions and navigate to the folder cfn-custom-resource-using-step-functions. Now, execute the script script-deploy.sh by passing the name of the AWS profile that you created in the prerequisites section above. This should deploy the solution. The commands are shown as follows for your reference. Note that if you don’t pass the AWS profile name ‘default’ the profile will be used for deployment.

git clone 
cd cfn-custom-resource-using-step-functions
./script-deploy.sh "<AWS- ACCOUNT-PROFILE-NAME>"

The deployed solution consists of 2 stacks as shown in the following screenshot

cfn-custom-resource-common-lib: Deploys common components
- DynamoDB table that custom resources write to during their lifecycle events
- Lambda layer used across the rest of the stacks
cfn-custom-resource-sfn: Deploys Step Functions backed custom resource implementation

Figure 2. CloudFormation stacks deployed

For demo purposes, I implemented a custom resource that inserts data into the DynamoDB table. When you deploy the solution for the first time, like you just did in the previous step, it initiates a CREATE event resulting in the creation of a new custom resource using Step Functions. You should see a new record with unix epoch timestamp in the DynamoDB table, indicating that the resource was created as shown in the following screenshot. You can find the DynamoDB table name/arn from the SSM Parameter Store /CUSTOM_RESOURCE_PATTERNS/DYNAMODB/ARN

Figure 3. DynamoDB record indicating custom resource creation

Now, execute the script script-deploy.sh again. This should initiate an UPDATE event, resulting in the update of custom resources. The code also automatically creates new WaitConditionHandle and WaitCondition resources required to wait for the update event to finish. Now you should see that the records in the DynamoDb table have been updated with new values for lastOperation and ts attributes as follows.

Figure 4. DynamoDB record indicating custom resource update

Cleaning up

To remove all of the stacks, run the script script-undeploy.sh as follows.

./script-undeploy.sh "<AWS- ACCOUNT-PROFILE-NAME>"

Conclusion

In this post I showed how to look beyond the conventional approach of building CloudFormation custom resources using a Lambda function. I discussed implementing custom resources using Step Functions and CloudFormation wait conditions. Try this solution in scenarios where you must execute a long running deployment task/job as part of your CloudFormation stack deployment.

About the author:

10 reasons to import a certificate into AWS Certificate Manager (ACM)

2022-09-16 Nicholas Doropoulos

Post Syndicated from Nicholas Doropoulos original https://aws.amazon.com/blogs/security/10-reasons-to-import-a-certificate-into-aws-certificate-manager-acm/

AWS Certificate Manager (ACM) is a service that lets you efficiently provision, manage, and deploy public and private SSL/TLS certificates for use with AWS services and your internal connected resources. The certificates issued by ACM can then be used to secure network communications and establish the identity of websites on the internet or resources on private networks.

So why might you want to import a certificate into ACM, rather than using a certificate issued by ACM? According to the AWS Certificate Manager User Guide topic Importing certificates into AWS Certificate Manager, “you might do this because you already have a certificate from a third-party certificate authority (CA), or because you have application-specific requirements that are not met by ACM issued certificates.”

In this blog post, I’ll list 10 reasons why you might want to import a certificate into ACM, including what specific requirements you might have, and why you might want to use a certificate signed by a third-party CA in the first place.

1. To use an ECDSA certificate for faster TLS connections

Imported Elliptic Curve Digital Signature Algorithm (ECDSA) certificates use smaller keys than ACM issued public RSA certificates, allowing for TLS connections to be established faster. For this reason, ECDSA certificates are particularly useful for systems with limited processing resources, such as Internet of Things (IoT) devices. ACM supports imported certificates with ECDSA in 256, 384, and 521 bit variations. If you want to use an ECDSA certificate for your public-facing web application, you need to get a third-party certificate and then import it into ACM. For more information about supported cryptographic algorithms for imported certificates, see Prerequisites for importing certificates in the AWS Certificate Manager User Guide.

2. To control your certificate’s renewal cycle

When you import a certificate into ACM, you have greater control over its renewal cycle simply because you can re-import it as frequently as you want. You also have control over how often your imported certificate’s private key can be rotated. As a best practice, you should rotate your certificate’s private key based on your certificate’s usage frequency.

Note: When you re-import your certificate, to maintain the existing associations during renewal, ensure that you specify the existing certificate’s Amazon Resource Name (ARN). For more information and step-by-step instructions, see Reimporting a certificate in the AWS Certificate Manager User Guide.

3. To use certificate pinning

You might have an application that requires certificate pinning, which is the practice of bypassing the typical hierarchical model of trust that is governed by certificate authorities. With certificate pinning, a host’s identity is trusted based on a specific certificate or public key. As a certificate pinning best practice, AWS recommends that public certificates issued by ACM should not be pinned because ACM will generate a new public/private key pair at the next renewal phase, which essentially replaces the pinned certificate with a new one, causing service disruption along the process. If you want to use certificate pinning, you can pin an imported certificate because imported certificates are not subject to managed renewal, thereby reducing the risk of production impact.

4. To use a higher-assurance certificate

You might want to use a higher-assurance certificate, such as an organization validation (OV) or extended validation (EV) certificate. Certificates issued by ACM currently only support domain validation (DV). If the domain you want to protect is an application that requires OV or EV, you can import OV or EV certificates into ACM by using a third-party certificate of either type. You can use the ACM API action ImportCertificate to import OV or EV certificates into ACM.

5. To use a self-signed certificate

For internal testing environments where your developers want speed and flexibility, self-signed certificates are issued faster and effortlessly. However, it’s important to know that self-signed certificates are not trusted by default, which means that self-signed certificates need to be installed inside the trust stores of the intended clients, to avoid the risk of your users getting into the habit of ignoring browser warnings. For more information, see the additional requirements for self-signed certificates in Prerequisites for importing certificates in the AWS Certificate Manager User Guide.

6. To use an IP address for the certificate’s subject

By design, the subject field of an ACM certificate can only identify a fully qualified domain name (FQDN). If you want to use an IP address for the certificate’s subject, then you can create the certificate and import it to ACM.

7. To exceed the number of domains allowed by the ACM quotas

Certificates issued by ACM are subject to the ACM service quotas. The default quota for ACM is 10 domain names for each ACM certificate, and you can request an increase to the quota up to a maximum of 100 domain names for each certificate. However, if you import certificates, they are not subject to the quotas, and you can use a public certificate with more than 100 FQDNs in its domain scope without having to go through the process of requesting any limit increases.

8. To use a private certificate issued by ACM Private CA with the IssueCertificate API action

Certificates provisioned with the IssueCertificate API action have a private status and cannot be associated directly with an AWS integrated service, such as an internal Application Load Balancer. Instead, a private certificate issued by AWS Certificate Manager Private Certificate Authority (ACM Private CA) with the IssueCertificate API action needs to be exported and then imported into ACM before the association can be made. The same is true for certificate templates as well, which are configuration templates that can be passed as parameters to the IssueCertificate API action as a means to have greater control over the private certificate’s extensions.

9. To use a private certificate issued by your on-premises CA

You might want to use a private certificate issued by your on-premises CA instead of using ACM Private CA. To administer your internal public key infrastructure (PKI), AWS generally recommends that you use ACM Private CA. However, you might still come across scenarios where a certificate signed by your on-premises CA is better suited for your specific needs. For example, you might want to have a common root of trust, for consistency and interoperability purposes across a hybrid PKI solution. Furthermore, using an external parent CA with ACM Private CA also allows you to enforce CA name constraints. For more information, see Signing private CA certificates with an external CA in the AWS Certificate Manager Private Certificate Authority User Guide.

10. To use a certificate for something other than securing a public website

In addition to securing a public website, you can use certificates for other purposes. For example, you can import client and server certificates as part of an OpenVPN setup. For more information about this example, see How can I generate server and client certificates and their respective keys on a Windows server and upload them to AWS Certificate Manager (ACM)? In addition, you can import a code-signing certificate for use with AWS IoT Device Management. For more information about how to import a code-signing certificate, see (For IoT only) Obtain and import a code-signing certificate in the AWS Signer Developer Guide.

Conclusion

In this blog post, you learned about some of the reasons you might want to import a certificate into AWS Certificate Manager (ACM). For more information about importing certificates into ACM and step-by-step instructions, see Importing certificates into AWS Certificate Manager in the AWS Certificate Manager User Guide. For the latest pricing information, see the AWS Certificate Manager Pricing page on the AWS website. You can also use the AWS pricing calculator to estimate costs.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Certificate Manager forum or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Hazard analysis and Chaos engineering at Vanguard Group

2022-09-16 Jason Barto

Post Syndicated from Jason Barto original https://aws.amazon.com/blogs/devops/hazard-analysis-and-chaos-engineering-at-vanguard-group/

Anticipating events that can cause a disruption to your system’s service is critical to building highly available, reliable systems. Hazard analysis gives you a method to identify such events. Chaos engineering gives you a method to confirm that a system behaves as expected in adverse conditions. By combining these methods, Vanguard is building reliability into their systems.

Vanguard engineering teams perform hazard analysis on their systems and capture the identified events as failure scenarios. They use the identified failure scenarios to create hypotheses to support chaos engineering experiments. These hypotheses predict how the system will respond to failures and each hypothesis is then confirmed through experimentation to increase the team’s confidence in the system’s reliability.

In this article we will walk you through how Vanguard uses hazard analysis and chaos engineering. We will also provide guidance on how you can employ these techniques on your applications.

Failure Mode & Effects Analysis

A hazard analysis can be performed using different methods. At Vanguard, they have adapted the failure mode & effects analysis (FMEA) method to support their important services.

FMEA is a bottom-up approach to analyse an architecture and focus on the impact to system functions when one or more components of the system are disrupted. Members of the engineering team and architects responsible for designing and building a system brainstorm possible failure scenarios or failure modes, and document the impact of these failures on the system. Combined with a quantitative method for ranking the failure modes, the analysis process produces a prioritised list of failure modes which describes how the system would respond to individual or combined failures in its component parts or dependencies.

For each failure mode the team conducting the analysis will highlight what protections exist within the system to guard against the failure mode. Sometimes, fault isolation boundaries have been put in place to prevent client impact in failure scenarios. In other scenarios, for one reason or another, there are hard dependencies in place for which the engineering team has decided not to build in fault tolerance. For example, a team responsible for a less-critical function may have architected its system to operate across multiple availability zones, but could decide not to implement other mitigations to prioritize cost over increased resilience.

The FMEA method has been in use by engineers in the automotive, aeronautical, healthcare, and military industries for more than 60 years. Over that time, FMEA has been modified to best suit the organization and the field in which it was applied. In many variations the FMEA measures each failure mode with a risk priority number (RPN), which is intended to quantitatively rank the failure mode based upon:

The failure mode’s impact to the system as a whole
The probability of the failure mode’s occurrence
How easily the failure mode can be detected

Vanguard have adapted the FMEA process to serve their own specific requirements and processes. Vanguard have decided not to adopt the RPN element of the FMEA process, as teams found they spent a lot of time debating the impact, probability, and detectability of individual failure modes. To perform an FMEA more quickly, teams instead focus on the failure modes and system impact only, documenting a mental model of system performance which can be experimented through chaos engineering.

An excerpt of a Vanguard FMEA output is provided as an example in the following table:

The “Process Step” in the table above refers to a business function of the system being analyzed, for example “Request to retrieve stored data”. As part of the analysis, the team identifies the system components needed to perform the Process Step and considers the interactions of those components Focusing on a Process Step makes it easier to anticipate the failure scenarios that would affect the system in performing this particular business function. Also, the Process Step will imply an importance or criticality which can be a factor when prioritizing mitigations.

After selecting a Process Step, you walk through the system components involved and identify how component failures or disruptions will affect the wider system. Such component failures may involve individual components or a combination of components and are captured as “Failure Mode”. This identifies the component or components that are disrupted and their behaviour; for example, “Microservice is unavailable or returns an error”.

“Expected Behaviour” describes the effect of the failure mode on the wider system, in the context of the Process Step. This captures what other system components are affected by the Failure Mode and why, and how this impacts the Process Step as a whole.

Lastly, the “Hypothesis” column forms the basis for the chaos experiments that will follow from the FMEA to confirm that the system performs as expected.

At Vanguard, all mission-critical product teams are conducting FMEAs for their production applications. The outputs of these sessions are maintained over time and serve multiple purposes:

When onboarding new team members, it is helpful to provide the FMEA document alongside an architecture diagram and narrative. It will paint a more robust picture of how the system is intended to operate in both “happy path” and “unhappy path” scenarios.
When troubleshooting incidents, an FMEA document can help on-call engineers – especially those less experienced with debugging – to match up the documented expectations to the observed system behavior.
Site Reliability Engineers (SREs) looking for opportunities to improve the resilience of a system might look to FMEA documentation to understand the existing fault isolation boundaries and introduce additional resilience mechanisms through automation and system changes.
Finally, when selecting scenarios for experimentation with Chaos Engineering, the FMEA document provides a list of conjectures that have been mapped to hypotheses, ready to be validated through experimentation. This input into the Chaos Engineering workflow is the primary use of FMEA documents for Vanguard product teams.

There are many resources available online to learn more about how FMEA is used and applied in other organisations. In Failure Modes and Continuous Resilience, Adrian Cockcroft introduces FMEA as a method for anticipating failure scenarios. The NASA Software Engineering Handbook details how FMEAs are conducted as part of their engineering process. The Automotive Industry Group has also formally documented the use of FMEA in the Automotive Industry Action Group FMEA Handbook.

Chaos Engineering

After failure modes have been identified and mitigated through system design, it’s time to understand how resilient the system’s implementation is to those failure modes. Chaos engineering can be used to explore a system and validate that a system’s implementation meets business resiliency objectives.

Chaos engineering helps to improve a team’s mental model about the system under experimentation and provides insights into how a complex system behaves under adverse conditions. It also enables an engineer to find the unknown unknowns and the known unknowns through experiments that are built on top of the hypothesis. These experiments should simulate real world events, such as network degradation and increased client requests, and the outcome of the experiment should not be known. In other words, an experiment is not an experiment if it’s known that the conditions will cause the system to fail.

Prerequisites to Chaos Experiments at Vanguard

At Vanguard, there are some necessary prerequisites to running a chaos experiment. Firstly, the system under experiment must be set up with some basic observability tooling that will allow teams to monitor the state of the application during the failure injection. This could be as simple as an Amazon CloudWatch dashboard and some associated alarms, or as elaborate as a dedicated dashboard set up in a vendor tool.

Secondly, teams must be able to drive load to the application during the experiment; depending on the experiment type, the level and type of load may vary. The load generator can be as simple as a script on someone’s machine, or a fully automated load test depending on the requirements of the hypothesis.

Finally, teams need to have a good understanding of what the application’s “steady state” looks like. I Ideally, this takes the form of some metrics such as expected error rate, expected latency, and/or a service level objective (SLO) that can be monitored throughout the duration of the experiment. For example, a service level objective for a RESTful API might be that 90% of requests should receive a response within 100 milliseconds.

With the prerequisites met and a completed FMEA, teams can then experiment with their hypothesis using various experiment templates defined by Vanguard’s Climate of Chaos tooling.

Vanguard’s Climate of Chaos

At Vanguard, ensuring its software systems are resilient to adverse events is a critical part of its ongoing mission to provide world-class service to their clients. Vanguard believes that in order to develop high quality software, one must plan for the inevitable “stormy weather” events that occur in a distributed system.

Over the past 2 years, as a response to this need, Vanguard has developed in-house tooling called “The Climate of Chaos” to give teams easy access to common experiment templates, along with a friendly UI interface. The Climate of Chaos helps developers experiment on their systems and validate the hypotheses generated from FMEAs. It also provides the tooling for them to simulate the most common failure scenarios on Vanguard’s most commonly utilized AWS infrastructure, including Amazon Elastic Container Service (Amazon ECS), AWS Fargate, Amazon DynamoDB, Amazon Relational Database Service (Amazon RDS), AWS Lambda, and others.

The Climate of Chaos was created prior to Amazon’s release of the AWS Fault Injection Simulator (FIS), and today there is a lot of overlap with the experiment capabilities available in FIS. The Climate of Chaos has also been enhanced with company-specific features and integrations that make it easier for Vanguard developers to run chaos experiments in a controlled and predictable manner.

The Climate of Chaos includes important safety features such as an “emergency stop” function. This feature enables teams to terminate the experiment immediately if unintended side effects are encountered, rolling back the events simulated to resume steady state operation. The Climate of Chaos has been coupled with other systems like an in-house load testing tooling and added features like the ability to monitor CloudWatch alarms. Vanguard also offers teams the ability to schedule experiments to run at their convenience. Soon, Vanguard hopes to make running chaos experiments even smarter, introducing tools that will help teams run bulk experiments that systematically inject failures on a group of related applications to help pinpoint more complex failure modes.

Next Steps

Failure modes and effects analysis is a hazard analysis method which can help you identify single and combined points of failure in your system so you can prioritize the failure modes. To learn more about the FMEA process, you can read the NASA Software Engineering Handbook which outlines how they perform FMEA on their software-based systems. The AWS Whitepaper Building Mission-Critical Financial Services Applications on AWS provides example forms and suggestions for severity, probability, and detectability rankings. Appendix F in the whitepaper suggests a 1 to 10 ranking for each Risk Priority Number input, and the example spreadsheets recommend performing FMEAs for the application, platform, infrastructure, and operation layers of the system. Using these examples, you can perform an analysis of your own systems and generate hypotheses.

To experiment on your systems and validate your own hypotheses, you can use the AWS Fault Injection Simulator (FIS) mentioned earlier in this article. FIS provides you with a framework for performing controlled chaos experiments on your AWS workloads. It helps you to safely manage your experiments by providing tooling to monitor, rollback, and orchestrate chaos experiments. FIS provides the fault injection mechanisms that you will need to experiment upon your system’s implementation and resilience to identified failure modes. You can start by running experiments in pre-production environments, and then step up to running them as part of your CI/CD workflow and ultimately in your production environment. To learn more about FIS, you can read the FIS User Guide and FIS tutorials.

By using FMEA to anticipate the failures and experimenting on your systems with chaos engineering, you will gain confidence in the reliability of your system.

The content and opinions in this post are those of The Vanguard Group and AWS is not responsible for the content or accuracy of this post.

About the authors:

Deploying Local Gateway Ingress Routing on AWS Outposts

2022-09-15 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/deploying-local-gateway-ingress-routing-on-aws-outposts/

This post is written by Leonardo Solano, Senior Hybrid Cloud Solution Architect and Chris Lunsford, Senior Specialist Solutions Architect, AWS Outposts.

AWS Outposts lets customers use the same Amazon Virtual Private Cloud (VPC) security mechanisms, such as security groups and network access control lists, to control traffic flows for on-premises applications running on Outposts. Some customers, desiring additional security or consistency with on-premises systems, want the ability to inspect and filter incoming application traffic as it enters the Outpost. Ideally, they would like to deploy virtual appliances in front of the workloads running on Outposts.

Today, we are announcing a new feature called Outposts local lateway (LGW) ingress routing. This lets you create LGW inbound routes to redirect incoming traffic to an Amazon Elastic Compute Cloud (EC2) Elastic Network Interface (ENI) associated with an EC2 instance running on Outposts rack. The traffic is redirected for inspection before it reaches the workloads running on Outposts rack. Moreover, it lets the EC2 virtual appliance inspect, filter, or optimize the traffic in a similar way as VPC ingress routing in the Region.

Use case

A common use case for this feature is deploying a customer-preferred third-party virtual network appliance. The appliance can inspect, modify, or monitor the incoming traffic for policy compliance and forward compliant traffic on to the workloads running on the Outpost. A typical virtual appliance could be a firewall, intrusion detection system (IDS), or intrusion prevention system (IPS). The features provided by the virtual appliances vary, and they may include deep packet inspection, traffic optimization, and flow monitoring. This new Outposts rack feature modifies the default behavior of the local gateway routing table (LGW-RTB), and it lets customers redirect traffic coming into an Outposts deployment to the virtual appliance.

The new behavior?

Now you can create static routes in the LGW-RTB that target a specific ENI on the Outpost as the next hop. These static routes are propagated toward the customer network through the Border Gateway Protocol (BGP) peering sessions with the Customer Networking Devices. The on-premises network will route traffic to the specified Classless Inter-Domain Routing (CIDR) prefixes, as defined in the static routes, toward the Outposts Network Devices.

In the preceeding diagram, the static route 198.19.33.248/29 has a longer prefix length than 198.19.33.240/28, and both routes will be propagated toward the customer network via BGP. The incoming traffic for the 198.19.33.248/29 prefix will be directed toward the ENI eni-1234example0. The architecture looks like the following diagram, where the security virtual appliance is seated between the LGW and a set of EC2 instances in Outposts.

As ingress traffic is routed through the virtual appliance for inspection and filtering, the destination addresses of packets arriving at the ENI of the virtual appliance won’t match its ENI’s private IP address (the packets are transiting the instance). By default, the ENI will drop the inbound traffic unless you disable source/destination checking on the virtual appliance instance ENI settings. The following screenshot shows how you can disable the EC2 instance source/destination checking in the AWS console.

Considerations for LGW ingress routing

Consider the following requirements when preparing to deploy LGW ingress routing:

The ENIs used as the next-hop target must be deployed in an Outposts Subnet.
The subnets must belong to a VPC associated with the LGW-RTB.
Routes with the longest matches are prioritized. If there are two with the same destination CIDR, then static routes are preferred over propagated ones.

Working with Outposts LGW ingress routing

The following output shows what the LGW route table looks like before applying the ingress routing feature:

{
    "Routes": [
        {
            "DestinationCidrBlock": "0.0.0.0/0",
            "LocalGatewayVirtualInterfaceGroupId": "lgw-vif-grp-XXX",
            "Type": "static",
            "State": "active",
            "LocalGatewayRouteTableId": "lgw-rtb-XXX",
            "LocalGatewayRouteTableArn": "arn:aws:ec2:>AWS-REGION>:<account-id>:local-gateway-route-table/lgw-rtb-XXX",
            "OwnerId": "<account-id>"
        },
        {
            "DestinationCidrBlock": "198.19.33.16/28",
            "CoipPoolId": "coip-pool-0000aaaabbbbcccc1111",
            "Type": "propagated",
            "State": "active",
            "LocalGatewayRouteTableId": "lgw-rtb-XXX",
            "LocalGatewayRouteTableArn": "arn:aws:ec2:<AWS-REGION>:<account-id>:local-gateway-route-table/lgw-XXX",
            "OwnerId": "<account-id>"
        },
        {
            "DestinationCidrBlock": "198.19.33.240/28",
            "CoipPoolId": "coip-pool-0000aaaabbbbcccc2222",
            "Type": "propagated",
            "State": "active",
            "LocalGatewayRouteTableId": "lgw-rtb-XXX",
            "LocalGatewayRouteTableArn": "arn:aws:ec2:<AWS-REGION>:<account-id>:local-gateway-route-table/lgw-XXX",
            "OwnerId": "<account-id>"
        }
     ]
}

The relevant change under an LGW-RTB before to add a local-gateway-route is the presence of the “propagated routes”. This represents the Outposts Subnets that can’t be deleted or modified with Next-Hop as specific ENIs present in Outposts. In the following section, we will cover how it will look after the creation of a local-gateway-route.

Configuring LGW ingress routing

To configure LGW ingress routing, you must provide the LGW route table ID, the ENI ID that will be utilized as a next-hop, and the destination CIDR block. Once you have identified those three parameters, you can configure LGW ingress routing via the This is shown in the following example, where the prefix 198.19.33.248/29 is routed to an Outpost. If the route points to an ENI attached to an instance, then the route will show as active. If the route points to an ENI that isn’t attached to an EC2 instance, then the route will show a blackhole state.

$ aws ec2 create-local-gateway-route \
  --local-gateway-route-table-id <lgw-rtb-id> \
  --network-interface-id <eni-id> \
  --destination-cidr-block 198.19.33.248/29
  
{
    "Route": {
        "DestinationCidrBlock": "198.19.33.248/29",
        "NetworkInterfaceId": "eni-id",
        "Type": "static",
        "State": "active",
        "LocalGatewayRouteTableId": "lgw-rtb-id",
        "LocalGatewayRouteTableArn": "arn:aws:ec2:<AWS-REGION>:<account-id>:local-gateway-route-table/<lgw-rtb-id>",
        "OwnerId": "<account-id>"
    }
}

Once LGW ingress routing has been configured, the LGW will route traffic destined to the 198.19.33.248/29 prefix to the target ENI. This must be present as part of the Outposts subnets. Note that the segment 198.19.33.248/29 is part of the Outposts CIDR range of 198.19.33.240/28. This belongs, in this case, to the Outposts customer-owned IP address (CoIP) CIDRs. When traffic follows a static route to an ENI, the packet destination address is preserved and isn’t translated to the private address of the ENI.

In this case, the new LGW-RTB will look like the following:

{
    "Routes": [
        {
            "DestinationCidrBlock": "0.0.0.0/0",
            "LocalGatewayVirtualInterfaceGroupId": "lgw-vif-grp-XXX",
            "Type": "static",
            "State": "active",
            "LocalGatewayRouteTableId": "lgw-rtb-XXX",
            "LocalGatewayRouteTableArn": "arn:aws:ec2:<AWS-REGION>:<account-id>:local-gateway-route-table/lgw-rtb-XXX",
            "OwnerId": "<account-id>"
        },
        {
            "DestinationCidrBlock": "198.19.33.16/28",
            "CoipPoolId": "coip-pool-0000aaaabbbbcccc1111",
            "Type": "propagated",
            "State": "active",
            "LocalGatewayRouteTableId": "lgw-rtb-XXX",
            "LocalGatewayRouteTableArn": "arn:aws:ec2:<AWS-REGION>:<account-id>:local-gateway-route-table/lgw-XXX",
            "OwnerId": "<account-id>"
        },
        {
            "DestinationCidrBlock": "198.19.33.240/28",
            "CoipPoolId": "coip-pool-0000aaaabbbbcccc1111",
            "Type": "propagated",
            "State": "active",
            "LocalGatewayRouteTableId": "lgw-rtb-XXX",
            "LocalGatewayRouteTableArn": "arn:aws:ec2:<AWS-REGION>:<account-id>:local-gateway-route-table/lgw-XXX",
            "OwnerId": "<account-id>"
        },
         {
            "DestinationCidrBlock": "198.19.33.248/29",
            "NetworkInterfaceId": "eni-XXX",
            "Type": "static",
            "State": "active",
            "LocalGatewayRouteTableId": "lgw-rtb-XXX",
            "LocalGatewayRouteTableArn": "arn:aws:ec2:<AWS-REGION>:<account-id>:local-gateway-route-table/lgw-rtb-XXX",
            "OwnerId": "<account-id>"
        }
     ]
}

In the AWS console, the LGW-RTB will show the new ingress routing route:

Modifying LGW ingress routing

Utilize a similar AWS CLI command to the one that we used previously to create the LGW ingress routing route to modify existing routes. In this case, the command will be aws ec2 modify-local-gateway-route, and the arguments are the same as with the create command. Use this command when you want to shift inbound traffic from one EC2 instance to another – perhaps from an active to a standby network appliance while you perform required maintenance on the primary instance.

$ aws ec2 modify-local-gateway-route \
  --local-gateway-route-table-id <lgw-rtb-id> \
  --network-interface-id <new-eni-id> \
  --destination-cidr-block 198.19.33.248/29
{
    "Route": {
        "DestinationCidrBlock": "198.19.33.248/29",
        "NetworkInterfaceId": "new-eni-id",
        "Type": "static",
        "State": "active",
        "LocalGatewayRouteTableId": "lgw-rtb-id",
        "LocalGatewayRouteTableArn": "arn:aws:ec2:<AWS-REGION>:<account-id>:local-gateway-route-table/<lgw-rtb-id>",
        "OwnerId": "<account-id>"
    }
}

Conclusion

AWS Outposts LGW ingress routing allows AWS customers and partners to deploy virtual appliances on Outposts rack and direct inbound traffic through those appliances. The virtual appliance can inspect, filter, and optimize the ingress traffic before forwarding it on to the workloads running on Outposts rack, creating fine-grained network and security policies for your workloads. To learn more about AWS Outposts rack, visit the product overview page.

Choose the k-NN algorithm for your billion-scale use case with OpenSearch

2022-09-13 Jack Mazanec

Post Syndicated from Jack Mazanec original https://aws.amazon.com/blogs/big-data/choose-the-k-nn-algorithm-for-your-billion-scale-use-case-with-opensearch/

When organizations set out to build machine learning (ML) applications such as natural language processing (NLP) systems, recommendation engines, or search-based systems, often times k-Nearest Neighbor (k-NN) search will be used at some point in the workflow. As the number of data points reaches the hundreds of millions or even billions, scaling a k-NN search system can be a major challenge. Applying Approximate Nearest Neighbor (ANN) search is a great way to overcome this challenge.

The k-NN problem is relatively simple compared to other ML techniques: given a set of points and a query, find the k nearest points in the set to the query. The naive solution is equally understandable: for each point in the set, compute its distance from the query and keep track of the top k along the way.

K-NN concept

The problem with this naive approach is that it doesn’t scale particularly well. The runtime search complexity is O(Nlogk), where N is the number of vectors and k is the number of nearest neighbors. Although this may not be noticeable when the set contains thousands of points, it becomes noticeable when the size gets into the millions. Although some exact k-NN algorithms can speed search up, they tend to perform similarly to the naive approach in higher dimensions.

Enter ANN search. We can reduce the runtime search latency if we loosen a few constraints on the k-NN problem:

Allow indexing to take longer
Allow more space to be used at query time
Allow the search to return an approximation of the k-NN in the set

Several different algorithms have been discovered to do just that.

OpenSearch is a community-driven, Apache 2.0-licensed, open-source search and analytics suite that makes it easy to ingest, search, visualize, and analyze data. The OpenSearch k-NN plugin provides the ability to use some of these algorithms within an OpenSearch cluster. In this post, we discuss the different algorithms that are supported and run experiments to see some of the trade-offs between them.

Hierarchical Navigable Small Worlds algorithm

The Hierarchical Navigable Small Worlds algorithm (HNSW) is one of the most popular algorithms out there for ANN search. It was the first algorithm that the k-NN plugin supported, using a very efficient implementation from the nmslib similarity search library. It has one of the best query latency vs. recall trade-offs and doesn’t require any training. The core idea of the algorithm is to build a graph with edges connecting index vectors that are close to each other. Then, on search, this graph is partially traversed to find the approximate nearest neighbors to the query vector. To steer the traversal towards the query’s nearest neighbors, the algorithm always visits the closest candidate to the query vector next.

But which vector should the traversal start from? It could just pick a random vector, but for a large index, this might be very far from the query’s actual nearest neighbors, leading to poor results. To pick a vector that is generally close to the query vector to start from, the algorithm builds not just one graph, but a hierarchy of graphs. All vectors are added to the bottom layer, and then a random subset of those are added to the layer above, and then a subset of those are added to the layer above that, and so on.

During search, we start from a random vector in the top layer, partially traverse the graph to find (approximately) the nearest point to the query vector in that layer, and then use this vector as the starting point for our traversal of the layer below. We repeat this until we get to the bottom layer. At the bottom layer, we perform the traversal, but this time, instead of just searching for the nearest neighbor, we keep track of the k-nearest neighbors that are visited along the way.

The following figure illustrates this process (inspired from the image in original paper Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs).

You can tune three parameters for HNSW:

m – The maximum number of edges a vector will get in a graph. The higher this number is, the more memory the graph will consume, but the better the search approximation may be.
ef_search – The size of the queue of the candidate nodes to visit during traversal. When a node is visited, its neighbors are added to the queue to be visited in the future. When this queue is empty, the traversal will end. A larger value will increase search latency, but may provide better search approximation.
ef_construction – Very similar to ef_search. When a node is to be inserted into the graph, the algorithm will find its m edges by querying the graph with the new node as the query vector. This parameter controls the candidate queue size for this traversal. A larger value will increase index latency, but may provide a better search approximation.

For more information on HNSW, you can read through the paper Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs.

Memory consumption

Although HNSW provides very good approximate nearest neighbor search at low latencies, it can consume a large amount of memory. Each HNSW graph uses roughly 1.1 * (4 * d + 8 * m) * num_vectors bytes of memory:

d is the dimension of the vectors
m is the algorithm parameter that controls the number of connections each node will have in a layer
num_vectors is the number of vectors in the index

To ensure durability and availability, especially when running production workloads, OpenSearch indexes are recommended to have at least one replica shard. Therefore, the memory requirement is multiplied by (1 + number of replicas). For use cases where the data size is 1 billion vectors of 128 dimensions each and m is set to the default value of 16, the estimated amount of memory required would be:

1.1 * (4 * 128 + 8 * 16) * 1,000,000,000 * 2 = 1,408 GB.

If we increase the size of vectors to 512, for example, and the m to 100, which is recommended for vectors with high intrinsic dimensionality, some use cases can require a total memory of approximately 4 TB.

With OpenSearch, we can always horizontally scale the cluster to handle this memory requirement. However, this comes at the expense of raising infrastructure costs. For cases where scaling doesn’t make sense, options to reduce the memory footprint of the k-NN system need to be explored. Fortunately, there are algorithms that we can use to do this.

Inverted File System algorithm

Consider a different approach for approximating a nearest neighbor search: separate your index vectors into a set of buckets, then, to reduce your search time, only search through a subset of these buckets. From a high level, this is what the Inverted File System (IVF) ANN algorithm does. In OpenSearch 1.2, the k-NN plugin introduced support for the implementation of IVF by Faiss. Faiss is an open-sourced library from Meta for efficient similarity search and clustering of dense vectors.

However, if we just randomly split up our vectors into different buckets, and only search a subset of them, this will be a poor approximation. The IVF algorithm uses a more elegant approach. First, before indexing begins, it assigns each bucket a representative vector. When a vector is indexed, it gets added to the bucket that has the closest representative vector. This way, vectors that are closer to each other are placed roughly in the same or nearby buckets.

To determine what the representative vectors for the buckets are, the IVF algorithm requires a training step. In this step, k-Means clustering is run on a set of training data, and the centroids it produces become the representative vectors. The following diagram illustrates this process.

Inverted file system indexing concept

IVF has two parameters:

nlist – The number of buckets to create. More buckets will result in longer training times, but may improve the granularity of the search.
nprobes – The number of buckets to search. This parameter is fairly straightforward. The more buckets that are searched, the longer the search will take, but the better the approximation.

Memory consumption

In general, IVF requires less memory than HNSW because IVF doesn’t need to store a set of edges for each indexed vector.

We estimate that IVF will roughly require the following amount of memory:

1.1 * (((4 * dimension) * num_vectors) + (4 * nlist * dimension)) bytes

For the case explored for HNSW where there are 1,000,000,000 128-dimensional vectors with one layer of replication, an IVF algorithm with an nlist of 4096 would take roughly 1.1 * (((4 * 128) * 2,000,000,000) + (4 * 4096 * 128)) bytes = 1126 GB.

This savings does come at a cost, however, because HNSW offers a better query latency versus approximation accuracy tradeoff.

Product quantization vector compression

Although you can use HNSW and IVF to speed up nearest neighbor search, they can consume a considerable amount of memory. When we get into the billion-vector scale, we start to require thousands of GBs of memory to support their index structures. As we scale up the number of vectors or the dimension of vectors, this requirement continues to grow. Is there a way to use noticeably less space for our k-NN index?

The answer is yes! In fact, there are a lot of different ways to reduce the amount of memory vectors require. You can change your embedding model to produce smaller vectors, or you can apply techniques like Principle Component Analysis (PCA) to reduce the vector’s dimensionality. Another approach is to use quantization. The general idea of vector quantization is to map a large vector space with continuous values into a smaller space with discrete values. When a vector is mapped into a smaller space, it requires fewer bits to represent. However, this comes at a cost—when mapping to a smaller input space, some information about the vector is lost.

Product quantization (PQ) is a very popular quantization technique in the field of nearest neighbor search. It can be used together with ANN algorithms for nearest neighbor search. Along with IVF, the k-NN plugin added support for Faiss’s PQ implementation in OpenSearch 1.2.

The main idea of PQ is to break up a vector into several sub-vectors and encode the sub-vectors independently with a fixed number of bits. The number of sub-vectors that the original vector is broken up into is controlled by a parameter, m, and the number of bits to encode each sub-vector with is controlled by a parameter, code_size. After encoding finishes, a vector is compressed into roughly m * code_size bits. So, assume we have a set of 100,000 1024-dimensional vectors. With m = 8 and code_size = 8, PQ breaks each vector into 8 128-dimensional sub-vectors and encode each sub-vector with 8 bits.

The values used for encoding are produced during a training step. During training, tables are created with 2^code_size entries for each sub-vector partition. Next, k-Means clustering, with a k value of 2^code_size, is run on the corresponding partition of sub-vectors from the training data. The centroids produced here are added as the entries to the partition’s table.

After all the tables are created, we encode a vector by replacing each sub-vector with the ID of the closest vector in the partition’s table. In the example where code_size = 8, we only need 8 bits to store an ID because there are 28 elements in the table. So, with dimension = 1024 and m = 8, the total size of one vector (assuming it uses a 32-bit floating point data type) is reduced from 4,096 bytes to roughly 8 bytes!

Product quantization encoding step

When we want to decode a vector, we can reconstruct an approximated version of it by using the stored IDs to retrieve the vectors from each partition’s table. The distance from the query vector to the reconstructed vector can then be computed and used in a nearest neighbor search. (It’s worth noting that, in practice, further optimization techniques like ADC are used to speed up this process for k-NN search).

Memory consumption

As we mentioned earlier, PQ will encode each vector into roughly m * code_size bits plus some overhead for each vector.

When combining it with IVF, we can estimate the index size as follows:

1.1 * ((((code_size/8) * m + overhead_per_vector) * num_vectors) + (4 * nlist * dimension) + (2 ^code_size * 4 * dimension) bytes

Using 1 billion vectors, dimension = 128, m = 8, code_size = 8, and nlist = 4096, we get an estimated total memory consumption of 70GB: 1.1 * ((((8 / 8) * 8 + 24) * 1,000,000,000) + (4 * 4096 * 128) + (2^8 * 4 * 128)) * 2 = 70 GB.

Running k-NN with OpenSearch

First make sure you have an OpenSearch cluster up and running. For instructions, refer to Cluster formation. For a more managed solution, you can use Amazon OpenSearch Service.

Before getting into the experiments, let’s go over how to run k-NN workloads in OpenSearch. First, we need to create an index. An index stores a set of documents in a way that they can be easily searched. For k-NN, the index’s mapping tells OpenSearch what algorithms to use and what parameters to use with them. We start by creating an index that uses HNSW as its search algorithm:

PUT my-hnsw-index
{
  "settings": {
    "index": {
      "knn": true,
      "number_of_shards": 10,
      "number_of_replicas" 1,
    }
  },
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "knn_vector",
        "dimension": 4,
        "method": {
          "name": "hnsw",
          "space_type": "l2",
          "engine": "nmslib",
          "parameters": {
            "ef_construction": 128,
            "m": 24
          }
        }
      }
    }
  }
}

In the settings, we need to enable knn so that the index can be searched with the knn query type (more on this later). We also set the number of shards, and the number of replicas each shard will have. An index is made up of a collection of shards. Sharding is how OpenSearch distributes an index across multiple nodes in a cluster. For more information about shards, refer to Sizing Amazon OpenSearch Service domains.

In the mappings, we configure a field called my_vector of type knn_vector to store the vector data. We also pass nmslib as the engine to let OpenSearch know it should use nmslib’s implementation of HNSW. Additionally, we pass l2 as the space_type. The space_type determines the function used to compute the distance between two vectors. l2 refers to the Euclidean distance. OpenSearch also supports cosine similarity and the inner product distance functions.

After the index is created, we can ingest some fake data:

POST _bulk
{ "index": { "_index": "my-hnsw-index", "_id": "1" } }
{ "my_vector": [1.5, 2.5], "price": 12.2 }
{ "index": { "_index": "my-hnsw-index", "_id": "2" } }
{ "my_vector": [2.5, 3.5], "price": 7.1 }
{ "index": { "_index": "my-hnsw-index", "_id": "3" } }
{ "my_vector": [3.5, 4.5], "price": 12.9 }
{ "index": { "_index": "my-hnsw-index", "_id": "4" } }
{ "my_vector": [5.5, 6.5], "price": 1.2 }
{ "index": { "_index": "my-hnsw-index", "_id": "5" } }
{ "my_vector": [4.5, 5.5], "price": 3.7 }
{ "index": { "_index": "my-hnsw-index", "_id": "6" } }
{ "my_vector": [1.5, 5.5, 4.5, 6.4], "price": 10.3 }
{ "index": { "_index": "my-hnsw-index", "_id": "7" } }
{ "my_vector": [2.5, 3.5, 5.6, 6.7], "price": 5.5 }
{ "index": { "_index": "my-hnsw-index", "_id": "8" } }
{ "my_vector": [4.5, 5.5, 6.7, 3.7], "price": 4.4 }
{ "index": { "_index": "my-hnsw-index", "_id": "9" } }
{ "my_vector": [1.5, 5.5, 4.5, 6.4], "price": 8.9 }

After adding some documents to the index, we can search it:

GET my-hnsw-index/_search
{
  "size": 2,
  "query": {
    "knn": {
      "my_vector": {
        "vector": [2, 3, 5, 6],
        "k": 2
      }
    }
  }
}

Creating an index that uses IVF or PQ is a little bit different because these algorithms require training. Before creating the index, we need to create a model using the training API:

POST /_plugins/_knn/models/my_ivfpq_model/_train
{
  "training_index": "train-index",
  "training_field": "train-field",
  "dimension": 128,
  "description": "My model description",
  "method": {
      "name":"ivf",
      "engine":"faiss",
      "parameters":{
        "encoder":{
            "name":"pq",
            "parameters":{
                "code_size": 8,
                "m": 8
            }
        }
      }
  }
}

The training_index and training_field specify where the training data is stored. The only requirement for the training data index is that it has a knn_vector field that has the same dimension as you want your model to have. The method defines the algorithm that should be used for search.

After the training request is submitted, it will run in the background. To check if the training is complete, you can use the GET model API:

GET /_plugins/_knn/models/my_ivfpq_model/filter_path=model_id,state
{
  "model_id" : "my_ivfpq_model",
  "state" : "created"
}

After the model is created, you can create an index that uses this model:

PUT /my-hnsw-index
{
  "settings" : {
    "index.knn": true
    "number_of_shards" : 10,
    "number_of_replicas" : 1,
  },
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "knn_vector",
        "model_id": "my_ivfpq_model"
      }
    }
  }
}

After the index is created, we can add documents to it and search it just like we did for HNSW.

Experiments

Let’s run a few experiments to see how these algorithms perform in practice and what tradeoffs are made. We look at an HNSW versus an IVF index using PQ. For these experiments, we’re interested in search accuracy, query latency, and memory consumption. Because these trade-offs are mainly observed at scale, we use the BIGANN dataset containing 1 billion vectors of 128 dimensions. The dataset also contains 10,000 queries of test data mapping a query to the ground truth closest 100 vectors based on the Euclidean distance.

Specifically, we compute the following search metrics:

Latency p99 (ms), Latency p90 (ms), Latency p50 (ms) – Query latency at various quantiles in milliseconds
recall@10 – The fraction of the top 10 ground truth neighbors found in the 10 results returned by the plugin
Native memory consumption (GB) – The amount of memory used by the plugin during querying

One thing to note is that the BIGANN dataset uses an unsigned integer as the data type. Because the knn_vector field doesn’t support unsigned integers, the data is automatically converted to floats.

To run the experiments, we complete the following steps:

Ingest the dataset into the cluster using the OpenSearch Benchmarks framework (the code can be found on GitHub).
When ingestion is complete, we use the warmup API to prepare the cluster for the search workload.
We run the 10,000 test queries against the cluster 10 times and collect the aggregated results.

The queries return the document ID only, and not the vector, to improve performance (code for this can be found on GitHub).

Parameter selection

One tricky aspect of running experiments is selecting the parameters. There are too many different combinations of parameters to test them all. That being said, we decided to create three configurations for HNSW and IVFPQ:

Optimize for search latency and memory
Optimize for recall
Fall somewhere in the middle

For each optimization strategy, we chose two configurations.

For HNSW, we can tune the m, ef_construction, and ef_search parameters to achieve our desired trade-off:

m – Controls the maximum number of edges a node in a graph can have. Because each node has to store all of its edges, increasing this value will increase the memory footprint, but also increase the connectivity of the graph, which will improve recall.
ef_construction – Controls the size of the candidate queue for edges when adding a node to the graph. Increasing this value will increase the number of candidates to consider, which will increase the index latency. However, because more candidates will be considered, the quality of the graph will be better, leading to better recall during search.
ef_search – Similar to ef_construction, it controls the size of the candidate queue for graph traversal during search. Increasing this value will increase the search latency, but will also improve the recall.

In general, we chose configurations that gradually increased the parameters, as detailed in the following table.

Config ID	Optimization Strategy	m	ef_construction	ef_search
hnsw1	Optimize for memory and search latency	8	32	32
hnsw2	Optimize for memory and search latency	16	32	32
hnsw3	Balance between latency, memory, and recall	16	128	128
hnsw4	Balance between latency, memory, and recall	32	256	256
hnsw5	Optimize for recall	32	512	512
hnsw6	Optimize for recall	64	512	512

For IVF, we can tune two parameters:

nlist – Controls the granularity of the partitioning. The recommended value for this parameter is a function of the number of vectors in the index. One thing to keep in mind is that there are Faiss indexes that map to Lucene segments. There are several Lucene segments per shard and several shards per OpenSearch index. For our estimates, we assumed that there would be 100 segments per shard and 24 shards, so about 420,000 vectors per Faiss index. With this value, we estimated a good value to be 4096 and kept this constant for the experiments.
nprobes – Controls the number of nlist buckets we search. Higher values generally lead to improved recalls at the expense of increased search latencies.

For PQ, we can tune two parameters:

m – Controls the number of partitions to break the vector into. The larger this value is, the better the encoding will approximate the original, at the expense of raising memory consumption.
code_size – Controls the number of bits to encode a sub-vector with. The larger this value is, the better the encoding approximates the original, at the expense of raising memory consumption. The max value is 8, so we kept it constant at 8 for all experiments.

The following table summarizes our strategies.

Config ID	Optimization Strategy	nprobes	m (num_sub_vectors)
ivfpq1	Optimize for memory and search latency	8	8
ivfpq2	Optimize for memory and search latency	16	8
ivfpq3	Balance between latency, memory, and recall	32	16
ivfpq4	Balance between latency, memory, and recall	64	32
ivfpq5	Optimize for recall	128	16
ivfpq6	Optimize for recall	128	32

Additionally, we need to figure out how much training data to use for IVFPQ. In general, Faiss recommends between 30,000 and 256,000 training vectors for components involving k-Means training. For our configurations, the maximum k for k-Means is 4096 from the nlist parameter. With this formula, the recommended training set size is between 122,880 and 1,048,576 vectors, so we settled on 1 million vectors. The training data comes from the index vector dataset.

Lastly, for the index configurations, we need to select the shard count. It is recommended to keep the shard size between 10–50 GBs for OpenSearch. Experimentally, we determined that for HNSW, a good number would be 64 shards and for IVFPQ, 42. Both index configurations were configured with one replica.

Cluster configuration

To run these experiments, we used Amazon OpenSearch Service using version 1.3 of OpenSearch to create the clusters. We decided to use the r5 instance family, which provides a good trade-off between memory size and cost.

The number of nodes will depend on the amount of memory that can be used for the algorithm per node and the total amount of memory required by the algorithm. Having more nodes and more memory will generally improve performance, but for these experiments, we want to minimize cost. The amount of memory available per node is computed as memory_available = (node_memory - jvm_size) * circuit_breaker_limit, with the following parameters:

node_memory – The total memory of the instance.
jvm_size – The OpenSearch JVM heap size. Set to 32 GB.
circuit_breaker_limit – The native memory usage threshold for the circuit breaker. Set to 0.5.

Because HNSW and IVFPQ have different memory requirements, we estimate how much memory is needed for each algorithm and determine the required number of nodes accordingly.

For HNSW, with m = 64, the total memory required using the formula from the previous sections is approximately 2,252 GB. Therefore, with r5.12xlarge (384 GB of memory), memory_available is 176 GB and the total number of nodes required is about 12, which we round up to 16 for stability purposes.

Because the IVFPQ algorithm requires less memory, we can use a smaller instance type, the r5.4xlarge instance, which has 128 GB of memory. Therefore, the memory_available for the algorithm is 48 GB. The estimated algorithm memory consumption where m = 64 is a total of 193 GB and the total number of nodes required is four, which we round up to six for stability purposes.

For both clusters, we use c5.2xlarge instance types as dedicated leader nodes. This will provide more stability for the cluster.

According to the AWS Pricing Calculator, for this particular use case, the cost per hour of the HNSW cluster is around $75 an hour, and the IVFPQ cluster costs around $11 an hour. This is important to remember when comparing the results.

Also, keep in mind that these benchmarks can be run using your custom infrastructure, using Amazon Elastic Compute Cloud (Amazon EC2), as long as the instance types and their memory size is equivalent.

Results

The following tables summarize the results from the experiments.

Test ID	p50 Query latency (ms)	p90 Query latency (ms)	p99 Query latency (ms)	Recall@10	Native memory consumption (GB)
hnsw1	9.1	11	16.9	0.84	1182
hnsw2	11	12.1	17.8	0.93	1305
hnsw3	23.1	27.1	32.2	0.99	1306
hnsw4	54.1	68.3	80.2	0.99	1555
hnsw5	83.4	100.6	114.7	0.99	1555
hnsw6	103.7	131.8	151.7	0.99	2055

Test ID	p50 Query latency (ms)	p90 Query latency (ms)	p99 Query latency (ms)	Recall@10	Native memory consumption (GB)
ivfpq1	74.9	100.5	106.4	0.17	68
ivfpq2	78.5	104.6	110.2	0.18	68
ivfpq3	87.8	107	122	0.39	83
ivfpq4	117.2	131.1	151.8	0.61	114
ivfpq5	128.3	174.1	195.7	0.40	83
ivfpq6	163	196.5	228.9	0.61	114

As you might expect, given how many more resources it uses, the HNSW cluster has lower query latencies and better recall. However, the IVFPQ indexes use significantly less memory.

For HNSW, increasing the parameters does in fact lead to better recall at the expense of latency. For IVFPQ, increasing m has the most significant impact on improving recall. Increasing nprobes improves the recall marginally, but at the expense of significant increases in latencies.

Conclusion

In this post, we covered different algorithms and techniques used to perform approximate k-NN search at scale (over 1 billion data points) within OpenSearch. As we saw in the previous benchmarks section, there isn’t one algorithm or approach that optimises for all the metrics at once. HNSW, IVF, and PQ each allow you to optimize for different metrics in your k-NN workload. When choosing the k-NN algorithm to use, first understand the requirements of your use case (How accurate does my approximate nearest neighbor search need to be? How fast should it be? What’s my budget?) and then tailor the algorithm configuration to meet them.

You can take a look at the benchmarking code base we used on GitHub. You can also get started with approximate k-NN search today following the instructions in Approximate k-NN search. If you’re looking for a managed solution for your OpenSearch cluster, check out Amazon OpenSearch Service.

About the Authors

Jack Mazanec is a software engineer working on OpenSearch plugins. His primary interests include machine learning and search engines. Outside of work, he enjoys skiing and watching sports.

Othmane Hamzaoui is a Data Scientist working at AWS. He is passionate about solving customer challenges using Machine Learning, with a focus on bridging the gap between research and business to achieve impactful outcomes. In his spare time, he enjoys running and discovering new coffee shops in the beautiful city of Paris.

DevOps with serverless Jenkins and AWS Cloud Development Kit (AWS CDK)

2022-09-12 sangusah

Post Syndicated from sangusah original https://aws.amazon.com/blogs/devops/devops-with-serverless-jenkins-and-aws-cloud-development-kit-aws-cdk/

The objective of this post is to walk you through how to set up a completely serverless Jenkins environment on AWS Fargate using AWS Cloud Development Kit (AWS CDK).

Jenkins is a popular open-source automation server that provides hundreds of plugins to support building, testing, deploying, and automation. Jenkins uses a controller-agent architecture in which the controller is responsible for serving the web UI, stores the configurations and related data on disk, and delegates the jobs to the worker agents that run these jobs as their primary responsibility.

Amazon Elastic Container Service (Amazon ECS) using Fargate is a fully-managed container orchestration service that helps you easily deploy, manage, and scale containerized applications. It deeply integrates with the rest of the AWS platform to provide a secure and easy-to-use solution for running container workloads in the cloud and now on your infrastructure. Fargate is a serverless, pay-as-you-go compute engine that lets you focus on building applications without managing servers. Fargate is compatible with both Amazon ECS and Amazon Elastic Kubernetes Service (Amazon EKS).

Solution overview

The following diagram illustrates the solution architecture. The dashed lines indicate the AWS CDK deployment.

Figure 1 This diagram shows AWS CDK and how it deploys using AWS CloudFormation to create the Elastic Load Balancer, AWS Fargate, and Amazon EFS

You’ll be using the following:

The Jenkins controller URL backed by an Application Load Balancer (ALB).
You’ll be using your default Amazon Virtual Private Cloud (Amazon VPC) for this example.
The Jenkins controller runs as a service in Amazon ECS using Fargate as the launch type. You’ll use Amazon Elastic File System (Amazon EFS) as the persistent backing store for the Jenkins controller task. The Jenkins controller and Amazon EFS are launched in private subnets.

Prerequisites

For this post, you’ll utilize AWS CDK using TypeScript.

Follow the guide on Getting Started for AWS CDK to:

Get your local environment setup
Bootstrap your development account

Code

Let’s review the code used to define the Jenkins environment in AWS using the AWS CDK.

Setup your imports

import { Duration, IResource, RemovalPolicy, Stack, Tags } from 'aws-cdk-lib';
import { Construct } from 'constructs';

import * as cdk from 'aws-cdk-lib';

import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as efs from 'aws-cdk-lib/aws-efs';
import { Port } from 'aws-cdk-lib/aws-ec2';
import * as elbv2 from 'aws-cdk-lib/aws-elasticloadbalancingv2';

Setup your Amazon ECS, which is a logical grouping of tasks or services and set vpc

export class AppStack extends Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const jenkinsHomeDir: string = 'jenkins-home';
    const appName: string = 'jenkins-cdk';

    const cluster = new ecs.Cluster(this, `${appName}-cluster`, {
      clusterName: appName,
    });

    const vpc = cluster.vpc;

Setup Amazon EFS to store the data

    const fileSystem = new efs.FileSystem(this, `${appName}-efs`, {
      vpc: vpc,
      fileSystemName: appName,
      removalPolicy: RemovalPolicy.DESTROY,
    });

Setup Access Point, which are application-specific entry points into an Amazon EFS file system that makes it easier to manage application access to shared datasets

const accessPoint = fileSystem.addAccessPoint(`${appName}-ap`, {
      path: `/${jenkinsHomeDir}`,
      posixUser: {
        uid: '1000',
        gid: '1000',
      },
      createAcl: {
        ownerGid: '1000',
        ownerUid: '1000',
        permissions: '755',
      },
    });

Setup Task Definition to run Docker containers in Amazon ECS

const taskDefinition = new ecs.FargateTaskDefinition(
      this,
      `${appName}-task`,
      {
        family: appName,
        cpu: 1024,
        memoryLimitMiB: 2048,
      }
    );

Setup a Volume mapping the Amazon EFS from above to the Task Definition

taskDefinition.addVolume({
      name: jenkinsHomeDir,
      efsVolumeConfiguration: {
        fileSystemId: fileSystem.fileSystemId,
        transitEncryption: 'ENABLED',
        authorizationConfig: {
          accessPointId: accessPoint.accessPointId,
          iam: 'ENABLED',
        },
      },
    });

Setup the Container using the Task Definition and the Jenkins image from the registry

const containerDefinition = taskDefinition.addContainer(appName, {
      image: ecs.ContainerImage.fromRegistry('jenkins/jenkins:lts'),
      logging: ecs.LogDrivers.awsLogs({ streamPrefix: 'jenkins' }),
      portMappings: [{ containerPort: 8080 }],
    });

Setup Mount Points to bind ephemeral storage to the container

containerDefinition.addMountPoints({
      containerPath: '/var/jenkins_home',
      sourceVolume: jenkinsHomeDir,
      readOnly: false,
    });

Setup Fargate Service to run the container serverless

    const fargateService = new ecs.FargateService(this, `${appName}-service`, {
      serviceName: appName,
      cluster: cluster,
      taskDefinition: taskDefinition,
      desiredCount: 1,
      maxHealthyPercent: 100,
      minHealthyPercent: 0,
      healthCheckGracePeriod: Duration.minutes(5),
    });
    fargateService.connections.allowTo(fileSystem, Port.tcp(2049));

Setup ALB and add listener to checks for connection requests, using the protocol and port that you configure.

    const loadBalancer = new elbv2.ApplicationLoadBalancer(
      this,
      `${appName}-elb`,
      {
        loadBalancerName: appName,
        vpc: vpc,
        internetFacing: true,
      }
    );
    const lbListener = loadBalancer.addListener(`${appName}-listener`, {
      port: 80,
    });

Setup Target to route requests to Jenkins running on Amazon ECS using Fargate

const loadBalancerTarget = lbListener.addTargets(`${appName}-target`, {
      port: 8080,
      targets: [fargateService],
      deregistrationDelay: Duration.seconds(10),
      healthCheck: { path: '/login' },
    });
  }
}

Jenkins Deployment

Now that you have all the code, let’s deploy the AWS CDK definition:

Make sure that you have done the Prerequisite steps from earlier.
Install packages by running the following command in your IDE CLI:

npm i

Now you’ll deploy your AWS CDK definition to your dev account:

cdk deploy

Let’s now login to Jenkins

In your browser, use the DNS Name from the deployed Load Balancer
In Amazon CloudWatch, there will be a Log group that will be created that is associated to Cluster Service.
1. Go into that log and you’ll see it output the Password to login to Jenkins

In Jenkins, follow the wizard to continue the setup

Cleaning up

To avoid incurring future charges, delete the resources.

Let’s destroy our deploy solution

In your IDE CLI:

cdk destroy

Conclusion

With this overview we were able to cover the following:

Build an Elastic Load Balancer
Use AWS Fargate with a Jenkins AMI
All resources running serverlessly
All build using the AWS CDK

About the author:

Amazon DevOps Guru increases Operational Efficiency for 605

2022-08-31 Mohit Gadkari

Post Syndicated from Mohit Gadkari original https://aws.amazon.com/blogs/devops/amazon-devops-guru-increases-operational-efficiency-for-605/

605 is an independent TV measurement firm that offers advertising and content measurement, full-funnel attribution, media planning, optimization, and analytical solutions, all on top of their multi-source viewership data set covering over 21 million U.S. households. 605 has built their technology solutions on AWS with dozens of accounts and tens of thousands of resources to monitor.

As 605 continues to innovate and build new solutions, the size and complexity of their AWS deployment has also grown proportionally. Over time, managing their deployment has become an operational challenge for their current team. 605 has deployed different application performance monitoring (APM) tools and notification systems to help their observability staff scale and support their growing cloud environment. However, 605 realized that their continued growth on the cloud would necessitate either increasing their observability staff or assuming some risk of potential application performance issues or even outages.

Amazon DevOps Guru allowed 605 to find a third path forward. Rather than accepting the trade-off of hiring more staff or assuming more risk, 605 discovered that DevOps Guru provides an increase in operational efficiency using their existing staff resources by applying artificial intelligence (AI) to supplement their existing APM and notification platform. Layering DevOps Guru into their DevOps environment , 605 realized a 4-fold decrease in the number of alerts and notifications that proved to be false positives. In fact, 605 went from an environment where 76.2% of their alerts and notifications were false positives, to one with only 18.9% false positives simply by adding Amazon DevOps Guru. In the end, 605 can more effectively and efficiently manage their environment with existing resources and actually freeing-up DevOps brainpower to work on more strategically important initiatives than application management.

“Amazon DevOps Guru has provided insights that help us focus our infrastructure roadmap. Our current SIEM tools require building out alerting ahead of time, while DevOps Guru is constantly evolving, which prevents becoming stagnant in our monitoring. Reducing the risk of false positive alerts has saved countless engineering hours.”

Jared Williams, VP of Infrastructure and Architecture, 605

605 without DevOps Guru had their Amazon CloudWatch and Amazon Elastic Container Service for Kubernetes ( Amazon EKS) configured with different application performance monitoring and notification systems. They saw only 23.8 % legitimate alerts and notifications, where as with the integration with DevOps Guru the legitimate alerts and notifications went up to 81% for a 6-month time period.
605 are monitoring over 13+ AWS Accounts, 20+ Amazon EKS Clusters, 500+ Pods ,15000+ EC2 Instances, 500+ S3 Buckets and 55+ Application Load Balancers with DevOps Guru

Figure 1. 605 are monitoring over 13+ AWS Accounts, 20+ Amazon EKS Clusters, 500+ Pods ,15000+ EC2 Instances, 500+ S3 Buckets and 55+ Application Load Balancers with DevOps Guru.

Amazon DevOps Guru is a service powered by applying artificial intelligence (AI) that’s designed to make it easy to improve an application’s operational performance and availability. DevOps Guru helps detect behaviors that deviate from normal operating patterns so that you can identify operational issues long before they impact your applications. DevOps Guru utilizes ML models informed by years of Amazon.com and AWS operational excellence to identify anomalous application behavior (for example, increased latency, error rates, resource constraints, and others). Furthermore, it helps surface critical issues that could cause potential outages or service disruptions. When DevOps Guru identifies a critical issue, it automatically sends an alert and provides a summary of related anomalies, the likely root cause, and context for when and where the issue occurred. When possible, DevOps Guru also helps provide recommendations regarding how to remediate the issue. DevOps Guru ingests operational data from your AWS applications and provides a single dashboard to visualize issues in your operational data. DevOps Guru can be enabled for all of the resources in your AWS account, resources in your AWS CloudFormation Stacks, or resources grouped together by AWS Tags, with no manual setup or ML expertise required.

The value of DevOps Guru for 605 goes beyond providing operational efficiency and avoiding the choice of adding DevOps resources or assuming more risk. DevOps Guru also discovered issues with application performance that their existing solutions weren’t trained to inspect.

This new data allowed 605 to avoid a potential problem that they didn’t otherwise know would occur. As DevOps Guru doesn’t require any set-up beyond enabling the service and choosing resources to monitor (it’s a managed service), the service can surface issues without any prior configuration.

In the end, the value of DevOps Guru for 605 surfaces in three ways. First, it increases operational efficiency by allowing their existing DevOps team to more effectively manage its AWS applications and resources, as well as the room to grow along with their business needs. Second, DevOps Guru reduces operational fatigue and allows their DevOps teams to focus on more strategic issues by significantly reducing false positives. Lastly, DevOps Guru can find operational issues to which existing APM tools may not be configured or able to detect.

Start monitoring your AWS applications with AWS DevOps Guru today using this link

About the authors:

Easily protect your AWS CDK-defined infrastructure with AWS WAFv2

2022-08-28 Ramon Lopez Narvaez

Post Syndicated from Ramon Lopez Narvaez original https://aws.amazon.com/blogs/devops/easily-protect-your-aws-cdk-defined-infrastructure-with-aws-wafv2/

Security is a shared responsibility between AWS and the customer. When we use infrastructure as code (IaC) we want to describe workloads wholistically, and that includes the configuration of firewalls alongside the entrypoints to web applications. As we evolve the infrastructure that our application is built upon, we can adjust firewall rules in the same place.

In this post, you’ll learn how you can easily add a layer of protection to your web application that is defined in AWS Cloud Development Kit (AWS CDK) and built using Amazon CloudFront, Amazon API Gateway, Application Load Balancer, or AWS AppSync.

To accomplish this, we’ll use AWS WAFv2. Although it’s usually complex to write your own firewall rules, we can simply use AWS Managed Rules. No tedious setup required!

What is AWS WAFv2?

AWS WAFv2 is a managed web application firewall. It can be natively enabled on CloudFront, API Gateway, Application Load Balancer, or AWS AppSync and is deployed alongside these services. AWS services terminate the TCP/TLS connection, process incoming HTTP requests, and then pass the request to AWS WAF for inspection and filtering.

For example, you can use AWS WAFv2 to protect against attacks, such as cross-site request forgery (CSRF), cross-site scripting (XSS), and SQL injection (SQLi) among other threats in the OWASP Top 10.

AWS Managed Rules for AWS WAF is a set of AWS WAF rules curated and maintained by the AWS Threat Research Team that provides protection against common application vulnerabilities or other unwanted traffic, without having to write your own rules.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account
An application fronted by one or more of the following services: Amazon Cloudfront, Amazon API Gateway, Application Load Balancer or AWS AppSync. From here on these are called ‘entrypoint’.
At least the above mentioned ‘entrypoint’ defined in AWS CDK.

Solution overview

When AWS WAF is applied to Amazon CloudFront, Amazon API Gateway, Application Load Balancer, or AWS AppSync, it inspects and filters requests before they’re forwarded to your compute infrastructure.

Figure 1. AWS WAFv2 can protect endpoints built by Amazon CloudFront, Amazon API Gateway, Application Load Balancer and AWS AppSync

Given that you have an existing web application defined in AWS CDK, we want to add a WAFv2 web ACL to its entrypoint. Instead of writing our own firewall rules to inspect and filter requests, we want to leverage an AWS Managed Rules rule group. Simultaneously, we must be able to disable or reconfigure some of the rules in the case that they cause undesirable behavior in the application.

A good first rule group to use is the core rule set (CRS) managed rule group, also named AWSManagedRulesCommonRuleSet. It contains rules that are generally applicable to web applications and provides protection against exploitation of various vulnerabilities, such as the ones described in the OWASP Top 10. You can later add more managed rule groups or write your own rules, which are specific to your application (e.g., for Windows, Linux, or WordPress).

Define the AWS WAFv2 web ACL

First, let’s give the AWS WAF module a nicely readable name:

import { aws_wafv2 as wafv2 } from 'aws-cdk-lib';

Then, we define the AWS WAFv2 web ACL in AWS CDK:

const cfnWebACL = new wafv2.CfnWebACL(this,'MyCDKWebAcl'
      defaultAction: {
        allow: {}
      },
      scope: 'REGIONAL',
      visibilityConfig: {
        cloudWatchMetricsEnabled: true,
        metricName:'MetricForWebACLCDK',
        sampledRequestsEnabled: true,
      },
      name:‘MyCDKWebAcl’,
      rules: [{
        name: 'CRSRule',
        priority: 0,
        statement: {
          managedRuleGroupStatement: {
            name:'AWSManagedRulesCommonRuleSet',
            vendorName:'AWS'
          }
        },
        visibilityConfig: {
          cloudWatchMetricsEnabled: true,
          metricName:'MetricForWebACLCDK-CRS',
          sampledRequestsEnabled: true,
        },
        overrideAction: {
          none: {}
        },
      }]
    });

The highlighted line references the CRS managed rule group as one Rule in the list. You could add more Rule elements, either referencing the managed rule groups or custom rules.

Note the scope attribute. If you want to attach this web ACL to an API Gateway, AWS AppSync API, or Application Load Balancer, then it will be REGIONAL. If you want to attach it to a CloudFront distribution, then make sure that your AWS WAFv2 web ACL is defined in the US East (N. Virginia) Region and the scope is CLOUDFRONT.

Attach the AWS WAFv2 web ACL to an Application Load Balancer, AWS AppSync API, or API Gateway

Now that we have a web ACL defined, we must attach it to a resource. This works exactly the same across API Gateway API’s, an AWS AppSync API, or an Application Load Balancer. We must create a CfnWebACLAssociation and point it to the previously created web ACL and the resource to protect:

const cfnWebACLAssociation = new wafv2.CfnWebACLAssociation(this,'MyCDKWebACLAssociation', {
      resourceArn:<ARN of resource to protect>,
      webAclArn:cfnWebACL.attrArn,
    });

Amazon Resource Names (ARNs) uniquely identify AWS resources. The highlighted line shows how AWS CDK lets you get the ARN of the previously defined CfnWebAcl.

Depending on what type of service you’re using, jump to one of the three following sections to learn how to retrieve the resourceArn of API Gateway, AWS AppSync, or Application Load Balancers.

Retrieving ARN for AWS AppSync API’s

To retrieve the ARN of an AWS AppSync API, call the .arn property:

const api = new appsync.GraphqlApi(…)
const cfnWebACLAssociation = new wafv2.CfnWebACLAssociation(this,'MyCDKWebACLAssociation', {
      resourceArn:api.arn,
      webAclArn: cfnWebACL.attrArn,
    });

Retrieving ARN for Amazon API Gateway REST API’s

In this case, we must specify which stage of the REST API we want to protect with the web ACL. Then, we reference the ARN of the stage:

const api = new apigateway.RestApi(…)
const deployment = new apigateway.Deployment(…)
const stage = apigateway.Stage(…)
const cfnWebACLAssociation = new wafv2.CfnWebACLAssociation(this,'MyCDKWebACLAssociation', {
      resourceArn:stage.stageArn,
      webAclArn: cfnWebACL.attrArn,
    });

Retrieving ARN for Application Load Balancers

If you’re dealing with an Application Load Balancer, then this is how you can retrieve its ARN:

const lb = new elbv2.ApplicationLoadBalancer(…)

const cfnWebACLAssociation = new wafv2.CfnWebACLAssociation(this,'MyCDKWebACLAssociation', {
      resourceArn:lb.loadBalancerArn,
      webAclArn: cfnWebACL.attrArn,
    });

Attach the AWS WAFv2 web ACL to a CloudFront distribution

Attaching a web ACL to CloudFront follows a different approach. Instead of defining a cfnWebACLAssociation, we reference the web ACL inside of the Distribution definition:

const distribution = new cloudfront.Distribution(this,'distro', {
      defaultBehavior: {
        origin: new origins.S3Origin(s3Bucket)
      },
     webAclId:cfnWebACL.attrArn
    });

Note that even though the property is called webAclId, because we’re using AWS WAFv2, we must supply the ARN of the web ACL.

Exclude rules from the web ACL

Lastly, let’s understand how we can customize the web ACL further. If a rule of the managed rule group causes undesired behavior in the application, then we can exclude it from the webACL. Assume that we want to exclude the SizeRestrictions_BODY rule, which limits the request body size to 8 KB.

Go back to the definition of the web ACL, and add the highlighted lines:

const cfnWebACL = new wafv2.CfnWebACL(this, 'MyCDKWebAcl', {
      defaultAction: {
        allow: {}
      },
      scope:'REGIONAL',
      visibilityConfig: {
        cloudWatchMetricsEnabled: true,
        metricName:'MetricForWebACLCDK',
        sampledRequestsEnabled: true,
      },
      name:'MyCDKWebAcl',
      rules: [{
        name:'CRSRule',
        priority: 0,
        statement: {
          managedRuleGroupStatement: {
            name: 'AWSManagedRulesCommonRuleSet',
            vendorName: 'AWS',
            excludedRules: [{
             ‘SizeRestrictions_BODY’ }]
          }
        },
        visibilityConfig: {
          cloudWatchMetricsEnabled: true,
          metricName:'MetricForWebACLCDK-CRS',
          sampledRequestsEnabled: true,
        },
        overrideAction: {
          none: {}
        },
      }]

    });

Other customizations you can do include pinning the version of the rule group and narrowing the scope of the request that the rule evaluates, using Scope-down statements.

Conclusion

In this post, you’ve seen how an AWS WAFv2 web ACL can be added to your existing infrastructure defined in AWS CDK. By using Managed Rules, your application benefits from a layer of protection that is curated and maintained by AWS security experts.

As a next step, you can learn how to include AWS WAFv2 metrics from Amazon CloudWatch into your application dashboards. This will give you perspective on how your web application is performing in conjunction with the AWS WAFv2 web ACL.

To learn more about AWS WAFv2 and how to manage web ACL’s, check out the official developer guide.

About the author:

Diving Deep into EC2 Spot Instance Cost and Operational Practices

2022-08-26 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/diving-deep-into-ec2-spot-instance-cost-and-operational-practices/

This blog post is written by, Sudhi Bhat, Senior Specialist SA, Flexible Compute.

Amazon EC2 Spot Instances are one of the popular choices among customers looking to cost optimize their workload running on AWS. Spot Instances let you take advantage of unused Amazon Elastic Compute Cloud (Amazon EC2) capacity in the AWS cloud and are available at up to a 90% discount compared to On-Demand EC2 instance prices. The key difference between On-Demand Instances and Spot Instances is that Spot Instances can be interrupted by Amazon EC2, with two minutes of notification, when Amazon EC2 needs the capacity back. Spot Instances are recommended for various stateless, fault-tolerant, or flexible applications, such as big data, containerized workloads, continuous integration/continuous development (CI/CD), web servers, high-performance computing (HPC), and test and development workloads.

Customers asked us for fast and easy ways to track and optimize usage for different services. In this post, we’ll focus on tools and techniques that can provide useful insights into the usages and behavior of workloads using Spot Instances, as well as how we can leverage those techniques for troubleshooting and cost tracking purposes.

Operational tools

Instance selection

One of the best practices while using Spot Instances is to be flexible about instance types, Regions, and Availability Zones, as this gives Spot a better cross-section of compute pools to select and allocate your desired capacity. AWS makes it easier to diversify your instance selection in Auto Scaling groups and EC2 Fleet through features like Attribute-Based Instance Type Selection, where you can select the instance requirements as a set of attributes like vCPU, memory, storage, etc. These requirements are translated into matching instance types automatically.

Considering that AWS Cloud spans across 25+ Regions and 80+ Availability Zones, finding the optimal location (either a Region or Availability Zone) to fulfil Spot capacity needs without launching a Spot can be very handy. This is especially true when AWS customers have the flexibility to run their workloads across multiple Regions or Availability Zones. This functionality can be achieved with one of the newer features called Amazon EC2 Spot placement score. Spot placement score provides a list of Regions or Availability Zones, each scored from 1 to 10, based on factors such as the requested instance types, target capacity, historical and current Spot usage trends, and the time of the request. The score reflects the likelihood of success when provisioning Spot capacity, with a 10 meaning that the request is highly likely to succeed.

If you wish to specifically select and match your instances to your workloads to leverage them, then refer to Spot Instance Advisor to determine Spot Instances that meet your computing requirements with their relative discounts and associated interruption rates. Spot Instance Advisor populates the frequency of interruption and average savings over On-Demand instances based on the last 30 days of historical data. However, note that the past interruption behavior doesn’t predict the future availability of these instances. Therefore, as a part of instance diversity, try to leverage as many instances as possible regardless of whether or not an instance has a high level of interruptions.

Spot Instance pricing history

Understanding the price history for a specific Amazon EC2 Spot Instance can be useful during instance selection. However, tracking these pricing changes can be complex. Since November 2017, AWS launched a new pricing model that simplified the Spot purchasing experience. The new model gives AWS Customers predictable prices that adjust slowly over days and weeks, as Spot Instance prices are now determined based on long-term trends in supply and demand for Spot Instance capacity. The current Spot Instance prices can be viewed on AWS website, and the Spot Instance pricing history can be viewed on the Amazon EC2 console or accessed via AWS Command Line Interface (AWS CLI). Customers can continue to access the Spot price history for the last 90 days, filtering by instance type, operating system, and Availability Zone to understand how the Spot pricing has changed.

Accessing Pricing history via AWS CLI using describe-spot-price-history or Get-EC2SpotPriceHistory (AWS Tools for Windows PowerShell).

aws ec2 describe-spot-price-history --start-time 2018-05-06T07:08:09 --end-time 2018-05-06T08:08:09 --instance-types c4.2xlarge --availability-zone eu-west-1a --product-description "Linux/UNIX (Amazon VPC)“
{
    "SpotPriceHistory": [
        {
            "Timestamp": "2018-05-06T06:30:30.000Z",
            "AvailabilityZone": "eu-west-1a",
            "InstanceType": "c4.2xlarge",
            "ProductDescription": "Linux/UNIX (Amazon VPC)",
            "SpotPrice": "0.122300"
        }
    ]
}

Spot Instance data feed

EC2 offers a mechanism to describe Spot Instance usage and pricing by providing a data feed that can be subscribed to. Therefore, the data feed is sent to an Amazon Simple Storage Service (Amazon S3) bucket on an hourly basis. Learn more about setting up the Spot Data feed and configuring the S3 bucket options in the documentation. A sample data feed would look like the following:

The above example provides more information about Spot Instance in use, like m4.large Instance being used at the time as specified by Timestamp and MyBidID=sir-11wsgc6k representing the request that generated this instance usage, Charge=0.045 USD indicating the discounted price charged compared to the MyMaxPrice, which was set to On-Demand cost. This information can be useful during troubleshooting, as you can refer to the information about Spot Instances even if that specific instance has been terminated. Moreover, you could choose to extend the use of this data for simple querying and visualization/analytics purposes using Amazon Athena.

Amazon EC2 Spot Instance Interruption dashboard

Spot Instance interruptions are an inherent part of the Spot Instance lifecycle. For example, it’s always possible that your Spot Instance might be interrupted depending on how many unused EC2 instances are available. Therefore, you must make sure that your application is prepared for a Spot Instance interruption.

There are several best practices regarding handling Spot interruptions as described in the blog “Best practices for handling EC2 Spot Instance interruptions”. Tracking Spot Instance interruptions can be useful in some scenarios, such as evaluating your workload for the tolerance for interruptions of a specific instance type, or to simply learn more about frequency of interruptions in your test environment so that you can fine-tune your instance selection. In these scenarios, you can use the EC2 Spot interruption dashboard, which is an opensource sample reference solution for logging Spot Instance interruptions. Spot Instance interruptions can fluctuate dynamically based on overall Spot Instance availability and demand for On-Demand Instances. However, it is important to note that tracking interruptions may not always represent the true Spot experience. Therefore, it’s recommended that this solution be used for those situations where Spot Instance interruptions inform a specific outcome, as it doesn’t accurately reflect system health or availability. It’s recommended to use this solution in dev/test environments to provide an educated view of how to use Spot Instances in production systems.

Cost management tools

AWS Pricing Calculator

AWS Pricing Calculator is a free tool that lets you create cost estimates for workloads that you run on AWS Services, including EC2 and Spot Instances. This calculator can greatly assist in calculating the cost of compute instances and estimating the future costs so that customers can compare the cost savings to be achieved before they even launch a Spot Instance as part of their solution. The AWS Pricing Calculator advanced estimate path offers six pricing strategies for Amazon EC2 instances. The pricing models include On-Demand, Reserved, Savings Plans, and Spot Instances. The estimates generated can also be exported to a CSV or PDF file format for quick sharing and additional analysis of the proposed architecture spend.

For Spot Instances, the calculator shows the historical average discount percentage for the instance chosen, and lets you enter a percentage discount for creating forecasts. We recommend choosing an instance type that best represents your target compute, memory, and network requirements for running your workload and generating an approximate estimate.

AWS Cost Management

One of the popular reporting tools offered by AWS is AWS Cost Explorer, which has an easy-to-use interface that lets you visualize, understand, and manage your AWS costs and usage over time, including Spot Instances. You can view data up to the last 12 months, and forecast the next three months. You can use Cost Explorer filtered by “Purchase Options” to see patterns in how much you spend on Spot Instances over time, and see trends that you can use to understand your costs. Furthermore, you can specify time ranges for the data, and view time data by day or by month. Moreover, you can leverage the Amazon EC2 Instance Usage reports to gain insights into your instance usage and patterns, along with information that you need to optimize the overall EC2 use.

AWS Billing and Cost Management offers a way to organize your resource costs on your cost allocation report by leveraging cost allocation tags, so that it’s easier to categorize and track your AWS costs using cost allocation reports, which includes all of your AWS costs for each billing period. The report includes both tagged and untagged resources, so that you can clearly organize the charges for resources. For example, if you tag resources with an application name that is deployed on Spot Instances, you can track the total cost of that single application that runs on those resources. The AWS generated tags “createdBy” is a tag that AWS defines and applies to supported AWS resources for cost allocation purposes and if opted, this tag is applied to “Spot-instance-request” resource type whenever the RequestSpotInstances API is invoked. This can be a great way to track the Spot Instance creation activities in your billing reports.

Cost and Usage Reports

AWS Customers have access to raw cost and usage data through the AWS Cost and Usage (AWS CUR) reports. These reports contain the most comprehensive information about your AWS usage and costs. Financial teams need this data so that they have an overview of their monthly, quarterly, and yearly AWS spend. But this data is equally valuable for technical teams who need detailed resource-level granularity to understand which resources are contributing to the spend, and what parts of the system to optimize. If you’re using Spot Instances for your compute needs, then AWS CUR populates the Amazon EC2 Spot usage pricing/* columns and the product/* columns. With this data, you can calculate the past savings achieved with Spot through the AWS CUR. Note that this feature was enabled in July 2021 and the AWS CUR data for Spot Usage is available only since then. The Cloud Intelligence Dashboards provide prebuilt visualizations that can help you get a detailed view of your AWS usage and costs. You can learn more about deploying Cloud Intelligence Dashboards by referring to the detailed blog “Visualize and gain insight into you AWS cost and usage with Cloud Intelligence Dashboard and CUDOS using Amazon QuickSite”

Conclusion

It’s always recommended to follow Spot Instance best practices while using Amazon EC2 Spot Instances for suitable workloads, so that you can have the best experience. In this post, we explored a few tools and techniques that can further guide you toward much deeper insights into your workloads that are using Spot Instances. This can assist you with understanding cost savings and help you with troubleshooting so that you can use Spot Instances more easily.