Tag Archives: Best practices

Externalize Amazon MSK Connect configurations with Terraform

Post Syndicated from Ramc Venkatasamy original https://aws.amazon.com/blogs/big-data/externalize-amazon-msk-connect-configurations-with-terraform/

Managing configurations for Amazon MSK Connect, a feature of Amazon Managed Streaming for Apache Kafka (Amazon MSK), can become challenging, especially as the number of topics and configurations grows. In this post, we address this complexity by using Terraform to optimize the configuration of the Kafka topic to Amazon S3 Sink connector. By adopting this strategic approach, you can establish a robust and automated mechanism for handling MSK Connect configurations, eliminating the need for manual intervention or connector restarts. This efficient solution will save time, reduce errors, and provide better control over your Kafka data streaming processes. Let’s explore how Terraform can simplify and enhance the management of MSK Connect configurations for seamless integration with your infrastructure.

Solution overview

At a well-known AWS customer, the management of their constantly growing MSK Connect S3 Sink connector topics has become a significant challenge. The challenges lie in the overhead of managing configurations, as well as dealing with patching and upgrades. Manually handling Kubernetes (K8s) configs and restarting connectors can be cumbersome and error-prone, making it difficult to keep track of changes and updates. At the time of writing this post, MSK Connect does not offer native mechanisms to easily externalize the Kafka topic to S3 Sink configuration.

To address these challenges, we introduce Terraform, an infrastructure as code (IaC) tool. Terraform’s declarative approach and extensive ecosystem make it an ideal choice for managing MSK Connect configurations.

By externalizing Kafka topic to S3 configurations, organizations can achieve the following:

  • Scalability – Effortlessly manage a growing number of topics, ensuring the system can handle increasing data volumes without difficulty
  • Flexibility – Seamlessly integrate MSK Connect configurations with other infrastructure components and services, enabling adaptability to changing business needs
  • Automation – Automate the deployment and management of MSK Connect configurations, reducing manual intervention and streamlining operational tasks
  • Centralized management – Achieve improved governance with centralized management, version control, auditing, and change tracking, ensuring better control and visibility over the configurations

In the following sections, we provide a detailed guide on establishing Terraform for MSK Connect configuration management, defining and decentralizing Topic configurations, and deploying and updating configurations using Terraform.

Prerequisites

Before proceeding with the solution, ensure you have the following resources and access:

  • You need access to an AWS account with sufficient permissions to create and manage resources, including AWS Identity and Access Management (IAM) roles and MSK clusters.
  • To simplify the setup, use the provided AWS CloudFormation template. This template will create the necessary MSK cluster and required resources for this post.
  • For this post, we are using the latest Terraform version (1.5.6).

By ensuring you have these prerequisites in place, you will be ready to follow the instructions and streamline your MSK Connect configurations with Terraform. Let’s get started!

Setup

Setting up Terraform for MSK Connect configuration management includes the following:

  • Installation of Terraform and setting up the environment
  • Setting up the necessary authentication and permissions

Defining and decentralizing topic configurations using Terraform includes the following:

  • Understanding the structure of Terraform configuration files
  • Determining the required variables and resources
  • Utilizing Terraform’s modules and interpolation for flexibility

The decision to externalize the configuration was primarily driven by the customer’s business requirement. They anticipated the need to add topics periodically and wanted to avoid the need to bring down and write specific code each time. Given the limitations of MSK Connect (as of this writing), it’s important to note that MSK Connect can handle up to 300 workers. For this proof of concept (POC), we opted for a configuration with 100 topics directed to a single Amazon Simple Storage Service (Amazon S3) bucket. To ensure compatibility within the 300-worker limit, we set the MCU count to 1 and configured auto scaling with a maximum of 2 workers. This ensures that the configuration remains within the bounds of the 300-worker maximum.

To make the configuration more flexible, we specify the variables that can be utilized in the code.(variables.tf):

variable "aws_region" {
description = "The AWS region to deploy resources in."
type = string
}

variable "s3_bucket_name" {
description = "s3_bucket_name."
type = string
}

variable "topics" {
description = "topics"
type = string
}

variable "msk_connect_name" {
description = "Name of the MSK Connect instance."
type = string
}

variable "msk_connect_description" {
description = "Description of the MSK Connect instance."
type = string
}

# Rest of the variables...

To set up the AWS MSK Connector for the S3 Sink, we need to provide various configurations. Let’s examine the connector_configuration block in the code snippet provided in the main.tf file in more detail:

connector_configuration = {
"connector.class" = "io.confluent.connect.s3.S3SinkConnector"
"s3.region" = "us-east-1"
"flush.size" = "5"
"schema.compatibility" = "NONE"
"tasks.max" = "1"
"topics" = var.topics
"format.class" = "io.confluent.connect.s3.format.json.JsonFormat"
"partitioner.class" = "io.confluent.connect.storage.partitioner.DefaultPartitioner"
"value.converter.schemas.enable" = "false"
"value.converter" = "org.apache.kafka.connect.json.JsonConverter"
"storage.class" = "io.confluent.connect.s3.storage.S3Storage"
"key.converter" = "org.apache.kafka.connect.storage.StringConverter"
"s3.bucket.name" = var.s3_bucket_name
"topics.dir" = "cxdl-data/KairosTelemetry"
}

The kafka_cluster block in the code snippet defines the Kafka cluster details, including the bootstrap servers and VPC settings. You can reference the variables to specify the appropriate values:

kafka_cluster {
apache_kafka_cluster {
bootstrap_servers = var.bootstrap_servers

vpc {
security_groups = [var.security_groups]
subnets = [var.aws_subnet_example1_id, var.aws_subnet_example2_id, var.aws_subnet_example3_id]
}
}
}

To secure the connection between Kafka and the connector, the code snippet includes configurations for authentication and encryption:

  • The kafka_cluster_client_authentication block sets the authentication type to IAM, enabling the use of IAM for authentication
  • The kafka_cluster_encryption_in_transit block enables TLS encryption for data transfer between Kafka and the connector
  kafka_cluster_client_authentication {
    authentication_type = "IAM"
  }

  kafka_cluster_encryption_in_transit {
    encryption_type = "TLS"
  }

You can externalize the variables and provide dynamic values using a var.tfvars file. Let’s assume the content of the var.tfvars file is as follows:

aws_region = "us-east-1"
msk_connect_name = "confluentinc-MSK-connect-s3-2"
msk_connect_description = "My MSK Connect instance"
s3_bucket_name = "msk-lab-xxxxxxxxxxxx-target-bucket"
topics = "salesdb.salesdb.CUSTOMER,salesdb.salesdb.CUSTOMER_SITE,salesdb.salesdb.PRODUCT,salesdb.salesdb.PRODUCT_CATEGORY,salesdb.salesdb.SALES_ORDER,salesdb.salesdb.SALES_ORDER_ALL,salesdb.salesdb.SALES_ORDER_DETAIL,salesdb.salesdb.SALES_ORDER_DETAIL_DS,salesdb.salesdb.SUPPLIER"
bootstrap_servers = "b-2.mskclustermskconnectl.4xwlfx.c11.kafka.us-east-1.amazonaws.com:9098,b-3.mskclustermskconnectl.4xwlfx.c11.kafka.us-east-1.amazonaws.com:9098,b-1.mskclustermskconnectl.4xwlfx.c11.kafka.us-east-1.amazonaws.com:9098“
aws_subnet_example1_id = "subnet-016ef7bb5f5db5759"
aws_subnet_example2_id = "subnet-0114c390d379134fa"
aws_subnet_example3_id = "subnet-0f6352ad89a1454f2"
security_groups = "sg-07eb8f8e4559334e7"
aws_mskconnect_custom_plugin_example_arn = "arn:aws:kafkaconnect:us-east-1:xxxxxxxxxxxx:custom-plugin/confluentinc-kafka-connect-s3-10-0-3/e9aeb52e-d172-4dba-9de5-f5cf73f1cb9e-2"
aws_mskconnect_custom_plugin_example_latest_revision = "1"
aws_iam_role_example_arn = "arn:aws:iam::xxxxxxxxxxxx:role/msk-connect-lab-S3ConnectorIAMRole-3LBTU7YAV9CM"

Deploy and update configurations using Terraform

Once you’ve defined your MSK Connect infrastructure using Terraform, applying these configurations is a straightforward process for creating or updating your infrastructure. This becomes particularly convenient when a new topic needs to be added. Thanks to the externalized configuration, incorporating this change is now a seamless task. The steps are as follows:

  1. Download and install Terraform from the official website (https://www.terraform.io/downloads.html) for your operating system.
  2. Confirm the installation by running the terraform version command on your command line interface.
  3. Ensure that you have configured your AWS credentials using the AWS Command Line Interface (AWS CLI) or by setting environment variables. You can use the aws configure command to configure your credentials if you’re using the AWS CLI.
  4. Place the main.tf, variables.tf, and var.tfvars files in the same Terraform directory.
  5. Open a command line interface, navigate to the directory containing the Terraform files, and run the command terraform init to initialize Terraform and download the required providers.
  6. Run the command terraform plan -var-file="var.tfvars" to review the run plan.

This command shows the changes that Terraform will make to the infrastructure based on the provided variables. This step is optional but is often used as a preview of the changes Terraform will make.

  1. If the plan looks correct, run the command terraform apply -var-file="var.tfvars" to apply the configuration.

Terraform will create the MSK_Connect in your AWS account. This will prompt you for confirmation before proceeding.

  1. After the terraform apply command is complete, verify the infrastructure has been created or updated on the console.
  2. For any changes or updates, modify your Terraform files (main.tf, variables.tf, var.tfvars) as needed, and then rerun the terraform plan and terraform apply commands.
  3. When you no longer need the infrastructure, you can use terraform destroy -var-file="var.tfvars" to remove all resources created by your Terraform files.

Be careful with this command because it will delete all the resources defined in your Terraform files.

Conclusion

In this post, we addressed the challenges faced by a customer in managing MSK Connect configurations and described a Terraform-based solution. By externalizing Kafka topic to Amazon S3 configurations, you can streamline your configuration management processes, achieve scalability, enhance flexibility, automate deployments, and centralize management. We encourage you to use Terraform to optimize your MSK Connect configurations and explore further possibilities in managing your streaming data pipelines efficiently.

To get started with externalizing MSK Connect configurations using Terraform, refer to the provided implementation steps and the Getting Started with Terraform guide, MSK Connect documentation, Terraform documentation, and example GitHub repository.

Using Terraform to externalize the Kafka topic to Amazon S3 Sink configuration in MSK Connect offers a powerful solution for managing and scaling your streaming data pipelines. By automating the deployment, updating, and central management of configurations, you can ensure efficiency, flexibility, and scalability in your data processing workflows.


About the Author

RamC Venkatasamy is a Solutions Architect based in Bloomington, Illinois. He helps AWS Strategic customers transform their businesses in the cloud. With a fervent enthusiasm for Serverless, Event-Driven Architecture and GenAI.

Manage roles and entitlements with PBAC using Amazon Verified Permissions

Post Syndicated from Abhishek Panday original https://aws.amazon.com/blogs/devops/manage-roles-and-entitlements-with-pbac-using-amazon-verified-permissions/

Traditionally, customers have used role-based access control (RBAC) to manage entitlements within their applications. The application controls what users can do, based on the roles they are assigned. But, the drive for least privilege has led to an exponential growth in the number of roles. Customers can address this role explosion by moving authorization logic out of the application code, and implementing a policy-based access control (PBAC) model that augments RBAC with attribute-based access control (ABAC).

In this blog post, we cover roles and entitlements, how they are applicable in apps authorization decisions, how customers implement roles and authorization in their app today, and how to shift to a centralized PBAC model by using Amazon Verified Permissions.

Describing roles and entitlements, approaches and challenges of current implementations

In RBAC models, a user’s entitlements are assigned based on job role. This role could be that of a developer, which might grant permissions to affect code in the pipeline of an app. Entitlements represent the features, functions, and resources a user has permissions to access. For example, a customer might be able to place orders or view pets in a pet store application, or a store owner might be entitled to review orders made from their store.

The combination of roles assigned to a user and entitlements granted to these roles determines what a human user can do within your application. Traditionally, application access has all been handled in code by hard coding roles that users can be assigned and mapping those roles directly to a set of actions on resources. However, as the need to apply more granular access control grows (as with least privilege), so do the number of required hard-coded roles that are assigned to users to obtain this level of granularity. This problem is frequently called role explosion, where role definitions grow exponentially which requires additional overhead from your teams to manage and audit roles effectively. For example, the code to authorize request to get details of an order has multiple if/else statements, as shown in the following sample.


boolean userAuthorizedForOrder (Order order, User user){
    if (user.storeId == user.storeID) {
        if (user.roles.contains("store-owner-roles") {            // store owners can only access orders for their own stores  
            return true; 
        } else if (user.roles.contains("store-employee")) {
            if (isStoreOpen(current_time)) {                      // Only allow access for the order to store-employees when
                return true                                       // store is open 
            }
        }
    } else {
        if (user.roles("customer-service-associate") &&           // Only allow customer service associates to orders for cases 
                user.assignedShift(current_time)) &&              // they are assinged and only during their assigned shift
                user.currentCase.order.orderId == order.orderId
         return true;
    }
    return false; 
}

This problem introduces several challenges. First, figuring out why a permission was granted or denied requires a closer look at the code. Second, adding a permission requires code changes. Third, audits can be difficult because you either have to run a battery of tests or explore code across multiple files to demonstrate access controls to auditors. Though there might be additional considerations, these three challenges have led many app owners to begin looking at PBAC methods to address the granularity problem. You can read more about the foundations of PBAC models in Policy-based access control in application development with Amazon Verified Permissions. By shifting to a PBAC model, you can reduce role growth to meet your fine-grained permissions needs. You can also externalize authorization logic from code, develop granular permissions based on roles and attributes, and reduce the time that you spend refactoring code for changes to authorization decisions or reading through the code to audit authorization logic.

In this blog, we demonstrate implementing permissions in a PBAC model through a demo application. The demo application uses Cognito groups to manage role assignment, Verified Permissions to implement entitlements for the roles. The approach restricts the resources that a role can access using attribute-based conditions. This approach works well in usecases when you already have a system in place to manage role assignment and you can define resources that a user may access by matching attributes of the user with attributes of the resource.

Demo app

Let’s look at a sample pet store app. The app is used by 2 types of users – end users and store owners. The app enables end users to search and order pets. The app allows store owners to list orders for the store. This sample app is available for download and local testing on the aws-samples/avp-petstore-sample Github repository. The app is a web app built by using AWS Amplify, Amazon API-Gateway, Amazon Cognito, and Amazon Verified Permissions. The following diagram is a high-level illustration of the app’s architecture.

Architectural Diagram

Steps

  1. The user logs in to the application, and is re-directed to Amazon Cognito to sign-in and obtain a JWT token.
  2. When user take an action (eg. ListOrders) in the application, the application calls Amazon API-Gateway to process the request.
  3. Amazon API-Gateway forwards the request to a lambda function, that call Amazon Verified Permissions to authorize the action. If the authorization results in deny, the lambda returns Unauthorized back to the application.
  4. If the authorization succeed, the application continues to execute the action.

RBAC policies in action

In this section, we focus on building RBAC permissions for the sample pet store app. We will guide you through building RBAC by using Verified Permissions and by focusing on a role for store owners, who are allowed to view all orders for a store. We use Verified Permissions to manage the permissions granted to this role and Amazon Cognito to manage role assignments.

We model the store owner role in Amazon Cognito as a user group called Store-Owner-Role. When a user is assigned the store owner role, the user is added to the “Store-Owner-Role” user group. You can create the users and users groups required to follow along with the sample application by visiting managing users and groups in Amazon Cognito.

After users are assigned to the store owner role, you can enforce that they can list all orders in the store by using the following RBAC policy. The policy provides access to any user in the Store-Owner-Role to perform the ListOrders and GetStoreInventory actions on any resource.

permit (
         principal in MyApplication::Group::"Store-Owner-Role",
         action in [
              MyApplication::Action::"GetStoreInventory",
              MyApplication::Action::"ListOrders"
         ],
         resource
);

Based on the policy we reviewed – the store owner will receive a Success! when they attempt to list existing orders.

Eve is permitted to list orders

This example further demonstrates the division of responsibility between the identity provider (Amazon Cognito) and Verified Permissions. The identity provider (IdP) is responsible for managing roles and memberships in roles. Verified Permissions is responsible for managing policies that describe what those roles are permitted to do. As demonstrated above, you can use this process to add roles without needing to change code.

Using PBAC to help reduce role explosion

Up until the point of role explosion, RBAC has worked well as the sole authorization model. Unfortunately, we have heard from customers that this model does not scale well because of the challenge of role explosion. Role explosion happens when you have hundreds or thousands of roles, and managing and auditing those roles becomes challenging. In extreme cases, you might have more roles than the number of users in your organization. This happens primarily because organizations keep creating more roles, with each role granting access to a smaller set of resources in an effort to follow the principle of least privilege.

Let’s understand the problem of role explosion through our sample pet store app. The pet store app is now being sold as a SaaS product to pet stores in other locations. As a result, the app needs additional access controls to ensure that each store owner can view only the orders from their own store. The most intuitive way to implement these access controls was to create an additional role for each location, which would restrict the scope of access for a store owner to their respective store’s orders. For example, a role named petstore-austin would allow access only to resources in the Austin, Texas store. RBAC models allow developers to predefine sets of permissions that can be used in an application, and ABAC models allow developers to adapt those permissions to the context of the request (such as the client, the resource, and the method used). The adoption of both RBAC and ABAC models leads to an explosion of either roles or attribute-based rules as the number of store locations increases.

To solve this problem, you can combine RBAC and ABAC policies into a PBAC model. RBAC policies determines the actions the user can take. Augmenting these policies with ABAC policies allows you to control the resouces they can take those actions on. For example, you can scope down the resources a user can access based on identity attributes, such as department or business unit, region, and management level. This approach mitigates role explosion because you need to have only a small number of predefined roles, and access is controlled based on attributes. You can use Verified Permissions to combine RBAC and ABAC models in the form of Cedar policies to build this PBAC solution.

We can demonstrate this solution in the sample pet store app by modifying the policy we created earlier and adding ABAC conditions. The conditions specify that users can only ListOrders of the store they own. The store a store owner owns is represented in Amazon Cognito by employmentStoreCode. This policy now expands on the granularity of access of the original RBAC policy without leading to numerous RBAC policies.

permit (
         principal in MyApplication::Group::"Store-Owner-Role",
         action in [
              MyApplication::Action::"GetStoreInventory",
              MyApplication::Action::"ListOrders"
          ],
          resource
) when { 
          principal.employmentStoreCode == resource.storeId 
};

We demonstrate that our policy restricts access for store owners to the store they own, by creating a user – eve – who is assigned the Store-Owner-Role and owns petstore-london. When Eve lists orders for the petstore-london store, she gets a success response, indicating she has permissions to list orders.
Eve is permitted to list orders for petstore-london

Next, when even tries to list orders for the petstore-seattle store, she gets a Not Authorized response. She is denied access as she does not own petstore-seattle.

Eve is not permitted to list orders for petstore-seattle

Step-by-step walkthrough of trying the Demo App

If you want to go through the demo of our sample pet store app, we recommend forking it from aws-samples/avp-petstore-sample Github repo and going through this process in README.md to ensure hands-on familiarity.

We will first walk through setting up permissions using only RBAC for the sample pet store application. Next, we will see how you can use PBAC to implement least priveledge as the application scales.

Implement RBAC based Permissions

We describe setting up policies to implement entitlements for the store owner role in Verified Permissions.

    1. Navigate to the AWS Management Console, search for Verified Permissions, and select the service to go to the service page.
    2. Create new policy store to create a container for your policies. You can create an Empty Policy Store for the purpose of the walk-through.
    3. Navigate to Policies in the navigation pane and choose Create static policy.
    4. Select Next and paste in the following Cedar policy and select Save.
permit (
        principal in MyApplication::Group::"Store-Owner-Role",
        action in [
               MyApplication::Action::"GetStoreInventory",
               MyApplication::Action::"ListOrders"
         ],
         resource
);
  1. You need to get users and assign the Store-Owner-Role to them. In this case, you will use Amazon Cognito as the IdP and the role can be assigned there. You can create users and groups in Cognito by following the below steps.
    1. Navigate to Amazon Cognito from the AWS Management Console, and select the user group created for the pet store app.
    2. Creating a user by clicking create user and create a user with user name eve
    3. Navigate to the Groups section and create a group called Store-Owner-Role .
    4. Add eve to the Store-Owner-Role group by clicking Add user to Group, selecting eve and clicking the Add.
  2. Now that you have assigned the Store-Owner-Role to the user, and Verified Permissions has a permit policy granting entitlements based on role membership, you can log in to the application as the user – eve – to test functionality. When choosing List All Orders, you can see the approval result in the app’s output.

Implement PBAC based Permissions

As the company grows, you want to be able to limit GetOrders access to a specific store location so that you can follow least privilege. You can update your policy to PBAC by adding an ABAC condition to the existing permit policy. You can add a condition in the policy that restricts listing orders to only those stores the user owns.

Below is the walk-though of updating the application

    1. Navigate to the Verified Permissions console and update the policy to the below.
permit (
         principal in MyApplication::Group::"Store-Owner-Role",
         action in [
              MyApplication::Action::"GetStoreInventory",
              MyApplication::Action::"ListOrders"
          ],
          resource
) when { 
          principal.employmentStoreCode == resource.storeId 
};
  1. Navigate to the Amazon Cognito console, select the user eve and click “Edit” in the user attributes section to update the “custom:employmentStoreCode”. Set the attribute value to “petstore-london” as eve owns the petstore-london location
  2. You can demonstrate that eve can only list orders of “petstore-london” by following the below steps
    1. We want to make sure that latest changes to the user attributed are passed to the application in the identity token. We will refresh the identity token, by logging out of the app and logging in again as Eve. Navigate back to the application and logout as eve.
    2. In the application, we set the Pet Store Identifier as petstore-london and click the List All Orders. The result is success!, as Eve is authorized to list orders of the store she owns.
    3. Next, we change the Pet Store Identifier to petstore-seattle and and click the List All Orders. The result is Not Authorized, as Eve is authorized to list orders of stores she does not owns.

Clean Up section

You can cleanup the resources that were created in this blog by following these steps.

Conclusion

In this post, we reviewed what roles and entitlements are as well as how they are used to manage user authorization in your app. We’ve also covered RBAC and ABAC policy examples with respect to the demo application, avp-petstore-sample, that is available to you via AWS Samples for hands-on testing. The walk-through also covered our example architecture using Amazon Cognito as the IdP and Verified Permissions as the centralized policy store that assessed authorization results based on the policies set for the app. By leveraging Verified Permissions, we could use PBAC model to define fine-grained access while preventing role explosion. For more information about Verified Permissions, see the Amazon Verified Permissions product details page and Resources page.

Abhishek Panday

Abhishek is a product manager in the Amazon Verified Permissions team. He has been working with the AWS for more than two years, and has been at Amazon for more than five years. Abhishek enjoys working with customers to understand the customer’s challenges and building products to solve those challenges. Abhishek currently lives in Seattle and enjoys playing soccer, hiking, and cooking Indian cuisines.

Jeremy Ware

Jeremy is a Security Specialist Solutions Architect focused on Identity and Access Management. Jeremy and his team enable AWS customers to implement sophisticated, scalable, and secure IAM architecture and Authentication workflows to solve business challenges. With a background in Security Engineering, Jeremy has spent many years working to raise the Security Maturity gap at numerous global enterprises. Outside of work, Jeremy loves to explore the mountainous outdoors participate in sports such as Snowboarding, Wakeboarding, and Dirt bike riding.

Best Practices for Writing Step Functions Terraform Projects

Post Syndicated from Patrick Guha original https://aws.amazon.com/blogs/devops/best-practices-for-writing-step-functions-terraform-projects/

Terraform by HashiCorp is one of the most popular infrastructure-as-code (IaC) platforms. AWS Step Functions is a visual workflow service that helps developers use AWS services to build distributed applications, automate processes, orchestrate microservices, and create data and machine learning (ML) pipelines. In this blog, we showcase best practices for users leveraging Terraform to deploy workflows, also known as Step Functions state machines. We will create a state machine using Workflow Studio for AWS Step Functions, deploy the state machine with Terraform, and introduce best operating practices on topics such as project structure, modules, parameter substitution, and remote state.

We recommend that you have a working understanding of both Terraform and Step Functions before going through this blog. If you are brand new to Step Functions and/or Terraform, please visit the Introduction to Terraform on AWS Workshop and the Terraform option in the Managing State Machines with Infrastructure as Code section of The AWS Step Functions Workshop to learn more.

Step Functions and Terraform Project Structure

One of the most important parts of any software project is its structure. It must be clear and well-organized for yourself or any member of your team to pick up and start coding efficiently. A Step Functions project using Terraform can potentially have many moving parts and components, so it is especially important to modularize and label wherever possible. Let’s take a look at a project structure that will allow for modularization, re-usability, and extensibility:

mkdir sfn-tf-example
cd sfn-tf-example
mkdir -p -- statemachine modules functions/first-function/src
touch main.tf outputs.tf variables.tf .gitignore functions/first-function/src/lambda.py
tree

Before moving forward, let’s analyze the directory, subdirectories, and files created above:

  • /statemachine will hold our Amazon States Language (ASL) JSON code describing the Step Functions state machine definition. This is where the orchestration logic will reside, so it is prudent to keep it separated from the infrastructure code. If you are deploying multiple state machines in your project, each definition will have its own JSON file. If you prefer, you can specify separate folders for each state machine to further modularize and isolate the logic.
  • /functions subdirectory includes the actual code for AWS Lambda functions used in our state machine. Keeping this code here will be much easier to read than writing it inline in our main.tf file.
  • The last subdirectory we have is /modules. Terraform modules are higher level abstracts explaining new concepts in your architecture. However, do not fall into the trap of making a custom module for everything. Doing so will make your code harder to maintain, and AWS provider resources will often suffice. There are also very popular modules that you can use from the Terraform Registry, such as Terraform AWS modules. Whenever possible, one should re-use modules to avoid code duplication in your project.
  • The remaining files in the root of the project are common to all Terraform projects. There are going to be hidden files created by your Terraform project after running terraform init, so we will include a .gitignore. What you include in .gitignore is largely dependent on your codebase and what your tools silently create in the background. In a later section, we will explicitly call out *.tfstate files in our .gitignore, and go over best practices for managing Terraform state securely and remotely.

Initial Code and Project Setup

We are going to create a simple Step Functions state machine that will only execute a single Lambda function. However, we will need to create the Lambda function that the state machine will reference. We first need to create our Lambda function code and save it in the following the directory structure and file mentioned above: functions/first-function/src/lambda.py.

import boto3

def lambda_handler(event, context):
# Minimal function for demo purposes
	return True

In Terraform, the main configuration file is named main.tf. This is the file that the Terraform CLI will look for in the local directory. Although you can break down your template into multiple .tf files, main.tf must be one of them. In this file, we will define the required providers and their minimum version, along with the resource definition of our template. In the example below, we define the minimum resources needed for a simple state machine that only executes a Lambda function. We define the two AWS Identity and Access Management (IAM) roles that our Lambda function and state machine will use, respectively. We define a data resource that zips the Lambda function code, which is then used in the Lambda function definition. Also notice that we use the aws_iam_policy_document data source throughout. Using the official IAM policy document means both your integrated development environment (IDE) and Terraform can see if your policy is malformed before running terraform apply. Finally, we define an Amazon CloudWatch Log group that will be used by the Lambda function to store its execution logs.

Terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~>4.0"
    }
  }
}

provider "aws" {}

provider "random" {}

data "aws_caller_identity" "current_account" {}

data "aws_region" "current_region" {}

resource "random_string" "random" {
  length  = 4
  special = false
}

data "aws_iam_policy_document" "lambda_assume_role_policy" {
  statement {
    effect = "Allow"

    principals {
      type        = "Service"
      identifiers = ["lambda.amazonaws.com"]
    }

    actions = [
      "sts:AssumeRole",
    ]
  }
}

resource "aws_iam_role" "function_role" {
  assume_role_policy  = data.aws_iam_policy_document.lambda_assume_role_policy.json
  managed_policy_arns = ["arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"]
}

# Create the function
data "archive_file" "lambda" {
  type        = "zip"
  source_file = "functions/first-function/src/lambda.py"
  output_path = "functions/first-function/src/lambda.zip"
}

resource "aws_kms_key" "log_group_key" {}

resource "aws_kms_key_policy" "log_group_key_policy" {
  key_id = aws_kms_key.log_group_key.id
  policy = jsonencode({
    Id = "log_group_key_policy"
    Statement = [
      {
        Action = "kms:*"
        Effect = "Allow"
        Principal = {
          AWS = "arn:aws:iam::${data.aws_caller_identity.current_account.account_id}:root"
        }

        Resource = "*"
        Sid      = "Enable IAM User Permissions"
      },
      {
        Effect = "Allow",
        Principal = {
          Service : "logs.${data.aws_region.current_region.name}.amazonaws.com"
        },
        Action = [
          "kms:Encrypt*",
          "kms:Decrypt*",
          "kms:ReEncrypt*",
          "kms:GenerateDataKey*",
          "kms:Describe*"
        ],
        Resource = "*"
      }
    ]
    Version = "2012-10-17"
  })
}

resource "aws_lambda_function" "test_lambda" {
  function_name    = "HelloFunction-${random_string.random.id}"
  role             = aws_iam_role.function_role.arn
  handler          = "lambda.lambda_handler"
  runtime          = "python3.9"
  filename         = "functions/first-function/src/lambda.zip"
  source_code_hash = data.archive_file.lambda.output_base64sha256
}

# Explicitly create the function’s log group to set retention and allow auto-cleanup
resource "aws_cloudwatch_log_group" "lambda_function_log" {
  retention_in_days = 1
  name              = "/aws/lambda/${aws_lambda_function.test_lambda.function_name}"
  kms_key_id        = aws_kms_key.log_group_key.arn
}

# Create an IAM role for the Step Functions state machine
data "aws_iam_policy_document" "state_machine_assume_role_policy" {
  statement {
    effect = "Allow"

    principals {
      type        = "Service"
      identifiers = ["states.amazonaws.com"]
    }

    actions = [
      "sts:AssumeRole",
    ]
  }
}

resource "aws_iam_role" "StateMachineRole" {
  name               = "StepFunctions-Terraform-Role-${random_string.random.id}"
  assume_role_policy = data.aws_iam_policy_document.state_machine_assume_role_policy.json
}

data "aws_iam_policy_document" "state_machine_role_policy" {
  statement {
    effect = "Allow"

    actions = [
      "logs:CreateLogStream",
      "logs:PutLogEvents",
      "logs:DescribeLogGroups"
    ]

    resources = ["${aws_cloudwatch_log_group.MySFNLogGroup.arn}:*"]
  }

  statement {
    effect = "Allow"
    actions = [
      "cloudwatch:PutMetricData",
      "logs:CreateLogDelivery",
      "logs:GetLogDelivery",
      "logs:UpdateLogDelivery",
      "logs:DeleteLogDelivery",
      "logs:ListLogDeliveries",
      "logs:PutResourcePolicy",
      "logs:DescribeResourcePolicies",
    ]
    resources = ["*"]
  }

  statement {
    effect = "Allow"

    actions = [
      "lambda:InvokeFunction"
    ]

    resources = ["${aws_lambda_function.test_lambda.arn}"]
  }

}

# Create an IAM policy for the Step Functions state machine
resource "aws_iam_role_policy" "StateMachinePolicy" {
  role   = aws_iam_role.StateMachineRole.id
  policy = data.aws_iam_policy_document.state_machine_role_policy.json
}

# Create a Log group for the state machine
resource "aws_cloudwatch_log_group" "MySFNLogGroup" {
  name_prefix       = "/aws/vendedlogs/states/MyStateMachine-"
  retention_in_days = 1
  kms_key_id        = aws_kms_key.log_group_key.arn
}

Workflow Studio and Terraform Integration

It is important to understand the recommended steps given the different tools we have available for creating Step Functions state machines. You should use a combination of Workflow Studio and local development with Terraform. This workflow assumes you will define all resources for your application within the same Terraform project, and that you will be leveraging Terraform for managing your AWS resources.

Workflow for creating Step Functions state machine via Terraform

Figure 1 – Workflow for creating Step Functions state machine via Terraform

  1. You will write the Terraform definition for any resources you intend to call with your state machine, such as Lambda functions, Amazon Simple Storage Service (Amazon S3) buckets, or Amazon DynamoDB tables, and deploy them using the terraform apply command. Doing this prior to using Workflow Studio will be useful in designing the first version of the state machine. You can define additional resources after importing the state machine into your local Terraform project.
  2. You can use Workflow Studio to visually design the first version of the state machine. Given that you should have created the necessary resources already, you can drag and drop all of the actions and states, link them, and see how they look. Finally, you can execute the state machine for testing purposes.
  3. Once your initial design is ready, you will export the ASL file and save it in your Terraform project. You can use the Terraform resource type aws_sfn_state_machine and reference the saved ASL file in the definition field.
  4. You will then need to parametrize the ASL file given that Terraform will dynamically name the resources, and the Amazon Resource Name (ARN) may eventually change. You do not want to hardcode an ARN in your ASL file, as this will make updating and refactoring your code more difficult.
  5. Finally, you deploy the state machine via Terraform by running terraform apply.

Simple changes should be made directly in the parametrized ASL file in your Terraform project instead of going back to Workflow Studio. Having the ASL file versioned as part of your project ensures that no manual changes break the state machine. Even if there is a breaking change, you can easily roll back to a previous version. One caveat to this is if you are making major changes to the state machine. In this case, taking advantage of Workflow Studio in the console is preferable.

However, you will most likely want to continue seeing a visual representation of the state machine while developing locally. The good news is that you have another option directly integrated into Visual Studio Code (VS Code) that visually renders the state machine, similar to Workflow Studio. This functionality is part of the AWS Toolkit for VS Code. You can learn more about the state machine integration with the AWS Toolkit for VS Code here. Below is an example of a parametrized ASL file and its rendered visualization in VS Code.

Step Functions state machine displayed visually in VS Code

Figure 2 – Step Functions state machine displayed visually in VS Code

Parameter Substitution

In the Terraform template, when you define the Step Functions state machine, you can either include the definition in the template or in an external file. Leaving the definition in the template can cause the template to be less readable and difficult to manage. As a best practice, it is recommended to keep the definition of the state machine in a separate file. This raises the question of how to pass parameters to the state machine. In order to do this, you can use the templatefile function of Terraform. The templatefile function reads a file and renders its content with the supplied set of variables. As shown in the code snippet below, we will use the templatefile function to render the state machine definition file with the Lambda function ARN and any other parameters to pass to the state machine.

resource "aws_sfn_state_machine" "sfn_state_machine" {
  name     = "MyStateMachine-${random_string.random.id}"
  role_arn = aws_iam_role.StateMachineRole.arn
  definition = templatefile("${path.module}/statemachine/statemachine.asl.json", {
    ProcessingLambda = aws_lambda_function.test_lambda.arn
    }
  )
  logging_configuration {
    log_destination        = "${aws_cloudwatch_log_group.MySFNLogGroup.arn}:*"
    include_execution_data = true
    level                  = "ALL"
  }
}

Inside the state machine definition, you have to specify a string template using the interpolation sequences delimited with ${}. Similar to the code snippet below, you will define the state machine with the variable name that will be passed by the templatefile function.

"Lambda Invoke": {
    "Type": "Task",
    "Resource": "arn:aws:states:::lambda:invoke",
    "Parameters": {
        "Payload.$": "$",
        "FunctionName": "${ProcessingLambda}"
    },
    "End": true
}

After the templatefile function runs, it will replace the variable ${ProcessingLambda} with the actual Lambda function ARN generated when the template is deployed.

Remote Terraform State Management

Every time you run Terraform, it stores information about the managed infrastructure and configuration in a state file. By default, Terraform creates the state file called terraform.tfstate in the local directory. As mentioned earlier, you will want to include any .tfstate files in your .gitignore file. This will ensure you do not commit it to source control, which could potentially expose secrets and would most likely lead to errors in state. If you accidentally delete this local file, Terraform cannot track the infrastructure that was previously created. In that case, if you run terraform apply on an updated configuration, Terraform will create it from scratch, which will lead to conflicts. It is recommended that you store the Terraform state remotely in secure storage to enable versioning, encryption, and sharing. Terraform supports storing state in S3 buckets by using the backend configuration block. In order to configure Terraform to write the state file to an S3 bucket, you need to specify the bucket name, the region, and the key name.

It is also recommended that you enable versioning in the S3 bucket and MFA delete to protect the state file from accidental deletion. In addition, you need to make sure that Terraform has the right IAM permissions on the target S3 bucket. In case you have multiple developers working with the same infrastructure simultaneously, Terraform can also use state locking to prevent concurrent runs against the same state. You can use a DynamoDB table to control locking. The DynamoDB table you use must have a partition key named LockID with type String, and Terraform must have the right IAM permissions on the table.

terraform {
    backend "s3" {
        bucket         = "mybucket"
        key            = "path/to/state/file"
        region         = "us-east-1"
        attach_deny_insecure_transport_policy = true # only allow HTTPS connections 
        encrypt        = true
        dynamodb_table = "Table-Name"
    }
}

With this remote state configuration, you will maintain the state securely stored in S3. With every change you apply to your infrastructure, Terraform will automatically pull the latest state from the S3 bucket, lock it using the DynamoDB table, apply the changes, push the latest state again to the S3 bucket and then release the lock.

Cleanup

If you were following along and deployed resources such as the Lambda function, the Step Functions state machine, the S3 bucket for backend state storage, or any of the other associated resources by running terraform apply, to avoid incurring charges on your AWS account, please run terraform destroy to tear these resources down and clean up your environment.

Conclusion

In conclusion, this blog provides a comprehensive guide to leveraging Terraform for deploying AWS Step Functions state machines. We discussed the importance of a well-structured project, initial code setup, integration between Workflow Studio and Terraform, parameter substitution, and remote state management. By following these best practices, developers can create and manage their state machines more effectively while maintaining clean, modular, and reusable code. Embracing infrastructure-as-code and using the right tools, such as Workflow Studio, VS Code, and Terraform, will enable you to build scalable and maintainable distributed applications, automate processes, orchestrate microservices, and create data and ML pipelines with AWS Step Functions.

If you would like to learn more about using Step Functions with Terraform, please check out the following patterns and workflows on Serverless Land and view the Step Functions Developer Guide.

About the authors

Ahmad Aboushady

Ahmad Aboushady is a Senior Technical Account Manager at AWS based in UAE. He works with Enterprise Support customers across the region to help them optimize their workloads on AWS and make the best out of their cloud journey.

Patrick Guha

Patrick Guha is a Solutions Architect at AWS based in Austin, TX. He supports non-profit, research customers focused on genomics, healthcare, and high-performance compute workloads in the cloud. Patrick has a BS in Electrical and Computer Engineering, and is currently working towards an MS in Engineering Management.

Aryam Gutierrez

Aryam Gutierrez is a Senior Partner Solutions Architect at AWS based in Madrid. He supports strategic partners to either build highly-scalable solutions or navigate through the various partner programs to differentiate their business, with the ultimate goal of growing business with AWS.

How to secure your email account and improve email sender reputation

Post Syndicated from bajavani original https://aws.amazon.com/blogs/messaging-and-targeting/how-to-secure-your-email-account-and-improve-email-sender-reputation/

How to secure your email account and improve email sender reputation

Introduction

Amazon Simple Email Service (Amazon SES) is a cost-effective, flexible, and scalable email service that enables customers to send email from within any application. You can send email using the SES SMTP interface or via HTTP requests to the SES API. All requests to send email must be authenticated using either SMTP or IAM credentials and it is when these credentials end up in the hands of a malicious actor, that customers need to act fast to secure their SES account.

Compromised credentials with permission to send email via SES allows the malicious actor to use SES to send spam and or phishing emails, which can lead to high bounce and or complaint rates for the SES account. A consequence of high bounce and or complaint rates can result in sending for the SES account being paused.

How to identify if your SES email sending account is compromised

Start by checking the reputation metrics for the SES account from the Reputation metrics menu in the SES Console.
A sudden increase or spike in the bounce or complaint metrics should be further investigated. You can start by checking the Feedback forwarding destination, where SES will send bounce and or complaints to. Feedback on bounces and complaints will contain the From, To email addresses as well as the subject. Use these attributes to determine if unintended emails are being sent, for example if the bounce and / or complaint recipients are not known to you that is an indication of compromise. To find out what your feedback forwarding destination is, please see Feedback forwarding mechanism

If SNS notifications are already enabled, check the subscribed endpoint for the bounce and / or complaint notifications to review the notifications for unintended email sending. SNS notifications would provide additional information, such as IAM identity being used to send the emails as well as the source IP address the emails are being sent from.

If the review of the bounces or complaints leads to the conclusion that the email sending is unintended, immediately follow the steps below to secure your account.

Steps to secure your account:

You can follow the below steps in order to secure your SES account:

  1. It is recommended that to avoid any more unintended emails from being sent, to immediately pause the SES account until the root cause has been identified and steps taken to secure the SES account. You can use the below command to pause the email sending for your account:

    aws ses update-account-sending-enabled --no-enabled --region sending_region

    Note: Change the sending_region with the region you are using to send email.

  2. Rotate the credentials for the IAM identity being used to send the unintended emails. If the IAM identity was originally created from the SES Console as SMTP credentials, it is recommended to delete the IAM identity and create new SMTP credentials from the SES Console.
  3. Limit the scope of SMTP/IAM identity to send email only from the specific IP address your email sending originates from.

See controlling access to Amazon SES.

Below is an example of an IAM policy which allows emails from IP Address 1.2.3.4 and 5.6.7.8 only.

————————-

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "RestrictIP",
"Effect": "Allow",
"Action": "ses:SendRawEmail",
"Resource": "*",
"Condition": {
"IpAddress": {
"aws:SourceIp": [
"1.2.3.4/32",
"5.6.7.8/32"
]
}
}
}
]
}

———————————

When you send an email from IP address apart from the IP mentioned in the policy, then the following error will be observed and the email sending request will fail:

———-

554 Access denied: User arn:aws:iam::123456789012:user/iam-user-name’ is not authorized to perform ses:SendRawEmail’ on resource `arn:aws:ses:eu-west-1:123456789012:identity/example.com’

———-

4.  Once these steps have been taken, the sending for the account can be enabled again, using the command below:

aws ses update-account-sending-enabled --enabled --region sending_region

Conclusion

You can secure your SES email sending account by taking the necessary steps mentioned and also prevent this from happening in the future.

Access accounts with AWS Management Console Private Access

Post Syndicated from Suresh Samuel original https://aws.amazon.com/blogs/security/access-accounts-with-aws-management-console-private-access/

AWS Management Console Private Access is an advanced security feature to help you control access to the AWS Management Console. In this post, I will show you how this feature works, share current limitations, and provide AWS CloudFormation templates that you can use to automate the deployment. AWS Management Console Private Access is useful when you want to restrict users from signing in to unknown AWS accounts from within your network. With this feature, you can limit access to the console only to a specified set of known accounts when the traffic originates from within your network.

For enterprise customers, users typically access the console from devices that are connected to a corporate network, either directly or through a virtual private network (VPN). With network connectivity to the console, users can authenticate into an account with valid credentials, including third-party accounts and personal accounts. For enterprise customers with stringent network access controls, this feature provides a way to control which accounts can be accessed from on-premises networks.

How AWS Management Console Private Access works

AWS PrivateLink now supports the AWS Management Console, which means that you can create Virtual Private Cloud (VPC) endpoints in your VPC for the console. You can then use DNS forwarding to conditionally route users’ browser traffic to the VPC endpoints from on-premises and define endpoint policies that allow or deny access to specific accounts, organizations, or organizational units (OUs). To privately reach the endpoints, you must have a hybrid network connection between on-premises and AWS over AWS Direct Connect or AWS Site-to-Site VPN.

When you conditionally forward DNS queries for the zone aws.amazon.com from on-premises to an Amazon Route 53 Resolver inbound endpoint within the VPC, Route 53 will prefer the private hosted zone for aws.amazon.com to resolve the queries. The private hosted zone makes it simple to centrally manage records for the console in the AWS US East (N. Virginia) Region (us-east-1) as well as other Regions.

Configure a VPC endpoint for the console

To configure VPC endpoints for the console, you must complete the following steps:

  1. Create interface VPC endpoints in a VPC in the US East (N. Virginia) Region for the console and sign-in services. Repeat for other desired Regions. You must create VPC endpoints in the US East (N. Virginia) Region because the default DNS name for the console resolves to this Region. Specify the accounts, organizations, or OUs that should be allowed or denied in the endpoint policies. For instructions on how to create interface VPC endpoints, see Access an AWS service using an interface VPC endpoint.
  2. Create a Route 53 Resolver inbound endpoint in a VPC and note the IP addresses for the elastic network interfaces of the endpoint. Forward DNS queries for the console from on-premises to these IP addresses. For instructions on how to configure Route 53 Resolver, see Getting started with Route 53 Resolver.
  3. Create a Route 53 private hosted zone with records for the console and sign-in subdomains. For the full list of records needed, see DNS configuration for AWS Management Console and AWS Sign-In. Then associate the private hosted zone with the same VPC that has the Resolver inbound endpoint. For instructions on how to create a private hosted zone, see Creating a private hosted zone.
  4. Conditionally forward DNS queries for aws.amazon.com to the IP addresses of the Resolver inbound endpoint.

How to access Regions other than US East (N. Virginia)

To access the console for another supported Region using AWS Management Console Private Access, complete the following steps:

  1. Create the console and sign-in VPC endpoints in a VPC in that Region.
  2. Create resource records for <region>.console.aws.amazon.com and <region>.signin.aws.amazon.com in the private hosted zone, with values that target the respective VPC endpoints in that Region. Replace <region> with the region code (for example, us-west-2).

For increased resiliency, you can also configure a second Resolver inbound endpoint in a different Region other than the US East (N. Virginia) Region (us-east-1). On-premises DNS resolvers can use both endpoints for resilient DNS resolution to the private hosted zone.

Automate deployment of AWS Management Console Private Access

I created an AWS CloudFormation template that you can use to deploy the required resources in the US East (N. Virginia) Region (us-east-1). To get the template, go to console-endpoint-use1.yaml. The CloudFormation stack deploys the required VPC endpoints, Route 53 Resolver inbound endpoint, and private hosted zone with required records.

Note: The default endpoint policy allows all accounts. For sample policies with conditions to restrict access, see Allow AWS Management Console use for expected accounts and organizations only (trusted identities).

I also created a CloudFormation template that you can use to deploy the required resources in other Regions where private access to the console is required. To get the template, go to console-endpoint-non-use1.yaml.

Cost considerations

When you configure AWS Management Console Private Access, you will incur charges. You can use the following information to estimate these charges:

  • PrivateLink pricing is based on the number of hours that the VPC endpoints remain provisioned. In the US East (N. Virginia) Region, this is $0.01 per VPC endpoint per Availability Zone ($/hour).
  • Data processing charges per gigabyte (GB) of data processed through the VPC endpoints is $0.01 in the US East (N. Virginia) Region.
  • The Route 53 Resolver inbound endpoint is charged per IP (elastic network interface) per hour. In the US East (N. Virginia) Region, this is $0.125 per IP address per hour. See Route 53 pricing.
  • DNS queries to the inbound endpoint are charged at $0.40 per million queries.
  • The Route 53 hosted zone is charged at $0.50 per hosted zone per month. To allow testing, AWS won’t charge you for a hosted zone that you delete within 12 hours of creation.

Based on this pricing model, the cost of configuring AWS Management Console Private Access in the US East (N. Virginia) Region in two Availability Zones is approximately $212.20 per month for the deployed resources. DNS queries and data processing charges are additional based on actual usage. You can also apply this pricing model to help estimate the cost to configure in additional supported Regions. Route 53 is a global service, so you only have to create the private hosted zone once along with the resources in the US East (N. Virginia) Region.

Limitations and considerations

Before you get started with AWS Management Console Private Access, make sure to review the following limitations and considerations:

  • For a list of supported Regions and services, see Supported AWS Regions, service consoles, and features.
  • You can use this feature to restrict access to specific accounts from customer networks by forwarding DNS queries to the VPC endpoints. This feature doesn’t prevent users from accessing the console directly from the internet by using the console’s public endpoints from devices that aren’t on the corporate network.
  • The following subdomains aren’t currently supported by this feature and won’t be accessible through private access:
    • docs.aws.amazon.com
    • health.aws.amazon.com
    • status.aws.amazon.com
  • After a user completes authentication and accesses the console with private access, when they navigate to an individual service console, for example Amazon Elastic Compute Cloud (Amazon EC2), they must have network connectivity to the service’s API endpoint, such as ec2.amazonaws.com. This is needed for the console to make API calls such as ec2:DescribeInstances to display resource details in the service console.

Conclusion

In this blog post, I outlined how you can configure the console through AWS Management Console Private Access to restrict access to AWS accounts from on-premises, how the feature works, and how to configure it for multiple Regions. I also provided CloudFormation templates that you can use to automate the configuration of this feature. Finally, I shared information on costs and some limitations that you should consider before you configure private access to the console.

For more information about how to set up and test AWS Management Console Private Access and reference architectures, see Try AWS Management Console Private Access. For the latest CloudFormation templates, see the aws-management-console-private-access-automation GitHub repository.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread at re:Post.

Want more AWS Security news? Follow us on Twitter.

Suresh Samuel

Suresh Samuel

Suresh is a Senior Technical Account Manager at AWS. He helps customers in the financial services industry with their operations on AWS. When not working, he can be found photographing birds in Texas or hanging out with his kids.

Understanding DDoS Simulation Testing in AWS

Post Syndicated from Harith Gaddamanugu original https://aws.amazon.com/blogs/security/understanding-ddos-simulation-testing-at-aws/

Distributed denial of service (DDoS) events occur when a threat actor sends traffic floods from multiple sources to disrupt the availability of a targeted application. DDoS simulation testing uses a controlled DDoS event to allow the owner of an application to assess the application’s resilience and practice event response. DDoS simulation testing is permitted on Amazon Web Services (AWS), subject to Testing policy terms and conditions. In this blog post, we help you understand when it’s appropriate to perform a DDoS simulation test on an application running on AWS, and what options you have for running the test.

DDoS protection at AWS

Security is the top priority at AWS. AWS services include basic DDoS protection as a standard feature to help protect customers from the most common and frequently occurring infrastructure (layer 3 and 4) DDoS events, such as SYN/UDP floods, reflection attacks, and others. While this protection is designed to protect the availability of AWS infrastructure, your application might require more nuanced protections that consider your traffic patterns and integrate with your internal reporting and incident response processes. If you need more nuanced protection, then you should consider subscribing to AWS Shield Advanced in addition to the native resiliency offered by the AWS services you use.

AWS Shield Advanced is a managed service that helps you protect your application against external threats, like DDoS events, volumetric bots, and vulnerability exploitation attempts. When you subscribe to Shield Advanced and add protection to your resources, Shield Advanced provides expanded DDoS event protection for those resources. With advanced protections enabled on your resources, you get tailored detection based on the traffic patterns of your application, assistance with protecting against Layer 7 DDoS events, access to 24×7 specialized support from the Shield Response Team (SRT), access to centralized management of security policies through AWS Firewall Manager, and cost protections to help safeguard against scaling charges resulting from DDoS-related usage spikes. You can also configure AWS WAF (a web application firewall) to integrate with Shield Advanced to create custom layer 7 firewall rules and enable automatic application layer DDoS mitigation.

Acceptable DDoS simulation use cases on AWS

AWS is constantly learning and innovating by delivering new DDoS protection capabilities, which are explained in the DDoS Best Practices whitepaper. This whitepaper provides an overview of DDoS events and the choices that you can make when building on AWS to help you architect your application to absorb or mitigate volumetric events. If your application is architected according to our best practices, then a DDoS simulation test might not be necessary, because these architectures have been through rigorous internal AWS testing and verified as best practices for customers to use.

Using DDoS simulations to explore the limits of AWS infrastructure isn’t a good use case for these tests. Similarly, validating if AWS is effectively protecting its side of the shared responsibility model isn’t a good test motive. Further, using AWS resources as a source to simulate a DDoS attack on other AWS resources isn’t encouraged. Load tests are performed to gain reliable information on application performance under stress and these are different from DDoS tests. For more information, see the Amazon Elastic Compute Cloud (Amazon EC2) testing policy and penetration testing. Application owners, who have a security compliance requirement from a regulator or who want to test the effectiveness of their DDoS mitigation strategies, typically run DDoS simulation tests.

DDoS simulation tests at AWS

AWS offers two options for running DDoS simulation tests. They are:

  • A simulated DDoS attack in production traffic with an authorized pre-approved AWS Partner.
  • A synthetic simulated DDoS attack with the SRT, also referred to as a firedrill.

The motivation for DDoS testing varies from application to application and these engagements don’t offer the same value to all customers. Establishing clear motives for the test can help you choose the right option. If you want to test your incident response strategy, we recommend scheduling a firedrill with our SRT. If you want to test the Shield Advanced features or test application resiliency, we recommend that you work with an AWS approved partner.

DDoS simulation testing with an AWS Partner

AWS DDoS test partners are authorized to conduct DDoS simulation tests on customers’ behalf without prior approval from AWS. Customers can currently contact the following partners to set up these paid engagements:

Before contacting the partners, customers must agree to the terms and conditions for DDoS simulation tests. The application must be well-architected prior to DDoS simulation testing as described in AWS DDoS Best Practices whitepaper. AWS DDoS test partners that want to perform DDoS simulation tests that don’t comply with the technical restrictions set forth in our public DDoS testing policy, or other DDoS test vendors that aren’t approved, can request approval to perform DDoS simulation tests by submitting the DDoS Simulation Testing form at least 14 days before the proposed test date. For questions, please send an email to [email protected].

After choosing a test partner, customers go through various phases of testing. Typically, the first phase involves a discovery discussion, where the customer defines clear goals, assembles technical details, and defines the test schedule with the partner. In the next phase, partners run multiple simulations based on agreed attack vectors, duration, diversity of the attack vectors, and other factors. These tests are usually carried out by slowly ramping up traffic levels from low levels to desired high levels with an ability for an emergency stop. The final stage involves reporting, discussing observed gaps, identifying actionable tasks, and driving those tasks to completion.

These engagements are typically long-term, paid contracts that are planned over months and carried out over weeks, with results analyzed over time. These tests and reports are beneficial to customers who need to evaluate detection and mitigation capabilities on a large scale. If you’re an application owner and want to evaluate the DDoS resiliency of your application, practice event response with real traffic, or have a DDoS compliance or regulation requirement, we recommend this type of engagement. These tests aren’t recommended if you want to learn the volumetric breaking points of the AWS network or understand when AWS starts to throttle requests. AWS services are designed to scale, and when certain dynamic volume thresholds are exceeded, AWS detection systems will be invoked to block traffic. Lastly, it’s critical to distinguish between these tests and stress tests, in which meaningful packets are sent to the application to assess its behavior.

DDoS firedrill testing with the Shield Response Team

Shield Advanced service offers additional assistance through the SRT, this team can also help with testing incident response workflows. Customers can contact the SRT and request firedrill testing. Firedrill testing is a type of synthetic test that doesn’t generate real volumetric traffic but does post a shield event to the requesting customer’s account.

These tests are available for customers who are already on-boarded to Shield Advanced and want to test their Amazon CloudWatch alarms by invoking a DDoSDetected metric, or test their proactive engagement setup or their custom incident response strategy. Because this event isn’t based on real traffic, the customer won’t see traffic generated on their account or see logs that drive helpful reports.

These tests are intended to generate associated Shield Advanced metrics and post a DDoS event for a customer resource. For example, SRT can post a 14 Gbps UDP mock attack on a protected resource for about 15 minutes and customers can test their response capability during such an event.

Note: Not all attack vectors and AWS resource types are supported for a firedrill. Shield Advanced onboarded customers can contact AWS Support teams to request assistance with running a firedrill or understand more about them.

Conclusion

DDoS simulations and incident response testing on AWS through the SRT or an AWS Partner are useful in improving application security controls, identifying Shield Advanced misconfigurations, optimizing existing detection systems, and improving incident readiness. The goal of these engagements is to help you build a DDoS resilient architecture to protect your application’s availability. However, these engagements don’t offer the same value to all customers. Most customers can obtain similar benefits by following AWS Best Practices for DDoS Resiliency. AWS recommends architecting your application according to DDoS best practices and fine tuning AWS Shield Advanced out-of-the-box offerings to your application needs to improve security posture.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Harith Gaddamanugu

Harith Gaddamanugu

Harith works at AWS as a Sr. Edge Specialist Solutions Architect. He stays motivated by solving problems for customers across AWS Perimeter Protection and Edge services. When he is not working, he enjoys spending time outdoors with friends and family.

Managing Amazon EBS volume throughput limits in Amazon OpenSearch Service domains

Post Syndicated from Pranit Kumar original https://aws.amazon.com/blogs/big-data/managing-amazon-ebs-volume-throughput-limits-in-amazon-opensearch-service-domains/

In this blog post, we discuss the impact of Amazon Elastic Block Store (Amazon EBS) volume IOPS and throughput limits on Amazon OpenSearch Service domain and how to prevent/mitigate throughput throttling situation.

Amazon OpenSearch Service is a managed service that makes it easy for you to perform website searches, interactive log analytics, real-time application monitoring, and more. Based on the open source OpenSearch suite, Amazon OpenSearch Service allows you to search, visualize, and analyze up to petabytes of text and unstructured data.

An OpenSearch Service domain primarily contains nodes with the following set of roles.

  • Cluster manager (dedicated master): Responsible for managing the cluster and checking the health of the data nodes in the cluster.
  • Data: Responsible for serving search and indexing requests and storing the indexed data.
  • Ultrawarm: Nodes which use Amazon S3 as a backing store to provide lower-cost storage.

When creating an OpenSearch Service domain, you choose the storage for the data nodes with local Non-Volatile Memory Express (NVMe) or with Amazon EBS volumes.

If the OpenSearch Service data node storage is backed by Amazon EBS volumes, depending on your workload, EBS throughput can heavily influence performance of the OpenSearch Service domain. The EBS volume performance metric is defined by the following two key parameters.

  • IOPS defines the number of IO operations performed per second.
  • Throughput is a measure of how much data can be transferred in a given amount of time. It is usually measured in bytes per second.

Whenever IOPS or throughput of the data node breaches the maximum allowed limit of the EBS volume or the EC2 instance of the data node, then the OpenSearch Service domain experiences IOPS or throughput throttling. This can result in high search and indexing latency and in the worst scenario node crash as well.

Maximum allowed IOPS and throughput for the data node

The maximum allowed value for IOPS or the throughput for the data node in an OpenSearch Service domain is the minimum of the following two values.

Throughput throttling and its impact on an Amazon OpenSearch Service domain

Throughput throttling happens when the total EBS throughput on a data node exceeds the maximum allowed throughput value of that data node in the OpenSearch Service domain.

The ThroughputThrottle metric for the domain or node can be seen in the Amazon CloudWatch console at the following location.

  • Domain: “ES/OpenSearchService > Per-Domain, Per-Client Metrics”
  • Node: “ES/OpenSearchService > ClientId, DomainName, NodeId”

The value of 1 in the ThroughputThrottle metric signifies a throttling event for the domain or node.

If a data node in the domain experiences throughput throttling for a consistent period, it can result in the following performance degradation for the data node.

  • Slower EBS volume performance.
  • High read/write latency.

This can affect the checks performed by the cluster manager or data node. It can result in:

  • FS (file system) health check failure performed by the data node.
  • Follower check failure performed by cluster manager due to high request latency.

This will result in the cluster manager marking such data nodes unhealthy, resulting in the data node being removed from the cluster. This can lead to a yellow or red cluster status.

Throughput value calculation

Total throughput for the data node is the total bytes read and written to the EBS volume per second. The following metrics provides the read and write throughput for the data node in the Amazon Opensearch Service domain.

Total throughput for the data node in the OpenSearch Service domain is calculated as the following.

Throughput = ReadThroughputMicroBursting + WriteThroughputMicroBursting

To get total throughput for the data node, follow these steps.

  1. Go to Amazon Cloudwatch metrics.
  2. Go to ES/OpenSearchService > ClientId, DomainName, NodeId.
  3. Select ReadThroughputMicroBursting and WriteThroughputMicroBursting metric.
  4. Go to Graphed metrics.
  5. Use Add math and create formulas to sum ReadThroughputMicroBursting and WriteThroughputMicroBursting values.

Handling throughput throttle

When the maximum allowed throughput limit is breached on the data node in an OpenSearch Service domain, a disk throughput throttle notification is sent to the AWS console. Throughput throttling on the data node can happen due to various reasons, such as the following.

  • A sudden increase in the index rate or search rate to the data node of the OpenSearch Service domain.
  • A blue/green event happening on the OpenSearch Service domain during peak hours.
  • The OpenSearch Service domain is under-scaled.

We suggest the following measures to prevent throughput throttling for the OpenSearch Service domain.

  • Monitor the traffic to the OpenSearch Service domain and create alarms on the search and index traffic sent to the OpenSearch Service domain.
  • Set up off-peak hours for OpenSearch Service domain so that the updates that lead to blue/green deployments are executed when there is less demand.
  • Monitor the ThroughputThrottle cluster metrics for the OpenSearch Service domain.
  • Monitor shard skewness for the OpenSearch Service domain. Shard skewness can lead to uneven load distribution of traffic to data nodes and can lead to hot nodes in the cluster, which can experience high index and search traffic that results in throttling.
  • If you are hitting EBS Volume or EC2 instance throughput limits for the data node, you will need to scale up the OpenSearch Service domain to avoid throughput throttling. Check the limits provided by EBS volumes and  Amazon EBS optimized instances used by the data node and scale up the OpenSearch cluster accordingly.

Every scenario requires specific investigation and the appropriate measures to resolve it. Still, we suggest the following guidelines as part of a broader approach to handling throughput throttle.

  • If high throughput is seen on a specific set of data nodes most of the time, shard skewness may be causing hot nodes. In such cases, resolving shard skewness will help the situation.
  • If OpenSearch Service domain is experiencing uneven traffic patterns, check for sudden bursts resulting in throttling. In such scenarios, streamlining the traffic pattern can be helpful.
  • If throughput throttling is seen on most of the nodes on the cluster with consistent traffic patterns, scaling up of the OpenSearch Service domain should be considered.

Conclusion

In this post, we covered the Amazon EBS throughput throttling in OpenSearch Service domain, its impact, and ways to monitor and handle it. We provided suggestions that can be used to handle such throttling situations.

Related links


About the Authors

Pranit Kumar is a Sr. Software Dev Engineer working on OpenSearch at Amazon Web Services. He is interested in distributed systems and solving complex problems.

Dhrubajyoti Das is an Engineering Manager working on OpenSearch at Amazon Web Services. He is deeply interested in high scalable systems and infrastructure related challenges.

Automatically detect and block low-volume network floods

Post Syndicated from Bryan Van Hook original https://aws.amazon.com/blogs/security/automatically-detect-and-block-low-volume-network-floods/

In this blog post, I show you how to deploy a solution that uses AWS Lambda to automatically manage the lifecycle of Amazon VPC Network Access Control List (ACL) rules to mitigate network floods detected using Amazon CloudWatch Logs Insights and Amazon Timestream.

Application teams should consider the impact unexpected traffic floods can have on an application’s availability. Internet-facing applications can be susceptible to traffic that some distributed denial of service (DDoS) mitigation systems can’t detect. For example, hit-and-run events are a popular approach that use short-lived floods that reoccur at random intervals. Each burst is small enough to go unnoticed by mitigation systems, but still occur often enough and are large enough to be disruptive. Automatically detecting and blocking temporary sources of invalid traffic, combined with other best practices, can strengthen the resiliency of your applications and maintain customer trust.

Use resilient architectures

AWS customers can use prescriptive guidance to improve DDoS resiliency by reviewing the AWS Best Practices for DDoS Resiliency. It describes a DDoS-resilient reference architecture as a guide to help you protect your application’s availability.

The best practices above address the needs of most AWS customers; however, in this blog we cover a few outlier examples that fall outside normal guidance. Here are a few examples that might describe your situation:

  • You need to operate functionality that isn’t yet fully supported by an AWS managed service that takes on the responsibility of DDoS mitigation.
  • Migrating to an AWS managed service such as Amazon Route 53 isn’t immediately possible and you need an interim solution that mitigates risks.
  • Network ingress must be allowed from a wide public IP space that can’t be restricted.
  • You’re using public IP addresses assigned from the Amazon pool of public IPv4 addresses (which can’t be protected by AWS Shield) rather than Elastic IP addresses.
  • The application’s technology stack has limited or no support for horizontal scaling to absorb traffic floods.
  • Your HTTP workload sits behind a Network Load Balancer and can’t be protected by AWS WAF.
  • Network floods are disruptive but not significant enough (too infrequent or too low volume) to be detected by your managed DDoS mitigation systems.

For these situations, VPC network ACLs can be used to deny invalid traffic. Normally, the limit on rules per network ACL makes them unsuitable for handling truly distributed network floods. However, they can be effective at mitigating network floods that aren’t distributed enough or large enough to be detected by DDoS mitigation systems.

Given the dynamic nature of network traffic and the limited size of network ACLs, it helps to automate the lifecycle of network ACL rules. In the following sections, I show you a solution that uses network ACL rules to automatically detect and block infrastructure layer traffic within 2–5 minutes and automatically removes the rules when they’re no longer needed.

Detecting anomalies in network traffic

You need a way to block disruptive traffic while not impacting legitimate traffic. Anomaly detection can isolate the right traffic to block. Every workload is unique, so you need a way to automatically detect anomalies in the workload’s traffic pattern. You can determine what is normal (a baseline) and then detect statistical anomalies that deviate from the baseline. This baseline can change over time, so it needs to be calculated based on a rolling window of recent activity.

Z-scores are a common way to detect anomalies in time-series data. The process for creating a Z-score is to first calculate the average and standard deviation (a measure of how much the values are spread out) across all values over a span of time. Then for each value in the time window calculate the Z-score as follows:

Z-score = (value – average) / standard deviation

A Z-score exceeding 3.0 indicates the value is an outlier that is greater than 99.7 percent of all other values.

To calculate the Z-score for detecting network anomalies, you need to establish a time series for network traffic. This solution uses VPC flow logs to capture information about the IP traffic in your VPC. Each VPC flow log record provides a packet count that’s aggregated over a time interval. Each flow log record aggregates the number of packets over an interval of 60 seconds or less. There isn’t a consistent time boundary for each log record. This means raw flow log records aren’t a predictable way to build a time series. To address this, the solution processes flow logs into packet bins for time series values. A packet bin is the number of packets sent by a unique source IP address within a specific time window. A source IP address is considered an anomaly if any of its packet bins over the past hour exceed the Z-score threshold (default is 3.0).

When overall traffic levels are low, there might be source IP addresses with a high Z-score that aren’t a risk. To mitigate against false positives, source IP addresses are only considered to be an anomaly if the packet bin exceeds a minimum threshold (default is 12,000 packets).

Let’s review the overall solution architecture.

Solution overview

This solution, shown in Figure 1, uses VPC flow logs to capture information about the traffic reaching the network interfaces in your public subnets. CloudWatch Logs Insights queries are used to summarize the most recent IP traffic into packet bins that are stored in Timestream. The time series table is queried to identify source IP addresses responsible for traffic that meets the anomaly threshold. Anomalous source IP addresses are published to an Amazon Simple Notification Service (Amazon SNS) topic. A Lambda function receives the SNS message and decides how to update the network ACL.

Figure 1: Automating the detection and mitigation of traffic floods using network ACLs

Figure 1: Automating the detection and mitigation of traffic floods using network ACLs

How it works

The numbered steps that follow correspond to the numbers in Figure 1.

  1. Capture VPC flow logs. Your VPC is configured to stream flow logs to CloudWatch Logs. To minimize cost, the flow logs are limited to particular subnets and only include log fields required by the CloudWatch query. When protecting an endpoint that spans multiple subnets (such as a Network Load Balancer using multiple availability zones), each subnet shares the same network ACL and is configured with a flow log that shares the same CloudWatch log group.
  2. Scheduled flow log analysis. Amazon EventBridge starts an AWS Step Functions state machine on a time interval (60 seconds by default). The state machine starts a Lambda function immediately, and then again after 30 seconds. The Lambda function performs steps 3–6.
  3. Summarize recent network traffic. The Lambda function runs a CloudWatch Logs Insights query. The query scans the most recent flow logs (5-minute window) to summarize packet frequency grouped by source IP. These groupings are called packet bins, where each bin represents the number of packets sent by a source IP within a given minute of time.
  4. Update time series database. A time series database in Timestream is updated with the most recent packet bins.
  5. Use statistical analysis to detect abusive source IPs. A Timestream query is used to perform several calculations. The query calculates the average bin size over the past hour, along with the standard deviation. These two values are then used to calculate the maximum Z-score for all source IPs over the past hour. This means an abusive IP will remain flagged for one hour even if it stopped sending traffic. Z-scores are sorted so that the most abusive source IPs are prioritized. If a source IP meets these two criteria, it is considered abusive.
    1. Maximum Z-score exceeds a threshold (defaults to 3.0).
    2. Packet bin exceeds a threshold (defaults to 12,000). This avoids flagging source IPs during periods of overall low traffic when there is no need to block traffic.
  6. Publish anomalous source IPs. Publish a message to an Amazon SNS topic with a list of anomalous source IPs. The function also publishes CloudWatch metrics to help you track the number of unique and abusive source IPs over time. At this point, the flow log summarizer function has finished its job until the next time it’s invoked from EventBridge.
  7. Receive anomalous source IPs. The network ACL updater function is subscribed to the SNS topic. It receives the list of anomalous source IPs.
  8. Update the network ACL. The network ACL updater function uses two network ACLs called blue and green. This verifies that the active rules remain in place while updating the rules in the inactive network ACL. When the inactive network ACL rules are updated, the function swaps network ACLs on each subnet. By default, each network ACL has a limit of 20 rules. If the number of anomalous source IPs exceeds the network ACL limit, the source IPs with the highest Z-score are prioritized. CloudWatch metrics are provided to help you track the number of source IPs blocked, and how many source IPs couldn’t be blocked due to network ACL limits.

Prerequisites

This solution assumes you have one or more public subnets used to operate an internet-facing endpoint.

Deploy the solution

Follow these steps to deploy and validate the solution.

  1. Download the latest release from GitHub.
  2. Upload the AWS CloudFormation templates and Python code to an S3 bucket.
  3. Gather the information needed for the CloudFormation template parameters.
  4. Create the CloudFormation stack.
  5. Monitor traffic mitigation activity using the CloudWatch dashboard.

Let’s review the steps I followed in my environment.

Step 1. Download the latest release

I create a new directory on my computer named auto-nacl-deploy. I review the releases on GitHub and choose the latest version. I download auto-nacl.zip into the auto-nacl-deploy directory. Now it’s time to stage this code in Amazon Simple Storage Service (Amazon S3).

Figure 2: Save auto-nacl.zip to the auto-nacl-deploy directory

Figure 2: Save auto-nacl.zip to the auto-nacl-deploy directory

Step 2. Upload the CloudFormation templates and Python code to an S3 bucket

I extract the auto-nacl.zip file into my auto-nacl-deploy directory.

Figure 3: Expand auto-nacl.zip into the auto-nacl-deploy directory

Figure 3: Expand auto-nacl.zip into the auto-nacl-deploy directory

The template.yaml file is used to create a CloudFormation stack with four nested stacks. You copy all files to an S3 bucket prior to creating the stacks.

To stage these files in Amazon S3, use an existing bucket or create a new one. For this example, I used an existing S3 bucket named auto-nacl-us-east-1. Using the Amazon S3 console, I created a folder named artifacts and then uploaded the extracted files to it. My bucket now looks like Figure 4.

Figure 4: Upload the extracted files to Amazon S3

Figure 4: Upload the extracted files to Amazon S3

Step 3. Gather information needed for the CloudFormation template parameters

There are six parameters required by the CloudFormation template.

Template parameter Parameter description
VpcId The ID of the VPC that runs your application.
SubnetIds A comma-delimited list of public subnet IDs used by your endpoint.
ListenerPort The IP port number for your endpoint’s listener.
ListenerProtocol The Internet Protocol (TCP or UDP) used by your endpoint.
SourceCodeS3Bucket The S3 bucket that contains the files you uploaded in Step 2. This bucket must be in the same AWS Region as the CloudFormation stack.
SourceCodeS3Prefix The S3 prefix (folder) of the files you uploaded in Step 2.

For the VpcId parameter, I use the VPC console to find the VPC ID for my application.

Figure 5: Find the VPC ID

Figure 5: Find the VPC ID

For the SubnetIds parameter, I use the VPC console to find the subnet IDs for my application. My VPC has public and private subnets. For this solution, you only need the public subnets.

Figure 6: Find the subnet IDs

Figure 6: Find the subnet IDs

My application uses a Network Load Balancer that listens on port 80 to handle TCP traffic. I use 80 for ListenerPort and TCP for ListenerProtocol.

The next two parameters are based on the Amazon S3 location I used earlier. I use auto-nacl-us-east-1 for SourceCodeS3Bucket and artifacts for SourceCodeS3Prefix.

Step 4. Create the CloudFormation stack

I use the CloudFormation console to create a stack. The Amazon S3 URL format is https://<bucket>.s3.<region>.amazonaws.com/<prefix>/template.yaml. I enter the Amazon S3 URL for my environment, then choose Next.

Figure 7: Specify the CloudFormation template

Figure 7: Specify the CloudFormation template

I enter a name for my stack (for example, auto-nacl-1) along with the parameter values I gathered in Step 3. I leave all optional parameters as they are, then choose Next.

Figure 8: Provide the required parameters

Figure 8: Provide the required parameters

I review the stack options, then scroll to the bottom and choose Next.

Figure 9: Review the default stack options

Figure 9: Review the default stack options

I scroll down to the Capabilities section and acknowledge the capabilities required by CloudFormation, then choose Submit.

Figure 10: Acknowledge the capabilities required by CloudFormation

Figure 10: Acknowledge the capabilities required by CloudFormation

I wait for the stack to reach CREATE_COMPLETE status. It takes 10–15 minutes to create all of the nested stacks.

Figure 11: Wait for the stacks to complete

Figure 11: Wait for the stacks to complete

Step 5. Monitor traffic mitigation activity using the CloudWatch dashboard

After the CloudFormation stacks are complete, I navigate to the CloudWatch console to open the dashboard. In my environment, the dashboard is named auto-nacl-1-MitigationDashboard-YS697LIEHKGJ.

Figure 12: Find the CloudWatch dashboard

Figure 12: Find the CloudWatch dashboard

Initially, the dashboard, shown in Figure 13, has little information to display. After an hour, I can see the following metrics from my sample environment:

  • The Network Traffic graph shows how many packets are allowed and rejected by network ACL rules. No anomalies have been detected yet, so this only shows allowed traffic.
  • The All Source IPs graph shows how many total unique source IP addresses are sending traffic.
  • The Anomalous Source Networks graph shows how many anomalous source networks are being blocked by network ACL rules (or not blocked due to network ACL rule limit). This graph is blank unless anomalies have been detected in the last hour.
  • The Anomalous Source IPs graph shows how many anomalous source IP addresses are being blocked (or not blocked) by network ACL rules. This graph is blank unless anomalies have been detected in the last hour.
  • The Packet Statistics graph can help you determine if the sensitivity should be adjusted. This graph shows the average packets-per-minute and the associated standard deviation over the past hour. It also shows the anomaly threshold, which represents the minimum number of packets-per-minute for a source IP address to be considered an anomaly. The anomaly threshold is calculated based on the CloudFormation parameter MinZScore.

    anomaly threshold = (MinZScore * standard deviation) + average

    Increasing the MinZScore parameter raises the threshold and reduces sensitivity. You can also adjust the CloudFormation parameter MinPacketsPerBin to mitigate against blocking traffic during periods of low volume, even if a source IP address exceeds the minimum Z-score.

  • The Blocked IPs grid shows which source IP addresses are being blocked during each hour, along with the corresponding packet bin size and Z-score. This grid is blank unless anomalies have been detected in the last hour.
     
Figure 13: Observe the dashboard after one hour

Figure 13: Observe the dashboard after one hour

Let’s review a scenario to see what happens when my endpoint sees two waves of anomalous traffic.

By default, my network ACL allows a maximum of 20 inbound rules. The two default rules count toward this limit, so I only have room for 18 more inbound rules. My application sees a spike of network traffic from 20 unique source IP addresses. When the traffic spike begins, the anomaly is detected in less than five minutes. Network ACL rules are created to block the top 18 source IP addresses (sorted by Z-score). Traffic is blocked for about 5 minutes until the flood subsides. The rules remain in place for 1 hour by default. When the same 20 source IP addresses send another traffic flood a few minutes later, most traffic is immediately blocked. Some traffic is still allowed from two source IP addresses that can’t be blocked due to the limit of 18 rules.

Figure 14: Observe traffic blocked from anomalous source IP addresses

Figure 14: Observe traffic blocked from anomalous source IP addresses

Customize the solution

You can customize the behavior of this solution to fit your use case.

  • Block many IP addresses per network ACL rule. To enable blocking more source IP addresses than your network ACL rule limit, change the CloudFormation parameter NaclRuleNetworkMask (default is 32). This sets the network mask used in network ACL rules and lets you block IP address ranges instead of individual IP addresses. By default, the IP address 192.0.2.1 is blocked by a network ACL rule for 192.0.2.1/32. Setting this parameter to 24 results in a network ACL rule that blocks 192.0.2.0/24. As a reminder, address ranges that are too wide might result in blocking legitimate traffic.
  • Only block source IPs that exceed a packet volume threshold. Use the CloudFormation parameter MinPacketsPerBin (default is 12,000) to set the minimum packets per minute. This mitigates against blocking source IPs (even if their Z-score is high) during periods of overall low traffic when there is no need to block traffic.
  • Adjust the sensitivity of anomaly detection. Use the CloudFormation parameter MinZScore to set the minimum Z-score for a source IP to be considered an anomaly. The default is 3.0, which only blocks source IPs with packet volume that exceeds 99.7 percent of all other source IPs.
  • Exclude trusted source IPs from anomaly detection. Specify an allow list object in Amazon S3 that contains a list of IP addresses or CIDRs that you want to exclude from network ACL rules. The network ACL updater function reads the allow list every time it handles an SNS message.

Limitations

As covered in the preceding sections, this solution has a few limitations to be aware of:

  • CloudWatch Logs queries can only return up to 10,000 records. This means the traffic baseline can only be calculated based on the observation of 10,000 unique source IP addresses per minute.
  • The traffic baseline is based on a rolling 1-hour window. You might need to increase this if a 1-hour window results in a baseline that allows false positives. For example, you might need a longer baseline window if your service normally handles abrupt spikes that occur hourly or daily.
  • By default, a network ACL can only hold 20 inbound rules. This includes the default allow and deny rules, so there’s room for 18 deny rules. You can increase this limit from 20 to 40 with a support case; however, it means that a maximum of 18 (or 38) source IP addresses can be blocked at one time.
  • The speed of anomaly detection is dependent on how quickly VPC flow logs are delivered to CloudWatch. This usually takes 2–4 minutes but can take over 6 minutes.

Cost considerations

CloudWatch Logs Insights queries are the main element of cost for this solution. See CloudWatch pricing for more information. The cost is about 7.70 USD per GB of flow logs generated per month.

To optimize the cost of CloudWatch queries, the VPC flow log record format only includes the fields required for anomaly detection. The CloudWatch log group is configured with a retention of 1 day. You can tune your cost by adjusting the anomaly detector function to run less frequently (the default is twice per minute). The tradeoff is that the network ACL rules won’t be updated as frequently. This can lead to the solution taking longer to mitigate a traffic flood.

Conclusion

Maintaining high availability and responsiveness is important to keeping the trust of your customers. The solution described above can help you automatically mitigate a variety of network floods that can impact the availability of your application even if you’ve followed all the applicable best practices for DDoS resiliency. There are limitations to this solution, but it can quickly detect and mitigate disruptive sources of traffic in a cost-effective manner. Your feedback is important. You can share comments below and report issues on GitHub.

 
If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security news? Follow us on Twitter.

Bryan Van Hook

Bryan Van Hook

Bryan is a Senior Security Solutions Architect at AWS. He has over 25 years of experience in software engineering, cloud operations, and internet security. He spends most of his time helping customers gain the most value from native AWS security services. Outside of his day job, Bryan can be found playing tabletop games and acoustic guitar.

Reduce the security and compliance risks of messaging apps with AWS Wickr

Post Syndicated from Anne Grahn original https://aws.amazon.com/blogs/security/reduce-the-security-and-compliance-risks-of-messaging-apps-with-aws-wickr/

Effective collaboration is central to business success, and employees today depend heavily on messaging tools. An estimated 3.09 billion mobile phone users access messaging applications (apps) to communicate, and this figure is projected to grow to 3.51 billion users in 2025.

This post highlights the risks associated with messaging apps and describes how you can use enterprise solutions — such as AWS Wickr — that combine end-to-end encryption with data retention to drive positive security and business outcomes.

The business risks of messaging apps

Evolving threats, flexible work models, and a growing patchwork of data protection and privacy regulations have made maintaining secure and compliant enterprise messaging a challenge.

The use of third-party apps for business-related messages on both corporate and personal devices can make it more difficult to verify that data is being adequately protected and retained. This can lead to business risk, particularly in industries with unique record-keeping requirements. Organizations in the financial services industry, for example, are subject to rules that include Securities and Exchange Commission (SEC) Rule 17a-4 and Financial Industry Regulatory Authority (FINRA) Rule 3120, which require them to preserve all pertinent electronic communications.

A recent Gartner report on the viability of mobile bring-your-own-device (BYOD) programs noted, “It is now logical to assume that most financial services organizations with mobile BYOD programs for regulated employees could be fined due to a lack of compliance with electronic communications regulations.”

In the public sector, U.S. government agencies are subject to records requests under the Freedom of Information Act (FOIA) and various state sunshine statutes. For these organizations, effectively retaining business messages is about more than supporting security and compliance—it’s about maintaining public trust.

Securing enterprise messaging

Enterprise-grade messaging apps can help you protect communications from unauthorized access and facilitate desired business outcomes.

Security — Critical security protocols protect messages and files that contain sensitive and proprietary data — such as personally identifiable information, protected health information, financial records, and intellectual property — in transit and at rest to decrease the likelihood of a security incident.

Control — Administrative controls allow you to add, remove, and invite users, and organize them into security groups with restricted access to features and content at their level. Passwords can be reset and profiles can be deleted remotely, helping you reduce the risk of data exposure stemming from a lost or stolen device.

Compliance — Information can be preserved in a customer-controlled data store to help meet requirements such as those that fall under the Federal Records Act (FRA) and National Archives and Records Administration (NARA), as well as SEC Rule 17a-4 and Sarbanes-Oxley (SOX).

Marrying encryption with data retention

Enterprise solutions bring end-to-end encryption and data retention together in support of a comprehensive approach to secure messaging that balances people, process, and technology.

End-to-end encryption

Many messaging apps offer some form of encryption, but not all of them use end-to-end encryption. End-to-end encryption is a secure communication method that protects data from unauthorized access, interception, or tampering as it travels from one endpoint to another.

In end-to-end encryption, encryption and decryption take place locally, on the device. Every call, message, and file is encrypted with unique keys and remains indecipherable in transit. Unauthorized parties cannot access communication content because they don’t have the keys required to decrypt the data.

Encryption in transit compared to end-to-end encryption

Encryption in transit encrypts data over a network from one point to another (typically between one client and one server); data might remain stored in plaintext at the source and destination storage systems. End-to-end encryption combines encryption in transit and encryption at rest to secure data at all times, from being generated and leaving the sender’s device, to arriving at the recipient’s device and being decrypted.

“Messaging is a critical tool for any organization, and end-to-end encryption is the security technology that provides organizations with the confidence they need to rely on it.” — CJ Moses, CISO and VP of Security Engineering at AWS

Data retention

While data retention is often thought of as being incompatible with end-to-end encryption, leading enterprise-grade messaging apps offer both, giving you the option to configure a data store of your choice to retain conversations without exposing them to outside parties. No one other than the intended recipients and your organization has access to the message content, giving you full control over your data.

How AWS can help

AWS Wickr is an end-to-end encrypted messaging and collaboration service that was built from the ground up with features designed to help you keep internal and external communications secure, private, and compliant. Wickr protects one-to-one and group messaging, voice and video calling, file sharing, screen sharing, and location sharing with 256-bit Advanced Encryption Standard (AES) encryption, and provides data retention capabilities.

Figure 1: How Wickr works

Figure 1: How Wickr works

With Wickr, each message gets a unique AES private encryption key, and a unique Elliptic-curve Diffie–Hellman (ECDH) public key to negotiate the key exchange with recipients. Message content — including text, files, audio, or video — is encrypted on the sending device (your iPhone, for example) using the message-specific AES key. This key is then exchanged via the ECDH key exchange mechanism, so that only intended recipients can decrypt the message.

“As former employees of federal law enforcement, the intelligence community, and the military, Qintel understands the need for enterprise-federated, secure communication messaging capabilities. When searching for our company’s messaging application we evaluated the market thoroughly and while there are some excellent capabilities available, none of them offer the enterprise security and administrative flexibility that Wickr does.”
Bill Schambura, CEO at Qintel

Wickr network administrators can configure and apply data retention to both internal and external communications in a Wickr network. This includes conversations with guest users, external teams, and other partner networks, so you can retain messages and files sent to and from the organization to help meet internal, legal, and regulatory requirements.

Figure 2: Data retention process

Figure 2: Data retention process

Data retention is implemented as an always-on recipient that is added to conversations, not unlike the blind carbon copy (BCC) feature in email. The data-retention process participates in the key exchange, allowing it to decrypt messages. The process can run anywhere: on-premises, on an Amazon Elastic Compute Cloud (Amazon EC2) instance, or at a location of your choice.

Wickr is a Health Insurance Portability and Accountability Act of 1996 (HIPAA)-eligible service, helping healthcare organizations and medical providers to conduct secure telehealth visits, send messages and files that contain protected health information, and facilitate real-time patient care.

Wickr networks can be created through the AWS Management Console, and workflows can be automated with Wickr bots. Wickr is currently available in the AWS US East (Northern Virginia), AWS GovCloud (US-West), AWS Canada (Central), and AWS Europe (London) Regions.

Keep your messages safe

Employees will continue to use messaging apps to chat with friends and family, and boost productivity at work. While many of these apps can introduce risks if not used properly in business settings, Wickr combines end-to-end encryption with data-retention capabilities to help you achieve security and compliance goals. Incorporating Wickr into a comprehensive approach to secure enterprise messaging that includes clear policies and security awareness training can help you to accelerate collaboration, while protecting your organization’s data.

To learn more and get started, visit the AWS Wickr webpage, or contact us.

Want more AWS Security news? Follow us on Twitter.

Anne Grahn

Anne Grahn

Anne is a Senior Worldwide Security GTM Specialist at AWS, based in Chicago. She has more than a decade of experience in the security industry, and focuses on effectively communicating cybersecurity risk. She maintains a Certified Information Systems Security Professional (CISSP) certification.

Tanvi Jain

Tanvi Jain

Tanvi is a Senior Technical Product Manager at AWS, based in New York. She focuses on building security-first features for customers, and is passionate about improving collaboration by building technology that is easy to use, scalable, and interoperable.

Establishing a data perimeter on AWS: Allow access to company data only from expected networks

Post Syndicated from Laura Reith original https://aws.amazon.com/blogs/security/establishing-a-data-perimeter-on-aws-allow-access-to-company-data-only-from-expected-networks/

A key part of protecting your organization’s non-public, sensitive data is to understand who can access it and from where. One of the common requirements is to restrict access to authorized users from known locations. To accomplish this, you should be familiar with the expected network access patterns and establish organization-wide guardrails to limit access to known networks. Additionally, you should verify that the credentials associated with your AWS Identity and Access Management (IAM) principals are only usable within these expected networks. On Amazon Web Services (AWS), you can use the network perimeter to apply network coarse-grained controls on your resources and principals. In this fourth blog post of the Establishing a data perimeter on AWS series, we explore the benefits and implementation considerations of defining your network perimeter.

The network perimeter is a set of coarse-grained controls that help you verify that your identities and resources can only be used from expected networks.

To achieve these security objectives, you first must define what expected networks means for your organization. Expected networks usually include approved networks your employees and applications use to access your resources, such as your corporate IP CIDR range and your VPCs. There are also scenarios where you need to permit access from networks of AWS services acting on your behalf or networks of trusted third-party partners that you integrate with. You should consider all intended data access patterns when you create the definition of expected networks. Other networks are considered unexpected and shouldn’t be allowed access.

Security risks addressed by the network perimeter

The network perimeter helps address the following security risks:

Unintended information disclosure through credential use from non-corporate networks

It’s important to consider the security implications of having developers with preconfigured access stored on their laptops. For example, let’s say that to access an application, a developer uses a command line interface (CLI) to assume a role and uses the temporary credentials to work on a new feature. The developer continues their work at a coffee shop that has great public Wi-Fi while their credentials are still valid. Accessing data through a non-corporate network means that they are potentially bypassing their company’s security controls, which might lead to the unintended disclosure of sensitive corporate data in a public space.

Unintended data access through stolen credentials

Organizations are prioritizing protection from credential theft risks, as threat actors can use stolen credentials to gain access to sensitive data. For example, a developer could mistakenly share credentials from an Amazon EC2 instance CLI access over email. After credentials are obtained, a threat actor can use them to access your resources and potentially exfiltrate your corporate data, possibly leading to reputational risk.

Figure 1 outlines an undesirable access pattern: using an employee corporate credential to access corporate resources (in this example, an Amazon Simple Storage Service (Amazon S3) bucket) from a non-corporate network.

Figure 1: Unintended access to your S3 bucket from outside the corporate network

Figure 1: Unintended access to your S3 bucket from outside the corporate network

Implementing the network perimeter

During the network perimeter implementation, you use IAM policies and global condition keys to help you control access to your resources based on which network the API request is coming from. IAM allows you to enforce the origin of a request by making an API call using both identity policies and resource policies.

The following two policies help you control both your principals and resources to verify that the request is coming from your expected network:

  • Service control policies (SCPs) are policies you can use to manage the maximum available permissions for your principals. SCPs help you verify that your accounts stay within your organization’s access control guidelines.
  • Resource based policies are policies that are attached to resources in each AWS account. With resource based policies, you can specify who has access to the resource and what actions they can perform on it. For a list of services that support resource based policies, see AWS services that work with IAM.

With the help of these two policy types, you can enforce the control objectives using the following IAM global condition keys:

  • aws:SourceIp: You can use this condition key to create a policy that only allows request from a specific IP CIDR range. For example, this key helps you define your expected networks as your corporate IP CIDR range.
  • aws:SourceVpc: This condition key helps you check whether the request comes from the list of VPCs that you specified in the policy. In a policy, this condition key is used to only allow access to an S3 bucket if the VPC where the request originated matches the VPC ID listed in your policy.
  • aws:SourceVpce: You can use this condition key to check if the request came from one of the VPC endpoints specified in your policy. Adding this key to your policy helps you restrict access to API calls that originate from VPC endpoints that belong to your organization.
  • aws:ViaAWSService: You can use this key to write a policy to allow an AWS service that uses your credentials to make calls on your behalf. For example, when you upload an object to Amazon S3 with server-side encryption with AWS Key Management Service (AWS KMS) on, S3 needs to encrypt the data on your behalf. To do this, S3 makes a subsequent request to AWS KMS to generate a data key to encrypt the object. The call that S3 makes to AWS KMS is signed with your credentials and originates outside of your network.
  • aws:PrincipalIsAWSService: This condition key helps you write a policy to allow AWS service principals to access your resources. For example, when you create an AWS CloudTrail trail with an S3 bucket as a destination, CloudTrail uses a service principal, cloudtrail.amazonaws.com, to publish logs to your S3 bucket. The API call from CloudTrail comes from the service network.

The following table summarizes the relationship between the control objectives and the capabilities used to implement the network perimeter.

Control objective Implemented by using Primary IAM capability
My resources can only be accessed from expected networks. Resource-based policies aws:SourceIp
aws:SourceVpc
aws:SourceVpce
aws:ViaAWSService
aws:PrincipalIsAWSService
My identities can access resources only from expected networks. SCPs aws:SourceIp
aws:SourceVpc
aws:SourceVpce
aws:ViaAWSService

My resources can only be accessed from expected networks

Start by implementing the network perimeter on your resources using resource based policies. The perimeter should be applied to all resources that support resource- based policies in each AWS account. With this type of policy, you can define which networks can be used to access the resources, helping prevent access to your company resources in case of valid credentials being used from non-corporate networks.

The following is an example of a resource-based policy for an S3 bucket that limits access only to expected networks using the aws:SourceIp, aws:SourceVpc, aws:PrincipalIsAWSService, and aws:ViaAWSService condition keys. Replace <my-data-bucket>, <my-corporate-cidr>, and <my-vpc> with your information.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EnforceNetworkPerimeter",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:*",
      "Resource": [
        "arn:aws:s3:::<my-data-bucket>",
        "arn:aws:s3:::<my-data-bucket>/*"
      ],
      "Condition": {
        "NotIpAddressIfExists": {
          "aws:SourceIp": "<my-corporate-cidr>"
        },
        "StringNotEqualsIfExists": {
          "aws:SourceVpc": "<my-vpc>"
        },
        "BoolIfExists": {
          "aws:PrincipalIsAWSService": "false",
          "aws:ViaAWSService": "false"
        }
      }
    }
  ]
}

The Deny statement in the preceding policy has four condition keys where all conditions must resolve to true to invoke the Deny effect. Use the IfExists condition operator to clearly state that each of these conditions will still resolve to true if the key is not present on the request context.

This policy will deny Amazon S3 actions unless requested from your corporate CIDR range (NotIpAddressIfExists with aws:SourceIp), or from your VPC (StringNotEqualsIfExists with aws:SourceVpc). Notice that aws:SourceVpc and aws:SourceVpce are only present on the request if the call was made through a VPC endpoint. So, you could also use the aws:SourceVpce condition key in the policy above, however this would mean listing every VPC endpoint in your environment. Since the number of VPC endpoints is greater than the number of VPCs, this example uses the aws:SourceVpc condition key.

This policy also creates a conditional exception for Amazon S3 actions requested by a service principal (BoolIfExists with aws:PrincipalIsAWSService), such as CloudTrail writing events to your S3 bucket, or by an AWS service on your behalf (BoolIfExists with aws:ViaAWSService), such as S3 calling AWS KMS to encrypt or decrypt an object.

Extending the network perimeter on resource

There are cases where you need to extend your perimeter to include AWS services that access your resources from outside your network. For example, if you’re replicating objects using S3 bucket replication, the calls to Amazon S3 originate from the service network outside of your VPC, using a service role. Another case where you need to extend your perimeter is if you integrate with trusted third-party partners that need access to your resources. If you’re using services with the described access pattern in your AWS environment or need to provide access to trusted partners, the policy EnforceNetworkPerimeter that you applied on your S3 bucket in the previous section will deny access to the resource.

In this section, you learn how to extend your network perimeter to include networks of AWS services using service roles to access your resources and trusted third-party partners.

AWS services that use service roles and service-linked roles to access resources on your behalf

Service roles are assumed by AWS services to perform actions on your behalf. An IAM administrator can create, change, and delete a service role from within IAM; this role exists within your AWS account and has an ARN like arn:aws:iam::<AccountNumber>:role/<RoleName>. A key difference between a service-linked role (SLR) and a service role is that the SLR is linked to a specific AWS service and you can view but not edit the permissions and trust policy of the role. An example is AWS Identity and Access Management Access Analyzer using an SLR to analyze resource metadata. To account for this access pattern, you can exempt roles on the service-linked role dedicated path arn:aws:iam::<AccountNumber>:role/aws-service-role/*, and for service roles, you can tag the role with the tag network-perimeter-exception set to true.

If you are exempting service roles in your policy based on a tag-value, you must also include a policy to enforce the identity perimeter on your resource as shown in this sample policy. This helps verify that only identities from your organization can access the resource and cannot circumvent your network perimeter controls with network-perimeter-exception tag.

Partners accessing your resources from their own networks

There might be situations where your company needs to grant access to trusted third parties. For example, providing a trusted partner access to data stored in your S3 bucket. You can account for this type of access by using the aws:PrincipalAccount condition key set to the account ID provided by your partner.

The following is an example of a resource-based policy for an S3 bucket that incorporates the two access patterns described above. Replace <my-data-bucket>, <my-corporate-cidr>, <my-vpc>, <third-party-account-a>, <third-party-account-b>, and <my-account-id> with your information.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "EnforceNetworkPerimeter",
            "Principal": "*",
            "Action": "s3:*",
            "Effect": "Deny",
            "Resource": [
              "arn:aws:s3:::<my-data-bucket>",
              "arn:aws:s3:::<my-data-bucket>/*"
            ],
            "Condition": {
                "NotIpAddressIfExists": {
                  "aws:SourceIp": "<my-corporate-cidr>"
                },
                "StringNotEqualsIfExists": {
                    "aws:SourceVpc": "<my-vpc>",
       "aws:PrincipalTag/network-perimeter-exception": "true",
                    "aws:PrincipalAccount": [
                        "<third-party-account-a>",
                        "<third-party-account-b>"
                    ]
                },
                "BoolIfExists": {
                    "aws:PrincipalIsAWSService": "false",
                    "aws:ViaAWSService": "false"
                },
                "ArnNotLikeIfExists": {
                    "aws:PrincipalArn": "arn:aws:iam::<my-account-id>:role/aws-service-role/*"
                }
            }
        }
    ]
}

There are four condition operators in the policy above, and you need all four of them to resolve to true to invoke the Deny effect. Therefore, this policy only allows access to Amazon S3 from expected networks defined as: your corporate IP CIDR range (NotIpAddressIfExists and aws:SourceIp), your VPC (StringNotEqualsIfExists and aws:SourceVpc), networks of AWS service principals (aws:PrincipalIsAWSService), or an AWS service acting on your behalf (aws:ViaAWSService). It also allows access to networks of trusted third-party accounts (StringNotEqualsIfExists and aws:PrincipalAccount: <third-party-account-a>), and AWS services using an SLR to access your resources (ArnNotLikeIfExists and aws:PrincipalArn).

My identities can access resources only from expected networks

Applying the network perimeter on identity can be more challenging because you need to consider not only calls made directly by your principals, but also calls made by AWS services acting on your behalf. As described in access pattern 3 Intermediate IAM roles for data access in this blog post, many AWS services assume an AWS service role you created to perform actions on your behalf. The complicating factor is that even if the service supports VPC-based access to your data — for example AWS Glue jobs can be deployed within your VPC to access data in your S3 buckets — the service might also use the service role to make other API calls outside of your VPC. For example, with AWS Glue jobs, the service uses the service role to deploy elastic network interfaces (ENIs) in your VPC. However, these calls to create ENIs in your VPC are made from the AWS Glue managed network and not from within your expected network. A broad network restriction in your SCP for all your identities might prevent the AWS service from performing tasks on your behalf.

Therefore, the recommended approach is to only apply the perimeter to identities that represent the highest risk of inappropriate use based on other compensating controls that might exist in your environment. These are identities whose credentials can be obtained and misused by threat actors. For example, if you allow your developers access to the Amazon Elastic Compute Cloud (Amazon EC2) CLI, a developer can obtain credentials from the Amazon EC2 instance profile and use the credentials to access your resources from their own network.

To summarize, to enforce your network perimeter based on identity, evaluate your organization’s security posture and what compensating controls are in place. Then, according to this evaluation, identify which service roles or human roles have the highest risk of inappropriate use, and enforce the network perimeter on those identities by tagging them with data-perimeter-include set to true.

The following policy shows the use of tags to enforce the network perimeter on specific identities. Replace <my-corporate-cidr>, and <my-vpc> with your own information.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EnforceNetworkPerimeter",
      "Effect": "Deny",
      "Action": "*",
      "Resource": "*",
      "Condition": {
        "BoolIfExists": {
          "aws:ViaAWSService": "false"
        },
        "NotIpAddressIfExists": {
          "aws:SourceIp": [
            "<my-corporate-cidr>"
          ]
        },
        "StringNotEqualsIfExists": {
          "aws:SourceVpc": [
            "<my-vpc>"
          ]
        }, 
       "ArnNotLikeIfExists": {
          "aws:PrincipalArn": [
            "arn:aws:iam::*:role/aws:ec2-infrastructure"
          ]
        },
        "StringEquals": {
          "aws:PrincipalTag/data-perimeter-include": "true"
        }
      }
    }
  ]
}

The above policy statement uses the Deny effect to limit access to expected networks for identities with the tag data-perimeter-include attached to them (StringEquals and aws:PrincipalTag/data-perimeter-include set to true). This policy will deny access to those identities unless the request is done by an AWS service on your behalf (aws:ViaAWSService), is coming from your corporate CIDR range (NotIpAddressIfExists and aws:SourceIp), or is coming from your VPCs (StringNotEqualsIfExists with the aws:SourceVpc).

Amazon EC2 also uses a special service role, also known as infrastructure role, to decrypt Amazon Elastic Block Store (Amazon EBS). When you mount an encrypted Amazon EBS volume to an EC2 instance, EC2 calls AWS KMS to decrypt the data key that was used to encrypt the volume. The call to AWS KMS is signed by an IAM role, arn:aws:iam::*:role/aws:ec2-infrastructure, which is created in your account by EC2. For this use case, as you can see on the preceding policy, you can use the aws:PrincipalArn condition key to exclude this role from the perimeter.

IAM policy samples

This GitHub repository contains policy examples that illustrate how to implement network perimeter controls. The policy samples don’t represent a complete list of valid access patterns and are for reference only. They’re intended for you to tailor and extend to suit the needs of your environment. Make sure you thoroughly test the provided example policies before implementing them in your production environment.

Conclusion

In this blog post you learned about the elements needed to build the network perimeter, including policy examples and strategies on how to extend that perimeter. You now also know different access patterns used by AWS services that act on your behalf, how to evaluate those access patterns, and how to take a risk-based approach to apply the perimeter based on identities in your organization.

For additional learning opportunities, see the Data perimeters on AWS. This information resource provides additional materials such as a data perimeter workshop, blog posts, whitepapers, and webinar sessions.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Security, Identity, & Compliance re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Author

Laura Reith

Laura is an Identity Solutions Architect at Amazon Web Services. Before AWS, she worked as a Solutions Architect in Taiwan focusing on physical security and retail analytics.

Accelerating JVM cryptography with Amazon Corretto Crypto Provider 2

Post Syndicated from Will Childs-Klein original https://aws.amazon.com/blogs/security/accelerating-jvm-cryptography-with-amazon-corretto-crypto-provider-2/

Earlier this year, Amazon Web Services (AWS) released Amazon Corretto Crypto Provider (ACCP) 2, a cryptography provider built by AWS for Java virtual machine (JVM) applications. ACCP 2 delivers comprehensive performance enhancements, with some algorithms (such as elliptic curve key generation) seeing a greater than 13-fold improvement over ACCP 1. The new release also brings official support for the AWS Graviton family of processors. In this post, I’ll discuss a use case for ACCP, then review performance benchmarks to illustrate the performance gains. Finally, I’ll show you how to get started using ACCP 2 in applications today.

This release changes the backing cryptography library for ACCP from OpenSSL (used in ACCP 1) to the AWS open source cryptography library, AWS libcrypto (AWS-LC). AWS-LC has extensive formal verification, as well as traditional testing, to assure the correctness of cryptography that it provides. While AWS-LC and OpenSSL are largely compatible, there are some behavioral differences that required the ACCP major version increment to 2.

The move to AWS-LC also allows ACCP to leverage performance optimizations in AWS-LC for modern processors. I’ll illustrate the ACCP 2 performance enhancements through the use case of establishing a secure communications channel with Transport Layer Security version 1.3 (TLS 1.3). Specifically, I’ll examine cryptographic components of the connection’s initial phase, known as the handshake. TLS handshake latency particularly matters for large web service providers, but reducing the time it takes to perform various cryptographic operations is an operational win for any cryptography-intensive workload.

TLS 1.3 requires ephemeral key agreement, which means that a new key pair is generated and exchanged for every connection. During the TLS handshake, each party generates an ephemeral elliptic curve key pair, exchanges public keys using Elliptic Curve Diffie-Hellman (ECDH), and agrees on a shared secret. Finally, the client authenticates the server by verifying the Elliptic Curve Digital Signature Algorithm (ECDSA) signature in the certificate presented by the server after key exchange. All of this needs to happen before you can send data securely over the connection, so these operations directly impact handshake latency and must be fast.

Figure 1 shows benchmarks for the three elliptic curve algorithms that implement the TLS 1.3 handshake: elliptic curve key generation (up to 1,298% latency improvement in ACCP 2.0 over ACCP 1.6), ECDH key agreement (up to 858% latency improvement), and ECDSA digital signature verification (up to 260% latency improvement). These algorithms were benchmarked over three common elliptic curves with different key sizes on both ACCP 1 and ACCP 2. The choice of elliptic curve determines the size of the key used or generated by the algorithm, and key size correlates to performance. The following benchmarks were measured under the Amazon Corretto 11 JDK on a c7g.large instance running Amazon Linux with a Graviton 3 processor.

Figure 1: Percentage improvement of ACCP 2.0 over 1.6 performance benchmarks on c7g.large Amazon Linux Graviton 3

Figure 1: Percentage improvement of ACCP 2.0 over 1.6 performance benchmarks on c7g.large Amazon Linux Graviton 3

The performance improvements due to the optimization of secp384r1 in AWS-LC are particularly noteworthy.

Getting started

Whether you’re introducing ACCP to your project or upgrading from ACCP 1, start the onboarding process for ACCP 2 by updating your dependency manager configuration in your development or testing environment. The Maven and Gradle examples below assume that you’re using linux on an ARM64 processor. If you’re using an x86 processor, substitute linux-x86_64 for linux-aarch64. After you’ve performed this update, sync your application’s dependencies and install ACCP in your JVM process. ACCP can be installed either by specifying our recommended security.properties file in your JVM invocation or programmatically at runtime. The following sections provide more details about all of these steps.

After ACCP has been installed, the Java Cryptography Architecture (JCA) will look for cryptographic implementations in ACCP first before moving on to other providers. So, as long as your application and dependencies obtain algorithms supported by ACCP from the JCA, your application should gain the benefits of ACCP 2 without further configuration or code changes.

Maven

If you’re using Maven to manage dependencies, add or update the following dependency configuration in your pom.xml file.

<dependency>
  <groupId>software.amazon.cryptools</groupId>
  <artifactId>AmazonCorrettoCryptoProvider</artifactId>
  <version>[2.0,3.0)</version>
  <classifier>linux-aarch64</classifier>
</dependency>

Gradle

For Gradle, add or update the following dependency in your build.gradle file.

dependencies {
    implementation 'software.amazon.cryptools:AmazonCorrettoCryptoProvider:2.+:linux-aarch64'
}

Install through security properties

After updating your dependency manager, you’ll need to install ACCP. You can install ACCP using security properties as described in our GitHub repository. This installation method is a good option for users who have control over their JVM invocation.

Install programmatically

If you don’t have control over your JVM invocation, you can install ACCP programmatically. For Java applications, add the following code to your application’s initialization logic (optionally performing a health check).

com.amazon.corretto.crypto.provider.AmazonCorrettoCryptoProvider.install();
com.amazon.corretto.crypto.provider.AmazonCorrettoCryptoProvider.INSTANCE.assertHealthy();

Migrating from ACCP 1 to ACCP 2

Although the migration path to version 2 is straightforward for most ACCP 1 users, ACCP 2 ends support for some outdated algorithms: a finite field Diffie-Hellman key agreement, finite field DSA signatures, and a National Institute of Standards and Technology (NIST)-specified random number generator. The removal of these algorithms is not backwards compatible, so you’ll need to check your code for their usage and, if you do find usage, either migrate to more modern algorithms provided by ACCP 2 or obtain implementations from a different provider, such as one of the default providers that ships with the JDK.

Check your code

Search for unsupported algorithms in your application code by their JCA names:

  • DH: Finite-field Diffie-Hellman key agreement
  • DSA: Finite-field Digital Signature Algorithm
  • NIST800-90A/AES-CTR-256: NIST-specified random number generator

Use ACCP 2 supported algorithms

Where possible, use these supported algorithms in your application code:

  • ECDH for key agreement instead of DH
  • ECDSA or RSA for signatures instead of DSA
  • Default SecureRandom instead of NIST800-90A/AES-CTR-256

If your use case requires the now-unsupported algorithms, check whether any of those algorithms are explicitly requested from ACCP.

  • If ACCP is not explicitly named as the provider, then you should be able to transparently fall back to another provider without a code change.
  • If ACCP is explicitly named as the provider, then remove that provider specification and register a different provider that offers the algorithm. This will allow the JCA to obtain an implementation from another registered provider without breaking backwards compatibility in your application.

Test your code

Some behavioral differences exist between ACCP 2 and other providers, including ACCP 1 (backed by OpenSSL). After onboarding or migrating, it’s important that you test your application code thoroughly to identify potential incompatibilities between cryptography providers.

Conclusion

Integrate ACCP 2 into your application today to benefit from AWS-LC’s security assurance and performance improvements. For a full list of changes, see the ACCP CHANGELOG on GitHub. Linux builds of ACCP 2 are now available on Maven Central for aarch64 and x86-64 processor architectures. If you encounter any issues with your integration, or have any feature suggestions, please reach out to us on GitHub by filing an issue.

 
If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security news? Follow us on Twitter.

Will Childs-Klein

Will Childs-Klein

Will is a Senior Software Engineer at AWS Cryptography, where he focuses on developing cryptographic libraries, optimizing software performance, and deploying post-quantum cryptography. Previously at AWS, he worked on data storage and transfer services including Storage Gateway, Elastic File System, and DataSync.

Discover the benefits of AWS WAF advanced rate-based rules

Post Syndicated from Rodrigo Ferroni original https://aws.amazon.com/blogs/security/discover-the-benefits-of-aws-waf-advanced-rate-based-rules/

In 2017, AWS announced the release of Rate-based Rules for AWS WAF, a new rule type that helps protect websites and APIs from application-level threats such as distributed denial of service (DDoS) attacks, brute force log-in attempts, and bad bots. Rate-based rules track the rate of requests for each originating IP address and invokes a rule action on IPs with rates that exceed a set limit.

While rate-based rules are useful to detect and mitigate a broad variety of bad actors, threats have evolved to bypass request-rate limit rules. For example, one bypass technique is to send a high volumes of requests by spreading them across thousands of unique IP addresses.

In May 2023, AWS announced AWS WAF enhancements to the existing rate-based rules feature that you can use to create more dynamic and intelligent rules by using additional HTTP request attributes for request rate limiting. For example, you can now choose from the following predefined keys to configure your rules: label namespace, header, cookie, query parameter, query string, HTTP method, URI path and source IP Address or IP Address in a header. Additionally, you can combine up to five composite keys as parameters for stronger rule development. These rule definition enhancements help improve perimeter security measures against sophisticated application-layer DDoS attacks using AWS WAF. For more information about the supported request attributes, see Rate-based rule statement in the AWS WAF Developer Guide.

In this blog post, you will learn more about these new AWS WAF feature enhancements and how you can use alternative request attributes to create more robust and granular sets of rules. In addition, you’ll learn how to combine keys to create a composite aggregation key to uniquely identify a specific combination of elements to improve rate tracking.

Getting started

Configuring advanced rate-based rules is similar to configuring simple rate-based rules. The process starts with creating a new custom rule of type rate-based rule, entering the rate limit value, selecting custom keys, choosing the key from the request aggregation key dropdown menu, and adding additional composite keys by choosing Add a request aggregation key as shown in Figure 1.

Figure 1: Creating an advanced rate-based rule with two aggregation keys

Figure 1: Creating an advanced rate-based rule with two aggregation keys

For existing rules, you can update those rate-based rules to use the new functionality by editing them. For example, you can add a header to be aggregated with the source IP address, as shown in Figure 2. Note that previously created rules will not be modified.

Figure 2: Add a second key to an existing rate-based rule

Figure 2: Add a second key to an existing rate-based rule

You still can set the same rule action, such as block, count, captcha, or challenge. Optionally, you can continue applying a scope-down statement to limit rule action. For example, you can limit the scope to a certain application path or requests with a specified header. You can scope down the inspection criteria so that only certain requests are counted towards rate limiting, and use certain keys to aggregate those requests together. A technique would be to count only requests that have /api at the start of the URI, and aggregate them based on their SessionId cookie value.

Target use cases

Now that you’re familiar with the foundations of advanced rate-based rules, let’s explore how they can improve your security posture using the following use cases:

  • Enhanced Application (Layer 7) DDoS protection
  • Improved API security
  • Enriched request throttling

Use case 1: Enhance Layer 7 DDoS mitigation

The first use case that you might find beneficial is to enhance Layer 7 DDoS mitigation. An HTTP request flood is the most common vector of DDoS attacks. This attack type aims to affect application availability by exhausting available resources to run the application.

Before the release of these enhancements to AWS WAF rules, rules were limited by aggregating requests based on the IP address from the request origin or configured to use a forwarded IP address in an HTTP header such as X-Forwarded-For. Now you can create a more robust rate-based rule to help protect your web application from DDoS attacks by tracking requests based on a different key or a combination of keys. Let’s examine some examples.

To help detect pervasive bots, such as scrapers, scanners, and crawlers, or common bots that are distributed across many unique IP addresses, a rule can look for static request data like a custom header — for example, User-Agent.

Key 1: Custom header (User-Agent)

{
  "Name": "test-rbr",
  "Priority": 0,
  "Statement": {
    "RateBasedStatement": {
      "Limit": 2000,
      "AggregateKeyType": "CUSTOM_KEYS",
      "CustomKeys": [
        {
          "Header": {
            "Name": "User-Agent",
            "TextTransformations": [
              {
                "Priority": 0,
                "Type": "NONE"
              }
            ]
          }
        }
      ]
    }
  },
  "Action": {
    "Block": {}
  },
  "VisibilityConfig": {
    "SampledRequestsEnabled": true,
    "CloudWatchMetricsEnabled": true,
    "MetricName": "test-rbr"
  }
}

To help you decide what unique key to use, you can analyze AWS WAF logs. For more information, review Examples 2 and 3 in the blog post Analyzing AWS WAF Logs in Amazon CloudWatch Logs.

To uniquely identity users behind a NAT gateway, you can use a cookie in addition to an IP address. Before the aggregation keys feature, it was difficult to identify users who connected from a single IP address. Now, you can use the session cookie to aggregate requests by their session identifier and IP address.

Note that for Layer 7 DDoS mitigation, tracking by session ID in cookies can be circumvented, because bots might send random values or not send any cookie at all. It’s a good idea to keep an IP-based blanket rate-limiting rule to block offending IP addresses that reach a certain high rate, regardless of their request attributes. In that case, the keys would look like:

  • Key 1: Session cookie
  • Key 2: IP address

You can reduce false positives when using AWS Managed Rules (AMR) IP reputation lists by rate limiting based on their label namespace. Labelling functionality is a powerful feature that allows you to map the requests that match a specific pattern and apply custom rules to them. In this case, you can match the label namespace provided by the AMR IP reputation list that includes AWSManagedIPDDoSList, which is a list of IP addresses that have been identified as actively engaging in DDoS activities.

You might want to be cautious about using this group list in block mode, because there’s a chance of blocking legitimate users. To mitigate this, use the list in count mode and create an advanced rate-based rule to aggregate all requests with the label namespace awswaf:managed:aws:amazon-ip-list:, targeting captcha as the rule action. This lets you reduce false positives without compromising security. Applying captcha as an action for the rule reduces serving captcha to all users and instead only applies it when the rate of requests exceeds the defined limit. The key for this rule would be:

  • Labels (AMR IP reputation lists).

Use case 2: API security

In this second use case, you learn how to use an advanced rate-based rule to improve the security of an API. Protecting an API with rate-limiting rules helps ensure that requests aren’t being sent too frequently in a short amount of time. Reducing the risk from misusing an API helps to ensure that only legitimate requests are handled and not denied due to an overload of requests.

Now, you can create advanced rate-based rules that track API requests based on two aggregation keys. For example, HTTP method to differentiate between GET, POST, and other requests in combination with a custom header like Authorization to match a JSON Web Token (JWT). JWTs are not decrypted by AWS WAF, and AWS WAF only aggregates requests with the same token. This can help to ensure that a token is not being used maliciously or to bypass rate-limiting rules. An additional benefit of this configuration is that requests with no authorization headers are being aggregated together towards the rate limiting threshold. The keys for this use case are:

  • Key 1: HTTP method
  • Key 2: Custom header (Authorization)

In addition, you can configure a rule to block and add a custom response when the requests limit is reached. For example, by returning HTTP error code 429 (too many requests) with a Retry-After header indicating the requester should wait 900 seconds (15 minutes) before making a new request.

{
  "Name": "test-rbr",
  "Priority": 0,
  "Statement": {
    "RateBasedStatement": {
      "Limit": 600,
      "AggregateKeyType": "CUSTOM_KEYS",
      "CustomKeys": [
        {
          "HTTPMethod": {}
        },
        {
          "Header": {
            "Name": "Authorization",
            "TextTransformations": [
              {
                "Priority": 0,
                "Type": "NONE"
              }
            ]
          }
        }
      ]
    }
  },
  "Action": {
    "Block": {
      "CustomResponse": {
        "ResponseCode": 429,
        "ResponseHeaders": [
          {
            "Name": "Retry-After",
            "Value": "900"
          }
        ]
      }
    }
  },
  "VisibilityConfig": {
    "SampledRequestsEnabled": true,
    "CloudWatchMetricsEnabled": true,
    "MetricName": "test-rbr"
  }
}

Use case 3: Implement request throttling

There are many situations where throttling should be considered. For example, if you want to maintain the performance of a service API by providing fair usage for all users, you can have different rate limits based on the type or purpose of the API, such as mutable or non-mutable requests. To achieve this, you can create two advanced rate-based rules using aggregation keys like IP address, combined with an HTTP request parameter for either mutable or non-mutable that indicates the type of request. Each rule will have its own HTTP request parameter, and you can set different maximum values for the rate limit. The keys for this use case are:

  • Key 1: HTTP request parameter
  • Key 2: IP address

Another example where throttling can be helpful is for a multi-tenant application where you want to track requests made by each tenant’s users. Let’s say you have a free tier but also a paying subscription model for which you want to allow a higher request rate. For this use case, it’s recommended to use two different URI paths to verify that the two tenants are kept separated. Additionally, it is advised to still use a custom header or query string parameter to differentiate between the two tenants, such as a tenant-id header or parameter that contains a unique identifier for each tenant. To implement this type of throttling using advanced rate-based rules, you can create two rules using an IP address in combination with the custom header as aggregation keys. Each rule can have its own maximum value for rate limiting, as well as a scope-down statement that matches requests for each URI path. The keys and scope-down statement for this use case are:

  • Key 1: Custom header (tenant-id)
  • Key 2: IP address
  • Scope down statement (URI path)

As a third example, you can rate-limit web applications based on the total number of requests that can be handled. For this use case, you can use the new Count all as aggregation option. The option counts and rate-limits the requests that match the rule’s scope-down statement, which is required for this type of aggregation. One option is to scope down and inspect the URI path to target a specific functionality like a /history-search page. An option when you need to control how many requests go to a specific domain is to scope down a single header to a specific host, creating one rule for a.example.com and another rule for b.example.com.

  • Request Aggregation: Count all
  • Scope down statement (URI path | Single header)

For these examples, you can block with a custom response when the requests exceed the limit. For example, by returning the same HTTP error code and header, but adding a custom response body with a message like “You have reached the maximum number of requests allowed.”

Logging

The AWS WAF logs now include additional information about request keys used for request-rate tracking and the values of matched request keys. In addition to the existing IP or Forwarded_IP values, you can see the updated log fields limitKey and customValue, where the limitKey field now shows either CustomKeys for custom aggregate key settings or Constant for count all requests. CustomValues shows an array of keys, names, and values.

Figure 3: Example log output for the advanced rate-based rule showing updated limitKey and customValues fields

Figure 3: Example log output for the advanced rate-based rule showing updated limitKey and customValues fields

As mentioned in the first use case, to get more detailed information about the traffic that’s analyzed by the web ACL, consider enabling logging. If you choose to enable Amazon CloudWatch Logs as the log destination, you can use CloudWatch Logs Insights and advanced queries to interactively search and analyze logs.

For example, you can use the following query to get the request information that matches rate-based rules, including the updated keys and values, directly from the AWS WAF console.

| fields terminatingRuleId as RuleName
| filter terminatingRuleType ="RATE_BASED" 
| parse @message ',"customValues":[*],' as customKeys
| limit 100

Figure 4 shows the CloudWatch Log Insights query and the logs output including custom keys, names, and values fields.

Figure 4: The CloudWatch Log Insights query and the logs output

Figure 4: The CloudWatch Log Insights query and the logs output

Pricing

There is no additional cost for using advanced rate-base rules; standard AWS WAF pricing applies when you use this feature. For AWS WAF pricing information, see AWS WAF Pricing. You only need to be aware that using aggregation keys will increase AWS WAF web ACL capacity units (WCU) usage for the rule. WCU usage is calculated based on how many keys you want to use for rate limiting. The current model of 2 WCUs plus any additional WCUs for a nested statement is being updated to 2 WCUs as a base, and 30 WCUs for each custom aggregation key that you specify. For example, if you want to create aggregation keys with an IP address in combination with a session cookie, this will use 62 WCUs, and aggregation keys with an IP address, session cookie, and customer header will use 92 WCUs. For more details about the WCU-based cost structure, visit Rate-based rule statement in the AWS WAF Developer Guide.

Conclusion

In this blog post, you learned about AWS WAF enhancements to existing rate-based rules that now support request parameters in addition to IP addresses. Additionally, these enhancements allow you to create composite keys based on up to five request parameters. This new capability allows you to be either more coarse in aggregating requests (such as all the requests that have an IP reputation label associated with them) or finer (such as aggregate requests for a specific session ID, not its IP address).

For more rule examples that include JSON rule configuration, visit Rate-based rule examples in the AWS WAF Developer Guide.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Author

Rodrigo Ferroni

Rodrigo is a senior Security Specialist at AWS Enterprise Support. He is certified in CISSP, an AWS Security Specialist, and an AWS Solutions Architect Associate. He enjoys helping customers continue adopting AWS security services to improve their security posture in the cloud. Outside of work, he loves to travel as much as he can. In winter, he enjoys snowboarding with his friends.

Maksim Akifev

Maksim Akifev

Maksim is a Senior Product Manager at AWS WAF, partnering with businesses ranging from startups to enterprises to enhance their web application security. Maksim prioritizes quality, security, and user experience. He’s enthusiastic about innovative technology that expedites digital growth for businesses.

Deploy Amazon OpenSearch Serverless with Terraform

Post Syndicated from Joshua Luo original https://aws.amazon.com/blogs/big-data/deploy-amazon-opensearch-serverless-with-terraform/

Amazon OpenSearch Serverless provides the search and analytical functionality of OpenSearch without the manual overhead of configuring, managing, and scaling OpenSearch clusters. It automatically scales the resources based on your workload, and you only pay for the resources consumed. Managing OpenSearch Serverless is simple, but with infrastructure as code (IaC) software like Terraform, you can simplify your resource management even more.

This post demonstrates how to use Terraform to create, deploy, and clean up OpenSearch Serverless infrastructure.

Solution overview

To create and deploy an OpenSearch Serverless collection with security and access policies using Terraform, you need to follow these steps:

  1. Initialize the Terraform configuration.
  2. Create an encryption policy.
  3. Create an OpenSearch Serverless collection.
  4. Create a network policy.
  5. Create a virtual private cloud (VPC) endpoint.
  6. Create a data access policy.
  7. Deploy using Terraform.

Prerequisites

This post assumes that you’re familiar with GitHub and Git commands.

For this walkthrough, you need the following:

Initialize the Terraform configuration

The sample code is available in the Terraform GitHub repo in the terraform-provider-aws/examples/opensearchserverless directory. This configuration will get you started with OpenSearch Serverless. First, clone the repository to your workstation and navigate to the directory:

$ git clone https://github.com/hashicorp/terraform-provider-aws.git && \ 
cd ./terraform-provider-aws/examples/opensearchserverless

Initialize the configuration to install the aws provider by running the following command:

$ terraform init

The Terraform configuration first defines the version of Terraform required and configures the AWS provider to launch resources in the Region defined by the aws_region variable:

# Require Terraform 0.12 or greater
terraform {
  required_version = ">= 0.12"
}

# Set AWS provider region
provider "aws" {
  region = var.aws_region
}

The variables used in this Terraform configuration are defined in the variables.tf file. This post assumes the default values are used:

variable "aws_region" {
  description = "The AWS region to create things in."
  default     = "us-east-1"
}

variable "collection_name" {
  description = "Name of the OpenSearch Serverless collection."
  default     = "example-collection"
}

Create an encryption policy

Now that the provider is installed and configured, the Terraform configuration moves on to defining OpenSearch Serverless policies for security. OpenSearch Serverless uses AWS Key Management Service (AWS KMS) to encrypt your data. The encryption is managed by an encryption policy. To create an encryption policy, use the aws_opensearchserverless_security_policy resource, which has a name parameter, a type of encryption, a JSON string that defines the policy, and an optional description:

# Creates an encryption security policy
resource "aws_opensearchserverless_security_policy" "encryption_policy" {
  name        = "example-encryption-policy"
  type        = "encryption"
  description = "encryption policy for ${var.collection_name}"
  policy = jsonencode({
    Rules = [
      {
        Resource = [
          "collection/${var.collection_name}"
        ],
        ResourceType = "collection"
      }
    ],
    AWSOwnedKey = true
  })
}

This encryption policy is named example-encryption-policy, applies to a collection named example-collection, and uses an AWS owned key to encrypt the data.

Create an OpenSearch Serverless collection

You can organize your OpenSearch indexes into a logical grouping called a collection. Create a collection using the aws_opensearchserverless_collection resource, which has a name parameter, and optionally, description, tags, and type:

# Creates a collection
resource "aws_opensearchserverless_collection" "collection" {
  name = var.collection_name

  depends_on = [aws_opensearchserverless_security_policy.encryption_policy]
}

This collection is named example-collection. If type is not specified, a time series collection is created. Supported collection types can be found in the Terraform documentation for the aws_opensearchserverless_collection resource. OpenSearch Serverless requires encryption at rest, so an applicable encryption policy is required before a collection can be created. The Terraform configuration explicitly defines this dependency using the depends_on meta-argument. Errors can arise if this dependency is not defined.

Now that a collection has been created with an AWS owned KMS key, the Terraform configuration goes on to define the network and data access policy to configure access to the collection.

Create a network policy

A network policy allows access to your collection either over the public internet or through OpenSearch Serverless-managed VPC endpoints. Similar to the encryption policy, to create a network policy, use the aws_opensearchserverless_security_policy resource, which has a name parameter, a type of network, a JSON string that defines the policy, and an optional description:

# Creates a network security policy
resource "aws_opensearchserverless_security_policy" "network_policy" {
  name        = "example-network-policy"
  type        = "network"
  description = "public access for dashboard, VPC access for collection endpoint"
  policy = jsonencode([
    {
      Description = "VPC access for collection endpoint",
      Rules = [
        {
          ResourceType = "collection",
          Resource = [
            "collection/${var.collection_name}"
          ]
        }
      ],
      AllowFromPublic = false,
      SourceVPCEs = [
        aws_opensearchserverless_vpc_endpoint.vpc_endpoint.id
      ]
    },
    {
      Description = "Public access for dashboards",
      Rules = [
        {
          ResourceType = "dashboard"
          Resource = [
            "collection/${var.collection_name}"
          ]
        }
      ],
      AllowFromPublic = true
    }
  ])
}

This network policy is named example-network-policy and applies to the collection named example-collection. This policy only allows access to the collection’s OpenSearch endpoint through a VPC endpoint, but allows public access to the OpenSearch Dashboards endpoint.

You’ll notice the VPC endpoint has not been defined yet, but it is referenced in the network policy. Terraform determines this dependency automatically and will not create the network policy until the VPC endpoint has been created.

Create a VPC endpoint

A VPC endpoint enables you to privately access your OpenSearch Serverless collection using AWS PrivateLink (for more information, refer to Access AWS services through AWS PrivateLink). Create a VPC endpoint using the aws_opensearchserverless_vpc_endpoint resource, where you define name, vpc_id, subnet_ids , and optionally, security_group_ids:

# Creates a VPC endpoint
resource "aws_opensearchserverless_vpc_endpoint" "vpc_endpoint" {
  name               = "example-vpc-endpoint"
  vpc_id             = aws_vpc.vpc.id
  subnet_ids         = [aws_subnet.subnet.id]
  security_group_ids = [aws_security_group.security_group.id]
}

Creating a VPC and all the required networking resources is out of scope for this post, but the minimum required VPC resources are created here in a separate file to demonstrate the VPC endpoint functionality. Refer to Getting Started with Amazon VPC to learn more.

Create a data access policy

The configuration defines a data source that looks up information about the context Terraform is currently running in. This data source is used when defining the data access policy. More information can be found in the Terraform documentation for aws_caller_identity.

# Gets access to the effective Account ID in which Terraform is authorized
data "aws_caller_identity" "current" {}

A data access policy allows you to define who has access to collections and indexes. The data access policy is defined using the aws_opensearchserverless_access_policy resource, which has a name parameter, a type parameter set to data, a JSON string that defines the policy, and an optional description:

# Creates a data access policy
resource "aws_opensearchserverless_access_policy" "data_access_policy" {
  name        = "example-data-access-policy"
  type        = "data"
  description = "allow index and collection access"
  policy = jsonencode([
    {
      Rules = [
        {
          ResourceType = "index",
          Resource = [
            "index/${var.collection_name}/*"
          ],
          Permission = [
            "aoss:*"
          ]
        },
        {
          ResourceType = "collection",
          Resource = [
            "collection/${var.collection_name}"
          ],
          Permission = [
            "aoss:*"
          ]
        }
      ],
      Principal = [
        data.aws_caller_identity.current.arn
      ]
    }
  ])
}

This data access policy allows the current AWS role or user to perform collection-related actions on the collection named example-collection and index-related actions on the indexes in the collection.

Deploy using Terraform

Now that you have configured the necessary resources, apply the configuration using terraform apply. Before creating the resources, Terraform will describe all the resources that will be created so you can verify your configuration:

$ terraform apply

...

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

...

Plan: 13 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + collection_enpdoint = (known after apply)
  + dashboard_endpoint  = (known after apply)

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value:

Answer yes to proceed.

If this is the first OpenSearch Serverless collection in your account, applying the configuration may take over 10 minutes because Terraform waits for the collection to become active.

Apply complete! Resources: 13 added, 0 changed, 0 destroyed.

Outputs:

collection_enpdoint = "..."
dashboard_endpoint = "..."

You have now deployed an OpenSearch Serverless time series collection with policies to configure encryption and access to the collection!

Clean up

The resources will incur costs as long as they are running, so clean up the resources when you are done using them. Use the terraform destroy command to do this:

$ terraform destroy

...

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  - destroy

Terraform will perform the following actions:

  ...

Plan: 0 to add, 0 to change, 13 to destroy.

Changes to Outputs:

...

Do you really want to destroy all resources?
  Terraform will destroy all your managed infrastructure, as shown above.
  There is no undo. Only 'yes' will be accepted to confirm.

  Enter a value:

Answer yes to run this plan and destroy the infrastructure.

Destroy complete! Resources: 13 destroyed.

All resources created during this walkthrough have now been deleted.

Conclusion

In this post, you created an OpenSearch Serverless collection. Using IaC software like Terraform can make it simple to manage your resources like OpenSearch Serverless collections, encryption, network, and data access policies, and VPC endpoints.

Thank you to all the open-source contributors who help maintain OpenSearch and Terraform.

Try using OpenSearch Serverless with Terraform to simplify your resource management. Check out the Getting started with Amazon OpenSearch Serverless workshop and the Amazon OpenSearch Serverless Developer Guide to learn more about OpenSearch Serverless.


About the authors

Joshua Luo is a Software Development Engineer for Amazon OpenSearch Serverless. He works on the systems that enable customers to manage and monitor their OpenSearch Serverless resources. He enjoys bouldering, photography, and videography in his free time.

Satish Nandi is a Senior Technical Product Manager for Amazon OpenSearch Service.

Two real-life examples of why limiting permissions works: Lessons from AWS CIRT

Post Syndicated from Richard Billington original https://aws.amazon.com/blogs/security/two-real-life-examples-of-why-limiting-permissions-works-lessons-from-aws-cirt/

Welcome to another blog post from the AWS Customer Incident Response Team (CIRT)! For this post, we’re looking at two events that the team was involved in from the viewpoint of a regularly discussed but sometimes misunderstood subject, least privilege. Specifically, we consider the idea that the benefit of reducing permissions in real-life use cases does not always require using the absolute minimum set of privileges. Instead, you need to weigh the cost and effort of creating and maintaining privileges against the risk reduction that is achieved, to make sure that your permissions are appropriate for your needs.

To quote VP and Distinguished Engineer at Amazon Security, Eric Brandwine, “Least privilege equals maximum effort.” This is the idea that creating and maintaining the smallest possible set of privileges needed to perform a given task will require the largest amount of effort, especially as customer needs and service features change over time. However, the correlation between effort and permission reduction is not linear. So, the question you should be asking is: How do you balance the effort of privilege reduction with the benefits it provides?

Unfortunately, this is not an easy question to answer. You need to consider the likelihood of an unwanted issue happening, the impact if that issue did happen, and the cost and effort to prevent (or detect and recover from) that issue. You also need to factor requirements such as your business goals and regulatory requirements into your decision process. Of course, you won’t need to consider just one potential issue, but many. Often it can be useful to start with a rough set of permissions and refine them down as you develop your knowledge of what security level is required. You can also use service control policies (SCPs) to provide a set of permission guardrails if you’re using AWS Organizations. In this post, we tell two real-world stories where limiting AWS Identity and Access Management (IAM) permissions worked by limiting the impact of a security event, but where the permission set did not involve maximum effort.

Story 1: On the hunt for credentials

In this AWS CIRT story, we see how a threat actor was unable to achieve their goal because the access they obtained — a database administrator’s — did not have the IAM permissions they were after.

Background and AWS CIRT engagement

A customer came to us after they discovered unauthorized activity in their on-premises systems and in some of their AWS accounts. They had incident response capability and were looking for an additional set of hands with AWS knowledge to help them with their investigation. This helped to free up the customer’s staff to focus on the on-premises analysis.

Before our engagement, the customer had already performed initial containment activities. This included rotating credentials, revoking temporary credentials, and isolating impacted systems. They also had a good idea of which federated user accounts had been accessed by the threat actor.

The key part of every AWS CIRT engagement is the customer’s ask. Everything our team does falls on the customer side of the AWS Shared Responsibility Model, so we want to make sure that we are aligned to the customer’s desired outcome. The ask was clear—review the potential unauthorized federated users’ access, and investigate the unwanted AWS actions that were taken by those users during the known timeframe. To get a better idea of what was “unwanted,” we talked to the customer to understand at a high level what a typical day would entail for these users, to get some context around what sort of actions would be expected. The users were primarily focused on working with Amazon Relational Database Service (RDS).

Analysis and findings

For this part of the story, we’ll focus on a single federated user whose apparent actions we investigated, because the other federated users had not been impersonated by the threat actor in a meaningful way. We knew the approximate start and end dates to focus on and had discovered that the threat actor had performed a number of unwanted actions.

After reviewing the actions, it was clear that the threat actor had performed a console sign-in on three separate occasions within a 2-hour window. Each time, the threat actor attempted to perform a subset of the following actions:

Note: This list includes only the actions that are displayed as readOnly = false in AWS CloudTrail, because these actions are often (but not always) the more impactful ones, with the potential to change the AWS environment.

This is the point where the benefit of permission restriction became clear. As soon as this list was compiled, we noticed that two fields were present for all of the actions listed:

"errorCode": "Client.UnauthorizedOperation",
"errorMessage": "You are not authorized to perform this operation. [rest of message]"

As this reveals, every single non-readOnly action that was attempted by the threat actor was denied because the federated user account did not have the required IAM permissions.

Customer communication and result

After we confirmed the findings, we had a call with the customer to discuss the results. As you can imagine, they were happy that the results showed no material impact to their data, and said no further investigation or actions were required at that time.

What were the IAM permissions the federated user had, which prevented the set of actions the threat actor attempted?

The answer did not actually involve the absolute minimal set of permissions required by the user to do their job. It’s simply that the federated user had a role that didn’t have an Allow statement for the IAM actions the threat actor tried — their job did not require them. Without an explicit Allow statement, the IAM actions attempted were denied because IAM policies are Deny by default. In this instance, simply not having the desired IAM permissions meant that the threat actor wasn’t able to achieve their goal, and stopped using the access. We’ll never know what their goal actually was, but trying to create access keys, passwords, and update policies means that a fair guess would be that they were attempting to create another way to access that AWS account.

Story 2: More instances for crypto mining

In this AWS CIRT story, we see how a threat actor’s inability to create additional Amazon Elastic Compute Cloud (Amazon EC2) instances resulted in the threat actor leaving without achieving their goal.

Background and AWS CIRT engagement

Our second story involves a customer who had an AWS account they were using to test some new third-party software that uses Amazon Elastic Container Service (Amazon ECS). This customer had Amazon GuardDuty turned on, and found that they were getting GuardDuty alerts for CryptoCurrency:EC2/BitcoinTool related findings.

Because this account was new and currently only used for testing their software, the customer saw that the detection was related to the Amazon ECS cluster and decided to delete all the resources in the account and rebuild. Not too long after doing this, they received a similar GuardDuty alert for the new Amazon ECS cluster they had set up. The second finding resulted in the customer’s security team and AWS being brought in to try to understand what was causing this. The customer’s security team was focused on reviewing the tasks that were being run on the cluster, while AWS CIRT reviewed the AWS account actions and provided insight about the GuardDuty finding and what could have caused it.

Analysis and findings

Working with the customer, it wasn’t long before we discovered that the 3rd party Amazon ECS task definition that the customer was using, was unintentionally exposing a web interface to the internet. That interface allowed unauthenticated users to run commands on the system. This explained why the same alert was also received shortly after the new install had been completed.

This is where the story takes a turn for the better. The AWS CIRT analysis of the AWS CloudTrail logs found that there were a number of attempts to use the credentials of the Task IAM role that was associated with the Amazon ECS task. The majority of actions were attempting to launch multiple Amazon EC2 instances via RunInstances calls. Every one of these actions, along with the other actions attempted, failed with either a Client.UnauthorizedOperation or an AccessDenied error message.

Next, we worked with the customer to understand the permissions provided by the Task IAM role. Once again, the permissions could have been limited more tightly. However, this time the goal of the threat actor — running a number of Amazon EC2 instances (most likely for surreptitious crypto mining) — did not align with the policy given to the role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
          "Effect": "Allow",
          "Action": "s3:*",
          "Resource": "*"
        }
    ]
}

AWS CIRT recommended creating policies to restrict the allowed actions further, providing specific resources where possible, and potentially also adding in some conditions to limit other aspects of the access (such as the two Condition keys launched recently to limit where Amazon EC2 instance credentials can be used from). However, simply having the policy limit access to Amazon Simple Storage Service (Amazon S3) meant that the threat actor decided to leave with just the one Amazon ECS task running crypto mining rather than a larger number of Amazon EC2 instances.

Customer communication and result

After reporting these findings to the customer, there were two clear next steps: First, remove the now unwanted and untrusted Amazon ECS resource from their AWS account. Second, review and re-architect the Amazon ECS task so that access to the web interface was only provided to appropriate users. As part of that re-architecting, an Amazon S3 policy similar to “Allows read and write access to objects in an S3 bucket” was recommended. This separates Amazon S3 bucket actions from Amazon S3 object actions. When applications have a need to read and write objects in Amazon S3, they don’t normally have a need to create or delete entire buckets (or versioning on those buckets).

Some tools to help

We’ve just looked at how limiting privileges helped during two different security events. Now, let’s consider what can help you decide how to reduce your IAM permissions to an appropriate level. There are a number of resources that talk about different approaches:

The first approach is to use Access Analyzer to help generate IAM policies based on access activity from log data. This can then be refined further with the addition of Condition elements as desired. We already have a couple of blog posts about that exact topic:

The second approach is similar, and that is to reduce policy scope based on the last-accessed information:

The third approach is a manual method of creating and refining policies to reduce the amount of work required. For this, you can begin with an appropriate AWS managed IAM policy or an AWS provided example policy as a starting point. Following this, you can add or remove Actions, Resources, and Conditions — using wildcards as desired — to balance your effort and permission reduction.

An example of balancing effort and permission is in the IAM tutorial Create and attach your first customer managed policy. In it, the authors create a policy that uses iam:Get* and iam:List:* in the Actions section. Although not all iam:Get* and iam:List:* Actions may be required, this is a good way to group similar Actions together while minimizing Actions that allow unwanted access — for example, iam:Create* or iam:Delete*. Another example of this balancing was mentioned earlier relating to Amazon S3, allowing access to create, delete, and read objects, but not to change the configuration of the bucket those objects are in.

In addition to limiting permissions, we also recommend that you set up appropriate detection and response capability. This will enable you to know when an issue has occurred and provide the tools to contain and recover from the issue. Further details about performing incident response in an AWS account can be found in the AWS Security Incident Response Guide.

There are also two services that were used to help in the stories we presented here — Amazon GuardDuty and AWS CloudTrail. GuardDuty is a threat detection service that continuously monitors your AWS accounts and workloads for malicious activity. It’s a great way to monitor for unwanted activity within your AWS accounts. CloudTrail records account activity across your AWS infrastructure and provides the logs that were used for the analysis that AWS CIRT performed for both these stories. Making sure that both of these are set up correctly is a great first step towards improving your threat detection and incident response capability in AWS.

Conclusion

In this post, we looked at two examples where limiting privilege provided positive results during a security event. In the second case, the policy used should probably have restricted permissions further, but even as it stood, it was an effective preventative control in stopping the unauthorized user from achieving their assumed goal.

Hopefully these stories will provide new insight into the way your organization thinks about setting permissions, while taking into account the effort of creating the permissions. These stories are a good example of how starting a journey towards least privilege can help stop unauthorized users. Neither of the scenarios had policies that were least privilege, but the policies were restrictive enough that the unauthorized users were prevented from achieving their goals this time, resulting in minimal impact to the customers. However in both cases AWS CIRT recommended further reducing the scope of the IAM policies being used.

Finally, we provided a few ways to go about reducing permissions—first, by using tools to assist with policy creation, and second, by editing existing policies so they better fit your specific needs. You can get started by checking your existing policies against what Access Analyzer would recommend, by looking for and removing overly permissive wildcard characters (*) in some of your existing IAM policies, or by implementing and refining your SCPs.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Richard Billington

Richard Billington

Richard is the Incident Response Watch Lead for the Asia-Pacific region of the AWS Customer Incident Response Team (a team that supports AWS Customers during active security events). He also helps customers prepare for security events using event simulations. Outside of work, he loves wildlife photography and Dr Pepper (which is hard to find in meaningful quantities within Australia).

Validate IAM policies by using IAM Policy Validator for AWS CloudFormation and GitHub Actions

Post Syndicated from Mitch Beaumont original https://aws.amazon.com/blogs/security/validate-iam-policies-by-using-iam-policy-validator-for-aws-cloudformation-and-github-actions/

In this blog post, I’ll show you how to automate the validation of AWS Identity and Access Management (IAM) policies by using a combination of the IAM Policy Validator for AWS CloudFormation (cfn-policy-validator) and GitHub Actions. Policy validation is an approach that is designed to minimize the deployment of unwanted IAM identity-based and resource-based policies to your Amazon Web Services (AWS) environments.

With GitHub Actions, you can automate, customize, and run software development workflows directly within a repository. Workflows are defined using YAML and are stored alongside your code. I’ll discuss the specifics of how you can set up and use GitHub actions within a repository in the sections that follow.

The cfn-policy-validator tool is a command-line tool that takes an AWS CloudFormation template, finds and parses the IAM policies that are attached to IAM roles, users, groups, and resources, and then runs the policies through IAM Access Analyzer policy checks. Implementing IAM policy validation checks at the time of code check-in helps shift security to the left (closer to the developer) and shortens the time between when developers commit code and when they get feedback on their work.

Let’s walk through an example that checks the policies that are attached to an IAM role in a CloudFormation template. In this example, the cfn-policy-validator tool will find that the trust policy attached to the IAM role allows the role to be assumed by external principals. This configuration could lead to unintended access to your resources and data, which is a security risk.

Prerequisites

To complete this example, you will need the following:

  1. A GitHub account
  2. An AWS account, and an identity within that account that has permissions to create the IAM roles and resources used in this example

Step 1: Create a repository that will host the CloudFormation template to be validated

To begin with, you need to create a GitHub repository to host the CloudFormation template that is going to be validated by the cfn-policy-validator tool.

To create a repository:

  1. Open a browser and go to https://github.com.
  2. In the upper-right corner of the page, in the drop-down menu, choose New repository. For Repository name, enter a short, memorable name for your repository.
  3. (Optional) Add a description of your repository.
  4. Choose either the option Public (the repository is accessible to everyone on the internet) or Private (the repository is accessible only to people access is explicitly shared with).
  5. Choose Initialize this repository with: Add a README file.
  6. Choose Create repository. Make a note of the repository’s name.

Step 2: Clone the repository locally

Now that the repository has been created, clone it locally and add a CloudFormation template.

To clone the repository locally and add a CloudFormation template:

  1. Open the command-line tool of your choice.
  2. Use the following command to clone the new repository locally. Make sure to replace <GitHubOrg> and <RepositoryName> with your own values.
    git clone [email protected]:<GitHubOrg>/<RepositoryName>.git

  3. Change in to the directory that contains the locally-cloned repository.
    cd <RepositoryName>

    Now that the repository is locally cloned, populate the locally-cloned repository with the following sample CloudFormation template. This template creates a single IAM role that allows a principal to assume the role to perform the S3:GetObject action.

  4. Use the following command to create the sample CloudFormation template file.

    WARNING: This sample role and policy should not be used in production. Using a wildcard in the principal element of a role’s trust policy would allow any IAM principal in any account to assume the role.

    cat << EOF > sample-role.yaml
    
    AWSTemplateFormatVersion: "2010-09-09"
    Description: Base stack to create a simple role
    Resources:
      SampleIamRole:
        Type: AWS::IAM::Role
        Properties:
          AssumeRolePolicyDocument:
            Statement:
              - Effect: Allow
                Principal:
                  AWS: "*"
                Action: ["sts:AssumeRole"]
          Path: /      
          Policies:
            - PolicyName: root
              PolicyDocument:
                Version: 2012-10-17
                Statement:
                  - Resource: "*"
                    Effect: Allow
                    Action:
                      - s3:GetObject
    EOF

Notice that AssumeRolePolicyDocument refers to a trust policy that includes a wildcard value in the principal element. This means that the role could potentially be assumed by an external identity, and that’s a risk you want to know about.

Step 3: Vend temporary AWS credentials for GitHub Actions workflows

In order for the cfn-policy-validator tool that’s running in the GitHub Actions workflow to use the IAM Access Analyzer API, the GitHub Actions workflow needs a set of temporary AWS credentials. The AWS Credentials for GitHub Actions action helps address this requirement. This action implements the AWS SDK credential resolution chain and exports environment variables for other actions to use in a workflow. Environment variable exports are detected by the cfn-policy-validator tool.

AWS Credentials for GitHub Actions supports four methods for fetching credentials from AWS, but the recommended approach is to use GitHub’s OpenID Connect (OIDC) provider in conjunction with a configured IAM identity provider endpoint.

To configure an IAM identity provider endpoint for use in conjunction with GitHub’s OIDC provider:

  1. Open the AWS Management Console and navigate to IAM.
  2. In the left-hand menu, choose Identity providers, and then choose Add provider.
  3. For Provider type, choose OpenID Connect.
  4. For Provider URL, enter
    https://token.actions.githubusercontent.com
  5. Choose Get thumbprint.
  6. For Audiences, enter sts.amazonaws.com
  7. Choose Add provider to complete the setup.

At this point, make a note of the OIDC provider name. You’ll need this information in the next step.

After it’s configured, the IAM identity provider endpoint should look similar to the following:

Figure 1: IAM Identity provider details

Figure 1: IAM Identity provider details

Step 4: Create an IAM role with permissions to call the IAM Access Analyzer API

In this step, you will create an IAM role that can be assumed by the GitHub Actions workflow and that provides the necessary permissions to run the cfn-policy-validator tool.

To create the IAM role:

  1. In the IAM console, in the left-hand menu, choose Roles, and then choose Create role.
  2. For Trust entity type, choose Web identity.
  3. In the Provider list, choose the new GitHub OIDC provider that you created in the earlier step. For Audience, select sts.amazonaws.com from the list.
  4. Choose Next.
  5. On the Add permission page, choose Create policy.
  6. Choose JSON, and enter the following policy:
    
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                  "iam:GetPolicy",
                  "iam:GetPolicyVersion",
                  "access-analyzer:ListAnalyzers",
                  "access-analyzer:ValidatePolicy",
                  "access-analyzer:CreateAccessPreview",
                  "access-analyzer:GetAccessPreview",
                  "access-analyzer:ListAccessPreviewFindings",
                  "access-analyzer:CreateAnalyzer",
                  "s3:ListAllMyBuckets",
                  "cloudformation:ListExports",
                  "ssm:GetParameter"
                ],
                "Resource": "*"
            },
            {
              "Effect": "Allow",
              "Action": "iam:CreateServiceLinkedRole",
              "Resource": "*",
              "Condition": {
                "StringEquals": {
                  "iam:AWSServiceName": "access-analyzer.amazonaws.com"
                }
              }
            } 
        ]
    }

  7. After you’ve attached the new policy, choose Next.

    Note: For a full explanation of each of these actions and a CloudFormation template example that you can use to create this role, see the IAM Policy Validator for AWS CloudFormation GitHub project.

  8. Give the role a name, and scroll down to look at Step 1: Select trusted entities.

    The default policy you just created allows GitHub Actions from organizations or repositories outside of your control to assume the role. To align with the IAM best practice of granting least privilege, let’s scope it down further to only allow a specific GitHub organization and the repository that you created earlier to assume it.

  9. Replace the policy to look like the following, but don’t forget to replace {AWSAccountID}, {GitHubOrg} and {RepositoryName} with your own values.
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "Federated": "arn:aws:iam::{AWSAccountID}:oidc-provider/token.actions.githubusercontent.com"
                },
                "Action": "sts:AssumeRoleWithWebIdentity",
                "Condition": {
                    "StringEquals": {
                        "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
                    },
                    "StringLike": {
                        "token.actions.githubusercontent.com:sub": "repo:${GitHubOrg}/${RepositoryName}:*"
                    }
                }
            }
        ]
    }

For information on best practices for configuring a role for the GitHub OIDC provider, see Creating a role for web identity or OpenID Connect Federation (console).

Checkpoint

At this point, you’ve created and configured the following resources:

  • A GitHub repository that has been locally cloned and filled with a sample CloudFormation template.
  • An IAM identity provider endpoint for use in conjunction with GitHub’s OIDC provider.
  • A role that can be assumed by GitHub actions, and a set of associated permissions that allow the role to make requests to IAM Access Analyzer to validate policies.

Step 5: Create a definition for the GitHub Actions workflow

The workflow runs steps on hosted runners. For this example, we are going to use Ubuntu as the operating system for the hosted runners. The workflow runs the following steps on the runner:

  1. The workflow checks out the CloudFormation template by using the community actions/checkout action.
  2. The workflow then uses the aws-actions/configure-aws-credentials GitHub action to request a set of credentials through the IAM identity provider endpoint and the IAM role that you created earlier.
  3. The workflow installs the cfn-policy-validator tool by using the python package manager, PIP.
  4. The workflow runs a validation against the CloudFormation template by using the cfn-policy-validator tool.

The workflow is defined in a YAML document. In order for GitHub Actions to pick up the workflow, you need to place the definition file in a specific location within the repository: .github/workflows/main.yml. Note the “.” prefix in the directory name, indicating that this is a hidden directory.

To create the workflow:

  1. Use the following command to create the folder structure within the locally cloned repository:
    mkdir -p .github/workflows

  2. Create the sample workflow definition file in the .github/workflows directory. Make sure to replace <AWSAccountID> and <AWSRegion> with your own information.
    cat << EOF > .github/workflows/main.yml
    name: cfn-policy-validator-workflow
    
    on: push
    
    permissions:
      id-token: write
      contents: read
    
    jobs: 
      cfn-iam-policy-validation: 
        name: iam-policy-validation
        runs-on: ubuntu-latest
        steps:
          - name: Checkout code
            uses: actions/checkout@v3
    
          - name: Configure AWS Credentials
            uses: aws-actions/configure-aws-credentials@v2
            with:
              role-to-assume: arn:aws:iam::<AWSAccountID>:role/github-actions-access-analyzer-role
              aws-region: <AWSRegion>
              role-session-name: GitHubSessionName
            
          - name: Install cfn-policy-validator
            run: pip install cfn-policy-validator
    
          - name: Validate templates
            run: cfn-policy-validator validate --template-path ./sample-role-test.yaml --region <AWSRegion>
    EOF
    

Step 6: Test the setup

Now that everything has been set up and configured, it’s time to test.

To test the workflow and validate the IAM policy:

  1. Add and commit the changes to the local repository.
    git add .
    git commit -m ‘added sample cloudformation template and workflow definition’

  2. Push the local changes to the remote GitHub repository.
    git push

    After the changes are pushed to the remote repository, go back to https://github.com and open the repository that you created earlier. In the top-right corner of the repository window, there is a small orange indicator, as shown in Figure 2. This shows that your GitHub Actions workflow is running.

    Figure 2: GitHub repository window with the orange workflow indicator

    Figure 2: GitHub repository window with the orange workflow indicator

    Because the sample CloudFormation template used a wildcard value “*” in the principal element of the policy as described in the section Step 2: Clone the repository locally, the orange indicator turns to a red x (shown in Figure 3), which signals that something failed in the workflow.

    Figure 3: GitHub repository window with the red cross workflow indicator

    Figure 3: GitHub repository window with the red cross workflow indicator

  3. Choose the red x to see more information about the workflow’s status, as shown in Figure 4.
    Figure 4: Pop-up displayed after choosing the workflow indicator

    Figure 4: Pop-up displayed after choosing the workflow indicator

  4. Choose Details to review the workflow logs.

    In this example, the Validate templates step in the workflow has failed. A closer inspection shows that there is a blocking finding with the CloudFormation template. As shown in Figure 5, the finding is labelled as EXTERNAL_PRINCIPAL and has a description of Trust policy allows access from external principals.

    Figure 5: Details logs from the workflow showing the blocking finding

    Figure 5: Details logs from the workflow showing the blocking finding

    To remediate this blocking finding, you need to update the principal element of the trust policy to include a principal from your AWS account (considered a zone of trust). The resources and principals within your account comprises of the zone of trust for the cfn-policy-validator tool. In the initial version of sample-role.yaml, the IAM roles trust policy used a wildcard in the Principal element. This allowed principals outside of your control to assume the associated role, which caused the cfn-policy-validator tool to generate a blocking finding.

    In this case, the intent is that principals within the current AWS account (zone of trust) should be able to assume this role. To achieve this result, replace the wildcard value with the account principal by following the remaining steps.

  5. Open sample-role.yaml by using your preferred text editor, such as nano.
    nano sample-role.yaml

    Replace the wildcard value in the principal element with the account principal arn:aws:iam::<AccountID>:root. Make sure to replace <AWSAccountID> with your own AWS account ID.

    AWSTemplateFormatVersion: "2010-09-09"
    Description: Base stack to create a simple role
    Resources:
      SampleIamRole:
        Type: AWS::IAM::Role
        Properties:
          AssumeRolePolicyDocument:
            Statement:
              - Effect: Allow
                Principal:
                  AWS: "arn:aws:iam::<AccountID>:root"
                Action: ["sts:AssumeRole"]
          Path: /      
          Policies:
            - PolicyName: root
              PolicyDocument:
                Version: 2012-10-17
                Statement:
                  - Resource: "*"
                    Effect: Allow
                    Action:
                      - s3:GetObject

  6. Add the updated file, commit the changes, and push the updates to the remote GitHub repository.
    git add sample-role.yaml
    git commit -m ‘replacing wildcard principal with account principal’
    git push

After the changes have been pushed to the remote repository, go back to https://github.com and open the repository. The orange indicator in the top right of the window should change to a green tick (check mark), as shown in Figure 6.

Figure 6: GitHub repository window with the green tick workflow indicator

Figure 6: GitHub repository window with the green tick workflow indicator

This indicates that no blocking findings were identified, as shown in Figure 7.

Figure 7: Detailed logs from the workflow showing no more blocking findings

Figure 7: Detailed logs from the workflow showing no more blocking findings

Conclusion

In this post, I showed you how to automate IAM policy validation by using GitHub Actions and the IAM Policy Validator for CloudFormation. Although the example was a simple one, it demonstrates the benefits of automating security testing at the start of the development lifecycle. This is often referred to as shifting security left. Identifying misconfigurations early and automatically supports an iterative, fail-fast model of continuous development and testing. Ultimately, this enables teams to make security an inherent part of a system’s design and architecture and can speed up product development workflows.

In addition to the example I covered today, IAM Policy Validator for CloudFormation can validate IAM policies by using a range of IAM Access Analyzer policy checks. For more information about these policy checks, see Access Analyzer reference policy checks.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Mitch Beaumont

Mitch Beaumont

Mitch is a Principal Solutions Architect for Amazon Web Services, based in Sydney, Australia. Mitch works with some of Australia’s largest financial services customers, helping them to continually raise the security bar for the products and features that they build and ship. Outside of work, Mitch enjoys spending time with his family, photography, and surfing.

Generate machine learning insights for Amazon Security Lake data using Amazon SageMaker

Post Syndicated from Jonathan Nguyen original https://aws.amazon.com/blogs/security/generate-machine-learning-insights-for-amazon-security-lake-data-using-amazon-sagemaker/

Amazon Security Lake automatically centralizes the collection of security-related logs and events from integrated AWS and third-party services. With the increasing amount of security data available, it can be challenging knowing what data to focus on and which tools to use. You can use native AWS services such as Amazon QuickSight, Amazon OpenSearch, and Amazon SageMaker Studio to visualize, analyze, and interactively identify different areas of interest to focus on, and prioritize efforts to increase your AWS security posture.

In this post, we go over how to generate machine learning insights for Security Lake using SageMaker Studio. SageMaker Studio is a web integrated development environment (IDE) for machine learning that provides tools for data scientists to prepare, build, train, and deploy machine learning models. With this solution, you can quickly deploy a base set of Python notebooks focusing on AWS Security Hub findings in Security Lake, which can also be expanded to incorporate other AWS sources or custom data sources in Security Lake. After you’ve run the notebooks, you can use the results to help you identify and focus on areas of interest related to security within your AWS environment. As a result, you might implement additional guardrails or create custom detectors to alert on suspicious activity.

Prerequisites

  1. Specify a delegated administrator account to manage the Security Lake configuration for all member accounts within your organization.
  2. Security Lake has been enabled in the delegated administrator AWS account.
  3. As part of the solution in this post, we focus on Security Hub as a data source. AWS Security Hub must be enabled for your AWS Organizations. When enabling Security Lake, select All log and event sources to include AWS Security Hub findings.
  4. Configure subscriber query access to Security Lake. Security Lake uses AWS Lake Formation cross-account table sharing to support subscriber query access. Accept the resource share request in the subscriber AWS account in AWS Resource Access Manager (AWS RAM). Subscribers with query access can query the data that Security Lake collects. These subscribers query Lake Formation tables in an Amazon Simple Storage Service (Amazon S3) bucket with Security Lake data using services such as Amazon Athena.

Solution overview

Figure 1 that follows depicts the architecture of the solution.

Figure 1 SageMaker machine learning insights architecture for Security Lake

Figure 1 SageMaker machine learning insights architecture for Security Lake

The deployment builds the architecture by completing the following steps:

  1. A Security Lake is set up in an AWS account with supported log sources — such as Amazon VPC Flow Logs, AWS Security Hub, AWS CloudTrail, and Amazon Route53 — configured.
  2. Subscriber query access is created from the Security Lake AWS account to a subscriber AWS account.

    Note: See Prerequisite #4 for more information.

  3. The AWS RAM resource share request must be accepted in the subscriber AWS account where this solution is deployed.

    Note: See Prerequisite #4 for more information.

  4. A resource link database in Lake Formation is created in the subscriber AWS account and grants access for the Athena tables in the Security Lake AWS account.
  5. VPC is provisioned for SageMaker with IGW, NAT GW, and VPC endpoints for the AWS services used in the solution. IGW and NAT are required to install external open-source packages.
  6. A SageMaker Domain for SageMaker Studio is created in VPCOnly mode with a single SageMaker user profile that is tied to a dedicated AWS Identity and Access Management (IAM) role.
  7. A dedicated IAM role is created to restrict access to create and access the presigned URL for the SageMaker Domain from a specific CIDR for accessing the SageMaker notebook.
  8. An AWS CodeCommit repository containing Python notebooks is used for the AI and ML workflow by the SageMaker user-profile.
  9. An Athena workgroup is created for the Security Lake queries with an S3 bucket for output location (access logging configured for the output bucket).

Deploy the solution

You can deploy the SageMaker solution by using either the AWS Management Console or the AWS Cloud Development Kit (AWS CDK).

Option 1: Deploy the solution with AWS CloudFormation using the console

Use the console to sign in to your subscriber AWS account and then choose the Launch Stack button to open the AWS CloudFormation console pre-loaded with the template for this solution. It takes approximately 10 minutes for the CloudFormation stack to complete.

Select this image to open a link that starts building the CloudFormation stack

Option 2: Deploy the solution by using the AWS CDK

You can find the latest code for the SageMaker solution in the SageMaker machine learning insights GitHub repository, where you can also contribute to the sample code. For instructions and more information on using the AWS CDK, see Get Started with AWS CDK.

To deploy the solution by using the AWS CDK

  1. To build the app when navigating to the project’s root folder, use the following commands:
    npm install -g aws-cdk-lib
    npm install

  2. Update IAM_role_assumption_for_sagemaker_presigned_url and security_lake_aws_account default values in source/lib/sagemaker_domain.ts with their respective appropriate values.
  3. Run the following commands in your terminal while authenticated in your subscriber AWS account. Be sure to replace <INSERT_AWS_ACCOUNT> with your account number and replace <INSERT_REGION> with the AWS Region that you want the solution deployed to.
    cdk bootstrap aws://<INSERT_AWS_ACCOUNT>/<INSERT_REGION>
    cdk deploy

Post deployment steps

Now that you’ve deployed the SageMaker solution, you must grant the SageMaker user profile in the subscriber AWS account query access to your Security Lake. You can Grant permission for the SageMaker user profile to Security Lake in Lake Formation in the subscriber AWS account.

Grant permission to the Security Lake database

  1. Copy the SageMaker user-profile Amazon resource name (ARN) arn:aws:iam::<account-id>:role/sagemaker-user-profile-for-security-lake
  2. Go to Lake Formation in the console.
  3. Select the amazon_security_lake_glue_db_us_east_1 database.
  4. From the Actions Dropdown, select Grant.
  5. In Grant Data Permissions, select SAML Users and Groups.
  6. Paste the SageMaker user profile ARN from Step 1.
  7. In Database Permissions, select Describe and then Grant.

Grant permission to Security Lake – Security Hub table

  1. Copy the SageMaker user-profile ARN arn:aws:iam:<account-id>:role/sagemaker-user-profile-for-security-lake
  2. Go to Lake Formation in the console.
  3. Select the amazon_security_lake_glue_db_us_east_1 database.
  4. Choose View Tables.
  5. Select the amazon_security_lake_table_us_east_1_sh_findings_1_0 table.
  6. From Actions Dropdown, select Grant.
  7. In Grant Data Permissions, select SAML Users and Groups.
  8. Paste the SageMaker user-profile ARN from Step 1.
  9. In Table Permissions, select Describe and then Grant.

Launch your SageMaker Studio application

Now that you have granted permissions for a SageMaker user-profile, we can move on to launching the SageMaker application associated to that user-profile.

  1. Navigate to the SageMaker Studio domain in the console.
  2. Select the SageMaker domain security-lake-ml-insights-<account-id>.
  3. Select the SageMaker user profile sagemaker-user-profile-for-security-lake.
  4. Select the Launch drop-down and select Studio
    Figure 2 SageMaker domain user-profile AWS console screen

    Figure 2: SageMaker domain user-profile AWS console screen

Clone Python notebooks

You’ll work primarily in the SageMaker user profile to create a data-science app to work in. As part of the solution deployment, we’ve created Python notebooks in CodeCommit that you will need to clone.

To clone the Python notebooks

  1. Navigate to CloudFormation in the console.
  2. In the Stacks section, select the SageMakerDomainStack.
  3. Select to the Outputs tab/
  4. Copy the value for sagemakernotebookmlinsightsrepositoryURL. (For example: https://git-codecommit.us-east-1.amazonaws.com/v1/repos/sagemaker_ml_insights_repo)
  5. Go back to your SageMaker app.
  6. In Studio, in the left sidebar, choose the Git icon (identified by a diamond with two branches), then choose Clone a Repository.
    Figure 3 SageMaker clone CodeCommit repository

    Figure 3: SageMaker clone CodeCommit repository

  7. Paste the CodeCommit repository link from Step 4 under the Git repository URL (git). After you paste the URL, select Clone “https://git-codecommit.us-east-1.amazonaws.com/v1/repos/sagemaker_ml_insights_repo”, then select Clone.

    NOTE: If you don’t select from the auto-populated drop-down, SageMaker won’t be able to clone the repository.

    Figure 4 SageMaker clone CodeCommit URL

    Figure 4: SageMaker clone CodeCommit URL

Generating machine learning insights using SageMaker Studio

You’ve successfully pulled the base set of Python notebooks into your SageMaker app and they can be accessed at sagemaker_ml_insights_repo/notebooks/tsat/. The notebooks provide you with a starting point for running machine learning analysis using Security Lake data. These notebooks can be expanded to existing native or custom data sources being sent to Security Lake.

Figure 5: SageMaker cloned Python notebooks

Figure 5: SageMaker cloned Python notebooks

Notebook #1 – Environment setup

The 0.0-tsat-environ-setup notebook handles the installation of the required libraries and dependencies needed for the subsequent notebooks within this blog. For our notebooks, we use an open-source Python library called Kats, which is a lightweight, generalizable framework to perform time series analysis.

  1. Select the 0.0-tsat-environ-setup.ipynb notebook for the environment setup.

    Note: If you have already provisioned a kernel, you can skip steps 2 and 3.

  2. In the right-hand corner, select No Kernel
  3. In the Set up notebook environment pop-up, leave the defaults and choose Select.
    Figure 6 SageMaker application environment settings

    Figure 6: SageMaker application environment settings

  4. After the kernel has successfully started, choose the Terminal icon to open the image terminal.
    Figure 7: SageMaker application terminal

    Figure 7: SageMaker application terminal

  5. To install open-source packages from https instead of http, you must update the sources.list file. After the terminal opens, send the following commands:
    cd /etc/apt
    sed -i 's/http:/https:/g' sources.list

  6. Go back to the 0.0-tsat-environ-setup.ipynb notebook and select the Run drop-down and select Run All Cells. Alternatively, you can run each cell independently, but it’s not required. Grab a coffee! This step will take about 10 minutes.

    IMPORTANT: If you complete the installation out of order or update the requirements.txt file, you might not be able to successfully install Kats and you will need to rebuild your environment by using a net-new SageMaker user profile.

  7. After installing all the prerequisites, check the Kats version to determine if it was successfully installed.
    Figure 8: Kats installation verification

    Figure 8: Kats installation verification

  8. Install PyAthena (Python DB API client for Amazon Athena) which is used to query your data in Security Lake.

You’ve successfully set up the SageMaker app environment! You can now load the appropriate dataset and create a time series.

Notebook #2 – Load data

The 0.1-load-data notebook establishes the Athena connection to query data in Security Lake and creates the resulting time series dataset. The time series dataset will be used for subsequent notebooks to identify trends, outliers, and change points.

  1. Select the 0.1-load-data.ipynb notebook.
  2. If you deployed the solution outside of us-east-1, update the con details to the appropriate Region. In this example, we’re focusing on Security Hub data within Security Lake. If you want to change the underlying data source, you can update the TABLE value.
    Figure 9: SageMaker notebook load Security Lake data settings

    Figure 9: SageMaker notebook load Security Lake data settings

  3. In the Query section, there’s an Athena query to pull specific data from Security Hub, this can be expanded as needed to a subset or can include all products within Security Hub. The query below pulls Security Hub information after 01:00:00 1/1/2022 from the products listed in productname.
    Figure 10: SageMaker notebook Athena query

    Figure 10: SageMaker notebook Athena query

  4. After the values have been updated, you can create your time series dataset. For this notebook, we recommend running each cell individually instead of running all cells at once so you can get a bit more familiar with the process. Select the first cell and choose the Run icon.
    Figure 11: SageMaker run Python notebook code

    Figure 11: SageMaker run Python notebook code

  5. Follow the same process as Step 4 for the subsequent cells.

    Note: If you encounter any issues with querying the table, make sure you completed the post-deployment step for Grant permission to Security Lake – Security Hub table.

You’ve successfully loaded your data and created a timeseries! You can now move on to generating machine learning insights from your timeseries.

Notebook #3 – Trend detector

The 1.1-trend-detector.ipynb notebook handles trend detection in your data. Trend represents a directional change in the level of a time series. This directional change can be either upward (increase in level) or downward (decrease in level). Trend detection helps detect a change while ignoring the noise from natural variability. Each environment is different, and trends help us identify where to look more closely to determine why a trend is positive or negative.

  1. Select 1.1-trend-detector.ipynb notebook for trend detection.
  2. Slopes are created to identify the relationship between x (time) and y (counts).
    Figure 12: SageMaker notebook slope view

    Figure 12: SageMaker notebook slope view

  3. If the counts are increasing with time, then it’s considered a positive slope and the reverse is considered a negative slope. A positive slope isn’t necessarily a good thing because in an ideal state we would expect counts of a finding type to come down with time.
    Figure 13: SageMaker notebook trend view

    Figure 13: SageMaker notebook trend view

  4. Now you can plot the top five positive and negative trends to identify the top movers.
    Figure 14: SageMaker notebook trend results view

    Figure 14: SageMaker notebook trend results view

Notebook #4 – Outlier detection

The 1.2-outlier-detection.ipynb notebook handles outlier detection. This notebook does a seasonal decomposition of the input time series, with additive or multiplicative decomposition as specified (default is additive). It uses a residual time series by either removing only trend or both trend and seasonality if the seasonality is strong. The intent is to discover useful, abnormal, and irregular patterns within data sets, allowing you to pinpoint areas of interest.

  1. To start, it detects points in the residual that are over 5 times the inter-quartile range.
  2. Inter-quartile range (IQR) is the difference between the seventy-fifth and twenty-fifth percentiles of residuals or the spread of data within the middle two quartiles of the entire dataset. IQR is useful in detecting the presence of outliers by looking at values that might lie outside of the middle two quartiles.
  3. The IQR multiplier controls the sensitivity of the range and decision of identifying outliers. By using a larger value for the iqr_mult_thresh parameter in OutlierDetector, outliers would be considered data points, while a smaller value would identify data points as outliers.

    Note: If you don’t have enough data, decrease iqr_mult_thresh to a lower value (for example iqr_mult_thresh=3).

    Figure 15: SageMaker notebook outlier setting

    Figure 15: SageMaker notebook outlier setting

  4. Along with outlier detection plots, investigation SQL will be displayed as well, which can help with further investigation of the outliers.

    In the diagram that follows, you can see that there are several outliers in the number of findings, related to failed AWS Firewall Manager policies, which have been identified by the vertical red lines within the line graph. These are outliers because they deviate from the normal behavior and number of findings on a day-to-day basis. When you see outliers, you can look at the resources that might have caused an unusual increase in Firewall Manager policy findings. Depending on the findings, it could be related to an overly permissive or noncompliant security group or a misconfigured AWS WAF rule group.

    Figure 16: SageMaker notebook outlier results view

    Figure 16: SageMaker notebook outlier results view

Notebook #5 – Change point detection

The 1.3-changepoint-detector.ipynb notebook handles the change point detection. Change point detection is a method to detect changes in a time series that persist over time, such as a change in the mean value. To detect a baseline to identify when several changes might have occurred from that point. Change points occur when there’s an increase or decrease to the average number of findings within a data set.

  1. Along with identifying change points within the data set, the investigation SQL is generated to further investigate the specific change point if applicable.

    In the following diagram, you can see there’s a change point decrease after July 27, 2022, with confidence of 99.9 percent. It’s important to note that change points differ from outliers, which are sudden changes in the data set observed. This diagram means there was some change in the environment that resulted in an overall decrease in the number of findings for S3 buckets with block public access being disabled. The change could be the result of an update to the CI/CD pipelines provisioning S3 buckets or automation to enable all S3 buckets to block public access. Conversely, if you saw a change point that resulted in an increase, it could mean that there was a change that resulted in a larger number of S3 buckets with a block public access configuration consistently being disabled.

    Figure 17: SageMaker changepoint detector view

    Figure 17: SageMaker changepoint detector view

By now, you should be familiar with the set up and deployment for SageMaker Studio and how you can use Python notebooks to generate machine learning insights for your Security Lake data. You can take what you’ve learned and start to curate specific datasets and data sources within Security Lake, create a time series, detect trends, and identify outliers and change points. By doing so, you can answer a variety of security-related questions such as:

  • CloudTrail

    Is there a large volume of Amazon S3 download or copy commands to an external resource? Are you seeing a large volume of S3 delete object commands? Is it possible there’s a ransomware event going on?

  • VPC Flow Logs

    Is there an increase in the number of requests from your VPC to external IPs? Is there an increase in the number of requests from your VPC to your on-premises CIDR? Is there a possibility of internal or external data exfiltration occurring?

  • Route53

    Which resources are making DNS requests that they haven’t typically made within the last 30–45 days? When did it start? Is there a potential command and control session occurring on an Amazon Elastic Compute Cloud (Amazon EC2) instance?

It’s important to note that this isn’t a solution to replace Amazon GuardDuty, which uses foundational data sources to detect communication with known malicious domains and IP addresses and identify anomalous behavior, or Amazon Detective, which provides customers with prebuilt data aggregations, summaries, and visualizations to help security teams conduct faster and more effective investigations. One of the main benefits of using Security Lake and SageMaker Studio is the ability to interactively create and tailor machine learning insights specific to your AWS environment and workloads.

Clean up

If you deployed the SageMaker machine learning insights solution by using the Launch Stack button in the AWS Management Console or the CloudFormation template sagemaker_ml_insights_cfn, do the following to clean up:

  1. In the CloudFormation console for the account and Region where you deployed the solution, choose the SageMakerML stack.
  2. Choose the option to Delete the stack.

If you deployed the solution by using the AWS CDK, run the command cdk destroy.

Conclusion

Amazon Security Lake gives you the ability to normalize and centrally store your security data from various log sources to help you analyze, visualize, and correlate appropriate security logs. You can then use this data to increase your overall security posture by implementing additional security guardrails or take appropriate remediation actions within your AWS environment.

In this blog post, you learned how you can use SageMaker to generate machine learning insights for your Security Hub findings in Security Lake. Although the example solution focuses on a single data source within Security Lake, you can expand the notebooks to incorporate other native or custom data sources in Security Lake.

There are many different use-cases for Security Lake that can be tailored to fit your AWS environment. Take a look at this blog post to learn how you can ingest, transform and deliver Security Lake data to Amazon OpenSearch to help your security operations team quickly analyze security data within your AWS environment. In supported Regions, new Security Lake account holders can try the service free for 15 days and gain access to its features.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Jonathan Nguyen

Jonathan Nguyen

Jonathan is a Principal Security Architect at AWS. His background is in AWS security, with a focus on threat detection and incident response. He helps enterprise customers develop a comprehensive AWS security strategy, deploy security solutions at scale, and train customers on AWS security best practices.

Madhunika Reddy Mikkili

Madhunika Reddy Mikkili

Madhunika is a Data and Machine Learning Engineer with the AWS Professional Services Shared Delivery Team. She is passionate about helping customers achieve their goals through the use of data and machine learning insights. Outside of work, she loves traveling and spending time with family and friends.

Using and Managing Security Groups on AWS Snowball Edge devices

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/using-and-managing-security-groups-on-aws-snowball-edge-devices/

This blog post is written by Jared Novotny & Tareq Rajabi, Specialist Hybrid Edge Solution Architects. 

The AWS Snow family of products are purpose-built devices that allow petabyte-scale movement of data from on-premises locations to AWS Regions. Snow devices also enable customers to run Amazon Elastic Compute Cloud (Amazon EC2) instances with Amazon Elastic Block Storage (Amazon EBS), and Amazon Simple Storage Service (Amazon S3) in edge locations.

Security groups are used to protect EC2 instances by controlling ingress and egress traffic. Once a security group is created and associated with an instance, customers can add ingress and egress rules to control data flow. Just like the default VPC in a region, there is a default security group on Snow devices. A default security group is applied when an instance is launched and no other security group is specified.  This default security group in a region allows all inbound traffic from network interfaces and instances that are assigned to the same security group, and allows and all outbound traffic. On Snowball Edge, the default security group allows all inbound and outbound traffic.

In this post, we will review the tools and commands required to create, manage and use security groups on the Snowball Edge device.

Some things to keep in mind:

  1. AWS Snowball Edge is limited to 50 security groups.
  2. An instance will only have one security group, but each group can have a total of 120 rules. This is comprised of 60 inbound and 60 outbound rules.
  3. Security groups can only have allow statements to allow network traffic.
  4. Deny statements aren’t allowed.
  5. Some commands in the Snowball Edge client (AWS CLI) don’t provide an output.
  6. AWS CLI commands can use the name or the security group ID.

Prerequisites and tools

Customers must place an order for Snowball Edge from their AWS Console to be able to run the following AWS CLI commands and configure security groups to protect their EC2 instances.

The AWS Snowball Edge client is a standalone terminal application that customers can run on their local servers and workstations to manage and operate their Snowball Edge devices. It supports Windows, Mac, and Linux systems.

AWS OpsHub is a graphical user interface that you can use to manage your AWS Snowball devices. Furthermore, it’s the easiest tool to use to unlock Snowball Edge devices. It can also be used to configure the device, launch instances, manage storage, and provide monitoring.

Customers can download and install the Snowball Edge client and AWS OpsHub from AWS Snowball resources.

Getting Started

To get started, when a Snow device arrives at a customer site, the customer must unlock the device and launch an EC2 instance. This can be done via AWS OpsHub or the AWS Snowball Edge Client. AWS Snow Family of devices support both Virtual Network Interfaces (VNI) and Direct Network interfaces (DNI), customers should review the types of interfaces before deciding which one is best for their use case. Note that security groups are only supported with VNIs, so that is what was used in this post. A post explaining how to use these interfaces should be reviewed before proceeding.

Viewing security group information

Once the AWS Snowball Edge is unlocked, configured, and has an EC2 instance running, we can dig deeper into using security groups to act as a virtual firewall and control incoming and outgoing traffic.

Although the AWS OpsHub tool provides various functionalities for compute and storage operations, it can only be used to view the name of the security group associated to an instance in a Snowball Edge device:

view the name of the security group associated to an instance in a Snowball Edge device

Every other interaction with security groups must be through the AWS CLI.

The following command shows how to easily read the outputs describing the protocols, sources, and destinations. This particular command will show information about the default security group, which allows all inbound and outbound traffic on EC2 instances running on the Snowball Edge.

In the following sections we review the most common commands with examples and outputs.

View (all) existing security groups:

aws ec2 describe-security-groups --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge
{
    "SecurityGroups": [
        {
            "Description": "default security group",
            "GroupName": "default",
            "IpPermissions": [
                {
                    "IpProtocol": "-1",
                    "IpRanges": [
                        {
                            "CidrIp": "0.0.0.0/0"
                        }
                    ]
                }
            ],
            "GroupId": "s.sg-8ec664a23666db719",
            "IpPermissionsEgress": [
                {
                    "IpProtocol": "-1",
                    "IpRanges": [
                        {
                            "CidrIp": "0.0.0.0/0"
                        }
                    ]
                }
            ]
        }
    ]
}

Create new security group:

aws ec2 create-security-group --group-name allow-ssh--description "allow only ssh inbound" --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

The output returns a GroupId:

{  "GroupId": "s.sg-8f25ee27cee870b4a" }

Add port 22 ingress to security group:

aws ec2 authorize-security-group-ingress --group-ids.sg-8f25ee27cee870b4a --protocol tcp --port 22 --cidr 10.100.10.0/24 --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

{    "Return": true }

Note that if you’re using the default security group, then the outbound rule is still to allow all traffic.

Revoke port 22 ingress rule from security group

aws ec2 revoke-security-group-ingress --group-ids.sg-8f25ee27cee870b4a --ip-permissions IpProtocol=tcp,FromPort=22,ToPort=22, IpRanges=[{CidrIp=10.100.10.0/24}] --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

{ "Return": true }

Revoke default egress rule:

aws ec2 revoke-security-group-egress --group-ids.sg-8f25ee27cee870b4a  --ip-permissions IpProtocol="-1",IpRanges=[{CidrIp=0.0.0.0/0}] --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

{ "Return": true }

Note that this rule will remove all outbound ephemeral ports.

Add default outbound rule (revoked above):

aws ec2 authorize-security-group-egress --group-id s.sg-8f25ee27cee870b4a --ip-permissions IpProtocol="-1", IpRanges=[{CidrIp=0.0.0.0/0}] --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

{  "Return": true }

Changing an instance’s existing security group:

aws ec2 modify-instance-attribute --instance-id s.i-852971d05144e1d63 --groups s.sg-8f25ee27cee870b4a --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

Note that this command produces no output. We can verify that it worked with the “aws ec2 describe-instances” command. See the example as follows (command output simplified):

aws ec2 describe-instances --instance-id s.i-852971d05144e1d63 --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge


    "Reservations": [{
            "Instances": [{
                    "InstanceId": "s.i-852971d05144e1d63",
                    "InstanceType": "sbe-c.2xlarge",
                    "LaunchTime": "2022-06-27T14:58:30.167000+00:00",
                    "PrivateIpAddress": "34.223.14.193",
                    "PublicIpAddress": "10.100.10.60",
                    "SecurityGroups": [
                        {
                            "GroupName": "allow-ssh",
                            "GroupId": "s.sg-8f25ee27cee870b4a"
                        }      ], }  ] }

Changing and instance’s security group back to default:

aws ec2 modify-instance-attribute --instance-ids.i-852971d05144e1d63 --groups s.sg-8ec664a23666db719 --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

Note that this command produces no output. You can verify that it worked with the “aws ec2 describe-instances” command. See the example as follows:

aws ec2 describe-instances –instance-ids.i-852971d05144e1d63 –endpoint Https://MySnowIPAddress:8008 –profile SnowballEdge

{
    "Reservations": [
        {  "Instances": [ {
                    "AmiLaunchIndex": 0,
                    "ImageId": "s.ami-8b0223704ca8f08b2",
                    "InstanceId": "s.i-852971d05144e1d63",
                    "InstanceType": "sbe-c.2xlarge",
                    "LaunchTime": "2022-06-27T14:58:30.167000+00:00",
                    "PrivateIpAddress": "34.223.14.193",
                    "PublicIpAddress": "10.100.10.60",
                             "SecurityGroups": [
                        {
                            "GroupName": "default",
                            "GroupId": "s.sg-8ec664a23666db719" ] }

Delete security group:

aws ec2 delete-security-group --group-ids.sg-8f25ee27cee870b4a --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

Sample walkthrough to add a SSH Security Group

As an example, assume a single EC2 instance “A” running on a Snowball Edge device. By default, all traffic is allowed to EC2 instance “A”. As per the following diagram, we want to tighten security and allow only the management PC to SSH to the instance.

1. Create an SSH security group:

aws ec2 create-security-group --group-name MySshGroup--description “ssh access” --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

2. This will return a “GroupId” as an output:

{   "GroupId": "s.sg-8a420242d86dbbb89" }

3. After the creation of the security group, we must allow port 22 ingress from the management PC’s IP:

aws ec2 authorize-security-group-ingress --group-name MySshGroup -- protocol tcp --port 22 -- cidr 192.168.26.193/32 --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

4. Verify that the security group has been created:

aws ec2 describe-security-groups ––group-name MySshGroup –endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

{
	“SecurityGroups”:   [
		{
			“Description”: “SG for web servers”,
			“GroupName”: :MySshGroup”,
			“IpPermissinos”:  [
				{ “FromPort”: 22,
			 “IpProtocol”: “tcp”,
			 “IpRanges”: [
			{
				“CidrIp”: “192.168.26.193.32/32”
						} ],
					“ToPort”:  22 }],}

5. After the security group has been created, we must associate it with the instance:

aws ec2 modify-instance-attribute –-instance-id s.i-8f7ab16867ffe23d4 –-groups s.sg-8a420242d86dbbb89 --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

6. Optionally, we can delete the Security Group after it is no longer required:

aws ec2 delete-security-group --group-id s.sg-8a420242d86dbbb89 --endpoint Http://MySnowIPAddress:8008 --profile SnowballEdge

Note that for the above association, the instance ID is an output of the “aws ec2 describe-instances” command, while the security group ID is an output of the “describe-security-groups” command (or the “GroupId” returned by the console in Step 2 above).

Conclusion

This post addressed the most common commands used to create and manage security groups with the AWS Snowball Edge device. We explored the prerequisites, tools, and commands used to view, create, and modify security groups to ensure the EC2 instances deployed on AWS Snowball Edge are restricted to authorized users. We concluded with a simple walkthrough of how to restrict access to an EC2 instance over SSH from a single IP address. If you would like to learn more about the Snowball Edge product, there are several resources available on the AWS Snow Family site.

Try semantic search with the Amazon OpenSearch Service vector engine

Post Syndicated from Stavros Macrakis original https://aws.amazon.com/blogs/big-data/try-semantic-search-with-the-amazon-opensearch-service-vector-engine/

Amazon OpenSearch Service has long supported both lexical and vector search, since the introduction of its kNN plugin in 2020. With recent developments in generative AI, including AWS’s launch of Amazon Bedrock earlier in 2023, you can now use Amazon Bedrock-hosted models in conjunction with the vector database capabilities of OpenSearch Service, allowing you to implement semantic search, retrieval augmented generation (RAG), recommendation engines, and rich media search based on high-quality vector search. The recent launch of the vector engine for Amazon OpenSearch Serverless makes it even easier to deploy such solutions.

OpenSearch Service supports a variety of search and relevance ranking techniques. Lexical search looks for words in the documents that appear in the queries. Semantic search, supported by vector embeddings, embeds documents and queries into a semantic high-dimension vector space where texts with related meanings are nearby in the vector space and therefore semantically similar, so that it returns similar items even if they don’t share any words with the query.

We’ve put together two demos on the public OpenSearch Playground to show you the strengths and weaknesses of the different techniques: one comparing textual vector search to lexical search, the other comparing cross-modal textual and image search to textual vector search. With OpenSearch’s Search Comparison Tool, you can compare the different approaches. For the demo, we’re using the Amazon Titan foundation model hosted on Amazon Bedrock for embeddings, with no fine tuning. The dataset consists of a selection of Amazon clothing, jewelry, and outdoor products.

Background

A search engine is a special kind of database, allowing you to store documents and data and then run queries to retrieve the most relevant ones. End-user search queries usually consist of text entered in a search box. Two important techniques for using that text are lexical search and semantic search. In lexical search, the search engine compares the words in the search query to the words in the documents, matching word for word. Only items that have all or most of the words the user typed match the query. In semantic search, the search engine uses a machine learning (ML) model to encode text from the source documents as a dense vector in a high-dimensional vector space; this is also called embedding the text into the vector space. It similarly codes the query as a vector and then uses a distance metric to find nearby vectors in the multi-dimensional space. The algorithm for finding nearby vectors is called kNN (k Nearest Neighbors). Semantic search does not match individual query terms—it finds documents whose vector embedding is near the query’s embedding in the vector space and therefore semantically similar to the query, so the user can retrieve items that don’t have any of the words that were in the query, even though the items are highly relevant.

Textual vector search

The demo of textual vector search shows how vector embeddings can capture the context of your query beyond just the words that compose it.

In the text box at the top, enter the query tennis clothes. On the left (Query 1), there’s an OpenSearch DSL (Domain Specific Language for queries) semantic query using the amazon_products_text_embedding index, and on the right (Query 2), there’s a simple lexical query using the amazon_products_text index. You’ll see that lexical search doesn’t know that clothes can be tops, shorts, dresses, and so on, but semantic search does.

Search Comparison Tool

Compare semantic and lexical results

Similarly, in a search for warm-weather hat, the semantic results find lots of hats suitable for warm weather, whereas the lexical search returns results mentioning the words “warm” and “hat,” all of which are warm hats suitable for cold weather, not warm-weather hats. Similarly, if you’re looking for long dresses with long sleeves, you might search for long long-sleeved dress. A lexical search ends up finding some short dresses with long sleeves and even a child’s dress shirt because the word “dress” appears in the description, whereas the semantic search finds much more relevant results: mostly long dresses with long sleeves, with a couple of errors.

Cross-modal image search

The demo of cross-modal textual and image search shows searching for images using textual descriptions. This works by finding images that are related to your textual descriptions using a pre-production multi-modal embedding. We’ll compare searching for visual similarity (on the left) and textual similarity (on the right). In some cases, we get very similar results.

Search Comparison Tool

Compare image and textual embeddings

For example, sailboat shoes does a good job with both approaches, but white sailboat shoes does much better using visual similarity. The query canoe finds mostly canoes using visual similarity—which is probably what a user would expect—but a mixture of canoes and canoe accessories such as paddles using textual similarity.

If you are interested in exploring the multi-modal model, please reach out to your AWS specialist.

Building production-quality search experiences with semantic search

These demos give you an idea of the capabilities of vector-based semantic vs. word-based lexical search and what can be accomplished by utilizing the vector engine for OpenSearch Serverless to build your search experiences. Of course, production-quality search experiences use many more techniques to improve results. In particular, our experimentation shows that hybrid search, combining lexical and vector approaches, typically results in a 15% improvement in search result quality over lexical or vector search alone on industry-standard test sets, as measured by the NDCG@10 metric (Normalized Discounted Cumulative Gain in the first 10 results). The improvement is because lexical outperforms vector for very specific names of things, and semantic works better for broader queries. For example, in the semantic vs. lexical comparison, the query saranac 146, a brand of canoe, works very well in lexical search, whereas semantic search doesn’t return relevant results. This demonstrates why the combination of semantic and lexical search provides superior results.

Conclusion

OpenSearch Service includes a vector engine that supports semantic search as well as classic lexical search. The examples shown in the demo pages show the strengths and weaknesses of different techniques. You can use the Search Comparison Tool on your own data in OpenSearch 2.9 or higher.

Further information

For further information about OpenSearch’s semantic search capabilities, see the following:


About the author

Stavros Macrakis is a Senior Technical Product Manager on the OpenSearch project of Amazon Web Services. He is passionate about giving customers the tools to improve the quality of their search results.

How AWS built the Security Guardians program, a mechanism to distribute security ownership

Post Syndicated from Ana Malhotra original https://aws.amazon.com/blogs/security/how-aws-built-the-security-guardians-program-a-mechanism-to-distribute-security-ownership/

Product security teams play a critical role to help ensure that new services, products, and features are built and shipped securely to customers. However, since security teams are in the product launch path, they can form a bottleneck if organizations struggle to scale their security teams to support their growing product development teams. In this post, we will share how Amazon Web Services (AWS) developed a mechanism to scale security processes and expertise by distributing security ownership between security teams and development teams. This mechanism has many names in the industry — Security Champions, Security Advocates, and others — and it’s often part of a shift-left approach to security. At AWS, we call this mechanism Security Guardians.

In many organizations, there are fewer security professionals than product developers. Our experience is that it takes much more time to hire a security professional than other technical job roles, and research conducted by (ISC)2 shows that the cybersecurity industry is short 3.4 million workers. When product development teams continue to grow at a faster rate than security teams, the disparity between security professionals and product developers continues to increase as well. Although most businesses understand the importance of security, frustration and tensions can arise when it becomes a bottleneck for the business and its ability to serve customers.

At AWS, we require the teams that build products to undergo an independent security review with an AWS application security engineer before launching. This is a mechanism to verify that new services, features, solutions, vendor applications, and hardware meet our high security bar. This intensive process impacts how quickly product teams can ship to customers. As shown in Figure 1, we found that as the product teams scaled, so did the problem: there were more products being built than the security teams could review and approve for launch. Because security reviews are required and non-negotiable, this could potentially lead to delays in the shipping of products and features.

Figure 1: More products are being developed than can be reviewed and shipped

Figure 1: More products are being developed than can be reviewed and shipped

How AWS builds a culture of security

Because of its size and scale, many customers look to AWS to understand how we scale our own security teams. To tell our story and provide insight, let’s take a look at the culture of security at AWS.

Security is a business priority

At AWS, security is a business priority. Business leaders prioritize building products and services that are designed to be secure, and they consider security to be an enabler of the business rather than an obstacle.

Leaders also strive to create a safe environment by encouraging employees to identify and escalate potential security issues. Escalation is the process of making sure that the right people know about the problem at the right time. Escalation encompasses “Dive Deep”, which is one of our corporate values at Amazon, because it requires owners and leaders to dive into the details of the issue. If you don’t know the details, you can’t make good decisions about what’s going on and how to run your business effectively.

This aspect of the culture goes beyond intention — it’s embedded in our organizational structure:

CISOs and IT leaders play a key role in demystifying what security and compliance represent for the business. At AWS, we made an intentional choice for the security team to report directly to the CEO. The goal was to build security into the structural fabric of how AWS makes decisions, and every week our security team spends time with AWS leadership to ensure we’re making the right choices on tactical and strategic security issues.

– Stephen Schmidt, Chief Security Officer, Amazon, on Building a Culture of Security

Everyone owns security

Because our leadership supports security, it’s understood within AWS that security is everyone’s job. Security teams and product development teams work together to help ensure that products are built and shipped securely. Despite this collaboration, the product teams own the security of their product. They are responsible for making sure that security controls are built into the product and that customers have the tools they need to use the product securely.

On the other hand, central security teams are responsible for helping developers to build securely and verifying that security requirements are met before launch. They provide guidance to help developers understand what security controls to build, provide tools to make it simpler for developers to implement and test controls, provide support in threat modeling activities, use mechanisms to help ensure that customers’ security expectations are met before launch, and so on.

This responsibility model highlights how security ownership is distributed between the security and product development teams. At AWS, we learned that without this distribution, security doesn’t scale. Regardless of the number of security experts we hire, product teams always grow faster. Although the culture around security and the need to distribute ownership is now well understood, without the right mechanisms in place, this model would have collapsed.

Mechanisms compared to good intentions

Mechanisms are the final pillar of AWS culture that has allowed us to successfully distribute security across our organization. A mechanism is a complete process, or virtuous cycle, that reinforces and improves itself as it operates. As shown in Figure 2, a mechanism takes controllable inputs and transforms them into ongoing outputs to address a recurring business challenge. At AWS, the business challenge that we’re facing is that security teams create bottlenecks for the business. The culture of security at AWS provides support to help address this challenge, but we needed a mechanism to actually do it.

Figure 2: AWS sees mechanisms as a complete process, or virtuous cycle

Figure 2: AWS sees mechanisms as a complete process, or virtuous cycle

“Often, when we find a recurring problem, something that happens over and over again, we pull the team together, ask them to try harder, do better – essentially, we ask for good intentions. This rarely works… When you are asking for good intentions, you are not asking for a change… because people already had good intentions. But if good intentions don’t work, what does? Mechanisms work.

 – Jeff Bezos, February 1, 2008 All Hands.

At AWS, we’ve learned that we can help solve the challenge of scaling security by distributing security ownership with a mechanism we call the Security Guardians program. Like other mechanisms, it has inputs and outputs, and transforms over time.

AWS distributes security ownership with the Security Guardians program

At AWS, the Security Guardians program trains, develops, and empowers developers to be security ambassadors, or Guardians, within the product teams. At a high level, Guardians make sure that security considerations for a product are made earlier and more often, helping their peers build and ship their product faster. They also work closely with the central security team to help ensure that the security bar at AWS is rising and the Security Guardians program is improving over time. As shown in Figure 3, embedding security expertise within the product teams helps products with Guardian involvement move through security review faster.

Figure 3: Security expertise is embedded in the product teams by Guardians

Figure 3: Security expertise is embedded in the product teams by Guardians

Guardians are informed, security-minded product builders who volunteer to be consistent champions of security on their teams and are deeply familiar with the security processes and tools. They provide security guidance throughout the development lifecycle and are stakeholders in the security of the products being shipped, helping their teams make informed decisions that lead to more secure, on-time launches. Guardians are the security points-of-contact for their product teams.

In this distributed security ownership model, accountability for product security sits with the product development teams. However, the Guardians are responsible for performing the first evaluation of a development team’s security review submission. They confirm the quality and completeness of the new service’s resources, design documents, threat model, automated findings, and penetration test readiness. The development teams, supported by the Guardian, submit their security review to AWS Application Security (AppSec) engineers for the final pre-launch review.

In practice, as part of this development journey, Guardians help ensure that security considerations are made early, when teams are assessing customer requests and the feature or product design. This can be done by starting the threat modeling processes. Next, they work to make sure that mitigations identified during threat modeling are developed. Guardians also play an active role in software testing, including security scans such as static application security testing (SAST) and dynamic application security testing (DAST). To close out the security review, security engineers work with Guardians to make sure that findings are resolved and the product is ready to ship.

Figure 4: Expedited security review process supported by Guardians

Figure 4: Expedited security review process supported by Guardians

Guardians are, after all, Amazonians. Therefore, Guardians exemplify a number of the Amazon Leadership Principles and often have the following characteristics:

  • They are exemplary practitioners for security ownership and empower their teams to own the security of their service.
  • They hold a high security bar and exercise strong security judgement, don’t accept quick or easy answers, and drive continuous improvement.
  • They advocate for security needs in internal discussions with the product team.
  • They are thoughtful yet assertive to make customer security a top priority on their team.
  • They maintain and showcase their security knowledge to their peers, continuously building knowledge from many different sources to gain perspective and to stay up to date on the constantly evolving threat landscape.
  • They aren’t afraid to have their work independently validated by the central security team.

Expected outcomes

AWS has benefited greatly from the Security Guardians program. We’ve had 22.5 percent fewer medium and high severity security findings generated during the security review process and have taken about 26.9 percent less time to review a new service or feature. This data demonstrates that with Guardians involved we’re identifying fewer issues late in the process, reducing remediation work, and as a result securely shipping services faster for our customers. To help both builders and Guardians improve over time, our security review tool captures feedback from security engineers on their inputs. This helps ensure that our security ownership mechanism reinforces and improves itself over time.

AWS and other organizations have benefited from this mechanism because it generates specialized security resources and distributes security knowledge that scales without needing to hire additional staff.

A program such as this could help your business build and ship faster, as it has for AWS, while maintaining an appropriately high security bar that rises over time. By training builders to be security practitioners and advocates within your development cycle, you can increase the chances of identifying risks and security findings earlier. These findings, earlier in the development lifecycle, can reduce the likelihood of having to patch security bugs or even start over after the product has already been built. We also believe that a consistent security experience for your product teams is an important aspect of successfully distributing your security ownership. An experience with less confusion and friction will help build trust between the product and security teams.

To learn more about building positive security culture for your business, watch this spotlight interview with Stephen Schmidt, Chief Security Officer, Amazon.

If you’re an AWS customer and want to learn more about how AWS built the Security Guardians program, reach out to your local AWS solutions architect or account manager for more information.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Ana Malhotra

Ana Malhotra

Ana is a Security Specialist Solutions Architect and the Healthcare and Life Sciences (HCLS) Security Lead for AWS Industry, based in Seattle, Washington. As a former AWS Application Security Engineer, Ana loves talking all things AppSec, including people, process, and technology. In her free time, she enjoys tapping into her creative side with music and dance.

Mitch Beaumont

Mitch Beaumont

Mitch is a Principal Solutions Architect for Amazon Web Services, based in Sydney, Australia. Mitch works with some of Australia’s largest financial services customers, helping them to continually raise the security bar for the products and features that they build and ship. Outside of work, Mitch enjoys spending time with his family, photography, and surfing.

Providing durable storage for AWS Outpost servers using AWS Snowcone

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/providing-durable-storage-for-aws-outpost-servers-using-aws-snowcone/

This blog post is written by Rob Goodwin, Specialist Solutions Architect, Secure Hybrid Edge. 

With the announcement of AWS Outposts servers, you now have a streamlined means to deploy AWS Cloud infrastructure to regional offices using the 1 rack unit (1U) or 2 rack unit (2U) Outposts servers where the 42U AWS Outposts rack wasn’t an economical or physical fit.

This post discusses how you can use AWS Snowcone to provide persistent storage for AWS Outposts servers in the case of Amazon Elastic Compute Cloud (Amazon EC2) instance termination or if the Outposts server fails. In this post, we show:

  1. How to leverage the built-in features of Snowcone to provide persistent storage to an EC2 instance.
  2. Optionally replicate the data back to an AWS Region with AWS DataSync. Replicating data back to an AWS Region with DataSync allows for a seamless way to copy data offsite to improve resiliency. Furthermore, it allows the ability to leverage regional AWS Services for machine learning (ML) training.

Background

Outposts servers ship with internal NVMe SSD instance storage. Just like in the Regions, instance storage is allocated directly to the EC2 instance and tied to the lifecycle of the instance. This means that if the EC2 instance is terminated, then the data associated with the instance is deleted. In the event you want data to persist after the instance is terminated, you must use operating system (OS) functions to save and back up to other media or save your data to an external network attached storage or file system.

Mounting an external file system to an EC2 instance is not a new concept in AWS. Using Amazon Elastic File System (Amazon EFS), you can mount the EFS file system to EC2 instance(s).

This architecture may look similar to the following diagram:

AWS VPC showing EC2 Instances mounting Amazon EFS in the Region

Figure 1: AWS VPC showing EC2 Instances mounting Amazon EFS in the Region

In this architecture, EC2 instances are using Amazon EFS for a shared file system.

A main use case for Outposts servers is to deploy applications closer to an end user for quicker response times. If we move our application to the Outposts server to improve the response time to the end user, then we could still use Amazon EFS as a shared file system. However, the latency to read the file system over the service link may affect application performance.

There are third-party network attached storage systems available that could work with Outposts servers. However, Snowcone provides the built-in service of DataSync to replicate data back to the Region and is ideal where physical space and power are limited.

By leveraging Snowcone, we can provide persistent and durable network attached storage external to the Outposts server along with a means to replicate data to and from an AWS Region. Snowcone is a small, rugged, and secure device offering edge computing, data storage, and data transfer.

Solution overview

In this solution, we combine multiple AWS services to provide a durable environment. We use Snowcone as our Network File System (NFS) mount point and leverage the built-in DataSync Agent to replicate the bucket on the Snowcone back to an Amazon Simple Storage Service (Amazon S3) bucket in-Region.

When EC2 instances are launched on the Outposts server, we map the NFS mount point from the Snowcone into the file system of a Linux host through the Outposts server’s Logical Network Interface (LNI). For a Windows system, using the NFS Client for Windows, we can map a drive letter to the NFS mount point as well. The following diagram illustrates this.

EC2 instances on Outposts server attaching to the NFS mount on Snowcone with DataSync replicating data back to Amazon S3 in the AWS Region

Figure 2: EC2 instances on Outposts server attaching to the NFS mount on Snowcone with DataSync replicating data back to Amazon S3 in the AWS Region

Prerequisites

To deploy this solution, you must:

  1. Have the Outposts server installed and authorized.
    1. The Outposts server must be fully capable of launching an EC2 instance and being able to communicate through the LNI to local network resources.
  2. Have an AWS Snowcone ordered, connected to the local network, and unlocked.
    1. To make sure that NFS is available, the job type must be either Import into Amazon S3 or Export from Amazon S3, as shown in the following figure.
    1. Figure 3: Screenshot of Job Type when ordering Snow devices
  3. Have a local client with AWS OpsHub installed.
    1. You can use an instance launched on the Outposts server to configure the Snowcone if:
      1. ·       The LNI is connected on the instance
      2. ·       The Snowcone is on the network

Steps to activate

  1. Configure NFS on the Snowcone manually.
    1. Either statically assign the IP address, or if you’re using DHCP, create an IP reservation to make sure that the NFS mount is consistent. In the following figure, we use 10.0.0.32 as a static IP assigned to the NFS Mount.
  2. (Optional) Start the DataSync Agent on the Snowcone.
    1. We assume that the Snowcone has access to the internet in the same way the Outposts server does. Configure the Agent, and then enable tasks. The Agent is used to replicate data from the Snowcone to the Region or from the Region to the Snowcone. The tasks that are created in this step enable replication.
  3. Launch the EC2 instance (either a. or b.)
    1. a.      Using a Linux OS – When launching an instance on the Outposts server to attach to the NFS mount, make sure that the LNI is configured when launching the instance. In the User data section, enter the commands shown in the following figure to mount the NFS file system from the Snowcone.Screenshot of User Data section within the Amazon EC2 Launch Wizard

Figure 5: Screenshot of User Data section within the Amazon EC2 Launch Wizard

#!/bin/bash
sudo mkdir /var/snowcone
sudo mount -t nfs SNOW-NFS-IP:/buckets /var/snowcone
sudo sh -c “echo ’ SNOW-NFS-IP:/buckets /var/snowcone nfs defaults 0 0’ >> /etc/fstab”

In this OS, we create a directory and then mount the NFS file system to that directory. The echo is used to place the mount into fstab to make sure that the mount is persistent if the instance is rebooted.

  1. b        Windows OS – The AMI being used during the launch must include the NFS client. The client is required to mount the NFS. When launching an instance on the Outposts server to attach to the NFS mount, make sure that the LNI is configured when launching the instance. In the User data section, enter the commands shown in the following figure to mount the NFS from the Snowcone as a drive letter.

A screenshot of User Data section of Amazon EC2 Launch wizard with commands to mount NFS to the Windows File System

Figure 6: A screenshot of User Data section of Amazon EC2 Launch wizard with commands to mount NFS to the Windows File System

<powershell>
NET USE Z: \\SNOW-NFS-IP\buckets -P
</powershell> 

The NET USE command maps the Z: drive to the NFS mount, and the -P makes it persistent between reboots.

This solution also works with Snowball Edge Storage Optimized. When ordering the Snowball Edge, choose NFS based data transfer for the storage type.

Screenshot of Select the storage type for the Snowball Edge

Figure 4: Screenshot of Select the storage type for the Snowball Edge

Conclusion

In this post, we examined how to mount NFS file systems in Snowcone to EC2 instances running on Outposts servers. We also covered starting DataSync Agent on Snowcone to enable data transfer from the edge to an AWS Region. By pairing these services together, you can build persistent and durable storage external to the Outposts servers and replicate your data back to the AWS Region.

If you want to learn more about how to get started with Outposts servers, my colleague Josh Coen and I have published a video series on this topic. The demo series shows you how to unbox an Outposts server, activate the Outposts server, and what you can do with your Outposts server after it is activated. Make sure to check it out!