Tag Archives: Best practices

Diving Deeper into Projen: Exploring Advanced Features

2024-10-29 Michael Tran

Post Syndicated from Michael Tran original https://aws.amazon.com/blogs/devops/diving-deeper-into-projen-exploring-advanced-features/

We will be highlighting Projen’s powerful features that cater to various aspects of project management and development. We’ll examine how Projen enhances polyglot programming within Amazon Web Services (AWS) Cloud Development Kit constructs. We’ll also touch on its built-in support for common development tools and practices.

In our previous blog, we introduced you to the basics of getting started with Projen. Projen is a powerful project generator that simplifies the management of complex software configurations. In our prior blog, we discussed developing a new AWS cloud development kit (CDK) construct library project. For consistency, we will continue using this construct library project as our example while exploring linting, dependency management, and test coverage. It’s important to note that these practices are equally applicable to CDK applications and other project types.

AWS CDK Polyglot Construct Library

The AWS Cloud Development Kit (AWS CDK) is an open-source software development framework that allows developers to define cloud infrastructure using familiar programming languages. In a CDK application, constructs serve as the foundational elements, allowing developers to represent either a single AWS resource or a complex combination of resources. These constructs are not only reusable but can be incorporated into other AWS CDK projects, promoting efficient and scalable development practices.

Projen and Polyglot Programming

Projen leverages the power of the JSII library, enabling developers to write constructs once and generate equivalent constructs across multiple programming languages. This feature streamlines the development process, especially when working with teams that have expertise in different languages.

Automated Publishing with Projen

With its publisher module, Projen automates the distribution of c
ructs to various package managers. This process can be integrated into a GitHub workflow, such as a build job, which triggers the publication of the library to the designated package managers.

Starting with Projen

Initiating an AWS CDK construct library project is straightforward through the Projen command npx projen new <project_type>. By executing the command npx projen new awscdk-construct, you initialize a new project complete with a projenrc file. This file contains the essential configuration for a CDK construct library, setting the stage for further customization and development.

import { awscdk } from 'projen';
const project = new awscdk.AwsCdkConstructLibrary({
  author: 'github username',
  authorAddress: 'github email',
  cdkVersion: '2.1.0',
  defaultReleaseBranch: 'main',
  jsiiVersion: '~5.0.0',
  name: 'cdkconstruct',
  projenrcTs: true,
  repositoryUrl: 'https://github.com/*****/cdkconstruct.git',

  // deps: [],                /* Runtime dependencies of this module. */
  // description: undefined,  /* The description is just a string that helps people understand the purpose of the package. */
  // devDeps: [],             /* Build dependencies for this module. */
  // packageName: undefined,  /* The "name" in package.json. */
});
project.synth();

A release.yml file is generated by projen under the github>workflow directory. This file has the details of the public registry where the construct needs to be published. By default, it will add the details for npm.

release_npm:
    name: Publish to npm

The construct can be developed in typescript under src/main.ts, our previous blog shows how to create one. If the construct needs to be published to other public registries (such as Maven for java, Pypi for python), then a projenrc file can be updated to synthesize a new release.yml file.

For example, to publish a construct developed in typescript to Maven (so that it can be used in a java application) add publishToMaven API to the projenrc file.

const project = new awscdk.AwsCdkConstructLibrary({
  author: 'github username',
  authorAddress: 'github email',
  cdkVersion: '2.1.0',
  defaultReleaseBranch: 'main',
  jsiiVersion: '~5.0.0',
  name: 'cdkconstruct',
  projenrcTs: true,
  repositoryUrl: 'https://github.com/*****/cdkconstruct.git',
  publishToMaven: {
    javaPackage: 'com.cdk.hello',
    mavenArtifactId: 'cdk-construct-jsii',
    mavenGroupId: 'com.cdk.hello',
    mavenServerId: 'github',
    mavenRepositoryUrl: 'https://maven.pkg.github.com/example/hello-jsii',
  },
});

Run npx projen and the release.yml will be updated with Maven central details.

release_maven:
    name: Publish to Maven Central
    needs: release
    ....

Similarly, it can be published to other registries.

publishToPypi: 
publishToMaven:
publishToNuGet:
publishToGo:

This way the construct is built once and published to multiple registries with different programming languages.

Running Projen build runs a variety of processes.

Figure 1: High-level Architecture showing publication to multiple public registries

Linting, Dependency Management & Test Coverage

Projen streamlines the setup process by generating a comprehensive package.json file. This file includes pre-configured dependencies for ESLint and Jest, enabling developers to maintain coding standards and ensure robust test coverage right from the start. ESLint, a widely adopted static code analysis utility, empowers developers to enforce consistent coding practices by analyzing the source code and identifying potential errors, bugs, and stylistic issues. Additionally, Jest equips developers with a comprehensive suite of tools for writing and executing unit tests, facilitating comprehensive test coverage for their codebase. While Projen provides Jest as the default testing framework, it offers developers the flexibility to incorporate alternative testing frameworks based on their project requirements.

Following with the awscdk-construct from the previous section, under test>main.test.ts a default test file is created, which can be updated for writing test cases. A default package.json is generated in the root directory.

{
  "name": "projen_hello",
  "scripts": {
    "build": "npx projen build",
    "bundle": "npx projen bundle",
    "clobber": "npx projen clobber",
    "compile": "npx projen compile",
    "default": "npx projen default",
    "deploy": "npx projen deploy",
    "destroy": "npx projen destroy",
    "diff": "npx projen diff",
    "eject": "npx projen eject",
    "eslint": "npx projen eslint",
    "package": "npx projen package",
    "post-compile": "npx projen post-compile",
    "post-upgrade": "npx projen post-upgrade",
    "pre-compile": "npx projen pre-compile",
    "synth": "npx projen synth",
    "synth:silent": "npx projen synth:silent",
    "test": "npx projen test",
    "test:watch": "npx projen test:watch",
    "upgrade": "npx projen upgrade",
    "watch": "npx projen watch",
    "projen": "npx projen"
  },
  "devDependencies": {
    "@types/jest": "^29.5.4",
    "@types/node": "^16",
    "@typescript-eslint/eslint-plugin": "^6",
    "@typescript-eslint/parser": "^6",
    "aws-cdk": "^2.1.0",
    "esbuild": "^0.19.2",
    "eslint": "^8",
    "eslint-import-resolver-node": "^0.3.9",
    "eslint-import-resolver-typescript": "^3.6.0",
    "eslint-plugin-import": "^2.28.1",
    "jest": "^29.7.0",
    "jest-junit": "^15",
    "npm-check-updates": "^16",
    "projen": "^0.73.17",
    "ts-jest": "^29.1.1",
    "ts-node": "^10.9.1",
    "typescript": "^5.2.2",
    "webpack": "5.88.2"
  },
  "dependencies": {
    "aws-cdk-lib": "^2.1.0",
    "constructs": "^10.0.5"
  },
  "license": "Apache-2.0",
  "version": "0.0.0",
  "jest": {
    "testMatch": [
      "<rootDir>/src/**/__tests__/**/*.ts?(x)",
      "<rootDir>/(test|src)/**/*(*.)@(spec|test).ts?(x)"
    ],
    "clearMocks": true,
    "collectCoverage": true,
    "coverageReporters": [
      "json",
      "lcov",
      "clover",
      "cobertura",
      "text"
    ],
    "coverageDirectory": "coverage",
    "coveragePathIgnorePatterns": [
      "/node_modules/"
    ],
    "testPathIgnorePatterns": [
      "/node_modules/"
    ],
    "watchPathIgnorePatterns": [
      "/node_modules/"
    ],
    "reporters": [
      "default",
      [
        "jest-junit",
        {
          "outputDirectory": "test-reports"
        }
      ]
    ],
    "preset": "ts-jest",
    "globals": {
      "ts-jest": {
        "tsconfig": "tsconfig.dev.json"
      }
    }
  },
  "//": "~~ Generated by projen. To modify, edit .projenrc.ts and run \"npx projen\"."
}

Projen can be extensively configured. For example, if you need to configure webpack as a module bundler, then you need to add a webpack.config.js file and update the projenrc file project.

The other dependencies can be updated in package.json by adding deps in the projenrc.ts file.

const project = new awscdk.AwsCdkTypeScriptApp({
  cdkVersion: '2.1.0',
  defaultReleaseBranch: 'main',
  name: 'projen_hello',
  projenrcTs: true,
  
  deps:[
   "express",
  ],
  
  // add webpack dependencies
  devDeps:[
    "webpack",
    "webpack-cli",
    "ts-loader",
  ]
});
  
// update pre-configured build tasks and execute webpack
project.buildTask.reset
project.buildTask.exec('npx projen');
project.buildTask.exec('npx projen test');
project.buildTask.exec('npx webpack');

Run npx projen build to synthesize a package.json.

Continuous Integration and Continuous Delivery (CI/CD)

When you create a project using Projen, it comes equipped with an automated build process that triggers upon the submission of a pull request. This is one of the key, “out-of-the-box” features that streamlines development workflows.

Projen orchestrates this process through GitHub Actions, utilizing a sequence of tasks predefined in the project’s base ‘Project’ class.

When a build is initiated, it systematically carries out several sub-tasks:

Synthesis: It starts by synthesizing all the project files, ensuring they are up-to-date and correctly configured.
Bundling: Next, it bundles the necessary assets for the project.
Compilation: The project’s code is then compiled.
Testing: Following compilation, Projen runs the suite of tests defined for the project.
Packaging: Finally, it packages everything together, preparing it for deployment or distribution.

Projen manages these steps by auto-generating a build.yml file, which it places within the workflow directory of your project’s structure. This YAML file contains all the instructions for the GitHub Actions to execute the build process.

For instance, when you run the command npx projen new awscdk-app-ts, Projen sets up a TypeScript application for AWS CDK. It automatically creates a ‘build.yml’ file through the default projenrc file, which can be found in the github/workflow folder of your project repository. This automated process is designed to save time and reduce manual errors, making it an essential feature for efficient project management.

 .github       
   workflow    
    build.yml

A Projen build is self-mutating because files generated by Projen are part of the source directory. To ensure that a pull request branch always represents the final state of the repository, you can enable the mutableBuild option in your project configuration (currently only supported for projects derived from NodeProject).

The build process can be customized by adding any task in the project class, which can execute a shell command.

const buildproject = project.addTask('build'); 
buildproject.exec('npm run build');

You can spawn a subtask as well.

const buildproject = project.addTask('world');
buildproject.exec('echo world!');

const testproject = project.addTask('test');
testproject.exec('npm test');
testproject.spawn(buildproject);

The Task also supports the condition option that determines if the condition is true before running the task.

const hello = project.addTask('hello', {
  condition: '[ -n "$CI" ]', // only execute if the CI environment variable is defined
  exec: 'echo running in a CI environment'
});

Releases and Versioning

Projen uses Conventional Commits to generate semantic versioning of the releases automatically. This means that based on the commit message format, it can create the release version automatically.

Initially, the project is released under version 0.0.0. Anything may change at any time and public APIs should not be considered stable. Commits marked as a breaking change will increase the minor version. All other commits will increase the patch version.

You need to manually promote the major version to 1 once your project is considered stable. For major versions 1 and above, if a release includes fix commits only, it will increase the patch version. If a release includes any feat commits, then the new version will be a minor version.

Commit Messages                     Release versions         

feat: <Message>                     1.0.X (Patch)            
fix: <Message>                      1.X.0 (Minor)            
BREAKING CHANGE: <Message>          X.0 (Major)

API Documentation

One of the nice, out-of-the-box features that comes with Projen for AWS CDK constructs is the creation of API documentation for your constructs. By leveraging jsii-docgen, Projen’s build step will generate API documentation (API.md) from the comments in your code.

This feature is powerful for several reasons. Firstly, it ensures that documentation is kept up-to-date with the codebase, as the API documentation is generated directly from the source code comments. This reduces the risk of discrepancies between the code and its documentation, which can lead to misunderstandings and errors in usage.

Secondly, it streamlines the development process by automating a task that is often tedious and time-consuming. Developers can focus more on writing code and less on updating documentation manually.

Thirdly, it promotes better coding practices, as developers are encouraged to write clear and detailed comments in their code. This not only benefits the generation of documentation, but also helps any new developers who may work on the codebase in the future to understand the code more quickly and thoroughly.

Moreover, having readily available and accurate documentation can significantly enhance the developer experience. It makes it more straightforward for users of the CDK constructs to understand the functionality, parameters, return types, and the structure of the code they are working with.

In the context of team collaboration and open-source projects, this feature is especially beneficial. It ensures that anyone who contributes to the codebase is able to generate and view the latest documentation without any additional setup or configuration, facilitating smoother collaboration and integration processes.

Let’s recap all of the features that Projen can introduce into your project right out of the box:

Projen’s automation for linting and testing to maintain high code quality from the beginning.
Automated API documentation feature to keep your project’s documentation synchronized with the latest code changes.
Polyglot capabilities to cater to a diverse development team, ensuring flexibility in language preference.
The publisher module to streamline the release process across multiple package managers, saving time and reducing the scope for human error.
A list of awesome projects developed with Projen for inspiration or use as a template.

Conclusion

As we wrap up our deep dive into some of the advanced features of Projen within AWS CDK, it’s clear that Projen helps alleviate a lot of the pain points of a new greenfield project. By leveraging Projen, developers can navigate the complexities of polyglot programming, automate the mundane tasks of publishing and documentation, and ensure consistent code quality through linting and testing. Projen elevates the development workflow to a level where efficiency and scalability are the norms, not the exception.

What’s more compelling is Projen’s commitment to developer empowerment. Through its automated systems, it encourages developers to adhere to best practices without the overhead of manual enforcement. Its ability to seamlessly integrate with various package managers and generate detailed API documentation from inline comments signifies a leap in developer tooling.

Contact an AWS Representative to know how we can help accelerate your business.

Further Reading

How Amazon SES Mail Manager customers can prevent EchoSpoofing

2024-10-24 Zip Zieper

Post Syndicated from Zip Zieper original https://aws.amazon.com/blogs/messaging-and-targeting/how-ses-mail-manager-customers-can-prevent-echospoofing/

Customers not using Amazon SES Mail Manager, or those leveraging the authenticated SMTP functionality, are not at risk of EchoSpoofing. In such cases, no further action is required.

However, customers currently using or evaluating the unauthenticated SMTP relay feature of Amazon SES Mail Manager are strongly advised to review and implement the guidance provided in this blog post.

A new type of email spoofing attack

In July 2024, the researchers at Guardio Labs disclosed a new type of email spoofing (authentication bypass) attack they called “EchoSpoofing”. The attackers successfully sent spoofed emails by redirecting them through a virtual SMTP server, Office365 Exchange Online server, and a trusted third-party SMTP relay service. This path provided the fraudulent messages a means to pass standard authentication checks. Fortunately, the Guardio Labs researchers responsibly disclosed the issue to the targeted email security provider, leading to a speedy and effective remedy.

Unfortunately, before addressing the vulnerability completely, cybercriminals executed a series of sophisticated phishing campaigns. These campaigns involved sending millions of fraudulent emails that had valid Sender Policy Framework (SPF) and DomainKeys Identified Mail (DKIM) from well-known consumer brands.

The EchoSpoofing incident reminds both providers and customers to adopt a “trust-but-verify” approach to email security. This is especially true when mail routing functions have been, or are, in the process of being decoupled from single-tenant, on-premise (or cloud) email infrastructures.

As a leading provider of global managed email infrastructure, the Amazon SES service team went to work immediately after the Guardio Labs announcement. The Amazon SES team scrutinized the announcement and remediations undertaken by both Microsoft and the third-party SMTP relay service to fully understand the EchoSpoofing exploits and devised methods to swiftly safeguard Amazon SES customers.

Although we won’t delve into Guardio Labs’ in-depth analysis, it is crucial to grasp the attack’s main elements and examine how malicious individuals could exploit supposedly “secure” email relay systems. Amazon SES has taken steps to safeguard against EchoSpoofing and similar attacks, urging its customers to do the same.

Understanding the EchoSpoofing attack

The bad actors who implemented EchoSpoofing were able to send millions of well camouflaged malicious messages through the trusted delivery path of targeted organizations by preserving the SPF and DKIM attributes of the targeted sender’s domain. This greatly increased the likelihood of recipients trusting and acting upon the fake messages.

The attacker first set up a tenant in Microsoft Office365 and then delivered a spoofed email to that tenant, falsifying the from email headers. The attacker-controlled tenant was configured to relay the email to a security relay point linked to the forged sending identity. As the forged email came from trusted IP addresses belonging to Microsoft, the security relay point signed the message and relayed to the final recipients.

The attackers had amassed a large inventory of high-profile domains, and spread the EchoSpoof campaigns out across them to smooth the traffic and avoid sending spikes from any single domain. They carried out this attack for several months undetected, sending as many as 14 million messages per day, targeting the users of the compromised domains’ email services. This made the attack easy to automate at scale, difficult to detect via automated means, and highly successful in delivering malicious emails to unsuspecting recipients.

Guardio Labs’ discovery highlights the risks associated with an insecure SMTP relay model in a trusted domain configuration. This is of particular concern when permissive security policies allow fraudulent emails to be injected into the email flow without raising alarms.

The AWS shared security model for SES

As an email sender, Amazon SES is one of the largest on the internet, operating a worldwide fleet of trusted mail relays. Amazon SES maintains high IP reputations for this large fleet by maintaining a tight focus on robust, evolving security practices.

At AWS we operate under the shared security model. For Amazon SES, this means AWS takes responsibility for securing the underlying email delivery infrastructure, including the email servers, networks, and physical data centers. Customers take responsibility for securing their configurations, email content, sender authentication, and email lists they use with Amazon SES.

To ensure we meet our obligations in the shared security model, Amazon SES has recently added new features to Mail Manager SMTP relays that provide an increased level of protection to help guard against exploits like EchoSpoofing. Theses features are live in every AWS Region where Mail Manager is accessible.

We have outlined our recommendations and updated Amazon SES Mail Manager configurations in this blog to help customers meet their obligations and strengthen their Amazon SES email infrastructure against EchoSpoofing. As noted above, authenticated SMTP relays are not subject to this exploit.

Prevent EchoSpoofing when relaying email out of MailManager

If you need to relay email to a third party system that cannot enforce SMTP authentication, our recommendation is to limit access to the IP addresses used by Mail Manager in your region.

As of this writing, Mail Manager is generally available, and has its own IP range, in six commercial regions (below). As Amazon expands SES Mail Manager availability into more regions the IP ranges will be updated in the Amazon SES documentation .

SES Mail Manager IP ranges as of 10/23/2024

When unable to enforce SMTP authentication, we recommend configuring your SMTP servers, or third party software for the new MIME header"X-MAIL-MANAGER-ORIGINATOR-ORG". This new Mail Manager header is now automatically inserted into messages relayed by Mail Manager. The X-MAIL-MANAGER-ORIGINATOR-ORG will be set to the customer’s unique SMTP relay ID, which can be found via the Mail Manager console or the ListSmtpRelays API.

In addition to added security, the MIME header feature can also be used in message search and filtering behaviors for a wide range of MIME header name:value pairs.

If the original email already contains an X-MAIL-MANAGER-ORIGINATOR-ORG header, it will be replaced with the last MailManager SMTP relay ID to relay the email. Here is an example of an email relayed by MailManager with the header:

MIME-Version: 1.0
From: [email protected]
To: [email protected]
Subject: Test
X-SES-REDIRECT-MESSAGE-ID: <[email protected]>
X-MAIL-MANAGER-ORIGINATOR-ORG: rl-usmoots8mgmfgfaeijckxhqx
X-SES-Outgoing: 2024.08.26-76.223.191.14

--===============1760803815732220490==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit

This is a sample message. Have a nice day.
--===============1760803815732220490==--

This approach elevates your security posture because the IP access control lists on your third party system ensures only mail from Amazon SES is accepted, and the MIME header check can be trusted and checked.

Prevent EchoSpoofing when relaying into MailManager

When relaying email from a third party into Amazon SES Mail Manager, you will similarly need to configure an IP allowlisting, and if the email comes from a shared or cloud environment, you will need an additional header check to disambiguate among the multiple tenants it hosts. Those IPs and headers are provider specific, for example, emails coming from Office365 will have a header called X-OriginatorOrg.

You can use the rule editor screen in MailManager to configure the check in Mail Manager for the IPs and 3rd party headers before executing any action.

Mail Manager Rules Detail Page

The verification of a MIME header is not necessary if the third party relaying into MailManager uses an IP dedicated for your tenant. In that case, there is no possibility that an attacker tenant injects email into MailManager using your IP.

Conclusion

While the conditions that made the EchoSpoofing exploit were highly specific, they reminded us all of the importance of taking a proactive approach to email security.

The chances of your Amazon SES Mail Manager unauthenticated SMTP relay being compromised are low, but we highly advise you follow the recommendations in this blog post promptly. You’ll find more information in the Amazon SES documentation ( here ).

If you need help with securing your Amazon SES Mail Manager SMTP relay actions against EchoSpoofing, contact AWS support, or leave us a comment in community section of the blog post.

Call to Action:

If you are using SES Mail Manager’s unauthenticated SMTP relay today, follow the guidance in this blog to secure your email infrastructure today by configuring the recommended ACLs and MIME header verification in AWS SES Mail Manager.“

Stay ahead of emerging threats by subscribing to this AWS blog where we post the latest security updates as well as new features and interesting use cases for SES.

Join the conversation and share your best practices for email security with the AWS community.

Explore the new MIME header evaluation feature in AWS SES Mail Manager and share your creative use cases with us and the SES community via the community comments section of the blog.

About the Authors

Toby Weir-Jones

Toby is a Principal Product Manager for Amazon SES and WorkMail. He joined AWS in January 2021 and has significant experience in both business and consumer information security products and services. His focus on email solutions at SES is all about tackling a product that everyone uses and finding ways to bring innovation and improved performance to one of the most ubiquitous IT tools.

Zip

Zip is a Sr. Specialist Solutions Architect at AWS, working with Amazon Pinpoint and Simple Email Service and WorkMail. Outside of work he enjoys time with his family, cooking, mountain biking, boating, learning and beach plogging.

Leandro Batista Lameiro

Leandro is a Sr. Software Dev Engineer at AWS.

Linzhou Zhong

Linzhou is a software engineer at AWS.

How to use interface VPC endpoints to meet your security objectives

2024-10-22 Joaquin Manuel Rinaudo

Post Syndicated from Joaquin Manuel Rinaudo original https://aws.amazon.com/blogs/security/how-to-use-interface-vpc-endpoints-to-meet-your-security-objectives/

Amazon Virtual Private Cloud (Amazon VPC) endpoints—powered by AWS PrivateLink—enable customers to establish private connectivity to supported AWS services, enterprise services, and third-party services by using private IP addresses. There are three types of VPC endpoints: interface endpoints, Gateway Load Balancer endpoints, and gateway endpoints. An interface VPC endpoint, in particular, allows customers to design applications that connect to AWS services privately, including the more than 130 AWS services that are available over AWS PrivateLink. For a complete list of services that integrate with PrivateLink, see the documentation for VPC endpoints.

The decision regarding when to use interface VPC endpoints to further secure your AWS infrastructure depends on your need for additional security controls or your preferred architecture patterns. In this blog post, we present four security objectives that VPC endpoints help you achieve. It’s important to note that other non-security benefits, such as reduced latency and cost management, are not covered in this post. For more information on those benefits, see these topics:

Improve the latency of sensitive applications with Amazon SageMaker real-time inference endpoints
Reduce network costs by centralizing access to AWS services through AWS Transit Gateway

Background

By default, network packets that originate in the AWS network with a destination on the AWS network (for example, public endpoints for AWS services) stay in the AWS network, except traffic to and from AWS China Regions. In addition, all data flowing across the AWS global network that interconnects AWS data centers and Regions is automatically encrypted at the physical layer before it leaves AWS secured facilities.

AWS PrivateLink VPC endpoints enable customers to further enhance the security posture of their applications by establishing private connectivity to supported AWS services, enterprise services, and third-party services by using a private IP address. You can find patterns for how to use the different types of endpoints in the Securely Access Services Over AWS PrivateLink whitepaper.

An interface VPC endpoint is a collection of one or more elastic network interfaces with private IP addresses. These endpoints can serve as an entry point for traffic destined to a supported AWS service in the same AWS Region as the VPC, without requiring an internet gateway, NAT device, VPN connection, AWS Direct Connect connection, or a public IP. Customers can then use interface VPC endpoints to help meet multiple security objectives, such as the following:

Implement networks that are isolated from the internet
Implement a data perimeter by using VPC endpoint policies
Enable private connectivity to AWS service API endpoints for on-premises environments
Align with specific compliance requirements

In the rest of this post, we’ll discuss each of these objectives in detail and how interface VPC endpoints can help you implement them.

Security objective 1: Implementing networks that are isolated from the internet

If you operate sensitive workloads, you might require that they run in private subnets that are isolated from the internet. In this scenario, the subnets in the network don’t have routes to an internet gateway and won’t be able to either send packets to the internet or receive packets from it.

In this case, you can use interface VPC endpoints to connect your VPC to AWS services in the same Region as if they were in your VPC, without configuring an internet gateway, NAT instance, or route tables. For information on how to configure a cross-Region VPC interface endpoint by using VPC peering, see this guidance.

Figure 1 shows an example architecture with an Amazon Elastic Compute Cloud (Amazon EC2) instance running in an isolated network and using interface VPC endpoints to send messages to Amazon Simple Queue Service (Amazon SQS).

Figure 1: Isolated subnet for EC2 server sending messages to Amazon SQS

Security objective 2: Implement a data perimeter using VPC endpoint policies

A data perimeter is a set of guardrails to help ensure that only your trusted identities are accessing trusted resources from expected networks. Learn more about data perimeters on AWS.

You can use VPC endpoint policies to implement a data perimeter by allowing access to only trusted entities and resources from your network, helping to prevent unintended access. This enables you to take advantage of the power of AWS Identity and Access Management (IAM) policy and flexibility to control access to your resources at a granular level.

In the VPC diagram in Figure 2, EC2 instance traffic flows out through a firewall endpoint, NAT gateway, and internet gateway to reach the S3 public API endpoint, remaining within the AWS network. However, this setup does not allow the implementation of a logical data perimeter to control the specific resources that the EC2 instance can access.

Figure 2: Before implementing a data perimeter

In contrast, in the diagram in Figure 3, you can see how VPC interface endpoints enable the use of VPC endpoint policies to enforce a data perimeter, such as only allowing certain S3 buckets to be accessed by the EC2 instance.

Figure 3: After implementing a data perimeter

For example, you can attach a policy, similar to the one below, to an Amazon S3 interface or gateway VPC endpoint to restrict access from the VPC to only S3 buckets that are owned by the same AWS Organizations organization. Make sure to replace <MY-ORG-ID> with your own information.

{
    "Version": "2012-10-17",
        "Statement": [
        {
            "Effect": "Allow",
            "Principal": "*",
            "Action": "s3:*",
            "Resource": "*",
            "Condition": { "StringEquals": { "aws:ResourceOrgID": <MY-ORG-ID>}}
        }
    ]
}

As a further example, the following policy shows how you can limit access to only trusted identities. You can attach this policy to an S3 interface VPC endpoint to permit access only to principals from your organization, to help mitigate the risk of unintended disclosure through non-corporate credentials. Make sure to replace <MY-ORG-ID> with your own information.

{    "Version": "2012-10-17",
        "Statement": [
        {
            "Effect": "Allow",
            "Principal": "*",
            "Action": "s3:*",
            "Resource": "*",
            "Condition": { "StringEquals": { "aws:PrincipalOrgID": "<MY-ORG-ID>"}}
        }
    ]
}

Finally, you can create resource policies for your resources to restrict access to only VPC interface endpoints. For example, you can use the following policy from our Amazon S3 User Guide for S3 buckets. Make sure to replace <MY-VPCE-ID> and <MY-BUCKET> with your own information.

 {  "Version": "2012-10-17",
        "Statement": [
        {
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:*",
            "Resource": ["arn:aws:s3::<MY-BUCKET>", "arn:aws:s3::<MY-BUCKET>/*"]
            "Condition": { "StringNotEquals": { "aws:SourceVpce": "<MY-VPCE-ID>"}}
        }
    ]
}

For more information on the use of these condition keys and how to implement a data perimeter, see this blog post.

Security objective 3: Enable private connectivity to AWS service API endpoints for on-premises environments

You might be required to run private connectivity to AWS only from your on-premises environments, such as when your on-premises firewalls are configured to limit the connectivity to the internet, including AWS public endpoints.

In this case, you can use interface VPC endpoints with Direct Connect private virtual interfaces (VIFs) or Site-to-Site VPN to extend private connectivity to your on-premises networks. With this setup, you can also enforce data perimeter rules like those shown earlier in this post.

For example, customers can use interface VPC endpoints from Amazon CloudWatch agents running on on-premises servers to CloudWatch through a private connection, as demonstrated in this blog post.

In the diagram in Figure 4, we show how you can extend this approach to include other services, such as Amazon S3, in a single VPC setup. To implement this pattern, you need to set up conditional forwarding on your on-premises DNS resolver to forward queries for amazonaws.com to an Amazon Route 53 Resolver’s inbound endpoint IPs.

The flow in this scenario is as follows:

The DNS query for your S3 endpoint from your on-premises host is routed to the locally configured on-premises DNS server.
The on-premises DNS server performs conditional forwarding to an Amazon Route 53 inbound resolver endpoint IP address.
The inbound resolver returns the IP address of the interface VPC endpoint, which allows the on-premises host to establish private connectivity through AWS VPN or AWS Direct Connect.

Figure 4: On-premises private connectivity to Amazon S3 and Amazon CloudWatch

You can extend this architecture to support a cross-Region and multi-VPC setup by using AWS Transit Gateway and Amazon Route 53 private hosted zones, as described in the Building a Scalable and Secure Multi-VPC AWS Network Infrastructure whitepaper. Keep in mind that a distributed VPC endpoint approach (one that uses one endpoint per VPC) will allow you to implement least-privilege policies in VPC endpoints. A centralized approach, while more cost-effective, can increase the complexity of maintaining least privilege in a single policy and increase the scope of impact of a security event.

Security objective 4: Align with specific compliance requirements

In certain cases, customers operating in industries such as financial services or healthcare need to maintain compliance with regulations or standards such as HIPAA, the EU-US Data Privacy Framework, and PCI DSS. Although all communication between instances and services hosted in AWS use the AWS private network, using an interface VPC endpoint can help prove to auditors that you’re applying a defense-in-depth approach. This approach includes designing your workloads to run in networks that are isolated from the internet or implementing additional conditions such as the example VPC endpoint policies shown earlier in this post.

You can use AWS Audit Manager to get started mapping your compliance requirements to industry and geographic frameworks, such as NIST SP 800-53 Rev. 5, FedRAMP, and PCI DSS, and to automate evidence collection for controls such as the use of VPC endpoints. If you also have custom compliance requirements, you can create your own custom controls by using the Configure Amazon Virtual Private Cloud (VPC) service endpoints core control in the AWS Audit Manager control library console.

If you want to know how the use of VPC endpoints can help you align with compliance requirements for your specific workload and require assistance beyond what is provided in the public documentation on the AWS Compliance Programs webpage, you can consult with AWS Security Assurance Services (AWS SAS). AWS SAS has expert consultants and advisors who can help you design your systems to achieve, maintain, and automate compliance in the cloud.

Conclusion

In this blog post, we presented four security objectives to consider when deciding whether to use AWS interface VPC endpoints. You can use this information when you design your architecture or create a threat model to help implement secure architectures for your AWS hosted workloads. If you want to learn more about AWS PrivateLink and interface endpoints, see the AWS PrivateLink documentation. If you’re interested in learning more about implementing data perimeter concepts by using VPC endpoints, we suggest this workshop.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

How to build a Security Guardians program to distribute security ownership

2024-10-21 Mitch Beaumont

Post Syndicated from Mitch Beaumont original https://aws.amazon.com/blogs/security/how-to-build-your-own-security-guardians-program/

Welcome to the second post in our series on Security Guardians, a mechanism to distribute security ownership at Amazon Web Services (AWS) that trains, develops, and empowers builder teams to make security decisions about the software that they create. In the previous post, you learned the importance of building a culture of security ownership to scale security within your organization, and how AWS achieves this using the Security Guardians program. Since then, many customers have asked how they can build their own, similar program.

In this post, you will learn the steps to build your own Security Guardians program for your organization, including how to:

Set the vision, mission, and goals of your program
Identify developer teams that can pilot your new program
Define the expected behaviors for those teams
Develop training and create opportunities for career development to keep your teams engaged in the program

The guidance in this post is based on what we learned at AWS. Because every organization is different, the final version of the program you build is likely to look different from the one at AWS. Your program needs to reflect the current state of your organization’s culture of security and be designed to cultivate the security-related behaviors that are most important to your organization.

Security Guardians program mechanism

As discussed in the previous post, mechanisms form a key part of our business at AWS. Figure 1 demonstrates how a mechanism is a complete process, or virtuous cycle, that reinforces and improves itself as it operates. It takes controllable inputs and transforms them into ongoing outputs to address a recurring business challenge. In this case, the business challenge AWS faced was that security findings were being identified late in the development lifecycle, making it more expensive—in terms of time, money and effort—to remediate them. This led to bottlenecks in our security review process. The culture of security at AWS, specifically our culture of ownership, provides support to solve this challenge, but we needed the Security Guardians mechanism to actually do it.

Figure 1: AWS mechanism cycle

With most mechanisms, driving adoption is difficult, especially when the mechanism requires human participation to succeed. This is also true in the case of Security Guardians, and you can use our experience to help you avoid some of the challenges and growing pains of driving adoption.

Getting everyone aligned

“If I had an hour to solve a problem and my life depended on the solution, I would spend the first 55 minutes determining the proper question to ask, for once I know the proper question, I could solve the problem in less than five minutes.” – Albert Einstein

Getting alignment for the need to distribute security expertise starts with deeply understanding what problems need to be addressed. For example:

Is product delivery velocity being negatively impacted by delays in the security review process?
What business goal or metric are these delays negatively impacting?
Where in the security review process are those delays occurring?
What factors are contributing to those delays?
Is it a lack of time, people, or skills?

Thoroughly understanding the specific problems and their root causes, as identified by answering those questions, allows you to evaluate whether distributing security ownership is the appropriate solution. This in turn makes it easier to gain alignment and buy-in across the organization for the chosen approach.

A component of a culture of security

Building a strong culture of security requires support from executive leadership to set the direction for the rest of the organization. Executive support makes it easier for product leaders to secure the resources and finances needed for a Security Guardians program to be successful. To align with your organization’s leaders, you can reflect on the goals of your leaders and how the Security Guardians program can be built to meet those goals.

For example, if your business goal is to ship products 25 percent faster, understand how a particular resourcing effort from Security Guardians is going to help your organization meet that goal. AWS benefited from the program with a 26.9 percent reduction in the time to review a new service or feature when a Security Guardian was involved.

Our experience is that it’s challenging to establish a Security Guardians program without executive support. If you’re struggling to identify a business leader to sponsor the program and provide insight on the business problem, your AWS account team—including your account manager or solutions architect—can help. If you’re a business leader or executive reading this post, consider becoming that sponsor yourself.

One step at a time

A step-by-step approach to implementing the Security Guardians program helps overcome organizational challenges and avoid common pitfalls that could lead to failure. These steps, shown in Figure 2, are:

Set the vision
Choose innovators
Define behaviors
Maintain interest
Measure success

These steps support the activities that make a mechanism successful: adoption, inspection, and tools.

Figure 2: Steps for implementing a Security Guardians program

Set the vision

Now that you’ve identified the business problem or business goal, set the vision for the Security Guardians program by working backwards from this problem or goal to define the purpose of your program. For example, the vision of the AWS Security Guardians is “To nourish security ownership that consistently delights our customers with security-by-design throughout the development lifecycle.”

Craft an ambitious vision for your Security Guardians program. Think beyond easy wins and focus on bold, forward-thinking security outcomes for your organization. Make sure that each element of your vision aligns with a business problem or goal. The following table is an example of how the vision of the program is aligned with business goals:

Business goals	Security outcome	Long-term goals
Develop products faster and more efficiently.	To improve developer agility while reducing security risk.	Increase the number of threat models performed by Security Guardians (instead of by application security engineers). Over time, this goal could change to “increase the quality of threat models.”. Decrease the average monthly security issue rate. Train three new Security Guardians each quarter.
Reduce long-term security spend.	To identify and mitigate security risk as early as possible.
Increase customer trust.	To exceed customer security expectations by raising the security bar.

The next step is to define a clear mission that is supported with measurable goals. The mission and goals must be achievable and help to move the needle towards the long-term vision.

The final part is to name your program. We chose Security Guardians, like Marvel’s Guardians of the Galaxy. We’ve also heard customers using Security Champions, Security Advocates, Security Innovators, and Security Drivers. Have fun with it and make sure the name resonates with as many participants as possible.

After you’ve defined the vision, future state, mission, measurable goals, and name of the program, review them with your security and business leaders. It’s beneficial to include your innovators or Security Guardians who will be early adopters of the program in this review. In the next section, you’ll learn how to identify these innovators.

Choosing innovators

Just as you develop for and iterate with early adopters of the products you’re building, you should identify individuals and teams who will pilot the program with you. Before the AWS Security Guardians program, our application security engineering teams built relationships with product teams through security reviews.

This meant that they already knew which individuals within those product teams had an interest in security. This is where AWS started, but the success of your program isn’t dependent on whether you already know who these individuals are. Development teams will self-identify and nominate Security Guardians from their own teams. Figure 4 shows examples to help you get started understanding which development teams will be good early adopters for your program.

Figure 3: Example product teams for early program adopters

The examples in Figure 3 include:

Candidate A: Quick wins team

Early adopters typically share key traits, including existing security measures and a designated security role or team members with security expertise. Essentially, they already prioritize security at the team level.

Candidate B: High impact team

This is the team most impacted by the disparity between product development teams and security teams; the agility and time-related benefits of the Security Guardians program will be the highest for this team. For example, this team might be facing long delays in launching products because of the current security review process at your organization.

Candidate C: High risk team

This team owns a product that has a high security risk because of the nature of the product. This team will benefit the most from additional security scrutiny and from raising the security bar at your organization. For example, this team might be building a product that’s considered a critical asset, hosts sensitive data, or performs critical processes.

After you’ve identified one or more teams that could be good early adopters of the program, you need to identify at least one individual from each team to serve as the Security Guardian. Keep the vision and goals of your program in mind when selecting your Security Guardian. Your early Security Guardians should have at least the following characteristics:

Ability to exercise well-informed and decisive judgement
Maintain and showcase their knowledge
Not afraid to have their work be independently validated
Advocate for their security needs in internal discussions
Hold a high security bar
Thoughtful and assertive to make customer security a top priority on their team

In terms of time commitment, our experience is that each Security Guardian spends an average of 3.5 hours each month on activities such as answering general security questions, identifying security stories needed for sprints, diving deep into security related tasks and supporting security related tasks. Each application security review takes approximately 4 hours of effort.

The first post of this series contains even more details on the characteristics that make a good Security Guardian.

Defining behaviors

It’s important to set expectations on what behaviors you want Security Guardians, developers, and security teams to exhibit within the context of the program. These behaviors typically relate directly to the goals of the program. For example, if one of the goals is to increase the number of threat models created, then create threat modeling will be one of the defined behaviors. The behaviors need to be measurable with some flexibility for change as you improve the program.

At AWS, our Security Guardians have access to a runbook that lists the activities each Guardian should take when engaged as part of a review. With each of these activities understood, the program team will then make sure appropriate training is provided so that the Security Guardians are able to complete each of the activities. For example, AWS Security Guardians are asked to help develop threat models. To support this, the program team has developed and released training material to teach Security Guardians how to create a threat model.

With the defined behaviors, understand how the Security Guardian and product development team will engage with the security team. Although we’re clearly defining behaviors, the behaviors aren’t typically done in a silo for the successful launch of a secure product. At AWS, the Security Guardians and product developers engage with the security teams in key partnership areas. If you’re unsure of where to start in defining the behaviors of your program, Figure 4 shows an example of how teams interact at AWS, beginning with the creation of an initial threat model and going through review, remediation, and testing. Consider creating your own version of the model to help define the behaviors and key partnership areas at your organization.

Figure 4: Example behaviors and partnership areas at AWS

In the example of a threat model review, the Guardian and the central security team will jointly create and review the threat model. Specific activity examples include reviewing threats that have no documented mitigations and discussing additional threats that haven’t yet been considered.

As part of encouraging a culture of ownership, AWS recommends allowing Security Guardians to influence the role within a set of boundaries. An example of this is allowing the Security Guardians to be a part of recurring reviews of the program growth metrics, actively collecting their feedback, and encouraging them to host their own training sessions. Active Security Guardians are key to the success of the program and allowing them to influence the program will give them a sense of ownership and inclusion.

Maintaining interest

It’s important to not lose sight that a program like the AWS Security Guardians program is supported by volunteers. Most of your Security Guardians will be product developers who already have a full-time job developing products for your organization. The time and effort to find and onboard new Security Guardians will have a low return on investment if they stop engaging because the program owners didn’t keep them engaged. Keeping Security Guardians is just as important as finding them.

At AWS, we invest time to understand how to build trust with Security Guardians and provide value by working backwards from their wants and needs. Some Security Guardians joined the program to learn new skills and for career growth opportunities. AWS built training programs that were designed for Security Guardians and provide metrics that are used to document their impact to their managers and leaders.

AWS Security Guardians constantly tell us that they value recognition of their contributions by leadership. We work to build mechanisms to continuously surface the great work of our Security Guardians. We also recognize the contributions Security Guardians make through awards, gifts, and other incentives. For example, each quarter, the AWS Security Guardians team sends out a newsletter to senior leaders of the organization. This communication identifies the Guardians within their organization and highlights their contributions, including the number and impact of reviews they’ve completed.

Another way that AWS recognizes the contributions of our Security Guardians is through the Guardians Belt Program. The Guardians Belt Program is designed to recognize Security Guardians for their contributions and support them as they work to advance their security skills and expand their scope of impact. Security Guardians earn Black, Green, Yellow, and White belts with each belt corresponding to significant accomplishments that require consistent commitment to raising the security bar.

To make sure that Security Guardians value the program, your organization should provide and actively facilitate benefits. The benefits must be accessible without requiring additional time or effort from the Security Guardians, promoting immediate and direct gains. Consider the following examples of benefits to maintain Security Guardian interest and support:

Specialized training: Workshops, game days, challenges and contests.
Impact opportunities: Ability to impact multiple products by working with other teams in the organization, ability to help define patterns, best practices, and automation for the program.
Community: Collaborate, connect, share and learn from experts and individuals with similar interests.
Ownership opportunities: Ability to accelerate certain steps in the process.
Leadership opportunities: Active involvement in recurring program or business reviews.

The best ways to maintain interest are determined by the culture of your organization. What does your organization value the most, and how will the program provide that to your Security Guardians? Sometimes, the best way to answer these questions is to ask your early or potential Security Guardians.

Measuring success

The final step of building a successful Security Guardians program is to measure program success. Measuring success is equivalent to the inspection step from Figure 1. This verifies that your desired outcomes are being achieved and provides a jumping off point for iteration. Measuring success also gives you the opportunity to audit the output or results of the Security Guardians program and perform corrections and improvements.

Earlier in this post, we covered identifying the business problem and creating the vision and measurable goals for your Security Guardians program. Example metrics include:

Average time to release features
Average number of security issues per team
Average time spent by Security Guardians and builders doing security work
Percentage of Security Guardians who have taken required and non-required training

Measuring success includes steps to collect feedback and tune the program over time, shown in Figure 5.

Figure 5: Feedback and tuning steps for Security Guardians program.

The cycle to gather feedback and tune the program includes:

Report on metrics
Communicate wins
Measure outcome and cycle time
Identify trends
Review goals

Gathering feedback from Security Guardians is as important as providing feedback to them. One of the ways AWS collects feedback from Security Guardians is through an annual survey that collects feedback on their experiences of program and tooling. To help both builders and Security Guardians improve over time, our security review tooling captures feedback from security engineers on the inputs from Security Guardians. Combined, the data gathered through these surveys helps our security ownership mechanism reinforce and improve itself over time.

Figure 6 summarizes the steps that you can take to develop your program.

Figure 6: Security Guardians program steps

The broad steps to develop a program include:

Set the vision: Set your vision for the program and metrics for success. Get sponsorship from leadership. Choose a name for your program.
Choose innovators: Identify innovators who have a passion for security and foster a community with continuous knowledge sharing.
Define behaviors: Redefine your RACI (responsible, accountable, consulted, informed) and be clear on expectations from your security advocates.
Maintain interest: Provide clear training and learning paths and opportunities for career advancement.
Measure success: Gather feedback and measure the program’s effectiveness.

Conclusion

This post and the previous post covered numerous concepts, considerations, and ideas, including:

The initial intention of the Security Guardians program is to focus on training developers in product teams. This improves early security-focused design thinking.
An alternative approach is to embed or align security engineers directly with product development teams. This can be more effective in organizations where reporting structures and accountability are key considerations.
Some organizations draw Security Guardians from all job types. The program can also be used to focus on uplifting developers and broad security culture.
You must regularly inspect the outcomes delivered by the Security Guardians program and use the information to make incremental improvements as the program matures.

For additional support building a Security Guardians program, contact your AWS account representative and they will get you in touch with a specialist who can help you develop your program.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Accelerate Serverless Streamlit App Deployment with Terraform

2024-10-09 Kevon Mayers

Post Syndicated from Kevon Mayers original https://aws.amazon.com/blogs/devops/accelerate-serverless-streamlit-app-deployment-with-terraform/

Graphic created by Kevon Mayers.

Introduction

As customers increasingly seek to harness the power of generative AI (GenAI) and machine learning to deliver cutting-edge applications, the need for a flexible, intuitive, and scalable development platform has never been greater. In this landscape, Streamlit has emerged as a standout tool, making it easy for developers to prototype, build, and deploy GenAI-powered apps with minimal friction. It is an open-source Python framework designed to simplify the development of custom web applications for data science, machine learning, and GenAI projects. With Streamlit, developers can quickly transform Python scripts into interactive dashboards, LLM-powered chatbots, and web apps, using just a few lines of code. Its unique combination of simplicity, interactivity, and speed is the perfect complement to the rapid advancements in AI.

When deploying Streamlit applications, customers often face the challenge of ensuring their applications are highly available and can scale to meet a variable amount of demand. To achieve these goals, customers are looking at serverless approaches to deploying their Streamlit apps. With a serverless application, you only pay for the resources required and do not want have to worry about managing servers or capacity planning.

In this post, we will walk you through deploying containerized, serverless Streamlit applications automatically via HashiCorp Terraform, an Infrastructure as Code (IaC) tool that enables users to define and provision infrastructure across cloud platforms.

Solution Overview

For this solution, we have the Streamlit app running on an Amazon Elastic Container Service (ECS) cluster across multiple availability zones (AZs), using AWS Fargate to manage the compute. Fargate is a serverless, pay-as-you-go compute engine that lets you focus on building apps without managing servers. Using Fargate helps reduce the undifferentiated heavy lifting that can come with building and maintaining web applications. It is also often desirable to use a Content Delivery Network (CDN) to ensure low latency for users globally by caching the content at edge locations closer to where the users are geographically located.

Let’s zoom in on the two architectures – the Streamlit App hosting architecture, and the Streamlit App deployment pipeline.

Streamlit app hosting

In the above architecture, the following flow applies:

Users access the Streamlit App using the public DNS endpoint for an Amazon CloudFront distribution.
Using an Internet Gateway (IGW), user requests are routed to a public-facing Application Load Balancer (ALB).
This ALB has target groups which map to ECS task nodes that are part of an ECS cluster running in two AZs (us-east-1a and us-east-1b in this example).
Fargate will automatically scale the underlying compute nodes in the ECS cluster based on the demand.

Streamlit app deployment pipeline

In the above architecture, the following flow applies:

User develops a local Streamlit App and defines the path of these assets in the module configuration, then runs terraform apply to generate a local .zip file comprised of the Streamlit App directory, and upload this to an Amazon S3 bucket (Streamlit Assets) with versioning enabled, which is configured to trigger the Streamlit CI/CD pipeline to run.
AWS CodePipeline (Streamlit CI/CD pipeline) begins running. The pipeline copies the .zip file from the Streamlit Assets S3 Bucket, stores the contents in a connected CodePipeline Artifacts S3 bucket, and passes the asset to the AWS CodeBuild project that is also part of the pipeline.
CodeBuild (Streamlit CodeBuild Project) configures a compute/build environment and fetches a Python Docker Image from a public Amazon ECR repository. CodeBuild uses Docker to build a new Streamlit App image based on what is defined in the Dockerfile within the .zip file, and pushes the new image to a private ECR repository. It tags the image with latest, an app_version (user-defined in Terraform), as well as the S3 Version ID of the .zip file and pushes the image to ECR.
ECS has a task definition that references the image in ECR based on the S3 Version ID tag which will always be a unique value, as it is generated whenever a new version of the file is created. This also serves as data lineage so versions of the Streamlit App .zip files in S3 can be linked to versions of the image stored in ECR. Once a new image is pushed to ECR (with a unique image tag), the task definition is updated and the ECS service begins a new deployment using the new version of the Streamlit App.
When a new image is pushed to ECR, the Terraform Module is configured to use the local-exec provisioner to run an AWS CLI command that creates a CloudFront invalidation. This enables users of the Streamlit app to use the new version without waiting for the time-to-live (TTL) of the cached file to expire on the edge locations (default is 24 hours).
Both of these pipelines are built and packaged into a Terraform module that can be reused efficiently with only a few lines of code.

Both of these pipelines are built and packaged into a Terraform module that can be reused efficiently with only a few lines of code.

Prerequisites

This solution requires the following prerequisites:

An AWS account. If you don’t have an account, you can sign up for one.
Terraform v1.0.0 or newer installed.
python v3.8 or newer installed.
A Streamlit app. If you don’t have a Streamlit project already, you can download this app directory as a sample Streamlit app for this post and save it to a local folder.

Your folder structure will look something like this:

terraform_streamlit_folder
├── README.md
└── app                 # Streamlit app directory
    ├── home.py         # Streamlit app entry point
    ├── Dockerfile      # Dockerfile
     └── pages/          # Streamlit pages

Create and initialize a Terraform project

In the same folder where you have the your Streamlit app saved, in the above example in the terraform_streamlit_folder, you will create and initialize a new Terraform project.

In your preferred terminal, create a new file named main.tf by running the following command on Unix/Linux machines, or an equivalent command on Windows machines:
```
touch main.tf
```
Open up the main.tf file and add the following code to it:
```
module "serverless-streamlit-app" {
  source          = "aws-ia/serverless-streamlit-app/aws"
  app_name        = "streamlit-app"
  app_version     = "v1.1.0" 
  path_to_app_dir = "./app" # Replace with path to your app
}
```
This code utilizes a module block with a source pointing to the Terraform module, and the appropriate input variables passed in. When Terraform encounters a module block, it loads and processes that module’s configuration files using the source. The Serverless Streamlit App Terraform module has many optional input variables. If you have existing resources, such as an existing VPC, subnets, and security groups that you’d like to reuse instead of deploying new ones, you can use the module’s input variables to reference your existing resources. However, in this post, we’re deploying all of the resources in the above architecture from scratch. Here, we simply define the source that references the module hosted in the Terraform Registry, provide an app_name that will be used as a prefix for naming your resources, the app_version that is used for tracking changes to your app, and the path_to_app_dir which is the path to the local directory where the assets for your Streamlit app are stored.
Save the file.
To initialize the Terraform working directory, run the following command in your terminal:
```
terraform init
```
The output will contain a successful message like the following:
```
"Terraform has been successfully initialized"
```

Output the CloudFront URL

To be able to easily access the Cloudfront URL of the deployed Streamlit application, you can add the URL as a Terraform output.

In your terminal, create a new file named outputs.tf by running the following command on Unix/Linux machines, or an equivalent command on Windows machines:
```
touch outputs.tf
```

Open up the outputs.tf file and add the following code to it:

output "streamlit_cloudfront_distribution_url" {
  value = module.serverless-streamlit-app.streamlit_cloudfront_distribution_url
}

Save the file.
Now, your folder structure will look like:

terraform_streamlit_folder
├── README.md
├── app                 # Streamlit app directory
│   ├── home.py         # Streamlit app entry point
│   ├── Dockerfile      # Dockerfile
│   └── pages/          # Streamlit pages
│     
├── main.tf             # Terraform Code (where you call the module) 
└── outputs.tf          # Outputs definition

Deploy the solution

Now you can use Terraform to deploy the resources defined in your main.tf file.

In your terminal, run the following command to apply to deploy the infrastructure. This includes the hosting for your Streamlit application using ECS and CloudFront, as well as the pipeline that is used to push updates.
```
terraform apply
```
When the apply command finishes running, you’ll see the Terraform outputs displayed in the terminal.
Navigate to the streamlit_cloudfront_distribution_url to see your Streamlit application that is hosted on AWS.
When you make changes to your Streamlit codebase, you can go ahead and re-run terraform apply to push your new changes to your cloud environment.

When updating the Streamlit codebase, the CodePipeline and CodeBuild processes kick off to automatically update your new changes, which get reflected on your Streamlit application. CodePipeline automates the entire software release process, managing stages like source retrieval, building, testing, and deployment. It integrates with AWS services and third-party tools (such as GitHub and Jenkins) to enhance automation, speed, and security. CodeBuild focuses on automating code compilation, testing, and packaging, supporting multiple languages and custom Docker environments, while integrating with CodePipeline for scalable, secure builds. With this CI/CD pipeline, when you make changes to your code, all you need to run is terraform apply to update your cloud environment. For an example buildspec, see the example in the repo.

You can find full examples of deploying the infrastructure with and without existing resources in the GitHub repository.

Clean up

When you no longer need the resources deployed in this post, you can clean up the resources by using the Terraform destroy command. Simply run terraform destroy . This will remove all of the resources you have deployed in this post with Terraform.

Conclusion

Building serverless Streamlit applications with Terraform on AWS offers a powerful combination of scalability, efficiency, and automation. As you continue to build and refine your Streamlit applications, Terraform’s flexibility ensures that your infrastructure can evolve seamlessly, supporting rapid innovation and agile development. With Streamlit and Terraform, you have the tools to create dynamic, serverless applications that scale effortlessly and operate reliably in the cloud.

Authors

Securing communications at the edge with AWS Wickr

2024-10-09 Erik Iwanski

Post Syndicated from Erik Iwanski original https://aws.amazon.com/blogs/messaging-and-targeting/securing-communications-at-the-edge-with-aws-wickr/

Organizations that are looking to establish secure communication networks at the edge often encounter challenges. The use of disparate collaboration tools on personal and government-issued devices can make it difficult to protect sensitive data and avoid communication gaps.

This blog post highlights four common communication issues that customers encounter when operating in disconnected (or intermittently connected) environments, and how end-to-end encrypted messaging and collaboration service AWS Wickr can help you address them.

Issue 1: Seamless communication—multiple agencies and partners need to collaborate effectively.

Federal, state, and local organizations tend to use different means and mechanisms to communicate both internally and externally with third parties, which often leads to interoperability challenges. They need to seamlessly coordinate and connect with mission partners—including government agencies, military teams, medical professionals, and first responders—even in disconnected environments in order to work together effectively.

Issue 2: Out-of-band communication—teams need a way to ensure that communication is possible when primary channels are down or compromised.

Network disruptions, security events, and system failures can impact communication channels. The use of a separate, secure, out-of-band communication tool that can be used as a backup when primary channels are unavailable or compromised is critical to protecting sensitive information, maintaining business continuity, and coordinating incident response activities.

Issue 3: Data retention—messages and files need to be retained to help meet recordkeeping requirements, and facilitate after-action reports.

Virtually all federal, state, and local government agencies must adhere to various data retention and records management policies, regulations, and laws. Many are subject to Federal Records Act (FRA) and National Archives and Records Administration (NARA) regulations that require them to collect, store, and manage federal records that are created, received, and used in daily operations. For those subject to Freedom of Information Act (FOIA) requests and U.S. Department of Defense (DOD) Instruction 8170.01—which prescribes procedures for the collection, distribution, storage, and processing of DOD information through electronic messaging services—effectively retaining messages is about more than supporting security and compliance; it’s about maintaining public trust.

Issue 4: Security and control—communications must be adequately protected and administrative control must be maintained, no matter the environment.

The transmission of sensitive and mission-critical data through messaging apps and collaboration tools that lack critical encryption and security protocols increases the likelihood of a security incident. Popular consumer messaging apps don’t provide controls that allow for individual devices or accounts to be suspended or removed, increasing the threat of data exposure stemming from a lost or stolen device. Enterprise collaboration apps lack the advanced security provided by end-to-end encryption.

How AWS Wickr can help

AWS Wickr is a secure messaging and collaboration service that protects one-to-one and group messaging, voice and video calling, file sharing, screen sharing, and location sharing with 256-bit encryption.

Wickr combines the security and privacy of end-to-end encryption with the data retention and administrative controls you need to accelerate collaboration, even in disconnected environments.

Wickr provides the following capabilities to help you address common communication challenges:

Seamless communication: Federation and guest access features allow you to exchange sensitive information with mission partners, without the need to connect to a virtual private network (VPN). You can assign groups of users to specific federation rules, restrict access to select agencies and partners, and allow or disable the guest user access feature for individual security groups.
Out-of-band communication: Wickr provides a communication channel outside of existing systems that can help you keep teams connected and protect sensitive information, even when primary channels are down or compromised. The user interface is intuitive; response teams can simply open the application on their device and start collaborating, without special software or training.
Data retention: Wickr network administrators can configure and apply data retention to both internal and external communications in a Wickr network. This includes conversations with guest users, external teams, and other partner networks, so you can retain messages and files sent to and from the organization to help meet requirements. Data retention is implemented as an always-on recipient that is added to conversations, similar to the blind carbon copy (BCC) feature in email. The data retention process can run anywhere Docker workloads are supported: on-premises, on an Amazon Elastic Compute Cloud (Amazon EC2) virtual machine, or at a location of your choice.
Security and control: With Wickr, each message gets a unique Advanced Encryption Standard (AES) private encryption key, and a unique Elliptic-curve Diffie–Hellman (ECDH) public key to negotiate the key exchange with recipients. Message content—including text, files, audio, or video—is encrypted on the sending device using the message-specific AES key. This key is then exchanged via the ECDH key exchange mechanism so that no one other than intended recipients can decrypt the content (not even AWS). Fine-grained administrative controls allow you to organize users into security groups with restricted access to features and content at their level. You can apply policies to each group that are custom-tailored to meet desired outcomes. Wickr app data can be deleted remotely both by administrators, and end users.

Communicating at the edge

Wickr is available in two deployment models: cloud-native AWS Wickr and AWS WickrGov, which are available through the AWS Management Console, and self-hosted Wickr Enterprise. Wickr Enterprise offers the same secure collaboration features as AWS Wickr and AWS WickrGov, but can be self-hosted on any private on-premises infrastructure (such as an AWS Outpost or Snowball Edge device), private cloud infrastructure, or in a multi-cloud deployment. Wickr Enterprise can maintain secure communications when internet access (via broadband, mobile, 5G, or satellite) to cloud-based networks fails. You can run Wickr Enterprise without any internet connectivity and it supports architectural resiliency, such as deploying a fully managed network backhaul that is capable of federating with AWS Wickr users when internet connectivity is available.

Figure 1 illustrates a hybrid architecture that combines AWS Wickr and Wickr Enterprise. The Snowball Edge device running Wickr allows disconnected communications at the edge between Wickr Enterprise users. When internet connectivity becomes available, Wickr Enterprise users can federate with AWS Wickr users and send data retention logs to Amazon S3 or any customer-defined storage.

Figure 1: Hybrid of Wickr Enterprise self-hosted on Snowball Edge and AWS Wickr in the Cloud. A hybrid solution federates AWS Wickr in the cloud with a local deployment of Wickr Enterprise for extended resilience and redundancy.

Collaborate with confidence

Securing communications at the edge is critical to protecting sensitive data and maintaining operational resilience. AWS Wickr offers a secure, simple-to-use, reliable solution that can help you address common challenges and collaborate effectively, even in the harshest environments. By choosing the features and deployment options that meet your needs, you can facilitate secure and compliant communications everywhere, and seamlessly collaborate with mission partners.

AWS Wickr has been authorized for Department of Defense Cloud Computing Security Requirements Guide Impact Level 4 and 5 (DoD CC SRG IL4 and IL5) in the AWS GovCloud (US-West) Region. It is also Federal Risk and Authorization Management Program (FedRAMP) authorized at the Moderate impact level in the AWS US East (N. Virginia) Region, FedRamp High authorized in the AWS GovCloud (US-West) Region, and meets compliance programs and standards such as Health Insurance Portability and Accountability Act (HIPAA) eligibility, International Organization for Standardization (ISO) 27001, and System and Organization Controls (SOC) 1,2, and 3.

For more information, please visit the AWS Wickr webpage, or email [email protected].

About the Authors

Erik Iwanski

Erik is a Principal Worldwide Go-to-Market (GTM) Specialist for Amazon Web Services (AWS) and is based in Montana. He focuses on global customers and leads the global GTM plan for AWS Wickr. Erik has 15-plus years of experience working across industries from national security, federal/SLED sales, healthcare, and technology. He holds a master’s degree in microbiology from California State University Long Beach and a bachelor’s degree in Biological Sciences from the University of California Irvine.

Anne Grahn

Anne is a Senior Worldwide Security GTM Specialist at AWS, based in Chicago. She has more than 13 years of experience in the security industry, and focuses on effectively communicating cybersecurity risk. She maintains a Certified Information Systems Security Professional (CISSP) certification.

Customer compliance and security during the post-quantum cryptographic migration

2024-10-03 Panos Kampanakis

Post Syndicated from Panos Kampanakis original https://aws.amazon.com/blogs/security/customer-compliance-and-security-during-the-post-quantum-cryptographic-migration/

Amazon Web Services (AWS) prioritizes the security, privacy, and performance of its services. AWS is responsible for the security of the cloud and the services it offers, and customers own the security of the hosts, applications, and services they deploy in the cloud. AWS has also been introducing quantum-resistant key exchange in common transport protocols used by our customers in order to provide long-term confidentiality. In this blog post, we elaborate how customer compliance and security configuration responsibility will operate in the post-quantum migration of secure connections to the cloud. We explain how customers are responsible for enabling quantum-resistant algorithms or having these algorithms enabled by default in their applications that connect to AWS. We also discuss how AWS will honor and choose these algorithms (if they are supported on the server side) even if that means the introduction of a small delay to the connection.

Secure connectivity

Security and compliance is a shared responsibility between AWS and the customer. This Shared Responsibility Model can help relieve the customer’s operational burden as AWS operates, manages, and controls the components from the host operating system and virtualization layer down to the physical security of the facilities in which the service operates. The customer assumes responsibility and management of the guest operating system and other associated application software, as well as the configuration of the AWS provided security group firewall. AWS has released Customer Compliance Guides (CCGs) to support customers, partners, and auditors in their understanding of how compliance requirements from leading frameworks map to AWS service security recommendations.

In the context of secure connectivity, AWS makes available secure algorithms in encryption protocols (for example, TLS, SSH, and VPN) for customers that connect to its services. That way AWS is responsible for enabling and prioritizing modern cryptography in connections to the AWS Cloud. Customers, on the other hand, use clients that enable such algorithms and negotiate cryptographic ciphers when connecting to AWS. It is the responsibility of the customer to configure or use clients that only negotiate the algorithms the customer prefers and trusts when connecting.

Prioritizing quantum-resistance or performance?

AWS has been in the process of migrating to post-quantum cryptography in network connections to AWS services. New cryptographic algorithms are designed to protect against a future cryptanalytically relevant quantum computer (CRQC) which could threaten the algorithms we use today. Post-quantum cryptography involves introducing post-quantum (PQ) hybrid key exchanges in protocols like TLS 1.3 or SSH/SFTP. Because both classical and PQ-hybrid exchanges need to be supported for backwards compatibility, AWS will prioritize PQ-hybrid exchanges for clients that support it and classical for clients that have not been upgraded yet. We don’t want to switch a client to classical if it advertises support for PQ.

PQ-hybrid key establishment leverages quantum-resistant key encapsulation mechanisms (KEMs) used in conjunction with classical key exchange. The client and server still do an ECDH key exchange, which gets combined with the KEM shared secret when deriving the symmetric key. For example, clients could perform an ECDH key exchange with curve P256 and post-quantum Kyber-768 from NIST’s PQC Project Round 3 (TLS group identifier X25519Kyber768Draft00) when connecting to AWS Certificate Manager (ACM), AWS Key Management Service (AWS KMS), and AWS Secrets Manager. This strategy combines the high assurance of a classical key exchange with the quantum-resistance of the proposed post-quantum key exchanges, to help ensure that the handshakes are protected as long as the ECDH or the post-quantum shared secret cannot be broken. The introduction of the ML-KEM algorithm adds more data (2.3 KB) to be transferred and slightly more processing overhead. The processing overhead is comparable to the existing ECDH algorithm, which has been used in most TLS connections for years. As shown in the following table, the total overhead of hybrid key exchanges has been shown to be immaterial in typical handshakes over the Internet. (Sources: Blog posts How to tune TLS for hybrid post-quantum cryptography with Kyber and The state of the post-quantum Internet)

	Data transferred (bytes)	CPU processing (thousand ops/sec)
	Data transferred (bytes)	Client	Server
ECDH with P256	128	17	17
X25519	64	31	31
ML-KEM-768	2,272	13	25

The new key exchanges introduce some unique conceptual choices that we didn’t have before, which could lead to the peers negotiating classical-only algorithms. In the past, our cryptographic protocol configurations involved algorithms that were widely trusted to be secure. The client and server configured a priority for their algorithms of choice and they picked the more appropriate ones from their negotiated prioritized order. Now, the industry has two families of algorithms, the “trusted classical” and the “trusted post-quantum” algorithms. Given that a CRQC is not available, both classical and post-quantum algorithms are considered secure. Thus, there is a paradigm shift that calls for a decision in the priority vendors should enforce on the client and server configurations regarding the “secure classical” or “secure post-quantum” algorithms.

Figure 1 shows a typical PQ-hybrid key exchange in TLS.

Figure 1: A typical TLS 1.3 handshake

In the example in Figure 1, the client advertises support for PQ-hybrid algorithms with ECDH curve P256 and quantum-resistant ML-KEM-768, ECDH curve P256 and quantum-resistant Kyber-512 Round 3, and classical ECDH with P256. The client also sends a Keyshare value for classical ECDH with P256 and for PQ-hybrid P256+MLKEM768. The Keyshare values include the client’s public keys. The client does not include a Keyshare for P256+Kyber512, because that would increase the size of the ClientHello unnecessarily and because ML-KEM-768 is the ratified version of Kyber Round 3, and so the client chose to only generate and send a P256+MLKEM768 public key. Now let’s say that the server supports ECDH curve P256 and PQ-hybrid P256+Kyber512, but not P256+MLKEM768. Given the groups and the Keyshare values the client included in the ClientHello, the server has the following two options:

Use the client P256 Keyshare to negotiate a classical key exchange, as shown in Figure 1. Although one might assume that the P256+Kyber512 Keyshare could have been used for a quantum-resistant key exchange, the server can pick to negotiate only classical ECDH key exchange with P256, which is not resistant to a CRQC.
Send a Hello Retry Request (HRR) to tell the client to send a PQ-hybrid Keyshare for P256+Kyber512 in a new ClientHello (Figure 2). This introduces a round trip, but it also forces the peers to negotiate a quantum-resistant symmetric key.

Note: A round-trip could take 30-50 ms in typical Internet connections.

Previously, some servers were using the Keyshare value to pick the key exchange algorithm (option 1 above). This generally allowed for faster TLS 1.3 handshakes that did not require an extra round-trip (HRR), but in the post-quantum scenario described earlier, it would mean the server does not negotiate a quantum-resistant algorithm even though both peers support it.

Such scenarios could arise in cases where the client and server don’t deploy the same version of a new algorithm at the same time. In the example in Figure 1, the server could have been an early adopter of the post-quantum algorithm and added support for P256+Kyber512 Round 3. The client could subsequently have upgraded to the ratified post-quantum algorithm with ML-KEM (P256+MLKEM768). AWS doesn’t always control both the client and the server. Some AWS services have adopted the earlier versions of Kyber and others will deploy ML-KEM-768 from the start. Thus, such scenarios could arise while AWS is in the post-quantum migration phase.

Note: In these cases, there won’t be a connection failure; the side-effect is that the connection will use classical-only algorithms although it could have negotiated PQ-hybrid.

These intricacies are not specific to AWS. Other industry peers have been thinking about these issues, and they have been a topic of discussion in the Internet Engineering Task Force (IETF) TLS Working Group. The issue of potentially negotiating a classical key exchange although the client and server support quantum-resistant ones is discussed in the Security Considerations of the TLS Key Share Prediction draft (draft-davidben-tls-key-share-prediction). To address some of these concerns, the Transport Layer Security (TLS) Protocol Version 1.3 draft (draft-ietf-tls-rfc8446bis), which is the draft update of TLS 1.3 (RFC 8446), introduces text about client and server behavior when choosing key exchange groups and the use of Keyshare values in Section 4.2.8. The TLS Key Share Prediction draft also tries to address the issue by providing DNS as a mechanism for the client to use a proper Keyshare that the server supports.

Prioritizing quantum resistance

In a typical TLS 1.3 handshake, the ClientHello includes the client’s key exchange algorithm order of preferences. Upon receiving the ClientHello, the server responds by picking the algorithms based on its preferences.

Figure 2 shows how a server can send a HelloRetryRequest (HRR) to the client in the previous scenario (Figure 1) in order to request the negotiation of quantum-resistant keys by using P256+Kyber512. This approach introduces an extra round trip to the handshake.

Figure 2: An HRR from the server to request the negotiation of mutually supported quantum-resistant keys with the client

AWS services that terminate TLS 1.3 connections will take this approach. They will prioritize quantum resistance for clients that advertise support for it. If the AWS service has added quantum-resistant algorithms, it will honor a client-supported post-quantum key exchange even if that means that the handshake will take an extra round trip and the PQ-hybrid key exchange will include minor processing overhead (ML-KEM is almost performant as ECDH). A typical round trip in regionalized TLS connections today is usually under 50 ms and won’t have material impact to the connection performance. In the post-quantum transition, we consider clients that advertise support for quantum-resistant key exchange to be clients that take the CRQC risk seriously. Thus, the AWS server will honor that preference if the server supports the algorithm.

Pull Request 4526 introduces this behavior in s2n-tls, the AWS open source, efficient TLS library built over other crypto libraries like OpenSSL libcrypto or AWS libcrypto (AWS-LC). When built with s2n-tls, s2n-quic handshakes will also inherit the same behavior. s2n-quic is the AWS open source Rust implementation of the QUIC protocol.

What AWS customers can do to verify post-quantum key exchanges

AWS services that have already adopted the behavior described in this post include AWS KMS, ACM, and Secrets Manager TLS endpoints, which have been supporting post-quantum hybrid key exchange for a few years already. Other endpoints that will deploy quantum-resistant algorithms will inherit the same behavior.

AWS customers that want to take advantage of new quantum-resistant algorithms introduced in AWS services are expected to enable them on the client side or the server side of a customer-managed endpoint. For example, if you are using the AWS Common Runtime (CRT) HTTP client in the AWS SDK for Java v2, you would need to enable post-quantum hybrid TLS key exchanges with the following.

SdkAsyncHttpClient awsCrtHttpClient = AwsCrtAsyncHttpClient.builder()
            .postQuantumTlsEnabled(true)
            .build();

The AWS KMS and Secrets Manager documentation includes more details for using the AWS SDK to make HTTP API calls over quantum-resistant connections to AWS endpoints that support post-quantum TLS.

To confirm that a server endpoint properly prioritizes and enforces the PQ algorithms, you can use an “old” client that sends a PQ-hybrid Keyshare value that the PQ-enabled server does not support. For example, you could use s2n-tls built with AWS-LC (which supports the quantum-resistant KEMs). You could use a client TLS policy (PQ-TLS-1-3-2023-06-01) that is newer than the server’s policy (PQ-TLS-1-0-2021-05-24). That will lead the server to request the client by means of an HRR to send a new ClientHello that includes P256+MLKEM768, as shown following.

./bin/s2nd -c PQ-TLS-1-0-2021-05-24 localhost 4444
sudo tcpdump port 4444 -w hrr-capture.pcap
./bin/s2nc localhost 4444 -c PQ-TLS-1-3-2023-06-01 -i

The hrr-capture.pcap packet capture will show the negotiation and the HRR from the server.

To confirm that a server endpoint properly implements the post-quantum hybrid key exchanges, you can use a modern client that supports the key exchange and connect against the endpoint. For example, using the s2n-tls client built with AWS-LC (which supports the quantum-resistant KEMs), you could try connecting to a Secrets Manager endpoint by using a post-quantum TLS policy (for example, PQ-TLS-1-2-2023-12-15) and observe the PQ hybrid key exchange used in the output, as shown following.

./bin/s2nc -c PQ-TLS-1-2-2023-12-15 secretsmanager.us-east-1.amazonaws.com 443
CONNECTED:
Handshake: NEGOTIATED|FULL_HANDSHAKE|MIDDLEBOX_COMPAT
Client hello version: 33
Client protocol version: 34
Server protocol version: 34
Actual protocol version: 34
Server name: secretsmanager.us-east-1.amazonaws.com
Curve: NONE
KEM: NONE
KEM Group: SecP256r1Kyber768Draft00
Cipher negotiated: TLS_AES_128_GCM_SHA256
Server signature negotiated: RSA-PSS-RSAE+SHA256
Early Data status: NOT REQUESTED
Wire bytes in: 6699
Wire bytes out: 1674
s2n is ready
Connected to secretsmanager.us-east-1.amazonaws.com:443

An alternative would be using the Open Quantum Safe (OQS) for OpenSSL client to do the same.

As another example, if you want to transfer a file over a quantum-resistant SFTP connection with AWS Transfer Family, you would need to configure a PQ cryptography SSH security policy on your AWS File Transfer SFTP endpoint (for example, TransferSecurityPolicy-2024-01) and enable quantum-resistant SSH key exchange in the SFTP client. Note that in SSH/SFTP, although the AWS server side will advertise the quantum-resistant schemes as higher priority, the client picks the key exchange algorithm. So, a client that supports PQ cryptography would need to have the PQ algorithms configured with higher priority (as described in the Shared Responsibility Model). For more details, refer to the AWS Transfer Family documentation.

Conclusion

Cryptographic migrations can introduce intricacies to cryptographic negotiations between clients and servers. During the migration phase, AWS services will mitigate the risks of these intricacies by prioritizing post-quantum algorithms for customers that advertise support for these algorithms—even if that means a small slowdown in the initial negotiation phase. While in the post-quantum migration phase, customers who choose to enable quantum resistance have made a choice which shows that they consider the CRQC risk as important. To mitigate this risk, AWS will honor the customer’s choice, assuming that quantum resistance is supported on the server side.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Security, Identity, & Compliance re:Post or contact AWS Support. For more details regarding AWS PQC efforts, refer to our PQC page.

Accelerate application upgrades with Amazon Q Developer agent for code transformation

2024-10-03 Jonathan Vogel

Post Syndicated from Jonathan Vogel original https://aws.amazon.com/blogs/devops/accelerate-application-upgrades-with-amazon-q-developer-agent-for-code-transformation/

In this blog, we will explore how Amazon Q Developer Agent for code transformation accelerates Java application upgrades. We will examine the benefits of this Generative AI-powered agent and outline strategies to achieve maximal acceleration, drawing from real-world success stories and best practices.

Benefits of using Amazon Q Developer to upgrade your applications

Amazon Q Developer addresses a critical challenge for organizations managing numerous Java applications, particularly as they face the approaching end of Long-Term-Support (LTS) for older Java versions. Upgrading to Java 17 enhances security, resolves vulnerabilities, and improves performance while ensuring long-term compatibility and access to modern features. Currently, Q Developer agent for code transformation supports upgrades from Java 8 and 11 to Java 17. Software developers can utilize Q Developer within their IDE (VS Code and JetBrains) to transform both single-module and multi-module applications. Q Developer will generate a plan that identifies necessary library upgrades and replacements for deprecated code in the application, proposing code changes with the goal of ensuring the transformed code compiles successfully in Java 17. Q Developer can significantly enhance the efficiency of your migration workflow, performing code transformations on applications in hours rather than weeks.

Customer success of using Q Developer to modernize legacy Java applications

Customers have used Q Developer to upgrade their Java applications successfully. Here is how two customers as well as Amazon internal teams use Q Developer to accelerate the migration process.

A large insurance company in North America strategically approached their Java upgrade initiative by identifying applications with dependencies that Q Developer could upgrade effectively. They focused on applications that rely on frameworks like Spring Boot, which can be time-consuming to upgrade manually. After leveraging Q Developer to transform 4 applications in pilot, they estimated a 36% acceleration in their upgrade process, indicating that Q Developer automatically completed over a third of the work that would have been required manually. While the remaining portion still necessitated manual intervention to ensure the code would build and run correctly, the effort acceleration was significant.

A major financial services firm’s experience with Q Developer proved equally compelling. In a focused two-day workshop, 20 developers successfully transformed 20 applications in production using the Amazon Q Developer agent. This results in 42% time savings using Q Developer compared to manual upgrade, saving on average 24 hours per application. They spent about 3 weeks to prepare for the transformation workshop. They identified first-party (1P) dependencies—internal libraries that other production applications rely on. Q developer does not guarantee upgrade of 1P dependencies. With a combination of Q Developer and manual work, the customer upgraded many of these common 1P dependencies leading up to the workshop. This step was crucial to gain maximum acceleration while using Q Developer for the upgrades.

Amazon uses Q Developer internally to upgrade Java applications following company-wide campaigns. The central team who owns the campaigns provides detailed guidance on which Java applications can be upgraded with Q developer most effectively. This team also manages Amazon’s internal build system and provides tooling to automate part of the manual efforts. They are able to achieve significant savings. Amazon was able to upgrade more than 50% of production applications in six months, 79% of the auto-generated code reviews were applied without additional changes.

Use Q Developer to upgrade your applications

To ensure that Q Developer is properly applied to the specific characteristics of their codebases, customers create and follow a transformation approach. Teams and individuals who understand the scope of the upgrade run campaigns across the company to effectively utilize Q Developer. To maximize the acceleration from Q Developer, these teams classify the applications which need to be upgraded, identify which ones can be upgraded using Q Developer, estimate the manual effort required, which provides a baseline to measure the value added by Q Developer agent for code transformation. The preparation phase is crucial before starting the execution phase of the upgrade. Each of the steps in the preparation phase plays an important role in maximizing the acceleration of Amazon Q in their upgrade processes.

Classifying the applications to upgrade: Q Developer supports the upgrade of 30 most common Java libraries. Q Developer’s performance on less common and internal libraries is lower compared to the common libraries. In this case, you can use a combination of Q Developer and manual steps. It’s recommended to include both production applications and internal dependencies in this step. You should also classify your applications and internal libraries based on if/how they are used by other applications, it will help prioritize the applications to upgrade first in campaigns. Classifying applications by libraries used can help you identify the best upgrade approach using Q Developer.
Defining baselines of efficiency: To measure the efficiency of the upgrade effort in your organization, it is crucial to establish baselines. Based on the classification of applications, use Q Developer in a pilot for each class to see which libraries are transformed correctly, and which ones have to be done manually. This helps you operationalize the process of using Q Developer and the manual steps required, and understand how this procedure accelerates the upgrade of a certain class of applications. Some customers use manual effort hours for each upgrade on dependency versions and deprecated code as baseline and compare the manual effort hours with time taken when completing the upgrade using Q Developer. For example, you can classify the applications based on the main frameworks used before upgrading applications using Q Developer. Compare the time taken by Q Developer with manual upgrade hours to understand which applications can be upgraded by Q Developer most effectively.
Identifying applications for migration: Decide which applications to use Q Developer for, and prioritize the applications to upgrade in waves based on expected acceleration and business value. You can prioritize the applications which are most used by other applications and upgrade them in the initial campaign, then upgrade the rest of the applications in the subsequent campaigns. By addressing the foundational components first, the overall upgrade process will be streamlined. In Amazon, a centralized internal team defines migration waves and identifies which packages would be included in the upgrade campaign. Additionally, this team conducted analysis of the apps to determine the likelihood of the upgrade being successful using Q developer, and provides an estimate of the remaining engineering effort needed to complete the upgrade. The team will use this information to select applications and uses an Amazon-internal tool to assign the upgrade tasks to the team owning the applications. While SDEs were free to run the upgrade on their own, following the campaign with a set deadline mobilized the application owner teams to complete the upgrade.

Use Q Developer to automate upgrade tasks

Once the preparation phase is completed, you can start the execution phase. Software developers can use Q Developer to accelerate many of the steps in execution phase.

Assessing the components of an application to upgrade. You can use Q Developer to start a transformation, at the beginning of the transformation, there will be a transformation plan generated for you to view which dependencies and deprecated code will be upgraded.
Research and update dependency versions compatible to the target version. Q Developer will analyze your app and attempt to update the dependencies to the versions compatible with target Java version and in some cases the latest version.
Replace deprecated methods and API calls which are not compatible to the target version. Q Developer will detect the deprecated code and attempt to update to what’s recommended in the compatible Java version.
Reviewing the modified code and address any conflicts or issues that may arise. Q Developer will return code changes to you at the end of the transformation. If the transformation is successful, the app will compile in Java 17. If the transformation is partially successful, Q Developer was able to upgrade library versions and make code changes but could not compile the transformed app successfully in Java 17. Check out this part of our documentation on how to handle partial transformations.
Test the upgraded application thoroughly to ensure correct functionality. Q Developer will run the unit tests and integration tests in your app when compiling in the target version.

Conclusion

As organizations face the pressing need to modernize their Java applications, Amazon Q Developer emerges as a powerful ally in this complex journey. The customer success stories demonstrate the tangible benefits of leveraging AI-assisted code transformation: significant time savings, reduced manual effort, and accelerated upgrade processes.

Q Developer not only addresses the technical challenges of Java upgrades, but also enables organizations to approach these initiatives strategically. By classifying applications, establishing baselines, and prioritizing upgrades, teams can maximize the efficiency of their modernization efforts. While Q Developer streamlines much of the upgrade process, it is important to note that some challenges may still arise. For a comprehensive understanding of potential challenges and detailed guidance on getting started with Q Developer, we encourage you to explore our public documentation.

The journey to Java 17 and beyond doesn’t have to be daunting. With Amazon Q Developer, you have a powerful tool at your disposal to accelerate your upgrade process, reduce costs, and ensure your applications remain secure, performant, and future-ready.

Take the first step towards modernizing your Java ecosystem today. Explore Amazon Q Developer and discover how it can transform your upgrade strategy. See Getting Started with Amazon Q Developer agent for code transformation for a how-to guide on using Q Developer to transform Java applications.

About the authors

Enhancing data privacy with layered authorization for Amazon Bedrock Agents

2024-10-02 Jeremy Ware

Post Syndicated from Jeremy Ware original https://aws.amazon.com/blogs/security/enhancing-data-privacy-with-layered-authorization-for-amazon-bedrock-agents/

Customers are finding several advantages to using generative AI within their applications. However, using generative AI adds new considerations when reviewing the threat model of an application, whether you’re using it to improve the customer experience for operational efficiency, to generate more tailored or specific results, or for other reasons.

Generative AI models are inherently non-deterministic, meaning that even when given the same input, the output they generate can vary because of the probabilistic nature of the models. When using managed services such as Amazon Bedrock in your workloads, there are additional security considerations to help ensure protection of data that’s accessed by Amazon Bedrock.

In this blog post, we discuss the current challenges that you may face regarding data controls when using generative AI services and how to overcome them using native solutions within Amazon Bedrock and layered authorization.

Definitions

Before we get started, let’s review some definitions.

Amazon Bedrock Agents: You can use Amazon Bedrock Agents to autonomously complete multistep tasks across company systems and data sources. Agents can be used to enrich entry data to provide more accurate results or to automate repetitive tasks. Generative AI agents can make decisions based on input and the environmental data they have access to.

Layered authorization: Layered authorization is the practice of implementing multiple authorization checks between the application components beyond the initial point of ingress. This includes service-to-service authorization, carrying the true end-user identity through application components, and adding end-user authorization for each operation in addition to the service authorization.

Trusted identity propagation: Trusted identity propagation provides more simply defined, granted, and logged user access to AWS resources. Trusted identity propagation is built on the OAuth 2.0 authorization framework, which allows applications to access and share user data securely without the need to share passwords.

Amazon Verified Permissions: Amazon Verified Permissions is a fully managed authorization service that uses the provably correct Cedar policy language, so you can build more secure applications.

Challenge

As you build on AWS, there are several services and features that you can use to help ensure your data or your customers’ data is secure. This might include encryption at-rest with Amazon Simple Storage Service (Amazon S3) default encryption or AWS Key Management Service (AWS KMS) keys, or the use of prefixes in Amazon S3 or partition keys in Amazon DynamoDB to separate tenants’ data. These mechanisms are great for dealing with data at-rest and separation of data partitions, but after a generative AI powered application enables customers to access a variety of data (different sensitivity types of data, multiple tenants’ data, and so on) based on user input, the risk of disclosure of sensitive data increases (see the data privacy FAQ for more information about data privacy at AWS). This is because access to data is now being passed to an untrusted identity (the model) within the workload operating on behalf of the calling principal.

Many customers are using Amazon Bedrock Agents in their architecture to augment user input with additional information to improve responses. Agents might also be used to automate repetitive tasks and streamline workflows. For example, chatbots can be useful tools for improving user experiences, such as summarizing patient test results for healthcare providers. However, it’s important to understand the potential security risks and mitigation strategies when implementing chatbot solutions.

A common architecture involves invoking a chatbot agent through an Amazon API Gateway. The API gateway validates the API call using an Amazon Cognito or AWS Lambda authorizer and then passes the request to the chatbot agent to perform its function.

A potential risk arises when users can provide input prompts to the chatbot agent. This input could lead to prompt injection (OWASP LLM:01) or sensitive data disclosure (OWASP LLM:06) vulnerabilities. The root cause is that the chatbot agent often requires broad access permissions through an AWS Identity and Access Management (IAM) service role with access to various data stores (such as S3 buckets or databases), to fulfill its function. Without proper security controls, a threat actor from one tenant could potentially access or manipulate data belonging to another tenant.

Solution

While there is no single solution that can mitigate all risks, having a proper threat model of your consumer application to identify risks (such unauthorized access to data) is critical. AWS offers several generative AI security strategies to assist you in generating appropriate threat models. In this post, we focus on layered authorization throughout the application, focusing on a solution to support a consumer application.

Note: This can also be accomplished using Trusted identity propagation (TIP) and Amazon S3 Access Grants for a workforce application.

By using a strong authentication process such as an OpenID Connect (OIDC) identity provider (IdP) for your consumers enhanced with multi-factor authentication (MFA), you can govern access to invoke the agents at the API gateway. We recommend that you also pass custom parameters to the agent—as shown in Figure 1, using the JWT token from the header of the request. With such a configuration, the agent will evaluate an isAuthorized request with Amazon Verified Permissions to confirm that the calling user has access to the data requested prior to the agent running its described function. This architecture is shown in Figure 1:

Figure 1: Authorization architecture

The steps of the architecture are as follows:

The client connects to the application frontend.
The client is redirected to the Amazon Cognito user pool UI for authentication.
The client receives a JWT token from Amazon Cognito.
The application frontend uses the JWT token presented by the client to authorize a request to the Amazon Bedrock agent. The application frontend adds the JWT token to the InvokeAgent API call.
The agent reviews the request, calls the knowledge base if required, and calls the Lambda function. The agent includes the JWT token provided by the application frontend into the Lambda invocation context.
The Lambda function uses the JWT token details to authorize subsequent calls to DynamoDB tables using Verified Permissions (6a), and calls the DynamoDB table only if the call is authorized (6b).

Deep dive

When you design an application behind an API gateway that triggers Amazon Bedrock agents, you must create an IAM service role for your agent with a trust policy that grants AssumeRole access to Amazon Bedrock. This role should allow Amazon Bedrock to get the OpenAPI schema for your agent Action Group Lambda function from the S3 bucket and allow for the bedrock:InvokeModel action to the specified model. If you did not select the default KMS key to encrypt your agent session data, you must grant access in the IAM service role to access the customer managed KMS key. Example policies and trust relationship are shown in the following examples.

The following policy grants permission to invoke an Amazon Bedrock model. This will be granted to the agent. In the resource, we are specifically targeting an approved foundational model (FM).

{
"Version": "2012-10-17",
"Statement": [
    { 
        "Sid": "AmazonBedrockAgentBedrockFoundationModelPolicy",
        "Effect": "Allow",
        "Action": "bedrock:InvokeModel",
        "Resource": [
            "arn:aws:bedrock:us-west-2::foundation-model/your_chosen_model"
            ]
        }
    ]
}

Next, we add a policy statement that allows the Amazon Bedrock agent access to S3:GetObject and targets a specific S3 bucket with a condition that the account number matches one within our organization.

{
"Version": "2012-10-17",
"Statement": [
    { 
        "Sid": "AmazonBedrockAgentDataStorePolicy",
        "Effect": "Allow",
        "Action": [
        "s3:GetObject"
        ],
        "Resource": [
            "arn:aws:s3:::S3BucketName/*"
        ],
        "Condition": {
            "StringEquals": {
                "aws:ResourceAccount": "Account_Number"
                }
            }
        }
    ]
}

Finally, we add a trust policy that grants Amazon Bedrock permissions to assume the defined role. We have also added conditional statements to make sure that the service is calling on behalf of our account to help prevent the confused deputy problem.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AmazonBedrockAgentTrustPolicy",
            "Effect": "Allow",
            "Principal": {
                "Service": "bedrock.amazonaws.com"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": "Account_Number"
                },
                "ArnLike": {
                    "aws:SourceArn": "arn:aws:bedrock:us-west-2:Account_Number:agent/*"
                }
            }
        }
    ]
}

Amazon Bedrock agents use a service role and don’t propagate the consumer’s identity natively. This is where the underlying problem of protecting tenants’ data might exist. If the agent is accessing unclassified data, then there’s no need to add layered authorization because there’s no additional segregation of access needed based on the authorization caller. But if the application has access to sensitive data, you must carry authorization into processing the agent’s function.

You can do this by adding an additional layer to the Lambda function triggered by invoking the agent. First, initialize the agent to make an isAuthorized call to Verified Permissions. Only upon an Allow response will the agent perform the rest of its function. If the response from Verified Permissions is Deny, then the agent should return a status 403 or a friendly error message to the user.

Verified Permissions must have pre-built policies to dictate how authorization should occur when data is being accessed. For example, you might have a policy like the following to grant access to patient records if the calling principal is a doctor.

permit(
  principal in Group::"doctor", 
  action == Action::"view", 
  resource
 )
 when {
 resource.fileType == Sensitive &&
 resource.patient == doctor.patient
};

In this example, the authorization logic to handle this decision is within the agent Lambda. To do so, the Lambda function first builds the entities structure by decoding the JWT passed as a custom parameter to the Amazon Bedrock agent to assess the calling principal’s access. The requested data should also be included in the isAuthorized call. After this data is passed to Verified Permissions, it will assess the access decision based on the context provided and the policies within the policy store. As a policy decision point (PDP), it’s important to note that the allow or deny decision must be enforced at the application level. Based on this decision, access to the data will be allowed or denied. The resources being accessed should be categorized to help the application evaluate access control. For example, if the data is stored in DynamoDB, then patients might be separated by partition keys that are defined in the Verified Permissions schema and referenced in a hierarchal sense.

Conclusion

In this post, you learned how you can improve data protection by using AWS native services to enforce layered authorization throughout a consumer application that uses Amazon Bedrock Agents. This post has shown you the steps to improve enforcement of access controls through identity processes. This can help you build applications using Amazon Bedrock Agents and maintain strong isolation of data to mitigate unintended sensitive data disclosure.

We recommend the Secure Generative AI Solutions using OWASP Framework workshop to learn more about using Verified Permissions and Amazon Bedrock Agents to enforce layered authorization throughout an application.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Accelerate Amazon Redshift Data Lake queries with AWS Glue Data Catalog Column Statistics

2024-10-01 Kalaiselvi Kamaraj

Post Syndicated from Kalaiselvi Kamaraj original https://aws.amazon.com/blogs/big-data/accelerate-amazon-redshift-data-lake-queries-with-column-level-statistics/

Amazon Redshift enables you to efficiently query and retrieve structured and semi-structured data from open format files in Amazon S3 data lake without having to load the data into Amazon Redshift tables. Amazon Redshift extends SQL capabilities to your data lake, enabling you to run analytical queries. Amazon Redshift supports a wide variety of tabular data formats like CSV, JSON, Parquet, ORC and open tabular formats like Apache Hudi, Linux foundation Delta Lake and Apache Iceberg.

You create Redshift external tables by defining the structure for your files, S3 location of the files and registering them as tables in an external data catalog. The external data catalog can be AWS Glue Data Catalog, the data catalog that comes with Amazon Athena, or your own Apache Hive metastore.

Over the last year, Amazon Redshift added several performance optimizations for data lake queries across multiple areas of query engine such as rewrite, planning, scan execution and consuming AWS Glue Data Catalog column statistics. To get the best performance on data lake queries with Redshift, you can use AWS Glue Data Catalog’s column statistics feature to collect statistics on Data Lake tables. For Amazon Redshift Serverless instances, you will see improved scan performance through increased parallel processing of S3 files and this happens automatically based on RPUs used.

In this post, we highlight the performance improvements we observed using industry standard TPC-DS benchmarks. Overall execution time of TPC-DS 3 TB benchmark improved by 3x. Some of the queries in our benchmark experienced up to 12x speed up.

Performance Improvements

Several performance optimizations were done over the last year to improve performance of data lake queries including the following.

Consume AWS Glue Data Catalog column statistics and tuning of Redshift optimizer to improve quality of query plans
Utilize bloom filters for partition columns
Improved scan efficiency for Amazon Redshift Serverless instances through increased parallel processing of files
Novel query rewrite rules to merge similar scans
Faster retrieval of metadata from AWS Glue Data Catalog

To understand the performance gains, we tested the performance on the industry-standard TPC-DS benchmark using 3 TB data sets and queries which represents different customer use cases. Performance was tested on a Redshift serverless data warehouse with 128 RPU. In our testing, the dataset was stored in Amazon S3 in Parquet format and AWS Glue Data Catalog was used to manage external databases and tables. Fact tables were partitioned on the date column, and each fact table consisted of approximately 2,000 partitions. All of the tables had their row count table property, numRows, set as per the spectrum query performance guidelines.

We did a baseline run on Redshift patch version (patch 172) from last year. Later, we ran all TPC-DS queries on latest patch version (patch 180) that includes all performance optimizations added over last year. Then we used AWS Glue Data Catalog’s column statistics feature to compute statistics for all the tables and measured improvements with the presence of AWS Glue Data Catalog column statistics.

Our analysis revealed that the TPC-DS 3TB Parquet benchmark saw substantial performance gains with these optimizations. Specifically, partitioned Parquet with our latest optimizations achieved 2x faster runtimes compared to the previous implementation. Enabling AWS Glue Data Catalog column statistics further improved performance by 3x versus last year. The following graph illustrates these runtime improvements for the full benchmark (all TPC-DS queries) over the past year, including the additional boost from using AWS Glue Data Catalog column statistics.

Figure 1: Improvement in total runtime of TPC-DS 3T workload

The following graph presents the top queries from the TPC-DS benchmark with the greatest performance improvement over the last year with and without AWS Glue Data Catalog column statistics. You can see that performance improves a lot when statistics exist on AWS Glue Data Catalog (for details on how to get statistics for your Data Lake tables, please refer to optimizing query performance using AWS Glue Data Catalog column statistics). Specifically, multi-join queries will benefit the most from AWS Glue Data Catalog column statistics because the optimizer uses statistics to choose the right join order and distribution strategy.

Figure 2: Speed-up in TPC-DS queries

Let’s discuss some of the optimizations that contributed to improved query performance.

Optimizing with table-level statistics

Amazon Redshift’s design enables it to handle large-scale data challenges with superior speed and cost-efficiency. Its massively parallel processing (MPP) query engine, AI-powered query optimizer, auto-scaling capabilities, and other advanced features allow Redshift to excel at searching, aggregating, and transforming petabytes of data.

However, even the most powerful systems can experience performance degradation if they encounter anti-patterns like grossly inaccurate table statistics, such as the row count metadata.

Without this crucial metadata, Redshift’s query optimizer may be limited in the number of possible optimizations, especially those related to data distribution during query execution. This can have a significant impact on overall query performance.

To illustrate this, consider the following simple query involving an inner join between a large table with billions of rows and a small table with only a few hundred thousand rows.

select small_table.sellerid, sum(large_table.qtysold)
from large_table, small_table
where large_table.salesid = small_table.listid
 and small_table.listtime > '2023-12-01'
 and large_table.saletime > '2023-12-01'
group by 1 order by 1

If executed as-is, with the large table on the right-hand side of the join, the query will lead to sub-optimal performance. This is because the large table will need to be distributed (broadcast) to all Redshift compute nodes to perform the inner join with the small table, as shown in the following diagram.

Inaccurate table statistics lead to limited optimizations and large amounts of data broadcast among compute nodes for a simple inner join

Figure 3: Inaccurate table statistics lead to limited optimizations and large amounts of data broadcast among compute nodes for a simple inner join

Now, consider a scenario where the table statistics, such as the row count, are accurate. This allows the Amazon Redshift query optimizer to make more informed decisions, such as determining the optimal join order. In this case, the optimizer would immediately rewrite the query to have the large table on the left-hand side of the inner join, so that it is the small table that is broadcast across the Redshift compute nodes, as illustrated in the following diagram.

Accurate table statistics lead to high degree of optimizations and very little data broadcast among compute nodes for a simple inner join

Figure 4: Accurate table statistics lead to high degree of optimizations and very little data broadcast among compute nodes for a simple inner join

Fortunately, Amazon Redshift automatically maintains accurate table statistics for local tables by running the ANALYZE command in the background. For external tables (data lake tables), however, AWS Glue Data Catalog column statistics are recommended for use with Amazon Redshift as we will discuss in the next section. For more general information on optimizing queries in Amazon Redshift, please refer to the documentation on factors affecting query performance, data redistribution, and Amazon Redshift best practices for designing queries.

Improvements with AWS Glue Data Catalog column statistics

AWS Glue Data Catalog has a feature to compute column level statistics for Amazon S3 backed external tables. AWS Glue Data Catalog can compute column level statistics such as NDV, Number of Nulls, Min/Max and Avg. column width for the columns without the need for additional data pipelines. Amazon Redshift cost-based optimizer utilizes these statistics to come up with better quality query plans. In addition to consuming statistics, we also made several improvements in cardinality estimations and cost tuning to get high quality query plans thereby improving query performance.

TPC-DS 3TB dataset showed 40% improvement in total query execution time when these AWS Glue Data Catalog column statistics were provided. Individual TPC-DS queries showed up to 5x improvements in query execution time. Some of the queries that had greater impact in execution time are Q85, Q64, Q75, Q78, Q94, Q16, Q04, Q24 and Q11.

We will go through an example where cost-based optimizer generated a better query plan with statistics and how it improved the execution time.

Let’s consider following simpler version of TPC-DS Q64 to showcase the query plan differences with statistics.

select i_product_name product_name
,i_item_sk item_sk
,ad1.ca_street_number b_street_number
,ad1.ca_street_name b_street_name
,ad1.ca_city b_city
,ad1.ca_zip b_zip
,d1.d_year as syear
,count(*) cnt
,sum(ss_wholesale_cost) s1
,sum(ss_list_price) s2
,sum(ss_coupon_amt) s3
FROM   tpcds_3t_alls3_pp_ext.store_sales
,tpcds_3t_alls3_pp_ext.store_returns
,tpcds_3t_alls3_pp_ext.date_dim d1
,tpcds_3t_alls3_pp_ext.customer
,tpcds_3t_alls3_pp_ext.customer_address ad1
,tpcds_3t_alls3_pp_ext.item
WHERE
ss_sold_date_sk = d1.d_date_sk AND
ss_customer_sk = c_customer_sk AND

ss_addr_sk = ad1.ca_address_sk and
ss_item_sk = i_item_sk and
ss_item_sk = sr_item_sk and
ss_ticket_number = sr_ticket_number and
i_color in ('firebrick','papaya','orange','cream','turquoise','deep') and
i_current_price between 42 and 42 + 10 and
i_current_price between 42 + 1 and 42 + 15
group by i_product_name
,i_item_sk
,ad1.ca_street_number
,ad1.ca_street_name
,ad1.ca_city
,ad1.ca_zip
,d1.d_year

Without Statistics

Following figure represents the logical query plan of Q64. You can observe that cardinality estimation of joins is not accurate. With inaccurate cardinalities, optimizer produces a sub-optimal query plan leading to higher execution time.

With Statistics

Following figure represents the logical query plan after consuming AWS Glue Data Catalog column statistics. Based on the highlighted changes, you can observe that the cardinality estimations of JOIN improved by many magnitudes helping the optimizer to choose a better join order and join strategy (broadcast DS_BCAST_INNER vs. distribute DS_DIST_BOTH). Switching the customer_address and customer table from inner to outer table and making join strategies as distribute has major impact because this reduces the data movement between the nodes and avoids spilling from hash table.

Figure 5: Logical query plan of Q64 without statistics

Logical query plan of Q64 after consuming column-level statistics

Figure 6: Logical query plan of Q64 after consuming AWS Glue Data Catalog column statistics

This change in query plan improved the query execution time of Q64 from 383s to 81s.

Given the greater benefits with AWS Glue Data Catalog column statistics for the optimizer, you should consider collecting stats for your data lake using AWS Glue. If your workload is a JOIN heavy workload, then collecting stats will show greater improvement on your workload. Refer to generating AWS Glue Data Catalog column statistics for instructions on how to collect statistics in AWS Glue Data Catalog.

Query rewrite optimization

We introduced a new query rewrite rule which combines scalar aggregates over the same common expression using slightly different predicates. This rewrite resulted in performance improvements on TPC-DS queries Q09, Q28, and Q88. Let’s focus on Q09 as a representative of these queries, given by the following fragment:

SELECT CASE
WHEN (SELECT COUNT(*)
FROM store_sales
WHERE ss_quantity BETWEEN 1 AND 20) > 48409437
THEN (SELECT AVG(ss_ext_discount_amt)
FROM store_sales
WHERE ss_quantity BETWEEN 1 AND 20)
ELSE (SELECT AVG(ss_net_profit)
FROM store_sales
WHERE ss_quantity BETWEEN 1 AND 20) END
AS bucket1,
<<4 more variations of the CASE expression above>>
FROM reason
WHERE r_reason_sk = 1

In total, there are 15 scans of the fact table store_sales, each one returning various aggregates over different subsets of data. The engine first performs subquery removal and transforms the various expressions in the CASE statements into relational subtrees connected via cross products, and then they are fused into one subquery handling all scalar aggregates. The resulting plan for Q09, described below using SQL for clarity, is given by:

SELECT CASE WHEN v1 > 48409437 THEN t1 ELSE e1 END,
<4 more variations>
FROM (SELECT COUNT(CASE WHEN b1 THEN 1 END) AS v1,
AVG(CASE WHEN b1 THEN ss_ext_discount_amt END) AS t1,
AVG(CASE WHEN b1 THEN ss_net_profit END) AS e1,
<4 more variations>
FROM reason,
(SELECT *,
ss_quantity BETWEEN 1 AND 20 AS b1,
<4 more variations>
FROM store_sales
WHERE ss_quantity BETWEEN 1 AND 20 OR
<4 more variations>))
WHERE r_reason_sk = 1)

In general, this rewrite rule results in the largest improvements both in latency (from 3x to 8x improvements) and bytes read from Amazon S3 (from 6x to 8x reduction in scanned bytes and, consequently, cost).

Bloom filter for partition columns

Amazon Redshift already uses Bloom filters on data columns of external tables in Amazon S3 to enable early and effective data filtering. Last year, we extended this support for partition columns as well. A Bloom filter is a probabilistic, memory-efficient data structure that accelerates join queries at scale by filtering rows that do not match the join relation, significantly reducing the amount of data transferred over the network. Amazon Redshift automatically determines what queries are suitable for leveraging Bloom filters at query runtime.

This optimization resulted in performance improvements on TPC-DS queries Q05, Q17 and Q54. This optimization resulted in large improvements in both latency (from 2x to 3x improvement) and bytes read from S3 (from 9x to 15x reduction in scanned bytes and, consequently cost).

Following is the subquery of Q05 which showcased improvements with runtime filter.

select s_store_id,
sum(sales_price) as sales,
sum(profit) as profit,
sum(return_amt) as returns,
sum(net_loss) as profit_loss
from
( select  ss_store_sk as store_sk,
ss_sold_date_sk  as date_sk,
ss_ext_sales_price as sales_price,
ss_net_profit as profit,
cast(0 as decimal(7,2)) as return_amt,
cast(0 as decimal(7,2)) as net_loss
from tpcds_3t_alls3_pp_ext.store_sales
union all
select sr_store_sk as store_sk,
sr_returned_date_sk as date_sk,
cast(0 as decimal(7,2)) as sales_price,
cast(0 as decimal(7,2)) as profit,
sr_return_amt as return_amt,
sr_net_loss as net_loss
from tpcds_3t_alls3_pp_ext.store_returns
) salesreturnss,
tpcds_3t_alls3_pp_ext.date_dim,
tpcds_3t_alls3_pp_ext.store
where date_sk = d_date_sk
and d_date between cast('1998-08-13' as date)
and (cast('1998-08-13' as date) +  14)
and store_sk = s_store_sk
group by s_store_id

Without bloom filter support on partition columns

Following figure is the logical query plan for sub-query of Q05. This appends two large fact tables store_sales (8B rows) and store_returns (863M rows) and then joins with very selective dimension tables date_dim and then with dimension table store. You can observe that join with date_dim table reduces the number of rows from 9B to 93M rows.

With bloom filter support on partition columns

With support of bloom filter on partition columns, we now create bloom filter for d_date_sk column of date_dim table and push down the bloom filters to store_sales and store_returns table. These bloom filters help to filter out the partitions in both store_sales and store_returns table because join happens on partition column (number of partitions processed reduces by 10x).

Figure 7: Logical query plan for sub-query of Q05 without bloom filter support on partition columns

Figure 8: Logical query plan for sub-query of Q05 with bloom filter support on partition columns

Overall, bloom filter on partition column will reduce the number of partitions processed resulting in reduced S3 listing calls and lesser number of data files to be read (reduction in scanned bytes). You can see that we only scan 89M rows from store_sales and 4M rows from store_returns because of the bloom filter. This reduced number of rows to process at JOIN level and helped in improving the overall query performance by 2x and scanned bytes by 9x.

Conclusion

In this post, we covered new performance optimizations in Amazon Redshift data lake query processing and how AWS Glue Data Catalog statistics helps to enhance quality of query plans for data lake queries in Amazon Redshift. These optimizations together improved TPC-DS 3 TB benchmark by 3x. Some of the queries in our benchmark benefited up to 12x speed up.

In summary, Amazon Redshift now offers enhanced query performance with optimizations such as AWS Glue Data Catalog column statistics, bloom filters on partition columns, new query rewrite rules and faster retrieval of metadata. These optimizations are enabled by default and Amazon Redshift users will benefit with better query response times for their workloads. For more information, please reach out to your AWS technical account manager or AWS account solutions architect. They will be happy to provide additional guidance and support.

About the authors

Kalaiselvi Kamaraj is a Sr. Software Development Engineer with Amazon. She has worked on several projects within Redshift Query processing team and currently focusing on performance related projects for Redshift Data Lake.

Mark Lyons is a Principal Product Manager on the Amazon Redshift team. He works on the intersection of data lakes and data warehouses. Prior to joining AWS, Mark held product leadership roles with Dremio and Vertica. He is passionate about data analytics and empowering customers to change the world with their data.

Asser Moustafa is a Principal Worldwide Specialist Solutions Architect at AWS, based in Dallas, Texas, USA. He partners with customers worldwide, advising them on all aspects of their data architectures, migrations, and strategic data visions to help organizations adopt cloud-based solutions, maximize the value of their data assets, modernize legacy infrastructures, and implement cutting-edge capabilities like machine learning and advanced analytics. Prior to joining AWS, Asser held various data and analytics leadership roles, completing an MBA from New York University and an MS in Computer Science from Columbia University in New York. He is passionate about empowering organizations to become truly data-driven and unlock the transformative potential of their data.

Keep your firewall rules up-to-date with Network Firewall features

2024-10-01 Salman Ahmed

Post Syndicated from Salman Ahmed original https://aws.amazon.com/blogs/security/keep-your-firewall-rules-up-to-date-with-network-firewall-features/

AWS Network Firewall is a managed firewall service that makes it simple to deploy essential network protections for your virtual private clouds (VPCs) on AWS. Network Firewall automatically scales with your traffic, and you can define firewall rules that provide fine-grained control over network traffic.

When you work with security products in a production environment, you need to maintain a consistent effort to keep the security rules synchronized as you make modifications to your environment. To stay aligned with your organization’s best practices, you should diligently review and update security rules, but this can increase your team’s operational overhead.

Since the launch of Network Firewall, we have added new capabilities that simplify your efforts by using managed rules and automated methods to help keep your firewall rules current. This approach can streamline operations for your team and help enhance security by reducing the risk of failures stemming from manual intervention or customer automation processes. You can apply regularly updated security rules with just a few clicks, enabling a wide range of comprehensive protection measures.

In this blog post, I discuss three features—managed rule groups, prefix lists, and tag-based resource groups—offering an in-depth look at how Network Firewall operates to assist you in keeping your rule sets current and effective.

Prerequisites

If this is your first time using Network Firewall, make sure to complete the following prerequisites. However, if you already created rule groups, a firewall policy, and a firewall, then you can skip this section.

Network Firewall and AWS managed rule groups

AWS managed rule groups are collections of predefined, ready-to-use rules that AWS maintains on your behalf. You can use them to address common security use cases and help protect your environment from various types of threats. This can help you stay current with the evolving threat landscape and security best practices.

AWS managed rule groups are available for no additional cost to customers who use Network Firewall. When you work with a stateful rule group—a rule group that uses Suricata-compatible intrusion prevention system (IPS) specifications—you can integrate managed rules that help provide protection from botnet, malware, and phishing attempts.

AWS offers two types of managed rule groups: domain and IP rule groups and threat signature rule groups. AWS regularly maintains and updates these rule groups, so you can use them to help protect against constantly evolving security threats.

When you use Network Firewall, one of the use cases is to protect your outbound traffic from compromised hosts, malware, and botnets. To help meet this requirement, you can use the domain and IP rule group. You can select domain and IP rules based on several factors, such as the following:

Domains that are generally legitimate but now are compromised and hosting malware
Domains that are known for hosting malware
Domains that are generally legitimate but now are compromised and hosting botnets
Domains that are known for hosting botnets

The threat signature rule group offers additional protection by supporting several categories of threat signatures to help protect against various types of malware and exploits, denial of service attempts, botnets, web attacks, credential phishing, scanning tools, and mail or messaging attacks.

To use Network Firewall managed rules

Update the existing firewall policy that you created as part of the Prerequisites for this post or create a new firewall policy.
Add a managed rule group to your policy and select from Domain and IP rule groups or Threat signature rule groups.

Figure 1 illustrates the use of AWS managed rules. It shows both the domain and IP rule group and the threat signature rule group, and it includes one specific rule or category from each as a demonstration.

Figure 1: Network Firewall deployed with AWS managed rules

As shown in Figure 1, the process for using AWS managed rules has the following steps:

The Network Firewall policy contains managed rules from the domain and IP rule groups and threat signature rule groups.
If the traffic from a protected subnet passes the checks of the firewall policy as it goes to the Network Firewall endpoint, then it proceeds to the NAT gateway and the internet gateway (depicted with the dashed line in the figure).
If traffic from a protected subnet fails the checks of the firewall policy, the traffic is dropped at the Network Firewall endpoint (depicted with the dotted line).

Inner workings of AWS managed rules

Let’s go deeper into the underlying mechanisms and processes that AWS uses for managed rules. After you configure your firewall with these managed rules, you gain the benefits of the up-to-date rules that AWS manages. AWS pulls updated rule content from the managed rules provider on a fixed cadence for domain-based rules and other managed rule groups.

The Network Firewall team operates a serverless processing pipeline powered by AWS Lambda. This processes the rules from the vendor source, first fetching them so that they can be manipulated and transformed into the managed rule groups. Then the rules are mapped to the appropriate category based on their metadata. The final rules are uploaded to Amazon Simple Storage Service (Amazon S3) to prepare for propagation in each AWS Region.

Finally, Network Firewall processes the rule group content Region by Region, updating the managed rule group object associated with your firewall with the new content from the vendor. For threat signature rule groups, subscribers receive an SNS notification, letting them know that the rules have been updated.

AWS handles the tasks associated with this process so you can deploy and secure your workloads while addressing evolving security threats.

Network Firewall and prefix lists

Network Firewall supports Amazon Virtual Private Cloud (Amazon VPC) prefix lists to simplify management of your firewall rules and policies across your VPCs. With this capability, you can define a prefix list one time and reference it in your rules later. For example, with prefix lists, you can group multiple CIDR blocks into a single object instead of managing them at an individual IP level by creating a prefix list for their specific use case.

AWS offers two types of prefix lists: AWS-managed prefix lists and customer-managed prefix lists. In this post, we focus on customer-managed prefix lists. With customer-managed prefix lists, you can define and maintain your own sets of IP address ranges to meet your specific needs. Although you operate these prefix lists and can add and remove IP addresses, AWS controls and maintains the integration of these prefix lists with Network Firewall.

To use a Network Firewall prefix list

Create a prefix list.
Update your existing rule group that you created as part of the Prerequisites for this post or create a new rule group.
Use IP set references in Suricata compatible rule groups. In the IP set references section, select Edit, and in the Resource ID section, select the prefix list that you created.

Figure 2 illustrates Network Firewall deployed with a prefix list.

Figure 2: Network Firewall deployed with prefix list

As shown in Figure 2, we use the same design as in our previous example:

We use a prefix list that is referenced in our rule group.
The traffic from the protected subnet goes through the Network Firewall endpoint and NAT gateway and then to the internet gateway. As it passes through the Network Firewall endpoint, the firewall policy that contains the rule group determines if the traffic is allowed or not according to the policy.

Inner workings of prefix lists

After you configure a rule group that references a prefix list, Network Firewall automatically keeps the associated rules up to date. Network Firewall creates an IP set object that corresponds to this prefix list. This IP set object is how Network Firewall internally tracks the state of the prefix list reference, and it contains both resolved IP addresses from the source and additional metadata that’s associated with the IP set, such as which rule groups reference it. AWS manages these references and uses them to track which firewalls need to be updated when the content of these IP sets change.

The Network Firewall orchestration engine is integrated with prefix lists, and it works in conjunction with Amazon VPC to keep the resolved IPs up to date. The orchestration engine automatically refreshes IPs associated with a prefix list, whether that prefix list is AWS-managed or customer-managed.

When you use a prefix list with Network Firewall, AWS handles a significant portion of the work on your behalf. This managed approach simplifies the process while providing the flexibility that you need to customize the allow or deny list of IP addresses according to your specific security requirements.

Network Firewall and tag-based resource groups

With Network Firewall, you can now use tag-based resource groups to simplify managing your firewall rules. A resource group is a collection of AWS resources that are in the same Region, and that match the criteria specified in the group’s query. A tag-based resource group bases its membership on a query that specifies a list of resource types and tags. Tags are key value pairs that help identify and sort your resources within your organization.

In your stateful firewall rules, you can reference a resource group that you have created for a specific set of Amazon Elastic Compute Cloud (Amazon EC2) instances or elastic network interfaces (ENIs). When these resources change, you don’t have to update your rule group every time. Instead, you can use a tagging policy for the resources that are in your tag-based resource group.

As your AWS environment changes, it’s important to make sure that new resources are using the same egress rules as the current resources. However, managing the changing EC2 instances due to workload changes creates an operational overhead. By using tag-based resource groups in your rules, you can eliminate the need to manually manage the changing resources in your AWS environment.

To use Network Firewall resource groups with a stateful rule group

Create Network Firewall resource groups – Create a resource group for each of two applications. For the example in this blog post, enter the name rg-app-1 for application 1, and rg-app-2 for application 2.
Update your existing rule group that you created as a part of the Prerequisites for this post or create a new rule group. In the IP set references section, select Edit; and in the Resource ID section, choose the resource groups that you created in the previous step (rg-app-1 and rg-app-2).

Now as your EC2 instance or ENIs scale, those resources stay in sync automatically.

Figure 3 illustrates resource groups with a stateful rule group.

Figure 3: Network Firewall deployed with resource groups

As shown in Figure 3, we tagged the EC2 instances as app-1 or app-2. In your stateful rule group, restrict access to a website for app-2, but allow it for app-1:

We use the resource group that is referenced in our rule group.
The traffic from the protected subnet goes through the Network Firewall endpoint and the NAT gateway and then to the internet gateway. As it passes through the Network Firewall endpoint, the firewall policy that contains the rule group referencing the specific resource group determines how to handle the traffic. In the figure, the dashed line shows that the traffic is allowed while the dotted line shows it’s denied based on this rule.

Inner workings of resource groups

For tag-based resource groups, Network Firewall works with resource groups to automatically refresh the contents of the Network Firewall resource groups. Network Firewall first resolves the resources that are associated with the resource group, which are EC2 instances or ENIs that match the tag-based query specified. Then it resolves the IP addresses associated with these resources by calling the relevant Amazon EC2 API.

After the IP addresses are resolved, through either a prefix list or Network Firewall resource group, the IP set is ready for propagation. Network Firewall uploads the refreshed content of the IP set object to Amazon S3, and the data plane capacity (the hardware responsible for packet processing) fetches this new configuration. The stateful firewall engine accepts and applies these updates, which allows your rules to apply to the new IP set content.

By using tag-based resource groups within your workloads, you can delegate a substantial amount of your firewall management tasks to AWS, enhancing efficiency and reducing manual efforts on your part.

Considerations

When you use a managed rule group in your firewall policy, you can edit the following setting: Set rule actions to alert. This will override all rule actions in the rule group to alert which is useful for testing a rule group before using it control your traffic.
Managed rule groups count against the limit of stateful rules for each policy. For more information, see AWS Network Firewall quotas and setting rule group capacity in AWS Network Firewall.
When working with prefix lists and resource groups, make sure that you understand the Limits for IP set references.

Conclusion

In this blog post, you learned how to use Network Firewall managed rule groups, prefix lists, and tag-based resource groups to harness the automation and user-friendly capabilities of Network Firewall. You also learned more detail about how AWS operates these features on your behalf, to help you deploy a simple-to-use and secure solution. Enhance your current or new Network Firewall deployments by integrating these features today.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Leverage IAM Roles for email sending via SES from EC2 and eliminate a common credential risk

2024-09-27 Zip Zieper

Post Syndicated from Zip Zieper original https://aws.amazon.com/blogs/messaging-and-targeting/leverage-iam-roles-for-email-sending-via-ses-from-ec2-and-eliminate-a-common-credential-risk/

Sending automated transactional emails, such as account verifications and password resets, is a common requirement for web applications hosted on Amazon EC2 instances. Amazon SES provides multiple interfaces for sending emails, including SMTP, API, and the SES console itself. The type of SES credential you use with Amazon SES depends on the method through which you are sending the emails.

In this blog post, we describe how to leverage IAM roles for EC2 instances to securely send emails via the Amazon SES API, without the need to embed IAM credentials directly in the application code, link to a shared credentials file, or manage IAM credentials within the EC2 instance. By adopting the approach outlined in this blog, you can enhance security by eliminating the risk of credential exposure and simplify credential management for your web applications.

Solution Overview

Below we provide step-by-step instructions to configure an IAM role with SES permissions to use on your EC2 instance. This allows the EC2 hosted web application to securely send emails via Amazon SES without storing or managing IAM credentials within the EC2 instance. We present an option for running EC2 and SES in the same AWS account, as well as an option to accommodate running EC2 and SES in different AWS accounts. Both options offer a way to enhance security and simplify credential management.

Either option begins with creating an IAM role with SES permissions. Next, the IAM role is attached to your EC2 instance, providing it with the necessary permissions for SES without needing to embed IAM credentials in your application code or on a file in the EC2 instance. In option 2, we’ll add cross-account permissions that allow the code on the EC2 instance in account “A” to send email via the SES API in account “B”. We also provide a sample Python script that demonstrates how to send an email from your EC2 instance using the attached IAM role.

Option 1 – SES and EC2 are in a single AWS account

In a typical scenario where an EC2 instance is operating in the same AWS account as SES, the process of using an IAM role to send emails via SES is straightforward. In the steps below, you’ll configure and attach an IAM role to the EC2 instance. You’ll then update a sample Python script to use the permissions provided by the attached IAM role to send emails via SES. This direct access simplifies the SES sending process, as no explicit credential management is required in the code, nor do you need to include a shared credentials file on the EC2 instance.

EC2 & SES in the same AWS Account

Prerequisites – single AWS account for EC2 and SES

A single AWS account in a region that supports SES
Verified domain or email identity in Amazon SES.
- Make note of a verified sending email address here: ___________
EC2 instance (Linux) in running state
- If you don’t have a EC2 instance create one (Linux)
Administrative Access to Amazon SES, IAM and EC2 consoles.
Access to a recipient email address to receive test emails from the python script.
- Make note of a SES verified recipient email address to send test emails here: ___________

Step 1 – Create IAM Role for EC2 instance with SES Permissions

To start, create an IAM role that grants the necessary permissions to send emails using Amazon SES by following these steps:

Sign in to the AWS Management Console and open the IAM console.
In the navigation pane, choose “Roles,” and then choose “Create role.”
Choose the trusted entity type as “AWS service” and select “EC2” as the service that will use this role, then click ‘Next’
Search for and select the “AmazonSESFullAccess” policy from the list (or create a custom policy with the necessary SES permissions), then click ‘Next’.

Provide a name for your role (e.g., EC2_SES_SendEmail_Role).
Click “Create role“.

Step 2 – Attach the IAM Role to EC2 instance.

Next, attach the IAM role to your EC2 instance:

Open the EC2 Management Console.
In the navigation pane, choose “Instances,” and select the running EC2 instance to which you want to attach the IAM role.
With the instance selected, choose “Actions,” then “Security,” and “Modify IAM role.“
Choose the IAM role you created (EC2_SES_SendEmail_Role) from the drop-down menu and click “Update IAM role.”

Step 3 – Create a sample python script that sends emails from the EC2 instance with the attached role.

Now that your EC2 instance is configured with the necessary permissions, you can set up an example Python script to send emails via Amazon SES using the IAM Role. Here, we’re using the AWS SDK for Python (Boto3), a powerful and versatile library to interact with the SES API endpoint. Before running the example script, ensure that Python, pip (the package installer for Python), and the Boto3 library are installed on your EC2 instance:
- Run the ‘python3 –version‘ command to check if Python is installed on your EC2 instance. If Python is installed, the version will be displayed, otherwise you’ll receive a ‘command not found’ or similar error message.
  - If python is not installed, run the command ‘sudo yum install python3 -y‘
- Run the ‘pip3 --version‘ command to check if pip is installed on your EC2 instance. If pip3 is installed, is installed, the version will be displayed, otherwise you’ll receive a ‘command not found’ or similar error message.
  - If pip3 is not installed, run the command ‘sudo yum install python3-pip‘
- Install the Boto3 Library which allows Python scripts to interact with AWS services including SES. Run the command ‘pip3 install boto3‘ to install (or update) Boto3 using pip.
Save the code below as a Python file named ‘sesemail.py‘ on your EC2 instance.
Edit 'sesemail.py‘ and replace the placeholder values of SENDER, RECIPIENT, and AWS_REGION with your values (see prerequisites). Do not modify any “” marks.

[copy]

import boto3
from botocore.exceptions import ClientError

SENDER = "[email protected]"
RECIPIENT = "[email protected]"
#CONFIGURATION_SET = "ConfigSet"
AWS_REGION = "us-west-2"
SUBJECT = "Amazon SES Test Email (SDK for Python) using IAM Role"
BODY_TEXT = ("Amazon SES Test (Python)\r\n"
             "This email was sent with Amazon SES using the "
             "AWS SDK for Python (Boto)."
            )
            
BODY_HTML = """<html>
<head></head>
<body>
  <h1>Amazon SES Test (SDK for Python) using IAM Role</h1>
  <p>This email was sent with
    <a href='https://aws.amazon.com/ses/'>Amazon SES</a> using the
    <a href='https://aws.amazon.com/sdk-for-python/'>
      AWS SDK for Python (Boto)</a>.</p>
</body>
</html>
            """            

CHARSET = "UTF-8"

client = boto3.client('ses',region_name=AWS_REGION)

try:
    response = client.send_email(
        Destination={
            'ToAddresses': [
                RECIPIENT,
            ],
        },
        Message={
            'Body': {
                'Html': {
                    'Charset': CHARSET,
                    'Data': BODY_HTML,
                },
                'Text': {
                    'Charset': CHARSET,
                    'Data': BODY_TEXT,
                },
            },
            'Subject': {
                'Charset': CHARSET,
                'Data': SUBJECT,
            },
        },
        Source=SENDER,
    )   
except ClientError as e:
    print(e.response['Error']['Message'])
else:
    print("Email sent! Message ID:"),
    print(response['MessageId'])

Run ‘python3 sesmail.py‘ to execute the Python script.
When ‘python3 sesmail.py‘ runs successfully, an email is sent to the RECIPIENT(check the inbox), and the command line will display the sent Message ID.

Option 2 – SES and EC2 are in different AWS accounts

In some scenarios, your EC2 instance might operate in a different AWS account than SES. Let’s call the EC2 AWS account “A” and SES AWS account “B”. Because the AWS resources in account A don’t automatically have permission to access AWS resources account B, we need some way to allow the code on EC2 to assume a role in the SES Account using the AWS Security Token Service (STS). This involves a method that generates temporary credentials that include an access key, secret access key, and session token, which are only valid for a limited time.

EC2 & SES in different AWS Accounts

In the steps below, you’ll configure and attach an IAM role to the EC2 instance in account “A” such that it can run an example Python script. This Python script can use the permissions provided by the attached IAM role to send emails via SES in account “B”. This approach leverages cross-account access and simplifies sending email from the EC2 in account A via SES in account B. As with Option 1, no explicit credential management is required in the code running on EC2, nor do you need to include a shared credentials file on the Ec2 instance.

Prerequisites – different AWS accounts for EC2 and SES (use cross-account access)

An AWS account “A” with:
- EC2 instance (Linux) in running state. (If you don’t have a EC2 instance, create one using Amazon Linux)
- Administrative Access to Amazon IAM and EC2 consoles.
- Make note of your “A” AWS account ID here: ________________
An AWS account “B” with:
- Verified domain (or email identity for testing only) in Amazon SES
  - Make note of a verified sending email address here: ___________
- Administrative Access to Amazon SES and IAM consoles.
  - Make note of your “B” AWS account ID here: ________________
- In the steps below, you will create a “SES_Role_for_account_A” role.
  - Make note of the ARN of the “SES_Role_for_account_A” role here: ___________
- Access to a recipient email address to receive test emails from the python script.
  - Make note of a SES verified recipient email address to send test emails here: ___________

Step 1 – Create IAM Role in the SES “B” account

Sign in to the SES “B” account via the AWS Management Console and open the IAM console.
In the navigation pane, choose “Roles,” and then choose “Create role“.
Choose the trusted entity type as ‘AWS account’ and select ‘Another AWS account’.
Add the AWS account ID where your EC2 instance resides (AWS account “A” in the prerequisites) and click ‘Next’.
Search for and select the “AmazonSESFullAccess” policy or create a custom policy with the necessary SES permissions, then click ‘Next’.
Provide a name for your role (e.g., ‘SES_Role_for_account_A').
Click “Create role“.
Copy the arn for the new SES_Role_for_account_A (you’ll need the arn in the next step).

Step 2 – Create a IAM policy in the EC2 “A” account that allows this role to assume the SES_Role_for_account_A role you just created in the SES “B” Account.

Sign in to the EC2 “A” account via the AWS Management Console and open the IAM console.
In the navigation pane, choose “Policies,” and then choose “Create Policy”.
Choose the service as ‘EC2’ and select policy editor as JSON.
Copy the policy below, and in the policy editor, replace the Resource with the arn of theSES_Role_for_account_A in the SES account “B” (you created this in step 1).

[copy, paste into policy editor & replace the arn with SES_Role_for_account_A]

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Resource": "arn:aws:iam::<SES_Account_ID>:role/<Role_Name>"
}
]
}

Click ‘Next’ and provide a name for your role (e.g., EC2_Policy_for_account_B).
Click ‘Create the Policy’

Step 3 – Create an IAM role in the EC2 “A” account, and attach the previously created IAM policy (`EC2_Policy_for_account_B`) to it.

In the EC2 “A” account IAM console navigation pane, choose “Roles,” and then choose “Create role.”
Choose the trusted entity type as “AWS service” and select “EC2” as the service, then click ‘Next’.

Filter by type “customer managed”, search for (EC2_Policy_for_account_B) and select that policy and ‘Next’ (note – if you are using AWS Session Manger to remotely connect to your EC2 instance, you may need to add the “AmazonSSMManagedInstanceCore” policy to the role).

Provide a name for your role (e.g., EC2_SES_in_account_B_role).
Click “Create role“.

Step 4 – Attach the IAM Role (`EC2_SES_in_account_B_role`) to the EC2 instance in AWS account “A”.

Open the EC2 Management Console in AWS account “A”
In the navigation pane, choose “Instances,” and select the instance to which you want to attach the EC2_SES_in_account_B_role IAM role.
With the instance selected, choose “Actions,” then “Security,” and “Modify IAM role.”

Choose the IAM role you created (EC2_SES_in_account_B_role) from the drop-down menu.
Click “Update IAM role.”

Step 5 – Create a sample python script that sends emails via SES in AWS account “B” from the EC2 instance in AWS account “A” using the EC2 attached role.

Now that your EC2 instance is configured with the necessary permissions, you can set up an example Python script to send emails via Amazon SES in AWS Account “B” using the IAM Role on EC2 in AWS Account “A”. We’ll use the AWS SDK for Python (Boto3), a powerful and versatile library to interact with the SES API endpoint. Before running the example script, ensure that Python, pip (the package installer for Python), and the Boto3 library are installed on your EC2 instance:

- Run the ‘python3 –version‘ command to check if Python is installed on your EC2 instance. If Python is installed, the version will be displayed, otherwise you’ll receive a ‘command not found’ or similar error message.
  - If python is not installed, run the command ‘sudo yum install python3 -y‘
- Run the ‘pip3 --version‘ command to check if pip is installed on your EC2 instance. If pip3 is installed, is installed, the version will be displayed, otherwise you’ll receive a ‘command not found’ or similar error message.
  - If pip3 is not installed, run the command ‘sudo yum install python3-pip‘
- Install the Boto3 Library which allows Python scripts to interact with AWS services including SES. Run the command ‘pip3 install boto3‘ to install (or update) Boto3 using pip.

Save the code below as a Python file named cross_sesemail.py on your EC2 instance.
4b. Edit cross_sesemail.py and replace the placeholder values of the ROLE_ARN with ARN of the SES_Role_for_account_A you created in SES Account “B” (see prerequisites), SENDER, RECIPIENT, and AWS_REGION with your values (see prerequisites). Do not modify any “” marks.

[copy, edit & replace the ROLE_ARN]

import boto3
from botocore.exceptions import ClientError

# Replace with your role ARN in SES Account
ROLE_ARN = "arn:aws:iam::<Account_ID>:role/<Role_Name>"

# Create an STS client
sts_client = boto3.client('sts')

# Assume the role
assumed_role = sts_client.assume_role(
    RoleArn=ROLE_ARN,
    RoleSessionName="SESSession"
)

# Extract the temporary credentials
credentials = assumed_role['Credentials']

# Create an SES client using the assumed role credentials
ses_client = boto3.client(
    'ses',
    region_name='us-west-2',
    aws_access_key_id=credentials['AccessKeyId'],
    aws_secret_access_key=credentials['SecretAccessKey'],
    aws_session_token=credentials['SessionToken']
)

# Email parameters
SENDER = "[email protected]"
RECIPIENT = "[email protected]"
SUBJECT = "Amazon SES Test (SDK for Python) using cross-account IAM Role"
BODY_TEXT = ("Amazon SES Test (Python)\r\n"
             "This email was sent with Amazon SES using the "
             "AWS SDK for Python (Boto) using IAM Role."
            )
BODY_HTML = """<html>
<head></head>
<body>
  <h1>Amazon SES Test (SDK for Python) using IAM Role</h1>
  <p>This email was sent with
    <a href='https://aws.amazon.com/ses/'>Amazon SES</a> using the
    <a href='https://aws.amazon.com/sdk-for-python/'>
      AWS SDK for Python (Boto)</a> using IAM Role.</p>
</body>
</html>
            """
CHARSET = "UTF-8"

# Send the email
try:
    response = ses_client.send_email(
        Destination={
            'ToAddresses': [RECIPIENT],
        },
        Message={
            'Body': {
                'Html': {
                    'Charset': CHARSET,
                    'Data': BODY_HTML,
                },
                'Text': {
                    'Charset': CHARSET,
                    'Data': BODY_TEXT,
                },
            },
            'Subject': {
                'Charset': CHARSET,
                'Data': SUBJECT,
            },
        },
        Source=SENDER,
    )
except ClientError as e:
    print(e.response['Error']['Message'])
else:
    print("Email sent! Message ID:"),
    print(response['MessageId'])

Run the python script python3 cross_sesemail.py. When the email is sent successfully, the command line output will display the message ID of the sent email, and the recipient will receive an email.

Conclusion:

By implementing IAM roles for EC2 instances with SES permissions, you can securely send emails via the SES APIs from your web applications without the need to store or manage IAM credentials within the EC2 instance or application code. This approach not only enhances security by eliminating the risk of credential exposure, but also simplifies the management of credentials. With the step-by-step guide provided in this blog post, you can easily configure IAM roles for your EC2 instances and start sending emails via the Amazon SES API in a secure and efficient manner, regardless of whether your EC2 and SES resources reside in the same or different AWS accounts.

Next Steps:

Sign up for the AWS Free Tier and try out Amazon SES with IAM roles for EC2 instances as demonstrated in this blog post.
Consult the AWS documentation on IAM Roles for Amazon EC2 and Amazon SES for more detailed instructions and best practices.
Join the AWS Community Forums to ask questions, share experiences, and learn from other AWS users who have implemented similar solutions for secure email sending from their web applications.

About the Authors

Manas Murali M

Manas Murali M is a Cloud Support Engineer II at AWS and subject matter expert in Amazon Simple Email Service (SES) and Amazon CloudFront. With over 5 years of experience in the IT industry, he is passionate about resolving technical issues for customers. In his free time, he enjoys spending time with friends, traveling, and exploring emerging technologies.

Zip

Zip is an Amazon Pinpoint and Amazon Simple Email Service Sr. Specialist Solutions Architect at AWS. Outside of work he enjoys time with his family, cooking, mountain biking and plogging.

Let’s Architect! Building multi-tenant SaaS systems

2024-09-26 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-building-multi-tenant-saas-systems/

Software as a Service (SaaS) applications offer a transformative solution for businesses worldwide, delivering on-demand software solutions to a global audience. However, building a successful SaaS platform demands on meticulous architectural planning, especially given the inherent challenges of multi-tenancy. It’s also essential to ensure that each tenant’s data remains isolated and protected from unauthorized access and that multi-tenant systems are cost-optimized and can sustain the scaling of the SaaS business provider.

In this blog post, we will explore some of the key elements and best practices for designing and deploying secure and efficient SaaS systems on AWS.

Building cost-optimized multi-tenant SaaS architectures

Cost is a key factor to consider when we design new systems. Multi-tenancy requires teams to think beyond the basics of auto scaling, adopting strategies to allow their architecture to support a complex cost-scaling challenges. In this session, the speaker covers some design patterns for distributed systems to support the continually evolving scale needs of the environment, while optimizing the cost of the infrastructure.

The architectural model chosen for deploying multi-tenant systems—pooled, siloed, or mixed—significantly influences the cost optimization strategy. Each approach offers distinct trade-offs in terms of resource allocation, scalability, and cost efficiency.

Figure 1. The architectural model chosen for deploying multi-tenant systems—pooled, siloed, or mixed—significantly influences the cost-optimization strategy. Each approach offers distinct trade-offs in terms of resource allocation, scalability, and cost efficiency.

Take me to this video

Well-Architected SaaS Lens

The SaaS Lens for the AWS Well-Architected Framework empowers customers to assess and enhance their cloud-based architectures, fostering a deeper understanding of the business implications of their design choices. By bringing together technical leadership and diverse teams to discuss strategies for improving various aspects of the system, the AWS Well-Architected Framework facilitates collaborative decision-making. Moreover, the AWS account team can provide valuable support in conducting these assessments, offering expert guidance and insights. The AWS SaaS Lens specifically focuses on how to design, deploy, and architect multi-tenant SaaS application workloads within the AWS Cloud.

The microservices running in a multi-tenant environment must be able to reference and apply tenant context within each service. At the same time, it’s also our goal to limit the degree to which developers need to introduce any tenant awareness into their code.

Figure 2. The microservices running in a multi-tenant environment must be able to reference and apply tenant context within each service. At the same time, it’s also our goal to limit the degree to which developers need to introduce any tenant awareness into their code.

Take me to this well-architected framework

SaaS anywhere: Designing distributed multi-tenant architectures

Not every SaaS provider has the luxury of running all the moving parts of their solution within their own infrastructure. SaaS teams might support a range of diverse system models, where architectures might include customer-hosted data, edge deployment for parts of the application, and on-premises components. In this session, you can learn the strategies to support the complexities of this distributed model without undermining the resilience, operational efficiency, and agility goals of your solution. The video covers how this influences the onboarding, deployment, and profile management of the SaaS environment.

In this architectural pattern, tenants are demanding to have the ML workload in their environment. So, the SaaS provider only manages the SaaS Control plane where tenants deploy the application plane in their environment, including the ML workload and the necessary components around it.

Figure 3. In this architectural pattern, tenants are demanding to have the ML workload in their environment. So, the SaaS provider only manages the SaaS control plane where tenants deploy the application plane in their environment, including the ML workload and the necessary components around it.

Take me to this video

Deploying multi-tenant SaaS applications on Amazon ECS and AWS Fargate

Containers are frequently employed in multi-tenant SaaS environments to enhance scalability, isolation, and resource efficiency. Developing such systems requires addressing multiple challenges, including tenant isolation, tenant on-boarding, tenant-specific metering, monitoring, and other factors related to multi-tenancy. This session explores how to effectively manage all of these aspects when deploying solutions on AWS Fargate.

Microservices architecture can enhance security isolation by dividing applications into smaller, independent services, reducing the potential impact of a breach.

Figure 4. Microservices architecture can enhance security isolation by dividing applications into smaller, independent services, reducing the potential impact of a breach.

Take me to this video

AWS Serverless SaaS Workshop

Serverless helps to create multi-tenant architectures thanks to services like AWS Lambda that isolate your business logic per request, making them the perfect companion to run a SaaS platform. This workshop provides a hands-on introduction to creating serverless multi-tenant SaaS applications, helping you get started and gain practical experience.

Figure 5. This is the high-level architecture of the web application you will use in the AWS Serverless SaaS Workshop. In the labs, you will use this web application to add features that are needed to build this final SaaS application.

Take me to this workshop

See you next time!

Thanks for reading! Multi-tenant SaaS architectures require a careful design of your system. In this post, you have discovered key elements for properly designing your next SaaS workloads. In the next blog, we will talk about modern data architectures.

To revisit any of our previous posts or explore the entire series, visit the Let’s Architect! page.

Six tips to improve the security of your AWS Transfer Family server

2024-09-24 John Jamail

Post Syndicated from John Jamail original https://aws.amazon.com/blogs/security/six-tips-to-improve-the-security-of-your-aws-transfer-family-server/

AWS Transfer Family is a secure transfer service that lets you transfer files directly into and out of Amazon Web Services (AWS) storage services using popular protocols such as AS2, SFTP, FTPS, and FTP. When you launch a Transfer Family server, there are multiple options that you can choose depending on what you need to do. In this blog post, I describe six security configuration options that you can activate to fit your needs and provide instructions for each one.

Use our latest security policy to help protect your transfers from newly discovered vulnerabilities

By default, newly created Transfer Family servers use our strongest security policy, but for compatibility reasons, existing servers require that you update your security policy when a new one is issued. Our latest security policy, including our FIPS-based policy, can help reduce your risks of known vulnerabilities such as CVE-2023-48795, also known as the Terrapin Attack. In 2020, we had already removed support for the ChaCha20-Poly1305 cryptographic construction and CBC with Encrypt-then-MAC (EtM) encryption modes, so customers using our later security policies did not need to worry about the Terrapin Attack. Transfer Family will continue to publish improved security policies to offer you the best possible options to help ensure the security of your Transfer Family servers. See Edit server details for instructions on how to update your Transfer Family server to the latest security policy.

Use slashes in session policies to limit access

If you’re using Amazon Simple Storage Service (Amazon S3) as your data store with a Transfer Family server, the session policy for your S3 bucket grants and limits access to objects in the bucket. Amazon S3 is an object store and not a file system, so it has no concept of directories, only prefixes. You cannot, for example, set permissions on a directory the way you might on a file system. Instead, you set session policies on prefixes.

Even though there isn’t a file system, the slash character still plays an important role. Imagine you have a bucket named DailyReports and you’re trying to authorize certain entities to access the objects in that bucket. If your session policy is missing a slash in the Resource section, such as arn:aws:s3:::$DailyReports*, then you should add a slash to make it arn:aws:s3:::$DailyReports/*. Without the slash (/) before the asterisk (*), your session policy might allow access to buckets you don’t intend. For example, if you also have buckets named DailyReports-archive and DailyReports-testing, then a role with permission arn:aws:s3:::$DailyReports* will also grant access to objects in those buckets, which is probably not what you want. A role with permission arn:aws:s3:::$DailyReports/* won’t grant access to objects in your DailyReports-archive bucket, because the slash (/) makes it clear that only objects whose prefix begins with DailyReports/ will match, and all objects in DailyReports-archive will have a prefix of DailyReports-archive/, which won’t match your pattern. To check to see if this is an issue, follow the instructions in Creating a session policy for an Amazon S3 bucket to find your AWS Identity and Access Management (IAM) session policy.

Use scope down policies to back up logical directory mappings

When creating a logical directory mapping with a role that has more access than you intend to give your users, it’s important to use session policies to tailor the access appropriately. This provides an extra layer of protection against accidental changes to your logical directory mapping opening access to files you didn’t intend.

Details on how to construct a session policy for an S3 bucket can be found in Creating a session policy for an Amazon S3 bucket, and Create fine-grained session permissions using IAM managed policies provides additional context. Amazon S3 also offers IAM Access Analyzer to assist with this process.

Don’t place NLBs in front of a Transfer Family server

We’ve spoken with many customers who have configured a Network Load Balancer (NLB) to route traffic to their Transfer Family server. Usually, they’ve done this either because they created their server before we offered a way to access it from both inside their VPC and from the internet, or to support FTP on the internet. This not only increases the cost for the customer, it can cause other issues, which we describe in this section.

If you’re using this configuration, we encourage you to move to a VPC endpoint and use an Elastic IP. Placing an NLB in front of your Transfer Family server removes your ability to see the source IP of your users, because Transfer Family will see only the IP address of your NLB. This not only degrades your ability to audit who is accessing your server, it can also impact performance. Transfer Family uses the source IP to shard your connections across our data plane. In the case of FTPS, this means that instead of being able to have 10,000 simultaneous connections, a Transfer Family server with an NLB in front of it would be limited to only 300 simultaneous connections. If you have a use case that requires you to place an NLB in front of your Transfer Family server, reach out to the Transfer Family Product Management team through AWS Support or discuss issues on AWS re:Post, so we can look for options to help you take full advantage of our service.

Protect your API Gateway instance with WAF

If you’re using the custom identity provider capability of Transfer Family, you connect your identity provider through Amazon API Gateway. As a best practice, Transfer Family recommends use AWS Web Application Firewall (WAF) to help protect your API Gateway. This will allow you to create access control lists (ACLs) for your API Gateway instance to allow access for only AWS and anyone in the ACL. To help protect your API Gateway instance, see Securing AWS Transfer Family with AWS Web Application Firewall and Amazon API Gateway.

FTPS customers should use TLS session resumption

One of the security challenges with FTPS is that it uses two separate ports to process read/write requests. An analogy to this in the physical world would be going through a drive-thru window where you pay for your food and someone else can cut in front of you to receive your order at the second window. For this reason, security measures have been added to the FTPS protocol over time. In a client-server protocol, there are server-side configurations and client-side configurations.

TLS session resumption helps protect client connections as they hand off between the FTPS control port and the data port. The server sends a unique identifier for each session on the control port, and the client is meant to send that same session identifier back on the data port. This gives the server confidence that it’s talking to the same client on the data port that initiated the session on the control port. Transfer Family endpoints provide three options for session resumption:

Disabled – The server ignores whether the client sends a session ID and doesn’t check that it’s correct, if it is sent. This option exists for backward compatibility reasons, but we don’t recommend it.
Enabled – The server will transmit session IDs and will enforce session IDs if the client uses them, but clients who don’t use session IDs are still allowed to connect. We only recommend this as a transitional state to Enforced to verify client compatibility.
Enforced – Clients must support TLS session resumption, or the server won’t transmit data to them. This is our default and recommended setting.

To use the console to see your TLS session resumption settings:

Sign in to the AWS Management Console in the account where your transfer server runs and go to AWS Transfer Family. Be sure to select the correct AWS Region.
To find your Transfer Family server endpoint, find your Transfer Family server in the console and choose Main Server Details.
Select Additional Details.
Under TLS Session Resumption, you will see if your server is enforcing TLS session resumption.
If some of your users don’t have access to modern FTPS clients that support TLS, you can choose Edit to choose a different option.

Conclusion

Transfer Family offers many benefits to help secure your managed file transfer (MFT) solution as the threat landscape evolves. The steps in this post can help you get the most out of Transfer Family to help protect your file transfers. As the requirements for a secure, compliant architecture for file transfers evolve and threats become more sophisticated, Transfer Family will continue to offer optimized solutions and provide actionable advice on how you can use them. For more information, see Security in AWS Transfer Family and take our self-paced security workshop.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Transfer Family re:Post or contact AWS Support.

Differentiate generative AI applications with your data using AWS analytics and managed databases

2024-09-12 Diego Colombatto

Post Syndicated from Diego Colombatto original https://aws.amazon.com/blogs/big-data/differentiate-generative-ai-applications-with-your-data-using-aws-analytics-and-managed-databases/

While the potential of generative artificial intelligence (AI) is increasingly under evaluation, organizations are at different stages in defining their generative AI vision. In many organizations, the focus is on large language models (LLMs), and foundation models (FMs) more broadly. This is just the tip of the iceberg, because what enables you to obtain differential value from generative AI is your data.

Generative AI applications are still applications, so you need the following:

Operational databases to support the user experience for interaction steps outside of invoking generative AI models
Data lakes to store your domain-specific data, and analytics to explore them and understand how to use them in generative AI
Data integrations and pipelines to manage (sourcing, transforming, enriching, and validating, among others) and render data usable with generative AI
Governance to manage aspects such as data quality, privacy and compliance to applicable privacy laws, and security and access controls

LLMs and other FMs are trained on a generally available collective body of knowledge. If you use them as is, they’re going to provide generic answers with no differential value for your company. However, if you use generative AI with your domain-specific data, it can provide a valuable perspective for your business and enable you to build differentiated generative AI applications and products that will stand out from others. In essence, you have to enrich the generative AI models with your differentiated data.

On the importance of company data for generative AI, McKinsey stated that “If your data isn’t ready for generative AI, your business isn’t ready for generative AI.”

In this post, we present a framework to implement generative AI applications enriched and differentiated with your data. We also share a reusable, modular, and extendible asset to quickly get started with adopting the framework and implementing your generative AI application. This asset is designed to augment catalog search engine capabilities with generative AI, improving the end-user experience.

You can extend the solution in directions such as the business intelligence (BI) domain with customer 360 use cases, and the risk and compliance domain with transaction monitoring and fraud detection use cases.

Solution overview

There are three key data elements (or context elements) you can use to differentiate the generative AI responses:

Behavioral context – How do you want the LLM to behave? Which persona should the FM impersonate? We call this behavioral context. You can provide these instructions to the model through prompt templates.
Situational context – Is the user request part of an ongoing conversation? Do you have any conversation history and states? We call this situational context. Also, who is the user? What do you know about user and their request? This data is derived from your purpose-built data stores and previous interactions.
Semantic context – Is there any meaningfully relevant data that would help the FMs generate the response? We call this semantic context. This is typically obtained from vector stores and searches. For example, if you’re using a search engine to find products in a product catalog, you could store product details, encoded into vectors, into a vector store. This will enable you to run different kinds of searches.

Using these three context elements together is more likely to provide a coherent, accurate answer than relying purely on a generally available FM.

There are different approaches to design this type of solution; one method is to use generative AI with up-to-date, context-specific data by supplementing the in-context learning pattern using Retrieval Augmented Generation (RAG) derived data, as shown in the following figure. A second approach is to use your fine-tuned or custom-built generative AI model with up-to-date, context-specific data.

The framework used in this post enables you to build a solution with or without fine-tuned FMs and using all three context elements, or a subset of these context elements, using the first approach. The following figure illustrates the functional architecture.

Technical architecture

When implementing an architecture like that illustrated in the previous section, there are some key aspects to consider. The primary aspect is that, when the application receives the user input, it should process it and provide a response to the user as quickly as possible, with minimal response latency. This part of the application should also use data stores that can handle the throughput in terms of concurrent end-users and their activity. This means predominantly using transactional and operational databases.

Depending on the goals of your use case, you might store prompt templates separately in Amazon Simple Storage Service (Amazon S3) or in a database, if you want to apply different prompts for different usage conditions. Alternatively, you might treat them as code and use source code control to manage their evolution over time.

NoSQL databases like Amazon DynamoDB, Amazon DocumentDB (with MongoDB compatibility), and Amazon MemoryDB can provide low read latencies and are well suited to handle your conversation state and history (situational context). The document and key value data models allow you the flexibility to adjust the schema of the conversation state over time.

User profiles or other user information (situational context) can come from a variety of database sources. You can store that data in relational databases like Amazon Aurora, NoSQL databases, or graph databases like Amazon Neptune.

The semantic context originates from vector data stores or machine learning (ML) search services. Amazon Aurora PostgreSQL-Compatible Edition with pgvector and Amazon OpenSearch Service are great options if you want to interact with vectors directly. Amazon Kendra, our ML-based search engine, is a great fit if you want the benefits of semantic search without explicitly maintaining vectors yourself or tuning the similarity algorithms to be used.

Amazon Bedrock is a fully managed service that makes high-performing FMs from leading AI startups and Amazon available through a unified API. You can choose from a wide range of FMs to find the model that is best suited for your use case. Amazon Bedrock also offers a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Amazon Bedrock provides integrations with both Aurora and OpenSearch Service, so you don’t have to explicitly query the vector data store yourself.

The following figure summarizes the AWS services available to support the solution framework described so far.

Catalog search use case

We present a use case showing how to augment the search capabilities of an existing search engine for product catalogs, such as ecommerce portals, using generative AI and customer data.

Each customer will have their own requirements, so we adopt the framework presented in the previous sections and show an implementation of the framework for the catalog search use case. You can use this framework for both catalog search use cases and as a foundation to be extended based on your requirements.

One additional benefit about this catalog search implementation is that it’s pluggable to existing ecommerce portals, search engines, and recommender systems, so you don’t have to redesign or rebuild your processes and tools; this solution will augment what you currently have with limited changes required.

The solution architecture and workflow is shown in the following figure.

The workflow consists of the following steps:

The end-user browses the product catalog and submits a search, in natual language, using the web interface of the frontend catalog application (not shown). The catalog frontend application sends the user search to the generative AI application. Application logic is currently implemented as a container, but it can be deployed with AWS Lambda as required.
The generative AI application connects to Amazon Bedrock to convert the user search into embeddings.
The application connects with OpenSearch Service to search and retrieve relevant search results (using an OpenSearch index containing products). The application also connects to another OpenSearch index to get user reviews for products listed in the search results. In terms of searches, different options are possible, such as k-NN, hybrid search, or sparse neural search. For this post, we use k-NN search. At this stage, before creating the final prompt for the LLM, the application can perform an additional step to retrieve situational context from operational databases, such as customer profiles, user preferences, and other personalization information.
The application gets prompt templates from an S3 data lake and creates the engineered prompt.
The application sends the prompt to Amazon Bedrock and retrieves the LLM output.
The user interaction is stored in a data lake for downstream usage and BI analysis.
The Amazon Bedrock output retrieved in Step 5 is sent to the catalog application frontend, which shows results on the web UI to the end-user.
DynamoDB stores the product list used to display products in the ecommerce product catalog. DynamoDB zero-ETL integration with OpenSearch Service is used to replicate product keys into OpenSearch.

Security considerations

Security and compliance are key concerns for any business. When adopting the solution described in this post, you should always factor in the Security Pillar best practices from the AWS Well-Architecture Framework.

There are different security categories to consider and different AWS Security services you can use in each security category. The following are some examples relevant for the architecture shown in this post:

Data protection – You can use AWS Key Management Service (AWS KMS) to manage keys and encrypt data based on the data classification policies defined. You can also use AWS Secrets Manager to manage, retrieve, and rotate database credentials, API keys, and other secrets throughout their lifecycles.
Identity and access management – You can use AWS Identity and Access Management (IAM) to specify who or what can access services and resources in AWS, centrally manage fine-grained permissions, and analyze access to refine permissions across AWS.
Detection and response – You can use AWS CloudTrail to track and provide detailed audit trails of user and system actions to support audits and demonstrate compliance. Additionally, you can use Amazon CloudWatch to observe and monitor resources and applications.
Network security – You can use AWS Firewall Manager to centrally configure and manage firewall rules across your accounts and AWS network security services, such as AWS WAF, AWS Network Firewall, and others.

Conclusion

In this post, we discussed the importance of using customer data to differentiate generative AI usage in applications. We presented a reference framework (including a functional architecture and a technical architecture) to implement a generative AI application using customer data and an in-context learning pattern with RAG-provided data. We then presented an example of how to apply this framework to design a generative AI application using customer data to augment search capabilities and personalize the search results of an ecommerce product catalog.

Contact AWS to get more information on how to implement this framework for your use case. We’re also happy to share the technical asset presented in this post to help you get started building generative AI applications with your data for your specific use case.

About the Authors

Diego Colombatto is a Senior Partner Solutions Architect at AWS. He brings more than 15 years of experience in designing and delivering Digital Transformation projects for enterprises. At AWS, Diego works with partners and customers advising how to leverage AWS technologies to translate business needs into solutions.

Angel Conde Manjon is a Sr. EMEA Data & AI PSA, based in Madrid. He has previously worked on research related to Data Analytics and Artificial Intelligence in diverse European research projects. In his current role, Angel helps partners develop businesses centered on Data and AI.

Tiziano Curci is a Manager, EMEA Data & AI PDS at AWS. He leads a team that works with AWS Partners (G/SI and ISV), to leverage the most comprehensive set of capabilities spanning databases, analytics and machine learning, to help customers unlock the through power of data through an end-to-end data strategy.

Integrate sparse and dense vectors to enhance knowledge retrieval in RAG using Amazon OpenSearch Service

2024-09-05 Yuanbo Li

Post Syndicated from Yuanbo Li original https://aws.amazon.com/blogs/big-data/integrate-sparse-and-dense-vectors-to-enhance-knowledge-retrieval-in-rag-using-amazon-opensearch-service/

In the context of Retrieval-Augmented Generation (RAG), knowledge retrieval plays a crucial role, because the effectiveness of retrieval directly impacts the maximum potential of large language model (LLM) generation.

Currently, in RAG retrieval, the most common approach is to use semantic search based on dense vectors. However, dense embeddings do not perform well in understanding specialized terms or jargon in vertical domains. A more advanced method is to combine traditional inverted-index(BM25) based retrieval, but this approach requires spending a considerable amount of time customizing lexicons, synonym dictionaries, and stop-word dictionaries for optimization.

In this post, instead of using the BM25 algorithm, we introduce sparse vector retrieval. This approach offers improved term expansion while maintaining interpretability. We walk through the steps of integrating sparse and dense vectors for knowledge retrieval using Amazon OpenSearch Service and run some experiments on some public datasets to show its advantages. The full code is available in the github repo aws-samples/opensearch-dense-spase-retrieval.

What’s Sparse vector retrieval

Sparse vector retrieval is a recall method based on an inverted index, with an added step of term expansion. It comes in two modes: document-only and bi-encoder. For more details about these two terms, see Improving document retrieval with sparse semantic encoders.

Simply put, in document-only mode, term expansion is performed only during document ingestion. In bi-encoder mode, term expansion is conducted both during ingestion and at the time of query. Bi-encoder mode improves performance but may cause more latency. The following figure demonstrates its effectiveness.

Neural sparse search in OpenSearch achieves 12.7%(document-only) ~ 20%(bi-encoder) higher NDCG@10, comparable to the TAS-B dense vector model.

With neural sparse search, you don’t need to configure the dictionary yourself. It will automatically expand terms for the user. Additionally, in an OpenSearch index with a small and specialized dataset, while hit terms are generally few, the calculated term frequency may also lead to unreliable term weights. This may lead to significant bias or distortion in BM25 scoring. However, sparse vector retrieval first expands terms, greatly increasing the number of hit terms compared to before. This helps produce more reliable scores.

Although the absolute metrics of the sparse vector model can’t surpass those of the best dense vector models, it possesses unique and advantageous characteristics. For instance, in terms of the NDCG@10 metric, as mentioned in Improving document retrieval with sparse semantic encoders, evaluations on some datasets reveal that its performance could be better than state-of-the-art dense vector models, such as in the DBPedia dataset. This indicates a certain level of complementarity between them. Intuitively, for some extremely short user inputs, the vectors generated by dense vector models might have significant semantic uncertainty, where overlaying with a sparse vector model could be beneficial. Additionally, sparse vector retrieval still maintains interpretability, and you can still observe the scoring calculation through the explanation command. To take advantage of both methods, OpenSearch has already introduced a built-in feature called hybrid search.

How to combine dense and sparse?

1. Deploy a dense vector model

To get more valuable test results, we selected Cohere-embed-multilingual-v3.0, which is one of several popular models used in production for dense vectors. We can access it through Amazon Bedrock and use the following two functions to create a connector for bedrock-cohere and then register it as a model in OpenSearch. You can get its model ID from the response.

def create_bedrock_cohere_connector(account_id, aos_endpoint, input_type='search_document'):
    # input_type could be search_document | search_query
    service = 'es'
    session = boto3.Session()
    credentials = session.get_credentials()
    region = session.region_name
    awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, service, session_token=credentials.token)

    path = '/_plugins/_ml/connectors/_create'
    url = 'https://' + aos_endpoint + path

    role_name = "OpenSearchAndBedrockRole"
    role_arn = "arn:aws:iam::{}:role/{}".format(account_id, role_name)
    model_name = "cohere.embed-multilingual-v3"

    bedrock_url = "https://bedrock-runtime.{}.amazonaws.com/model/{}/invoke".format(region, model_name)

    payload = {
      "name": "Amazon Bedrock Connector: Cohere doc embedding",
      "description": "The connector to the Bedrock Cohere multilingual doc embedding model",
      "version": 1,
      "protocol": "aws_sigv4",
      "parameters": {
        "region": region,
        "service_name": "bedrock"
      },
      "credential": {
        "roleArn": role_arn
      },
      "actions": [
        {
          "action_type": "predict",
          "method": "POST",
          "url": bedrock_url,
          "headers": {
            "content-type": "application/json",
            "x-amz-content-sha256": "required"
          },
          "request_body": "{ \"texts\": ${parameters.texts}, \"input_type\": \"search_document\" }",
          "pre_process_function": "connector.pre_process.cohere.embedding",
          "post_process_function": "connector.post_process.cohere.embedding"
        }
      ]
    }
    headers = {"Content-Type": "application/json"}

    r = requests.post(url, auth=awsauth, json=payload, headers=headers)
    return json.loads(r.text)["connector_id"]
    
def register_and_deploy_aos_model(aos_client, model_name, model_group_id, description, connecter_id):
    request_body = {
        "name": model_name,
        "function_name": "remote",
        "model_group_id": model_group_id,
        "description": description,
        "connector_id": connecter_id
    }

    response = aos_client.transport.perform_request(
        method="POST",
        url=f"/_plugins/_ml/models/_register?deploy=true",
        body=json.dumps(request_body)
    )

    returnresponse

2. Deploy a sparse vector model

Currently, you can’t deploy the sparse vector model in an OpenSearch Service domain. You must deploy it in Amazon SageMaker first, then integrate it through an OpenSearch Service model connector. For more information, see Amazon OpenSearch Service ML connectors for AWS services.

Complete the following steps:

2.1 On the OpenSearch Service console, choose Integrations in the navigation pane.

2.2 Under Integration with Sparse Encoders through Amazon SageMaker, choose to configure a VPC domain or public domain.

Next, you configure the AWS CloudFormation template.

2.3 Enter the parameters as shown in the following screenshot.

2.4 Get the sparse model ID from the stack output.

3. Set up pipelines for ingestion and search

Use the following code to create pipelines for ingestion and search. With these two pipelines, there’s no need to perform model inference, just text field ingestion.

PUT /_ingest/pipeline/neural-sparse-pipeline
{
  "description": "neural sparse encoding pipeline",
  "processors" : [
    {
      "sparse_encoding": {
        "model_id": "<nerual_sparse_model_id>",
        "field_map": {
           "content": "sparse_embedding"
        }
      }
    },
    {
      "text_embedding": {
        "model_id": "<cohere_ingest_model_id>",
        "field_map": {
          "doc": "dense_embedding"
        }
      }
    }
  ]
}

PUT /_search/pipeline/hybird-search-pipeline
{
  "description": "Post processor for hybrid search",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {
          "technique": "l2"
        },
        "combination": {
          "technique": "arithmetic_mean",
          "parameters": {
            "weights": [
              0.5,
              0.5
            ]
          }
        }
      }
    }
  ]
}

4. Create an OpenSearch index with dense and sparse vectors

Use the following code to create an OpenSearch index with dense and sparse vectors. You must specify the default_pipeline as the ingestion pipeline created in the previous step.

PUT {index-name}
{
    "settings" : {
        "index":{
            "number_of_shards" : 1,
            "number_of_replicas" : 0,
            "knn": "true",
            "knn.algo_param.ef_search": 32
        },
        "default_pipeline": "neural-sparse-pipeline"
    },
    "mappings": {
        "properties": {
            "content": {"type": "text", "analyzer": "ik_max_word", "search_analyzer": "ik_smart"},
            "dense_embedding": {
                "type": "knn_vector",
                "dimension": 1024,
                "method": {
                    "name": "hnsw",
                    "space_type": "cosinesimil",
                    "engine": "nmslib",
                    "parameters": {
                        "ef_construction": 512,
                        "m": 32
                    }
                }            
            },
            "sparse_embedding": {
                "type": "rank_features"
            }
        }
    }
}

Testing methodology

1. Experimental data selection

For retrieval evaluation, we used to use the datasets from BeIR. But not all datasets from BeIR are suitable for RAG. To mimic the knowledge retrieval scenario, we choose BeIR/fiqa and squad_v2 as our experimental datasets. The schema of its data is shown in the following figures.

The following is a data preview of squad_v2.

The following is a query preview of BeIR/fiqa.

The following is a corpus preview of BeIR/fiqa.

You can find question and context equivalent fields in the BeIR/fiqa datasets. This is almost the same as the knowledge recall in RAG. In subsequent experiments, we input the context field into the index of OpenSearch as text content, and use the question field as a query for the retrieval test.

2. Test data ingestion

The following script ingests data into the OpenSearch Service domain:

import json
from setup_model_and_pipeline import get_aos_client
from beir.datasets.data_loader import GenericDataLoader
from beir import LoggingHandler, util

aos_client = get_aos_client(aos_endpoint)

def ingest_dataset(corpus, aos_client, index_name, bulk_size=50):
    i=0
    bulk_body=[]
    for _id , body in tqdm(corpus.items()):
        text=body["title"]+" "+body["text"]
        bulk_body.append({ "index" : { "_index" : index_name, "_id" : _id } })
        bulk_body.append({ "content" : text })
        i+=1
        if i % bulk_size==0:
            response=aos_client.bulk(bulk_body,request_timeout=100)
            try:
                assert response["errors"]==False
            except:
                print("there is errors")
                print(response)
                time.sleep(1)
                response = aos_client.bulk(bulk_body,request_timeout=100)
            bulk_body=[]
        
    response=aos_client.bulk(bulk_body,request_timeout=100)
    assert response["errors"]==False
    aos_client.indices.refresh(index=index_name)

url = f"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{dataset_name}.zip"
data_path = util.download_and_unzip(url, data_root_dir)
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")
ingest_dataset(corpus, aos_client=aos_client, index_name=index_name)

3. Performance evaluation of retrieval

In RAG knowledge retrieval, we usually focus on the relevance of top results, so our evaluation uses recall@4 as the metric indicator. The whole test will include various retrieval methods to compare, such as bm25_only, sparse_only, dense_only, hybrid_sparse_dense, and hybrid_dense_bm25.

The following script uses hybrid_sparse_dense to demonstrate the evaluation logic:

def search_by_dense_sparse(aos_client, index_name, query, sparse_model_id, dense_model_id, topk=4):
    request_body = {
      "size": topk,
      "query": {
        "hybrid": {
          "queries": [
            {
              "neural_sparse": {
                  "sparse_embedding": {
                    "query_text": query,
                    "model_id": sparse_model_id,
                    "max_token_score": 3.5
                  }
              }
            },
            {
              "neural": {
                  "dense_embedding": {
                      "query_text": query,
                      "model_id": dense_model_id,
                      "k": 10
                    }
                }
            }
          ]
        }
      }
    }

    response = aos_client.transport.perform_request(
        method="GET",
        url=f"/{index_name}/_search?search_pipeline=hybird-search-pipeline",
        body=json.dumps(request_body)
    )

    return response["hits"]["hits"]
    
url = f"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{dataset_name}.zip"
data_path = util.download_and_unzip(url, data_root_dir)
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")
run_res={}
for _id, query in tqdm(queries.items()):
    hits = search_by_dense_sparse(aos_client, index_name, query, sparse_model_id, dense_model_id, topk)
    run_res[_id]={item["_id"]:item["_score"] for item in hits}
    
for query_id, doc_dict in tqdm(run_res.items()):
    if query_id in doc_dict:
        doc_dict.pop(query_id)
res = EvaluateRetrieval.evaluate(qrels, run_res, [1, 4, 10])
print("search_by_dense_sparse:")
print(res)

Results

In the context of RAG, usually the developer doesn’t pay attention to the metric NDCG@10; the LLM will pick up the relevant context automatically. We care more about the recall metric. Based on our experience of RAG, we measured recall@1, recall@4, and recall@10 for your reference.

The dataset BeIR/fiqa is mainly used for evaluation of retrieval, whereas squad_v2 is mainly used for evaluation of reading comprehension. In terms of retrieval, squad_v2 is much less complicated than BeIR/fiqa. In the real RAG context, the difficulty of retrieval may not be as high as with BeIR/fiqa, so we evaluate both datasets.

The hybird_dense_sparse metric is always beneficial. The following table shows our results.

Dataset	BeIR/fiqa			squad_v2
Method\Metric	Recall@1	Recall@4	Recall@10	Recall@1	Recall@4	Recall@10
bm25	0.112	0.215	0.297	0.59	0.771	0.851
dense	0.156	0.316	0.398	0.671	0.872	0.925
sparse	0.196	0.334	0.438	0.684	0.865	0.926
hybird_dense_sparse	0.203	0.362	0.456	0.704	0.885	0.942
hybird_dense_bm25	0.156	0.316	0.394	0.671	0.871	0.925

Conclusion

The new neural sparse search feature in OpenSearch Service version 2.11, when combined with dense vector retrieval, can significantly improve the effectiveness of knowledge retrieval in RAG scenarios. Compared to the combination of bm25 and dense vector retrieval, it’s more straightforward to use and more likely to achieve better results.

OpenSearch Service version 2.12 has recently upgraded its Lucene engine, significantly enhancing the throughput and latency performance of neural sparse search. But the current neural sparse search only supports English. In the future, other languages might be supported. As the technology continues to evolve, it stands to become a popular and widely applicable way to enhance retrieval performance.

About the Author

YuanBo Li is a Specialist Solution Architect in GenAI/AIML at Amazon Web Services. His interests include RAG (Retrieval-Augmented Generation) and Agent technologies within the field of GenAI, and he dedicated to proposing innovative GenAI technical solutions tailored to meet diverse business needs.

Charlie Yang is an AWS engineering manager with the OpenSearch Project. He focuses on machine learning, search relevance, and performance optimization.

River Xie is a Gen AI specialist solution architecture at Amazon Web Services. River is interested in Agent/Mutli Agent workflow, Large Language Model inference optimization, and passionate about leveraging cutting-edge Generative AI technologies to develop modern applications that solve complex business challenges.

Ren Guo is a manager of Generative AI Specialist Solution Architect Team for the domains of AIML and Data at AWS, Greater China Region.

Best Practices for working with Pull Requests in Amazon CodeCatalyst

2024-08-23 Fahim Sajjad

Post Syndicated from Fahim Sajjad original https://aws.amazon.com/blogs/devops/best-practices-for-working-with-pull-requests-in-amazon-codecatalyst/

According to the Well-Architected DevOps Guidance, “A peer review process for code changes is a strategy for ensuring code quality and shared responsibility. To support separation of duties in a DevOps environment, every change should be reviewed and approved by at least one other person before merging.” Development teams often implement the peer review process in their Software Development Lifecycle (SDLC) by leveraging Pull Requests (PRs). Amazon CodeCatalyst has recently released three new features to facilitate a robust peer review process. Pull Request Approval Rules enforce a minimum number of approvals to ensure multiple peers review a proposed change prior to a progressive deployment. Amazon Q pull request summaries can automatically summarize code changes in a PR, saving time for both the creator and reviewer. Lastly, Nested Comments allows teams to organize conversations and feedback left on a PR to ensure efficient resolution.

This blog will demonstrate how a DevOps lead can leverage new features available in CodeCatalyst to accomplish the following requirements covering best practices: 1. Require at least two people to review every PR prior to deployment, and 2. Reduce the review time to merge (RTTM).

Prerequisites

If you are using CodeCatalyst for the first time, you’ll need the following to follow along with the steps outlined in the blog post:

A Project in a CodeCatalyst Space. If you don’t have one, you can create a new space.
Amazon Q pull request summaries require that you use the Standard or Enterprise tier and the generative AI feature is enabled in your space.

Pull request approval rules

Approval rules can be configured for branches in a repository. When you create a PR whose destination branch has an approval rule configured for it, the requirements for the rule must be met before the PR can be merged.

In this section, you will implement approval rules on the default branch (main in this case) in the application’s repository to implement the new ask from leadership requiring that at least two people review every PR before deployment.

Step 1: Creating the application
Pull Request approval rules work with every project but in this blog, we’ll leverage the Modern three-tier web application blueprint for simplicity to implement PR approval rules for merging to the main branch.

The image shows the interface of the Amazon CodeCatalyst platform, which allows users to create new projects in three different ways. The three options are "Start with a blueprint", "Bring your own code", and "Start from scratch". In the image, the "Start with a blueprint" option is selected, and the "Modern three-tier web application" blueprint is chosen.

Figure 1: Creating a new Modern three-tier application Blueprint

First, within your space click “Create Project” and select the Modern three-tier web application CodeCatalyst Blueprint as shown above in Figure 1.
Enter a Project name and select: Lambda for the Compute Platform and Amplify Hosting for Frontend Hosting Options. Additionally, ensure your AWS account is selected along with creating a new IAM Role.
Finally, click Create Project and a new project will be created based on the Blueprint.

Once the project is successfully created, the application will deploy via a CodeCatalyst workflow, assuming the AWS account and IAM role were setup correctly. The deployed application will be similar to the Mythical Mysfits website.

Step 2: Creating an approval rule

Next, to satisfy the new requirement of ensuring at least two people review every PR before deployment, you will create the approval rule for members when they create a pull request to merge into the main branch.

Navigate to the project you created in the previous step.
In the navigation pane, choose Code, and then choose Source repositories.
Next, choose the mysfits repository that was created as part of the Blueprint.
1. On the overview page of the repository, choose Branches.
2. For the main branch, click View under the Approval Rules column.
In Minimum number of approvals, the number corresponds to the number of approvals required before a pull request can be merged to that branch.
Now, you’ll change the approval rule to satisfy the requirement to ensure at least 2 people review every PR. Choose Manage settings. On the settings page for the source repository, in Approval rules, choose Edit.
In Destination Branch, from the drop-down list, choose main as the name of the branch to configure an approval rule. In Minimum number of approvals, enter 2, and then choose Save.

The image shows an interface for creating an approval rule. It allows users to specify the destination branch and the minimum number of approvals required before a pull request can be merged. In the image, 'main' is selected as the destination branch, and '2' is set as the minimum number of approvals. The interface also provides "Cancel" and "Save" buttons to either discard or commit the approval rule settings.

Figure 2: Creating a new approval rule

Note: You must have the Project administrator role to create and manage approval rules in CodeCatalyst projects. You cannot create approval rules for linked repositories.

When implementing approval rules and branch restrictions in your repositories, ensure you take into consideration the following best practices:

For branches deemed critical or important, ensure only highly privileged users are allowed to Push to the Branch and Delete the Branch in the branch rules. This prevents accidental deletion of critical or important branches as well as ensuring any changes introduced to the branch are reviewed before deployment.
Ensure Pull Request approval rules are in place for branches your team considers critical or important. While there is no specific recommended number due to varying team size and project complexity, the minimum number of approvals is recommended to be at least one and research has found the optimal number to be two.

In this section, you walked through the steps to create a new approval rule to satisfy the requirement of ensuring at least two people review every PR before deployment on your CodeCatalyst repository.

Amazon Q pull request summaries

Now, you begin exploring ways that can help development teams reduce MTTR. You begin reading about Amazon Q pull request summaries and how this feature can automatically summarize code changes and start to explore this feature in further detail.

While creating a pull request, in Pull request description, you can leverage the Write description for me feature, as seen in Figure 5 below, to have Amazon Q create a description of the changes contained in the pull request.

The image displays an interface for a pull request details page. At the top, it shows the source repository where the changes being reviewed are located, which is "mysfits1ru6c". Below that, there are two dropdown menus - one for the destination branch where the changes will be merged, set to "main", and one for the source branch containing the changes, set to "test-branch". The interface also includes a field for the pull request title, which is set to "Updated Title", and an optional description field. The description field has a button labeled "Write description for me" that allows the user to have the system automatically generate a description for the pull request leveraging Amazon Q.

Figure 3: Amazon Q write description for me feature

Once the description is generated, you can Accept and add to description, as seen in Figure 6 below. As a best practice, once Amazon Q has generated the initial PR summary, you should incorporate any specific organizational or team requirements into the summary before creating the PR. This allows developers to save time and reduce MTTR in generating the PR summary while ensuring all requirements are met.

The image displays an interface for a pull request details page. It shows the source repository as "mystits1ruc" and the destination branch as "main", with the source branch set to "test-branch". The interface also includes a field for the pull request title, which is set to "Updated Title". Underneath that is the optional Pull Request description, which is populated with a description generated from Amazon Q. Below the description field, there are two buttons - "Accept and add to description" and "Hide preview" - that allow the user to accept the description and add it to the pull request.

Figure 4: PR Summary generated by Amazon Q

CodeCatalyst offers an Amazon Q feature that summarizes pull request comments, enabling developers to quickly grasp key points. When many comments are left by reviewers, it can be difficult to understand common themes in the feedback, or even be sure that you’ve addressed all the comments in all revisions. You can use the Create comment summary feature to have Amazon Q analyze the comments and provide a summary for you, as seen in Figure 5 below.

The image shows an interface where pull request title is set to "New Title Update," and the description provides details on the changes being made. Below the description, there is a "Comment summary" section that offers instructions for summarizing the pull request comments. Additionally, there is a "Create comment summary" button, which allows the user to generate a summary of the comments using Amazon Q.

Figure 5: Comment summary

Nested Comments

When reviewing various PRs for the development teams, you notice that feedback and subsequent conversations often happen within disparate and separate comments. This makes reviewing, understanding and addressing the feedback cumbersome and time consuming for the individual developers. Nested Comments in CodeCatalyst can organize conversations and reduce MTTR.

You’ll leverage the existing project to walkthrough how to use the Nested Comments feature:

Step 1: Creating the PR

Click the mysifts repository, and on the overview page of the repository, choose More, and then choose Create branch.
- Create a new branch called test-branch.
Open the web/index.html file
- Edit the file to update the text in the <title> block to Mythical Mysfits new title update! and Commit the changes.
Create a pull request by using test-branch as the Source branch and main as the Destination branch. Your PR should now look similar to Figure 6 below:

The image shows the Amazon CodeCatalyst interface, which is used to compare code changes between different revisions of a project. The interface displays a side-by-side view of the "web/index.html" file, highlighting the changes made between the main branch and Revision 1. The differences are ready for review, as indicated by the green message at the top.

Figure 6: Pull Request with updated Title

Step 2: Review PR and add Comments

Review the PR, ensure you are on the Changes tab (similar to Figure 3), click the Comment icon and leave a comment. Normally this would be done by the Reviewer but you will simulate being both the Reviewer and Developer in this walkthrough.
With the comment still open, hit Reply and add another comment as a response to the initial comment. The PR should now look similar to Figure 7 below.

This image shows a pull request interface where changes have been made to the HTML title of a web page. Below the code changes, there is a section for comments related to this pull request. The comments show a nested comments between two developers where they are discussing and confirming the changes to the title.

Figure 7: PR with Nested Comments

When leaving comments on PR in CodeCatalyst, ensure you take into consideration the following best practices :

Feedback or conversation focused on a specific topic or piece of code should leverage the nested comments feature. This will ensure the conversation can be easily followed and that context and intent are not lost in a sea of individual comments.
Author of the PR should address all comments by either making updates to the code or replying to the comment. This indicates to the reviewer that each comment was reviewed and addressed accordingly.
Feedback should be constructive in nature on PRs. Research has found that, “destructive criticism had a negative impact on participants’ moods and motivation to continue working.”

Clean-up

As part of following the steps in this blog post, if you upgraded your space to Standard or Enterprise tier, please ensure you downgrade to the Free tier to avoid any unwanted additional charges. Additionally, delete any projects you may have created during this walkthrough.

Conclusion

In today’s fast-paced software development environment, maintaining a high standard for code changes is crucial. With its recently introduced features, including Pull Request Approval Rules, Amazon Q pull request summaries, and nested comments, CodeCatalyst empowers development teams to ensure a robust pull request review process is in place. These features streamline collaboration, automate documentation tasks, and facilitate organized discussions, enabling developers to focus on delivering high-quality code while maximizing productivity. By leveraging these powerful tools, teams can confidently merge code changes into production, knowing that they have undergone rigorous review and meet the necessary standards for reliability and performance.

About the authors

Optimize cost and performance for Amazon MWAA

2024-08-21 Sriharsh Adari

Post Syndicated from Sriharsh Adari original https://aws.amazon.com/blogs/big-data/optimize-cost-and-performance-for-amazon-mwaa/

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed service for Apache Airflow that allows you to orchestrate data pipelines and workflows at scale. With Amazon MWAA, you can design Directed Acyclic Graphs (DAGs) that describe your workflows without managing the operational burden of scaling the infrastructure. In this post, we provide guidance on how you can optimize performance and save cost by following best practices.

Amazon MWAA environments include four Airflow components hosted on groups of AWS compute resources: the scheduler that schedules the work, the workers that implement the work, the web server that provides the UI, and the metadata database that keeps track of state. For intermittent or varying workloads, optimizing costs while maintaining price and performance is crucial. This post outlines best practices to achieve cost optimization and efficient performance in Amazon MWAA environments, with detailed explanations and examples. It may not be necessary to apply all of these best practices for a given Amazon MWAA workload; you can selectively choose and implement relevant and applicable principles for your specific workloads.

Right-sizing your Amazon MWAA environment

Right-sizing your Amazon MWAA environment makes sure you have an environment that is able to concurrently scale across your different workloads to provide the best price-performance. The environment class you choose for your Amazon MWAA environment determines the size and the number of concurrent tasks supported by the worker nodes. In Amazon MWAA, you can choose from five different environment classes. In this section, we discuss the steps you can follow to right-size your Amazon MWAA environment.

Monitor resource utilization

The first step in right-sizing your Amazon MWAA environment is to monitor the resource utilization of your existing setup. You can monitor the underlying components of your environments using Amazon CloudWatch, which collects raw data and processes data into readable, near real-time metrics. With these environment metrics, you have greater visibility into key performance indicators to help you appropriately size your environments and debug issues with your workflows. Based on the concurrent tasks needed for your workload, you can adjust the environment size as well as the maximum and minimum workers needed. CloudWatch will provide CPU and memory utilization for all the underlying AWS services utilize by Amazon MWAA. Refer to Container, queue, and database metrics for Amazon MWAA for additional details on available metrics for Amazon MWAA. These metrics also include the number of base workers, additional workers, schedulers, and web servers.

Analyze your workload patterns

Next, take a deep dive into your workflow patterns. Examine DAG schedules, task concurrency, and task runtimes. Monitor CPU/memory usage during peak periods. Query CloudWatch metrics and Airflow logs. Identify long-running tasks, bottlenecks, and resource-intensive operations for optimal environment sizing. Understanding the resource demands of your workload will help you make informed decisions about the appropriate Amazon MWAA environment class to use.

Choose the right environment class

Match requirements to Amazon MWAA environment class specifications (mw1.small to mw1.2xlarge) that can handle your workload efficiently. You can vertically scale up or scale down an existing environment through an API, the AWS Command Line Interface (AWS CLI), or the AWS Management Console. Be aware that a change in the environment class requires a scheduled downtime.

Fine tune configuration parameters

Fine-tuning configuration parameters in Apache Airflow is crucial for optimizing workflow performance and cost reductions. It allows you to tune settings such as Auto scaling, parallelism, logging, and DAG code optimizations.

Auto scaling

Amazon MWAA supports worker auto scaling, which automatically adjusts the number of running worker and web server nodes based on your workload demands. You can specify the minimum and maximum number of Airflow workers that run in your environment. For worker node auto scaling, Amazon MWAA uses RunningTasks and QueuedTasks metrics, where (tasks running + tasks queued) / (tasks per worker) = (required workers). If the required number of workers is greater than the current number of running workers, Amazon MWAA will add additional worker instances using AWS Fargate, up to the maximum value specified by the maximum worker configuration.

Auto scaling in Amazon MWAA will gracefully downscale when there are more additional workers than required. For example, let’s assume a large Amazon MWAA environment with a minimum of 1 worker and a maximum of 10, where each large Amazon MWAA worker can support up to 20 tasks. Let’s say, each day at 8:00 AM, DAGs start up that use 190 concurrent tasks. Amazon MWAA will automatically scale to 10 workers, because the required workers = 190 requested tasks (some running, some queued) / 20 (tasks per worker) = 9.5 workers, rounded up to 10. At 10:00 AM, half of the tasks complete, leaving 85 running. Amazon MWAA will then downscale to 6 workers (95 tasks/20 tasks per worker = 5.25 workers, rounded up to 6). Any workers that are still running tasks remain protected during downscaling until they’re complete, and no tasks will be interrupted. As the queued and running tasks decrease, Amazon MWAA will remove workers without affecting running tasks, down to the minimum specified worker count.

Web server auto scaling in Amazon MWAA allows you to automatically scale the number of web servers based on CPU utilization and active connection count. Amazon MWAA makes sure your Airflow environment can seamlessly accommodate increased demand, whether from REST API requests, AWS CLI usage, or more concurrent Airflow UI users. You can specify the maximum and minimum web server count while configuring your Amazon MWAA environment.

Logging and metrics

In this section, we discuss the steps to select and set the appropriate log configurations and CloudWatch metrics.

Choose the right log levels

If enabled, Amazon MWAA will send Airflow logs to CloudWatch. You can view the logs to determine Airflow task delays or workflow errors without the need for additional third-party tools. You need to enable logging to view Airflow DAG processing, tasks, scheduler, web server, and worker logs. You can enable Airflow logs at the INFO, WARNING, ERROR, or CRITICAL level. When you choose a log level, Amazon MWAA sends logs for that level and higher levels of severity. Standard CloudWatch logs charges apply, so reducing log levels where possible can reduce overall costs. Use the most appropriate log level based on environment, such as INFO for dev and UAT, and ERROR for production.

Set appropriate log retention policy

By default, logs are kept indefinitely and never expire. To reduce CloudWatch cost, you can adjust the retention policy for each log group.

Choose required CloudWatch metrics

You can choose which Airflow metrics are sent to CloudWatch by using the Amazon MWAA configuration option metrics.statsd_allow_list. Refer to the complete list of available metrics. Some metrics such as schedule_delay and duration_success are published per DAG, whereas others such as ti.finish are published per task per DAG.

Therefore, the cumulative number of DAGs and tasks directly influence your CloudWatch metric ingestion costs. To control CloudWatch costs, choose to publish selective metrics. For example, the following will only publish metrics that start with scheduler and executor:

metrics.statsd_allow_list = scheduler,executor

We recommend using metrics.statsd_allow_list with metrics.metrics_use_pattern_match.

An effective practice is to utilize regular expression (regex) pattern matching against the entire metric name instead of only matching the prefix at the beginning of the name.

Monitor CloudWatch dashboards and set up alarms

Create a custom dashboard in CloudWatch and add alarms for a particular metric to monitor the health status of your Amazon MWAA environment. Configuring alarms allows you to proactively monitor the health of the environment.

Optimize AWS Secrets Manager invocations

Airflow has a mechanism to store secrets such as variables and connection information. By default, these secrets are stored in the Airflow meta database. Airflow users can optionally configure a centrally managed location for secrets, such as AWS Secrets Manager. When specified, Airflow will first check this alternate secrets backend when a connection or variable is requested. If the alternate backend contains the needed value, it is returned; if not, Airflow will check the meta database for the value and return that instead. One of the factors affecting the cost to use Secrets Manager is the number of API calls made to it.

On the Amazon MWAA console, you can configure the backend Secrets Manager path for the connections and variables that will be used by Airflow. By default, Airflow searches for all connections and variables in the configured backend. To reduce the number of API calls Amazon MWAA makes to Secrets Manager on your behalf, configure it to use a lookup pattern. By specifying a pattern, you narrow the possible paths that Airflow will look at. This will help in lowering your costs when using Secrets Manager with Amazon MWAA.

To use a secrets cache, enable AIRFLOW_SECRETS_USE_CACHE with TTL to help to reduce the Secrets Manager API calls.

For example, if you want to only look up a specific subset of connections, variables, or config in Secrets Manager, set the relevant *_lookup_pattern parameter. This parameter takes a regex as a string as value. To lookup connections starting with m in Secrets Manager, your configuration file should look like the following code:

[secrets]
backend = airflow.providers.amazon.aws.secrets.secrets_manager.SecretsManagerBackend
backend_kwargs =

{
  "connections_prefix": "airflow/connections",
  "connections_lookup_pattern": "^m",
  "profile_name": "default"
}

DAG code optimization

Schedulers and workers are two components that are involved in parsing the DAG. After the scheduler parses the DAG and places it in a queue, the worker picks up the DAG from the queue. At the point, all the worker knows is the DAG_id and the Python file, along with some other info. The worker has to parse the Python file in order to run the task.

DAG parsing is run twice, once by the scheduler and then by the worker. Because the workers are also parsing the DAG, the amount of time it takes for the code to parse dictates the number of workers needed, which adds cost of running those workers.

For example, for a total of 200 DAGs having 10 tasks each, taking 60 seconds per task to parse, we can calculate the following:

Total tasks across all DAGs = 2,000
Time per task = 60 seconds + 20 seconds (parse DAG)
Total time = 2000 * 80 = 160,000 seconds
Total time per worker = 72,000 seconds
Number of workers needs = Total time/Total time per worker = 160,000/72,000 = ~3

Now, let’s increase the time taken to parse the DAGs to 100 seconds:

Total tasks across all DAGs = 2,000
Time per task = 60 seconds + 100 seconds
Total time = 2,000 *160 = 320,000 seconds
Total time per worker = 72,000 seconds
Number of workers needs = Total time/Total time per worker = 320,000/72,000 = ~5

As you can see, when the DAG parsing time increased from 20 seconds to 100 seconds, the number of worker nodes needed increased from 3 to 5, thereby adding compute cost.

To reduce the time it takes for parsing the code, follow the best practices in the subsequent sections.

Remove top-level imports

Code imports will run every time the DAG is parsed. If you don’t need the libraries being imported to create the DAG objects, move the import to the task level instead of defining it at the top. After it’s defined in the task, the import will be called only when the task is run.

Avoid multiple calls to databases like the meta database or external system database. Variables are used within the DAG that are defined in the meta database or a backend system like Secrets Manager. Use templating (Jinja) wherein calls to populate the variables are only made at task runtime and not at task parsing time.

For example, see the following code:

import pendulum
from airflow import DAG
from airflow.decorators import task
import numpy as np  # <-- DON'T DO THAT!

with DAG(
    dag_id="example_python_operator",
    schedule=None,
    start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
    catchup=False,
    tags=["example"],
) as dag:

    @task()
    def print_array():
        """Print Numpy array."""
        import numpy as np  # <-- INSTEAD DO THIS!
        a = np.arange(15).reshape(3, 5)
        print(a)
        return a
    print_array()

The following code is another example:

# Bad example
from airflow.models import Variable

foo_var = Variable.get("foo")  # DON'T DO THAT

bash_use_variable_bad_1 = BashOperator(
    task_id="bash_use_variable_bad_1", bash_command="echo variable foo=${foo_env}", env={"foo_env": foo_var}
)

bash_use_variable_bad_2 = BashOperator(
    task_id="bash_use_variable_bad_2",
    bash_command=f"echo variable foo=${Variable.get('foo')}",  # DON'T DO THAT
)

bash_use_variable_bad_3 = BashOperator(
    task_id="bash_use_variable_bad_3",
    bash_command="echo variable foo=${foo_env}",
    env={"foo_env": Variable.get("foo")},  # DON'T DO THAT
)

# Good example
bash_use_variable_good = BashOperator(
    task_id="bash_use_variable_good",
    bash_command="echo variable foo=${foo_env}",
    env={"foo_env": "{{ var.value.get('foo') }}"},
)

@task
def my_task():
    var = Variable.get("foo")  # this is fine, because func my_task called only run task, not scan DAGs.
print(var)

Writing DAGs

Complex DAGs with a large number of tasks and dependencies between them can impact performance of scheduling. One way to keep your Airflow instance performant and well utilized is to simplify and optimize your DAGs.

For example, a DAG that has simple linear structure A → B → C will experience less delays in task scheduling than a DAG that has a deeply nested tree structure with an exponentially growing number of dependent tasks.

Dynamic DAGs

In the following example, a DAG is defined with hardcoded table names from a database. A developer has to define N number of DAGs for N number of tables in a database.

# Bad example
dag_params = getData()
no_of_dags = int(dag_params["no_of_dags"]['N'])
# build a dag for each number in no_of_dags
for n in range(no_of_dags):
    dag_id = 'dynperf_t1_{}'.format(str(n))
default_args = {'owner': 'airflow','start_date': datetime(2022, 2, 2, 12, n)}

To reduce verbose and error-prone work, use dynamic DAGs. The following definition of the DAG is created after querying a database catalog, and creates as many DAGs dynamically as there are tables in the database. This achieves the same objective with less code.

def getData():
    client = boto3.client('dynamodb’)
    response = client.get_item(
        TableName="mwaa-dag-creation",
        Key={'key': {'S': 'mwaa’}}
    )
    return response["Item"]

Stagger DAG schedules

Running all DAGs simultaneously or within a short interval in your environment can result in a higher number of worker nodes required to process the tasks, thereby increasing compute costs. For business scenarios where the workload is not time-sensitive, consider spreading the schedule of DAG runs in a way that maximizes the utilization of available worker resources.

DAG folder parsing

Simpler DAGs are usually only in a single Python file; more complex DAGs might be spread across multiple files and have dependencies that should be shipped with them. You can either do this all inside of the DAG_FOLDER , with a standard filesystem layout, or you can package the DAG and all of its Python files up as a single .zip file. Airflow will look into all the directories and files in the DAG_FOLDER. Using the .airflowignore file specifies which directories or files Airflow should intentionally ignore. This will increase the efficiency of finding a DAG within a directory, improving parsing times.

Deferrable operators

You can run deferrable operators on Amazon MWAA. Deferrable operators have the ability to suspend themselves and free up the worker slot. No tasks in the worker means fewer required worker resources, which can lower the worker cost.

For example, let’s assume you’re using a large number of sensors that wait for something to occur and occupy worker node slots. By making the sensors deferrable and using worker auto scaling improvements to aggressively downscale workers, you will immediately see an impact where fewer worker nodes are needed, saving on worker node costs.

Dynamic Task Mapping

Dynamic Task Mapping allows a way for a workflow to create a number of tasks at runtime based on current data, rather than the DAG author having to know in advance how many tasks would be needed. This is similar to defining your tasks in a for loop, but instead of having the DAG file fetch the data and do that itself, the scheduler can do this based on the output of a previous task. Right before a mapped task is run, the scheduler will create N copies of the task, one for each input.

Stop and start the environment

You can stop and start your Amazon MWAA environment based on your workload requirements, which will result in cost savings. You can perform the action manually or automate stopping and starting Amazon MWAA environments. Refer to Automating stopping and starting Amazon MWAA environments to reduce cost to learn how to automate the stop and start of your Amazon MWAA environment retaining metadata.

Conclusion

In conclusion, implementing performance optimization best practices for Amazon MWAA can significantly reduce overall costs while maintaining optimal performance and reliability. Key strategies include right-sizing environment classes based on CloudWatch metrics, managing logging and monitoring costs, using lookup patterns with Secrets Manager, optimizing DAG code, and selectively stopping and starting environments based on workload demands. Continuously monitoring and adjusting these settings as workloads evolve can maximize your cost-efficiency.

About the Authors

Sriharsh Adari is a Senior Solutions Architect at AWS, where he helps customers work backward from business outcomes to develop innovative solutions on AWS. Over the years, he has helped multiple customers on data platform transformations across industry verticals. His core area of expertise includes technology strategy, data analytics, and data science. In his spare time, he enjoys playing sports, binge-watching TV shows, and playing Tabla.

Retina Satish is a Solutions Architect at AWS, bringing her expertise in data analytics and generative AI. She collaborates with customers to understand business challenges and architect innovative, data-driven solutions using cutting-edge technologies. She is dedicated to delivering secure, scalable, and cost-effective solutions that drive digital transformation.

Jeetendra Vaidya is a Senior Solutions Architect at AWS, bringing his expertise to the realms of AI/ML, serverless, and data analytics domains. He is passionate about assisting customers in architecting secure, scalable, reliable, and cost-effective solutions.

Encryption in transit over external networks: AWS guidance for NYDFS and beyond

2024-08-21 Aravind Gopaluni

Post Syndicated from Aravind Gopaluni original https://aws.amazon.com/blogs/security/encryption-in-transit-over-external-networks-aws-guidance-for-nydfs-and-beyond/

On November 1, 2023, the New York State Department of Financial Services (NYDFS) issued its Second Amendment (the Amendment) to its Cybersecurity Requirements for Financial Services Companies adopted in 2017, published within Section 500 of 23 NYCRR 500 (the Cybersecurity Requirements; the Cybersecurity Requirements as amended by the Amendment, the Amended Cybersecurity Requirements). In the introduction to its Cybersecurity Resource Center, the Department explains that the revisions are aimed at addressing the changes in the increasing sophistication of threat actors, the prevalence of and relative ease in running cyberattacks, and the availability of additional controls to manage cyber risks.

This blog post focuses on the revision to the encryption in transit requirement under section 500.15(a). It outlines the encryption capabilities and secure connectivity options offered by Amazon Web Services (AWS) to help customers demonstrate compliance with this updated requirement. The post also provides best practices guidance, emphasizing the shared responsibility model. This enables organizations to design robust data protection strategies that address not only the updated NYDFS encryption requirements but potentially also other security standards and regulatory requirements.

The target audience for this information includes security leaders, architects, engineers, and security operations team members and risk, compliance, and audit professionals.

Note that the information provided here is for informational purposes only; it is not legal or compliance advice and should not be relied on as legal or compliance advice. Customers are responsible for making their own independent assessments and should obtain appropriate advice from their own legal and compliance advisors regarding compliance with applicable NYDFS regulations.

500.15 Encryption of nonpublic information

The updated requirement in the Amendment states that:

As part of its cybersecurity program, each covered entity shall implement a written policy requiring encryption that meets industry standards, to protect nonpublic information held or transmitted by the covered entity both in transit over external networks and at rest.
To the extent a covered entity determines that encryption of nonpublic information at rest is infeasible, the covered entity may instead secure such nonpublic information using effective alternative compensating controls that have been reviewed and approved by the covered entity’s CISO in writing. The feasibility of encryption and effectiveness of the compensating controls shall be reviewed by the CISO at least annually.

This section of the Amendment removes the covered entity’s chief information security officer’s (CISO) discretion to approve compensating controls when encryption of nonpublic information in transit over external networks is deemed infeasible. The Amendment mandates that, effective November 2024, organizations must encrypt nonpublic information transmitted over external networks without the option of implementing alternative compensating controls. While the use of security best practices such as network segmentation, multi-factor authentication (MFA), and intrusion detection and prevention systems (IDS/IPS) can provide defense in depth, these compensating controls are no longer sufficient to replace encryption in transit over external networks for nonpublic information.

However, the Amendment still allows for the CISO to approve the use of alternative compensating controls where encryption of nonpublic information at rest is deemed infeasible. AWS is committed to providing industry-standard encryption services and capabilities to help protect customer data at rest in the cloud, offering customers the ability to add layers of security to their data at rest, providing scalable and efficient encryption features. This includes the following services:

Data encryption capabilities available in AWS storage and database services, such as Amazon Elastic Block Store (Amazon EBS), Amazon Simple Storage Service (Amazon S3), Amazon Relational Database Service (Amazon RDS), and Amazon Redshift.
Flexible key management options, including AWS Key Management Service (AWS KMS), which allow you to choose whether to have AWS manage the encryption keys or keep complete control over your keys.
Dedicated, hardware-based cryptographic key storage using AWS CloudHSM, to help you adhere to compliance requirements

While the above highlights encryption-at-rest capabilities offered by AWS, the focus of this blog post is to provide guidance and best practice recommendations for encryption in transit.

AWS guidance and best practice recommendations

Cloud network traffic encompasses connections to and from the cloud and traffic between cloud service provider (CSP) services. From an organization’s perspective, CSP networks and data centers are deemed external because they aren’t under the organization’s direct control. The connection between the organization and a CSP, typically established over the internet or dedicated links, is considered an external network. Encrypting data in transit over these external networks is crucial and should be an integral part of an organization’s cybersecurity program.

AWS implements multiple mechanisms to help ensure the confidentiality and integrity of customer data during transit and at rest across various points within its environment. While AWS employs transparent encryption at various transit points, we strongly recommend incorporating encryption by design into your architecture. AWS provides robust encryption-in-transit capabilities to help you adhere to compliance requirements and mitigate the risks of unauthorized disclosure and modification of nonpublic information in transit over external networks.

Additionally, AWS recommends that financial services institutions adopt a secure by design (SbD) approach to implement architectures that are pre-tested from a security perspective. SbD helps establish control objectives, security baselines, security configurations, and audit capabilities for workloads running on AWS.

Security and Compliance is a shared responsibility between AWS and the customer. Shared responsibility can vary depending on the security configuration options for each service. You should carefully consider the services you choose because your organization’s responsibilities vary depending on the services used, the integration of those services into your IT environment, and applicable laws and regulations. AWS provides resources such as service user guides and AWS Customer Compliance Guides, which map security best practices for individual services to leading compliance frameworks, including NYDFS.

Protecting connections to and from AWS

We understand that customers place a high priority on privacy and data security. That’s why AWS gives you ownership and control over your data through services that allow you to determine where your content will be stored, secure your content in transit and at rest, and manage access to AWS services and resources for your users. When architecting workloads on AWS, classifying data based on its sensitivity, criticality, and compliance requirements is essential. Proper data classification allows you to implement appropriate security controls and data protection mechanisms, such as Transport Layer Security (TLS) at the application layer, access control measures, and secure network connectivity options for nonpublic information over external networks. When it comes to transmitting nonpublic information over external networks, it’s a recommended practice to identify network segments traversed by this data based on your network architecture. While AWS employs transparent encryption at various transit points, it’s advisable to implement encryption solutions at multiple layers of the OSI model to establish defense in depth and enhance end-to-end encryption capabilities. Although requirement 500.15 of the Amendment doesn’t mandate end-to-end encryption, implementing such controls can provide an added layer of security and can help demonstrate that nonpublic information is consistently encrypted during transit.

AWS offers several options to achieve this. While not every option provides end-to-end encryption on its own, using them in combination helps to ensure that nonpublic information doesn’t traverse open, public networks unprotected. These options include:

Using AWS Direct Connect with IEEE 802.1AE MAC Security Standard (MACsec) encryption
VPN connections
Secure API endpoints
Client-side encryption of data before sending it to AWS

AWS Direct Connect with MACsec encryption

AWS Direct Connect provides direct connectivity to the AWS network through third-party colocation facilities, using a cross-connect between an AWS owned device and either a customer- or partner-owned device. Direct Connect can reduce network costs, increase bandwidth throughput, and provide a more consistent network experience than internet-based connections. Within Direct Connect connections (a physical construct) there will be one or more virtual interfaces (VIFs). These are logical entities and are reflected as industry-standard 802.1Q VLANs on the customer equipment terminating the Direct Connect connection. Depending on the type of VIF, they will use either public or private IP addressing. There are three different types of VIFs:

Public virtual interface – Establish connectivity between AWS public endpoints and your data center, office, or colocation environment.
Transit virtual interface – Establish private connectivity between AWS Transit Gateways and your data center, office, or colocation environment. Transit Gateways is an AWS managed high availability and scalability regional network transit hub used to interconnect Amazon Virtual Private Cloud (Amazon VPC) and customer networks.
Private virtual interface – Establish private connectivity between Amazon VPC resources and your data center, office, or colocation environment.

By default, a Direct Connect connection isn’t encrypted from your premises to the Direct Connect location because AWS cannot assume your on-premises device supports the MACsec protocol. With MACsec, Direct Connect delivers native, near line-rate, point-to-point encryption, ensuring that data communications between AWS and your corporate network remain protected. MACsec is supported on 10 Gbps and 100 Gbps dedicated Direct Connect connections at selected points of presence. Using Direct Connect with MACsec-enabled connections and combining it with the transparent physical network encryption offered by AWS from the Direct Connect location through the AWS backbone not only benefits you by allowing you to securely exchange data with AWS, but also enables you to use the highest available bandwidth. For additional information on MACsec support and cipher suites, see the MACsec section in the Direct Connect FAQs.

Figure 1 illustrates a sample reference architecture for securing traffic from corporate network to your VPCs over Direct Connect with MACsec and AWS Transit Gateways.

Figure 1: Sample architecture for using Direct Connect with MACsec encryption

In the sample architecture, you can see that Layer 2 encryption through MACsec only encrypts the traffic from your on-premises systems to the AWS device in the Direct Connect location, and therefore you need to consider additional encryption solutions at Layer 3, 4, or 7 to get closer to end-to-end encryption to the device where you’re comfortable for the packets to be decrypted. In the next section, let’s review an option for using network layer encryption using AWS Site-to-Site VPN.

Direct Connect with Site-to-Site VPN

AWS Site-to-Site VPN is a fully managed service that creates a secure connection between your corporate network and your Amazon VPC using IP security (IPsec) tunnels over the internet. Data transferred between your VPC and the remote network routes over an encrypted VPN connection to help maintain the confidentiality and integrity of data in transit. Each VPN connection consists of two tunnels between a virtual private gateway or transit gateway on the AWS side and a customer gateway on the on-premises side. Each tunnel supports a maximum throughput of up to 1.25 Gbps. See Site-to-Site VPN quotas for more information.

You can use Site-to-Site VPN over Direct Connect to achieve secure IPsec connection with the low latency and consistent network experience of Direct Connect when reaching resources in your Amazon VPCs.

Figure 2 illustrates a sample reference architecture for establishing end-to-end IPsec-encrypted connections between your networks and Transit Gateway over a private dedicated connection.

Figure 2: Encrypted connections between the AWS Cloud and a customer’s network using VPN

While Direct Connect with MACsec and Site-to-Site VPN with IPsec can provide encryption at the physical and network layers respectively, they primarily secure the data in transit between your on-premises network and the AWS network boundary. To further enhance the coverage for end-to-end encryption, it is advisable to use TLS encryption. In the next section, let’s review mechanisms for securing API endpoints on AWS using TLS encryption.

Secure API endpoints

APIs act as the front door for applications to access data, business logic, or functionality from other applications and backend services.

AWS enables you to establish secure, encrypted connections to its services using public AWS service API endpoints. Public AWS owned service API endpoints (AWS managed services like Amazon Simple Queue Service (Amazon SQS), AWS Identity and Access Management (IAM), AWS Key Management Service (AWS KMS), others) have certificates that are owned and deployed by AWS. By default, requests to these public endpoints use HTTPS. To align with evolving technology and regulatory standards for TLS, as of February 27, 2024, AWS has updated its TLS policy to require a minimum of TLS 1.2, thereby deprecating support for TLS 1.0 and 1.1 versions on AWS service API endpoints across each of our AWS Regions and Availability Zones.

Additionally, to enhance connection performance, AWS has begun enabling TLS version 1.3 globally for its service API endpoints. If you’re using the AWS SDKs or AWS Command Line Interface (AWS CLI), you will automatically benefit from TLS 1.3 after a service enables it.

While requests to public AWS service API endpoints use HTTPS by default, a few services, such as Amazon S3 and Amazon DynamoDB, allow using either HTTP or HTTPS. If the client or application chooses HTTP, the communication isn’t encrypted. Customers are responsible for enforcing HTTPS connections when using such AWS services. To help ensure secure communication, you can establish an identity perimeter by using the IAM policy condition key aws:SecureTransport in your IAM roles to evaluate the connection and mandate HTTPS usage.

As enterprises increasingly adopt cloud computing and microservices architectures, teams frequently build and manage internal applications exposed as private API endpoints. Customers are responsible for managing the certificates on private customer-owned endpoints. AWS helps you deploy private customer-owned identities (that is, TLS certificates) through the use of AWS Certificate Manager (ACM) private certificate authorities (PCA) and the integration with AWS services that offer private customer-owned TLS termination endpoints.

ACM is a fully managed service that lets you provision, manage, and deploy public and private TLS certificates for use with AWS services and internal connected resources. ACM minimizes the time-consuming manual process of purchasing, uploading, and renewing TLS certificates. You can provide certificates for your integrated AWS services either by issuing them directly using ACM or by importing third-party certificates into the ACM management system. ACM offers two options for deploying managed X.509 certificates. You can choose the best one for your needs.

AWS Certificate Manager (ACM) – This service is for enterprise customers who need a secure web presence using TLS. ACM certificates are deployed through Elastic Load Balancing (ELB), Amazon CloudFront, Amazon API Gateway, and other integrated AWS services. The most common application of this type is a secure public website with significant traffic requirements. ACM also helps to simplify security management by automating the renewal of expiring certificates.
AWS Private Certificate Authority (Private CA) – This service is for enterprise customers building a public key infrastructure (PKI) inside the AWS Cloud and is intended for private use within an organization. With AWS Private CA, you can create your own certificate authority (CA) hierarchy and issue certificates with it for authenticating users, computers, applications, services, servers, and other devices. Certificates issued by a private CA cannot be used on the internet. For more information, see the AWS Private CA User Guide.

You can use a centralized API gateway service, such as Amazon API Gateway, to securely expose customer-owned private API endpoints. API Gateway is a fully managed service that allows developers to create, publish, maintain, monitor, and secure APIs at scale. With API Gateway, you can create RESTful APIs and WebSocket APIs, enabling near real-time, two-way communication applications. API Gateway operations must be encrypted in-transit using TLS, and require the use of HTTPS endpoints. You can use API Gateway to configure custom domains for your APIs using TLS certificates provisioned and managed by ACM. Developers can optionally choose a specific TLS version for their custom domain names. For use cases that require mutual TLS (mTLS) authentication, you can configure certificate-based mTLS authentication on your custom domains.

Pre-encryption of data to be sent to AWS

Depending on the risk profile and sensitivity of the data that’s being transferred to AWS, you might want to choose encrypting data in an application running on your corporate network before sending it to AWS (client-side encryption). AWS offers a variety of SDKs and client-side encryption libraries to help you encrypt and decrypt data in your applications. You can use these libraries with the cryptographic service provider of your choice, including AWS Key Management Service or AWS CloudHSM, but the libraries do not require an AWS service.

The AWS Encryption SDK is a client-side encryption library that you can use to encrypt and decrypt data in your application and is available in several programming languages, including a command-line interface. You can use the SDK to encrypt your data before you send it to an AWS service. The SDK offers advanced data protection features, including envelope encryption and additional authenticated data (AAD). It also offers secure, authenticated, symmetric key algorithm suites, such as 256-bit AES-GCM with key derivation and signing.
The AWS Database Encryption SDK is a set of software libraries developed in open source that enable you to include client-side encryption in your database design. The SDK provides record-level encryption solutions. You specify which fields are encrypted and which fields are included in the signatures that help ensure the authenticity of your data. Encrypting your sensitive data in transit and at rest helps ensure that your plaintext data isn’t available to a third party, including AWS. The AWS Database Encryption SDK for DynamoDB is designed especially for DynamoDB applications. It encrypts the attribute values in each table item using a unique encryption key. It then signs the item to protect it against unauthorized changes, such as adding or deleting attributes or swapping encrypted values. After you create and configure the required components, the SDK transparently encrypts and signs your table items when you add them to a table. It also verifies and decrypts them when you retrieve them. Searchable encryption in the AWS Database Encryption SDK enables you search encrypted records without decrypting the entire database. This is accomplished by using beacons, which create a map between the plaintext value written to a field and the encrypted value that is stored in your database. For more information, see the AWS Database Encryption SDK Developer Guide.
The Amazon S3 Encryption Client is a client-side encryption library that enables you to encrypt an object locally to help ensure its security before passing it to Amazon S3. It integrates seamlessly with the Amazon S3 APIs to provide a straightforward solution for client-side encryption of data before uploading to Amazon S3. After you instantiate the Amazon S3 Encryption Client, your objects are automatically encrypted and decrypted as part of your Amazon S3 PutObject and GetObject requests. Your objects are encrypted with a unique data key. You can use both the Amazon S3 Encryption Client and server-side encryption to encrypt your data. The Amazon S3 Encryption Client is supported in a variety of programming languages and supports industry-standard algorithms for encrypting objects and data keys. For more information, see the Amazon S3 Encryption Client developer guide.

Encryption in-transit inside AWS

AWS implements responsible and sophisticated technical and physical controls that are designed to help prevent unauthorized access to or disclosure of your content. To protect data in transit, traffic traversing through the AWS network that is outside of AWS physical control is transparently encrypted by AWS at the physical layer. This includes traffic between AWS Regions (except China Regions), traffic between Availability Zones, and between Direct Connect locations and Regions through the AWS backbone network.

Network segmentation

When you create an AWS account, AWS offers a virtual networking option to launch resources in a logically isolated virtual private network (VPN), Amazon Virtual Private Cloud (Amazon VPC). A VPC is limited to a single AWS Region and every VPC has one or more subnets. VPCs can be connected externally using an internet gateway (IGW), VPC peering connection, VPN, Direct Connect, or Transit Gateways. Traffic within the your VPC is considered internal because you have complete control over your virtual networking environment, including selection of your own IP address range, creation of subnets, and configuration of route tables and network gateways.

As a customer, you maintain ownership of your data, and you select which AWS services can process, store, and host your data, and you choose the Regions in which your data is stored. AWS doesn’t automatically replicate data across Regions, unless the you choose to do so. Data transmitted over the AWS global network between Regions and Availability Zones is automatically encrypted at the physical layer before leaving AWS secured facilities. Cross-Region traffic that uses Amazon VPC and Transit Gateway peering is automatically bulk-encrypted when it exits a Region.

Encryption between instances

AWS provides secure and private connectivity between Amazon Elastic Compute Cloud (Amazon EC2) instances of all types. The Nitro System is the underlying foundation for modern Amazon EC2 instances. It’s a combination of purpose-built server designs, data processors, system management components, and specialized firmware that provides the underlying foundation for EC2 instances launched since the beginning of 2018. Instance types that use the offload capabilities of the underlying Nitro System hardware automatically encrypt in-transit traffic between instances. This encryption uses Authenticated Encryption with Associated Data (AEAD) algorithms, with 256-bit encryption and has no impact on network performance. To support this additional in-transit traffic encryption between instances, instances must be of supported instance types, in the same Region, and in the same VPC or peered VPCs. For a list of supported instance types and additional requirements, see Encryption in transit.

Conclusion

The second Amendment to the NYDFS Cybersecurity Regulation underscores the criticality of safeguarding nonpublic information during transmission over external networks. By mandating encryption for data in transit and eliminating the option for compensating controls, the Amendment reinforces the need for robust, industry-standard encryption measures to protect the confidentiality and integrity of sensitive information.

AWS provides a comprehensive suite of encryption services and secure connectivity options that enable you to design and implement robust data protection strategies. The transparent encryption mechanisms that AWS has built into services across its global network infrastructure, secure API endpoints with TLS encryption, and services such as Direct Connect with MACsec encryption and Site-to-Site VPN, can help you establish secure, encrypted pathways for transmitting nonpublic information over external networks.

By embracing the principles outlined in this blog post, financial services organizations can address not only the updated NYDFS encryption requirements for section 500.15(a) but can also potentially demonstrate their commitment to data security across other security standards and regulatory requirements.

For further reading on considerations for AWS customers regarding adherence to the Second Amendment to the NYDFS Cybersecurity Regulation, see the AWS Compliance Guide to NYDFS Cybersecurity Regulation.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Financial Services re:Post and AWS Security, Identity, & Compliance re:Post ,or contact AWS Support.

Reducing long-term logging expenses by 4,800% with Amazon OpenSearch Service

2024-08-20 Jon Handler

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/reducing-long-term-logging-expenses-by-4800-with-amazon-opensearch-service/

When you use Amazon OpenSearch Service for time-bound data like server logs, service logs, application logs, clickstreams, or event streams, storage cost is one of the primary drivers for the overall cost of your solution. Over the last year, OpenSearch Service has released features that have opened up new possibilities for storing your log data in various tiers, enabling you to trade off data latency, durability, and availability. In October 2023, OpenSearch Service announced support for im4gn data nodes, with NVMe SSD storage of up to 30 TB. In November 2023, OpenSearch Service introduced or1, the OpenSearch-optimized instance family, which delivers up to 30% price-performance improvement over existing instances in internal benchmarks and uses Amazon Simple Storage Service (Amazon S3) to provide 11 nines of durability. Finally, in May 2024, OpenSearch Service announced general availability for Amazon OpenSearch Service zero-ETL integration with Amazon S3. These new features join OpenSearch’s existing UltraWarm instances, which provide an up to 90% reduction in storage cost per GB, and UltraWarm’s cold storage option, which lets you detach UltraWarm indexes and durably store rarely accessed data in Amazon S3.

This post works through an example to help you understand the trade-offs available in cost, latency, throughput, data durability and availability, retention, and data access, so that you can choose the right deployment to maximize the value of your data and minimize the cost.

Examine your requirements

When designing your logging solution, you need a clear definition of your requirements as a prerequisite to making smart trade-offs. Carefully examine your requirements for latency, durability, availability, and cost. Additionally, consider which data you choose to send to OpenSearch Service, how long you retain data, and how you plan to access that data.

For the purposes of this discussion, we divide OpenSearch instance storage into two classes: ephemeral backed storage and Amazon S3 backed storage. The ephemeral backed storage class includes OpenSearch nodes that use Nonvolatile Memory Express SSDs (NVMe SSDs) and Amazon Elastic Block Store (Amazon EBS) volumes. The Amazon S3 backed storage class includes UltraWarm nodes, UltraWarm cold storage, or1 instances, and Amazon S3 storage you access with the service’s zero-ETL with Amazon S3. When designing your logging solution, consider the following:

Latency – if you need results in milliseconds, then you must use ephemeral backed storage. If seconds or minutes are acceptable, you can lower your cost by using Amazon S3 backed storage.
Throughput – As a general rule, ephemeral backed storage instances will provide higher throughput. Instances that have NVMe SSDs, like the im4gn, generally provide the best throughput, with EBS volumes providing good throughput. or1 instances take advantage of Amazon EBS storage for primary shards while using Amazon S3 with segment replication to reduce the compute cost of replication, thereby offering indexing throughput that can match or even exceed NVMe-based instances.
Data durability – Data stored in the hot tier (you deploy these as data nodes) has the lowest latency, and also the lowest durability. OpenSearch Service provides automated recovery of data in the hot tier through replicas, which provide durability with added cost. Data that OpenSearch stores in Amazon S3 (UltraWarm, UltraWarm cold storage, zero-ETL with Amazon S3, and or1 instances) gets the benefit of 11 nines of durability from Amazon S3.
Data availability – Best practices dictate that you use replicas for data in ephemeral backed storage. When you have at least one replica, you can continue to access all of your data, even during a node failure. However, each replica adds a multiple of cost. If you can tolerate temporary unavailability, you can reduce replicas through or1 instances, with Amazon S3 backed storage.
Retention – Data in all storage tiers incurs cost. The longer you retain data for analysis, the more cumulative cost you incur for each GB of that data. Identify the maximum amount of time you must retain data before it loses all value. In some cases, compliance requirements may restrict your retention window.
Data access – Amazon S3 backed storage instances generally have a much higher storage to compute ratio, providing cost savings but with insufficient compute for high-volume workloads. If you have high query volume or your queries span a large volume of data, ephemeral backed storage is the right choice. Direct query (Amazon S3 backed storage) is perfect for large volume queries for infrequently queried data.

As you consider your requirements along these dimensions, your answers will guide your choices for implementation. To help you make trade-offs, we work through an extended example in the following sections.

OpenSearch Service cost model

To understand how to cost an OpenSearch Service deployment, you need to understand the cost dimensions. OpenSearch Service has two different deployment options: managed clusters and serverless. This post considers managed clusters only, because Amazon OpenSearch Serverless already tiers data and manages storage for you. When you use managed clusters, you configure data nodes, UltraWarm nodes, and cluster manager nodes, selecting Amazon Elastic Compute Cloud (Amazon EC2) instance types for each of these functions. OpenSearch Service deploys and manages these nodes for you, providing OpenSearch and OpenSearch Dashboards through a REST endpoint. You can choose Amazon EBS backed instances or instances with NVMe SSD drives. OpenSearch Service charges an hourly cost for the instances in your managed cluster. If you choose Amazon EBS backed instances, the service will charge you for the storage provisioned, and any provisioned IOPs you configure. If you choose or1 nodes, UltraWarm nodes, or UltraWarm cold storage, OpenSearch Service charges for the Amazon S3 storage consumed. Finally, the service charges for data transferred out.

Example use case

We use an example use case to examine the trade-offs in cost and performance. The cost and sizing of this example are based on best practices, and are directional in nature. Although you can expect to see similar savings, all workloads are unique and your actual costs may vary substantially from what we present in this post.

For our use case, Fizzywig, a fictitious company, is a large soft drink manufacturer. They have many plants for producing their beverages, with copious logging from their manufacturing line. They started out small, with an all-hot deployment and generating 10 GB of logs daily. Today, that has grown to 3 TB of log data daily, and management is mandating a reduction in cost. Fizzywig uses their log data for event debugging and analysis, as well as historical analysis over one year of log data. Let’s compute the cost of storing and using that data in OpenSearch Service.

Ephemeral backed storage deployments

Fizzywig’s current deployment is 189 r6g.12xlarge.search data nodes (no UltraWarm tier), with ephemeral backed storage. When you index data in OpenSearch Service, OpenSearch builds and stores index data structures that are usually about 10% larger than the source data, and you need to leave 25% free storage space for operating overhead. Three TB of daily source data will use 4.125 TB of storage for the first (primary) copy, including overhead. Fizzywig follows best practices, using two replica copies for maximum data durability and availability, with the OpenSearch Service Multi-AZ with Standby option, increasing the storage need to 12.375 TB per day. To store 1 year of data, multiply by 365 days to get 4.5 PB of storage needed.

To provision this much storage, they could also choose im4gn.16xlarge.search instances, or or1.16.xlarge.search instances. The following table gives the instance counts for each of these instance types, and with one, two, or three copies of the data.

.	Max Storage (GB) per Node	Primary (1 Copy)	Primary + Replica (2 Copies)	Primary + 2 Replicas (3 Copies)
im4gn.16xlarge.search	30,000	52	104	156
or1.16xlarge.search	36,000	42	84	126
r6g.12xlarge.search	24,000	63	126	189

The preceding table and the following discussion are strictly based on storage needs. or1 instances and im4gn instances both provide higher throughput than r6g instances, which will reduce cost further. The amount of compute saved varies between 10–40% depending on the workload and the instance type. These savings do not pass straight through to the bottom line; they require scaling and modification of the index and shard strategy to fully realize them. The preceding table and subsequent calculations take the general assumption that these deployments are over-provisioned on compute, and are storage-bound. You would see more savings for or1 and im4gn, compared with r6g, if you had to scale higher for compute.

The following table represents the total cluster costs for the three different instance types across the three different data storage sizes specified. These are based on on-demand US East (N. Virginia) AWS Region costs and include instance hours, Amazon S3 cost for the or1 instances, and Amazon EBS storage costs for the or1 and r6g instances.

.	Primary (1 Copy)	Primary + Replica (2 Copies)	Primary + 2 Replicas (3 Copies)
im4gn.16xlarge.search	$3,977,145	$7,954,290	$11,931,435
or1.16xlarge.search	$4,691,952	$9,354,996	$14,018,041
r6g.12xlarge.search	$4,420,585	$8,841,170	$13,261,755

This table gives you the one-copy, two-copy, and three-copy costs (including Amazon S3 and Amazon EBS costs, where applicable) for this 4.5 PB workload. For this post, “one copy” refers to the first copy of your data, with the replication factor set to zero. “Two copies” includes a replica copy of all of the data, and “three copies” includes a primary and two replicas. As you can see, each replica adds a multiple of cost to the solution. Of course, each replica adds availability and durability to the data. With one copy (primary only), you would lose data in the case of a single node outage (with an exception for or1 instances). With one replica, you might lose some or all data in a two-node outage. With two replicas, you could lose data only in a three-node outage.

The or1 instances are an exception to this rule. or1 instances can support a one-copy deployment. These instances use Amazon S3 as a backing store, writing all index data to Amazon S3, as a means of replication, and for durability. Because all acknowledged writes are persisted in Amazon S3, you can run with a single copy, but with the risk of losing availability of your data in case of a node outage. If a data node becomes unavailable, any impacted indexes will be unavailable (red) during the recovery window (usually 10–20 minutes). Carefully evaluate whether you can tolerate this unavailability with your customers as well as your system (for example, your ingestion pipeline buffer). If so, you can drop your cost from $14 million to $4.7 million based on the one-copy (primary) column illustrated in the preceding table.

Reserved Instances

OpenSearch Service supports Reserved Instances (RIs), with 1-year and 3-year terms, with no up-front cost (NURI), partial up-front cost (PURI), or all up-front cost (AURI). All reserved instance commitments lower cost, with 3-year, all up-front RIs providing the deepest discount. Applying a 3-year AURI discount, annual costs for Fizzywig’s workload gives costs as shown in the following table.

.	Primary	Primary + Replica	Primary + 2 Replicas
im4gn.16xlarge.search	$1,909,076	$3,818,152	$5,727,228
or1.16xlarge.search	$3,413,371	$6,826,742	$10,240,113
r6g.12xlarge.search	$3,268,074	$6,536,148	$9,804,222

RIs provide a straightforward way to save cost, with no code or architecture changes. Adopting RIs for this workload brings the im4gn cost for three copies down to $5.7 million, and the one-copy cost for or1 instances down to $3.2 million.

Amazon S3 backed storage deployments

The preceding deployments are useful as a baseline and for comparison. In actuality, you would choose one of the Amazon S3 backed storage options to keep costs manageable.

OpenSearch Service UltraWarm instances store all data in Amazon S3, using UltraWarm nodes as a hot cache on top of this full dataset. UltraWarm works best for interactive querying of data in small time-bound slices, such as running multiple queries against 1 day of data from 6 months ago. Evaluate your access patterns carefully and consider whether UltraWarm’s cache-like behavior will serve you well. UltraWarm first-query latency scales with the amount of data you need to query.

When designing an OpenSearch Service domain for UltraWarm, you need to decide on your hot retention window and your warm retention window. Most OpenSearch Service customers use a hot retention window that varies between 7–14 days, with warm retention making up the rest of the full retention period. For our Fizzywig scenario, we use 14 days hot retention and 351 days of UltraWarm retention. We also use a two-copy (primary and one replica) deployment in the hot tier.

The 14-day, hot storage need (based on a daily ingestion rate of 4.125 TB) is 115.5 TB. You can deploy six instances of any of the three instance types to support this indexing and storage. UltraWarm stores a single replica in Amazon S3, and doesn’t need additional storage overhead, making your 351-day storage need 1.158 PiB. You can support this with 58 UltraWarm1.large.search instances. The following table gives the total cost for this deployment, with 3-year AURIs for the hot tier. The or1 instances’ Amazon S3 cost is rolled into the S3 column.

.	Hot	UltraWarm	S3	Total
im4gn.16xlarge.search	$220,278	$1,361,654	$333,590	$1,915,523
or1.16xlarge.search	$337,696	$1,361,654	$418,136	$2,117,487
r6g.12xlarge.search	$270,410	$1,361,654	$333,590	$1,965,655

You can further reduce the cost by moving data to UltraWarm cold storage. Cold storage reduces cost by reducing availability of the data—to query the data, you must issue an API call to reattach the target indexes to the UltraWarm tier. A typical pattern for 1 year of data keeps 14 days hot, 76 days in UltraWarm, and 275 days in cold storage. Following this pattern, you use 6 hot nodes and 13 UltraWarm1.large.search nodes. The following table illustrates the cost to run Fizzywig’s 3 TB daily workload. The or1 cost for Amazon S3 usage is rolled into the UltraWarm nodes + S3 column.

.	Hot	UltraWarm nodes + S3	Cold	Total
im4gn.16xlarge.search	$220,278	$377,429	$261,360	$859,067
or1.16xlarge.search	$337,696	$461,975	$261,360	$1,061,031
r6g.12xlarge.search	$270,410	$377,429	$261,360	$909,199

By employing Amazon S3 backed storage options, you’re able to reduce cost even further, with a single-copy or1 deployment at $337,000, and a maximum of $1 million annually with or1 instances.

OpenSearch Service zero-ETL for Amazon S3

When you use OpenSearch Service zero-ETL for Amazon S3, you keep all your secondary and older data in Amazon S3. Secondary data is the higher-volume data that has lower value for direct inspection, such as VPC Flow Logs and WAF logs. For these deployments, you keep the majority of infrequently queried data in Amazon S3, and only the most recent data in your hot tier. In some cases, you sample your secondary data, keeping a percentage in the hot tier as well. Fizzywig decides that they want to have 7 days of all of their data in the hot tier. They will access the rest with direct query (DQ).

When you use direct query, you can store your data in JSON, Parquet, and CSV formats. Parquet format is optimal for direct query and provides about 75% compression on the data. Fizzywig is using Amazon OpenSearch Ingestion, which can write Parquet format data directly to Amazon S3. Their 3 TB of daily source data compresses to 750 GB of daily Parquet data. OpenSearch Service maintains a pool of compute units for direct query. You are billed hourly for these OpenSearch Compute Units (OCUs), scaling based on the amount of data you access. For this conversation, we assume that Fizzywig will have some debugging sessions and run 50 queries daily over one day worth of data (750 GB). The following table summarizes the annual cost to run Fizzywig’s 3 TB daily workload, 7 days hot, 358 days in Amazon S3.

.	Hot	DQ Cost	OR1 S3	Raw Data S3	Total
im4gn.16xlarge.search	$220,278	$2,195	$0	$65,772	$288,245
or1.16xlarge.search	$337,696	$2,195	$84,546	$65,772	$490,209
r6g.12xlarge.search	$270,410	$2,195	$0	$65,772	$338,377

That’s quite a journey! Fizzywig’s cost for logging has come down from as high as $14 million annually to as low as $288,000 annually using direct query with zero-ETL from Amazon S3. That’s a savings of 4,800%!

Sampling and compression

In this post, we have looked at one data footprint to let you focus on data size, and the trade-offs you can make depending on how you want to access that data. OpenSearch has additional features that can further change the economics by reducing the amount of data you store.

For logs workloads, you can employ OpenSearch Ingestion sampling to reduce the size of data you send to OpenSearch Service. Sampling is appropriate when your data as a whole has statistical characteristics where a part can be representative of the whole. For example, if you’re running an observability workload, you can often send as little as 10% of your data to get a representative sampling of the traces of request handling in your system.

You can further employ a compression algorithm for your workloads. OpenSearch Service recently released support for Zstandard (zstd) compression that can bring higher compression rates and lower decompression latencies as compared to the default, best compression.

Conclusion

With OpenSearch Service, Fizzywig was able to balance cost, latency, throughput, durability and availability, data retention, and preferred access patterns. They were able to save 4,800% for their logging solution, and management was thrilled.

Across the board, im4gn comes out with the lowest absolute dollar amounts. However, there are a couple of caveats. First, or1 instances can provide higher throughput, especially for write-intensive workloads. This may mean additional savings through reduced need for compute. Additionally, with or1’s added durability, you can maintain availability and durability with lower replication, and therefore lower cost. Another factor to consider is RAM; the r6g instances provide additional RAM, which speeds up queries for lower latency. When coupled with UltraWarm, and with different hot/warm/cold ratios, r6g instances can also be an excellent choice.

Do you have a high-volume, logging workload? Have you benefitted from some or all of these methods? Let us know!

About the Author

Jon Handler is a Senior Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have vector, search, and log analytics workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included 4 years of coding a large-scale, ecommerce search engine. Jon holds a Bachelor’s of the Arts from the University of Pennsylvania, and a Master’s of Science and a PhD in Computer Science and Artificial Intelligence from Northwestern University.