Tag Archives: Advanced (300)

How to set up Amazon Cognito for federated authentication using Azure AD

Post Syndicated from Ratan Kumar original https://aws.amazon.com/blogs/security/how-to-set-up-amazon-cognito-for-federated-authentication-using-azure-ad/

In this blog post, I’ll walk you through the steps to integrate Azure AD as a federated identity provider in Amazon Cognito user pool. A user pool is a user directory in Amazon Cognito that provides sign-up and sign-in options for your app users.

Identity management and authentication flow can be challenging when you need to support requirements such as OAuth, social authentication, and login using a Security Assertion Markup Language (SAML) 2.0 based identity provider (IdP) to meet your enterprise identity management requirements. Amazon Cognito provides you a managed, scalable user directory, user sign-up and sign-in, and federation through third-party identity providers. An added benefit for developers is that it provides you a standardized set of tokens (Identity, Access and Refresh Token). So, in situations when you have to support authentication with multiple identity providers (e.g. Social authentication, SAML IdP, etc.), you don’t have to write code for handling different tokens issued by different identity providers. Instead, you can just work with a consistent set of tokens issued by Amazon Cognito user pool.

Figure 1: High-level architecture for federated authentication in a web or mobile app

Figure 1: High-level architecture for federated authentication in a web or mobile app

As shown in Figure 1, the high-level application architecture of a serverless app with federated authentication typically involves following steps:

  1. User selects their preferred IdP to authenticate.
  2. User gets re-directed to the federated IdP for login. On successful authentication, the IdP posts back a SAML assertion or token containing user’s identity details to an Amazon Cognito user pool.
  3. Amazon Cognito user pool issues a set of tokens to the application
  4. Application can use the token issued by the Amazon Cognito user pool for authorized access to APIs protected by Amazon API Gateway.

To learn more about the authentication flow with SAML federation, see the blog post Building ADFS Federation for your Web App using Amazon Cognito User Pools.

Step-by-step instructions for enabling Azure AD as federated identity provider in an Amazon Cognito user pool

This post will walk you through the following steps:

  1. Create an Amazon Cognito user pool
  2. Add Amazon Cognito as an enterprise application in Azure AD
  3. Add Azure AD as SAML identity provider (IDP) in Amazon Cognito
  4. Create an app client and use the newly created SAML IDP for Azure AD


You’ll need to have administrative access to Azure AD, an AWS account and the AWS Command Line Interface (AWS CLI) installed on your machine. Follow the instructions for installing, updating, and uninstalling the AWS CLI version 2; and then to configure your installation, follow the instructions for configuring the AWS CLI. If you don’t want to install AWS CLI, you can also run these commands from AWS CloudShell which provides a browser-based shell to securely manage, explore, and interact with your AWS resources.

Step 1: Create an Amazon Cognito user pool

The procedures in this post use the AWS CLI, but you can also follow the instructions to use the AWS Management Console to create a new user pool.

To create a user pool in the AWS CLI

  1. Use the following command to create a user pool with default settings. Be sure to replace <yourUserPoolName> with the name you want to use for your user pool.
    aws cognito-idp create-user-pool \
    --pool-name <yourUserPoolName>

    You should see an output containing number of details about the newly created user pool.

  2. Copy the value of user pool ID, in this example, ap-southeast-2_xx0xXxXXX. You will need this value for the next steps.
    "UserPool": {
            "Id": "ap-southeast-2_xx0xXxXXX",
            "Name": "example-corp-prd-userpool"
           "Policies": { …

Add a domain name to user pool

One of the many useful features of Amazon Cognito is hosted UI which provides a configurable web interface for user sign in. Hosted UI is accessible from a domain name that needs to be added to the user pool. There are two options for adding a domain name to a user pool. You can either use an Amazon Cognito domain, or a domain name that you own. This solution uses an Amazon Cognito domain, which will look like the following:


To add a domain name to user pool

  1. Use following CLI command to add an Amazon Cognito domain to the user pool. Replace <yourDomainPrefix> with a unique domain name prefix (for example example-corp-prd). Note that you cannot use keywords aws, amazon, or cognito for domain prefix.
    aws cognito-idp create-user-pool-domain \
    --domain <yourDomainPrefix> \
    --user-pool-id <yourUserPoolID>

Prepare information for Azure AD setup

Next, you prepare Identifier (Entity ID) and Reply URL, which are required to add Amazon Cognito as an enterprise application in Azure AD (done in Step 2 below). Azure AD expects these values in a very specific format. In a text editor, note down your values for Identifier (Entity ID) and Reply URL according to the following formats:

  • For Identifier (Entity ID) the format is:

    For example:


  • For Reply URL the format is:

    For example:


    Note: The Reply URL is the endpoint where Azure AD will send SAML assertion to Amazon Cognito during the process of user authentication.

Update the placeholders above with your values (without < >), and then note the values of Identifier (Entity ID) and Reply URL in a text editor for future reference.

For more information, see Adding SAML Identity Providers to a User Pool in the Amazon Cognito Developer Guide.

Step 2: Add Amazon Cognito as an enterprise application in Azure AD

In this step, you add an Amazon Cognito user pool as an application in Azure AD, to establish a trust relationship between them.

To add new application in Azure AD

  1. Log in to the Azure Portal.
  2. In the Azure Services section, choose Azure Active Directory.
  3. In the left sidebar, choose Enterprise applications.
  4. Choose New application.
  5. On the Browse Azure AD Gallery page, choose Create your own application.
  6. Under What’s the name of your app?, enter a name for your application and select Integrate any other application you don’t find in the gallery (Non-gallery), as shown in Figure 2. Choose Create.
    Figure 2: Add an enterprise app in Azure AD

    Figure 2: Add an enterprise app in Azure AD

It will take few seconds for the application to be created in Azure AD, then you should be redirected to the Overview page for the newly added application.

Note: Occasionally, this step can result in a Not Found error, even though Azure AD has successfully created a new application. If that happens, in Azure AD navigate back to Enterprise applications and search for your application by name.

To set up Single Sign-on using SAML

  1. On the Getting started page, in the Set up single sign on tile, choose Get started, as shown in Figure 3.
    Figure 3: Application configuration page in Azure AD

    Figure 3: Application configuration page in Azure AD

  2. On the next screen, select SAML.
  3. In the middle pane under Set up Single Sign-On with SAML, in the Basic SAML Configuration section, choose the edit icon ().
  4. In the right pane under Basic SAML Configuration, replace the default Identifier ID (Entity ID) with the Identifier (Entity ID) you copied previously. In the Reply URL (Assertion Consumer Service URL) field, enter the Reply URL you copied previously, as shown in Figure 4. Choose Save.
    Figure 4: Azure AD SAML-based Sign-on setup

    Figure 4: Azure AD SAML-based Sign-on setup

  5. In the middle pane under Set up Single Sign-On with SAML, in the User Attributes & Claims section, choose Edit.
  6. Choose Add a group claim.
  7. On the User Attributes & Claims page, in the right pane under Group Claims, select Groups assigned to the application, leave Source attribute as Group ID, as shown in Figure 5. Choose Save.
    Figure 5: Option to select group claims to release to Amazon Cognito

    Figure 5: Option to select group claims to release to Amazon Cognito

    This adds the group claim so that Amazon Cognito can receive the group membership detail of the authenticated user as part of the SAML assertion.

  8. In a text editor, note down the Claim names under Additional claims, as shown in Figure 5. You’ll need these when creating attribute mapping in Amazon Cognito.
  9. Close the User Attributes & Claims screen by choosing the X in the top right corner. You’ll be redirected to the Set up Single Sign-on with SAML page.
  10. Scroll down to the SAML Signing Certificate section, and copy the App Federation Metadata Url by choosing the copy into clipboard icon (highlighted with red arrow in Figure 6). Keep this URL in a text editor, as you’ll need it in the next step.
    Figure 6: Copy SAML metadata URL from Azure AD

    Figure 6: Copy SAML metadata URL from Azure AD

Step 3: Add Azure AD as SAML IDP in Amazon Cognito

Next, you need an attribute in the Amazon Cognito user pool where group membership details from Azure AD can be received, and add Azure AD as an identity provider.

To add custom attribute to user pool and add Azure AD as an identity provider

  1. Use the following CLI command to add a custom attribute to the user pool. Replace <yourUserPoolID> and <customAttributeName> with your own values.
    aws cognito-idp add-custom-attributes \
    --user-pool-id <yourUserPoolID> \
    --custom-attributes Name=<customAttributeName>,AttributeDataType="String"

    If the command succeeds, you’ll not see any output.

  2. Use the following CLI command to add Azure AD as an identity provider. Be sure to replace the following with your own values:
    • Replace <yourUserPoolID> with Amazon Cognito user pool ID copied previously.
    • Replace <IDProviderName> with a name for your identity provider (for example, Example-Corp-IDP).
    • Replace <MetadataURLCopiedFromAzureAD> with the Metadata URL copied from Azure AD.
    • Replace <customAttributeName> with custom attribute name created previously.
    aws cognito-idp create-identity-provider \
    --user-pool-id <yourUserPoolID> \
    --provider-name=<IDProviderName> \
    --provider-type SAML \
    --provider-details MetadataURL=<MetadataURLCopiedFromAzureAD> \
    --attribute-mapping email=http://schemas.xmlsoap.org/ws/2005/05/identity/claims/emailaddress,<customAttributeName>=http://schemas.microsoft.com/ws/2008/06/identity/claims/groups

    Successful running of this command adds Azure AD as a SAML IDP to your Amazon Cognito user pool.

Step 4: Create an app client and use the newly created SAML IDP for Azure AD

Before you can use Amazon Cognito in your web application, you need to register your app with Amazon Cognito as an app client. An app client is an entity within an Amazon Cognito user pool that has permission to call unauthenticated API operations (operations that do not require an authenticated user), for example to register, sign in, and handle forgotten passwords.

To create an app client

  1. Use following command to create an app client. Be sure to replace the following with your own values:
    • Replace <yourUserPoolID> with the Amazon Cognito user pool ID created previously.
    • Replace <yourAppClientName> with a name for your app client.
    • Replace <callbackURL> with the URL of your web application that will receive the authorization code. It must be an HTTPS endpoint, except for in a local development environment where you can use http://localhost:PORT_NUMBER.
    • Use parameter –allowed-o-auth-flows for allowed OAuth flows that you want to enable. In this example, we use code for Authorization code grant.
    • Use parameter –allowed-o-auth-scopes to specify which OAuth scopes (such as phone, email, openid) Amazon Cognito will include in the tokens. In this example, we use openid.
    • Replace <IDProviderName> with the same name you used for ID provider previously.
    aws cognito-idp create-user-pool-client \
    --user-pool-id <yourUserPoolID> \
    --client-name <yourAppClientName> \
    --no-generate-secret \
    --callback-urls <callbackURL> \
    --allowed-o-auth-flows code \
    --allowed-o-auth-scopes openid email\
    --supported-identity-providers <IDProviderName> \

Successful running of this command will provide an output in following format. In a text editor, note down the ClientId for referencing in the web application. In this following example, the ClientId is 7xyxyxyxyxyxyxyxyxyxy.

    "UserPoolClient": {
        "UserPoolId": "ap-southeast-2_xYYYYYYY",
        "ClientName": "my-client-name",
        "ClientId": "7xyxyxyxyxyxyxyxyxyxy",
        "LastModifiedDate": "2021-05-04T17:33:32.936000+12:00",
        "CreationDate": "2021-05-04T17:33:32.936000+12:00",
        "RefreshTokenValidity": 30,
        "SupportedIdentityProviders": [
        "CallbackURLs": [
        "AllowedOAuthFlows": [
        "AllowedOAuthScopes": [
            "openid", "email"
        "AllowedOAuthFlowsUserPoolClient": true

Test the setup

Next, do a quick test to check if everything is configured properly.

  1. Open the Amazon Cognito console.
  2. Choose Manage User Pools, then choose the user pool you created in Step 1: Create an Amazon Cognito user pool.
  3. In the left sidebar, choose App client settings, then look for the app client you created in Step 4: Create an app client and use the newly created SAML IDP for Azure AD. Scroll to the Hosted UI section and choose Launch Hosted UI, as shown in Figure 7.
    Figure 7: App client settings showing link to access Hosted UI

    Figure 7: App client settings showing link to access Hosted UI

  4. On the sign-in page as shown in Figure 8, you should see all the IdPs that you enabled on the app client. Choose the Azure-AD button, which redirects you to the sign-in page hosted on https://login.microsoftonline.com/.
    Figure 8: Amazon Cognito hosted UI

    Figure 8: Amazon Cognito hosted UI

  5. Sign in using your corporate ID. If everything is working properly, you should be redirected back to the callback URL after successful authentication.

(Optional) Add authentication to a single page application

One way to add secure authentication using Amazon Cognito into a single page application (SPA) is to use the Auth.federatedSignIn() method of Auth class from AWS Amplify. AWS Amplify provides SDKs to integrate your web or mobile app with a growing list of AWS services, including integration with Amazon Cognito user pool. The federatedSign() method will render the hosted UI that gives users the option to sign in with the identity providers that you enabled on the app client (in Step 4), as shown in Figure 8. One advantage of hosted UI is that you don’t have to write any code for rendering it. Additionally, it will transparently implement the Authorization code grant with PKCE and securely provide your client-side application with the tokens (ID, Access and Refresh) that are required to access the backend APIs.

For a sample web application and instructions to connect it with Amazon Cognito authentication, see the aws-amplify-oidc-federation GitHub repository.


In this blog post, you learned how to integrate an Amazon Cognito user pool with Azure AD as an external SAML identity provider, to allow your users to use their corporate ID to sign in to web or mobile applications.

For more information about this solution, see our video Integrating Amazon Cognito with Azure Active Directory (from timestamp 25:26) on the official AWS twitch channel. In the video, you’ll find an end-to-end demo of how to integrate Amazon Cognito with Azure AD, and then how to use AWS Amplify SDK to add authentication to a simple React app (using the example of a pet store). The video also includes how you can access group membership details from Azure AD for authorization and fine-grained access control.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon Cognito forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Ratan Kumar

Ratan is a solutions architect based out of Auckland, New Zealand. He works with large enterprise customers helping them design and build secure, cost-effective, and reliable internet scale applications using the AWS cloud. He is passionate about technology and likes sharing knowledge through blog posts and twitch sessions.


Vishwanatha Nayak

Vish is a solutions architect at AWS. He engages with customers to create innovative solutions that are secure, reliable, and cost optimised to address business problems and accelerate the adoption of AWS services. He has over 15 years of experience in various software development, consulting, and architecture roles.

Scale Up Language Detection with Amazon Comprehend and S3 Batch Operations

Post Syndicated from Ameer Hakme original https://aws.amazon.com/blogs/architecture/scale-up-language-detection-with-amazon-comprehend-and-s3-batch-operations/

Organizations have been collecting text data for years. Text data can help you intelligently address a range of challenges, from customer experience to analytics. These mixed language, unstructured datasets can contain a wealth of information within business documents, emails, and webpages. If you’re able to process and interpret it, this information can provide insight that can help guide your business decisions.

Amazon Comprehend is a natural language processing (NLP) service that extracts insights from text datasets. Amazon Comprehend asynchronous batch operations provides organizations with the ability to detect dominant languages from text documents stored in Amazon Simple Storage Service (S3) buckets. The asynchronous operations support a maximum document size of 1 MB for language detection. They can process up to one million documents per batch, for a total size of 5 GB.

But what if your organization has millions, or even billions of documents stored in an S3 bucket waiting for language detection processing? What if your language detection process requires customization to let you organize your documents based on language? What if you need to create a search index that can help you quickly audit your text document repositories?

In this blog post, we walk through a solution using Amazon S3 Batch Operations to initiate language detection jobs with AWS Lambda and Amazon Comprehend.

Real world language detection solution architecture

In our example, we have tens of millions of text objects stored in a single S3 bucket. These need to be processed to detect the dominant language. To create a language detection job, we must supply the S3 Batch Operations with a manifest file that lists all text objects. We can use an Amazon S3 Inventory report as an input to the manifest file to create S3 bucket object lists.

One of the supported S3 Batch Operations is invoking an AWS Lambda function. The S3 Batch Operations job uses LambdaInvoke to run a Lambda function on every object listed in a manifest. Lambda jobs are subject to overall Lambda concurrency limits for the account and each Lambda invocation will have a defined runtime. Organizations can request a service quota increase if necessary. Lambda functions in a single AWS account and in one Region share the concurrency limit. You can set reserved capacity for Lambda functions to ensure that they can be invoked even when overall capacity has been exhausted.

The Lambda function can be customized to take further actions based on the output received from Amazon Comprehend. The following diagram shows an architecture for language detection with S3 Batch Operations and Amazon Comprehend.

Figure 1. Language detection with S3 Batch Operations and Amazon Comprehend

Figure 1. Language detection with S3 Batch Operations and Amazon Comprehend

Here is the architecture flow, as shown in Figure 1:

  1. S3 Batch Operations will pull the manifest file from the source S3 bucket.
  2. The S3 Batch Operations job will invoke the language detection Lambda function for each object listed in the manifest file. Lambda function code will perform a preliminary scan to check the file size, file extension, or any other requirements before calling Amazon Comprehend API. The Lambda function will then read the text object from S3 and then call the Amazon Comprehend API to detect the dominant language.
  3. The Language Detection API automatically identifies text written in over 100 languages. The API response contains the dominant language with a confidence score supporting the interpretation. An example API response would be: {‘LanguageCode’: ‘fr’, ‘Score’: 0.9888556003570557}. Once the Lambda function receives the API response, Lambda will return a message back to S3 Batch Operations with a result code.
  4. The Lambda function will then publish a message to an Amazon Simple Notification Service (SNS) topic.
  5. An Amazon Simple Queue Service (SQS) queue subscribed to the SNS topic will receive the message with all required information related to each processed text object.
  6. The SQS queue will invoke a Lambda function to process the message.
  7. The Lambda function will move the targeted S3 object to a destination S3 bucket.
  8. S3 Batch Operations will generate a completion report and will store it in an S3 bucket. The completion report will contain additional information for each task, including the object key name and version, status, error codes, and error descriptions.

Leverage SNS fanout pattern for more complex use cases

This blog post describes the basic building blocks for the solution, but it can be extended for more complex use cases, as illustrated in Figure 2. Using an SNS fanout application integration pattern would enable many SQS queues to subscribe to the same SNS topic. These SQS queues would receive identical notifications for the processed text objects, and you could implement downstream services for additional evaluation. For example, you can store text object metadata in an Amazon DynamoDB table. You can further analyze the number of processed text objects, dominant languages, object size, word count, and more.

Your source S3 bucket may have objects being uploaded in real time in addition to the existing batch processes. In this case, you could process these objects in a new batch job, or process them individually during upload by using S3 event triggers and Lambda.

Figure 2. Extending the solution

Figure 2. Extending the solution


You can implement a language detection job in a number of ways. All the Amazon Comprehend single document and synchronous API batch operations can be used for real-time analysis. Asynchronous batch operations can analyze large documents and large collections of documents. However, by using S3 Batch Operations, you can scale language detection batch operations to billions of text objects stored in S3. This solution has the flexibility to add customized functionality. This may be useful for more complex jobs, or when you want to capture different data points from your S3 objects.

For further reading:

Managing temporary elevated access to your AWS environment

Post Syndicated from James Greenwood original https://aws.amazon.com/blogs/security/managing-temporary-elevated-access-to-your-aws-environment/

In this post you’ll learn about temporary elevated access and how it can mitigate risks relating to human access to your AWS environment. You’ll also be able to download a minimal reference implementation and use it as a starting point to build a temporary elevated access solution tailored for your organization.


While many modern cloud architectures aim to eliminate the need for human access, there often remain at least some cases where it is required. For example, unexpected issues might require human intervention to diagnose or fix, or you might deploy legacy technologies into your AWS environment that someone needs to configure manually.

AWS provides a rich set of tools and capabilities for managing access. Users can authenticate with multi-factor authentication (MFA), federate using an external identity provider, and obtain temporary credentials with limited permissions. AWS Identity and Access Management (IAM) provides fine-grained access control, and AWS Single Sign-On (AWS SSO) makes it easy to manage access across your entire organization using AWS Organizations.

For higher-risk human access scenarios, your organization can supplement your baseline access controls by implementing temporary elevated access.

What is temporary elevated access?

The goal of temporary elevated access is to ensure that each time a user invokes access, there is an appropriate business reason for doing so. For example, an appropriate business reason might be to fix a specific issue or deploy a planned change.

Traditional access control systems require users to be authenticated and authorized before they can access a protected resource. Becoming authorized is typically a one-time event, and a user’s authorization status is reviewed periodically—for example as part of an access recertification process.

With persistent access, also known as standing access, a user who is authenticated and authorized can invoke access at any time just by navigating to a protected resource. The process of invoking access does not consider the reason why they are invoking it on each occurrence. Today, persistent access is the model that AWS Single Sign-On supports, and is the most common model used for IAM users and federated users.

With temporary elevated access, also known as just-in-time access, users must be authenticated and authorized as before—but furthermore, each time a user invokes access an additional process takes place, whose purpose is to identify and record the business reason for invoking access on this specific occasion. The process might involve additional human actors or it might use automation. When the process completes, the user is only granted access if the business reason is appropriate, and the scope and duration of their access is aligned to the business reason.

Why use temporary elevated access?

You can use temporary elevated access to mitigate risks related to human access scenarios that your organization considers high risk. Access generally incurs risk when two elements come together: high levels of privilege, such as ability to change configuration, modify permissions, read data, or update data; and high-value resources, such as production environments, critical services, or sensitive data. You can use these factors to define a risk threshold, above which you enforce temporary elevated access, and below which you continue to allow persistent access.

Your motivation for implementing temporary elevated access might be internal, based on your organization’s risk appetite; or external, such as regulatory requirements applicable to your industry. If your organization has regulatory requirements, you are responsible for interpreting those requirements and determining whether a temporary elevated access solution is required, and how it should operate.

Regardless of the source of requirement, the overall goal is to reduce risk.

Important: While temporary elevated access can reduce risk, the preferred approach is always to automate your way out of needing human access in the first place. Aim to use temporary elevated access only for infrequent activities that cannot yet be automated. From a risk perspective, the best kind of human access is the kind that doesn’t happen at all.

The AWS Well-Architected Framework provides guidance on using automation to reduce the need for human user access:

How can temporary elevated access help reduce risk?

In scenarios that require human intervention, temporary elevated access can help manage the risks involved. It’s important to understand that temporary elevated access does not replace your standard access control and other security processes, such as access governance, strong authentication, session logging and monitoring, and anomaly detection and response. Temporary elevated access supplements the controls you already have in place.

The following are some of the ways that using temporary elevated access can help reduce risk:

1. Ensuring users only invoke elevated access when there is a valid business reason. Users are discouraged from invoking elevated access habitually, and service owners can avoid potentially disruptive operations during critical time periods.

2. Visibility of access to other people. With persistent access, user activity is logged—but no one is routinely informed when a user invokes access, unless their activity causes an incident or security alert. With temporary elevated access, every access invocation is typically visible to at least one other person. This can arise from their participation in approvals, notifications, or change and incident management processes which are multi-party by nature. With greater visibility to more people, inappropriate access by users is more likely to be noticed and acted upon.

3. A reminder to be vigilant. Temporary elevated access provides an overt reminder for users to be vigilant when they invoke high-risk access. This is analogous to the kind security measures you see in a physical security setting. Imagine entering a secure facility. You see barriers, fences, barbed wire, CCTV, lighting, guards, and signs saying “You are entering a restricted area.” Temporary elevated access has a similar effect. It reminds users there is a heightened level of control, their activity is being monitored, and they will be held accountable for any actions they perform.

4. Reporting, analytics, and continuous improvement. A temporary elevated access process records the reasons why users invoke access. This provides a rich source of data to analyze and derive insights. Management can see why users are invoking access, which systems need the most human access, and what kind of tasks they are performing. Your organization can use this data to decide where to invest in automation. You can measure the amount of human access and set targets to reduce it. The presence of temporary elevated access might also incentivize users to automate common tasks, or ask their engineering teams to do so.

Implementing temporary elevated access

Before you examine the reference implementation, first take a look at a logical architecture for temporary elevated access, so you can understand the process flow at a high level.

A typical temporary elevated access solution involves placing an additional component between your identity provider and the AWS environment that your users need to access. This is referred to as a temporary elevated access broker, shown in Figure 1.

Figure 1: A logical architecture for temporary elevated access

Figure 1: A logical architecture for temporary elevated access

When a user needs to perform a task requiring temporary elevated access to your AWS environment, they will use the broker to invoke access. The broker performs the following steps:

1. Authenticate the user and determine eligibility. The broker integrates with your organization’s existing identity provider to authenticate the user with multi-factor authentication (MFA), and determine whether they are eligible for temporary elevated access.

Note: Eligibility is a key concept in temporary elevated access. You can think of it as pre-authorization to invoke access that is contingent upon additional conditions being met, described in step 3. A user typically becomes eligible by becoming a trusted member of a team of admins or operators, and the scope of their eligibility is based on the tasks they’re expected to perform as part of their job function. Granting and revoking eligibility is generally based on your organization’s standard access governance processes. Eligibility can be expressed as group memberships (if using role-based access control, or RBAC) or user attributes (if using attribute-based access control, or ABAC). Unlike regular authorization, eligibility is not sufficient to grant access on its own.

2. Initiate the process for temporary elevated access. The broker provides a way to start the process for gaining temporary elevated access. In most cases a user will submit a request on their own behalf—but some broker designs allow access to be initiated in other ways, such as an operations user inviting an engineer to assist them. The scope of a user’s requested access must be a subset of their eligibility. The broker might capture additional information about the context of the request in order to perform the next step.

3. Establish a business reason for invoking access. The broker tries to establish whether there is a valid business reason for invoking access with a given scope on this specific occasion. Why does this user need this access right now? The process of establishing a valid business reason varies widely between organizations. It might be a simple approval workflow, a quorum-based authorization, or a fully automated process. It might integrate with existing change and incident management systems to infer the business reason for access. A broker will often provide a way to expedite access in a time-critical emergency, which is a form of break-glass access. A typical broker implementation allows you to customize this step.

4. Grant time-bound access. If the business reason is valid, the broker grants time-bound access to the AWS target environment. The scope of access that is granted to the user must be a subset of their eligibility. Further, the scope and duration of access granted should be necessary and sufficient to fulfill the business reason identified in the previous step, based on the principle of least privilege.

A minimal reference implementation for temporary elevated access

To get started with temporary elevated access, you can deploy a minimal reference implementation accompanying this blog post. Information about deploying, running and extending the reference implementation is available in the Git repo README page.

Note: You can use this reference implementation to complement the persistent access that you manage for IAM users, federated users, or manage through AWS Single Sign-On. For example, you can use the multi-account access model of AWS SSO for persistent access management, and create separate roles for temporary elevated access using this reference implementation.

To establish a valid business reason for invoking access, the reference implementation uses a single-step approval workflow. You can adapt the reference implementation and replace this with a workflow or business logic of your choice.

To grant time-bound access, the reference implementation uses the identity broker pattern. In this pattern, the broker itself acts as an intermediate identity provider which conditionally federates the user into the AWS target environment granting a time-bound session with limited scope.

Figure 2 shows the architecture of the reference implementation.

Figure 2: Architecture of the reference implementation

Figure 2: Architecture of the reference implementation

To illustrate how the reference implementation works, the following steps walk you through a user’s experience end-to-end, using the numbers highlighted in the architecture diagram.

Starting the process

Consider a scenario where a user needs to perform a task that requires privileged access to a critical service running in your AWS environment, for which your security team has configured temporary elevated access.

Loading the application

The user first needs to access the temporary elevated access broker so that they can request the AWS access they need to perform their task.

  1. The user navigates to the temporary elevated access broker in their browser.
  2. The user’s browser loads a web application using web static content from an Amazon CloudFront distribution whose target is an Amazon S3 bucket.

The broker uses a web application that runs in the browser, known as a Single Page Application (SPA).

Note: CloudFront and S3 are only used for serving web static content. If you prefer, you can modify the solution to serve static content from a web server in your private network.

Authenticating users

  1. The user is redirected to your organization’s identity provider to authenticate. The reference implementation uses the OpenID Connect Authorization Code flow with Proof Key for Code Exchange (PKCE).
  2. The user returns to the application as an authenticated user with an access token and ID token signed by the identity provider.

The access token grants delegated authority to the browser-based application to call server-side APIs on the user’s behalf. The ID token contains the user’s attributes and group memberships, and is used for authorization.

Calling protected APIs

  1. The application calls APIs hosted by Amazon API Gateway and passes the access token and ID token with each request.
  2. For each incoming request, API Gateway invokes a Lambda authorizer using AWS Lambda.

The Lambda authorizer checks whether the user’s access token and ID token are valid. It then uses the ID token to determine the user’s identity and their authorization based on their group memberships.

Displaying information

  1. The application calls one of the /get… API endpoints to fetch data about previous temporary elevated access requests.
  2. The /get… API endpoints invoke Lambda functions which fetch data from a table in Amazon DynamoDB.

The application displays information about previously-submitted temporary elevated access requests in a request dashboard, as shown in Figure 3.

Figure 3: The request dashboard

Figure 3: The request dashboard

Submitting requests

A user who is eligible for temporary elevated access can submit a new request in the request dashboard by choosing Create request. As shown in Figure 4, the application then displays a form with input fields for the IAM role name and AWS account ID the user wants to access, a justification for invoking access, and the duration of access required.

Figure 4: Submitting requests

Figure 4: Submitting requests

The user can only request an IAM role and AWS account combination for which they are eligible, based on their group memberships.

Note: The duration specified here determines a time window during which the user can invoke sessions to access the AWS target environment if their request is approved. It does not affect the duration of each session. Session duration can be configured independently.

  1. When a user submits a new request for temporary elevated access, the application calls the /create… API endpoint, which writes information about the new request to the DynamoDB table.

The user can submit multiple concurrent requests for different role and account combinations, as long as they are eligible.

Generating notifications

The broker generates notifications when temporary elevated access requests are created, approved, or rejected.

  1. When a request is created, approved, or rejected, a DynamoDB stream record is created for notifications.
  2. The stream record then invokes a Lambda function to handle notifications.
  3. The Lambda function reads data from the stream record, and generates a notification using Amazon Simple Notification Service (Amazon SNS).

By default, when a user submits a new request for temporary elevated access, an email notification is sent to all authorized reviewers. When a reviewer approves or rejects a request, an email notification is sent to the original requester.

Reviewing requests

A user who is authorized to review requests can approve or reject requests submitted by other users in a review dashboard, as shown in Figure 5. For each request awaiting their review, the application displays information about the request, including the business justification provided by the requester.

Figure 5: The review dashboard

Figure 5: The review dashboard

The reviewer can select a request, determine whether the request is appropriate, and choose either Approve or Reject.

  1. When a reviewer approves or rejects a request, the application calls the /approve… or /reject… API endpoint, which updates the status of the request in the DynamoDB table and initiates a notification.

Invoking sessions

After a requester is notified that their request has been approved, they can log back into the application and see their approved requests, as shown in Figure 6. For each approved request, they can invoke sessions. There are two ways they can invoke a session, by choosing either Access console or CLI.

Figure 6: Invoking sessions

Figure 6: Invoking sessions

Both options grant the user a session in which they assume the IAM role in the AWS account specified in their request.

When a user invokes a session, the broker performs the following steps.

  1. When the user chooses Access console or CLI, the application calls one of the /federate… API endpoints.
  2. The /federate… API endpoint invokes a Lambda function, which performs the following three checks before proceeding:
    1. Is the user authenticated? The Lambda function checks that the access and ID tokens are valid and uses the ID token to determine their identity.
    2. Is the user eligible? The Lambda function inspects the user’s group memberships in their ID token to confirm they are eligible for the AWS role and account combination they are seeking to invoke.
    3. Is the user elevated? The Lambda function confirms the user is in an elevated state by querying the DynamoDB table, and verifying whether there is an approved request for this user whose duration has not yet ended for the role and account combination they are seeking to invoke.
  3. If all three checks succeed, the Lambda function calls sts:AssumeRole to fetch temporary credentials on behalf of the user for the IAM role and AWS account specified in the request.
  4. The application returns the temporary credentials to the user.
  5. The user obtains a session with temporary credentials for the IAM role in the AWS account specified in their request, either in the AWS Management Console or AWS CLI.

Once the user obtains a session, they can complete the task they need to perform in the AWS target environment using either the AWS Management Console or AWS CLI.

The IAM roles that users assume when they invoke temporary elevated access should be dedicated for this purpose. They must have a trust policy that allows the broker to assume them. The trusted principal is the Lambda execution role used by the broker’s /federate… API endpoints. This ensures that the only way to assume those roles is through the broker.

In this way, when the necessary conditions are met, the broker assumes the requested role in your AWS target environment on behalf of the user, and passes the resulting temporary credentials back to them. By default, the temporary credentials last for one hour. For the duration of a user’s elevated access they can invoke multiple sessions through the broker, if required.

Session expiry

When a user’s session expires in the AWS Management Console or AWS CLI, they can return to the broker and invoke new sessions, as long as their elevated status is still active.

Ending elevated access

A user’s elevated access ends when the requested duration elapses following the time when the request was approved.

Figure 7: Ending elevated access

Figure 7: Ending elevated access

Once elevated access has ended for a particular request, the user can no longer invoke sessions for that request, as shown in Figure 7. If they need further access, they need to submit a new request.

Viewing historical activity

An audit dashboard, as shown in Figure 8, provides a read-only view of historical activity to authorized users.

Figure 8: The audit dashboard

Figure 8: The audit dashboard

Logging session activity

When a user invokes temporary elevated access, their session activity in the AWS control plane is logged to AWS CloudTrail. Each time they perform actions in the AWS control plane, the corresponding CloudTrail events contain the unique identifier of the user, which provides traceability back to the identity of the human user who performed the actions.

The following example shows the userIdentity element of a CloudTrail event for an action performed by user [email protected] using temporary elevated access.

"userIdentity": {
    "type": "AssumedRole",
    "principalId": "AROACKCEVSQ6C2EXAMPLE:[email protected]-TempAccessRoleS3Admin",
    "arn": "arn:aws:sts::111122223333:assumed-role/TempAccessRoleS3Admin/[email protected]-TempAccessRoleS3Admin",
    "accountId": "111122223333",
    "sessionContext": {
        "sessionIssuer": {
            "type": "Role",
            "principalId": "AROACKCEVSQ6C2EXAMPLE",
            "arn": "arn:aws:iam::111122223333:role/TempAccessRoleS3Admin",
            "accountId": "111122223333",
            "userName": "TempAccessRoleS3Admin"
        "webIdFederationData": {},
        "attributes": {
            "mfaAuthenticated": "true",
            "creationDate": "2021-07-02T13:24:06Z"

Security considerations

The temporary elevated access broker controls access to your AWS environment, and must be treated with extreme care in order to prevent unauthorized access. It is also an inline dependency for accessing your AWS environment and must operate with sufficient resiliency.

The broker should be deployed in a dedicated AWS account with a minimum of dependencies on the AWS target environment for which you’ll manage access. It should use its own access control configuration following the principle of least privilege. Ideally the broker should be managed by a specialized team and use its own deployment pipeline, with a two-person rule for making changes—for example by requiring different users to check in code and approve deployments. Special care should be taken to protect the integrity of the broker’s code and configuration and the confidentiality of the temporary credentials it handles.

See the reference implementation README for further security considerations.

Extending the solution

You can extend the reference implementation to fit the requirements of your organization. Here are some ways you can extend the solution:

  • Customize the UI, for example to use your organization’s branding.
  • Keep network traffic within your private network, for example to comply with network security policies.
  • Change the process for initiating and evaluating temporary elevated access, for example to integrate with a change or incident management system.
  • Change the authorization model, for example to use groups with different scope, granularity, or meaning.
  • Use SAML 2.0, for example if your identity provider does not support OpenID Connect.

See the reference implementation README for further details on extending the solution.


In this blog post you learned about temporary elevated access and how it can help reduce risk relating to human user access. You learned that you should aim to eliminate the need to use high-risk human access through the use of automation, and only use temporary elevated access for infrequent activities that cannot yet be automated. Finally, you studied a minimal reference implementation for temporary elevated access which you can download and customize to fit your organization’s needs.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS IAM forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


James Greenwood

James is a principal security solutions architect who helps helps AWS Financial Services customers meet their security and compliance objectives in the AWS cloud. James has a background in identity and access management, authentication, credential management, and data protection with more than 20 years experience in the financial services industry.


Bikash Behera

Bikash is a principal solutions architect who provides transformation guidance to AWS Financial Services customers and develops solutions for high priority customer objectives. Bikash has been delivering transformation guidance and technology solutions to the financial services industry for the last 25 years.


Kevin Higgins

Kevin is a principal cloud architect with AWS Professional Services. He helps customers with the architecture, design, and development of cloud-optimized infrastructure solutions. As a member of the Microsoft Global Specialty Practice, he collaborates with AWS field sales, training, support, and consultants to help drive AWS product feature roadmap and go-to-market strategies.

Running a Cost-effective NLP Pipeline on Serverless Infrastructure at Scale

Post Syndicated from Eitan Sela original https://aws.amazon.com/blogs/architecture/running-a-cost-effective-nlp-pipeline-on-serverless-infrastructure-at-scale/

Amenity Analytics develops enterprise natural language processing (NLP) platforms for the finance, insurance, and media industries that extract critical insights from mountains of documents. We provide a scalable way for businesses to get a human-level understanding of information from text.

In this blog post, we will show how Amenity Analytics improved the continuous integration (CI) pipeline speed by 15x. We hope that this example can help other customers achieve high scalability using AWS Step Functions Express.

Amenity Analytics’ models are developed using both a test-driven development (TDD) and a behavior-driven development (BDD) approach. We verify the model accuracy throughout the model lifecycle—from creation to production, and on to maintenance.

One of the actions in the Amenity Analytics model development cycle is backtesting. It is an important part of our CI process. This process consists of two steps running in parallel:

  • Unit tests (TDD): checks that the code performs as expected
  • Backtesting tests (BDD): validates that the precision and recall of our models is similar or better than previous

The backtesting process utilizes hundreds of thousands of annotated examples in each “code build.” To accomplish this, we initially used the AWS Step Functions default workflow. AWS Step Functions is a low-code visual workflow service used to orchestrate AWS services, automate business processes, and build serverless applications. Workflows manage failures, retries, parallelization, service integrations, and observability.

Challenge with the existing Step Functions solution

We found that Step Functions standard workflow has a bucket of 5,000 state transitions with a refill rate of 1,500. Each annotated example has ~10 state transitions. This creates millions of state transitions per code build. Since state transitions are limited and couldn’t be increased to our desired amount, we often faced delays and timeouts. Developers had to coordinate their work with each other, which inevitably slowed down the entire development cycle.

To resolve these challenges, we migrated from Step Functions standard workflows to Step Functions Express workflows, which have no limits on state transitions. In addition, we changed the way each step in the pipeline is initiated, from an async call to a sync API call.

Step Functions Express workflow solution

When a model developer merges their new changes, the CI process starts the backtesting for all existing models.

  • Each model is checked to see if the annotated examples were already uploaded and saved in the Amazon Simple Storage Service (S3) cache. The check is made by a unique key representing the list of items. Once a model is reviewed, the review items will rarely be changed.
  • If the review items haven’t been uploaded yet, it uploads them and initiates an unarchive process. This way the review items can be used in the next phase.
  • When the items are uploaded, an API call is invoked using Amazon API Gateway with the review items keys, see Figure 1.
  • The request is forwarded to an AWS Lambda function. It is responsible for validating the request and sending a job message to an Amazon Simple Queue Service (SQS) queue.
  • The SQS messages are consumed by concurrent Lambda functions, which synchronously invoke a Step Function. The number of Lambda functions are limited to ensure that they don’t exceed their limit in the production environment.
  • When an item is finished in the Step Function, it creates an SQS notification message. This message is inserted into a queue and consumed as a message batch by a Lambda function. The function then sends an AWS IoT message containing all relevant messages for each individual user.
Figure 1. Step Functions Express workflow solution

Figure 1. Step Functions Express workflow solution

Main Step Function Express workflow pipeline

Step Functions Express supports only sync calls. Therefore, we replaced the previous async Amazon Simple Notification Service (SNS) and Amazon SQS, with sync calls to API Gateway.

Figure 2 shows the workflow for a single document in Step Function Express:

  1. Generate a document ID for the current iteration
  2. Perform base NLP analysis by calling another Step Function Express wrapped by an API Gateway
  3. Reformat the response to be the same as all other “logic” steps results
  4. Verify the result by the “Choice” state – if failed go to end, otherwise, continue
  5. Perform the Amenity core NLP analysis in three model invocations: Group, Patterns, and Business Logic (BL)
  6. For each of the model runtime steps:
    • Check if the result is correct
    • If failed, go to end, otherwise continue
  7. Return a formatted result at the end
Figure 2. Workflow for a single document

Figure 2. Workflow for a single document

Base NLP analysis Step Function Express

For our base NLP analysis, we use Spacy. Figure 3 shows how we used it in Step Functions Express:

  1. Confirm if text exists in cache (this means it has been previously analyzed)
  2. If it exists, return the cached result
  3. If it doesn’t exist, split the text to a manageable size (Spacy has text size limitations)
  4. All the texts parts are analyzed in parallel by Spacy
  5. Merge the results into a single, analyzed document and save it to the cache
  6. If there was an exception during the process, it is handled in the “HandleStepFunctionExceptionState”
  7. Send a reference to the analyzed document if successful
  8. Send an error message if there was an exception
Figure 3. Base NLP analysis for a single document

Figure 3. Base NLP analysis for a single document


Our backtesting migration was deployed on August 10, and unit testing migration on September 14. After the first migration, the CI was limited by the unit tests, which took ~25 minutes. When the second migration was deployed, the process time was reduced to ~6 minutes (P95).

Figure 4. Process time reduced from 50 minutes to 6 minutes

Figure 4. Process time reduced from 50 minutes to 6 minutes


By migrating from standard Step Functions to Step Functions Express, Amenity Analytics increased processing speed 15x. A complete pipeline that used to take ~45 minutes with standard Step Functions, now takes ~3 minutes using Step Functions Express. This migration removed the need for users to coordinate workflow processes to create a build. Unit testing (TDD) went from ~25 mins to ~30 seconds. Backtesting (BDD) went from taking more than 1 hour to ~6 minutes.

Switching to Step Functions Express allows us to focus on delivering business value faster. We will continue to explore how AWS services can help us drive more value to our users.

For further reading:

Managing permissions with grants in AWS Key Management Service

Post Syndicated from Rick Yin original https://aws.amazon.com/blogs/security/managing-permissions-with-grants-in-aws-key-management-service/

AWS Key Management Service (AWS KMS) helps customers to use encryption to secure their data. When creating a new encrypted Amazon Web Services (AWS) resource, such as an Amazon Relational Database Service (Amazon RDS) database or an Amazon Simple Storage Service (Amazon S3) bucket, all you have to do is provide an AWS KMS key ID that you control and the data will be encrypted and the complexity of protecting and making encryption keys highly available is reduced.

If you’re considering delegating encryption to an AWS service to use a key under your control when it encrypts your data in that service, you might wonder how to ensure the AWS service can only use your key when you want it to and not have full access to decrypt any of your resources at any time. The answer is to use scoped-down dynamic permissions in AWS KMS. Specifically, a combination of permissions that you define in the KMS key policy document along with additional permissions that are created dynamically using KMS grants define the conditions under which one or more AWS services can use your KMS keys to encrypt and decrypt your data.

In this blog post, I discuss:

  • An example of how an AWS service uses your KMS key policy and grants to securely manage access to your encryption keys. The example uses Amazon RDS and demonstrates how the block storage volume behind your database instance is encrypted.
  • Best practices for using grants from AWS KMS in your own workloads.
  • Recent performance improvements when using grants in AWS KMS.

Case study: How RDS uses grants from AWS KMS to encrypt your database volume

Many Amazon RDS instance types are hosted on an Amazon Elastic Compute Cloud (Amazon EC2) instance where the underlying storage layer is an Amazon Elastic Block Store (Amazon EBS) volume. The blocks of the EBS volume that stores the database content are encrypted under a randomly generated 256-bit symmetric data key that is itself encrypted under a KMS key that you configure RDS to use when you create your database instance. Let’s look at how RDS interacts with EBS, EC2, and AWS KMS to securely create an RDS instance using an KMS key.

When you send a request to RDS to create your database, there are several asynchronous requests being made among the RDS, EC2, EBS, and KMS services to:

  1. Create the underlying storage volume with a unique encryption key.
  2. Create the compute instance in EC2.
  3. Load the database engine into the EC2 instance.
  4. Give the EC2 instance permissions to use the encryption key to read and write data to the database storage volume.

The initial authenticated request that you make to RDS to create a new database is made by an AWS Identity and Access Management (IAM) principal in your account (e.g. a user or role). Once the request is received, a series of things has to happen:

  1. RDS needs to request EBS to create an encrypted volume to store your future data.
  2. EBS needs to request AWS KMS generate a unique 256-bit data key for the volume and encrypt it under the KMS key you told RDS to use.
  3. RDS then needs to request that EC2 launch an instance, attach that encrypted volume, and make the data key available to EC2 for use in reads and writes to the volume.

From your perspective, the IAM principal used to create the database also must have permissions in the KMS key policy for the GenerateDataKeyWithoutPlaintext and Decrypt actions. This enables the unique 256-bit data key to be created and encrypted under the desired KMS key as well as allowing the user or role to have the data key decrypted and provisioned to the Nitro card managing your EC2 instance so that reads/writes can happen from/to the database. Given the asynchronous nature of the process of creating the database vs. launching the database volume in the future, how do the RDS, EBS, and EC2 services all get the necessary least privileged permissions to create and provision the data key for use with your database? The answer starts with your IAM principal having permission for the AWS KMS CreateGrant action in the key policy.

RDS uses the identity from your IAM principal to create a grant in AWS KMS that allows it to create other grants for EC2 and EBS with very limited permissions that are further scoped down compared to the original permissions your IAM principal has on the AWS KMS key. A total of three grants are created:

  • The initial RDS grant.
  • A subsequent EBS grant that allows EBS to call AWS KMS and generate a 256-bit data key that is encrypted under the KMS key you defined when creating your database.
  • The attachment grant, which allows the specific EC2 instance hosting your database volume to decrypt the encrypted data key for and provision it for use during I/O between the instance and the EBS volume.

RDS grant

In this example, let’s say you’ve created an RDS instance with an ID of db-1234 and specified a KMS key for encryption. The following grant is created on the KMS key, allowing RDS to create more grants for EC2 and EBS to use in the asynchronous processes required to launch your database instance. The RDS grant is as follows:

{Grantee Principal: '<Regional RDS Service Account>', Encryption Context: '"aws:rds:db-id": "db-1234"', Operations: ['CreateGrant', 'Decrypt', 'GenerateDataKeyWithoutPlaintext']}

In plain English, this grant gives RDS permissions to use the KMS key for three specific operations (API actions) only when the call specifies the RDS instance ID db-1234 in the Encryption Context parameter. The grant provides access for the the grantee principal, which in this case is the value shown for the <Regional RDS service account>. This grant is created in AWS KMS and associated with your KMS key. Because the EC2 instance hasn’t yet been created and launched, the grantee principal cannot include the EC2 instance ID and must instead be the regional RDS service account.

EBS grant

With the RDS instance and initial AWS KMS grant created, RDS requests EC2 to launch an instance for the RDS database. EC2 creates an instance with a unique ID (e.g. i-1234567890abcdefg) using EC2 permissions you gave to the original IAM principal. In addition to the EC2 instance being created, RDS requests that Amazon EBS create an encrypted volume dedicated to the database. As a part of volume creation, EBS needs permission to call AWS KMS to generate a unique 256-bit data key for the volume and encrypt that data key under the KMS key you defined.

The EC2 instance ID is used as the name of the identity for future calls to AWS KMS, so RDS inserts it as the grantee principal in the EBS grant it creates. The EBS grant is as follows:

{Grantee Principal: '<RDS-Host-Role>:i-1234567890abcdefg', Encryption Context: '"aws:rds:db-id": "db-1234"', Operations: ['CreateGrant', 'Decrypt', 'GenerateDataKeyWithoutPlaintext']}}

You’ll notice that this grant uses the same encryption context as the initial RDS grant. However, now that we have the EC2 instance ID associated with the database ID, the permissions that EBS gets to use your key as the grantee principal can be scoped down to require both values. Once this grant is created, EBS can create the EBS volume (e.g. vol-0987654321gfedcba) and call AWS KMS to generate and encrypt a 256-bit data key that can only be used for that volume. This encrypted data key is stored by EBS in preparation for the volume attachment process.

Attachment grant

The final step in creating the RDS instance is to attach the EBS volume to the EC2 instance hosting your database. EC2 now uses the previously created EBS grant to create the attachment grant with the i-1234567890abcdefg instance identity. This grant allows EC2 to decrypt the encrypted data key, provision it to the Nitro card that manages the instance, and begin encrypting I/O to the EBS volume of the RDS database. The attachment grant in this example will be as follows:

{Grantee Principal: 'EC2 Instance Role:i-1234567890abcdefg', Encryption Context: '"aws:rds:db-id": "db-1234", "aws:ebs:id":"vol-0987654321gfedcba"', Operations: ['Decrypt']}

The attachment grant is the most restrictive of the three grants. It requires the caller to know the IDs of all the AWS entities involved: EC2 instance ID, EBS volume ID, and RDS database ID. This design ensures that your KMS key can only be used for decryption by these AWS services in order to launch the specific RDS database you want.

The encrypted EBS volume is now active and attached to the EC2 instance. Should you terminate the RDS instance, the services retire all the relevant KMS grants so they no longer have any permission to use your KMS key to decrypt the 256-bit data key required to decrypt data in your database. If you need to launch your encrypted database again, a similar set of three grants will be dynamically created with the RDS database, EC2 instance, and EBS volume IDs used to scope down permissions on the AWS KMS key.

The process described in the previous paragraphs is graphically shown in Figure 1:
Figure 1: How Amazon RDS uses Amazon EC2, Amazon EBS, and AWS KMS to create an encrypted RDS instance

Considering all the AWS KMS key permissions that are added and removed as a part of launching a database, you might ask why not just use the key policy document to make these changes? A KMS key allows only one key policy with a maximum document size of 32 KB. Because one key could be used to encrypt any number of AWS resources, trying to dynamically add and remove scoped-down permissions related to each resource to the key policy document creates two risks. First, the maximum allowable size of the key policy document (32KB) might be exceeded. Second, depending on how many resources are being accessed concurrently, you may exceed the request rate quota for the PutKeyPolicy API action in AWS KMS.

In contrast, there can be any number of grants on a given AWS KMS key, each grant specifying a scoped-down permission for the use of a KMS key with any AWS service that integrated with AWS KMS. Grant creation and deletion is also designed for much higher-volume request rates than modifications to the key policy document. Finally, permission to call PutKeyPolicy is a highly privileged permission, as it lets the caller make unrestricted changes to the permissions on the key, including changes to administrative permissions to disable or schedule the key for deletion. Grants on a key can only allow permissions to use the key, not administer the key. Also, grants that allow the creation of other grants by other IAM principals prohibit the escalation of privilege. In the RDS example above, the permissions RDS receives from the IAM principal in your account during the first CreateGrant request cannot be more permissive than what you defined for the IAM principal in the KMS key policy. The permissions RDS gives to EC2 and EBS during the database creation process cannot be more permissive than the original permission RDS has from the initial grant. This design ensures that AWS services cannot escalate their privileges and use your KMS key for purposes different than what you intend.

Best practices for using AWS KMS grants

AWS KMS grants are a powerful tool to dynamically define permissions to use keys. They are automatically created when you use server-side encryption features in various AWS services. You can also use grants to control permission in your own applications that perform client-side encryption. Here are some best practices to consider:

  • Design the permissions to be as scoped down as possible. Use a specific grantee principal, such as an IAM role, and give the principal access only to the AWS KMS API actions that are needed. You can further limit the scope of grants with the Encryption Context parameter by using any element you want to ensure callers are using the AWS KMS key only for the intended purpose. Below is a specific example that grants AWS account 123456789012 permission to call the GenerateDataKey or Decrypt APIs, but only if the supplied encryption context for customerID is 5678.
    {Actions: 'GenerateDataKey, Decrypt', Grantee Principal: '123456789012', Encryption Context: '"customerID": "5678"'}

    This grant could prevent your application from decrypting data belonging to customer “5678” without explicitly passing the expected customerID in the request to AWS KMS. This may be a useful defense-in-depth mechanism to prevent unauthorized access to your customers’ data if your application’s AWS credentials were compromised and used from a different caller who doesn’t know that encryption context is a required parameter for all reads and writes in order to encrypt and decrypt data.

    For more information on how you can use encryption context in AWS KMS permissions, requests, and AWS CloudTrail logs, see How to Protect the Integrity of Your Encrypted Data by Using AWS Key Management Service and EncryptionContext.

  • Remember that grants don’t automatically expire. Your code needs to retire or revoke them once you know the permission is no longer needed on the KMS key. Grants that aren’t retired are leftover permissions that might create a security risk for encrypted resources. See retiring and revoking grants in the AWS KMS developer guide for more detail.
  • Avoid creating duplicate grants. A duplicate grant is a grant that shares the same AWS KMS key ID, API actions, grantee principal, encryption context, and name. If you retire the original grant after use and not the duplicates, then the leftover duplicate grants can lead to unintended access to encrypt or decrypt data.

Recent performance improvements to AWS KMS grants: Removing a resource quota

For customers who use AWS KMS to encrypt resources in AWS services that use grants, there used to be cases where AWS KMS had to enforce a quota on the number of concurrently active resources that could be encrypted under the same KMS key. For example, customers of Amazon RDS, Amazon WorkSpaces, or Amazon EBS would run into this quota at very large scale. This was the Grants for a given principal per key quota and was previously set to 500. You might have seen the error message “Keys only support 500 grants per grantee principal in this region” when trying to create a resource in one of these services.

We recently made a change to AWS KMS to remove this quota entirely and this error message no longer exists. With this quota removed, you can now attach unlimited grants to any KMS key when using any AWS service.


In this blog post, you’ve seen how services such as Amazon RDS use AWS KMS grants to pass scoped-down permissions through the Amazon EC2 and Amazon EBS instances. You also saw some best practices for using AWS KMS grants in your own applications. Finally, you learned about how AWS KMS has improved grants by removing one of the resource quotas.

Below are some additional resources for AWS KMS and grants.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Rick Yin

Rick is a software development engineer on the AWS KMS team. His current focus is helping to scale AWS KMS to meet increasing customer demand by making sure we can serve our requests at ultra-low latency and ultra-high availability. In his free time, Rick enjoys learning about history and trying to stay in shape. He has recently taken up rowing.

Deep learning image vector embeddings at scale using AWS Batch and CDK

Post Syndicated from Filip Saina original https://aws.amazon.com/blogs/devops/deep-learning-image-vector-embeddings-at-scale-using-aws-batch-and-cdk/

Applying various transformations to images at scale is an easily parallelized and scaled task. As a Computer Vision research team at Amazon, we occasionally find that the amount of image data we are dealing with can’t be effectively computed on a single machine, but also isn’t large enough to justify running a large and potentially costly AWS Elastic Map Reduce (EMR) job. This is when we can utilize AWS Batch as our main computing environment, as well as Cloud Development Kit (CDK) to provision the necessary infrastructure in order to solve our task.

In Computer Vision, we often need to represent images in a more concise and uniform way. Working with standard image files would be challenging, as they can vary in resolution or are otherwise too large in terms of dimensionality to be provided directly to our models. For that reason, the common practice for deep learning approaches is to translate high-dimensional information representations, such as images, into vectors that encode most (if not all) information present in them — in other words, to create vector embeddings.

This post will demonstrate how we utilize the AWS Batch platform to solve a common task in many Computer Vision projects — calculating vector embeddings from a set of images so as to allow for scaling.

 Architecture Overview

Diagram explained in post.

Figure 1: High-level architectural diagram explaining the major solution components.

As seen in Figure 1, AWS Batch will pull the docker image containing our code onto provisioned hosts and start the docker containers. Our sample code, referenced in this post, will then read the resources from S3, conduct the vectorization, and write the results as entries in the DynamoDB Table.

In order to run our image vectorization task, we will utilize the following AWS cloud components:

  • Amazon ECR — Elastic Container Registry is a Docker image repository from which our batch instances will pull the job images;
  • S3 — Amazon Simple Storage Service will act as our image source from which our batch jobs will read the image;
  • Amazon DynamoDB — NoSQL database in which we will write the resulting vectors and other metadata;
  • AWS Lambda — Serverless compute environment which will conduct some pre-processing and, ultimately, trigger the batch job execution; and
  • AWS Batch — Scalable computing environment powering our models as embarrassingly parallel tasks running as AWS Batch jobs.

To translate an image to a vector, we can utilize a pre-trained model architecture, such as AlexNet, ResNet, VGG, or more recent ones, like ResNeXt and Vision Transformers. These model architectures are available in most of the popular deep learning frameworks, and they can be further modified and extended depending on our project requirements. For this post, we will utilize a pre-trained ResNet18 model from MxNet. We will output an intermediate layer of the model, which will result in a 512 dimensional representation, or, in other words, a 512 dimensional vector embedding.

Deployment using Cloud Development Kit (CDK)

In recent years, the idea of provisioning cloud infrastructure components using popular programming languages was popularized under the term of infrastructure as code (IaC). Instead of writing a file in the YAML/JSON/XML format, which would define every cloud component we want to provision, we might want to define those components trough a popular programming language.

As part of this post, we will demonstrate how easy it is to provision infrastructure on AWS cloud by using Cloud Development Kit (CDK). The CDK code included in the exercise is written in Python and defines all of the relevant exercise components.

Hands-on exercise

1. Deploying the infrastructure with AWS CDK

For this exercise, we have provided a sample batch job project that is available on Github (link). By using that code, you should have every component required to do this exercise, so make sure that you have the source on your machine. The root of your sample project local copy should contain the following files:

batch_job_cdk - CDK stack code of this batch job project
src_batch_job - source code for performing the image vectorization
src_lambda - source code for the lambda function which will trigger the batch job execution
app.py - entry point for the CDK tool
cdk.json - config file specifying the entry point for CDK
requirements.txt - list of python dependencies for CDK 
  1. Make sure you have installed and correctly configured the AWS CLI and AWS CDK in your environment. Refer to the CDK documentation for more information, as well as the CDK getting started guide.
  2. Set the CDK_DEPLOY_ACCOUNT and CDK_DEPLOY_REGION environmental variables, as described in the project README.md.
  3. Go to the sample project root and install the CDK python dependencies by running pip install -r requirements.txt.
  4. Install and configure Docker in your environment.
  5. If you have multiple AWS CLI profiles, utilize the --profile option to specify which profile to use for deployment. Otherwise, simply run cdk deploy and deploy the infrastructure to your AWS account set in step 1.

NOTE: Before deploying, make sure that you are familiar with the restrictions and limitations of the AWS services we are using in this post. For example, if you choose to set an S3 bucket name in the CDK Bucket construct, you must avoid naming conflicts that might cause deployment errors.

The CDK tool will now trigger our docker image build, provision the necessary AWS infrastructure (i.e., S3 Bucket, DynamoDB table, roles and permissions), and, upon completion, upload the docker image to a newly created repository on Amazon Elastic Container Registry (ECR).

2. Upload data to S3

Console explained in post.

Figure 2: S3 console window with uploaded images to the `images` directory.

After CDK has successfully finished deploying, head to the S3 console screen and upload images you want to process to a path in the S3 bucket. For this exercise, we’ve added every image to the `images` directory, as seen in Figure 2.

For larger datasets, utilize the AWS CLI tool to sync your local directory with the S3 bucket. In that case, consider enabling the ‘Transfer acceleration’ option of your S3 bucket for faster data transfers. However, this will incur an additional fee.

3. Trigger batch job execution

Once CDK has completed provisioning our infrastructure and we’ve uploaded the image data we want to process, open the newly created AWS Lambda in the AWS console screen in order to trigger the batch job execution.

To do this, create a test event with the following JSON body:

"Paths": [

The JSON body that we provide as input to the AWS Lambda function defines a list of paths to directories in the S3 buckets containing images. Having the ability to dynamically provide paths to directories with images in S3, lets us combine multiple data sources into a single AWS Batch job execution. Furthermore, if we decide in the future to put an API Gateway in front of the Lambda, you could pass every parameter of the batch job with a simple HTTP method call.

In this example, we specified just one path to the `images` directory in the S3 bucket, which we populated with images in the previous step.

Console screen explained in post.

Figure 3: AWS Lambda console screen of the function that triggers batch job execution. Modify the batch size by modifying the `image_batch_limit` variable. The value of this variable will depend on your particular use-case, computation type, image sizes, as well as processing time requirements.

The python code will list every path under the images S3 path, batch them into batches of desired size, and finally save the paths to batches as txt files under tmp S3 path. Each path to a txt files in S3 will be passed as an input to a batch jobs.

Select the newly created event, and then trigger the Lambda function execution. The AWS Lambda function will submit the AWS Batch jobs to the provisioned AWS Batch compute environment.

Batch job explained in post.

Figure 4: Screenshot of a running AWS Batch job that creates feature vectors from images and stores them to DynamoDB.

Once the AWS Lambda execution finishes its execution, we can monitor the AWS Batch jobs being processed on the AWS console screen, as seen in Figure 4. Wait until every job has finished successfully.

4. View results in DynamoDB

Image vectorization results.

Figure 5: Image vectorization results stored for each image as a entry in the DynamoDB table.

Once every batch job is successfully finished, go to the DynamoDB AWS cloud console and see the feature vectors stored as strings obtained from the numpy tostring method, as well as other data we stored in the table.

When you are ready to access the vectors in one of your projects, utilize the code snippet provided here:

#!/usr/bin/env python3

import numpy as np
import boto3

def vector_from(item):
    item : DynamoDB response item object
    vector = np.frombuffer(item['Vector'].value, dtype=item['DataType'])
    assert len(vector) == item['Dimension']
    return vector

def vectors_from_dydb(dynamodb, table_name, image_ids):
    dynamodb : DynamoDB client
    table_name : Name of the DynamoDB table
    image_ids : List of id's to query the DynamoDB table for

    response = dynamodb.batch_get_item(
        RequestItems={table_name: {'Keys': [{'ImageId': val} for val in image_ids]}},

    query_vectors =  [vector_from(item) for item in response['Responses'][table_name]]
    query_image_ids =  [item['ImageId'] for item in response['Responses'][table_name]]

    return zip(query_vectors, query_image_ids)
def process_entry(vector, image_id):
    NOTE - Add your code here.

def main():
    Reads vectors from the batch job DynamoDB table containing the vectorization results.
    dynamodb = boto3.resource('dynamodb', region_name='eu-central-1')
    table_name = 'aws-blog-batch-job-image-transform-dynamodb-table'

    image_ids = ['B000KT6OK6', 'B000KTC6X0', 'B000KTC6XK', 'B001B4THHG']

    for vector, image_id in vectors_from_dydb(dynamodb, table_name, image_ids):
        process_entry(vector, image_id)

if __name__ == "__main__":

This code snippet will utilize the boto3 client to access the results stored in the DynamoDB table. Make sure to update the code variables, as well as to modify this implementation to one that fits your use-case.

5. Tear down the infrastructure using CDK

To finish off the exercise, we will tear down the infrastructure that we have provisioned. Since we are using CDK, this is very simple — go to the project root directory and run:

cdk destroy

After a confirmation prompt, the infrastructure tear-down should be underway. If you want to follow the process in more detail, then go to the CloudFormation console view and monitor the process from there.

NOTE: The S3 Bucket, ECR image, and DynamoDB table resource will not be deleted, since the current CDK code defaults to RETAIN behavior in order to prevent the deletion of data we stored there. Once you are sure that you don’t need them, remove those remaining resources manually or modify the CDK code for desired behavior.


In this post we solved an embarrassingly parallel job of creating vector embeddings from images using AWS batch. We provisioned the infrastructure using Python CDK, uploaded sample images, submitted AWS batch job for execution, read the results from the DynamoDB table, and, finally, destroyed the AWS cloud resources we’ve provisioned at the beginning.

AWS Batch serves as a good compute environment for various jobs. For this one in particular, we can scale the processing to more compute resources with minimal or no modifications to our deep learning models and supporting code. On the other hand, it lets us potentially reduce costs by utilizing smaller compute resources and longer execution times.

The code serves as a good point for beginning to experiment more with AWS batch in a Deep Leaning/Machine Learning setup. You could extend it to utilize EC2 instances with GPUs instead of CPUs, utilize Spot instances instead of on-demand ones, utilize AWS Step Functions to automate process orchestration, utilize Amazon SQS as a mechanism to distribute the workload, as well as move the lambda job submission to another compute resource, or pretty much tailor your project for anything else you might need AWS Batch to do.

And that brings us to the conclusion of this post. Thanks for reading, and feel free to leave a comment below if you have any questions. Also, if you enjoyed reading this post, make sure to share it with your friends and colleagues!

About the author

Filip Saina

Filip is a Software Development Engineer at Amazon working in a Computer Vision team. He works with researchers and engineers across Amazon to develop and deploy Computer Vision algorithms and ML models into production systems. Besides day-to-day coding, his responsibilities also include architecting and implementing distributed systems in AWS cloud for scalable ML applications.

Generating DevOps Guru Proactive Insights for Amazon ECS

Post Syndicated from Trishanka Saikia original https://aws.amazon.com/blogs/devops/generate-devops-guru-proactive-insights-in-ecs-using-container-insights/

Monitoring is fundamental to operating an application in production, since we can only operate what we can measure and alert on. As an application evolves, or the environment grows more complex, it becomes increasingly challenging to maintain monitoring thresholds for each component, and to validate that they’re still set to an effective value. We not only want monitoring alarms to trigger when needed, but also want to minimize false positives.

Amazon DevOps Guru is an AWS service that helps you effectively monitor your application by ingesting vended metrics from Amazon CloudWatch. It learns your application’s behavior over time and then detects anomalies. Based on these anomalies, it generates insights by first combining the detected anomalies with suspected related events from AWS CloudTrail, and then providing the information to you in a simple, ready-to-use dashboard when you start investigating potential issues. Amazon DevOpsGuru makes use of the CloudWatch Containers Insights to detect issues around resource exhaustion for Amazon ECS or Amazon EKS applications. This helps in proactively detecting issues like memory leaks in your applications before they impact your users, and also provides guidance as to what the probable root-causes and resolutions might be.

This post will demonstrate how to simulate a memory leak in a container running in Amazon ECS, and have it generate a proactive insight in Amazon DevOps Guru.

Solution Overview

The following diagram shows the environment we’ll use for our scenario. The container “brickwall-maker” is preconfigured as to how quickly to allocate memory, and we have built this container image and published it to our public Amazon ECR repository. Optionally, you can build and host the docker image in your own private repository as described in step 2 & 3.

After creating the container image, we’ll utilize an AWS CloudFormation template to create an ECS Cluster and an ECS Service called “Test” with a desired count of two. This will create two tasks using our “brickwall-maker” container image. The stack will also enable Container Insights for the ECS Cluster. Then, we will enable resource coverage for this CloudFormation stack in Amazon DevOpsGuru in order to start our resource analysis.

Architecture Diagram showing the service “Test” using the container “brickwall-maker” with a desired count of two. The two ECS Task’s vended metrics are then processed by CloudWatch Container Insights. Both, CloudWatch Container Insights and CloudTrail, are ingested by Amazon DevOps Guru which then makes detected insights available to the user. [Image: DevOpsGuruBlog1.png]V1: DevOpsGuruBlog1.drawio (https://api.quip-amazon.com/2/blob/fbe9AAT37Ge/LdkTqbmlZ8uNj7A44pZbnw?name=DevOpsGuruBlog1.drawio&s=cVbmAWsXnynz) V2: DevOpsGuruBlog1.drawio (https://api.quip-amazon.com/2/blob/fbe9AAT37Ge/SvsNTJLEJOHHBls_kV7EwA?name=DevOpsGuruBlog1.drawio&s=cVbmAWsXnynz) V3: DevOpsGuruBlog1.drawio (https://api.quip-amazon.com/2/blob/fbe9AAT37Ge/DqKTxtQvmOLrzM3KcF_oTg?name=DevOpsGuruBlog1.drawio&s=cVbmAWsXnynz)

Source provided on GitHub:

  • DevOpsGuru.yaml
  • EnableDevOpsGuruForCfnStack.yaml
  • Docker container source


1. Create your IDE environment

In the AWS Cloud9 console, click Create environment, give your environment a Name, and click Next step. On the Environment settings page, change the instance type to t3.small, and click Next step. On the Review page, make sure that the Name and Instance type are set as intended, and click Create environment. The environment creation will take a few minutes. After that, the AWS Cloud9 IDE will open, and you can continue working in the terminal tab displayed in the bottom pane of the IDE.

Install the following prerequisite packages, and ensure that you have docker installed:

sudo yum install -y docker
sudo service docker start 
docker --version
Clone the git repository in order to download the required CloudFormation templates and code:

git clone https://github.com/aws-samples/amazon-devopsguru-brickwall-maker

Change to the directory that contains the cloned repository

cd amazon-devopsguru-brickwall-maker

2. Optional : Create ECR private repository

If you want to build your own container image and host it in your own private ECR repository, create a new repository with the following command and then follow the steps to prepare your own image:

aws ecr create-repository —repository-name brickwall-maker

3. Optional: Prepare Docker Image

Authenticate to Amazon Elastic Container Registry (ECR) in the target region

aws ecr get-login-password --region ap-northeast-1 | \
    docker login --username AWS --password-stdin \

In the above command, as well as in the following shown below, make sure that you replace 123456789012 with your own account ID.

Build brickwall-maker Docker container:

docker build -t brickwall-maker .

Tag the Docker container to prepare it to be pushed to ECR:

docker tag brickwall-maker:latest 123456789012.dkr.ecr.ap-northeast-1.amazonaws.com/brickwall-maker:latest

Push the built Docker container to ECR

docker push 123456789012.dkr.ecr.ap-northeast-1.amazonaws.com/brickwall-maker:latest

4. Launch the CloudFormation template to deploy your ECS infrastructure

To deploy your ECS infrastructure, run the following command (replace your own private ECR URL or use our public URL) in the ParameterValue) to launch the CloudFormation template :

aws cloudformation create-stack --stack-name myECS-Stack \
--template-body file://DevOpsGuru.yaml \
--parameters ParameterKey=ImageUrl,ParameterValue=public.ecr.aws/p8v8e7e5/myartifacts:brickwallv1

5. Enable DevOps Guru to monitor the ECS Application

Run the following command to enable DevOps Guru for monitoring your ECS application:

aws cloudformation create-stack \
--stack-name EnableDevOpsGuruForCfnStack \
--template-body file://EnableDevOpsGuruForCfnStack.yaml \
--parameters ParameterKey=CfnStackNames,ParameterValue=myECS-Stack

6. Wait for base-lining of resources

This step lets DevOps Guru complete the baselining of the resources and benchmark the normal behavior. For this particular scenario, we recommend waiting two days before any insights are triggered.

Unlike other monitoring tools, the DevOps Guru dashboard would not present any counters or graphs. In the meantime, you can utilize CloudWatch Container Insights to monitor the cluster-level, task-level, and service-level metrics in ECS.

7. View Container Insights metrics

  • Open the CloudWatch console.
  • In the navigation pane, choose Container Insights.
  • Use the drop-down boxes near the top to select ECS Services as the resource type to view, then select DevOps Guru as the resource to monitor.
  • The performance monitoring view will show you graphs for several metrics, including “Memory Utilization”, which you can watch increasing from here. In addition, it will show the list of tasks in the lower “Task performance” pane showing the “Avg CPU” and “Avg memory” metrics for the individual tasks.

8. Review DevOps Guru insights

When DevOps Guru detects an anomaly, it generates a proactive insight with the relevant information needed to investigate the anomaly, and it will list it in the DevOps Guru Dashboard.

You can view the insights by clicking on the number of insights displayed in the dashboard. In our case, we expect insights to be shown in the “proactive insights” category on the dashboard.

Once you have opened the insight, you will see that the insight view is divided into the following sections:

  • Insight Overview with a basic description of the anomaly. In this case, stating that Memory Utilization is approaching limit with details of the stack that is being affected by the anomaly.
  • Anomalous metrics consisting of related graphs and a timeline of the predicted impact time in the future.
  • Relevant events with contextual information, such as changes or updates made to the CloudFormation stack’s resources in the region.
  • Recommendations to mitigate the issue. As seen in the following screenshot, it recommends troubleshooting High CPU or Memory Utilization in ECS along with a link to the necessary documentation.

The following screenshot illustrates an example insight detail page from DevOps Guru

 An example of an ECS Service’s Memory Utilization approaching a limit of 100%. The metric graph shows the anomaly starting two days ago at about 22:00 with memory utilization increasing steadily until the anomaly was reported today at 18:08. The graph also shows a forecast of the memory utilization with a predicted impact of reaching 100% the next day at about 22:00.

Potentially related events on a timeline and below them a list of recommendations. Two deployment events are shown without further details on a timeline. The recommendations table links to one document on how to troubleshoot high CPU or memory utilization in Amazon ECS.


This post describes how DevOps Guru continuously monitors resources in a particular region in your AWS account, as well as proactively helps identify problems around resource exhaustion such as running out of memory, in advance. This helps IT operators take preventative actions even before a problem presents itself, thereby preventing downtime.

Cleaning up

After walking through this post, you should clean up and un-provision the resources in order to avoid incurring any further charges.

  1. To un-provision the CloudFormation stacks, on the AWS CloudFormation console, choose Stacks. Select the stack name, and choose Delete.
  2. Delete the AWS Cloud9 environment.
  3. Delete the ECR repository.

About the authors

Trishanka Saikia

Trishanka Saikia is a Technical Account Manager for AWS. She is also a DevOps enthusiast and works with AWS customers to design, deploy, and manage their AWS workloads/architectures.

Gerhard Poul

Gerhard Poul is a Senior Solutions Architect at Amazon Web Services based in Vienna, Austria. Gerhard works with customers in Austria to enable them with best practices in their cloud journey. He is passionate about infrastructure as code and how cloud technologies can improve IT operations.

Forensic investigation environment strategies in the AWS Cloud

Post Syndicated from Sol Kavanagh original https://aws.amazon.com/blogs/security/forensic-investigation-environment-strategies-in-the-aws-cloud/

When a deviation from your secure baseline occurs, it’s crucial to respond and resolve the issue quickly and follow up with a forensic investigation and root cause analysis. Having a preconfigured infrastructure and a practiced plan for using it when there’s a deviation from your baseline will help you to extract and analyze the information needed to determine the impact, scope, and root cause of an incident and return to operations confidently.

Time is of the essence in understanding the what, how, who, where, and when of a security incident. You often hear of automated incident response, which has repeatable and auditable processes to standardize the resolution of incidents and accelerate evidence artifact gathering.

Similarly, having a standard, pristine, pre-configured, and repeatable forensic clean-room environment that can be automatically deployed through a template allows your organization to minimize human interaction, keep the larger organization separate from contamination, hasten evidence gathering and root cause analysis, and protect forensic data integrity. The forensic analysis process assists in data preservation, acquisition, and analysis to identify the root cause of an incident. This approach can also facilitate the presentation or transfer of evidence to outside legal entities or auditors. AWS CloudFormation templates—or other infrastructure as code (IaC) provisioning tools—help you to achieve these goals, providing your business with consistent, well-structured, and auditable results that allow for a better overall security posture. Having these environments as a permanent part of your infrastructure allows them to be well documented and tested, and gives you opportunities to train your teams in their use.

This post provides strategies that you can use to prepare your organization to respond to secure baseline deviations. These strategies take the form of best practices around Amazon Web Services (AWS) account structure, AWS Organizations organizational units (OUs) and service control policies (SCPs), forensic Amazon Virtual Private Cloud (Amazon VPC) and network infrastructure, evidence artifacts to be collected, AWS services to be used, forensic analysis tool infrastructure, and user access and authorization to the above. The specific focus is to provide an environment where Amazon Elastic Compute Cloud (Amazon EC2) instances with forensic tooling can be used to examine evidence artifacts.

This post presumes that you already have an evidence artifact collection procedure or that you are implementing one and that the evidence can be transferred to the accounts described here. If you are looking for advice on how to automate artifact collection, see How to automate forensic disk collection for guidance.

Infrastructure overview

A well-architected multi-account AWS environment is based on the structure provided by Organizations. As companies grow and need to scale their infrastructure with multiple accounts, often in multiple AWS Regions, Organizations offers programmatic creation of new AWS accounts combined with central management and governance that helps them to do so in a controlled and standardized manner. This programmatic, centralized approach should be used to create the forensic investigation environments described in the strategy in this blog post.

The example in this blog post uses a simplified structure with separate dedicated OUs and accounts for security and forensics, shown in Figure 1. Your organization’s architecture might differ, but the strategy remains the same.

Note: There might be reasons for forensic analysis to be performed live within the compromised account itself, such as to avoid shutting down or accessing the compromised instance or resource; however, that approach isn’t covered here.

Figure 1: AWS Organizations forensics OU example

Figure 1: AWS Organizations forensics OU example

The most important components in Figure 1 are:

  • A security OU, which is used for hosting security-related access and services. The security OU and the associated AWS accounts should be owned and managed by your security organization.
  • A forensics OU, which should be a separate entity, although it can have some similarities and crossover responsibilities with the security OU. There are several reasons for having it within a separate OU and account. Some of the more important reasons are that the forensics team might be a different team than the security team (or a subset of it), certain investigations might be under legal hold with additional access restrictions, or a member of the security team could be the focus of an investigation.

When speaking about Organizations, accounts, and the permissions required for various actions, you must first look at SCPs, a core functionality of Organizations. SCPs offer control over the maximum available permissions for all accounts in your organization. In the example in this blog post, you can use SCPs to provide similar or identical permission policies to all the accounts under the forensics OU, which is being used as a resource container. This policy overrides all other policies, and is a crucial mechanism to ensure that you can explicitly deny or allow any API calls desired. Some use cases of SCPs are to restrict the ability to disable AWS CloudTrail, restrict root user access, and ensure that all actions taken in the forensic investigation account are logged. This provides a centralized way to avoid changing individual policies for users, groups, or roles. Accessing the forensic environment should be done using a least-privilege model, with nobody capable of modifying or compromising the initially collected evidence. For an investigation environment, denying all actions except those you want to list as exceptions is the most straightforward approach. Start with the default of denying all, and work your way towards the least authorizations needed to perform the forensic processes established by your organization. AWS Config can be a valuable tool to track the changes made to the account and provide evidence of these changes.

Keep in mind that once the restrictive SCP is applied, even the root account or those with administrator access won’t have access beyond those permissions; therefore, frequent, proactive testing as your environment changes is a best practice. Also, be sure to validate which principals can remove the protective policy, if required, to transfer the account to an outside entity. Finally, create the environment before the restrictive permissions are applied, and then move the account under the forensic OU.

Having a separate AWS account dedicated to forensic investigations is best to keep your larger organization separate from the possible threat of contamination from the incident itself, ensure the isolation and protection of the integrity of the artifacts being analyzed, and keeping the investigation confidential. Separate accounts also avoid situations where the threat actors might have used all the resources immediately available to your compromised AWS account by hitting service quotas and so preventing you from instantiating an Amazon EC2 instance to perform investigations.

Having a forensic investigation account per Region is also a good practice, as it keeps the investigative capabilities close to the data being analyzed, reduces latency, and avoids issues of the data changing regulatory jurisdictions. For example, data residing in the EU might need to be examined by an investigative team in North America, but the data itself cannot be moved because its North American architecture doesn’t align with GDPR compliance. For global customers, forensics teams might be situated in different locations worldwide and have different processes. It’s better to have a forensic account in the Region where an incident arose. The account as a whole could also then be provided to local legal institutions or third-party auditors if required. That said, if your AWS infrastructure is contained within Regions only in one jurisdiction or country, then a single re-creatable account in one Region with evidence artifacts shared from and kept in their respective Regions could be an easier architecture to manage over time.

An account created in an automated fashion using a CloudFormation template—or other IaC methods—allows you to minimize human interaction before use by recreating an entirely new and untouched forensic analysis instance for each separate investigation, ensuring its integrity. Individual people will only be given access as part of a security incident response plan, and even then, permissions to change the environment should be minimal or none at all. The post-investigation environment would then be either preserved in a locked state or removed, and a fresh, blank one created in its place for the subsequent investigation with no trace of the previous artifacts. Templating your environment also facilitates testing to ensure your investigative strategy, permissions, and tooling will function as intended.

Accessing your forensics infrastructure

Once you’ve defined where your investigative environment should reside, you must think about who will be accessing it, how they will do so, and what permissions they will need.

The forensic investigation team can be a separate team from the security incident response team, the same team, or a subset. You should provide precise access rights to the group of individuals performing the investigation as part of maintaining least privilege.

You should create specific roles for the various needs of the forensic procedures, each with only the permissions required. As with SCPs and other situations described here, start with no permissions and add authorizations only as required while establishing and testing your templated environments. As an example, you might create the following roles within the forensic account:

Responder – acquire evidence

Investigator – analyze evidence

Data custodian – manage (copy, move, delete, and expire) evidence

Analyst – access forensics reports for analytics, trends, and forecasting (threat intelligence)

You should establish an access procedure for each role, and include it in the response plan playbook. This will help you ensure least privilege access as well as environment integrity. For example, establish a process for an owner of the Security Incident Response Plan to verify and approve the request for access to the environment. Another alternative is the two-person rule. Alert on log-in is an additional security measure that you can add to help increase confidence in the environment’s integrity, and to monitor for unauthorized access.

You want the investigative role to have read-only access to the original evidence artifacts collected, generally consisting of Amazon Elastic Block Store (Amazon EBS) snapshots, memory dumps, logs, or other artifacts in an Amazon Simple Storage Service (Amazon S3) bucket. The original sources of evidence should be protected; MFA delete and S3 versioning are two methods for doing so. Work should be performed on copies of copies if rendering the original immutable isn’t possible, especially if any modification of the artifact will happen. This is discussed in further detail below.

Evidence should only be accessible from the roles that absolutely require access—that is, investigator and data custodian. To help prevent potential insider threat actors from being aware of the investigation, you should deny even read access from any roles not intended to access and analyze evidence.

Protecting the integrity of your forensic infrastructures

Once you’ve built the organization, account structure, and roles, you must decide on the best strategy inside the account itself. Analysis of the collected artifacts can be done through forensic analysis tools hosted on an EC2 instance, ideally residing within a dedicated Amazon VPC in the forensics account. This Amazon VPC should be configured with the same restrictive approach you’ve taken so far, being fully isolated and auditable, with the only resources being dedicated to the forensic tasks at hand.

This might mean that the Amazon VPC’s subnets will have no internet gateways, and therefore all S3 access must be done through an S3 VPC endpoint. VPC flow logging should be enabled at the Amazon VPC level so that there are records of all network traffic. Security groups should be highly restrictive, and deny all ports that aren’t related to the requirements of the forensic tools. SSH and RDP access should be restricted and governed by auditable mechanisms such as a bastion host configured to log all connections and activity, AWS Systems Manager Session Manager, or similar.

If using Systems Manager Session Manager with a graphical interface is required, RDP or other methods can still be accessed. Commands and responses performed using Session Manager can be logged to Amazon CloudWatch and an S3 bucket, this allows auditing of all commands executed on the forensic tooling Amazon EC2 instances. Administrative privileges can also be restricted if required. You can also arrange to receive an Amazon Simple Notification Service (Amazon SNS) notification when a new session is started.

Given that the Amazon EC2 forensic tooling instances might not have direct access to the internet, you might need to create a process to preconfigure and deploy standardized Amazon Machine Images (AMIs) with the appropriate installed and updated set of tooling for analysis. Several best practices apply around this process. The OS of the AMI should be hardened to reduce its vulnerable surface. We do this by starting with an approved OS image, such as an AWS-provided AMI or one you have created and managed yourself. Then proceed to remove unwanted programs, packages, libraries, and other components. Ensure that all updates and patches—security and otherwise—have been applied. Configuring a host-based firewall is also a good precaution, as well as host-based intrusion detection tools. In addition, always ensure the attached disks are encrypted.

If your operating system is supported, we recommend creating golden images using EC2 Image Builder. Your golden image should be rebuilt and updated at least monthly, as you want to ensure it’s kept up to date with security patches and functionality.

EC2 Image Builder—combined with other tools—facilitates the hardening process; for example, allowing the creation of automated pipelines that produce Center for Internet Security (CIS) benchmark hardened AMIs. If you don’t want to maintain your own hardened images, you can find CIS benchmark hardened AMIs on the AWS Marketplace.

Keep in mind the infrastructure requirements for your forensic tools—such as minimum CPU, memory, storage, and networking requirements—before choosing an appropriate EC2 instance type. Though a variety of instance types are available, you’ll want to ensure that you’re keeping the right balance between cost and performance based on your minimum requirements and expected workloads.

The goal of this environment is to provide an efficient means to collect evidence, perform a comprehensive investigation, and effectively return to safe operations. Evidence is best acquired through the automated strategies discussed in How to automate incident response in the AWS Cloud for EC2 instances. Hashing evidence artifacts immediately upon acquisition is highly recommended in your evidence collection process. Hashes, and in turn the evidence itself, can then be validated after subsequent transfers and accesses, ensuring the integrity of the evidence is maintained. Preserving the original evidence is crucial if legal action is taken.

Evidence and artifacts can consist of, but aren’t limited to:

Access to the control plane logs mentioned above—such as the CloudTrail logs—can be accessed in one of two ways. Ideally, the logs should reside in a central location with read-only access for investigations as needed. However, if not centralized, read access can be given to the original logs within the source account as needed. Read access to certain service logs found within the security account, such as AWS Config, Amazon GuardDuty, Security Hub, and Amazon Detective, might be necessary to correlate indicators of compromise with evidence discovered during the analysis.

As previously mentioned, it’s imperative to have immutable versions of all evidence. This can be achieved in many ways, including but not limited to the following examples:

  • Amazon EBS snapshots, including hibernation generated memory dumps:
    • Original Amazon EBS disks are snapshotted, shared to the forensics account, used to create a volume, and then mounted as read-only for offline analysis.
  • Amazon EBS volumes manually captured:
    • Linux tools such as dc3dd can be used to stream a volume to an S3 bucket, as well as provide a hash, and then made immutable using an S3 method from the next bullet point.
  • Artifacts stored in an S3 bucket, such as memory dumps and other artifacts:
    • S3 Object Lock prevents objects from being deleted or overwritten for a fixed amount of time or indefinitely.
    • Using MFA delete requires the requestor to use multi-factor authentication to permanently delete an object.
    • Amazon S3 Glacier provides a Vault Lock function if you want to retain immutable evidence long term.
  • Disk volumes:
    • Linux: Mount in read-only mode.
    • Windows: Use one of the many commercial or open-source write-blocker applications available, some of which are specifically made for forensic use.
  • CloudTrail:
  • AWS Systems Manager inventory:
  • AWS Config data:
    • By default, AWS Config stores data in an S3 bucket, and can be protected using the above methods.

Note: AWS services such as KMS can help enable encryption. KMS is integrated with AWS services to simplify using your keys to encrypt data across your AWS workloads.

An example use case of Amazon EBS disks being shared as evidence to the forensics account, the following figure—Figure 2—is a simplified S3 bucket folder structure you could use to store and work with evidence.

Figure 2 shows an S3 bucket structure for a forensic account. An S3 bucket and folder is created to hold incoming data—for example, from Amazon EBS disks—which is streamed to Incoming Data > Evidence Artifacts using dc3dd. The data is then copied from there to a folder in another bucket—Active Investigation > Root Directory > Extracted Artifacts—to be analyzed by the tooling installed on your forensic Amazon EC2 instance. Also, there are folders under Active Investigation for any investigation notes you make during analysis, as well as the final reports, which are discussed at the end of this blog post. Finally, a bucket and folders for legal holds, where an object lock will be placed to hold evidence artifacts at a specific version.

Figure 2: Forensic account S3 bucket structure

Figure 2: Forensic account S3 bucket structure


Finally, depending on the severity of the incident, your on-premises network and infrastructure might also be compromised. Having an alternative environment for your security responders to use in case of such an event reduces the chance of not being able to respond in an emergency. Amazon services such as Amazon Workspaces—a fully managed persistent desktop virtualization service—can be used to provide your responders a ready-to-use, independent environment that they can use to access the digital forensics and incident response tools needed to perform incident-related tasks.

Aside from the investigative tools, communications services are among the most critical for coordination of response. You can use Amazon WorkMail and Amazon Chime to provide that capability independent of normal channels.


The goal of a forensic investigation is to provide a final report that’s supported by the evidence. This includes what was accessed, who might have accessed it, how it was accessed, whether any data was exfiltrated, and so on. This report might be necessary for legal circumstances, such as criminal or civil investigations or situations requiring breach notifications. What output each circumstance requires should be determined in advance in order to develop an appropriate response and reporting process for each. A root cause analysis is vital in providing the information required to prepare your resources and environment to help prevent a similar incident in the future. Reports should not only include a root cause analysis, but also provide the methods, steps, and tools used to arrive at the conclusions.

This article has shown you how you can get started creating and maintaining forensic environments, as well as enable your teams to perform advanced incident resolution investigations using AWS services. Implementing the groundwork for your forensics environment, as described above, allows you to use automated disk collection to begin iterating on your forensic data collection capabilities and be better prepared when security events occur.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on one of the AWS Security, Identity, and Compliance forums or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Sol Kavanagh

Sol is a senior solutions architect and CISSP with 20+ years of experience in the enterprise space. He is passionate about security and helping customers build solutions in the AWS Cloud. In his spare time he enjoys distance cycling, adventure travelling, Buddhist philosophy, and Muay Thai.

Migrate and secure your Windows PKI to AWS with AWS CloudHSM

Post Syndicated from Govindarajan Varadan original https://aws.amazon.com/blogs/security/migrate-and-secure-your-windows-pki-to-aws-with-aws-cloudhsm/

AWS CloudHSM provides a cloud-based hardware security module (HSM) that enables you to easily generate and use your own encryption keys in AWS. Using CloudHSM as part of a Microsoft Active Directory Certificate Services (AD CS) public key infrastructure (PKI) fortifies the security of your certificate authority (CA) private key and ensures the security of the trust hierarchy. In this blog post, we walk you through how to migrate your existing Microsoft AD CS CA private key to the HSM in a CloudHSM cluster.

The challenge

Organizations implement public key infrastructure (PKI) as an application to provide integrity and confidentiality between internal and customer-facing applications. A PKI provides encryption/decryption, message hashing, digital certificates, and digital signatures to ensure these security objectives are met. Microsoft AD CS is a popular choice for creating and managing a CA for enterprise applications such as Active Directory, Exchange, and Systems Center Configuration Manager. Moving your Microsoft AD CS to AWS as part of your overall migration plan allows you to continue to use your existing investment in Windows certificate auto enrollment for users and devices without disrupting existing workflows or requiring new certificates to be issued. However, when you migrate an on-premises infrastructure to the cloud, your security team may determine that storing private keys on the AD CS server’s disk is insufficient for protecting the private key that signs the certificates issued by the CA. Moving from storing private keys on the AD CS server’s disk to a hardware security module (HSM) can provide the added security required to maintain trust of the private keys.

This walkthrough shows you how to migrate your existing AD CS CA private key to the HSM in your CloudHSM cluster. The resulting configuration avoids the security concerns of using keys stored on your AD CS server, and uses the HSM to perform the cryptographic signing operations.


For this walkthrough, you should have the following in place:

Migrating a domain

In this section, you will walk through migrating your AD CS environment to AWS by using your existing CA certificate and private key that will be secured in CloudHSM. In order to securely migrate the private key into the HSM, you will install the CloudHSM client and import the keys directly from the existing CA server.

This walkthrough includes the following steps:

  1. Create a crypto user (CU) account
  2. Import the CA private key into CloudHSM
  3. Export the CA certificate and database
  4. Configure and import the certificate into the new Windows CA server
  5. Install AD CS on the new server

The operations you perform on the HSM require the credentials of an HSM user. Each HSM user has a type that determines the operations you can perform when authenticated as that user. Next, you will create a crypto user (CU) account to use with your CA servers, to manage keys and to perform cryptographic operations.

To create the CU account

  1. From the on-premises CA server, use the following command to log in with the crypto officer (CO) account that you created when you activated the cluster. Be sure to replace <co_password> with your CO password.
    loginHSM CO admin <co_password>

  2. Use the following command to create the CU account. Replace <cu_user> and <cu_password> with the username and password you want to use for the CU.
    createUser CU <cu_user> <cu_password>

  3. Use the following command to set the login credentials for the HSM on your system and enable the AWS CloudHSM client for Windows to use key storage providers (KSPs) and Cryptography API: Next Generation (CNG) providers. Replace <cu_user> and <cu_password> with the username and password of the CU.
    set_cloudhsm_credentials.exe --username <cu_user> password <cu_password>

Now that you have the CloudHSM client installed and configured on the on-premises CA server, you can import the CA private key from the local server into your CloudHSM cluster.

To import the CA private key into CloudHSM

  1. Open an administrative command prompt and navigate to C:\Program Files\Amazon\CloudHSM.
  2. To identify the unique container name for your CA’s private key, enter certutil -store my to list all certificates stored in the local machine store. The CA certificate will be shown as follows:
    ================ Certificate 0 ================
    Serial Number: <certificate_serial_number>
    Issuer: CN=example-CA, DC=example, DC=com
     NotBefore: 6/25/2021 5:04 PM
     NotAfter: 6/25/2022 5:14 PM
    Subject: CN=example-CA-test3, DC=example, DC=com
    Certificate Template Name (Certificate Type): CA
    CA Version: V0.0
    Signature matches Public Key
    Root Certificate: Subject matches Issuer
    Template: CA, Root Certification Authority
    Cert Hash(sha1): cb7c09cd6c76d69d9682a31fbdbbe01c29cebd82
      Key Container = example-CA-test3
      Unique container name: <unique_container_name>
      Provider = Microsoft Software Key Storage Provider
    Signature test passed

  3. Verify that the key is backed by the Microsoft Software Key Storage Provider and make note of the <unique_container_name> from the output, to use it in the following steps.
  4. Use the following command to set the environment variable n3fips_password. Replace <cu_user> and <cu_password> with the username and password for the CU you created earlier for the CloudHSM cluster. This variable will be used by the import_key command in the next step.
    set n3fips_password=<cu_user>:<cu_password>

  5. Use the following import_key command to import the private key into the HSM. Replace <unique_container_name> with the value you noted earlier.
    import_key.exe -RSA "<unique_container_name>

The import_key command will report that the import was successful. At this point, your private key has been imported into the HSM, but the on-premises CA server will continue to run using the key stored locally.

The Active Directory Certificate Services Migration Guide for Windows Server 2012 R2 uses the Certification Authority snap-in to migrate the CA database, as well as the certificate and private key. Because you have already imported your private key into the HSM, next you will need to make a slight modification to this process and export the certificate manually, without its private key.

To export the CA certificate and database

  1. To open the Microsoft Management Console (MMC), open the Start menu and in the search field, enter MMC, and choose Enter.
  2. From the File menu, select Add/Remove Snapin.
  3. Select Certificates and choose Add.
  4. You will be prompted to select which certificate store to manage. Select Computer account and choose Next.
  5. Select Local Computer, choose Finish, then choose OK.
  6. In the left pane, choose Personal, then choose Certificates. In the center pane, locate your CA certificate, as shown in Figure 1.
    The MMC Certificates snap-in displays the Certificates directories for the local computer. The Personal Certificates location is open displaying the example-CA-test3 certificate.

    Figure 1: Microsoft Management Console Certificates snap-in

  7. Open the context (right-click) menu for the certificate, choose All Tasks, then choose Export.
  8. In the Certificate Export Wizard, choose Next, then choose No, do not export the private key.
  9. Under Select the format you want to use, select Cryptographic Message Syntax Standard – PKCS #7 format file (.p7b) and select Include all certificates in the certification path if possible, as shown in Figure 2.
    The Certificate Export Wizard window is displayed.  This windows is prompting for the selection of an export format.  The toggle is selected for Cryptographic Message Syntax Standard – PKCS #7 Certificates (.P7B) and the check box is marked to Include all certificates in the certification path if possible.

    Figure 2: Certificate Export Wizard

  10. Save the file in a location where you’ll be able to locate it later, so you will be able to copy it to the new CA server.
  11. From the Start menu, browse to Administrative Tools, then choose Certificate Authority.
  12. Open the context (right-click) menu for your CA and choose All Tasks, then choose Back up CA.
  13. In the Certificate Authority Backup Wizard, choose Next. For items to back up, select only Certificate database and certificate database log. Leave all other options unselected.
  14. Under Back up to this location, choose Browse and select a new empty folder to hold the backup files, which you will move to the new CA later.
  15. After the backup is complete, in the MMC, open the context (right-click) menu for your CA, choose All Tasks, then choose Stop service.

At this point, until you complete the migration, your CA will no longer be issuing new certificates.

To configure and import the certificate into the new Windows CA server

  1. Open a Remote Desktop session to the EC2 instance that you created in the prerequisite steps, which will serve as your new AD CS certificate authority.
  2. Copy the certificate (.p7b file) backup from the on-premises CA server to the EC2 instance.
  3. On your EC2 instance, locate the certificate you just copied, as shown in Figure 3. Open the certificate to start the import process.
    The Certificate Manager tool window shows the Certificates directory for the p7b file that was opened. The main window for this location is displaying the example-CA-test3 certificate.

    Figure 3: Certificate Manager tool

  4. Select Install Certificate. For Store Location, select Local Machine.
  5. Select Place the Certificates in the following store. Allowing Windows to place the certificate automatically will install it as a trusted root certificate, rather than a server certificate.
  6. Select Browse, select the Personal store, and then choose OK.
  7. Choose Next, then choose Finish to complete the certificate installation.

At this point, you’ve installed the public key and certificate from the on-premises CA server to your EC2-based Windows CA server. Next, you need to link this installed certificate with the private key, which is now stored on the CloudHSM cluster, in order to make it functional for signing issued certificates and CRLs.

To link the certificate with the private key

  1. Open an administrative command prompt and navigate to C:\Program Files\Amazon\CloudHSM.
  2. Use the following command to set the environment variable n3fips_password. Replace <cu_user> and <cu_password> with the username and password for the CU that you created earlier for the CloudHSM cluster. This variable will be used by the import_key command in the next step.
    set n3fips_password=<cu_user>:<cu_password>

  3. Use the following import_key command to represent all keys stored on the HSM in a new key container in the key storage provider. This step is necessary to allow the cryptography tools to see the CA private key that is stored on the HSM.
    import_key -from HSM -all

  4. Use the following Windows certutil command to find your certificate’s unique serial number.
    certutil -store my

    Take note of the CA certificate’s serial number.

  5. Use the following Windows certutil command to link the installed certificate with the private key stored on the HSM. Replace <certificate_serial_number> with the value noted in the previous step.
    certutil -repairstore my <certificate_serial_number>

  6. Enter the command certutil -store my. The CA certificate will be shown as follows. Verify that the certificate is now linked with the HSM-backed private key. Note that the private key is using the Cavium Key Store Provider. Also note the message Encryption test passed, which means that the private key is usable for encryption.
    ================ Certificate 0 ================
    Serial Number: <certificate_serial_number>
    Issuer: CN=example-CA, DC=example, DC=com
     NotBefore: 6/25/2021 5:04 PM
     NotAfter: 6/25/2022 5:14 PM
    Subject: CN=example-CA, DC=example, DC=com
    Certificate Template Name (Certificate Type): CA
    CA Version: V0.0
    Signature matches Public Key
    Root Certificate: Subject matches Issuer
    Template: CA, Root Certification Authority
    Cert Hash(sha1): cb7c09cd6c76d69d9682a31fbdbbe01c29cebd82
      Key Container = PRV_KEY_IMPORT-6-9-7e5cde
      Provider = Cavium Key Storage Provider
    Private key is NOT exportable
    Encryption test passed

Now that your CA certificate and key materials are in place, you are ready to setup your EC2 instance as a CA server.

To install AD CS on the new server

  1. In Microsoft’s documentation to Install the Certificate Authority role on your new EC2 instance, follow steps 1-8. Do not complete the remaining steps, because you will be configuring the CA to use the existing HSM backed certificate and private-key instead of generating a new key.
  2. In Confirm installation selections, select Install.
  3. After your installation is complete, Server Manager will show a notification banner prompting you to configure AD CS. Select Configure Active Directory Certificate Services from this prompt.
  4. Select either Standalone or Enterprise CA installation, based upon the configuration of your on-premises CA.
  5. Select Use Existing Certificate and Private Key and browse to select the CA certificate imported from your on-premises CA server.
  6. Select Next and verify your location for the certificate database files.
  7. Select Finish to complete the wizard.
  8. To restore the CA database backup, from the Start menu, browse to Administrative Tools, then choose Certificate Authority.
  9. Open the context (right-click) menu for the certificate authority and choose All Tasks, then choose Restore CA. Browse to and select the database backup that you copied from the on-premises CA server.

Review the Active Directory Certificate Services Migration Guide for Windows Server 2012 R2 to complete migration of your remaining Microsoft Public Key Infrastructure (PKI) components. Depending on your existing CA environment, these steps may include establishing new CRL and AIA endpoints, configuring Windows Routing and Remote Access to use the new CA, or configuring certificate auto enrollment for Windows clients.


In this post, we walked you through migrating an on-premises Microsoft AD CS environment to an AWS environment that uses AWS CloudHSM to secure the CA private key. By migrating your existing Windows PKI backed by AWS CloudHSM, you can continue to use your Windows certificate auto enrollment for users and devices with your private key secured in a dedicated HSM.

For more information about setting up and managing CloudHSM, see Getting Started with AWS CloudHSM and the AWS Security Blog post CloudHSM best practices to maximize performance and avoid common configuration pitfalls.

If you have feedback about this blog post, submit comments in the Comments section below. You can also start a new thread on the AWS CloudHSM forum to get answers from the community.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Govindarajan Varadan

Govindarajan is a senior solutions architect at AWS based out of Silicon Valley in California. He works with AWS customers to help them achieve their business objectives by innovating at scale, modernizing their applications, and adopting game-changing technologies like AI/ML.


Brian Benscoter

Brian is a senior solutions architect at AWS with a passion for governance at scale and is based in Charlotte, NC. Brian works with enterprise AWS customers to help them design, deploy, and scale applications to achieve their business goals.


Axel Larsson

Axel is an enterprise solutions architect at AWS. He has helped several companies migrate to AWS and modernize their architecture. Axel is passionate about helping organizations establish a solid foundation in the cloud, enabled by security best practices.

Correlate security findings with AWS Security Hub and Amazon EventBridge

Post Syndicated from Marshall Jones original https://aws.amazon.com/blogs/security/correlate-security-findings-with-aws-security-hub-and-amazon-eventbridge/

In this blog post, we’ll walk you through deploying a solution to correlate specific AWS Security Hub findings from multiple AWS services that are related to a single AWS resource, which indicates an increased possibility that a security incident has happened.

AWS Security Hub ingests findings from multiple AWS services, including Amazon GuardDuty, Amazon Inspector, Amazon Macie, AWS Firewall Manager, AWS Identity and Access Management (IAM) Access Analyzer, and AWS Systems Manager Patch Manager. Findings from each service are normalized into the AWS Security Finding Format (ASFF), so that you can review findings in a standardized format and take action quickly. You can use AWS Security Hub to provide a single view of all security-related findings, where you can set up alerting, automatic remediation, and ingestion into third-party incident management systems for specific findings.

Although Security Hub can ingest a vast number of integrations and findings, it cannot create correlation rules like a Security Information and Event Management (SIEM) tool can. However, you can create such rules using EventBridge. It’s important to take a closer look when multiple AWS security services generate findings for a single resource, because this potentially indicates elevated risk. Depending on your environment, the initial number of findings in AWS Security Hub findings may be high, so you may need to prioritize which findings require immediate action. AWS Security Hub natively gives you the ability to filter findings by resource, account, and many other details. With the solution in this post, when one of these correlated sets of findings is detected, a new finding is created and pushed to AWS Security Hub by using the Security Hub BatchImportFindings API operation. You can then respond to these new security incident-oriented findings through ticketing, chat, or incident management systems.


This solution requires that you have AWS Security Hub enabled in your AWS account. In addition to AWS Security Hub, the following services must be enabled and integrated to AWS Security Hub:

Solution overview

In this solution, you will use a combination of AWS Security Hub, Amazon EventBridge, AWS Lambda, and Amazon DynamoDB to ingest and correlate specific findings that indicate a higher likelihood of a security incident. Each correlation is focused on multiple specific AWS security service findings for a single AWS resource.

The following list shows the correlated findings that are detected by this solution. The Description section for each finding correlation provides context for that correlation, the Remediation section provides general recommendations for remediation, and the Prevention/Detection section provides guidance to either prevent or detect one or more findings within the correlation. With the code provided, you can also add more correlations than those listed here by modifying the Cloud Development Kit (CDK) code and AWS Lambda code. The Solution workflow section breaks down the flow of the solution. If you choose to implement automatic remediation, each finding correlation will be created with the following AWS Security Hub Finding Format (ASFF) fields:

- Severity: CRITICAL
- ProductArn: arn:aws:securityhub:<REGION>:<AWS_ACCOUNT_ID>:product/<AWS_ACCOUNT_ID>/default

These correlated findings are created as part of this solution:

  1. Any Amazon GuardDuty Backdoor findings and three critical common vulnerabilities and exposures (CVEs) from Amazon Inspector that are associated with the same Amazon Elastic Compute Cloud (Amazon EC2) instance.
    • Description: Amazon Inspector has found at least three critical CVEs on the EC2 instance. CVEs indicate that the EC2 instance is currently vulnerable or exposed. The EC2 instance is also performing backdoor activities. The combination of these two findings is a stronger indication of an elevated security incident.
    • Remediation: It’s recommended that you isolate the EC2 instance and follow standard protocol to triage the EC2 instance to verify if the instance has been compromised. If the instance has been compromised, follow your standard Incident Response process for post-instance compromise and forensics. Redeploy a backup of the EC2 instance by using an up-to-date hardened Amazon Machine Image (AMI) or apply all security-related patches to the redeployed EC2 instance.
    • Prevention/Detection: To mitigate or prevent an Amazon EC2 instance from missing critical security updates, consider using Amazon Systems Manager Patch Manager to automate installing security-related patching for managed instances. Alternatively, you can provide developers up-to-date hardened Amazon Machine Images (AMI) by using Amazon EC2 Image Builder. For detection, you can set the AMI property called ‘DeprecationTime’ to indicate when the AMI will become outdated and respond accordingly.
  2. An Amazon Macie sensitive data finding and an Amazon GuardDuty S3 exfiltration finding for the same Amazon Simple Storage Service (Amazon S3) bucket.
    • Description: Amazon Macie has scanned an Amazon S3 bucket and found a possible match for sensitive data. Amazon GuardDuty has detected a possible exfiltration finding for the same Amazon S3 bucket. The combination of these findings indicates a higher risk security incident.
    • Remediation: It’s recommended that you review the source IP and/or IAM principal that is making the S3 object reads against the S3 bucket. If the source IP and/or IAM principal is not authorized to access sensitive data within the S3 bucket, follow your standard Incident Response process for post-compromise plan for S3 exfiltration. For example, you can restrict an IAM principal’s permissions, revoke existing credentials or unauthorized sessions, restricting access via the Amazon S3 bucket policy, or using the Amazon S3 Block Public Access feature.
    • Prevention/Detection: To mitigate or prevent exposure of sensitive data within Amazon S3, ensure the Amazon S3 buckets are using least-privilege bucket policies and are not publicly accessible. Alternatively, you can use the Amazon S3 Block Public Access feature. Review your AWS environment to make sure you are following Amazon S3 security best practices. For detection, you can use Amazon Config to track and auto-remediate Amazon S3 buckets that do not have logging and encryption enabled or publicly accessible.
  3. AWS Security Hub detects an EC2 instance with a public IP and unrestricted VPC Security Group; Amazon GuardDuty unusual network traffic behavior finding; and Amazon GuardDuty brute force finding.
    • Description: AWS Security Hub has detected an EC2 instance that has a public IP address attached and a VPC Security Group that allows traffic for ports outside of ports 80 and 443. Amazon GuardDuty has also determined that the EC2 instance has multiple brute force attempts and is communicating with a remote host on an unusual port that the EC2 instance has not previously used for network communication. The correlation of these lower-severity findings indicates a higher-severity security incident.
    • Remediation: It’s recommended that you isolate the EC2 instance and follow standard protocol to triage the EC2 instance to verify if the instance has been compromised. If the instance has been compromised, follow your standard Incident Response process for post-instance compromise and forensics.
    • Prevention/Detection: To mitigate or prevent these events from occurring within your AWS environment, determine whether the EC2 instance requires a public-facing IP address and review the VPC Security Group(s) has only the required rules configured. Review your AWS environment to make sure you are following Amazon EC2 best practices. For detection, consider implementing AWS Firewall Manager to continuously audit and limit VPC Security Groups.

The solution workflow, shown in Figure 1, is as follows:

  1. Security Hub ingests findings from integrated AWS security services.
  2. An EventBridge rule is invoked based on Security Hub findings in GuardDuty, Macie, Amazon Inspector, and Security Hub security standards.
  3. The EventBridge rule invokes a Lambda function to store the Security Hub finding, which is passed via EventBridge, in a DynamoDB table for further analysis.
  4. After the new findings are stored in DynamoDB, another Lambda function is invoked by using Dynamo StreamSets and a time-to-live (TTL) set to delete finding entries that are older than 30 days.
  5. The second Lambda function looks at the resource associated with the new finding entry in the DynamoDB table. The Lambda function checks for specific Security Hub findings that are associated with the same resource.
Figure 1: Architecture diagram describing the flow of the solution

Figure 1: Architecture diagram describing the flow of the solution

Solution deployment

You can deploy the solution through either the AWS Management Console or the AWS Cloud Development Kit (AWS CDK).

To deploy the solution by using the AWS Management Console

In your account, launch the AWS CloudFormation template by choosing the following Launch Stack button. It will take approximately 10 minutes for the CloudFormation stack to complete.
Select the Launch Stack button to launch the template

To deploy the solution by using the AWS CDK

You can find the latest code in the aws-security GitHub repository where you can also contribute to the sample code. The following commands show how to deploy the solution by using the AWS CDK. First, the CDK initializes your environment and uploads the AWS Lambda assets to Amazon S3. Then, you can deploy the solution to your account. For <INSERT_AWS_ACCOUNT>, specify the account number, and for <INSERT_REGION>, specify the AWS Region that you want the solution deployed to.

cdk bootstrap aws://<INSERT_AWS_ACCOUNT>/<INSERT_REGION>

cdk deploy


In this blog post, we walked through a solution to use AWS services, including Amazon EventBridge, AWS Lambda, and Amazon DynamoDB, to correlate AWS Security Hub findings from multiple different AWS security services. The solution provides a framework to prioritize specific sets of findings that indicate a higher likelihood that a security incident has occurred, so that you can prioritize and improve your security response.

If you have feedback about this post, submit comments in the Comments section below. If you have any questions about this post, start a thread on the AWS Security Hub forum.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Marshall Jones

Marshall is a worldwide security specialist solutions architect at AWS. His background is in AWS consulting and security architecture, focused on a variety of security domains including edge, threat detection, and compliance. Today, he is focused on helping enterprise AWS customers adopt and operationalize AWS security services to increase security effectiveness and reduce risk.


Jonathan Nguyen

Jonathan is a shared delivery team senior security consultant at AWS. His background is in AWS security, with a focus on threat detection and incident response. He helps enterprise customers develop a comprehensive AWS security strategy, deploy security solutions at scale, and train customers on AWS security best practices.

Designing a High-volume Streaming Data Ingestion Platform Natively on AWS

Post Syndicated from Soonam Jose original https://aws.amazon.com/blogs/architecture/designing-a-high-volume-streaming-data-ingestion-platform-natively-on-aws/

The total global data storage is projected to exceed 200 zettabytes by 2025. This exponential growth of data demands increased vigilance against cybercrimes. Emerging cybersecurity trends include increasing service attacks, ransomware, and critical infrastructure threats. Businesses are changing how they approach cybersecurity and are looking for new ways to tackle these threats. In the past, they have relied on internal IT or engaged a managed security services provider (MSSP) to monitor and prevent unauthorized access and attacks.

An end-to-end analytics solution should ingest and process log data streaming from various computing and IoT devices. It can then make processed data available to analytics systems users in near-real-time. However, the sheer volume of data in the future makes this difficult to address in a reliable and cost-effective manner.

In this blog post, we present three approaches for a high-volume log data ingestion and processing platform natively on Amazon Web Services (AWS). We also compare the pros and cons of each. We’ll discuss factors to consider when evaluating the different options and their associated flexibility, to take full advantage of AWS. We will showcase a fictional use case for a top MSSP who ingests high volumes of logs from security devices to cloud. This MSSP also performs downstream analytics and threat detection modeling.

The options we present here have a log collection platform (LCP) on-premises. It collects logs from security devices and sensors, performs necessary translations and tokenization, and pushes compressed log files to the processing tier on cloud. The collection platform can also be modernized to have the IoT-enabled devices send logs to AWS IoT services. This will push the data to Amazon Kinesis, a managed service for collecting and analyzing streaming data.

Approach 1: Amazon Kinesis for log ingestion and format conversion

Figure 1 illustrates a comprehensive solution that uses managed and serverless services on AWS.

Figure 1. Amazon Kinesis for log ingestion and format conversion

Figure 1. Amazon Kinesis for log ingestion and format conversion

1. LCP will invoke a scalable producer application for Amazon Kinesis Data Streams running on AWS Fargate behind an Application Load Balancer. The producer application will use the Amazon Kinesis Producer Library (KPL). KPL aggregates and batches data records to make ingestion into the data stream more efficient. The application may provide compressed records to the KPL to have it manage object compression.

The application can be set up as an HTTP endpoint that receives log files and processes them using KPL. Customer ID sent as part of an HTTP request header can be used to maintain affinity. The application can run in a Docker container, which is orchestrated by Amazon ECS on AWS Fargate. A target tracking scaling policy can manage the number of parallel running data ingestion containers to manage scalability of the ingestion process.

2. Amazon Kinesis Scaling Utility can be used to scale data streams up or down by a count, or as a percentage of the total fleet. The scaling utility archive file can be imported as a library to AWS Lambda. It will automatically manage the number of shards in the stream based on the observed PUT or GET rate of the stream. The combination of customer ID and security device ID may be used to define the partition key.

3. Records uploaded to the stream by the producer application will be consumed by Lambda. It will perform gateway transformations (required by all downstream consumers) and the normalization of record format. Any additional consumer level transformations may be handled separately, associated with respective consumers.

A combination of batch window and batch size configurations can improve efficiency of function invocations. Batch windows are the maximum amount of time in seconds to gather records before invoking the function. Batch size is the number of records to send to the function in each batch. The Lambda function will throttle sending records to Amazon Kinesis Data Firehose. Error handling will be accomplished via retries with a smaller batch size, with number of retries limited as appropriate. It will discard records that are too old.

An Amazon Simple Queue Service (SQS) queue can be configured as a failed-event destination for further offline analysis. A Lambda function can read from the error SQS queue to do basic checks and determine appropriate follow-up actions. This can be an initiated email for additional investigation or a command to discard the message.

4. Output of transformations by Lambda will be saved to the short term (hot) storage Amazon S3 bucket via Kinesis Data Firehose. This can efficiently handle Parquet format conversion required by downstream analytics applications. Kinesis Data Firehose delivery streams will be created per customer and configured with associated AWS Glue Data Catalog table, to perform parquet format conversion.

5. AWS Glue jobs will be used to consolidate and write larger files to the long term (cold) storage bucket.

6. The data in the cold storage bucket will be accessed by internal SOC analysts for threat detection and mitigation.

7. The data in cold storage buckets will also be accessed by end customers via dashboards in Amazon QuickSight.

8. This architecture also provides additional options to modernize streaming analytics using Amazon Kinesis Data Analytics or AWS Glue streaming jobs as appropriate.

While this architecture proposes a fully managed, end-to-end solution, the sheer volume of log messages may drive up the total cost of the solution. This is especially true for Kinesis Data Streams and Kinesis Data Firehose costs.

Approach 2: Containerized application on AWS Fargate for ingestion and Amazon Kinesis for format conversion

An alternative approach shown in Figure 2 replaces the gateway Kinesis Data Streams and transformations, with a containerized application on Fargate. Conversion to Parquet format and writing to the S3 bucket is still handled by Kinesis Data Firehose.

Figure 2. Containerized application for ingestion and Amazon Kinesis for format conversion

Figure 2. Containerized application for ingestion and Amazon Kinesis for format conversion

1. LCP will upload log files to a raw storage bucket in Amazon S3.

2. A Lambda function will process Event Notifications from the raw data storage bucket. It can insert Amazon S3 object pointers to a Kinesis Data Stream partitioned by Customer ID and Device ID.

3. The producer application will retrieve the Event Notifications from the Data Stream and retrieve corresponding log files from S3. It will perform initial aggregations and transformations, and output to Kinesis Data Firehose. The application can run in a Docker container that is orchestrated by Amazon ECS on Fargate. A target tracking scaling policy can manage the number of parallel running data ingestion containers, to manage scalability of the ingestion process. ECS cluster capacity can be scaled up or down based on Amazon CloudWatch alarms.

4. Kinesis Data Firehose converts to Parquet format, zips the data, and persists to a short-term storage bucket in S3. This is backed by Glue Data Catalog.

Steps 5, 6 and 7 perform consolidation and availability of the processed data to downstream consumers, as in the previous approach.

This option uses the built-in capabilities of Kinesis Data Firehose to transform to Parquet format and deliver to S3. Note that higher costs associated with the service may still be cost prohibitive for larger data volumes.

Approach 3: Containerized application on AWS Fargate for ingestion and format conversion

Figure 3 uses a containerized application running on Fargate for both gateway transformations. This app also provides conversion to Parquet format before writing the files to a short term (hot) storage bucket. All the other steps are the same as in option 2.

Figure 3. Containerized application for ingestion and format conversion

Figure 3. Containerized application for ingestion and format conversion

This option offers the least expensive way to transform, aggregate, and enrich the incoming log records, as well as convert them to Parquet format. But it comes with additional overhead for custom development of format conversion, checkpointing, error handling, and application management. Evaluate based on your business needs and workflow.


In this post, we discussed multiple approaches to design a platform on AWS to ingest and process high-volume security log records. We compared the pros and cons for each option. Amazon Kinesis is a fully managed and scalable service that helps easily collect, process, and analyze video and data streams in real time. A solution primarily based on Kinesis may become cost prohibitive due to large data volumes. Consider alternate approaches that use containerized applications on AWS Fargate. The trade-off would be the ability for custom development versus application management overhead.

To improve your security log analysis solution, explore one of the approaches we illustrate and customize as appropriate to fit your unique needs.

Align with best practices while creating infrastructure using CDK Aspects

Post Syndicated from Om Prakash Jha original https://aws.amazon.com/blogs/devops/align-with-best-practices-while-creating-infrastructure-using-cdk-aspects/

Organizations implement compliance rules for cloud infrastructure to ensure that they run the applications according to their best practices. They utilize AWS Config to determine overall compliance against the configurations specified in their internal guidelines. This is determined after the creation of cloud resources in their AWS account. This post will demonstrate how to use AWS CDK Aspects to check and align with best practices before the creation of cloud resources in your AWS account.

The AWS Cloud Development Kit (CDK) is an open-source software development framework that lets you define your cloud application resources using familiar programming languages, such as TypeScript, Python, Java, and .NET. The expressive power of programming languages to define infrastructure accelerates the development process and improves the developer experience.

AWS Config is a service that enables you to assess, audit, and evaluate your AWS resource configurations. Config continuously monitors and records your AWS resource configurations, as well as lets you automate the evaluation of recorded configurations against desired configurations. React to non-compliant resources and change their state either automatically or manually.

AWS Config helps customers run their workloads on AWS in a compliant manner. Some customers want to detect it up front, and then only provision compliant resources. Some configurations are important for the customers, so they might not provision resources without having them compliant from the beginning. The following are examples of such configurations:

  • Amazon S3 bucket must not be created with public access
  • Amazon S3 bucket encryption must be enabled
  • Database deletion protection must be enabled

CDK Aspects

CDK Aspects are a way to apply an operation to every construct in a given scope. The aspect could verify something about the state of the constructs, such as ensuring that all buckets are encrypted, or it could modify the constructs, such as by adding tags.

An aspect is a class that implements the IAspect interface shown below. Aspects employ visitor patter, which allows them to add a new operation to existing object structures without modifying the structures. In object-oriented programming and software engineering, the visitor design pattern is a method for separating an algorithm from an object structure on which it operates.

interface IAspect {
   visit(node: IConstruct): void;

An AWS CDK app goes through the following lifecycle phases when you call cdk deploy. These phases are also shown in the diagram below. Learn more about the CDK application lifecycle at this page.

  1. Construction
  2. Preparation
  3. Validation
  4. Synthesis
  5. Deployment

Understanding the CDK Deploy

CDK Aspects become relevant during the Prepare phase, where it makes the final modifications round in the constructs to setup their final state. This Prepare phase happens automatically. All constructs have their internal list of Aspects which are called and applied during the Prepare phase. Add your custom aspects in a scope by calling the following method:

Aspects.of(myConstruct).add(new SomeAspect(...));

When you call the method above, constructs add the custom aspects to the list of internal aspects. When CDK application goes through the Prepare phase, then AWS CDK calls the visit method of the object for the constructs and all of its children in top-down order. The visit method is free to change anything in the construct.

How to align with or check configuration compliance using CDK Aspects

In the following sections, you will see how to implement CDK Aspects for some common use cases when provisioning the cloud resources. CDK Aspects are extensible, and you can extend it for any suitable use cases in order to implement additional rules. Apply CDK Aspects not only to verify against the best practices, but also to update the state before resource creation.

The code below creates the cloud resources to be verified against the best practices or to be updated using Aspects in the following sections.

export class AwsCdkAspectsStack extends cdk.Stack {
  constructor(scope: cdk.Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    //Create a VPC with 3 availability zones
    const vpc = new ec2.Vpc(this, 'MyVpc', {
      maxAzs: 3,

    //Create a security group
    const sg = new ec2.SecurityGroup(this, 'mySG', {
      vpc: vpc,
      allowAllOutbound: true

    //Add ingress rule for SSH from the public internet
    sg.addIngressRule(ec2.Peer.anyIpv4(), ec2.Port.tcp(22), 'SSH access from anywhere')

    //Launch an EC2 instance in private subnet
    const instance = new ec2.Instance(this, 'MyInstance', {
      vpc: vpc,
      machineImage: ec2.MachineImage.latestAmazonLinux(),
      instanceType: new ec2.InstanceType('t3.small'),
      vpcSubnets: {subnetType: ec2.SubnetType.PRIVATE},
      securityGroup: sg

    //Launch MySQL rds database instance in private subnet
    const database = new rds.DatabaseInstance(this, 'MyDatabase', {
      engine: rds.DatabaseInstanceEngine.mysql({
        version: rds.MysqlEngineVersion.VER_5_7
      vpc: vpc,
      vpcSubnets: {subnetType: ec2.SubnetType.PRIVATE},
      deletionProtection: false

    //Create an s3 bucket
    const bucket = new s3.Bucket(this, 'MyBucket')

In the first section, you will see the use cases and code where Aspects are used to verify the resources against the best practices.

  1. VPC CIDR range must start with specific CIDR IP
  2. Security Group must not have public ingress rule
  3. EC2 instance must use approved AMI
//Verify VPC CIDR range
export class VPCCIDRAspect implements IAspect {
    public visit(node: IConstruct): void {
        if (node instanceof ec2.CfnVPC) {
            if (!node.cidrBlock.startsWith('192.168.')) {
                Annotations.of(node).addError('VPC does not use standard CIDR range starting with "192.168."');

//Verify public ingress rule of security group
export class SecurityGroupNoPublicIngressAspect implements IAspect {
    public visit(node: IConstruct) {
        if (node instanceof ec2.CfnSecurityGroup) {

        function checkRules (rules :Array<IngressProperty>) {
            if(rules) {
                for (const rule of rules.values()) {
                    if (!Tokenization.isResolvable(rule) && (rule.cidrIp == '' || rule.cidrIp == '::/0')) {
                        Annotations.of(node).addError('Security Group allows ingress from public internet.');

//Verify AMI of EC2 instance
export class EC2ApprovedAMIAspect implements IAspect {
    public visit(node: IConstruct) {
        if (node instanceof ec2.CfnInstance) {
            if (node.imageId != 'approved-image-id') {
                Annotations.of(node).addError('EC2 Instance is not using approved AMI.');

In the second section, you will see the use cases and code where Aspects are used to update the resources in order to make them compliant before creation.

  1. S3 bucket encryption must be enabled. If not, then enable
  2. S3 bucket versioning must be enabled. If not, then enable
  3. RDS instance must have deletion protection enabled. If not, then enable
//Enable versioning of bucket if not enabled
export class BucketVersioningAspect implements IAspect {
    public visit(node: IConstruct): void {
        if (node instanceof s3.CfnBucket) {
            if (!node.versioningConfiguration
                || (!Tokenization.isResolvable(node.versioningConfiguration)
                    && node.versioningConfiguration.status !== 'Enabled')) {
                Annotations.of(node).addInfo('Enabling bucket versioning configuration.');
                node.addPropertyOverride('VersioningConfiguration', {'Status': 'Enabled'})

//Enable server side encryption for the bucket if no encryption is enabled
export class BucketEncryptionAspect implements IAspect {
    public visit(node: IConstruct): void {
        if (node instanceof s3.CfnBucket) {
            if (!node.bucketEncryption) {
                Annotations.of(node).addInfo('Enabling default S3 server side encryption.');
                node.addPropertyOverride('BucketEncryption', {
                        "ServerSideEncryptionConfiguration": [
                                "ServerSideEncryptionByDefault": {
                                    "SSEAlgorithm": "AES256"

//Enable deletion protection of DB instance if not already enabled
export class RDSDeletionProtectionAspect implements IAspect {
    public visit(node: IConstruct) {
        if (node instanceof rds.CfnDBInstance) {
            if (! node.deletionProtection) {
                Annotations.of(node).addInfo('Enabling deletion protection of DB instance.');
                node.addPropertyOverride('DeletionProtection', true);

Once you create the aspects, add them in a particular scope. That scope can be App, Stack, or Construct. In the example below, all aspects are added in the scope of Stack.

const app = new cdk.App();

const stack = new AwsCdkAspectsStack(app, 'MyApplicationStack');

Aspects.of(stack).add(new VPCCIDRAspect());
Aspects.of(stack).add(new SecurityGroupNoPublicIngressAspect());
Aspects.of(stack).add(new EC2ApprovedAMIAspect());
Aspects.of(stack).add(new RDSDeletionProtectionAspect());
Aspects.of(stack).add(new BucketEncryptionAspect());
Aspects.of(stack).add(new BucketVersioningAspect());


Once you call cdk deploy for the above code with aspects added, you will see the output below. The deployment will not continue until you resolve the errors and modifications conducted to make some of the resources compliant.

Screenshot displaying CDK errors.

Also utilize Aspects to make general modifications to the resources regardless of any compliance checks. For example, use it apply mandatory tags to every taggable resource. Tags is an example of implementing CDK Aspects in order to achieve this functionality. Utilizing the code below, add or remove a tag from all taggable resources and their children in the scope of a Construct.

Tags.of(myConstruct).add('key', 'value');

Below is an example of adding the Department tag to every resource created in the scope of Stack.

Tags.of(stack).add('Department', 'Finance');

CDK Aspects are ways for developers to align with and check best practices in their infrastructure configurations using the programming language of choice. AWS CloudFormation Guard (cfn-guard) provides compliance administrators with a simple, policy-as-code language to author policies and apply them to enforce best practices. Aspects are applied before generation of the CloudFormation template in Prepare phase, but cfn-guard is applied after generation of the CloudFormation template and before the Deploy phase. Developers can use Aspects or cfn-guard or both as part of a CI/CD pipeline to stop deployment of non-compliant resources, but CloudFormation Guard is the way to go when you want to enforce compliances and prevent deployment of non-compliant resources.


If you are utilizing AWS CDK to provision your infrastructure, then you can start using Aspects to align with best practices before resources are created. If you are utilizing CloudFormation template to manage your infrastructure, then you can read this blog to learn how to migrate the CloudFormation template to AWS CDK. After the migration, utilize CDK Aspects not only to evaluate compliance of your resources against the best practices, but also modify their state to make them compliant before they are created.

About the Authors

Om Prakash Jha

Om Prakash Jha is a Solutions Architect at AWS. He helps customers build well-architected applications on AWS in the retail industry vertical. He has more than a decade of experience in developing, designing, and architecting mission critical applications. His passion is DevOps and application modernization. Outside of his work, he likes to read books, watch movies, and explore part of the world with his family.

How to set up a two-way integration between AWS Security Hub and Jira Service Management

Post Syndicated from Ramesh Venkataraman original https://aws.amazon.com/blogs/security/how-to-set-up-a-two-way-integration-between-aws-security-hub-and-jira-service-management/

If you use both AWS Security Hub and Jira Service Management, you can use the new AWS Service Management Connector for Jira Service Management to create an automated, bidirectional integration between these two products that keeps your Security Hub findings and Jira issues in sync. In this blog post, I’ll show you how to set up this integration.

As a Jira administrator, you’ll then be able to create Jira issues from Security Hub findings automatically, and when you update those issues in Jira, the changes are automatically replicated back into the original Security Hub findings. For example, if you resolve an issue in Jira, the workflow status of the finding in Security Hub will also be resolved. This way, Security Hub always has up-to-date status about your security posture.

Watch a demonstration of the integration.


To complete this walkthrough, you’ll need a Jira instance with the connector configured. For more information on how to set this up, see AWS Service Management Connector for Jira Service Management in the AWS Service Catalog Administrator Guide. At the moment, this connector can be used with Atlassian Data Center.

On the AWS side, you need Security Hub enabled in your AWS account. For more information, see Enabling Security Hub manually.

This walkthrough uses an AWS CloudFormation template to create the necessary AWS resources for this integration. In this template, I use the AWS Region us-east-1, but you can use any of the supported Regions for Security Hub.

Deploy the solution

In this solution, you will first deploy an AWS CloudFormation stack that sets up the necessary AWS resources that are needed to set up the integration in Jira.

To download and run the CloudFormation template

  1. Download the sample template for this walkthrough.
  2. In the AWS CloudFormation console, choose Create stack, choose With new resources (standard), and then select Template is ready.
  3. For Specify template, choose Upload a template file and select the template that you downloaded in step 1.

To create the CloudFormation stack

  1. In the CloudFormation console, choose Specify stack details, and enter a Stack name (in the example, I named mine SecurityHub-Jira-Integration).
  2. Keep the other default values as shown in Figure 1, and then choose Next.
    Figure 1: Creating a CloudFormation stack

    Figure 1: Creating a CloudFormation stack

  3. On the Configure stack options page, choose Next.
  4. On the Review page, select the check box I acknowledge that AWS CloudFormation might create IAM resources with custom names. (Optional) If you would like more information about this acknowledgement, choose Learn more.
  5. Choose Create stack.
    Figure 2: Acknowledge creation of IAM resources

    Figure 2: Acknowledge creation of IAM resources

  6. After the CloudFormation stack status is CREATE_COMPLETE, you can see the list of resources that are created, as shown in Figure 3.
    Figure 3: Resources created from the CloudFormation template

    Figure 3: Resources created from the CloudFormation template

Next, you’ll integrate Jira with Security Hub.

To integrate Jira with Security Hub

  1. In the Jira dashboard, choose the gear icon to open the JIRA ADMINISTRATION menu, and then choose Manage apps
    Figure 4: Jira Manage apps

    Figure 4: Jira Manage apps

  2. On the Administration screen, under AWS SERVICE MANAGEMENT CONNECTOR in the left navigation menu, choose AWS accounts
    Figure 5: Choose AWS accounts

    Figure 5: Choose AWS accounts

  3. Choose Connect new account to open a page where you can configure Jira to access an AWS account. 
    Figure 6: Connect new account

    Figure 6: Connect new account

  4. Enter values for the account alias and user credentials. For the account alias, I’ve named my account SHJiraIntegrationAccount. In the SecurityHub-Jira-Integration CloudFormation stack that you created previously, see the Outputs section to get the values for SCSyncUserAccessKey, SCSyncUserSecretAccessKey, SCEndUserAccessKey, and SCEndUserSecretAccessKey, as shown in Figure 7.
    Figure 7: CloudFormation Outputs details

    Figure 7: CloudFormation Outputs details

    Important: Because this is an example walkthrough, I show the access key and secret key generated as CloudFormation outputs. However, if you’re using the AWS Service Management Connector for Jira in a production workload, see How do I create an AWS access key? to understand the connectivity and to create the access key and secret key for users. Visit that link to create an IAM user and access key. For the permissions that are required for the IAM user, you can review the permissions and policies outlined in the template.

  5. In Jira, on the Connect new account page, enter all the values from the CloudFormation Outputs that you saw in step 4, and choose the Region you used to launch your CloudFormation resources. I chose the Region US East (N.Virginia)/us-east-1.
  6. Choose Connect, and you should see a success message for the test connection. You can also choose Test connectivity after connecting the account, as shown in figure 8. 
    Figure 8: Test connectivity

    Figure 8: Test connectivity

The connector is preconfigured to automatically create Jira incidents for Security Hub findings. The findings will have the same information in both the AWS Security Hub console and the Jira console.

Test the integration

Finally, you can test the integration between Security Hub and Jira Service Management.

To test the integration

  1. For this walkthrough, I’ve created a new project from the Projects console in Jira. If you have an existing project, you can link the AWS account to the project.
  2. In the left navigation menu, under AWS SERVICE MANAGEMENT CONNECTOR, choose Connector settings.
  3. On the AWS Service Management Connector settings page, under Projects enabled for Connector, choose Add Jira project, and select the project you want to connect to the AWS account. 
    Figure 9: Add the Jira project

    Figure 9: Add the Jira project

  4. On the same page, under OpsCenter Configuration, choose the project to associate with the AWS accounts. Under Security Hub Configuration, associate the Jira project with the AWS account. Choose Save after you’ve configured the project.
  5. On the AWS accounts page, choose Sync now
    Figure 10: Sync now

    Figure 10: Sync now

  6. In the top pane, under Issues, choose Search for issues.
  7. Choose the project that you added in step 3. You will see a screen like the one shown in Figure 11.

    To further filter just the Security Hub findings, you can also choose AWS Security Hub Finding under Type

    Figure 11: A Security Hub finding in the Jira console

    Figure 11: A Security Hub finding in the Jira console

  8. You can review the same finding from Security Hub in the AWS console, as shown in Figure 12, to verify that it’s the same as the finding you saw in step 7. 
    Figure 12: A Security Hub finding in the AWS console

    Figure 12: A Security Hub finding in the AWS console

  9. On the Jira page for the Security Hub finding (the same page discussed in step 7), you can update the workflow status to Notified, after which the issue status changes to NOTIFIED, as shown in Figure 13. 
    Figure 13: Update the finding status to NOTIFIED

    Figure 13: Update the finding status to NOTIFIED

    You can navigate to the AWS Security Hub console and look at the finding’s workflow, as shown in Figure 14. The workflow should say NOTIFIED, as you updated it in the Jira console.

    Figure 14: The Security Hub finding workflow updated to NOTIFIED

    Figure 14: The Security Hub finding workflow updated to NOTIFIED

  10. You can now fix the issue from the Security Hub console. When you resolve the finding from Security Hub, it will also show up as resolved in the Jira console.
  11. (Optional) In the AWS Service Management Connector in the Jira console, you can configure several settings, such as Sync Interval, SQS Queue Name, and Number of messages to pull from SQS, as shown in Figure 15. You can also synchronize Security Hub findings according to their Severity value.
    Figure 15: Jira settings for Security Hub

    Figure 15: Jira settings for Security Hub


In this blog post, I showed you how to set up the new two-way integration of AWS Security Hub and Jira by using the AWS Service Management Connector for Jira Service Management. To learn more about Jira’s integration with Security Hub, watch the video AWS Security Hub – Bidirectional integration with Jira Service Management Center, and see AWS Service Management Connector for Jira Service Management in the AWS Service Catalog Administrator Guide. To download the free AWS Service Management Connector for Jira, see the Atlassian Marketplace. If you have additional questions, you can post them to the AWS Security Hub forum.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Ramesh Venkataraman

Ramesh is a Solutions Architect who enjoys working with customers to solve their technical challenges using AWS services. Outside of work, Ramesh enjoys following stack overflow questions and answers them in any way he can.

Serverless Architecture for a Structured Data Mining Solution

Post Syndicated from Uri Rotem original https://aws.amazon.com/blogs/architecture/serverless-architecture-for-a-structured-data-mining-solution/

Many businesses have an essential need for structured data stored in their own database for business operations and offerings. For example, a company that produces electronics may want to store a structured dataset of parts. This requires the following properties: color, weight, connector type, and more.

This data may already be available from external sources. In many cases, one source is sufficient. But often, multiple data sources from different vendors must be incorporated. Each data source might have a different structure for the same data field, which is problematic. Achieving one unified structure from variable sources can be difficult, and is a classic data mining problem.

We will break the problem into two main challenges:

  1. Locate and collect data. Collect from multiple data sources and load data into a data store.
  2. Unify the collected data. Since the collected data has no constraints, it might be stored in different structures and file formats. To use the collected data, it must be unified by performing an extract, transform, load (ETL) process. This matches the different data sources and creates one unified data store.

In this post, we demonstrate a pipeline of services, built on top of a serverless architecture that will handle the preceding challenges. This architecture supports large-scale datasets. Because it is a serverless solution, it is also secure and cost effective.

We use Amazon SageMaker Ground Truth as a tool for classifying the data, so that no custom code is needed to classify different data sources.

Data mining and structuring

There are three main steps to explore in order to solve these challenges:

  1. Collect the data – Data mine from different sources
  2. Map the data – Construct a dictionary of key-value pairs without writing code
  3. Structure the collected data – Enrich your dataset with a unified collection of data that was collected and mapped in steps 1 and 2

Following is an example of a use case and solution flow using this architecture:

  • In this scenario, a company must enrich an empty data base with items and properties, see Figure 1.
Figure 1. Company data before data mining

Figure 1. Company data before data mining

  • Data will then be collected from multiple data sources, and stored in the cloud, as shown in Figure 2.
Figure 2. Collecting the data by SKU from different sources

Figure 2. Collecting the data by SKU from different sources

  • To unify different property names, SageMaker Ground Truth is used to label the property names with a list of properties. The results are stored in Amazon DynamoDB, shown in Figure 3.
Figure 3. Mapping the property names to match a unified name

Figure 3. Mapping the property names to match a unified name

  • Finally, the database is populated and enriched by the mapped properties from the different data sources. This can be iterated with new sources to further enrich the data base, see Figure 4.
Figure 4. Company data after data mining, mapping, and structuring

Figure 4. Company data after data mining, mapping, and structuring

1. Collect the data

Using this serverless architecture illustrated in Figure 5, your teams can minimize the effort and cost. You’ll be able to handle large-scale datasets to collect and store the data required for your business.

Figure 5. Serverless architecture for parallel data collection

Figure 5. Serverless architecture for parallel data collection

We use Amazon S3 as it is a highly scalable and durable object storage service, and can store the original dataset. It will initiate an event that will invoke a Lambda function to start a state machine, using the original dataset as its input.

AWS Step Functions are used to orchestrate the process of preparing the dataset for parallel scraping of the items. It will automatically manage the queue of items to be processed when the dataset is large. Step Functions ensures visibility of the process, reports errors, and decouples the compute-intensive scraping operation per item.

The state machine has two steps:

  1. ETL the data to clean and standardize it. Store each item in Amazon DynamoDB, a fast and flexible NoSQL database service for any scale. The ETL function will create an array of all the items identifiers. The identifier is a unique describer of the item, such as manufacturer ID and SKU.
  2. Using the Map functionality of Step Functions, a Lambda function will be invoked for each item. This runs all your scrapers for that item and stores the results in an S3 bucket.

This solution requires custom implementation of only these two functions, according to your own dataset and scraping sources. The ETL Lambda function will contain logic needed to transform your input into an array of identifiers. The scraper Lambda function will contain logic to locate the data in the source and then store it.

Scraper function flow

For each data source, write your own scraper. The Lambda function can run them sequentially.

  1. Use the identifier input to locate the item in each one of the external sources. The data source can be an API, a webpage, a PDF file, or other source.
    • API: Collecting this data will be specific to the interface provided.
    • Webpages: Data is collected with custom code. There are open source libraries that are popular for this task, such as Beautiful Soup.
    • PDF files: Consider using Amazon Textract. Amazon Textract will give you key-value pairs and table analysis.
  2. Transform the response to key-value pairs as part of the scraper logic.
  3. Store the key-value pairs in a sub folder of the scraper responses S3 bucket, and name it after that data source.

2. Mapping the responses

Figure 6. Pipeline for property mapping

Figure 6. Pipeline for property mapping

This pipeline is initiated after the data is collected. It creates a labeling job of Named Entity Recognition, with a pre-defined set of labels. The labeling work will be split among your Workforces. When the job is completed, the output manifest file for named entity recognition is used for the final ETL Lambda. This manually locates the labeling key and values detected by your workforce, and places the results in a reusable mapping table in DynamoDB.

Services used:

Amazon SageMaker Ground Truth is a fully managed data labeling service that helps you build highly accurate training datasets for machine learning (ML). By using Ground Truth, your teams can unify different data sources to match each other, so they can be identified and used in your applications.

Figure 7. Example of one line item being labeled by one of the Workforce team members

Figure 7. Example of one line item being labeled by one of the Workforce team members

3. Structure the collected data

Figure 8. Architecture diagram of entire data collection and classification process

Figure 8. Architecture diagram of entire data collection and classification process

Using another Lambda function (see in Figure 8, populate items properties), we use the collected data (1), and the mapping (2), to populate the unified dataset into the original data DynamoDB table (3).


In this blog, we showed a solution to automatically collect and structure data. We used a serverless architecture that requires minimal effort, to build a reusable asset that can unify different property definitions from different data sources. Minimal effort is involved in structuring this data, as we use Amazon SageMaker Ground Truth to match and reconcile the new data sources.

For further reading:

Validate IAM policies in CloudFormation templates using IAM Access Analyzer

Post Syndicated from Matt Luttrell original https://aws.amazon.com/blogs/security/validate-iam-policies-in-cloudformation-templates-using-iam-access-analyzer/

In this blog post, I introduce IAM Policy Validator for AWS CloudFormation (cfn-policy-validator), an open source tool that extracts AWS Identity and Access Management (IAM) policies from an AWS CloudFormation template, and allows you to run existing IAM Access Analyzer policy validation APIs against the template. I also show you how to run the tool in a continuous integration and continuous delivery (CI/CD) pipeline to validate IAM policies in a CloudFormation template before they are deployed to your AWS environment.

Embedding this validation in a CI/CD pipeline can help prevent IAM policies that have IAM Access Analyzer findings from being deployed to your AWS environment. This tool acts as a guardrail that can allow you to delegate the creation of IAM policies to the developers in your organization. You can also use the tool to provide additional confidence in your existing policy authoring process, enabling you to catch mistakes prior to IAM policy deployment.

What is IAM Access Analyzer?

IAM Access Analyzer mathematically analyzes access control policies that are attached to resources, and determines which resources can be accessed publicly or from other accounts. IAM Access Analyzer can also validate both identity and resource policies against over 100 checks, each designed to improve your security posture and to help you to simplify policy management at scale.

The IAM Policy Validator for AWS CloudFormation tool

IAM Policy Validator for AWS CloudFormation (cfn-policy-validator) is a new command-line tool that parses resource-based and identity-based IAM policies from your CloudFormation template, and runs the policies through IAM Access Analyzer checks. The tool is designed to run in the CI/CD pipeline that deploys your CloudFormation templates, and to prevent a deployment when an IAM Access Analyzer finding is detected. This ensures that changes made to IAM policies are validated before they can be deployed.

The cfn-policy-validator tool looks for all identity-based policies, and a subset of resource-based policies, from your templates. For the full list of supported resource-based policies, see the cfn-policy-validator GitHub repository.

Parsing IAM policies from a CloudFormation template

One of the challenges you can face when parsing IAM policies from a CloudFormation template is that these policies often contain CloudFormation intrinsic functions (such as Ref and Fn::GetAtt) and pseudo parameters (such as AWS::AccountId and AWS::Region). As an example, it’s common for least privileged IAM policies to reference the Amazon Resource Name (ARN) of another CloudFormation resource. Take a look at the following example CloudFormation resources that create an Amazon Simple Queue Service (Amazon SQS) queue, and an IAM role with a policy that grants access to perform the sqs:SendMessage action on the SQS queue.

Figure 1- Example policy in CloudFormation template

Figure 1- Example policy in CloudFormation template

As you can see in Figure 1, line 21 uses the function Fn::Sub to restrict this policy to MySQSQueue created earlier in the template.

In this example, if you were to pass the root policy (lines 15-21) as written to IAM Access Analyzer, you would get an error because !Sub ${MySQSQueue.Arn} is syntax that is specific to CloudFormation. The cfn-policy-validator tool takes the policy and translates the CloudFormation syntax to valid IAM policy syntax that Access Analyzer can parse.

The cfn-policy-validator tool recognizes when an intrinsic function such as !Sub ${MySQSQueue.Arn} evaluates to a resource’s ARN, and generates a valid ARN for the resource. The tool creates the ARN by mapping the type of CloudFormation resource (in this example AWS::SQS::Queue) to a pattern that represents what the ARN of the resource will look like when it is deployed. For example, the following is what the mapping looks like for the SQS queue referenced previously:

AWS::SQS::Queue.Arn -> arn:${Partition}:sqs:${Region}:${Account}:${QueueName}

For some CloudFormation resources, the Ref intrinsic function also returns an ARN. The cfn-policy-validator tool handles these cases as well.

Cfn-policy-validator walks through each of the six parts of an ARN and substitutes values for variables in the ARN pattern (any text contained within ${}). The values of ${Partition} and ${Account} are taken from the identity of the role that runs the cfn-policy-validator tool, and the value for ${Region} is provided as an input flag. The cfn-policy-validator tool performs a best-effort resolution of the QueueName, but typically defaults it to the name of the CloudFormation resource (in the previous example, MySQSQueue). Validation of policies with IAM Access Analyzer does not rely on the name of the resource, so the cfn-policy-validator tool is able to substitute a replacement name without affecting the policy checks.

The final ARN generated for the MySQSQueue resource looks like the following (for an account with ID of 111111111111):


The cfn-policy-validator tool substitutes this generated ARN for !Sub ${MySQSQueue.Arn}, which allows the cfn-policy-validator tool to parse a policy from the template that can be fed into IAM Access Analyzer for validation. The cfn-policy-validator tool walks through your entire CloudFormation template and performs this ARN substitution until it has generated ARNs for all policies in your template.

Validating the policies with IAM Access Analyzer

After the cfn-policy-validator tool has your IAM policies in a valid format (with no CloudFormation intrinsic functions or pseudo parameters), it can take those policies and feed them into IAM Access Analyzer for validation. The cfn-policy-validator tool runs resource-based and identity-based policies in your CloudFormation template through the ValidatePolicy action of the IAM Access Analyzer. ValidatePolicy is what ensures that your policies have correct grammar and follow IAM policy best practices (for example, not allowing iam:PassRole to all resources). The cfn-policy-validator tool also makes a call to the CreateAccessPreview action for supported resource policies to determine if the policy would grant unintended public or cross-account access to your resource.

The cfn-policy-validator tool categorizes findings from IAM Access Analyzer into the categories blocking or non-blocking. Findings categorized as blocking cause the tool to exit with a non-zero exit code, thereby causing your deployment to fail and preventing your CI/CD pipeline from continuing. If there are no findings, or only non-blocking findings detected, the tool will exit with an exit code of zero (0) and your pipeline can to continue to the next stage. For more information about how the cfn-policy-validator tool decides what findings to categorize as blocking and non-blocking, as well as how to customize the categorization, see the cfn-policy-validator GitHub repository.

Example of running the cfn-policy-validator tool

This section guides you through an example of what happens when you run a CloudFormation template that has some policy violations through the cfn-policy-validator tool.

The following template has two CloudFormation resources with policy findings: an SQS queue policy that grants account 111122223333 access to the SQS queue, and an IAM role with a policy that allows the role to perform a misspelled sqs:ReceiveMessages action. These issues are highlighted in the policy below.

Important: The policy in Figure 2 is written to illustrate a CloudFormation template with potentially undesirable IAM policies. You should be careful when setting the Principal element to an account that is not your own.

Figure 2: CloudFormation template with undesirable IAM policies

Figure 2: CloudFormation template with undesirable IAM policies

When you pass this template as a parameter to the cfn-policy-validator tool, you specify the AWS Region that you want to deploy the template to, as follows:

cfn-policy-validator validate --template-path ./template.json --region us-east-1

After the cfn-policy-validator tool runs, it returns the validation results, which includes the actual response from IAM Access Analyzer:

    "BlockingFindings": [
            "findingType": "ERROR",
            "code": "INVALID_ACTION",
            "message": "The action sqs:ReceiveMessages does not exist.",
            "resourceName": "MyRole",
            "policyName": "root",
            "details": …
            "findingType": "SECURITY_WARNING",
            "code": "EXTERNAL_PRINCIPAL",
            "message": "Resource policy allows access from external principals.",
            "resourceName": "MyQueue",
            "policyName": "QueuePolicy",
            "details": …
    "NonBlockingFindings": []

The output from the cfn-policy-validator tool includes the type of finding, the code for the finding, and a message that explains the finding. It also includes the resource and policy name from the CloudFormation template, to allow you to quickly track down and resolve the finding in your template. In the previous example, you can see that IAM Access Analyzer has detected two findings, one security warning and one error, which the cfn-policy-validator tool has classified as blocking. The actual response from IAM Access Analyzer is returned under details, but is excluded above for brevity.

If account 111122223333 is an account that you trust and you are certain that it should have access to the SQS queue, then you can suppress the finding for external access from the 111122223333 account in this example. Modify the call to the cfn-policy-validator tool to ignore this specific finding by using the –-allow-external-principals flag, as follows:

cfn-policy-validator validate --template-path ./template.json --region us-east-1 --allow-external-principals 111122223333

When you look at the output that follows, you’re left with only the blocking finding that states that sqs:ReceiveMessages does not exist.

    "BlockingFindings": [
            "findingType": "ERROR",
            "code": "INVALID_ACTION",
            "message": "The action sqs:ReceiveMessages does not exist.",
            "resourceName": "MyRole",
            "policyName": "root",
            "details": …
    "NonBlockingFindings": []

To resolve this finding, update the template to have the correct spelling, sqs:ReceiveMessage (without trailing s).

For the full list of available flags and commands supported, see the cfn-policy-validator GitHub repository.

Now that you’ve seen an example of how you can run the cfn-policy-validator tool to validate policies that are on your local machine, you will take it a step further and see how you can embed the cfn-policy-validator tool in a CI/CD pipeline. By embedding the cfn-policy-validator tool in your CI/CD pipeline, you can ensure that your IAM policies are validated each time a commit is made to your repository.

Embedding the cfn-policy-validator tool in a CI/CD pipeline

The CI/CD pipeline you will create in this post uses AWS CodePipeline and AWS CodeBuild. AWS CodePipeline is a continuous delivery service that enables you to model, visualize, and automate the steps required to release your software. AWS CodeBuild is a fully-managed continuous integration service that compiles source code, runs tests, and produces software packages that are ready to deploy. You will also use AWS CodeCommit as the source repository where your CloudFormation template is stored. AWS CodeCommit is a fully managed source-control service that hosts secure Git-based repositories.

To deploy the pipeline

  1. To deploy the CloudFormation template that builds the source repository and AWS CodePipeline pipeline, select the following Launch Stack button.
    Select the Launch Stack button to launch the template
  2. In the CloudFormation console, choose Next until you reach the Review page.
  3. Select I acknowledge that AWS CloudFormation might create IAM resources and choose Create stack.
  4. Open the AWS CodeCommit console and choose Repositories.
  5. Select cfn-policy-validator-source-repository.
  6. Download the template.json and template-configuration.json files to your machine.
  7. In the cfn-policy-validator-source-repository, on the right side, select Add file and choose Upload file.
  8. Choose Choose File and select the template.json file that you downloaded previously.
  9. Enter an Author name and an E-mail address and choose Commit changes.
  10. In the cfn-policy-validator-source-repository, repeat steps 7-9 for the template-configuration.json file.

To view validation in the pipeline

  1. In the AWS CodePipeline console choose the IAMPolicyValidatorPipeline.
  2. Watch as your commit travels through the pipeline. If you followed the previous instructions and made two separate commits, you can ignore the failed results of the first pipeline execution. As shown in Figure 3, you will see that the pipeline fails in the Validation stage on the CfnPolicyValidator action, because it detected a blocking finding in the template you committed, which prevents the invalid policy from reaching your AWS environment.
    Figure 3: Validation failed on the CfnPolicyValidator action

    Figure 3: Validation failed on the CfnPolicyValidator action

  3. Under CfnPolicyValidator, choose Details, as shown in Figure 3.
  4. In the Action execution failed pop-up, choose Link to execution details to view the cfn-policy-validator tool output in AWS CodeBuild.

Architectural overview of deploying cfn-policy-validator in your pipeline

You can see the architecture diagram for the CI/CD pipeline you deployed in Figure 4.
Figure 4: CI/CD pipeline that performs IAM policy validation using the AWS CloudFormation Policy Validator and IAM Access Analyzer

Figure 4 shows the following steps, starting with the CodeCommit source action on the left:

  1. The pipeline starts when you commit to your AWS CodeCommit source code repository. The AWS CodeCommit repository is what contains the CloudFormation template that has the IAM policies that you would like to deploy to your AWS environment.
  2. AWS CodePipeline detects the change in your source code repository and begins the Validation stage. The first step it takes is to start an AWS CodeBuild project that runs the CloudFormation template through the AWS CloudFormation Linter (cfn-lint). The cfn-lint tool validates your template against the CloudFormation resource specification. Taking this initial step ensures that you have a valid CloudFormation template before validating your IAM policies. This is an optional step, but a recommended one. Early schema validation provides fast feedback for any typos or mistakes in your template. There’s little benefit to running additional static analysis tools if your template has an invalid schema.
  3. If the cfn-lint tool completes successfully, you then call a separate AWS CodeBuild project that invokes the IAM Policy Validator for AWS CloudFormation (cfn-policy-validator). The cfn-policy-validator tool then extracts the identity-based and resource-based policies from your template, as described earlier, and runs the policies through IAM Access Analyzer.

    Note: if your template has parameters, then you need to provide them to the cfn-policy-validator tool. You can provide parameters as command-line arguments, or use a template configuration file. It is recommended to use a template configuration file when running validation with an AWS CodePipeline pipeline. The same file can also be used in the deploy stage to deploy the CloudFormation template. By using the template configuration file, you can ensure that you use the same parameters to validate and deploy your template. The CloudFormation template for the pipeline provided with this blog post defaults to using a template configuration file.

    If there are no blocking findings found in the policy validation, the cfn-policy-validator tool exits with an exit code of zero (0) and the pipeline moves to the next stage. If any blocking findings are detected, the cfn-policy-validator tool will exit with a non-zero exit code and the pipeline stops, to prevent the deployment of undesired IAM policies.

  4. The final stage of the pipeline uses the AWS CloudFormation action in AWS CodePipeline to deploy the template to your environment. Your template will only make it to this stage if it passes all static analysis checks run in the Validation Stage.

Cleaning Up

To avoid incurring future charges, in the AWS CloudFormation console delete the validate-iam-policy-pipeline stack. This will remove the validation pipeline from your AWS account.


In this blog post, I introduced the IAM Policy Validator for AWS CloudFormation (cfn-policy-validator). The cfn-policy-validator tool automates the parsing of identity-based and resource-based IAM policies from your CloudFormation templates and runs those policies through IAM Access Analyzer. This enables you to validate that the policies in your templates follow IAM best practices and do not allow unintended external access to your AWS resources.

I showed you how the IAM Policy Validator for AWS CloudFormation can be included in a CI/CD pipeline. This allows you to run validation on your IAM policies on every commit to your repository and only deploy the template if validation succeeds.

For more information, or to provide feedback and feature requests, see the cfn-policy-validator GitHub repository.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Matt Luttrell

Matt is a Sr. Solutions Architect on the AWS Identity Solutions team. When he’s not spending time chasing his kids around, he enjoys skiing, cycling, and the occasional video game.

Securely extend and access on-premises Active Directory domain controllers in AWS

Post Syndicated from Mangesh Budkule original https://aws.amazon.com/blogs/security/securely-extend-and-access-on-premises-active-directory-domain-controllers-in-aws/

If you have an on-premises Windows Server Active Directory infrastructure, it’s important to plan carefully how to extend it into Amazon Web Services (AWS) when you’re migrating or implementing cloud-based applications. In this scenario, existing applications require Active Directory for authentication and identity management. When you migrate these applications to the cloud, having a locally accessible Active Directory domain controller is an important factor in achieving fast, reliable, and secure Active Directory authentication.

In this blog post, I’ll provide guidance on how to securely extend your existing Active Directory domain to AWS and optimize your infrastructure for maximum performance. I’ll also show you a best practice that implements a remote desktop gateway solution to access your domain controllers securely while using the minimum required ports. Additionally, you will learn about how AWS Systems Manager Session Manager port forwarding helps provide a secure and simple way to manage your domain resources remotely, without the need to open inbound ports and maintain RDGW hosts.

Administrators can use this blog post as guidance to design Active Directory on Amazon Elastic Compute Cloud (Amazon EC2) domain controllers. This post can also be used to determine which ports and protocols are required for domain controller infrastructure communication in a segmented network.

Design and guidelines for EC2-hosted domain controllers

This section provides a set of best practices for designing and deploying EC2-hosted domain controllers in AWS.

AWS has multiple options for hosting Active Directory on AWS, which are discussed in detail in the Active Directory Domain Services on AWS Design and Planning Guide. One option is to use AWS Directory Service for Microsoft Active Directory (AWS Managed Microsoft AD). AWS Managed Microsoft AD provides you with a complete new forest and domain to start your Active Directory deployment on AWS. However, if you prefer to extend your existing Active Directory domain infrastructure to AWS and manage it yourself, you have the option of running Active Directory on EC2-hosted domain controllers. See our Quick Start guide for instructions on how to deploy both of these options (AWS Managed Microsoft AD or EC2-hosted domain controllers on AWS).

If you’re operating in more than one AWS Region and require Active Directory to be available in all these Regions, use the best practices in the Design and Planning Guide for a multi-Region deployment strategy. Within each of the Regions, follow the guidelines and best practices described in this blog post.

Figure 1 shows an example of how to deploy Active Directory on EC2 instances in multiple Regions with multiple virtual private clouds (VPCs). In this example, I’m showing the Active Directory design in multiple Regions that interconnect to each other by using AWS Transit Gateway.

Figure 1: Extended EC2 domain controllers architecture

Figure 1: Extended EC2 domain controllers architecture

In order to extend your existing Active Directory deployment from on-premises to AWS as shown in the example, you do two things. First, you add additional domain controllers (running on Amazon EC2) to your existing domain. Second, you place the domain controllers in multiple Availability Zones (AZs) within your VPC, in multiple Regions, by keeping the same forest (Example.com) and domain structure.

Consider these best practices when you deploy or extend Active Directory on EC2 instances:

  1. We recommend deploying at least two domain controllers (DCs) in each Region and configuring a minimum of two AZs, to provide high availability.
  2. If you require additional domain controllers to achieve your performance goals, add more domain controllers to existing AZs or deploy to another available AZ.
  3. It’s important to define Active Directory sites and subnets correctly to prevent clients from using domain controllers that are located in different Regions, which causes increased latency.
  4. Configure the VPC in a Region as a single Active Directory site and configure Active Directory subnets accordingly in the AD Sites and Services console. This configuration confirms that your clients correctly select the closest available domain controller.
  5. If you have multiple VPCs, centralize the Active Directory services in one of your existing VPCs or create a shared services VPC to centralize the domain controllers.
  6. Make sure that robust inter-Region connectivity exists between all of the Regions. Within AWS, you can leverage cross-Region VPC peering to achieve highly available private connectivity between Regions. You can also use the Transit Gateway VPC solution, as shown in Figure 1, to interconnect multiple Regions.
  7. Make sure that you’re deploying your domain controllers in a private subnet without internet access.
  8. Keep your security patches up to date on EC2 domain controllers. You can use AWS Systems Manager to patch your domain controllers in a controlled manner.
  9. Have a dedicated AWS account for directory services and don’t share the account with other general services and applications. This helps you to control access to the AWS account and add domain controller–specific automation.
  10. If your users need to manage AWS services and access AWS applications with their Active Directory credentials, we recommend integrating your identity service with the management account in AWS Organizations. You can configure the AWS Single Sign-On (AWS SSO) service to use AD Connector in a primary account VPC to connect to self-managed Active Directory domain controllers that are hosted in a Shared Services account.

    Alternatively, you can deploy AWS Managed Microsoft AD in the management account, with trust to your EC2 Active Directory domain, to allow users from any trusted domain to access AWS applications. However, you could host these EC2 domain controllers in the primary account, similar to the AWS Managed AD option.

  11. Build domain controllers with end-to-end automation using version control (for example, GIT and AWS CodeCommit) and Desired State Configuration (DSC)/PowerShell.

Security considerations for EC2-hosted domains

This section explains how you can maximize the security of your extended EC2-hosted domain controller infrastructure, and use AWS services to help achieve security compliance. You should also refer to your organization’s IT system security policies to determine the most relevant recommendations to implement.

AWS operates under a shared security responsibility model, where AWS is responsible for the security of the underlying cloud infrastructure and you are responsible for securing workloads you deploy in AWS.

Our recommendations for security for EC2-hosted domains are as follows:

  1. We recommend that you place EC2-hosted domain controllers in a single dedicated AWS account or deploy them in your AWS Organizations management account. This makes it possible for you to use your Active Directory credentials for authentication to access the AWS Management Console and other AWS applications.
  2. Use tag-based policies to restrict access to domain controllers if you’re using the Shared Services account for hosting domain controllers.
  3. Take advantage of the EC2 Image Builder service to deploy a domain controller that uses a CIS standard base image. By doing this, you can avoid manual deployment by setting up an image pipeline.
  4. Secure the AWS account where the domain controllers are running by following the principle of least privilege and by using role-based access control.
  5. Take advantage of these AWS services to help secure your workloads and application:
    • AWS Landing Zone–A solution that helps you more quickly set up a secure, multi-account AWS environment, based on AWS best practices.
    • AWS Organizations–A service that helps you centrally manage and govern your environment as you grow and scale your AWS resources.
    • Amazon Guard Duty–An automated threat detection service that continuously monitors for suspicious activity and unauthorized behavior to protect your AWS accounts, workloads, and data that are stored in Amazon Simple Storage Service (Amazon S3).
    • Amazon Detective–A service that can analyze, investigate, and quickly identify the root cause of potential security issues or suspicious activities.
    • Amazon Inspector–An automated security assessment service that helps improve the security and compliance of applications that are deployed on AWS.
    • AWS Security Hub–A service that provides customers with a comprehensive view of their security and compliance status across their AWS accounts. You can import critical patch compliance findings into Security Hub for easy reference.

Use data encryption

AWS offers you the ability to add a layer of security to your data at rest in the cloud, providing scalable and efficient encryption features. These are some best practices for data encryption:

  1. Encrypt the Amazon Elastic Block Store (Amazon EBS) volumes that are attached to the domain controllers, and keep the customer master key (CMK) safe with AWS Key Management Service (AWS KMS) or AWS CloudHSM, according to your security team’s guidance and policies.
  2. Consider using a separate CMK for the Active Directory and restrict access to the CMK to a specific team.
  3. Enable LDAP over SSL (LDAPS) on all domain controllers, for secure authentication, if your application supports LDAPS authentication.
  4. Deploy and manage a public key infrastructure (PKI) on AWS. For more information, see the Microsoft PKI Quick Start guide.

Restrict account and instance access

Provide management access for directory service accounts and domain controller instances only to the specific team that manages the Active Directory. To do this, follow these guidelines:

  1. Restrict access to an EC2 domain controller’s start, stop, and terminate behavior by using AWS Identity and Access Management (IAM) policy and resources tags. Example: Restrict-ec2-iam
  2. Restrict access to Amazon EBS volumes and snapshots.
  3. Restrict account root access and implement multi-factor authentication (MFA) for this access.

Network access control for domain controllers

Whenever possible, block all unnecessary traffic to and from your domain controllers to limit the communication so that only the necessary ports are opened between a domain controller and another computer. Use these best practices:

  1. Allow only the required network ports between the client and domain controllers, and between domain controllers.
  2. Use a security group to narrow down the access to domain controllers.
  3. Use network access control lists (network ACLs) to filter Active Directory ports as this gives you better control than using ephemeral ports.
  4. Deploy domain controllers in private subnets.
  5. Route only the required subnets into the VPC that contains the domain controllers.

Secure administration

AWS provides services that continuously monitor your data, accounts, and workloads to help protect them from unauthorized access. We recommend that you take advantage of the following services to securely administer your domain controller’s deployment:

  1. Use AWS Systems Manager Session Manager or Run Command to manage your instances remotely. The command actions are sent to Amazon CloudWatch Logs or Amazon S3 for auditing purposes. Leaving inbound Remote Desktop Protocol (RDP), WinRM ports, and remote PowerShell ports open on your instances greatly increases the risk of entities running unauthorized or malicious commands on the instances. Session Manager helps you improve your security posture by letting you close these inbound ports, which frees you from managing SSH keys and certificates, bastion hosts, and jump boxes.
  2. Use Amazon EventBridge to set up rules to detect when changes happen to your domain controller EC2 instances and to send notifications by using Amazon Simple Notification Service (Amazon SNS) when a command is run.
  3. Manage configuration drift on EC2 instances. Systems Manager State Manager helps you automate the process of keeping your domain controller EC2 instances in the desired state and integrates with Systems Manager Compliance.
  4. Avoid any manual interventions while you build and manage domain controllers. Automate the domain join process for Amazon EC2 instances from multiple AWS accounts and Regions.
  5. For developing your applications with domain controllers, use the Windows DC locator service or use the Dynamic DNS (DDNS) service of your AWS Managed Microsoft AD to locate domain controllers. Do not hard-code applications with the address of a domain controller.
  6. Use AWS Config to manage your domain controller configuration.
  7. Use Systems Manager Parameter Store or Secrets Manager to store all secrets, as well as configurations for your domain controller automation.
  8. Use version control to update the domain controller source code with pipeline approvals to avoid any misconfigurations and faulty deployments.

Logging and monitoring

AWS provides tools and features that you can use to see what’s happening in your AWS environment. We recommend that you use these logging and monitoring practices for your EC2-hosted domain controllers:

  1. Enable VPC Flow Logs data for each domain controller’s accounts to monitor the traffic that’s reaching your domain controller instance.
  2. Log Windows and Active Directory events in Amazon CloudWatch Logs for increased visibility.
  3. Consider setting up alerts and notifications for key security events for EC2 domain controllers, in real time. These alerts can be sent to your Red and Blue security response teams for further analysis.
  4. Deploy the CloudWatch agent or the Amazon Kinesis Agent for Windows on EC2 for detail monitoring and alerting at the domain controller operating system level.
  5. Log Systems Manager API calls with AWS CloudTrail.

Other security considerations

As a best practice, implement domain controller security at the operating system level, according to your security team’s recommendations. We recommend these options:

  1. Block executables from running on domain controllers.
  2. Prevent web browsing from domain controllers.
  3. Configure a Windows Server Core base image for domain controllers.
  4. Integrate bastion hosts with Systems Manager Session Manager and use MFA to manage domain controllers remotely.
  5. Perform regular system state backups of your Active Directory environments. Encrypt these backups.
  6. Perform Active Directory administrative management from a remote server, and avoid logging in to domain controllers interactively unless needed.
  7. For FSMO roles, you can follow the same recommendations you would follow for your on-premises deployment to determine FSMO roles on domain controllers. For more information, see these best practices from Microsoft. In the case of AWS Managed Microsoft AD, all domain controllers and FSMO role assignments are managed by AWS and don’t require you to manage or change them.

Domain controller ports

In this section, I’m going to cover the network ports and protocols that are needed to deploy domain services securely. Understanding how traffic flows and is processed by a network firewall is essential when someone requests or implements firewall rules, to avoid any connectivity issues.

Here are some common problems that you might observe because of network port blockage:

  • The RPC server is unavailable
  • Name resolution issues
  • A connectivity issue is detected with LDAP, LDAPS, and Kerberos
  • Domain replication issues
  • Domain authentication issues
  • Domain trust issues between on-premises Active Directory and AWS Managed Microsoft AD
  • AD Connector connectivity issues
  • Issues with domain join, password reset, and more

Understand Active Directory firewall ports

You must allow traffic from your on-premises network to the VPC that contains your extended domain controllers. To do this, make sure that the firewall ports that opened with the VPC subnets that were used to deploy your EC2-hosted domain controllers and the security group rules that are configured on your domain controllers both allow the network traffic to support domain trusts.

Domain controller to domain controller core ports requirements

The following table lists the port requirements for establishing DC-to-DC communication in all versions of Windows Server.

Source Destination Protocol Port Type Active Directory usage Type of traffic
Any domain controller Any domain controller TCP and UDP 53 Bi-directional User and computer authentication, name resolution, trusts DNS
TCP and UDP 88 Bi-directional User and computer authentication, forest level trusts Kerberos
UDP 123 Bi-directional Windows Time, trusts Windows Time
TCP 135 Bi-directional Replication RPC, Endpoint Mapper (EPM)
UDP 137 Bi-directional User and computer authentication NetLogon, NetBIOS name resolution
UDP 138 Bi-directional Distributed File System (DFS), Group Policy DFSN, NetLogon, NetBIOS Datagram Service
TCP 139 Bi-directional User and computer authentication, replication DFSN, NetBIOS Session Service, NetLogon
TCP and UDP 389 Bi-directional Directory, replication, user, and computer authentication, Group Policy, trustss LDAP
TCP and UDP 445 Bi-directional Replication, user, and computer authentication, Group Policy, trusts SMB, CIFS, SMB2, DFSN, LSARPC, NetLogonR, SamR, SrvSvc
TCP and UDP 464 Bi-directional Replication, user, and computer authentication, trusts Kerberos change/set password
TCP 636 Bi-directional Directory, replication, user, and computer authentication, Group Policy, trusts LDAP SSL (required only if LDAP over SSL is configured)
TCP 3268 Bi-directional Directory, replication, user, and computer authentication, Group Policy, trusts LDAP Global Catalog (GC)
TCP 3269 Bi-directional Directory, replication, user, and computer authentication, Group Policy, trusts LDAP GC SSL (required only if LDAP over SSL is configured)
TCP 5722 Bi-directional File replication RPC, DFSR (SYSVOL)
TCP 9389 Bi-directional AD DS web services SOAP
TCP Dynamic 49152–65535 Bi-directional Replication, user, and computer authentication, Group Policy, trusts RPC, DCOM, EPM, DRSUAPI, NetLogonR, SamR, File Replication Service (FRS)
UDP Dynamic 49152–65535 Bi-directional Group Policy DCOM, RPC, EPM

Note: There is no need to open a DNS port on domain controllers if you are not using a domain controller as a DNS server, or if you’re using any third-party DNS solutions.

Client to domain controller core ports requirements

The following table lists the port requirements for establishing client to domain controller communication for Active Directory.

Source Destination Protocol Port Type Usage Type of traffic
All internal company client network IP subnets Any domain controller TCP 53 Uni-directional DNS DNS
UDP 53 Uni-directional DNS Kerberos
TCP 88 Uni-directional Kerberos Auth Kerberos
UDP 88 Uni-directional Kerberos Auth Kerberos
UDP 123 Uni-directional Windows Time Windows Time
TCP 135 Uni-directional RPC, EPM RPC, EPM
UDP 137 Uni-directional NetLogon, NetBIOS name User and computer authentication
UDP 138 Uni-directional DFSN, NetLogon, NetBIOS datagram service DFS, Group Policy, NetBIOS, NetLogon, browsing
TCP 389 Uni-directional LDAP Directory, replication, user, and computer authentication, Group Policy, trust
UDP 389 Uni-directional LDAP Directory, replication, user, and computer authentication, Group Policy, trust
TCP 445 Uni-directional SMB, CIFS, SMB3, DFSN, LSARPC, NetLogonR, SamR, SrvSvc Replication, user, and computer authentication, Group Policy, trust
TCP 464 Uni-directional Kerberos change/set password Replication, user, and computer authentication, trust
UDP 464 Uni-directional Kerberos change/set password Replication, user, and computer authentication, trust
TCP 636 Uni-directional LDAP SSL Directory, replication, user, and computer authentication, Group Policy, trust
TCP 3268 Uni-directional LDAP GC Directory, replication, user, and computer authentication, Group Policy, trust
TCP 3269 Uni-directional LDAP GC SSL Directory, replication, user, and computer authentication, Group Policy, trust
TCP 9389 Uni-directional SOAP AD DS web service
TCP 49152–65535 Uni-directional DCOM, RPC, EPM Group Policy


  • You must allow network traffic communication from your on-premises network to the VPC that contains your AWS-hosted EC2 domain controllers.
  • You also can restrict DC-to-DC replication traffic and DC-to-client communications to specific ports.
  • Packet fragmentation can cause issues with services such as Kerberos. You should make sure that maximum transmission unit (MTU) sizes match your network devices.
  • Additionally, unless a tunneling protocol is used to encapsulate traffic to Active Directory, ranges of ephemeral TCP ports between 49152 to 65535 are required. Ephemeral ports are also known as service response ports. These ports are created dynamically for session responses for each client that establishes a session. These ports are required not only for Windows but for Linux and UNIX.

Manage domain controllers securely using a bastion host and RDGW

We recommend that you restrict the domain controller’s management by using a secure, highly available, and scalable Microsoft Remote Desktop Gateway (RDGW) solution in conjunction with bastion hosts. A bastion host that is designed to work with a specific part of the infrastructure should work with that unit only, and nothing else. Limiting the use of bastion hosts to a specific instance example domain controller can help improve your security posture.

The reference architecture shown in Figure 2 restricts management access to your domain controllers and access via port 443. The bastion hosts in the diagram are configured to only allow RDP from the RDGW.

For additional security, follow these best practices:

  • Configure RDGW and bastions hosts to use MFA for logins.
  • Implement login restrictions by using a Group Policy Object (GPO), so that only required administrators log in to RDGW and the bastion host, based on their group membership.

Bastion host to domain controllers ports requirements

The following table lists the port requirements for establishing bastion host-to-DC communication in all versions of Windows Server.

Source Destination Protocol Ports Type Usage Type of traffic
Bastion host to domain controller Any domain controller subset TCP 443 Uni-directional TPKT Remote Protocol Gateway access
UDP 3389 Uni-directional TPKT Remote Desktop Protocol
TCP 3389 Uni-directional WS-Man Remote Desktop Protocol
TCP 5985 Bi-directional HTTPS Windows Remote Management (WinRM)
TCP 5985 Bi-directional WS-Man Windows Remote Management (WinRM)

You can also take advantage of Systems Manager Session Manager to manage domain joined resources instead of using bastion hosts for management. This option eliminates the need to manage bastion infrastructure and open any inbound rules. It also integrates natively with IAM and AWS CloudTrail, two services that enhance your security and audit posture. In the next section, I’ll discuss Session Manager and how it is useful in this context.

Session Manager port forwarding

Active Directory administrators are accustomed to managing domain resources by using Remote Server Administrators Tools (RSAT) that are installed on either their workstations or a member server in the domain (for example, RDP to a bastion host). Although RDP is effective, using RDP requires more management, such as managing inbound rules for port 3389. In some cases, having this port exposed to the internet might put your systems at risk. For example, systems can be susceptible to brute force or unauthorized dictionary activity. Instead of using a RDGW host and opening RDP inbound RDP ports, we recommend using the Session Manager Service, which provides port-forwarding ability without opening inbound ports.

Port forwarding provides the ability to forward traffic between your clients to open ports on your EC2 instance. After you configure port forwarding, you can connect to the local port and access the server application that is running inside the instance, as shown in Figure 3. To configure the port-forwarding feature in Session Manager, you can use IAM policies and the AWS-StartPortForwardingSession document.

Figure 3: Session Manager tunnel

Figure 3: Session Manager tunnel

To start a session using the AWS Command Line Interface (AWS CLI), run the following command.

aws ssm start-session --target "instance-id" --document-name AWS StartPortForwardingSession -parameters portNumber="3389",localPortNumber="9999"

Note: You can use any available ephemeral port. 9999 is just an example. Install and configure the AWS CLI, if you haven’t already.

You can also start a session by using an IAM policy like the one shown in the following example. To learn more about creating IAM policies for Session Manager, see the topic Quickstart default IAM policies for Session Manager.

In this policy example, I created the policy for Systems Manager for both AWS-StartPortForwadingSession and AWS-StartSSHSession for Linux (SSH) environments, for your reference and guidance.

    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Action": [
            "Resource": [
                "arn:aws:ec2:*: <AccountID>:instance/*",
                "arn:aws:ssm:*: <AccountID>:document/SSM-SessionManagerRunShell"
            "Condition": {
                "StringLike": {
                    "ssm:resourceTag/TAGKEY": [

            "Effect": "Allow",
            "Action": [
            "Resource": [

            "Effect": "Allow",

            "Action": [



            "Resource": "*"
            "Effect": "Allow",
            "Action": [
            "Resource": [

When you use the port-forwarding feature in Session Manager, you have the option to use an auditing service like AWS CloudTrail to provide a record of the connections made to your instances. You can also monitor the session by using Amazon CloudWatch Events with Amazon SNS to receive notifications when a user starts or ends session activity.

There is no additional charge for accessing EC2 instances by using Session Manager port forwarding. Port forwarding is available today in all AWS Regions where Systems Manager is available. You will be charged for the outgoing bandwidth from the NAT Gateway or your VPC Private Link.

Bastion host architecture using Session Manager

In this section, I discuss how to use a bastion host with Session Manager. Session Manager uses the Systems Manager infrastructure to create an SSH-like session with an instance. Session Manager tunnels real SSH connections, which allows you to tunnel to another resource within your VPC directly from your local machine. A managed instance that you create, acts as a bastion host, or gateway, to your AWS resources. The benefits of this configuration are:

  • Increased security: This configuration uses only one EC2 instance (the bastion host), and connects outbound port 443 to Systems Manager infrastructure. This allows you to use Session Manager without any inbound connections. The local resource must allow inbound traffic only from the instance that is acting as bastion host. Therefore, there is no need to open any inbound rule publicly.
  • Ease of use: You can access resources in your private VPC directly from your local machine.

In the example shown in Figure 4, the EC2 instance is acting as a domain controller that must be accessed securely by an Active Directory administrator who is working remotely via bastion host. To support this use case, I’ve chosen to use an interface VPC endpoint for Systems Manager, in order to facilitate private connectivity between Systems Manager Agent (SSM Agent) on the EC2 instance that is acting as a bastion host, and the Systems Manager service endpoints. You can configure Session Manager to enable port forwarding between the administrator’s local workstation and the private EC2 bastion instances, so that they can securely access the bastion host from the internet. This architecture helps you to eliminate RDGW infrastructure setup and reduce management efforts. You can add MFA at the bastion host level to enhance security.


  • If you want to use the AWS CLI to start and end sessions that connect you to your managed instances, you must first install the Session Manager plugin on your local machine.
  • Make sure that the bastion host has SSM Agent installed, because Session Manager only works with Systems Manager managed instances.
  • Follow the steps in Creating an interface endpoint to create the following interface endpoints:
    • com.amazonaws.<region>.ssm – The endpoint for the Systems Manager service.
    • com.amazonaws.><region>.ec2messages – Systems Manager uses this endpoint to make calls from the SSM Agent to the Systems Manager service.
    • com.amazonaws.<region>.ec2 – The endpoint to the EC2 service. If you’re using Systems Manager to create VSS-enabled snapshots, you must ensure that you have this endpoint. Without the EC2 endpoint defined, a call to enumerate attached EBS volumes fails. This causes the Systems Manager command to fail.
    • com.amazonaws.<region>.ssmmessages – This endpoint is required for connecting to your instances through a secure data channel by using Session Manager, in this case the port-forwarding requirement.

Support for domain controllers in Session Manager

You can use Session Manager to connect EC2 domain controllers directly, as well. To initiate a connection with either the default Session Manager connection or the port-forwarding feature discussed in this post, complete these steps.

To initiation a connection

  1. Create the ssm-user in your domain.
  2. Add the ssm-user to the domain groups that grant the user local access to the domain controller. One example is to add the user to the Domain Admins group.

IMPORTANT: Follow your organization’s security best practices when you grant the ssm-user access to the domain.


In this blog post, I described best practices for deploying domain controllers on EC2 instances and extending on-premises Active Directory to AWS for your guidance and quick reference. I also covered how you can maximize security for your extended EC2-hosted domain controller infrastructure by using AWS services. In addition, you learned about how AWS Systems Manager Session Manager port forwarding to RDP provides a simple and secure way to manage your domain resources remotely, without the need to open inbound ports and maintain RDGW hosts. Port forwarding works for Windows and Linux instances. It’s available today in all AWS Regions where Systems Manager is available. Depending on your use case, you should consider additional protection mechanisms per your organization’s security best practices.

To learn more about migrating Windows Server or SQL Server, visit Windows on AWS. For more information about how AWS can help you modernize your legacy Windows applications, see Modernize Windows Workloads with AWSContact us to start your modernization journey today.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Mangesh Budkule

Mangesh is a Microsoft Specialist Solutions Architect at AWS. He works with customers to provide architectural guidance and technical assistance on AWS services, improving the value of their solutions when using AWS.

Manage your AWS Directory Service credentials using AWS Secrets Manager

Post Syndicated from Ashwin Bhargava original https://aws.amazon.com/blogs/security/manage-your-aws-directory-service-credentials-using-aws-secrets-manager/

AWS Secrets Manager helps you protect the secrets that are needed to access your applications, services, and IT resources. With this service, you can rotate, manage, and retrieve database credentials, API keys, OAuth tokens, and other secrets throughout their lifecycle. The secret value rotation feature has built-in integration for services like Amazon Relational Database Service (Amazon RDS) , whose credentials can be rotated. The same integration functionality can also be extended to other types of secrets, including API keys and OAuth tokens, with the help of AWS Lambda functions.

This blog post provides details on how Secrets Manager can be used to store and rotate the admin password of AWS Directory Service at a specified frequency. Customers who use the directory services in AWS can deploy the solution in this blog post to minimize the effort spent by their operations team to manually rotate the password (which is one of the best practices of password management). These customers can also benefit by using the secure API access of Secrets Manager to allow access by applications that are using Active Directory–specific accounts. A good example is having an application to reset passwords for AD users and can be done using the API access.

Solution overview

When you configure AWS Directory Service, one of the inputs the service expects is the password for the admin user (administrator). By using an AWS Lambda function and Secrets Manager, you can store the password and rotate it periodically.

Figure 1 shows the architecture diagram for this solution.

Figure 1: Architecture diagram

Figure 1: Architecture diagram

The workflow is as follows:

  1. During initial setup (which can be performed either manually or through a CloudFormation template), the password of the admin user is stored as a secret in Secrets Manager. The secret is in the JSON format and contains three fields: Directory ID, UserName, and Password. The secret is encrypted using KMS Key to provide an added layer of security.
  2. This secret is attached to a Lambda function that controls rotation.
  3. This rotation Lambda function generates a new password, updates Active Directory, and then updates the secret. The function can be invoked on as-needed basis or at a desired interval. The CFN template we provide in this post schedules the rotation at a 30-day interval.
  4. Applications can securely fetch the new secret value from Secrets Manager.

Prerequisites and assumptions

To implement this solution, you need an AWS account to test the solution and access AWS services.

Also be aware of the following:

  1. In this solution, you will configure all the (supported) services in the same virtual private cloud (VPC) to simplify networking considerations.
  2. The predefined admin user name for Simple Active Directory is Administrator.
  3. The predefined password is a random 12-character string.

Important: The AWS CloudFormation template that we provide deploys a Simple Active Directory. This is for testing and demonstration purposes; you can modify or reuse the solution for other types of Active Directory solutions.

Deploy the solution

To deploy the solution, you first provision the baseline networking and other resources by using a CloudFormation stack.

The resource provisioning in this step creates these resources:

  • An Amazon Virtual Private Cloud (Amazon VPC) with two private subnets
  • AWS Directory Service installed and configured in the VPC
  • A Secrets Manager secret with rotation enabled
  • A Lambda function inside the VPC
  • These AWS Identity and Access Management (IAM) roles and permissions:
    • Secrets Manager has permission to invoke Lambda functions
    • The Lambda function has permission to update the secret in Secrets Manager
    • The Lambda function has permission to update the password for Directory Service

To deploy the solution by using the CloudFormation template

  1. You can use this downloadable template to set up the resources. To launch directly through the console, choose the following Launch Stack button, which creates the stack in the us-east-1 AWS Region.
    Select the Launch Stack button to launch the template
  2. Choose Next to go to the Specify stack details page.
  3. The bucket hosting the Lambda function code is predefined for ease of implementation, but you can edit the bucket name if necessary. Specify any other template details as needed, and then choose Next.
  4. (Optional) On the Configure Stack Options page, enter any tags, and then choose Next.
  5. On the Review page, select the check box for I acknowledge that AWS CloudFormation might create IAM resources with custom names, and choose Create stack.

It takes approximately 20–25 minutes for the provisioning to complete. When the stack status shows Create Complete, review the outputs that were created by navigating to the Outputs tab, as shown in Figure 2.

Figure 2: Outputs created by the CloudFormation template

Figure 2: Outputs created by the CloudFormation template

Now that the stack creation has completed successfully, you should validate the resources that were created.

To validate the resources

  1. Navigate to the AWS Directory Service console. You should see a new directory service that has the corp.com directory set up.
  2. Navigate to the AWS Secrets Manager console and review the secret that was created called DSAdminPswd. Choose the secret value, and then choose Retrieve secret value to reveal the secret values.
    Figure 3: Checking the secret value in the Secrets Manager console

    Figure 3: Checking the secret value in the Secrets Manager console

  3. As you might have noticed, the secret value changed from what was initially generated in the template. The Lambda function was invoked when it was attached to the secret, which caused the secret to rotate. To verify that the secret value changed, navigate to the Amazon CloudWatch console, and then navigate to Log groups.
  4. In the search bar, type the Lambda function name dj-rotate-lambda to filter on the log group name.
    Figure 4: CloudWatch log groups

    Figure 4: CloudWatch log groups

  5. Choose the log group /aws/lambda/dj-rotate-lambda to open the detailed log streams.
  6. Look at the Log streams and open the recent log stream to view the series of rotation events.
    Figure 5: The log data for a complete rotation

    Figure 5: The log data for a complete rotation

    You should see that each of the four stages of rotation (create, set, test, and finish) are called in the right sequence. A Success message in the finishSecret stage confirms the successful rotation of the secret value.

The next step is to rotate the secret manually or set a policy for rotation.

To rotate the secret

The CloudFormation automation has set the rotation configuration to rotate the secret every 30 days. You can alternatively initiate another rotation by choosing Rotate secret immediately, as shown in Figure 6. You will observe the log stream (in CloudWatch Logs) changing, followed by the new secret value.

Figure 6: Manual rotation of the secret

Figure 6: Manual rotation of the secret

You can also edit the rotation configuration by choosing Edit rotation and configuring the rotation policy that suits your organizational standards, as shown in Figure 7.

Figure 7: Editing the rotation configuration

Figure 7: Editing the rotation configuration

Code walkthrough

The rotation Lambda function works in four stages:

  1. CreateSecret – In this stage, the Lambda function creates a new password for the administrator user and sets up the staging label AWSPENDING for the secret’s new value.
  2. SetSecret – In this stage, the Lambda function fetches the newly generated password by using the label AWSPENDING and sets it as the password to the Active Directory administrator user.
  3. TestSecret – In this stage, the Lambda function verifies that the password is working by using the kinit command and the necessary dependent libraries of the Linux OS (the base OS for Lambda functions). If successful, the function continues to the next stage. In the case of failure, the catch block reverts the password of the Active Directory administrator user to the value in the AWSCURRENT label.
  4. FinishSecret – This is the final stage, where the Lambda function moves the labels AWSCURRENT from the current version of secret to the new version. And the same time, the old version of the secret is given AWSPREVIOUS label.

The Lambda function is written in Python 3.7 runtime and uses AWS SDK for Python (Boto3) API calls for interacting with Secrets Manager and Directory Services.

The directory ID and Secrets Manager endpoint are supplied as environment variables to the Lambda function, as shown in Figure 8. The secret ID is fetched from the event context.

Figure 8: Environment variables setup

Figure 8: Environment variables setup

You can download the Lambda code that is used for the rotation logic and modify it to suit your organizational needs. For instance, the random password is configured to have a length of 12 characters, excluding special characters and punctuations, as shown in the following code snippet. You can modify this configuration as needed.

newpasswd = service_client.get_random_password(PasswordLength=12,ExcludeCharacters='/@"\'\\',ExcludePunctuation=True)

As mentioned in the Prerequisites section, make sure that you do proper testing in development or test environments before proceeding to deploy the solution in production environments.


After you complete and test this solution, clean up the resources by deleting the AWS CloudFormation stack called aws-ds-creds-manager. For more information on deleting the stacks, see Deleting a stack on the AWS CloudFormation console.


In this post, we demonstrated how to use the AWS Secrets Manager service to store and rotate the AWS Directory Service Simple Active Directory admin password. You can also use this solution to rotate the AWS Managed Microsoft AD directory.

There are many other code samples listed in the AWS Code Sample Catalog that show how to rotate the passwords for other database services that are supported by this service.

You can find additional rotation Lambda function examples in the open source AWS library for Secrets Manager.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Secrets Manager forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Ashwin Bhargava

Ashwin is a DevOps Consultant at AWS working in Professional Services Canada. He is a DevOps expert and a security enthusiast with more than 13 years of development and consulting experience.


Satya Vajrapu

Satya is a Senior DevOps Consultant with AWS. He works with customers to help design, architect, and develop various practices and tools in the DevOps and cloud toolchain.

Disaster Recovery (DR) for a Third-party Interactive Voice Response on AWS

Post Syndicated from Priyanka Kulkarni original https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-for-a-third-party-interactive-voice-response-on-aws/

Voice calling systems are prevalent and necessary to many businesses today. They are usually designed to provide a 24×7 helpline support across multiple domains and use cases. Reliability and availability of such systems are important for a good customer experience. The thoughtful design of a cost-optimized solution will allow your business to sustain the system into the future.

We address a scenario in which you are mandated to host the workload on a corporate data center (DC), and configure the backup site on Amazon Web Services (AWS). Since the primary objective of a backup site is disaster recovery (DR) management, this site is often referred to as a DR site.

Disaster Recovery on AWS

DR strategy defines the recovery objectives for downtime and data loss. The workload has a recovery time objective (RTO) and a recovery point objective (RPO). RTO is the maximum acceptable delay between the interruption of service and the restoration of service. RPO is the maximum acceptable amount of time since the last data recovery point. AWS defines four DR strategies in increasing order of complexity, and decreasing order of RTO and RPO. These are backup and restore, active/passive (pilot light or warm standby), or active/active.

Figure 1. Disaster recovery (DR) options

Figure 1. Disaster recovery (DR) options

In our use case, the DR site on AWS must serve the user traffic with RPO and RTO in seconds. Warm standby is the optimal choice in this case. It is a scaled-down version of a fully functional environment, and is always running in the cloud.

Amazon Connect is an omnichannel cloud contact center that helps you provide great customer service at a lower cost. But in some situations, Amazon Connect may not be available. In other cases, the customer may want to use their home developed or third-party contact center application. Our solution is designed to help in both these scenarios.

This architecture enables customers facing challenges of cost overhead with redundant Session Initiation Protocol (SIP) trunks for the DC and DR sites. It allows you to optimize your spend and yet retain a reliable workflow.

SIP trunk communication on AWS

Let’s see how the SIP trunk termination on the AWS network handles the failover scenario of a third-party IVR application installed on Amazon EC2 at the DR site.

There will be two connections made from the AWS Direct Connect location (DX). The first will be for a point-to-point connectivity between the corporate DC and the AWS DR site. The second connection will be originating from the multiplexer (MUX) of the telecom provider who is providing you the SIP trunk.

The telecom provider will lay the SIP trunk from its MUX to the customer router at the DX location. At this point, the mode of communication becomes IP-based. The telecom provider will send the call to the IP address attached to the Network Load Balancer (NLB) in Amazon Virtual Private Cloud (VPC).

Figure 2. Communication circuitry at telecom side

Figure 2. Communication circuitry at telecom side

AWS Network Load Balancers can now distribute traffic to AWS resources using their IP addresses and instance IDs as targets. You can also distribute the traffic with on-premises resources over AWS Direct Connect. Load balancing across AWS and on-premises resources using the same load balancer streamlines migrate-to-cloud, burst-to-cloud, or failover-to-cloud.

In the backup site, the NLB will point to the Session Border Controller (SBC). This is a special-purpose device that protects and regulates IP communications flows. You can bring your own SBC, or you can use an SBC offered in the AWS Marketplace.

Best practices for high availability of IVR solution on AWS

  • Configure the multiple Availability Zone (Multi-AZ) SBC setup
  • Make sure that the telecom provider for the SIP trunk is different from the internet service provider (ISP). This is for last mile connectivity for the DC from Direct Connect
  • Consider redundancy for Direct Connect by using a Site-to-Site VPN tunnel
Figure 3. Solution architecture of DR on AWS for a third-party IVR solution

Figure 3. Solution architecture of DR on AWS for a third-party IVR solution

Communication flow for an IVR solution deployed on a corporate DC and its DR on AWS

  1. The callers are received on the telecom providers SIP line, which terminates on the AWS Direct Connect location.
  2. At the DX location, you will configure a route in the AWS router to send the traffic to the IP address of the NLB. The NLB should be configured to perform health checks on the virtual machine in your on-premises DC. Based on these health checks, the NLB will do the routing and the failover.
  3. In a live scenario with successful health checks at the DC, the NLB will forward the call to the IP of the on-premises virtual machine. This is where the IVR application will be installed.
  4. The communication between the NLB in Amazon VPC and the virtual machine in DC, will happen over Direct Connect.
  5. In a DR scenario, the NLB will failover the communication to SBCs in Amazon VPC.


This solution is useful when a third-party IVR system is deployed in a corporate data center, and the passive DR site is hosted on AWS. Cost optimization on telecom components is an important aspect of this design. AWS Direct Connect provides dedicated connectivity to the AWS environment, from 50 Mbps up to 10 Gbps. This gives you managed and controlled latency. It also provides provisioned bandwidth, so your workload can connect to AWS resources in a reliable, scalable, and cost-effective way.

The solution in this blog explains the end-to-end flow of communication, from the user to the IVR agents. It also provides insights into managing failover and failback between DR and the DR site.

Further Reading:

Hybrid Cloud Architectures Using Self-hosted Apache Kafka and AWS Glue

Post Syndicated from Brandon Rubadou original https://aws.amazon.com/blogs/architecture/hybrid-cloud-architectures-using-self-hosted-apache-kafka-and-aws-glue/

Using analytics to gain insights from a variety of datasets is key to successful transformation. There are many options to consider to realize the full value and potential of our data in a hybrid cloud infrastructure. Common practice is to route data produced from on-premises to a central repository or data lake. Here it can be consumed by multiple applications.

You can use an Apache Kafka cluster for data movement from on-premises to the data lake, using Amazon Simple Storage Service (Amazon S3). But you must either replicate the topics onto a cloud cluster, or develop a custom connector to read and copy the topics to Amazon S3. This presents a challenge for many customers.

This blog presents another option; an architecture solution leveraging AWS Glue.

Kafka and ETL processing

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. You can use Kafka clusters as a system to move data between systems. Producers typically publish data (or push) to a Kafka topic, where an application can consume it. Consumers are usually custom applications that feed data into respective target applications. These targets can be a data warehouse, an Amazon OpenSearch Service cluster, or others.

AWS Glue offers the ability to create jobs that will extract, transform, and load (ETL) data. This allows you to consume from many sources, such as from Apache Kafka, Amazon Kinesis Data Streams, or Amazon Managed Streaming for Apache Kafka (Amazon MSK). The jobs cleanse and transform the data, and then load the results into Amazon S3 data lakes or JDBC data stores.

Hybrid solution and architecture design

In most cases, the first step in building a responsive and manageable architecture is to review the data itself. For example, if we are processing insurance policy data from a financial organization, our data may contain fields that identify customer data. These can be account ID, an insurance claim identifier, and the dollar amount of the specific claim. Glue provides the ability to change any of these field types into the expected data lake schema type for processing.

Figure 1. Data flow - Source to data lake target

Figure 1. Data flow – Source to data lake target

Next, AWS Glue must be configured to connect to the on-premises Kafka server (see Figure 1). Private and secure connectivity to the on-premises environment can be established via AWS Direct Connect or a VPN solution. Traffic from the Amazon Virtual Private Cloud (Amazon VPC) is allowed to access the cluster directly. You can do this by creating a three-step streaming ETL job:

  1. Create a Glue connection to the on-premises Kafka source
  2. Create a Data Catalog table
  3. Create an ETL job, which saves to an S3 data lake

Configuring AWS Glue

  1. Create a connection. Using AWS Glue, create a secure SSL connection in the Data Catalog using the predefined Kafka connection type. Enter the hostname of the on-premises cluster and use the custom-managed certificate option for additional security. If you are in a development environment, you are required to generate a self-signed SSL certificate. Use your Kafka SSL endpoint to connect to Glue. (AWS Glue also supports client authentication for Apache Kafka streams.)
  2. Specify a security group. To allow AWS Glue to communicate between its components, specify a security group with a self-referencing inbound rule for all TCP ports. By creating this rule, you can restrict the source to the same security group in the Amazon VPC. Ensure you check the default security group for your VPC, as it could have a preconfigured self-referencing inbound rule for ALL traffic.
  3. Create the Data Catalog. Glue can auto-create the data schema. Since it’s a simple flat file, use the schema detection function of Glue. Set up the Kafka topic and refer to the connection.
  4. Define the job properties. Create the AWS Identity and Access Management (IAM) role to allow Glue to connect to S3 data. Select an S3 bucket and format. In this case, we use CSV and enable schema detection.

The Glue job can be scheduled, initiated manually, or by using an event driven architecture. Note that Glue does not yet support the “test connection” option within the console. Make sure you set the “Job Timeout” and enter a duration in minutes because the default value is blank.

When the job runs, it pulls the latest topics from the source on-premises Kafka cluster. Glue supports checkpoints to ensure that all source data is processed. By default, AWS Glue processes and writes out data in 100-second windows. This allows data to be processed efficiently and permits aggregations to be performed on data arriving later. You can modify this window size to increase timeliness or aggregation accuracy. AWS Glue streaming jobs use checkpoints rather than job bookmarks to track the data that has been read. AWS Glue bills hourly for streaming ETL jobs only while they are running.

Now that the connection is complete and the job is created, we can format the source data needed for the data lake. AWS Glue offers a set of built-in transforms that you can use to process your data using your ETL script. The transformed data is then placed in S3, where it can be leveraged as part of a larger data lake environment.

Many additional steps can be taken to render even more value from the information. For example, one team may choose to use a business intelligence tool like Amazon QuickSight to visualize and embed the data into an internal dashboard. Another team may want to use event driven architectures to notify financial analysts and initiate downstream actions when specific types of data are discovered. There are endless opportunities that should be determined by the business needs.


In this blog post, we have given an overview of an architecture that provides hybrid cloud data integration and analytics capability. Once the data is transformed and hosted in the S3 data lake, we can provide secure, reliable access to gain valuable insights. This solution allows for a variety of different producers and consumers, with the ability to handle increasing volumes of data.

AWS Glue along with Apache Kafka will ensure that your on-premises workloads are tightly integrated with your larger data lake solution.

If you have questions, post your thoughts in the comments section.

For further reading:

Orchestrate Jenkins Workloads using Dynamic Pod Autoscaling with Amazon EKS

Post Syndicated from Vladimir Toussaint original https://aws.amazon.com/blogs/devops/orchestrate-jenkins-workloads-using-dynamic-pod-autoscaling-with-amazon-eks/

This blog post will demonstrate how to leverage Jenkins with Amazon Elastic Kubernetes Service (EKS) by running a Jenkins Manager within an EKS pod. In doing so, we can run Jenkins workloads by allowing Amazon EKS to spawn dynamic Jenkins Agent(s) in order to perform application and infrastructure deployment. Traditionally, customers will setup a Jenkins Manager-Agent architecture that contains a set of manually added nodes with no autoscaling capabilities. Implementing this strategy will ensure that a robust approach optimizes the performance with the right-sized compute capacity and work needed to successfully perform the build tasks.

In setting up our Amazon EKS cluster with Jenkins, we’ll utilize the eksctl simple CLI tool for creating clusters on EKS. Then, we’ll build both the Jenkins Manager and Jenkins Agent image. Afterward, we’ll run a container deployment on our cluster to access the Jenkins application and utilize the dynamic Jenkins Agent pods to run pipelines and jobs.

Solution Overview

The architecture below illustrates the execution steps.

Solution Architecture diagram
Figure 1. Solution overview diagram

Disclaimer(s): (Note: This Jenkins application is not configured with a persistent volume storage. Therefore, you must establish and configure this template to fit that requirement).

To accomplish this deployment workflow, we will do the following:

Centralized Shared Services account

  1. Deploy the Amazon EKS Cluster into a Centralized Shared Services Account.
  2. Create the Amazon ECR Repository for the Jenkins Manager and Jenkins Agent to store docker images.
  3. Deploy the kubernetes manifest file for the Jenkins Manager.

Target Account(s)

  1. Establish a set of AWS Identity and Access Management (IAM) roles with permissions for cross-across access from the Share Services account into the Target account(s).

Jenkins Application UI

  1. Jenkins Plugins – Install and configure the Kubernetes Plugin and CloudBees AWS Credentials Plugin from Manage Plugins (you will not have to manually install this since it will be packaged and installed as part of the Jenkins image build).
  2. Jenkins Pipeline Example—Fetch the Jenkinsfile to deploy an S3 Bucket with CloudFormation in the Target account using a Jenkins parameterized pipeline.


The following is the minimum requirements for ensuring this solution will work.

Account Prerequisites

  • Shared Services Account: The location of the Amazon EKS Cluster.
  • Target Account: The destination of the CI/CD pipeline deployments.

Build Requirements

Clone the Git Repository

git clone https://github.com/aws-samples/jenkins-cloudformation-deployment-example.git

Security Considerations

This blog provides a high-level overview of the best practices for cross-account deployment and isolation maintenance between the applications. We evaluated the cross-account application deployment permissions and will describe the current state as well as what to avoid. As part of the security best practices, we will maintain isolation among multiple apps deployed in these environments, e.g., Pipeline 1 does not deploy to the Pipeline 2 infrastructure.


A Jenkins manager is running as a container in an EC2 compute instance that resides within a Shared AWS account. This Jenkins application represents individual pipelines deploying unique microservices that build and deploy to multiple environments in separate AWS accounts. The cross-account deployment utilizes the target AWS account admin credentials in order to do the deployment.

This methodology means that it is not good practice to share the account credentials externally. Additionally, the deployment errors risk should be eliminated and application isolation should be maintained within the same account.

Note that the deployment steps are being run using AWS CLIs, thus our solution will be focused on AWS CLI usage.

The risk is much lower when utilizing CloudFormation / CDK to conduct deployments because the AWS CLIs executed from the build jobs will specify stack names as parametrized inputs and the very low probability of stack-name error. However, it remains inadvisable to utilize admin credentials of the target account.

Best Practice — Current Approach

We utilized cross-account roles that can restrict unauthorized access across build jobs. Behind this approach, we will utilize the assume-role concept that will enable the requesting role to obtain temporary credentials (from the STS service) of the target role and execute actions permitted by the target role. This is safer than utilizing hard-coded credentials. The requesting role could be either the inherited EC2 instance role OR specific user credentials. However, in our case, we are utilizing the inherited EC2 instance role.

For ease of understanding, we will refer the target-role as execution-role below.

Cross account roles for Jenkins build jobs
Figure 2. Current approach

  • As per the security best practice of assigning minimum privileges, we must first create execution role in IAM in the target account that has deployment permissions (either via CloudFormation OR via CLI’s), e.g., app-dev-role in Dev account and app-prod-role in Prod account.
  • For each of those roles, we configure a trust relationship with the parent account ID (Shared Services account). This enables any roles in the Shared Services account (with assume-role permission) to assume the execution-role and deploy it on respective hosting infrastructure, e.g., the app-dev-role in Dev account will be a common execution role that will deploy various apps across infrastructure.
  • Then, we create a local role in the Shared Services account and configure credentials within Jenkins to be utilized by the Build Jobs. Provide the job with the assume-role permissions and specify the list of ARNs across every account. Alternatively, the inherited EC2 instance role can also be utilized to assume the execution-role.

Create Cross-Account IAM Roles

Cross-account IAM roles allow users to securely access AWS resources in a target account while maintaining the observability of that AWS account. The cross-account IAM role includes a trust policy allowing AWS identities in another AWS account to assume the given role. This allows us to create a role in one AWS account that delegates specific permissions to another AWS account.

  • Create an IAM role with a common name in each target account. The role name we’ve created is AWSCloudFormationStackExecutionRole. The role must have permissions to perform CloudFormation actions and any actions regarding the resources that will be created. In our case, we will be creating an S3 Bucket utilizing CloudFormation.
  • This IAM role must also have an established trust relationship to the Shared Services account. In this case, the Jenkins Agent will be granted the ability to assume the role of the particular target account from the Shared Services account.
  • In our case, the IAM entity that will assume the AWSCloudFormationStackExecutionRole is the EKS Node Instance Role that associated with the EKS Cluster Nodes.
    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Action": [
            "Resource": "*"

Build Docker Images

Build the custom docker images for the Jenkins Manager and the Jenkins Agent, and then push the images to AWS ECR Repository. Navigate to the docker/ directory, then execute the command according to the required parameters with the AWS account ID, repository name, region, and the build folder name jenkins-manager/ or jenkins-agent/ that resides in the current docker directory. The custom docker images will contain a set of starter package installations.

Deploy Jenkins Application

After building both images, navigate to the k8s/ directory, modify the manifest file for the Jenkins image, and then execute the Jenkins manifest.yaml template to setup the Jenkins application. (Note: This Jenkins application is not configured with a persistent volume storage. Therefore, you will need to establish and configure this template to fit that requirement).

# Fetch the Application URL or navigate to the AWS Console for the Load Balancer
kubectl get svc -n jenkins

# Verify that jenkins deployment/pods are up running
kubectl get pods -n jenkins

# Replace with jenkins manager pod name and fetch Jenkins login password
kubectl exec -it pod/<JENKINS-MANAGER-POD-NAME> -n jenkins -- cat /var/jenkins_home/secrets/initialAdminPassword
  • The Kubernetes Plugin and CloudBees AWS Credentials Plugin should be installed as part of the Jenkins image build from the Managed Plugins.
  • Navigate: Manage Jenkins → Configure Global Security
  • Set the Crumb Issuer to remove the error pages in order to prevent Cross Site Request Forgery exploits.

Screenshot of Crumb isssuer
Figure 3. Configure Global Security

Configure Jenkins Kubernetes Cloud

  • Navigate: Manage Jenkins → Manage Nodes and Clouds → Configure Clouds
  • Click: Add a new cloud → select Kubernetes from the drop menus

Screenshot to configure Cloud on Jenkins
Figure 4a. Jenkins Configure Nodes and Clouds

Note: Before proceeding, please ensure that you can access your Amazon EKS cluster information, whether it is through Console or CLI.

  • Enter a Name in the Kubernetes Cloud configuration field.
  • Enter the Kubernetes URL which can be found via AWS Console by navigating to the Amazon EKS service and locating the API server endpoint of the cluster, or run the command kubectl cluster-info.
  • Enter the namespace that will be utilized in the Kubernetes Namespace field. This will determine where the dynamic kubernetes pods will spawn. In our case, the name of the namespace is jenkins.
  • During the initial setup of Jenkins Manager on kubernetes, there is an environment variable JENKINS_URL that automatically utilizes the Load Balancer URL to resolve requests. However, we will resolve our requests locally to the cluster IP address.
    • The format is as follows: https://<service-name>.<namespace>.svc.cluster.local

Configuring Kubernetes cloud for Jenkins
Figure 4b. Configure Kubernetes Cloud

Set AWS Credentials

Security concerns are a key reason why we’re utilizing an IAM role instead of access keys. For any given approach involving IAM, it is the best practice to utilize temporary credentials.

  • You must have the AWS Credentials Binding Plugin installed before this step. Enter the unique ID name as shown in the example below.
  • Enter the IAM Role ARN you created earlier for both the ID and IAM Role to use in the field as shown below.

Setting up credentials on Jenkins
Figure 5. AWS Credentials Binding

Configuring Global credentials
Figure 6. Managed Credentials

Create a pipeline

  • Navigate to the Jenkins main menu and select new item
  • Create a Pipeline

Screenshot for Pipeline configuration
Figure 7. Create a pipeline

Configure Jenkins Agent

Setup a Kubernetes YAML template after you’ve built the agent image. In this example, we will be using the k8sPodTemplate.yaml file stored in the k8s/ folder.

CloudFormation Execution Scripts

This deploy-stack.sh file can accept four different parameters and conduct several types of CloudFormation stack executions such as deploy, create-changeset, and execute-changeset. This is also reflected in the stages of this Jenkinsfile pipeline. As for the delete-stack.sh file, two parameters are accepted, and, when executed, it will delete a CloudFormation stack based on the given stack name and region.


In this Jenkinsfile, the individual pipeline build jobs will deploy individual microservices. The k8sPodTemplate.yaml is utilized to specify the kubernetes pod details and the inbound-agent that will be utilized to run the pipeline.

Jenkins Pipeline: Execute a pipeline

  • Click Build with Parameters and then select a build action.

Configuring stackname in Jenkins configuration
Figure 8a. Build with Parameters

  • Examine the pipeline stages even further for the choice you selected. Also, view more details of the stages below and verify in your AWS account that the CloudFormation stack was executed.

Jenkins pipeline dashboard
Figure 8b. Pipeline Stage View

  • The Final Step is to execute your pipeline and watch the pods spin up dynamically in your terminal. As is shown below, the Jenkins agent pod spawned and then terminated after the work completed. Watch this task on your own by executing the following command:
# Watch the pods spawn in the "jenkins" namespace
kubectl get pods -n jenkins -w

CLI output showing Jenkins POD status
Figure 9. Watch Jenkins Agent Pods Spawn

Code Repository



In order to avoid incurring future charges, delete the resources utilized in the walkthrough.

  • Delete the EKS cluster. You can utilize the eksctl to delete the cluster.
  • Delete any remaining AWS resources created by EKS such as AWS LoadBalancer, Target Groups, etc.
  • Delete any related IAM entities.


This post walked you through the process of building out Amazon EKS based infrastructure and integrating Jenkins to orchestrate workloads. We demonstrated how you can utilize this to deploy securely across multiple accounts with dynamic Jenkins agents and create alignment to your business with similar use cases. To learn more about Amazon EKS, see our documentation pages or explore our console.

About the Authors

Vladimir Toussaint Headshot1.png

Vladimir P. Toussaint

Vladimir is a DevOps Cloud Architect at Amazon Web Services. He works with GovCloud customers to build solutions and capabilities as they move to the cloud. Previous to Amazon Web Services, Vladimir has leveraged container orchestration tools such as Kubernetes to securely manage microservice applications for large enterprises.

Matt Noyce Headshot1.png

Matt Noyce

Matt is a Sr. Cloud Application Architect at Amazon Web Services. He works primarily with health care and life sciences customers to help them architect and build applications, data lakes, and DevOps pipelines that solve their business needs. In his spare time Matt likes to run and hike along with enjoying time with friends and family.

Nikunj Vaidya Headshot1.png

Nikunj Vaidya

Nikunj is a DevOps Tech Leader at Amazon Web Services. He offers technical guidance to the customers on AWS DevOps solutions and services that would streamline the application development process, accelerate application delivery, and enable maintaining a high bar of software quality. Prior to AWS, Nikunj has worked in software engineering roles, leading transformation projects, driving releases and improvements in the software quality and customer experience.