Tag Archives: Management Tools

Proactively validate your AWS CloudFormation templates with AWS Lambda

Post Syndicated from Kirankumar Chandrashekar original https://aws.amazon.com/blogs/devops/proactively-validate-your-aws-cloudformation-templates-with-aws-lambda/

AWS CloudFormation is a service that allows you to define, manage, and provision your AWS cloud infrastructure using code. To enhance this process and ensure your infrastructure meets your organization’s standards, AWS offers CloudFormation Hooks. These Hooks are extension points that allow you to invoke custom logic at specific points during CloudFormation stack operations, enabling you to perform validations, make modifications, or trigger additional processes. Among these, the Lambda hook is a powerful option provided by AWS. This managed hook allows you to use Lambda functions to validate your CloudFormation templates before deployment. By using a Lambda hook, you can invoke custom logic to check infrastructure configurations on create or update or delete CloudFormation resources or stacks or change sets, as well as create or update operations for AWS Cloud Control API (CCAPI) resources. This enables you to enforce defined policies for your infrastructure-as-code (IaC), preventing the deployment of non-compliant resources or emitting warnings for potential issues. In this blog post, you will explore how to use a Lambda hook to validate your CloudFormation templates before deployment, ensuring your infrastructure is compliant and secure from the start.

Introducing Lambda Hook

The Lambda hook is an AWS-provided managed hook with the type AWS::Hooks::LambdaHook. It simplifies the integration of custom logic into CloudFormation stacks. This powerful feature allows you to focus on building and testing your custom logic as a Lambda function, without the complexity of creating a hook from scratch.

By using the Lambda hook, you can activate a pre-built hook and deploy your custom logic into a Lambda function using familiar tools like AWS CLI or AWS Serverless Application Model (SAM) or AWS Cloud Development Kit (CDK). This approach reduces the number of components you need to manage in your workflow, allowing for more streamlined operations. The Lambda hook also offers flexible evaluation capabilities, enabling you to respond to specific template properties or configurations as needed.

One of the key advantages of the Lambda hook is the enhanced control it provides. You can benefit from features such as VPC integration, local logging, and granular resource management, all while leveraging the power of AWS Lambda functions. To get started with the Lambda hook, you’ll need to activate it in your AWS account. This activation process eliminates the need for authoring, testing, packaging, and deploying a custom hook using the AWS CloudFormation Command Line Interface (CFN-CLI), significantly simplifying your workflow.

Example Use Case: S3 Bucket Versioning Validation

This blog post demonstrates using the Lambda hook to validate S3 Bucket versioning before deployment. While focused on S3 buckets, this approach can be applied to other resource types, properties, stack, and change set operations.

By leveraging the Lambda hook, you’ll streamline custom logic integration into your CloudFormation stacks. The process involves:

  1. Activating Lambda hook of type AWS::Hooks::LambdaHook in your account
  2. Writing a Lambda function for validation
  3. Providing the Lambda ARN as input to the hook

This example showcases how to enhance your infrastructure-as-code practices, ensuring compliant and secure deployments from the start.

Architecture

This section shows you how the Lambda hook and Lambda function work together to enhance your CloudFormation deployments.

Lambda hook and Lambda function

First, you need to create a Lambda function with the business logic to respond to the hook. Then, you need to create an IAM execution role with the necessary permissions to invoke the Lambda Function. Once you have the Lambda function and the IAM execution role, you can activate the AWS provided Lambda hook. Follow the steps in the documentation to activate a Lambda hook from the AWS console. Alternatively, you can activate it using the AWS Command Line Interface (AWS CLI) by using the activate-type and set-type-configuration commands. Lastly, you can also use AWS::CloudFormation::LambdaHook as a CloudFormation resource to activate and configure Lambda hook from a CloudFormation template. You can share this resource across your other accounts and regions using AWS CloudFormation StackSets by following this blog.

Lambda hook in action

The following diagram and explanation illustrate the step-by-step workflow of how Lambda hook integrates with your CloudFormation operations, providing a visual representation of the process from template creation to resource deployment or modification.

Lambda Hooks Architecture and its working in action

Diagram 1: Lambda hooks in action

The architecture diagram illustrates the step-by-step flow of how the Lambda hook is used during a CloudFormation stack operation.

  1. Author a template: Author a CloudFormation template, including the necessary resources to configure.
  2. Create the stack: The CloudFormation stack creation process has started, but the process of creating the defined resources in the template has not yet begun.
  3. Request is received by CloudFormation service: When a resource creation, update, or deletion is requested, the CloudFormation service receives the request.
  4. Invoke the Hook: The CloudFormation service then invokes the Lambda hook.
  5. The hook invokes your the Lambda Function: The Lambda hook, in turn, triggers the execution of the Lambda function that was defined in the hook activation.
  6. The Lambda function processes the request and responds back to the Hook: The Lambda function processes the request, performing validation, or additional tasks as required. The Lambda function then responds back to the Lambda hook.
  7. The stack workflow progresses further in either continuing the resource creation/update/deletion with/without a warning or fails: Based on the Lambda function’s output, the Lambda hook either allows the stack operation to proceed with the resource operation (for example, creation of the resource), or deny the resource operation causing a rollback of the stack.

This workflow demonstrates how Lambda hook seamlessly integrates into the CloudFormation stack deployment process, allowing you to implement custom validations, enforce policies, and extend the capabilities of your infrastructure-as-code deployments through the power of Lambda functions. By leveraging the Lambda hook and the custom Lambda function, customers can extend the capabilities of their CloudFormation deployments, enabling advanced use cases such as resource validation, or additional task execution.

Sample Deployment

This section shows you how to enable the Lambda hook, which is of type AWS::Hooks::LambdaHook, and add the business logic in the Lambda function to validate the versioning configuration of an S3 bucket. The sample solution shown in this blog post demonstrates the hook triggering for the resource type AWS::S3::Bucket, and if you want to trigger this for every resource type, then you can use the Resource filter within Hook filters configuration that can take wildcard "AWS::*::*" as a value or multiple targets of resource types for example "AWS::S3::Bucket", "AWS::DynamoDB::Table", and you’ll also want to make sure that the Lambda Function has the logic to handle the additional resource type. You can also add additional Hook targets , for example to validate your STACK or CHANGE_SET.

In the example used in this blog post, you will configure the hook and activate on create and update operations operations. For more information about TargetFilters, see Hook configuration schema and for more information about Lambda hook see here. With these modifications, you need to consider two important points: First, you will need to handle the business logic to deal with different resource types in your Lambda function code. Second, additional pricing may apply based on your resource usage, for more details see the Lambda pricing page.

Creating the Lambda Function

You can create a Lambda function in several ways – on the AWS Console, using CloudFormation, using AWS CLI, or by directly invoking the API via SDK. In this section, we will cover creating a Lambda function with a few clicks on the AWS console. See Using Lambda with infrastructure as code (IaC) for deploying Lambda Function using SAM CLI, CDK or CloudFormation.

Create the Lambda function on the AWS console by following create a Lambda function with the console instructions and use the following sample Python code.

"""Example Lambda function called by AWS::Hooks::LambdaHook."""


import logging


HOOK_INVOCATION_POINTS = [
    "CREATE_PRE_PROVISION",
    "UPDATE_PRE_PROVISION",
    "DELETE_PRE_PROVISION",
]

TARGET_NAMES = [
    "AWS::S3::Bucket",
]

LOGGER = logging.getLogger()

LOGGER.setLevel("INFO")


def lambda_handler(event, context):
  """Define the entry point of the function."""
  try:
    request = event
    
    LOGGER.info(f"Request: {request}")

    invocation_point = request["actionInvocationPoint"]
    LOGGER.info(f"Invocation point: {invocation_point}")

    target_name = request["requestData"]["targetName"]
    LOGGER.info(f"Target name: {target_name}")
    
    clientRequestToken = request["clientRequestToken"]

    if (
      invocation_point not in HOOK_INVOCATION_POINTS
      or target_name not in TARGET_NAMES
    ):
      message = (
        f"Skipping {target_name} evaluation for {invocation_point}."
      )
      LOGGER.info(message)
      payload = {
        "clientRequestToken": clientRequestToken,
        "hookStatus": "SUCCESS",
        "errorCode": None,
        "message": message,
        "callbackContext": None,
        "callbackDelaySeconds": 0,
      }
      LOGGER.debug(payload)
      return payload

    target_model = request["requestData"]["targetModel"]

    resource_properties = (
      target_model.get("resourceProperties")
      if target_model and target_model.get("resourceProperties")
      else None
    )
    LOGGER.debug(f"Resource properties: {resource_properties}")

    versioning_configuration = (
      resource_properties.get("VersioningConfiguration")
      if resource_properties
      and resource_properties.get("VersioningConfiguration")
      else None
    )
    versioning_configuration_status = (
      versioning_configuration.get("Status")
      if versioning_configuration
      and versioning_configuration.get("Status")
      else None
    )
    if (
      not resource_properties
      or not versioning_configuration
      or not versioning_configuration_status
      or not versioning_configuration_status == "Enabled"
    ):
      message = "Versioning not set or not enabled for the S3 bucket."
      LOGGER.error(message)
      payload = {
        "clientRequestToken": clientRequestToken,
        "hookStatus": "FAILED",
        "errorCode": "NonCompliant",
        "message": message,
      }
    else:
      message = "Versioning is enabled for the S3 bucket."
      LOGGER.info(message)
      payload = {
        "clientRequestToken": clientRequestToken,
        "hookStatus": "SUCCESS",
        "errorCode": None,
        "message": message,
      }

    LOGGER.debug(payload)
    return payload
  except Exception as exception:
    message = str(exception)
    payload = {
      "clientRequestToken":  event["clientRequestToken"],
      "hookStatus": "FAILED",
      "errorCode": "InternalFailure",
      "message": message,
      "callbackContext": None,
      "callbackDelaySeconds": 0,
    }
    LOGGER.error(message)
    return payload

Example event sent to Lambda by the hook

{
  "clientRequestToken": "XXXXXXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
  "awsAccountId": "111111111111",
  "stackId": "arn:aws:cloudformation:<AWS-Region>:111111111111:stack/example-stack/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "changeSetId": "None",
  "hookTypeName": "AWS::Hooks::LambdaHook",
  "hookTypeVersion": "00000001",
  "hookModel":
    {
      "LambdaFunction": "example-hook-function-name",
    },
  "actionInvocationPoint": "CREATE_PRE_PROVISION",
  "requestData":
    {
      "targetName": "AWS::S3::Bucket",
      "targetType": "AWS::S3::Bucket",
      "targetLogicalId": "Bucket",
      "callerCredentials": "None",
      "providerCredentials": "None",
      "providerLogGroupName": "None",
      "hookEncryptionKeyArn": "None",
      "hookEncryptionKeyRole": "None",
      "targetModel":
        {
          "resourceProperties":
            {
              "PublicAccessBlockConfiguration":
                { "RestrictPublicBuckets": true, "IgnorePublicAcls": true },
              "BucketName": "XXXXXXXXXXXXXXXXXXXXXXXXXXX",
              "VersioningConfiguration": { "Status": "Enabled" },
            },
          "previousResourceProperties": "None",
        },
    },
  "requestContext": { "invocation": 1, "callbackContext": "None" },
}

Explanation of the Lambda Function code

The Lambda Function code is designed to process the event received from the Lambda hook and validate the versioning configuration of the target S3 bucket resource. Here’s a detailed explanation of the code:

  1. The function first extracts the relevant information from the event, including the invocation point and the target resource type.
  2. It then checks if the current invocation point is in the configured HOOK_INVOCATION_POINTS list and if the target resource type is AWS::S3::Bucket. If not, the function returns a success response, skipping the validation for this particular invocation.

Note: this code that skips the validation is put here as a fallback logic in the event the user has not chosen to use TargetFilters. As this is a wildcard hook, without TargetFilters the hook will always be invoked for any AWS resource type described in the template, and since the hook targets preCreate, preUpdate, and preDelete by default, the hook will be invoked for these invocation points by default. To narrow the scope and reduce costs by avoiding to invoke the hook for all AWS resource type targets and invocation points, use TargetFilters.

  1. Next, the function retrieves the resource properties from the event, specifically looking for the VersioningConfiguration property and its Status.
  2. If the VersioningConfiguration property is not present or its Status is not set to Enabled, the function returns a failure response, indicating that the versioning is not enabled for the S3 bucket.
  3. If the versioning is enabled, the function returns a success response.
  4. The function also includes a fallback mechanism to return a failure response in case of any other exceptions. By evaluating this sample code, you can validate the versioning configuration of the S3 bucket during the CloudFormation stack creation and update processes, with your infrastructure-as-code policies.

Enabling Lambda Hook in your AWS Account/Region

  1. Navigate to the AWS CloudFormation service on the AWS Console, then choose “Create Hook” → “with Lambda” from the main Hooks page:

Lambda Hooks Creation

Diagram 2: Create a Hook with Lambda console page

You will see the page explaining how the Lambda function work as a hook.

Lambda hooks provide lambda

Diagram 3: Provide a Lambda function to Hook Console page

  1. Provide the Hooks details: the name, the Lambda function it should take, the type, and the mode. You can also create your execution role directly from the console by choosing “New execution role”.
Lambdas Hooks Details Config

Diagram 3: Provide a Lambda function to Hook Console page

You can review the Lambda hook and activate it from the next page.

Lambda Hooks details and configuration

Diagram 4: Review Lambda hook Console page

Test a sample

In this section, you will test the hook and the Lambda Function that you activated for a S3 bucket resource.

Create an S3 Bucket without versioning

AWSTemplateFormatVersion: "2010-09-09"

Description: This CloudFormation template provisions an S3 Bucket without versioning enabled

Resources:
  S3Bucket:
    DeletionPolicy: Delete
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Sub test-bucket-versioning-1-${AWS::Region}-${AWS::AccountId}

You will see the hook invoking Lambda function and the Lambda Function responding back with a failure message since the Versioning is not enabled.

When you create or update a stack with the template above, the Lambda hook will be invoked, and the Lambda Function will respond with a failure message since bucket versioning is not enabled. The Lambda Function code will extract the resourceProperties from the event, check the VersioningConfiguration property, and find that the Status is not set to Enabled. As a result, if you use the example template above where you describe the S3 bucket without versioning enabled, the Lambda Function will send a failure response back to the hook, causing the CloudFormation stack operation to fail as shown in the following screenshot.

Lambda Hooks Failure stack

Diagram 5: Lambda Hook failure Stack

Create an S3 Bucket with versioning enabled You can try creating an S3 Bucket with versioning enabled to see how Hooks assessment succeeded.

AWSTemplateFormatVersion: "2010-09-09"

Description: This CloudFormation template provisions an S3 Bucket with Versioning enabled

Resources:
  S3Bucket:
    DeletionPolicy: Delete
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Sub test-bucket-versioning-2-${AWS::Region}-${AWS::AccountId}
      VersioningConfiguration:
        Status: Enabled

In this case, you will see the hook invoking the Lambda function and getting a success message since the Versioning is enabled

When you create a stack with this CloudFormation template, the Lambda hook will be invoked, and the Lambda Function will respond with a success message since the versioning is enabled. The Lambda Function code will extract the resourceProperties from the event, check the VersioningConfiguration property, and find that the Status is set to Enabled. As a result, the Lambda Function will send a success response back to the hook, allowing the CloudFormation stack operation to proceed as shown in the following screenshot.

Lambda Hooks Success stack

Diagram 6: Lambda Hook success Stack

By testing these two scenarios, you can verify that the Lambda hook and the associated Lambda Function are working as expected, enforcing the S3 bucket versioning policy during CloudFormation stack operations.

Clean up

To clean up, refer to the following documentation to delete CloudFormation Stacks: Deleting a stack on the AWS CloudFormation console or Deleting a stack using AWS CLI. Refer to documentation to deactivate the hook.

Conclusion

In this blog post, you explored the capabilities of CloudFormation Hooks and how they can be leveraged to extend the functionality of your infrastructure-as-code deployments. Specifically, you learned about the Lambda hook, a pre-built hook that simplifies the process of integrating custom logic into your CloudFormation stacks.

By activating the Lambda hook, and deploying a custom Lambda Function, you were able to validate the versioning configuration of an S3 bucket during the CloudFormation stack creation and update processes. This approach allows you to enforce infrastructure-as-code policies and ensure compliance at the point of deployment, rather than relying on post-deployment checks or indirect governance mechanisms. The ability to leverage familiar tools and workflows, such as the AWS CLI, AWS SAM, CI/CD pipelines, or the AWS CDK, makes it easier to incorporate custom logic into your CloudFormation deployments. This reduces the overhead and complexity associated with traditional hook orchestration and packaging, empowering you to streamline your infrastructure-as-code practices.

As you continue to build and deploy your cloud infrastructure, consider exploring the various CloudFormation Hooks available, for example, see aws-cloudformation/aws-cloudformation-samples and aws-cloudformation/community-registry-extensions GitHub repositories. The approach demonstrated in this blog post can be applied to other resource types supported by CloudFormation, allowing you to validate and enforce policies for a wide range of infrastructure components, from EC2 instances and VPCs to databases and application services.

About the Author

kirankumar.jpeg

Kirankumar Chandrashekar is a Sr. Solutions Architect for Strategic Accounts at AWS. He focuses on leading customers in architecting DevOps, modernization using serverless, containers and container orchestration technologies like Docker, ECS, EKS to name a few. Kirankumar is passionate about DevOps, Infrastructure as Code, modernization and solving complex customer issues. He enjoys music, as well as cooking and traveling.

stella.jpegStella Hie is a Sr. Product Manager Technical for AWS Infrastructure as Code. She focuses on proactive control and governance space, working on delivering the best experience for customers to use AWS solutions safely. Outside of work, she enjoys hiking, playing piano, and watching live shows.

Introducing a managed hook for Guard

Post Syndicated from Brian Terry original https://aws.amazon.com/blogs/devops/introducing-a-managed-hook-for-guard/

In today’s cloud-driven world, maintaining compliance and enforcing organizational policies across your infrastructure is more critical than ever. AWS CloudFormation, a service that enables you to model, provision, and manage AWS and third-party resources through Infrastructure as Code (IaC), has been a cornerstone for automating cloud deployments. While CloudFormation simplifies resource management, ensuring compliance with internal and external standards requires robust tools and strategies to align with organizational and regulatory requirements.

Enter the new managed hook for Guard, a new hook designed to seamlessly integrate compliance into your AWS CloudFormation workflows. AWS CloudFormation Hooks are a powerful tool that allow you to validate and enforce policies during the provisioning and management of resources in your CloudFormation stacks. With a Guard hook, you can now bring your AWS CloudFormation Guard rules directly into your CloudFormation stacks, change sets, and individual resources. This feature eliminates the need to incorporate the Guard runtime into custom hooks, simplifying compliance enforcement and enhancing your security posture.

What is a Guard hook?

A Guard hook is a service-managed CloudFormation Hook that integrates AWS CloudFormation Guard—a policy-as-code tool—directly into your CloudFormation deployment process. AWS CloudFormation Guard uses a domain-specific language (DSL) to define compliance rules in an easily readable and reusable format, enabling you to codify and enforce organizational policies. This hook allows you to automatically validate your CloudFormation templates against these established compliance rules, enforcing policies seamlessly without the need to manage custom Lambda functions or Guard runtimes. By identifying non-compliant resources early, it helps streamline deployments and ensures that your infrastructure aligns with organizational standards right from the start.

Benefits of Using a Guard hook

Guard hooks provide unique advantages by combining the flexibility of CloudFormation Hooks with the expressiveness of the Guard Domain Specific Language (DSL), giving you a powerful tool for infrastructure compliance and security. As a fully managed CloudFormation hook, Guard hooks eliminate the need for a custom hook implementation and management, allowing you to leverage AWS’s service-driven infrastructure to streamline policy enforcement.With a Guard hook, you can express compliance rules and configurations directly in the Guard DSL.

Guard hooks specifically add value by enabling these Guard-based compliance policies to be applied automatically during stack operations. This helps validate that your infrastructure consistently aligns with security and compliance standards, reducing the risks of misconfiguration and non-compliance. Additionally, Guard hooks allow you to take advantage of CloudFormation’s native stack-level and resource-level hooks, adding depth to your compliance by enforcing policies across entire stacks and individual resources alike.

Use Cases

Guard hooks can be applied in various scenarios to enhance compliance and security. In security compliance, for instance, it can ensure that all databases and storage services have encryption enabled or prevent security groups from allowing unrestricted inbound traffic. Regarding resource configuration, it can enforce the use of approved EC2 instance types or require specific tags on all resources for cost tracking and management.

For operational governance, Guard hooks can validate that resources do not exceed predefined thresholds, such as storage size, ensuring that resources are provisioned within acceptable limits. It can also enforce network policies by ensuring the use of approved VPCs and subnets, maintaining network security and compliance with organizational standards.

A resource hook in CloudFormation targets individual resources within a template, such as EC2 instances, S3 buckets, or IAM roles. It is particularly useful for enforcing specific standards or compliance requirements on individual resources, thereby helping customers ensure each one meets defined criteria. For instance, you could use a resource-level hook to enforce that every S3 bucket created through CloudFormation has server-side encryption enabled. In contrast, a stack or change set hook targets the entire CloudFormation stack as a single entity, allowing for the application of guardrails or validation requirements across all resources within the stack collectively. This helps to ensure that the entire stack, not just individual resources, aligns with set policies before the stack is created or updated. For example, a stack-level hook could be used to prevent a stack from containing certain types of resources or to ensure that a specific number of resources are created in each stack. Together, these hooks provide different layers of validation to enforce compliance and configuration standards at both the resource and stack levels.

Getting Started

Getting started with a Guard hook involves configuring the hook, uploading your Guard rules to Amazon S3, activating the hook, and enabling it to run against your CloudFormation stacks. Below are the detailed steps to help you set up and start using Guard Hook effectively.

Pre-requisites Create an Amazon S3 Bucket

If you do not have one already, you will need to create an S3 bucket in your region. This bucket will hold your Guard policy files and optionally the full Guard evaluation output if so desired.

Configuring a Guard hook Create your rule

In this section, you will provide an example of AWS CloudFormation Guard rules that help enforce best practices for Amazon S3 buckets. Please create a file called safebucket.guard and include the following rules:

# Rule Identifier:
#    S3_BUCKET_PUBLIC_WRITE_PROHIBITED
#
# Select all S3 resources from the incoming template (payload)
#
let s3_buckets_public_write_prohibited = Resources.*[ Type == 'AWS::S3::Bucket' ]

rule S3_BUCKET_PUBLIC_WRITE_PROHIBITED when %s3_buckets_public_write_prohibited !empty {
  %s3_buckets_public_write_prohibited.Properties.PublicAccessBlockConfiguration exists
  %s3_buckets_public_write_prohibited.Properties.PublicAccessBlockConfiguration.BlockPublicAcls == true
  %s3_buckets_public_write_prohibited.Properties.PublicAccessBlockConfiguration.BlockPublicPolicy == true
  %s3_buckets_public_write_prohibited.Properties.PublicAccessBlockConfiguration.IgnorePublicAcls == true
  %s3_buckets_public_write_prohibited.Properties.PublicAccessBlockConfiguration.RestrictPublicBuckets == true
  <<
    Violation: S3 Bucket public write access must be restricted.
    Fix: Set the S3 Bucket PublicAccessBlockConfiguration properties—BlockPublicAcls, BlockPublicPolicy, IgnorePublicAcls, RestrictPublicBuckets—to true.
  >>
}

# Rule Identifier:
#    S3_BUCKET_VERSIONING_ENABLED
#
# Select all S3 resources from the incoming template (payload)
#
let s3_buckets_versioning_enabled = Resources.*[ Type == 'AWS::S3::Bucket' ]

rule S3_BUCKET_VERSIONING_ENABLED when %s3_buckets_versioning_enabled !empty {
  %s3_buckets_versioning_enabled.Properties.VersioningConfiguration exists
  %s3_buckets_versioning_enabled.Properties.VersioningConfiguration.Status == 'Enabled'
  <<
    Violation: S3 Bucket versioning must be enabled.
    Fix: Set the S3 Bucket property VersioningConfiguration.Status to 'Enabled'.
  >>
}

In this file, you can define rules to prevent public write access and require versioning on all S3 buckets. Here’s what the two rules in this example do:

  • S3_BUCKET_PUBLIC_WRITE_PROHIBITED:
    • This rule identifies S3 buckets within a CloudFormation template and checks if the public write access is restricted. It ensures that each S3 bucket has PublicAccessBlockConfiguration properties enabled to block public ACLs, block public policies, ignore public ACLs, and restrict public buckets. If a bucket does not meet these criteria, a violation message is triggered, instructing the user to set these properties to true.
  • S3_BUCKET_VERSIONING_ENABLED:
    • This rule targets S3 buckets and verifies that versioning is enabled. The rule checks if the VersioningConfiguration property exists and has its Status set to 'Enabled'. If this configuration is missing or set incorrectly, it alerts the user to enable versioning on the S3 bucket, providing an additional layer of data protection.

These rules help you maintain strict security and operational best practices by ensuring that S3 buckets are configured according to compliance requirements. Place these rules in safebucket.guard, and you’re ready to start applying them to your CloudFormation templates using Guard Hook to enforce policies without any manual intervention.

Upload your rules to S3

The Guard hook expects your policy files to be uploaded to an S3 bucket to then be pulled down during hook execution time. The hook expects your rules to be either a single .guard file, or multiple .guard files uploaded as a .zip or .tar.gz

Once you have uploaded your Guard rule or Guard rules bundle to S3, take note of the S3 URI of the uploaded object, as you will need this later to configure the Guard hook. Example: s3://your-bucket-name/safebucket.guard

You can also get the S3 URI by copy-pasting it from the console

Screenshot of an Amazon S3 object properties page displaying details for the file safebucket.guard in the safes3rules bucket.

Authoring the Hook

You can author and activate a Guard hook from the console, CLI, and using the AWS::CloudFormation::GuardHook CloudFormation resource.

Console

You can use the new console flow to define and activate your hook. Click on Create a Hook and choose With Guard.

  1. You can type in your Guard rules S3 URI from the step before, or you can browse your S3 bucket using the console button.Screenshot of the AWS CloudFormation interface showing the 'Provide your Guard rules' step in the 'Create a Hook with Guard' workflow.
  2. You can configure the hook by performing the following steps:
    1. In the Hook Name field, enter My::Company::GuardHook.
    2. Under Hook Targets, select both Stacks and Resources. By selecting both Stacks and Resources under Hook Targets, you’re instructing CloudFormation to apply your hook validations at both levels. This ensures that individual resources and the overall stack comply with your defined policies, providing a comprehensive safeguard across deployments.
    3. For Hook Mode, choose Fail.
    4. In Execution Role, select New execution role and enter my-guard-hook-role for the name if you want CloudFormation to create the necessary role and permissions. Alternatively, an IAM role ARN you created manually (learn more here).Screenshot of the AWS CloudFormation interface showing the 'Hook details and settings' step in the 'Create a Hook with Guard' workflow.
  3. While we didn’t include a filter in this example, you can limit the executions of the hook to the intended resource types by using a filter. For more information about filters view our documentation page. You can skip to the next step if you do not have anything to filter.Screenshot of the AWS CloudFormation interface showing the 'Apply Hook filters (optional)' step in the 'Create a Hook with Guard' workflow.
  4. Review the hook details and then click the Activate Hook button.

Screenshot of the AWS CloudFormation interface showing the 'Hook details and settings' step in the 'Create a Hook with Guard' workflow.

CLI

export AWS_ACCOUNT_ID=<Your_Account_Id>

aws cloudformation activate-type --type HOOK \
--type-name AWS::Hooks::GuardHook \
--publisher-id aws-hooks \
--type-name-alias MyCompany::Hooks::GuardHook \
--execution-role arn:aws:iam::${AWS_ACCOUNT_ID}:role/<Your_ExecutionRole_Name> \
--region eu-central-1

Note: You can create your own type-name-alias → MyCompany::Hooks::GuardHook is just an example

Using the CloudFormation resource

Using the AWS::CloudFormation::GuardHook resource in a CloudFormation template allows you to enforce compliance checks and validations on your AWS resources before creating, modifying, or deleting them within a stack. By specifying Guard rules using the Guard domain-specific language (DSL), you can define policies to assess resource configurations and security standards. When integrated into a CloudFormation template, AWS::CloudFormation::GuardHook evaluates the resources during stack operations and can either block non-compliant actions or provide warnings, depending on the configuration. This tool benefits organizations by automating governance, enhancing security, and reducing manual intervention, which help ensure that deployments adhere to best practices and compliance requirements consistently across environments.

Here is an example of how to create a Guard hook with the AWS::CloudFormation::GuardHook resource in a YAML file.

AWSTemplateFormatVersion: '2010-09-09'
Resources:
  GuardHookRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: GuardHookExecutionRole
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: hooks.cloudformation.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: S3AccessPolicy
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - s3:GetObject
                  - s3:GetObjectVersion
                  - s3:PutObject
                  - s3:ListBucket
                Resource: 
                  - arn:aws:s3:::<YourBucket>/*
                  - arn:aws:s3:::<YourBucket>

  MyGuardHook:
    Type: AWS::CloudFormation::GuardHook
    Properties:
      Alias: "my::guard::hook"
      ExecutionRole: !GetAtt GuardHookRole.Arn
      FailureMode: FAIL
      HookStatus: ENABLED
      LogBucket: "mysafebucket"
      RuleLocation:
        Uri: "s3://<YourBucket>/safebucket.guard"
      TargetOperations:
        - STACK
        - RESOURCE

For more about AWS::CloudFormation::GuardHook , see our docs.

Testing your hook (non-compliant stack)

Now at this point, the Guard Hook should be running against all S3 buckets deployed via CloudFormation stacks in your account. We can confirm this by creating a new stack, from the CLI.

To do so, create a file named s3-fail.yaml and add the following content:

AWSTemplateFormatVersion: "2010-09-09"
Description: This template creates a simple s3 bucket
   
Resources:
  DemoBucket:
   Type: AWS::S3::Bucket

Create a stack, and specify your template in the AWS Command Line Interface (AWS CLI). In the following example, specify the stack name as my-fail-stack and the template name as s3-fail.yaml.

aws cloudformation create-stack --stack-name my-fail-stack --template-body file://s3-fail.yaml

View Hook Invocations

Console

  1. From Stack Events tab, enable Hook Invocations field
    Screenshot of an AWS CloudFormation interface showing how to enable Hook invocations output.
  2. You can see the completed assessment and the result in summarized view (how many rules compliant vs. not) and Stack status based on your hook mode:Screenshot of the 'Hook invocation details' dialog in AWS CloudFormation. The dialog displays the Hook name as 'My::Company::GuardHook,' shown as a clickable link. The Hook status is 'HOOK_COMPLETE_FAILED,' highlighted in red with an error icon. The failure mode is set to 'Fail,' and the invocation point is 'PRE_PROVISION.' The status reason explains that the Hook failed validation due to the following rules being violated: 'S3_BUCKET_PUBLIC_WRITE_PROHIBITED' and 'S3_BUCKET_VERSIONING_ENABLED.

CLI

  1. View stack events by running below command:
    aws cloudformation describe-stack-events --stack-name my-fail-stack
  2. You can see the completed assessment and the result (how many rules compliant vs. not compliant as part of describe-stack-events. As the Hook Failure Mode is set to Fail, you can see the stack status: CREATE_FAILED due to Hook failure.
                ...
                "EventId": "my-fail-stack-d7943992-9366-4d3a-8bad-9695a5f6b310",
                "StackName": "my-fail-stack",
                "LogicalResourceId": "my-fail-stack",
                "PhysicalResourceId": "",
                "ResourceType": "STACK",
                "Timestamp": "2024-11-12T15:54:29.582000+00:00",
                "ResourceStatus": "CREATE_IN_PROGRESS",
                "HookType": "My::Company::GuardHook",
                "HookStatus": "HOOK_COMPLETE_FAILED",
                "HookStatusReason": "Hook failed with message: Template failed validation, the following rule(s) failed: S3_BUCKET_PUBLIC_WRITE_PROHIBITED, S3_BUCKET_VERSIONING_ENABLED.",
                "HookInvocationPoint": "PRE_PROVISION",
                "HookFailureMode": "FAIL"
            }

Viewing Guard hooks output from Log Bucket

If you specified the logging bucket, now you can view the Guard hook invocation json and evaluate the output stored in your S3 bucket.

Console

Screenshot of an Amazon S3 interface showing the list of Guard Hook log files.

CLI

You can download the log file by running the command below:

aws s3api get-object --bucket <Your log bucket> --key <insert_the-cfn-guard-validate-report-key.json> <Your_Outfile_DirectoryandName>

For more information about downloading S3 Object, see our docs.

Testing your hook (compliant stack)

To do so, create a file named s3-pass.yaml and add the following content:

AWSTemplateFormatVersion: "2010-09-09"
Description: This template creates a simple S3 bucket with versioning enabled and public write access disabled

Resources:
  DemoBucket:
   Type: AWS::S3::Bucket
   Properties:
      VersioningConfiguration:
        Status: Enabled
      PublicAccessBlockConfiguration:
        BlockPublicAcls: true
        IgnorePublicAcls: true
        BlockPublicPolicy: true
        RestrictPublicBuckets: true

Notice in the template, our bucket now does not allow public write access and enables versioning.

Create a stack, and specify your template in the AWS Command Line Interface (AWS CLI). In the following example, specify the stack name as my-pass-stack and the template name as s3-pass.yaml.

aws cloudformation create-stack --stack-name my-pass-stack --template-body file://s3-pass.yaml

View Hook Invocations

Console

  1. You can see the completed assessment and the the bucket was created.Screenshot of the 'Hook invocation details' dialog in AWS CloudFormation. The dialog displays the Hook name as 'My::Company::GuardHook,' shown as a clickable link. The Hook status is 'HOOK_COMPLETE_SUCCEEDED,' highlighted in green with a success icon. The failure mode is set to 'Fail,' and the invocation point is 'PRE_PROVISION.' The status reason indicates success with the message: 'Hook succeeded with message: Successful validation.
    Screenshot of the AWS CloudFormation 'Events' tab for a stack named "my-pass-stack." The CloudFormation stack shows successful deployment.

CLI

  1. View stack events by running below command:
    aws cloudformation describe-stack-events --stack-name my-pass-stack
  2. You can see the completed assessment and the result (how many rules compliant vs. not compliant as part of describe-stack-event
               ...
                "StackName": "my-pass-stack",
                "LogicalResourceId": "DemoBucket",
                "PhysicalResourceId": "my-pass-stack-demobucket-rmm9cngo70xq",
                "ResourceType": "AWS::S3::Bucket",
                "Timestamp": "2024-11-12T17:01:41.898000+00:00",
                "ResourceStatus": "CREATE_COMPLETE",
                "ResourceProperties": "{\"PublicAccessBlockConfiguration\":{\"RestrictPublicBuckets\":\"true\",\"BlockPublicPolicy\":\"true\",\"BlockPublicAcls\":\"true\",\"IgnorePublicAcls\":\"true\"},\"VersioningConfiguration\":{\"Status\":\"Enabled\"}}",
                "ClientRequestToken": "Console-CreateStack-46ec0d77-5d2b-3e2a-9c82-fcce5e60a2f5"
            },

Guard hooks revolutionize the way you enforce compliance and security policies within your AWS infrastructure. By leveraging a service-managed hook, you can seamlessly integrate policy validation into your CloudFormation deployments without additional operational overhead.

Take advantage of Guard hooks today to:

  • Simplify compliance enforcement.
  • Enhance your security posture.
  • Streamline your deployment workflows.
  • Focus on innovation rather than infrastructure management.

Ready to enhance your cloud compliance and security? Start integrating Guard hooks into your CloudFormation templates and experience a new level of simplicity and efficiency in policy enforcement.

Additional Resources

Brian Terry

Brian Terry, Senior WW Data & AI PSA, is an innovation leader with 20+ years of experience in technology and engineering. Pursuing a Ph.D. in Computer Science at the University of North Dakota. Brian has spearheaded generative AI projects, optimized infrastructure scalability, and driven partner integration strategies. He is passionate about leveraging technology to deliver scalable, resilient solutions that foster business growth and innovation

Delon Crammer

Delon Crammer

Delon Crammer is a Cloud Architect with expertise in Infrastructure and DevOps. He specializes in designing scalable, secure cloud architectures and delivering innovative, high-performing solutions tailored to complex technical needs.

Peek inside your AWS CloudFormation Deployments with timeline view

Post Syndicated from Idriss Laouali Abdou original https://aws.amazon.com/blogs/devops/peek-inside-your-aws-cloudformation-deployments-with-timeline-view/

AWS CloudFormation makes it easy to model and provision your cloud application infrastructure as code. CloudFormation templates can be written directly in JSON or YAML, or they can be generated by tools like the AWS Cloud Development Kit (CDK). These templates are submitted to the CloudFormation service and the resources are deployed together as stacks, in dependency order. Stack events can be viewed in a tabular format in the console, which provides fine-grained details about each event. And now there is a new feature that offers a more visually intuitive view of the events called the deployment timeline view, which provides a visualization of the sequence of actions CloudFormation took in a stack operation. This visual timeline shows you the exact order in which resources are provisioned, dependencies between resources, provisioning times, and likely root causes for any deployment failures. It complements the existing tabular stack events view by giving you additional context and visibility into what happens behind the scenes during deployments.

CloudFormation deployment timeline view feature overview

Figure 1: CloudFormation deployment timeline view feature overview

How It Works

To use the new deployment timeline view, simply initiate a stack create, update, or delete operation. In the AWS CloudFormation Console, choose the Events tab, then click the Timeline view tab next to Table View. As CloudFormation begins provisioning the resources defined in your template, you’ll see each resource operation appear as a bar in the timeline view. The resources are organized in a vertical stack, chronologically ordered with the most recent resource operation at the top. Each resource action is visualized horizontally, segmented by different color bars for each resource action:

CloudFormation deployment timeline view color bars for each resource action

In dark mode, in-progress rollback and completed rollback switch colors.

You can hover over any bar to see additional details like the full resource name and the start/end times of the resource action. If a deployment fails, CloudFormation will highlight the resource operation that was the likely root cause node in a red-and-white striped bar, which you can hover over to see the specific failure reason.

Creating a simple stack

Create a simple stack using CloudFormation Console. You will initiate a deployment and then explore the visual timeline view.

1. From the CloudFormation console, click Create stack and then choose with new resources.

CloudFormation’s create stack console button.

Figure 2: CloudFormation’s create stack console button 

2. In the Create stack wizard, click Choose an existing template and then choose Upload a template file. Click Choose file to select and upload the template file of the stack to deploy. In this blog you will use the template available here. Download the template or you can use your own template. If you decide to use your own template, it will result in a different deployment timeline view.

CloudFormation’s create stack console wizardFigure 3: CloudFormation’s create stack console wizard

The application stack includes the following key resources: Amazon EC2 instance (AWS::EC2::Instance), an Amazon VPC (AWS::EC2::VPC) and related VPC resources like subnets (AWS::EC2::Subnet) and an internet gateway (AWS::EC2::InternetGateway).

3. Once the template is uploaded to S3, click Next, then provide a stack name and parameters if needed. Complete the remaining stack deployment configuration steps, then initiate the stack creation operation.

4. On the stack events page, click the Deployment timeline tab next to Events.

You should now see the deployment timeline view rendering in real-time as CloudFormation provisions the VPC networking resources first, followed by the security group, subnet, and lastly, the EC2 instance.

CloudFormation in-progress stack deployment timeline viewFigure 4: Real-time CloudFormation in-progress deployment timeline view 

5. The timeline refreshes every five seconds to update the deployment progress. Five seconds later you have the below timeline view.

Updated CloudFormation deployment timeline after five second

Figure 5: Updated CloudFormation deployment timeline after five second

CloudFormation completed deployment timeline view Figure 6: CloudFormation completed deployment timeline view 

The above timeline shows, from the bottom to the top, bars representing each resource CloudFormation provisioned, with virtual vertical dependency lines showing the orders in which each resource operation occurred. The bars change colors as operations progress from blue (in-progress), to yellow consistency check, green (success) or red (failed).

CloudFormation completed deployment timeline view Figure 7: CloudFormation deployment timeline view – resource detail popover

Hover over any bar to see details like the start/end time of each deployment phase and the full resource name. The detailed information in the graph’s popover shows that CloudFormation created the InternetGateway resource in two seconds, then waited for 15-seconds to check if the resource was fully operational before marking as complete. This phase is called the resource consistency check phase, also known as the resource stabilization phase. It allows CloudFormation and other Infrastructure as Code (IaC) tools to ensure resilient deployments. To learn more, read this post about CloudFormation optimistic stabilization strategy.

Stack in a rollback complete state

When a stack deployment fails, CloudFormation rolls your stack back to its initial stable state before the current stack operation. The deployment timeline view below shows a failed stack operation in a complete rollback state. You can see the likely root cause resource highlighted in a red-and-white striped color so you can instantly identify it and start troubleshooting.

Deployment timeline view of a stack in rollback complete view 1

Deployment timeline view of a stack in rollback complete popover view 2

Figure 8: Deployment timeline view of a stack in rollback complete

Conclusion

The new CloudFormation deployment timeline view provides visibility into the orchestration flow and dependencies involved when CloudFormation provisions resources defined in your infrastructure-as-code templates. With this visual timeline view, you can quickly identify the root cause of deployment failures before operations complete and better understand the provisioning process. This feature is available now in all AWS regions where CloudFormation is supported. Visit the CloudFormation console to start using deployment timeline view.

About the authors:

Idriss Laouali Abdou

Idriss is a Senior Product Manager AWS, working on delivering the best experience for AWS IaC customers. Outside of work, you can either find him creating educational content helping thousands of students, cooking, or dancing.

Jamie To

Jamie is a Front End Engineer and has been delivering console features to AWS IaC customers for the last 3 years. Outside of work, Jamie enjoys drawing and playing foosball.

Puran Zhang

Puran has been contributing to IaC console as Front-end engineer for the last 4 years. Outside of work, you will likely find him in the kitchen, over the coffee bar, or on his hasty way to the next restaurant.

Efficiently monitor your On Demand Capacity Reservations (ODCR) by Grouping on CloudWatch Dimensions

Post Syndicated from aostan original https://aws.amazon.com/blogs/compute/efficiently-monitor-your-on-demand-capacity-reservations-odcr-by-grouping-on-cloudwatch-dimensions/

This post is written by Ballu Singh, Principal Solutions Architect at AWS, Ankush Goyal, Enterprise Support Lead in AWS Enterprise Support, Hasan Tariq, Principal Solutions Architect with AWS and Ninad Joshi, Senior Solutions Architect at AWS.

The On-Demand Capacity Reservations (ODCR) allows you to reserve compute capacity for your Amazon Elastic Compute Cloud (Amazon EC2) instances in a specific Availability Zone (AZ) for any duration. It makes sure that you always have access to your Amazon EC2 capacity when you need it. This is ideal for users who need to make sure their instances are available during critical events, even when it is stopped and restarted. Users can create ODCR anytime, without the need for a one or three-year term commitment.

In the post Automate the Creation of On-Demand Capacity Reservations for running EC2 instances, we discussed a solution for automating ODCR operations for existing EC2 instances. The post included creating, modifying, and canceling Capacity Reservations. We also showed monitoring Capacity Reservation usage using the Amazon CloudWatch metric InstanceUtilization, which indicates the percentage of reserved capacity currently in use. This metric is essential for effectively monitoring and optimizing your ODCR consumption.

On August 1st, 2024, AWS introduced new CloudWatch dimensions for Amazon EC2 ODCR. Using these enhancements you can now group CloudWatch metrics for ODCR by dimensions, such as InstanceType, AvailabilityZone, InstanceMatchCriteria, InstancePlatform, Tenancy, CapacityReservationId, or across the Capacity Reservations within a selected AWS Region. With these new dimensions, you no longer need to create a new alarm each time a new Capacity Reservation ID (CRID) is added. Furthermore, there is no longer a need to poll ODCR metadata using the describe-capacity-reservations AWS CLI command or API, because this information is now readily available through CloudWatch metrics.

This post shows you how to create CloudWatch alarms for ODCR using these new dimensions. The setup methodology helps you get the information directly in the CloudWatch console instead of having to call the DescribeCapacityReservations API or invoking the describe-capacity-reservationsCLI command.

Summary

  • The Prerequisites section outlines the necessary prerequisites and assumptions that should be completed before implementing the technical steps described later in this post. This includes any accounts, services, permissions, or configurations that need to be set up in advance.
  • The Setup section describes the specific scenario and infrastructure environment assumed for demonstrating the CloudWatch dimensions and alarms discussed in this post.
  • In the Implementation details section, we do a deep dive into the technical implementation, such as code snippets and step-by-step configurations for creating CloudWatch alarms for metric InstanceUtilization grouped by six dimensions outlined earlier.
  • The Cleaning up section provides steps to prevent ongoing charges after experimenting with the infrastructure and alarms created here.
  • Finally, in the Conclusion section, we recap the key points explored around CloudWatch, dimensions, metrics, and alarms. This content can serve as a solid foundation for implementing more advanced monitoring, optimization, and architectural best practices going forward.

Prerequisites

This solution needs you to complete the following prerequisites:

  1. Create Capacity Reservations in your account by following the ODCR Workshop self-paced lab. The solution needs scripts from this lab. If you are using any other Capacity Reservations for this lab, then you must use parameters according to your environment setup (for example AWS Region, AZ, platform, and instance match criteria)
  2. All the code used in this post is publicly available in the accompanying GitHub repository. Refer to the json included in the GitHub repository for the AWS Identity and Access Management (IAM) role permissions for IAM users used in the solution.
  3. Refer to the preceding GitHub repository for the code, and save the txt file in the same directory with other Python scripts. You may want to run the requirements.txt file if you don’t have appropriate dependencies to run the rest of the Python scripts. You can run this using the following command:

pip3 install -r requirements.txt

Setup

For this post, we have provided Python scripts to create CloudWatch alarms for Capacity Reservation usage metric for ODCR, for example InstanceUtilization. These alarms can be grouped using the new dimensions: InstanceType, AvailabilityZone, InstanceMatchCriteria, InstancePlatform, Tenancy, CapacityReservationId, or across the Capacity Reservations within a selected Region. We also created an Amazon Simple Notification Service (Amazon SNS) topic named ODCRAlarmTopic to notify you when there’s a breach with your CloudWatch alarm’s threshold.

To get started, download the scripts for creating a CloudWatch alarm using each aforementioned dimension from the GitHub repository in the Prerequisites section.

We envision a scenario where multiple Capacity Reservations exist across a Region in your AWS account. The goal is to identify any unused Capacity Reservations to optimize capacity usage and reduce unnecessary charges. Unused capacity can be identified by creating a CloudWatch alarm for the InstanceUtilization metric. The alarm can be grouped by one of six dimensions: AZ, Instance Match Criteria, Instance Type, InstancePlatform, Tenancy, or across the Capacity Reservations. You must set the alarm threshold that aligns to your usage optimization targets.

With Capacity Reservations, charges apply to any unused capacity. Users accept these charges because reservations provide capacity assurance. However, with improved reservation usage, users can make sure their reserved capacity is fully used. A triggered CloudWatch alarm signifies unused capacity. It can notify users to take near real-time action to optimize capacity and eliminate charges for unused reservations.

Implementation details

The following sections show the implementation details of alarms for each dimension.

For each alarm that we create in this section, you can set a threshold based on your usage optimization goals. For this post, we are setting the threshold to 75%. When these alarms are in place and the CloudWatch alarm breaches that threshold, the system enters an alarm state and sends an SNS notification to ODCRAlarmTopic. This process helps identify and address potential issues or opportunities for optimization related to the specific monitored dimension.

Creating CloudWatch alarm using AllCapacityReservations dimension

In this scenario, an organization is currently using the Capacity Reservations at 100% usage, but it needs to be notified when the total capacity usage drops to less than or equal to the threshold value. To do so, we use the InstanceUtilization metric for ODCR and group it with the AllCapacityReservations dimension. You can run the by_all_capacity_reservations.py script provided in the GitHub repository to create this CloudWatch alarm.

Prior to running the script, you must determine the following parameters:

Necessary input parameters

  • RegionName: The Region where the CloudWatch alarm should be created (for example us-east-1).
  • EmailAddress: The email address to receive notifications (for example [email protected]).

Optional input parameters

  • Dimension (default: AllCapacityReservations): The dimension for the CloudWatch alarm.
  • MetricName (default: InstanceUtilization): The metric name for the CloudWatch alarm.
  • ComparisonOperator (default: LessThanOrEqualToThreshold): The comparison operator for the CloudWatch alarm.
  • Threshold (default: 75.0): The threshold value for the CloudWatch alarm.
  • Protocol (default: email): The protocol for the SNS subscription.
  • TopicName (default: ODCRAlarmTopic): The name of the SNS topic.

After you’ve determined your input parameters, run the following Python script with your desired parameters to set up the alarm:

python3 by_all_capacity_reservations.py --RegionName <<Insert here the region where you have the Capacity Reservations>> --EmailAddress <<Insert here the email address subscribed to ODCRAlarmTopic>>

This creates a CloudWatch alarm that monitors InstanceUtilization for the Capacity Reservations in the Region you specified. You can confirm the alarm has been created by reviewing the by_all_capacity_reservations.log created in the same folder where you ran the script from. The following is an example log file content that confirms the creation of the alarm.

2024-10-17 19:32:34,922  __main__  INFO: Creating CloudWatch Alarm for the InstanceUtilization metric with AllCapacityReservations dimension in the us-east-1 region.

2024-10-17 19:32:35,514  __main__  INFO: The SNS topic 'ODCRAlarmTopic' already exists with ARN: arn:aws:sns:us-east-1:XXXXXXXXXXXX:ODCRAlarmTopic.

2024-10-17 19:32:35,516  __main__  INFO: Please ensure you have subscribed to the SNS Topic arn:aws:sns:us-east-1: XXXXXXXXXXXX:ODCRAlarmTopic.

2024-10-17 19:32:35,986  __main__  INFO: Successfully created CloudWatch Alarm for the InstanceUtilization metric with AllCapacityReservations dimension in the us-east-1 region.

You can also validate in the CloudWatch console. Choose All alarms and search for ODCRAlarm-InstanceUtilization-AllCapacityReservations.

Figure 1: Example AllCapacityReservations Alarm Setup validation using CloudWatch console

Figure 1: Example AllCapacityReservations Alarm Setup validation using CloudWatch console

Creating CloudWatch alarm using InstanceType dimension

After receiving an alert on total Capacity Reservations usage dropping below the threshold, you may want to view the usage drop by a specific instance type. To do so you can use the InstanceUtilization metric for ODCR and group it with the InstanceType dimension. You can use the by_instanceType.py script provided in the GitHub repository to create this CloudWatch alarm.

Prior to running the script, you must determine the following parameters:

Necessary input parameters

  • RegionName: The Region where the CloudWatch alarm should be created (for example us-east-1).
  • EmailAddress: The email address to receive notifications (for example [email protected]).
  • InstanceType: The instance type for the CloudWatch alarm (for example t2.micro).

Optional input parameters

  • Dimension (default: InstanceType): The dimension for the CloudWatch alarm.
  • MetricName (default: InstanceUtilization): The metric name for the CloudWatch alarm.
  • ComparisonOperator (default: LessThanOrEqualToThreshold): The comparison operator for the CloudWatch alarm.
  • Threshold (default: 75.0): The threshold value for the CloudWatch alarm.
  • Protocol (default: email): The protocol for the SNS subscription.
  • TopicName (default: ODCRAlarmTopic): The name of the SNS topic.

After you’ve determined your input parameters, run the following Python script with your desired parameters to set up the alarm:

python3 by_instanceType.py --RegionName <<Insert here the region where you have the Capacity Reservations>> --EmailAddress <<Insert here the email address subscribed to the ODCRAlarmTopic>> --InstanceType t2.micro

This should create the InstanceType dimension alarm. You can confirm this by reviewing the by_instanceType.log created in the same folder where you ran the script from. The following is an example log file content that confirms the creation of this alarm.

2024-10-17 19:46:07,288  __main__  INFO: Creating CloudWatch Alarm for the InstanceUtilization metric with InstanceType dimension in the us-east-1 region.

2024-10-17 19:46:07,804  __main__  INFO: The SNS topic 'ODCRAlarmTopic' already exists with ARN: arn:aws:sns:us-east-1:XXXXXXXXXXXX:ODCRAlarmTopic.

2024-10-17 19:46:07,804  __main__  INFO: Please ensure you have subscribed to the SNS Topic arn:aws:sns:us-east-1:XXXXXXXXXXXX:ODCRAlarmTopic.

2024-10-17 19:46:08,285  __main__  INFO: Successfully created CloudWatch Alarm for the InstanceUtilization metric with InstanceType dimension in the us-east-1 region.

You can also validate in the CloudWatch console. Choose All alarms and, for example, search for ODCRAlarm-InstanceUtilization-InstanceType-t2.micro.

Figure 2: Example InstanceType Alarm Setup validation using CloudWatch console

Figure 2: Example InstanceType Alarm Setup validation using CloudWatch console

Creating CloudWatch alarm using AvailabilityZone dimension

After receiving an alert on total Capacity Reservations usage at the instance level dropping below the threshold, you may want to view the usage drop by a specific AZ. You can do so by using the InstanceUtilization metric for ODCR and group it with the AvailabilityZone dimension. You can use the by_availabilityZone.py script provided in the GitHub repository to create this CloudWatch alarm.

Prior to running the script, you must determine the following parameters:

Necessary input parameters

  • RegionName: The Region where the CloudWatch alarm should be created (for example us-east-1).
  • EmailAddress: The email address to receive notifications (for example [email protected]).
  • AvailabilityZone: The AZ for the CloudWatch alarm (for example us-east-1a).

Optional input parameters

  • Dimension (default: AvailabilityZone): The dimension for the CloudWatch alarm.
  • MetricName (default: InstanceUtilization): The metric name for the CloudWatch alarm.
  • ComparisonOperator (default: LessThanOrEqualToThreshold): The comparison operator for the CloudWatch alarm.
  • Threshold (default: 75.0): The threshold value for the CloudWatch alarm.
  • Protocol (default: email): The protocol for the SNS subscription.
  • TopicName (default: ODCRAlarmTopic): The name of the SNS topic.

After you’ve determined your input parameters, run the following Python script with your desired parameters to set up the alarm:

python3 by_availabilityZone.py --RegionName <<Insert here the region where you have the Capacity Reservations>> --EmailAddress <<Insert here the email address subscribed to the ODCRAlarmTopic>> --AvailabilityZone us-east-1b

This should create the AvailabilityZone alarm. You can confirm this by reviewing the by_availabilityZone.log created in the same folder where you ran the script from. The following is an example log file content that confirms the creation of this alarm.

2024-10-17 19:38:39,141  __main__  INFO: Creating CloudWatch Alarm for the InstanceUtilization with AvailabilityZone dimension in the us-east-1 region.

2024-10-17 19:38:39,667  __main__  INFO: The SNS topic 'ODCRAlarmTopic' already exists with ARN: arn:aws:sns:us-east-1:XXXXXXXXXXXX:ODCRAlarmTopic.

2024-10-17 19:38:39,667  __main__  INFO: Please ensure you have subscribed to the SNS Topic arn:aws:sns:us-east-1:XXXXXXXXXXXX:ODCRAlarmTopic.

2024-10-17 19:38:40,172  __main__  INFO: Successfully created CloudWatch Alarm for the InstanceUtilization metric with AvailabilityZone dimension in the us-east-1 region.

You can also validate in the CloudWatch console. Choose All alarms, and search for ODCRAlarm-InstanceUtilization-AvailabilityZone-us-east-1b.

Figure 3: Example AvailabilityZone Alarm Setup validation using CloudWatch console

Figure 3: Example AvailabilityZone Alarm Setup validation using CloudWatch console

Create CloudWatch alarm using the InstancePlatform dimension

Based on workload requirements, organizations use different platforms such as Windows and Linux/UNIX for EC2 instances. They may want to be notified when the usage drops below threshold for a particular platform. To achieve this, we can use the InstanceUtilization metric for ODCR and group it with the InstancePlatform dimension. You can use the by_platform.py script provided in the GitHub repository to create the CloudWatch alarm.

Prior to running the script, you must determine the following parameters:

Necessary input parameters

  • RegionName: The Region where the CloudWatch alarm should be created (for example us-east-1).
  • EmailAddress: The email address to receive notifications (for example [email protected]).
  • InstancePlatform: The InstancePlatform for the CloudWatch alarm. For exampleE: ‘Linux/UNIX’. Supported InstancePlatform are’Linux/UNIX’,’Red Hat Enterprise Linux’,’SUSE Linux’,’Windows’,’Windows with SQL Server’,’Windows with SQL Server Enterprise’,’Windows with SQL Server Standard’,’Windows with SQL Server Web’,’Linux with SQL Server Standard’,’Linux with SQL Server Web’,’Linux with SQL Server Enterprise’,’RHEL with SQL Server Standard’,’RHEL with SQL Server Enterprise’,’RHEL with SQL Server Web’,’RHEL with HA’,’RHEL with HA and SQL Server Standard’,’RHEL with HA and SQL Server Enterprise’,’Ubuntu Pro’

Optional input parameters

  • Dimension (default: InstancePlatform): The dimension for the CloudWatch alarm.
  • MetricName (default: InstanceUtilization): The metric name for the CloudWatch alarm.
  • ComparisonOperator (default: LessThanOrEqualToThreshold): The comparison operator for the CloudWatch alarm.
  • Threshold (default: 75.0): The threshold value for the CloudWatch alarm.
  • Protocol (default: email): The protocol for the SNS subscription.
  • TopicName (default: ODCRAlarmTopic): The name of the SNS topic.

After you’ve determined your input parameters, run the following Python script with your desired parameters to set up the alarm:

python3 by_platform.py --RegionName <<Insert here the region where you have the Capacity Reservations>> --EmailAddress <<Insert here the email address subscribed to the ODCRAlarmTopic>> --InstancePlatform Linux/Unix

This should create the InstancePlatform alarm. You can confirm this by reviewing the by_platform.log created in the same folder where you ran the script from. The following log entry shows a confirmation the creation of this alarm.

2024-10-17 19:52:03,839  __main__  INFO: Creating CloudWatch Alarm for the InstanceUtilization with Platform dimension in the us-east-1 region.

2024-10-17 19:52:04,345  __main__  INFO: The SNS topic 'ODCRAlarmTopic' already exists with ARN: arn:aws:sns:us-east-1:XXXXXXXXXXXX:ODCRAlarmTopic.

2024-10-17 19:52:04,345  __main__  INFO: Please ensure you have subscribed to the SNS Topic arn:aws:sns:us-east-1:XXXXXXXXXXXX:ODCRAlarmTopic.

2024-10-17 19:52:04,854  __main__  INFO: Successfully created CloudWatch Alarm for the InstanceUtilization with Platform dimension in the us-east-1 region.

You can also validate in the CloudWatch console. Choose All alarms, and search for ODCRAlarm-InstanceUtilization-Platform-Linux/Unix.

Figure 4: Example AvailabilityZone Alarm Setup validation using CloudWatch console

Figure 4: Example AvailabilityZone Alarm Setup validation using CloudWatch console

Creating CloudWatch alarm using the InstanceMatchCriteria dimension

Capacity Reservations are configured as either open or targeted. If the Capacity Reservation is open, then the new and existing instances that have matching attributes automatically run in the capacity of the Capacity Reservation. If the Capacity Reservation is targeted, then instances must specifically target it to run in the reserved capacity. Organizations using these configurations may want to be notified when instance usage drops in either open or targeted Capacity Reservations. To achieve this, we use the InstanceUtilization metric for ODCR and group it with the InstanceMatchCriteria dimension. You can use the by_instanceMatchCriteria.py script provided in the GitHub repository to create the CloudWatch alarm.

Prior to running the script, you must determine the following parameters:

Necessary input parameters

  • RegionName: The Region where the CloudWatch alarm should be created (for example us-east-1).
  • EmailAddress: The email address to receive notifications (for example [email protected]).
  • InstanceMatchCriteria: The tenancy for the CloudWatch alarm. Supported values are ‘open’ and ‘targeted’.

Optional input parameters

  • Dimension (default: InstanceMatchCriteria): The dimension for the CloudWatch alarm.
  • MetricName (default: InstanceUtilization): The metric name for the CloudWatch alarm.
  • ComparisonOperator (default: LessThanOrEqualToThreshold): The comparison operator for the CloudWatch alarm.
  • Threshold (default: 75.0): The threshold value for the CloudWatch alarm.
  • Protocol (default: email): The protocol for the SNS subscription.
  • TopicName (default: ODCRAlarmTopic): The name of the SNS topic.

After you’ve determined your input parameters, run the following Python script with your desired parameters to set up the alarm:

python3 by_instanceMatchCriteria.py --RegionName <<Insert here the region where you have the Capacity Reservations>> --EmailAddress <<Insert here the email address subscribed to the ODCRAlarmTopic>> --InstanceMatchCriteria open

This should create the InstanceMatchCriteria alarm. You can confirm this by reviewing the by_instanceMatchCriteria.log created in the same folder where you ran the script from. The following log entry confirms the creation of such alarm.

2024-10-17 19:43:25,463  __main__  INFO: Creating CloudWatch Alarm for the InstanceUtilization with InstanceMatchCriteria dimension in the us-east-1 region.

2024-10-17 19:43:25,996  __main__  INFO: The SNS topic 'ODCRAlarmTopic' already exists with ARN: arn:aws:sns:us-east-1:XXXXXXXXXXXX:ODCRAlarmTopic.

2024-10-17 19:43:25,996  __main__  INFO: Please ensure you have subscribed to the SNS Topic arn:aws:sns:us-east-1:XXXXXXXXXXXX:ODCRAlarmTopic.

2024-10-17 19:43:26,552  __main__  INFO: Successfully created CloudWatch Alarm for the InstanceUtilization metric with InstanceMatchCriteria dimension in the us-east-1 region.

You can also validate in the CloudWatch console. Choose All alarms, and search for ODCRAlarm-InstanceUtilization-InstanceMatchCriteria-open.

Figure 5: Example InstanceMatchCriteria Alarm Setup validation using CloudWatch console

Figure 5: Example InstanceMatchCriteria Alarm Setup validation using CloudWatch console

Creating CloudWatch alarm using the Tenancy dimension

By default, EC2 instances run on shared tenancy hardware. However, if users want, they can also choose dedicated tenancy. Organizations using both types of tenancy for their workload may want to be notified when instance usage drops in either of these tenancies. To achieve this, we use the InstanceUtilization metric for ODCR and group it with the Tenancy dimension. You can run the by_tenancy.py script provided in the GitHub repository to create the CloudWatch alarm.

Prior to running the script, you must determine the following parameters:

Necessary input parameters

  • RegionName: The Region where the CloudWatch alarm should be created (for example us-east-1).
  • EmailAddress: The email address to receive notifications (for example [email protected]).
  • Tenancy: The tenancy for the CloudWatch alarm. Supported Tenancy are ‘default’ and ‘dedicated’.

Optional input parameters

  • Dimension (default: Tenancy): The dimension for the CloudWatch alarm.
  • MetricName (default: InstanceUtilization): The metric name for the CloudWatch alarm.
  • ComparisonOperator (default: LessThanOrEqualToThreshold): The comparison operator for the CloudWatch alarm.
  • Threshold (default: 75.0): The threshold value for the CloudWatch alarm.
  • Protocol (default: email): The protocol for the SNS subscription.
  • TopicName (default: ODCRAlarmTopic): The name of the SNS topic.

After you’ve determined your input parameters, run the following Python script with your desired parameters to set up the alarm:

python3 by_instanceMatchCriteria.py --RegionName <<Insert here the region where you have the Capacity Reservations>> --EmailAddress <<Insert here the email address subscribed to the ODCRAlarmTopic>> --Tenancy default

This should create the Tenancy dimension alarm successfully. You can confirm this by reviewing the by_tenancy.log created in the same folder where you ran the script from. The following entry confirms the creation of the alarm.

2024-10-17 19:56:14,331  __main__  INFO: Creating CloudWatch Alarm for the InstanceUtilization with Tenancy dimension in the us-east-1 region.

2024-10-17 19:56:14,809  __main__  INFO: The SNS topic 'ODCRAlarmTopic' already exists with ARN: arn:aws:sns:us-east-1:XXXXXXXXXXXX:ODCRAlarmTopic.

2024-10-17 19:56:14,810  __main__  INFO: Please ensure you have subscribed to the SNS Topic arn:aws:sns:us-east-1:XXXXXXXXXXXX:ODCRAlarmTopic.

2024-10-17 19:56:15,287  __main__  INFO: Successfully created CloudWatch Alarm for the InstanceUtilization with Tenancy dimension in the us-east-1 region.

You can also validate in the CloudWatch console. Choose All alarms and search for ODCRAlarm-InstanceUtilization-Tenancy-default.

Figure 6: Example Tenancy Dimension Alarm Setup validation using CloudWatch console

Figure 6: Example Tenancy Dimension Alarm Setup validation using CloudWatch console

Other options

As of this post’s publication, there is no native support to create a CloudWatch alarm on two dimensions. However, you can create a custom CloudWatch metric and create an alarm on that metric.

Cleaning up

If you used the ODCR workshop to create Capacity Reservations in your account, then follow the Clean-up step of the workshop to delete the Capacity Reservations and EC2 instances to stop incurring any charges. If you created any other EC2 instances or Capacity Reservations for this post, terminate those EC2 instances and cancel those Capacity Reservations.

To delete the alarms you created in this post, follow these steps given in the CloudWatch documentation.

Conclusion

In this post, we explored how to use the new Amazon CloudWatch dimensions for Amazon EC2 ODCR to efficiently monitor and maintain constant Capacity Reservations and achieve a higher level of usage, thereby saving costs associated with unused capacity. By automating the creation of CloudWatch alarms for Capacity Reservation usage metrics, specifically InstanceUtilization, you can gain more granular insights into your reserved capacity. This includes grouping metrics by Instance Type, Availability Zone, Platform, Instance Match Criteria, Tenancy, or across the Capacity Reservations in a Region.

We also used an Amazon SNS topic to receive near-real time alerts when thresholds are breached. These tools enable you to effectively monitor and optimize your ODCR usage, making sure that you maintain efficient and cost-effective capacity management during critical events.

For more details, refer to the updated Capacity Reservations documentation. If you have any questions or feedback, feel free to share them in the comments section or contact AWS Support.

Author Bios

Ballu Singh

Ballu Singh

Ballu Singh is a Principal Solutions Architect at AWS. He lives in the San Francisco Bay area and helps users architect and optimize applications on AWS. In his spare time, he enjoys reading and spending time with his family.

Ankush Goyal

Ankush Goyal

Ankush is an Enterprise Support Lead in AWS Enterprise Support who helps Enterprise Support users streamline their cloud operations on AWS. He enjoys working with users to help them design, implement, and support cloud infrastructure. He is a results-driven IT professional with over 20 years of experience.

Hasan Tariq

Hasan Tariq

Hasan Tariq is a Principal Solutions Architect with AWS. He helps Financial Services users accelerate their adoption of the AWS Cloud by providing architectural guidelines to design innovative and scalable solutions.

Ninad Joshi

Ninad Joshi

Ninad Joshi is a Senior Solutions Architect at AWS, helping global AWS users design secure, scalable, and cost-effective solutions in cloud to solve their complex real-world business challenges. Ninad specializes in AI/ML and Generative AI. Prior to joining AWS, Ninad worked as a software developer for 13+ years. Outside of his professional endeavors, Ninad enjoys playing chess and exploring different gambits.

 

Instant Well-Architected CDK Resources with Solutions Constructs Factories

Post Syndicated from Biff Gaut original https://aws.amazon.com/blogs/devops/instant-well-architected-cdk-resources-with-solutions-constructs-factories/

For several years, AWS Solutions Constructs have helped thousands of AWS Cloud Development Kit (CDK) users accelerate their creation of well-architected workloads by providing small, composable patterns linking two or more AWS services, such as an Amazon S3 bucket triggering an AWS Lambda function. Over this time, customers with use cases that don’t match an existing Solutions Construct have expressed a desire to create individual well-architected resources directly. Solutions Constructs Factories allow clients to create well-architected individual resources using the same internal code that Solutions Constructs use when composing larger patterns. While deploying a single AWS resource using the AWS CDK is often a trivial task, deploying that resource following all the best practices requires more knowledge and effort. For instance, a properly configured S3 bucket should include versioning, encryption, access logging, bucket policies allowing only TLS calls, and lifecycle policies. The AWS Solutions Constructs S3BucketFactory()method implements a fully well-architected CDK S3 bucket with all the best practices configured, including an additional bucket to hold S3 Access Logs. At the same time, every aspect of each resource can be overridden to satisfy any unique requirements for your use case. Best of all, the resources created are all standard CDK L2 constructs that integrate with any CDK stack.

Solutions Constructs Factories are a new feature of AWS Solutions Constructs, a library of common architectural patterns built on top of the AWS CDK. These multi-service patterns allow you to deploy multiple resources with a single CDK Construct. Solutions Constructs follow best practices by default – both for the configuration of the individual resources as well as their interaction. While each Solutions Construct implements a very small architectural pattern, they are designed so that multiple constructs can be combined by sharing a common resource. For instance, a Solutions Construct that implements an S3 bucket invoking a Lambda function can be deployed along with a second Solutions Construct that deploys a Lambda function that writes to an Amazon SQS queue by sharing the same Lambda function between the two constructs. There are currently over 70 Solutions Constructs available, covering Amazon API Gateway, Amazon Simple Storage Service (S3), Amazon SNS, Amazon Simple Queue Service (SQS), AWS Lambda, AWS Secrets Manager, IoT, Amazon ElasticCache, AWS Step Functions, AWS Fargate, AWS WAF, Application Load Balancers, Amazon Kinesis, Amazon Data Firehose, Amazon Sagemaker, AWS Elemental Mediastore, and more.

The AWS CDK provides a programming model above the static AWS CloudFormation template, representing all AWS resources with instantiated objects in a high-level programming language. When you instantiate CDK objects in your Typescript (or other language) code, the CDK “compiles” those objects into a JSON template, then deploys that template with CloudFormation. The use of programming languages such as Typescript or Python rather than the declarative YAML or JSON allows much more flexibility in defining your infrastructure. If you are not yet familiar with the AWS CDK you should check it out.

Visual representation of how AWS Solutions Constructs build abstractions upon the AWS CDK, which is then compiled into static CloudFormation templates. Solutions Constructs Factories are a feature found within AWS Solutions Constructs

Infrastructure as Code Abstraction Layers

How Constructs Factories Work

This demo will guide you through building a small CDK stack that uses one of the new Solutions Constructs Factories. The app will use a factory to create an S3 bucket that fully implements all the recommended best practices with a single function call.

First, create a Typescript CDK app and install the Solutions Constructs libraries:

md factories-blog
cd factories-blog
cdk init -l=typescript
npm install @aws-solutions-constructs/aws-constructs-factories

Next, open lib/factories-blog-stack.ts, delete the comments and add the indicated new lines of code:

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
// Add this import statement:
import { ConstructsFactories } from '@aws-solutions-constructs/aws-constructs-factories';

export class FactoriesBlogStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // Add these two lines
    const factories = new ConstructsFactories(this, 'constructs-factories');
    const response = factories.s3BucketFactory('default-bucket', {});
  }
}

Now build and deploy the app:

npm run build
cdk deploy

A key consideration for Constructs Factories is that the resources they create integrate seamlessly with the rest of your CDK stack. That’s why factories are implemented as methods that return standard AWS CDK L2 constructs rather than L3 constructs. Instantiating the ConstructsFactories object gives you access to the factory methods, but on its own adds nothing to your stack. In this case you called s3BucketFactory(), passing two arguments: a string id upon which all resource names will be based, and an empty S3BucketFactoryProps object. The method added 4 resources to your CDK stack. You can explore everything this app just deployed either in the console or through the synthesized template in ./cdk.out. Here’s a synopsis of what you’ll find:

  • An S3 bucket for storage (the goal of the call), with the following configuration:
    • AES-256 server side encryption
    • All public ACLs and policies blocked
    • Versioning enabled
    • A bucket policy blocking all access not via aws:SecureTransport (TSL)
    • Access logging enabled to a separate bucket
    • A lifecycle rule transitioning non-current versions to Glacier after 90 days
  • A second S3 bucket to contain the access logs, with a similar configuration:
    • AES-256 server side encryption
    • All public ACLs and policies blocked
    • Versioning enabled
    • A bucket policy blocking all access not via aws:SecureTransport (TSL)

A single call has deployed two resources with all best practices configured by default! As most workloads have some unique requirements, all of these resources can be customized to integrate into the overall stack. Diving a little deeper, recall the empty S3BucketFactoryProps object you passed to the factory call. The S3BucketFactoryProps argument allows you to send any properties to override the default settings for the main bucket and the logging bucket. For instance, if you need the bucket to send notifications to EventBridge you can enable that through the properties:

const response = factories.s3BucketFactory('customized-bucket', {
  bucketProps: {
    eventBridgeEnabled:true
  }
});

Second, the factory call returned an S3FactoryResponse object, which exposes standard CDK L2 Bucket constructs for both the main and logging buckets (you can also access the new BucketPolicies through these Bucket constructs).

const arn = response.s3Bucket.bucketArn;

Available Factories

The first release of Solutions Constructs Factories focuses on services where configuring a resource according to best practices requires the launch of additional resources, or other additional complexity. We currently support 3 different resources:

S3 Buckets

The s3BucketFactory() function will create an S3 bucket with:

  • TLS access required
  • Access logging enabled (to an additional log bucket by default)
  • Versioning enabled
  • All public ACLs and policies blocked
  • AWS managed server side encryption
  • Lifecycle policies that transition non-current versions to S3 Glacier after 90 days

Step Functions State Machines

The stateMachineFactory() function will create a Step Functions state machine with:

  • CloudWatch logs names prefixed with /aws/vendedlogs/ and unique to your stack, avoiding any issues with resource policy size without risk of name collisions in your account.
  • CloudWatch alarms for:
    • 1 or more failed executions
    • 1 or more throttled executions
    • 1 or more aborted executions

SQS Queues

The sqsQueueFactory() function will create an SQS queue with:

  • KMS managed encryption
  • A DLQ configured to accept messages that can’t be processed
  • A resource policy ensuring that only the queue owner can perform operations on the queue

Conclusion

A single call to an AWS Solutions Constructs Factory deploys a resource with all best practice settings, as well as any additional infrastructure required to make the resource part of a well-architected application. Since the factories return standard AWS CDK L2 constructs, you can use them in any new or existing CDK stack. The first Solutions Constructs Factories were launched earlier this year and more will be added soon. If you have factories you’d like to see implemented, enter an Issue on the Solutions Constructs github page. The next time you launch any of these three resources in a stack, try using a Constructs factory – you may never directly instantiate a CDK construct again.

AWS CloudFormation Linter (cfn-lint) v1

Post Syndicated from Kevin DeJong original https://aws.amazon.com/blogs/devops/aws-cloudformation-linter-v1/

Introduction

The CloudFormation Linter, cfn-lint, is a powerful tool designed to enhance the development process of AWS CloudFormation templates. It serves as a static analysis tool that checks CloudFormation templates for potential errors and best practices, ensuring that your infrastructure as code adheres to AWS best practices and standards. With its comprehensive rule set and customizable configuration options, cfn-lint provides developers with valuable insights into their CloudFormation templates, helping to streamline the deployment process, improve code quality, and optimize AWS resource utilization.

What’s Changing?

With cfn-lint v1, we are introducing a set of major enhancements that involve breaking changes. This upgrade is particularly significant as it converts from using the CloudFormation spec to using CloudFormation registry resource provider schemas. This change is aimed at improving the overall performance, stability, and compatibility of cfn-lint, ensuring a more seamless and efficient experience for our users.

Key Features of cfn-lint v1

  1. CloudFormation Registry Resource Provider Schemas: The migration to registry schemas brings a more robust and standardized approach to validating CloudFormation templates, offering improved accuracy in linting. We use additional data sources like the AWS pricing API and botocore (the foundation to the AWS CLI and AWS SDK for Python (Boto3)) to improve the schemas and increase the accuracy of our validation. We extend the schemas with additional keywords and logic to extend validation from the schemas.
  2. Rule Simplification: for this upgrade, we rewrote over 100 rules. Where possible, we rewrote rules to leverage JSON schema validation, which allows us to use common logic across rules. The result is that we now return more common error messages across our rules.
  3. Region Support: cfn-lint supports validation of resource types across regions. v1 expands this validation to check resource properties across all unique schemas for the resource type.

Transition Guidelines

To facilitate a seamless transition, we advise following these steps:

Review Templates

While we aim to preserve backward compatibility, we recommend reviewing your CloudFormation templates to ensure they align with the latest version. This step helps preempt any potential issues in your pipeline or deployment processes. If necessary, you can enforce pinning to cfn-lint v0 by running pip install --upgrade "cfn-lint<1"

Handling cfn-lint configurations

Throughout the process of rewriting rules, we’ve restructured some of the logic. Consequently, if you’ve been ignoring a specific rule, it’s possible that the logic associated with it has shifted to a new rule. As you transition to v1, you may need to adjust your template ignore rules configuration accordingly. Here is a subset of some of the changes with a focus on some of the more significant changes.

  • In v0, rule E3002 validated valid resource property names but it also validated object and array type checks. In v1 all type checks are now in E3012.
  • In v0, rule E3017 validated that when a property had a certain value other properties may be required. This validation has been rewritten into individual rules. This should allow more flexibility in ignoring and configuring rules.
  • In v0, rule E2522 validated when at least one of a list of properties is required. That logic has been moved to rule E3015.
  • In v0, rule E2523 validated when only one property from a list is required. That logic has been moved to rule E3014.

Adapting extensions to cfn-lint

If you’ve extended cfn-lint with custom rules or utilized it as a library, be aware that there have been some API changes. It’s advisable to thoroughly test your rules and packages to ensure consistency as you upgrade to v1.

Upgrade to cfn-lint v1

Upon the release of the new version, we highly recommend upgrading to cfn-lint v1 to capitalize on its enriched features and improvements. You can upgrade using pip by running pip install --upgrade cfn-lint.

Stay Updated

Keep yourself informed by monitoring our communication channels for announcements, release notes, and any additional information pertinent to cfn-lint v1. You can follow us on Discord. cfn-lint is an open source solution so you can submit issues on GitHub or follow our v1 discussion on GitHub.

Dependencies

cfn-lint v1 uses Python optional dependencies to reduce the amount of dependencies we install for standard usage. If you want to leverage features like graph, or output formats junit and sarif, you will have to change your install commands.

  • pip install cfn-lint[graph] – will include pydot to create graphs of resource dependencies using --build-graph
  • pip install cfn-lint[junit] – will include the packages to output JUnit using --output junit
  • pip install cfn-lint[sarif] – will include the packages to output SARIF using --output sarif

cfn-lint v0 support

We will continue to update and support cfn-lint v0 until early 2025. This includes regular releases to new CloudFormation spec files. We will only add new features into v1.

Thank You for Your Continued Support

We appreciate your continued trust and support as we work to enhance cfn-lint. Our team is committed to providing you with the best possible experience, and we believe that cfn-lint v1 will elevate your CloudFormation template development process.

If you have any questions or concerns, please don’t hesitate to reach out on our GitHub page.

Kevin DeJong

Kevin DeJong is a Developer Advocate – Infrastructure as Code at AWS. He is creator and maintainer of cfn-lint. Kevin has been working with the CloudFormation service for over 6+ years.

Simplify AWS CloudTrail log analysis with natural language query generation in CloudTrail Lake (preview)

Post Syndicated from Esra Kayabali original https://aws.amazon.com/blogs/aws/simplify-aws-cloudtrail-log-analysis-with-natural-language-query-generation-in-cloudtrail-lake-preview/

Today, I am happy to announce in preview the generative artificial intelligence (generative AI)–powered natural language query generation in AWS CloudTrail Lake, which is a managed data lake for capturing, storing, accessing, and analyzing AWS CloudTrail activity logs to meet compliance, security, and operational needs. You can ask a question using natural language about these activity logs (management and data events) stored in CloudTrail Lake without having the technical expertise to write a SQL query or spend time to decode the exact structure of activity events. For example, you might ask, “Tell me how many database instances are deleted without a snapshot”, and the feature will convert that question to a CloudTrail Lake query, which you can run as-is or modify to get the requested event information. Natural language query generation makes the process of exploration of AWS activity logs simpler.

Now, let me show you how to start using natural language query generation.

Getting started with natural language query generation
The natural language query generator uses generative AI to produce a ready-to-use SQL query from your prompt, which you can then choose to run in the query editor of CloudTrail Lake.

In the AWS CloudTrail console, I choose Query under Lake. The query generator can only generate queries for event data stores that collect CloudTrail management and data events. I choose an event data store for my CloudTrail Lake query from the dropdown list in Event data store. In the Query generator, I enter the following prompt in the Prompt field using natural language:

How many errors were logged during the past month?

Then, I choose Generate query. The following SQL query is automatically generated:

SELECT COUNT(*) AS error_count
FROM 8a6***
WHERE eventtime >= '2024-04-21 00:00:00'
    AND eventtime <= '2024-05-21 23:59:59'
    AND (
        errorcode IS NOT NULL
        OR errormessage IS NOT NULL
    )

I choose Run to see the results.

This is interesting, but I want to know more details. I want to see which services had the most errors and why these actions were erroring out. So I enter the following prompt to request additional details:

How many errors were logged during the past month for each service and what was the cause of each error?

I choose Generate query, and the following SQL query is generated:

SELECT eventsource,
    errorcode,
    errormessage,
    COUNT(*) AS errorCount
FROM 8a6***
WHERE eventtime >= '2024-04-21 00:00:00'
    AND eventtime <= '2024-05-21 23:59:59'
    AND (
        errorcode IS NOT NULL
        OR errormessage IS NOT NULL
    )
GROUP BY 1,
    2,
    3
ORDER BY 4 DESC;

I choose Run to see the results.

In the results, I see that my account experiences most number of errors related to Amazon S3, and top errors are related to CORS and object level configuration. I can continue to dig deeper to see more details by asking further questions. But now let me give natural language query generator another instruction. I enter the following prompt in the Prompt field:

What are the top 10 AWS services that I used in the past month? Include event name as well.

I choose Generate query, and the following SQL query is generated. This SQL statement retrieves the field names (eventSource,
eventName, COUNT(*) AS event_count), restricts the rows with the date interval of the past month in the WHERE clause, groups the rows by eventSource and eventName, sorts them by the usage count, and limit the result to 10 rows as I requested in a natural language.

SELECT eventSource,
    eventName,
    COUNT(*) AS event_count
FROM 8a6***
WHERE eventTime >= timestamp '2024-04-21 00:00:00'
    AND eventTime <= timestamp '2024-05-21 23:59:59'
GROUP BY 1,
    2
ORDER BY 3 DESC
LIMIT 10;

Again, I choose Run to see the results.

I now have a better understanding of how many errors were logged during the past month, what service the error was for, and what caused the error. You can try asking questions in plain language and run the generated queries over your logs to see how this feature works with your data.

Join the preview
Natural language query generation is available in preview in the US East (N. Virginia) Region as part of CloudTrail Lake.

You can use natural language query generation in preview for no additional cost. CloudTrail Lake query charges apply when running the query to generate results. For more information, visit AWS CloudTrail Pricing.

To learn more and get started using natural language query generation, visit AWS CloudTrail Lake User Guide.

— Esra

Quickly adopt new AWS features with the Terraform AWS Cloud Control provider

Post Syndicated from Welly Siauw original https://aws.amazon.com/blogs/devops/quickly-adopt-new-aws-features-with-the-terraform-aws-cloud-control-provider/

Introduction

Today, we are pleased to announce the general availability of the Terraform AWS Cloud Control (AWS CC) Provider, enabling our customers to take advantage of AWS innovations faster. AWS has been continually expanding its services to support virtually any cloud workload; supporting over 200 fully featured services and delighting customers through its rapid pace of innovation with over 3,400 significant new features in 2023. Our customers use Infrastructure as Code (IaC) tools such as HashiCorp Terraform among others as a best-practice to provision and manage these AWS features and services as part of their cloud infrastructure at scale. With the Terraform AWS CC Provider launch, AWS customers using Terraform as their IaC tool can now benefit from faster time-to-market by building cloud infrastructure with the latest AWS innovations that are typically available on the Terraform AWS CC Provider on the day of launch. For example, AWS customer Meta’s Oculus Studios was able to quickly leverage Amazon GameLift to support their game development. “AWS and Hashicorp have been great partners in helping Oculus Studios standardize how we deploy our GameLift infrastructure using industry best practices.” said Mick Afaneh, Meta’s Oculus Studios Central Technology.

The Terraform AWS CC Provider leverages AWS Cloud Control API to automatically generate support for hundreds of AWS resource types, such as Amazon EC2 instances and Amazon S3 buckets. Since the AWS CC provider is automatically generated, new features and services on AWS can be supported as soon as they are available on AWS Cloud Control API, addressing any coverage gaps in the existing Terraform AWS standard provider. This automated process allows the AWS CC provider to deliver new resources faster because it does not have to wait for the community to author schema and resource implementations for each new service. Today, the AWS CC provider supports 950+ AWS resources and data sources, with more support being added as AWS service teams continue to adopt the Cloud Control API standard.

As a Terraform practitioner, using the AWS CC Provider would feel familiar to the existing workflow. You can employ the configuration blocks shown below, while specifying your preferred region.

terraform {
  required_providers {
    awscc = {
      source  = "hashicorp/awscc"
      version = "~> 1.0"
    }
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "awscc" {
  region = "us-east-1"
}

provider "aws" {
  region = "us-east-1"
}

During Terraform plan or apply, the AWS CC Terraform provider interacts with AWS Cloud Control API to provision the resources by calling its consistent Create, Read, Update, Delete, or List (CRUD-L) APIs.

AWS Cloud Control API

AWS service teams own, publish, and maintain resources on the AWS CloudFormation Registry using a standardized resource model. This resource model uses uniform JSON schemas and provisioning logic that codifies the expected behavior and error handling associated with CRUD-L operations. This resource model enables AWS service teams to expose their service features in an easily discoverable, intuitive, and uniform format with standardized behavior. Launched in September 2021, AWS Cloud Control API exposes these resources through a set of five consistent CRUD-L operations without any additional work from service teams. Using Cloud Control API, developers can manage the lifecycle of hundreds of AWS and third-party resources with consistent resource-oriented API instead of using distinct service-specific APIs. Furthermore, Cloud Control API is up-to-date with the latest AWS resources as soon as they are available on the CloudFormation Registry, typically on the day of launch. You can read more on launch day requirement for Cloud Control API in this blog post. This enables AWS Partners such as HashiCorp to take advantage of consistent CRUD-L API operations and integrate Terraform with Cloud Control API just once, and then automatically access new AWS resources without additional integration work.

History and Evolution of the Terraform AWS CC Provider

The general availability of Terraform AWS CC Provider project is a culmination of 4+ years of collaboration between AWS and HashiCorp. Our teams partnered across the Product, Engineering, Partner, and Customer Support functions in influencing, shaping, and defining the customer experience leading up to the the technical preview announcement of the AWS CC provider in September 2021. At technical preview, the provider supported more than 300 resources. Since then, we have added an additional 600+ resources to the provider, bringing the total to 950+ supported resources at general availability.

Beyond just increasing resource coverage, we gathered additional signals from customer feedback during the technical preview and rolled out several improvements since September 2021. Customers care deeply about the user experience on the providers available on the Terraform registry. Customers sought practical examples in the form of sample HCL configurations for each resource that they could use to immediately test in order to confidently start using the provider. This prompted us to enrich the AWS CC provider with hundreds of practical examples for popular AWS CC provider resources in the Terraform registry. This was made possible by contributions of hundreds of Amazonians who became early adopters of the AWS CC provider. We also published a how-to guide for anyone interested in contributing to AWS CC provider examples. Furthermore, customers also wanted to minimize context switching by moving between Terraform and AWS service documentation on what each attribute of a resource signified and the type of values it needed as part of configuration. This empowered us to prioritize augmenting the provider with rich resource attribute description with information taken from AWS documentation. The documentation provides detailed information of how to use the attributes, enumerations of the accepted attribute values and other relevant information for dozens of popularly used AWS resources.

We also worked with HashiCorp on various bug fixes and feature enhancements for the AWS CC provider, as well as the upstream Cloud Control API dependencies. We improved handling for resources with complex nested attribute schemas, implemented various bug fixes to resolve unintended resource replacement, and refined provider behavior under various conditions to support the idempotency expected by Terraform practitioners. While this are not an exhaustive list of improvements, we continue to listen to customer feedback and iterate on improving the experience. We encourage you to try out the provider and share feedback on the AWS CC provider’s GitHub page.

Using the AWS CC Provider

Let’s take an example of a recently introduced service, Amazon Q Business, a fully managed, generative AI-powered assistant that you can configure to answer questions, provide summaries, generate content, and complete tasks based on your enterprise data. Amazon Q Business resources were available in AWS CC provider shortly after the April 30th 2024 launch announcement. In the following example, we’ll create a demo Amazon Q Business application and deploy the web experience.

data "aws_caller_identity" "current" {}

data "aws_ssoadmin_instances" "example" {}

resource "awscc_qbusiness_application" "example" {
  description                  = "Example QBusiness Application"
  display_name                 = "Demo_QBusiness_App"
  attachments_configuration    = {
    attachments_control_mode = "ENABLED"
  }
  identity_center_instance_arn = data.aws_ssoadmin_instances.example.arns[0]
}

resource "awscc_qbusiness_web_experience" "example" {
  application_id              = awscc_qbusiness_application.example.id
  role_arn                    = awscc_iam_role.example.arn
  subtitle                    = "Drop a file and ask questions"
  title                       = "Demo Amazon Q Business"
  welcome_message             = "Welcome, please enter your questions"
}

resource "awscc_iam_role" "example" {
  role_name   = "Amazon-QBusiness-WebExperience-Role"
  description = "Grants permissions to AWS Services and Resources used or managed by Amazon Q Business"
  assume_role_policy_document = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "QBusinessTrustPolicy"
        Effect = "Allow"
        Principal = {
          Service = "application.qbusiness.amazonaws.com"
        }
        Action = [
          "sts:AssumeRole",
          "sts:SetContext"
        ]
        Condition = {
          StringEquals = {
            "aws:SourceAccount" = data.aws_caller_identity.current.account_id
          }
          ArnEquals = {
            "aws:SourceArn" = awscc_qbusiness_application.example.application_arn
          }
        }
      }
    ]
  })
  policies = [{
    policy_name = "qbusiness_policy"
    policy_document = jsonencode({
      Version = "2012-10-17"
      Statement = [
        {
          Sid = "QBusinessConversationPermission"
          Effect = "Allow"
          Action = [
            "qbusiness:Chat",
            "qbusiness:ChatSync",
            "qbusiness:ListMessages",
            "qbusiness:ListConversations",
            "qbusiness:DeleteConversation",
            "qbusiness:PutFeedback",
            "qbusiness:GetWebExperience",
            "qbusiness:GetApplication",
            "qbusiness:ListPlugins",
            "qbusiness:GetChatControlsConfiguration"
          ]
          Resource = awscc_qbusiness_application.example.application_arn
        }
      ]
    })
  }]
}

As you see in this example, you can use both the AWS and AWS CC providers in the same configuration file. This allows you to easily incorporate new resources available in the AWS CC provider into your existing configuration with minimal changes. The AWS CC provider also accepts the same authentication method and provider-level features available in the AWS provider. This means you don’t have to add additional configuration in your CI/CD pipeline to start using the AWS CC provider. In addition, you can also add custom agent information inside the provider block as described in this documentation.

Things to know

The AWS CC provider is unique due to how it was developed and its dependencies with Cloud Control API and AWS resource model in the CloudFormation registry. As such, there are things that you should know before you start using the AWS CC provider.

  • The AWS CC provider is generated from the latest CloudFormation schemas, and will release weekly containing all new AWS services and enhancements added to Cloud Control API.
  • Certain resources available in the CloudFormation schema are not compatible with the AWS CC provider due to nuances in the schema implementation. You can find them on the GitHub issue list here. We are actively working to add these resources to the AWS CC provider.
  • The AWS CC provider requires Terraform CLI version 1.0.7 or higher.
  • Every AWS CC provider resource includes a top-level attribute `id` that acts as the resource identifier. If the CloudFormation resource schema also has a similarly named top-level attribute `id`, then that property is mapped to a new attribute named `<type>_id`. For example `web_experience_id` for `awscc_qbusiness_web_experience` resource.
  • If a resource attribute is not defined in the Terraform configuration, the AWS CC provider will honor the default values specified in the CloudFormation resource schema. If the resource schema does not include a default value, AWS CC provider will use attribute value stored in the Terraform state (taken from Cloud Control API GetResponse after resource was created).
  • In correlation to the default value behavior as stated above, when an attribute value is removed from the Terraform configuration (e.g. by commenting the attribute), the AWS CC provider will use the previous attribute value stored in the Terraform state. As such, no drift will be detected on the resource configuration when you run Terraform plan / apply.
  • The AWS CC provider data sources are either plural or singular with filters based on `id` attribute. Currently there is no native support for metadata sources such as `aws_region` or `aws_caller_identity`. You can continue to leverage the AWS provider data sources to complement your Terraform configuration.

If you want to dive deeper into AWS CC provider resource behavior, we encourage you to check the documentation here.

Conclusion

The AWS CC provider is now generally available and will be the fastest way for customers to access newly launched AWS features and services using Terraform. We will continue to add support for more resources, additional examples and enriching the schema descriptions. You can start using the AWS CC provider alongside your existing AWS standard provider. To learn more about the AWS CC provider, please check the HashiCorp announcement blog post. You can also follow the workshop on how to get started with AWS CC provider. If you are interested in contributing with practical examples for AWS CC provider resources, check out the how-to guide. For more questions or if you run into any issues with the new provider, don’t hesitate to submit your issue in the AWS CC provider GitHub repository.

Authors

Manu Chandrasekhar

Manu is an AWS DevOps consultant with close to 19 years of industry experience wearing QA/DevOps/Software engineering and management hats. He looks to enable teams he works with to be self-sufficient in
modelling/provisioning Infrastructure in cloud and guides them in cloud adoption. He believes that by improving the developer experience and reducing the barrier of entry to any technology with the advancements in automation and AI, software deployment and delivery can be a non-event.

Rahul Sharma

Rahul is a Principal Product Manager-Technical at Amazon Web Services with over three and a half years of cumulative product management experience spanning Infrastructure as Code (IaC) and Customer Identity and Access Management (CIAM) space.

Welly Siauw

As a Principal Partner Solution Architect, Welly led the co-build and co-innovation strategy with AWS ISV partners. He is passionate about Terraform, Developer Experience and Cloud Governance. Welly joined AWS in 2018 and carried with him almost 2 decades of experience in IT operations, application development, cyber security, and oil exploration. In between work, he spent time tinkering with espresso machines and outdoor hiking.

AWS Weekly Roundup — Claude 3 Haiku in Amazon Bedrock, AWS CloudFormation optimizations, and more — March 18, 2024

Post Syndicated from Antje Barth original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-claude-3-haiku-in-amazon-bedrock-aws-cloudformation-optimizations-and-more-march-18-2024/

Storage, storage, storage! Last week, we celebrated 18 years of innovation on Amazon Simple Storage Service (Amazon S3) at AWS Pi Day 2024. Amazon S3 mascot Buckets joined the celebrations and had a ton of fun! The 4-hour live stream was packed with puns, pie recipes powered by PartyRock, demos, code, and discussions about generative AI and Amazon S3.

AWS Pi Day 2024

AWS Pi Day 2024 — Twitch live stream on March 14, 2024

In case you missed the live stream, you can watch the recording. We’ll also update the AWS Pi Day 2024 post on community.aws this week with show notes and session clips.

Last week’s launches
Here are some launches that got my attention:

Anthropic’s Claude 3 Haiku model is now available in Amazon Bedrock — Anthropic recently introduced the Claude 3 family of foundation models (FMs), comprising Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus. Claude 3 Haiku, the fastest and most compact model in the family, is now available in Amazon Bedrock. Check out Channy’s post for more details. In addition, my colleague Mike shows how to get started with Haiku in Amazon Bedrock in his video on community.aws.

Up to 40 percent faster stack creation with AWS CloudFormation — AWS CloudFormation now creates stacks up to 40 percent faster and has a new event called CONFIGURATION_COMPLETE. With this event, CloudFormation begins parallel creation of dependent resources within a stack, speeding up the whole process. The new event also gives users more control to shortcut their stack creation process in scenarios where a resource consistency check is unnecessary. To learn more, read this AWS DevOps Blog post.

Amazon SageMaker Canvas extends its model registry integrationSageMaker Canvas has extended its model registry integration to include time series forecasting models and models fine-tuned through SageMaker JumpStart. Users can now register these models to the SageMaker Model Registry with just a click. This enhancement expands the model registry integration to all problem types supported in Canvas, such as regression/classification tabular models and CV/NLP models. It streamlines the deployment of machine learning (ML) models to production environments. Check the Developer Guide for more information.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS news
Here are some additional news items, open source projects, and Twitch shows that you might find interesting:

AWS Build On Generative AIBuild On Generative AI — Season 3 of your favorite weekly Twitch show about all things generative AI is in full swing! Streaming every Monday, 9:00 US PT, my colleagues Tiffany and Darko discuss different aspects of generative AI and invite guest speakers to demo their work. In today’s episode, guest Martyn Kilbryde showed how to build a JIRA Agent powered by Amazon Bedrock. Check out show notes and the full list of episodes on community.aws.

Amazon S3 Connector for PyTorch — The Amazon S3 Connector for PyTorch now lets PyTorch Lightning users save model checkpoints directly to Amazon S3. Saving PyTorch Lightning model checkpoints is up to 40 percent faster with the Amazon S3 Connector for PyTorch than writing to Amazon Elastic Compute Cloud (Amazon EC2) instance storage. You can now also save, load, and delete checkpoints directly from PyTorch Lightning training jobs to Amazon S3. Check out the open source project on GitHub.

AWS open source news and updates — My colleague Ricardo writes this weekly open source newsletter in which he highlights new open source projects, tools, and demos from the AWS Community.

Upcoming AWS events
Check your calendars and sign up for these AWS events:

AWS at NVIDIA GTC 2024 — The NVIDIA GTC 2024 developer conference is taking place this week (March 18–21) in San Jose, CA. If you’re around, visit AWS at booth #708 to explore generative AI demos and get inspired by AWS, AWS Partners, and customer experts on the latest offerings in generative AI, robotics, and advanced computing at the in-booth theatre. Check out the AWS sessions and request 1:1 meetings.

AWS SummitsAWS Summits — It’s AWS Summit season again! The first one is Paris (April 3), followed by Amsterdam (April 9), Sydney (April 10–11), London (April 24), Berlin (May 15–16), and Seoul (May 16–17). AWS Summits are a series of free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS.

AWS re:InforceAWS re:Inforce — Join us for AWS re:Inforce (June 10–12) in Philadelphia, PA. AWS re:Inforce is a learning conference focused on AWS security solutions, cloud security, compliance, and identity. Connect with the AWS teams that build the security tools and meet AWS customers to learn about their security journeys.

You can browse all upcoming in-person and virtual events.

That’s all for this week. Check back next Monday for another Weekly Roundup!

— Antje

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

How we sped up AWS CloudFormation deployments with optimistic stabilization

Post Syndicated from Bhavani Kanneganti original https://aws.amazon.com/blogs/devops/how-we-sped-up-aws-cloudformation-deployments-with-optimistic-stabilization/

Introduction

AWS CloudFormation customers often inquire about the behind-the-scenes process of provisioning resources and why certain resources or stacks take longer to provision compared to the AWS Management Console or AWS Command Line Interface (AWS CLI). In this post, we will delve into the various factors affecting resource provisioning in CloudFormation, specifically focusing on resource stabilization, which allows CloudFormation and other Infrastructure as Code (IaC) tools to ensure resilient deployments. We will also introduce a new optimistic stabilization strategy that improves CloudFormation stack deployment times by up to 40% and provides greater visibility into resource provisioning through the new CONFIGURATION_COMPLETE status.

AWS CloudFormation is an IaC service that allows you to model your AWS and third-party resources in template files. By creating CloudFormation stacks, you can provision and manage the lifecycle of the template-defined resources manually via the AWS CLI, Console, AWS SAM, or automatically through an AWS CodePipeline, where CLI and SAM can also be leveraged or through Git sync. You can also use AWS Cloud Development Kit (AWS CDK) to define cloud infrastructure in familiar programming languages and provision it through CloudFormation, or leverage AWS Application Composer to design your application architecture, visualize dependencies, and generate templates to create CloudFormation stacks.

Deploying a CloudFormation stack

Let’s examine a deployment of a containerized application using AWS CloudFormation to understand CloudFormation’s resource provisioning.

Sample application architecture to deploy an ECS service

Figure 1. Sample application architecture to deploy an ECS service

For deploying a containerized application, you need to create an Amazon ECS service. To set up the ECS service, several key resources must first exist: an ECS cluster, an Amazon ECR repository, a task definition, and associated Amazon VPC infrastructure such as security groups and subnets.
Since you want to manage both the infrastructure and application deployments using AWS CloudFormation, you will first define a CloudFormation template that includes: an ECS cluster resource (AWS::ECS::Cluster), a task definition (AWS::ECS::TaskDefinition), an ECR repository (AWS::ECR::Repository), required VPC resources like subnets (AWS::EC2::Subnet) and security groups (AWS::EC2::SecurityGroup), and finally, the ECS Service (AWS::ECS::Service) itself. When you create the CloudFormation stack using this template, the ECS service (AWS::ECS::Service) is the final resource created, as it waits for the other resources to finish creation. This brings up the concept of Resource Dependencies.

Resource Dependency:

In CloudFormation, resources can have dependencies on other resources being created first. There are two types of resource dependencies:

  • Implicit: CloudFormation automatically infers dependencies when a resource uses intrinsic functions to reference another resource. These implicit dependencies ensure the resources are created in the proper order.
  • Explicit: Dependencies can be directly defined in the template using the DependsOn attribute. This allows you to customize the creation order of resources.

The following template snippet shows the ECS service’s dependencies visualized in a dependency graph:

Template snippet:

ECSService:
    DependsOn: [PublicRoute] #Explicit Dependency
    Type: 'AWS::ECS::Service'
    Properties:
      ServiceName: cfn-service
      Cluster: !Ref ECSCluster #Implicit Dependency
      DesiredCount: 2
      LaunchType: FARGATE
      NetworkConfiguration:
        AwsvpcConfiguration:
          AssignPublicIp: ENABLED
          SecurityGroups:
            - !Ref SecurityGroup #Implicit Dependency
          Subnets:
            - !Ref PublicSubnet #Implicit Dependency
      TaskDefinition: !Ref TaskDefinition #Implicit Dependency

Dependency Graph:

CloudFormation’s dependency graph for a containerized application

Figure 2. CloudFormation’s dependency graph for a containerized application

Note: VPC Resources in the above graph include PublicSubnet (AWS::EC2::Subnet), SecurityGroup (AWS::EC2::SecurityGroup), PublicRoute (AWS::EC2::Route)

In the above template snippet, the ECS Service (AWS::ECS::Service) resource has an explicit dependency on the PublicRoute resource, specified using the DependsOn attribute. The ECS service also has implicit dependencies on the ECSCluster, SecurityGroup, PublicSubnet, and TaskDefinition resources. Even without an explicit DependsOn, CloudFormation understands that these resources must be created before the ECS service, since the service references them using the Ref intrinsic function. Now that you understand how CloudFormation creates resources in a specific order based on their definition in the template file, let’s look at the time taken to provision these resources.

Resource Provisioning Time:

The total time for CloudFormation to provision the stack depends on the time required to create each individual resource defined in the template. The provisioning duration per resource is determined by several time factors:

  • Engine Time: CloudFormation Engine Time refers to the duration spent by the service reading and persisting data related to a resource. This includes the time taken for operations like parsing and interpreting the CloudFormation template, and for the resolution of intrinsic functions like Fn::GetAtt and Ref.
  • Resource Creation Time: The actual time an AWS service requires to create and configure the resource. This can vary across resource types provisioned by the service.
  • Resource Stabilization Time: The duration required for a resource to reach a usable state after creation.

What is Resource Stabilization?

When provisioning AWS resources, CloudFormation makes the necessary API calls to the underlying services to create the resources. After creation, CloudFormation then performs eventual consistency checks to ensure the resources are ready to process the intended traffic, a process known as resource stabilization. For example, when creating an ECS service in the application, the service is not readily accessible immediately after creation completes (after creation time). To ensure the ECS service is available to use, CloudFormation performs additional verification checks defined specifically for ECS service resources. Resource stabilization is not unique to CloudFormation and must be handled to some degree by all IaC tools.

Stabilization Criteria and Stabilization Timeout

For CloudFormation to mark a resource as CREATE_COMPLETE, the resource must meet specific stabilization criteria called stabilization parameters. These checks validate that the resource is not only created but also ready for use.

If a resource fails to meet its stabilization parameters within the allowed stabilization timeout period, CloudFormation will mark the resource status as CREATE_FAILED and roll back the operation. Stabilization criteria and timeouts are defined uniquely for each AWS resource supported in CloudFormation by the service, and are applied during both resource create and update workflows.

AWS CloudFormation vs AWS CLI to provision resources

Now, you will create a similar ECS service using the AWS CLI. You can use the following AWS CLI command to deploy an ECS service using the same task definition, ECS cluster and VPC resources created earlier using CloudFormation.

Command:

aws ecs create-service \
    --cluster CFNCluster \
    --service-name service-cli \
    --task-definition task-definition-cfn:1 \
    --desired-count 2 \
    --launch-type FARGATE \
    --network-configuration "awsvpcConfiguration={subnets=[subnet-xxx],securityGroups=[sg-yyy],assignPublicIp=ENABLED}" \
    --region us-east-1

The following snippet from the output of the above command shows that the ECS Service has been successfully created and its status is ACTIVE.

Snapshot of the ECS service API call's response

Figure 3. Snapshot of the ECS service API call

However, when you navigate to the ECS console and review the service, tasks are still in the Pending state, and you are unable to access the application.

ECS tasks status in the AWS console

Figure 4. ECS console view

You have to wait for the service to reach a steady state before you can successfully access the application.

ECS service events from the AWS console

Figure 5. ECS service events from the AWS console

When you create the same ECS service using AWS CloudFormation, the service is accessible immediately after the resource reaches a status of CREATE_COMPLETE in the stack. This reliable availability is due to CloudFormation’s resource stabilization process. After initially creating the ECS service, CloudFormation waits and continues calling the ECS DescribeServices API action until the service reaches a steady state. Once the ECS service passes its consistency checks and is fully ready for use, only then will CloudFormation mark the resource status as CREATE_COMPLETE in the stack. This creation and stabilization orchestration allows you to access the service right away without any further delays.

The following is an AWS CloudTrail snippet of CloudFormation performing DescribeServices API calls during Stabilization:

Snapshot of AWS CloudTrail event for DescribeServices API call

Figure 6. Snapshot of AWS CloudTrail event

By handling resource stabilization natively, CloudFormation saves you the extra coding effort and complexity of having to implement custom status checks and availability polling logic after resource creation. You would have to develop this additional logic using tools like the AWS CLI or API across all the infrastructure and application resources. With CloudFormation’s built-in stabilization orchestration, you can deploy the template once and trust that the services will be fully ready after creation, allowing you to focus on developing your application functionality.

Evolution of Stabilization Strategy

CloudFormation’s stabilization strategy couples resource creation with stabilization such that the provisioning of a resource is not considered COMPLETE until stabilization is complete.

Historic Stabilization Strategy

For resources that have no interdependencies, CloudFormation starts the provisioning process in parallel. However, if a resource depends on another resource, CloudFormation will wait for the entire resource provisioning operation of the dependency resource to complete before starting the provisioning of the dependent resource.

CloudFormation’s historic stabilization strategy

Figure 7. CloudFormation’s historic stabilization strategy

The diagram above shows a deployment of some of the ECS application resources that you deploy using AWS CloudFormation. The Task Definition (AWS::ECS::TaskDefinition) resource depends on the ECR Repository (AWS::ECR::Repository) resource, and the ECS Service (AWS::ECS:Service) resource depends on both the Task Definition and ECS Cluster (AWS::ECS::Cluster) resources. The ECS Cluster resource has no dependencies defined. CloudFormation initiates creation of the ECR Repository and ECS Cluster resources in parallel. It then waits for the ECR Repository to complete consistency checks before starting provisioning of the Task Definition resource. Similarly, creation of the ECS Service resource begins only when the Task Definition and ECS Cluster resources have been created and are ready. This sequential approach ensures safety and stability but causes delays. CloudFormation strictly deploys dependent resources one after the other, slowing down deployment of the entire stack. As the number of interdependent resources grows, the overall stack deployment time increases, creating a bottleneck that prolongs the whole stack operation.

New Optimistic Stabilization Strategy

To improve stack provisioning times and deployment performance, AWS CloudFormation recently launched a new optimistic stabilization strategy. The optimistic strategy can reduce customer stack deployment duration by up to 40%. It allows dependent resources to be created in parallel. This concurrent resource creation helps significantly improve deployment speed.

CloudFormation’s new optimistic stabilizationstrategy

Figure 8. CloudFormation’s new optimistic stabilization strategy

The diagram above shows deployment of the same 4 resources discussed in the historic strategy. The Task Definition (AWS::ECS::TaskDefinition) resource depends on the ECR Repository (AWS::ECR::Repository) resource, and the ECS Service (AWS::ECS:Service) resource depends on both the Task Definition and ECS Cluster (AWS::ECS::Cluster) resources. The ECS Cluster resource has no dependencies defined. CloudFormation initiates creation of the ECR Repository and ECS Cluster resources in parallel. Then, instead of waiting for the ECR Repository to complete consistency checks, it starts creating the Task Definition when the ECR Repository completes creation, but before stabilization is complete. Similarly, creation of the ECS Service resource begins after Task Definition and ECS Cluster creation. The change was made because not all resources require their dependent resources to complete consistency checks before starting creation. If the ECS Service fails to provision because the Task Definition or ECS Cluster resources are still undergoing consistency checks, CloudFormation will wait for those dependencies to complete their consistency checks before attempting to create the ECS Service again.

CloudFormation’s new stabilization strategy with the retry capability

Figure 9. CloudFormation’s new stabilization strategy with the retry capability

This parallel creation of dependent resources with automatic retry capabilities results in faster deployment times compared to the historical linear resource provisioning strategy. The Optimistic stabilization strategy currently applies only to create workflows with resources that have implicit dependencies. For resources with an explicit dependency, CloudFormation leverages the historic strategy in deploying resources.

Improved Visibility into Resource Provisioning

When creating a CloudFormation stack, a resource can sometimes take longer to provision, making it appear as if it’s stuck in an IN_PROGRESS state. This can be because CloudFormation is waiting for the resource to complete consistency checks during its resource stabilization step. To improve visibility into resource provisioning status, CloudFormation has introduced a new “CONFIGURATION_COMPLETE” event. This event is emitted at both the individual resource level and the overall stack level during create workflow when resource(s) creation or configuration is complete, but stabilization is still in progress.

CloudFormation stack events of the ECS Application

Figure 10. CloudFormation stack events of the ECS Application

The above diagram shows the snapshot of stack events of the ECS application’s CloudFormation stack named ECSApplication. Observe the events from the bottom to top:

  • At 10:46:08 UTC-0600, ECSService (AWS::ECS::Service) resource creation was initiated.
  • At 10:46:09 UTC-0600, the ECSService has CREATE_IN_PROGRESS status in the Status tab and CONFIGURATION_COMPLETE status in the Detailed status tab, meaning the resource was successfully created and the consistency check was initiated.
  • At 10:46:09 UTC-0600, the stack ECSApplication has CREATE_IN_PROGRESS status in the Status tab and CONFIGURATION_COMPLETE status in the Detailed status tab, meaning all the resources in the ECSApplication stack are successfully created and are going through stabilization. This stack level CONFIGURATION_COMPLETE status can also be viewed in the stack’s Overview tab.
CloudFormation Overview tab for the ECSApplication stack

Figure 11. CloudFormation Overview tab for the ECSApplication stack

  • At 10:47:09 UTC-0600, the ECSService has CREATE_COMPLETE status in the Status tab, meaning the service is created and completed consistency checks.
  • At 10:47:10 UTC-0600, ECSApplication has CREATE_COMPLETE status in the Status tab, meaning all the resources are successfully created and completed consistency checks.

Conclusion:

In this post, I hope you gained some insights into how CloudFormation deploys resources and the various time factors that contribute to the creation of a stack and its resources. You also took a deeper look into what CloudFormation does under the hood with resource stabilization and how it ensures the safe, consistent, and reliable provisioning of resources in critical, high-availability production infrastructure deployments. Finally, you learned about the new optimistic stabilization strategy to shorten stack deployment times and improve visibility into resource provisioning.

About the authors:

Picture of author Bhavani Kanneganti

Bhavani Kanneganti

Bhavani is a Principal Engineer at AWS Support. She has over 7 years of experience solving complex customer issues on the AWS Cloud pertaining to infrastructure-as-code and container orchestration services such as CloudFormation, ECS, and EKS. She also works closely with teams across AWS to design solutions that improve customer experience. Outside of work, Bhavani enjoys cooking and traveling.

Picture of author Idriss Laouali Abdou

Idriss Laouali Abdou

Idriss is a Senior Product Manager AWS, working on delivering the best experience for AWS IaC customers. Outside of work, you can either find him creating educational content helping thousands of students, cooking, or dancing.

Amazon CloudWatch Application Signals for automatic instrumentation of your applications (preview)

Post Syndicated from Veliswa Boya original https://aws.amazon.com/blogs/aws/amazon-cloudwatch-application-signals-for-automatic-instrumentation-of-your-applications-preview/

One of the challenges with distributed systems is that they are made up of many interdependent services, which add a degree of complexity when you are trying to monitor their performance. Determining which services and APIs are experiencing high latencies or degraded availability requires manually putting together telemetry signals. This can result in time and effort establishing the root cause of any issues with the system due to the inconsistent experiences across metrics, traces, logs, real user monitoring, and synthetic monitoring.

You want to provide your customers with continuously available and high-performing applications. At the same time, the monitoring that assures this must be efficient, cost-effective, and without undifferentiated heavy lifting.

Amazon CloudWatch Application Signals helps you automatically instrument applications based on best practices for application performance. There is no manual effort, no custom code, and no custom dashboards. You get a pre-built, standardized dashboard showing the most important metrics, such as volume of requests, availability, latency, and more, for the performance of your applications. In addition, you can define Service Level Objectives (SLOs) on your applications to monitor specific operations that matter most to your business. An example of an SLO could be to set a goal that a webpage should render within 2000 ms 99.9 percent of the time in a rolling 28-day interval.

Application Signals automatically correlates telemetry across metrics, traces, logs, real user monitoring, and synthetic monitoring to speed up troubleshooting and reduce application disruption. By providing an integrated experience for analyzing performance in the context of your applications, Application Signals gives you improved productivity with a focus on the applications that support your most critical business functions.

My personal favorite is the collaboration between teams that’s made possible by Application Signals. I started this post by mentioning that distributed systems are made up of many interdependent services. On the Service Map, which we will look at later in this post, if you, as a service owner, identify an issue that’s caused by another service, you can send a link to the owner of the other service to efficiently collaborate on the triage tasks.

Getting started with Application Signals
You can easily collect application and container telemetry when creating new Amazon EKS clusters in the Amazon EKS console by enabling the new Amazon CloudWatch Observability EKS add-on. Another option is to enable for existing Amazon EKS Clusters or other compute types directly in the Amazon CloudWatch console.

Create service map

After enabling Application Signals via the Amazon EKS add-on or Custom option for other compute types, Application Signals automatically discovers services and generates a standard set of application metrics such as volume of requests and latency spikes or availability drops for APIs and dependencies, to name a few.

Specify platform

All of the services discovered and their golden metrics (volume of requests, latency, faults and errors) are then automatically displayed on the Services page and the Service Map. The Service Map gives you a visual deep dive to evaluate the health of a service, its operations, dependencies, and all the call paths between an operation and a dependency.

Auto-generated map

The list of services that are enabled in Application Signals will also show in the services dashboard, along with operational metrics across all of your services and dependencies to easily spot anomalies. The Application column is auto-populated if the EKS cluster belongs to an application that’s tagged in AppRegistry. The Hosted In column automatically detects which EKS pod, cluster, or namespace combination the service requests are running in, and you can select one to go directly to Container Insights for detailed container metrics such as CPU or memory utilization, to name a few.

Team collaboration with Application Signals
Now, to expand on the team collaboration that I mentioned at the beginning of this post. Let’s say you consult the services dashboard to do sanity checks and you notice two SLO issues for one of your services named pet-clinic-frontend. Your company maintains a set of SLOs, and this is the view that you use to understand how the applications are performing against the objectives. For the services that are tagged in AppRegistry all teams have a central view of the definition and ownership of the application. Further navigation to the service map gives you even more details on the health of this service.

At this point you make the decision to send the link to thepet-clinic-frontendservice to Sarah whose details you found in the AppRegistry. Sarah is the person on-call for this service. The link allows you to efficiently collaborate with Sarah because it’s been curated to land directly on the triage view that is contextualized based on your discovery of the issue. Sarah notices that the POST /api/customer/owners latency has increased to 2k ms for a number of requests and as the service owner, dives deep to arrive at the root cause.

Clicking into the latency graph returns a correlated list of traces that correspond directly to the operation, metric, and moment in time, which helps Sarah to find the exact traces that may have led to the increase in latency.

Sarah uses Amazon CloudWatch Synthetics and Amazon CloudWatch RUM and has enabled the X-Ray active tracing integration to automatically see the list of relevant canaries and pages correlated to the service. This integrated view now helps Sarah gain multiple perspectives in the performance of the application and quickly troubleshoot anomalies in a single view.

Available now
Amazon CloudWatch Application Signals is available in preview and you can start using it today in the following AWS Regions: US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), Asia Pacific (Sydney), and Asia Pacific (Tokyo).

To learn more, visit the Amazon CloudWatch user guide. You can submit your questions to AWS re:Post for Amazon CloudWatch, or through your usual AWS Support contacts.

Veliswa

New myApplications in the AWS Management Console simplifies managing your application resources

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/new-myapplications-in-the-aws-management-console-simplifies-managing-your-application-resources/

Today, we are announcing the general availability of myApplications supporting application operations, a new set of capabilities that help you get started with your applications on AWS, operate them with less effort, and move faster at scale. With myApplication in the AWS Management Console, you can more easily manage and monitor the cost, health, security posture, and performance of your applications on AWS.

The myApplications experience is available in the Console Home, where you can access an Applications widget that lists the applications in an account. Now, you can create your applications more easily using the Create application wizard, connecting resources in your AWS account from one view in the console. The created application will automatically display in myApplications, and you can take action on your applications.

When you choose your application in the Applications widget in the console, you can see an at-a-glance view of key application metrics widgets in the applications dashboard. Here you can find, debug operational issues, and optimize your applications.

With a single action on the applications dashboard, you can dive deeper to act on specific resources in the relevant services, such as Amazon CloudWatch for application performance, AWS Cost Explorer for cost and usage, and AWS Security Hub for security findings.

Getting started with myApplications
To get started, on the AWS Management Console Home, choose Create application in the Applications widget. In the first step, input your application name and description.

In the next step, you can add your resources. Before you can search and add resources, you should turn on and set up AWS Resource Explorer, a managed capability that simplifies the search and discovery of your AWS resources across AWS Regions.

Choose Add resources and select the resources to add to your applications. You can also search by keyword, tag, or AWS CloudFormation stack to integrate groups of resources to manage the full lifecycle of your application.

After confirming, your resources are added, new awsApplication tags applied, and the myApplications dashboard will be automatically generated.

Now, let’s see which widgets can be useful.

The Application summary widget displays the name, description, and tag so you know which application you are working on. The Cost and usage widget visualizes your AWS resource costs and usage from AWS Cost Explorer, including the application’s current and forecasted month-end costs, top five billed services, and a monthly application resource cost trend chart. You can monitor spend, look for anomalies, and click to take action where needed.

The Compute widget summarizes of application compute resources, information about which are in alarm, and trend charts from CloudWatch showing basic metrics such as Amazon EC2 instance CPU utilization and AWS Lambda invocations. You also can assess application operations, look for anomalies, and take action.

The Monitoring and Operations widget displays alarms and alerts for resources associated with your application, service level objectives (SLOs), and standardized application performance metrics from CloudWatch Application Signals. You can monitor ongoing issues, assess trends, and quickly identify and drill down on any issues that might impact your application.

The Security widget shows the highest priority security findings identified by AWS Security Hub. Findings are listed by severity and service, so you can monitor their security posture and click to take action where needed.

The DevOps widget summarizes operational insights from AWS System Manager Application Manager, such as fleet management, state management, patch management, and configuration management status so you can assess compliance and take action.

You can also use the Tagging widget to assist you in reviewing and applying tags to your application.

Now available
You can enjoy this new myApplications capability, a new application-centric experience to easily manage and monitor applications on AWS. myApplications capability is available in the following AWS Regions: US East (Ohio, N. Virginia), US West (N. California, Oregon), South America (São Paulo), Asia Pacific (Hyderabad, Jakarta, Mumbai, Osaka, Seoul, Singapore, Sydney, Tokyo), Europe (Frankfurt, Ireland, London, Paris, Stockholm), Middle East (Bahrain) Regions.

AWS Premier Tier Services Partners— Escala24x7, IBM, Tech Mahindra, and Xebia will support application operations with complementary features and services.

Give it a try now in the AWS Management Console and send feedback to AWS re:Post for AWS Management Console, using the feedback link on the myApplications dashboard, or through your usual AWS Support contacts.

Channy

Use natural language to query Amazon CloudWatch logs and metrics (preview)

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/use-natural-language-to-query-amazon-cloudwatch-logs-and-metrics-preview/

To make it easy to interact with your operational data, Amazon CloudWatch is introducing today natural language query generation for Logs and Metrics Insights. With this capability, powered by generative artificial intelligence (AI), you can describe in English the insights you are looking for, and a Logs or Metrics Insights query will be automatically generated.

This feature provides three main capabilities for CloudWatch Logs and Metrics Insights:

  • Generate new queries from a description or a question to help you get started easily.
  • Query explanation to help you learn the language including more advanced features.
  • Refine existing queries using guided iterations.

Let’s see how these work in practice with a few examples. I’ll cover logs first and then metrics.

Generate CloudWatch Logs Insights queries with natural language
In the CloudWatch console, I select Log Insights in the Logs section. I then select the log group of an AWS Lambda function that I want to investigate.

I choose the Query generator button to open a new Prompt field where I enter what I need using natural language:

Tell me the duration of the 10 slowest invocations

Then, I choose Generate new query. The following Log Insights query is automatically generated:

fields @timestamp, @requestId, @message, @logStream, @duration 
| filter @type = "REPORT" and @duration > 1000
| sort @duration desc
| limit 10

Console screenshot.

I choose Run query to see the results.

Console screenshot.

I find that now there’s too much information in the output. I prefer to see only the data I need, so I enter the following sentence in the Prompt and choose Update query.

Show only timestamps and latency

The query is updated based on my input and only the timestamp and duration are returned:

fields @timestamp, @duration 
| filter @type = "REPORT" and @duration > 1000
| sort @duration desc
| limit 10

I run the updated query and get a result that is easier for me to read.

Console screenshot.

Now, I want to know if there are any errors in the log. I enter this sentence in the Prompt and generate a new query:

Count the number of ERROR messages

As requested, the generated query is counting the messages that contain the ERROR string:

fields @message
| filter @message like /ERROR/
| stats count()

I run the query and find out that there are more errors than I expected. I need more information.

Console screenshot.

I use this prompt to update the query and get a better distribution of the errors:

Show the errors per hour

The updated query uses the bin() function to group the result in one hour intervals.

fields @timestamp, @message
| filter @message like /ERROR/
| stats count(*) by bin(1h)

Let’s see a more advanced query about memory usage. I select the log groups of a few Lambda functions and type:

Show invocations with the most over-provisioned memory grouped by log stream

Before generating the query, I choose the gear icon to toggle the options to include my prompt and an explanation as comment. Here’s the result (I split the explanation over multiple lines for readability):

# Show invocations with the most over-provisioned memory grouped by log stream

fields @logStream, @memorySize/1000/1000 as memoryMB, @maxMemoryUsed/1000/1000 as maxMemoryUsedMB, (@memorySize/1000/1000 - @maxMemoryUsed/1000/1000) as overProvisionedMB 
| stats max(overProvisionedMB) as maxOverProvisionedMB by @logStream 
| sort maxOverProvisionedMB desc

# This query finds the amount of over-provisioned memory for each log stream by
# calculating the difference between the provisioned and maximum memory used.
# It then groups the results by log stream and calculates the maximum
# over-provisioned memory for each log stream. Finally, it sorts the results
# in descending order by the maximum over-provisioned memory to show
# the log streams with the most over-provisioned memory.

Now, I have the information I need to understand these errors. On the other side, I also have EC2 workloads. How are those instances running? Let’s look at some metrics.

Generate CloudWatch Metrics Insights queries with natural language
In the CloudWatch console, I select All metrics in the Metrics section. Then, in the Query tab, I use the Editor. If you prefer, the Query generator is available also in the Builder.

I choose Query generator like before. Then, I enter what I need using plain English:

Which 10 EC2 instances have the highest CPU utilization?

I choose Generate new query and get a result using the Metrics Insights syntax.

SELECT AVG("CPUUtilization")
FROM SCHEMA("AWS/EC2", InstanceId)
GROUP BY InstanceId
ORDER BY AVG() DESC
LIMIT 10

To see the graph, I choose Run.

Console screenshot.

Well, it looks like my EC2 instances are not doing much. This result shows how those instances are using the CPU, but what about storage? I enter this in the prompt and choose Update query:

How about the most EBS writes?

The updated query replaces the average CPU utilization with the sum of bytes written to all EBS volumes attached to the instance. It keeps the limit to only show the top 10 results.

SELECT SUM("EBSWriteBytes")
FROM SCHEMA("AWS/EC2", InstanceId)
GROUP BY InstanceId
ORDER BY SUM() DESC
LIMIT 10

I run the query and, by looking at the result, I have a better understanding of how storage is being used by my EC2 instances.

Try entering some requests and run the generated queries over your logs and metrics to see how this works with your data.

Things to know
Amazon CloudWatch natural language query generation for logs and metrics is available in preview in the US East (N. Virginia) and US West (Oregon) AWS Regions.

There is no additional cost for using natural language query generation during the preview. You only pay for the cost of running the queries according to CloudWatch pricing.

Generated queries are produced by generative AI and dependent on factors including the data selected and available in your account. For these reasons, your results may vary.

When generating a query, you can include your original request and an explanation of the query as comments. To do so, choose the gear icon in the bottom right corner of the query edit window and toggle those options.

This new capability can help you generate and update queries for logs and metrics, saving you time and effort. This approach allows engineering teams to scale their operations without worrying about specific data knowledge or query expertise.

Use natural language to analyze your logs and metrics with Amazon CloudWatch.

Danilo

Amazon CloudWatch Logs now offers automated pattern analytics and anomaly detection

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/amazon-cloudwatch-logs-now-offers-automated-pattern-analytics-and-anomaly-detection/

Searching through log data to find operational or business insights often feels like looking for a needle in a haystack. It usually requires you to manually filter and review individual log records. To help you with that, Amazon CloudWatch has added new capabilities to automatically recognize and cluster patterns among log records, extract noteworthy content and trends, and notify you of anomalies using advanced machine learning (ML) algorithms trained using decades of Amazon and AWS operational data.

Specifically, CloudWatch now offers the following:

  • The Patterns tab on the Logs Insights page finds recurring patterns in your query results and lets you analyze them in detail. This makes it easier to find what you’re looking for and drill down into new or unexpected content in your logs.
  • The Compare button in the time interval selector on the Logs Insights page lets you quickly compare the query result for the selected time range to a previous period, such as the previous day, week, or month. In this way, it takes less time to see what has changed compared to a previous stable scenario.
  • The Log Anomalies page in the Logs section of the navigation pane automatically surfaces anomalies found in your logs while they are processed during ingestion.

Let’s see how these work in practice with a typical troubleshooting journey. I will look at some application logs to find key patterns, compare two time periods to understand what changed, and finally see how detecting anomalies can help discover issues.

Finding recurring patterns in the logs
In the CloudWatch console, I choose Logs Insights from the Logs section of the navigation pane. To start, I have selected which log groups I want to query. In this case, I select a log group of a Lambda function that I want to inspect and choose Run query.

In the Pattern tab, I see the patterns that have been found in these log groups. One of the patterns seems to be an error. I can select it to quickly add it as a filter to my query and focus on the logs that contain this pattern. For now, I choose the magnifying glass icon to analyze the pattern.

Console screenshot.

In the Pattern inspect window, a histogram with the occurrences of the pattern in the selected time period is shown. After the histogram, samples from the logs are provided.

Console screenshot.

The variable parts of the pattern (such as numbers) have been extracted as “tokens.” I select the Token values tab to see the values for a token. I can select a token value to quickly add it as a filter to the query and focus on the logs that contain this pattern with this specific value.

Console screenshot.

I can also look at the Related patterns tab to see other logs that typically occurred at the same time as the pattern I am analyzing. For example, if I am looking at an ERROR log that was always written alongside a DEBUG log showing more details, I would see that relationship there.

Comparing logs with a previous period
To better understand what is happening, I choose the Compare button in the time interval selector. This updates the query to compare results with a previous period. For example, I choose Previous day to see what changed compared to yesterday.

Console screenshot.

In the Patterns tab, I notice that there has actually been a 10 percent decrease in the number of errors, so the current situation might not be too bad.

I choose the magnifying glass icon on the pattern with severity type ERROR to see a full comparison of the two time periods. The graph overlaps the occurrences of the pattern over the two periods (now and yesterday in this case) inside the selected time range (one hour).

Console screenshot.

Errors are decreasing but are still there. To reduce those errors, I make some changes to the application. I come back after some time to compare the logs, and a new ERROR pattern is found that was not present in the previous time period.

Console screenshot.

My update probably broke something, so I roll back to the previous version of the application. For now, I’ll keep it as it is because the number of errors is acceptable for my use case.

Detecting anomalies in the log
I am reassured by the decrease in errors that I discovered comparing the logs. But how can I know if something unexpected is happening? Anomaly detection for CloudWatch Logs looks for unexpected patterns in the logs as they are processed during ingestion and can be enabled at log group level.

I select Log groups in the navigation pane and type a filter to see the same log group I was looking at before. I choose Configure in the Anomaly detection column and select an Evaluation frequency of 5 minutes. Optionally, I can use a longer interval (up to 60 minutes) and add patterns to process only specific log events for anomaly detection.

After I activate anomaly detection for this log group, incoming logs are constantly evaluated against historical baselines. I wait for a few minutes and, to see what has been found, I choose Log anomalies from the Logs section of the navigation pane.

Console screenshot.

To simplify this view, I can suppress anomalies that I am not interested in following. For now, I choose one of the anomalies in order to inspect the corresponding pattern in a way similar to before.

Console screenshot.

After this additional check, I am convinced there are no urgent issues with my application. With all the insights I collected with these new capabilities, I can now focus on the errors in the logs to understand how to solve them.

Things to know
Amazon CloudWatch automated log pattern analytics is available today in all commercial AWS Regions where Amazon CloudWatch Logs is offered excluding the China (Beijing), the China (Ningxia), and Israel (Tel Aviv) Regions.

The patterns and compare query features are charged according to existing Logs Insights query costs. Comparing a one-hour time period against another one-hour time period is equivalent to running a single query over a two-hour time period. Anomaly detection is included as part of your log ingestion fees, and there is no additional charge for this feature. For more information, see CloudWatch pricing.

Simplify how you analyze logs with CloudWatch automated log pattern analytics.

Danilo

Amazon Managed Service for Prometheus collector provides agentless metric collection for Amazon EKS

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/amazon-managed-service-for-prometheus-collector-provides-agentless-metric-collection-for-amazon-eks/

Today, I’m happy to announce a new capability, Amazon Managed Service for Prometheus collector, to automatically and agentlessly discover and collect Prometheus metrics from Amazon Elastic Kubernetes Service (Amazon EKS). Amazon Managed Service for Prometheus collector consists of a scraper that discovers and collects metrics from Amazon EKS applications and infrastructure without needing to run any collectors in-cluster.

This new capability provides fully managed Prometheus-compatible monitoring and alerting with Amazon Managed Service for Prometheus. One of the significant benefits is that the collector is fully managed, automatically right-sized, and scaled for your use case. This means you don’t have to run any compute for collectors to collect the available metrics. This helps you optimize metric collection costs to monitor your applications and infrastructure running on EKS.

With this launch, Amazon Managed Service for Prometheus now supports two major modes of Prometheus metrics collection: AWS managed collection, a fully managed and agentless collector, and customer managed collection.

Getting started with Amazon Managed Service for Prometheus Collector
Let’s take a look at how to use AWS managed collectors to ingest metrics using this new capability into a workspace in Amazon Managed Service for Prometheus. Then, we will evaluate the collected metrics in Amazon Managed Service for Grafana.

When you create a new EKS cluster using the Amazon EKS console, you now have the option to enable AWS managed collector by selecting Send Prometheus metrics to Amazon Managed Service for Prometheus. In the Destination section, you can also create a new workspace or select your existing Amazon Managed Service for Prometheus workspace. You can learn more about how to create a workspace by following the getting started guide.

Then, you have the flexibility to define your scraper configuration using the editor or upload your existing configuration. The scraper configuration controls how you would like the scraper to discover and collect metrics. To see possible values you can configure, please visit the Prometheus Configuration page.

Once you’ve finished the EKS cluster creation, you can go to the Observability tab on your cluster page to see the list of scrapers running in your EKS cluster.

The next step is to configure your EKS cluster to allow the scraper to access metrics. You can find the steps and information on Configuring your Amazon EKS cluster.

Once your EKS cluster is properly configured, the collector will automatically discover metrics from your EKS cluster and nodes. To visualize the metrics, you can use Amazon Managed Grafana integrated with your Prometheus workspace. Visit the Set up Amazon Managed Grafana for use with Amazon Managed Service for Prometheus page to learn more.

The following is a screenshot of metrics ingested by the collectors and visualized in an Amazon Managed Grafana workspace. From here, you can run a simple query to get the metrics that you need.

Using AWS CLI and APIs
Besides using the Amazon EKS console, you can also use the APIs or AWS Command Line Interface (AWS CLI) to add an AWS managed collector. This approach is useful if you want to add an AWS managed collector into an existing EKS cluster or make some modifications to the existing collector configuration.

To create a scraper, you can run the following command:

aws amp create-scraper \ 
       --source eksConfiguration="{clusterArn=<EKS-CLUSTER-ARN>,securityGroupIds=[<SG-SECURITY-GROUP-ID>],subnetIds=[<SUBNET-ID>]}" \ 
       --scrape-configuration configurationBlob=<BASE64-CONFIGURATION-BLOB> \ 
       --destination=ampConfiguration={workspaceArn="<WORKSPACE_ARN>"}

You can get most of the parameter values from the respective AWS console, such as your EKS cluster ARN and your Amazon Managed Service for Prometheus workspace ARN. Other than that, you also need to define the scraper configuration defined as configurationBlob.

Once you’ve defined the scraper configuration, you need to encode the configuration file into base64 encoding before passing the API call. The following is the command that I use in my Linux development machine to encode sample-configuration.yml into base64 and copy it onto the clipboard.

$ base64 sample-configuration.yml | pbcopy

Now Available
The Amazon Managed Service for Prometheus collector capability is now available to all AWS customers in all AWS Regions where Amazon Managed Service for Prometheus is supported.

Learn more:

Happy building!
Donnie

New – Multi-account search in AWS Resource Explorer

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-multi-account-search-in-aws-resource-explorer/

With AWS Resource Explorer, you can search for and discover your resources, such as Amazon Elastic Compute Cloud (Amazon EC2) instances, Amazon Kinesis data streams, and Amazon DynamoDB tables, across AWS Regions. Starting today, you can also search across accounts within your organization.

It takes just a few minutes to turn on and configure Resource Explorer for an entire organization or a specific organizational unit (OU) and use simple free-form text and filtered searches to find relevant AWS resources across accounts and Regions.

Multi-account search is available in the Resource Explorer console, anywhere in the AWS Management Console through the unified search bar (the search bar at the top of every AWS console page), using the AWS Command Line Interface (AWS CLI), AWS SDKs, or AWS Chatbot. In this way, you can locate a resource quickly, navigate to the appropriate account and service, and take action.

When operating in a well-architected manner, multiple AWS accounts are used to help isolate and manage business applications and data. You can now use Resource Explorer to simplify how you explore your resources across accounts and act on them at scale. For example, Resource Explorer can help you locate impacted resources across your entire organization when investigating increased operational costs, troubleshooting a performance issue, or remediating a security alert.

Let’s see how this works in practice.

Setting up multi-account search
You can set up multi-account search for your organization in four steps:

  1. Enable trusted access for AWS Account Management.
  2. Configure Resource Explorer in every account in the organization or in the OU you want to search through. You can do that in just a few clicks using AWS Systems Manager Quick Setup. Optionally, you can use AWS CloudFormation, or other management tools you are comfortable with.
  3. It is not mandatory, but we suggest creating a delegated admin account for AWS Account Management. Then, to centralize all the required permissions for multi-account creation, we recommend using the delegated admin account to create Resource Explorer multi-account views.
  4. Finally, you can create a multi-account view to start searching across the organization.

Create a multi-account view
I already implemented the first three steps in the previous list. Using the delegated admin account, I go to the Resource Explorer console. There, I choose Views in the Explore resources section and create a view.

I enter a name for the view and select Organization-wide resources visibility. In this way, I can allow visibility of resources located in accounts across my entire organization or in specific OUs. For this view, I select the whole organization.

Console screenshot.

For the Region, I select the one where I have the aggregator index. The aggregator index contains a replicated copy of the local index in every other Region where Resource Explorer has been turned on. Optionally, I can use a filter to limit which resources should be included in this view. I choose to include all resources and additional resource attributes such as tags.

Console screenshot.

Then, I complete the creation of the view. Now, by granting access to the view, I can control who can access what resource information in Resource Explorer.

Using multi-account search
To try the new multi-account view, I choose Resource search from the Explore resources section of the navigation pane. In my query, I want to see if there are Amazon ElastiCache resources for an old version of Redis. I type elasticache:* redis3.2 in the Query field.

Console screenshot.

In the results, I see the different AWS accounts and Regions where these resources are based. For resources in my account, there is a link in the first column that opens that resource in the console. For resources in other accounts, I can use the console with the appropriate account and service to get more information or take action.

Things to know
Multi-account search is available in the following AWS Regions: [[ Regions ]].

There is no additional charge for using AWS Resource Explorer, including for multi-account searches.

To share views with other accounts in an organization, we suggest you use the delegated admin account to create the view with the necessary visibility in terms of resources, Regions, and accounts within the organization and then use AWS Resource Access Manager to share access to the view. For example, you can create a view for a specific OU and then share the view with an account in that OU.

Search for and discover relevant resources across accounts in your organization and across Regions with AWS Resource Explorer.

Danilo

Deploy container applications in a multicloud environment using Amazon CodeCatalyst

Post Syndicated from Pawan Shrivastava original https://aws.amazon.com/blogs/devops/deploy-container-applications-in-a-multicloud-environment-using-amazon-codecatalyst/

In the previous post of this blog series, we saw how organizations can deploy workloads to virtual machines (VMs) in a hybrid and multicloud environment. This post shows how organizations can address the requirement of deploying containers, and containerized applications to hybrid and multicloud platforms using Amazon CodeCatalyst. CodeCatalyst is an integrated DevOps service which enables development teams to collaborate on code, and build, test, and deploy applications with continuous integration and continuous delivery (CI/CD) tools.

One prominent scenario where multicloud container deployment is useful is when organizations want to leverage AWS’ broadest and deepest set of Artificial Intelligence (AI) and Machine Learning (ML) capabilities by developing and training AI/ML models in AWS using Amazon SageMaker, and deploying the model package to a Kubernetes platform on other cloud platforms, such as Azure Kubernetes Service (AKS) for inference. As shown in this workshop for operationalizing the machine learning pipeline, we can train an AI/ML model, push it to Amazon Elastic Container Registry (ECR) as an image, and later deploy the model as a container application.

Scenario description

The solution described in the post covers the following steps:

  • Setup Amazon CodeCatalyst environment.
  • Create a Dockerfile along with a manifest for the application, and a repository in Amazon ECR.
  • Create an Azure service principal which has permissions to deploy resources to Azure Kubernetes Service (AKS), and store the credentials securely in Amazon CodeCatalyst secret.
  • Create a CodeCatalyst workflow to build, test, and deploy the containerized application to AKS cluster using Github Actions.

The architecture diagram for the scenario is shown in Figure 1.

Solution architecture diagram

Figure 1 – Solution Architecture

Solution Walkthrough

This section shows how to set up the environment, and deploy a HTML application to an AKS cluster.

Setup Amazon ECR and GitHub code repository

Create a new Amazon ECR and a code repository. In this case we’re using GitHub as the repository but you can create a source repository in CodeCatalyst or you can choose to link an existing source repository hosted by another service if that service is supported by an installed extension. Then follow the application and Docker image creation steps outlined in Step 1 in the environment creation process in exposing Multiple Applications on Amazon EKS. Create a file named manifest.yaml as shown, and map the “image” parameter to the URL of the Amazon ECR repository created above.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: multicloud-container-deployment-app
  labels:
    app: multicloud-container-deployment-app
spec:
  selector:
    matchLabels:
      app: multicloud-container-deployment-app
  replicas: 2
  template:
    metadata:
      labels:
        app: multicloud-container-deployment-app
    spec:
      nodeSelector:
        "beta.kubernetes.io/os": linux
      containers:
      - name: ecs-web-page-container
        image: <aws_account_id>.dkr.ecr.us-west-2.amazonaws.com/<my_repository>
        imagePullPolicy: Always
        ports:
            - containerPort: 80
        resources:
          limits:
            memory: "100Mi"
            cpu: "200m"
      imagePullSecrets:
          - name: ecrsecret
---
apiVersion: v1
kind: Service
metadata:
  name: multicloud-container-deployment-service
spec:
  type: LoadBalancer
  ports:
  - port: 80
    targetPort: 80
  selector:
    app: multicloud-container-deployment-app

Push the files to Github code repository. The multicloud-container-app github repository should look similar to Figure 2 below

Files in multicloud container app github repository 

Figure 2 – Files in Github repository

Configure Azure Kubernetes Service (AKS) cluster to pull private images from ECR repository

Pull the docker images from a private ECR repository to your AKS cluster by running the following command. This setup is required during the azure/k8s-deploy Github Actions in the CI/CD workflow. Authenticate Docker to an Amazon ECR registry with get-login-password by using aws ecr get-login-password. Run the following command in a shell where AWS CLI is configured, and is used to connect to the AKS cluster. This creates a secret called ecrsecret, which is used to pull an image from the private ECR repository.

kubectl create secret docker-registry ecrsecret\
 --docker-server=<aws_account_id>.dkr.ecr.us-west-2.amazonaws.com/<my_repository>\
 --docker-username=AWS\
 --docker-password= $(aws ecr get-login-password --region us-west-2)

Provide ECR URI in the variable “–docker-server =”.

CodeCatalyst setup

Follow these steps to set up CodeCatalyst environment:

Configure access to the AKS cluster

In this solution, we use three GitHub Actions – azure/login, azure/aks-set-context and azure/k8s-deploy – to login, set the AKS cluster, and deploy the manifest file to the AKS cluster respectively. For the Github Actions to access the Azure environment, they require credentials associated with an Azure Service Principal.

Service Principals in Azure are identified by the CLIENT_ID, CLIENT_SECRET, SUBSCRIPTION_ID, and TENANT_ID properties. Create the Service principal by running the following command in the azure cloud shell:

az ad sp create-for-rbac \
    --name "ghActionHTMLapplication" \
    --scope /subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP> \
    --role Contributor \
    --sdk-auth

The command generates a JSON output (shown in Figure 3), which is stored in CodeCatalyst secret called AZURE_CREDENTIALS. This credential is used by azure/login Github Actions.

JSON output stored in AZURE-CREDENTIALS secret

Figure 3 – JSON output

Configure secrets inside CodeCatalyst Project

Create three secrets CLUSTER_NAME (Name of AKS cluster), RESOURCE_GROUP(Name of Azure resource group) and AZURE_CREDENTIALS(described in the previous step) as described in the working with secret document. The secrets are shown in Figure 4.

Secrets in CodeCatalyst

Figure 4 – CodeCatalyst Secrets

CodeCatalyst CI/CD Workflow

To create a new CodeCatalyst workflow, select CI/CD from the navigation on the left and select Workflows (1). Then, select Create workflow (2), leave the default options, and select Create (3) as shown in Figure 5.

Create CodeCatalyst CI/CD workflow

Figure 5 – Create CodeCatalyst CI/CD workflow

Add “Push to Amazon ECR” Action

Add the Push to Amazon ECR action, and configure the environment where you created the ECR repository as shown in Figure 6. Refer to adding an action to learn how to add CodeCatalyst action.

Create ‘Push to ECR’ CodeCatalyst Action

Figure 6 – Create ‘Push to ECR’ Action

Select the Configuration tab and specify the configurations as shown in Figure7.

Configure ‘Push to ECR’ CodeCatalyst Action

Figure 7 – Configure ‘Push to ECR’ Action

Configure the Deploy action

1. Add a GitHub action for deploying to AKS as shown in Figure 8.

Github action to deploy to AKS

Figure 8 – Github action to deploy to AKS

2. Configure the GitHub action from the configurations tab by adding the following snippet to the GitHub Actions YAML property:

- name: Install Azure CLI
  run: pip install azure-cli
- name: Azure login
  id: login
  uses: azure/[email protected]
  with:
    creds: ${Secrets.AZURE_CREDENTIALS}
- name: Set AKS context
  id: set-context
  uses: azure/aks-set-context@v3
  with:
    resource-group: ${Secrets.RESOURCE_GROUP}
    cluster-name: ${Secrets.CLUSTER_NAME}
- name: Setup kubectl
  id: install-kubectl
  uses: azure/setup-kubectl@v3
- name: Deploy to AKS
  id: deploy-aks
  uses: Azure/k8s-deploy@v4
  with:
    namespace: default
    manifests: manifest.yaml
    pull-images: true

Github action configuration for deploying application to AKS

Figure 9 – Github action configuration

3. The workflow is now ready and can be validated by choosing ‘Validate’ and then saved to the repository by choosing ‘Commit’.
We have implemented an automated CI/CD workflow that builds the container image of the application (refer Figure 10), pushes the image to ECR, and deploys the application to AKS cluster. This CI/CD workflow is triggered as application code is pushed to the repository.

Automated CI/CD workflow

Figure 10 – Automated CI/CD workflow

Test the deployment

When the HTML application runs, Kubernetes exposes the application using a public facing load balancer. To find the external IP of the load balancer, connect to the AKS cluster and run the following command:

kubectl get service multicloud-container-deployment-service

The output of the above command should look like the image in Figure 11.

Output of kubectl get service command

Figure 11 – Output of kubectl get service

Paste the External IP into a browser to see the running HTML application as shown in Figure 12.

HTML application running successfully in AKS

Figure 12 – Application running in AKS

Cleanup

If you have been following along with the workflow described in the post, you should delete the resources you deployed so you do not continue to incur charges. First, delete the Amazon ECR repository using the AWS console. Second, delete the project from CodeCatalyst by navigating to Project settings and choosing Delete project. There’s no cost associated with the CodeCatalyst project and you can continue using it. Finally, if you deployed the application on a new AKS cluster, delete the cluster from the Azure console. In case you deployed the application to an existing AKS cluster, run the following commands to delete the application resources.

kubectl delete deployment multicloud-container-deployment-app
kubectl delete services multicloud-container-deployment-service

Conclusion

In summary, this post showed how Amazon CodeCatalyst can help organizations deploy containerized workloads in a hybrid and multicloud environment. It demonstrated in detail how to set up and configure Amazon CodeCatalyst to deploy a containerized application to Azure Kubernetes Service, leveraging a CodeCatalyst workflow, and GitHub Actions. Learn more and get started with your Amazon CodeCatalyst journey!

If you have any questions or feedback, leave them in the comments section.

About Authors

Picture of Pawan

Pawan Shrivastava

Pawan Shrivastava is a Partner Solution Architect at AWS in the WWPS team. He focusses on working with partners to provide technical guidance on AWS, collaborate with them to understand their technical requirements, and designing solutions to meet their specific needs. Pawan is passionate about DevOps, automation and CI CD pipelines. He enjoys watching MMA, playing cricket and working out in the gym.

Picture of Brent

Brent Van Wynsberge

Brent Van Wynsberge is a Solutions Architect at AWS supporting enterprise customers. He accelerates the cloud adoption journey for organizations by aligning technical objectives to business outcomes and strategic goals, and defining them where needed. Brent is an IoT enthusiast, specifically in the application of IoT in manufacturing, he is also interested in DevOps, data analytics and containers.

Picture of Amandeep

Amandeep Bajwa

Amandeep Bajwa is a Senior Solutions Architect at AWS supporting Financial Services enterprises. He helps organizations achieve their business outcomes by identifying the appropriate cloud transformation strategy based on industry trends, and organizational priorities. Some of the areas Amandeep consults on are cloud migration, cloud strategy (including hybrid & multicloud), digital transformation, data & analytics, and technology in general.

Picture of Brian

Brian Beach

Brian Beach has over 20 years of experience as a Developer and Architect. He is currently a Principal Solutions Architect at Amazon Web Services. He holds a Computer Engineering degree from NYU Poly and an MBA from Rutgers Business School. He is the author of “Pro PowerShell for Amazon Web Services” from Apress. He is a regular author and has spoken at numerous events. Brian lives in North Carolina with his wife and three kids.

Let’s Architect! Getting started with containers

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-getting-started-with-containers/

Most of AWS customers building cloud-native applications or modernizing applications choose containers to run their microservices applications to accelerate innovation and time to market while lowering their total cost of ownership (TCO). Using containers in AWS comes with other benefits, such as increased portability, scalability, and flexibility.

The combination of containers technologies and AWS services also provides features such as load balancing, auto scaling, and service discovery, making it easier to deploy and manage applications at scale.

In this edition of Let’s Architect! we share useful resources to help you to get started with containers on AWS.

Container Build Lens

This whitepaper describes the Container Build Lens for the AWS Well-Architected Framework. It helps customers review and improve their cloud-based architectures and better understand the business impact of their design decisions. The document describes general design principles for containers, as well as specific best practices and implementation guidance using the Six Pillars of the Well-Architected Framework.

Take me to explore the Containers Build Lens!

Follow Containers Build Lens Best practices to architect your containers-based workloads

Follow Containers Build Lens Best practices to architect your containers-based workloads.

EKS Workshop

The EKS Workshop is a useful resource to familiarize yourself with Amazon Elastic Kubernetes Service (Amazon EKS) by practicing on real use-cases. It is built to help users learn about Amazon EKS features and integrations with popular open-source projects. The workshop is abstracted into high-level learning modules, including Networking, Security, DevOps Automation, and more. These are further broken down into standalone labs focusing on a particular feature, tool, or use case.

Once you’re done experimenting with EKS Workshop, start building your environments with Amazon EKS Blueprints, a collection of Infrastructure as Code (IaC) modules that helps you configure and deploy consistent, batteries-included Amazon EKS clusters across accounts and regions following AWS best practices. Amazon EKS Blueprints are available in both Terraform and CDK.

Take me to this workshop!

The workshop is abstracted into high-level learning modules, including Networking, Security, DevOps Automation, and more.

The workshop is abstracted into high-level learning modules, including Networking, Security, DevOps Automation, and more.

Architecting for resiliency on AWS App Runner

Learn how to architect an highly available and resilient application using AWS App Runner. With App Runner, you can start with just the source code of your application or a container image. The complexity of running containerized applications is abstracted away, including the cloud resources needed for running your web application or API. App Runner manages load balancers, TLS certificates, auto scaling, logs, metrics, teachability and more, so you can focus on implementing your business logic in a highly scalable and elastic environment.

Take me to this blog post!

A high-level architecture for an available and resilient application with AWS App Runner.

A high-level architecture for an available and resilient application with AWS App Runner

Securing Kubernetes: How to address Kubernetes attack vectors

As part of designing any modern system on AWS, it is necessary to think about the security implications and what can affect your security posture. This session introduces the fundamentals of the Kubernetes architecture and common attack vectors. It also includes security controls provided by Amazon EKS and suggestions on how to address them. With these strategies, you can learn how to reduce risk for your Kubernetes-based workloads.

Take me to this video!

Some common attack vectors that need addressing with Kubernetes

Some common attack vectors that need addressing with Kubernetes

See you next time!

Thanks for exploring architecture tools and resources with us!

Next time we’ll talk about serverless.

To find all the posts from this series, check out the Let’s Architect! page of the AWS Architecture Blog.

Optimizing GPU utilization for AI/ML workloads on Amazon EC2

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/optimizing-gpu-utilization-for-ai-ml-workloads-on-amazon-ec2/

­­­­This blog post is written by Ben Minahan, DevOps Consultant, and Amir Sotoodeh, Machine Learning Engineer.

Machine learning workloads can be costly, and artificial intelligence/machine learning (AI/ML) teams can have a difficult time tracking and maintaining efficient resource utilization. ML workloads often utilize GPUs extensively, so typical application performance metrics such as CPU, memory, and disk usage don’t paint the full picture when it comes to system performance. Additionally, data scientists conduct long-running experiments and model training activities on existing compute instances that fit their unique specifications. Forcing these experiments to be run on newly provisioned infrastructure with proper monitoring systems installed might not be a viable option.

In this post, we describe how to track GPU utilization across all of your AI/ML workloads and enable accurate capacity planning without needing teams to use a custom Amazon Machine Image (AMI) or to re-deploy their existing infrastructure. You can use Amazon CloudWatch to track GPU utilization, and leverage AWS Systems Manager Run Command to install and configure the agent across your existing fleet of GPU-enabled instances.

Overview

First, make sure that your existing Amazon Elastic Compute Cloud (Amazon EC2) instances have the Systems Manager Agent installed, and also have the appropriate level of AWS Identity and Access Management (IAM) permissions to run the Amazon CloudWatch Agent. Next, specify the configuration for the CloudWatch Agent in Systems Manager Parameter Store, and then deploy the CloudWatch Agent to our GPU-enabled EC2 instances. Finally, create a CloudWatch Dashboard to analyze GPU utilization.

Architecture Diagram depicting the integration between AWS Systems Manager with RunCommand Arguments stored in SSM Parameter Store, your Amazon GPU enabled EC2 instance with installed Amazon CloudWatch Agen­t, and Amazon CloudWatch Dashboard that aggregates and displays the ­reported metrics.

  1. Install the CloudWatch Agent on your existing GPU-enabled EC2 instances.
  2. Your CloudWatch Agent configuration is stored in Systems Manager Parameter Store.
  3. Systems Manager Documents are used to install and configure the CloudWatch Agent on your EC2 instances.
  4. GPU metrics are published to CloudWatch, which you can then visualize through the CloudWatch Dashboard.

Prerequisites

This post assumes you already have GPU-enabled EC2 workloads running in your AWS account. If the EC2 instance doesn’t have any GPUs, then the custom configuration won’t be applied to the CloudWatch Agent. Instead, the default configuration is used. For those instances, leveraging the CloudWatch Agent’s default configuration is better suited for tracking resource utilization.

For the CloudWatch Agent to collect your instance’s GPU metrics, the proper NVIDIA drivers must be installed on your instance. Several AWS official AMIs including the Deep Learning AMI already have these drivers installed. To see a list of AMIs with the NVIDIA drivers pre-installed, and for full installation instructions for Linux-based instances, see Install NVIDIA drivers on Linux instances.

Additionally, deploying and managing the CloudWatch Agent requires the instances to be running. If your instances are currently stopped, then you must start them to follow the instructions outlined in this post.

Preparing your EC2 instances

You utilize Systems Manager to deploy the CloudWatch Agent, so make sure that your EC2 instances have the Systems Manager Agent installed. Many AWS-provided AMIs already have the Systems Manager Agent installed. For a full list of the AMIs which have the Systems Manager Agent pre-installed, see Amazon Machine Images (AMIs) with SSM Agent preinstalled. If your AMI doesn’t have the Systems Manager Agent installed, see Working with SSM Agent for instructions on installing based on your operating system (OS).

Once installed, the CloudWatch Agent needs certain permissions to accept commands from Systems Manager, read Systems Manager Parameter Store entries, and publish metrics to CloudWatch. These permissions are bundled into the managed IAM policies AmazonEC2RoleforSSM, AmazonSSMReadOnlyAccess, and CloudWatchAgentServerPolicy. To create a new IAM role and associated IAM instance profile with these policies attached, you can run the following AWS Command Line Interface (AWS CLI) commands, replacing <REGION_NAME> with your AWS region, and <INSTANCE_ID> with the EC2 Instance ID that you want to associate with the instance profile:

aws iam create-role --role-name CloudWatch-Agent-Role --assume-role-policy-document  '{"Statement":{"Effect":"Allow","Principal":{"Service":"ec2.amazonaws.com"},"Action":"sts:AssumeRole"}}'
aws iam attach-role-policy --role-name CloudWatch-Agent-Role --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
aws iam attach-role-policy --role-name CloudWatch-Agent-Role --policy-arn arn:aws:iam::aws:policy/AmazonSSMReadOnlyAccess
aws iam attach-role-policy --role-name CloudWatch-Agent-Role --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
aws iam create-instance-profile --instance-profile-name CloudWatch-Agent-Instance-Profile
aws iam add-role-to-instance-profile --instance-profile-name CloudWatch-Agent-Instance-Profile --role-name CloudWatch-Agent-Role
aws ec2 associate-iam-instance-profile --region <REGION_NAME> --instance-id <INSTANCE_ID> --iam-instance-profile Name=CloudWatch-Agent-Instance-Profile

Alternatively, you can attach the IAM policies to your existing IAM role associated with an existing IAM instance profile.

aws iam attach-role-policy --role-name <ROLE_NAME> --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
aws iam attach-role-policy --role-name <ROLE_NAME> --policy-arn arn:aws:iam::aws:policy/AmazonSSMReadOnlyAccess
aws iam attach-role-policy --role-name <ROLE_NAME> --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
aws ec2 associate-iam-instance-profile --region <REGION_NAME> --instance-id <INSTANCE_ID> --iam-instance-profile Name=<INSTANCE_PROFILE>

Once complete, you should see that your EC2 instance is associated with the appropriate IAM role.

An Amazon EC2 Instance with the CloudWatch-Agent-Role IAM Role attached

This role should have the AmazonEC2RoleforSSM, AmazonSSMReadOnlyAccess and CloudWatchAgentServerPolicy IAM policies attached.

The CloudWatch-Agent-Role IAM Role’s attached permission policies, Amazon EC2 Role for SSM, CloudWatch Agent Server ¬Policy, and Amazon SSM Read Only Access

Configuring and deploying the CloudWatch Agent

Before deploying the CloudWatch Agent onto our EC2 instances, make sure that those agents are properly configured to collect GPU metrics. To do this, you must create a CloudWatch Agent configuration and store it in Systems Manager Parameter Store.

Copy the following into a file cloudwatch-agent-config.json:

{
    "agent": {
        "metrics_collection_interval": 60,
        "run_as_user": "cwagent"
    },
    "metrics": {
        "aggregation_dimensions": [
            [
                "InstanceId"
            ]
        ],
        "append_dimensions": {
            "AutoScalingGroupName": "${aws:AutoScalingGroupName}",
            "ImageId": "${aws:ImageId}",
            "InstanceId": "${aws:InstanceId}",
            "InstanceType": "${aws:InstanceType}"
        },
        "metrics_collected": {
            "cpu": {
                "measurement": [
                    "cpu_usage_idle",
                    "cpu_usage_iowait",
                    "cpu_usage_user",
                    "cpu_usage_system"
                ],
                "metrics_collection_interval": 60,
                "resources": [
                    "*"
                ],
                "totalcpu": false
            },
            "disk": {
                "measurement": [
                    "used_percent",
                    "inodes_free"
                ],
                "metrics_collection_interval": 60,
                "resources": [
                    "*"
                ]
            },
            "diskio": {
                "measurement": [
                    "io_time"
                ],
                "metrics_collection_interval": 60,
                "resources": [
                    "*"
                ]
            },
            "mem": {
                "measurement": [
                    "mem_used_percent"
                ],
                "metrics_collection_interval": 60
            },
            "swap": {
                "measurement": [
                    "swap_used_percent"
                ],
                "metrics_collection_interval": 60
            },
            "nvidia_gpu": {
                "measurement": [
                    "utilization_gpu",
                    "temperature_gpu",
                    "utilization_memory",
                    "fan_speed",
                    "memory_total",
                    "memory_used",
                    "memory_free",
                    "pcie_link_gen_current",
                    "pcie_link_width_current",
                    "encoder_stats_session_count",
                    "encoder_stats_average_fps",
                    "encoder_stats_average_latency",
                    "clocks_current_graphics",
                    "clocks_current_sm",
                    "clocks_current_memory",
                    "clocks_current_video"
                ],
                "metrics_collection_interval": 60
            }
        }
    }
}

Run the following AWS CLI command to deploy a Systems Manager Parameter CloudWatch-Agent-Config, which contains a minimal agent configuration for GPU metrics collection. Replace <REGION_NAME> with your AWS Region.

aws ssm put-parameter \
--region <REGION_NAME> \
--name CloudWatch-Agent-Config \
--type String \
--value file://cloudwatch-agent-config.json

Now you can see a CloudWatch-Agent-Config parameter in Systems Manager Parameter Store, containing your CloudWatch Agent’s JSON configuration.

CloudWatch-Agent-Config stored in Systems Manager Parameter Store

Next, install the CloudWatch Agent on your EC2 instances. To do this, you can leverage Systems Manager Run Command, specifically the AWS-ConfigureAWSPackage document which automates the CloudWatch Agent installation.

  1. Run the following AWS CLI command, replacing <REGION_NAME> with the Region into which your instances are deployed, and <INSTANCE_ID> with the EC2 Instance ID on which you want to install the CloudWatch Agent.
aws ssm send-command \
--query 'Command.CommandId' \
--region <REGION_NAME> \
--instance-ids <INSTANCE_ID> \
--document-name AWS-ConfigureAWSPackage \
--parameters '{"action":["Install"],"installationType":["In-place update"],"version":["latest"],"name":["AmazonCloudWatchAgent"]}'

2. To monitor the status of your command, use the get-command-invocation AWS CLI command. Replace <COMMAND_ID> with the command ID output from the previous step, <REGION_NAME> with your AWS region, and <INSTANCE_ID> with your EC2 instance ID.

aws ssm get-command-invocation --query Status --region <REGION_NAME> --command-id <COMMAND_ID> --instance-id <INSTANCE_ID>

3.Wait for the command to show the status Success before proceeding.

$ aws ssm send-command \
	 --query 'Command.CommandId' \
    --region us-east-2 \
    --instance-ids i-0123456789abcdef \
    --document-name AWS-ConfigureAWSPackage \
    --parameters '{"action":["Install"],"installationType":["Uninstall and reinstall"],"version":["latest"],"additionalArguments":["{}"],"name":["AmazonCloudWatchAgent"]}'

"5d8419db-9c48-434c-8460-0519640046cf"

$ aws ssm get-command-invocation --query Status --region us-east-2 --command-id 5d8419db-9c48-434c-8460-0519640046cf --instance-id i-0123456789abcdef

"Success"

Repeat this process for all EC2 instances on which you want to install the CloudWatch Agent.

Next, configure the CloudWatch Agent installation. For this, once again leverage Systems Manager Run Command. However, this time the AmazonCloudWatch-ManageAgent document which applies your custom agent configuration is stored in the Systems Manager Parameter Store to your deployed agents.

  1. Run the following AWS CLI command, replacing <REGION_NAME> with the Region into which your instances are deployed, and <INSTANCE_ID> with the EC2 Instance ID on which you want to configure the CloudWatch Agent.
aws ssm send-command \
--query 'Command.CommandId' \
--region <REGION_NAME> \
--instance-ids <INSTANCE_ID> \
--document-name AmazonCloudWatch-ManageAgent \
--parameters '{"action":["configure"],"mode":["ec2"],"optionalConfigurationSource":["ssm"],"optionalConfigurationLocation":["/CloudWatch-Agent-Config"],"optionalRestart":["yes"]}'

2. To monitor the status of your command, utilize the get-command-invocation AWS CLI command. Replace <COMMAND_ID> with the command ID output from the previous step, <REGION_NAME> with your AWS region, and <INSTANCE_ID> with your EC2 instance ID.

aws ssm get-command-invocation --query Status --region <REGION_NAME> --command-id <COMMAND_ID> --instance-id <INSTANCE_ID>

3. Wait for the command to show the status Success before proceeding.

$ aws ssm send-command \
    --query 'Command.CommandId' \
    --region us-east-2 \
    --instance-ids i-0123456789abcdef \
    --document-name AmazonCloudWatch-ManageAgent \
    --parameters '{"action":["configure"],"mode":["ec2"],"optionalConfigurationSource":["ssm"],"optionalConfigurationLocation":["/CloudWatch-Agent-Config"],"optionalRestart":["yes"]}'

"9a4a5c43-0795-4fd3-afed-490873eaca63"

$ aws ssm get-command-invocation --query Status --region us-east-2 --command-id 9a4a5c43-0795-4fd3-afed-490873eaca63 --instance-id i-0123456789abcdef

"Success"

Repeat this process for all EC2 instances on which you want to install the CloudWatch Agent. Once finished, the CloudWatch Agent installation and configuration is complete, and your EC2 instances now report GPU metrics to CloudWatch.

Visualize your instance’s GPU metrics in CloudWatch

Now that your GPU-enabled EC2 Instances are publishing their utilization metrics to CloudWatch, you can visualize and analyze these metrics to better understand your resource utilization patterns.

The GPU metrics collected by the CloudWatch Agent are within the CWAgent namespace. Explore your GPU metrics using the CloudWatch Metrics Explorer, or deploy our provided sample dashboard.

  1. Copy the following into a file, cloudwatch-dashboard.json, replacing instances of <REGION_NAME> with your Region:
{
    "widgets": [
        {
            "height": 10,
            "width": 24,
            "y": 16,
            "x": 0,
            "type": "metric",
            "properties": {
                "metrics": [
                    [{"expression": "SELECT AVG(nvidia_smi_utilization_gpu) FROM SCHEMA(\"CWAgent\", InstanceId) GROUP BY InstanceId","id": "q1"}]
                ],
                "view": "timeSeries",
                "stacked": false,
                "region": "<REGION_NAME>",
                "stat": "Average",
                "period": 300,
                "title": "GPU Core Utilization",
                "yAxis": {
                    "left": {"label": "Percent","max": 100,"min": 0,"showUnits": false}
                }
            }
        },
        {
            "height": 7,
            "width": 8,
            "y": 0,
            "x": 0,
            "type": "metric",
            "properties": {
                "metrics": [
                    [{"expression": "SELECT AVG(nvidia_smi_utilization_gpu) FROM SCHEMA(\"CWAgent\", InstanceId)", "label": "Utilization","id": "q1"}]
                ],
                "view": "gauge",
                "stacked": false,
                "region": "<REGION_NAME>",
                "stat": "Average",
                "period": 300,
                "title": "Average GPU Core Utilization",
                "yAxis": {"left": {"max": 100, "min": 0}
                },
                "liveData": false
            }
        },
        {
            "height": 9,
            "width": 24,
            "y": 7,
            "x": 0,
            "type": "metric",
            "properties": {
                "metrics": [
                    [{ "expression": "SEARCH(' MetricName=\"nvidia_smi_memory_used\" {\"CWAgent\", InstanceId} ', 'Average')", "id": "m1", "visible": false }],
                    [{ "expression": "SEARCH(' MetricName=\"nvidia_smi_memory_total\" {\"CWAgent\", InstanceId} ', 'Average')", "id": "m2", "visible": false }],
                    [{ "expression": "SEARCH(' MetricName=\"mem_used_percent\" {CWAgent, InstanceId} ', 'Average')", "id": "m3", "visible": false }],
                    [{ "expression": "100*AVG(m1)/AVG(m2)", "label": "GPU", "id": "e2", "color": "#17becf" }],
                    [{ "expression": "AVG(m3)", "label": "RAM", "id": "e3" }]
                ],
                "view": "timeSeries",
                "stacked": false,
                "region": "<REGION_NAME>",
                "stat": "Average",
                "period": 300,
                "yAxis": {
                    "left": {"min": 0,"max": 100,"label": "Percent","showUnits": false}
                },
                "title": "Average Memory Utilization"
            }
        },
        {
            "height": 7,
            "width": 8,
            "y": 0,
            "x": 8,
            "type": "metric",
            "properties": {
                "metrics": [
                    [ { "expression": "SEARCH(' MetricName=\"nvidia_smi_memory_used\" {\"CWAgent\", InstanceId} ', 'Average')", "id": "m1", "visible": false } ],
                    [ { "expression": "SEARCH(' MetricName=\"nvidia_smi_memory_total\" {\"CWAgent\", InstanceId} ', 'Average')", "id": "m2", "visible": false } ],
                    [ { "expression": "100*AVG(m1)/AVG(m2)", "label": "Utilization", "id": "e2" } ]
                ],
                "sparkline": true,
                "view": "gauge",
                "region": "<REGION_NAME>",
                "stat": "Average",
                "period": 300,
                "yAxis": {
                    "left": {"min": 0,"max": 100}
                },
                "liveData": false,
                "title": "GPU Memory Utilization"
            }
        }
    ]
}

2. run the following AWS CLI command, replacing <REGION_NAME> with the name of your Region:

aws cloudwatch put-dashboard \
    --region <REGION_NAME> \
    --dashboard-name My-GPU-Usage \
    --dashboard-body file://cloudwatch-dashboard.json

View the My-GPU-Usage CloudWatch dashboard in the CloudWatch console for your AWS region..

An example CloudWatch dashboard, My-GPU-Usage, showing the GPU usage metrics over time.

Cleaning Up

To avoid incurring future costs for resources created by following along in this post, delete the following:

  1. My-GPU-Usage CloudWatch Dashboard
  2. CloudWatch-Agent-Config Systems Manager Parameter
  3. CloudWatch-Agent-Role IAM Role

Conclusion

By following along with this post, you deployed and configured the CloudWatch Agent across your GPU-enabled EC2 instances to track GPU utilization without pausing in-progress experiments and model training. Then, you visualized the GPU utilization of your workloads with a CloudWatch Dashboard to better understand your workload’s GPU usage and make more informed scaling and cost decisions. For other ways that Amazon CloudWatch can improve your organization’s operational insights, see the Amazon CloudWatch documentation.

New – Self-Service Provisioning of Terraform Open-Source Configurations with AWS Service Catalog

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-self-service-provisioning-of-terraform-open-source-configurations-with-aws-service-catalog/

With AWS Service Catalog, you can create, govern, and manage a catalog of infrastructure as code (IaC) templates that are approved for use on AWS. These IaC templates can include everything from virtual machine images, servers, software, and databases to complete multi-tier application architectures. You can control which IaC templates and versions are available, what is configured by each version, and who can access each template based on individual, group, department, or cost center. End users such as engineers, database administrators, and data scientists can then quickly discover and self-service provision approved AWS resources that they need to use to perform their daily job functions.

When using Service Catalog, the first step is to create products based on your IaC templates. You can then collect products, together with configuration information, in a portfolio.

Starting today, you can define Service Catalog products and their resources using either AWS CloudFormation or Hashicorp Terraform and choose the tool that better aligns with your processes and expertise. You can now integrate your existing Terraform configurations into Service Catalog to have them part of a centrally approved portfolio of products and share it with the AWS accounts used by your end users. In this way, you can prevent inconsistencies and mitigate the risk of noncompliance.

When resources are deployed by Service Catalog, you can maintain least privilege access during provisioning and govern tagging on the deployed resources. End users of Service Catalog pick and choose what they need from the list of products and versions they have access to. Then, they can provision products in a single action regardless of the technology (CloudFormation or Terraform) used for the deployment.

The Service Catalog hub-and-spoke model that enables organizations to govern at scale can now be extended to include Terraform configurations. With the Service Catalog hub and spoke model, you can centrally manage deployments using a management/user account relationship:

  • One management account – Used to create Service Catalog products, organize them into portfolios, and share portfolios with user accounts
  • Multiple user accounts (up to thousands) – A user account is any AWS account in which the end users of Service Catalog are provisioning resources.

Let’s see how this works in practice.

Creating an AWS Service Catalog Product Using Terraform
To get started, I install the Terraform Reference Engine (provided by AWS on GitHub) that configures the code and infrastructure required for the Terraform open-source engine to work with AWS Service Catalog. I only need to do this once, in the management account for Service Catalog, and the setup takes just minutes. I use the automated installation script:

./deploy-tre.sh -r us-east-1

To keep things simple for this post, I create a product deploying a single EC2 instance using AWS Graviton processors and the Amazon Linux 2023 operating system. Here’s the content of my main.tf file:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 4.16"
    }
  }

  required_version = ">= 1.2.0"
}

provider "aws" {
  region  = "us-east-1"
}

resource "aws_instance" "app_server" {
  ami           = "ami-00c39f71452c08778"
  instance_type = "t4g.large"

  tags = {
    Name = "GravitonServerWithAmazonLinux2023"
  }
}

I sign in to the AWS Management Console in the management account for Service Catalog. In the Service Catalog console, I choose Product list in the Administration section of the navigation pane. There, I choose Create product.

In Product details, I select Terraform open source as Product type. I enter a product name and description and the name of the owner.

Console screenshot.

In the Version details, I choose to Upload a template file (using a tar.gz archive). Optionally, I can specify the template using an S3 URL or an external code repository (on GitHub, GitHub Enterprise Server, or Bitbucket) using an AWS CodeStar provider.

Console screenshot.

I enter support details and custom tags. Note that tags can be used to categorize your resources and also to check permissions to create a resource. Then, I complete the creation of the product.

Adding an AWS Service Catalog Product Using Terraform to a Portfolio
Now that the Terraform product is ready, I add it to my portfolio. A portfolio can include both Terraform and CloudFormation products. I choose Portfolios from the Administrator section of the navigation pane. There, I search for my portfolio by name and open it. I choose Add product to portfolio. I search for the Terraform product by name and select it.

Console screenshot.

Terraform products require a launch constraint. The launch constraint specifies the name of an AWS Identity and Access Management (IAM) role that is used to deploy the product. I need to separately ensure that this role is created in every account with which the product is shared.

The launch role is assumed by the Terraform open-source engine in the management account when an end user launches, updates, or terminates a product. The launch role also contains permissions to describe, create, and update a resource group for the provisioned product and tag the product resources. In this way, Service Catalog keeps the resource group up-to-date and tags the resources associated with the product.

The launch role enables least privilege access for end users. With this feature, end users don’t need permission to directly provision the product’s underlying resources because your Terraform open-source engine assumes the launch role to provision those resources, such as an approved configuration of an Amazon Elastic Compute Cloud (Amazon EC2) instance.

In the Launch constraint section, I choose Enter role name to use a role I created before for this product:

  • The trust relationship of the role defines the entities that can assume the role. For this role, the trust relationship includes Service Catalog and the management account that contains the Terraform Reference Engine.
  • For permissions, the role allows to provision, update, and terminate the resources required by my product and to manage resource groups and tags on those resources.

Console screenshot.

I complete the addition of the product to my portfolio. Now the product is available to the end users who have access to this portfolio.

Launching an AWS Service Catalog Product Using Terraform
End users see the list of products and versions they have access to and can deploy them in a single action. If you already use Service Catalog, the experience is the same as with CloudFormation products.

I sign in to the AWS Console in the user account for Service Catalog. The portfolio I used before has been shared by the management account with this user account. In the Service Catalog console, I choose Products from the Provisioning group in the navigation pane. I search for the product by name and choose Launch product.

Console screenshot.

I let Service Catalog generate a unique name for the provisioned product and select the product version to deploy. Then, I launch the product.

Console screenshot.

After a few minutes, the product has been deployed and is available. The deployment has been managed by the Terraform Reference Engine.

Console screenshot.

In the Associated tags tab, I see that Service Catalog automatically added information on the portfolio and the product.

Console screenshot.

In the Resources tab, I see the resources created by the provisioned product. As expected, it’s an EC2 instance, and I can follow the link to open the Amazon EC2 console and get more information.

Console screenshot.

End users such as engineers, database administrators, and data scientists can continue to use Service Catalog and launch the products they need without having to consider if they are provisioned using Terraform or CloudFormation.

Availability and Pricing
AWS Service Catalog support for Terraform open-source configurations is available today in all AWS Regions where it is offered. There is no change in pricing when using Terraform. With Service Catalog, you pay for the API calls you make to the service, and you can start for free with the free tier. You also pay for the resources used and created by the Terraform Reference Engine. For more information, see Service Catalog Pricing.

Enable self-service provisioning at scale for your Terraform open-source configurations.

Danilo