Tag Archives: Intermediate (200)

Enhancing cloud security in AI/ML: The little pickle story

Post Syndicated from Nur Gucu original https://aws.amazon.com/blogs/security/enhancing-cloud-security-in-ai-ml-the-little-pickle-story/

As AI and machine learning (AI/ML) become increasingly accessible through cloud service providers (CSPs) such as Amazon Web Services (AWS), new security issues can arise that customers need to address. AWS provides a variety of services for AI/ML use cases, and developers often interact with these services through different programming languages. In this blog post, we focus on Python and its pickle module, which supports a process called pickling to serialize and deserialize object structures. This functionality simplifies data management and the sharing of complex data across distributed systems. However, because of potential security issues, it’s important to use pickling with care (see the warning note in pickle — Python object serialization). In this post, we’re going to show you ways to build secure AI/ML workloads that use this powerful Python module, ways to detect that it’s in use that you might not know about, and when it might be getting abused, and finally highlight alternative approaches that can help you avoid these issues.

Quick tips

Understanding insecure pickle serialization and deserialization in Python

Effective data management is crucial in Python programming, and many developers turn to the pickle module for serialization. However, issues can arise when deserializing data from untrusted sources. The Python bytestream that pickling uses, is proprietary to Python. Until it’s unpickled, the data in the bytestream can’t be thoroughly evaluated. This is where security controls and validation become critical. Without proper validation, there’s a risk that an unauthorized user could inject unexpected code, potentially leading to arbitrary code execution, data tampering, or even unintended access to a system. In the context of AI model loading, secure deserialization is particularly important—it helps prevent outside parties from modifying model behavior, injecting backdoors, or causing inadvertent disclosure of sensitive data.

Throughout this post, we will refer to pickle serialization and deserialization collectively as pickling. Similar issues can be present in other languages (for example, Java and PHP) when untrusted data is used to recreate objects or data structures, resulting in potential security issues such as arbitrary code execution, data corruption, and unauthorized access.

Static code analysis compared to dynamic testing for detecting pickling

Security code reviews, including static code analysis, offer valuable early detection and thorough coverage of pickling-related issues. By examining source code (including third-party libraries and custom code) before deployment, teams can minimize security risks in a cost-effective way. Tools that provide static analysis can automatically flag unsafe pickling patterns, giving developers actionable insights to address issues promptly. Regular code reviews also help developers improve secure coding skills over time.

While static code analysis provides a comprehensive white-box approach, dynamic testing can uncover context-specific issues that only appear during runtime. Both methods are important. In this post, we focus primarily on the role of static code analysis in identifying unsafe pickling.

Tools like Amazon CodeGuru and Semgrep are effective at detecting security issues early. For open source projects, Semgrep is a great option to maintain consistent security checks.

The risks of insecure pickling in AI/ML

Pickling issues in AI/ML contexts can be especially concerning.

  • Invalidated object loading: AI/ML models are often serialized for future use. Loading these models from untrusted sources without validation can result in arbitrary code execution. Libraries such as pickle, joblib, and some yaml configurations allow serialization but must be handled securely.
    • For example: If a web application stores user input using pickle and unpickles it later with no validation, an unauthorized user could craft a harmful payload that executes arbitrary code on the server.
  • Data integrity: The integrity of pickled data is critical. Unexpectedly crafted data could corrupt models, resulting in incorrect predictions or behaviors, which is especially concerning in sensitive domains such as finance, healthcare, and autonomous systems.
    • For example: A team updates its AI model architecture or preprocessing steps but forgets to retrain and save the updated model. Loading the old pickled model under new code might trigger errors or unpredictable outcomes.
  • Exposure of sensitive information: Pickling often includes all attributes of an object, potentially exposing sensitive data such as credentials or secrets.
    • For example: An ML model might contain database credentials within its serialized state. If shared or stored without precautions, an unauthorized user who unpickles the file might gain unintended access to these credentials.
  • Insufficient data protection: When sent across networks or stored without encryption, pickled data can be intercepted, leading to inadvertent disclosure of sensitive information.
    • For example: In a healthcare environment, a pickled AI model containing patient data could be transmitted over an unsecured network, enabling an outside party to intercept and read sensitive information.
  • Performance overhead: Pickling can be slower than other serialization formats (such as, JSON or Protocol Buffers), which can affect ML and large language model (LLM) applications when inference speed is critical.
    • For example: In a real-time natural language processing (NLP) application using an LLM, heavy pickling or unpickling operations might reduce responsiveness and degrade the user experience.

Detecting unsafe unpickling with static code analysis tools

Static code analysis (SCA) is a valuable practice for applications dealing with pickled data, because it helps detect insecure pickling before deployment. By integrating SCA tools into the development workflow, teams can spot questionable deserialization patterns as soon as code is committed. This proactive approach reduces the risk of events involving unexpected code execution or unintended access due to unsafe object loading.

For instance, in a financial services application where objects are routinely pickled, a SCA tool can scan new commits to detect unvalidated unpickling. If identified, the development team can quickly address the issue, protecting both the integrity of the application and sensitive financial data.

Patterns in the source code

There are various ways to load a pickle object in Python. In this context, methods for detection can be tailored for secure coding habits and needed package dependencies. Many Python libraries include a function to load pickle objects. An effective approach can be to catalog all Python libraries used in the project, then create custom rules in your static code analysis tool to detect unsafe pickling or unpickling within those libraries.

CodeGuru and other static analysis tools continue to evolve their capability to detect unsafe pickling patterns. Organizations can use these tools and create custom rules to identify potential security issues in AI/ML pipelines.

Let’s define the steps for creating a safe process for addressing pickling issues:

  1. Generate a list of all the Python libraries that are used in your repository or environment.
  2. Check the static code analysis tool in your pipeline for current rules and the ability to add custom rules. If the tool is capable of discovering all the libraries used in your project, you can rely on it. However, if it’s not able to discover all the libraries used in your project, you should consider adding user-provided custom rules in your static code analysis tool.
  3. Most of the issues can be identified with well-designed, context-driven patterns in the static code analysis tool. For addressing the pickling issues, you need to identify pickling and unpickling functions.
  4. Implement and test the custom rules to verify full coverage of pickling and unpickling risks. Let’s identify patterns for a few libraries:
    • NumPy can efficiently pickle and unpickle arrays; useful for scientific computing workflows requiring serialized arrays. To catch potential unsafe pickle usage in NumPy, custom rules could target patterns like:
      import numpy as np
      data = np.load('data.npy', allow_pickle=True)

    • npyfile is a utility for loading NumPy arrays from pickled files. You can add the following patterns to your custom rules to discover potentially unsafe pickle object usage.
      import npyfile
      data = npyfile.load('example.pkl')

    • pandas can pickle and unpickle DataFrames using pickle, allowing for efficient storage and retrieval of tabular data. You can add the following patterns to your custom rules to discover potentially unsafe pickle object usage.
      import pandas as pd
      df = pd.read_pickle('dataframe.pkl')

    • joblib is often used for pickling and unpickling Python objects that involve large data, especially NumPy arrays, more efficiently than standard pickle. You can add the following patterns to your custom rules to discover potentially unsafe pickle object usage.
      from joblib import load
      data = load('large_data.pkl')

    • Scikit-learn provides joblib for pickling and unpickling objects and is particularly useful for models. You can add the following patterns to your custom rules to discover potentially unsafe pickle object usage.
      from sklearn.externals import joblib
      data = joblib.load('example.pkl')

    • PyTorch provides utilities for loading pickled objects that are especially useful for ML models and tensors. You can add the following patterns to your custom rule format to discover potentially unsafe pickle object usage.
      import torch
      data = torch.load('example.pkl')

By searching for these functions and parameters in code, you can set up targeted rules that highlight potential issues with pickling.

Effective mitigation

Addressing pickling issues requires not only detection, but also clear guidance on remediation. Consider recommending more secure formats or validations where possible as follows:

  • PyTorch
    • Use Safetensors to store tensors. If pickling remains necessary, add integrity checks (for example, hashing) for serialized data.
  • pandas
    • Verify data sources and integrity when using pd.read_pickle. Encourage safer alternatives (for example, CSV, HDF5, or Parquet) to help avoid pickling risks.
  • scikit-learn (via joblib)
    • Consider Skops for safer persistence. If switching formats isn’t feasible, implement strict validation checks before loading.
  • General advice
    • Identify safer libraries or methods whenever possible.
    • Switch to formats such as CSV or JSON for data, unless object-specific serialization is absolutely required.
    • Perform source and integrity checks before loading pickle files—even those considered trusted.

Example

The following is an example implementation that shows safe pickle implementation as a representation of the preceding information.

import io
import base64
import pickle
import boto3
import numpy as np
from cryptography.fernet import Fernet

###############################################################################
# 1) RESTRICTED UNPICKLER
###############################################################################
#
# By default, pickle can execute arbitrary code when loading. Here we implement
# a custom Unpickler that only allows certain safe modules/classes. Adjust this
# to your application's requirements.
#

class RestrictedUnpickler(pickle.Unpickler):
    """
    Restricts unpickling to only the modules/classes we explicitly allow.
    """
    allowed_modules = {
        "numpy": set(["ndarray", "dtype"]),
        "builtins": set(["tuple", "list", "dict", "set", "frozenset", "int", "float", "bool", "str"])
    }

    def find_class(self, module, name):
        if module in self.allowed_modules:
            if name in self.allowed_modules[module]:
                return super().find_class(module, name)
        # If not allowed, raise an error to prevent arbitrary code execution.
        raise pickle.UnpicklingError(f"Global '{module}.{name}' is forbidden")

def restricted_loads(data: bytes):
    """Helper function to load pickle data using the RestrictedUnpickler."""
    return RestrictedUnpickler(io.BytesIO(data)).load()

###############################################################################
# 2) AWS KMS & ENCRYPTION HELPERS
###############################################################################

def generate_data_key(kms_key_id: str, region: str = "us-east-1"):
    """
    Generates a fresh data key using AWS KMS. 
    Returns (plaintext_key, encrypted_data_key).
    """
    kms_client = boto3.client("kms", region_name=region)
    response = kms_client.generate_data_key(KeyId=kms_key_id, KeySpec='AES_256')
    
    # Plaintext data key (use to encrypt the pickle data locally)
    plaintext_key = response["Plaintext"]
    # Encrypted data key (store along with your ciphertext)
    encrypted_data_key = response["CiphertextBlob"]
    return plaintext_key, encrypted_data_key

def decrypt_data_key(encrypted_data_key: bytes, region: str = "us-east-1"):
    """
    Decrypts the encrypted data key via AWS KMS, returning the plaintext key.
    """
    kms_client = boto3.client("kms", region_name=region)
    response = kms_client.decrypt(CiphertextBlob=encrypted_data_key)
    return response["Plaintext"]

def build_fernet_key(plaintext_key: bytes) -> Fernet:
    """
    Construct a Fernet instance from a 32-byte data key.
    Fernet requires a 32-byte key *encoded* in URL-safe base64.
    """
    if len(plaintext_key) < 32:
        raise ValueError("Data key is smaller than 32 bytes; cannot build a Fernet key.")
    fernet_key = base64.urlsafe_b64encode(plaintext_key[:32])
    return Fernet(fernet_key)

###############################################################################
# 3) MAIN LOGIC
###############################################################################

def upload_pickled_data_s3(
    numpy_obj: np.ndarray,
    bucket_name: str,
    s3_key: str,
    kms_key_id: str,
    region: str = "us-east-1"
):
    """
    Pickle a numpy object, encrypt it locally, and upload the ciphertext + 
    encrypted data key to S3.
    """
    # 1. Generate data key from KMS
    plaintext_key, encrypted_data_key = generate_data_key(kms_key_id, region)
    
    # 2. Build Fernet from plaintext data key
    fernet = build_fernet_key(plaintext_key)
    
    # 3. Serialize the numpy object with pickle
    pickled_data = pickle.dumps(numpy_obj, protocol=pickle.HIGHEST_PROTOCOL)
    
    # 4. Encrypt the pickled data
    encrypted_data = fernet.encrypt(pickled_data)
    
    # 5. Upload to S3 along with the encrypted data key (in metadata)
    s3_client = boto3.client("s3", region_name=region)
    s3_client.put_object(
        Bucket=bucket_name,
        Key=s3_key,
        Body=encrypted_data,
        Metadata={
            "encrypted_data_key": base64.b64encode(encrypted_data_key).decode("utf-8")
        }
    )
    print(f"Encrypted pickle uploaded to s3://{bucket_name}/{s3_key}")

def download_and_unpickle_data_s3(
    bucket_name: str,
    s3_key: str,
    region: str = "us-east-1"
) -> np.ndarray:
    """
    Download the ciphertext and the encrypted data key from S3. Decrypt the data 
    key with KMS, use it to decrypt the pickled data, then load with a restricted 
    unpickler for safety.
    """
    s3_client = boto3.client("s3", region_name=region)
    
    # 1. Get object from S3
    response = s3_client.get_object(Bucket=bucket_name, Key=s3_key)
    
    # 2. Extract the encrypted data key from metadata
    metadata = response["Metadata"]
    encrypted_data_key_b64 = metadata.get("encrypted_data_key")
    if not encrypted_data_key_b64:
        raise ValueError("Missing encrypted_data_key in S3 object metadata.")
    
    encrypted_data_key = base64.b64decode(encrypted_data_key_b64)
    
    # 3. Decrypt data key via KMS
    plaintext_key = decrypt_data_key(encrypted_data_key, region)
    fernet = build_fernet_key(plaintext_key)
    
    # 4. Decrypt the pickled data
    encrypted_data = response["Body"].read()
    decrypted_pickled_data = fernet.decrypt(encrypted_data)
    
    # 5. Use restricted unpickler to load the numpy object
    numpy_obj = restricted_loads(decrypted_pickled_data)
    
    return numpy_obj

###############################################################################
# DEMO USAGE
###############################################################################

if __name__ == "__main__":
    # --- Replace with your actual values ---
    KMS_KEY_ID = "arn:aws:kms:us-east-1:123456789012:key/your-kms-key-id"
    BUCKET_NAME = "your-secure-bucket"
    S3_OBJECT_KEY = "encrypted_npy_demo.bin"
    AWS_REGION = "us-east-1"  # or region of your choice
    
    # Example numpy array
    original_array = np.random.rand(2, 3)
    print("Original Array:")
    print(original_array)
    
    # Upload (pickle + encrypt) to S3
    upload_pickled_data_s3(
        numpy_obj=original_array,
        bucket_name=BUCKET_NAME,
        s3_key=S3_OBJECT_KEY,
        kms_key_id=KMS_KEY_ID,
        region=AWS_REGION
    )
    
    # Download (decrypt + unpickle) from S3
    retrieved_array = download_and_unpickle_data_s3(
        bucket_name=BUCKET_NAME,
        s3_key=S3_OBJECT_KEY,
        region=AWS_REGION
    )
    
    print("\nRetrieved Array:")
    print(retrieved_array)
    
    # Verify integrity
    assert np.allclose(original_array, retrieved_array), "Arrays do not match!"
    print("\nSuccess! The retrieved array matches the original array.")

Conclusion

With the rapid expansion of cloud technologies, integrating static code analysis into your AI/ML development process is increasingly important. While pickling offers a powerful way to serialize objects for AI/ML and LLM applications, you can mitigate potential risks by applying manual secure code reviews, setting up automated SCA with custom rules, and following best practices such as using alternative serialization methods or verifying data integrity.

When working with ML models on AWS, see the AWS Well-Architected Framework’s Machine Learning Lens for guidance on secure architecture and recommended practices. By combining these approaches, you can maintain a strong security posture and streamline the AI/ML development lifecycle.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
 

Nur Gucu
Nur Gucu

Nur is a Security Engineer at Amazon with 10 years of offensive security expertise, specializing in generative AI security, security architecture, and offensive testing. Nur has developed security products at startups and banks, and has created frameworks for emerging technologies. She brings practical experience to solve complex AI security challenges, and yet believes as many as six impossible things before breakfast.
Matt Schwartz
Matt Schwartz

Matt is an Amazon Principal Security Engineer specializing in generative AI security, risk management, and cloud computing expertise. His two-decade expertise enables organizations to implement advanced AI while maintaining strict security protocols. Matthew develops strategic frameworks that safeguard critical assets and ensure compliance in the evolving digital landscape, securing complex systems during transformations.

Effectively implementing resource controls policies in a multi-account environment

Post Syndicated from Tatyana Yatskevich original https://aws.amazon.com/blogs/security/effectively-implementing-resource-controls-policies-in-a-multi-account-environment/

Every organization strives to empower teams to drive innovation while safeguarding their data and systems from unintended access. For organizations that have thousands of Amazon Web Services (AWS) resources spread across multiple accounts, organization-wide permissions guardrails can help maintain secure and compliant configurations. For example, some AWS services support resource-based policies that can be used to grant identities permissions to perform actions on the resources they’re attached to. With the management of resource-based policies frequently delegated to application owners, central security teams use permissions guardrails to help ensure that possible misconfigurations don’t lead to unintended access to these resources.

In this post, we discuss how you can use resource control policies (RCPs) to centrally restrict access to resources. We demonstrate how RCPs can help improve your security posture while allowing even more freedom to developers in managing their resources, thus reducing friction between central security and application teams. Using a sample use case, we uncover key considerations for designing and effectively implementing RCPs in your organization at scale.

If you’re new to RCPs, we recommend starting with Introducing resource control policies (RCPs), a new type of authorization policy in AWS Organizations, which provides an introduction to RCPs and their role in your security strategy.

RCP implementation journey

RCPs are a type of authorization policy in AWS Organizations. RCPs work alongside service control policies (SCPs) to help establish permissions guardrails across multiple accounts in your organization. To understand their differences and use cases, see General use cases for SCPs and RCPs and Enforcing enterprise-wide preventive controls with AWS Organizations.

We recommend implementing permissions guardrails, including RCPs, using the following iterative process, which consists of five phases (as shown in Figure 1).

  1. Examine your security control objectives
  2. Design permissions guardrails
  3. Anticipate potential impacts
  4. Implement permissions guardrails
  5. Monitor permissions guardrails

Figure 1: Permissions guardrails implementation journey

Figure 1: Permissions guardrails implementation journey

This phased approach helps ensure an effective integration of RCPs into your security strategy, improving your security posture while helping to maintain business continuity. Let’s explore each phase of RCP implementation in detail and outline key considerations for an effective implementation strategy.

Phase 1: Examine your security control objectives

The first step in implementing RCPs is identifying areas where RCPs can help improve your security posture or optimize the implementation of controls for your organization’s specific security control objectives.

Your control objectives can be influenced by a variety of factors such as compliance and regulatory requirements, legal and contractual obligations, types of workloads, data classification, and your organization’s threat model. After your control objectives are well-defined and prioritized, identify those that can be achieved using RCPs.

Like SCPs, RCPs are designed to establish coarse-grained access controls, security invariants that rarely change and serve as always-on boundaries across a wide range of AWS resources in your accounts. RCPs aren’t for managing fine-grained access controls. You will keep using policies such as resource-based and identity-based policies to apply least-privilege permissions.

More specifically, the following are key control objectives that you can achieve using RCPs:

  • Establish a data perimeter around your AWS resources. For example, you can use RCPs to help ensure that only trusted identities can access your AWS resources.
  • Mitigate the cross-service confused deputy risk. You can use RCPs to help ensure that your AWS resources are accessed by AWS services only on behalf of your organization.
  • Apply consistent access controls to your AWS resources regardless of the identities accessing them. For example, you can use RCPs to help ensure your Amazon Simple Storage Service (Amazon S3) buckets require TLS v1.2 or higher for in-transit encryption.

For additional use cases and types of controls that can be implemented using RCPs, you can explore the resource control policy examples repository. In this post, we demonstrate how to help ensure that only trusted identities can access your AWS Identity and Access Management (IAM) roles.

Let’s begin with the scenario illustrated in Figure 2. Your company’s central cloud team manages your corporate AWS Organizations organization, which consists of two corporate AWS accounts. An IAM principal in Account A should be able to assume an IAM role in Account B to perform day-to-day operations. To align to the broader control objective of Only trusted identities can access my resources, the central security team wants to make sure that the IAM role in Account B (my resource) can only be assumed by IAM principals that belong to their organization (trusted identities).

Figure 2: Simple scenario depicting a trusted identity accessing an IAM role

Figure 2: Simple scenario depicting a trusted identity accessing an IAM role

One way of achieving this control objective is to follow the principle of least-privilege and make sure that the role trust policy, the resource-based policy attached to the IAM role, only allows access to identities that require that access. The following is an example trust policy that grants permissions to Role A in Account A to assume Role B in Account B.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "GrantCrossAccountAccess",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<my-account-a-id>:role/RoleA"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

In organizations that have only a few accounts, central teams typically manage these policies. While this centralized governance model helps ensure that trust policies applied to roles are always restricted to trusted identities, it can also impede the productivity of application teams when operating at a greater scale.

Assume that your company has started growing its cloud footprint so much that your central security team now must achieve the same control objective with hundreds of IAM roles that are spread across multiple AWS accounts, as demonstrated in Figure 3.

Figure 3: Restricting access by managing individual IAM role trust policies

Figure 3: Restricting access by managing individual IAM role trust policies

At this scale, we see organizations delegating permissions management to application teams to better support the growth of their business and empower developers to innovate faster. While central security teams no longer have full control over the permissions granted to resources across AWS accounts, they must make sure that access is aligned with their organization’s security standard. For example, they might want to make sure that the GrantCrossAccountAccess statement that is now managed by developers doesn’t inadvertently grant access to an account that doesn’t belong to their organization. Previously, central security teams typically achieved this by developing automated mechanisms to insert a standard statement into all trust policies. This statement helped ensure that access remained bounded to their organization, even when developers configured broad access permissions for their roles. The following is an example trust policy where a developer granted permissions to an external account through the GrantCrossAccountAccess statement. However, because of the RestrictAccessToMyOrg statement added to the policy by the central security team, the external account will be unable to use these permissions.

{
  "Version": "2012-10-17",
  "Statement": [
   	{
      "Sid": "GrantCrossAccountAccess",
      "Effect": "Allow",
      "Principal": {
        "AWS":"arn:aws:iam::<noncorp-account-id>:role/<role-name>"
      },
      "Action": "sts:AssumeRole"
    },
    {
      "Sid": "RestrictAccessToMyOrg",
      "Effect": "Deny",
      "Principal": {
        "AWS": "*"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringNotEqualsIfExists": {
          "aws:PrincipalOrgID": "<my-org-id>"
        },
        "BoolIfExists": {
          "aws:PrincipalIsAWSService": "false"
        }
      }
    }
  ]
}

The RestrictAccessToMyOrg statement uses the aws:PrincipalOrgID and aws:PrincipalIsAWSService condition keys to restrict access to principals within your organization or to AWS service principals. The BoolIfExists operator with the aws:PrincipalIsAWSService condition key is required if the roles you’re applying a control to are service roles that are used by AWS services to perform operations on your behalf. When an AWS service assumes a service role, it uses its AWS service principal, an identity that is owned by AWS and that does not belong to your organization.

The central security teams could, for example, use AWS Config rules to detect misconfigurations and then use AWS Config remediation to automatically add the RestrictAccessToMyOrg statement to the IAM roles’ trust policies when new IAM roles are created or their trust policies are changed. Even though the addition of the RestrictAccessToMyOrg statement to trust policies can be automated, RCPs can greatly simplify enforcement of such coarse-grained controls in a multi-account environment.

Phase 2: Design permissions guardrails

Central security teams can implement permissions guardrails by creating an RCP that centrally blocks external access to IAM roles. The RCP that you will implement contains similar restrictions to the RestrictAccessToMyOrg statement that you used in the IAM trust policy.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "RestrictAccessToMyOrg",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "sts:AssumeRole",
      "Resource": "*",
      "Condition": {
        "StringNotEqualsIfExists": {
          "aws:PrincipalOrgID": "<my-org-id>"
        },
        "BoolIfExists": {
          "aws:PrincipalIsAWSService": "false"
        }
      }
    }
  ]
}

Like SCPs, you attach the RCP to an account, organizational unit (OU), or the root of your organization. After being attached, the RCP automatically applies to applicable resources—in this case, IAM roles—within the scope of that AWS Organizations entity. This centralized approach alleviates the need to modify hundreds of trust policies across multiple accounts, lowering the operational overhead for central security teams and helping ensure consistent access controls are applied at scale. RCPs also help you achieve separation of duties with developers still managing their least-privilege permissions in trust policies and administrators applying coarse-grained access controls in RCPs. If developers make configuration mistakes while managing permissions for their applications, the preventative access controls implemented using RCPs will help ensure that they stay within your organization’s access control guidelines. See How AWS enforcement code logic evaluates requests to allow or deny access to understand how different policy types impact the authorization process.

If you’re transitioning existing controls from resource-based policies to RCPs, use the opportunity to reassess the control design based on your current control objectives and the additional benefits offered by RCPs. For example, your previous controls might have been limited to specific resource types, such as IAM roles in this use case, or to particular accounts, such as those storing the most sensitive data. RCPs enable you to extend controls to additional resources across your entire organization, reducing operational overhead through centralized management of permissions guardrails.

If you need to apply a control on resources not yet covered by RCPs, you can implement or retain your custom automation for enforcing controls with resource-based policies. See the List of AWS services that support RCPs and Resources and entities not restricted by RCPs and plan for additional controls if applicable.

While designing your RCPs, consider the following guidelines.

Design for operational excellence

A key foundation for effectively implementing and operating permissions guardrails like RCPs is organizing your AWS environment using multiple accounts. Account boundaries and strategic placement of workloads across them allow you to apply tailored access controls that align with data sensitivity and specific access requirements. Grouping accounts into OUs within AWS Organizations enables more effective access control, even in scenarios where cross-account access is required. Figure 4 illustrates an example organization structure, demonstrating how RCPs can be applied at various levels of the organizational hierarchy to adhere to the security requirements of different workloads.

Figure 4: A sample organization with RCPs applied at various levels

Figure 4: A sample organization with RCPs applied at various levels

When operating at scale, consider delegating policy management to a central security account in your organization. With AWS Organizations resource-based delegation, central teams don’t need access to the management account for any SCP or RCP related changes or troubleshooting.

Review Achieving operational excellence with design considerations for AWS Organizations SCPs, which focuses on SCPs but also covers foundational principles for designing and implementing permissions guardrails at scale. These considerations also apply to RCPs for enabling operational excellence. Additionally, see AWS Organizations quotas and RCP evaluation for the RCP-related quotas and unique implementation details.

Define your governance

Establishing clear governance helps you define how to implement and continuously manage RCPs within your organization. This includes the operating model, change management processes, and exceptions handling procedures. RCPs provide authorization controls similar to SCPs and therefore should integrate with your existing governance framework rather than requiring separate oversight. For example, if your change management process requires two-person approval for SCP changes, you should consider applying the same approval process for RCP implementation. You should also adopt the same mechanisms you currently use to prevent unauthorized changes or detect drifts in your policies.

Plan for exceptions

There might be scenarios where you have a few resources that should be accessible publicly or by identities that don’t belong to your organization. If you’re organizing your resources across multiple accounts and OUs based on their compliance requirements or a common set of controls, then you most likely have such resources in a dedicated set of accounts or OUs, such as the Public Data OU in Figure 4. These accounts or OUs can have applicable policies that account for their unique access requirements.

Another option to accommodate these scenarios is to use the aws:ResourceAccount or aws:ResourceOrgPaths condition key to exclude certain accounts from the control. For example, the following policy will deny access to identities outside your organization from assuming IAM roles unless the identity is an AWS service principal or the role that is being accessed belongs to Account A.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "RestrictAccessToMyOrgExceptMyAccounts",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "sts:AssumeRole",
      "Resource": "*",
      "Condition": {
        "StringNotEqualsIfExists": {
          "aws:PrincipalOrgID": "<my-org-id>",
          "aws:ResourceAccount": "<my-account-a-id>"
        },
        "BoolIfExists": {
          "aws:PrincipalIsAWSService": "false"
        }
      }
    }
  ]
}

There also might be situations where your company’s trusted partners or acquisitions need to be granted an exception for access to a subset of your company’s resources distributed across multiple accounts. For example, your company might integrate with Cloud Security Posture Management (CSPM) tools that assume roles in your accounts to assess your accounts’ security posture, as shown in Figure 5.

Figure 5: Representative view of granting exceptions to trusted partners

Figure 5: Representative view of granting exceptions to trusted partners

When implementing a control with an RCP that by default will apply to all resources of the entity it’s attached to, you can manage resource specific exceptions using the aws:ResourceTag condition key. In addition, use the aws:PrincipalAccount context key to conditionally grant exceptions based on the AWS account ID of the trusted partner.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "RestrictAccessToMyOrgExceptTaggedRoles",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "sts:AssumeRole",
            "Resource": "*",
            "Condition": {
                "StringNotEqualsIfExists": {
                    "aws:PrincipalOrgID": "<my-org-id>",
                    "aws:ResourceTag/partner-access-exception": "trusted-partner"
                },
        	  	"BoolIfExists": {
					"aws:PrincipalIsAWSService": "false"
				}					
			}
        },
        {
            "Sid": "RestrictAccessForTaggedRoles",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "sts:AssumeRole",
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/partner-access-exception": "trusted-partner"
                },
                "StringNotEqualsIfExists": {
                    "aws:PrincipalAccount": "<trusted-partner-account-id>"
                }
            }
        }
    ]
}

Let’s examine the two statements in the preceding RCP:

  • RestrictAccessToMyOrgExceptTaggedRoles

    This statement helps ensure that your roles can only be assumed by identities that belong to your organization or by AWS service principals, unless a role is tagged with partner-access-exception set to trusted-partner.

  • RestrictAccessForTaggedRoles

    This statement further restricts access by helping ensure that the roles that have the partner-access-exception tag can only be assumed by identities that belong to your trusted partner account.

If you have a well-known, tightly scoped set of resources that need to be excluded, you can also use the IAM policy element, NotResource, to list the Amazon Resource Names (ARNs) of resources to exclude from the control.

When implementing tag-based exception processes, establishing strict controls over tag management is key. Unauthorized modifications of tags on resources, principals, or sessions could impact your security posture by enabling unintended access. You should implement controls to help prevent unauthorized tag manipulation. For example, the following SCP restricts the use of the partner-access-exception tag to the admin role so that unauthorized users cannot alter the control by attaching, detaching, or modifying the tag.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "RestrictAccessToExceptionTag",
      "Effect": "Deny",
      "Action": "*",
      "Resource": "*",
      "Condition": {
        "StringNotEqualsIfExists": {
          "aws:PrincipalArn": "<admin-role-arn>"
        },
        "ForAnyValue:StringEquals": {
          "aws:TagKeys": [
			"partner-access-exception"
		  ]
        }
      }
    }
  ]
}

You should also make sure that the partner-access-exception tag cannot be passed as a session tag when identities assume roles. See the sample RCP in the data perimeter policy examples repository.

Phase 3: Anticipate potential impacts

Before rolling out RCPs, you need to understand their potential impact on your organization. Introducing new policies or modifying existing ones without proper validation can disrupt your security-productivity balance. Be aware that overly restrictive policies might inadvertently impede legitimate data flows that are essential for achieving your business objectives.

Consider using AWS Identity and Access Management Access Analyzer to monitor effective permissions across resources in your organization. For our IAM role example, use an organization external access analyzer to identify IAM roles in your organization that are shared with external entities. This analysis will help you to create appropriate exceptions or lock down any overly permissive access.

Another effective method to assess impact is to review and analyze your account activity using AWS CloudTrail. For example, if you centralize all your CloudTrail logs in an S3 bucket, you can use Amazon Athena to query these logs. Specifically, look for STS API calls made against your IAM roles by identities outside your organization. Then, compare the results with your list of known trusted partners and those you have already accounted for in your RCPs. Based on this analysis, determine if you need to add the partner-access-exception tag to additional IAM roles and further refine the policy before enforcement. This is essential to ensure trusted partner integrations continue to function as expected when you enforce your RCPs. Furthermore, use this analysis to identify any illegitimate access patterns in your environment and plan for necessary remediations, further enhancing your security posture as part of RCP implementation.

For detailed guidance on how to perform an impact analysis in your environment, see Analyze your account activity to evaluate impact and refine controls, which describes the tools and options you need to be able to conduct the analysis.

Phase 4: Implement permissions guardrails

As you transition into the implementation phase, consider the following key factors to promote a smooth rollout while enhancing your security posture.

Deployment automation and integration

Use your existing deployment pipelines to implement RCPs, the same as you do for SCPs. This approach will minimize operational overhead while maintaining consistency in the deployment of your controls.

You can use the AWS CloudFormation AWS::Organizations::Policy resource type to deploy RCPs as infrastructure as code (IaC) using your continuous integration and continuous delivery (CI/CD) pipeline. If you’re using AWS Control Tower and the Customizations for AWS Control Tower solution (CfCT) for account management and want to deploy your custom RCPs, use rcp as the deploy_method in the CfCT manifest file. You can also take advantage of the AWS Control Tower provided RCP-based controls to streamline the implementation.

Progressive deployment in stages

As with SCPs, AWS strongly advises against attaching RCPs in production environments without thoroughly testing the impact that the policies have on resources in your accounts. Follow standard CI/CD processes and begin your RCP rollout in lower environments by attaching them to individual test accounts or OUs first. After you validate that the controls behave as excepted, gradually promote the RCPs to upper environments.

If your goal is to transition an existing control from resource-based policies to RCPs, keep your resource-based policies in place while conducting the progressive rollout. After you have completed rolling out your RCPs and confirmed that they operate as expected, you can consider deactivating the automation you used to apply the control using resource-based policies. This approach lets you deploy RCPs without impacting your existing security posture or disrupting business workflows.

Additionally, consider deploying RCPs to a subset of resources or accounts first to limit the scope of impact and provide an opportunity to test and refine your deployment and operational processes. You can follow your standard prioritization approach to define deployment waves, for example, start with resources or accounts that store sensitive data or pose the highest risk, based on your current operational practices and other controls that might be in place. For additional best practices, see OPS06-BP03 Employ safe deployment strategies in the AWS Well-Architected Framework: Operation Excellence Pillar whitepaper.

Phase 5: Monitor permissions guardrails

Finally, establish monitoring processes to help ensure that controls for preventing external access to your resources operate as expected. You can use the same tools you used for impact analysis. For example, you can use IAM Access Analyzer external access findings to understand the impact of your RCPs on resource permissions. This information will help you verify that your RCPs are crafted in accordance with your intent and plan remediation actions, if required. You can also set alerts for occurrences of unintended access patterns observed in your CloudTrail logs.

Furthermore, follow the phased approach outlined in this post to regularly review and update your controls to help ensure that they align with evolving business and security objectives. Consider factors such as organizational changes, changes in partner relationships, data criticality shifts, and opportunities for expanding your RCP coverage. This continuous improvement process helps maintain the effectiveness of your security controls while supporting business growth and transformation.

Conclusion

In this post, we discussed how to effectively implement coarse-grained access controls on AWS resources at scale using RCPs. You can use the phased implementation approach described here to achieve your security control objectives while minimizing the risk of disrupting your business workflows. You can apply the same approach to implement other preventative controls, such as SCPs, across your multi-account environment.

Remember that RCPs, like SCPs, provide a powerful mechanism for enforcing coarse-grained controls across multiple accounts in your organization. They don’t replace your least-privilege controls and should be part of a broader, multi-layered approach to data security that includes other well-architected security design principles.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
 

Tatyana Yatskevich
Tatyana Yatskevich

Tatyana is a Principal Solutions Architect in AWS Identity. She works with customers to help them build and operate in AWS in a secure and efficient manner.
Harsha W Sharma
Harsha W Sharma

Harsha is a Principal Solutions Architect with AWS in New York. He works with Global Financial Services customers to help them design and develop scalable, secure and resilient architectures on AWS.

Master architecture decision records (ADRs): Best practices for effective decision-making

Post Syndicated from Christoph Kappey original https://aws.amazon.com/blogs/architecture/master-architecture-decision-records-adrs-best-practices-for-effective-decision-making/

Architecture decision records (ADRs) help you document and communicate important process and architecture decisions in your engineering projects. Based on our experience implementing over 200 ADRs across multiple projects, we’ve developed best practices that can help you streamline your decision-making processes and improve team collaboration.

In this post, you’ll learn:

  • How to implement ADRs in your organization
  • Best practices based on more than 200 ADRs across multiple projects
  • Practical tips for streamlining architectural decision-making
  • Real-world examples from projects with 10 to more than 100 team members

Common challenges in architecture decision-making

Before implementing ADRs, your teams might face these common challenges:

  1. Team alignment – Development teams spend a huge part of their time (20 –30%, based on our project experience of the past 3 years) coordinating with other teams, which can slow down feature deployment and increase costs through repeated architecture refactoring
  2. Design flexibility – Finding the right balance between upfront design and evolving architecture when working with agile and DevOps approaches
  3. Nonfunctional requirements – Making trade-offs between security, maintainability, and scalability requirements
  4. Changing requirements – Adapting architectural decisions to evolving business goals while maintaining system integrity
  5. Knowledge transfer – Onboard new team members efficiently and make sure they follow the team’s current way of working

How to streamline the decision-making process

We base the recommendations in this post on our experience with several projects, working with teams with fewer than 10 team members as well as complex projects with 100 team members across 10 work streams. We embarked on ambitious projects with a green-field start as well as projects covering ongoing development of new features in production. Especially in teams with 100 people contributing to the code base, we faced the challenge of making sure that collaboration was seamless and decision-making consistent.

To address this challenge, we implemented an ADR mechanism, which served as our guiding light throughout the project’s lifecycle. After more than 3 years of following this approach, we’ve amassed a wealth of experience and best practices that we’re excited to share with the software development community. By capturing the context, alternatives considered, and the rationale behind each decision, ADRs foster transparency, knowledge-sharing, and accountability within teams. Our goal is to guide you through the process of writing effective ADRs with the following best practice recommendations:

  1. Keep ADR meetings short and focused – Effective ADR meetings should be concise and time-bound. Aim to keep them 30–45 minutes maximum. This focused approach keeps discussions on track and participants engaged throughout the process.
  2. Embrace the readout meeting style – Adopt the readout meeting style, where participants spend 10–15 minutes reading the ADR document. Encourage attendees to provide written comments on sections, paragraphs, or sentences that require clarification or where they have differing opinions. This approach promotes active engagement and fosters a bias for action and frugality.
  3. Maintain a cross-functional yet lean participant list – Invite representatives from each team that might be affected by the architectural decision but strive to keep the total number of participants below 10. This cross-functional representation provides diverse perspectives while maintaining a lean and efficient decision-making process, aligning with the principles of frugality and bias for action.
  4. Focus on a single decision – Keep ADRs concise by focusing on a single decision. Don’t hesitate to split up decisions if necessary. Concentrating on one decision at a time simplifies the decision-making process so that participants can thoroughly evaluate the impact during readout sessions. This approach aligns with the principles of ownership and customer obsession.
  5. Separate design from decision – Use a separate design document mechanism to explore alternative options thoroughly. Reference these design documents within the ADR, adhering to the principles of invention and simplification.
  6. Address comments and resolve feedback – Actively follow up on comments received during the ADR review process. Resolve all comments, either by incorporating changes or by discussing and reaching a consensus with the comment author. This practice demonstrates a commitment to delivering results and fostering a sense of ownership.
  7. Push for a timely decision – Avoid prolonged discussions and multiple readout meetings. Based on our experience, one to three ADR readouts should be sufficient. If more sessions are required, reevaluate the dependencies and consider reducing the number of invitees or reducing the scope of the ADR. Most of the decisions are two-way door decisions, meaning that they can be changed with little impact in the future. It’s always better to make a decision and try it fast instead of endlessly discussing it. This approach aligns with the AWS principles of working backwards, customer obsession, delivering results, and being right a lot.
  8. Embrace team collaboration – Approving an ADR is a team effort. The author must own the document and gather feedback from all affected teams before finalizing the decision. This practice encourages having backbone, disagreeing and committing, and fostering a collaborative environment.
  9. Maintain and follow the process – Keep ADRs up to date and follow the established process. If an ADR supersedes a previous one, document the change and link the new ADR in the superseded document. Insist on the highest standards by adhering to the defined processes—consider ADRs as a team law.
  10. Centralize ADR storage – Store ADRs in a central location accessible to all project members, regardless of their team affiliation. This practice promotes transparency and makes sure that architectural decisions are readily available to everyone involved.

Implementation tips and success measures

When implementing these practices, we recommend that you start small with a pilot team, create clear templates, and establish review cycles. Defining success measures such as the time to decision, team satisfaction, architecture rework reduction, or cross-team collaboration improvement help to evaluate your decision-making process

Conclusion

By implementing these best practices for ADRs, you’ll streamline your decision-making processes, foster collaboration, and make sure that architectural decisions are well-documented, communicated, and aligned with your organization’s principles and goals. Embrace these practices and witness the positive impact they have on the success of your software projects.

Read the AWS Prescriptive Guidance for an introduction into ADRs and an example ADR or the homepage of ADR GitHub organization.


About the Authors

AWS KMS CloudWatch metrics help you better track and understand how your KMS keys are being used

Post Syndicated from Norman Li original https://aws.amazon.com/blogs/security/aws-kms-cloudwatch-metrics-help-you-better-track-and-understand-how-your-kms-keys-are-being-used/

AWS Key Management Service (AWS KMS) is pleased to launch key-level filtering for AWS KMS API usage in Amazon CloudWatch metrics, providing enhanced visibility to help customers improve their operational efficiency and aid in security and compliance risk management.

AWS KMS currently publishes account-level AWS KMS API usage metrics to Amazon CloudWatch, enabling you to monitor and manage your API usage. However, if you’re using numerous KMS keys, pinpointing the ones with the highest request rate quota usage or significant API costs becomes challenging. For example, if you have more than 10 active KMS keys in your account, prior to this launch you would have needed to build a custom CloudTrail and Amazon Athena based solution to locate which specific keys are driving the majority of API usage and costs. With the new CloudWatch metrics, which are available under the AWS/KMS namespace in CloudWatch, you can track, understand, and set alerts on detailed API usage at the individual KMS key level without building a costly customized solution.

This blog post explores several use cases to help you better take advantage of these newly introduced CloudWatch metrics to manage your AWS KMS API usage and costs. The use cases cover viewing and understanding your API usage at the key level, and creating CloudWatch alerts to detect unintentional runaway usage.

Overview of new CloudWatch metrics for KMS keys

With CloudWatch metrics for KMS keys, you can now do the following:

  1. View the API usage for a specific KMS key, filtered by individual API operations (for example, Encrypt, Decrypt, or GenerateDataKey).
  2. See the aggregated usage across cryptographic operations for a given KMS key.
  3. Set up an alarm if a specific KMS key exceeds a specified threshold on a single API operation, or a set of API operations.

This streamlined approach allows you to quickly monitor, understand, and troubleshoot the API usage patterns of your KMS keys, without the overhead of the previous multi-step process. Let’s detail how these key-level API usage metrics can be used in two real-world examples.

Example 1: How to locate the KMS keys that consume the most API usage quota or contribute the most API charges

When you surpass your AWS KMS API request rate quotas, you can view your AWS KMS API utilization within the Service Quotas console. However, you might still find it cumbersome to identify the KMS keys that consume the largest amount of your request quota. When you receive the AWS KMS API charges that exceed your expectation, you can check the detailed billing usage in each AWS Region in Cost Explorer, but you cannot easily locate the KMS keys with the most API charges. This process becomes even more challenging when you manage a large number of KMS keys.

With the key-level API usage CloudWatch metrics, you can use the advanced metric query option to query CloudWatch Metrics Insights data with a user-friendly dialect of SQL to locate the KMS keys that consume the largest portion of the API usage quota or contribute the most API charges.

Walkthrough

To use Amazon CloudWatch Metrics Insights to identify the top 20 KMS keys that have the most cryptographic API usage up to the last 3 hours, complete the following steps:

  1. Open the CloudWatch console.
  2. In the navigation pane, choose Metrics, and then choose All metrics.
  3. Choose the Multi source query tab.
  4. For the data source, choose CloudWatch Metrics Insights.
  5. You can enter the following example query in Editor view:

    Note: In Builder view, the metric namespace, metric name, filter by, group by, order by, and limit options are shown. In Editor view, the same options as in Builder view are shown in query format.

    	SELECT SUM(SuccessfulRequest)
    	FROM SCHEMA("AWS/KMS", KeyArn, Operation)
    	GROUP BY KeyArn
    	ORDER BY MAX () DESC
    	LIMIT 20

  6. Choose Run in the Editor view or Graph query in the Builder view.

Example 2: How to set a new detailed alarm on unintentional runaway AWS KMS API usage

Running big data processing workflows that read Amazon Simple Storage Service (Amazon S3) files encrypted by KMS keys is a common scenario for analytics, business reporting, or machine learning projects. Typically, these workflows read a limited number of files from S3 on each invocation. However, misconfigured workflows could unintentionally read a large number of S3 files, which could result in exceeding your AWS KMS API request rate quotas or incurring undesirable charges due to spiky AWS KMS API usage. Historically, to address this issue, you would have had to build a customized alarm system by following these steps: 1) send AWS CloudTrail events generated by AWS KMS to Amazon CloudWatch Logs; 2) write queries in Amazon CloudWatch Logs Insights to track your API request usage; and 3) enable anomaly detection on the corresponding CloudWatch Log Insights math expression.

Now, with key-level API usage CloudWatch metrics, you can directly enable anomaly detection on these metrics to set up alarms for anomalous AWS KMS API usage patterns. This provides a more streamlined and efficient way to monitor and detect potential runaway workflows. By using these CloudWatch metrics and anomaly detection capabilities, you can proactively identify and address unintended increases in AWS KMS API usage, helping to avoid unexpected charges or service disruptions in your analytics, reporting, or machine learning pipelines.

Walkthrough

Consider a scenario where you have an analytics workflow that runs frequently, which uses the Decrypt AWS KMS API operation on a KMS key to decrypt and read data from S3. You would like to enable anomaly detection on the KMS key to trigger an alarm when the Decrypt call volume to the specific KMS key sees a discernible trend or pattern. To do so, complete the following steps:

  1. Open the CloudWatch console.
  2. In the navigation pane, choose Metrics, and then choose All metrics.
  3. Choose KMS, and then choose KeyArn, Operation.
  4. In the search bar, enter the Amazon Resource Name (ARN) of the key, and then choose Search. Select the CloudWatch metric you would like to enable anomaly detection for.
  5. Navigate to Graphed metrics, and using the Statistic and Period drop-down lists, choose the statistic and period that you would like to monitor. Then you can enable anomaly detection by selecting the Pulse icon.

    Figure 1: How to enable anomaly detection on a SuccessfulRequest metric

    Figure 1: How to enable anomaly detection on a SuccessfulRequest metric

  6. You can adjust the anomaly detection by setting the sensitivity to adjust the bandwidth, if needed.

    Figure 2: Anomaly detection is enabled on the SuccessfulRequest metric. The gray band illustrates the expected range of values and the anomaly is in red

    Figure 2: Anomaly detection is enabled on the SuccessfulRequest metric. The gray band illustrates the expected range of values and the anomaly is in red

Conclusion

This blog post highlighted the newly introduced key-level filtering capability for the AWS KMS API usage in CloudWatch. We showed two real-world use cases to demonstrate how you can use the new CloudWatch metrics. These use cases include improving operational visibility, setting up proactive alarms on anomalies in KMS API usage patterns, and potentially tracking detailed key usage for compliance purposes.

If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about this blog post, start a new thread in the AWS Key Management Service re:Post.
 

Norman Li
Norman Li

Norman is a Software Development Manager for AWS KMS. In this role, Norman leads the development of visibility features, as well as internal scalability initiatives. Outside of work, Norman likes to spend time in the beautiful Pacific Northwest mountains.
Haiyu Zhen
Haiyu Zhen

Haiyu is a Senior Software Development Engineer for AWS KMS. She specializes in building secure, large-scale distributed systems and is passionate about enhancing cloud-native application security without compromising performance.

Develop and test AWS Glue 5.0 jobs locally using a Docker container

Post Syndicated from Subramanya Vajiraya original https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-5-0-jobs-locally-using-a-docker-container/

AWS Glue is a serverless data integration service that allows you to process and integrate data coming through different data sources at scale. AWS Glue 5.0, the latest version of AWS Glue for Apache Spark jobs, provides a performance-optimized Apache Spark 3.5 runtime experience for batch and stream processing. With AWS Glue 5.0, you get improved performance, enhanced security, support for the next generation of Amazon SageMaker, and more. AWS Glue 5.0 enables you to develop, run, and scale your data integration workloads and get insights faster.

AWS Glue accommodates various development preferences through multiple job creation approaches. For developers who prefer direct coding, Python or Scala development is available using the AWS Glue ETL library.

Building production-ready data platforms requires robust development processes and continuous integration and delivery (CI/CD) pipelines. To support diverse development needs—whether on local machines, Docker containers on Amazon Elastic Compute Cloud (Amazon EC2), or other environments—AWS provides an official AWS Glue Docker image through the Amazon ECR Public Gallery. The image enables developers to work efficiently in their preferred environment while using the AWS Glue ETL library.

In this post, we show how to develop and test AWS Glue 5.0 jobs locally using a Docker container. This post is an updated version of the post Develop and test AWS Glue version 3.0 and 4.0 jobs locally using a Docker container, and uses AWS Glue 5.0 .

Available Docker images

The following Docker images are available for the Amazon ECR Public Gallery:

  • AWS Glue version 5.0ecr.aws/glue/aws-glue-libs:5

AWS Glue Docker images are compatible with both x86_64 and arm64.

In this post, we use public.ecr.aws/glue/aws-glue-libs:5 and run the container on a local machine (Mac, Windows, or Linux). This container image has been tested for AWS Glue 5.0 Spark jobs. The image contains the following:

To set up your container, you pull the image from the ECR Public Gallery and then run the container. We demonstrate how to run your container with the following methods, depending on your requirements:

  • spark-submit
  • REPL shell (pyspark)
  • pytest
  • Visual Studio Code

Prerequisites

Before you start, make sure that Docker is installed and the Docker daemon is running. For installation instructions, see the Docker documentation for Mac, Windows, or Linux. Also make sure that you have at least 7 GB of disk space for the image on the host running Docker.

Configure AWS credentials

To enable AWS API calls from the container, set up your AWS credentials with the following steps:

  1. Create an AWS named profile.
  2. Open cmd on Windows or a terminal on Mac/Linux, and run the following command:
PROFILE_NAME="profile_name"

In the following sections, we use this AWS named profile.

Pull the image from the ECR Public Gallery

If you’re running Docker on Windows, choose the Docker icon (right-click) and choose Switch to Linux containers before pulling the image.

Run the following command to pull the image from the ECR Public Gallery:

docker pull public.ecr.aws/glue/aws-glue-libs:5

Run the container

Now you can run a container using this image. You can choose any of following methods based on your requirements.

spark-submit

You can run an AWS Glue job script by running the spark-submit command on the container.

Write your job script (sample.py in the following example) and save it under the /local_path_to_workspace/src/ directory using the following commands:

$ WORKSPACE_LOCATION=/local_path_to_workspace
$ SCRIPT_FILE_NAME=sample.py
$ mkdir -p ${WORKSPACE_LOCATION}/src
$ vim ${WORKSPACE_LOCATION}/src/${SCRIPT_FILE_NAME}

These variables are used in the following docker run command. The sample code (sample.py) used in the spark-submit command is included in the appendix at the end of this post.

Run the following command to run the spark-submit command on the container to submit a new Spark application:

$ docker run -it --rm \
    -v ~/.aws:/home/hadoop/.aws \
    -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \
    -e AWS_PROFILE=$PROFILE_NAME \
    --name glue5_spark_submit \
    public.ecr.aws/glue/aws-glue-libs:5 \
    spark-submit /home/hadoop/workspace/src/$SCRIPT_FILE_NAME

REPL shell (pyspark)

You can run a REPL (read-eval-print loop) shell for interactive development. Run the following command to run the pyspark command on the container to start the REPL shell:

$ docker run -it --rm \
    -v ~/.aws:/home/hadoop/.aws \
    -e AWS_PROFILE=$PROFILE_NAME \
    --name glue5_pyspark \
    public.ecr.aws/glue/aws-glue-libs:5 \
    pyspark

You will see following output:

Python 3.11.6 (main, Jan  9 2025, 00:00:00) [GCC 11.4.1 20230605 (Red Hat 11.4.1-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.2-amzn-1
      /_/

Using Python version 3.11.6 (main, Jan  9 2025 00:00:00)
Spark context Web UI available at None
Spark context available as 'sc' (master = local[*], app id = local-1740643079929).
SparkSession available as 'spark'.
>>> 

With this REPL shell, you can code and test interactively.

pytest

For unit testing, you can use pytest for AWS Glue Spark job scripts.

Run the following commands for preparation:

$ WORKSPACE_LOCATION=/local_path_to_workspace
$ SCRIPT_FILE_NAME=sample.py
$ UNIT_TEST_FILE_NAME=test_sample.py
$ mkdir -p ${WORKSPACE_LOCATION}/tests
$ vim ${WORKSPACE_LOCATION}/tests/${UNIT_TEST_FILE_NAME}

Now let’s invoke pytest using docker run:

$ docker run -i --rm \
    -v ~/.aws:/home/hadoop/.aws \
    -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \
    --workdir /home/hadoop/workspace \
    -e AWS_PROFILE=$PROFILE_NAME \
    --name glue5_pytest \
    public.ecr.aws/glue/aws-glue-libs:5 \
    -c "python3 -m pytest --disable-warnings"

When pytest finishes executing unit tests, your output will look something like the following:

============================= test session starts ==============================
platform linux -- Python 3.11.6, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/hadoop/workspace
plugins: integration-mark-0.2.0
collected 1 item

tests/test_sample.py .                                                   [100%]

======================== 1 passed, 1 warning in 34.28s =========================

Visual Studio Code

To set up the container with Visual Studio Code, complete the following steps:

  1. Install Visual Studio Code.
  2. Install Python.
  3. Install Dev Containers.
  4. Open the workspace folder in Visual Studio Code.
  5. Press Ctrl+Shift+P (Windows/Linux) or Cmd+Shift+P (Mac).
  6. Enter Preferences: Open Workspace Settings (JSON).
  7. Press Enter.
  8. Enter following JSON and save it:
{
    "python.defaultInterpreterPath": "/usr/bin/python3.11",
    "python.analysis.extraPaths": [
        "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip:/usr/lib/spark/python/:/usr/lib/spark/python/lib/",
    ]
}

Now you’re ready to set up the container.

  1. Run the Docker container:
$ docker run -it --rm \
    -v ~/.aws:/home/hadoop/.aws \
    -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \
    -e AWS_PROFILE=$PROFILE_NAME \
    --name glue5_pyspark \
    public.ecr.aws/glue/aws-glue-libs:5 \
    pyspark
  1. Start Visual Studio Code.
  2. Choose Remote Explorer in the navigation pane.
  3. Choose the container ecr.aws/glue/aws-glue-libs:5 (right-click) and choose Attach in Current Window.

  1. If the following dialog appears, choose Got it.

  1. Open /home/hadoop/workspace/.

  1. Create an AWS Glue PySpark script and choose Run.

You should see the successful run on the AWS Glue PySpark script.

Changes between the AWS Glue 4.0 and AWS Glue 5.0 Docker image

The following are major changes between the AWS Glue 4.0 and Glue 5.0 Docker image:

  • In AWS Glue 5.0, there is a single container image for both batch and streaming jobs. This differs from AWS Glue 4.0, where there was one image for batch and another for streaming.
  • In AWS Glue 5.0, the default user name of the container is hadoop. In AWS Glue 4.0, the default user name was glue_user.
  • In AWS Glue 5.0, several additional libraries, including JupyterLab and Livy, have been removed from the image. You can manually install them.
  • In AWS Glue 5.0, all of Iceberg, Hudi, and Delta libraries are pre-loaded by default, and the environment variable DATALAKE_FORMATS is no longer needed. Until AWS Glue 4.0, the environment variable DATALAKE_FORMATS was used to specify whether the specific table format is loaded.

The preceding list is specific to the Docker image. To learn more about AWS Glue 5.0 updates, see Introducing AWS Glue 5.0 for Apache Spark and Migrating AWS Glue for Spark jobs to AWS Glue version 5.0.

Considerations

Keep in mind that the following features are not supported when using the AWS Glue container image to develop job scripts locally:

Conclusion

In this post, we explored how the AWS Glue 5.0 Docker images provide a flexible foundation for developing and testing AWS Glue job scripts in your preferred environment. These images, readily available in the Amazon ECR Public Gallery, streamline the development process by offering a consistent, portable environment for AWS Glue development.

To learn more about how to build end-to-end development pipeline, see End-to-end development lifecycle for data engineers to build a data integration pipeline using AWS Glue. We encourage you to explore these capabilities and share your experiences with the AWS community.


Appendix A: AWS Glue job sample codes for testing

This appendix introduces three different scripts as AWS Glue job sample codes for testing purposes. You can use any of them in the tutorial.

The following sample.py code uses the AWS Glue ETL library with an Amazon Simple Storage Service (Amazon S3) API call. The code requires Amazon S3 permissions in AWS Identity and Access Management (IAM). You need to grant the IAM-managed policy arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess or IAM custom policy that allows you to make ListBucket and GetObject API calls for the S3 path.

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions


class GluePythonSampleTest:
    def __init__(self):
        params = []
        if '--JOB_NAME' in sys.argv:
            params.append('JOB_NAME')
        args = getResolvedOptions(sys.argv, params)

        self.context = GlueContext(SparkContext.getOrCreate())
        self.job = Job(self.context)

        if 'JOB_NAME' in args:
            jobname = args['JOB_NAME']
        else:
            jobname = "test"
        self.job.init(jobname, args)

    def run(self):
        dyf = read_json(self.context, "s3://awsglue-datasets/examples/us-legislators/all/persons.json")
        dyf.printSchema()

        self.job.commit()


def read_json(glue_context, path):
    dynamicframe = glue_context.create_dynamic_frame.from_options(
        connection_type='s3',
        connection_options={
            'paths': [path],
            'recurse': True
        },
        format='json'
    )
    return dynamicframe


if __name__ == '__main__':
    GluePythonSampleTest().run()

The following test_sample.py code is a sample for a unit test of sample.py:

The following test_sample.py code is a sample for a unit test of sample.py:
import pytest
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
import sys
from src import sample


@pytest.fixture(scope="module", autouse=True)
def glue_context():
    sys.argv.append('--JOB_NAME')
    sys.argv.append('test_count')

    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
    context = GlueContext(SparkContext.getOrCreate())
    job = Job(context)
    job.init(args['JOB_NAME'], args)

Appendix B: Adding JDBC drivers and Java libraries

To add a JDBC driver not currently available in the container, you can create a new directory under your workspace with the JAR files you need and mount the directory to /opt/spark/jars/ in the docker run command. JAR files found under /opt/spark/jars/ within the container are automatically added to Spark Classpath and will be available for use during the job run.

For example, you can use the following docker run command to add JDBC driver jars to a PySpark REPL shell:

$ docker run -it --rm \
    -v ~/.aws:/home/hadoop/.aws \
    -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \
    -v $WORKSPACE_LOCATION/jars/:/opt/spark/jars/ \
    --workdir /home/hadoop/workspace \
    -e AWS_PROFILE=$PROFILE_NAME \
    --name glue5_jdbc \
    public.ecr.aws/glue/aws-glue-libs:5 \
    pyspark

As highlighted earlier, the customJdbcDriverS3Path connection option can’t be used to import a custom JDBC driver from Amazon S3 in AWS Glue container images.

Appendix C: Adding Livy and JupyterLab

The AWS Glue 5.0 container image doesn’t have Livy installed by default. You can create a new container image extending the AWS Glue 5.0 container image as the base. The following Dockerfile demonstrates how you can extend the Docker image to include additional components you need to enhance your development and testing experience.

To get started, create a directory on your workstation and place the Dockerfile.livy_jupyter file in the directory:

$ mkdir -p $WORKSPACE_LOCATION/jupyterlab/
$ cd $WORKSPACE_LOCATION/jupyterlab/
$ vim Dockerfile.livy_jupyter

The following code is Dockerfile.livy_jupyter:

FROM public.ecr.aws/glue/aws-glue-libs:5 AS glue-base

ENV LIVY_SERVER_JAVA_OPTS="--add-opens java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED"

# Download Livy
ADD --chown=hadoop:hadoop https://dlcdn.apache.org/incubator/livy/0.8.0-incubating/apache-livy-0.8.0-incubating_2.12-bin.zip ./

# Install and configure Livy
RUN unzip apache-livy-0.8.0-incubating_2.12-bin.zip && \
rm apache-livy-0.8.0-incubating_2.12-bin.zip && \
mv apache-livy-0.8.0-incubating_2.12-bin livy && \
mkdir -p livy/logs && \
cat <<EOF >> livy/conf/livy.conf
livy.server.host = 0.0.0.0
livy.server.port = 8998
livy.spark.master = local
livy.repl.enable-hive-context = true
livy.spark.scala-version = 2.12
EOF && \
cat <<EOF >> livy/conf/log4j.properties
log4j.rootCategory=INFO,console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
log4j.logger.org.eclipse.jetty=WARN
EOF

# Switching to root user temporarily to install dev dependency packages
USER root 
RUN dnf update -y && dnf install -y krb5-devel gcc python3.11-devel
USER hadoop

# Install SparkMagic and JupyterLab
RUN export PATH=$HOME/.local/bin:$HOME/livy/bin/:$PATH && \
printf "numpy<2\nIPython<=7.14.0\n" > /tmp/constraint.txt && \
pip3.11 --no-cache-dir install --constraint /tmp/constraint.txt --user pytest boto==2.49.0 jupyterlab==3.6.8 IPython==7.14.0 ipykernel==5.5.6 ipywidgets==7.7.2 sparkmagic==0.21.0 jupyterlab_widgets==1.1.11 && \
jupyter-kernelspec install --user $(pip3.11 --no-cache-dir show sparkmagic | grep Location | cut -d" " -f2)/sparkmagic/kernels/sparkkernel && \
jupyter-kernelspec install --user $(pip3.11 --no-cache-dir show sparkmagic | grep Location | cut -d" " -f2)/sparkmagic/kernels/pysparkkernel && \
jupyter server extension enable --user --py sparkmagic && \
cat <<EOF >> /home/hadoop/.local/bin/entrypoint.sh
#!/usr/bin/env bash
mkdir -p /home/hadoop/workspace/
livy-server start
sleep 5
jupyter lab --no-browser --ip=0.0.0.0 --allow-root --ServerApp.root_dir=/home/hadoop/workspace/ --ServerApp.token='' --ServerApp.password=''
EOF

# Setup Entrypoint script
RUN chmod +x /home/hadoop/.local/bin/entrypoint.sh

# Add default SparkMagic Config
ADD --chown=hadoop:hadoop https://raw.githubusercontent.com/jupyter-incubator/sparkmagic/refs/heads/master/sparkmagic/example_config.json .sparkmagic/config.json

# Update PATH var
ENV PATH=/home/hadoop/.local/bin:/home/hadoop/livy/bin/:$PATH

ENTRYPOINT ["/home/hadoop/.local/bin/entrypoint.sh"]

Run the docker build command to build the image:

docker build \
    -t glue_v5_livy \
    --file $WORKSPACE_LOCATION/jupyterlab/Dockerfile.livy_jupyter \
    $WORKSPACE_LOCATION/jupyterlab/

When the image build is complete, you can use the following docker run command to start the newly built image:

docker run -it --rm \
    -v ~/.aws:/home/hadoop/.aws \
    -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \
    -p 8998:8998 \
    -p 8888:8888 \
    -e AWS_PROFILE=$PROFILE_NAME \
    --name glue5_jupyter  \
    glue_v5_livy

Appendix D: Adding extra Python libraries

In this section, we discuss adding extra Python libraries and installing Python packages using

Local Python libraries

To add local Python libraries, place them under a directory and assign the path to $EXTRA_PYTHON_PACKAGE_LOCATION:

$ docker run -it --rm \
    -v ~/.aws:/home/hadoop/.aws \
    -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \
    -v $EXTRA_PYTHON_PACKAGE_LOCATION:/home/hadoop/workspace/extra_python_path/ \
    --workdir /home/hadoop/workspace \
    -e AWS_PROFILE=$PROFILE_NAME \
    --name glue5_pylib \
    public.ecr.aws/glue/aws-glue-libs:5 \
    -c 'export PYTHONPATH=/home/hadoop/workspace/extra_python_path/:$PYTHONPATH; pyspark'

To validate that the path has been added to PYTHONPATH, you can check for its existence in sys.path:

Python 3.11.6 (main, Jan  9 2025, 00:00:00) [GCC 11.4.1 20230605 (Red Hat 11.4.1-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.2-amzn-1
      /_/

Using Python version 3.11.6 (main, Jan  9 2025 00:00:00)
Spark context Web UI available at None
Spark context available as 'sc' (master = local[*], app id = local-1740719582296).
SparkSession available as 'spark'.
>>> import sys
>>> "/home/hadoop/workspace/extra_python_path" in sys.path
True

Installing Python packages using pip

To install packages from PyPI (or any other artifact repository) using pip, you can use the following approach:

docker run -it --rm \
    -v ~/.aws:/home/hadoop/.aws \
    -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \
    --workdir /home/hadoop/workspace \
    -e AWS_PROFILE=$PROFILE_NAME \
    -e SCRIPT_FILE_NAME=$SCRIPT_FILE_NAME \
    --name glue5_pylib \
    public.ecr.aws/glue/aws-glue-libs:5 \
    -c 'pip3 install snowflake==1.0.5; spark-submit /home/hadoop/workspace/src/$SCRIPT_FILE_NAME'

About the Authors

Author Headshot - Subramanya VajirayaSubramanya Vajiraya is a Sr. Cloud Engineer (ETL) at AWS Sydney specialized in AWS Glue. He is passionate about helping customers solve issues related to their ETL workload and implementing scalable data processing and analytics pipelines on AWS. Outside of work, he enjoys going on bike rides and taking long walks with his dog Ollie.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He works based in Tokyo, Japan. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Unlock the power of optimization in Amazon Redshift Serverless

Post Syndicated from Ricardo Serafim original https://aws.amazon.com/blogs/big-data/unlock-the-power-of-optimization-in-amazon-redshift-serverless/

Amazon Redshift Serverless automatically scales compute capacity to match workload demands, measuring this capacity in Redshift Processing Units (RPUs). Although traditional scaling primarily responds to query queue times, the new AI-driven scaling and optimization feature offers a more sophisticated approach by considering multiple factors including query complexity and data volume. Intelligent scaling addresses key data warehouse challenges by preventing both over-provisioning of resources for performance and under-provisioning to save costs, particularly for workloads that fluctuate based on daily patterns or monthly cycles.

Amazon Redshift serverless now offers enhanced flexibility in configuring workgroups through two primary methods. Users can either set a base capacity, specifying the baseline RPUs for query execution, with options ranging from 8 to 1024 RPUs and each RPU providing 16 GB of memory, or they can opt for the price-performance target. Amazon Redshift Serverless AI-driven scaling and optimization can adapt more precisely to diverse workload requirements and employs intelligent resource management, automatically adjusting resources during query execution for optimal performance. Consider using AI-driven scaling and optimization if your current workload requires 32 to 512 base RPUs. We don’t recommend using this feature for less than 32 base RPU or more than 512 base RPU workloads.

In this post, we demonstrate how Amazon Redshift Serverless AI-driven scaling and optimization impacts performance and cost across different optimization profiles.

Options in AI-driven scaling and optimization

Amazon Redshift Serverless AI-driven scaling and optimization offers an intuitive slider interface, letting you balance price and performance goals. You can select from five optimization profiles, ranging from Optimized for Cost to Optimized for Performance, as shown in the following diagram. Your slider position determines how Amazon Redshift allocates resources and implements AI-driven scaling and optimizations, to achieve your desired price-performance target.

Sliding bar

The slider offers the following options:

  1. Optimized for Cost (1)
    • Prioritizes cost savings over performance
    • Allocates minimum resources in favor of saving on costs
    • Best for workloads where performance isn’t time-critical
  2. Cost-Balanced (25)
    • Balances towards cost savings while maintaining reasonable performance
    • Allocates moderate resources
    • Suitable for mixed workloads with some flexibility in query time
  3. Balanced (50)
    • Provides equal emphasis on cost efficiency and performance
    • Allocates optimal resources for most use cases
    • Ideal for general-purpose workloads
  4. Performance-Balanced (75)
    • Favors performance while maintaining some cost control
    • Allocates additional resources when needed
    • Suitable for workloads requiring consistently fast query elapsed time
  5. Optimized for Performance (100)
    • Maximizes performance regardless of cost
    • Provides maximum available resources
    • Best for time-critical workloads requiring fastest possible query delivery

Which workloads to consider for AI-driven scaling and optimizations

The Amazon Redshift Serverless AI-driven scaling and optimization capabilities can be applied to almost every analytical workload. Amazon Redshift will assess and apply optimizations according to your price-performance target—cost, balance, or performance.

Most analytical workloads operate on millions or even billions of rows and generate aggregations and complex calculations. These workloads have high variability for query patterns and number of queries. The Amazon Redshift Serverless AI-driven scaling and optimization will improve the price, performance, or both because it learns the patterns (the repeatability of your workload) and will allocate more resources towards performance improvements if you’re performance-focused or fewer resources if you’re cost-focused.

Cost-effectiveness of AI-driven scaling and optimization

To effectively determine the effectiveness of Amazon Redshift Serverless AI-driven scaling and optimization we need to be able to measure your current state of price-performance. We encourage you to measure your current price-performance by using sys_query_history to calculate the total elapsed time of your workload and note the start time and end time. Then use sys_serverless_usage to calculate the cost. You can use the query from the Amazon Redshift documentation and add the same start and end times. This will establish your current price performance, and now you have a baseline to compare against.

If such measurement isn’t practical because your workloads are continuously running and it’s impractical for you to determine a fixed start and end time, then another way is to compare holistically, check your month over month cost, check your user sentiment towards performance, towards system stability, improvements in data delivery, or reduction in overall monthly processing times.

Benchmark conducted and results

We evaluated the optimization options using the TPCDS 3TB dataset from the AWS Labs GitHub repository (amazon-redshift-utils). We deployed this dataset across three Amazon Redshift Serverless workgroups configured as Optimized for Cost, Balanced, and Optimized for Performance. To create a realistic reporting environment, we configured three Amazon Elastic Compute Cloud (Amazon EC2) instances with JMeter (one per endpoint) and ran 15 selected TPCDS queries concurrently for approximately 1 hour, as shown in the following screenshot.

We disabled the result cache to make sure Amazon Redshift Serverless ran all queries directly, providing accurate measurements. This setup helped us capture authentic performance characteristics across each optimization profile. Also, we designed our test environment without setting the Amazon Redshift Serverless workgroup max capacity parameter—a key configuration that controls the maximum RPUs available to your data warehouse. By removing this limit, we could clearly showcase how different configurations affect scaling behavior in our test endpoints.

Jmeter

Our comprehensive test plan included running each of the 15 queries 355 times, generating 5,325 queries per test cycle. The AI-driven scaling and optimization needs multiple iterations to identify patterns and optimize RPUs, so we ran this workload 10 times. Through these repetitions, the AI learned and adapted its behavior, processing a total of 53,250 queries throughout our testing period.

The testing revealed how the AI-driven scaling and optimization system adapts and optimizes performance across three distinct configuration profiles: Optimized for Cost, Balanced, and Optimized for Performance.

Queries and elapsed time

Although we ran the same core workload repeatedly, we used variable parameters in JMeter to generate different values for the WHERE clause conditions. This approach created similar but not identical workloads, introducing natural variations that showed how the system handles real-world scenarios with varying query patterns.

Our elapsed time analysis demonstrates how each configuration achieved its performance objectives, as shown by the average consumption metrics for each endpoint, as shown in the following screenshot.

Average Elapsed Time per Endpoint

The results matched our expectations: the Optimized for Performance configuration delivered significant speed improvements, running queries approximately two times as the Balanced configuration and four times as the Optimized for Cost setup.

The following screenshots show the elapsed time breakdown for each test.

Optimized for Cost - Elapsed Time Balanced - Elapsed Time Optimized for Performance - Elapsed Time

The following screenshot shows tenth and final test iteration demonstrates distinct performance differences across configurations.

Per Configuration - Elapsed Time

To clarify more, we categorized our query elapsed times into three groups:

  • Short queries – Less than 10 seconds
  • Medium queries – From 10 seconds to 10 minutes
  • Long queries: More than 10 minutes

Considering our last test, the analysis shows:

Duration per configuration Optimized for Cost Balanced Optimized for Performance
Short queries (<10 sec) 1488 1743 3290
Medium queries (10 sec – 10 min) 3633 3579 2035
Long queries (>10 min) 204 3 0
TOTAL 5325 5325 5325

The configuration’s capacity directly impacts query elapsed time. The Optimized for Cost configuration limits resources to save money, resulting in longer query times, making it best suited for workloads that aren’t time critical, where cost savings are prioritized. The Balanced configuration provides moderate resource allocation, striking a middle ground by effectively handling medium-duration queries and maintaining reasonable performance for short queries while nearly eliminating long-running queries. In contrast, the Optimized for Performance configuration allocates more resources, which increases costs but delivers faster query results, making it best for latency-sensitive workloads where query speed is critical.

Capacity used during the tests

Our comparison of the three configurations reveals how Amazon Redshift Serverless AI-driven scaling and optimization technology adapts resource allocation to meet user expectations. The monitoring showed both Base RPU variations and distinct scaling patterns across configurations—scaling up aggressively for faster performance or maintaining lower RPUs to optimize costs.

The Optimized for Cost configuration starts at 128 RPUs and increases to 256 RPUs after three tests. To maintain cost-efficiency, this setup limits the maximum RPU allocation during scaling, even when facing query queuing.

In the following table, we can observe the costs for this Optimized for Cost configuration.

Test# Starting RPUs Scaled up to Cost incurred
1 128 1408  $254.17
2 128 1408  $258.39
3 128 1408  $261.92
4 256 1408  $245.57
5 256 1408  $247.11
6 256 1408  $257.25
7 256 1408  $254.27
8 256 1408  $254.27
9 256 1408  $254.11
10 256 1408  $256.15

The strategic RPU allocation by Amazon Redshift Serverless helps optimize costs, as demonstrated in tests 3 and 4, where we observed significant cost savings. This is shown in the following graph.

Optimized for Cost - Cost Average

Although the optimization for cost changed the base RPU, the balanced configuration didn’t change the base RPUs but scaled up to 2176, further than the 1408 RPUs that were the maximum used by the cost optimization setup. The following table shows the figures for the Balanced configuration.

Test# Starting RPUs Scaled up to Cost incurred
1 192 2176  $261.48
2 192 2112  $270.90
3 192 2112  $265.26
4 192 2112  $260.20
5 192 2112  $262.12
6 192 2112  $253.18
7 192 2112  $272.80
8 192 2112  $272.80
9 192 2112  $263.72
10 192 2112  $243.28

The Balanced configuration, averaging $262.57 per test, delivered significantly better performance while costing only 3% more than the Optimized for Cost configuration, which averaged $254.32 per test. As demonstrated in the previous section, this performance advantage is evident in the elapsed time comparisons. The following graph shows the costs for the Balanced configuration.

Balanced - Cost Average

As expected from the Optimized for Performance configuration, the usage of resources was higher to attend the high performance. In this configuration, we can also observe that after two tests, the engine adapted itself to start with a higher number of RPUs to attend the queries faster.

Test# Starting RPUs Scaled Up to Cost incurred
1 512 2753  $295.07
2 512 2327  $280.29
3 768 2560  $333.52
4 768 2991  $295.36
5 768 2479  $308.72
6 768 2816  $324.08
7 768 2413  $300.45
8 768 2413  $300.45
9 768 2107  $321.07
10 768 2304  $284.93

Despite a 19% cost increase in the third test, most subsequent tests remained below the $304.39 average cost.

Optimized for Performance - Cost Average

The Optimized for Performance configuration maximizes resource usage to achieve faster query times, prioritizing speed over cost efficiency.

The final cost-performance analysis reveals compelling results:

  • The Balanced configuration delivered twofold better performance while costing only 3.25% more than the Optimized for Cost setup
  • The Optimized for Performance configuration achieved fourfold faster elapsed time with a 19.39% cost increase compared to the Optimized for Cost option.

The following chart illustrates our cost-performance findings:

Average Billing and Elapsed Time per Endpoint

It’s important to note that these results reflect our specific test scenario. Each workload has unique characteristics, and the performance and cost differences between configurations might vary significantly in other use cases. Our findings serve as a reference point rather than a universal benchmark. Additionally, we didn’t test two intermediate configurations available in Amazon Redshift Serverless: one between Optimized for Cost and Balanced, and another between Balanced and Optimized for Performance.

Conclusion

The test results demonstrate the effectiveness of Amazon Redshift Serverless AI-driven scaling and optimization across different workload requirements. These findings highlight how Amazon Redshift Serverless AI-driven scaling and optimization can help organizations find their ideal balance between cost and performance. Although our test results serve as a reference point, each organization should evaluate their specific workload requirements and price-performance targets. The flexibility of five different optimization profiles, combined with intelligent resource allocation, enables teams to fine-tune their data warehouse operations for optimal efficiency.

To get started with Amazon Redshift Serverless AI-driven scaling and optimization, we recommend:

  1. Establishing your current price-performance baseline
  2. Identifying your workload patterns and requirements
  3. Testing different optimization profiles with your specific workloads
  4. Monitoring and adjusting based on your results

By using these capabilities, organizations can achieve better resource utilization while meeting their specific performance and cost objectives.

Ready to optimize your Amazon Redshift Serverless workloads? Visit the AWS Management Console today to create your own Amazon Redshift Serverless AI-driven scaling and optimization to start exploring the different optimization profiles. For more information, check out our documentation on Amazon Redshift Serverless AI-driven scaling and optimization, or contact your AWS account team to discuss your specific use case.


About the Authors

Ricardo Serafim Ricardo Serafim is a Senior Analytics Specialist Solutions Architect at AWS. He has been helping companies with Data Warehouse solutions since 2007.

Milind Oke Milind Oke is a Data Warehouse Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over 15 years and specializes in Amazon Redshift.

Andre HassAndre Hass is a Senior Technical Account Manager at AWS, specialized in AWS Data Analytics workloads. With more than 20 years of experience in databases and data analytics, he helps customers optimize their data solutions and navigate complex technical challenges. When not immersed in the world of data, Andre can be found pursuing his passion for outdoor adventures. He enjoys camping, hiking, and exploring new destinations with his family on weekends or whenever an opportunity arises.

Four ways to grant cross-account access in AWS

Post Syndicated from Anshu Bathla original https://aws.amazon.com/blogs/security/four-ways-to-grant-cross-account-access-in-aws/

As your Amazon Web Services (AWS) environment grows, you might develop a need to grant cross-account access to resources. This could be for various reasons, such as enabling centralized operations across multiple AWS accounts, sharing resources across teams or projects within your organization, or integrating with third-party services. However, granting cross-account access requires careful consideration of your security, availability, and manageability requirements.

In this blog post, we explore four different ways to grant cross-account access using resource-based policies. Each method has its own unique tradeoffs, and the best choice depends on your specific requirements and use case.

Evaluating different techniques for granting cross-account access

Cross-account access is granted by identity-based policies and resource-based policies in AWS Identity and Access Management (IAM). Identity-based policies attach to an IAM role, while resource-based polices attach to resources like Amazon Simple Storage Service (Amazon S3) buckets and AWS Key Management Service (AWS KMS) keys. Resource-based policies require you to specify one or more principals (IAM users or roles) that are allowed to access the resource.

Your choice of how to specify the principal in a resource-based policy impacts some aspects of both the confidentiality and the availability of your solution. Understanding this impact and making the right tradeoffs for your use case is the focus of this post.

An example scenario

Imagine that you have an S3 bucket in your AWS account (Account A) that needs to be accessed by different principals in another AWS account (Account B). For this scenario, we assume that the principals in Account B have the necessary access to S3 in their identity-based policies, and we will focus on authoring the resource-based policies in Account A. While the methods explained here use Amazon S3, the concepts discussed apply to all AWS services that support resource-based policies. In the following sections, we walk through four different ways to grant cross-account access in this scenario and discuss the tradeoffs of each.

Method 1: Grant access to a specific IAM role using the Principal element of the resource-based policy

In this example, you use an S3 bucket policy to grant access to a specific IAM role (RoleFromAccountB) in Account B by specifying the IAM role’s Amazon Resource Name (ARN) in the Principal element of the policy in Account A.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowRoleInThePrincipalElement",
      "Principal": {
        "AWS": "arn:aws:iam::111122223333:role/RoleFromAccountB"
      },
      "Effect": "Allow",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::amzn-s3-demo-bucket-account-a/*"
    }
  ]
}

Using this bucket policy, if someone in Account B deletes or recreates the role (RoleFromAccountB), then that role can no longer access the amzn-s3-demo-bucket-account-a bucket, even if that role is recreated with the same name. The reason is that when you save this policy, the role ARN is mapped to the unique ID of the role, which looks something like this: AROADBQP57FF2AEXAMPLE. You will see a role identifier in the Principal element of your resource-based policies if you view them after you delete the role that they referenced.

This behavior is intentional. The resource-based policy only allows the specific instance of the role that you set as principal at the time of policy creation. This helps prevent unintended access to your resources if you delete a role, but forget to update your resource-based policy to remove that role. This behavior can also cause an availability risk because the role (RoleFromAccountB) will have a new unique ID when it is recreated and will no longer have access to the bucket. Roles can be recreated for a number of reasons, including accidentally when you use tools such as infrastructure as code.

You might consider choosing this method if:

  • You own the roles in both Account A and Account B and can control the creation and deletion of these roles.
  • You want your resource-based policy in Account A to stop granting access when the specified role (RoleFromAccountB) is deleted.
  • You prioritize granular access control over potential availability concerns if the role (RoleFromAccountB) is deleted.

Method 2: Grant access to an account using the Principal element of the resource-based policy

In this example, you grant access to a specific account in the Principal element of the resource-based policy. This resource-based policy of Account A allows any user or role from Account B that also has an identity-based policy that grants them access to read the objects.

Note: You can use either "Principal": {"AWS": "111122223333"} or "Principal": {"AWS": "arn:aws:iam::111122223333:root"} in the Principal element. They are equivalent, and the long-form ARN does not represent the root user.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowAccountInThePrincipalElement",
      "Principal": {
        "AWS": "111122223333"
      },
      "Effect": "Allow",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::amzn-s3-demo-bucket-account-a/*"
    }
  ]
}

This resource-based policy helps avoid the potential availability issue discussed for Method 1. If a role in Account B that needs to have access to the bucket is recreated, it will still have access after the recreation of that role. This is because you don’t specify a role in the Principal element—instead, you specify an account. If you use Method 2, you must be comfortable delegating access control decisions to the owner of that account.

This approach explicitly delegates access control decisions to IAM in the other account (Account B). Principals in Account B have access to this bucket if allowed by their identity-based policies.

You might consider choosing this method if:

  • You need to grant access to many principals in Account B.
  • You want to delegate the access decision in the account where the principal exists (Account B).
  • You prioritize ease of management and availability over granular access control.

Method 3: Grant access to a specific IAM role using the aws:PrincipalArn condition

This method expands on Method 2 and adds a condition that grants access only to a specific IAM role. Similar to Method 2, you use the account number as the value of the Principal element, but also use the aws:PrincipalArn condition key to limit access to a specific principal in Account B.

The aws:PrincipalArn condition key is a global condition key that compares the ARN of the principal that made the request with the ARN that you specify in the policy. For IAM roles, the request context returns the ARN of the role, not the ARN of the user that assumed the role.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowAccountInPrincipalAndRoleInPrincipalArn",
      "Principal": {
        "AWS": "111122223333"
      },
      "Effect": "Allow",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::amzn-s3-demo-bucket-account-a/*",
      "Condition": {
        "ArnEquals": {
          "aws:PrincipalArn": "arn:aws:iam::111122223333:role/RoleFromAccountB"
        }
      }
    }
  ]
}

This policy comes with the same availability benefits as the policy in Method 2: access to this resource will survive role recreation. This is because the role is translated to its unique identifier only when it is used in the Principal element. It is not translated to a unique identifier when it is used in a condition. If the role (RoleFromAccountB) in Account B is recreated, accidentally or intentionally, the policy will continue to grant access because the role matches the role ARN specified in the condition key of the resource-based policy in Account A. As a result, Method 3 provides a balanced approach to availability and security.

You might consider choosing this method if:

  • You are comfortable that this policy will continue to grant access to the role specified in the aws:PrincipalArn condition key if that role (RoleFromAccountB) is recreated.
  • You don’t own the Account B you are granting access to and don’t control when that role may be recreated.
  • You want a balance of availability and confidentiality.

Method 4: Grant access to an entire AWS Organizations organization

This method is focused on a different use case and is not an alternative to the methods listed earlier. Use this method if you have a resource (an S3 bucket, in this example) that you want to share with your entire organization, but not share with anyone outside of it.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowAccessToAnEntireOrganization",
      "Principal": {
        "AWS": "*"
      },
      "Effect": "Allow",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::amzn-s3-demo-bucket-account-a/*",
      "Condition": {
        "StringEquals": {
          "aws:PrincipalOrgId": "o-12345"
        },
        "StringNotEquals": {
          "aws:PrincipalAccount": "${aws:ResourceAccount}"
        }
      }
    }
  ]
}

There is no way to specify an organization by using the Principal element of a resource-based policy, so you must use the aws:PrincipalOrgId condition key to restrict access to a specific organization. In this policy, you specify a wildcard in the Principal element, which says that anyone can access the bucket. Then the condition reduces “anyone” to just those AWS account principals that belong to the specified organization and have an identity-based policy that allows them access.

You then add an additional conditional block that compares the aws:PrincipalAccount condition key to the aws:ResourceAccount condition key by using a policy variable. This extra conditional block is optional and excludes the account that owns the bucket (Account A) from the allow statement. The reason for using this extra conditional block is so that principals in Account A still require an allow statement in their identity-based policy to access this bucket. If you choose to exclude this aws:PrincipalAccount comparison, principals in Account A are granted access to the bucket without an explicit allow statement in their identity-based policy. Policy evaluation logic only requires either the identity-based policy or the resource-based policy (but not both) to allow a request when the principal and resource are in the same account.

You might consider choosing this method if:

  • You have a shared resource that should be accessible to your entire organization.

Conclusion

Choosing a method to grant cross-account access requires careful consideration of your requirements and use case. Each of the four methods discussed in this blog post has its own advantages and tradeoffs. By understanding these methods and their implications, you can decide on the most appropriate approach to grant cross-account access to your AWS resources. Remember to regularly review and audit your resource-based policies to verify that they align with your security and access requirements.

To learn how resource-based policies work with Amazon S3, see the blog post IAM Policies and Bucket Policies and ACLs! Oh My! Controlling Access to S3 Resources.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
 

Anshu Bathla
Anshu Bathla

Anshu is a Lead Consultant – SRC at AWS, based in Gurugram, India. He works with customers across diverse verticals to help strengthen their security infrastructure and achieve their security goals. Outside of work, Anshu enjoys reading books and gardening at his home garden.
Jay Goradia
Jay Goradia

Jay is a Technical Account Manager (TAM) at AWS who works closely with enterprise customers to accelerate their cloud journey through strategic guidance and technical expertise. Using his security background, he helps organizations understand security best practices in AWS.

WellRight modernizes to an event-driven architecture to manage bursty and unpredictable traffic

Post Syndicated from John Lee original https://aws.amazon.com/blogs/architecture/wellright-modernizes-to-an-event-driven-architecture-to-manage-bursty-and-unpredictable-traffic/

WellRight is a leading comprehensive corporate wellness platform provider that helps organizations and employees drive meaningful outcomes through personalized wellness programs. The platform increases engagement and benefit utilization by delivering engaging challenges across multiple dimensions of wellness, from physical activities like step tracking to mental health initiatives and team-building exercises.

In this post, we share how WellRight optimized the cost and performance of their application through a ground-up modernization to an event-driven architecture.

The challenge

WellRight’s infrastructure often experiences bursty and unpredictable traffic patterns. For instance, clients can upload bulk user data at any time, which can impact tens of thousands of users, which then cascade into millions of changes. WellRight’s legacy monolithic infrastructure had several challenges when faced with such traffic:

  • Multiple processes such as registration, progress calculation, and reward distribution relied on a single server, leading to a noisy neighbor problem.
  • Certain core services were isolated to avoid the noisy neighbor problem, but with high burst workloads, auto scaling didn’t react fast enough to meet the demand. This led to queues backing up with millions of requests. In addition, the database also had to be overprovisioned to avoid throttling, adding to the overall cost.
  • Parts of the application were not designed with auto scaling in mind, leading to overprovisioning of resources.

The following figure shows the Number of Messages Received metric from a sample Amazon Simple Queue Service (Amazon SQS) queue. WellRight would often receive burst of events at an unpredictable time.

A line graph showing the number of messages received in an SQS queue, with a sharp spike amid otherwise zero activity.

Solution overview

To address the challenges, WellRight made the strategic decision to transition to an event-driven architecture using fully managed AWS services. WellRight’s platform is driven by asynchronous state changes that propagate through multiple wellness programs, which is well suited for an event-driven architecture and can be broken down into microservices. Managed services such as AWS Lambda, Amazon SQS, and Amazon DynamoDB were appealing because they would eliminate the need to manage servers and allow WellRight to focus on core business logic and reduce the operational burden to their engineering team. It also has the added benefit of avoiding overprovisioning of infrastructure or continuously right-sizing resources. Each microservice would scale automatically as needed with no manual efforts, minimizing costs. The loosely coupled architecture would allow the WellRight team to be flexible, being able to add or make modifications to existing programs without affecting existing workflows.

Design

WellRight’s initial event-driven architecture was centered around using serverless and fully managed services. DynamoDB was used as a primary data store for user information. For instance, when a user makes progress on their step challenge, the update in the DynamoDB table would propagate through DynamoDB Streams to Amazon EventBridge. Then, the event would be routed to the appropriate SQS queue, which functions as a buffer and provides fault tolerance to the events. A Lambda function would then process individual user metrics and update the Programs table. The Programs table uses DynamoDB Streams to send out updates using Amazon Simple Notification Service (Amazon SNS), keeping users informed about their progress and rankings.

The following diagram illustrates the flow of an event after a user update.

The first iteration of the event-driven architecture fared better than the monolithic legacy application, but the bursty nature of the traffic was still an issue. Lambda functions triggered by SQS queues scaled rapidly, handling requests in under 15 minutes that previously required 30 servers and took hours to process. Lambda provided WellRight the scalability that they needed, but the rapid scaling introduced a new challenge. This resulted in the throttling of DynamoDB and reaching Lambda concurrency limits during times of extremely high load, which led to many unprocessed messages in the dead-letter queue (DLQ).

Maximum concurrency solution

In January 2023, AWS introduced the maximum concurrency feature for Lambda functions using Amazon SQS as an event source. This new feature allowed WellRight to control the concurrency of their Lambda functions for each SQS queue. Prior to this launch, Lambda functions would continue to scale as long as there were messages in the SQS queue. At times, Lambda functions would scale to its concurrency limits, resulting in it throttling itself. However, with this feature in place, the scaling Lambda functions would not exceed the set maximum concurrency value. This provided WellRight fine-grained control over the overall throughput of the system. WellRight would adjust the maximum concurrency value as needed to protect downstream processes from being overwhelmed, while responding to customer requests in a timely manner.

The following screenshot of the Lambda console shows the maximum concurrency for the function is set to 100 for an SQS trigger.

An AWS Lambda configuration screen showing a trigger from an SQS progress-calculation-queue with maximum concurrency set to 100, alongside a diagram illustrating the SQS to Lambda connection.

WellRight converted all Amazon SQS to Lambda integrations to use this feature. This provided WellRight with full control over the throughput of customer requests while preventing overloading the system. With the maximum concurrency feature, WellRight reduced failed processed messages by 99%, and eliminated DynamoDB throttling events. The feature was enabled for all Amazon SQS and Lambda integrations, including those without scaling issues, as a safeguard for potential future scaling demands.

Performance and cost savings

WellRight’s event-driven architecture significantly improved their ability to handle bursty and unpredictable traffic patterns. The managed serverless services can scale instantaneously to handle these traffic spikes, providing a seamless experience for their clients. With their previous legacy architecture, clients experienced lags in challenge progress, leaderboards, and reward processing.

Now, clients continue to upload updates with over 1 million entries at any time, and WellRight can maintain up-to-the-minute leaderboards and reward processing. The transition to the new architecture has also yielded significant cost savings for WellRight. Prior to the serverless architecture, their baseline architecture required several large Amazon Elastic Compute Cloud (Amazon EC2) instances to handle the initial burst of traffic. After implementing the event-driven architecture, WellRight reduced their costs by 70% on the progress calculation service.

Future plans

WellRight is currently in the process of rolling out the new event-driven architecture to the remaining clients. By the end of 2024, WellRight plans to retire the majority of their remaining servers, further reducing their infrastructure costs.

Conclusion

WellRight’s transition to an event-driven architecture on AWS has been a successful endeavor. By using fully managed services such as Lambda, Amazon SQS, and DynamoDB, they have been able to handle bursty and unpredictable traffic patterns efficiently, while providing a seamless experience for their clients. The introduction of maximum concurrency for Lambda functions has been a game changer, allowing WellRight to control the throughput of their Lambda functions and avoid overwhelming downstream resources.

Overall, the event-driven architecture has enabled WellRight to scale efficiently, improve performance, and reduce costs of their progress calculation service by over 70%. As they continue to optimize their serverless architecture and migrate remaining clients, WellRight is well-positioned to further enhance their platform and provide an exceptional experience to their customers.

To learn more about building event-driven architectures, including key concepts, best practices, AWS services, and getting started resources, visit Serverless Land.


About the authors

From log analysis to rule creation: How AWS Network Firewall automates domain-based security for outbound traffic

Post Syndicated from Mary Kay Sondecker original https://aws.amazon.com/blogs/security/from-log-analysis-to-rule-creation-how-aws-network-firewall-automates-domain-based-security-for-outbound-traffic/

When it comes to controlling incoming (ingress) and outgoing (egress) network traffic, organizations typically focus heavily on inbound traffic controls—carefully restricting what traffic can enter their network perimeter. However, this approach addresses only inbound security challenges. Modern applications rely heavily on third-party code through operating systems, libraries, and packages. This dependency can create potential security vulnerabilities. If these components are compromised, affected workloads might attempt to connect to unauthorized command and control servers or send sensitive data to unauthorized destinations on the internet.

This is why implementing strong outbound traffic controls—particularly through domain-based allowlisting—has become a critical security best practice. Rather than allowing unrestricted outbound access or maintaining an ever-growing denylist of low-reputation domains, many organizations are shifting to domain-based allowlisting. This approach restricts outbound communications to explicitly trusted domains, reduces potential risk surfaces, and helps to protect against both known and unknown threats. However, manually identifying and maintaining these allowlists has traditionally been a complex and time-consuming process.

AWS Network Firewall automated domain lists improve visibility into network traffic patterns and simplify outbound traffic control management. This feature provides analytics for HTTP and HTTPS network traffic, helping organizations understand domain usage patterns. It also automates firewall log analysis to create rules based on your network traffic. By combining increased visibility with automation, this feature enhances your security awareness and helps to improve the effectiveness of your firewall rules.

In this blog post, we’ll guide you through the implementation of the AWS Network Firewall automated domain list feature, providing a detailed overview, step-by-step instructions, and best practices to optimize your network security.

Overview of automated domain lists and traffic insights

Domain-based security allows you to control network traffic based on the domain names that your applications and users are trying to access. This approach offers a more intuitive and flexible way to create firewall rules, focusing on the destinations your network is trying to reach rather than just IP addresses. However, effectively configuring and managing firewall rules remains challenging for some customers, especially in large environments where connected devices, applications, and traffic patterns are continuously growing and changing. Organizations might struggle to keep up with these changes, leading to outdated or ineffective firewall rules and policies that are either too permissive, exposing the network to risks, or too restrictive, blocking legitimate traffic.

Let’s explore how automated domain lists address these challenges through various use cases and benefits:

Preventive and detective security controls

  1. Domain control through allowlisting – Establishing domain allowlists aligns with the security principle of least privilege for network traffic. A least-privilege model adjusts the scope of what a workload can do across the network, from infinite and undefined to scoped-down and well-defined, enabling better insight into potentially risky behaviors. By limiting outbound connections to only approved domains, organizations can more effectively control and monitor workload communications.
  2. Rule audit and compliance – Domain allowlisting makes it clear which domains are allowed, supporting alignment with standards like the Payment Card Industry Data Security Standard (PCI DSS), Health Insurance Portability and Accountability Act (HIPAA), Cybersecurity Maturity Model Certification (CMMC), and General Data Protection Regulation (GDPR).
  3. Preventive controls enable detection – Preventive controls also act as detective controls, establishing a baseline for normal domain access patterns. With a domain allowlist in place, security teams can better detect workloads that show signs of unauthorized activity.
  4. Incident response support – Domain reporting provides the latest list of workload domains accessed, enabling quick identification of potentially malicious domains during security incidents. This information helps teams prioritize which workloads may need immediate attention.

Operational value

  1. Initial firewall setup and management – Automated allowlisting involves analyzing existing traffic patterns and recommending domain-based rules, which simplifies the process of establishing baseline firewall rules. This helps organizations quickly deploy effective security policies, potentially reducing the time and expertise needed for initial firewall configuration and ongoing management.
  2. Application modernization – Allowlisting supports adjusting firewall rules to accommodate rapidly changing traffic patterns in microservices and containerized environments, helping security to keep pace with evolving architectures.
  3. Cross-environment consistency – Allowlisting enables consistent firewall rule creation and management across multi-cloud and hybrid environments, regardless of where applications or data reside.

How the automated domain list feature works

Automated domain lists work by analyzing your HTTP and HTTPS traffic, generating reports on frequently accessed domains, and providing a convenient way to create rules based on actual network traffic patterns. To begin using automated domain lists in AWS Network Firewall, sign in to the AWS Management Console, access the Network Firewall service, and either work with an existing firewall or create a new one. Then follow the rest of the steps in this post.

Step 1: Enable traffic analysis mode to capture HTTP and HTTPS traffic domain logs

After you’ve selected a firewall, in the left navigation pane, choose Configure advanced settings. Select the Enable traffic analysis mode checkbox to enable it, as shown in Figure 1. Network Firewall uses this logging mode to collect data on observed domains for HTTP and HTTPS traffic to create domain reports.

Figure 1: Enabling traffic analysis mode for a firewall

Figure 1: Enabling traffic analysis mode for a firewall

To stop collecting data on frequently accessed domains in your network traffic, clear the checkbox to disable traffic analysis mode, as shown in Figure 2. Note that if you disable traffic analysis mode, you won’t be able to generate domain reports.

Figure 2: Disabling traffic analysis mode

Figure 2: Disabling traffic analysis mode

Once traffic analysis mode is enabled, you’re ready to generate a domain report based on observed network traffic. Next, you can go to the Monitoring and observability tab and choose Create report.

Figure 3: Traffic analysis mode enabled: Now you’re ready to generate domain-based reports

Figure 3: Traffic analysis mode enabled: Now you’re ready to generate domain-based reports

Step 2: Create a domain report

The domain report summarizes the HTTP and HTTPS traffic observed by your firewall for up to 30 days (or for the duration since firewall activation if less than 30 days). Select the checkbox next to each traffic analysis type you want to include in the report—HTTP, HTTPS, or both.

Important: Use your monthly domain report to examine 30 days of traffic behavior. Each report type (HTTP, HTTPS) is available once every 30 days at no additional cost.

Figure 4: Create a domain report that includes traffic analysis types HTTP, HTTPS, or both

Figure 4: Create a domain report that includes traffic analysis types HTTP, HTTPS, or both

To see the status of your domain report, go to the Reports section in the console for your specific firewall. When the report is ready, you can review the report directly in the console or download it, as shown in Figure 5.

Figure 5: The list of domain reports in the Reports section of the console for your specific firewall

Figure 5: The list of domain reports in the Reports section of the console for your specific firewall

Step 3: Review the report details

The report details include the traffic type (HTTP or HTTPS) and the observation period (start and end dates). By default, the report covers the last 30 days, or the entire period since traffic analysis was enabled if that is less than 30 days. The report also shows these details:

  • The Domain list shows domains that are a fully qualified domain name (FQDN) observed in the network traffic, such as aws.com or subdomain.aws.com.
  • The Access attempt count refers to the overall count of connection requests to the domain, including both successful and failed attempts.
  • The Unique sources field shows the number of distinct source IP addresses connected to the domain, indicating its popularity. For example, if one workload connects to aws.com, then count = 1; if 1000 workloads connect to aws.com, then count = 1,000.
  • The First accessed field shows when the domain was first seen in your traffic, while Last accessed shows when it was most recently seen. This includes both successful and failed attempts to access the domain.
  • The Protocol field indicates how the domain was observed—through either HTTP or HTTPS traffic (in other words, HTTP headers or a TLS handshake).

An example report is shown in Figure 6.

Figure 6: Example domain report details: 30-day analysis

Figure 6: Example domain report details: 30-day analysis

Step 4: (Optional) Create a domain list rule group

You can copy the list of observed domains from the report to a stateful domain list rule group and update your firewall policy. To do so, in the Report details section, choose Create domain list group to use the firewall policy wizard to create or update your firewall rules. The selected domains are automatically copied to a domain list rule group, as shown in Figure 7. For detailed instructions, see the AWS Network Firewall documentation.

Figure 7: Option to copy over the observed domain lists and create a domain list rule group using the firewall policy wizard

Figure 7: Option to copy over the observed domain lists and create a domain list rule group using the firewall policy wizard

Best practices for implementing domain allowlists

When you implement domain allowlisting, consider the following guidelines for operational success. We recommend that you also consult your own internal compliance and security policies.

  1. Start with a strategy of generous allowlisting:
    • Begin with broader and more generous allowlist rules rather than a more refined list, initially, to reduce the risk of accidently blocking legitimate domains.
    • Focus on getting to a Default Deny policy so that you can benefit from its risk surface reduction.
    • Create flexible rules for trusted domains, including second-level domains and top-level domains, such as allowing access to subdomains under your registered second-level domain. Or allow access to second-level domains under top-level domains that your organization trusts—for example, .mil, .gov, or .edu.
    • Use custom Suricata rules with regex capabilities to handle complex traffic efficiently. See Examples of stateful rules for Network Firewall.
    • Remember that even a broad allowlist provides better security than having no allowlist at all.
  2. Make iterative improvements:
    • After you establish an initial generous allowlist and Default Deny rules, evaluate the rules to determine which areas you might want to start narrowing down further. Use alert rules before pass rules in order to log the specific domains a pass rule might be allowing access to.
    • Adjust logging levels based on domain trust levels and monitoring requirements.
    • Review and update rules based on operational insights and changing requirements.
    • Take a pragmatic and iterative approach to rule refinement rather than attempting to make the ruleset very strict.
  3. Set up robust logging:
  4. Additional considerations:
    • After you enable traffic analysis mode, the automated domain lists feature provides visibility into your network traffic, reporting on observed connections. Although it doesn’t distinguish between allowed and blocked traffic, the domain list report can help you identify the most critical domains to include in your firewall rules.
    • The domain traffic data used to generate the list of domain recommendations is available for up to the last 30 days after traffic analysis has been enabled. This allows you to focus on the most relevant and recent network activity when optimizing your firewall policies.
    • Data collection for automated domain lists is opt-in and performed independently of the firewall policy and logging configuration. Enabling the feature doesn’t impact the performance of the firewall itself.

Conclusion

With AWS Network Firewall automated domain lists, you can simplify your firewall management process, create more effective rules based on actual traffic patterns, and maintain a strong security posture with less manual effort. This feature helps you address common challenges such as keeping up with rapidly changing application landscapes, managing security across complex environments, and adhering to compliance requirements. To learn more about Network Firewall and its features, see the product page and service documentation.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Network Firewall re:Post forum or contact AWS Support.
 

Mary Kay Sondecker
Mary Kay Sondecker

Mary Kay is a Senior Product Manager at AWS, focused on AWS Network Firewall. With over two decades of experience in the technology industry, she is passionate about helping customers easily implement effective, scalable cloud solutions to drive better business outcomes.
Jesse Lepich
Jesse Lepich

Jesse is a Senior Security Solutions Architect at AWS based in Lake St. Louis, Missouri, focused on helping customers implement native AWS security services. Outside of cloud security, his interests include relaxing with family, barefoot waterskiing, snowboarding and snow skiing, surfing, boating, and mountain climbing.
Michael Leighty
Michael Leighty

Michael is a Senior Security Solutions Architect at AWS, based in Atlanta. He specializes in helping customers design and implement effective network security controls, drawing from extensive experience at leading network security vendors. At AWS, he works closely with service teams to drive continuous improvement in security services based on customer needs and feedback.
Jason Goode
Jason Goode

Jason is a Senior Security GTM Content Specialist at AWS, where he develops content strategies that bridge technical concepts with practical business solutions. Based in Austin, Texas, he leverages his creative background and expertise to help organizations understand and use native AWS security services.

Realizing twelve-factors with the AWS Well-Architected Framework

Post Syndicated from Michael Phorn original https://aws.amazon.com/blogs/architecture/realizing-twelve-factors-with-the-aws-well-architected-framework/

Organizations that are interested in improving their development velocity that follow the principles of the twelve-factor app might find benefits in understanding how to realize those concepts on Amazon Web Services (AWS). In this post, I will help you correlate the twelve-factors app concepts as you architect solutions on AWS.

Twelve-factors

Let’s start with a quick recap of twelve-factors. The Twelve-Factor App was published in 2011 by Adam Wiggins as a collaboration between developers at Heroku. He published it at a time when developers were shifting from a paradigm of writing software-as-a-service (SaaS) applications in their own cloud environments to having the applications hosted on a cloud provider, such as AWS. Their intent was to provide “a broad set of conceptual solutions” for building applications that were portable and resilient. The principles centered around reducing the software lifecycle burden, including application introduction, maintenance and operations, through sunsetting. These principles were captured in the following 12 factors:

  1. Codebase
  2. Dependencies
  3. Config
  4. Backing services
  5. Build, release, run
  6. Processes
  7. Port binding
  8. Concurrency
  9. Disposability
  10. Development and production environment parity
  11. Logs
  12. Admin process

These principles are portable and resilient application best practices.

At AWS, we have the Well-Architected Framework to capture cloud and architecture best practices, which contains similar practices to twelve-factors. The Framework comes from years of AWS Solutions Architect collective experience of building solutions across business verticals and use cases. The results are architectures that support secure, high-performing, resilient, and cost-effective systems in the cloud. If you’re responsible for the underlying infrastructure or the application, the Framework helps you, the CTO, the architect, the developer, or operations team member, understand the benefits and trade-offs of decisions that have to be made.

A brief history of the AWS Well-Architected Framework

AWS published the first version of the Framework in 2012, and we released the AWS Well-Architected Framework whitepaper in 2015. Following the initial introduction, we added the Operational Excellence pillar in 2016 and released pillar-specific whitepapers and AWS Well-Architected Lenses in 2017. The following year, the AWS Well-Architected Tool was launched.

While twelve-factors focuses on application characteristics, the AWS Well-Architected Framework provides architectural guidance. When your architecture undergoes a Well-Architected review, you can meet the guidance for a twelve-factors application more easily. With some factors, the Framework helps the application developer delegate some responsibility from the application to the infrastructure. Both frameworks aim to help you deliver applications and services that are robust, scalable, and cloud centered. The AWS Well-Architected Framework helps you reinforce these mechanisms.

The six pillars of the AWS Well-Architected Framework

Let’s explore the six pillars of the AWS Well-Architected Framework, what each aims to achieve, and where the twelve-factors concepts intersect with AWS guidance.

The following figures shows the twelve factors and how they map to processes in AWS, which are described in this section.

1. Operational excellence

The operational excellence pillar helps you review your organization’s ability to support development and run workloads efficiently. You can use the topics in this pillar to evaluate how you operate your solutions. The pillar guides you through inspection of organizational structure, inspection of your mechanisms, and identification of obstacles and roadblocks that might slow your ability to innovate. The results include a feedback loop of continuous improvement for operating the infrastructure and solutions.

The factors you capture through operational excellence are codebase (I) and development and production environment parity (X). Codebase prescribes that there is exactly one codebase used to deploy everywhere, which echoes the purpose of reducing the operational burden of maintaining your software. The argument for a single code base is for consistency, traceability, and efficiency across a unified development lifecycle. The second factor is development and production environment parity, which encourages developers to create smaller but more frequent deployments. It also encourages developers to maintain parity not just of the core software, but also the backing services between environments. Parity of environments is conducive to smoother development and deployment processes. Additionally, this parity helps developers catch issues in a non-production environment more consistently.

AWS services that can help you achieve operational excellence are AWS CodeConnections, AWS CodePipeline, AWS CloudFormation, AWS Systems Manager, Amazon CloudWatch, AWS Config, AWS CloudTrail, Amazon EventBridge, AWS X-Ray, AWS Organizations, AWS Control Tower, AWS Trusted Advisor, AWS Service Catalog, AWS Proton, Amazon CodeGuru (Preview), AWS Lambda, Amazon Simple Queue Service (Amazon SQS), Amazon Simple Notification Service (Amazon SNS), and AWS StepFunctions

2. Security

The security pillar describes how to use AWS Cloud technologies to protect data, systems, and assets that improve your security posture. At AWS, we advocate the shared responsibility model, which applies to the security pillar. AWS is responsible for providing a secure environment for managing and operating your systems and solutions, but it is your responsibility to implement those best practices in the context of your requirements. The security pillar describes best practices such as reviewing how you manage identities for people and machines, which helps you store secrets securely.

The config (III) factor can be mapped to the security pillar, which advises you to store variables and items that depend upon the environment as environment variables. This allows you to move between deployments without having to update your code. Configuration settings such as database connection strings, API keys, credentials, and other sensitive information should be separated from the application code. At AWS, we provide services that can be used to securely meet this requirement, including AWS Secrets ManagerAWS Systems Manager Parameter Store, AWS Certificate Manager, and AWS Key Management Service (KMS).

AWS services that can help you achieve security are AWS Identity and Access Management (IAM), Amazon GuardDuty, AWS Shield, AWS Web Application Firewall (WAF), Amazon Inspector, AWS CloudHSM, Amazon Macie, AWS Security Hub, AWS Config, AWS CloudTrail, Amazon VPC (Virtual Private Cloud), AWS Direct Connect, Amazon Cognito, AWS Firewall Manager, AWS Network Firewall, and AWS IAM Access Analyzer.

3. Reliability

The reliability pillar encompasses the ability of a workload to perform its intended function correctly and consistently when it’s expected to. Reliability means that your architecture and systems:

  • Appropriately scale resources to meet demands
  • Mitigate disruptions caused by misconfiguration or transient network issues
  • Recover when disruptions do occur

Automation of scaling and recovery are best practices within the reliability pillar.

Because twelve-factors helps developers deliver a reliable application, multiple factors are categorized under the reliability pillar of the AWS Well-Architected Framework. Backing services (IV) explains that you should have flexibility for integrating services. This way, when your system experiences issues with availability, the application can replace the troubled service without code changes. You should choose the right resource that provides scalability while optimizing costs. Dependencies (III) describes that applications declare and isolate dependencies to become modular and self-contained. This speeds up recovery by simplifying the setup for handlers of the application code. Applications that adhere to the processes (VI) factor run as a collection of stateless processes to support scaling. This is equivalent to creating microservices that can scale up or down depending upon the workload or bring in additional instances when one fails. Disposability (IX) suggests that an application’s processes can be started and stopped rapidly, which makes the application resilient to failures and capable of being adapted to elastically scale.

AWS services that can help you achieve reliability are Amazon EC2 Auto Scaling, Elastic Load Balancing (ELB), Amazon RDS Multi-AZ, Amazon Simple Storage Service (Amazon S3), AWS CloudFormation, Amazon Route53, AWS Shield, AWS Backup, Amazon CloudWatch, AWS Systems Manager, AWS Global Accelerator, Amazon Aurora, AWS Lambda, Amazon DynamoDB, and AWS Transit Gateway.

4. Performance efficiency

The principles under the performance efficiency pillar focus on using computing resources to build architectures on AWS that efficiently deliver sustained performance as demand changes and technologies evolve. Topics in this pillar include simplifying the consumption of technologies that align with your goals, the ability to go global in minutes, and reducing the time and effort needed to deliver a service.

The concurrency (VIII) factor prioritizes management of processes, which should be stateless and allow for horizontal scaling, promoting performance efficiency. The backing services (IV) factor also falls under this category because it dictates flexibility in integration. This flexibility enables the application to maximize performance by using the right resource that meets scalability and performance requirements.

AWS services that can help you achieve performance efficiency are Amazon Elastic Compute Cloud (Amazon EC2), Amazon EC2 AutoScaling, Amazon Elastic Block Store (Amazon EBS), Amazon S3, Amazon Aurora, Amazon DynamoDB, Amazon ElastiCache, Amazon CloudFront,Application Auto ScalingElastic Load Balancing (ELB), AWS Lambda, Amazon API GatewayAWS Step Functions, Amazon SQS, Amazon Kinesis, AWS Global Accelerator, Amazon Aurora, AWS X-Ray, Amazon CloudWatch, and AWS Compute Optimizer.

5. Cost optimization

The cost optimization pillar provides guidance for the architecture’s ability to operate systems and deliver the business value at the lowest price point. The cost optimization reviews help you avoid unnecessary costs, analyze and attribute expenditure, and use appropriate pricing models.

The relationship of the build, release, run (V) factor advocates for the process separation and strict discipline around efficient handling of application deployments. This aligns to the cost optimization pillar because cost effective operations are typically a result of well-designed processes and mechanisms. AWS services that can support the build, release, run factor are, AWS CodeBuild and AWS CodeDeploy.

Other AWS services that can help you with cost optimization are AWS Cost Explorer, AWS Budgets, AWS Data Exports, AWS Trusted Advisor, AWS Compute Optimizer, EC2 Spot Instances, AWS Savings Plans, Amazon S3 Intelligent-Tiering, AWS Lambda, Amazon Aurora, Application Auto Scaling, AWS Organizations, AWS Resource Groups, Tag Editor, AWS Marketplace, AWS License Manager, AWS Glue, and Amazon Athena.

6. Sustainability

The sustainability pillar focuses on minimizing the environmental impact of running workloads in the cloud. Topics in this include reviewing the lifecycle of your data and retention policies as a methodology to use only what is needed.

The disposability (IX) factor is aligned to sustainability because it highlights an application’s ability to rapidly start and shut down at a moment’s notice. This provides agility and optimized use of resources during the life of the application.

AWS services that can help you achieve sustainability are AWS Customer Carbon Footprint ToolAmazon EC2 AutoScalingAWS Lambda, Amazon EC2 Spot Instances, Amazon EBS gp3 volumes, Amazon S3 Intelligent-Tiering, Amazon S3 Lifecycle configurations, AWS Graviton processors, Amazon Aurora Serverless, Amazon RDS Multi-AZ deployments, AWS Compute Optimizer, AWS Well Architected Tool, and Amazon CloudWatch.

Remaining factors

Port binding, logs, and admin processes aren’t specifically categorized into the pillars of the AWS Well-Architected Framework. However, these factors can be addressed as an essential part of the services that AWS delivers.

The seventh factor: Port binding

The port binding factor says that an application should be bound to a specific port when it’s hosted as a web application with the intention of making the application completely self-contained. In the context of AWS, we offer you different ways to achieve this principle, which are dependent upon the way your application is deployed on AWS. When implementing port binding on AWS, we offer features such as security, service discovery, and dynamic port mapping to simplify and secure your applications through services like Amazon EC2, Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), AWS Elastic Beanstalk, AWS App Runner.

The eleventh factor: Logs

The logs factor dictates that an application should treat its running process as an event stream out to files that are managed completely by the execution environment. AWS offers many types of logging to capture different aspects of your application and the supporting infrastructure. CloudWatch is a centralized logging management service that monitors, stores, and provides access to log files from AWS services. For more detail, see AWS services for logging and monitoring.

The twelfth factor: Admin processes

The admin processes factor advises application developers to perform administrative tasks in an isolated manner to minimize the impact on the main application. At AWS, this factor is realized as a separation of the control plane and the data plane. The control plane is responsible for managing, configuring, and controlling the network or system infrastructure, while the data plane is responsible for the handling the actual user data or traffic. This separation is an inherent part of AWS services. We believe this separation allows AWS to deliver services that are scalable, highly available, secure, and efficient.

Applying the AWS Well-Architected Framework

The Framework shouldn’t be treated as a checklist that you review after development is complete. Instead, a review should be explored during the design phase to help you learn and apply architectural best practices. By the end of development, architects should have built a solution that facilitates faster, lower-risk service building and deployment. The Framework is not a static document, and as AWS evolves, architects continue to learn from working with customers and refine the definition of well-architected.

Conclusion

If you are familiar with twelve-factors or want to develop a twelve-factors app on AWS, read more about the AWS Well-Architected Framework. Consider starting a review project on your own to explore the detailed questions underneath each category or if you have specific workload that you’re already working on. You can use one of the many AWS Well-Architected Tool lenses to focus on applying these best practices to the services that you’re using. To get started on a lens review, see AWS Well-Architected Tool, which is accessible at no charge through the AWS Management Console.


About the author

Building AI-powered customer experiences using a modern communications hub

Post Syndicated from Osman Duman original https://aws.amazon.com/blogs/messaging-and-targeting/building-ai-powered-customer-experiences-using-a-modern-communications-hub/

Customers demand organizations to anticipate and seamlessly fulfill their needs, engaging them with personalized content when, where, and how they prefer. They yearn for context-sensitive, dynamic interactions with nuanced conversations across all communication channels. Organizations are under growing pressure to modernize customer experience workflows to drive loyalty and improve operational efficiency. Leveraging the latest advancements in Generative AI (GenAI), such as hyper-personalization and Agentic AI, presents new challenges. Organizations require a scalable, reusable architecture to integrate GenAI into their customer engagement systems without a complete system overhaul, amid disparate solutions they currently operate.

This blog post explores how to build an AI-powered modern communications hub using open-source GitHub samples that integrate SMS/MMS and WhatsApp services with GenAI capabilities. Organizations can create innovative AI-powered customer experiences with a quick proof-of-concept without disrupting existing systems.

In combination with Vector Databases and Retrieval Augmented Generation (RAG), GenAI makes it possible to reorganize knowledge into a single system and query from a single user interface through natural language conversation with a chatbot or virtual assistant. Funneling customer communications through a multi-channel communications hub linked with GenAI capabilities helps unify customer engagement mechanisms and streamlines the creation of rich customer experiences. Customers meet AI agents and Q&A bots on the communication channel that is convenient to self-serve their needs. Organizations can build communications-channel-agnostic customer experiences while collecting channel engagement event and conversational data into a centralized data store for real-time insights, ad-hoc queries, analytics, and ML training.

Solution overview

In the core of the solution is the Modern Communications Hub that connects digital communication channels with key GenAI services, like Amazon Bedrock and Amazon Q, along with AWS ML, database, storage, and serverless computing services.AWS End User Messaging and Amazon SES provide API level access to digital communication channels, offering secure, scalable, high-performance, and cost-effective services for enterprise applications to exchange SMS/MMS, WhatsApp, push and voice notifications, and email with customers.

A collection of open-source sample code, published in the AWS-samples GitHub repository, illustrates how to facilitate generative conversations on SMS/MMS and WhatsApp channels. This will be extended to include email services. Two key components form the foundation of the GenAI Integration Samples: the Multi-channel Chat with AI Agents and Q&A Bots and the Engagement Database and Analytics for End User Messaging and SES. We will simply refer to these as the Conversation Processor and Engagement Database in the solution diagram.

This diagrams shows the solution architecture in Level 300

The Conversation Processor receives customer messages via AWS End User Messaging and Amazon Simple Email Service (SES), stores the conversation details, and invokes the relevant Amazon Bedrock Agent. Amazon Bedrock Agents use Large Language Models (LLMs) and knowledge bases to analyze tasks, break them into actionable steps, execute those steps or search the knowledge base, observe outcomes, and iteratively refine their approach until completing the task along with a response. Alternatively, the Conversation Processor can function as a Q&A bot in which case it uses Amazon Bedrock Knowledge Bases along with its RAG feature to generate an LLM answer and send back on the same channel as the customer’s message.

The Engagement Database collects and combines customer engagement data and conversational logs from across communication channels, storing the information in a centralized data lake on Amazon S3. By converting the data into a common, canonical format, the solution simplifies querying and analysis of these inbound events. A Lambda Transformer function leverages Apache Velocity Templates to transform the incoming JSON data, enabling real-time insights.

The raw event data stored in the Amazon S3 data lake can then be fed into other AWS services for further processing. For example, the data can flow into Amazon Connect Customer Data Profiles or Amazon SageMaker to support machine learning model training. Data analysts can use Amazon Athena to issue direct queries for detailed ad-hoc reporting, or to send the data to Amazon QuickSight for advanced visualizations and natural language querying capabilities through Amazon Q in QuickSight.

NOTE: There is the potential for end users to send Personal Identifiable Information (PII) in messages. To protect customer privacy, please consider using Amazon Comprehend to assist in redacting PII before storing messages in S3. The following blog post provides a good overview of how to use Comprehend to redact PII: Redact sensitive data from streaming data in near-real time using Amazon Comprehend and Amazon Kinesis Data Firehose.

Amazon Bedrock provides core GenAI capabilities such as LLMs, Knowledge Bases, Retrieval Augmented Generation (RAG), AI agents, and Guardrails, to understand customer asks, determine what action to take, and what to communicate back. Amazon Bedrock Knowledge Bases provide organization specific business knowledge and reasoning, while Amazon Bedrock Agents automate multistep tasks by seamlessly connecting with company systems, APIs, and data sources.

Prerequisites

The following prerequisites are necessary to build your modern communications hub:

  • An AWS account. Sign up for an AWS account at AWS website if you don’t have one.
  • Appropriate AWS Identity and Access Management(IAM) roles and permissions for Amazon Bedrock, AWS End User Messaging, and Amazon S3. For more information, see Create a service role for model import.
  • AWS End User Messaging Configuration: You’ll need to configure the necessary origination identity in the AWS End User Messagingservice to deliver messages via SMS or WhatsApp. If configuring SMS, a registered and active SMS Origination Phone Number must be provisioned in AWS End User Messaging SMS. (Within the United States, use 10DLC or Toll-Free Numbers (TFNs). If configuring WhatsApp, an active number that has been registered with Meta/WhatsApp should be provisioned in AWS End User Messaging Social.
  • Amazon Bedrock models: Bedrock Anthropic Claude 3.0 Sonnet and Titan Text Embeddings V2 enabled in your region. Note that these are the default models used by the solution, however, you are free to experiment with different models.
  • Docker Installed and Running – This is used locally to package resources for deployment.
  • Node (> v18) and NPM (> v8.19) installed and configured on your computer
  • The AWS Command Line Interface(AWS CLI) installed and configured
  • AWS CDK (v2) installed and configured on your computer.

Deploy the Conversation Processor and Engagement Database

Deploy the following two solutions. While not required, it is best to deploy them in this order, as outputs from the Engagement Database can be used in the Multi-Channel Chat example:

  1. Engagement Database and Analytics for End User Messaging and SES
  2. Multi-channel Chat with AI Agents and Q&A Bots

Each solution contains detailed instructions to deploy the required services using the AWS Cloud Development Kit (CDK). The first Engagement Database solution will create an Amazon Data Firehose stream that can be used as an input to the second Multi-Channel Chat application so that data can be stored and queried in the Engagement Database.

Multi-Channel Chat with AI Agents and Q&A Bot Data Sources
This solution demonstrates how users can interact with three different knowledge sources. You may not need all of three, however this should serve as a good example to build the right knowledge source for your particular use-case:

NOTE: The starter project creates an S3 bucket to store the documents used for the Bedrock Knowledge Base. Please consider using Amazon Macie to assist in the discovery of potentially sensitive data in S3 buckets. Amazon Macie can be enabled on a free trial for 30 days, up to 150GB per account.

  • Build your Knowledge Base on Amazon Bedrock using a Web Crawler. Optionally configure your knowledge base to scan or crawl website(s) to populate your knowledge base.
  • Amazon Bedrock Agents: Optionally enable your users to chat with an Amazon Bedrock Agents. Agents have the added benefit of supporting knowledge bases for answering questions and walking users through collecting the information needed to automate a task such as making a reservation. Sample agents are available in the Amazon Bedrock Agent Samples repository. Note that you will need to have an Amazon Bedrock Agent created in your region prior to deploying the solution.

Conclusion

A Modern Communications Hub, loosely coupled with core Generative AI services, will establish a composable foundation to build communication-channel-agnostic customer experiences on. Build one by leveraging the GenAI Integration Samples, Conversation Processor and Engagement Database, combining with the secure, scalable, high-performance, and cost-effective digital communication services by AWS End User Messaging and Amazon SES. This will provide a single point of conversational access to knowledge bases and agentic AI capabilities on Amazon Bedrock. Start experimenting with AI-powered customer experience innovations with a quick proof-of-concept that won’t interfere with your present customer engagement setup.

About the Authors

Enhancing telecom security with AWS

Post Syndicated from Kal Krishnan original https://aws.amazon.com/blogs/security/enhancing-telecom-security-with-aws/

If you’d like to skip directly to the detailed mapping between the CISA guidance and AWS security controls and best practices, visit our Github page.
 

Implementing CISA’s enhanced visibility and hardening guidance for communications infrastructure

In response to recent cybersecurity incidents attributed to actors from the People’s Republic of China, a number of cybersecurity agencies led by the U.S. Cybersecurity and Infrastructure Security Agency (CISA) have jointly released comprehensive guidance for securing communications infrastructure. As communications service providers (CSPs) migrate their workloads to the cloud, they must take steps to implement these security measures effectively in cloud environments.

This blog post describes how CSPs can use Amazon Web Services (AWS) capabilities to implement this guidance while benefiting from the advantages of the cloud.

The guidance focuses on two key domains:

  • Strengthening visibility: Enabling security teams to monitor, detect, and respond to potential threats through comprehensive visibility into digital assets
  • Hardening systems and devices: Implementing robust security controls and configurations to minimize vulnerabilities and help prevent unauthorized access

Overview of fundamental cloud concepts

Before exploring the specific guidance in this post, it’s important to understand how security recommendations apply differently to public cloud environments than to private infrastructure. A common tendency in the telecommunications industry is to treat public clouds as merely scaled-up versions of private clouds. This can result in a misunderstanding of security capabilities and underutilization of cloud-native security features of the public cloud.

The fundamental difference lies in how public clouds are architected—specifically for multi-tenancy, with strong tenant isolation as a cornerstone of their design. In AWS, virtual resources are isolated by default and require explicit configuration to interconnect. For example, when you create a virtual private cloud (VPC) with Amazon VPC, this logically isolated network does not permit inbound or outbound traffic until specific routes and ports are deliberately configured. Similarly, Amazon Simple Storage Service (Amazon S3) buckets are private by default, requiring explicit configuration to grant access. This isolation extends to the core of our virtualization infrastructure through the AWS Nitro System, which provides unprecedented workload isolation—even AWS operators with the highest privileges have no technical access to customer workloads. Furthermore, data moving between Nitro System based virtual machines or across our global backbone network is automatically encrypted, providing additional layers of protection beyond customer-implemented encryption.

This secure-by-design and secure-by-default philosophy permeates throughout AWS service design and operations. It isn’t merely a design choice—it’s a business imperative driven by the critical need for operational resilience and customer trust in the public cloud model. Our commitment to principles of this sort is reflected in our participation as a signatory to CISA’s Secure by Design Pledge.

When AWS customers operate in the public cloud, understanding the shared responsibility model is paramount. This model clearly delineates security responsibilities: AWS is responsible for security of the cloud, while customers are responsible for security in the cloud. This division of responsibilities significantly reduces your operational burden, because AWS assumes responsibility for securing everything about and inside the cloud services it provides, all the way down to the physical protection of data centers. As a result, you can concentrate your security resources where they matter most—protecting your applications and workloads—while AWS handles the undifferentiated heavy lifting of infrastructure security.

This shared responsibility model becomes even more advantageous when considering the economies of scale inherent to public cloud operations. The massive scale of AWS allows us to invest more in securing the foundations than a single enterprise could achieve independently, creating a security multiplier effect that benefits all customers. A compelling example of this scale advantage is our comprehensive threat intelligence program, which deploys honeypot sensors throughout our global network. These sensors observe more than 100 million potential threat interactions and probes daily. Using artificial intelligence and machine learning (AI/ML), we analyze this information and take swift, often automated actions to mitigate threats. In the first half of 2023 alone, this program enabled us to dismantle the sources of approximately 230,000 Layer 7 distributed denial of service (DDoS) events. We also provide this intelligence to customers through services like Amazon GuardDuty, extending the benefits of our scale to our customers.

The scale of AWS operations not only enables exemplary threat intelligence, but also necessitates extensive automation of our security operations. Several routine tasks, such as feature and patch deployments and configuration updates, are fully automated through deployment pipelines. Automation has the added benefit of taking humans out of the loop, thereby decreasing opportunities for mistakes.

Our scale also facilitates our comprehensive compliance with security standards across multiple industries and jurisdictions. Our global presence and diverse customer base necessitate adherence to the most stringent security requirements worldwide. Through the AWS Compliance Program, we’ve achieved 143 security standards and compliance certifications, ranging from ISO standards for cloud security and privacy to industry-specific regulations in finance and healthcare, as well as government security programs. This includes independent verification of our claims on the isolation properties of our Nitro System virtualization infrastructure. Consequently, you benefit from this scale-driven compliance, gaining access to a secure, certified infrastructure that implements state-of-the-art security systems.

These are a few reasons why, in a blog post titled Why cloud first is not a security problem, the UK’s National Cyber Security Centre concluded that “using the cloud securely should be your primary concern – not the underlying security of the public cloud.”

Private clouds, on the other hand, are typically within the control of a single organization and are single-tenanted, offering relatively weak workload isolation. Virtual resources in private clouds usually default to being interconnected upon creation, and so require explicit steps to increase isolation. Manual operations, with their opportunities for mistakes as well as potential involvement of threat actors, are often still a large part of private cloud workflows. Rarely do they undergo the level of security scrutiny that public clouds are routinely subjected to. These, and other differences, mean that security risks in each offering are inherently different, so correspondingly distinct security controls and solution architectures are needed to mitigate these risks.

Implementing hardening guidance with AWS

Your cloud resources and data are contained in an AWS account. An account acts as an identity and access management isolation boundary. When you need to share resources and data between two accounts, you must explicitly allow this access. This reduces the risk of lateral movement between accounts.

Designing your AWS environment correctly lays a strong foundation that can help you meet the CISA cybersecurity guidance. AWS Control Tower, working with AWS Organizations, enables you to establish a well-architected, multi-account environment based on security best practices.

For detailed guidance on creating a secure landing zone for telecom workloads, refer to our comprehensive blog post on this topic.

We’ve analyzed the recommendations in CISA’s guidance and grouped them into six categories across the two key domains. Refer to the GitHub page linked at the bottom of this post, which has further detailed guidance on the relevant AWS services that can assist your implementation of each individual security measure in the guidance.

Logging and monitoring

The guidance in this category emphasizes the importance of increasing visibility to understand network activity, which is essential for detecting anomalies and responding to incidents. Key security controls include the following: have a robust asset management capability, enable logging at various levels, centralize logging, protect the logging and monitoring infrastructure, and use security tools to detect anomalies and incidents.

Enhanced visibility is an inherent advantage of the public cloud model, particularly in AWS. This transparency is not just a bolted-on feature, but a fundamental necessity driven by the API-centric, pay-as-you-go business model. To accurately bill customers, AWS has built comprehensive tracking and logging capabilities into its core architecture. As a result, AWS provides robust tools that allow you to capture, centralize, and monitor logs across every layer of your network workload. This level of visibility extends far beyond what’s typically achievable in traditional infrastructure, offering you unprecedented insight into your IT assets and user activities.

Another key guidance is this area is to centralize security-related logging while isolating the logging infrastructure from other production environments. You can implement this guidance in AWS by using Amazon Security Lake together with OpenSearch implemented in separate accounts, with access restricted to just the security organization. Alternatively, this solution provides a best-practice implementation of creating collection and ingestion pipelines to allow for centralization and inspection of log sources across your AWS workloads without the use of Security Lake.

Configuration and change management

The guidance in this category emphasizes the centralization, security, and protection of configurations. It highlights the importance of detecting and providing alerts for unauthorized modifications, auditing configurations for compliance, and a change management process that automates routine changes to minimize unintended drift.

In AWS, you can implement infrastructure and configuration as code, which allows for central storage of configuration in repositories, tracking changes through version control, and implementing changes through approved change management processes. You can use code repositories and continuous integration and continuous delivery (CI/CD) pipelines to automate the implementation of these configurations, helping you increase deployment speed, maintain consistency, simplify management, and implement a rigorous and auditable change control process.

Regardless of how infrastructure is deployed and managed, you can use the AWS Config service to automatically track the current state and history of a wide set of configuration information across more than 100 AWS services and hundreds of their resource types. You can also write custom AWS Config rules to take automatic actions whenever sensitive resources are modified, or take advantage of more than 400 AWS managed rules in AWS Config that send alerts or create automatic remediations when critical resources change state.

Identity and access management

The guidance in this category emphasizes the importance of active account and permissions management, use of phishing-resistant authentication methods, implementing least privilege through role-based access controls, managing emergency access, and limiting sessions.

Authentication and authorization, which are critical components of access control, are managed through AWS Identity and Access Management (IAM), AWS IAM Identity Center, and AWS Organizations. AWS provides you with capabilities to manage permissions at scale across identities, resources, and services, including mandating the use of multi-factor authentication (MFA) for logins. Furthermore, these capabilities support customers adhering to the principle of least privilege by encouraging time-bound, session-based credential management by using AWS Security Token Service (AWS STS).

Software running in the cloud that needs to call cloud APIs receives its temporary and frequently rotated credentials automatically through IAM roles for Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), and AWS Lambda, helping to eliminate the need for long-term credentials that can leak or be compromised. Access to cloud APIs from on-premises software can be safely boot-strapped from enterprise identity management technologies by using AWS IAM Roles Anywhere. You can even protect authentication to non-cloud technologies with a combination of roles and the use of AWS Secrets Manager to protect and automatically rotate secrets such as database passwords.

Network and traffic management

The guidance in this category emphasizes segmenting workloads and networks to limit the potential for lateral movement and exposure to the internet, monitoring and regulating traffic flows by using policies, and securing unused ports.

You can achieve network micro-segmentation, a critical aspect of modern security architecture, through VPCs and subnets. You can, for example, segregate internet-facing components of your application from those that don’t require such access by placing them in separate VPCs and enabling internet access only on the VPC that requires it. You can control traffic flows within and between segments by using a variety of network services—routing tables, internet gateways, transit gateways, and firewall services, to name a few. This segmentation minimizes your risk from unauthorized activity that originates from the internet and limits the potential for lateral movement in the event of a breach.

To implement the guidance regarding out-of-band management, you can architect your network connections to separate management traffic from network signaling traffic by using subnets—for example, a single EC2 instance can have multiple elastic network interfaces (ENIs) attached to different subnets or even different VPCs: one that permits only management traffic and another that permits only signaling traffic.

Strong cryptography to encrypt data at rest and in traffic

The guidance in this category emphasizes using strong cipher suites, secure versions of encryption protocols, and PKI-based certificates to protect data at rest and in transit.

Encryption, a cornerstone of data protection, is comprehensively addressed in AWS offerings. API endpoints of AWS services support TLS 1.3 (and a minimum of TLS 1.2) with secure standards-based cipher suites, encryption keys, and advanced security features like HKDF (HMAC-based extract-and-expand key derivation function) for added security. AWS services that manage customer secrets sent over the wire also support post-quantum cryptography. For example, AWS Key Management Service (AWS KMS), AWS Certificate Manager, and AWS Secrets Manager support a hybrid post-quantum key exchange option for the TLS network encryption protocol. In its use of the Border Gateway Protocol (BGP), AWS uses Resource Public Key Infrastructure (RPKI) and Route Origin Authorization (ROA) to protect the Amazon IP address space and routes from misconfigurations and hijacking.

You can also use AWS cryptographic services such as AWS KMS, AWS CloudHSM, and AWS Certificate Manager to help secure your data in transit and at rest. Keys that you create in AWS KMS are protected by FIPS 140-2 Level 3 validated hardware security modules (HSMs), and there is no mechanism for anyone, including AWS service operators, to view, access, or export plaintext key material.

AWS Secrets Manager helps you securely manage, retrieve, and rotate database credentials, application credentials, OAuth tokens, API keys, and other secrets throughout their lifecycles. For more details on AWS cryptography solutions and best practices, refer to Encryption best practices for AWS services.

Vulnerability management

This guidance emphasizes minimizing exploitation risks through proper lifecycle management, regular patching, and elimination of insecure protocols. AWS helps address these requirements through both shared responsibility and innovative architectural approaches.

Under the shared responsibility model, AWS manages the security of underlying infrastructure. This includes maintaining up-to-date systems and services, disabling insecure protocols and unused ports, and providing Security Bulletins for timely vulnerability notifications. AWS services are supported through contractually defined terms, so that you don’t need to worry about end-of-life infrastructure components.

For your applications, AWS enables a transformative approach to vulnerability management through ephemeral resources and immutable infrastructure. Instead of maintaining long-lived instances that require continuous patching, you can maintain a single, hardened, and frequently updated Amazon Machine Image (AMI) as your golden image. When updates are needed, rather than patching running instances, you simply deploy new instances with your application code installed from an updated AMI. Similar approaches also apply to container-based workloads. Workloads based on AWS Lambda reduce your patching responsibility even further, because only the code that contains your business logic (and any supporting layers you have chosen) needs to be updated—AWS patches the underlying hypervisors, operating systems, and containers automatically. This approach enables you to keep your systems in a known, secure state while reducing both the threat surface and operational overhead. You can further enhance security by using AWS networking features like security groups to disable insecure protocols, such as enforcing HTTPS rather than HTTP.

Conclusion

The comprehensive guidance from cybersecurity agencies provides a crucial framework for securing telecommunications infrastructure. As demonstrated throughout this post, AWS offers a robust set of native services and capabilities that align with the recommendations from CISA and allied governments. From enhanced visibility through logging and monitoring, to strong identity management, network segmentation, encryption, and vulnerability management, AWS provides the tools you need to implement these security controls effectively while maintaining operational efficiency. The shared responsibility model, combined with AWS continuous innovation in security, enables telecommunications companies to build and maintain resilient, secure cloud environments.

Visit our GitHub page for detailed information on implementing CISA guidance with AWS services.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
 

Kal Krishnan
Kal Krishnan

Kal is a telecom industry specialist with AWS Security. Since 2019, he leads a global program focused on helping AWS telecom customers achieve their security and compliance goals on their cloud journey. He has over 25 years of experience working on multiple generations of mobile network technologies. Before joining AWS, he was a Technical Fellow in the field of emergency calling and wireless location.
Danny Cortegaca
Danny Cortegaca

Danny is a Security Specialist Solutions Architect and is the Telco lead for AWS Industries. He joined AWS in 2021 and partners with some of the largest organizations in the world to help them navigate complex security and regulatory environments. He loves talking about application security with customers and has helped many adopt threat modeling into their practices.
Ruben Merz
Ruben Merz

Ruben is a Principal Solutions Architect in the AWS Industries Telecom Business Unit. He works with global telecom customers and regulated industries to help them transform with AWS.

Learning AWS best practices from Amazon Q in the Console

Post Syndicated from Brendan Jenkins original https://aws.amazon.com/blogs/devops/learning-aws-best-practices-from-amazon-q-in-the-console/

Operators, administrators, developers, and many other personas leveraging AWS come across multiple use cases and common issues such as lack of permissions, bugs in code in AWS Lambda, and more when leveraging the AWS console. To help alleviate this burden when using the console, AWS released Amazon Q to assist with users accessing the console with these use cases. Amazon Q is AWS’s generative AI-powered assistant that helps write code, answer questions, generate content, solve problems, manage AWS resources, and take actions. A component of Amazon Q is Amazon Q Developer. One way to interact with the service is to chat with Amazon Q Developer in the AWS Management Console, the AWS Console Mobile Application, on AWS websites, AWS Documentation websites, and chat channels integrated with AWS Chatbot to learn about AWS services. You can ask Q Developer about best practices, recommendations, step-by-step instructions for AWS tasks, and architecting your AWS resources and workflows.

In this blog post, we will highlight best practices for interacting with Q Developer in the console including topics such as using Q Developer in the console to generate code snippets, architect workloads, and understand your costs.

Prerequisites

To follow along with these examples, the following prerequisites are required:

Overview

Here are some of the examples on how Amazon Q Developer in the console can be utilized:

Please Note – Amazon Q in Console may generate different output than shown in examples below due to its non-deterministic nature.

To start, access the Amazon Q Developer service by signing into the AWS console and clicking the Amazon Q icon on the right-hand panel as shown below in Figure 1. Authenticate if necessary:

Accessing Amazon Q Chat

Figure 1: Accessing Amazon Q Chat

Use Q to learn about AWS services and best practices

In this section, we will look at how Amazon Q can help you learn about AWS services and also outline the best practices for using those services

Learn about services available in AWS

Whether someone is just learning or an experienced user, Amazon Q provides a simple way to discover AWS capabilities and get helpful information whenever needed.
For example, if you are looking to learn on how to auto-scale your compute instances based on a metric you can ask Amazon Q in console.

Sample prompt –
I need to set a autoscaling group for EC2 instances with this requirement, if
CPU utilization goes above 50% for 5 minutes then it should add new instance
and if CPU utilization drops below 50% for 5 minutes then it should delete 1
instance.

User entering prompt about setting up an auto scaling compute Instance based on a specific metric and threshold and Q generating a response.

Figure 2: User Prompt and Response from Amazon Q for Auto Scaling Compute Instance based on a specific metric and threshold

The response from Amazon Q lists down all the steps to set an Auto Scaling Group based on the requirements provided in the prompt.

Ask specific questions about AWS services

Amazon Q in console can also help you to run systems to deliver business value keeping best practices in mind. With natural language prompts, you can learn the best practices when using AWS services.

Sample Prompt –

I am using API Gateway for REST APIs. I have configured timeout for requests. I would like to learn additional best practices to reduce and handle long running requests.

User entering prompt to Q about keeping compute costs low and Q generating a response.

Figure 3: User Prompt and Response from Amazon Q keeping compute costs low

As shown above in Figure 3, Amazon Q then summarizes various ways to optimize for long running requests in Amazon API Gateway, for example, implement timeouts, use asynchronous invocation for long running operation if possible and several other ways to optimize.

Use Q to generate code snippets or scripts to automate tasks using AWS SDK or AWS CLI

Developers or System Administrators can use Q to generate code snippets or scripts to automate tasks instead of spending time going through documentation.

How to write an AWS Lambda function to read data from S3

For example, a developer may way to get started writing an AWS Lambda function that reads data from an Amazon S3 bucket but doesn’t know how to get started, Amazon Q can help.

User Prompt and Response from Amazon Q on instructions on how to write the lambda function

Figure 4: User Prompt and Response from Amazon Q on instructions to write a Lambda function

User Prompt and Response from Amazon Q on instructions on how to write the lambda function

Figure 5: Response from Amazon Q with sample code to write a Lambda function

As shown above in Figure 4 & 5, Amazon Q returns step-by-step instructions on how to write the Lambda function, including sample code for reference.

How do I upload a file to an S3 bucket using the AWS CLI?

If a developer or IT Professional uses the AWS CLI frequently but struggle with finding right commands to accomplish a task, then Amazon Q is definitely helpful

Sample Prompt

How do I upload a file to an S3 bucket using the AWS CLI?

User Prompt and Response from Q with CLI command to upload a file to an S3 bucket

Figure 6: User Prompt and Response from Q with CLI command to upload file to S3 bucket

As shown above in Figure 6, Amazon Q returns the CLI command to upload a file to an S3 bucket. The response also suggests the command to verify the upload.

Use Console-to-Code to write code to automate use of other services

Console-to-Code records your console actions, then uses generative AI to suggest code in your preferred language which currently supports CLI commands, CDK Java, CDK Python, CDK TypeScript, CloudFormation JSON/YAML.

Let’s say a developer wants to generate a CloudFormation YAML with an Amazon EC2 instance and Amazon RDS database instance. For this, the developer can go to the console of Amazon EC2 and Amazon RDS. On the right side, choose Console-to-Code icon and choose Start Recording.

As shown in the figures below, once an Amazon EC2 instance and then Amazon RDS DB instance is launched, stop the recording and simply download the CloudFormation YAML template.

Console-to-Code recording Amazon EC2 and Amazon RDS Database instance launch

 Figure 7: Console-to-Code recording Amazon EC2 and Amazon RDS Database instance launch

Generate Infrastructure-As-Code from the Console-to-Code recording

Figure 8: Generate Infrastructure-As-Code from the Console-to-Code recording

Generated CloudFormation YAML template from the Console-to-Code recording

      Figure 9: Generated CloudFormation YAML template from the Console-to-Code recording

Use Q to architect a workload & optimize

In this section, we will focus on leveraging Amazon Q to architect new workloads and optimize existing workloads.

Architecting Workloads

Now, a developer wants to build a new financial services application with the following requirements:

  • Architecture must be serverless & event driven
  • It must provide functionality as deposits and withdrawals
  • It should use low-cost, efficient AWS services.
  • It should scale with traffic spikes that can reach thousands of requests per second.
  • Should support REST APIs

Let’s prompt Amazon Q Developer by copying the following prompt below and pasting it into the chat window:

Sample Prompt:

How can I build a new financial services application with the following requirements: it should be serverless and event driven, must provide functionality as deposits and withdrawals, It should use low-cost efficient AWS services, It should scale with traffic spikes that can reach thousands of requests per second, and lastly it should support REST APIs.

Amazon Q should provide a response on how to architect this workload. Here is a sample response received from Amazon Q:

Response from Amazon Q for architecture

Figure 10: Response from Amazon Q for architecture

Amazon Q was able to provide the following response in full:

________________________________________________

To build a new financial services application with the requirements you described, I would recommend the following AWS services and architecture:

  1. Serverless and Event-Driven Architecture:
    • Use AWS Lambda for the core application logic, which will allow you to run your code without provisioning or managing servers.
      Integrate AWS Lambda with other serverless services like Amazon API Gateway for the REST API, Amazon DynamoDB for the database, and Amazon SNS/SQS for event-driven messaging.
  2. Deposits and Withdrawals Functionality:
    • Use Amazon DynamoDB as the database to store customer account information and transaction details.
    • Implement the deposit and withdrawal logic in AWS Lambda functions, which can be triggered by events from Amazon API Gateway or Amazon SNS/SQS.
  3. Low-Cost and Efficient AWS Services:
    • Leverage AWS Lambda, which charges only for the compute time you consume, and Amazon DynamoDB, which provides a pay-per-request pricing model.
    • Use Amazon API Gateway’s pay-per-call pricing to keep costs low for the REST API.
    • Consider using Amazon SQS and Amazon SNS for their low-cost, highly available, and scalable messaging capabilities.
  4. Scalability and High-Performance:
    • AWS Lambda and the other serverless services you choose will automatically scale to handle traffic spikes of thousands of requests per second.
    • You can further optimize performance by using features like Amazon API Gateway’s caching, Amazon DynamoDB’s auto-scaling, and Amazon SQS’s message batching.
  5. REST API Support:
    • Integrate your application logic in AWS Lambda with Amazon API Gateway to expose a RESTful API. Amazon API Gateway will handle the API management, including routing, authentication, and throttling.

To get started, I recommend reviewing the following AWS documentation:

Please let me know if you have any other questions!
Sources:
Key AWS services – Financial Services Industry Lens
Harnessing the scale of AWS for financial simulations | AWS HPC Blog
RPC vs REST – Difference Between API Architectures – AWS

________________________________________________

As shown above, Amazon Q was able to provide the service recommendations for our architecture based on the requirements provided along with links to services recommended based on the requirements

Optimizing workloads

Now, let’s a developer or IT professional has an architecture that revolves around Amazon EC2 and an instance deployed called Server-1-demo in AWS and wants to optimize to help save on costs.

Similar to the previous section, open a new chat window within the Amazon Q chat in the console and enter the following prompt:

Sample Prompt:

Based on the current CPU utilization of my EC2 Server-1-demo, what do you recommend I do to cost optimize?

As a result, Q provides the following response:

Prompt to Q about the optimization of an EC2 instance based on CPU utilization and Response from Amazon Q

Figure 11: Optimization Response from Amazon Q

As shown, Amazon Q took in the context of the CPU utilization for Server-1-demo and made recommendations to leverage new instance types such as AWS Graviton which is designed to deliver the best price performance for your cloud workloads running in Amazon Elastic Compute Cloud (Amazon EC2) along with other recommendations.

Use Q to understand your costs

Another way to leverage Q in the console is to analyze costs. A developer or IT professional can use Amazon Q, to retrieve and analyze cost data from AWS Cost Explorer, being able ask questions about AWS costs and receive answers in natural language that reflect the actual costs of your AWS account.

Now, open a new Amazon Q in Console chat window and lets try the following prompt as an example:

Sample Prompt:

Show me the breakdown of EC2 costs by instance type for the last 30 days.

] Response from Amazon Q for the breakdown of EC2 costs in the last month

Figure 12: Response from Amazon Q for the breakdown of EC2 costs in the last month

As shown above in Figure 12, Q Developer provides a detailed breakdown of the EC2 instance types for the last 30 days.

Now, trying another example:

Sample Prompt:

What was my cost breakdown by service for the past three months?

Response from Amazon Q for the last 3 month’s spend analysis

Figure 13: Response from Amazon Q for last 3 months spend analysis

As shown in figure 13, Amazon Q provides detailed cost breakdowns, including percentages of total spend, making it easy to understand your AWS usage and expenses. This feature allows you to quickly identify your highest-cost services and track spending trends over time. Always verify your cost data with AWS Cost Explorer for the most accurate information. For more details and information on this capability, check out this blog covering the feature in more detail.

Best practices for using Amazon Q in the console

The previous sections showcased examples of leveraging Amazon Q capabilities for AWS application architecture and account management. In both cases, the input given to Amazon Q directly affects its output quality. Your question should be concise, clear and contain the necessary details for the tool to understand the scenario and what should be answered. The recommended approach for providing effective input to a generative AI chat bot is called Prompt Engineering. By adhering to the following best practices, you can achieve improved outcomes when using Amazon Q:

  • Specify the task you want Q to do: explain a concept, compare services, list pros and cons, generate a CLI command, generate a code snippet, suggest architecture options for a scenario.
  • Provide context: Why do you need to know this concept? For which part of your application or architecture will you apply the knowledge?

Amazon Q asks for additional details to better answer the question

          Figure 14: Amazon Q asks for additional details to better answer the question.

In this example, we asked for scenarios, which is what type of answer we want. We also specified the service we want scenarios about, and the edge case of the scenario (after instance creation). Amazon Q summarized our question and provided scenarios and sources in accordance with what we asked.

  • Break a series of questions into multiple questions.
  • Ask for one task at a time.
  • Don’t stop at the first answer; keep asking questions that use the information to help Q provide more enriched responses.

Amazon Q uses chat context to give an answer. In this scenario, the provided input was not enough for Q to provide a good answer, so it asked for more details and considered both inputs as a context to the answer.

Figure 15: Amazon Q uses chat context to give an answer.

In this scenario, the provided input was not enough for Q to provide a good answer, so it asked for more details and considered both inputs as a context to the answer.

  • Be mindful about security related questions about your account- Amazon Q in console may not provide answers that address security in your account
  • Your input must have the maximum of 1000 characters. This is another reason to be concise while providing an input.
  • Create a new conversation if you are going to start a new topic discussion. Unnecessary context will reduce Q answers specificity to your new situation.

Amazon Q does not provide security tips. Create a new conversation. Maximum allowed characters are 1000

Figure 16: Amazon Q does not provide security tips. Create a new conversation. Maximum allowed characters are 1000.

Conclusion

In this blog post, we explored the various ways in which Amazon Q, AWS’s generative AI-powered assistant, is used in the AWS Console to enhance productivity and reduce ramp-up time for developers, DevOps engineers, and architects. Amazon Q functions as an AWS consultant, offering advice on various tasks, such as understanding AWS services and implementing best practices, as well as generating code snippets and automating CLI commands. The tool’s capability to help architect new workloads and enhance existing ones based on specific needs was demonstrated with detailed examples. The importance of prompt engineering – crafting clear, concise prompts to elicit high-quality responses from the AI assistant – was also discussed. By embracing the capabilities of Amazon Q in the console, AWS users will streamline their workflows and speed up their cloud journey. Whether you’re a seasoned cloud architect or starting out, this AI-powered assistant will serve as a partner, helping you navigate the AWS landscape and unlock new levels of efficiency. As you continue exploring the possibilities of Amazon Q, follow the best practices outlined in this post, experiment!

About the authors

Brendan Jenkins

Brendan Jenkins is a Tech Lead Solutions Architect at Amazon Web Services (AWS) working with Enterprise AWS customers providing them with technical guidance and helping achieve their business goals. He has an area of specialization in DevOps and Machine Learning technology.

Renu Yadav

Renu Yadav is a Solutions Architect at Amazon Web Services (AWS), where she works with enterprise-level AWS customers providing them with technical guidance and help them achieve their business objectives. Renu has a strong passion for learning with her area of specialization in DevOps. She leverages her expertise in this domain to assist AWS customers in optimizing their cloud infrastructure and streamlining their software development and deployment processes.

Maria Mendes

Maria Mendes is a Solutions Architect and has been part of the CSC SA team since 2022, working with Small and Medium-sized business customers. Maria’s daily work consists of architecture reviews, providing AWS services best practices guidance, executing workshops with customers, and participating in multi-customer AWS event speaking activities. She is a generalist solutions architect and is also part of a technical field community inside AWS that is focused on DevOps services.

Amazon Redshift enhances security by changing default behavior in 2025

Post Syndicated from Yanzhu Ji original https://aws.amazon.com/blogs/security/amazon-redshift-enhances-security-by-changing-default-behavior-in-2025/

Today, I’m thrilled to announce that Amazon Redshift, a widely used, fully managed, petabyte-scale data warehouse, is taking a significant step forward in strengthening the default security posture of our customers’ data warehouses. Some default security settings for newly created provisioned clusters, Amazon Redshift Serverless workgroups, and clusters restored from snapshots have changed. These changes include disabling public accessibility, enabling database encryption, and enforcing secure connections.

Amazon Redshift already supports encryption in transit and at rest. Database encryption is crucial because it helps safeguard sensitive data from unauthorized access. Furthermore, restricting public access can be advantageous because it limits the threat surface and helps prevent unauthorized access to the database. By confining the Amazon Redshift cluster within the customer’s virtual private cloud (VPC), the cluster is isolated from the public internet, which significantly reduces the possibility of unauthorized parties discovering and accessing the data warehouse.

Enforcing secure connections is another essential security measure. This enforces encryption of communication between the applications and the database, reducing the risk of eavesdropping and man-in-the-middle exploits, which helps protect the confidentiality and integrity of the data being transmitted.

By implementing additional security defaults for newly created provisioned clusters, Serverless workgroups, and clusters restored from snapshots, Amazon Redshift helps customers adhere to best practices in data security without requiring additional setup, reducing the risk of potential misconfigurations.

These new security enhancements include three key changes:

  1. Disabling public access by defaultPublic accessibility will be disabled by default for newly created or restored provisioned clusters. This means that the newly created clusters will be accessible only within your VPC and not accessible from the public internet. With this change, if you create a provisioned cluster from the AWS Management Console, then the cluster is created with public access disabled by default. Specifically, the PubliclyAccessible parameter will be set to false by default. This change will also be reflected in the CreateCluster and RestoreFromClusterSnapshot API operations and the corresponding console, AWS CLI, and AWS CloudFormation By default, connections to clusters will only be permitted from client applications within the same VPC. To access your data warehouse from applications in another VPC, you have to configure cross-VPC access.

    If you still need public access, you must explicitly override the default and set the PubliclyAccessible parameter to true when you run the CreateCluster or RestoreFromClusterSnapshot API operations. With a publicly accessible cluster, we recommend that you always use security groups or network access control lists (network ACLs) to restrict access.

  2. Enabling encryption by default – With this change, the ability to create unencrypted clusters will no longer be available in the Amazon Redshift console. When you use the console, CLI, API, or CloudFormation to create a provisioned cluster without specifying an AWS Key Management Service (AWS KMS) key, the cluster will automatically be encrypted with an AWS-owned key. The AWS-owned key is managed by AWS.

    This update might impact you if you are creating unencrypted clusters by using automated scripts or using data sharing with unencrypted clusters. If you regularly create new unencrypted consumer clusters and use them for data sharing, review your configurations to verify that the producer and consumer clusters are both encrypted to reduce the chance that you will experience disruptions in your data-sharing workloads.

  3. Enforcing secure connections by default – With this change, a new default parameter group named default.redshift-2.0 will be introduced for newly created or restored clusters, with the require_ssl parameter set to true by default. New clusters created without a specified parameter group will automatically use the default.redshift-2.0 parameter group. When you create a cluster through the console, the new default.redshift-2.0 parameter group will be automatically selected. This change will also be reflected in the CreateCluster and RestoreFromClusterSnapshot API operations, as well as in the corresponding console, AWS CLI, and AWS CloudFormation operations.

    For customers who are using existing or custom parameter groups, the service will continue to honor the require_ssl value specified in your parameter group. However, we recommend that you update the require_ssl parameter to true in order to enhance the security of your connections. You continue to have the option to change the require_ssl value in your custom parameter groups as needed. You can follow the procedure in this topic in the Amazon Redshift Management Guide to configure security options for connections.

We recommend that all Amazon Redshift customers review their current configurations for this service and consider implementing the new security measures across their applications. These security enhancements could impact existing workflows that rely on public access, unencrypted clusters, or non-SSL connections. We recommend that you review and update your configurations, scripts, and tools to align with these new defaults.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
 

Yanzhu Ji
Yanzhu Ji

Yanzhu Ji is a Senior Product Manager on the Amazon Redshift team. She has extensive experience in database security and developing product vision and strategy for industry-leading data products and platforms. She excels at building robust software products using web development, system design, database, and distributed programming techniques.

Testing and evaluating GuardDuty detections

Post Syndicated from Marshall Jones original https://aws.amazon.com/blogs/security/testing-and-evaluating-guardduty-detections/

Amazon GuardDuty is a threat detection service that continuously monitors, analyzes, and processes Amazon Web Services (AWS) data sources and logs in your AWS environment. GuardDuty uses threat intelligence feeds, such as lists of malicious IP addresses and domains, file hashes, and machine learning (ML) models to identify suspicious and potentially malicious activity in your AWS environment. When GuardDuty identifies a potential security issue, it creates a GuardDuty finding that gives you information about what the potential security issue is, the resources involved, and contextualized information that’s key to remediating the issue. GuardDuty helps you monitor for the latest threats by continually expanding threat detection to emerging and common threats.

Whether you’re new to GuardDuty or are a long-time user, it’s recommended that you understand the different GuardDuty finding types and finding details and practice responding to them as suggested in the security pillar of AWS Well-Architected.

In this blog post, I dive deep into an open source tool for testing GuardDuty findings and then walk through three examples of how you can use this tool to test and improve your response to GuardDuty findings.

Overview

If you want to learn more about GuardDuty, you can read about the finding types in this AWS documentation. However, customers often want realistic findings in their environment to understand what a finding looks like and to practice responding hands on. While you can use GuardDuty to create sample findings in your environment, these findings are approximations populated with placeholder values and look different from real findings. Additionally, you cannot practice remediation with these findings because they’re not tied to actual resources in your account. This can be helpful if you only want to see what details are in a finding, but if you want to practice a real-world scenario, these sample findings might not be adequate.

To address this use case and provide customers with a secure and reliable way to test the threat detection capabilities of GuardDuty, the GuardDuty service team launched an open source project called GuardDuty Tester. The GuardDuty Tester creates infrastructure in your environment to simulate different security issues so that you can test GuardDuty findings that mirror actual security issues that you might encounter, such as crypto mining or a reverse shell being created on an Amazon Elastic Compute Cloud (Amazon EC2) instance. The GuardDuty Tester was originally released in 2018 as an AWS CloudFormation template and was focused more on testing investigation workflows than on a wide range of finding types. AWS has since released an updated version that uses the AWS Cloud Development Kit (AWS CDK) to make the infrastructure code easier to read and expanded the test coverage to over 100 unique finding types and resource combinations.

The ability to create findings across different resource types such as Amazon EC2, Amazon Simple Storage Service (Amazon S3), and Amazon Elastic Kubernetes Service (Amazon EKS) is a valuable resource for your security team, allowing them to simulate various types of threats with isolated infrastructure so that you don’t need to compromise your deployed workloads to improve response actions and techniques. Remember that the GuardDuty Tester doesn’t cover every possible scenario, but is instead focused on threat intelligence and rules-based findings. Anomaly-based findings, which require learning about how you operate your environment, aren’t included in the GuardDuty Tester.

Getting started with the GuardDuty Tester

The GuardDuty Tester is deployed by using the AWS CDK to create the required infrastructure and scripts to generate the GuardDuty findings. For safety, AWS recommends that you deploy the GuardDuty Tester in a nonproduction environment in an account that’s used specifically for this purpose. This way, your security team can differentiate between test GuardDuty findings and findings for other workloads that they’re monitoring.

In this post, I won’t walk through configuring the GuardDuty Tester because this is already documented in the GuardDuty documentation. Instead, I will go over what you need to know about the GuardDuty Tester and some of the benefits.

Figure 1 shows the GuardDuty Tester architecture, which includes the resources necessary to create GuardDuty findings for various protection plans such as Amazon S3 buckets, Amazon EC2 instances, and an Amazon EKS cluster. The tester also deploys a dedicated GuardDuty Tester instance where you will run the scripts needed to create the GuardDuty findings.

Figure 1: GuardDuty Tester architecture

Figure 1: GuardDuty Tester architecture

The GuardDuty Tester provides key features including:

  • A wide range of threat scenario simulations: Resources that the GuardDuty Tester can create findings for include Amazon S3, AWS Identity and Access Management (IAM), Amazon Elastic Container Service (Amazon ECS) for both Amazon EC2 and AWS Fargate hosted workloads, Amazon EKS, and AWS Lambda and covers over 105 threat scenarios. This includes GuardDuty runtime monitoring as well as other GuardDuty protection plans.
  • Access through AWS Systems Manager: The GuardDuty Tester provides secure access by using Systems Manager to minimize open ports to the internet and allowing access only through Systems Manager.
  • Modular scripts: With an expanded library of tests available, the GuardDuty Tester accepts user parameters to set the scope of the tests to run, which gives you greater flexibility for different testing scenarios.

Setting up the GuardDuty Tester environment is straightforward and requires only a few commands. As outlined in the documentation and the README file in the repository, there are a number of prerequisites to set up the stack. These prerequisites include Python 3+, git, the AWS Command Line Interface (AWS CLI), AWS Systems Manager Session Manager plugin, npm, Docker, and a subscription to Kali Linux image for Amazon EC2. You will have to subscribe to the Kali Linux instance in AWS Marketplace, but will be charged for the instance only while the GuardDuty Tester is deployed. After these prerequisites are met, you can clone the repository, install the packages, and deploy the GuardDuty Tester to your AWS account.

Deploying the GuardDuty Tester can take 20–30 minutes, but if you’re following along with this post, I assume that you have deployed the GuardDuty Tester into your environment and have started your Systems Manager session as stated on Part A of Step 3 – Run tester scripts in the GuardDuty documentation. Now, I will dive into the first testing example.

Manual investigation

The first test use case is about getting familiar with what GuardDuty findings look like and the details that a finding gives you. This might be one of your first steps after turning on GuardDuty, or this might be an activity that you perform to help new team members understand GuardDuty findings.

To start a manual investigation:

  1. Run the following command in your Systems Manager session to view the GuardDuty Tester options.
    Python3 guardduty_tester.py --help
  2. Run the following command in your Systems Manager session to create the first test finding.
    Python3 guardduty_tester.py - -ec2 - -runtime only - -tactics impact
  3. Before creating the findings, the GuardDuty Tester prompts you to confirm that it’s allowed to change GuardDuty settings in the environment. For example, if you’ve chosen to create findings related to the GuardDuty runtime monitoring feature but don’t have this feature enabled, the GuardDuty Tester will enable it for the tests and then disable it after testing is complete.

    Note: This will start the 30-day trial of the enabled features in this account, in this AWS Region, even if the feature is disabled after testing is complete. More information about GuardDuty pricing and free trials can be found on the GuardDuty pricing page.

  4. After choosing y which indicates “yes”, the GuardDuty Tester reports the number of domain reputation findings it’s expecting. Figure 2 shows an example of the expected findings. You can learn more about domain reputation findings in the GuardDuty finding documentation.

    Figure 2: Generated GuardDuty findings in the console

    Figure 2: Generated GuardDuty findings in the console

  5. After the GuardDuty Tester is finished, wait a few minutes and then go to the AWS Management Console for GuardDuty to see the findings. In this example, there are four new GuardDuty findings as expected from step 4 and shown in Figure 3. With the findings generated, you can start your manual investigation.

    Figure 3: GuardDuty finding details

    Figure 3: GuardDuty finding details

In the preceding figure, you can see some of the finding details presented—such as the action type and the process information—that can help you quickly identify what trigger started the suspicious communication. From here, I encourage you to use this finding to practice your runbooks for investigation and response. For example, you might start with validating and triaging the finding before moving into evidence collection and remediation. If you don’t have incident response runbooks already built, you can use this finding as an example to get started. There are multiple open source examples such as AWS incident response playbooks and AWS customer response playbook. A runbook will help your team evaluate the information provided in the GuardDuty finding and understand what else they need to know about your specific environment to properly respond to the finding. For example, in the finding, you will have resource and actor information but not things such as who is the account owner or point of contact for security for that account.

Creating alerts

The next use case highlights how to create alerts based on GuardDuty findings. When setting up alerting automation with tools such as Amazon Simple Notification Service (Amazon SNS) and Slack, you should create a finding using the GuardDuty Tester to test that you’ve configured your alert correctly. See Creating custom responses to GuardDuty findings for information about creating alerts with either of these tools. Figure 4 shows a sample EventBridge rule that will send GuardDuty findings to SNS.

Figure 4: EventBridge rule to send GuardDuty findings to SNS

Figure 4: EventBridge rule to send GuardDuty findings to SNS

For this post, I assume that you’ve already configured an Amazon EventBridge rule and Amazon SNS alert.

To test alerts:

  1. Run the following command in your Systems Manager session to create a privileged container finding.
    Python3 guardduty_tester.py - -finding ‘PrivilegedEscalation:Kubernetes/PrivilegedContainer’
  2. Shortly after creating this finding, you should see an SNS alert based on the finding type.

Figure 5: SNS notification from a GuardDuty finding

Figure 5: SNS notification from a GuardDuty finding

If you’ve configured the alert correctly, you will see an email similar to Figure 5. The email demonstrates that SNS notifications were successfully configured and tested using the GuardDuty Tester. If this is a new finding, you will receive this SNS notification shortly after the GuardDuty Tester generates the finding, but if this is an updated finding, then the timing will be based on the notification frequency configured in the account.

There are many ways that customers consume GuardDuty findings in their environments. Whether you’re using Amazon SNS or another mechanism such as a chat application, ticketing system, or a security information and event management (SIEM) solution, you can use this example of an EventBridge rule and the GuardDuty Tester to test out your notification pipeline.

Automated response

For the third use case, I show you how to create an automated action based on a GuardDuty finding. In this example, I create a finding based on an EC2 instance connecting to a Bitcoin mining domain, then based on this finding, I use Lambda to tag the instance to assist with identification during that investigation steps that follow. Although this is a simple example, it shows you what you can do by combining EventBridge rules and Lambda functions. If you want to create an automated response for GuardDuty runtime monitoring findings that requires making a host-level modification, you can use EventBridge rules with AWS Systems Manager Run Command to run commands locally on a host to remediate a security issue.

Start by creating a Lambda function that will take a GuardDuty event delivered by EventBridge, pull out the instance ID information, and then use that as a parameter in the create_tags API call. See the following example code.

import json
import boto3
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):
    try:
        # Extract the necessary information from the GuardDuty finding
        instance_id = event['detail']['resource']['instanceDetails']['instanceId']
        account_id = event['detail']['accountId']
        region = event['detail']['region']

        # Create an EC2 client
        ec2 = boto3.client('ec2', region_name=region)

        # Add the "infected" and "cryptomining" tag value pair to the instance
        ec2.create_tags(
            Resources=[instance_id],
            Tags=[
                {
                    'Key': 'infected',
                    'Value': 'cryptomining'
                }
            ]
        )

        logger.info(f"Tagged instance {instance_id} with 'infected=cryptomining' in account {account_id} and region {region}")
        return {
            'statusCode': 200,
            'body': 'Instance tagged successfully'
        }
    except Exception as e:
        logger.error(f"Error tagging instance {instance_id}: {str(e)}")
        return {
            'statusCode': 500,
            'body': f"Error tagging instance: {str(e)}"
        }

Next, I create an EventBridge rule specific to the Bitcoin mining finding that I want to test, shown in Figure 6. The target is the Lambda function that I just created.

Figure 6: EventBridge rule for crypto mining GuardDuty finding

Figure 6: EventBridge rule for crypto mining GuardDuty finding

Now that the EventBridge rule is in place with the Lambda function as the target, I can use the GuardDuty Tester to trigger a Bitcoin mining finding and test my solution with the following command.

Python3 guardduty_tester.py - - finding ‘CryptoCurrency:EC2/BitcoinTool.B!DNS’

After the finding is generated, I go to my EC2 instance, where there’s a new instance tag with a key of infected and a value of cryptomining, shown in Figure 7.

Figure 7: Updated tags after automated response

Figure 7: Updated tags after automated response

Although this is a general example, you can use the same approach across various actions that you might take in response to a GuardDuty finding and then test them using the GuardDuty Tester. Examples include using Lambda to add logic in AWS WAF, a network access control list (network ACL), or AWS Network Firewall to block suspicious traffic, or use Systems Manager Run Command to end a malicious process that’s running on a host.

Conclusion

The updated GuardDuty Tester represents a significant advancement in helping organizations validate and gain confidence in GuardDuty threat detection. The GuardDuty Tester now provides more comprehensive coverage of GuardDuty runtime monitoring and protection plans across various AWS services.

By using the GuardDuty Tester and following the use cases in this post, you can proactively assess your threat detection readiness, identify potential gaps, and implement necessary measures to help you fortify your AWS environments against evolving cyber threats.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
 

Marshall Jones
Marshall Jones

Marshall is a Worldwide Security Specialist Solutions Architect at AWS. His background is in AWS consulting and security architecture and focused on a variety of security domains including edge, threat detection, and compliance. Today, he’s focused on helping enterprise AWS customers adopt and operationalize AWS security services to increase security effectiveness and reduce risk.

Announcing upcoming changes to the AWS Security Token Service global endpoint

Post Syndicated from Palak Arora original https://aws.amazon.com/blogs/security/announcing-upcoming-changes-to-the-aws-security-token-service-global-endpoint/

AWS launched AWS Security Token Service (AWS STS) in August 2011 with a single global endpoint (https://sts.amazonaws.com), hosted in the US East (N. Virginia) AWS Region. To reduce dependency on a single Region, STS launched AWS STS Regional endpoints (https://sts.{Region_identifier}.{partition_domain}) in February 2015. These Regional endpoints allow you to use STS in the same Region as your workloads, which improves both performance and reliability.

However, many customers and third-party tools continue to call the STS global endpoint, and as a result, these customers don’t get the benefits of STS Regional endpoints. To help improve the resiliency and performance of your applications, we are making changes to the STS global endpoint, with no action required from customers. These changes will be released in the coming weeks.

In this blog post, we discuss the upcoming changes to the STS global endpoint and their benefits, and provide our recommendation on which STS endpoint to use going forward.

Upcoming changes to the STS global endpoint

The changes being made to the STS global endpoint will help enhance resiliency and improve performance. Today, all the requests to the STS global endpoint are served by the US East (N. Virginia) Region. Starting in early 2025, requests to the STS global endpoint will be automatically served in the same Region as your AWS deployed workloads. For example, if your application calls sts.amazonaws.com from the US West (Oregon) Region, your calls will be served locally in the US West (Oregon) Region instead of being served by the US East (N. Virginia) Region.

With this change, requests to the STS global endpoint will be served locally if your request originated from AWS Regions that are enabled by default.1 However, requests to the STS global endpoint will continue to be served in US East (N. Virginia) Region if your request originated from opt-in Regions or if you used STS from outside AWS, such as in your on-premises network or data centers.

We will gradually roll out this change to AWS Regions that are enabled by default by mid-2025, starting with the Europe (Stockholm) Region.

We’ve taken the following measures to help avoid disruptions to your existing processes:

  • AWS CloudTrail logs for requests made to the STS global endpoints will be sent to the US East (N. Virginia) Region. CloudTrail logs for requests handled by STS Regional endpoints will continue to be logged to their respective Region in CloudTrail, even if the requests are served locally.
  • CloudTrail logs for operations performed by the STS global and Regional endpoints will have the additional fields endpointType and awsServingRegion to clarify which endpoint and Region served the request.
  • Requests made to the sts.amazonaws.com endpoints will have a value of us-east-1 for the aws:RequestedRegion condition key, regardless of which Region served the request.
  • Requests handled by the sts.amazonaws.com endpoints will not share a request quota with the Regional STS endpoints.

1. In addition, for your requests to be served locally, your DNS request for sts.amazonaws.com must be handled by an Amazon DNS Server in Amazon Virtual Private Cloud (Amazon VPC).

Our recommendation

We continue to recommend that you use the appropriate STS Regional endpoints whenever possible. If you’re using STS from outside AWS, such as in your on-premises networks or data centers, we recommend you use the STS Regional endpoint that is hosted in the same Region as the AWS resource that you need STS credentials to access. If you’re building in opt-in Regions such as Asia Pacific (Hong Kong) or Asia Pacific (Jakarta), we recommend that you use the STS endpoint from the opt-in Region that is hosting your workload. By following the steps in the blog post How to use Regional AWS STS endpoints, you can identify workloads that are still using the global STS endpoint and get insights into how to reconfigure them when required.

If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Palak Arora
Palak Arora

Palak is a Senior Product Manager at AWS Identity. She has over eight years of cybersecurity experience, with a specialization in the Identity and Access Management (IAM) domain. She has helped various customers across different sectors define their enterprise and customer IAM roadmaps and strategies, and improve their overall technology risk landscape.
Liam Wadman
Liam Wadman

Liam is a Principal Solutions Architect with the AWS Identity team. When he’s not building exciting solutions on AWS or helping customers, he’s often found in the hills of British Columbia on his mountain bike. Liam points out that you cannot spell LIAM without IAM.

Automate topic provisioning and configuration using Terraform with Amazon MSK

Post Syndicated from Vijay Kardile original https://aws.amazon.com/blogs/big-data/automate-topic-provisioning-and-configuration-using-terraform-with-amazon-msk/

As organizations deploy Amazon Managed Streaming for Apache Kafka (Amazon MSK) clusters across multiple use cases, the manual management of topic configurations can be challenging. This can lead to several issues:

  • Inefficiency – Manual configuration is time-consuming and error-prone, especially for large deployments. Maintaining consistency across multiple configurations can be difficult. To avoid this, Kafka administrators often set the create.topics.enable property on brokers, which leads to cluster operation inefficiency.
  • Human error – Manual configuration increases the risk of mistakes that can disrupt data flow and impact applications relying on Amazon MSK.
  • Scalability challenges – Scaling an Amazon MSK environment with manual configuration is cumbersome. Adding new topics or modifying existing ones requires manual intervention, hindering agility.

These challenges highlight the need for a more automated and robust approach to MSK topic configuration management.

In this post, we address this problem by using Terraform to optimize the configuration of MSK topics. This solution supports both provisioned and serverless MSK clusters.

Solution overview

Customers want a better way to manage the overhead of topics and their configurations. Manually handling topic configurations can be cumbersome and error-prone, making it difficult to keep track of changes and updates.

To address these challenges, you can use Terraform, an infrastructure as code (IaC) tool by HashiCorp. Terraform allows you to manage and provision infrastructure declaratively. It uses human-readable configuration files written in HashiCorp Configuration Language (HCL) to define the desired state of infrastructure resources. These resources can span virtual machines, networks, databases, and a vast array of cloud provider-specific offerings.

Terraform offers a compelling solution to the challenges of manual Kafka topic configuration. Terraform allows you to define and manage your Kafka topics through code. This approach provides several key benefits:

  • Automation – Terraform automates the creation, modification, and deletion of MSK topics.
  • Consistency and repeatability – Terraform configurations provide consistent topic structures and settings across your entire Amazon MSK environment. This simplifies management and reduces the likelihood of configuration drift.
  • Scalability – Terraform enables you to provision and manage large numbers of MSK topics, facilitating the growth of your Amazon MSK environment.
  • Version control – Terraform configurations are stored in version control systems, allowing you to track changes, roll back if needed, and collaborate effectively on your Amazon MSK infrastructure.

By using Terraform for MSK topic configuration management, you can streamline your operations, minimize errors, and have a robust and scalable Amazon MSK environment.

In this post, we provide a comprehensive guide for using Terraform to manage Amazon MSK configurations. We explore the process of installing Terraform on Amazon Elastic Compute Cloud (Amazon EC2), defining and decentralizing topic configurations, and deploying and updating configurations in an automated manner.

Prerequisites

Before proceeding with the solution, make sure you have the following resources and access:

By making sure you have these prerequisites in place, you will be ready to streamline your topic configurations with Terraform.

Install Terraform on your client machine

When your cluster and client machine are ready, SSH to your client machine (Amazon EC2) and install Terraform.

  1. Run the following commands to install Terraform:
    sudo yum update -y
    sudo yum install -y yum-utils shadow-utils
    sudo yum-config-manager --add-repo https://rpm.releases.hashicorp.com/AmazonLinux/hashicorp.repo
    sudo yum -y install terraform

  2. Run the following command to check the installation:
    terraform -v
    

This indicates that Terraform installation is successful and you are ready to automate your MSK topic configuration.

Provision an MSK topic using Terraform

To provision the MSK topic, complete the following steps:

  1. Create a new file called main.tf and copy the following code into this file, replacing the BOOTSTRAP_SERVERS and AWS_REGION information with the details for your cluster. For instructions on retrieving the bootstrap_servers information for IAM authentication from your MSK cluster, see Getting the bootstrap brokers for an Amazon MSK cluster. This script is common for Amazon MSK provisioned and MSK Serverless.
    terraform {
    required_providers {
    kafka = {
    source = "Mongey/kafka" }}}
    provider "kafka" {
    bootstrap_servers = [{BOOTSTRAP_SERVERS}]
    tls_enabled       = true
    sasl_mechanism    = "aws-iam"
    sasl_aws_region   ={AWS_REGION}
    sasl_aws_profile  = "dev" }
    resource "kafka_topic" "sampleTopic" {
    name               = "sampleTopic"
    replication_factor = 1
    partitions         = 50 }

  2. Add IAM bootstrap servers endpoints in a comma separated list format:
    BOOTSTRAP_SERVERS = ["b-2.mskcluster…. ","b-3.mskcluster…. ","b-1.mskcluster…. "]

  3. Run the command terraform init to initialize Terraform and download the required providers.

The terraform init command initializes a working directory containing Terraform configuration files(main.tf). This is the first command that should be run after writing a new Terraform configuration.

  1. Run the command terraform plan to review the run plan.

This command shows the changes that Terraform will make to the infrastructure based on the provided configuration. This step is optional but is often used as a preview of the changes Terraform will make.

  1. If the plan looks correct, run the command terraform apply to apply the configuration.
  2. When prompted for confirmation before proceeding, enter yes.

The terraform apply command runs the actions proposed in a Terraform plan. Terraform will create the sampleTopic topic in your MSK cluster.

  1. After the terraform apply command is complete, verify the infrastructure has been created with the help of the kafka-topics.sh utility:
    kafka/bin/kafka-topics.sh 
    --bootstrap-server "b-1…..amazonaws.com:9098" 
    --command-config ./kafka/bin/client.properties  
    --list

You can use the kafka-toipcs.sh tool with the --list option to retrieve a list of topics associated with your MSK cluster. For more information, refer to the createtopic documentation.

Update the MSK topic configuration using Terraform

To update the MSK topic configuration, let’s assume we want to change the number of partitions from 50 to 10 on our topic. We need to perform the following steps:

  1. Verify the number of partitions on the topic using the --describe command:
    kafka/bin/kafka-topics.sh 
    --bootstrap-server "b-1…...amazonaws.com:9098" 
    --command-config ./kafka/bin/client.properties  
    --describe 
    --topic sampleTopic

This command will show 50 partitions on the sampleTopic topic.

  1. Modify the Terraform file main.tf and change the value of the partitions parameter to 10:
    resource "kafka_topic" "sampleTopic" {
    name               = " sampleTopic "
    replication_factor = 1
    partitions         = 10 }

  2. Run the command terraform plan to review the run plan.

  1. If the plan shows the changes, run the command terraform apply to apply the configuration.
  2. When prompted for confirmation before proceeding, enter yes.

Terraform will drop and recreate the sampleTopic topic with the changed configuration.

  1. Verify the changed number of partitions on the topic, ad rerun the --describe command:
    kafka/bin/kafka-topics.sh 
    --bootstrap-server "b-1…...amazonaws.com:9098" 
    --command-config ./kafka/bin/client.properties  
    --describe --topic sampleTopic

Now, this command will show 10 partitions on the sampleTopic topic.

Delete the MSK topic using Terraform

When you no longer need the infrastructure, you can remove all resources created by your Terraform file.

  1. Run the command terraform destroy to remove the topic.
  2. When prompted for confirmation before proceeding, enter yes.

Terraform will delete the sampleTopic topic from your MSK cluster.

  1. To verify, rerun the --list command:
    kafka/bin/kafka-topics.sh 
    --bootstrap-server "b-1…..amazonaws.com:9098" 
    --command-config ./kafka/bin/client.properties  
    --list

Now, this command will not show the sampleTopic topic.

Conclusion

In this post, we addressed the common challenges associated with manual MSK topic configuration management and presented a robust Terraform-based solution. Using Terraform for automated topic provisioning and configuration streamlines your processes, fosters scalability, and enhances flexibility. Additionally, it facilitates automated deployments and centralized management.

We encourage you to explore Terraform as a means to optimize Amazon MSK configurations and unlock further efficiencies within your streaming data pipelines.


About the author

Vijay Kardile is a Sr. Technical Account Manager with Enterprise Support, India. With over two decades of experience in IT Consulting and Engineering, he specializes in Analytics services, particularly Amazon EMR and Amazon MSK. He has empowered numerous enterprise clients by facilitating their adoption of various AWS services and offering expert guidance on attaining operational excellence.

How to implement IAM policy checks with Visual Studio Code and IAM Access Analyzer

Post Syndicated from Anshu Bathla original https://aws.amazon.com/blogs/security/how-to-implement-iam-policy-checks-with-visual-studio-code-and-iam-access-analyzer/

In a previous blog post, we introduced the IAM Access Analyzer custom policy check feature, which allows you to validate your policies against custom rules. Now we’re taking a step further and bringing these policy checks directly into your development environment with the AWS Toolkit for Visual Studio Code (VS Code).

In this blog post, we show how you can integrate IAM Access Analyzer custom policy check capability into VS Code, so you can identify overly permissive IAM policies and fine-tune access controls early in the development process. This proactive approach to security and compliance helps to ensure that your IAM policies are validated before they are deployed, reducing the risk of introducing misconfigurations or granting unintended access. It also saves developer time by providing fast feedback to developers when they write a policy that does not meet organizational standards.

What is the problem?

Although security teams oversee an organization’s overall security posture, developers create applications that require specific permissions. To enable developers to work efficiently while maintaining high security standards, organizations often seek ways to safely delegate the authoring of AWS Identity and Access Management (IAM) policies to developers. Many AWS customers manually review developer-authored IAM policies before deploying them to production environments to help prevent granting excessive or unintended permissions. However, depending on the volume and complexity of policies, these manual reviews can be time-consuming, leading to development delays and potential bottlenecks in the deployment of applications and services. Organizations need to balance secure access management with the agility required for rapid application development and deployment.

How to use IAM Access Analyzer custom policy checks in VS Code

Custom policy checks are a feature in IAM Access Analyzer that are designed to help security teams proactively identify and analyze critical permissions within their IAM policies. In this section, we provide step-by-step instructions for using custom policy checks directly in VS Code.

Prerequisites

To complete the examples in our walkthrough, you first need to do the following:

  1. Install Python version 3.6 or later.
  2. Assuming you are already using the VS Code Integrated Development Environment (IDE), search for and install the AWS Toolkit extension.
  3. Configure your AWS role credentials to connect the toolkit to AWS.
  4. Install the IAM Policy Validator for AWS CloudFormation, available on GitHub. Alternatively, you can install the IAM Policy Validator for Terraform from GitHub if you are using Terraform as infrastructure-as-code in your organization.
  5. So that you can open IAM Access Analyzer policy checks in the VS Code editor, open the VS Code Command Palette by pressing Ctrl+Shift+P, search for IAM Policy Checks, and then choose AWS: Open IAM Policy Checks as shown in Figure 1.
    Figure 1: Search for the AWS: Open IAM Policy Checks option

    Figure 1: Search for the AWS: Open IAM Policy Checks option

By using the IAM policy checks option in VS Code, you can perform four types of checks:

We’ll walk through examples of each of these checks in the sections that follow.

Example 1: ValidatePolicy

In this example, we use the ValidatePolicy option provided by the IAM policy check plugin to validate IAM policies against IAM policy grammar and AWS best practices. When you run this check, you can view policy validation check findings that include security warnings, errors, general warnings, and suggestions for your policy. These actionable recommendations help you author policies that are aligned with AWS best practices.

To run the ValidatePolicy check

  1. Let’s use the following IAM policy for illustration purposes. You can see that resource * (a wildcard) is being used in the first statement, which indicates that the iam:PassRole action is allowed for all resources.
    {
        "Version": "2012-10-17",
        "Statement": [
          {
            "Effect": "Allow",
            "Action": "iam:PassRole",	
            "Resource": "*"
          },
          {
            "Effect": "Allow",
            "Action": ["s3:GetObject", "s3:PutObject"],
            "Resource": "arn:aws:s3:::amzn-s3-demo-bucket/*"
          }
        ]
      }
    

  2. In the VS Code editor, navigate to the IAM Policy Checks pane. Choose the document type JSON Policy Language and policy type Identity. Then choose Run Policy Validation.
    Figure 2: IAM Access Analyzer ValidatePolicy check results

    Figure 2: IAM Access Analyzer ValidatePolicy check results

    You can see that Access Analyzer has detected an issue, which is shown in the PROBLEMS pane.

    Figure 3: Problems pane with finding details for the ValidatePolicy check

    Figure 3: Problems pane with finding details for the ValidatePolicy check

    The security warning shown in Figure 3 states that the iam:PassRole action with a wildcard (*) in the resource can be overly permissive because it allows the ability to pass any IAM role in that account.

  3. Now, let’s modify the IAM policy by replacing the wildcard (*) with a specific role Amazon Resource Name (ARN).
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": "iam:PassRole",
          "Resource": "arn:aws:iam::111122223333:role/sample_role"
        },
        {
          "Effect": "Allow",
          "Action": ["s3:GetObject", "s3:PutObject"],
          "Resource": "arn:aws:s3:::amzn-s3-demo-bucket/*"
        }
      ]
    }
    

  4. Verify the policy again by running the ValidatePolicy check to make sure that it doesn’t generate findings after you updated the IAM policy.
    Figure 4: Results of the ValidatePolicy check after IAM policy correction

    Figure 4: Results of the ValidatePolicy check after IAM policy correction

Example 2: CheckNoPublicAccess

With the CheckNoPublicAccess option, you can verify whether your resource policy grants public access for supported resource types.

To run the CheckNoPublicAccess check

  1. To test whether a policy does not allow public access, create a new bucket using a CloudFormation template and attach a resource policy that grants access to any principal to see the objects in this bucket.

    WARNING: This sample bucket policy should not be used in production. Using a wildcard in the principal element of a bucket policy would allow any IAM principal to view the contents of the bucket.

    Resources:
              MyBucket:
                Type: 'AWS::S3::Bucket'
                Properties:
                  BucketName: amzn-s3-demo-bucket
    
              MyBucketPolicy:
                Type: 'AWS::S3::BucketPolicy'
                Properties:
                  Bucket:
                    Ref: 'MyBucket'
                  PolicyDocument:
                    Version: '2012-10-17'
                    Statement:
                      - Effect: Allow
                        Principal: "*"
                        Action: 's3:GetObject'
                        Resource:
                          Fn::Join:
                            - ''
                            - - 'arn:aws:s3:::'
                              - Ref: 'MyBucket'
                              - '/*'
    

  2. Select the document type CloudFormation template and then choose Run Custom Policy Check to see whether this resource policy passes the CheckNoPublicAccess check.
    Figure 5: IAM Access Analyzer CheckNoPublicAccess check results

    Figure 5: IAM Access Analyzer CheckNoPublicAccess check results

    The policy check returns a failed result because this bucket does allow public access.

    Figure 6: Problems pane finding details for CheckNoPublicAccess check

    Figure 6: Problems pane finding details for CheckNoPublicAccess check

  3. Next, fix this policy to allow access from a role within the same account by restricting the policy to a specific role ARN.
    Resources:
              MyBucket:
                Type: 'AWS::S3::Bucket'
                Properties:
                  BucketName: amzn-s3-demo-bucket
    
              MyBucketPolicy:
                Type: 'AWS::S3::BucketPolicy'
                Properties:
                  Bucket:
                    Ref: 'MyBucket'
                  PolicyDocument:
                    Version: '2012-10-17'
                    Statement:
                      - Effect: Allow
                        Principal: 
                          "AWS": 'arn:aws:iam::111122223333:role/sample_role'
                        Action: 's3:GetObject'
                        Resource:
                          Fn::Join:
                            - ''
                            - - 'arn:aws:s3:::'
                              - Ref: 'MyBucket'
                              - '/*'
    

  4. Re-run the CheckNoPublicAccess check. The resource policy no longer grants public access and the status of the policy check is PASS.

Example 3: CheckAccessNotGranted

The CheckAccessNotGranted option allows you to check whether a policy allows access to a list of IAM actions and resource ARNs. You can use this check to give developers fast feedback that certain permissions or access to certain resources are not allowed.

To run the CheckAccessNotGranted check

  1. Identify sensitive actions and resources.

    In the VS Code editor, under Custom Policy Checks, choose the check type CheckAccessNotGranted. Using a comma-separated list, create a list of actions and resource ARNs that you don’t want to allow in your IAM policy. You can also create a JSON file with your actions and resources by using the syntax shown in Figure 7. For this example, set the s3:PutBucketPolicy and dynamodb:DeleteTable IAM actions to “not allowed” in the IAM policy.

    Figure 7: Configure the CheckAccessNotGranted check

    Figure 7: Configure the CheckAccessNotGranted check

  2. Create a sample CloudFormation template that contains an IAM policy attached to an IAM role, as follows. This policy grants access to some of the actions that you deemed sensitive in Figure 7.
    Resources:
      CreateTagsLambdaRole:
        Type: AWS::IAM::Role
        Properties:
          AssumeRolePolicyDocument:
            Version: '2012-10-17'
            Statement:
            - Effect: Allow
              Principal:
                Service: lambda.amazonaws.com
              Action: sts:AssumeRole
          Policies:
          - PolicyName: my-application-access
            PolicyDocument:
              Version: '2012-10-17'
              Statement:
              - Effect: Allow
                Action:
                - ec2:DescribeInstances
                Resource: "*"
              - Effect: Allow
                Action:
                - s3:GetObject
                - s3:PutBucketPolicy
                - dynamodb:DeleteTable
                Resource: "*"            
              
          RoleName: sample-role
    

  3. In the VS Code editor, choose Run Custom Policy Check to identify whether one of the sensitive actions or resources is allowed in the IAM policy. The policy check returns FAIL because the policy has the actions s3:PutBucketPolicy and dynamodb:DeleteTable, which you marked as actions that you don’t want developers to grant access to. Remove the restricted actions from the policy and run the check again to see a PASS result for the policy check.

Example 4: CheckNoNewAccess

The CheckNoNewAccess option is a custom policy check that verifies whether your policy grants new access compared to a reference policy.

You use a reference policy to check whether a candidate policy allows more access than the reference policy does. In other words, the check passes if the candidate policy is a subset of the reference policy. A reference policy typically starts by allowing all access. You then add a statement or statements that deny the access that you want the reference policy to check for. For more details and examples of reference policies, see the iam-access-analyzer-custom-policy-check-samples repository on GitHub.

The ability to use a reference policy provides you with the flexibility to look for almost anything in an IAM policy. This is useful when you have custom requirements for your organization that may not be met with some of the other custom policy checks.

To run the CheckNoNewAccess check

  1. Create a reference policy: In your project, create a new JSON policy document that will serve as your reference policy.

    The following reference policy checks that an IAM role trust policy only grants access to an allowlisted set of AWS services. This enables you to allow builders to create roles, but constrain the use of those roles to the set of AWS services specified.

    In this reference policy, only the specified AWS service principals ec2.amazonaws.com, lambda.amazonaws.com, and ecs-tasks.amazonaws.com are allowed to assume the role.

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "AllowThisSetOfServicePrincipals",
          "Effect": "Allow",
          "Principal": {
            "Service": [
              "ec2.amazonaws.com",
              "lambda.amazonaws.com",
              "ecs-tasks.amazonaws.com"
            ]
          },
          "Action": "sts:AssumeRole"
        },
        {
          "Sid": "AllowOtherSTSActions",
          "Effect": "Allow",
          "Principal": "*",
          "NotAction": "sts:AssumeRole"
        }
      ]
    }
    

  2. Enter the reference policy in the VS Code editor. In the IAM Policy Checks pane, select the check type CheckNoNewAccess. Then set the reference policy type to Resource, because this is a trust policy that defines which principals can assume the role. In addition, provide the path of the reference policy that you created in Step 1. You can also directly enter the reference policy as a JSON policy document, as shown in Figure 8.
    Figure 8: Enter the reference policy for the CheckNoNewAccess check

    Figure 8: Enter the reference policy for the CheckNoNewAccess check

  3. Create a CloudFormation template, as follows. This template creates an IAM role that allows the AWS service principals lambda.amazonaws.com and glue.amazonaws.com to assume the sample-application-role IAM role.
    Resources:
      SampleApplicationRole:
        Type: AWS::IAM::Role
        Properties:
          AssumeRolePolicyDocument:
            Version: '2012-10-17'
            Statement:
            - Effect: Allow
              Principal:
                Service: 
                - lambda.amazonaws.com
                - glue.amazonaws.com
              Action: sts:AssumeRole
          Policies:
          - PolicyName: my-application-access
            PolicyDocument:
              Version: '2012-10-17'
              Statement:
              - Effect: Allow
                Action:
                - s3:GetObject
                Resource: "arn:aws:s3::111122223333:amzn-s3-demo-bucket/*"            
          RoleName: sample-application-role
    

  4. In the VS Code editor, choose Run Custom Policy Check to check your CloudFormation template against the reference policy you configured in Step 1. The check will return FAIL and you will see a security warning in the editor in the PROBLEMS pane.
    Figure 9: Problems pane finding details for the CheckNoNewAccess check

    Figure 9: Problems pane finding details for the CheckNoNewAccess check

    The issue is that glue.amazonaws.com was not listed as a service principal that was allowed to assume a role in your reference policy. You can remove glue.amazonaws.com from the CloudFormation template and re-run the check to receive a PASS result.

Conclusion

In this post, we explored how you can use the integration of VS Code with IAM Access Analyzer in your development workflow to make sure that your IAM policies align with best practices and adhere to your organization’s security requirements. The four critical checks provided by IAM Access Analyzer can be summarized as follows:

  • The ValidatePolicy check provides actionable recommendations that help you author policies that are aligned with AWS best practices.
  • The CheckNoPublicAccess check helps protect resources from being exposed publicly and mitigates the risk of unauthorized public access.
  • The CheckAccesNotGranted check looks for specific IAM actions and resource ARNs to help enforce access restrictions and help prevent unauthorized access to critical data or services.
  • The CheckNoNewAccess check validates that the permissions granted in your IAM policies remain within the intended scope, as defined by your organization’s requirements.

Install or update the AWS Toolkit for VS Code today, and make sure that you have the CloudFormation Policy Validator or Terraform Policy Validator, to take advantage of these features.

If you have feedback about this post, submit comments in the Comments section below.

Anshu Bathla

Anshu Bathla

Anshu is a Lead Consultant – SRC at AWS, based in Gurugram, India. He works with customers across diverse verticals to help strengthen their security infrastructure and achieve their security goals. Outside of work, Anshu enjoys reading books and gardening at his home garden.

Manoj Kumar

Manoj Kumar

Manoj is a Lead Consultant – SRC at AWS, based in Gurugram, India. He collaborates with diverse clients to design and implement comprehensive AWS Cloud security solutions. His expertise helps organizations fortify their cloud infrastructures, achieve compliance objectives, and provide robust data protection while using the advanced security features of AWS to support their business objectives.

Unlocking AWS Console: Diagnosing Errors with Amazon Q Developer

Post Syndicated from Marco Frattallone original https://aws.amazon.com/blogs/devops/unlocking-aws-console-diagnosing-errors-with-amazon-q-developer/

Introduction

Developers, IT Operators, and in some cases, Site Reliability Engineers (SREs) are responsible for deploying and operating infrastructure and applications, as well as responding to and resolving incidents effectively and in a timely manner. Effective incident management requires quick diagnosis, root cause analysis, and implementation of corrective actions. Diagnosing the root cause can be challenging in the context of modern systems that involve multiple resources deployed across distributed environments. Amazon Q Developer, a generative AI-powered assistant, can help simplify this process by diagnosing errors you receive in the AWS Management Console.

Amazon Q Developer can save you critical time when dealing with production issues by helping to diagnose errors related to your AWS environment. These errors could be the result of potential misconfiguration across multiple resources, and usually requires you to navigate between several service consoles to identify the root cause. Amazon Q Developer applies machine learning models to automate diagnosis of errors that arise in the AWS Console interface. This reduces the mean time to repair (MTTR) and minimizes the impact of incidents on business operations.

This blog post explores the Amazon Q Developer feature to diagnose errors in AWS Console while working with AWS services. We describe how this feature works in order to provide you guidance on troubleshooting. We take a look behind-the-scenes to show the processes that power this feature.

Diagnose with Amazon Q

The Diagnose with Amazon Q feature is activated when an error occurs in the console for an AWS service that is currently supported by this functionality, and a user with appropriate permissions clicks the Diagnose with Amazon Q button next to the error message. Amazon Q provides a natural language explanation that analyzes the root cause of the error. With a second click on Help me resolve, Amazon Q displays an ordered list of instructions which can be used to resolve the error condition. Once completed, you can provide feedback on whether the resolution provided by Amazon Q was helpful.

To make things concrete, we consider two running examples.

Example 1: Assume that you try to delete an S3 bucket which is not empty. This results in an error message:

This bucket is not empty. Buckets must be empty before they can be deleted. To
delete all objects in the bucket, use the empty bucket configuration.

Example 2: Suppose that you try to list objects in a particular S3 bucket, but lack IAM permissions to do so. This results in an error message:

Insufficient permissions to list objects. After you or your AWS administrator has updated your permissions to allow the s3:ListBucketaction, refresh the page. Learn more about Identity and access management in
Amazon S3.

User clicks on “Launch Instances” button In the EC2 service console in the AWS Management console. User enters all the required information, and clicks on “Launch Instance” button. This results in “Instance launch failed” error appearing in the console along with a “Diagnose with Amazon Q” button. User clicks on the button. this brings up a new window titled “Diagnose console errors with Amazon Q”. Soon an “Analysis” section appears with the message describing the issue with IAM permissions to launch new EC2 instances using natural language. User clicks on “Help me resolve” button. After few seconds, “Resolution” section along with the steps to resolve the error appears.

Diagnose with Amazon Q IAM permissions related to EC2 instance launch error

Behind the Scenes: How Amazon Q generates a diagnosis

When you click on Diagnose with Amazon Q button next to the error message in the AWS Management Console, Amazon Q generates an Analysis that expresses the root cause of the error in natural language. This step is assisted by Large Language Models (LLMs) and is based on context information only. The context provided to the LLM includes the error message shown in the console, the URL of the triggering action, and the IAM role of the user signed in the AWS Console. The service always operates within the permissions granted by your role as you operate in the AWS Console, ensuring that privileges are never escalated beyond what are assigned to you.

When you click on Help me resolve button after you have reviewed the analysis, Amazon Q retrieves additional information about the state of the resources in the AWS Account where the error occurred. This is accomplished by interrogating the customer account in various ways. In this phase, the system actively decides which information is still missing and issues interrogation requests against internal services to fulfil the information need. Interrogation is not needed for simple errors, such as Example 1 above, but becomes essential in order to resolve more complex errors, where information from the context proves insufficient.

Given the context, error analysis, user permissions, and results of account interrogation, Amazon Q generates step-by-step Resolution instructions. This step is assisted by LLMs.

After implementing and validating the steps provided by Amazon Q to resolve the error in the console, you have the option to provide feedback of your experience.

A flow diagram illustrating an error resolution process using Amazon Q. The process begins with an error. The user then diagnoses the issue with Amazon Q, which gets context information from the AWS Console and provide an Analysis. The user requests help to resolve the error. The system enriches the prompt interrogation the signed-in user's account. The model generates step-by-step resolution instructions. These instructions go through a validation process before being presented to the user for implementation.

Diagram showing Interactions between User, AWS Console and Amazon Q Developer

Context Information

Contextual information helps the LLMs to generate more relevant and informed outputs. Context is provided to Amazon Q as input from the console automatically. As the basis for all further analysis and decisions, it should be as rich as possible. At a minimum, Amazon Q obtains the error message, the URL for the triggering action, and the IAM role that the signed-in user assumes. The system automatically extracts relevant identifiers from the context. In our running Example 1, the URL may be https://s3.console.aws.amazon.com/s3/bucket/my-bucket-123456/delete?region=us-west-2, from which Amazon Q extracts aws_region = "us-west-2" and s3_bucket_name = "my-bucket-123456".

Beyond this minimum context, Amazon Q can obtain additional information from the console, pertaining to what the user sees on the screen when the error happens, such as content of text fields or widgets in the current UI. Amazon Q can also make use of specific context provided by the underlying service. In the case of Example 2 above, the bucket name is extracted from the URL, the action s3:ListBucket from the error message, and Amazon Q may obtain additional information from IAM about related policies and accept or deny statements.

Interrogating the signed-in user’s Account

Diagnose with Amazon Q functionality is not just a passive receiver of context information, it has built-in capabilities of actively asking for additional information. This includes developing an understanding of resources in the AWS account, and their relationship with the resource experiencing the error. Such interrogation queries are planned by a subsystem based on context information. It provides a low-latency and deterministic approach to find resources and their relationships. This relationship context provided to the LLM, such as EBS volumes attached to an EC2 instance or policies included in the attached IAM role, improves the accuracy of root cause analysis for diagnosing the error.

In the simple running Example 1 where error is due to non-empty S3 bucket, the error message and the console URL contain all the necessary information to proceed, and active interrogation is not required. On the other hand, for the IAM permission error in Example 2, it’s helpful to understand the permissions on the IAM role associated with the resource experiencing the error. Amazon Q can fetch identity-level policies for the role and resource-level policies for the affected resource, based on which it can diagnose the cause of the error, using internal IAM services. To be concrete, the URL for Example 2 may be https://s3.console.aws.amazon.com/s3/buckets/my-bucket-123456?region=us-west-2&bucketType=general&tab=objects, from which Amazon Q extracts region and S3 bucket name. It can also extract the action s3:ListBucket from the error message itself. Based on this information, Amazon Q can fetch bucket policies for my-bucket-123456, identity-level policies for the role, then scan those for presence or absence of the s3:ListBucket action, or call internal IAM services to provide additional information about the cause of access being denied.

This subsystem uses AWS Cloud Control API (CCAPI) which is called on your behalf by Amazon Q with the permissions granted by your IAM Role. As part of onboarding to Amazon Q, the AmazonQFullAccess managed policy is attached to the Role that can access Amazon Q. This managed policy contains the ListResources and GetResource CCAPI IAM permissions. This ensures all Roles given that managed policy will have access to the CCAPI read and list endpoints. If you do not attach the AmazonQFullAccess managed policy to the required roles, you will need to attach the ListResources and GetResource permission directly to the role.

Generating Step-by-step Resolution Instructions

At this point, all acquired information is synthesized by Amazon Q in order to generate useful and actionable resolution instructions. As an illustration, possible sample instructions for the running examples under consideration are listed below. As the models are updated and improved over time, the responses can change.

For Example 1, sample instructions could look like:

  1. Navigate to the S3 console, click “Buckets”, and select the my-bucket-123456 bucket
  2. Click on the “Empty” tab.
  3. If your bucket contains a large number of objects, creating a lifecycle rule to delete all objects in the bucket might be a more efficient way of emptying your bucket
  4. Type “permanently delete” in text input field and confirm that all objects are to be removed.
  5. Retry deleting the my-bucket-123456 S3 bucket.

For Example 2, you may obtain:

  1. Go to the IAM console. Edit the IAM policy attached to the role ReadOnly
  2. Allow for the s3:ListBucket action for resource being the S3 bucket ARN arn:aws:s3:::my-bucket-123456.
  3. Save the updated IAM policy
  4. Refresh the S3 console page to list the objects in the bucket my-bucket-123456

Note that the instructions contain information inferred from the context, such as bucket name my-bucket-123456, instead of placeholders. Instructions returned by Diagnose with Amazon Q are complete and fine-grained enough in order to be followed without any extra effort. In fact, while the service makes use of an LLM to synthesize resolution instructions, Amazon Q uses post-processing to correct frequently occurring mistakes. For example, in Example 2 above, the LLM may have returned the ARN as arn:aws:s3:<region>::<bucket_name>, which would be corrected to what is shown above.

The instructions returned for Example 2 above assume that the reason for the user not being able to list objects is a missing Allow statement in the policies attached to the ReadOnly role. Other root causes could be a Deny statement in a policy attached to the S3 bucket, or to the ReadOnly role. Diagnose with Amazon Q can use account interrogation in order to identify the correct root cause and propose the right resolution. In the example above, it can fetch the policies attached to the ReadOnly role and check whether s3:ListBucket is missing indeed, or fetch policies attached to the bucket bucket-123456.

Validation

One goal for Diagnose with Amazon Q is to attain wide coverage of AWS rapidly, while keeping the quality bar high, so that you obtain useful, actionable advice where ever you obtain an error. An important prerequisite to attain this goal is a robust and flexible evaluation system. Evaluating systems based on Generative AI is challenging due to the large output space (natural language) and non-deterministic behavior.

In a nutshell, our validation system is based on building a large dataset of errors, where each record has a certain number of annotations. Each record contains the context (templatized error message and console URL; meaning that bucket-123456 is replaced by {{s3_bucket_name}}, us-west-2 by {{aws_region}}). Annotations include Infrastructure as Code (CloudFormation) descriptions of the erroneous account state and the triggering action, as well as ground truth responses obtained from expert annotators. These records allow us to simulate the behaviour of variants of our system without human interactions and many times faster than real time (by way of parallelization). We are also developing automated validation metrics for comparing ground truth annotations and system responses, based on which offline evaluations can be run fully automatically.

This validation system allows us to rapidly validate new ideas by comparing them against the current state, while also guarding against regressions. While human experts are still needed to provide annotations of error records, we actively innovate to speed up and simplify these tasks, by building annotation tools which avoid natural language input, have validations built in, and are rather asking to correct system output than providing ground truth annotations from scratch.

Conclusion

The Diagnose with Amazon Q feature of Amazon Q Developer allows you to determine the cause of an error in the AWS Console without needing to navigate to multiple service consoles. By providing tailored, step-by-step instructions specific to your AWS account and error context, Amazon Q Developer empowers you to troubleshoot and resolve issues efficiently. This helps your organization achieve greater operational efficiency, reduce downtime, improve service quality, and free up valuable human resources enabling them to focus on higher-value activities. We also provide you details on how AI and machine learning capabilities work behind the scenes to enable this functionality.

About the authors

Matthias Seeger, Principal Applied Scientist, AWS NGDE Science

Matthias Seeger is a Principal Applied Scientist at AWS.

Marco Frattallone, Sr. TAM, AWS Enterprise Support

Marco Frattallone is a Senior Technical Account Manager at AWS focused on supporting Partners. He works closely with Partners to help them build, deploy, and optimize their solutions on AWS, providing guidance and leveraging best practices. Marco is passionate about technology and helps Partners stay at the forefront of innovation. Outside work, he enjoys outdoor cycling, sailing, and exploring new cultures.

Surabhi Tandon, Sr EAE, AWS Support

Surabhi Tandon is a Senior Technical Account Manager at Amazon Web Services (AWS). She supports enterprise customers achieve operational excellence and help them with their cloud journey on AWS by providing strategic technical guidance. Surabhi is a builder with interest in Generative AI, automation, and DevOps. Outside of work, she enjoys hiking, reading and spending time with family and friends.

Securing a city-sized event: How Amazon integrates physical and logical security at re:Invent

Post Syndicated from Steve Schmidt original https://aws.amazon.com/blogs/security/securing-a-city-sized-event-how-amazon-integrates-physical-and-logical-security-at-reinvent/

Securing an event of the magnitude of AWS re:Invent—the Amazon Web Services annual conference in Las Vegas—is no small feat. The most recent event, in December, operated on the scale of a small city, spanning seven venues over twelve miles and nearly seven million square feet across the bustling Las Vegas Strip.

Keeping all 60,000 in-person attendees, 400,000 online participants, and their data secure requires a sophisticated blend of physical and logical security measures—a challenge that we’ve addressed by building an integrated security strategy that brings both sides together. We used every resource available to us, including drones, K9 units, our network security teams, and much more, to help protect every person attending the event and their data.

Figure 1: The re:Invent Command Post

Figure 1: The re:Invent Command Post

Security is a team sport

At Amazon, our physical security and information security (logical) teams work together to secure our customers, employees, and infrastructure across our diverse range of businesses at scale against a wide range of threats. At large events such as re:Invent, this integrated approach allows us to protect the many aspects of our event—from our attendees, to our on-site computers and servers, to our Wi-Fi network and its users—as comprehensively as possible.

Amazon doesn’t work alone, either. Our event security teams coordinate with Las Vegas Metropolitan Police and over 40 different agencies, including counterterrorism, bomb squad personnel, and first responders.

Figure 2: K9 units – valued members of our onsite security team

Figure 2: K9 units – valued members of our onsite security team

These teams are co-located in the Command Post—the nerve center of our security operations. Here, physical and logical security converge as nearly every element of our security footprint comes together, and we monitor the event for threats in real-time. This includes our event security management teams, our intelligence team, and our CCTV camera operators, alongside local law enforcement and emergency management services. As an added layer of protection, we also operate a dedicated Wireless Security Operations Center (WiSOC) in close coordination with our main Command Post, which serves as the primary hub for our wireless and cybersecurity teams.

Fostering open dialogue and information-sharing is critical for effective collaboration to secure re:Invent. And as the threat landscape continues to evolve, organizations must prioritize closing the gap between physical and logical security. Not only is this integrated approach the key to effectively securing a city-sized event such as re:Invent, but it also helps us protect our customers, employees, and company every day.

City-scale security

We deploy a number of integrated security measures at re:Invent to protect our physical and digital assets. When it comes to physical security, the primary concern is, of course, human safety. At re:Invent, we deploy thousands of security personnel, including guards, K9 units, and first responders to help respond to and assist with any issues, such as medical events, fires, theft, or overcrowding. We have CCTV cameras stationed in high-traffic areas and implement strict access control measures, including walkthrough screening detectors at entry points and a robust credentialing system, to create a safe and secure environment for our attendees.

We also have help from drones. The automated, high-flying craft provide a bird’s eye view at re:Play—the culminating concert at the Las Vegas Festival Grounds—and help coordinate responses to issues. Using AWS cloud solutions, live footage is streamed directly to our onsite security teams to monitor crowd flow.

Figure 3: A security team member showcases a drone used to help secure re:Play

Figure 3: A security team member showcases a drone used to help secure re:Play

We’re also focused on the security of our network, which in turn protects its users—our attendees. Our wireless and cybersecurity teams work to identify anomalous activity across our network, including signs of spoofing—a tactic where actors set up look-a-like Wi-Fi networks in an attempt to lure attendees to connect to their network instead of ours.

Amazon also secures the presentations given by re:Invent’s cloud computing and AI experts, executives, and engineers. To have confidence in sharing their insights, speakers must know that their talks run on secure, uninterrupted channels streaming to hundreds of thousands of viewers around the world. Our re:Invent mobile app is built with security in mind, too, so attendees have a safe place to manage events and in-conference needs.

Our integrated approach to security is made possible by the AWS Cloud, which helps us support the different components of our security operation and share critical information rapidly. Whether we’re facing a logical security threat, physical security concern, or a wellness incident, our success hinges on our response time—and running our operations in the AWS Cloud enables us to move quickly.

Amazon will continue investing in and strengthening our unified approach to help make sure that, no matter the vector of the threat, our teams will have a cohesive, unified response. We’re proud to be a leader in this space and hope our learnings can help others enhance their own security resilience, both inside and outside of events.

For more about this year’s re:Invent, see:

If you have feedback about this post, submit comments in the Comments section below.

Steve Schmidt

Steve Schmidt

Steve is the chief security officer for Amazon and has been with the company since February 2008. He leads the information security, physical security, security engineering, and regulatory program teams. From 2010 to 2022, Steve was the chief information security officer for AWS. Prior to joining Amazon, Steve had an extensive career at the FBI, where he served as a senior executive.