Tag Archives: Best practices

Making sense of secrets management on Amazon EKS for regulated institutions

2024-08-19 Piyush Mattoo

Post Syndicated from Piyush Mattoo original https://aws.amazon.com/blogs/security/making-sense-of-secrets-management-on-amazon-eks-for-regulated-institutions/

Amazon Web Services (AWS) customers operating in a regulated industry, such as the financial services industry (FSI) or healthcare, are required to meet their regulatory and compliance obligations, such as the Payment Card Industry Data Security Standard (PCI DSS) or Health Insurance Portability and Accountability Act (HIPPA).

AWS offers regulated customers tools, guidance and third-party audit reports to help meet compliance requirements. Regulated industry customers often require a service-by-service approval process when adopting cloud services to make sure that each adopted service aligns with their regulatory obligations and risk tolerance. How financial institutions can approve AWS services for highly confidential data walks through the key considerations that customers should focus on to help streamline the approval of cloud services. In this post we cover how regulated customers, especially FSI customers, can approach secrets management on Amazon Elastic Kubernetes Service (Amazon EKS) to help meet data protection and operational security requirements. Amazon EKS gives you the flexibility to start, run, and scale Kubernetes applications in the AWS Cloud or on-premises.

Applications often require sensitive information such as passwords, API keys, and tokens to connect to external services or systems. Kubernetes has secrets objects for managing these types of sensitive information. Additional tools and approaches have evolved to supplement the Kubernetes Secrets to help meet the compliance requirements of regulated organizations. One of the driving forces behind the evolution of these tools for regulated customers is that the native Kubernetes Secrets values aren’t encrypted but encoded as base64 strings; meaning that their values can be decoded by a threat actor with either API access or authorization to create a pod in a namespace containing the secret. There are options such as GoDaddy Kubernetes External Secrets, AWS Secrets and Configuration Provider (ASCP) for the Kubernetes Secrets Store CSI Driver, Hashicorp Vault, and Bitnami Sealed secrets that you can use to can help to improve the security, management, and audibility of your secrets usage.

In this post, we cover some of the key decisions involved in choosing between External Secrets Operator (ESO), Sealed Secrets, and ASCP for the Kubernetes Secrets Store Container Storage Interface (CSI) Driver, specifically for FSI customers with regulatory demands. These decision points are also broadly applicable to customers operating in other regulated industries.

AWS Shared Responsibility Model

Security and compliance is a shared responsibility between AWS and the customer. The AWS Shared Responsibility Model describes this as security of the cloud and security in the cloud:

AWS responsibility – Security of the cloud: AWS is responsible for protecting the infrastructure that runs the services offered in the AWS Cloud. For Amazon EKS, AWS is responsible for the Kubernetes control plane, which includes the control plane nodes and etcd database. Amazon EKS is certified by multiple compliance programs for regulated and sensitive applications. The effectiveness of the security controls are regularly tested and verified by third-party auditors as part of the AWS compliance programs.
Customer responsibility – Security in the cloud: Customers are responsible for the security and compliance of customer configured systems and services deployed on AWS. This includes responsibility for securely deploying, configuring and managing ESO within their Amazon EKS cluster. For Amazon EKS, the customer responsibility depends upon the worker nodes you pick to run your workloads and cluster configuration as shown in Figure 1. In the case of Amazon EKS deployment using Amazon Elastic Compute Cloud (Amazon EC2) hosts, the customer responsibility includes the following areas:
- The security configuration of the data plane, including the configuration of the security groups that allow traffic to pass from the Amazon EKS control plane into the customer virtual private cloud (VPC).
- The configuration of the nodes and the containers themselves.
- The nodes’ operating system, including updates and security patches.
- Other associated application software:
  - Setting up and managing network controls, such as firewall rules.
  - Managing platform-level identity and access management, either with or in addition to AWS Identity and Access Management (IAM).
- The sensitivity of your data, such as personally identifiable information (PII), keys, passwords, and tokens
  - Customers are responsible for enforcing access controls to protect their data and secrets.
  - Customers are responsible for monitoring and logging activities related to secrets management including auditing access, detecting anomalies and responding to security incidents.
- Your company’s requirements, applicable laws and regulations
- When using AWS Fargate, the operational overhead for customers is reduced in the following areas:
  - The customer is not responsible for updating or patching the host system.
  - Fargate manages the placement and scaling of containers.

Figure 1: AWS Shared Responsibility Model with Fargate and Amazon EC2 based workflows

As an example of the Shared Responsibility Model in action, consider a typical FSI workload accepting or processing payments cards and subject to PCI DSS requirements. PCI DSS v4.0 requirement 3 focuses on guidelines to secure cardholder data while at rest and in transit:

Control ID	Control description
3.6	Cryptographic keys used to protect stored account data are secured.
3.6.1.2	Store secret and private keys used to encrypt and decrypt cardholder data in one (or more) of the following forms: Encrypted with a key-encrypting key that is at least as strong as the data-encrypting key, and that is stored separately from the data-encrypting key. Stored within a secure cryptographic device (SCD), such as a hardware security module (HSM) or PTS-approved point-of-interaction device. Has at least two full-length key components or key shares, in accordance with an industry-accepted method. Note: It is not required that public keys be stored in one of these forms.
3.6.1.3	Access to cleartext cryptographic key components is restricted to the fewest number of custodians necessary.

NIST frameworks and controls are also broadly adopted by FSI customers. NIST Cyber Security Framework (NIST CSF) and NIST SP 800-53 (Security and Privacy Controls for Information Systems and Organizations) include the following controls that apply to secrets:

Regulation or framework	Control ID	Control description
NIST CSF	PR.AC-1	Identities and credentials are issued, managed, verified, revoked, and audited for authorized devices, users and processes.
NIST CSF	PR.DS-1	Data-at-rest is protected.
NIST 800-53.r5	AC-2(1) AC-3(15)	Secrets should have automatic rotation enabled. Delete unused secrets.

Based on the preceding objectives, the management of secrets can be categorized into two broad areas:

Identity and access management ensures separation of duties and least privileged access.
Strong encryption, using a dedicated cryptographic device, introduces a secure boundary between the secrets data and keys, while maintaining appropriate management over the cryptographic keys.

Choosing your secrets management provider

To help choose a secrets management provider and apply compensating controls effectively, in this section we evaluate three different options based on the key objectives derived from the PCI DSS and NIST controls described above and other considerations such as operational overhead, high availability, resiliency, and developer or operator experience.

Architecture and workflow

The following architecture and component descriptions highlight the different architectural approaches and responsibilities of each solution’s components, ranging from controllers and operators, command-line interface (CLI) tools, custom resources, and CSI drivers working together to facilitate secure secrets management within Kubernetes environments.

External Secrets Operator (ESO) extends the Kubernetes API using a custom resource definition (CRD) for secret retrieval. ESO enables integration with external secrets management systems such as AWS Secrets Manager, HashiCorp Vault, Google Secrets Manager, Azure Key Vault, IBM Cloud Secrets Manager, and various other systems. ESO watches for changes to an external secret store and keeps Kubernetes secrets in sync. These services offer features that aren’t available with native Kubernetes Secrets, such as fine-grained access controls, strong encryption, and automatic rotation of secrets. By using these purpose-built tools outside of a Kubernetes cluster, you can better manage risk and benefit from central management of secrets across multiple Amazon EKS clusters. For more information, see the detailed walkthrough of using ESO to synchronize secrets from Secrets Manager to your Amazon EKS Fargate cluster.

ESO is comprised of a cluster-side controller that automatically reconciles the state within the Kubernetes cluster and updates the related secrets anytime the external API’s secret undergoes a change.

Figure 2: ESO workflow

Sealed Secrets is an open source project by Bitnami comprised of a Kubernetes controller coupled with a client-side CLI tool with the objective to store secrets in Git in a secure fashion. Sealed Secrets encrypts your Kubernetes secret into a SealedSecret, which can also be deployed to a Kubernetes cluster using kubectl. For more information, see the detailed walkthough of using tools from the Sealed Secrets open source project to manage secrets in your Amazon EKS clusters.

Sealed Secrets comprises of three main components: First, there is an operator or a controller which is deployed onto a Kubernetes cluster. The controller is responsible for decrypting your secrets. Second, you have a CLI tool called Kubeseal that takes your secret and encrypts it. Third, you have a CRD. Instead of creating regular secrets, you create SealedSecrets, which is a CRD defined within Kubernetes. That is how the operator knows when to perform the decryption process within your Kubernetes cluster.

Upon startup, the controller looks for a cluster-wide private-public key pair and generates a new 4096-bit RSA public-private key pair if one doesn’t exist. The private key is persisted in a secret object in the same namespace as the controller. The public key portion of this is made publicly available to anyone wanting to use Sealed Secrets with this cluster.

Figure 3: Sealed Secrets workflow

The AWS Secrets Manager and Config Provider (ASCP) for Secret Store CSI driver is an open source tool from AWS that allows secrets from Secrets Manager and Parameter Store, a capability of AWS Systems Manager, to be mounted as files inside Amazon EKS pods. It uses a CRD called SecretProviderClass to specify which secrets or parameters to mount. Upon a pod start or restart, the CSI driver retrieves the secrets or parameters from AWS and writes them to a tmpfs volume mounted in the pod. The volume is automatically cleaned up when the pod is deleted, making sure that secrets aren’t persisted. For more information, see the detailed walkthrough on how to set up and configure the ASCP to work with Amazon EKS.

ASCP comprises of a cluster-side controller acting as the provider, allowing secrets from Secrets Manager, and parameters from Parameter Store to appear as files mounted in Kubernetes pods. Secrets Store CSI Driver is a DaemonSet with three containers: node-driver-registrar, which registers the CSI driver with Kubelet; secrets-store, which implements the CSI Node service gRPC services for mounting and unmounting volumes during pod creation and deletion; and liveness-probe, which monitors the health of the CSI driver and reports to Kubernetes for automatic issue detection and pod restart.

Figure 4: AWS Secrets Manager and configuration provider

In the next section, we cover some of the key decisions involved in choosing whether to use ESO, Sealed Secrets, or ASCP for regulated customers to help meet their regulatory and compliance needs.

Comparing ESO, Sealed Secrets, and ASCP objectives

All three solutions address different aspects of secure secrets management and aim to help FSI customers meet their regulatory compliance requirements while upholding the protection of sensitive data in Kubernetes environments.

ESO synchronizes secrets from external APIs into Kubernetes, targeting the cluster operator and application developer personas. The cluster operator is responsible for setting up ESO and managing access policies. The application developer is responsible for defining external secrets and the application configuration.

Sealed Secrets encrypts your Kubernetes secrets before storing them in version control systems such as public Git repositories. This is the case if you decide to check in your Kubernetes manifest to a Git repository granting access to your sensitive secrets to anyone who has access to the Git repository. This is ultimately the reason why Sealed Secrets was created and the sealed secret can be decrypted only by the controller running in the target cluster.

Using ASCP, you can securely store and manage your secrets in Secrets Manager and retrieve them through your applications running on Kubernetes without having to write custom code. Secrets Manager provides features such as rotation, auditing, and access control that can help FSI customers meet regulatory compliance requirements and maintain a robust security posture.

Installation

The deployment and configuration details that follow highlight the different approaches and resources used by each solution to integrate with Kubernetes and external secret stores, catering to the specific requirements of secure secrets management in containerized environments.

ESO provides Helm charts for ease of operator deployment. External Secrets provides custom resources like SecretStore and ExternalSecret for configuring the required operator functionality to synchronize external secrets to your cluster. For instance, SecretStore can be used by the cluster operator to be able to connect to AWS Secrets Manager using appropriate credentials to pull in the secrets.

To install Sealed Secrets, you can deploy the Sealed Secrets Controller onto the Kubernetes cluster. You can deploy the manifest by itself or you can use a Helm chart to deploy the Sealed Secrets Controller for you. After the controller is installed, you use the Kubeseal client-side utility to encrypt secrets using asymmetric cryptography. If you don’t already have the Kubeseal CLI installed, see the installation instructions.

ASCP provides Helm charts to assist in operator deployment. The ASCP operator provides custom resources such as SecretProviderClass to provide provider-specific parameters to the CSI driver. During pod start and restart, the CSI driver will communicate with the provider using gRPC to retrieve the secret content from the external secret store you specified in the SecretProviderClass custom resource. Then the volume is mounted in the pod as tmpfs and the secret contents are written to the volume.

Encryption and key management

These solutions use robust encryption mechanisms and key management practices provided by external secret stores and AWS services such as AWS Key Management Service (AWS KMS) and Secrets Manager. However, additional considerations and configurations might be required to meet specific regulatory requirements, such as PCI DSS compliance for handling sensitive data.

ESO relies on encryption features within the external secrets management system. For instance, Secrets Manager supports envelope encryption with AWS KMS which is FIPS 140-2 Level 3 certified. Secrets Manager has several compliance certifications making it a great fit for regulated workloads. FIPS 140-2 Level 3 ensures only strong encryption algorithms approved by NIST can be used to protect data. It also defines security requirements for the cryptographic module, creating logical and physical boundaries.

Both AWS KMS and Secrets Manager help you to manage key lifecycle and to integrate with other AWS Services. In terms of key rotation, both provide automatic rotation of secrets that runs on a schedule (which you define), and abstract the complexity of managing different versions of keys. For AWS managed keys, the key rotation happens automatically once every year by default. With customer managed keys (CMKs), automatic key rotation is available but not enabled by default.

When using SealedSecrets, you use the Kubeseal tool to convert a standard Kubernetes Secret into a Sealed Secrets resource. The contents of the Sealed Secrets are encrypted with the public key served by the Sealed Secrets Controller as described in the Sealed Secrets project homepage.

In the absence of cloud native secrets management integration, you might have to add compensating controls to achieve the regulatory standards required by your organization. In cases where the underlying SealedSecrets data is sensitive in nature, such as cardholder PII, PCI requires that you store sensitive secrets in a cryptographic device such as a hardware security module (HSM). You can use Secrets Manager to store the master key generated to seal the secrets. However, this you will have to enable additional integration with Amazon EKS APIs to fetch the master key securely from the EKS cluster. You will also have to modify your deployment process to use a master key from Secrets Manager. The applications running in the EKS cluster must have permissions to fetch the SealedSecret and master key from Secrets Manager. This might involve configuring the application to interact with Amazon EKS APIs and Secrets Manager. For non-sensitive data, Kubeseal can be used directly within the EKS cluster to manage secrets and sealing keys.

For key rotation, you can store the controller generated private key in Parameter Store as a SecureString. You can use the advanced tier in Parameter Store if the file containing the private keys exceeds the Standard tier limit of up to 4,096 characters. In addition, if you want to add key rotation, you can use AWS KMS.

The ASCP relies on encryption features within the chosen secret store, such as Secrets Manager. Secrets Manager supports integration with AWS KMS for an additional layer of security by storing encryption keys separately. The Secrets Store CSI Driver facilitates secure interaction with the secret store, but doesn’t directly encrypt secrets. Encrypting mounted content can provide further protection, but introduces operational overhead related to key management.

ASCP relies on Secrets Manager and AWS KMS for encryption and decryption capabilities. As a recommendation, you can encrypt mounted content to further protect the secrets. However, this introduces the additional operational overhead of managing encryption keys and addressing key rotation.

Additional considerations

These solutions address various aspects of secure secrets management, ranging from centralized management, compliance, high availability, performance, developer experience, and integration with existing investments, catering to the specific needs of FSI customers in their Kubernetes environments.

ESO can be particularly useful when you need to manage an identical set of secrets across multiple Kubernetes clusters. Instead of configuring, managing, and rotating secrets at each cluster level individually, you can synchronize your secrets across your clusters. This simplifies secrets management by providing a single interface to manage secrets across multiple clusters and environments.

External secrets management systems typically offer advanced security features such as encryption at rest, access controls, audit logs, and integration with identity providers. This helps FSI customers ensure that sensitive information is stored and managed securely in accordance with regulatory requirements.

FSI customers usually have existing investments in their on-premises or cloud infrastructure, including secrets management solutions. ESO integrates seamlessly with existing secrets management systems and infrastructure, allowing FSI customers to use their investment in these systems without requiring significant changes to their workflow or tooling. This makes it easier for FSI customers to adopt and integrate ESO into their existing Kubernetes environments.

ESO provides capabilities for enforcing policies and governance controls around secrets management such as access control, rotation policies, and audit logging when using services like Secrets Manager. For FSI customers, audits and compliance are critical and ESO verifies that access to secrets is tracked and audit trails are maintained, thereby simplifying the process of demonstrating adherence to regulatory standards. For instance, secrets stored inside Secrets Manager can be audited for compliance with AWS Config and AWS Audit Manager. Additionally, ESO uses role-based access control (RBAC) to help prevent unauthorized access to Kubernetes secrets as documented in the ESO security best practices guide.

High availability and resilience are critical considerations for mission critical FSI applications such as online banking, payment processing, and trading services. By using external secrets management systems designed for high availability and disaster recovery, ESO helps FSI customers ensure secrets are available and accessible in the event of infrastructure failure or outages, thereby minimizing service disruption and downtime.

FSI workloads often experience spikes in transaction volumes, especially during peak days or hours. ESO is designed to efficiently managed a large volume of secrets by using external secrets management that’s optimized for performance and scalability.

In terms of monitoring, ESO provides Prometheus metrics to enable fine-grained monitoring of access to secrets. Amazon EKS pods offer diverse methods to grant access to secrets present on external secrets management solutions. For example, in non-production environments, access can be granted through IAM instance profiles assigned to the Amazon EKS worker nodes. For production, using IAM roles for service accounts (IRSA) is recommended. Furthermore, you can achieve namespace level fine-grained access control by using annotations.

ESO also provides options to configure operators to use a VPC endpoint to comply with FIPS requirements.

Additional developer productivity benefits provided by ESO include support for JSON objects (Secret key/value in the AWS Management console) or strings (Plaintext in the console). With JSON objects, developers can programmatically update multiple values atomically when rotating a client certificate and private key.

The benefit of Sealed Secrets, as discussed previously, is when you upload your manifest to a Git repository. The manifest will contain the encrypted SealedSecrets and not the regular secrets. This assures that no one has access to your sensitive secrets even when they have access to your Git repository. Sealed Secrets offer a few benefits to developers in terms of developer experience. Sealed Secrets gives you access to manage your secrets, making them more readily available to developers. Sealed Secrets offers VSCode extension to assist in integrating it into the software development lifecycle (SDLC). Using Sealed Secrets, you can store the encrypted secrets in the version control systems such as Gitlab and GitHub. Sealed Secrets can reduce operational overhead related to updating dependent objects because whenever a secret resource is updated, the same update is applied to the dependent objects.

ASCP integration with the Kubernetes Secrets Store CSI Driver on Amazon EKS offers enhanced security through seamless integration with Secrets Manager and Parameter Store, ensuring encryption, access control, and auditing. It centralizes management of sensitive data, simplifying operations and reducing the risk of exposure. The dynamic secrets injection capability facilitates secure retrieval and injection of secrets into Kubernetes pods, while automatic rotation provides up-to-date credentials without manual intervention. This combined solution streamlines deployment and management, providing a secure, scalable, and efficient approach to handling secrets and configuration settings in Kubernetes applications.

Consolidated threat model

We created a threat model based on the architecture of the three solution offerings. The threat model provides a comprehensive view of the potential threats and corresponding mitigations for each solution, allowing organizations to proactively address security risks and ensure the secure management of secrets in their Kubernetes environments.

X = Mitigations applicable to the solution

Threat	Mitigations	ESO	Sealed Secrets	ASCP
Unauthorized access or modification of secrets	Implement least privilege access principles Rotate and manage credentials securely Enable RBAC and auditing in Kubernetes	X	X	X
Insider threat (for example, a rogue administrator who has legitimate access)	Implement least privilege access principles Enable auditing and monitoring Enforce separation of duties and job rotation	X	X
Compromise of the deployment process	Secure and harden the deployment pipeline Implement secure coding practices Enable auditing and monitoring		X
Unauthorized access or tampering of secrets during transit	Enable encryption in transit using TLS Implement mutual TLS authentication between components Use private networking or VPN for secure communication	X	X	X
Compromise of the Kubernetes API server because of vulnerabilities or misconfiguration	Secure and harden the Kubernetes API server Enable authentication and authorization mechanisms (for example, mutual TLS and RBAC) Keep Kubernetes components up-to-date and patched Enable Kubernetes audit logging and monitoring	X
Vulnerability in the external secrets controller leading to privilege escalation or data exposure	Keep the external secrets controller up-to-date and patched Regularly monitor for and apply security updates Implement least privilege access principles Enable auditing and monitoring	X
Compromise of the Secrets Store CSI Driver, node-driver-registrar, Secrets Store CSI Provider, kubelet, or Pod could lead to unauthorized access or exposure of secrets	Implement least privilege principles and role-based access controls Regularly patch and update the components Monitor and audit the component activities			X
Unauthorized access or data breach in Secrets Manager could expose sensitive secrets	Implement strong access controls and access logging for Secrets Manager Encrypt secrets at rest and in transit Regularly rotate and update secrets	X		X

Shortcomings and limitations

The following limitations and drawbacks highlight the importance of carefully evaluating the specific requirements and constraints of your organization before adopting any of these solutions. You should consider factors such as team expertise, deployment environments, integration needs, and compliance requirements to promote a secure and efficient secrets management solution that aligns with your organization’s needs.

ESO doesn’t include a default way to restrict network traffic to and from ESO using network policies or similar network or firewall mechanisms. The application team is responsible for properly configuring network policies to improve the overall security posture of ESO within your Kubernetes cluster.

Any time an external secret associated with ESO is rotated, you must restart the deployment that uses that particular external secret. Given the inherent risks associated with integrating an external entity or third-party solution into your system, including ESO, it’s crucial to implement a comprehensive threat model similar to the Kubernetes Admission Control Threat Model.

Also, ESO set up is complicated and the controller must be installed on the Kubernetes cluster.

SealedSecrets cannot be reused across namespaces unless they’re re-encrypted or made cluster-wide, which makes it challenging to manage secrets across multiple namespaces consistently. The need to manually rotate and re-encrypt SealedSecrets with new keys can introduce operational overhead, especially in large-scale environments with numerous secrets. The old sealing keys pose a potential risk of misuse by unauthorized users, which increases the risk. To mitigate both risks (high overhead and old secrets), you should implement additional controls such as deleting older keys as part of the key rotation process or periodically rotate sealing keys and make sure that old sealed secret resources are re-encrypted with the new keys. Sealed Secrets doesn’t support external secret stores such as HashiCorp Vault, or cloud provider services such as Secrets Manager, Parameter Store, or Azure Key Vault. Sealed Secrets requires a Kubeseal client-side binary to encrypt secrets. This can be a concern in FSI environments where client-side tools are restricted by security policies.

While ASCP provides seamless integration with Secrets Manager and Parameter Store, teams unfamiliar with these AWS services might need to invest some additional effort to fully realize the benefits. This additional effort is justified by the long-term benefits of centralized secrets management and access control provided by these services. Additionally, relying primarily on AWS services for secrets management can potentially limit flexibility in deploying to alternative cloud providers or on-premises environments in the future. These factors should be carefully evaluated based on the specific needs and constraints of the application and deployment environment.

Conclusion

We have provided a summary of three options for managing secrets in Amazon EKS, ESO, Sealed Secrets, and AWS Secrets and Configuration Provider (ASCP), and the key considerations for FSI customers when choosing between them. The choice depends on several factors including existing investments in secrets management systems, specific security needs and compliance requirements, preference for a Kubernetes native solution or willingness to accept vendor lock-in.

The guidance provided here covers the strengths, limitations, and trade-offs of each option, allowing regulated institutions to make an informed decision based on their unique requirements and constraints. This guidance can be adapted and tailored to fit the specific needs of an organization, providing a secure and efficient secrets management solution for their Amazon EKS workloads, while aligning with the stringent security and compliance standards of the regulated institutions.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Cloud infrastructure entitlement management in AWS

2024-08-14 Mathangi Ramesh

Post Syndicated from Mathangi Ramesh original https://aws.amazon.com/blogs/security/cloud-infrastructure-entitlement-management-in-aws/

Customers use Amazon Web Services (AWS) to securely build, deploy, and scale their applications. As your organization grows, you want to streamline permissions management towards least privilege for your identities and resources. At AWS, we see two customer personas working towards least privilege permissions: security teams and developers. Security teams want to centrally inspect permissions across their organizations to identify and remediate access-related risks, such as excessive permissions, anomalous access to resources or compliance of identities. Developers want policy verification tools that help them set effective permissions and maintain least privilege as they build their applications.

Customers are increasingly turning to cloud infrastructure entitlement management (CIEM) solutions to guide their permissions management strategies. CIEM solutions are designed to identify, manage, and mitigate risks associated with access privileges granted to identities and resources in cloud environments. While the specific pillars of CIEM vary, four fundamental capabilities are widely recognized: rightsizing permissions, detecting anomalies, visualization, and compliance reporting. AWS provides these capabilities through services such as AWS Identity and Access Management (IAM) Access Analyzer, Amazon GuardDuty, Amazon Detective, AWS Audit Manager, and AWS Security Hub. I explore these services in this blog post.

Rightsizing permissions

Customers primarily explore CIEM solutions to rightsize their existing permissions by identifying and remediating identities with excessive permissions that pose potential security risks. In AWS, IAM Access Analyzer is a powerful tool designed to assist you in achieving this goal. IAM Access Analyzer guides you to set, verify, and refine permissions.

After IAM Access Analyzer is set up, it continuously monitors AWS Identity and Access Management (IAM) users and roles within your organization and offers granular visibility into overly permissive identities. This empowers your security team to centrally review and identify instances of unused access, enabling them to take proactive measures to refine access and mitigate risks.

While most CIEM solutions prioritize tools for security teams, it’s essential to also help developers make sure that their policies adhere to security best practices before deployment. IAM Access Analyzer provides developers with policy validation and custom policy checks to make sure their policies are functional and secure. Now, they can use policy recommendations to refine unused access, making sure that identities have only the permissions required for their intended functions.

Anomaly detection

Security teams use anomaly detection capabilities to identify unexpected events, observations, or activities that deviate from the baseline behavior of an identity. In AWS, Amazon GuardDuty supports anomaly detection in an identity’s usage patterns, such as unusual sign-in attempts, unauthorized access attempts, or suspicious API calls made using compromised credentials.

By using machine learning and threat intelligence, GuardDuty can establish baselines for normal behavior and flag deviations that might indicate potential threats or compromised identities. When establishing CIEM capabilities, your security team can use GuardDuty to identify threat and anomalous behavior pertaining to their identities.

Visualization

With visualization, you have two goals. The first is to centrally inspect the security posture of identities, and the second is to comprehensively understand how identities are connected to various resources within your AWS environment. IAM Access Analyzer provides a dashboard to centrally review identities. The dashboard helps security teams gain visibility into the effective use of permissions at scale and identify top accounts that need attention. By reviewing the dashboard, you can pinpoint areas that need focus by analyzing accounts with the highest number of findings and the most commonly occurring issues such as unused roles.

Amazon Detective helps you to visually review individual identities in AWS. When GuardDuty identifies a threat, Detective generates a visual representation of identities and their relationships with resources, such as Amazon Elastic Compute Cloud (Amazon EC2) instances, Amazon Simple Storage Service (Amazon S3) buckets, or AWS Lambda functions. This graphical view provides a clear understanding of the access patterns associated with each identity. Detective visualizes access patterns, highlighting unusual or anomalous activities related to identities. This can include unauthorized access attempts, suspicious API calls, or unexpected resource interactions. You can depend on Detective to generate a visual representation of the relationship between identities and resources.

Compliance reporting

Security teams work with auditors to assess whether identities, resources, and permissions adhere to the organization’s compliance requirements. AWS Audit Manager automates evidence collection to help you meet compliance reporting and audit needs. These automated evidence packages include reporting on identities. Specifically, you can use Audit Manager to analyze IAM policies and roles to identify potential misconfigurations, excessive permissions, or deviations from best practices.

Audit Manager provides detailed compliance reports that highlight non-compliant identities or access controls, allowing your auditors and security teams to take corrective actions and support ongoing adherence to regulatory and organizational standards. In addition to monitoring and reporting, Audit Manager offers guidance to remediate certain types of non-compliant identities or access controls, reducing the burden on security teams and supporting timely resolution of identified issues.

Single pane of glass

While customers appreciate the diverse capabilities AWS offers across various services, they also seek a unified and consolidated view that brings together data from these different sources. AWS Security Hub addresses this need by providing a single pane of glass that enables you to gain a holistic understanding of your security posture. Security Hub acts as a centralized hub, consuming findings from multiple AWS services and presenting a comprehensive view of how identities are being managed and used across the organization.

Conclusion

CIEM solutions are designed to identify, manage, and mitigate risks associated with access privileges granted to identities and resources in cloud environments. The AWS services mentioned in this post can help you achieve your CIEM goals. If you want to explore CIEM capabilities in AWS, use the services mentioned in this post or see the following resources.

Resources

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Generating Accurate Git Commit Messages with Amazon Q Developer CLI Context Modifiers

2024-08-07 Ryan Yanchuleff

Post Syndicated from Ryan Yanchuleff original https://aws.amazon.com/blogs/devops/generating-accurate-git-commit-messages-with-amazon-q-developer-cli-context-modifiers/

Writing clear and concise Git commit messages is crucial for effective version control and collaboration. However, when working with complex projects or codebases, providing additional context can be challenging. In this blog post, we’ll explore how to leverage Amazon Q Developer to analyze our code changes for us and produce meaningful commit messages for Git.

Amazon Q is the most capable generative AI-powered assistant for accelerating software development and leveraging companies’ internal data. It assists developers and IT professionals with all their tasks—from coding, testing, and upgrading applications, to diagnosing errors, performing security scanning and fixes, and optimizing AWS resources. Amazon Q Developer has advanced, multistep planning and reasoning capabilities that can transform (for example, perform Java version upgrades) and implement new features generated from developer requests. Q Developer is available in the IDE, the AWS Console, and on the command line interface (CLI).

Overview of solution

With the Amazon Q Developer CLI, you can engage in natural language conversations, ask questions, and receive responses from Amazon Q Developer, all from your terminal’s command-line interface. One of the powerful features of the Amazon Q Developer CLI is its ability to integrate contextual information from your local development environment. A context modifier in the Amazon Q CLI is a special keyword that allows you to provide additional context to Amazon Q from your local development environment. This context helps Amazon Q better understand the specific use case you’re working on and provide more relevant and accurate responses.

The Amazon Q CLI supports three context modifiers:

@git: This modifier allows you to share your Git repository status with Amazon Q, including the current branch, staged and unstaged changes, and commit history.
@env: By using this modifier, you can provide Amazon Q with your local shell environment variables, which can be helpful for understanding your development setup and configuration.
@history: This modifier enables you to share your recent shell command history with Amazon Q, giving it insights into your actions and the context in which you’re working

By using these context modifiers, you can enhance Amazon Q’s understanding of your specific use case, enabling it to provide more relevant and context-aware responses tailored to your local development environment.

Now let’s dive deeper into how we can use the @git context modifier to craft better Git commit messages. By incorporating the @git context modifier, you can provide additional details about the changes made to your Git repository, such as the affected files, branches, and other Git-related metadata. This not only improves code comprehension but also facilitates better collaboration within your team. We’ll walk through practical examples and best practices, equipping you with the knowledge to take your Git commit messages to the next level using the @git context modifier.

Prerequisites

For this walkthrough, you should have the following prerequisites:

A code base versioned with git
Amazon Q Developer CLI (OSX only): Install the Amazon Q Developer CLI by following the instructions provided in the Amazon Q Developer documentation. This may involve downloading and installing a package or using a package manager like pip or npm.
Amazon Q Developer Subscription: Subscribe to the Amazon Q Developer service. This can be done through the AWS Management Console or by following the instructions in the Amazon Q Developer documentation.

Walkthrough

Open a terminal and navigate to the directory that contains your git project
From within your project directory, run the git status command to view which files have been modified or added since your last commit. Any untracked files (new files) will appear in red, and modified files will be shown in green.

Figure 1 – git status execution from within your project director
Use the git add command to stage the files you want to commit. For example, git add app.py requirements.txt perm_policies/explicit_dependencies_stack.py will stage the specific files, or git add . will add all modified and untracked files in the current directory recursively.
After staging your files, use the q chat command to generate commit message using the @git context modifier. From within the Q Developer Chat context, attach @git to the end of your prompt to engage the context modifier.

Figure 2 – Amazon Q chat interaction using @git context modifier to generate a commit message
Copy the generated commit message and exit Amazon Q Developer chat
Commit your changes using `git commit”
Paste your commit message in default editor to create a new commit with the staged changes.

Figure 3 – paste copied git commit message
Finally, use `git push` to upload your local commits to the remote repository, allowing others to access your changes.

Conclusion

In this post, we looked at how to maximize your productivity by using Amazon Q Developer. Using the @git context modifier in Amazon Q Developer CLI enables you to enrich your Git commit messages with relevant details about the changes made to your codebase. Clear and informative commit messages are essential for effective collaboration and code maintenance. By leveraging this powerful feature, you can provide valuable context, such as affected files, branches, and other Git metadata, making it easier for team members to understand the scope and purpose of each commit.

To continue improving your software lifecycle management (SLCM) you can also check out other Amazon Q Developer capabilities, such as code analysis, debugging, and refactoring suggestions. Finally, stay tuned for upcoming Amazon Q Developer features and enhancements that could further streamline your development processes.
Learn more and get started with the Amazon Q Developer Free Tier.

Implementing Identity-Aware Sessions with Amazon Q Developer

2024-08-07 Ryan Yanchuleff

Post Syndicated from Ryan Yanchuleff original https://aws.amazon.com/blogs/devops/implementing-identity-aware-sessions-with-amazon-q-developer/

“Be yourself; everyone else is already taken.”
-Oscar Wilde

In the real world as in the world of technology and authentication, the ability to understand who we are is important on many levels. In this blog post, we’ll look at how the ability to uniquely identify ourselves in the AWS console can lead to a better overall experience, particularly when using Amazon Q Developer. We explore the features that become available to us when Q Developer can uniquely identify our sessions in the console and match them with our subscriptions and resources. We’ll look at how we can accomplish this goal using identity-aware sessions, a capability now available with AWS IAM Identity Center. Finally, we’ll walk through the steps necessary to enable it in your AWS Organization today.

Amazon Q Developer is a generative AI-powered assistant for software development. Accessible from multiple contexts including the IDE, the command line, and the AWS Management Console, the service offers two different pricing tiers: free and Pro. In this post, we’ll explore how to use Q Developer Pro in the AWS Console with identity-aware sessions. We’ll also explore the recently introduced ability to chat about AWS account resources within the Q Developer Chat window in the AWS Console to inquire about resources in an AWS account when identity-aware sessions are enabled.

Connecting your corporate source of identities to IAM Identity Center creates a shared record of your workforce and users’ group associations. This allows AWS applications to interact with one another efficiently on behalf of your users because they all reference this shared record of attributes. As a result, users have a consistent, continuous experience across AWS applications. Once your source of identities is connected to IAM Identity Center, your identity provider administrator can decide which users and groups will be synchronized with Identity Center. Your Amazon Q Developer administrator sees these synchronized users and groups only within Amazon Q Developer and can assign Q Developer Pro subscriptions to them.

User Identity in the AWS Console

To access the AWS Console, you must first obtain an IAM session – most commonly by using Identity Center Access Portal, IAM federation, or IAM (or root) users. Users can also use IAM Identity Center or a third party federated login mechanism. In this post, we’ll be using Microsoft Entra ID, but many other providers are available. Of all these options, however, only logging in with IAM Identity Center provides us with enough context to uniquely identity the user automatically by default. Identity-aware sessions will make this work.

Using Q Developer with a direct IAM Identity Center connection

Figure 1: Logging into an AWS account via the IAM Identity Center enables Q Developer to match the user with an active Pro subscription.

To meet customers where they are and allow them to build on their existing configurations, IAM Identity Center includes a mechanism that allows users to obtain an identity-aware session to access Q in the Console, regardless of how they originally logged in to the Console.

Let’s look at a real-world example to explore how this might work. Let’s assume our organization is currently using Microsoft Entra ID alongside AWS Organizations to federate our users into AWS accounts. This grants them access to the AWS console for accounts in our AWS Organization and enables our users to be assigned IAM roles and permissions. While secure, this access method does not allow Q Developer to easily associate the user with their Entra ID identity and to match them to a Q Developer subscription.

Using Q Developer with a 3P Identity Provider and IAM Identity Center

Figure 2: Using Entra ID, the user is federated into the AWS account and assumes an IAM role without further context in the console. Q Developer can obtain that context by authenticating the user with identity-aware sessions. This process is first attempted manually before prompting the user for credentials

To provide identity-aware sessions to these users, we can enable IAM Identity Center for the Organization and integrate it with our Entra ID instance. This allows us to sync our users and groups from Entra ID and assign them to subscriptions in our AWS Applications such as Amazon Q Developer.

We then go one step further and enable identity-aware sessions for our Identity Center instance. Identity-aware sessions allow Amazon Q to access user’s unique identifier in Identity Center so that it can then look up a user’s subscription and chat history. When the user opens the Console Chat, Q Developer checks whether the current IAM session already includes a valid identity-aware context. If this context is not available, Q will then verify the account is part of an Organization and has an IAM Identity Center instance with identity-aware sessions enabled. If so, it will prompt the user to authenticate with IAM Identity Center. Otherwise, the chat will throw an error.

With a valid Q Developer Pro subscription now verified, the user’s interactions with the Q Chat window will include personalization such as access to prior chat history, the ability to chat about AWS account resources, and higher request limits for multiple capabilities included with Q Developer Pro. This will persist with the user for the duration of their AWS Console session.

Configuring Identity-Aware Sessions

Identity-aware sessions are only available for instances of IAM Identity Center deployed at the AWS Organization level. (Account-level instances of IAM Identity Center do not support this feature). Once IAM Identity Center is configured, the option to enable Identity-aware sessions needs to be manually selected. (NOTE: This is a one-way door option which, once enabled, cannot be disabled. For more information about prerequisites and considerations for this feature, you can review the documentation here.)

To begin, verify that you have enabled AWS Organizations across your accounts. Once you have completed this, you are ready to enable IAM Identity Center and enable identity-aware sessions. The steps below should be completed by a member of your infrastructure administration team.

For customers who already have an Organization-based instance of IAM Identity Center configured, skip to Step 4 below. For those organizations who would like to read more about IAM Identity Center before completing the following steps, you can find details in the documentation available here.

Walkthrough

From within the management account or security account configured in your AWS Configuration, access the AWS Console and navigate to the AWS IAM Identity Center in the region where you wish to deploy your organization’s Identity Center instance.
Choose the “Enable” option where you will be presented with an option to setup Identity Center at the Organization level or as a single account instance. Choose the “Enable with AWS Organizations” to have access to identity-aware sessions.

Choose the IAM Identity Center Context - Organization vs Account

After Identity Center has been enabled, navigate to the “Settings” page from the left-hand navigation menu. Note that under the “Details” section, the “Identity-aware sessions” option is currently marked as “Disabled”.

IAM Identity Center General Settings - Identity-Aware Sessions disabled by default

Choose the “Enable” option from the Details section or select it from the blue prompt below the Details section.

Identity-Aware Info Prompt allows users to enable

Choose “Enable” from the popup box that appears to confirm your choice.

Identity-Aware Confirmation Prompt

Once IAM Identity Center is enabled and Identity-aware sessions are enabled, you can then proceed by either creating a user manually in Identity Center to log in with, or by connecting your Identity Center instance to a third-party provider like Entra ID, Ping, or Okta. For more information on how to complete this process, please see the documentation for the various third-party providers available.
If you don’t have Q Developer enabled, you will want to do so now. From within the AWS Console, using the search bar navigate to the Amazon Q Developer service. As a best practice, we recommend configuring Q Developer in your management account.

Amazon Q Developer Home Screen - Control panel to add subscriptions

Begin by clicking the “Subscribe to Amazon Q” button to enable Q Developer in your account. You will see a green check denoting that Q has successfully been paired with IAM Identity Center.

Amazon Q Developer Paired with IAM Identity Center - Info Notice

Choose “Subscribe” to enable Q Developer Pro.

Amazon Q Developer Pro Subscription Info Panel

Enable Q Developer Pro in the popup prompt

Amazon Q Developer Pro Confirmation

From here, you can then assign users and groups from the Q Developer prompt or you may assign them from within the IAM Identity Center using the Amazon Q Application Profile.

IAM Identity Center - Q Developer Profile for Managed Applications

Once your users and groups have been assigned, they are now able to begin using Q Developer in both the AWS account console and their individual IDE’s.

Why Use Q Developer Pro?

In this final section, we’ll explore the benefits of using Amazon Q Developer Pro. There are three main areas of benefit:

Chat History

Q Developer Pro can store your chat history and restore it from previous sessions each time you begin. This enables you to develop a context within the chat about things that are relevant to your interests and in turn inform the feedback you receive from Q going forward.

Chat about your AWS account resources

Q Developer Pro can leverage your IAM permissions to make requests regarding resources and costs associated with your account (assuming you have the appropriate policies). This enables you to inquire about certain resources deployed in a given region, or ask questions about cost such as the overall EC2 spend in a given period of time.

Figure 4: From the Q Chat panel, you can inquire about resources deployed in your account. (This capability requires you to have the necessary permissions to view information about the requested resource.)

Personalization

Identity-aware sessions also enable you to benefit from custom settings in your Q Chat. For example, you can enable cross-region access for your Q Chat sessions which enable you to ask questions about resources in the current region but also all other regions in your account.

Conclusion

As a new feature of IAM Identity Center, identity-aware sessions enable an AWS Console user to access their Q Developer Pro subscription in the Q Chat panel. This provides them with richer conversations with Q Developer about their accounts and maintains those conversations over time with stored chat history. Enabling this feature involves no additional cost and only a single setting change in a configured IAM Identity Center organization instance. Once made, users will be able to benefit from the full feature set of Amazon Q Developer regardless of how they log into the account.

OpenSearch optimized instance (OR1) is game changing for indexing performance and cost

2024-08-07 Cedric Pelvet

Post Syndicated from Cedric Pelvet original https://aws.amazon.com/blogs/big-data/opensearch-optimized-instance-or1-is-game-changing-for-indexing-performance-and-cost/

Amazon OpenSearch Service securely unlocks real-time search, monitoring, and analysis of business and operational data for use cases like application monitoring, log analytics, observability, and website search.

In this post, we examine the OR1 instance type, an OpenSearch optimized instance introduced on November 29, 2023.

OR1 is an instance type for Amazon OpenSearch Service that provides a cost-effective way to store large amounts of data. A domain with OR1 instances uses Amazon Elastic Block Store (Amazon EBS) volumes for primary storage, with data copied synchronously to Amazon Simple Storage Service (Amazon S3) as it arrives. OR1 instances provide increased indexing throughput with high durability.

To learn more about OR1, see the introductory blog post.

While actively writing to an index, we recommend that you keep one replica. However, you can switch to zero replicas after a rollover and the index is no longer being actively written.

This can be done safely because the data is persisted in Amazon S3 for durability.

Note that in case of a node failure and replacement, your data will be automatically restored from Amazon S3, but would be partially unavailable during the repair operation, so you should not consider it for cases where searches on non-actively written indices require high availability.

Goal

In this blog post, we’ll explore how OR1 impacts the performance of OpenSearch workloads.

By providing segment replication, OR1 instances save CPU cycles by indexing only on the primary shards. By doing that, the nodes are able to index more data with the same amount of compute, or to use fewer resources for indexing and thus have more available for search and other operations.

For this post, we’re going to consider an indexing-heavy workload and do some performance testing.

Traditionally, Amazon Elastic Compute Cloud (Amazon EC2) R6g instances are a high performant choice for indexing-heavy workloads, relying on Amazon EBS storage. Im4gn instances provide local NVMe SSD for high throughput and low latency disk writes.

We will compare OR1 indexing performance relative to these two instance types, focusing on indexing performance only for scope of this blog.

Setup

For our performance testing, we set up multiple components, as shown in the following figure:

Architecture diagram

For the testing process:

AWS Step Functions orchestrates an initialization step to clean up the environment and set up the index mapping and to run the batch testing.
AWS Batch runs parallel jobs to index log data in OpenTelemetry JSON format.
The jobs run a custom Rust program that generates randomized logs using the OpenSearch Rust Client with AWS Identity and Access Management (IAM) authentication.
The OpenSearch Service domain is set up with OpenSearch 2.11, two availability zones, fine-grained access control, encryption at rest using AWS Key Management Service (AWS KMS), and encryption in transit using TLS.

The index mapping, which is part of our initialization step, is as follows:

{
  "index_patterns": [
    "logs-*"
  ],
  "data_stream": {
    "timestamp_field": {
      "name": "time"
    }
  },
  "template": {
    "settings": {
      "number_of_shards": <VARYING>,
      "number_of_replicas": 1,
      "refresh_interval": "20s"
    },
    "mappings": {
      "dynamic": false,
      "properties": {
        "traceId": {
          "type": "keyword"
        },
        "spanId": {
          "type": "keyword"
        },
        "severityText": {
          "type": "keyword"
        },
        "flags": {
          "type": "long"
        },
        "time": {
          "type": "date",
          "format": "date_time"
        },
        "severityNumber": {
          "type": "long"
        },
        "droppedAttributesCount": {
          "type": "long"
        },
        "serviceName": {
          "type": "keyword"
        },
        "body": {
          "type": "text"
        },
        "observedTime": {
          "type": "date",
          "format": "date_time"
        },
        "schemaUrl": {
          "type": "keyword"
        },
        "resource": {
          "type": "flat_object"
        },
        "instrumentationScope": {
          "type": "flat_object"
        }
      }
    }
  }
}

As you can see, we’re using a data stream to simplify the rollover configuration and keep the maximum primary shard size under 50 GiB, as per best practices.

We optimized the mapping to avoid any unnecessary indexing activity and use the flat_object field type to avoid field mapping explosion.

For reference, the Index State Management (ISM) policy we used is as follows:

{
  "policy": {
    "default_state": "hot",
    "states": [
      {
        "name": "hot",
        "actions": [
          {
            "rollover": {
              "min_primary_shard_size": "50gb"
            }
          }
        ],
        "transitions": []
      }
    ],
    "ism_template": [
      {
        "index_patterns": [
          "logs-*"
        ]
      }
    ]
  }
}

Our average document size is 1.6 KiB and the bulk size is 4,000 documents per bulk, which makes approximately 6.26 MiB per bulk (uncompressed).

Testing protocol

The protocol parameters are as follows:

Number of data nodes: 6 or 12
Jobs parallelism: 75, 40
Primary shard count: 12, 48, 96 (for 12 nodes)
Number of replicas: 1 (total of 2 copies)
Instance types (each with 16 vCPUs):
- or1.4xlarge.search
- r6g.4xlarge.search
- im4gn.4xlarge.search

Cluster	Instance type	vCPU	RAM	JVM size
or1-target	or1.4xlarge.search	16	128	32
im4gn-target	im4gn.4xlarge.search	16	64	32
r6g-target	r6g.4xlarge.search	16	128	32

Note that the im4gn cluster has half the memory of the other two, but still each environment has the same JVM heap size of approximately 32 GiB.

Performance testing results

For the performance testing, we started with 75 parallel jobs and 750 batches of 4,000 documents per client (a total 225 million documents). We then adjusted the number of shards, data nodes, replicas, and jobs.

Configuration 1: 6 data nodes, 12 primary shards, 1 replica

For this configuration, we used 6 data nodes, 12 primary shards, and 1 replica, we observed the following performance:

Cluster	CPU usage	Time taken	Indexing speed
or1-target	65-80%	24 min	156 kdoc/s	243 MiB/s
im4gn-target	89-97%	34 min	110 kdoc/s	172 MiB/s
r6g-target	88-95%	34 min	110 kdoc/s	172 MiB/s

Highlighted in this table, im4gn and r6g clusters have very high CPU usage, triggering admission control, which rejects document.

The OR1 shows a CPU below 80 percent sustained, which is a very good target.

Things to keep in mind:

In production, don’t forget to retry indexing with exponential backoff to avoid dropping unindexed documents because of intermittent rejections.
The bulk indexing operation returns 200 OK but can have partial failures. The body of the response must be checked to validate that all the documents were indexed successfully.

By reducing the number of parallel jobs from 75 to 40, while maintaining 750 batches of 4,000 documents per client (total 120M documents), we get the following:

Cluster	CPU usage	Time taken	Indexing speed
or1-target	25-60%	20 min	100 kdoc/s	156 MiB/s
im4gn-target	75-93%	19 min	105 kdoc/s	164 MiB/s
r6g-target	77-90%	20 min	100 kdoc/s	156 MiB/s

The throughput and CPU usage decreased, but the CPU remains high on Im4gn and R6g, while the OR1 is showing more CPU capacity to spare.

Configuration 2: 6 data nodes, 48 primary shards, 1 replica

For this configuration, we increased the number of primary shards from 12 to 48, which provides more parallelism for indexing:

Cluster	CPU usage	Time taken	Indexing speed
or1-target	60-80%	21 min	178 kdoc/s	278 MiB/s
im4gn-target	67-95%	34 min	110 kdoc/s	172 MiB/s
r6g-target	70-88%	37 min	101 kdoc/s	158 MiB/s

The indexing throughput increased for the OR1, but the Im4gn and R6g didn’t see an improvement because their CPU utilization is still very high.

Reducing the parallel jobs to 40 and keeping 48 primary shards, we can see that the OR1 gets a little more pressure as the minimum CPU increases from 12 primary shards, and the CPU for R6g looks much better. For the Im4gn however, the CPU is still high.

Cluster	CPU usage	Time taken	Indexing speed
or1-target	40-60%	16 min	125 kdoc/s	195 MiB/s
im4gn-target	80-94%	18 min	111 kdoc/s	173 MiB/s
r6g-target	70-80%	21 min	95 kdoc/s	148 MiB/s

Configuration 3: 12 data nodes, 96 primary shards, 1 replica

For this configuration, we started with the original configuration and added more compute capacity, moving from 6 nodes to 12 and increasing the number of primary shards to 96.

Cluster	CPU usage	Time taken	Indexing speed
or1-target	40-60%	18 min	208 kdoc/s	325 MiB/s
im4gn-target	74-90%	20 min	187 kdoc/s	293 MiB/s
r6g-target	60-78%	24 min	156 kdoc/s	244 MiB/s

The OR1 and the R6g are performing well with CPU usage below 80 percent, with OR1 giving 33 percent better performance with 30 percent less CPU usage compared to R6g.

The Im4gn is still at 90 percent CPU, but the performance is also very good.

Reducing the number of parallel jobs from 75 to 40, we get:

Cluster	CPU usage	Time taken	Indexing speed
or1-target	40-60%	11 min	182 kdoc/s	284 MiB/s
im4gn-target	70-90%	11 min	182 kdoc/s	284 MiB/s
r6g-target	60-77%	12 min	167 kdoc/s	260 MiB/s

Reducing the number of parallel jobs to 40 from 75 brought the OR1 and Im4gn instances on par and the R6g very close.

Interpretation

The OR1 instances speed up indexing because only the primary shards need to be written while the replica is produced by copying segments. While being more performant compared to Img4n and R6g instances, the CPU usage is also lower, which gives room for additional load (search) or cluster size reduction.

We can compare a 6-node OR1 cluster with 48 primary shards, indexing at 178 thousand documents per second, to a 12-node Im4gn cluster with 96 primary shards, indexing at 187 thousand documents per second or to a 12-node R6g cluster with 96 primary shards, indexing at 156 thousand documents per second.

The OR1 performs almost as well as the larger Im4gn cluster, and better than the larger R6g cluster.

How to size when using OR1 instances

As you can see in the results, OR1 instances can process more data at higher throughput rates. However, when increasing the number of primary shards, they don’t perform as well because of the remote backed storage.

To get the best throughput from the OR1 instance type, you can use larger batch sizes than usual, and use an Index State Management (ISM) policy to roll over your index based on size so that you can effectively limit the number of primary shards per index. You can also increase the number of connections because the OR1 instance type can handle more parallelism.

For search, OR1 doesn’t directly impact the search performance. However, as you can see, the CPU usage is lower on OR1 instances than on Im4gn and R6g instances. That enables either more activity (search and ingest), or the possibility to reduce the instance size or count, which would result in a cost reduction.

Conclusion and recommendations for OR1

The new OR1 instance type gives you more indexing power than the other instance types. This is important for indexing-heavy workloads, where you index in batch every day or have a high sustained throughput.

The OR1 instance type also enables cost reduction because their price for performance is 30 percent better than existing instance types. When adding more than one replica, price for performance will decrease because the CPU is barely impacted on an OR1 instance, while other instance types would have indexing throughput decrease.

Check out the complete instructions for optimizing your workload for indexing using this repost article.

About the author

Cédric Pelvet is a Principal AWS Specialist Solutions Architect. He helps customers design scalable solutions for real-time data and search workloads. In his free time, his activities are learning new languages and practicing the violin.

Using Amazon APerf to go from 50% below to 36% above performance target

2024-08-07 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/using-amazon-aperf-to-go-from-50-below-to-36-above-performance-target/

This post is written by Tyler Jones, Senior Solutions Architect – Graviton, AWS.

Performance tuning the Renaissance Finagle-http benchmark

Sometimes software doesn’t perform the way it’s expected to across different systems. This can be due to a configuration error, code bug, or differences in hardware performance. Amazon APerf is a powerful tool designed to help identify and address performance issues on AWS instances and other computers. APerf captures comprehensive system metrics simultaneously and then visualizes them in an interactive report. The report allows users to analyze metrics such as CPU usage, interrupts, memory usage, and CPU core performance counters (PMU) together. APerf is particularly useful for performance tuning workloads across different instance types, as it can generate side-by-side reports for easy comparison. APerf is valuable for developers, system administrators, and performance engineers who need to optimize application performance on AWS. From here on we use the Renaissance benchmark as an example to demonstrate how APerf is used to debug and find performance bottlenecks.

The example

The Renaissance finagle-http benchmark was unexpectedly found to run 50% slower on a c7g.16xl Graviton3 than on a reference instance, both initially using the Linux-5.15.60 kernel. This is unexpected behavior.Graviton3 should be performing as good or better than our reference instance as it does for other Java based workloads. It’s likely there is a configuration problem somewhere. The Renaissance finagle-http benchmark is written in Scala but produces Java byte code, so our investigation will focus on the Java JVM as well as system-level configurations.

Overview

System performance tuning is an iterative process that is conducted in two main phases, the first focuses on overall system issues, and the second focuses on CPU core bottlenecks. APerf is used to assist in both phases.

APerf can render several instances’ data in one report, side by side, typically a reference system and the system to be tuned. The reference system provides values to compare against. In isolation, metrics are harder to evaluate. A metric may be acceptable in general but the comparison to the reference system makes it easier to spot room for improvement.

APerf helps to identify unusual system behavior, such as high interrupt load, excessive I/O wait, unusual network layer patterns, and other such issues in the first phase. After adjustments are made to address these issues, for example by modifying JVM flags, the second phase starts. Using the system tuning of the first phase, fresh APerf data is collected and evaluated with a focus on CPU core performance metrics.

Any inferior metric of the SUT CPU core, as compared to the reference system, holds potential for improvement. In the following section we discuss the two phases in detail.
For more background on system tuning, refer to the Performance Runbook in the AWS Graviton Getting Started guide.

Initial data collection

Here is an example for how 240 seconds of system data is collected with APerf:

#enable PMU access
echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid
#APerf has to open more than the default limit of files.
ulimit -n 65536
#usually aperf would be run in another terminal. 
#For illustration purposes it is send to the background here
./aperf record --run-name finagle_1 --period 240 &
#With 64 CPUs it takes APerf a little less than 15s to report readiness 
#for data collection.
sleep 15
java -jar renaissance-gpl-0.14.2.jar -r 8 finagle-http

The APerf report is generated as follows:

./aperf report --run finagle_1

Then, the report can be viewed with a web browser:

firefox aperf_report_finagle_1/index.html

The APerf report can render several data sets in the same report, side-by-side.

./aperf report --run finagle_1_c7g --run finagle_1_reference --run ...

Note that it is crucial to examine the CPU usage over time shown in APerf. Some data may be gathered while the system is idle. The metrics during idle times have no significant value.

First phase: system level

In this phase we look for differences on the system level. To do this APerf collects data during runs of finagle-http on c7g.16xl and the reference. The reference system provides the target numbers to compare against. Any large difference warrants closer inspection.

The first differences can be seen in the following figure.

The APerf CPU usage plot shows larger drops on Graviton3 (highlighted in red) at the beginning of each run than on the reference instance.

Figure 1: CPU usage. c7g.16xl on the left, reference on the right.

The log messages about the GC runtime hint at a possible reason as they coincide with the dips in CPU usage.

====== finagle-http (web) [default], iteration 0 started ======
GC before operation: completed in 32.692 ms, heap usage 87.588 MB → 31.411 MB.
====== finagle-http (web) [default], iteration 0 completed (6534.533 ms) ======

The JVM tends to spend a significant amount of time in garbage collection, during which time it has to suspend all threads, and choosing a different GC may have a positive impact.

The default GC on OpenJDK17 is G1GC. Using parallelGC is an alternative given that the instances have 64 CPUs, and thus GC can be performed highly parallel. The Graviton Getting Started guide also recommends checking the GC log when working on Java performance issues. A cross check using the JVM’s -Xlog:gc option confirms the reduced GC time with parallelGC.

The second difference is evident in the following figure, the CPU-to-CPU interrupts (IPI). There is more than 10x higher activity on Graviton 3, which means additional IRQ work c7g.16xl on which the reference system does not have to spend CPU cycles.

Figure 2: IPI0/RES Interrupts. c7g.16xl on the left, reference on the right.

Grepping through kernel commit messages can help find patches that address a particular issue, such as IPI inefficiencies.

This scheduling patch improves performance by 19%. Another IPI patch provides an additional 1% improvement. Switching to Linux-5.15.61 allows us to use these IPI improvements. The following figure shows the effect in APerf.

Figure 3: IPI0 Interrupts (c7g.16xl)

Second phase: CPU core level

Now that the system level issues are mitigated, the focus is on the CPU cores. The data collected by APerf shows PMU data where the reference instance and Graviton3 differ significantly.

PMU metric	Graviton3	Reference system
Branch Prediction Misses/1000 Instructions	16	4
Instruction Translation Lookaside Buffer (TLB) Misses/1000 Instructions	8.3	4
Instructions Per Clock cycle	0.6	0.6*

*Note that reference has a 20% higher CPU clock than c7g.16xl. As a rule of thumb, instructions per clock (IPC) multiplied by clock rate equals work done by a CPU.

Addressing CPU Core bottlenecks

The first improvement in PMU metrics stems from the parallelGC option. Although the intention was to increase CPU usage, the following figure shows lowered branch miss counts as well. Limiting the JIT tiered compilation to only use C2 mode helps branch prediction by reducing branch indirection and increasing the locality of executed code. Finally adding Transparent Huge Pages helps branch prediction logic and avoids lengthy address translation look-ups in DDR memory. The following graphs show the effects of the chosen JVM options.

Branch misses/1000 instructions (c7g.16xl)

Figure 4: Branch misses per 1k instructions

JVM Options from left to right:

-XX:+UseParallelGC
-XX:+UseParallelGC -XX:-TieredCompilation
-XX:+UseParallelGC -XX:-TieredCompilation -XX:+UseTransparentHugePages

With the options listed under the preceding figure, APerf shows the branch prediction miss rate decreasing from the initial 16 to 11. Branch mis-predictions incur significant performance penalties as they result in wasted cycles spent computing results that ultimately need to be discarded. Furthermore, these mis-predictions cause the prefetching and cache subsystems to fail to load the necessary subsequent instructions into cache. Consequently, costly pipeline stalls and frontend stalls occur, preventing the CPU from executing instructions.

Code sparsity

Figure 5: Code sparsity

JVM Options from left to right:

-XX:+UseParallelGC
-XX:+UseParallelGC -XX:-TieredCompilation
-XX:+UseParallelGC -XX:-TieredCompilation -XX:+UseTransparentHugePages

Code sparsity is a measure of how compact the instruction code is packed and how closely related code is placed. This is where turning off tiered compilation shows its effect. Lower sparsity helps branch prediction and the cache subsystem.

Instruction TLB misses/1000 instructions (c7g.16xl)

Figure 6: Instruction TLB misses per 1k Instructions

JVM Options from left to right:

-XX:+UseParallelGC
-XX:+UseParallelGC -XX:-TieredCompilation
-XX:+UseParallelGC -XX:-TieredCompilation -XX:+UseTransparentHugePages

The big decrease in TLB misses is caused by the use of transparent huge pages, which increase the likelihood that a virtual address translation is present in the TLB, since fewer entries are needed. Translation table walks are avoided that otherwise need to traverse entries in DDR memory that cost hundreds of CPU cycles to read.

Instructions per clock cycle (IPC)

Figure 7: Instructions per clock cycle

JVM Options from left to right:

-XX:+UseParallelGC
-XX:+UseParallelGC -XX:-TieredCompilation
-XX:+UseParallelGC -XX:-TieredCompilation -XX:+UseTransparentHugePages

The preceding figure shows IPC increasing from 0.58 to 0.71 as JVM flags are added.

Results

This table summarizes the measures taken for performance improvement and their results.

JVM option set	baseline	1	2	3
IPC	0.6	0.6	0.63	0.71
Branch misses/1000 instructions	16	14	12	11
ITLB misses/1000 instructions	8.3	7.7	8.8	1.1
Benchmark runtime [ms]	6000	3922	3843	3512
execution time improvement vs baseline

+parallelGC
+parallelGC -tieredCompilation
+parallelGC -tieredCompilation +UseTransparentHugePages

Re-examining the setup of our testing enviornment: Changing where the load-generator lives

With the preceding efforts, c7g.16xl is within 91% of the reference system. The c7g.16xl branch prediction miss rate is still higher at 11 than the references at 4. As shown in the preceding figure, reduced branch prediction misses have a strong positive effect on performance. What follows is an experiment to achieve parity or better with the reference system based on the reduction of branch prediction misses.

Finagle-http serves HTTP requests generated by wrk2, which is a load generator implemented in C. The expectation is that the c7g.16xl branch predictor works better with the native wrk2 binary, unlike the Renaissance load generator, which is executing on the JVM. The wrk2 load generator and the finagle-http are assigned through taskset to separate sets of CPUs: 16 CPUs for wrk2 and 48 CPUs for finagle-http. The idea here is to have the branch predictors on these CPU sets focus on a limited code set. The following diagram illustrates the difference between the Renaissance and the experimental setup.

With this CPU performance tuned setup, c7g.16xl can now handle a 36% higher request load than the reference using the same configuration, at an average latency limit of 1.5ms. This illustrates the impact that system tuning with APerf can have. The same system that scored 50% lower than the comparison system now exceeds it by 36%.
The following APerf data shows the improvement of key PMU metrics that lead to the performance jump.

Branch prediction misses/1000 instructions

Figure 8: Branch misses per 1k instructions

Left chart: Optimized Java-only setup. Right chart: Finagle-http with wrk2 load generator

The branch prediction miss rate is reduced to 1.5 from 11 with the Java-only setup.

IPC

Figure 9: Instructions Per Clock Cycle

Left chart: Optimized Java-only setup. Right chart: Finagle-http with wrk2 load generator

The IPC steps up from 0.7 to 1.5 due to the improvement in branch prediction.

Code sparsity

Figure 10: Code Sparsity

Left chart: Optimized Java-only setup. Right chart: Finagle-http with wrk2 load generator

The code sparsity decreases to 0.014 from 0.21, a factor of 15.

Conclusion

AWS created the APerf tool to aid in root cause analysis and help address performance issues for any workload. APerf is a standalone binary that captures relevant data simultaneously as a time series, such as CPU usage, interrupt frequency, memory usage, and CPU core metrics (PMU counters). APerf can generate reports for multiple data captures, making it easy to spot differences between instance types. We were able to use this data to analyze why Graviton3 was underperforming and to also see the impact of our changes in terms of performance. Using APerf we were able to successfully adjust configuration parameters and go from 50% below our performance target to 36% more performant than our reference system and associated performance target. Without Aperf, collecting these metrics and visualizing them is a non-trival task. With Aperf you can capture and visualize these metrics with two short commands, saving you time and effort so you can focus on what matters most: getting the most performance from your application.

Network perimeter security protections for generative AI

2024-08-07 Riggs Goodman III

Post Syndicated from Riggs Goodman III original https://aws.amazon.com/blogs/security/network-perimeter-security-protections-for-generative-ai/

Generative AI–based applications have grown in popularity in the last couple of years. Applications built with large language models (LLMs) have the potential to increase the value companies bring to their customers. In this blog post, we dive deep into network perimeter protection for generative AI applications. We’ll walk through the different areas of network perimeter protection you should consider, discuss how those apply to generative AI–based applications, and provide architecture patterns. By implementing network perimeter protection for your generative AI–based applications, you gain controls to help protect from unauthorized use, cost overruns, distributed denial of service (DDoS), and other threat actors or curious users.

Perimeter protection for LLMs

Network perimeter protection for web applications helps answer important questions, for example:

Who can access the app?
What kind of data is sent to the app?
How much data is the app is allowed to use?

For the most part, the same network protection methods used for other web apps also work for generative AI apps. The main focus of these methods is controlling network traffic that is trying to access the app, not the specific requests and responses the app creates. We’ll focus on three key areas of network perimeter protection:

Authentication and authorization for the app’s frontend
Using a web application firewall
Protection against DDoS attacks

The security concerns of using LLMs in these apps, including issues with prompt injections, sensitive information leaks, or excess agency, is beyond the scope of this post.

Frontend authentication and authorization

When designing network perimeter protection, you first need to decide whether you will allow certain users to access the application, based on whether they are authenticated (AuthN) and whether they are authorized (AuthZ) to ask certain questions of the generative AI–based applications. Many generative AI–based applications sit behind an authentication layer so that a user must sign in to their identity provider before accessing the application. For public applications that are not behind any authentication (a chatbot, for example), additional considerations are required with regard to AWS WAF and DDoS protection, which we discuss in the next two sections.

Let’s look at an example. Amazon API Gateway is an option for customers for the application frontend, providing metering of users or APIs with authentication and authorization. It’s a fully managed service that makes it convenient for developers to publish, maintain, monitor, and secure APIs at scale. With API Gateway, you create AWS Lambda authorizers to control access to APIs within your application. Figure 1 shows how access works for this example.

Figure 1: An API Gateway, Lambda authorizer, and basic filter in the signal path between client and LLM

The workflow in Figure 1 is as follows:

A client makes a request to your API that is fronted by the API Gateway.
When the API Gateway receives the request, it sends the request to a Lambda authorizer that authenticates the request through OAuth, SAML, or another mechanism. The Lambda authorizer returns an AWS Identity and Access Management (IAM) policy to the API Gateway, which will permit or deny the request.
If permitted, the API Gateway sends the API request to the backend application. In Figure 1, this is a Lambda function that provides additional capabilities in the area of LLM security, standing in for more complex filtering. In addition to the Lambda authorizer, you can configure throttling on the API Gateway on a per-client basis or on the application methods clients are accessing before traffic makes it to the backend application. Throttling can provide some mitigation against not only DDoS attacks but also model cloning and inversion attacks.
Finally, the application sends requests to your LLM that is deployed on AWS. In this example, the LLM is deployed on Amazon Bedrock.

The combination of Lambda authorizers and throttling helps support a number of perimeter protection mechanisms. First, only authorized users gain access to the application, helping to prevent bots and the public from accessing the application. Second, for authorized users, you limit the rate at which they can invoke the LLM to prevent excessive costs related to requests and responses to the LLM. Third, after users have been authenticated and authorized by the application, the application can pass identity information to the backend data access layer in order to restrict the data available to the LLM, aligning with what the user is authorized to access.

Besides API Gateway, AWS provides other options you can use to provide frontend authentication and authorization. AWS Application Load Balancer (ALB) supports OpenID Connect (OIDC) capabilities to require authentication to your OIDC provider prior to access. For internal applications, AWS Verified Access combines both identity and device trust signals to permit or deny access to your generative AI application.

AWS WAF

Once the authentication or authorization decision is made, the next consideration for network perimeter protection is on the application side. New security risks are being identified for generative AI–based applications, as described in the OWASP Top 10 for Large Language Model Applications. These risks include insecure output handling, insecure plugin design, and other mechanisms that cause the application to provide responses that are outside the desired norm. For example, a threat actor could craft a direct prompt injection to the LLM, which causes the LLM behave improperly. Some of these risks (insecure plugin design) can be addressed by passing identity information to the plugins and data sources. However, many of those protections fall outside the network perimeter protection and into the realm of security within the application. For network perimeter protection, the focus is on validating the users who have access to the application and supporting rules that allow, block, or monitor web requests based on network rules and patterns at the application level prior to application access.

In addition, bot traffic is an important consideration for web-based applications. According to Security Today, 47% of all internet traffic originates from bots. Bots that send requests to public applications drive up the cost of using generative AI–based applications by causing higher request loads.

To protect against bot traffic before the user gains access to the application, you can implement AWS WAF as part of the perimeter protection. Using AWS WAF, you can deploy a firewall to monitor and block the HTTP(S) requests that are forwarded to your protected web application resources. These resources exist behind Amazon API Gateway, ALB, AWS Verified Access, and other resources. From a web application point of view, AWS WAF is used to prevent or limit access to your application before invocation of your LLM takes place. This is an important area to consider because, in addition to protecting the prompts and completions going to and from the LLM itself, you want to make sure only legitimate traffic can access your application. AWS Managed Rules or AWS Marketplace managed rule groups provide you with predefined rules as part of a rule group.

Let’s expand the previous example. As your application shown in Figure 1 begins to scale, you decide to move it behind Amazon CloudFront. CloudFront is a web service that gives you a distributed ingress into AWS by using a global network of edge locations. Besides providing distributed ingress, CloudFront gives you the option to deploy AWS WAF in a distributed fashion to help protect against SQL injections, bot control, and other options as part of your AWS WAF rules. Let’s walk through the new architecture in Figure 2.

Figure 2: Adding AWS WAF and CloudFront to the client-to-model signal path

The workflow shown in Figure 2 is as follows:

A client makes a request to your API. DNS directs the client to a CloudFront location, where AWS WAF is deployed.
CloudFront sends the request through an AWS WAF rule to determine whether to block, monitor, or allow the traffic. If AWS WAF does not block the traffic, AWS WAF sends it to the CloudFront routing rules.

Note: It is recommended that you restrict access to the API Gateway so users cannot bypass the CloudFront distribution to access the API Gateway. An example of how to accomplish this goal can be found in the Restricting access on HTTP API Gateway Endpoint with Lambda Authorizer blog post.
CloudFront sends the traffic to the API Gateway, where it runs through the same traffic path as discussed in Figure 1.

To dive into more detail, let’s focus on bot traffic. With AWS WAF Bot Control, you can monitor, block, or rate limit bots such as scrapers, scanners, crawlers, status monitors, and search engines. Bot Control provides multiple options in terms of configured rules and inspection levels. For example, if you use the targeted inspection level of the rule group, you can challenge bots that don’t self-identify, making it harder and more expensive for malicious bots to operate against your generative AI–based application. You can use the Bot Control managed rule group alone or in combination with other AWS Managed Rules rule groups and your own custom AWS WAF rules. Bot Control also provides granular visibility on the number of bots that are targeting your application, as shown in Figure 3.

Figure 3: Bot control dashboard for bot requests and non-bot requests

How does this functionality help you? For your generative AI–based application, you gain visibility into how bots and other traffic are targeting your application. AWS WAF provides options to monitor and customize the web request handling of bot traffic, including allowing specific bots or blocking bot traffic to your application. In addition to bot control, AWS WAF provides a number of different managed rule groups, including baseline rule groups, use-case specific rule groups, IP reputation rules groups, and others. For more information, take a look at the documentation on both AWS Managed Rules rule groups and AWS Marketplace managed rule groups.

DDoS protection

The last topic we’ll cover in this post is DDoS with LLMs. Similar to threats against other Layer 7 applications, threat actors can send requests that consume an exceptionally high amount of resources, which results in a decline in the service’s responsiveness or an increase in the cost to run the LLMs that are handling the high number of requests. Although throttling can help support a per-user or per-method rate limit, DDoS attacks use more advanced threat vectors that are difficult to protect against with throttling.

AWS Shield helps to provide protection against DDoS for your internet-facing applications, both at Layer 3/4 with Shield standard or Layer 7 with Shield Advanced. For example, Shield Advanced responds automatically to mitigate application threats by counting or blocking web requests that are part of the exploit by using web access control lists (ACLs) that are part of your already deployed AWS WAF. Depending on your requirements, Shield can provide multiple layers of protection against DDoS attacks.

Figure 4 shows how your deployment might look after Shield is added to the architecture.

Figure 4: Adding Shield Advanced to the client-to-model signal path

The workflow in Figure 4 is as follows:

A client makes a request to your API. DNS directs the client to a CloudFront location, where AWS WAF and Shield are deployed.
CloudFront sends the request through an AWS WAF rule to determine whether to block, monitor, or allow the traffic. AWS Shield can mitigate a wide range of known DDoS attack vectors and zero-day attack vectors. Depending on the configuration, Shield Advanced and AWS WAF work together to rate-limit traffic coming from individual IP addresses. If AWS WAF or Shield Advanced don’t block the traffic, the services will send it to the CloudFront routing rules.
CloudFront sends the traffic to the API Gateway, where it will run through the same traffic path as discussed in Figure 1.

When you implement AWS Shield and Shield Advanced, you gain protection against security events and visibility into both global and account-level events. For example, at the account level, you get information on the total number of events seen on your account, the largest bit rate and packet rate for each resource, and the largest request rate for CloudFront. With Shield Advanced, you also get access to notifications of events that are detected by Shield Advanced and additional information about detected events and mitigations. These metrics and data, along with AWS WAF, provide you with visibility into the traffic that is trying to access your generative AI–based applications. This provides mitigation capabilities before the traffic accesses your application and before invocation of the LLM.

Considerations

When deploying network perimeter protection with generative AI applications, consider the following:

AWS provides multiple options, on both the frontend authentication and authorization side and the AWS WAF side, for how to configure perimeter protections. Depending on your application architecture and traffic patterns, multiple resources can provide the perimeter protection with AWS WAF and integrate with identity providers for authentication and authorization decisions.
You can also deploy more advanced LLM-specific prompt and completion filters by using Lambda functions and other AWS services as part of your deployment architecture. Perimeter protection capabilities are focused on preventing undesired traffic from reaching the end application.
Most of the network perimeter protections used for LLMs are similar to network perimeter protection mechanisms for other web applications. The difference is that additional threat vectors come into play compared to regular web applications. For more information on the threat vectors, see OWASP Top 10 for Large Language Model Applications and Mitre ATLAS.

Conclusion

In this blog post, we discussed how traditional network perimeter protection strategies can provide defense in depth for generative AI–based applications. We discussed the similarities and differences between LLM workloads and other web applications. We walked through why authentication and authorization protection is important, showing how you can use Amazon API Gateway to throttle through usage plans and to provide authentication through Lambda authorizers. Then, we discussed how you can use AWS WAF to help protect applications from bots. Lastly, we talked about how AWS Shield can provide advanced protection against different types of DDoS attacks at scale. For additional information on network perimeter protection and generative AI security, take a look at other blogs posts in the AWS Security Blog Channel.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Migrating your on-premises workloads to AWS Outposts rack

2024-08-06 Macey Neff

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/migrating-your-on-premises-workloads-to-aws-outposts-rack/

This post is written by Craig Warburton, Senior Solutions Architect, Hybrid. Sedji Gaouaou, Senior Solutions Architect, Hybrid. Brian Daugherty, Principal Solutions Architect, Hybrid.

Migrating workloads to AWS Outposts rack offers you the opportunity to gain the benefits of cloud computing while keeping your data and applications on premises.

For organizations with strict data residency requirements, by deploying AWS infrastructure and services on premises, you can keep sensitive data and mission-critical applications within your own data centers or facilities, helping ensure compliance with data sovereignty laws and regulatory frameworks.

On the other hand, if your organization does not have stringent data residency requirements, you may opt for a hybrid approach, using both Outposts rack and the AWS Regions. With this flexibility, you can process and store data in the most appropriate location based on factors such as latency, cost optimization, and application requirements.

In this post, we cover the best options to migrate your workloads to Outposts rack, taking into account your specific data residency requirements. We explore strategies, tools, and best practices to enable a successful migration tailored to your organization’s needs.

Overview

AWS has a number of services to help you migrate and rehost workloads, including AWS Migration Hub, AWS Application Migration Service, AWS Elastic Disaster Recovery. Alternatively, you can use backup and recovery solutions provided by AWS partners.

At AWS, we use the 7 Rs framework to help organizations evaluate and choose the appropriate migration strategy for moving applications and workloads to the AWS Cloud. The 7 Rs represent:

Rehosting (rehost or lift and shift)
Replatforming (lift, tinker, and shift)
Repurchasing (republish or re-vendor)
Refactoring (re-architecting)
Retiring
Retaining (revisit)
Relocating (remigrate).

This post focuses on rehosting and the services available to help rehost on-premises applications to Outposts rack.

Before getting started with any migration, AWS recommends a three-phase approach to migrating workloads to the cloud (AWS Region or Outposts rack). The three phases are assess, mobilize, and migrate and modernize.

Figure 1: Diagram showing the three migration phases of assess, mobilize, and migrate and modernize

This post describes the steps that you can take in the migrate and modernize phase. However, the assess and mobilize phases are also critical to allow you to understand what applications will be migrated, the dependencies between them, and the planning associated with how and when migration will occur.

AWS Migration Hub is a cloud migration service provided by AWS that helps organizations accelerate and simplify the process of migrating workloads to AWS. It provides a unified location to track the progress of application migrations across multiple AWS and partner services. This service can be used to help work through all three phases of migration, and we recommend that you start with this service and complete each phase accordingly. The assess phase should help you identify any applications that require consideration when migrating (including any data residency requirements), and the mobilize phase defines the approach to take.

Workload migration to AWS Outposts rack: With staging environment in an AWS Region

After deploying an Outpost rack to your desired on-premises location, you can perform migrations of on-premises systems and virtual machines using either Application Migration Service or third-party backup and recovery services. Both scenarios are described in the following sections.

Scenario 1: Using AWS Application Migration Service

Application Migration Service is able to lift and shift a large number of physical or virtual servers without compatibility issues, performance disruption, or long cutover windows.

In this scenario, at least one Outpost rack is deployed on premises with the following prerequisites:

At least one Outpost rack installed and activated
The Outposts rack must be in Direct VPC Routing (DVR) mode
VPC in Region containing subnet for staging resources
VPC extended to the Outposts rack containing subnet for target resources
An AWS Replication Agent installed on each source server

The following diagram shows the solution architecture and includes the on-premises servers that will be migrated from the local network to the Outposts rack. It also includes the staging VPC in Region used to deploy the replication servers, Amazon S3 to store the Amazon EBS snapshots and the target VPC extended to Outposts rack.

Figure 2: Architecture diagram showing migration with Application Migration Service

Step 1: Outposts rack configuration

You can work with AWS specialists to size your Outpost for your workload and application requirements. In this scenario, you don’t need additional Outposts rack capacity for the migration because the staging area will be deployed in the Region (see 1 in Figure 2).

Step 2: Prepare Application Migration service

Set up Application Migration Service from the console in the Region your Outposts rack is anchored to. If this is your first setup, choose Get started on the AWS Application Migration Service console. When creating the replication settings template, make sure your staging area is using subnets in the parent Region (see 2 in Figure 2).

Step 3: Install the AWS Replication Agent to the source servers or machines

For large migrations, source servers may have a wide variety of operating system versions and may be distributed across multiple data centers. AWS Application Migration Service offers the MGN connector, a feature that allows you to automate running commands on your source environment. Finally, ensure that communication is possible between the agent and Application Migration Service (see 3 in Figure 2).

In the following image, there is an example of deploying the AWS Replication Agent providing the required parameters (Region, AWS access key and AWS secret access key).

Once the AWS Replication Agent is installed, the server will be added to the AWS Application Migration Service console. Next, it will undergo the initial sync process, which will be completed when showing the Ready for testing lifecycle state in the Application Migration Service console.

Step 4: Configure launch settings

Prior to testing or cutting over an instance, you must configure the launch settings by creating Amazon Elastic Compute Cloud (Amazon EC2) launch templates, ensuring that you select your extended virtual private cloud (VPC) and subnet deployed on Outposts rack and using an appropriate, available instance type (see 4 in Figure 2).

To identify EC2 instances configured on your Outpost, you can use the following AWS Command Line Interface (AWS CLI):

Outposts get-outpost-instance-types \

--outpost-id op-abcdefgh123456789

The output of this command lists the instance types and sizes configured on your Outpost:

InstanceTypes:

- InstanceType: c5.xlarge

- InstanceType: c5.4xlarge

- InstanceType: r5.2xlarge

- InstanceType: r5.4xlarge

With knowledge of the instance types configured, you can now determine how many of each are available. For example, the following AWS CLI command, which is run on the account that owns the Outpost, lists the number of c5.xlarge instances available for use:

aws cloudwatch get-metric-statistics \

--namespace AWS/Outposts \

--metric-name AvailableInstanceType_Count \

--statistics Average --period 3600 \

--start-time $(date -u -Iminutes -d '-1hour') \

--end-time $(date -u -Iminutes) \

--dimensions \

Name=OutpostId,Value=op-abcdefgh123456789 \

Name=InstanceType,Value=c5.xlarge

This command returns:

Datapoints:

- Average: 10.0

Timestamp: '2024-04-10T10:39:00+00:00'

Unit: Count

Label: AvailableInstanceType_Count

The output indicates that there were (on average) 10 c5.xlarge instances available in the specified time period (1 hour). Using the same command for the other instance types, you discover that there are also 20 c5.4xlarge, 10 r5.2xlarge, and 6 r5.4xlarge available for use in completing the required EC2 launch templates.

Step 5: Install AWS Systems Manager Agent in your on your target instances

Once the launch settings are defined, you must activate the post-launch actions for either a specific server or all the servers. You must leave the Install the Systems Manager agent and allow executing actions on launched servers option toggled on in order for post-launch actions to work. Untoggling the option would disallow Application Migration Service to install the AWS Systems Manager Agent (SSM Agent) on your servers, and post-launch actions would no longer be executed on them (see 5 in Figure 2).

Figure 3: Post-launch actions on the Application Migration Service console

Step 6: Testing and cutover

Once you have configured the launch settings for each source server, you are ready to launch the servers as test instances. Best practice is to test instances before cutover.

Figure 4: Application Migration Service console ready to launch test instances

Finally, after completing the testing of all the source servers, you are ready for cutover (see 6 on Figure 2). Prior to launching cutover instances, check that the source servers are listed as Ready for cutover under Migration lifecycle and Healthy under Data replication status.

Figure 5: Application Migration Console ready for cutover

To launch the cutover instances, select the instances you want to cutover and then select Launch cutover instances under Cutover (see Figure 5).

The AWS Application Migration Service console will indicate Cutover finalized when the cutover has completed successfully, the selected source servers’ Migration lifecycle column will show the Cutover complete status, the Data replication status column will show Disconnected, and the Next step column will show Mark as archived. The source servers have now been successfully migrated into AWS. You can now archive your source servers that have launched cutover instances.

Scenario 2: Using partner backup and replication solutions

You may already be using a third-party or AWS Partner solution to create on-premises backups of bare-metal or virtualized systems. These solutions often use local disk-arrays or object stores to create tiered backups of systems covering restore-points going back years, days, or just a few hours or minutes.

These solutions may also have inherent capabilities to restore from these backups directly to the AWS, enabling migration of on-premises systems to EC2 instances deployed to Outposts rack.

In the scenario illustrated in Figure 6, the partner backup and replication service (BR) creates backups (see 1 in Figure 6) of virtual machines to on-premises disk or object storage repositories. Using the service’s AWS integration, virtual machines can be restored (see 2 in Figure 6) to an EC2 instance deployed on Outposts rack, which is also on premises. The restoration may follow a process that uses helper instances and volumes (see 3 in Figure 6) during intermediate steps to create Amazon Elastic Block Store (Amazon EBS) snapshots (see 4 in Figure 6) and then Amazon Machine Images (AMIs) of the systems being migrated (see 5 in Figure 6), which are ultimately deployed (see 6 in Figure 6) to Outposts rack.

Figure 6: Architecture diagram of the partner backup and replication scenario

When performing this type of migration, there will typically be a stage where you are asked to specify parameters defining the target VPC and subnets. These should be the VPC being extended to the Outpost and a subnet that has been created in that VPC on the Outpost. You will also need to specify an EC2 instance type that is available on the Outpost, which can be discovered using the process described in the previous section.

Workload migration to AWS Outposts rack: With staging environment on an AWS Outpost rack

Data residency can be a critical consideration for organizations that collect and store sensitive information, such as personally identifiable information (PII), financial data or medical records. AWS Elastic Disaster Recovery, supported on Outposts rack, helps enable seamless replication of on-premises data to Outposts rack and addresses data residency concerns by keeping data within your on-premises environment, using Amazon EBS and Amazon S3 on Outposts.

In this scenario, an Outpost rack is deployed on premises with the following prerequisites:

At least one Outpost rack installed and activated
The Outposts rack must be in Direct VPC Routing (DVR) mode
VPC extended to the Outposts rack containing subnets for staging and target resources
Amazon S3 on Outposts (required for all Elastic Disaster Recovery replication destinations)
An AWS Replication Agent installed on each source server.

The following diagram shows the solution architecture and includes the on-premises servers that will be migrated from the local network to the Outposts rack. It also includes the staging VPC used to deploy the replication servers on Outposts rack, Amazon S3 on Outposts to store the local Amazon EBS snapshots and the target VPC extended to Outposts rack.

Figure 7: Architecture diagram for workflow migration to AWS Outposts rack

Step 1: Outposts rack configuration

To use Elastic Disaster Recovery on Outposts rack, you need to configure both Amazon EBS and Amazon S3 on Outposts to support nearly continuous replication and point-in-time recovery for your workload needs (see 1 in Figure 7). Specifically, you need to size Amazon EBS and Amazon S3 on Outposts capacity according to your workload capacity requirements and application interdependencies. To do this, you can define dependency groups–each dependency group is a collection of applications and their underlying infrastructure with technical or non-technical dependencies. A 2:1 ratio is recommended for the EBS volumes to be used for near-continuous replication; a 1:1 ratio is recommended for the Amazon S3 on Outposts ratio for EBS snapshots. For example, to migrate 40 terabytes (TB) of workloads, you need to plan for 80TB of EBS volumes and 40TB of S3 on Outposts capacity.

Step 2: Extend VPC to your Outposts rack

Once your Outpost has been provisioned and is available, extend the required Amazon Virtual Private Cloud (Amazon VPC) connection to the Outpost from the Region by creating the desired staging and target subnets (see 2 in Figure 7).

Step 3: Prepare AWS Elastic Disaster Recovery service

Prepare the AWS Elastic Disaster Recovery service from the AWS console to set the default replication and launch settings. When defining these settings, make sure that the Outposts resources available are chosen for staging and target subnets and instance and storage type (see 3 in Figure 7).

Step 4: Install the AWS Replication Agent to the source servers or machines

The next phase will be to install the AWS Replication Agent to the source servers and to ensure that communication is possible between the replication agent and your Outposts replication subnet through the Outposts local gateway to ensure that replication traffic uses the local network (see 4 in Figure 7).

Step 5: Continuous block-level replication

Staging area resources are automatically created and managed by Elastic Disaster Recovery. Once the AWS Replication Agent has been deployed, continuous block-level replication (compressed and encrypted in transit) will occur (see 5 in Figure 7) over the local network.

Step 6: Launch Outposts rack resources

Finally, migrated instances can now be launched using Outposts rack resources based on the launch settings defined previously (see 6 in Figure 7).

Conclusion

In this post, you have learned how to migrate your workloads from your on-premises environment to Outposts rack based on your specific data residency requirements. When you have the flexibility of using Regional services, AWS migration services or partner solutions can be used with infrastructure already in place. If your data must stay on-premises, using AWS Elastic Disaster Recovery allows you to migrate your data without using Regional services, allowing you to migrate to Outposts rack without your data leaving the boundary of a certain geographic location.

To learn more about an end-to-end migration and modernization journey, visit AWS Migration Hub.

Hardening the RAG chatbot architecture powered by Amazon Bedrock: Blueprint for secure design and anti-pattern migration

2024-08-06 Magesh Dhanasekaran

Post Syndicated from Magesh Dhanasekaran original https://aws.amazon.com/blogs/security/hardening-the-rag-chatbot-architecture-powered-by-amazon-bedrock-blueprint-for-secure-design-and-anti-pattern-migration/

This blog post demonstrates how to use Amazon Bedrock with a detailed security plan to deploy a safe and responsible chatbot application. In this post, we identify common security risks and anti-patterns that can arise when exposing a large language model (LLM) in an application. Amazon Bedrock is built with features you can use to mitigate vulnerabilities and incorporate secure design principles. This post highlights architectural considerations and best practice strategies to enhance the reliability of your LLM-based application.

Amazon Bedrock unleashes the fusion of generative artificial intelligence (AI) and LLMs, empowering you to craft impactful chatbot applications. As with technologies handling sensitive data and intellectual property, it’s crucial that you prioritize security and adopt a robust security posture. Without proper measures, these applications can be susceptible to risks such as prompt injection, information disclosure, model exploitation, and regulatory violations. By proactively addressing these security considerations, you can responsibly use Amazon Bedrock foundation models and generative AI capabilities.

The chatbot application use case represents a common pattern in enterprise environments, where businesses want to use the power of generative AI foundation models (FMs) to build their own applications. This falls under the Pre-trained models category of the Generative AI Security Scoping Matrix. In this scope, businesses directly integrate with FMs like Anthropic’s Claude through Amazon Bedrock APIs to create custom applications, such as customer support Retrieval Augmented Generation (RAG) chatbots, content generation tools, and decision support systems.

This post provides a comprehensive security blueprint for deploying chatbot applications that integrate with Amazon Bedrock, enabling the responsible adoption of LLMs and generative AI in enterprise environments. We outline mitigation strategies through secure design principles, architectural considerations, and best practices tailored to the challenges of integrating LLMs and generative AI capabilities.

By following the guidance in this post, you can proactively identify and mitigate risks associated with deploying and operating chatbot applications that integrate with Amazon Bedrock and use generative AI models. The guidance can help you strengthen the security posture, protect sensitive data and intellectual property, maintain regulatory compliance, and responsibly deploy generative AI capabilities within your enterprise environments.

This post contains the following high-level sections:

Chatbot application architecture overview

The chatbot application architecture described in this post represents an example implementation that uses various AWS services and integrates with Amazon Bedrock and Anthropic’s Claude 3 Sonnet LLM. This baseline architecture serves as a foundation to understand the core components and their interactions. However, it’s important to note that there can be multiple ways for customers to design and implement a chatbot architecture that integrates with Amazon Bedrock, depending on their specific requirements and constraints. Regardless of the implementation approach, it’s crucial to incorporate appropriate security controls and follow best practices for secure design and deployment of generative AI applications.

The chatbot application allows users to interact through a frontend interface and submit prompts or queries. These prompts are processed by integrating with Amazon Bedrock, which uses the Anthropic Claude 3 Sonnet LLM and a knowledge base built from ingested data. The LLM generates relevant responses based on the prompts and retrieved context from the knowledge base. While this baseline implementation outlines the core functionality, it requires incorporating security controls and following best practices to mitigate potential risks associated with deploying generative AI applications. In the subsequent sections, we discuss security anti-patterns that can arise in such applications, along with their corresponding mitigation strategies. Additionally, we present a secure and responsible architecture blueprint for the chatbot application powered by Amazon Bedrock.

Figure 1: Baseline chatbot application architecture using AWS services and Amazon Bedrock

Components in the chatbot application baseline architecture

The chatbot application architecture uses various AWS services and integrates with the Amazon Bedrock service and Anthropic’s Claude 3 Sonnet LLM to deliver an interactive and intelligent chatbot experience. The main components of the architecture (as shown in Figure 1) are:

User interaction layer:
Users interact with the chatbot application through the Streamlit frontend (3), a Python-based open-source library, used to build the user-friendly and interactive interface.
Amazon Elastic Container Service (Amazon ECS) on AWS Fargate:
A fully managed and scalable container orchestration service that eliminates the need to provision and manage servers, allowing you to run containerized applications without having to manage the underlying compute infrastructure.
Application hosting and deployment:
The Streamlit application (3) components are hosted and deployed on Amazon ECS on AWS Fargate (2), maintaining scalability and high availability. This architecture represents the application and hosting environment in an independent virtual private cloud (VPC) to promote a loosely-coupled architecture. The Streamlit frontend can be replaced with your organization’s specific frontend and quickly integrated with the backend Amazon API Gateway in the VPC. An application load balancer is used to distribute traffic to the Streamlit application instances.
API Gateway driven Lambda Integration:
In this example architecture, instead of directly invoking the Amazon Bedrock service from the frontend, an API Gateway backed by an AWS Lambda function (5) is used as an intermediary layer. This approach promotes better separation of concerns, scalability, and secure access to Amazon Bedrock by limiting direct exposure from the frontend.
Lambda:
Lambda provides highly scalable, short-term serverless compute. Here, the requests from Streamlit are processed. First, the history of the user’s session is retrieved from Amazon DynamoDB (6). Second, the user’s question, history, and the context are formatted into a prompt template and queried against Amazon Bedrock with the knowledge base, employing retrieval augmented generation (RAG).
DynamoDB:
DynamoDB is responsible for storing and retrieving chat history, conversation history, recommendations, and other relevant data using the Lambda function.
Amazon S3:
Amazon Simple Storage Services (Amazon S3) is a data storage service and is used here for storing data artifacts that are ingested into the knowledge base.
Amazon Bedrock:
Amazon Bedrock plays a central role in the architecture. It handles the questions posed by the user using Anthropic Claude 3 Sonnet LLM (9) combined with a previously generated knowledge base (10) of the customer’s organization-specific data.
Anthropic Claude 3 Sonnet:
Anthropic Claude 3 Sonnet is the LLM used to generate tailored recommendations and responses based on user inputs and the context retrieved from the knowledge base. It’s part of the text analysis and generation module in Amazon Bedrock.
Knowledge base and data ingestion:
Relevant documents classified as public are ingested from Amazon S3 (9) into in an Amazon Bedrock knowledge base. Knowledge bases are backed by Amazon OpenSearch Service. Amazon Titan Embeddings (10) are used to generate the vector embeddings database of the documents. Storing the data as vector embeddings allows for semantic similarity searching of the documents to retrieve the context of the question posed by the user (RAG). By providing the LLM with context in addition to the question, there’s a much higher chance of getting a useful answer from the LLM.

Comprehensive logging and monitoring strategy

This section outlines a comprehensive logging and monitoring strategy for the Amazon Bedrock-powered chatbot application, using various AWS services to enable centralized logging, auditing, and proactive monitoring of security events, performance metrics, and potential threats.

Logging and auditing:
- AWS CloudTrail: Logs API calls made to Amazon Bedrock, including InvokeModel requests, as well as information about the user or service that made the request.
- AWS CloudWatch Logs: Captures and analyzes Amazon Bedrock invocation logs, user prompts, generated responses, and errors or warnings encountered during the invocation process.
- Amazon OpenSearch Service: Logs and indexes data related to the OpenSearch integration, context data retrievals, and knowledge base operations.
- AWS Config: Monitors and audits the configuration of resources related to the chatbot application and Amazon Bedrock service, including IAM policies, VPC settings, encryption key management, and other resource configurations.
Monitoring and alerting:
- AWS CloudWatch: Monitors metrics specific to Amazon Bedrock, such as the number of model invocations, latency of invocations, and error metrics (client-side errors, server-side errors, and throttling). Configures targeted CloudWatch alarms to proactively detect and respond to anomalies or issues related to Bedrock invocations and performance.
- AWS GuardDuty: Continuously monitors CloudTrail logs for potential threats and unauthorized activity within the AWS environment.
- AWS Security Hub: Provides centralized security posture management and compliance checks.
- Amazon Security Lake: Provides a centralized data lake for log analysis; is integrated with CloudTrail and SecurityHub.
Security information and event management integration:
- Integrate with security information and event management (SIEM) solutions for centralized log management, real-time monitoring of security events, and correlation of logging data from multiple sources (CloudTrail, CloudWatch Logs, OpenSearch Service, and so on).
Continuous improvement:
- Regularly review and update logging and monitoring configurations, alerting thresholds, and integration with security solutions to address emerging threats, changes in application requirements, or evolving best practices.

Security anti-patterns and mitigation strategies

This section identifies and explores common security anti-patterns associated with the Amazon Bedrock chatbot application architecture. By recognizing these anti-patterns early in the development and deployment phases, you can implement effective mitigation strategies and fortify your security posture.

Addressing security anti-patterns in the Amazon Bedrock chatbot application architecture is crucial for several reasons:

Data protection and privacy: The chatbot application processes and generates sensitive data, including personal information, intellectual property, and confidential business data. Failing to address security anti-patterns can lead to data breaches, unauthorized access, and potential regulatory violations.
Model integrity and reliability: Vulnerabilities in the chatbot application can enable bad actors to manipulate or exploit the underlying generative AI models, compromising the integrity and reliability of the generated outputs. This can have severe consequences, particularly in decision-support or critical applications.
Responsible AI deployment: As the adoption of generative AI models continues to grow, it’s essential to maintain responsible and ethical deployment practices. Addressing security anti-patterns is crucial for maintaining trust, transparency, and accountability in the chatbot application powered by AI models.
Compliance and regulatory requirements: Many industries and regions have specific regulations and guidelines governing the use of AI technologies, data privacy, and information security. Addressing security anti-patterns is a critical step towards adhering to and maintaining compliance for the chatbot application.

The security anti-patterns that are covered in this post include:

Lack of secure authentication and access controls
Insufficient input validation and sanitization
Insecure communication channels
Inadequate prompt and response logging, auditing, and non-repudiation
Insecure data storage and access controls
Failure to secure FMs and generative AI components
Lack of responsible AI governance and ethics
Lack of comprehensive testing and validation

Anti-pattern 1: Lack of secure authentication and access controls

In a generative AI chatbot application using Amazon Bedrock, a lack of secure authentication and access controls poses significant risks to the confidentiality, integrity, and availability of the system. Identity spoofing and unauthorized access can enable threat actors to impersonate legitimate users or systems, gain unauthorized access to sensitive data processed by the chatbot application, and potentially compromise the integrity and confidentiality of the customer’s data and intellectual property used by the application.

Identity spoofing and unauthorized access are important areas to address in this architecture, as the chatbot application handles user prompts and responses, which may contain sensitive information or intellectual property. If a threat actor can impersonate a legitimate user or system, they can potentially inject malicious prompts, retrieve confidential data from the knowledge base, or even manipulate the responses generated by the Anthropic Claude 3 LLM integrated with Amazon Bedrock.

Anti-pattern examples

Exposing the Streamlit frontend interface or the API Gateway endpoint without proper authentication mechanisms, potentially allowing unauthenticated users to interact with the chatbot application and inject malicious prompts.
Storing or hardcoding AWS access keys or API credentials in the application code or configuration files, increasing the risk of credential exposure and unauthorized access to AWS services like Amazon Bedrock or DynamoDB.
Implementing weak or easily guessable passwords for administrative or service accounts with elevated privileges to access the Amazon Bedrock service or other critical components.
Lacking multi-factor authentication (MFA) for AWS Identity and Access Management (IAM) users or roles with privileged access, increasing the risk of unauthorized access to AWS resources, including the Amazon Bedrock service, if credentials are compromised.

Mitigation strategies

To mitigate the risks associated with a lack of secure authentication and access controls, implement robust IAM controls, as well as continuous logging, monitoring, and threat detection mechanisms.

IAM controls:

Use industry-standard protocols like OAuth 2.0 or OpenID Connect, and integrate with AWS IAM Identity Center or other identity providers for centralized authentication and authorization for the Streamlit frontend interface and AWS API Gateway endpoints.
Implement fine-grained access controls using AWS IAM policies and resource-based policies to restrict access to only the necessary Amazon Bedrock resources, Lambda functions, and other components required for the chatbot application.
Enforce the use of MFA for all IAM users, roles, and service accounts with access to critical components like Amazon Bedrock, DynamoDB, or the Streamlit application.

Continuous logging and monitoring and threat detection:

See the Comprehensive logging and monitoring strategy section for guidance on implementing centralized logging and monitoring solutions to track and audit authentication events, access attempts, and potential unauthorized access or credential misuse across the chatbot application components and Amazon Bedrock service, as well as using CloudWatch, Lambda, and GuardDuty to detect and respond to anomalous behavior and potential threats.

Anti-pattern 2: Insufficient input sanitization and validation

Insufficient input validation and sanitization in a generative AI chatbot application can expose the system to various threats, including injection events, data tampering, adversarial events, and data poisoning events. These vulnerabilities can lead to unauthorized access, data manipulation, and compromised model outputs.

Injection events: If user prompts or inputs aren’t properly sanitized and validated, a threat actor can potentially inject malicious code, such as SQL code, leading to unauthorized access or manipulation of the DynamoDB chat history data. Additionally, if the chatbot application or components process user input without proper validation, a threat actor can potentially inject and run arbitrary code on the backend systems, compromising the entire application.

Data tampering: A threat actor can potentially modify user prompts or payloads in transit between the chatbot interface and Amazon Bedrock service, leading to unintended model responses or actions. Lack of data integrity checks can allow a threat actor to tamper with the context data exchanged between Amazon Bedrock and OpenSearch, potentially leading to incorrect or malicious search results influencing the LLM responses.

Data poisoning events: If the training data or context data used by the LLM or chatbot application isn’t properly validated and sanitized, bad actors can potentially introduce malicious or misleading data, leading to biased or compromised model outputs.

Anti-pattern examples

Failure to validate and sanitize user prompts before sending them to Amazon Bedrock, potentially leading to injection events or unintended data exposure.
Lack of input validation and sanitization for context data retrieved from OpenSearch, allowing malformed or malicious data to influence the LLM’s responses.
Insufficient sanitization of LLM-generated responses before displaying them to users, enabling potential code injection or rendering of harmful content.
Inadequate sanitization of user input in the Streamlit application or Lambda functions, failing to remove or escape special characters, code snippets, or potentially malicious patterns, enabling code injection events.
Insufficient validation and sanitization of training data or other data sources used by the LLM or chatbot application, allowing data poisoning events that can introduce malicious or misleading data, leading to biased or compromised model outputs.
Allowing unrestricted character sets, input lengths, or special characters in user prompts or data inputs, enabling adversaries to craft inputs that bypass input validation and sanitization mechanisms, potentially causing undesirable or malicious outputs.
Relying solely on deny lists for input validation, which can be quickly bypassed by adversaries, potentially leading to injection events, data tampering, or other exploit scenarios.

Mitigation strategies

To mitigate the risks associated with insufficient input validation and sanitization, implement robust input validation and sanitization mechanisms throughout the chatbot application and its components.

Input validation and sanitization:

Implement strict input validation rules for user prompts at the chatbot interface and Amazon Bedrock service boundaries, defining allowed character sets, maximum input lengths, and disallowing special characters or code snippets. Use Amazon Bedrock’s Guardrails feature, which allows defining denied topics and content filters to remove undesirable and harmful content from user interactions with your applications.
Use allow lists instead of deny lists for input validation to maintain a more robust and comprehensive approach.
Sanitize user input by removing or escaping special characters, code snippets, or potentially malicious patterns.

Data flow validation:

Validate and sanitize data flows between components, including:
- User prompts sent to the FM and responses generated by the FM and returned to the chatbot interface.
- Training data, context data, and other data sources used by the FM or chatbot application.

Protective controls:

Use AWS Web Application Firewall (WAF) for input validation and protection against common web exploits.
Use AWS Shield for protection against distributed denial of service (DDoS) events.
Use CloudTrail to monitor API calls to Amazon Bedrock, including InvokeModel requests.
See the Comprehensive logging and monitoring strategy section for guidance on implementing Lambda functions, Amazon EventBridge rules, and CloudWatch Logs to analyze CloudTrail logs, ingest application logs, user prompts, and responses, and integrate with incident response and SIEM solutions for detecting, investigating, and mitigating security incidents related to input validation and sanitization, including jailbreaking attempts and anomalous behavior.

Anti-pattern 3: Insecure communication channels

Insecure communication channels between chatbot application components can expose sensitive data to interception, tampering, and unauthorized access risks. Unsecured channels enable man-in-the-middle events where threat actors intercept, modify data in transit such as user prompts, responses, and context data, leading to data tampering, malicious payload injection, and unauthorized information access.

Anti-pattern examples

Failure to use AWS PrivateLink for secure service-to-service communication within the VPC, exposing communications between Amazon Bedrock and other AWS services to potential risks over the public internet, even when using HTTPS.
Absence of data integrity checks or mechanisms to detect and prevent data tampering during transmission between components.
Failure to regularly review and update communication channel configurations, protocols, and encryption mechanisms to address emerging threats and ensure compliance with security best practices.

Mitigation strategies

To mitigate the risks associated with insecure communication channels, implement secure communication mechanisms and enforce data integrity throughout the chatbot application’s components and their interactions. Proper encryption, authentication, and integrity checks should be employed to protect sensitive data in transit and help prevent unauthorized access, data tampering, and man-in-the-middle events.

Secure communication channels:

Use PrivateLink for secure service-to-service communication between Amazon Bedrock and other AWS services used in the chatbot application architecture. PrivateLink provides a private, isolated communication channel within the Amazon VPC, eliminating the need to traverse the public internet. This mitigates the risk of potential interception, tampering, or unauthorized access to sensitive data transmitted between services, even when using HTTPS.
Use AWS Certificate Manager (ACM) to manage and automate the deployment of SSL/TLS certificates used for secure communication between the chatbot frontend interface (the Streamlit application) and the API Gateway endpoint. ACM simplifies the provisioning, renewal, and deployment of SSL/TLS certificates, making sure that communication channels between the user-facing components and the backend API are securely encrypted using industry-standard protocols and up-to-date certificates.

Continuous logging and monitoring:

See the Comprehensive Logging and Monitoring Strategy section for guidance on implementing centralized logging and monitoring mechanisms to detect and respond to potential communication channel anomalies or security incidents, including monitoring communication channel metrics, API call patterns, request payloads, and response data, using AWS services like CloudWatch, CloudTrail, and AWS WAF.

Network segmentation and isolation controls

Implement network segmentation by deploying the Amazon ECS cluster within a dedicated VPC and subnets, isolating it from other components and restricting communication based on the principle of least privilege.
Create separate subnets within the VPC for the public-facing frontend tier and the backend application tier, further isolating the components.
Use AWS security groups and network access control lists (NACLs) to control inbound and outbound traffic at the instance and subnet levels, respectively, for the ECS cluster and the frontend instances.

Anti-pattern 4: Inadequate logging, auditing, and non-repudiation

Inadequate logging, auditing, and non-repudiation mechanisms in a generative AI chatbot application can lead to several risks, including a lack of accountability, challenges in forensic analysis, and compliance concerns. Without proper logging and auditing, it’s challenging to track user activities, diagnose issues, perform forensic analysis in case of security incidents, and demonstrate compliance with regulations or internal policies.

Anti-pattern examples

Lack of logging for data flows between components, such as user prompts sent to Amazon Bedrock, context data exchanged with OpenSearch, and responses from the LLM, hindering investigative efforts in case of security incidents or data breaches.
Insufficient logging of user activities within the chatbot application—such as sign in attempts, session duration, and actions performed—limiting the ability to track and attribute actions to specific users.
Absence of mechanisms to ensure the integrity and authenticity of logged data, allowing potential tampering or repudiation of logged events.
Failure to securely store and protect log data from unauthorized access or modification, compromising the reliability and confidentiality of log information.

Mitigation strategies

To mitigate the risks associated with inadequate logging, auditing, and non-repudiation, implement comprehensive logging and auditing mechanisms to capture critical events, user activities, and data flows across the chatbot application components. Additionally, measures must be taken to maintain the integrity and authenticity of log data, help prevent tampering or repudiation, and securely store and protect log information from unauthorized access.

Comprehensive logging and auditing:

See the Comprehensive logging and monitoring strategy section for detailed guidance on implementing logging mechanisms using CloudTrail, CloudWatch Logs, and OpenSearch Service, as well as using CloudTrail for logging and monitoring API calls, especially Amazon Bedrock API calls and other API activities within the AWS environment, using CloudWatch for monitoring Amazon Bedrock-specific metrics, and ensuring log data integrity and non-repudiation through the CloudTrail log file integrity validation feature and implementing S3 Object Lock and S3 Versioning for log data stored in Amazon S3.
Make sure that log data is securely stored and protected from unauthorized access by using AWS Key Management Service (AWS KMS) for encryption at rest and implementing restrictive IAM policies and resource-based policies to control access to log data.
Retain log data for an appropriate period based on compliance requirements, using CloudTrail log file integrity validation and CloudWatch Logs retention periods and data archiving capabilities.

User activity monitoring and tracking:

Use CloudTrail for logging and monitoring API calls, especially Amazon Bedrock API calls and other API activities within the AWS environment, such as API Gateway, Lambda, and DynamoDB. Additionally, use CloudWatch for monitoring metrics specific to Amazon Bedrock, including the number of model invocations, latency, and error metrics (client-side errors, server-side errors, and throttling).
Integrate with security information and event management (SIEM) solutions for centralized log management and real-time monitoring of security events.

Data integrity and non-repudiation:

Implement digital signatures or non-repudiation mechanisms to verify the integrity and authenticity of logged data, minimizing tampering or repudiation of logged events. Use the CloudTrail log file integrity validation feature, which uses industry-standard algorithms (SHA-256 for hashing and SHA-256 with RSA for digital signing) to provide non-repudiation and verify log data integrity. For log data stored in Amazon S3, enable S3 Object Lock and S3 Versioning to provide an immutable, write once, read many (WORM) data storage model, helping to prevent object deletions or modifications, and maintaining data integrity and non-repudiation. Additionally, implement S3 bucket policies and IAM policies to restrict access to log data stored in S3, further enhancing the security and non-repudiation of logged events.

Anti-pattern 5: Insecure data storage and access controls

Insecure data storage and access controls in a generative AI chatbot application can lead to significant risks, including information disclosure, data tampering, and unauthorized access. Storing sensitive data, such as chat history, in an unencrypted or insecure manner can result in information disclosure if the data store is compromised or accessed by unauthorized entities. Additionally, a lack of proper access controls can allow unauthorized parties to access, modify, or delete data, leading to data tampering or unauthorized access.

Anti-pattern examples

Storing chat history data in DynamoDB without encryption at rest using AWS KMS customer-managed keys (CMKs).
Lack of encryption at rest using CMKs from AWS KMS for data in OpenSearch, Amazon S3, or other components that handle sensitive data.
Overly permissive access controls or lack of fine-grained access control mechanisms for the DynamoDB chat history, OpenSearch, Amazon S3, or other data stores, increasing the risk of unauthorized access or data breaches.
Storing sensitive data in clear text, or using insecure encryption algorithms or key management practices.
Failure to regularly review and rotate encryption keys or update access control policies to address potential security vulnerabilities or changes in access requirements.

Mitigation strategies

To mitigate the risks associated with insecure data storage and access controls, implement robust encryption mechanisms, secure key management practices, and fine-grained access control policies. Encrypting sensitive data at rest and in transit, using customer-managed encryption keys from AWS KMS, and implementing least- privilege access controls based on IAM policies and resource-based policies can significantly enhance the security and protection of data within the chatbot application architecture.

Key management and encryption at rest:

Implement AWS KMS to manage and control access to CMKs for data encryption across components like DynamoDB, OpenSearch, and Amazon S3.
- Use CMKs to configure DynamoDB to automatically encrypt chat history data at rest.
- Configure OpenSearch and Amazon S3 to use encryption at rest with AWS KMS CMKs for data stored in these services.
- CMKs provide enhanced security and control, allowing you to create, rotate, disable, and revoke encryption keys, enabling better key isolation and separation of duties.
- CMKs enable you to enforce key policies, audit key usage, and adhere to regulatory requirements or organizational policies that mandate customer-managed encryption keys.
- CMKs offer portability and independence from specific services, allowing you to migrate or integrate data across multiple services while maintaining control over the encryption keys.
- AWS KMS provides a centralized and secure key management solution, simplifying the management and auditing of encryption keys across various components and services.
Implement secure key management practices, including:
- Regular key rotation to maintain the security of your encrypted data.
- Separation of duties to make sure that no single individual has complete control over key management operations.
- Strict access controls for key management operations, using IAM policies and roles to enforce the principle of least privilege.

Fine-grained access controls:

Implement fine-grained access controls for the DynamoDB chat history data store, OpenSearch, Amazon S3, and other data stores using IAM policies and roles.
Implement fine-grained access controls and define least-privilege access policies for all resources handling sensitive data, such as the DynamoDB chat history data store, OpenSearch, Amazon S3, and other data stores or services. For example, use IAM policies and resource-based policies to restrict access to specific DynamoDB tables, OpenSearch domains, and S3 buckets, limiting access to only the necessary actions (for example, read, write, and list) based on the principle of least privilege. Extend this approach to all resources handling sensitive data within the chatbot application architecture, making sure that access is granted only to the minimum required resources and actions necessary for each component or user role.

Continuous improvement:

Regularly review and update encryption configurations, access control policies, and key management practices to address potential security vulnerabilities or changes in access requirements.

Anti-pattern 6: Failure to secure FM and generative AI components

Inadequate security measures for FMs and generative AI components in a chatbot application can lead to severe risks, including model tampering, unintended information disclosure, and denial of service. Threat actors can manipulate unsecured FMs and generative AI models to generate biased, harmful, or malicious responses, potentially causing significant harm or reputational damage.

Lack of proper access controls or input validation can result in unintended information disclosure, where sensitive data is inadvertently included in model responses. Additionally, insecure FM or generative AI components can be vulnerable to denial-of-service events, disrupting the availability of the chatbot application and impacting its functionality.

Anti-pattern examples

Insecure model fine tuning practices, such as using untrusted or compromised data sources, can lead to biased or malicious models.
Lack of continuous monitoring for FM and generative AI components, leaving them vulnerable to emerging threats or known vulnerabilities.
Lack of guardrails or safety measures to control and filter the outputs of FMs and generative AI components, potentially leading to the generation of harmful, biased, or undesirable content.
Inadequate access controls or input validation for prompts and context data sent to the FM components, increasing the risk of injection events or unintended information disclosure.
Failure to implement secure deployment practices for FM and generative AI components, including secure communication channels, encryption of model artifacts, and access controls.

Mitigation strategies

To mitigate the risks associated with inadequately secured foundational models (FMs) and generative AI components, implement secure integration mechanisms, robust model fine-tuning and deployment practices, continuous monitoring, and effective guardrails and safety measures. These mitigation strategies help prevent model tampering, unintended information disclosure, denial-of-service events, and the generation of harmful or undesirable content, while ensuring the security, reliability, and ethical alignment of the chatbot application’s generative AI capabilities.

Secure integration with LLMs and knowledge bases:

Implement secure communication channels (for example HTTPS or PrivateLink) between Amazon Bedrock, OpenSearch, and the FM components to help prevent unauthorized access or data tampering.
Implement strict input validation and sanitization for prompts and context data sent to the FM components to help prevent injection events or unintended information disclosure.
Implement access controls and least-privilege principles for the OpenSearch integration to limit the data accessible to the LLM components.

Secure model fine tuning, deployment, and monitoring:

Establish secure and auditable fine-tuning pipelines, using trusted and vetted data sources, to help prevent tampering or the introduction of biases.
Implement secure deployment practices for FM and generative AI components, including access controls, secure communication channels, and encryption of model artifacts.
Continuously monitor FM and generative AI components for security vulnerabilities, performance issues, and unintended behavior.
Implement rate-limiting, throttling, and load-balancing mechanisms to help prevent denial-of-service events on FM and generative AI components.
Regularly review and audit FM and generative AI components for compliance with security policies, industry best practices, and regulatory requirements.

Guardrails and safety measures

Implement guardrails, which are safety measures designed to reduce harmful outputs and align the behavior of FMs and generative AI components with human values.
Use keyword-based filtering, metric-based thresholds, human oversight, and customized guardrails tailored to the specific risks and cultural and ethical norms of each application domain.
Monitor the effectiveness of guardrails through performance benchmarking and adversarial testing.

Jailbreak robustness testing

Conduct jailbreak robustness testing by prompting the FMs and generative AI components with a diverse set of jailbreak attempts across different prohibited scenarios to identify weaknesses and improve model robustness.

Anti-pattern 7: Lack of responsible AI governance and ethics

While the previous anti-patterns focused on technical security aspects, it is equally important to address the ethical and responsible governance of generative AI systems. Without strong governance frameworks, ethical guidelines, and accountability measures, chatbot applications can result in unintended consequences, biased outcomes, and a lack of transparency and trust.

Anti-pattern examples

Lack of an established ethical AI governance framework, including principles, policies, and processes to guide the responsible development and deployment of the generative AI chatbot application.
Insufficient measures to ensure transparency, explainability, and interpretability of the LLM and generative AI components, making it difficult to understand and audit their decision-making processes.
Absence of mechanisms for stakeholder engagement, public consultation, and consideration of societal impacts, potentially leading to a lack of trust and acceptance of the chatbot application.
Failure to address potential biases, discrimination, or unfairness in the training data, models, or outputs of the generative AI system.
Inadequate processes for testing, validation, and ongoing monitoring of the chatbot application’s ethical behavior and alignment with organizational values and societal norms.

Mitigation strategies

To minimize a lack of responsible AI governance and ethics, establish a comprehensive ethical AI governance framework, promote transparency and interpretability, engage stakeholders and consider societal impacts, address potential biases and fairness issues, implement continuous improvement and monitoring processes, and use guardrails and safety measures. These mitigation strategies help to foster trust, accountability, and ethical alignment in the development and deployment of the generative AI chatbot application, mitigating the risks of unintended consequences, biased outcomes, and a lack of transparency.

Ethical AI governance framework:

Establish an ethical AI governance framework, including principles, policies, and processes to guide the responsible development and deployment of the generative AI chatbot application.
Define clear ethical guidelines and decision-making frameworks to address potential ethical dilemmas, biases, or unintended consequences.
Implement accountability measures, such as designated ethics boards, ethics officers, or external advisory committees, to oversee the ethical development and deployment of the chatbot application.

Transparency and interpretability:

Implement measures to promote transparency and interpretability of the LLM and generative AI components, allowing for auditing and understanding of their decision-making processes.
Provide clear and accessible information to stakeholders and users about the chatbot application’s capabilities, limitations, and potential biases or ethical considerations.

Stakeholder engagement and societal impact:

Establish mechanisms for stakeholder engagement, public consultation, and consideration of societal impacts, fostering trust and acceptance of the chatbot application.
Conduct impact assessments to identify and mitigate potential negative consequences or risks to individuals, communities, or society.

Bias and fairness:

Address potential biases, discrimination, or unfairness in the training data, models, or outputs of the generative AI system through rigorous testing, bias mitigation techniques, and ongoing monitoring.
Promote diverse and inclusive representation in the development, testing, and governance processes to reduce potential biases and blind spots.

Continuous improvement and monitoring:

Implement processes for ongoing testing, validation, and monitoring of the chatbot application’s behavior and alignment with organizational values and societal norms.
Regularly review and update the AI governance framework, policies, and processes to address emerging ethical challenges, societal expectations, and regulatory developments.

Guardrails and safety measures:

Implement guardrails, such as Guardrails for Amazon Bedrock, which are safety measures designed to reduce harmful outputs and align the behavior of LLMs and generative AI components with human values and responsible AI policies.
Use Guardrails for Amazon Bedrock to define denied topics and content filters to remove undesirable and harmful content from interactions between users and your applications.
- Define denied topics using natural language descriptions to specify topics or subject areas that are undesirable in the context of your application.
- Configure content filters to set thresholds for filtering harmful content across categories such as hate, insults, sexuality, and violence based on your use cases and responsible AI policies.
- Use the personally identifiable information (PII) redaction feature to redact information such as names, email addresses, and phone numbers from LLM-generated responses or block user inputs that contain PII.
Integrate Guardrails for Amazon Bedrock with CloudWatch to monitor and analyze user inputs and LLM responses that violate defined policies, enabling proactive detection and response to potential issues.
Monitor the effectiveness of guardrails through performance benchmarking and adversarial testing, continuously refining and updating the guardrails based on real-world usage and emerging ethical considerations.

Jailbreak robustness testing:

Conduct jailbreak robustness testing by prompting the LLMs and generative AI components with a diverse set of jailbreak attempts across different prohibited scenarios to identify weaknesses and improve model robustness.

Anti-pattern 8: Lack of comprehensive testing and validation

Inadequate testing and validation processes for the LLM system and the generative AI chatbot application can lead to unidentified vulnerabilities, performance bottlenecks, and availability issues. Without comprehensive testing and validation, organizations might fail to detect potential security risks, functionality gaps, or scalability and performance limitations before deploying the application in a production environment.

Anti-pattern examples

Lack of functional testing to validate the correctness and completeness of the LLM’s responses and the chatbot application’s features and functionalities.
Insufficient performance testing to identify bottlenecks, resource constraints, or scalability limitations under various load conditions.
Absence of security testing, such as penetration testing, vulnerability scanning, and adversarial testing to uncover potential security vulnerabilities or model exploits.
Failure to incorporate automated testing and validation processes into a continuous integration and continuous deployment (CI/CD) pipeline, leading to manual and one-time testing efforts that might overlook critical issues.
Inadequate testing of the chatbot application’s integration with external services and components, such as Amazon Bedrock, OpenSearch, and DynamoDB, potentially leading to compatibility issues or data integrity problems.

Mitigation strategies

To address the lack of comprehensive testing and validation, implement a robust testing strategy encompassing functional, performance, security, and integration testing. Integrate automated testing into a CI/CD pipeline, conduct security testing like threat modeling and penetration testing, and use adversarial validation techniques. Continuously improve testing processes to verify the reliability, security, and scalability of the generative AI chatbot application.

Comprehensive testing strategy:

Establish a comprehensive testing strategy that includes functional testing, performance testing, load testing, security testing, and integration testing for the LLM system and the overall chatbot application.
Define clear testing requirements, test cases, and acceptance criteria based on the application’s functional and non-functional requirements, as well as security and compliance standards.

Automated testing and CI/CD integration:

Incorporate automated testing and validation processes into a CI/CD pipeline, enabling continuous monitoring and assessment of the LLM’s performance, security, and reliability throughout its lifecycle.
Use automated testing tools and frameworks to streamline the testing process, improve test coverage, and facilitate regression testing.

Security testing and adversarial validation:

Conduct threat modeling exercises early in the design process and as soon as the design is finalized for the chatbot application architecture to proactively identify potential security risks and vulnerabilities. Subsequently, conduct regular security testing—including penetration testing, vulnerability scanning, and adversarial testing—to uncover and validate identified security vulnerabilities or model exploits.
Implement adversarial validation techniques, such as prompting the LLM with carefully crafted inputs designed to expose weaknesses or vulnerabilities, to improve the model’s robustness and security.

Performance and load testing:

Perform comprehensive performance and load testing to identify potential bottlenecks, resource constraints, or scalability limitations under various load conditions.
Use tools and techniques for load generation, stress testing, and capacity planning to ensure the chatbot application can handle anticipated user traffic and workloads.

Integration testing:

Conduct thorough integration testing to validate the chatbot application’s integration with external services and components, such as Amazon Bedrock, OpenSearch, and DynamoDB, maintaining seamless communication and data integrity.

Continuous improvement:

Regularly review and update the testing and validation processes to address emerging threats, new vulnerabilities, or changes in application requirements.
Use testing insights and results to continuously improve the LLM system, the chatbot application, and the overall security posture.

Common mitigation strategies for all anti-patterns

Regularly review and update security measures, access controls, monitoring mechanisms, and guardrails for LLM and generative AI components to address emerging threats, vulnerabilities, and evolving responsible AI best practices.
Conduct regular security assessments, penetration testing, and code reviews to identify and remediate vulnerabilities or misconfigurations related to logging, auditing, and non-repudiation mechanisms.
Stay current with security best practices, guidance, and updates from AWS and industry organizations regarding logging, auditing, and non-repudiation for generative AI applications.

Secure and responsible architecture blueprint

After discussing the baseline chatbot application architecture and identifying critical security anti-patterns associated with generative AI applications built using Amazon Bedrock, we now present the secure and responsible architecture blueprint. This blueprint (Figure 2) incorporates the recommended mitigation strategies and security controls discussed throughout the anti-pattern analysis.

Figure 2: Secure and responsible generative AI chatbot architecture blueprint

In this target state architecture, unauthenticated users interact with the chatbot application through the frontend interface (1), where it’s crucial to mitigate the anti-pattern of insufficient input validation and sanitization by implementing secure coding practices and input validation. The user inputs are then processed through AWS Shield, AWS WAF, and CloudFront (2), which provide DDoS protection, web application firewall capabilities, and a content delivery network, respectively. These services help mitigate insufficient input validation, web exploits, and lack of comprehensive testing by using AWS WAF for input validation and conducting regular security testing.

The user requests are then routed through API Gateway (3), which acts as the entry point for the chatbot application, facilitating API connections to the Streamlit frontend. To address anti-patterns related to authentication, insecure communication, and LLM security, it’s essential to implement secure authentication protocols, HTTPS/TLS, access controls, and input validation within API Gateway. Communication between the VPC resources and API Gateway is secured through VPC endpoints (4), using PrivateLink for secure private communication and attaching endpoint policies to control which AWS principals can access the API Gateway service (8), mitigating the insecure communication channels anti-pattern.

The Streamlit application (5) is hosted on Amazon ECS in a private subnet within the VPC. It hosts the frontend interface and must implement secure coding practices and input validation to mitigate insufficient input validation and sanitization. User inputs are then processed by Lambda (6), a serverless compute service hosted within the VPC, which connects to Amazon Bedrock, OpenSearch, and DynamoDB through VPC endpoints (7). These VPC endpoints have endpoint policies attached to control access, enabling secure private communication between the Lambda function and the services, mitigating the insecure communication channels anti-pattern. Within Lambda, strict input validation rules, allow-lists, and user input sanitization are implemented to address the input validation anti-pattern.

User requests from the chatbot application are sent to Amazon Bedrock (12), a generative AI solution that powers the LLM capabilities. To mitigate the failure to secure FM and generative AI components anti-pattern, secure communication channels, input validation, and sanitization for prompts and context data must be implemented when interacting with Amazon Bedrock.

Amazon Bedrock interacts with OpenSearch Service (9) using Amazon Bedrock knowledge bases to retrieve relevant context data for the user’s question. The knowledge base is created by ingesting public documents from Amazon S3 (10). To mitigate the anti-pattern of insecure data storage and access controls, implement encryption at rest using AWS KMS and fine-grained IAM policies and roles for access control within OpenSearch Service. Titan Embeddings (11) are the format of the vector embeddings, which represent the documents stored in Amazon S3. The vector format enables similarity calculation and retrieval of relevant information (12). To address the failure to secure FM and generative AI components anti-pattern, secure integration with Titan Embeddings and input data validation should be implemented.

The knowledge base data, user prompts, and context data are processed by Amazon Bedrock (13) with the Claude 3 LLM (14). To address the anti-patterns of failure to secure FM and generative AI components, as well as lack of responsible AI governance and ethics, secure communication channels, input validation, ethical AI governance frameworks, transparency and interpretability measures, stakeholder engagement, bias mitigation, and guardrails like Guardrails for Amazon Bedrock should be implemented.

The generated responses and recommendations are then stored and retrieved in Amazon DynamoDB (15) by the Lambda function. To mitigate insecure data storage and access, encrypting data at rest with AWS KMS (16) and implement fine-grained access controls through IAM policies and roles.

Comprehensive logging, auditing, and monitoring mechanisms are provided by CloudTrail (17), CloudWatch (18), and AWS Config (19) to address the inadequate logging, auditing, and non-repudiation anti-pattern. See the Comprehensive logging and monitoring strategy section for detailed guidance on implementing comprehensive logging, auditing, and monitoring mechanisms using CloudTrail, CloudWatch, CloudWatch Logs, and AWS Config to address the inadequate logging, auditing, and non-repudiation anti-pattern; including logging API calls made to Amazon Bedrock service, monitoring Amazon Bedrock-specific metrics, capturing and analyzing Bedrock invocation logs, and monitoring and auditing the configuration of resources related to the chatbot application and Amazon Bedrock service.

IAM (20) plays a crucial role in the overall architecture and in mitigating anti-patterns related to authentication and insecure data storage and access. IAM roles and permissions are critical in enforcing secure authentication mechanisms, least privilege access, multi-factor authentication, and robust credential management across the various components of the chatbot application. Additionally, service control policies (SCPs) can be configured to restrict access to specific models or knowledge bases within Amazon Bedrock, preventing unauthorized access or use of sensitive intellectual property.

Finally, GuardDuty (21), Amazon Inspector (22), Security Hub (23), and Security Lake (24) have been included as additional recommended services to further enhance the security posture of the chatbot application. GuardDuty (21) provides threat detection across the control and data planes, Amazon Inspector (22) enables vulnerability assessments and continuous monitoring of Amazon ECS and Lambda workloads. Security Hub (23) offers centralized security posture management and compliance checks, while Security Lake (24) acts as a centralized data lake for log analysis, integrated with CloudTrail and SecurityHub.

Conclusion

By identifying critical anti-patterns and providing comprehensive mitigation strategies, you now have a solid foundation for a secure and responsible deployment of generative AI technologies in enterprise environments.

The secure and responsible architecture blueprint presented in this post serves as a comprehensive guide for organizations that want to use the power of generative AI while ensuring robust security, data protection, and ethical governance. By incorporating industry-leading security controls—such as secure authentication mechanisms, encrypted data storage, fine-grained access controls, secure communication channels, input validation and sanitization, comprehensive logging and auditing, secure FM integration and monitoring, and responsible AI guardrails—this blueprint addresses the unique challenges and vulnerabilities associated with generative AI applications.

Moreover, the emphasis on comprehensive testing and validation processes, as well as the incorporation of ethical AI governance principles, makes sure that you can not only mitigate potential risks, but also promote transparency, explainability, and interpretability of the LLM components, while addressing potential biases and ensuring alignment with organizational values and societal norms.

By following the guidance outlined in this post and depicted in the architectural blueprint, you can proactively identify and mitigate potential risks, enhance the security posture of your generative AI-based chatbot solutions, protect sensitive data and intellectual property, maintain regulatory compliance, and responsibly deploy LLMs and generative AI technologies in your enterprise environments.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Improve Apache Kafka scalability and resiliency using Amazon MSK tiered storage

2024-08-02 Sai Maddali

Post Syndicated from Sai Maddali original https://aws.amazon.com/blogs/big-data/improve-apache-kafka-scalability-and-resiliency-using-amazon-msk-tiered-storage/

Since the launch of tiered storage for Amazon Managed Streaming for Apache Kafka (Amazon MSK), customers have embraced this feature for its ability to optimize storage costs and improve performance. In previous posts, we explored the inner workings of Kafka, maximized the potential of Amazon MSK, and delved into the intricacies of Amazon MSK tiered storage. In this post, we deep dive into how tiered storage helps with faster broker recovery and quicker partition migrations, facilitating faster load balancing and broker scaling.

Apache Kafka availability

Apache Kafka is a distributed log service designed to provide high availability and fault tolerance. At its core, Kafka employs several mechanisms to provide reliable data delivery and resilience against failures:

Kafka replication – Kafka organizes data into topics, which are further divided into partitions. Each partition is replicated across multiple brokers, with one broker acting as the leader and the others as followers. If the leader broker fails, one of the follower brokers is automatically elected as the new leader, providing continuous data availability. The replication factor determines the number of replicas for each partition. Kafka maintains a list of in-sync replicas (ISRs) for each partition, which are the replicas that are up to date with the leader.
Producer acknowledgments – Kafka producers can specify the required acknowledgment level for write operations. This makes sure the data is durably persisted on the configured number of replicas before the producer receives an acknowledgment, reducing the risk of data loss.
Consumer group rebalancing – Kafka consumers are organized into consumer groups, where each consumer in the group is responsible for consuming a subset of the partitions. If a consumer fails, the partitions it was consuming are automatically reassigned to the remaining consumers in the group, providing continuous data consumption.
Zookeeper or KRaft for cluster coordination – Kafka relies on Apache ZooKeeper or KRaft for cluster coordination and metadata management. It maintains information about brokers, topics, partitions, and consumer offsets, enabling Kafka to recover from failures and maintain a consistent state across the cluster.

Kafka’s storage architecture and its impact on availability and resiliency

Although Kafka provides robust fault-tolerance mechanisms, in the traditional Kafka architecture, brokers store data locally on their attached storage volumes. This tight coupling of storage and compute resources can lead to several issues, impacting availability and resiliency of the cluster:

Slow broker recovery – When a broker fails, the recovery process involves transferring data from the remaining replicas to the new broker. This data transfer can be slow, especially for large data volumes, leading to prolonged periods of reduced availability and increased recovery times.
Inefficient load balancing – Load balancing in Kafka involves moving partitions between brokers to distribute the load evenly. However, this process can be resource-intensive and time-consuming, because it requires transferring large amounts of data between brokers.
Scaling limitations – Scaling a Kafka cluster traditionally involves adding new brokers and rebalancing partitions across the expanded set of brokers. This process can be disruptive and time-consuming, especially for large clusters with high data volumes.

How Amazon MSK tiered storage improves availability and resiliency

Amazon MSK offers tiered storage, a feature that allows configuring local and remote tiers. This greatly decouples compute and storage resources and thereby addresses the aforementioned challenges, improving availability and resiliency of Kafka clusters. You can benefit from the following:

Faster broker recovery – With tiered storage, data automatically moves from the faster Amazon Elastic Block Store (Amazon EBS) volumes to the more cost-effective storage tier over time. New messages are initially written to Amazon EBS for fast performance. Based on your local data retention policy, Amazon MSK transparently transitions that data to tiered storage. This frees up space on the EBS volumes for new messages. When broker fails and recovers either due to node or volume failure, the catch-up is faster because it only needs to catch up data stored on the local tier from the leader.
Efficient load balancing – Load balancing in Amazon MSK with tiered storage is more efficient because there is less data to move while reassigning partition. This process is faster and less resource-intensive, enabling more frequent and seamless load balancing operations.
Faster scaling – Scaling an MSK cluster with tiered storage is a seamless process. New brokers can be added to the cluster without the need for a large amount of data transfer and longer time for partition rebalancing. The new brokers can start serving traffic much faster, because the catch-up process takes less time, improving the overall cluster throughput and reducing downtime during scaling operations.

As shown in the following figure, MSK brokers and EBS volumes are tightly coupled. On a three-AZ deployed cluster, when you create a topic with replication factor three, Amazon MSK spreads those three replicas across all three Availability Zones and the EBS volumes attached with that broker store all the topic data spread across three Availability Zones. If you need to move a partition from one broker to another, Amazon MSK needs to move all the segments (both active and closed) from the existing broker to the new brokers, as illustrated in the following figure.

However, when you enable tiered storage for that topic, Amazon MSK transparently moves all closed segments for a topic from EBS volumes to tiered storage. That storage provides the built-in capability for durability and high availability with virtually unlimited storage capacity. With closed segments moved to tiered storage and only active segments on the local volume, your local storage footprint remains minimal regardless of topic size. If you need to move the partition to a new broker, the data movement is very minimal across the brokers. The following figure illustrates this updated configuration.

Amazon MSK tiered storage addresses the challenges posed by Kafka’s traditional storage architecture, enabling faster broker recovery, efficient load balancing, and seamless scaling, thereby improving availability and resiliency of your cluster. To learn more about the core components of Amazon MSK tiered storage, refer to Deep dive on Amazon MSK tiered storage.

A real-world test

We hope that you now understand how Amazon MSK tiered storage can improve your Kafka resiliency and availability. To test it, we created a three-node cluster with the new m7g instance type. We created a topic with a replication factor of three and without using tiered storage. Using the Kafka performance tool, we ingested 300 GB of data into the topic. Next, we added three new brokers to the cluster. Because Amazon MSK doesn’t automatically move partitions to these three new brokers, they will remain idle until we rebalance the partitions across all six brokers.

Let’s consider a scenario where we need to move all the partitions from the existing three brokers to the three new brokers. We used the kafka-reassign-partitions tool to move the partitions from the existing three brokers to the newly added three brokers. During this partition movement operation, we observed that the CPU usage was high, even though we weren’t performing any other operations on the cluster. This indicates that the high CPU usage was due to the data replication to the new brokers. As shown in the following metrics, the partition movement operation from broker 1 to broker 2 took approximately 75 minutes to complete.

Additionally, during this period, CPU utilization was elevated.

After completing the test, we enabled tiered storage on the topic with local.retention.ms=3600000 (1 hour) and retention.ms=31536000000. We continuously monitored the RemoteCopyBytesPerSec metrics to determine when the data migration to tiered storage was complete. After 6 hours, we observed zero activity on the RemoteCopyBytesPerSec metrics, indicating that all closed segments had been successfully moved to tiered storage. For instructions to enable tiered storage on an existing topic, refer to Enabling and disabling tiered storage on an existing topic.

We then performed the same test again, moving partitions to three empty brokers. This time, the partition movement operation was completed in just under 15 minutes, with no noticeable CPU usage, as shown in the following metrics. This is because, with tiered storage enabled, all the data has already been moved to the tiered storage, and we only have the active segment in the EBS volume. The partition movement operation is only moving the small active segment, which is why it takes less time and minimal CPU to complete the operation.

Conclusion

In this post, we explored how Amazon MSK tiered storage can significantly improve the scalability and resilience of Kafka. By automatically moving older data to the cost-effective tiered storage, Amazon MSK reduces the amount of data that needs to be managed on the local EBS volumes. This dramatically improves the speed and efficiency of critical Kafka operations like broker recovery, leader election, and partition reassignment. As demonstrated in the test scenario, enabling tiered storage reduced the time taken to move partitions between brokers from 75 minutes to just under 15 minutes, with minimal CPU impact. This enhanced the responsiveness and self-healing ability of the Kafka cluster, which is crucial for maintaining reliable, high-performance operations, even as data volumes continue to grow.

If you’re running Kafka and facing challenges with scalability or resilience, we highly recommend using Amazon MSK with the tiered storage feature. By taking advantage of this powerful capability, you can unlock the true scalability of Kafka and make sure your mission-critical applications can keep pace with ever-increasing data demands.

To get started, refer to Enabling and disabling tiered storage on an existing topic. Additionally, check out Automated deployment template of Cruise Control for Amazon MSK for effortlessly rebalancing your workload.

About the Authors

Sai Maddali is a Senior Manager Product Management at AWS who leads the product team for Amazon MSK. He is passionate about understanding customer needs, and using technology to deliver services that empowers customers to build innovative applications. Besides work, he enjoys traveling, cooking, and running.

Nagarjuna Koduru is a Principal Engineer in AWS, currently working for AWS Managed Streaming For Kafka (MSK). He led the teams that built MSK Serverless and MSK Tiered storage products. He previously led the team in Amazon JustWalkOut (JWO) that is responsible for real time tracking of shopper locations in the store. He played pivotal role in scaling the stateful stream processing infrastructure to support larger store formats and reducing the overall cost of the system. He has keen interest in stream processing, messaging and distributed storage infrastructure.

Masudur Rahaman Sayem is a Streaming Data Architect at AWS. He works with AWS customers globally to design and build data streaming architectures to solve real-world business problems. He specializes in optimizing solutions that use streaming data services and NoSQL. Sayem is very passionate about distributed computing.

Unlock scalability, cost-efficiency, and faster insights with large-scale data migration to Amazon Redshift

2024-08-01 Chanpreet Singh

Post Syndicated from Chanpreet Singh original https://aws.amazon.com/blogs/big-data/unlock-scalability-cost-efficiency-and-faster-insights-with-large-scale-data-migration-to-amazon-redshift/

Large-scale data warehouse migration to the cloud is a complex and challenging endeavor that many organizations undertake to modernize their data infrastructure, enhance data management capabilities, and unlock new business opportunities. As data volumes continue to grow exponentially, traditional data warehousing solutions may struggle to keep up with the increasing demands for scalability, performance, and advanced analytics.

Migrating to Amazon Redshift offers organizations the potential for improved price-performance, enhanced data processing, faster query response times, and better integration with technologies such as machine learning (ML) and artificial intelligence (AI). However, you might face significant challenges when planning for a large-scale data warehouse migration. These challenges can range from ensuring data quality and integrity during the migration process to addressing technical complexities related to data transformation, schema mapping, performance, and compatibility issues between the source and target data warehouses. Additionally, organizations must carefully consider factors such as cost implications, security and compliance requirements, change management processes, and the potential disruption to existing business operations during the migration. Effective planning, thorough risk assessment, and a well-designed migration strategy are crucial to mitigating these challenges and implementing a successful transition to the new data warehouse environment on Amazon Redshift.

In this post, we discuss best practices for assessing, planning, and implementing a large-scale data warehouse migration into Amazon Redshift.

Success criteria for large-scale migration

The following diagram illustrates a scalable migration pattern for an extract, load, and transform (ELT) scenario using Amazon Redshift data sharing patterns.

The following diagram illustrates a scalable migration pattern for extract, transform, and load (ETL) scenario.

Migration pattern extract, transform, and load (ETL) scenarios

Success criteria alignment by all stakeholders (producers, consumers, operators, auditors) is key for successful transition to a new Amazon Redshift modern data architecture. The success criteria are the key performance indicators (KPIs) for each component of the data workflow. This includes the ETL processes that capture source data, the functional refinement and creation of data products, the aggregation for business metrics, and the consumption from analytics, business intelligence (BI), and ML.

KPIs make sure you can track and audit optimal implementation, achieve consumer satisfaction and trust, and minimize disruptions during the final transition. They measure workload trends, cost usage, data flow throughput, consumer data rendering, and real-life performance. This makes sure the new data platform can meet current and future business goals.

Migration from a large-scale mission-critical monolithic legacy data warehouse (such as Oracle, Netezza, Teradata, or Greenplum) is typically planned and implemented over 6–16 months, depending on the complexity of the existing implementation. The monolithic data warehouse environments that have been built over the last 30 years contain proprietary business logic and multiple data design patterns, including an operation data store, star or Snowflake schema, dimension and facts, data warehouses and data marts, online transaction processing (OLTP) real-time dashboards, and online analytic processing (OLAP) cubes with multi-dimensional analytics. The data warehouse is highly business critical with minimal allowable downtime. If your data warehouse platform has gone through multiple enhancements over the years, your operational service levels documentation may not be current with the latest operational metrics and desired SLAs for each tenant (such as business unit, data domain, or organization group).

As part of the success criteria for operational service levels, you need to document the expected service levels for the new Amazon Redshift data warehouse environment. This includes the expected response time limits for dashboard queries or analytical queries, elapsed runtime for daily ETL jobs, desired elapsed time for data sharing with consumers, total number of tenants with concurrency of loads and reports, and mission-critical reports for executives or factory operations.

As part of your modern data architecture transition strategy, the migration goal of a new Amazon Redshift based platform is to use the scalability, performance, cost-optimization, and additional lake house capabilities of Amazon Redshift, resulting in improving the existing data consumption experience. Depending on your enterprise’s culture and goals, your migration pattern of a legacy multi-tenant data platform to Amazon Redshift could use one of the following strategies:

Leapfrog strategy – In this strategy, you move to an AWS modern data architecture and migrate one tenant at a time. For an example, refer to How JPMorgan Chase built a data mesh architecture to drive significant value to enhance their enterprise data platform.
Organic strategy – This strategy uses a lift and shift data schema using migration tools. For an example, see GE Aviation Modernizes Technology Stack and Improves Data Accessibility Using Amazon Redshift).
Strangler strategy – This involves creating an abstraction layer for consumption and transitioning one component at a time. For more details, see Strangler Fig Application.

A majority of organizations opt for the organic strategy (lift and shift) when migrating their large data platforms to Amazon Redshift. This approach uses AWS migration tools such as the AWS Schema Conversion Tool (AWS SCT) or the managed service version DMS Schema Conversion to rapidly meet goals around data center exit, cloud adoption, reducing legacy licensing costs, and replacing legacy platforms.

By establishing clear success criteria and monitoring KPIs, you can implement a smooth migration to Amazon Redshift that meets performance and operational goals. Thoughtful planning and optimization are crucial, including optimizing your Amazon Redshift configuration and workload management, addressing concurrency needs, implementing scalability, tuning performance for large result sets, minimizing schema locking, and optimizing join strategies. This will enable right-sizing the Redshift data warehouse to meet workload demands cost-effectively. Thorough testing and performance optimization will facilitate a smooth transition with minimal disruption to end-users, fostering exceptional user experiences and satisfaction. A successful migration can be accomplished through proactive planning, continuous monitoring, and performance fine-tuning, thereby aligning with and delivering on business objectives.

Migration involves the following phases, which we delve into in the subsequent sections:

Assessment
- Discovery of workload and integrations
- Dependency analysis
- Effort estimation
- Team sizing
- Strategic wave planning
Functional and performance
- Code conversion
- Data validation
Measure and benchmark KPIs
- Platform-level KPIs
- Tenant-level KPIs
- Consumer-level KPIs
- Sample SQL
Monitoring Amazon Redshift performance and continual optimization
- Identify top offending queries
- Optimization strategies

To achieve a successful Amazon Redshift migration, it’s important to address these infrastructure, security, and deployment considerations simultaneously, thereby implementing a smooth and secure transition.

Assessment

In this section, we discuss the steps you can take in the assessment phase.

Discovery of workload and integrations

Conducting discovery and assessment for migrating a large on-premises data warehouse to Amazon Redshift is a critical step in the migration process. This phase helps identify potential challenges, assess the complexity of the migration, and gather the necessary information to plan and implement the migration effectively. You can use the following steps:

Data profiling and assessment – This involves analyzing the schema, data types, table sizes, and dependencies. Special attention should be given to complex data types such as arrays, JSON, or custom data types and custom user-defined functions (UDFs), because they may require specific handling during the migration process. Additionally, it’s essential to assess the volume of data and daily incremental data to be migrated, and estimate the required storage capacity in Amazon Redshift. Furthermore, analyzing the existing workload patterns, queries, and performance characteristics provides valuable insights into the resource requirements needed to optimize the performance of the migrated data warehouse in Amazon Redshift.
Code and query assessment – It’s crucial to assess the compatibility of existing SQL code, including queries, stored procedures, and functions. The AWS SCT can help identify any unsupported features, syntax, or functions that need to be rewritten or replaced to achieve a seamless integration with Amazon Redshift. Additionally, it’s essential to evaluate the complexity of the existing processes and determine if they require redesigning or optimization to align with Amazon Redshift best practices.
Performance and scalability assessment – This includes identifying performance bottlenecks, concurrency issues, or resource constraints that may be hindering optimal performance. This analysis helps determine the need for performance tuning or workload management techniques that may be required to achieve optimal performance and scalability in the Amazon Redshift environment.
Application integrations and mapping – Embarking on a data warehouse migration to a new platform necessitates a comprehensive understanding of the existing technology stack and business processes intertwined with the legacy data warehouse. Consider the following:
- Meticulously document all ETL processes, BI tools, and scheduling mechanisms employed in conjunction with the current data warehouse. This includes commercial tools, custom scripts, and any APIs or connectors interfacing with source systems.
- Take note of any custom code, frameworks, or mechanisms utilized in the legacy data warehouse for tasks such as managing slowly changing dimensions (SCDs), generating surrogate keys, implementing business logic, and other specialized functionalities. These components may require redevelopment or adaptation to operate seamlessly on the new platform.
- Identify all upstream and downstream applications, as well as business processes that rely on the data warehouse. Map out their specific dependencies on database objects, tables, views, and other components. Trace the flow of data from its origins in the source systems, through the data warehouse, and ultimately to its consumption by reporting, analytics, and other downstream processes.
Security and access control assessment – This includes reviewing the existing security model, including user roles, permissions, access controls, data retention policies, and any compliance requirements and industry regulations that need to be adhered to.

Dependency analysis

Understanding dependencies between objects is crucial for a successful migration. You can use system catalog views and custom queries on your on-premises data warehouses to create a comprehensive object dependency report. This report shows how tables, views, and stored procedures rely on each other. This also involves analyzing indirect dependencies (for example, a view built on top of another view, which in turn uses a set of tables), and having a complete understanding of data usage patterns.

Effort estimation

The discovery phase serves as your compass for estimating the migration effort. You can translate those insights into a clear roadmap as follows:

Object classification and complexity assessment – Based on the discovery findings, categorize objects (tables, views, stored procedures, and so on) based on their complexity. Simple tables with minimal dependencies will require less effort to migrate than intricate views or stored procedures with complex logic.
Migration tools – Use the AWS SCT to estimate the base migration effort per object type. The AWS SCT can automate schema conversion, data type mapping, and function conversion, reducing manual effort.
Additional considerations – Factor in additional tasks beyond schema conversion. This may include data cleansing, schema optimization for Amazon Redshift performance, unit testing of migrated objects, and migration script development for complex procedures. The discovery phase sheds light on potential schema complexities, allowing you to accurately estimate the effort required for these tasks.

Team sizing

With a clear picture of the effort estimate, you can now size the team for the migration.

Person-months calculation

Divide the total estimated effort by the desired project duration to determine the total person-months required. This provides a high-level understanding of the team size needed.

For example, for a ELT migration project from an on-premises data warehouse to Amazon Redshift to be completed within 6 months, we estimate the team requirements based on the number of schemas or tenants (for example, 30), number of database tables (for example, 5,000), average migration estimate for a schema (for example, 4 weeks based on complexity of stored procedures, tables and views, platform-specific routines, and materialized views), and number of business functions (for example, 2,000 segmented by simple, medium, and complex patterns). We can determine the following are needed:

Migration time period (65% migration/35% for validation & transition) = 0.8* 6 months = 5 months or 22 weeks
Dedicated teams = Number of tenants / (migration time period) / (average migration period for a tenant) = 30/5/1 = 6 teams
Migration team structure:
- One to three data developers with stored procedure conversion expertise per team, performing over 25 conversions per week
- One data validation engineer per team, testing over 50 objects per week
- One to two data visualization experts per team, confirming consumer downstream applications are accurate and performant
A common shared DBA team with performance tuning expertise responding to standardization and challenges
A platform architecture team (3–5 individuals) focused on platform design, service levels, availability, operational standards, cost, observability, scalability, performance, and design pattern issue resolutions

Team composition expertise

Based on the skillsets required for various migration tasks, we assemble a team with the right expertise. Platform architects define a well-architected platform. Data engineers are crucial for schema conversion and data transformation, and DBAs can handle cluster configuration and workload monitoring. An engagement or project management team makes sure the project runs smoothly, on time, and within budget.

For example, for an ETL migration project from Informatica/Greenplum to a target Redshift lakehouse with an Amazon Simple Storage Service (Amazon S3) data lake to be completed within 12 months, we estimate the team requirements based on the number of schemas and tenants (for example, 50 schemas), number of database tables (for example, 10,000), average migration estimate for a schema (6 weeks based on complexity of database objects), and number of business functions (for example, 5,000 segmented by simple, medium, and complex patterns). We can determine the following are needed:

An open data format ingestion architecture processing the source dataset and refining the data in the S3 data lake. This requires a dedicated team of 3–7 members building a serverless data lake for all data sources. Ingestion migration implementation is segmented by tenants and type of ingestion patterns, such as internal database change data capture (CDC); data streaming, clickstream, and Internet of Things (IoT); public dataset capture; partner data transfer; and file ingestion patterns.
The migration team composition is tailored to the needs of a project wave. Depending on each migration wave and what is being done in the wave (development, testing, or performance tuning), the right people will be engaged. When the wave is complete, the people from that wave will move to another wave.
A loading team builds a producer-consumer architecture in Amazon Redshift to process concurrent near real-time publishing of data. This requires a dedicated team of 3–7 members building and publishing refined datasets in Amazon Redshift.
A shared DBA group of 3–5 individuals helping with schema standardization, migration challenges, and performance optimization outside the automated conversion.
Data transformation experts to convert database stored functions in the producer or consumer.
A migration sprint plan for 10 months with 2 sprint weeks with multiple waves to release tenants to the new architecture.
A validation team to confirm a reliable and complete migration.
One to two data visualization experts per team, confirming that consumer downstream applications are accurate and performant.
A platform architecture team (3–5 individuals) focused on platform design, service levels, availability, operational standards, cost, observability, scalability, performance, and design pattern issue resolutions.

Strategic wave planning

Migration waves can be determined as follows:

Dependency-based wave delineation – Objects can be grouped into migration waves based on their dependency relationships. Objects with no or minimal dependencies will be prioritized for earlier waves, whereas those with complex dependencies will be migrated in subsequent waves. This provides a smooth and sequential migration process.
Logical schema and business area alignment – You can further revise migration waves by considering logical schema and business areas. This allows you to migrate related data objects together, minimizing disruption to specific business functions.

Functional and performance

In this section, we discuss the steps for refactoring the legacy SQL codebase to leverage Redshift SQL best practices, build validation routines to ensure accuracy and completeness during the transition to Redshift, capturing KPIs to ensure similar or better service levels for consumption tools/downstream applications, and incorporating performance hooks and procedures for scalable and performant Redshift Platform.

Code conversion

We recommend using the AWS SCT as the first step in the code conversion journey. The AWS SCT is a powerful tool that can streamline the database schema and code migrations to Amazon Redshift. With its intuitive interface and automated conversion capabilities, the AWS SCT can significantly reduce the manual effort required during the migration process. Refer to Converting data warehouse schemas to Amazon Redshift using AWS SCT for instructions to convert your database schema, including tables, views, functions, and stored procedures, to Amazon Redshift format. For an Oracle source, you can also use the managed service version DMS Schema Conversion.

When the conversion is complete, the AWS SCT generates a detailed conversion report. This report highlights any potential issues, incompatibilities, or areas requiring manual intervention. Although the AWS SCT automates a significant portion of the conversion process, manual review and modifications are often necessary to address various complexities and optimizations.

Some common cases where manual review and modifications are typically required include:

Incompatible data types – The AWS SCT may not always handle custom or non-standard data types, requiring manual intervention to map them to compatible Amazon Redshift data types.
Database-specific SQL extensions or proprietary functions – If the source database uses SQL extensions or proprietary functions specific to the database vendor (for example, STRING_AGG() or ARRAY_UPPER functions, or custom UDFs for PostgreSQL), these may need to be manually rewritten or replaced with equivalent Amazon Redshift functions or UDFs. The AWS SCT extension pack is an add-on module that emulates functions present in a source database that are required when converting objects to the target database.
Performance optimization – Although the AWS SCT can convert the schema and code, manual optimization is often necessary to take advantage of the features and capabilities of Amazon Redshift. This may include adjusting distribution and sort keys, converting row-by-row operations to set-based operations, optimizing query plans, and other performance tuning techniques specific to Amazon Redshift.
Stored procedures and code conversion – The AWS SCT offers comprehensive capabilities to seamlessly migrate stored procedures and other code objects across platforms. Although its automated conversion process efficiently handles the majority of cases, certain intricate scenarios may necessitate manual intervention due to the complexity of the code and utilization of database-specific features or extensions. To achieve optimal compatibility and accuracy, it’s advisable to undertake testing and validation procedures during the migration process.

After you address the issues identified during the manual review process, it’s crucial to thoroughly test the converted stored procedures, as well as other database objects and code, such as views, functions, and SQL extensions, in a non-production Redshift cluster before deploying them in the production environment. This exercise is mostly undertaken by QA teams. This phase also involves conducting holistic performance testing (individual queries, batch loads, consumption reports and dashboards in BI tools, data mining applications, ML algorithms, and other relevant use cases) in addition to functional testing to make sure the converted code meets the required performance expectations. The performance tests should simulate production-like workloads and data volumes to validate the performance under realistic conditions.

Data validation

When migrating data from an on-premises data warehouse to a Redshift cluster on AWS, data validation is a crucial step to confirm the integrity and accuracy of the migrated data. There are several approaches you can consider:

Custom scripts – Use scripting languages like Python, SQL, or Bash to develop custom data validation scripts tailored to your specific data validation requirements. These scripts can connect to both the source and target databases, extract data, perform comparisons, and generate reports.
Open source tools – Use open source data validation tools like Amazon Deequ or Great Expectations. These tools provide frameworks and utilities for defining data quality rules, validating data, and generating reports.
AWS native or commercial tools – Use AWS native tools such as AWS Glue Data Quality or commercial data validation tools like Collibra Data Quality. These tools often provide comprehensive features, user-friendly interfaces, and dedicated support.

The following are different types of validation checks to consider:

Structural comparisons – Compare the list of columns and data types of columns between the source and target (Amazon Redshift). Any mismatches should be flagged.
Row count validation – Compare the row counts of each core table in the source data warehouse with the corresponding table in the target Redshift cluster. This is the most basic validation step to make sure no data has been lost or duplicated during the migration process.
Column-level validation – Validate individual columns by comparing column-level statistics (min, max, count, sum, average) for each column between the source and target databases. This can help identify any discrepancies in data values or data types.

You can also consider the following validation strategies:

Data profiling – Perform data profiling on the source and target databases to understand the data characteristics, identify outliers, and detect potential data quality issues. For example, you can use the data profiling capabilities of AWS Glue Data Quality or the Amazon Deequ
Reconciliation reports – Produce detailed validation reports that highlight errors, mismatches, and data quality issues. Consider generating reports in various formats (CSV, JSON, HTML) for straightforward consumption and integration with monitoring tools.
Automate the validation process – Integrate the validation logic into your data migration or ETL pipelines using scheduling tools or workflow orchestrators like Apache Airflow or AWS Step Functions.

Lastly, keep in mind the following considerations for collaboration and communication:

Stakeholder involvement – Involve relevant stakeholders, such as business analysts, data owners, and subject matter experts, throughout the validation process to make sure business requirements and data quality expectations are met.
Reporting and sign-off – Establish a clear reporting and sign-off process for the validation results, involving all relevant stakeholders and decision-makers.

Measure and benchmark KPIs

For multi-tenant Amazon Redshift implementation, KPIs are segmented at the platform level, tenant level, and consumption tools level. KPIs evaluate the operational metrics, cost metrics, and end-user response time metrics. In this section, we discuss the KPIs needed for achieving a successful transition.

Platform-level KPIs

As new tenants are gradually migrated to the platform, it’s imperative to monitor the current state of Amazon Redshift platform-level KPIs. The current KPI’s state will help the platform team make the necessary scalability modifications (add nodes, add consumer clusters, add producer clusters, or increase concurrency scaling clusters). Amazon Redshift query monitoring rules (QMR) also help govern the overall state of data platform, providing optimal performance for all tenants by managing outlier workloads.

The following table summarizes the relevant platform-level KPIs.

Component	KPI	Service Level and Success Criteria
ETL	Ingestion data volume	Daily or hourly peak volume in GBps, number of objects, number of threads.
	Ingestion threads	Peak hourly ingestion threads (COPY or INSERT), number of dependencies, KPI segmented by tenants and domains.
	Stored procedure volume	Peak hourly stored procedure invocations segmented by tenants and domains.
	Concurrent load	Peak concurrent load supported by the producer cluster; distribution of ingestion pattern across multiple producer clusters using data sharing.
	Data sharing dependency	Data sharing between producer clusters (objects refreshed, locks per hour, waits per hour).
Workload	Number of queries	Peak hour query volume supported by cluster segmented by short (less than 10 seconds), medium (less than 60 seconds), long (less than 5 minutes), very long (less than 30 minutes), and outlier (more than 30 minutes); segmented by tenant, domain, or sub-domain.
	Number of queries per queue	Peak hour query volume supported by priority automatic WLM queue segmented by short (less than 10 seconds), medium (less than 60 seconds), long (less than 5 minutes), very long (less than 30 minutes), and outlier (more than 30 minutes); segmented by tenant, business group, domain, or sub-domain.
	Runtime pattern	Total runtime per hour; max, median, and average run pattern; segmented by service class across clusters.
	Wait time patterns	Total wait time per hour; max, median, and average wait pattern for queries waiting.
Performance	Leader node usage	Service level for leader node (recommended less than 80%).
	Compute node CPU usage	Service level for compute node (recommended less than 90%).
	Disk I/O usage per node	Service level for disk I/O per node.
	QMR rules	Number of outlier queries stopped by QMR (large scan, large spilling disk, large runtime); logging thresholds for potential large queries running more than 5 minutes.
	History of WLM queries	Historical trend of queries stored in historical archive table for all instances of queries in STL_WLM_QUERY; trend analysis over 30 days, 60 days, and 90 days to fine-tune the workload across clusters.
Cost	Total cost per month of Amazon Redshift platform	Service level for mix of instances (reserved, on-demand, serverless), cost of Concurrency Scaling, cost of Amazon Redshift Spectrum usage. Use AWS tools like AWS Cost Explorer or daily cost usage report to capture monthly costs for each component.
	Daily Concurrency Scaling usage	Service limits to monitor cost for concurrency scaling; invoke for outlier activity on spikes.
	Daily Amazon Redshift Spectrum usage	Service limits to monitor cost for using Amazon Redshift Spectrum; invoke for outlier activity.
	Redshift Managed Storage usage cost	Track usage of Redshift Managed Storage, monitoring wastage on temporary, archival, and old data assets.
Localization	Remote or on-premises tools	Service level for rendering large datasets to remote destinations.
Localization	Data transfer to remote tools	Data transfer to BI tools or workstations outside the Redshift cluster VPC; separation of datasets to Amazon S3 using the unload feature, avoiding bottlenecks at leader node.

Tenant-level KPIs

Tenant-level KPIs help capture current performance levels from the legacy system and document expected service levels for the data flow from the source capture to end-user consumption. The captured legacy KPIs assist in providing the best target modern Amazon Redshift platform (a single Redshift data warehouse, a lake house with Amazon Redshift Spectrum, and data sharing with the producer and consumer clusters). Cost usage tracking at the tenant level helps you spread the cost of a shared platform across tenants.

The following table summarizes the relevant tenant-level KPIs.

Component	KPI	Service Level and Success Criteria
Cost	Compute usage by tenant	Track usage by tenant, business group, or domain; capture query volume by business unit associating Redshift user identity to internal business unit; data observability by consumer usage for data products helping with cost attribution.
ETL	Orchestration SLA	Service level for daily data availability.
	Runtime	Service level for data loading and transformation.
	Data ingestion volume	Peak expected volume for service level guarantee.
Query consumption	Response time	Response time SLA for query patterns (dashboards, SQL analytics, ML analytics, BI tool caching).
	Concurrency	Peak query consumers for tenant.
	Query volume	Peak hourly volume service levels and daily query volumes.
	Individual query response for critical data consumption	Service level and success criteria for critical workloads.

Consumer-level KPIs

A multi-tenant modern data platform can set service levels for a variety of consumer tools. The service levels provide guidance to end-users of the capability of the new deployment.

The following table summarizes the relevant consumer-level KPIs.

Consumer	KPI	Service Level and Success Criteria
BI tools	Large data extraction	Service level for unloading data for caching or query rendering a large result dataset.
Dashboards	Response time	Service level for data refresh.
SQL query tools	Response time	Service level for response time by query type.
SQL query tools	Concurrency	Service level for concurrent query access by all consumers.
One-time analytics	Response time	Service level for large data unloads or aggregation.
ML analytics	Response time	Service level for large data unloads or aggregation.

Sample SQL

The post includes sample SQL to capture daily KPI metrics. The following example KPI dashboard trends assist in capturing historic workload patterns, identifying deviations in workload, and providing guidance on the platform workload capacity to meet the current workload and anticipated growth patterns.

The following figure shows a daily query volume snapshot (queries per day and queued queries per day, which waited a minimum of 5 seconds).

Figure shows a daily query volume snapshot (queries per day and queued queries per day, which waited a minimum of 5 seconds)

The following figure shows a daily usage KPI. It monitors percentage waits and median wait for waiting queries (identifies the minimal threshold for wait to compute waiting queries and median of all wait times to infer deviation patterns).

Figure shows a daily usage KPI. It monitors percentage waits and median wait for waiting queries (identifies the minimal threshold for wait to compute waiting queries and median of all wait times to infer deviation patterns)

The following figure illustrates concurrency usage (monitors concurrency compute usage for Concurrency Scaling clusters).

The following figure illustrates concurrency usage (monitors concurrency compute usage for Concurrency Scaling clusters)

The following figure shows a 30-day pattern (computes volume in terms of total runtime and total wait time).

The following figure shows a 30-day pattern (computes volume in terms of total runtime and total wait time)

Monitoring Redshift performance and continual optimization

Amazon Redshift uses automatic table optimization (ATO) to choose the right distribution style, sort keys, and encoding when you create a table with AUTO options. Therefore, it’s a good practice to take advantage of the AUTO feature and create tables with DISTSTYLE AUTO, SORTKEY AUTO, and ENCODING AUTO. When tables are created with AUTO options, Amazon Redshift initially creates tables with optimal keys for the best first-time query performance possible using information such as the primary key and data types. In addition, Amazon Redshift analyzes the data volume and query usage patterns to evolve the distribution strategy and sort keys to optimize performance over time. Finally, Amazon Redshift performs table maintenance activities on your tables that reduce fragmentation and make sure statistics are up to date.

During a large, phased migration, it’s important to monitor and measure Amazon Redshift performance against target KPIs at each phase and implement continual optimization. As new workloads are onboarded at each phase of the migration, it’s recommended to perform regular Redshift cluster reviews and analyze query pattern and performance. Cluster reviews can be done by engaging the Amazon Redshift specialist team through AWS Enterprise support or your AWS account team. The goal of a cluster review includes the following:

Use cases – Review the application use cases and determine if the design is suitable to solve for those use cases.
End-to-end architecture – Assess the current data pipeline architecture (ingestion, transformation, and consumption). For example, determine if too many small inserts are occurring and review their ETL pipeline. Determine if integration with other AWS services can be useful, such as AWS Lake Formation, Amazon Athena, Redshift Spectrum, or Amazon Redshift federation with PostgreSQL and MySQL.
Data model design – Review the data model and table design and provide recommendations for sort and distribution keys, keeping in mind best practices.
Performance – Review cluster performance metrics. Identify bottlenecks or irregularities and suggest recommendations. Dive deep into specific long-running queries to identify solutions specific to the customer’s workload.
Cost optimization – Provide recommendations to reduce costs where possible.
New features – Stay up to date with the new features in Amazon Redshift and identify where they can be used to meet these goals.

New workloads can introduce query patterns that could impact performance and miss target SLAs. A number of factors can affect query performance. In the following sections, we discuss aspects impacting query speed and optimizations for improving Redshift cluster performance.

Identify top offending queries

A compute node is partitioned into slices. More nodes means more processors and more slices, which enables you to redistribute the data as needed across the slices. However, more nodes also means greater expense, so you will need to find the balance of cost and performance that is appropriate for your system. For more information on Redshift cluster architecture, see Data warehouse system architecture. Each node type offers different sizes and limits to help you scale your cluster appropriately. The node size determines the storage capacity, memory, CPU, and price of each node in the cluster. For more information on node types, see Amazon Redshift pricing.

Redshift Test Drive is an open source tool that lets you evaluate which different data warehouse configuration options are best suited for your workload. We created Redshift Test Drive from Simple Replay and Amazon Redshift Node Configuration Comparison (see Compare different node types for your workload using Amazon Redshift for more details) to provide a single entry point for finding the best Amazon Redshift configuration for your workload. Redshift Test Drive also provides additional features such as a self-hosted analysis UI and the ability to replicate external objects that a Redshift workload may interact with. With Amazon Redshift Serverless, you can start with a base Redshift Processing Unit (RPU), and Redshift Serverless automatically scales based on your workload needs.

Optimization strategies

If you choose to fine-tune manually, the following are key concepts and considerations:

Data distribution – Amazon Redshift stores table data on the compute nodes according to a table’s distribution style. When you run a query, the query optimizer redistributes the data to the compute nodes as needed to perform any joins and aggregations. Choosing the right distribution style for a table helps minimize the impact of the redistribution step by locating the data where it needs to be before the joins are performed. For more information, see Working with data distribution styles.
Data sort order – Amazon Redshift stores table data on disk in sorted order according to a table’s sort keys. The query optimizer and query processor use the information about where the data is located to reduce the number of blocks that need to be scanned and thereby improve query speed. For more information, see Working with sort keys.
Dataset size – A higher volume of data in the cluster can slow query performance for queries, because more rows need to be scanned and redistributed. You can mitigate this effect by regular vacuuming and archiving of data, and by using a predicate (a condition in the WHERE clause) to restrict the query dataset.
Concurrent operations – Amazon Redshift offers a powerful feature called automatic workload management (WLM) with query priorities, which enhances query throughput and overall system performance. By intelligently managing multiple concurrent operations and allocating resources dynamically, automatic WLM makes sure high-priority queries receive the necessary resources promptly, while lower-priority queries are processed efficiently without compromising system stability. This advanced queuing mechanism allows Amazon Redshift to optimize resource utilization, minimizing potential bottlenecks and maximizing query throughput, ultimately delivering a seamless and responsive experience for users running multiple operations simultaneously.
Query structure – How your query is written will affect its performance. As much as possible, write queries to process and return as little data as will meet your needs. For more information, see Amazon Redshift best practices for designing queries.
Queries with a long return time – Queries with a long return time can impact the processing of other queries and overall performance of the cluster. It’s critical to identify and optimize them. You can optimize these queries by either moving clients to the same network or using the UNLOAD feature of Amazon Redshift, and then configure the client to read the output from Amazon S3. To identify percentile and top running queries, you can download the sample SQL notebook system queries. You can import this in Query Editor V2.0.

Conclusion

In this post, we discussed best practices for assessing, planning, and implementing a large-scale data warehouse migration into Amazon Redshift.

The assessment phase of a data migration project is critical for implementing a successful migration. It involves a comprehensive analysis of the existing workload, integrations, and dependencies to accurately estimate the effort required and determine the appropriate team size. Strategic wave planning is crucial for prioritizing and scheduling the migration tasks effectively. Establishing KPIs and benchmarking them helps measure progress and identify areas for improvement. Code conversion and data validation processes validate the integrity of the migrated data and applications. Monitoring Amazon Redshift performance, identifying and optimizing top offending queries, and conducting regular cluster reviews are essential for maintaining optimal performance and addressing any potential issues promptly.

By addressing these key aspects, organizations can seamlessly migrate their data workloads to Amazon Redshift while minimizing disruptions and maximizing the benefits of Amazon Redshift.

We hope this post provides you with valuable guidance. We welcome any thoughts or questions in the comments section.

About the authors

Chanpreet Singh is a Senior Lead Consultant at AWS, specializing in Data Analytics and AI/ML. He has over 17 years of industry experience and is passionate about helping customers build scalable data warehouses and big data solutions. In his spare time, Chanpreet loves to explore nature, read, and enjoy with his family.

Harshida Patel is a Analytics Specialist Principal Solutions Architect, with AWS.

Raza Hafeez is a Senior Product Manager at Amazon Redshift. He has over 13 years of professional experience building and optimizing enterprise data warehouses and is passionate about enabling customers to realize the power of their data. He specializes in migrating enterprise data warehouses to AWS Modern Data Architecture.

Ram Bhandarkar is a Principal Data Architect at AWS based out of Northern Virginia. He helps customers with planning future Enterprise Data Strategy and assists them with transition to Modern Data Architecture platform on AWS. He has worked with building and migrating databases, data warehouses and data lake solutions for over 25 years.

Vijay Bagur is a Sr. Technical Account Manager. He works with enterprise customers to modernize and cost optimize workloads, improve security posture, and helps them build reliable and secure applications on the AWS platform. Outside of work, he loves spending time with his family, biking and traveling.

Let’s Architect! Designing Well-Architected systems

2024-07-31 Vittorio Denti

Post Syndicated from Vittorio Denti original https://aws.amazon.com/blogs/architecture/lets-architect-well-architected-systems/

The design of cloud workloads can be a complex task, where a perfect and universal solution doesn’t exist. We should balance all the different trade-offs and find an optimal solution based on our context. But how does it work in practice? Which guiding principles should we follow? Which are the most important areas we should focus on?

In this blog, we will try to answer some of these questions by sharing a set of resources related to the AWS Well-Architected Framework. The Framework shares a set of methods to help you understand the pros and cons of decisions you make while building cloud systems. By following this resource, you will learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems in the cloud. The framework is constantly updated; it evolves as the technology landscape changes. Check out the latest updates from June 2024.

Build secure applications on AWS the Well-Architected way

The AWS Well-Architected Framework is constantly updated across all six pillars. The security pillar added a new best practice area: application security (AppSec). In this session, you can learn about the best practices highlighted in this area. Review four key domains: organization and culture, security of the pipeline, security in the pipeline, and dependency management. Each area provides a set of principles that you can implement and provides a complete view of how you design, develop, build, deploy, and operate secure workloads in the cloud.

Security should be part of the end-to-end development process, and implementing best practices both in the application code as well as in the underlying infrastructure components.

Figure 1. Security should be part of the end-to-end development process, and implementing best practices both in the application code as well as in the underlying infrastructure components.

Take me to this video

Announcing the AWS Well-Architected Mergers and Acquisitions Lens

How can we integrate different systems as a consequence of an acquisition? Mergers and acquisitions operations bring different people with different backgrounds together, with a need of driving systems convergence. Both organization and technical challenges can arise in this scenario. The Mergers and Acquisitions (M&A) Lens is a collection of customer-proven design principles, best practices, and prescriptive guidance to help you integrate the IT systems of two or more organizations. This lens helps companies follow AWS prescribed best practices during technical integration, drive cost optimization, and expedite merger and acquisition value realization.

If the seller company runs on another cloud platform or on-premises, the acquirer should plan a cloud migration while guaranteeing continuity of service.

Figure 2. If the seller company runs on another cloud platform or on-premises, the acquirer should plan a cloud migration while guaranteeing continuity of service.

Take me to this blog

AWS Well-Architected Labs

One of the best ways to become familiar with new concepts and methodologies consist of doing hands-on work to absorb the techniques properly. For each Let’s Architect! blog, we tend to share at least one workshop associated with the topic. The AWS Well-Architected Framework covers six different pillars, so today we share the AWS Well-Architected Labs to cover each area of the framework. Feel free to jump across the different workshops and start building!

Sustainability is one of the pillars in the framework. Asynchronous and scheduled processing are key techniques for improving the sustainability and costs of cloud architectures.

Figure 3. Sustainability is one of the pillars in the framework. Asynchronous and scheduled processing are key techniques for improving the sustainability and costs of cloud architectures.

Take me to this workshop

Gain confidence in system correctness and resilience with formal methods

Distributed systems are difficult to design. It’s even more difficult to test them and prove they are working. Formal methods enable the early discovery of design bugs that can escape the guardrails of design reviews and automated testing only to get uncovered in production. This video shows how AWS uses P, an open source, state machine–based programming language for formal modelling and analysis of distributed systems.

You can learn from AWS engineers and architects how to use P for your own applications to find bugs early in the development process and increase developer velocity. This tool is used in AWS to reason out the correctness of cloud services (for example, Amazon Simple Storage Service and Amazon DynamoDB).

Figure 4. An example of a distributed system for processing transactions.

Take me to this video

See you next time!

Thanks for reading! Hopefully, you got interesting insights into the methodologies for designing Well-Architected systems. In the next blog, we will talk about multi-region architectures. We will understand when they are actually needed, and which design principles should be applied.

To revisit any of our previous posts or explore the entire series, visit the Let’s Architect! page.

Balance deployment speed and stability with DORA metrics

2024-07-31 Rostislav Markov

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/devops/balance-deployment-speed-and-stability-with-dora-metrics/

Development teams adopt DevOps practices to increase the speed and quality of their software delivery. The DevOps Research and Assessment (DORA) metrics provide a popular method to measure progress towards that outcome. Using four key metrics, senior leaders can assess the current state of team maturity and address areas of optimization.

This blog post shows you how to make use of DORA metrics for your Amazon Web Services (AWS) environments. We share a sample solution which allows you to bootstrap automatic metric collection in your AWS accounts.

Benefits of collecting DORA metrics

DORA metrics offer insights into your development teams’ performance and capacity by measuring qualitative aspects of deployment speed and stability. They also indicate the teams’ ability to adapt by measuring the average time to recover from failure. This helps product owners in defining work priorities, establishing transparency on team maturity, and developing a realistic workload schedule. The metrics are appropriate for communication with senior leadership. They help commit leadership support to resolve systemic issues inhibiting team satisfaction and user experience.

Use case

This solution is applicable to the following use case:

Development teams have a multi-account AWS setup including a tooling account where the CI/CD tools are hosted, and an operations account for log aggregation and visualization.
Developers use GitHub code repositories and AWS CodePipeline to promote code changes across application environment accounts.
Tooling, operations, and application environment accounts are member accounts in AWS Control Tower or workload accounts in the Landing Zone Accelerator on AWS solution.
Service impairment resulting from system change is logged as OpsItem in AWS Systems Manager OpsCenter.

Overview of solution

The four key DORA metrics

The ‘four keys’ measure team performance and ability to react to problems:

Deployment Frequency measures the frequency of successful change releases in your production environment.
Lead Time For Changes measures the average time for committed code to reach production.
Change Failure Rate measures how often changes in production lead to service incidents/failures, and is complementary to Mean Time Between Failure.
Mean Time To Recovery measures the average time from service interruption to full recovery.

The first two metrics focus on deployment speed, while the other two indicate deployment stability (Figure 1). We recommend organizations to set their own goals (that is, DORA metric targets) based on service criticality and customer needs. For a discussion of prior DORA benchmark data and what it reveals about the performance of development teams, consult How DORA Metrics Can Measure and Improve Performance.

Figure 1. Overview of DORA metrics

Consult the GitHub code repository Balance deployment speed and stability with DORA metrics for a detailed description of the metric calculation logic. Any modifications to this logic should be made carefully.

For example, the Change Failure Rate focuses on changes that impair the production system. Limiting the calculation to tags (such as hotfixes) on pull requests would exclude issues related to the build process. It’s important to match system change records that lead to actual impairments in production. Limiting the calculation to the number of failed deployments from the deployment pipeline only considers deployments that didn’t reach production. We use AWS Systems Manager OpsCenter as the system of records for change-related outages, rather than relying solely on data from CI/CD tools.

Similarly, Mean Time To Recovery measures the duration from a service impairment in production to a successful pipeline run. We encourage teams to track both pipeline status and recovery time, as frequent pipeline failure can indicate insufficient local testing and potential pipeline engineering issues.

Gathering DORA events

Our metric calculation process runs in four steps:

In the tooling account, we send events from CodePipeline to the default event bus of Amazon EventBridge.
Events are forwarded to custom event buses which process them according to the defined metrics and any filters we may have set up.
The custom event buses call AWS Lambda functions which forward metric data to Amazon CloudWatch. CloudWatch gives us an aggregated view of each of the metrics. From Amazon CloudWatch, you can send the metrics to another designated dashboard like Amazon Managed Grafana.
As part of the data collection, the Lambda function will also query GitHub for the relevant commit to calculate the lead time for changes metric. It will query AWS Systems Manager for OpsItem data for change failure rate and mean time to recovery metrics. You can create OpsItems manually as part of your change management process or configure CloudWatch alarms to create OpsItems automatically.

Figure 2 visualizes these steps. This setup can be replicated to a group of accounts of one or multiple teams.

This figure visualizes the aforementioned four steps of our metric calculation process. AWS Lambda functions process all events and publish custom metrics in Amazon CloudWatch.

Figure 2. DORA metric setup for AWS CodePipeline deployments

Walkthrough

Follow these steps to deploy the solution in your AWS accounts.

Prerequisites

For this walkthrough, you should have the following prerequisites:

AWS accounts for tooling, operations, and application environments
Install Python version 3.9 or later
AWS Cloud Development Kit (AWS CDK) v2 installed
Set up an AWS CDK pipeline
Access to AWS CodePipeline, AWS Systems Manager OpsCenter and GitHub

Deploying the solution

Clone the GitHub code repository Balance deployment speed and stability with DORA metrics.

Before you start deploying or working with this code base, there are a few configurations you need to complete in the constants.py file in the cdk/ directory. Open the file in your IDE and update the following constants:

TOOLING_ACCOUNT_ID & TOOLING_ACCOUNT_REGION: These represent the AWS account ID and AWS region for AWS CodePipeline (that is, your tooling account).
OPS_ACCOUNT_ID & OPS_ACCOUNT_REGION: These are for your operations account (used for centralized log aggregation and dashboard).
TOOLING_CROSS_ACCOUNT_LAMBDA_ROLE: The IAM Role for cross-account access that allows AWS Lambda to post metrics from your tooling account to your operations account/Amazon CloudWatch dashboard.
DEFAULT_MAIN_BRANCH: This is the default branch in your code repository that’s used to deploy to your production application environment. It is set to “main” by default, as we assumed feature-driven development (GitFlow) on the main branch; update if you use a different naming convention.
APP_PROD_STAGE_NAME: This is the name of your production stage and set to “DeployPROD” by default. It’s reserved for teams with trunk-based development.

Setting up the environment

To set up your environment on MacOS and Linux:

Create a virtual environment:
```
$ python3 -m venv .venv
```
Activate the virtual environment: On MacOS and Linux:
```
$ source .venv/bin/activate
```

Alternatively, to set up your environment on Windows:

Create a virtual environment:
```
% .venv\Scripts\activate.bat
```
Install the required Python packages:
```
$ pip install -r requirements.txt
```

To configure the AWS Command Line Interface (AWS CLI):

Follow the configuration steps in the AWS CLI User Guide.
```
$ aws configure sso
```
Configure your user profile (for example, Ops for operations account, Tooling for tooling account). You can check user profile names in the credentials file.

Deploying the CloudFormation stacks

Switch directory
```
$ cd cdk
```
Bootstrap CDK
```
$ cdk bootstrap –-profile Ops
```
Synthesize the AWS CloudFormation template for this project:
```
$ cdk synth
```
To deploy a specific stack (see Figure 3 for an overview), specify the stack name and AWS account number(s) in the following command:
```
$ cdk deploy <Stack-Name> --profile {Tooling, Ops}
```
To launch the DoraToolingEventBridgeStack stack in the Tooling account:
```
$ cdk deploy DoraToolingEventBridgeStack --profile Tooling
```
To launch the other stacks in the Operations account (including DoraOpsGitHubLogsStack, DoraOpsDeploymentFrequencyStack, DoraOpsLeadTimeForChangeStack, DoraOpsChangeFailureRateStack, DoraOpsMeanTimeToRestoreStack, DoraOpsMetricsDashboardStack):
```
$ cdk deploy DoraOps* --profile Ops
```

The following figure shows the resources you’ll launch with each CloudFormation stack. This includes six AWS CloudFormation stacks in operations account. The first stack sets up log integration for GitHub commit activity. Four stacks contain a Lambda function which creates one of the DORA metrics. The sixth stack creates the consolidated dashboard in Amazon CloudWatch.

Figure 3. Resources provisioned with this solution

Testing the deployment

To run the provided tests:

$ pytest

Understanding what you’ve built

Deployed resources in tooling account

The DoraToolingEventBridgeStack includes Amazon EventBridge rules with a target of the central event bus in the operations account, plus an AWS IAM role with cross-account access to put events in the operations account. The event pattern for invoking our EventBridge rules listens for deployment state changes in AWS CodePipeline:

{
  "detail-type": ["CodePipeline Pipeline Execution State Change"],
  "source": ["aws.codepipeline"]
}

Deployed resources in operations account

The Lambda function for Deployment Frequency tracks the number of successful deployments to production, and posts the metric data to Amazon CloudWatch. You can add a dimension with the repository name in Amazon CloudWatch to filter on particular repositories/teams.
The Lambda function for the Lead Time For Change metric calculates the duration from the first commit to successful deployment in production. This covers all factors contributing to lead time for changes, including code reviews, build, test, as well as the deployment itself.
The Lambda function for Change Failure Rate keeps track of the count of successful deployments and the count of system impairment records (OpsItems) in production. It publishes both as metrics to Amazon CloudWatch and the latter calculates the ratio, as shown in below example.
The Lambda function for Mean Time To Recovery keeps track of all deployments with status SUCCEEDED in production and whose repository branch name references an existing OpsItem ID. For every matching event, the function gets the creation time of the OpsItem record and posts the duration between OpsItem creation and successful re-deployment to the CloudWatch dashboard.

All Lambda functions publish metric data to Amazon CloudWatch using the PutMetricData API. The final calculation of the four keys is performed on the CloudWatch dashboard. The solution includes a simple CloudWatch dashboard so you can validate the end-to-end data flow and confirm that it has deployed successfully:

This simple CloudWatch dashboard displays the four DORA metrics for three reporting periods: per day, per week, and per month.

Cleaning up

Remember to delete example resources if you no longer need them to avoid incurring future costs.

You can do this via the CDK CLI:

$ cdk destroy <Stack-Name> --profile {Tooling, Ops}

Alternatively, go to the CloudFormation console in each AWS account, select the stacks related to DORA and click on Delete. Confirm that the status of all DORA stacks is DELETE_COMPLETE.

Conclusion

DORA metrics provide a popular method to measure the speed and stability of your deployments. The solution in this blog post helps you bootstrap automatic metric collection in your AWS accounts. The four keys help you gain consensus on team performance and provide data points to back improvement suggestions. We recommend using the solution to gain leadership support for systemic issues inhibiting team satisfaction and user experience. To learn more about developer productivity research, we encourage you to also review alternative frameworks including DevEx and SPACE.

Further resources

If you enjoyed this post, you may also like:

Author bio

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

2024-07-29 Michael Greenshtein

Post Syndicated from Michael Greenshtein original https://aws.amazon.com/blogs/big-data/monitoring-apache-iceberg-metadata-layer-using-aws-lambda-aws-glue-and-aws-cloudwatch/

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources. Data lakes provide a unified repository for organizations to store and use large volumes of data. This enables more informed decision-making and innovative insights through various analytics and machine learning applications.

Despite their advantages, traditional data lake architectures often grapple with challenges such as understanding deviations from the most optimal state of the table over time, identifying issues in data pipelines, and monitoring a large number of tables. As data volumes grow, the complexity of maintaining operational excellence also increases. Monitoring and tracking issues in the data management lifecycle are essential for achieving operational excellence in data lakes.

This is where Apache Iceberg comes into play, offering a new approach to data lake management. Apache Iceberg is an open table format designed specifically to improve the performance, reliability, and scalability of data lakes. It addresses many of the shortcomings of traditional data lakes by providing features such as ACID transactions, schema evolution, row-level updates and deletes, and time travel.

In this blog post, we’ll discuss how the metadata layer of Apache Iceberg can be used to make data lakes more efficient. You will learn about an open-source solution that can collect important metrics from the Iceberg metadata layer. Based on collected metrics, we will provide recommendations on how to improve the efficiency of Iceberg tables. Additionally, you will learn how to use Amazon CloudWatch anomaly detection feature to detect ingestion issues.

Deep dive into Iceberg’s Metadata layer

Before diving into a solution, let’s understand how the Apache Iceberg metadata layer works. The Iceberg metadata layer provides an open specification instructing integrated big data engines such as Spark or Trino how to run read and write operations and how to resolve concurrency issues. It’s crucial for maintaining inter-operability between different engines. It stores detailed information about tables such as schema, partitioning, and file organization in versioned JSON and Avro files. This ensures that each change is tracked and reversible, enhancing data governance and auditability.

Apache Iceberg metadata layer architecture diagram

History and versioning: Iceberg’s versioning feature captures every change in table metadata as immutable snapshots, facilitating data integrity, historical views, and rollbacks.

File organization and snapshot management: Metadata closely manages data files, detailing file paths, formats, and partitions, supporting multiple file formats like Parquet, Avro, and ORC. This organization helps with efficient data retrieval through predicate pushdown, minimizing unnecessary data scans. Snapshot management allows concurrent data operations without interference, maintaining data consistency across transactions.

In addition to its core metadata management capabilities, Apache Iceberg also provides specialized metadata tables—snapshots, files, and partitions—that provide deeper insights and control over data management processes. These tables are dynamically generated and provide a live view of the metadata for query purposes, facilitating advanced data operations:

Snapshots table: This table lists all snapshots of a table, including snapshot IDs, timestamps, and operation types. It enables users to track changes over time and manage version history effectively.
Files table: The files table provides detailed information on each file in the table, including file paths, sizes, and partition values. It is essential for optimizing read and write performance.
Partitions table: This table shows how data is partitioned across different files and provides statistics for each partition, which is crucial for understanding and optimizing data distribution.

Metadata tables enhance Iceberg’s functionality by making metadata queries straightforward and efficient. Using these tables, data teams can gain precise control over data snapshots, file management, and partition strategies, further improving data system reliability and performance.

Before you get started

The next section describes a packaged open source solution using Apache Iceberg’s metadata layer and AWS services to enhance monitoring across your Iceberg tables.

Before we deep dive into the suggested solution, let’s mention Iceberg MetricsReporter, which is a native way to emit metrics for Apache Iceberg. It supports two types of reports: one for commits and one for scans. The default output is log based. It produces log files as a result of commit or scan operations. To submit metrics to CloudWatch or any other monitoring tool, users need to create and configure a custom MetricsReporter implementation. MetricsReporter is supported in Apache Iceberg v1.1.0 and later versions, and customers who want to use it must enable it through Spark configuration on their existing pipelines.

The following is deployed independently and doesn’t require any configuration changes to existing data pipelines. It can immediately start monitoring all the tables within the AWS account and AWS Region where it’s deployed. This solution introduces an additional latency of metrics arrival between 20 and 80 seconds compared to MetricsReporter but offers seamless integration without the need for custom configurations or changes to current workflows.

Solution overview

This solution is specifically designed for customers who run Apache Iceberg on Amazon Simple Storage Service (Amazon S3) and use AWS Glue as their data catalog.

Key features

This solution uses an AWS Lambda deployment package to collect metrics from Apache Iceberg tables. The metrics are then submitted to CloudWatch where you can create metrics visualizations to help recognize trends and anomalies over time.

The solution is designed to be lightweight, focusing on collecting metrics directly from the Iceberg metadata layer without scanning the actual data layer. This approach significantly reduces the compute capacity required, making it efficient and cost-effective. Key features of the solution include:

Time-series metrics collection: The solution monitors Iceberg tables continuously to identify trends and detect anomalies in data ingestion rates, partition skewness, and more.
Event-driven architecture: The solution uses Amazon EventBridge to launch a Lambda function when the state of an AWS Glue Data Catalog table changes. This ensures real-time metrics collection every time a transaction is committed to an Iceberg table.
Efficient data retrieval: Incorporates minimal compute resources by utilizing AWS Glue interactive sessions and the pyiceberg library to directly access Iceberg metadata tables such as snapshots, partitions, and files.

Metrics tracked

As of the blog release date, the solution collects over 25 metrics. These metrics are categorized into several groups:

Snapshot metrics: Include total and changes in data files, delete files, records added or removed, and size changes.
Partition and file metrics: Aggregated and per-partition metrics like average, maximum, minimum record counts and file sizes, which help in understanding data distribution and help optimizing storage.

To see the complete list of metrics, go to the GitHub repository.

Visualizing data with CloudWatch dashboards

The solution also provides a sample CloudWatch dashboard to visualize the collected metrics. Metrics visualization is important for real-time monitoring and detecting operational issues. The provided helper script simplifies the set up and deployment of the dashboard.

Amazon CloudWatch dashboard

You can go to the GitHub repository to learn more about how to deploy the solution in your AWS account.

What are the vital metrics for Apache Iceberg tables?

This section discusses specific metrics from Iceberg’s metadata and explains why they’re important for monitoring data quality and system performance. The metrics are broken down into three parts: insight, challenge, and action. This provides a clear path for practical application. In this section, we provide only a subset of the available metrics that the solution can collect, for a complete list, see the solution Github page.

1. snapshot.added_data_files, snapshot.added_records

Metric insight: The number of data files and number of records added to the table during the last transaction. The ingestion rate measures the speed at which new data is added to the data lake. This metric helps identify bottlenecks or inefficiencies in data pipelines, guiding capacity planning and scalability decisions.
Challenge: A sudden drop in the ingestion rate can indicate failures in data ingestion pipelines, source system outages, configuration errors or traffic spikes.
Action: Teams need to establish real-time monitoring and alert systems to detect drops in ingestion rates promptly, allowing quick investigations and resolutions.

2. files.avg_record_count, files.avg_file_size

Metric insight: These metrics provide insights into the distribution and storage efficiency of the table. Small file sizes might suggest excessive fragmentation.
Challenge: Excessively small file sizes can indicate inefficient data storage leading to increased read operations and higher I/O costs.
Action: Implementing regular data compaction processes helps consolidate small files, optimizing storage and enhancing content delivery speeds as demonstrated by a streaming service. Data Catalog offers automatic compaction of Apache Iceberg tables. To learn more about compacting Apache Iceberg tables, see Enable compaction in Working with tables on the AWS Glue console.

3. partitions.skew_record_count, partitions.skew_file_count

Metric insight: The metrics indicate the asymmetry of the data distribution across the available table partitions. A skewness value of zero, or very close to zero, suggests that the data is balanced. Positive or negative skewness values might indicate a problem.
Challenge: Imbalances in data distribution across partitions can lead to inefficiencies and slow query responses.
Action: Regularly analyze data distribution metrics to adjust partitioning configuration. Apache Iceberg allows you to transform partitions dynamically, which enables optimization of table partitioning as query patterns or data volumes change, without impacting your existing data.

4. snapshot.deleted_records, snapshot.total_delete_files, snapshot.added_position_deletes

Metric insight: Deletion metrics in Apache Iceberg provide important information on the volume and nature of data deletions within a table. These metrics help track how often data is removed or updated, which is essential for managing data lifecycle and compliance with data retention policies.
Challenge: High values in these metrics can indicate excessive deletions or updates, which might lead to fragmentation and decreased query performance.
Action: To address these challenges, run compaction periodically to ensure deleted rows do not persist in new files. Regularly review and adjust data retention policies and consider expiring old snapshots to keep only necessary amount of data files. You can run compaction operation on specific partitions using Amazon Athena Optimize

Effective monitoring is essential for making informed decisions about necessary maintenance actions for Apache Iceberg tables. Determining the right timing for these actions is crucial. Implementing timely preventative maintenance ensures high operational efficiency of the data lake and helps to address potential issues before they become significant problems.

Using Amazon CloudWatch for anomaly detection and alerts

This section assumes that you have completed the solution setup and collected operational metrics from your Apache Iceberg tables into Amazon CloudWatch.

Now you can start setting up some alerts and detect anomalies.

We guide you on setting up the anomaly detection and configuring alerts in CloudWatch to monitor the snapshot.added_records metric, which indicates the ingestion rate of data written into an Apache Iceberg table.

Set up anomaly detection

CloudWatch anomaly detection applies machine learning algorithms to continuously analyze system metrics, determine normal baselines, and identify items that are outside of the established patterns. Here is how you configure it:

Amazon CloudWatch anomaly detection screenshot

Select Metrics: In the AWS Management Console for Cloudwatch, go to the Metrics tab and search for and select snapshot.added_records.
Create anomaly detection models: Choose the Graphed metrics tab and click the Pulse icon to enable anomaly detection.
Set Sensitivity: The second parameter of the ANOMALY_DETECTION_BAND (m1, 5) is to adjust the sensitivity of the anomaly detection. The goal is to balance detecting real issues and reducing false positives.

Configure alerts

After the anomaly detection model is set up, set up an alert to notify operations teams about potential issues:

Create alarm: Choose the bell icon under Actions on the same Graphed metrics tab.
Alarm settings: Set the alarm to notify the operations team when the snapshot.added_records metric is outside the anomaly detection band for two consecutive periods. This helps reduce the risk of false alerts.
Alarm actions: Configure CloudWatch to send an alarm email to the operations team. In addition to sending emails, CloudWatch alarm actions can automatically launch remediation processes, such as scaling operations or initiating data compaction.

Best practices

Regularly review and adjust models: As data patterns evolve, periodically review and adjust anomaly detection models and alarm settings to remain effective.
Comprehensive coverage: Ensure that all critical aspects of the data pipeline are monitored, not just a few metrics.
Documentation and communication: Maintain clear documentation of what each metric and alarm represent and ensure that your operations team understands the monitoring set up and response procedures. Set up the alerting mechanisms to send notifications through appropriate channels such as email, corporate messenger, or telephone to ensure your operations team stays informed and can quickly address the issues.
Create playbooks and automate remediation tasks: Establish detailed playbooks that describe step-by-step responses for common scenarios identified by alerts. Additionally, automate remediation tasks where possible to speed up response times and reduce the manual burden on teams. This ensures consistent and effective responses to all incidents.

CloudWatch anomaly detection and alerting features help organizations proactively manage their data lakes. This ensures data integrity, reduces downtime, and maintains high data quality. As a result, it enhances operational efficiency and supports robust data governance.

Conclusion

In this blog post, we explored Apache Iceberg’s transformative impact on data lake management. Apache Iceberg addresses the challenges of big data with features like ACID transactions, schema evolution, and snapshot isolation, enhancing data reliability, query performance, and scalability.

We delved into Iceberg’s metadata layer and related metadata tables such as snapshots, files, and partitions that allow easy access to crucial information about the current state of the table. These metadata tables facilitate the extraction of performance-related data, enabling teams to monitor and optimize the data lake’s efficiency.

Finally, we showed you a practical solution for monitoring Apache Iceberg tables using Lambda, AWS Glue, and CloudWatch. This solution uses Iceberg’s metadata layer and CloudWatch monitoring capabilities to provide a proactive operational framework. This framework detects trends and anomalies, ensuring robust data lake management.

About the Author

Avatar Michael Greenshtein is a Senior Analytics Specialist at Amazon Web Services. He is an experienced data professional with over 8 years in cloud computing and data management. Michael is passionate about open-source technology and Apache Iceberg.

Testing your applications with Amazon Q Developer

2024-07-29 Svenja Raether

Post Syndicated from Svenja Raether original https://aws.amazon.com/blogs/devops/testing-your-applications-with-amazon-q-developer/

Testing code is a fundamental step in the field of software development. It ensures that applications are reliable, meet quality standards, and work as intended. Automated software tests help to detect issues and defects early, reducing impact to end-user experience and business. In addition, tests provide documentation and prevent regression as code changes over time.

In this blog post, we show how the integration of generative AI tools like Amazon Q Developer can further enhance unit testing by automating test scenarios and generating test cases.

Amazon Q Developer helps developers and IT professionals with all of their tasks across the software development lifecycle – from coding, testing, and upgrading, to troubleshooting, performing security scanning and fixes, optimizing AWS resources, and creating data engineering pipelines. It integrates into your Integrated Development Environment (IDE) and aids with providing answers to your questions. Amazon Q Developer supports you across the Software Development Lifecycle (SLDC) by enriching feature and test development with step-by-step instructions and best practices. It learns from your interactions and training itself over time to output personalized, and tailored answers.

Solution overview

In this blog, we show how to use Amazon Q Developer to:

learn about software testing concepts and frameworks
identify unit test scenarios
write unit test cases
refactor test code
mock dependencies
generate sample data

Note: Amazon Q Developer may generate an output different from this blog post’s examples due to its nondeterministic nature.

Using Amazon Q Developer to learn about software testing frameworks and concepts

As you start gaining experience with testing, Amazon Q Developer can accelerate your learning through conversational Q&A directly within the AWS Management Console or the IDE. It can explain topics, provide general advice and share helpful resources on testing concepts and frameworks. It gives personalized recommendations on resources which makes the learning experience more interactive and accelerates the time to get started with writing unit tests. Let’s introduce an example conversation to demonstrate how you can leverage Amazon Q Developer for learning before attempting to write your first test case.

Example – Select and install frameworks

A unit testing framework is a software tool used for test automation, design and management of test cases. Upon starting a software project, you may be faced with the selection of a framework depending on the type of tests, programming language, and underlying technology. Let’s ask for recommendations around unit testing frameworks for Python code running in an AWS Lambda function.

In the Visual Studio Code IDE, a user asks Amazon Q Developer for recommendations on unit test frameworks suitable for AWS Lambda functions written in Python using the following prompt: “Can you recommend unit testing frameworks for AWS Lambda functions in Python?”. Amazon Q Developer returns a numbered list of popular unit testing frameworks, including Pytest, unittest, Moto, AWS SAM Local, and AWS Lambda Powertools. Amazon Q Developer returns two URLs in the Sources section. The first URL links to an article called “AWS Lambda function testing in Python – AWS Lambda” from the AWS documentation, and the other URL to an AWS DevOps Blog post with the title “Unit Testing AWS Lambda with Python and Mock AWS Services”.

Figure 1: Recommend unit testing frameworks for AWS Lambda functions in Python

In Figure 1, Amazon Q Developer answers with popular frameworks (pytest, unittest, Moto, AWS SAM Command Line Interface, and Powertools for AWS Lambda) including a brief description for each of them. It provides a reference to the sources of its response at the bottom and suggested follow up questions. As a next step, you may want to refine what you are looking for with a follow-up question. Amazon Q Developer uses the context from your previous questions and answers to give more precise answers as you continue the conversation. For example, one of the frameworks it suggested using was pytest. If you don’t know how to install that locally, you can ask something like “What are the different options to install pytest on my Linux machine?”. As shown in Figure 2, Amazon Q Developer provides installation recommendation using Python.

In the Visual Studio Code IDE, a user asks Amazon Q Developer about the different options to install pytest on a Linux machine using the following prompt: “What are the different options to install pytest on my Linux machine?”. Amazon Q Developer replies with four different options: using pip, using a package manager, using a virtual environment, and using a Python distribution. Each option includes the steps to install pytest. A source URL is included for an article called “How to install pytest in Python? – Be on the Right Side of Change”.

Figure 2: Options to install pytest

Example – Explain concepts

Amazon Q Developer can also help you to get up to speed with testing concepts, such as mocking service dependencies. Let’s ask another follow up question to explain the benefits of mocking AWS services.

In the Visual Studio Code IDE, a user asks Amazon Q Developer about the key benefits of mocking AWS Services’ API calls to create unit tests for Lambda function code using the following prompt: “What are the key benefits of mocking AWS Services' API calls to create unit tests for Lambda function code?”. Amazon Q Developer replies with six key benefits, including: Isolation of the Lambda function code, Faster feedback loop, Consistent and repeatable tests, Cost savings, Improved testability, and Easier debugging. Each benefit has a brief explanation attached to it. The Sources section includes one URL to an AWS DevOps Blog post with the title “Unit Testing AWS Lambda with Python and Mock AWS Services”.

Figure 3: Benefits of mocking AWS services

The above conversation in Figure 3 shows how Amazon Q Developer can help to understand concepts. Let’s learn more about Moto.

In the Visual Studio Code IDE, a user asks Amazon Q Developer: “What is Moto used for?”. Amazon Q Developer provides a brief explanation of the Moto library, and how it can be used to create simulations of various AWS resources, such as: AWS Lambda, Amazon Dynamo DB, Amazon S3, Amazon EC2, AWS IAM, and many other AWS services. Amazon Q Developer provides a simple code example of how to use Moto to mock the AWS Lambda service in a Pytest test case by using the @mock_lambda decorator.

Figure 4: Follow up question about Moto

In Figure 4, Amazon Q Developer gives more details about the Moto framework and provides a short example code snippet for mocking an AWS Lambda service.

Best Practice – Write clear prompts

Writing clear prompts helps you to get the desired answers from Amazon Q. A lack of clarity and topic understanding may result in unclear questions and irrelevant or off-target responses. Note how those prompts contain specific description of what the answer should provide. For example, Figure 1 includes the programming language (Python) and service (AWS Lambda) to be considered in the expected answer. If unfamiliar with a topic, leverage Amazon Q Developer as part of your research, to better understand that topic.

Using Amazon Q Developer to identify unit test cases

Understanding the purpose and intended functionality of the code is important for developing relevant test cases. We introduce an example use case in Python, which handles payroll calculation for different hour rates, hours worked, and tax rates.

"""
This module provides a Payroll class to calculate the net pay for an employee.

The Payroll class takes in the hourly rate, hours worked, and tax rate, and
calculates the gross pay, tax amount, and net pay.
"""

class Payroll:
    """
    A class to handle payroll calculations.
    """

    def __init__(self, hourly_rate: float, hours_worked: float, tax_rate: float):
        self._validate_inputs(hourly_rate, hours_worked, tax_rate)
        self.hourly_rate = hourly_rate
        self.hours_worked = hours_worked
        self.tax_rate = tax_rate

    def _validate_inputs(self, hourly_rate: float, hours_worked: float, tax_rate: float) -> None:
        """
        Validate the input values for the Payroll class.

        Args:
            hourly_rate (float): The employee's hourly rate.
            hours_worked (float): The number of hours the employee worked.
            tax_rate (float): The tax rate to be applied to the employee's gross pay.

        Raises:
            ValueError: If the hourly rate, hours worked, or tax rate is not a positive number, 
            or if the tax rate is not between 0 and 1.
        """
        if hourly_rate <= 0:
            raise ValueError("Hourly rate must be a non-negative number.")
        if hours_worked < 0:
            raise ValueError("Hours worked must be a non-negative number.")
        if tax_rate < 0 or tax_rate >= 1:
            raise ValueError("Tax rate must be between 0 and 1.")

    def gross_pay(self) -> float:
        """
        Calculate the employee's gross pay.

        Returns:
            float: The employee's gross pay.
        """
        return self.hourly_rate * self.hours_worked

    def tax_amount(self) -> float:
        """
        Calculate the tax amount to be deducted from the employee's gross pay.

        Returns:
            float: The tax amount.
        """
        return self.gross_pay() * self.tax_rate

    def net_pay(self) -> float:
        """
        Calculate the employee's net pay after deducting taxes.

        Returns:
            float: The employee's net pay.
        """
        return self.gross_pay() - self.tax_amount()

The example shows how Amazon Q Developer can be used to identify test scenarios before writing the actual cases. Let’s ask Amazon Q Developer to suggest test cases for the Payroll class.

In the Visual Studio Code IDE, a user has a Payroll Python class open on the right side of the editor. The Payroll class is a module used to calculate the net pay for an employee. It takes the hourly rate, hours worked, and tax rate to calculate the gross pay, tax amount, and net pay. The user asks Amazon Q Developer to list unit test scenarios for the Payroll class using the following prompt: “Can you list unit test scenarios for the Payroll class?”. Amazon Q Developer provides eight different unit test scenarios: Test valid input values, Test invalid input values, Test edge cases, Test methods behavior, Test error handling, Test data types, Test rounding behavior, and Test consistency.

Figure 5: Suggest unit test scenarios

In Figure 5, Amazon Q Developer provides a list of different scenarios specific to the Payroll class, including valid, error, and edge cases.

Using Amazon Q Developer to write unit tests

Developers can have a collaborative conversation with Amazon Q Developer, which helps to unpack the code and think through testing cases to check that important test cases are captured but also edge cases are identified. This section focuses on how to facilitate the quick generation of unit test cases, based on the cases recommended in the previous section. Let’s start with a question around best practices when writing unit tests with pytest.

In the Visual Studio Code IDE, a user asks Amazon Q Developer “What are the best practices for unit testing with pytest?”. Amazon Q Developer replies with ten best practices, including: Keep tests simple and focused, Use descriptive test function names, Organize your tests, Run tests multiple times in random order, Utilize static code analysis tools, Focus on behavior not implementation, Use fixtures to set up test data, Parameterize your tests, and Integrate with continuous integration. A source URL is also provided for a Medium article called “Python Unit Tests with pytest Guide”

Figure 6: Recommended best practices for testing with pytest

In Figure 6, Amazon Q Developer provides a list of best practices for writing effective unit tests. Let’s follow up by asking to generate one of the suggested test cases.

In the Visual Studio Code IDE, a user asks Amazon Q Developer to generate a test case using pytest to make sure that the Payroll class raises a ValueError when the hourly rate holds a negative value. Amazon Q Developer generates a code sample using the pytest.raises() context manager to satisfy the requirement. It also provides instructions on how to run the text by prefixing the test module with pytest and running the command in the terminal. The user can now click on Insert at cursor button to insert the test case into the module, and run the test.

Figure 7: Generate a unit test case

Amazon Q Developer includes code in its response which you can copy or insert directly into your file by choosing Insert at cursor. Figure 7 displays valid unit tests covering some of the suggested scenarios and best practices, such as being simple and holding descriptive naming. It also states how to run the test using a command for the terminal.

Best Practice – Provide context

Context allows Amazon Q Developer to offer tailored responses that are more in sync with the conversation. In the chat interface, the flow of the ongoing conversation and past interactions are a critical contextual element. Other ways to provide context are selecting the code-under-test, keeping any relevant files, such as test examples, open in the editor and leveraging conversation context such as asking for best practices and example test scenarios before writing the test cases.

Using Amazon Q Developer to refactor unit tests

To improve code quality, Amazon Q Developer can be used to recommend improvements and refactor parts of the code base. To illustrate Amazon Q Developer refactoring functionality, we prepared test cases for the Payroll class which deviate from some of the suggested best practices.

Example – Send to Amazon Q Refactor

Let’s follow up by asking to refactor the code built-in Amazon Q > Refactor functionality.

In the Visual Studio Code IDE, a user selects the code in test_payroll_refactor module and asks Amazon Q Developer to refactor it via the Amazon Q Developer Refactor functionality, accessible through right click. This code contains ambiguous function and variable names, and might be hard to read without context. Amazon Q Developer then generates the refactored code and outlines the changes made: renamed test functions, removed variable names, and unnecessary comments, as functions are now self-explanatory. The user can now use the Insert at Cursor feature to add the code to the test_payroll_refactored module, and run the tests.

Figure 8: Refactor test cases

In Figure 8, the refactoring renamed the function and variable names to be more descriptive and therefore removed the comments. The recommendation is inserted in the second file to verify it runs correctly.

Best Practice – Apply human judgement and continuously interact with Amazon Q Developer

Note that code generations should always be reviewed and adjusted before used in your projects. Amazon Q Developer can provide you with initial guidance, but you might not get a perfect answer. A developer’s judgement should be applied to value usefulness of the generated code and iterations should be used to continuously improve the results.

Using Amazon Q Developer for mocking dependencies and generating sample data

More complex application architectures may require developers to mock dependencies and use sample data to test specific functionalities. The second code example contains a save_to_dynamodb_table function that writes a job_posting object into a specific Amazon DynamoDB table. This function references the TABLE_NAME environment variable to specify the name of the table in which the data should be saved.

We break down the tasks for Amazon Q Developer into three smaller steps for testing: Generate a fixture for mocking the TABLE_NAME environment variable name, generate instances of the given class to be used as test data, and generate the test.

Example – Generate fixtures

Pytest provides the capability to define fixtures to set defined, reliable, and consistent context for tests. Let’s ask Amazon Q Developer to write a fixture for the TABLE_NAME environment variable.

In the Visual Studio Code IDE, a user asks Amazon Q Developer to create a pytest fixture to mock the TABLE_NAME environment variable using the following prompt: “Create a pytest fixture which mocks my TABLE_NAME environment variable”. Amazon Q Developer replies with a code example showing how the fixture uses the os.environ dictionary to temporarily set the environment variable value to a mock value just for the duration of any test case using the fixture. The code example also includes a yield keyword to pause the fixture, and then delete the mock environment variable to restore the actual value once the test is completed.

Figure 9: Generate a pytest fixture

The result in Figure 9 shows that Amazon Q Developer generated a simple fixture for the TABLE_NAME environment variable. It provides code showing how to use the fixture in the actual test case with additional comments for its content.

Example – Generate data

Amazon Q Developer provides capabilities that can help you generate input data for your tests based on a schema, data model, or table definition. The save_to_dynamodb_table saves an instance of the job posting class to the table. Let’s ask Amazon Q Developer to create a sample instance based on this definition.

In the Visual Studio Code IDE, a user asks Amazon Q Developer to create a sample valid instance of the selected JobPosting class using the following prompt: “Create a sample valid instance of the selected JobPostings class”. The JobPosting class includes multiple fields, including: id, title, description, salary, location, company, employment type, and application_deadline. Amazon Q Developer provides a valid snippet for the JobPosting class, incorporating a UUID for the id field, nested amount and currency for the Salary class, and generic sample values for the remaining fields.

Figure 10: Generate sample data

The answer shows a valid instance of the class in Figure 10 containing common example values for the fields.

Example – Generate unit test cases with context

The code being tested relies on an external library, boto3. To make sure that this dependency is included, we leave a comment specifying that boto3 should be mocked using the Moto library. Additionally, we tell Amazon Q Developer to consider the test instance named job_posting and the fixture named mock_table_name for reference. Developers can now provide a prompt to generate the test case using the context from previous tasks or use comments as inline prompts to generate the test within the test file itself.

In the Visual Studio Code IDE, a user is leveraging inline prompts to generate an autocomplete suggestion from Amazon Q Developer. The inline prompt response is trying to fill out the test_save_to_dynamodb_table function with a mock test asserting the previous JobPosting fields. The user can decide to accept or reject the provided code completion suggestion.

Figure 11: Inline prompts for generating unit test case

Figure 11 shows the recommended code using inline prompts, which can be accepted as the unit test for the save_to_dynamodb_table function.

Best Practice – Break down larger tasks into smaller ones

For cases where Amazon Q Developer does not have much context or example code to refer to, such as writing unit tests from scratch, it is helpful to break down the tasks into smaller tasks. Amazon Q Developer will get more context with each step and can result in more effective responses.

Conclusion

Amazon Q Developer is a powerful tool that simplifies the process of writing and executing unit tests for your application. The examples provided in this post demonstrated that it can be a helpful companion throughout different stages of your unit test process. From initial learning to investigation and writing of test cases, the Chat, Generate, and Refactor capabilities allow you to speed up and improve your test generation. Using clear and concise prompts, context, an iterative approach, and small scoped tasks to interact with Amazon Q Developer improves the generated answers.

To learn more about Amazon Q Developer, see the following resources:

About the authors

Building a scalable streaming data platform that enables real-time and batch analytics of electric vehicles on AWS

2024-07-17 Ayush Agrawal

Post Syndicated from Ayush Agrawal original https://aws.amazon.com/blogs/big-data/building-a-scalable-streaming-data-platform-that-enables-real-time-and-batch-analytics-of-electric-vehicles-on-aws/

The automobile industry has undergone a remarkable transformation because of the increasing adoption of electric vehicles (EVs). EVs, known for their sustainability and eco-friendliness, are paving the way for a new era in transportation. As environmental concerns and the push for greener technologies have gained momentum, the adoption of EVs has surged, promising to reshape our mobility landscape.

The surge in EVs brings with it a profound need for data acquisition and analysis to optimize their performance, reliability, and efficiency. In the rapidly evolving EV industry, the ability to harness, process, and derive insights from the massive volume of data generated by EVs has become essential for manufacturers, service providers, and researchers alike.

As the EV market is expanding with many new and incumbent players trying to capture the market, the major differentiating factor will be the performance of the vehicles.

Modern EVs are equipped with an array of sensors and systems that continuously monitor various aspects of their operation including parameters such as voltage, temperature, vibration, speed, and so on. From battery management to motor performance, these data-rich machines provide a wealth of information that, when effectively captured and analyzed, can revolutionize vehicle design, enhance safety, and optimize energy consumption. The data can be used to do predictive maintenance, device anomaly detection, real-time customer alerts, remote device management, and monitoring.

However, managing this deluge of data isn’t without its challenges. As the adoption of EVs accelerates, the need for robust data pipelines capable of collecting, storing, and processing data from an exponentially growing number of vehicles becomes more pronounced. Moreover, the granularity of data generated by each vehicle has increased significantly, making it essential to efficiently handle the ever-increasing number of data points. The challenges include not only the technical intricacies of data management but also concerns related to data security, privacy, and compliance with evolving regulations.

In this blog post, we delve into the intricacies of building a reliable data analytics pipeline that can scale to accommodate millions of vehicles, each generating hundreds of metrics every second using Amazon OpenSearch Ingestion. We also provide guidelines and sample configurations to help you implement a solution.

Of the prerequisites that follow, the IOT topic rule and the Amazon Managed Streaming for Apache Kafka (Amazon MSK) cluster can be set up by following How to integrate AWS IoT Core with Amazon MSK. The steps to create an Amazon OpenSearch Service cluster are available in Creating and managing Amazon OpenSearch Service domains.

Prerequisites

Before you begin the implementing the solution, you need the following:

IOT topic rule
Amazon MSK Simple Authentication and Security Layer/Salted Challenge Response Mechanism (SASL/SCRAM) cluster
Amazon OpenSearch Service domain

Solution overview

The following architecture diagram provides a scalable and fully managed modern data streaming platform. The architecture uses Amazon OpenSearch Ingestion to stream data into OpenSearch Service and Amazon Simple Storage Service (Amazon S3) to store the data. The data in OpenSearch powers real-time dashboards. The data can also be used to notify customers of any failures occurring on the vehicle (see Configuring alerts in Amazon OpenSearch Service). The data in Amazon S3 is used for business intelligence and long-term storage.

Architecture diagram

In the following sections, we focus on the following three critical pieces of the architecture in depth:

1. Amazon MSK to OpenSearch ingestion pipeline

2. Amazon OpenSearch Ingestion pipeline to OpenSearch Service

3. Amazon OpenSearch Ingestion to Amazon S3

Solution Walkthrough

Step 1: MSK to Amazon OpenSearch Ingestion pipeline

Because each electric vehicle streams massive volumes of data to Amazon MSK clusters through AWS IoT Core, making sense of this data avalanche is critical. OpenSearch Ingestion provides a fully managed serverless integration to tap into these data streams.

The Amazon MSK source in OpenSearch Ingestion uses Kafka’s Consumer API to read records from one or more MSK topics. The MSK source in OpenSearch Ingestion seamlessly connects to MSK to ingest the streaming data into OpenSearch Ingestion’s processing pipeline.

The following snippet illustrates the pipeline configuration for an OpenSearch Ingestion pipeline used to ingest data from an MSK cluster.

While creating an OpenSearch Ingestion pipeline, add the following snippet in the Pipeline configuration section.

version: "2"
msk-pipeline: 
  source: 
    kafka: 
      acknowledgments: true                  
      topics: 
         - name: "ev-device-topic " 
           group_id: "opensearch-consumer" 
           serde_format: json                 
      aws: 
        # Provide the Role ARN with access to MSK. This role should have a trust relationship with osis-pipelines.amazonaws.com 
        sts_role_arn: "arn:aws:iam:: ::<<account-id>>:role/opensearch-pipeline-Role"
        # Provide the region of the domain. 
        region: "<<region>>" 
        msk: 
          # Provide the MSK ARN.  
          arn: "arn:aws:kafka:<<region>>:<<account-id>>:cluster/<<name>>/<<id>>"

When configuring Amazon MSK and OpenSearch Ingestion, it’s essential to establish an optimal relationship between the number of partitions in your Kafka topics and the number of OpenSearch Compute Units (OCUs) allocated to your ingestion pipelines. This optimal configuration ensures efficient data processing and maximizes throughput. You can read more about it in Configure recommended compute units (OCUs) for the Amazon MSK pipeline.

Step 2: OpenSearch Ingestion pipeline to OpenSearch Service

OpenSearch Ingestion offers a direct method for streaming EV data into OpenSearch. The OpenSearch sink plugin channels data from multiple sources directly into the OpenSearch domain. Instead of manually provisioning the pipeline, you define the capacity for your pipeline using OCUs. Each OCU provides 6 GB of memory and two virtual CPUs. To use OpenSearch Ingestion auto-scaling optimally, it’s essential to configure the maximum number of OCUs for a pipeline based on the number of partitions in the topics being ingested. If a topic has a large number of partitions (for example, more than 96, which is the maximum OCUs per pipeline), it’s recommended to configure the pipeline with a maximum of 1–96 OCUs. This way, the pipeline can automatically scale up or down within this range as needed. However, if a topic has a low number of partitions (for example, fewer than 96), it’s advisable to set the maximum number of OCUs to be equal to the number of partitions. This approach ensures that each partition is processed by a dedicated OCU enabling parallel processing and optimal performance. In scenarios where a pipeline ingests data from multiple topics, the topic with the highest number of partitions should be used as a reference to configure the maximum OCUs. Additionally, if higher throughput is required, you can create another pipeline with a new set of OCUs for the same topic and consumer group, enabling near-linear scalability.

OpenSearch Ingestion provides several pre-defined configuration blueprints that can help you quickly build your ingestion pipeline on AWS

The following snippet illustrates pipeline configuration for an OpenSearch Ingestion pipeline using OpenSearch as a SINK with a dead letter queue (DLQ) to Amazon S3. When a pipeline encounters write errors, it creates DLQ objects in the configured S3 bucket. DLQ objects exist within a JSON file as an array of failed events.

sink: 
      - opensearch: 
          # Provide an AWS OpenSearch Service domain endpoint 
          hosts: [ "https://<<domain-name>>.<<region>>.es.amazonaws.com" ] 
          aws: 
          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com 
            sts_role_arn: "arn:aws:iam::<<account-id>>:role/<<role-name>>" 
          # Provide the region of the domain. 
            region: "<<region>>" 
          # Enable the 'serverless' flag if the sink is an Amazon OpenSearch Serverless collection 
          # serverless: true 
          # index name can be auto-generated from topic name 
          index: "index_ev_pipe-%{yyyy.MM.dd}" 
          # Enable 'distribution_version' setting if the AWS OpenSearch Service domain is of version Elasticsearch 6.x 
          #distribution_version: "es6" 
          # Enable the S3 DLQ to capture any failed requests in Ohan S3 bucket 
          dlq: 
            s3: 
            # Provide an S3 bucket 
              bucket: "<<bucket-name>>"
            # Provide a key path prefix for the failed requests
              key_path_prefix: "oss-pipeline-errors/dlq"
            # Provide the region of the bucket.
              region: "<<region>>"
            # Provide a Role ARN with access to the bucket. This role should have a trust relationship with osis-pipelines.amazonaws.com
              sts_role_arn: "arn:aws:iam:: <<account-id>>:role/<<role-name>>"

Step 3: OpenSearch Ingestion to Amazon S3

OpenSearch Ingestion offers a built-in sink for loading streaming data directly into S3. The service can compress, partition, and optimize the data for cost-effective storage and analytics in Amazon S3. Data loaded into S3 can be partitioned for easier query isolation and lifecycle management. Partitions can be based on vehicle ID, date, geographic region, or other dimensions as needed for your queries.

The following snippet illustrates how we’ve partitioned and stored EV data in Amazon S3.

- s3:
            aws:
              # Provide a Role ARN with access to the bucket. This role should have a trust relationship with osis-pipelines.amazonaws.com
                sts_role_arn: "arn:aws:iam::<<account-id>>:role/<<role-name>>"
              # Provide the region of the domain.
                region: "<<region>>"
            # Replace with the bucket to send the logs to
            bucket: "evbucket"
            object_key:
              # Optional path_prefix for your s3 objects
              path_prefix: "index_ev_pipe/year=%{yyyy}/month=%{MM}/day=%{dd}/hour=%{HH}"
            threshold:
              event_collect_timeout: 60s
            codec:
              parquet:
                auto_schema: true

The pipeline can be created following the steps in Creating Amazon OpenSearch Ingestion pipelines.

The following is the complete pipeline configuration, combining the configuration of all three steps. Update the Amazon Resource Names (ARNs), AWS Region, Open Search Service domain endpoint, and S3 names as needed.

The entire OpenSearch Ingestion pipeline configuration can be directly copied into the ‘Pipeline configuration’ field in the AWS Management Console while creating the OpenSearch Ingestion pipeline

version: "2"
msk-pipeline: 
  source: 
    kafka: 
      acknowledgments: true           # Default is false  
      topics: 
         - name: "<<msk-topic-name>>" 
           group_id: "opensearch-consumer" 
           serde_format: json        
      aws: 
        # Provide the Role ARN with access to MSK. This role should have a trust relationship with osis-pipelines.amazonaws.com 
        sts_role_arn: "arn:aws:iam::<<account-id>>:role/<<role-name>>"
        # Provide the region of the domain. 
        region: "<<region>>" 
        msk: 
          # Provide the MSK ARN.  
          arn: "arn:aws:kafka:us-east-1:<<account-id>>:cluster/<<cluster-name>>/<<cluster-id>>" 
  processor:
      - parse_json:
  sink: 
      - opensearch: 
          # Provide an AWS OpenSearch Service domain endpoint 
          hosts: [ "https://<<opensearch-service-domain-endpoint>>.us-east-1.es.amazonaws.com" ] 
          aws: 
          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com 
            sts_role_arn: "arn:aws:iam::<<account-id>>:role/<<role-name>>" 
          # Provide the region of the domain. 
            region: "<<region>>" 
          # Enable the 'serverless' flag if the sink is an Amazon OpenSearch Serverless collection 
          # index name can be auto-generated from topic name 
          index: "index_ev_pipe-%{yyyy.MM.dd}" 
          # Enable 'distribution_version' setting if the AWS OpenSearch Service domain is of version Elasticsearch 6.x 
          #distribution_version: "es6" 
          # Enable the S3 DLQ to capture any failed requests in Ohan S3 bucket 
          dlq: 
            s3: 
            # Provide an S3 bucket 
              bucket: "<<bucket-name>>"
            # Provide a key path prefix for the failed requests
              key_path_prefix: "oss-pipeline-errors/dlq"
            # Provide the region of the bucket.
              region: "<<region>>"
            # Provide a Role ARN with access to the bucket. This role should have a trust relationship with osis-pipelines.amazonaws.com
              sts_role_arn: "arn:aws:iam::<<account-id>>:role/<<role-name>>"
      - s3:
            aws:
              # Provide a Role ARN with access to the bucket. This role should have a trust relationship with osis-pipelines.amazonaws.com
                sts_role_arn: "arn:aws:iam::<<account-id>>:role/<<role-name>>"
              # Provide the region of the domain.
                region: "<<region>>"
            # Replace with the bucket to send the logs to
            bucket: "<<bucket-name>>"
            object_key:
              # Optional path_prefix for your s3 objects
              path_prefix: "index_ev_pipe/year=%{yyyy}/month=%{MM}/day=%{dd}/hour=%{HH}"
            threshold:
              event_collect_timeout: 60s
            codec:
              parquet:
                auto_schema: true

Real-time analytics

After the data is available in OpenSearch Service, you can build real-time monitoring and notifications. OpenSearch Service has robust support for multiple notification channels, allowing you to receive alerts through services like Slack, Chime, custom webhooks, Microsoft Teams, email, and Amazon Simple Notification Service (Amazon SNS).

The following screenshot illustrates supported notification channels in OpenSearch Service.

The notification feature in OpenSearch Service allows you to create monitors that will watch for certain conditions or changes in your data and launch alerts, such as monitoring vehicle telemetry data and launching alerts for issues like battery degradation or abnormal energy consumption. For example, you can create a monitor that analyzes battery capacity over time and notifies the on-call team using Slack if capacity drops below expected degradation curves in a significant number of vehicles. This could indicate a potential manufacturing defect requiring investigation.

In addition to notifications, OpenSearch Service makes it easy to build real-time dashboards to visually track metrics across your fleet of vehicles. You can ingest vehicle telemetry data like location, speed, fuel consumption, and so on, and visualize it on maps, charts, and gauges. Dashboards can provide real-time visibility into vehicle health and performance.

The following screenshot illustrates creating a sample dashboard on OpenSearch Service

Opensearch Dashboard

A key benefit of OpenSearch Service is its ability to handle high sustained ingestion and query rates with millisecond latencies. It distributes incoming vehicle data across data nodes in a cluster for parallel processing. This allows OpenSearch to scale out to handle very large fleets while still delivering the real-time performance needed for operational visibility and alerting.

Batch analytics

After the data is available in Amazon S3, you can build a secure data lake to power a variety of analytics use cases deriving powerful insights. As an immutable store, new data is continually stored in S3 while existing data remains unaltered. This serves as a single source of truth for downstream analytics.

For business intelligence and reporting, you can analyze trends, identify insights, and create rich visualizations powered by the data lake. You can use Amazon QuickSight to build and share dashboards without needing to set up servers or infrastructure. Here’s an example of a Quicksight dashboard for IoT device data. For example, you can use a dashboard to gain insights from historical data that can help with better vehicle and battery design.

The Amazon Quicksight public gallery shows examples of dashboards across different domains.

You should consider Amazon OpenSearch dashboards for your operational day-to-day use cases to identify issues and alert in near real time whereas Amazon Quicksight should be used to analyze big data stored in a lake house and generate actionable insights from them.

Clean up

Delete the OpenSearch pipeline and Amazon MSK cluster to stop incurring costs on these services.

Conclusion

In this post, you learned how Amazon MSK, OpenSearch Ingestion, OpenSearch Services, and Amazon S3 can be integrated to ingest, process, store, analyze, and act on endless streams of EV data efficiently.

With OpenSearch Ingestion as the integration layer between streams and storage, the entire pipeline scales up and down automatically based on demand. No more complex cluster management or lost data from bursts in streams.

See Amazon OpenSearch Ingestion to learn more.

About the authors

Ayush Agrawal is a Startups Solutions Architect from Gurugram, India with 11 years of experience in Cloud Computing. With a keen interest in AI, ML, and Cloud Security, Ayush is dedicated to helping startups navigate and solve complex architectural challenges. His passion for technology drives him to constantly explore new tools and innovations. When he’s not architecting solutions, you’ll find Ayush diving into the latest tech trends, always eager to push the boundaries of what’s possible.

Fraser Sequeira is a Solutions Architect with AWS based in Mumbai, India. In his role at AWS, Fraser works closely with startups to design and build cloud-native solutions on AWS, with a focus on analytics and streaming workloads. With over 10 years of experience in cloud computing, Fraser has deep expertise in big data, real-time analytics, and building event-driven architecture on AWS.

AWS announces workspace context awareness for Amazon Q Developer chat

2024-07-11 Will Matos

Post Syndicated from Will Matos original https://aws.amazon.com/blogs/devops/aws-announces-workspace-context-awareness-for-amazon-q-developer-chat/

Today, Amazon Web Services (AWS) announced the release of workspace context awareness in Amazon Q Developer chat. By including @workspace in your prompt, Amazon Q Developer will automatically ingest and index all code files, configurations, and project structure, giving the chat comprehensive context across your entire application within the integrated development environment (IDE).

Throughout the software development lifecycle, developers, and IT professionals face challenges in understanding, creating, troubleshooting, and modernizing complex applications. Amazon Q Developer’s workspace context enhances its ability to address these issues by providing a comprehensive understanding of the entire codebase. During the planning phase, it enables gaining insights into the overall architecture, dependencies, and coding patterns, facilitating more complete and accurate responses and descriptions. When writing code, workspace awareness allows Amazon Q Developer to suggest code and solutions that require context from different areas without constant file switching. For troubleshooting, deeper contextual understanding aids in comprehending how various components interact, accelerating root cause analysis. By learning about the collection of files and folders in your workspace, Amazon Q Developer delivers accurate and relevant responses, tailored to the specific codebase, streamlining development efforts across the entire software lifecycle.

In this post, we’ll explore:

How @workspace works
How to use @workspace to:
- Understand your projects and code
- Navigate your project using natural language descriptions
- Generate code and solutions that demonstrate workspace awareness
How to get started

Our goal is for you to gain a comprehensive understanding of how the @workspace context will increase developer productivity and accuracy of Amazon Q Developer responses.

How @workspace works

Amazon Q Developer chat now utilizes advanced machine learning in the client to provide a comprehensive understanding of your codebase within the integrated development environment (IDE).

The key processes are:

Local Workspace Indexing: When enabled, Amazon Q Developer will index programming files or configuration files in your project upon triggering with @workspace for the first time. Non-essential files like binaries and those specified in .gitignore are intelligently filtered out during this process, focusing only on relevant code assets. This indexing takes approximately 5–20 minutes for workspace sizes up to 200 MB.
Persisted and Auto-Refreshed Index: The created index is persisted to disk, allowing fast loading on subsequent openings of the same workspace. If the index is over 24 hours old, it is automatically refreshed by re-indexing all files.
Memory-Aware Indexing: To prevent resource overhead, indexing stops at either a hard limit on size or when available system memory reaches a minimum threshold.
Continuous Updates: After initial indexing, the index is incrementally updated whenever you finish editing files in the IDE by closing the file or moving to another tab.

By creating and maintaining a comprehensive index of your codebase, Amazon Q Developer is empowered with workspace context awareness, enabling rich, project-wide assistance tailored to your development needs. When responding to chat requests, instructions, and questions, Amazon Q Developer can now use its knowledge of the rest of the workspace to augment the context of the currently open files.

Let’s see how @workspace can help!

Customer Use-Cases

Onboarding and Knowledge Sharing

You can quickly get up to speed on implementation details by asking questions like What are the key classes with application logic in this @workspace?
Animation showing Amazon Q Developer chat being prompted for "What are the key classes with application logic in this @workspace?". Q Developer responds with listing of classes and folders that are considered important with an explanation.

You can ask other discovery questions about how code works: Does this application authenticate players prior to starting a game? @workspace
Animation of a chat conversation demonstrating the workspace context capability in Amazon Q Developer. The user asks '@workspace Does this application authenticate players prior to starting a game?'. Q Developer responds with a detailed explanation of the application's behavior, indicating that it does not perform player authentication before starting a game.

You can then follow up with documentation requests: Can you document, using markdown, the workflow in this @workspace to start a game?
Animation of a chat conversation using the workspace context capability in Amazon Q Developer. The user requests '@workspace Can you document using markdown the workflow to start a game?'. Q Developer responds with detailed markdown-formatted documentation outlining the steps and code involved in starting a game within the application.

Project-Wide Code Discovery and Understanding

You can understand how a function or class is used across the workspace by asking: How does this application validate the guessed word’s correctness? @workspace
Animation of a chat conversation using the workspace context capability in Amazon Q Developer. The user inquires 'How does this application validate the guessed word's correctness? @workspace '. Q Developer provides an explanation detailing the specific classes and functions involved in validating a player's word guess within the application.

You can then ask about how this is communicated to the player: @workspace How are the results of this validation rendered to the web page?
Animation of a chat conversation using the workspace context capability in Amazon Q Developer. The user asks '@workspace How are the results of this validation rendered to the web page?'. Q Developer explains the process of rendering the validation results on the web page, including identifying the specific code responsible for this functionality.

Code Generation with Multi-File Context

You can also generate tests, new features, or extend functionality while leveraging context from other project files with prompts like @workspace Find the class that generates the random words and create a unit test for it.
Animation of a chat conversation using the workspace context capability in Amazon Q Developer. The user requests '@workspace Find the class that generates the random words and create a unit test for it'.

Project-Wide Analysis and Remediation

Create data flow diagrams that require application workflow knowledge with @workspace Can you provide a UML diagram of the data flow between the front-end and the back-end? Using your built-in UML Previewer you can view the resulting diagram.
Animation of a chat conversation using the workspace context capability in Amazon Q Developer. The user requests '@workspace Can you provide a UML diagram of the dataflow between the front-end and the back-end?'.

With @workspace, Amazon Q Developer chat becomes deeply aware of workspace’s unique code, enabling efficient development, maintenance, and knowledge sharing.

Getting Started

Enabling the powerful workspace context capability in Amazon Q Developer chat is straightforward. Here’s how to get started:

Open Your Project in the IDE: Launch your integrated development environment (IDE) and open the workspace or project you want the Amazon Q to understand.
Start a New Chat Session: Start a new chat session within the Amazon Q Developer chat panel if not already open.
“Enable” Workspace Context :To activate the project-wide context, simply include @workspace in the prompt. For example: How does the authentication flow work in this @workspace? When enabled, the first time Amazon Q Developer sees the @workspace keyword for the current workspace, Amazon Q Developer will ingest and analyze the code, configuration, and structure of the project.
1. If not already enabled, Amazon Q Developer will instruct you to do so.
2. Select the check the box for Amazon Q: Local Workspace Index
Try Different Query Types: With @workspace context, you can ask a wide range of questions and provide instructions that leverage the full project context:
1. Where is the business logic to handle users in this @workspace?
2. @workspace Explain the data flow between the frontend and backend.
3. Add new API tests using the existing test utilities found in the @workspace.
Iterate and Refine: Try rephrasing your query or explicitly including more context by selecting code or mentioning specific files when the response doesn’t meet your expectations. The more relevant information you provide, the better Amazon Q Developer can understand your intent. For optimal results using workspace context, it’s recommended to use specific terminology present in your codebase, avoid overly broad queries, and leverage available examples, references, and code snippets to steer Amazon Q Developer effectively.

Conclusion

In this post we introduced Amazon Q Developer’s workspace awareness in chat via the @workspace keyword, highlighting the benefits of using workspace when understanding code, responding to questions, and generating new code. By allowing Amazon Q Developer to analyze and understand your project structure, workspace context unlocks new possibilities for development productivity gains.

If you are new to Amazon Q Developer, I highly recommend you check out Amazon Q Developer documentation and the Q-Words workshop.

About the author:

Strategies for achieving least privilege at scale – Part 2

2024-07-09 Joshua Du Lac

Post Syndicated from Joshua Du Lac original https://aws.amazon.com/blogs/security/strategies-for-achieving-least-privilege-at-scale-part-2/

In this post, we continue with our recommendations for achieving least privilege at scale with AWS Identity and Access Management (IAM). In Part 1 of this two-part series, we described the first five of nine strategies for implementing least privilege in IAM at scale. We also looked at a few mental models that can assist you to scale your approach. In this post, Part 2, we’ll continue to look at the remaining four strategies and related mental models for scaling least privilege across your organization.

6. Empower developers to author application policies

If you’re the only developer working in your cloud environment, then you naturally write your own IAM policies. However, a common trend we’ve seen within organizations that are scaling up their cloud usage is that a centralized security, identity, or cloud team administrator will step in to help developers write customized IAM policies on behalf of the development teams. This may be due to variety of reasons, including unfamiliarity with the policy language or a fear of creating potential security risk by granting excess privileges. Centralized creation of IAM policies might work well for a while, but as the team or business grows, this practice often becomes a bottleneck, as indicated in Figure 1.

Figure 1: Bottleneck in a centralized policy authoring process

This mental model is known as the theory of constraints. With this model in mind, you should be keen to search for constraints, or bottlenecks, faced by your team or organization, identify the root cause, and solve for the constraint. That might sound obvious, but when you’re moving at a fast pace, the constraint might not appear until agility is already impaired. As your organization grows, a process that worked years ago might no longer be effective today.

A software developer generally understands the intent of the applications they build, and to some extent the permissions required. At the same time, the centralized cloud, identity, or security teams tend to feel they are the experts at safely authoring policies, but lack a deep knowledge of the application’s code. The goal here is to enable developers to write the policies in order to mitigate bottlenecks.

The question is, how do you equip developers with the right tools and skills to confidently and safely create the required policies for their applications? A simple way to start is by investing in training. AWS offers a variety of formal training options and ramp-up guides that can help your team gain a deeper understanding of AWS services, including IAM. However, even self-hosting a small hackathon or workshop session in your organization can drive improved outcomes. Consider the following four workshops as simple options for self-hosting a learning series with your teams.

How and when to use different IAM policy types workshop – Learn when to use which policy type, and who should own and manage the policy.
IAM policy learning experience workshop – Learn how to write different types of IAM policies and implement access controls on principals and resources, using conditions to scope down access.
IAM troubleshooting workshop – Learn how to create fine-grained access policies with the help of the IAM API, AWS Management Console, IAM Access Analyzer, and AWS CloudTrail, and review key concepts of the IAM policy evaluation logic.
Refining IAM Permissions Like A Pro – Learn how to use IAM Access Analyzer programmatically, use tools to check IAM policies in CI/CD pipeline and AWS Lambda functions, and get hands-on practice in using the tools from the perspectives of both Security and DevOps teams.

As a next step, you can help your teams along the way by setting up processes that foster collaboration and improve quality. For example, peer reviews are highly recommended, and we’ll cover this later. Additionally, administrators can use AWS native tools such as permissions boundaries and IAM Access Analyzer policy generation to help your developers begin to author their own policies more safely.

Let’s look at permissions boundaries first. An IAM permissions boundary should generally be used to delegate the responsibility of policy creation to your development team. You can set up the developer’s IAM role so that they can create new roles only if the new role has a specific permissions boundary attached to it, and that permissions boundary allows you (as an administrator) to set the maximum permissions that can be granted by the developer. This restriction is implemented by a condition on the developer’s identity-based policy, requiring that specific actions—such as iam:CreateRole or iam:CreatePolicy—are allowed only if a specified permissions boundary is attached.

In this way, when a developer creates an IAM role or policy to grant an application some set of required permissions, they are required to add the specified permissions boundary that will “bound” the maximum permissions available to that application. So even if the policy that the developer creates—such as for their AWS Lambda function—is not sufficiently fine-grained, the permissions boundary helps the organization’s cloud administrators make sure that the Lambda function’s policy is not greater than a maximum set of predefined permissions. So with permissions boundaries, your development team can be allowed to create new roles and policies (with constraints) without administrators creating a manual bottleneck.

Another tool developers can use is IAM Access Analyzer policy generation. IAM Access Analyzer reviews your CloudTrail logs and autogenerates an IAM policy based on your access activity over a specified time range. This greatly simplifies the process of writing granular IAM policies that allow end users access to AWS services.

A classic use case for IAM Access Analyzer policy generation is to generate an IAM policy within the test environment. This provides a good starting point to help identify the needed permissions and refine your policy for the production environment. For example, IAM Access Analyzer can’t identify the production resources used, so it adds resource placeholders for you to modify and add the specific Amazon Resource Names (ARNs) your application team needs. However, not every policy needs to be customized, and the next strategy will focus on reusing some policies.

7. Maintain well-written policies

Strategies seven and eight focus on processes. The first process we’ll focus on is to maintain well-written policies. To begin, not every policy needs to be a work of art. There is some wisdom in reusing well-written policies across your accounts, because that can be an effective way to scale permissions management. There are three steps to approach this task:

Identify your use cases
Create policy templates
Maintain repositories of policy templates

For example, if you were new to AWS and using a new account, we would recommend that you use AWS managed policies as a reference to get started. However, the permissions in these policies might not fit how you intend to use the cloud as time progresses. Eventually, you would want to identify the repetitive or common use cases in your own accounts and create common policies or templates for those situations.

When creating templates, you must understand who or what the template is for. One thing to note here is that the developer’s needs tend to be different from the application’s needs. When a developer is working with resources in your accounts, they often need to create or delete resources—for example, creating and deleting Amazon Simple Storage Service (Amazon S3) buckets for the application to use.

Conversely, a software application generally needs to read or write data—in this example, to read and write objects to the S3 bucket that was created by the developer. Notice that the developer’s permissions needs (to create the bucket) are different than the application’s needs (reading objects in the bucket). Because these are different access patterns, you’ll need to create different policy templates tailored to the different use cases and entities.

Figure 2 highlights this issue further. Out of the set of all possible AWS services and API actions, there are a set of permissions that are relevant for your developers (or more likely, their DevOps build and delivery tools) and there’s a set of permissions that are relevant for the software applications that they are building. Those two sets may have some overlap, but they are not identical.

Figure 2: Visualizing intersecting sets of permissions by use case

When discussing policy reuse, you’re likely already thinking about common policies in your accounts, such as default federation permissions for team members or automation that runs routine security audits across multiple accounts in your organization. Many of these policies could be considered default policies that are common across your accounts and generally do not vary. Likewise, permissions boundary policies (which we discussed earlier) can have commonality across accounts with low amounts of variation. There’s value in reusing both of these sets of policies. However, reusing policies too broadly could cause challenges if variation is needed—to make a change to a “reusable policy,” you would have to modify every instance of that policy, even if it’s only needed by one application.

You might find that you have relatively common resource policies that multiple teams need (such as an S3 bucket policy), but with slight variations. This is where you might find it useful to create a repeatable template that abides by your organization’s security policies, and make it available for your teams to copy. We call it a template here, because the teams might need to change a few elements, such as the Principals that they authorize to access the resource. The policies for the applications (such as the policy a developer creates to attach to an Amazon Elastic Compute Cloud (Amazon EC2) instance role) are generally more bespoke or customized and might not be appropriate in a template.

Figure 3 illustrates that some policies have low amounts of variation while others are more bespoke.

Figure 3: Identifying bespoke versus common policy types

Regardless of whether you choose to reuse a policy or turn it into a template, an important step is to store these reusable policies and templates securely in a repository (in this case, AWS CodeCommit). Many customers use infrastructure-as-code modules to make it simple for development teams to input their customizations and generate IAM policies that fit their security policies in a programmatic way. Some customers document these policies and templates directly in the repository while others use internal wikis accompanied with other relevant information. You’ll need to decide which process works best for your organization. Whatever mechanism you choose, make it accessible and searchable by your teams.

8. Peer review and validate policies

We mentioned in Part 1 that least privilege is a journey and having a feedback loop is a critical part. You can implement feedback through human review, or you can automate the review and validate the findings. This is equally as important for the core default policies as it is for the customized, bespoke policies.

Let’s start with some automated tools you can use. One great tool that we recommend is using AWS IAM Access Analyzer policy validation and custom policy checks. Policy validation helps you while you’re authoring your policy to set secure and functional policies. The feature is available through APIs and the AWS Management Console. IAM Access Analyzer validates your policy against IAM policy grammar and AWS best practices. You can view policy validation check findings that include security warnings, errors, general warnings, and suggestions for your policy.

Let’s review some of the finding categories.

Finding type	Description
Security	Includes warnings if your policy allows access that AWS considers a security risk because the access is overly permissive.
Errors	Includes errors if your policy includes lines that prevent the policy from functioning.
Warning	Includes warnings if your policy doesn’t conform to best practices, but the issues are not security risks.
Suggestions	Includes suggestions if AWS recommends improvements that don’t impact the permissions of the policy.

Custom policy checks are a new IAM Access Analyzer capability that helps security teams accurately and proactively identify critical permissions in their policies. You can use this to check against a reference policy (that is, determine if an updated policy grants new access compared to an existing version of the policy) or check against a list of IAM actions (that is, verify that specific IAM actions are not allowed by your policy). Custom policy checks use automated reasoning, a form of static analysis, to provide a higher level of security assurance in the cloud.

One technique that can help you with both peer reviews and automation is the use of infrastructure-as-code. By this, we mean you can write and deploy your IAM policies as AWS CloudFormation templates (CFTs) or AWS Cloud Development Kit (AWS CDK) applications. You can use a software version control system with your templates so that you know exactly what changes were made, and then test and deploy your default policies across multiple accounts, such as by using AWS CloudFormation StackSets.

In Figure 4, you’ll see a typical development workflow. This is a simplified version of a CI/CD pipeline with three stages: a commit stage, a validation stage, and a deploy stage. In the diagram, the developer’s code (including IAM policies) is checked across multiple steps.

Figure 4: A pipeline with a policy validation step

In the commit stage, if your developers are authoring policies, you can quickly incorporate peer reviews at the time they commit to the source code, and this creates some accountability within a team to author least privilege policies. Additionally, you can use automation by introducing IAM Access Analyzer policy validation in a validation stage, so that the work can only proceed if there are no security findings detected. To learn more about how to deploy this architecture in your accounts, see this blog post. For a Terraform version of this process, we encourage you to check out this GitHub repository.

9. Remove excess privileges over time

Our final strategy focuses on existing permissions and how to remove excess privileges over time. You can determine which privileges are excessive by analyzing the data on which permissions are granted and determining what’s used and what’s not used. Even if you’re developing new policies, you might later discover that some permissions that you enabled were unused, and you can remove that access later. This means that you don’t have to be 100% perfect when you create a policy today, but can rather improve your policies over time. To help with this, we’ll quickly review three recommendations:

Restrict unused permissions by using service control policies (SCPs)
Remove unused identities
Remove unused services and actions from policies

First, as discussed in Part 1 of this series, SCPs are a broad guardrail type of control that can deny permissions across your AWS Organizations organization, a set of your AWS accounts, or a single account. You can start by identifying services that are not used by your teams, despite being allowed by these SCPs. You might also want to identify services that your organization doesn’t intend to use. In those cases, you might consider restricting that access, so that you retain access only to the services that are actually required in your accounts. If you’re interested in doing this, we’d recommend that you review the Refining permissions in AWS using last accessed information topic in the IAM documentation to get started.

Second, you can focus your attention more narrowly to identify unused IAM roles, unused access keys for IAM users, and unused passwords for IAM users either at an account-specific level or the organization-wide level. To do this, you can use IAM Access Analyzer’s Unused Access Analyzer capability.

Third, the same Unused Access Analyzer capability also enables you to go a step further to identify permissions that are granted but not actually used, with the goal of removing unused permissions. IAM Access Analyzer creates findings for the unused permissions. If the granted access is required and intentional, then you can archive the finding and create an archive rule to automatically archive similar findings. However, if the granted access is not required, you can modify or remove the policy that grants the unintended access. The following screenshot shows an example of the dashboard for IAM Access Analyzer’s unused access findings.

Figure 5: Screenshot of IAM Access Analyzer dashboard

When we talk to customers, we often hear that the principle of least privilege is great in principle, but they would rather focus on having just enough privilege. One mental model that’s relevant here is the 80/20 rule (also known as the Pareto principle), which states that 80% of your outcome comes from 20% of your input (or effort). The flip side is that the remaining 20% of outcome will require 80% of the effort—which means that there are diminishing returns for additional effort. Figure 6 shows how the Pareto principle relates to the concept of least privilege, on a scale from maximum privilege to perfect least privilege.

Figure 6: Applying the Pareto principle (80/20 rule) to the concept of least privilege

The application of the 80/20 rule to permissions management—such as refining existing permissions—is to identify what your acceptable risk threshold is and to recognize that as you perform additional effort to eliminate that risk, you might produce only diminishing returns. However, in pursuit of least privilege, you’ll still want to work toward that remaining 20%, while being pragmatic about the remainder of the effort.

Remember that least privilege is a journey. Two ways to be pragmatic along this journey are to use feedback loops as you refine your permissions, and to prioritize. For example, focus on what is sensitive to your accounts and your team. Restrict access to production identities first before moving to environments with less risk, such as development or testing. Prioritize reviewing permissions for roles or resources that enable external, cross-account access before moving to the roles that are used in less sensitive areas. Then move on to the next priority for your organization.

Conclusion

Thank you for taking the time to read this two-part series. In these two blog posts, we described nine strategies for implementing least privilege in IAM at scale. Across these nine strategies, we introduced some mental models, tools, and capabilities that can assist you to scale your approach. Let’s consider some of the key takeaways that you can use in your journey of setting, verifying, and refining permissions.

Cloud administrators and developers will set permissions, and can use identity-based policies or resource-based policies to grant access. Administrators can also use multiple accounts as boundaries, and set additional guardrails by using service control policies, permissions boundaries, block public access, VPC endpoint policies, and data perimeters. When cloud administrators or developers create new policies, they can use IAM Access Analyzer’s policy generation capability to generate new policies to grant permissions.

Cloud administrators and developers will then verify permissions. For this task, they can use both IAM Access Analyzer’s policy validation and peer review to determine if the permissions that were set have issues or security risks. These tools can be leveraged in a CI/CD pipeline too, before the permissions are set. IAM Access Analyzer’s custom policy checks can be used to detect nonconformant updates to policies.

To both verify existing permissions and refine permissions over time, cloud administrators and developers can use IAM Access Analyzer’s external access analyzers to identify resources that were shared with external entities. They can also use either IAM Access Advisor’s last accessed information or IAM Access Analyzer’s unused access analyzer to find unused access. In short, if you’re looking for a next step to streamline your journey toward least privilege, be sure to check out IAM Access Analyzer.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Strategies for achieving least privilege at scale – Part 1

2024-07-09 Joshua Du Lac

Post Syndicated from Joshua Du Lac original https://aws.amazon.com/blogs/security/strategies-for-achieving-least-privilege-at-scale-part-1/

Least privilege is an important security topic for Amazon Web Services (AWS) customers. In previous blog posts, we’ve provided tactical advice on how to write least privilege policies, which we would encourage you to review. You might feel comfortable writing a few least privilege policies for yourself, but to scale this up to thousands of developers or hundreds of AWS accounts requires strategy to minimize the total effort needed across an organization.

At re:Inforce 2022, we recommended nine strategies for achieving least privilege at scale. Although the strategies we recommend remain the same, this blog series serves as an update, with a deeper discussion of some of the strategies. In this series, we focus only on AWS Identity and Access Management (IAM), not application or infrastructure identities. We’ll review least privilege in AWS, then dive into each of the nine strategies, and finally review some key takeaways. This blog post, Part 1, covers the first five strategies, while Part 2 of the series covers the remaining four.

Overview of least privilege

The principle of least privilege refers to the concept that you should grant users and systems the narrowest set of privileges needed to complete required tasks. This is the ideal, but it’s not so simple when change is constant—your staff or users change, systems change, and new technologies become available. AWS is continually adding new services or features, and individuals on your team might want to adopt them. If the policies assigned to those users were perfectly least privilege, then you would need to update permissions constantly as the users ask for more or different access. For many, applying the narrowest set of permissions could be too restrictive. The irony is that perfect least privilege can cause maximum effort.

We want to find a more pragmatic approach. To start, you should first recognize that there is some tension between two competing goals—between things you don’t want and things you do want, as indicated in Figure 1. For example, you don’t want expensive resources created, but you do want freedom for your builders to choose their own resources.

Figure 1: Tension between two competing goals

There’s a natural tension between competing goals when you’re thinking about least privilege, and you have a number of controls that you can adjust to securely enable agility. I’ve spoken with hundreds of customers about this topic, and many focus primarily on writing near-perfect permission policies assigned to their builders or machines, attempting to brute force their way to least privilege.

However, that approach isn’t very effective. So where should you start? To answer this, we’re going to break this question down into three components: strategies, tools, and mental models. The first two may be clear to you, but you might be wondering, “What is a mental model”? Mental models help us conceptualize something complex as something relatively simpler, though naturally this leaves some information out of the simpler model.

Teams

Teams generally differ based on the size of the organization. We recognize that each customer is unique, and that customer needs vary across enterprises, government agencies, startups, and so on. If you feel the following example descriptions don’t apply to you today, or that your organization is too small for this many teams to co-exist, then keep in mind that the scenarios might be more applicable in the future as your organization continues to grow. Before we can consider least privilege, let’s consider some common scenarios.

Customers who operate in the cloud tend to have teams that fall into one of two categories: decentralized and centralized. Decentralized teams might be developers or groups of developers, operators, or contractors working in your cloud environment. Centralized teams often consist of administrators. Examples include a cloud environment team, an infrastructure team, the security team, the network team, or the identity team.

Scenarios

To achieve least privilege in an organization effectively, teams must collaborate. Let’s consider three common scenarios:

Creating default roles and policies (for teams and monitoring)
Creating roles and policies for applications
Verifying and refining existing permissions

The first scenario focuses on the baseline set of roles and permissions that are necessary to start using AWS. Centralized teams (such as a cloud environmentteam or identity and access management team) commonly create these initial default roles and policies that you deploy by using your account factory, IAM Identity Center, or through AWS Control Tower. These default permissions typically enable federation for builders or enable some automation, such as tools for monitoring or deployments.

The second scenario is to create roles and policies for applications. After foundational access and permissions are established, the next step is for your builders to use the cloud to build. Decentralized teams (software developers, operators, or contractors) use the roles and policies from the first scenario to then create systems, software, or applications that need their own permissions to perform useful functions. These teams often need to create new roles and policies for their software to interact with databases, Amazon Simple Storage Service (Amazon S3), Amazon Simple Queue Service (Amazon SQS) queues, and other resources.

Lastly, the third scenario is to verify and refine existing permissions, a task that both sets of teams should be responsible for.

Journeys

At AWS, we often say that least privilege is a journey, because change is a constant. Your builders may change, systems may change, you may swap which services you use, and the services you use may add new features that your teams want to adopt, in order to enable faster or more efficient ways of working. Therefore, what you consider least privilege today may be considered insufficient by your users tomorrow.

This journey is made up of a lifecycle of setting, verifying, and refining permissions. Cloud administrators and developers will set permissions, they will then verify permissions, and then they refine those permissions over time, and the cycle repeats as illustrated in Figure 2. This produces feedback loops of continuous improvement, which add up to the journey to least privilege.

Figure 2: Least privilege is a journey

Strategies for implementing least privilege

The following sections will dive into nine strategies for implementing least privilege at scale:

Part 1 (this post):

(Plan) Begin with coarse-grained controls
(Plan) Use accounts as strong boundaries around resources
(Plan) Prioritize short-term credentials
(Policy) Enforce broad security invariants
(Policy) Identify the right tool for the job

Part 2:

(Policy) Empower developers to author application policies
(Process) Maintain well-written policies
(Process) Peer-review and validate policies
(Process) Remove excess privileges over time

To provide some logical structure, the strategies can be grouped into three categories—plan, policy, and process. Plan is where you consider your goals and the outcomes that you want to achieve and then design your cloud environment to simplify those outcomes. Policy focuses on the fact that you will need to implement some of those goals in either the IAM policy language or as code (such as infrastructure-as-code). The Process category will look at an iterative approach to continuous improvement. Let’s begin.

1. Begin with coarse-grained controls

Most systems have relationships, and these relationships can be visualized. For example, AWS accounts relationships can be visualized as a hierarchy, with an organization’s management account and groups of AWS accounts within that hierarchy, and principals and policies within those accounts, as shown in Figure 3.

Figure 3: Icicle diagram representing an account hierarchy

When discussing least privilege, it’s tempting to put excessive focus on the policies at the bottom of the hierarchy, but you should reverse that thinking if you want to implement least privilege at scale. Instead, this strategy focuses on coarse-grained controls, which refer to a top-level, broader set of controls. Examples of these broad controls include multi-account strategy, service control policies, blocking public access, and data perimeters.

Before you implement coarse-grained controls, you must consider which controls will achieve the outcomes you desire. After the relevant coarse-grained controls are in place, you can tailor the permissions down the hierarchy by using more fine-grained controls along the way. The next strategy reviews the first coarse-grained control we recommend.

2. Use accounts as strong boundaries around resources

Although you can start with a single AWS account, we encourage customers to adopt a multi-account strategy. As customers continue to use the cloud, they often need explicit security boundaries, the ability to control limits, and billing separation. The isolation designed into an AWS account can help you meet these needs.

Customers can group individual accounts into different assortments (organizational units) by using AWS Organizations. Some customers might choose to align this grouping by environment (for example: Dev, Pre-Production, Test, Production) or by business units, cost center, or some other option. You can choose how you want to construct your organization, and AWS has provided prescriptive guidance to assist customers when they adopt a multi-account strategy.

Similarly, you can use this approach for grouping security controls. As you layer in preventative or detective controls, you can choose which groups of accounts to apply them to. When you think of how to group these accounts, consider where you want to apply your security controls that could affect permissions.

AWS accounts give you strong boundaries between accounts (and the entities that exist in those accounts). As shown in Figure 4, by default these principals and resources cannot cross their account boundary (represented by the red dotted line on the left).

Figure 4: Account hierarchy and account boundaries

In order for these accounts to communicate with each other, you need to explicitly enable access by adding narrow permissions. For use cases such as cross-account resource sharing, or cross-VPC networking, or cross-account role assumptions, you would need to explicitly enable the required access by creating the necessary permissions. Then you could review those permissions by using IAM Access Analyzer.

One type of analyzer within IAM Access Analyzer, external access, helps you identify resources (such as S3 buckets, IAM roles, SQS queues, and more) in your organization or accounts that are shared with an external entity. This helps you identify if there’s potential for unintended access that could be a security risk to your organization. Although you could use IAM Access Analyzer (external access) with a single account, we recommend using it at the organization level. You can configure an access analyzer for your entire organization by setting the organization as the zone of trust, to identify access allowed from outside your organization.

To get started, you create the analyzer and it begins analyzing permissions. The analysis may produce findings, which you can review for intended and unintended access. You can archive the intended access findings, but you’ll want to act quickly on the unintended access to mitigate security risks.

In summary, you should use accounts as strong boundaries around resources, and use IAM Access Analyzer to help validate your assumptions and find unintended access permissions in an automated way across the account boundaries.

3. Prioritize short-term credentials

When it comes to access control, shorter is better. Compared to long-term access keys or passwords that could be stored in plaintext or mistakenly shared, a short-term credential is requested dynamically by using strong identities. Because the credentials are being requested dynamically, they are temporary and automatically expire. Therefore, you don’t have to explicitly revoke or rotate the credentials, nor embed them within your application.

In the context of IAM, when we’re discussing short-term credentials, we’re effectively talking about IAM roles. We can split the applicable use cases of short-term credentials into two categories—short-term credentials for builders and short-term credentials for applications.

Builders (human users) typically interact with the AWS Cloud in one of two ways; either through the AWS Management Console or programmatically through the AWS CLI. For console access, you can use direct federation from your identity provider to individual AWS accounts or something more centralized through IAM Identity Center. For programmatic builder access, you can get short-term credentials into your AWS account through IAM Identity Center using the AWS CLI.

Applications created by builders need their own permissions, too. Typically, when we consider short-term credentials for applications, we’re thinking of capabilities such as IAM roles for Amazon Elastic Compute Cloud (Amazon EC2), IAM roles for Amazon Elastic Container Service (Amazon ECS) tasks, or AWS Lambda execution roles. You can also use IAM Roles Anywhere to obtain temporary security credentials for workloads and applications that run outside of AWS. Use cases that require cross-account access can also use IAM roles for granting short-term credentials.

However, organizations might still have long-term secrets, like database credentials, that need to be stored somewhere. You can store these secrets with AWS Secrets Manager, which will encrypt the secret by using an AWS KMS encryption key. Further, you can configure automatic rotation of that secret to help reduce the risk of those long-term secrets.

4. Enforce broad security invariants

Security invariants are essentially conditions that should always be true. For example, let’s assume an organization has identified some core security conditions that they want enforced:

Block access for the AWS account root user
Disable access to unused AWS Regions
Prevent the disabling of AWS logging and monitoring services (AWS CloudTrail or Amazon CloudWatch)

You can enable these conditions by using service control policies (SCPs) at the organization level for groups of accounts using an organizational unit (OU), or for individual member accounts.

Notice these words—block, disable, and prevent. If you’re considering these actions in the context of all users or all principals except for the administrators, that’s where you’ll begin to implement broad security invariants, generally by using service control policies. However, a common challenge for customers is identifying what conditions to apply and the scope. This depends on what services you use, the size of your organization, the number of teams you have, and how your organization uses the AWS Cloud.

Some actions have inherently greater risk, while others may have nominal risk or are more easily reversible. One mental model that has helped customers to consider these issues is an XY graph, as illustrated in the example in Figure 5.

Figure 5: Using an XY graph for analyzing potential risk versus frequency of use

The X-axis in this graph represents the potential risk associated with using a service functionality within a particular account or environment, while the Y-axis represents the frequency of use of that service functionality. In this representative example, the top-left part of the graph covers actions that occur frequently and are relatively safe—for example, read-only actions.

The functionality in the bottom-right section is where you want to focus your time. Consider this for yourself—if you were to create a similar graph for your environment—what are the actions you would consider to be high-risk, with an expected low or rare usage within your environment? For example, if you enable CloudTrail for logging, you want to make sure that someone doesn’t invoke the CloudTrail StopLogging API operation or delete the CloudTrail logs. Another high-risk, low-usage example could include restricting AWS Direct Connect or network configuration changes to only your network administrators.

Over time, you can use the mental model of the XY graph to decide when to use preventative guardrails for actions that should never happen, versus conditional or alternative guardrails for situational use cases. You could also move from preventative to detective security controls, while accounting for factors such as the user persona and the environment type (production, development, or testing). Finally, you could consider doing this exercise broadly at the service level before thinking of it in a more fine-grained way, feature-by-feature.

However, not all controls need to be custom to your organization. To get started quickly, here are some examples of documented SCPs as well as AWS Control Tower guardrail references. You can adopt those or tailor them to fit your environment as needed.

5. Identify the right tools for the job

You can think of IAM as a toolbox that offers many tools that provide different types of value. We can group these tools into two broad categories: guardrails and grants.

Guardrails are the set of tools that help you restrict or deny access to your accounts. At a high level, they help you figure out the boundary for the set of permissions that you want to retain. SCPs are a great example of guardrails, because they enable you to restrict the scope of actions that principals in your account or your organization can take. Permissions boundaries are another great example, because they enable you to safely delegate the creation of new principals (roles or users) and permissions by setting maximum permissions on the new identity.

Although guardrails help you restrict access, they don’t inherently grant any permissions. To grant permissions, you use either an identity-based policy or resource-based policy. Identity policies are attached to principals (roles or users), while resource-based policies are applied to specific resources, such as an S3 bucket.

A common question is how to decide when to use an identity policy versus a resource policy to grant permissions. IAM, in a nutshell, seeks to answer the question: who can access what? Can you spot the nuance in the following policy examples?

Policies attached to principals

{
      "Effect": "Allow",
      "Action": "x",
      "Resource": "y",
      "Condition": "z"
    }

Policies attached to resources

{
      "Effect": "Allow",
      "Principal": "w",
      "Action": "x",
      "Resource": "y",
      "Condition": "z"
    }

You likely noticed the difference here is that with identity-based (principal) policies, the principal is implicit (that is, the principal of the policy is the entity to which the policy is applied), while in a resource-based policy, the principal must be explicit (that is, the principal has to be specified in the policy). A resource-based policy can enable cross-account access to resources (or even make a resource effectively public), but the identity-based policies likewise need to allow the access to that cross-account resource. Identity-based policies with sufficient permissions can then access resources that are “shared.” In essence, both the principal and the resource need to be granted sufficient permissions.

When thinking about grants, you can address the “who” angle by focusing on the identity-based policies, or the “what” angle by focusing on resource-based policies. For additional reading on this topic, see this blog post. For information about how guardrails and grants are evaluated, review the policy evaluation logic documentation.

Lastly, if you’d like a detailed walkthrough on choosing the right tool for the job, we encourage you to read the IAM policy types: How and when to use them blog post.

Conclusion

This blog post walked through the first five (of nine) strategies for achieving least privilege at scale. For the remaining four strategies, see Part 2 of this series.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Refactoring to Serverless: From Application to Automation

2024-07-03 Sindhu Pillai

Post Syndicated from Sindhu Pillai original https://aws.amazon.com/blogs/devops/refactoring-to-serverless-from-application-to-automation/

Serverless technologies not only minimize the time that builders spend managing infrastructure, they also help builders reduce the amount of application code they need to write. Replacing application code with fully managed cloud services improves both the operational characteristics and the maintainability of your applications thanks to a cleaner separation between business logic and application topology. This blog post shows you how.

Serverless isn’t a runtime; it’s an architecture

Since the launch of AWS Lambda in 2014, serverless has evolved to be more than just a cloud runtime. The ability to easily deploy and scale individual functions, coupled with per-millisecond billing, has led to the evolution of modern application architectures from monoliths towards loosely-coupled applications. Functions typically communicate through events, an interaction model that’s supported by a combination of serverless integration services, such as Amazon EventBridge and Amazon SNS, and Lambda’s asynchronous invocation model.

Modern distributed architectures with independent runtime elements (like Lambda functions or containers) have a distinct topology graph that represents which elements talk to others. In the diagram below, Amazon API Gateway, Lambda, EventBridge, and Amazon SQS interact to process an order in a typical Order Processing System. The topology has a major influence on the application’s runtime characteristics like latency, throughput, or resilience.

The role of cloud automation evolves

Cloud automation languages, commonly referred to as IaC (Infrastructure as Code), date back to 2011 with the launch of CloudFormation, which allowed users to declare a set of cloud resources in configuration files instead of issuing a series of API calls or CLI commands. Initial document-oriented automation languages like AWS CloudFormation and Terraform were soon complemented by frameworks like AWS Cloud Development Kit (CDK), CDK for Terraform, and Pulumi that introduced the ability to write cloud automation code in popular general-purpose languages like TypeScript, Python, or Java.

The role of cloud automation evolved alongside serverless application architectures. Because serverless technologies free builders from having to manage infrastructure, there really isn’t any “I” in serverless IaC anymore. Instead, serverless cloud automation primarily defines the application’s topology by connecting Lambda functions with event sources or targets, which can be other Lambda functions. This approach more closely resembles “AaC” – Architecture as Code – as the automation now defines the application’s architecture instead of provisioning infrastructure elements.

Improving serverless applications with automation code

By utilizing AWS serverless runtime features, automation code can frequently achieve the same functionality as your application code.

For example, the Lambda function below, written in TypeScript, sends a message to EventBridge:

export const handler = async (event: APIGatewayProxyEvent): Promise<APIGatewayProxyResult> => { 
    const result = // some logic
    const eventParam = new PutEventsCommand({
        Entries: [
            {
              Detail: JSON.stringify(result),
              DetailType: 'OrderCreated',
              EventBusName: process.env.EVENTBUS_NAME,
            }
          ]
    });
    await eventBridgeClient.send(eventParam);     return {
       statusCode: 200,
       body: JSON.stringify({ message: 'Order created', result }),
    };
};

You can achieve the same behavior using AWS Lambda Destinations, which instructs the Lambda runtime to publish an event after the completion of the function. You can configure Lambda destinations via below AWS CDK code, also written in TypeScript:

import {EventBridgeDestination} from "aws-cdk-lib/aws-lambda-destinations"

const createOrderLambda = new Function(this,'createOrderLambda', {
    functionName: `OrderService`,
    runtime: Runtime.NODEJS_20_X,
    code: Code.fromAsset('lambda-fns/send-message-using-destination'),
    handler: 'OrderService.handler',
 onSuccess: new EventBridgeDestination(eventBus)
});

With the AWS CDK, you can use the same programming languages for both application and automation code, allowing you to switch easily between the two.

The Lambda function can now focus on the business logic and doesn’t contain any reference to message sending or EventBridge. This separation of concerns is a best practice because changes to the business logic do not run the risk of breaking the architecture and vice versa.

export const handler = async (event: APIGatewayProxyEvent): Promise<APIGatewayProxyResult> => {
    const result = //some logic
    return {
        statusCode: 200,
        body: JSON.stringify({ message: 'Order created', result }),
     };
};

Instructing the serverless Lambda runtime to send the event has several advantages over hand-coding it inside the application code

It decouples application logic from topology. The message destination, consisting of the type of the service (e.g., EventBridge vs. another Lambda Function) and the destination’s ARN, define the application’s architecture (or topology). Embedding message sending in the application code mixes architecture with business logic. Handling the sending of the message in the runtime separates concerns and avoids having to touch the application code for a topology change.
It makes the composition explicit. If application code sends a message, it will likely read the destination from an environment variable, which is passed to the Lambda function. The name of the variable that is used for this purpose is buried in the application code, forcing you to rely on naming conventions. Defining all dependencies between service instances in automation code keeps them in a central location, and allows you to use code analysis and refactoring tools to reason about your architecture or make changes to it.
It avoids simple mistakes. Redundant code can lead to mistakes. For example, debugging a Lambda function that accidentally swapped day and month in the message’s date field took hours. Letting the runtime send messages avoids such errors.
Higher-level constructs simplify permission grants. Cloud automation libraries like CDK allow the creation of higher-level constructs, which can combine multiple resources and include necessary IAM permissions. You’ll write less code and avoid debugging cycles.
The runtime is more robust. Delegating message sending to the serverless runtime takes care of any required retries, ensuring the message to be sent and freeing builders from having to write extra code for such undifferentiated heavy lifting.

In summary, letting the managed service handle message passing makes your serverless application cleaner and more robust. We also like to say that it becomes “serverless-native” because it fully utilizes the native services available to the application.

Refactoring to serverless-native

Shifting code from application to automation is what we call “Refactoring to Serverless”. Refactoring is a term popularized by Martin Fowler in the late 90s to describe the restructuring of source code to alter its structure without changing its external behavior. Code refactoring can be as simple as extracting code into a separate method or more sophisticated like replacing conditional expressions with polymorphism.

Developers refactor their code to improve its readability and maintainability. A common approach in Test-Driven Development (TDD) is the so-called red-green-refactor cycle: write a test, which will be red because the functionality isn’t implemented, then write the code to make the test green, and finally refactor to counteract the growing entropy in the codebase.

Serverless refactoring takes inspiration from this concept but augments it to the context of serverless automation:

Serverless refactoring: A controlled technique for improving the design of serverless applications by replacing application code with equivalent automation code.

Let’s explore how serverless refactoring can enhance the design and runtime characteristics of a serverless application. The diagram below shows an AWS Step Functions workflow that performs a quality check through image recognition. An early implementation, shown on the left, would use an intermediate AWS Lambda function to call the Amazon Rekognition service. Thanks to the launch of Step Functions’ AWS SDK service integrations in 2021, you can refactor the workflow to directly call the Rekognition API. This refactored design, seen on the right, eliminates the Lambda function (assuming it didn’t perform any additional tasks), thereby reducing costs and runtime complexity.

Replacing Lambda with Service Integration in Step Function workflow

See the AWS CDK implementation for this refactoring, in TypeScript, on GitHub.

Refactoring Limitations

The initial example of replacing application code to send a message to SQS via Lambda Destinations reveals that refactoring from application to automation code isn’t 100% behavior-preserving.

First, Lambda Destinations are only triggered when the function is invoked asynchronously. For synchronous invocations, the function passes the results back to the caller, and does not invoke the destination. Second, the serverless runtime wraps the data returned from the function inside a message envelope, affecting how the message recipient parses the JSON object. The message data is placed inside the responsePayload field if sending to another Lambda function or the detail field if sending to an EventBridge destination. Last, Lambda Destinations sends a message after the function completes, whereas application code could send the message at any point during the execution.

Lambda Destination Execution

The last change in behavior will be transparent to well-architected asynchronous applications because they won’t depend on the timing of message delivery. If a Lambda function continues processing after sending a message (for example, to EventBridge), that code can’t assume that the message has been processed because delivery is asynchronous. A rare exception could be a loop waiting for the results from the downstream message processing, but such loops violate the principles of asynchronous integration and also waste compute resources (Amazon Step Functions is a great choice for asynchronous callbacks). If such behavior is required, it can be achieved by splitting the Lambda function into two parts.

Can Serverless Refactoring be Automated?

Traditional code refactoring like “Extract Method” is automated thanks to built-in support by many code editors. Serverless refactoring isn’t (yet) a fully automatic, 100%-equivalent code transformation because it translates application code into automation code (or vice versa). While AI-powered tools like Amazon Q Developer are getting us closer to that vision, we consider serverless refactoring primarily as a design technique for developers to better utilize the AWS runtime. Improved code design and runtime characteristics outweigh behavior differences, especially if your application includes automated tests.

Incorporating refactoring into your team structures

If a single team owns both the application and the automation code, refactoring takes place inside the team. However, serverless refactoring can cross team boundaries when separate teams develop business logic versus managing the underlying infrastructure, configuration, and deployment.

In such a model, AWS recommends that the development team be responsible for both the application code and the application-specific automation, such as the CDK code to configure Lambda Destinations, Step Functions workflows, or EventBridge routing. Splitting application and application-specific automation across teams would make the development team dependent on the platform team for each refactoring and introduce unnecessary friction.

If both teams use the same Infrastructure-as-Code (IaC) tool, say AWS CDK, the platform team can build reusable templates and constructs that encapsulate organizational requirements and guardrails, such as CDK constructs for S3 buckets with encryption enabled. Development teams can easily consume those resources across CDK stacks.

However, teams could use different IaC tools, for example, the infrastructure team prefers CloudFormation but the development team prefers AWS CDK. In this setup, development teams can build their automation on top of the CFN Modules provided by the infrastructure team. However, they won’t benefit from the same high-level programming abstractions as they do with CDK.

Collaboration in a split-team model

Continuous Refactoring

Just like traditional code refactoring, refactoring to serverless isn’t a one-time activity but an essential aspect of your software delivery. Because adding functionality increases your application’s complexity, regular refactoring can help keep complexity at bay and maintain your development velocity. Like with Continuous Delivery, you can improve your software delivery with Continuous Refactoring.

Teams who encounter difficulties with serverless refactoring might be lacking automated test coverage or cloud automation. So, refactoring can become a useful forcing function for teams to exercise software delivery hygiene, for example by implementing automated tests.

Getting Started

The refactoring samples discussed here are a subset of an extensive catalog of open source code examples, which you can find along with AWS CDK implementation examples at refactoringserverless.com. You can also dive deeper into how serverless refactoring can make your application architecture more loosely coupled in a separate blog post.

Use the examples to accelerate your own refactoring effort. Now Go Refactor!