Tag Archives: Technical How-to

Hardening the RAG chatbot architecture powered by Amazon Bedrock: Blueprint for secure design and anti-pattern migration

Post Syndicated from Magesh Dhanasekaran original https://aws.amazon.com/blogs/security/hardening-the-rag-chatbot-architecture-powered-by-amazon-bedrock-blueprint-for-secure-design-and-anti-pattern-migration/

This blog post demonstrates how to use Amazon Bedrock with a detailed security plan to deploy a safe and responsible chatbot application. In this post, we identify common security risks and anti-patterns that can arise when exposing a large language model (LLM) in an application. Amazon Bedrock is built with features you can use to mitigate vulnerabilities and incorporate secure design principles. This post highlights architectural considerations and best practice strategies to enhance the reliability of your LLM-based application.

Amazon Bedrock unleashes the fusion of generative artificial intelligence (AI) and LLMs, empowering you to craft impactful chatbot applications. As with technologies handling sensitive data and intellectual property, it’s crucial that you prioritize security and adopt a robust security posture. Without proper measures, these applications can be susceptible to risks such as prompt injection, information disclosure, model exploitation, and regulatory violations. By proactively addressing these security considerations, you can responsibly use Amazon Bedrock foundation models and generative AI capabilities.

The chatbot application use case represents a common pattern in enterprise environments, where businesses want to use the power of generative AI foundation models (FMs) to build their own applications. This falls under the Pre-trained models category of the Generative AI Security Scoping Matrix. In this scope, businesses directly integrate with FMs like Anthropic’s Claude through Amazon Bedrock APIs to create custom applications, such as customer support Retrieval Augmented Generation (RAG) chatbots, content generation tools, and decision support systems.

This post provides a comprehensive security blueprint for deploying chatbot applications that integrate with Amazon Bedrock, enabling the responsible adoption of LLMs and generative AI in enterprise environments. We outline mitigation strategies through secure design principles, architectural considerations, and best practices tailored to the challenges of integrating LLMs and generative AI capabilities.

By following the guidance in this post, you can proactively identify and mitigate risks associated with deploying and operating chatbot applications that integrate with Amazon Bedrock and use generative AI models. The guidance can help you strengthen the security posture, protect sensitive data and intellectual property, maintain regulatory compliance, and responsibly deploy generative AI capabilities within your enterprise environments.

This post contains the following high-level sections:

Chatbot application architecture overview

The chatbot application architecture described in this post represents an example implementation that uses various AWS services and integrates with Amazon Bedrock and Anthropic’s Claude 3 Sonnet LLM. This baseline architecture serves as a foundation to understand the core components and their interactions. However, it’s important to note that there can be multiple ways for customers to design and implement a chatbot architecture that integrates with Amazon Bedrock, depending on their specific requirements and constraints. Regardless of the implementation approach, it’s crucial to incorporate appropriate security controls and follow best practices for secure design and deployment of generative AI applications.

The chatbot application allows users to interact through a frontend interface and submit prompts or queries. These prompts are processed by integrating with Amazon Bedrock, which uses the Anthropic Claude 3 Sonnet LLM and a knowledge base built from ingested data. The LLM generates relevant responses based on the prompts and retrieved context from the knowledge base. While this baseline implementation outlines the core functionality, it requires incorporating security controls and following best practices to mitigate potential risks associated with deploying generative AI applications. In the subsequent sections, we discuss security anti-patterns that can arise in such applications, along with their corresponding mitigation strategies. Additionally, we present a secure and responsible architecture blueprint for the chatbot application powered by Amazon Bedrock.

Figure 1: Baseline chatbot application architecture using AWS services and Amazon Bedrock

Figure 1: Baseline chatbot application architecture using AWS services and Amazon Bedrock

Components in the chatbot application baseline architecture

The chatbot application architecture uses various AWS services and integrates with the Amazon Bedrock service and Anthropic’s Claude 3 Sonnet LLM to deliver an interactive and intelligent chatbot experience. The main components of the architecture (as shown in Figure 1) are:

  1. User interaction layer:
    Users interact with the chatbot application through the Streamlit frontend (3), a Python-based open-source library, used to build the user-friendly and interactive interface.
  2. Amazon Elastic Container Service (Amazon ECS) on AWS Fargate:
    A fully managed and scalable container orchestration service that eliminates the need to provision and manage servers, allowing you to run containerized applications without having to manage the underlying compute infrastructure.
  3. Application hosting and deployment:
    The Streamlit application (3) components are hosted and deployed on Amazon ECS on AWS Fargate (2), maintaining scalability and high availability. This architecture represents the application and hosting environment in an independent virtual private cloud (VPC) to promote a loosely-coupled architecture. The Streamlit frontend can be replaced with your organization’s specific frontend and quickly integrated with the backend Amazon API Gateway in the VPC. An application load balancer is used to distribute traffic to the Streamlit application instances.
  4. API Gateway driven Lambda Integration:
    In this example architecture, instead of directly invoking the Amazon Bedrock service from the frontend, an API Gateway backed by an AWS Lambda function (5) is used as an intermediary layer. This approach promotes better separation of concerns, scalability, and secure access to Amazon Bedrock by limiting direct exposure from the frontend.
  5. Lambda:
    Lambda provides highly scalable, short-term serverless compute. Here, the requests from Streamlit are processed. First, the history of the user’s session is retrieved from Amazon DynamoDB (6). Second, the user’s question, history, and the context are formatted into a prompt template and queried against Amazon Bedrock with the knowledge base, employing retrieval augmented generation (RAG).
  6. DynamoDB:
    DynamoDB is responsible for storing and retrieving chat history, conversation history, recommendations, and other relevant data using the Lambda function.
  7. Amazon S3:
    Amazon Simple Storage Services (Amazon S3) is a data storage service and is used here for storing data artifacts that are ingested into the knowledge base.
  8. Amazon Bedrock:
    Amazon Bedrock plays a central role in the architecture. It handles the questions posed by the user using Anthropic Claude 3 Sonnet LLM (9) combined with a previously generated knowledge base (10) of the customer’s organization-specific data.
  9. Anthropic Claude 3 Sonnet:
    Anthropic Claude 3 Sonnet is the LLM used to generate tailored recommendations and responses based on user inputs and the context retrieved from the knowledge base. It’s part of the text analysis and generation module in Amazon Bedrock.
  10. Knowledge base and data ingestion:
    Relevant documents classified as public are ingested from Amazon S3 (9) into in an Amazon Bedrock knowledge base. Knowledge bases are backed by Amazon OpenSearch Service. Amazon Titan Embeddings (10) are used to generate the vector embeddings database of the documents. Storing the data as vector embeddings allows for semantic similarity searching of the documents to retrieve the context of the question posed by the user (RAG). By providing the LLM with context in addition to the question, there’s a much higher chance of getting a useful answer from the LLM.

Comprehensive logging and monitoring strategy

This section outlines a comprehensive logging and monitoring strategy for the Amazon Bedrock-powered chatbot application, using various AWS services to enable centralized logging, auditing, and proactive monitoring of security events, performance metrics, and potential threats.

  1. Logging and auditing:
    • AWS CloudTrail: Logs API calls made to Amazon Bedrock, including InvokeModel requests, as well as information about the user or service that made the request.
    • AWS CloudWatch Logs: Captures and analyzes Amazon Bedrock invocation logs, user prompts, generated responses, and errors or warnings encountered during the invocation process.
    • Amazon OpenSearch Service: Logs and indexes data related to the OpenSearch integration, context data retrievals, and knowledge base operations.
    • AWS Config: Monitors and audits the configuration of resources related to the chatbot application and Amazon Bedrock service, including IAM policies, VPC settings, encryption key management, and other resource configurations.
  2. Monitoring and alerting:
    • AWS CloudWatch: Monitors metrics specific to Amazon Bedrock, such as the number of model invocations, latency of invocations, and error metrics (client-side errors, server-side errors, and throttling). Configures targeted CloudWatch alarms to proactively detect and respond to anomalies or issues related to Bedrock invocations and performance.
    • AWS GuardDuty: Continuously monitors CloudTrail logs for potential threats and unauthorized activity within the AWS environment.
    • AWS Security Hub: Provides centralized security posture management and compliance checks.
    • Amazon Security Lake: Provides a centralized data lake for log analysis; is integrated with CloudTrail and SecurityHub.
  3. Security information and event management integration:
    • Integrate with security information and event management (SIEM) solutions for centralized log management, real-time monitoring of security events, and correlation of logging data from multiple sources (CloudTrail, CloudWatch Logs, OpenSearch Service, and so on).
  4. Continuous improvement:
    • Regularly review and update logging and monitoring configurations, alerting thresholds, and integration with security solutions to address emerging threats, changes in application requirements, or evolving best practices.

Security anti-patterns and mitigation strategies

This section identifies and explores common security anti-patterns associated with the Amazon Bedrock chatbot application architecture. By recognizing these anti-patterns early in the development and deployment phases, you can implement effective mitigation strategies and fortify your security posture.

Addressing security anti-patterns in the Amazon Bedrock chatbot application architecture is crucial for several reasons:

  1. Data protection and privacy: The chatbot application processes and generates sensitive data, including personal information, intellectual property, and confidential business data. Failing to address security anti-patterns can lead to data breaches, unauthorized access, and potential regulatory violations.
  2. Model integrity and reliability: Vulnerabilities in the chatbot application can enable bad actors to manipulate or exploit the underlying generative AI models, compromising the integrity and reliability of the generated outputs. This can have severe consequences, particularly in decision-support or critical applications.
  3. Responsible AI deployment: As the adoption of generative AI models continues to grow, it’s essential to maintain responsible and ethical deployment practices. Addressing security anti-patterns is crucial for maintaining trust, transparency, and accountability in the chatbot application powered by AI models.
  4. Compliance and regulatory requirements: Many industries and regions have specific regulations and guidelines governing the use of AI technologies, data privacy, and information security. Addressing security anti-patterns is a critical step towards adhering to and maintaining compliance for the chatbot application.

The security anti-patterns that are covered in this post include:

  1. Lack of secure authentication and access controls
  2. Insufficient input validation and sanitization
  3. Insecure communication channels
  4. Inadequate prompt and response logging, auditing, and non-repudiation
  5. Insecure data storage and access controls
  6. Failure to secure FMs and generative AI components
  7. Lack of responsible AI governance and ethics
  8. Lack of comprehensive testing and validation

Anti-pattern 1: Lack of secure authentication and access controls

In a generative AI chatbot application using Amazon Bedrock, a lack of secure authentication and access controls poses significant risks to the confidentiality, integrity, and availability of the system. Identity spoofing and unauthorized access can enable threat actors to impersonate legitimate users or systems, gain unauthorized access to sensitive data processed by the chatbot application, and potentially compromise the integrity and confidentiality of the customer’s data and intellectual property used by the application.

Identity spoofing and unauthorized access are important areas to address in this architecture, as the chatbot application handles user prompts and responses, which may contain sensitive information or intellectual property. If a threat actor can impersonate a legitimate user or system, they can potentially inject malicious prompts, retrieve confidential data from the knowledge base, or even manipulate the responses generated by the Anthropic Claude 3 LLM integrated with Amazon Bedrock.

Anti-pattern examples

  • Exposing the Streamlit frontend interface or the API Gateway endpoint without proper authentication mechanisms, potentially allowing unauthenticated users to interact with the chatbot application and inject malicious prompts.
  • Storing or hardcoding AWS access keys or API credentials in the application code or configuration files, increasing the risk of credential exposure and unauthorized access to AWS services like Amazon Bedrock or DynamoDB.
  • Implementing weak or easily guessable passwords for administrative or service accounts with elevated privileges to access the Amazon Bedrock service or other critical components.
  • Lacking multi-factor authentication (MFA) for AWS Identity and Access Management (IAM) users or roles with privileged access, increasing the risk of unauthorized access to AWS resources, including the Amazon Bedrock service, if credentials are compromised.

Mitigation strategies

To mitigate the risks associated with a lack of secure authentication and access controls, implement robust IAM controls, as well as continuous logging, monitoring, and threat detection mechanisms.

IAM controls:

  • Use industry-standard protocols like OAuth 2.0 or OpenID Connect, and integrate with AWS IAM Identity Center or other identity providers for centralized authentication and authorization for the Streamlit frontend interface and AWS API Gateway endpoints.
  • Implement fine-grained access controls using AWS IAM policies and resource-based policies to restrict access to only the necessary Amazon Bedrock resources, Lambda functions, and other components required for the chatbot application.
  • Enforce the use of MFA for all IAM users, roles, and service accounts with access to critical components like Amazon Bedrock, DynamoDB, or the Streamlit application.

Continuous logging and monitoring and threat detection:

  • See the Comprehensive logging and monitoring strategy section for guidance on implementing centralized logging and monitoring solutions to track and audit authentication events, access attempts, and potential unauthorized access or credential misuse across the chatbot application components and Amazon Bedrock service, as well as using CloudWatch, Lambda, and GuardDuty to detect and respond to anomalous behavior and potential threats.

Anti-pattern 2: Insufficient input sanitization and validation

Insufficient input validation and sanitization in a generative AI chatbot application can expose the system to various threats, including injection events, data tampering, adversarial events, and data poisoning events. These vulnerabilities can lead to unauthorized access, data manipulation, and compromised model outputs.

Injection events: If user prompts or inputs aren’t properly sanitized and validated, a threat actor can potentially inject malicious code, such as SQL code, leading to unauthorized access or manipulation of the DynamoDB chat history data. Additionally, if the chatbot application or components process user input without proper validation, a threat actor can potentially inject and run arbitrary code on the backend systems, compromising the entire application.

Data tampering: A threat actor can potentially modify user prompts or payloads in transit between the chatbot interface and Amazon Bedrock service, leading to unintended model responses or actions. Lack of data integrity checks can allow a threat actor to tamper with the context data exchanged between Amazon Bedrock and OpenSearch, potentially leading to incorrect or malicious search results influencing the LLM responses.

Data poisoning events: If the training data or context data used by the LLM or chatbot application isn’t properly validated and sanitized, bad actors can potentially introduce malicious or misleading data, leading to biased or compromised model outputs.

Anti-pattern examples

  • Failure to validate and sanitize user prompts before sending them to Amazon Bedrock, potentially leading to injection events or unintended data exposure.
  • Lack of input validation and sanitization for context data retrieved from OpenSearch, allowing malformed or malicious data to influence the LLM’s responses.
  • Insufficient sanitization of LLM-generated responses before displaying them to users, enabling potential code injection or rendering of harmful content.
  • Inadequate sanitization of user input in the Streamlit application or Lambda functions, failing to remove or escape special characters, code snippets, or potentially malicious patterns, enabling code injection events.
  • Insufficient validation and sanitization of training data or other data sources used by the LLM or chatbot application, allowing data poisoning events that can introduce malicious or misleading data, leading to biased or compromised model outputs.
  • Allowing unrestricted character sets, input lengths, or special characters in user prompts or data inputs, enabling adversaries to craft inputs that bypass input validation and sanitization mechanisms, potentially causing undesirable or malicious outputs.
  • Relying solely on deny lists for input validation, which can be quickly bypassed by adversaries, potentially leading to injection events, data tampering, or other exploit scenarios.

Mitigation strategies

To mitigate the risks associated with insufficient input validation and sanitization, implement robust input validation and sanitization mechanisms throughout the chatbot application and its components.

Input validation and sanitization:

  • Implement strict input validation rules for user prompts at the chatbot interface and Amazon Bedrock service boundaries, defining allowed character sets, maximum input lengths, and disallowing special characters or code snippets. Use Amazon Bedrock’s Guardrails feature, which allows defining denied topics and content filters to remove undesirable and harmful content from user interactions with your applications.
  • Use allow lists instead of deny lists for input validation to maintain a more robust and comprehensive approach.
  • Sanitize user input by removing or escaping special characters, code snippets, or potentially malicious patterns.

Data flow validation:

  • Validate and sanitize data flows between components, including:
    • User prompts sent to the FM and responses generated by the FM and returned to the chatbot interface.
    • Training data, context data, and other data sources used by the FM or chatbot application.

Protective controls:

  • Use AWS Web Application Firewall (WAF) for input validation and protection against common web exploits.
  • Use AWS Shield for protection against distributed denial of service (DDoS) events.
  • Use CloudTrail to monitor API calls to Amazon Bedrock, including InvokeModel requests.
  • See the Comprehensive logging and monitoring strategy section for guidance on implementing Lambda functions, Amazon EventBridge rules, and CloudWatch Logs to analyze CloudTrail logs, ingest application logs, user prompts, and responses, and integrate with incident response and SIEM solutions for detecting, investigating, and mitigating security incidents related to input validation and sanitization, including jailbreaking attempts and anomalous behavior.

Anti-pattern 3: Insecure communication channels

Insecure communication channels between chatbot application components can expose sensitive data to interception, tampering, and unauthorized access risks. Unsecured channels enable man-in-the-middle events where threat actors intercept, modify data in transit such as user prompts, responses, and context data, leading to data tampering, malicious payload injection, and unauthorized information access.

Anti-pattern examples

  • Failure to use AWS PrivateLink for secure service-to-service communication within the VPC, exposing communications between Amazon Bedrock and other AWS services to potential risks over the public internet, even when using HTTPS.
  • Absence of data integrity checks or mechanisms to detect and prevent data tampering during transmission between components.
  • Failure to regularly review and update communication channel configurations, protocols, and encryption mechanisms to address emerging threats and ensure compliance with security best practices.

Mitigation strategies

To mitigate the risks associated with insecure communication channels, implement secure communication mechanisms and enforce data integrity throughout the chatbot application’s components and their interactions. Proper encryption, authentication, and integrity checks should be employed to protect sensitive data in transit and help prevent unauthorized access, data tampering, and man-in-the-middle events.

Secure communication channels:

  • Use PrivateLink for secure service-to-service communication between Amazon Bedrock and other AWS services used in the chatbot application architecture. PrivateLink provides a private, isolated communication channel within the Amazon VPC, eliminating the need to traverse the public internet. This mitigates the risk of potential interception, tampering, or unauthorized access to sensitive data transmitted between services, even when using HTTPS.
  • Use AWS Certificate Manager (ACM) to manage and automate the deployment of SSL/TLS certificates used for secure communication between the chatbot frontend interface (the Streamlit application) and the API Gateway endpoint. ACM simplifies the provisioning, renewal, and deployment of SSL/TLS certificates, making sure that communication channels between the user-facing components and the backend API are securely encrypted using industry-standard protocols and up-to-date certificates.

Continuous logging and monitoring:

  • See the Comprehensive Logging and Monitoring Strategy section for guidance on implementing centralized logging and monitoring mechanisms to detect and respond to potential communication channel anomalies or security incidents, including monitoring communication channel metrics, API call patterns, request payloads, and response data, using AWS services like CloudWatch, CloudTrail, and AWS WAF.

Network segmentation and isolation controls

  • Implement network segmentation by deploying the Amazon ECS cluster within a dedicated VPC and subnets, isolating it from other components and restricting communication based on the principle of least privilege.
  • Create separate subnets within the VPC for the public-facing frontend tier and the backend application tier, further isolating the components.
  • Use AWS security groups and network access control lists (NACLs) to control inbound and outbound traffic at the instance and subnet levels, respectively, for the ECS cluster and the frontend instances.

Anti-pattern 4: Inadequate logging, auditing, and non-repudiation

Inadequate logging, auditing, and non-repudiation mechanisms in a generative AI chatbot application can lead to several risks, including a lack of accountability, challenges in forensic analysis, and compliance concerns. Without proper logging and auditing, it’s challenging to track user activities, diagnose issues, perform forensic analysis in case of security incidents, and demonstrate compliance with regulations or internal policies.

Anti-pattern examples

  • Lack of logging for data flows between components, such as user prompts sent to Amazon Bedrock, context data exchanged with OpenSearch, and responses from the LLM, hindering investigative efforts in case of security incidents or data breaches.
  • Insufficient logging of user activities within the chatbot application—such as sign in attempts, session duration, and actions performed—limiting the ability to track and attribute actions to specific users.
  • Absence of mechanisms to ensure the integrity and authenticity of logged data, allowing potential tampering or repudiation of logged events.
  • Failure to securely store and protect log data from unauthorized access or modification, compromising the reliability and confidentiality of log information.

Mitigation strategies

To mitigate the risks associated with inadequate logging, auditing, and non-repudiation, implement comprehensive logging and auditing mechanisms to capture critical events, user activities, and data flows across the chatbot application components. Additionally, measures must be taken to maintain the integrity and authenticity of log data, help prevent tampering or repudiation, and securely store and protect log information from unauthorized access.

Comprehensive logging and auditing:

  • See the Comprehensive logging and monitoring strategy section for detailed guidance on implementing logging mechanisms using CloudTrail, CloudWatch Logs, and OpenSearch Service, as well as using CloudTrail for logging and monitoring API calls, especially Amazon Bedrock API calls and other API activities within the AWS environment, using CloudWatch for monitoring Amazon Bedrock-specific metrics, and ensuring log data integrity and non-repudiation through the CloudTrail log file integrity validation feature and implementing S3 Object Lock and S3 Versioning for log data stored in Amazon S3.
  • Make sure that log data is securely stored and protected from unauthorized access by using AWS Key Management Service (AWS KMS) for encryption at rest and implementing restrictive IAM policies and resource-based policies to control access to log data.
  • Retain log data for an appropriate period based on compliance requirements, using CloudTrail log file integrity validation and CloudWatch Logs retention periods and data archiving capabilities.

User activity monitoring and tracking:

  • Use CloudTrail for logging and monitoring API calls, especially Amazon Bedrock API calls and other API activities within the AWS environment, such as API Gateway, Lambda, and DynamoDB. Additionally, use CloudWatch for monitoring metrics specific to Amazon Bedrock, including the number of model invocations, latency, and error metrics (client-side errors, server-side errors, and throttling).
  • Integrate with security information and event management (SIEM) solutions for centralized log management and real-time monitoring of security events.

Data integrity and non-repudiation:

  • Implement digital signatures or non-repudiation mechanisms to verify the integrity and authenticity of logged data, minimizing tampering or repudiation of logged events. Use the CloudTrail log file integrity validation feature, which uses industry-standard algorithms (SHA-256 for hashing and SHA-256 with RSA for digital signing) to provide non-repudiation and verify log data integrity. For log data stored in Amazon S3, enable S3 Object Lock and S3 Versioning to provide an immutable, write once, read many (WORM) data storage model, helping to prevent object deletions or modifications, and maintaining data integrity and non-repudiation. Additionally, implement S3 bucket policies and IAM policies to restrict access to log data stored in S3, further enhancing the security and non-repudiation of logged events.

Anti-pattern 5: Insecure data storage and access controls

Insecure data storage and access controls in a generative AI chatbot application can lead to significant risks, including information disclosure, data tampering, and unauthorized access. Storing sensitive data, such as chat history, in an unencrypted or insecure manner can result in information disclosure if the data store is compromised or accessed by unauthorized entities. Additionally, a lack of proper access controls can allow unauthorized parties to access, modify, or delete data, leading to data tampering or unauthorized access.

Anti-pattern examples

  • Storing chat history data in DynamoDB without encryption at rest using AWS KMS customer-managed keys (CMKs).
  • Lack of encryption at rest using CMKs from AWS KMS for data in OpenSearch, Amazon S3, or other components that handle sensitive data.
  • Overly permissive access controls or lack of fine-grained access control mechanisms for the DynamoDB chat history, OpenSearch, Amazon S3, or other data stores, increasing the risk of unauthorized access or data breaches.
  • Storing sensitive data in clear text, or using insecure encryption algorithms or key management practices.
  • Failure to regularly review and rotate encryption keys or update access control policies to address potential security vulnerabilities or changes in access requirements.

Mitigation strategies

To mitigate the risks associated with insecure data storage and access controls, implement robust encryption mechanisms, secure key management practices, and fine-grained access control policies. Encrypting sensitive data at rest and in transit, using customer-managed encryption keys from AWS KMS, and implementing least- privilege access controls based on IAM policies and resource-based policies can significantly enhance the security and protection of data within the chatbot application architecture.

Key management and encryption at rest:

  • Implement AWS KMS to manage and control access to CMKs for data encryption across components like DynamoDB, OpenSearch, and Amazon S3.
    • Use CMKs to configure DynamoDB to automatically encrypt chat history data at rest.
    • Configure OpenSearch and Amazon S3 to use encryption at rest with AWS KMS CMKs for data stored in these services.
    • CMKs provide enhanced security and control, allowing you to create, rotate, disable, and revoke encryption keys, enabling better key isolation and separation of duties.
    • CMKs enable you to enforce key policies, audit key usage, and adhere to regulatory requirements or organizational policies that mandate customer-managed encryption keys.
    • CMKs offer portability and independence from specific services, allowing you to migrate or integrate data across multiple services while maintaining control over the encryption keys.
    • AWS KMS provides a centralized and secure key management solution, simplifying the management and auditing of encryption keys across various components and services.
  • Implement secure key management practices, including:
    • Regular key rotation to maintain the security of your encrypted data.
    • Separation of duties to make sure that no single individual has complete control over key management operations.
    • Strict access controls for key management operations, using IAM policies and roles to enforce the principle of least privilege.

Fine-grained access controls:

  • Implement fine-grained access controls for the DynamoDB chat history data store, OpenSearch, Amazon S3, and other data stores using IAM policies and roles.
  • Implement fine-grained access controls and define least-privilege access policies for all resources handling sensitive data, such as the DynamoDB chat history data store, OpenSearch, Amazon S3, and other data stores or services. For example, use IAM policies and resource-based policies to restrict access to specific DynamoDB tables, OpenSearch domains, and S3 buckets, limiting access to only the necessary actions (for example, read, write, and list) based on the principle of least privilege. Extend this approach to all resources handling sensitive data within the chatbot application architecture, making sure that access is granted only to the minimum required resources and actions necessary for each component or user role.

Continuous improvement:

  • Regularly review and update encryption configurations, access control policies, and key management practices to address potential security vulnerabilities or changes in access requirements.

Anti-pattern 6: Failure to secure FM and generative AI components

Inadequate security measures for FMs and generative AI components in a chatbot application can lead to severe risks, including model tampering, unintended information disclosure, and denial of service. Threat actors can manipulate unsecured FMs and generative AI models to generate biased, harmful, or malicious responses, potentially causing significant harm or reputational damage.

Lack of proper access controls or input validation can result in unintended information disclosure, where sensitive data is inadvertently included in model responses. Additionally, insecure FM or generative AI components can be vulnerable to denial-of-service events, disrupting the availability of the chatbot application and impacting its functionality.

Anti-pattern examples

  • Insecure model fine tuning practices, such as using untrusted or compromised data sources, can lead to biased or malicious models.
  • Lack of continuous monitoring for FM and generative AI components, leaving them vulnerable to emerging threats or known vulnerabilities.
  • Lack of guardrails or safety measures to control and filter the outputs of FMs and generative AI components, potentially leading to the generation of harmful, biased, or undesirable content.
  • Inadequate access controls or input validation for prompts and context data sent to the FM components, increasing the risk of injection events or unintended information disclosure.
  • Failure to implement secure deployment practices for FM and generative AI components, including secure communication channels, encryption of model artifacts, and access controls.

Mitigation strategies

To mitigate the risks associated with inadequately secured foundational models (FMs) and generative AI components, implement secure integration mechanisms, robust model fine-tuning and deployment practices, continuous monitoring, and effective guardrails and safety measures. These mitigation strategies help prevent model tampering, unintended information disclosure, denial-of-service events, and the generation of harmful or undesirable content, while ensuring the security, reliability, and ethical alignment of the chatbot application’s generative AI capabilities.

Secure integration with LLMs and knowledge bases:

  • Implement secure communication channels (for example HTTPS or PrivateLink) between Amazon Bedrock, OpenSearch, and the FM components to help prevent unauthorized access or data tampering.
  • Implement strict input validation and sanitization for prompts and context data sent to the FM components to help prevent injection events or unintended information disclosure.
  • Implement access controls and least-privilege principles for the OpenSearch integration to limit the data accessible to the LLM components.

Secure model fine tuning, deployment, and monitoring:

  • Establish secure and auditable fine-tuning pipelines, using trusted and vetted data sources, to help prevent tampering or the introduction of biases.
  • Implement secure deployment practices for FM and generative AI components, including access controls, secure communication channels, and encryption of model artifacts.
  • Continuously monitor FM and generative AI components for security vulnerabilities, performance issues, and unintended behavior.
  • Implement rate-limiting, throttling, and load-balancing mechanisms to help prevent denial-of-service events on FM and generative AI components.
  • Regularly review and audit FM and generative AI components for compliance with security policies, industry best practices, and regulatory requirements.

Guardrails and safety measures

  • Implement guardrails, which are safety measures designed to reduce harmful outputs and align the behavior of FMs and generative AI components with human values.
  • Use keyword-based filtering, metric-based thresholds, human oversight, and customized guardrails tailored to the specific risks and cultural and ethical norms of each application domain.
  • Monitor the effectiveness of guardrails through performance benchmarking and adversarial testing.

Jailbreak robustness testing

  • Conduct jailbreak robustness testing by prompting the FMs and generative AI components with a diverse set of jailbreak attempts across different prohibited scenarios to identify weaknesses and improve model robustness.

Anti-pattern 7: Lack of responsible AI governance and ethics

While the previous anti-patterns focused on technical security aspects, it is equally important to address the ethical and responsible governance of generative AI systems. Without strong governance frameworks, ethical guidelines, and accountability measures, chatbot applications can result in unintended consequences, biased outcomes, and a lack of transparency and trust.

Anti-pattern examples

  • Lack of an established ethical AI governance framework, including principles, policies, and processes to guide the responsible development and deployment of the generative AI chatbot application.
  • Insufficient measures to ensure transparency, explainability, and interpretability of the LLM and generative AI components, making it difficult to understand and audit their decision-making processes.
  • Absence of mechanisms for stakeholder engagement, public consultation, and consideration of societal impacts, potentially leading to a lack of trust and acceptance of the chatbot application.
  • Failure to address potential biases, discrimination, or unfairness in the training data, models, or outputs of the generative AI system.
  • Inadequate processes for testing, validation, and ongoing monitoring of the chatbot application’s ethical behavior and alignment with organizational values and societal norms.

Mitigation strategies

To minimize a lack of responsible AI governance and ethics, establish a comprehensive ethical AI governance framework, promote transparency and interpretability, engage stakeholders and consider societal impacts, address potential biases and fairness issues, implement continuous improvement and monitoring processes, and use guardrails and safety measures. These mitigation strategies help to foster trust, accountability, and ethical alignment in the development and deployment of the generative AI chatbot application, mitigating the risks of unintended consequences, biased outcomes, and a lack of transparency.

Ethical AI governance framework:

  • Establish an ethical AI governance framework, including principles, policies, and processes to guide the responsible development and deployment of the generative AI chatbot application.
  • Define clear ethical guidelines and decision-making frameworks to address potential ethical dilemmas, biases, or unintended consequences.
  • Implement accountability measures, such as designated ethics boards, ethics officers, or external advisory committees, to oversee the ethical development and deployment of the chatbot application.

Transparency and interpretability:

  • Implement measures to promote transparency and interpretability of the LLM and generative AI components, allowing for auditing and understanding of their decision-making processes.
  • Provide clear and accessible information to stakeholders and users about the chatbot application’s capabilities, limitations, and potential biases or ethical considerations.

Stakeholder engagement and societal impact:

  • Establish mechanisms for stakeholder engagement, public consultation, and consideration of societal impacts, fostering trust and acceptance of the chatbot application.
  • Conduct impact assessments to identify and mitigate potential negative consequences or risks to individuals, communities, or society.

Bias and fairness:

  • Address potential biases, discrimination, or unfairness in the training data, models, or outputs of the generative AI system through rigorous testing, bias mitigation techniques, and ongoing monitoring.
  • Promote diverse and inclusive representation in the development, testing, and governance processes to reduce potential biases and blind spots.

Continuous improvement and monitoring:

  • Implement processes for ongoing testing, validation, and monitoring of the chatbot application’s behavior and alignment with organizational values and societal norms.
  • Regularly review and update the AI governance framework, policies, and processes to address emerging ethical challenges, societal expectations, and regulatory developments.

Guardrails and safety measures:

  • Implement guardrails, such as Guardrails for Amazon Bedrock, which are safety measures designed to reduce harmful outputs and align the behavior of LLMs and generative AI components with human values and responsible AI policies.
  • Use Guardrails for Amazon Bedrock to define denied topics and content filters to remove undesirable and harmful content from interactions between users and your applications.
    • Define denied topics using natural language descriptions to specify topics or subject areas that are undesirable in the context of your application.
    • Configure content filters to set thresholds for filtering harmful content across categories such as hate, insults, sexuality, and violence based on your use cases and responsible AI policies.
    • Use the personally identifiable information (PII) redaction feature to redact information such as names, email addresses, and phone numbers from LLM-generated responses or block user inputs that contain PII.
  • Integrate Guardrails for Amazon Bedrock with CloudWatch to monitor and analyze user inputs and LLM responses that violate defined policies, enabling proactive detection and response to potential issues.
  • Monitor the effectiveness of guardrails through performance benchmarking and adversarial testing, continuously refining and updating the guardrails based on real-world usage and emerging ethical considerations.

Jailbreak robustness testing:

  • Conduct jailbreak robustness testing by prompting the LLMs and generative AI components with a diverse set of jailbreak attempts across different prohibited scenarios to identify weaknesses and improve model robustness.

Anti-pattern 8: Lack of comprehensive testing and validation

Inadequate testing and validation processes for the LLM system and the generative AI chatbot application can lead to unidentified vulnerabilities, performance bottlenecks, and availability issues. Without comprehensive testing and validation, organizations might fail to detect potential security risks, functionality gaps, or scalability and performance limitations before deploying the application in a production environment.

Anti-pattern examples

  • Lack of functional testing to validate the correctness and completeness of the LLM’s responses and the chatbot application’s features and functionalities.
  • Insufficient performance testing to identify bottlenecks, resource constraints, or scalability limitations under various load conditions.
  • Absence of security testing, such as penetration testing, vulnerability scanning, and adversarial testing to uncover potential security vulnerabilities or model exploits.
  • Failure to incorporate automated testing and validation processes into a continuous integration and continuous deployment (CI/CD) pipeline, leading to manual and one-time testing efforts that might overlook critical issues.
  • Inadequate testing of the chatbot application’s integration with external services and components, such as Amazon Bedrock, OpenSearch, and DynamoDB, potentially leading to compatibility issues or data integrity problems.

Mitigation strategies

To address the lack of comprehensive testing and validation, implement a robust testing strategy encompassing functional, performance, security, and integration testing. Integrate automated testing into a CI/CD pipeline, conduct security testing like threat modeling and penetration testing, and use adversarial validation techniques. Continuously improve testing processes to verify the reliability, security, and scalability of the generative AI chatbot application.

Comprehensive testing strategy:

  • Establish a comprehensive testing strategy that includes functional testing, performance testing, load testing, security testing, and integration testing for the LLM system and the overall chatbot application.
  • Define clear testing requirements, test cases, and acceptance criteria based on the application’s functional and non-functional requirements, as well as security and compliance standards.

Automated testing and CI/CD integration:

  • Incorporate automated testing and validation processes into a CI/CD pipeline, enabling continuous monitoring and assessment of the LLM’s performance, security, and reliability throughout its lifecycle.
  • Use automated testing tools and frameworks to streamline the testing process, improve test coverage, and facilitate regression testing.

Security testing and adversarial validation:

  • Conduct threat modeling exercises early in the design process and as soon as the design is finalized for the chatbot application architecture to proactively identify potential security risks and vulnerabilities. Subsequently, conduct regular security testing—including penetration testing, vulnerability scanning, and adversarial testing—to uncover and validate identified security vulnerabilities or model exploits.
  • Implement adversarial validation techniques, such as prompting the LLM with carefully crafted inputs designed to expose weaknesses or vulnerabilities, to improve the model’s robustness and security.

Performance and load testing:

  • Perform comprehensive performance and load testing to identify potential bottlenecks, resource constraints, or scalability limitations under various load conditions.
  • Use tools and techniques for load generation, stress testing, and capacity planning to ensure the chatbot application can handle anticipated user traffic and workloads.

Integration testing:

  • Conduct thorough integration testing to validate the chatbot application’s integration with external services and components, such as Amazon Bedrock, OpenSearch, and DynamoDB, maintaining seamless communication and data integrity.

Continuous improvement:

  • Regularly review and update the testing and validation processes to address emerging threats, new vulnerabilities, or changes in application requirements.
  • Use testing insights and results to continuously improve the LLM system, the chatbot application, and the overall security posture.

Common mitigation strategies for all anti-patterns

  • Regularly review and update security measures, access controls, monitoring mechanisms, and guardrails for LLM and generative AI components to address emerging threats, vulnerabilities, and evolving responsible AI best practices.
  • Conduct regular security assessments, penetration testing, and code reviews to identify and remediate vulnerabilities or misconfigurations related to logging, auditing, and non-repudiation mechanisms.
  • Stay current with security best practices, guidance, and updates from AWS and industry organizations regarding logging, auditing, and non-repudiation for generative AI applications.

Secure and responsible architecture blueprint

After discussing the baseline chatbot application architecture and identifying critical security anti-patterns associated with generative AI applications built using Amazon Bedrock, we now present the secure and responsible architecture blueprint. This blueprint (Figure 2) incorporates the recommended mitigation strategies and security controls discussed throughout the anti-pattern analysis.

Figure 2: Secure and responsible generative AI chatbot architecture blueprint

Figure 2: Secure and responsible generative AI chatbot architecture blueprint

In this target state architecture, unauthenticated users interact with the chatbot application through the frontend interface (1), where it’s crucial to mitigate the anti-pattern of insufficient input validation and sanitization by implementing secure coding practices and input validation. The user inputs are then processed through AWS Shield, AWS WAF, and CloudFront (2), which provide DDoS protection, web application firewall capabilities, and a content delivery network, respectively. These services help mitigate insufficient input validation, web exploits, and lack of comprehensive testing by using AWS WAF for input validation and conducting regular security testing.

The user requests are then routed through API Gateway (3), which acts as the entry point for the chatbot application, facilitating API connections to the Streamlit frontend. To address anti-patterns related to authentication, insecure communication, and LLM security, it’s essential to implement secure authentication protocols, HTTPS/TLS, access controls, and input validation within API Gateway. Communication between the VPC resources and API Gateway is secured through VPC endpoints (4), using PrivateLink for secure private communication and attaching endpoint policies to control which AWS principals can access the API Gateway service (8), mitigating the insecure communication channels anti-pattern.

The Streamlit application (5) is hosted on Amazon ECS in a private subnet within the VPC. It hosts the frontend interface and must implement secure coding practices and input validation to mitigate insufficient input validation and sanitization. User inputs are then processed by Lambda (6), a serverless compute service hosted within the VPC, which connects to Amazon Bedrock, OpenSearch, and DynamoDB through VPC endpoints (7). These VPC endpoints have endpoint policies attached to control access, enabling secure private communication between the Lambda function and the services, mitigating the insecure communication channels anti-pattern. Within Lambda, strict input validation rules, allow-lists, and user input sanitization are implemented to address the input validation anti-pattern.

User requests from the chatbot application are sent to Amazon Bedrock (12), a generative AI solution that powers the LLM capabilities. To mitigate the failure to secure FM and generative AI components anti-pattern, secure communication channels, input validation, and sanitization for prompts and context data must be implemented when interacting with Amazon Bedrock.

Amazon Bedrock interacts with OpenSearch Service (9) using Amazon Bedrock knowledge bases to retrieve relevant context data for the user’s question. The knowledge base is created by ingesting public documents from Amazon S3 (10). To mitigate the anti-pattern of insecure data storage and access controls, implement encryption at rest using AWS KMS and fine-grained IAM policies and roles for access control within OpenSearch Service. Titan Embeddings (11) are the format of the vector embeddings, which represent the documents stored in Amazon S3. The vector format enables similarity calculation and retrieval of relevant information (12). To address the failure to secure FM and generative AI components anti-pattern, secure integration with Titan Embeddings and input data validation should be implemented.

The knowledge base data, user prompts, and context data are processed by Amazon Bedrock (13) with the Claude 3 LLM (14). To address the anti-patterns of failure to secure FM and generative AI components, as well as lack of responsible AI governance and ethics, secure communication channels, input validation, ethical AI governance frameworks, transparency and interpretability measures, stakeholder engagement, bias mitigation, and guardrails like Guardrails for Amazon Bedrock should be implemented.

The generated responses and recommendations are then stored and retrieved in Amazon DynamoDB (15) by the Lambda function. To mitigate insecure data storage and access, encrypting data at rest with AWS KMS (16) and implement fine-grained access controls through IAM policies and roles.

Comprehensive logging, auditing, and monitoring mechanisms are provided by CloudTrail (17), CloudWatch (18), and AWS Config (19) to address the inadequate logging, auditing, and non-repudiation anti-pattern. See the Comprehensive logging and monitoring strategy section for detailed guidance on implementing comprehensive logging, auditing, and monitoring mechanisms using CloudTrail, CloudWatch, CloudWatch Logs, and AWS Config to address the inadequate logging, auditing, and non-repudiation anti-pattern; including logging API calls made to Amazon Bedrock service, monitoring Amazon Bedrock-specific metrics, capturing and analyzing Bedrock invocation logs, and monitoring and auditing the configuration of resources related to the chatbot application and Amazon Bedrock service.

IAM (20) plays a crucial role in the overall architecture and in mitigating anti-patterns related to authentication and insecure data storage and access. IAM roles and permissions are critical in enforcing secure authentication mechanisms, least privilege access, multi-factor authentication, and robust credential management across the various components of the chatbot application. Additionally, service control policies (SCPs) can be configured to restrict access to specific models or knowledge bases within Amazon Bedrock, preventing unauthorized access or use of sensitive intellectual property.

Finally, GuardDuty (21), Amazon Inspector (22), Security Hub (23), and Security Lake (24) have been included as additional recommended services to further enhance the security posture of the chatbot application. GuardDuty (21) provides threat detection across the control and data planes, Amazon Inspector (22) enables vulnerability assessments and continuous monitoring of Amazon ECS and Lambda workloads. Security Hub (23) offers centralized security posture management and compliance checks, while Security Lake (24) acts as a centralized data lake for log analysis, integrated with CloudTrail and SecurityHub.

Conclusion

By identifying critical anti-patterns and providing comprehensive mitigation strategies, you now have a solid foundation for a secure and responsible deployment of generative AI technologies in enterprise environments.

The secure and responsible architecture blueprint presented in this post serves as a comprehensive guide for organizations that want to use the power of generative AI while ensuring robust security, data protection, and ethical governance. By incorporating industry-leading security controls—such as secure authentication mechanisms, encrypted data storage, fine-grained access controls, secure communication channels, input validation and sanitization, comprehensive logging and auditing, secure FM integration and monitoring, and responsible AI guardrails—this blueprint addresses the unique challenges and vulnerabilities associated with generative AI applications.

Moreover, the emphasis on comprehensive testing and validation processes, as well as the incorporation of ethical AI governance principles, makes sure that you can not only mitigate potential risks, but also promote transparency, explainability, and interpretability of the LLM components, while addressing potential biases and ensuring alignment with organizational values and societal norms.

By following the guidance outlined in this post and depicted in the architectural blueprint, you can proactively identify and mitigate potential risks, enhance the security posture of your generative AI-based chatbot solutions, protect sensitive data and intellectual property, maintain regulatory compliance, and responsibly deploy LLMs and generative AI technologies in your enterprise environments.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Magesh Dhanasekaran
Magesh Dhanasekaran

Magesh is a Security Architect at AWS. He has a proven track record providing information security consulting services to financial institutions and government agencies in Australia and the United States. Magesh uses his experience in cloud security architecture, digital transformation, and secure application development practices to provide security advice on AWS products and services. He currently holds multiple industry certifications.
Amy Tipple
Amy Tipple

Amy is a Senior Data Scientist with the Professional Services Data and Machine Learning team and has been with AWS for approximately four years. Amy has worked on several engagements involving generative AI and is an advocate for making sure that generative AI-related security is accessible and understandable for AWS users.

SaaS authentication: Identity management with Amazon Cognito user pools

Post Syndicated from Shubhankar Sumar original https://aws.amazon.com/blogs/security/saas-authentication-identity-management-with-amazon-cognito-user-pools/

Amazon Cognito is a customer identity and access management (CIAM) service that can scale to millions of users. Although the Cognito documentation details which multi-tenancy models are available, determining when to use each model can sometimes be challenging. In this blog post, we’ll provide guidance on when to use each model and review their pros and cons to help inform your decision.

Cognito overview

Amazon Cognito handles user identity management and access control for web and mobile apps. With Cognito user pools, you can add sign-up, sign-in, and access control to your apps. A Cognito user pool is a user directory within a specific AWS Region where users can authenticate and register for applications. In addition, a Cognito user pool is an OpenID Connect (OIDC) identity provider (IdP). App users can either sign in directly through a user pool or federate through a third-party IdP. Cognito issues a user pool token after successful authentication, which can be used to securely access backend APIs and resources.

Cognito issues three types of tokens:

  • ID token – Contains user identity claims like name, email, and phone number. This token type authenticates users and enables authorization decisions in apps and API gateways.
  • Access token – Includes user claims, groups, and authorized scopes. This token type grants access to API operations based on the authenticated user and application permissions. It also enables fine-grained, user-based access control within the application or service.
  • Refresh token – Retrieves new ID and access tokens when these are expired. Access and ID tokens are short-lived, while the refresh token is long-lived. By default, refresh tokens expire 30 days after the user signs in, but this can be configured to a value between 60 minutes and 10 years.

You can find more information on using tokens and their contents in the Cognito documentation.

Multi-tenancy approaches

Software as a service (SaaS) architectures often use silo, pool, or bridge deployment models, which also apply to CIAM services like Cognito. The silo model isolates tenants in dedicated resources. The pool model shares resources between tenants. The bridge model connects siloed and pooled components. This post compares the Cognito silo and pool models for SaaS identity management.

It’s also possible to combine the silo and pool models by having multiple tiers of resources. For example, you could have a siloed tier for sensitive tenant data along with a pooled tier for shared functionality. This is similar to the silo model but with added routing complexity to connect the tiers. When you have multiple pools or silos, this is a similar approach to the pure silo model but with more components to manage.

More detail on these models are included in the AWS SaaS Lens.

We’ve detailed five possible patterns in the following sections and explored the scenarios where each of the patterns can be used, along with the advantages and disadvantages for each. The rest of the post delves deeper into the details of these different patterns, enabling you to make an informed decision that best aligns with your unique requirements and constraints.

Pattern 1: Representing SaaS identity with custom attributes

To implement multi-tenancy in a SaaS application, tenant context needs to be associated with user identity. This allows implementation of the multi-tenant policies and strategies that comprise our SaaS application. Cognito has user pool attributes, which are pieces of information to represent identity. There are standard attributes, such as name and email, that describe the user identity. Cognito also supports custom attributes that can be used to hold information about the user’s relationship to a tenant, such as tenantId.

By using custom attributes for multi-tenancy in Amazon Cognito, the tenant context for each user can be stored in their user profile.

To enable multi-tenancy, you can add a custom attribute like tenantId to the user profile. When a new user signs up, this tenantId attribute can be set to a value indicating which tenant the user belongs to. For example, users with tenantId “1234” belong to Tenant A, while users with tenantId “5678” belong to Tenant B.

The tenantId attribute value gets returned in the ID token after a successful user authentication. (This value can also be added to the access token through customization by using a pre-token generation Lambda trigger.) The application can then inspect this claim to determine which tenant the user belongs to. The tenantId attribute is typically managed at the SaaS platform level and is read-only to users and the application layer. (Note: SaaS providers need to configure the tenantId attribute to be read-only.)

In addition to storing a tenant ID, you can use custom attributes to model additional tenant context. For instance, attributes like tenantName, tenantTier, or tenantRegion could be defined and set appropriately for each user to provide relevant informational context for the application. However, make sure not to use custom attributes as a database—they are meant to represent identity, not store application data. Custom attributes should only contain information that is relevant for authorization decisions and JSON web token (JWT) compactness and should be relatively static because their values are stored in the Cognito directory. Updating frequently changing data requires modifying the directory, which can be cumbersome.

The custom attributes themselves need to be defined at the time of creating the Amazon Cognito user pool, and there is a maximum of 50 custom attributes that you can create. Once the pool is created, these custom attribute fields will be present on every user profile in that user pool. However, they won’t have values populated yet. The actual tenant attribute values get populated only when a new user is created in the user pool. This can be done in two ways:

  1. During user sign-up, a post confirmation AWS Lambda trigger can be used to set the appropriate tenant attribute values based on the user’s input.
  2. An admin user can provision a new user through the AdminCreateUser API operation and specify the tenant attribute values at that time.

After user creation, the custom tenant attribute values can still be updated by an administrator through the AdminUpdateUserAttributes API operation or by a user with the UpdateUserAttributes API operation, if needed. But the key point is that the custom attributes themselves must be predefined at user pool creation, while the values get set later during user creation and provisioning flows. Figure 1 shows how custom attributes are associated with an ID token and used subsequently in downstream applications.

Figure 1: Associating tenant context with custom attributes

Figure 1: Associating tenant context with custom attributes

As shown in Figure 1:

  • The custom tenant attribute values from the user profile are included in the Cognito ID token that is generated after a successful user authentication. These values can be used for access control for other AWS services, such as Amazon API Gateway.
  • You can configure Amazon API Gateway with a Lambda authorizer function that validates the ID token signature (the aws-jwt-verify library can be used for this purpose) and inspects the tenant ID claim in each request.
  • Based on the tenant ID value extracted from the ID token, the Lambda authorizer can determine which backend resources and services each authenticated user is authorized to access.

You can use this method to provide fine-grained access control, as described in this blog post, by using tenant claims as context in addition to the user claims embedded within the token. This pattern of embedding information about the user’s identity, along with details on their associated tenant, in a single token is what AWS refers to as SaaS identity.

The multi-tenancy approaches of using siloed user pools, shared pools, or custom attributes rely on embedding tenant context within the user identity. This is accomplished by having Cognito include claims with tenant information in the JWTs issued after authentication.

The JWT encodes user identity information like the username, email address, and so on. By adding custom claims that contain tenant identifiers or metadata, the tenant context gets tightly coupled to the user identity. The embedded tenant context in the JWT allows applications to implement access control and authorization based on the associated tenant for each user.

This combination of user identity information and tenant context in the issued JWT represents the SaaS identity—a unified identity spanning both user and tenant dimensions. The application uses this SaaS identity for implementing multi-tenant logic and policies.

Pattern 2: Shared user pool (pool model)

A single, shared Amazon Cognito user pool simplifies identity management for multi-tenant SaaS applications. With one consolidated pool, changes and configurations apply across tenants in one place, which can reduce overhead.

For example, you can define password complexity rules and other settings once at the user pool level, and then these settings are shared across tenants. Adding new tenants is streamlined by using the settings in the existing shared pool, without duplicating setup per tenant. This avoids deploying isolated pools when onboarding new tenants.

Additionally, the tokens issued from the shared pool are signed by the same issuer. There is no tenant-specific issuer in the tokens when using a shared pool. For SaaS apps with common identity needs, a shared multi-tenant pool minimizes friction for rapid onboarding despite that loss of per-tenant customization.

Advantages of the pool model:

  • This model uses a single shared user pool for tenants. This simplifies onboarding by setting user attributes rather than configuring multiple user pools.
  • Tenants authenticate using the same application client and user pool, which keeps the SaaS client configuration simple.

Disadvantages of the pool model:

  • Sharing one pool means that settings like password policies and MFA apply uniformly, without customization per tenant.
  • Some resource quotas are managed at a user pool level (for example, the number of application clients or customer attributes), so you need to consider quotas carefully when adopting this model.

Pattern 3: Group-based multi-tenancy (pool model)

Amazon Cognito user pools give an administrator the capability to add groups and associate users with groups. Doing so introduces specific attributes (cognito:groups and cognito:roles) that are managed and maintained by Cognito and available within the ID tokens. (Access tokens only have the cognito:groups attribute.) These groups can be used to enable multi-tenancy by creating a separate group for each tenant. Users can be assigned to the appropriate tenant group based on the value of a custom tenantId attribute. The application can then implement authorization logic to limit access to resources and data based on the user’s tenant group membership that is encoded in the tokens. This provides isolation and access control across tenants, making use of the native group constructs in Cognito rather than relying entirely on custom attributes.

The group information contained in the tokens can then be used by downstream services to make authorization decisions. Groups are often combined with custom attributes for more granular access control. For example, in the SaaS Factory Serverless SaaS – Reference Solution developed by the AWS SaaS Factory team, roles are specified by using Cognito groups, but tenant identity relies on a custom tenantId attribute. The tenant ID attribute provides isolation between tenants, while the groups define individual user roles and access privileges that apply within a tenant.

Figure 2 shows how groups are associated with the user and then the Lambda authorizer can determine which backend resources and services each authenticated user is authorized to access.

Figure 2: Group-based multi-tenancy

Figure 2: Group-based multi-tenancy

In this model, groups can provide role-based controls, while custom attributes like tenant ID provide the contextual information needed to enforce tenant isolation. The authorization decisions are then made by evaluating a user’s group memberships and attribute values in order to provide fine-grained access tailored to each tenant and user. So groups directly enable role-based checks, while custom attributes provide broader context for conditional access across tenants. Together they can provide the data that is needed to implement granular authorization in a multi-tenant application.

Advantages of group-based multi-tenancy:

  • This model uses a single shared user pool for tenants, so that onboarding requires setting user attributes rather than configuring multiple pools.
  • Tenants authenticate through the same application client and pool, keeping SaaS client configuration straightforward.

Disadvantages of group-based multi-tenancy:

  • Sharing one pool means that settings like password policies and MFA apply uniformly without per-tenant customization.
  • There is a limit of 10,000 groups per user pool.

Pattern 4: Dedicated user pool per tenant (silo model)

Another common approach for multi-tenant identity with Cognito is to provision a separate user pool for each tenant. A Cognito user pool is a user directory, so using distinct pools provides maximum isolation. However, this approach requires that you implement tenant routing logic in the application to determine which user pool a user should authenticate against, based on their tenant.

Tenant routing

With separate user pools per tenant (or application clients, as we’ll discuss later), the application needs logic to route each user to the appropriate pool (or client) for authentication. There are a few options that you can use for this approach:

  • Use a subdomain in the URL that maps to the tenant—for example, tenant1.myapp.com routes to Tenant 1’s user pool. This requires mapping subdomains to tenant pools.
  • Rely on unique email domains per tenant—for example, @tenant1.com goes to Tenant 1’s pool. This requires mapping email domains to pools.
  • Have the user select their tenant from a dropdown list. This requires the tenant choices to be configured.
  • Prompt the user to enter a tenant ID code that maps to pools. This requires mapping codes to pools.

No matter the approach you chose, the key requirements are the following:

  • A data point to identify the tenant (such as subdomain, email, selection, or code).
  • A mapping dataset that takes tenant identifying information from the user and looks up the corresponding user pool to route to for authentication.
  • Routing logic to redirect to the appropriate user pool.

For example, the AWS SaaS Factory Serverless Reference Architecture uses the approach shown in Figure 3.

Figure 3: Dedicated user pool per tenant

Figure 3: Dedicated user pool per tenant

The workflow is as follows:

  1. The user enters their tenant name during sign-in.
  2. The tenant name retrieves tenant-specific information like the user pool ID, application client ID, and API URLs.
  3. Tenant-specific information is passed to the SaaS app to initialize authentication to the correct user pool and app client, and this is used to initialize an authorization code flow.
  4. The app redirects to the Cognito hosted UI for authentication.
  5. User credentials are validated, and Cognito issues an OAuth code.
  6. The OAuth code is exchanged for a JWT token from Cognito.
  7. The JWT token is used to authenticate the user to access microservices.

Advantages of the one pool per tenant model:

  • Users exist in a single directory with no cross-tenant visibility. Tokens are issued and signed with keys that are unique to that pool.
  • Each pool can have customized security policies, like password rules or MFA requirements per tenant.
  • Pools can be hosted in different AWS Regions to meet data residency needs.

Potential disadvantages of the one pool per tenant model:

  • There are limits on the number of pools per account. (The default is 1,000 pools, and the maximum is 10,000.)
  • Additional automation is required to create multiple pools, especially with customized configurations.
  • Applications must implement tenant routing to direct authentication requests to the correct user pool.
  • Troubleshooting can be more difficult, because configuration of each pool is managed separately and tenant routing functionality is added.

In summary, separate user pools maximize tenant isolation but require more complex provisioning and routing. You might also need to consider limits on the pool count for large multi-tenant deployments.

Pattern 5: Application client per tenant (bridge model)

You can achieve some extra tenant isolation by using separate application clients per tenant in a single user pool, in addition to using groups and custom attributes. Cognito configurations from the application client, such as OAuth scopes, hosted UI customization, and security policies can be specific to each tenant. The application client also enables external IdP federation per tenant. However, user pool–level settings, such as password policy, remain shared.

Figure 4 shows how a single user pool can be configured with multiple application clients. Each of those application clients is assigned to a tenant. However, this approach requires that you implement tenant routing logic in the application to determine which application client a tenant should be mapped to (similar to the approach we discussed for the shared user pool). Once the user is authenticated, you can configure Amazon API Gateway with a Lambda authorizer function that validates the ID token signature. Subsequently, the Lambda authorizer can determine which backend resources and services each authenticated user is authorized to access.

Figure 4: Application client based multi-tenancy

Figure 4: Application client based multi-tenancy

For tenants that want to use their own IdP through SAML or OpenID Connect federation, you can create a dedicated application client that will redirect users to authenticate with the tenant’s federated IdP. This has some key benefits:

  • If a single external IdP is enabled on the application client, the hosted UI automatically redirects users without presenting Cognito sign-in screens. This provides a familiar sign-in experience for tenants and is frictionless if users have existing sessions with the tenant IdP.
  • Management of user activities like joining and leaving, passwords, and other tasks are entirely handled by the tenant in their own IdP. The SaaS provider doesn’t need to get involved in these processes.

Importantly, even with federation, Cognito still issues tokens after successful external authentication. So the SaaS provider gets consistent tokens from Cognito to validate during authorization, regardless of the IdP.

Attribute mapping

When federating with an external IdP, Amazon Cognito can dynamically map attributes to populate the tokens it issues. This allows attributes like groups, email addresses, and roles created in the IdP to be passed to Cognito during authentication and added to the tokens.

The mapping occurs upon every sign-in, overwriting the existing mapped attributes to stay in sync with the latest IdP values. Therefore, changes made in the external IdP related to mapped attributes are reflected in Cognito after signing in. If a mapped attribute is required in the Cognito user pool, like email for sign-in, it must have an equivalent in the IdP to map. The target attributes in Cognito must be configured as mutable, since immutable attributes cannot be overwritten after creation, even through mapping.

Important: For SaaS identity, tenant attributes should be defined in Cognito rather than mapped from an external IdP. This helps to prevent tenants from tampering with values and maintains isolation. However, user attributes like groups and roles can be mapped from the tenant’s IdP to manage permissions. This allows tenants to configure application roles by using their own IdP groups.

Advantages of the bridge model:

  • This model enables tenant-specific configuration like OAuth scopes, UI, and IdPs.
  • Tenant users access familiar workflows through external IdPs, and when using external IdPs, tenant user management is handled externally.
  • No custom claim mappings are needed, but can be used optionally.
  • Cognito still issues tokens for authorization.

Disadvantages of the bridge model:

  • Requires routing users to the correct app client per tenant.
  • There is a limit on the number of app clients per user pool.
  • Some user pool settings remain shared, such as password policy.
  • There is no dynamic group claim modification.

Conclusion

In this blog post, we explored various ways Amazon Cognito user pools can enable multi-tenant identity for SaaS solutions. A single shared user pool simplifies management but limits the option to customize user pool–level policies, while separate pools maximize isolation and configurability at the cost of complexity. If you use multiple application clients, you can balance tailored options like external IdPs and OAuth scopes with centralized policies in the user pool. Custom claim mappings provide flexibility but require additional logic.

These two approaches can also be combined. For example, you can have dedicated user pools for select high-tier tenants while others share a multi-tenant pool. The optimal choice depends on the specific tenant needs and on the customization that is required.

In this blog post, we have mainly focused on a static approach. You can also use a pre-token generation Lambda trigger to modify tokens by adding, changing, or removing claims dynamically. The trigger can also override the group membership in both the identity and access tokens. Other claim changes only apply to the ID token. A common use case for this trigger is injecting tenant attributes into the token dynamically.

Evaluate the pros and cons of each approach against the requirements of the SaaS architecture and tenants. Often a hybrid model works best. Cognito constructs like user pools, IdPs, and triggers provide various levers that you can use to fine-tune authentication and authorization across tenants.

For further reading on these topics, see the Common Amazon Cognito scenarios topic in the Cognito Developer Guide and the related blog post How to Use Cognito Pre-Token Generation trigger to Customize Claims in ID Tokens.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on Amazon Cognito re:Post

Shubhankar Sumar

Shubhankar Sumar
Shubhankar is a Senior Solutions Architect at AWS, working with enterprise software and SaaS customers across the UK to help architect secure, scalable, efficient and cost-effective systems. He is an experienced software engineer having built many SaaS solutions. Shubhankar specializes in building multi-tenant platform on the cloud. He is also working closely with customers to bring in GenAI capabilities in their SaaS application.

Owen Hawkins

Owen Hawkins
With over 20 years of information security experience, Owen brings deep expertise to his role as a Principal Solutions Architect at AWS. He works closely with ISV customers, drawing on his extensive background in digital banking security. Owen specializes in SaaS and multi-tenant architecture. He is passionate about enabling companies to securely embrace the cloud. Solving complex challenges excites Owen, who thrives on finding innovative ways to protect and run applications on AWS.

Federated access to Amazon Athena using AWS IAM Identity Center

Post Syndicated from Ajay Rawat original https://aws.amazon.com/blogs/security/federated-access-to-amazon-athena-using-aws-iam-identity-center/

Managing Amazon Athena through identity federation allows you to manage authentication and authorization procedures centrally. Athena is a serverless, interactive analytics service that provides a simplified and flexible way to analyze petabytes of data.

In this blog post, we show you how you can use the Athena JDBC driver (which includes a browser Security Assertion Markup Language (SAML) plugin) to connect to Athena from third-party SQL client tools, which helps you quickly implement identity federation capabilities and multi-factor authentication (MFA). This enables automation and enforcement of data access policies across your organization.

You can use AWS IAM Identity Center to federate access to users to AWS accounts. IAM Identity Center integrates with AWS Organizations to manage access to the AWS accounts under your organization. In this post, you will learn how to configure the Athena driver to use the AWS configuration profile credentials. This will allow you to resolve credentials from IAM Identity Center and use the MFA capability of your federation identity provider (IdP).In this post, you will learn how you can integrate the Athena browser-based SAML plugin to add single sign-on (SSO) and MFA capability with your federation identity provider (IdP).

Prerequisites

To implement this solution, you must have the follow prerequisites:

Note: Lake Formation only supports a single role in the SAML assertion. Multiple roles cannot be used.

Solution overview

Figure 1: Solution architecture

Figure 1: Solution architecture

To implement the solution, complete the steps below as shown in Figure 1:

  1. An IAM Identity Center delegated administrator creates two custom permission sets within Identity Center.
  2. An IAM Identity Center delegated administrator assign permission sets to AWS accounts and users and groups. The user has permissions to single sign-on roles that are provisioned in the data lake account. The role created by Identity Center has a name that begins with AWSReservedSSO.
  3. A Lake Formation administrator grants single sign-on roles permissions to the corresponding database and tables.

The solution workflow consists of the following high-level steps as shown in Figure 1:

  1. The user configures IAM Identity Center authentication using the AWS CLI.
  2. The AWS CLI redirects the user to the AWS access portal URL. The user enters workforce identity credentials (username and password). Then chooses Sign in.
  3. The AWS access portal verifies the user’s identity. IAM Identity Center redirects the request to the Identity Center authentication service to validate the user’s credentials.
  4. If MFA is enabled for the user, then they are prompted to authenticate their MFA device.
  5. The user enters or approves the MFA details. The user’s MFA is successfully completed.
  6. The user selects the AWS account to use from the displayed list. Then select the IAM single sign-on role to use from the displayed list.
  7. The user tests the SQL client connection and then uses the client to run a SQL query.
  8. The client makes a call to Athena to retrieve the table and associated metadata from the Data Catalog.
  9. Athena requests access to the data from Lake Formation. Lake Formation invokes the AWS Security Token Service (AWS STS).
  10. Lake Formation invokes AWS STS.
    1. Lake Formation obtains temporary AWS credentials with the permissions of the defined IAM role (sensitive or non-sensitive) associated with the data lake location.
    2. Lake Formation returns temporary credentials to Athena.
  11. Athena uses the temporary credentials to retrieve data objects from Amazon S3.
  12. The Athena engine successfully runs the query and returns the results to the client.

Solution walkthrough

The walkthrough includes five sections that will guide you through the process of creating permission sets, assigning permission sets to AWS Accounts, managing permission sets access using Lake Formation, and setting up third-party SQL clients such as SQL Workbench to connect to your data store and query your data through Athena.

Step 1: Federate onboarding

Federating onboarding is done within the IAM Identity Center account. As part of federated onboarding, you need to create IAM Identity Center users and groups. Groups are a collection of people who have the same security rights and permissions. You can create groups and add users to the groups. Create one IAM Identity Center group for sensitive data and another for non-sensitive data to provide distinct access to different classes of data sets. You can assign access to IAM Identity Center permission sets to a user or group.

To federate onboarding:

  1. Open the AWS Management Console using the IAM Identity Center account and go to IAM Identity Center.
  2. Choose Groups.
  3. Choose Create group.
  4. Enter a Group name and Description .
  5. Choose Create group.

To add a user as a member of a group:

  1. Open the IAM Identity Center console.
  2. Choose Groups.
  3. Select the group name that you want to update.
  4. On the group details page, under Users in this group, choose Add users to group.
  5. On the Add users to group page, under Other users, locate the users you want to add as members and select the check box next to each of them.
  6. Choose Add users to group.

Figure 2: Assigning users to a group

Figure 2: Assigning users to a group

Step 2: Create permission sets

For this step, create two permission sets (sensitive-iam-role and non-sensitive-iam-role). These permission sets can be assigned to users or groups in IAM Identity Center, granting them specific access to AWS account resources.

To create custom permission sets:

  1. In the IAM Identity Center administrator account, under Multi-Account permissions, choose Permission sets.
  2. Choose Create permission set.
  3. On the Select permission set type page, under Permission set type, choose Custom permission set.

    Figure 3: Selecting a permission set

    Figure 3: Selecting a permission set

  4. Choose Next.
  5. On the Specify policies and permission boundary page, expand Inline policy to add custom JSON-formatted policy text.
  6. Insert the following policy and update the S3 bucket name (<s3-bucket-name>), AWS Region (<region>) account ID (<account-id>), CloudWatch alarm name (<AlarmName>), Athena workgroup name (sensitive or non-sensitive) (<WorkGroupName>), KMS key alias name (<KMS-key-alias-name>), and organization ID (<aws-PrincipalOrgID>).
    {
      "Statement": [
        {
          "Action": [
            "lakeformation:SearchTablesByLFTags",
            "lakeformation:SearchDatabasesByLFTags",
            "lakeformation:ListLFTags",
            "lakeformation:GetResourceLFTags",
            "lakeformation:GetLFTag",
            "lakeformation:GetDataAccess",
            "glue:SearchTables",
            "glue:GetTables",
            "glue:GetTable",
            "glue:GetPartitions",
            "glue:GetDatabases",
            "glue:GetDatabase"
          ],
          "Effect": "Allow",
          "Resource": "*",
          "Sid": "LakeformationAccess"
        },
        {
          "Action": [
            "s3:PutObject",
            "s3:ListMultipartUploadParts",
            "s3:ListBucketMultipartUploads",
            "s3:ListBucket",
            "s3:GetObject",
            "s3:GetBucketLocation",
            "s3:CreateBucket",
            "s3:AbortMultipartUpload"
          ],
          "Effect": "Allow",
          "Resource": [
            "arn:aws:s3:::<s3-bucket-name>/*",
            "arn:aws:s3:::<s3-bucket-name>"
          ],
          "Sid": "S3Access"
        },
        {
          "Action": "s3:ListAllMyBuckets",
          "Effect": "Allow",
          "Resource": "*",
          "Sid": "AthenaS3ListAllBucket"
        },
        {
          "Action": [
            "cloudwatch:PutMetricAlarm",
            "cloudwatch:DescribeAlarms"
          ],
          "Effect": "Allow",
          "Resource": [
            "arn:aws:cloudwatch:<region>:<account-id>:alarm:<AlarmName>"
          ],
          "Sid": "CloudWatchLogs"
        },
        {
          "Action": [
            "athena:UpdatePreparedStatement",
            "athena:StopQueryExecution",
            "athena:StartQueryExecution",
            "athena:ListWorkGroups",
            "athena:ListTableMetadata",
            "athena:ListQueryExecutions",
            "athena:ListPreparedStatements",
            "athena:ListNamedQueries",
            "athena:ListEngineVersions",
            "athena:ListDatabases",
            "athena:ListDataCatalogs",
            "athena:GetWorkGroup",
            "athena:GetTableMetadata",
            "athena:GetQueryResultsStream",
            "athena:GetQueryResults",
            "athena:GetQueryExecution",
            "athena:GetPreparedStatement",
            "athena:GetNamedQuery",
            "athena:GetDatabase",
            "athena:GetDataCatalog",
            "athena:DeletePreparedStatement",
            "athena:DeleteNamedQuery",
            "athena:CreatePreparedStatement",
            "athena:CreateNamedQuery",
            "athena:BatchGetQueryExecution",
            "athena:BatchGetNamedQuery"
          ],
          "Effect": "Allow",
          "Resource": [
            "arn:aws:athena:<region>:<account-id>:workgroup/<WorkGroupName>",
            "arn:aws:athena:{Region}:{Account}:datacatalog/{DataCatalogName}"
          ],
          "Sid": "AthenaAllow"
        },
        {
          "Action": [
            "kms:GenerateDataKey",
            "kms:DescribeKey",
            "kms:Decrypt"
          ],
          "Condition": {
            "ForAnyValue:StringLike": {
              "kms:ResourceAliases": "<KMS-key-alias-name>"
            }
          },
          "Effect": "Allow",
          "Resource": "*",
          "Sid": "kms"
        },
        {
          "Action": "*",
          "Condition": {
            "StringNotEquals": {
              "aws:PrincipalOrgID": "<aws-PrincipalOrgID>"
            }
          },
          "Effect": "Deny",
          "Resource": "*",
          "Sid": "denyRule"
        }
      ],
      "Version": "2012-10-17"
    }

  7. Update the custom policy to add the corresponding Athena workgroup ARN for the sensitive and non-sensitive IAM roles.

    Note: See the documentation for information about AWS global condition context keys.

  8. Choose Next.
  9. On the Specify permission set details page, enter a name to identify this permission set in IAM Identity Center. The name that you specify for this permission set appears in the AWS access portal as an available role. Users sign in to the AWS access portal, choose an AWS account, and then choose the role.
  10. Choose Next.
  11. On the Review and create page, review the selections that you made, and then choose Create.

Step 3: Assign permission sets to AWS accounts

You can add and remove permissions sets for an IAM user or group by attaching and detaching permission sets. Permission sets define what actions an identity can perform on which AWS resources.

To assign permission sets to AWS accounts:

  1. In the IAM Identity Center administrator account, under Multi-account permissions, choose AWS accounts.
  2. On the AWS accounts page, select one or more AWS accounts that you want to assign single sign-on access to.
  3. Choose Assign users or groups.

    Figure 4: Selecting users and groups

    Figure 4: Selecting users and groups

  4. On the Assign users and groups to “<AWS account name>”, for Selected users and groups, choose the users that you want to create the permission set for. Choose Next.
  5. Select permission sets: On the Assign permission sets to “AWS-account-name” page, select one or more permission sets.
  6. On the Review and submit assignments to AWS-account-name page, for Review and submit, choose Submit.

Step 4. Grant permissions to IAM (single sign-on) roles

A data lake administrator has the broad ability to grant a principal (including themselves) permissions on Data Catalog resources. This includes the ability to manage access controls and permissions for the data lake. When you grant Lake Formation permissions on a specific Data Catalog table, you can also include data filtering specifications. This allows you to further restrict access to certain data within the table, limiting what users can see in their query results based on those filtering rules.

To grant permissions to IAM roles:

In the Lake Formation console, under Permissions in the navigation pane, select Data Lake permissions, and then choose Grant.

To grant Database permissions to IAM roles:

  1. Under Principals, select the IAM role name (for example, Sensitive-IAM-Role).
  2. Under Named Data Catalog resources, go to Databases and select a database (for example, demo).

    Figure 5: Select an IAM role and database

    Figure 5: Select an IAM role and database

  3. Under Database permissions, select Describe and then choose Grant.

    Figure 6: Grant database permissions to an IAM role

    Figure 6: Grant database permissions to an IAM role

To grant tables permissions to IAM roles:

  1. Repeat steps 1 and 2 of the preceding procedure.
  2. Under Tables – optional, select a table name (for example, demo2).

    Figure 7: Select tables within a database to grant access

    Figure 7: Select tables within a database to grant access

  3. Select the desired Table Permissions (for example, select and describe), and then choose Grant.

    Figure 8: Grant access to tables within the database

    Figure 8: Grant access to tables within the database

  4. Repeat steps 1 through 4 to grant access for the respective database and tables for the non-sensitive IAM role.

Step 5: Client-side setup using JDBC

You can use a JDBC connection to connect Athena and SQL client applications (for example, PyCharm or SQL Workbench) to enable analytics and reporting on the data that Athena returns from Amazon S3 databases. To use the Athena JDBC driver, you must specify the driver class from the JAR file. Additionally, you must pass in some parameters to change the authentication mechanism so the athena-sts-auth libraries are used:

  • S3 output location – Where in S3 the Athena service can write its output. For example, s3://path/to/query/bucket/.
  • The IAM Identity Center administrator can configure the session duration for the AWS access portal. The session duration can be set from a minimum of 15 minutes to a maximum of 90 days.

To set up PyCharm

  1. Install Athena JDBC 3.x driver from Athena JDBC 3.x driver.
    1. In the left navigation pane, select JDBC 3.x and then Getting started. Select Uber jar to download a .jar file, which contains the driver and its dependencies.

      Figure 9: Download Athena JDBC jar

      Figure 9: Download Athena JDBC jar

  2. Open PyCharm and create a new project.
    1. Enter a Name for your project
    2. Select the desired project Location
    3. Choose Create

    Figure 10: Create a new project in PyCharm

    Figure 10: Create a new project in PyCharm

  3. Configure Data Source and drivers. Select Data Source, and then choose the plus sign or New to configure new data sources and drivers.

    Figure 11: Add database source properties

    Figure 11: Add database source properties

  4. Configure the Athena driver by selecting the Drivers tab, and then choose the plus sign to add a new driver.

    Figure 12: Add database drivers

    Figure 12: Add database drivers

  5. Under Driver Files, upload the custom JAR file that you downloaded in the Step 1. Select the Athena class dropdown. Enter the driver’s name (for example Athena JDBC Driver). Then choose Apply.

    Figure 13: Add database driver files

    Figure 13: Add database driver files

  6. Configure a new data source. Choose the plus sign and select your driver’s name from the driver dropdown.
  7. Enter the data source name (for example, Athena Demo). For the authentication method, select User & Password. Then choose Apply.

    Figure 14: Create a project data source profile

    Figure 14: Create a project data source profile

  8. Select the SSH/SSL tab and select Use SSL. Verify that the Use truststore options for IDE, JAVA, and system are all selected. Then choose Apply.

    Figure 15: Enable data source profile SSL

    Figure 15: Enable data source profile SSL

  9. Select the Options tab and then select Single Session Mode. Then choose Apply.

    Figure 16: Configure single session mode in PyCharm

    Figure 16: Configure single session mode in PyCharm

  10. Select the General tab and enter the JDBC and single sign-on URL. The following is a sample JDBC URL based on the SAML application:
    jdbc:athena://;CredentialsProvider= ProfileCredentials; ProfileName=<name-of-the-profile>;WorkGroup=<name-of-the-WorkGroup>; 

    1. Choose Apply.
    2. Choose Test Connection. If the profile has expired, refresh the single sign-on session by running aws sso login --profile <profile-name> with the corresponding profile.

    Figure 17: Test the data source connection

    Figure 17: Test the data source connection

  11. After the connection is successful, select the Schemas tab and select All databases and All schemas.

    Figure 18: Select data source databases and schemas

    Figure 18: Select data source databases and schemas

  12. Run a sample test query: SELECT <table-names> FROM <database-name> limit 10;
  13. Verify that the credentials and permissions are working as expected.

To set up SQL Workbench

  1. Open SQL Workbench.
  2. Configure an Athena driver by selecting File and then Manage Drivers.
  3. Enter the Athena JDBC Driver as the name and set the library to browse the path for the location where you downloaded the driver. Enter amazonaws.athena.jdbc.AthenaDriver as the Classname.
  4. Enter the following URL, replacing <name-of-the-WorkGroup> with your workgroup name.
    jdbc:athena://;CredentialsProvider=ProfileCredentials;ProfileName=<name-of-the-profile>;WorkGroup=<name-of-the-WorkGroup>;

  5. Choose OK.
  6. Run a test query, replacing <table-names> and <database-name> with your table and database names:
    SELECT <table-names> FROM <database-name> limit 10;

  7. Verify that the credentials and permissions are working as expected.

Conclusion

In this post, we covered how to use JDBC drivers to connect to Athena from third-party SQL client tools. You were able to set this up without creating IAM users or any type of long-lived credentials that would need to be stored on your developers’ workstations. You learned how to configure IAM Identity Center users and groups, create permission sets, and assign permission sets to AWS Accounts. You also learned how to grant permissions to single sign-on roles using Lake Formation to create distinct access to different classes of data sets and connect to Athena through an SQL client tool (such as PyCharm). This setup can also work with other supported identity sources such as IAM Identity Centerself-managed or on-premises Active Directory, or an external IdP.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
 

Ajay Rawat
Ajay Rawat

Ajay is a Senior Security Consultant, focusing on AWS Identity and Access Management (IAM), data protection, incident response, and operationalizing AWS security services to increase security effectiveness and reduce risk. Ajay is a technology enthusiast and enjoys working with customers to solve their technical challenges and to improve their security posture in the cloud.
Mihir Borkar
Mihir Borkar

Mihir is an AWS Data Architect who excels at simplifying customer challenges with innovative cloud data solutions. Specializing in AWS Lake Formation and AWS Glue, he designs scalable data lakes and analytics platforms, demonstrating expertise in crafting efficient solutions within the AWS Cloud.

Create a customizable cross-company log lake for compliance, Part I: Business Background

Post Syndicated from Colin Carson original https://aws.amazon.com/blogs/big-data/create-a-customizable-cross-company-log-lake-for-compliance-part-i-business-background/

As described in a previous postAWS Session Manager, a capability of AWS Systems Manager, can be used to manage access to Amazon Elastic Compute Cloud (Amazon EC2) instances by administrators who need elevated permissions for setup, troubleshooting, or emergency changes. While working for a large global organization with thousands of accounts, we were asked to answer a specific business question: “What did employees with privileged access do in Session Manager?”

This question had an initial answer: use logging and auditing capabilities of Session Manager and integration with other AWS services, including recording connections (StartSession API calls) with AWS CloudTrail, and recording commands (keystrokes) by streaming session data to Amazon CloudWatch Logs.

This was helpful, but only the beginning. We had more requirements and questions:

  • After session activity is logged to CloudWatch Logs, then what?
  • How can we provide useful data structures that minimize work to read out, delivering faster performance, using more data, with more convenience?
  • How do we support a variety of usage patterns, such as ongoing system-to-system bulk transfer, or an ad-hoc query by a human for a single session?
  • How should we share and implement governance?
  • Thinking bigger, what about the same question for a different service or across more than one use case? How do we add what other API activity happened before or after a connection—in other words, context?

We needed more comprehensive functionality, more customization, and more control than a single service or feature could offer. Our journey began where previous customer stories about using Session Manager for privileged access (similar to our situation), least privilege, and guardrails ended. We had to create something new that combined existing approaches and ideas:

  • Low-level primitives such as Amazon Simple Storage Service (Amazon S3).
  • Latest features and approaches of AWS, such as vertical and horizontal scaling in AWS Glue.
  • Our experience working with legal, audit, and compliance in large enterprise environments.
  • Customer feedback.

In this post, we introduce Log Lake, a do-it-yourself data lake based on logs from CloudWatch and AWS CloudTrail. We share our story in three parts:

  • Part 1: Business background – We share why we created Log Lake and AWS alternatives that might be faster or easier for you.
  • Part 2: Build – We describe the architecture and how to set it up using AWS CloudFormation templates.
  • Part 3: Add – We show you how to add invocation logs, model input, and model output from Amazon Bedrock to Log Lake.

Do you really want to do it yourself?

Before you build your own log lake, consider the latest, highest-level options already available in AWS–they can save you a lot of work. Whenever possible, choose AWS services and approaches that abstract away undifferentiated heavy lifting to AWS so you can spend time on adding new business value instead of managing overhead. Know the use cases services were designed for, so you have a sense of what they already can do today and where they’re going tomorrow.

If that doesn’t work, and you don’t see an option that delivers the customer experience you want, then you can mix and match primitives in AWS for more flexibility and freedom, as we did for Log Lake.

Session Manager activity logging

As we mentioned in our introduction, you can save logging data to AmazonS3add a table on top, and query that table using Amazon Athena—this is what we recommend you consider first because it’s straightforward.

This would result in files with the sessionid in the name. If you want, you can process these files into a calendarday, sessionid, sessiondata format using an S3 event notification that invokes a function (and make sure to save it to a different bucket, in a different table, to avoid causing recursive loops). The function could derive the calendarday and sessionid from the S3 key metadata, and sessiondata would be the entire file contents.

Alternatively, you can sign to one log group in CloudWatch logs, have an Amazon Data Firehose subscription filter move that to S3 (this file would have additional metadata in the JSON content and more customization potential from filters). This was used in our situation, but it wasn’t enough by itself.

AWS CloudTrail Lake

CloudTrail Lake is for running queries on events over years of history and with near real-time latency and offers a deeper and more customizable view of events than CloudTrail Event history. CloudTrail Lake enables you to federate an event data store, which lets you view the metadata in the AWS Glue catalog and run Athena queries. For needs involving one organization and ongoing ingesting from a trail (or point-in-time import from Amazon S3, or both), you can consider CloudTrail Lake.

We considered CloudTrail Lake, as either a managed lake option or source for CloudTrail only, but ended up creating our own AWS Glue job instead. This was because of a combination of reasons, including full control over schema and jobs, ability to ingest data from an S3 bucket of our choosing as an ongoing source, fine-grained filtering on account, AWS Region, and eventName (eventName filtering wasn’t supported for management events ), and cost.

The cost of CloudTrail lake based on uncompressed data ingested (data size can be 10 times larger than in Amazon S3) was a factor for our use case. In one test, we found CloudTrail Lake to be 38 times faster to process the same workload as Log Lake, but Log Lake was 10–100 times less costly depending on filters, timing, and account activity. Our test workload was 15.9 GB file size in S3, 199 million events, and 400 thousand files, spread across over 150 accounts and 3 Regions. Filters Log Lake applied were eventname='StartSession', 'AssumeRole', 'AssumeRoleWithSAML', and five arbitrary allow listed accounts. These tests might be different from your use case, so you should do your own testing, gather your own data, and decide for yourself.

Other services

The products mentioned previously are the most relevant to the outcomes we were trying to accomplish, but you should consider security, identity, and compliance products on AWS, too. These products and features can be used either as an alternative to Log Lake or to add functionality.

As an example, Amazon Bedrock can add functionality in three ways:

  • To skip the search and query Log Lake for you
  • To summarize across logs
  • As a source for logs (similar to Session Manager as a source for CloudWatch logs)

Querying means you can have an AI agent query your AWS Glue catalog (such as the Log Lake catalog) for data-based results. Summarizing means you can use generative artificial intelligence (AI) to summarize your text logs from a knowledge base as part of retrieval augmented generation (RAG), to ask questions like “How many log files are exactly the same? Who changed IAM roles last night?” Considerations and limitations apply.

Adding Amazon Bedrock as a source means using invocation logging to collect requests and responses.

Because we wanted to store very large amounts of data frugally (compressed and columnar format, not text) and produce non-generative (data-based) results that can be used for legal compliance and security, we didn’t use Amazon Bedrock in Log Lake—but we will revisit this topic in Part 3 when we detail how to use the approach we used for Session Manager for Amazon Bedrock.

Business background

When we began talking with our business partners, sponsors, and other stakeholders, important questions, problems, opportunities, and requirements emerged.

Why we needed to do this

Legal, security, identity, and compliance authorities of the large enterprise we were working for had created a customer-specific control. To comply with the control objective, use of elevated privileges required a manager to manually review all available data (including any session manager activity) to confirm or deny if use of elevated privileges was justified. This was a compliance use case that, when solved, could be applied to more use cases such as auditing and reporting.

Note on terms:

  • Here, the customer in customer-specific control means a control that is solely the responsibility of a customer, not AWS, as described in the AWS Shared Responsibility Model.
  • In this article, we define auditing broadly as testing information technology (IT) controls to mitigate risk, by anyone, at any cadence (ongoing as part of day-to-day operations, or one time only). We don’t refer to auditing that is financial, only conducted by an independent third-party, or only at certain times. We use self-review and auditing interchangeably.
  • We also define reporting broadly as presenting data for a specific purpose in a specific format to evaluate business performance and facilitate data-driven decisions—such as answering “how many employees had sessions last week?”

The use case

Our first and most important use case was a manager who needed to review activity, such as from an after-hours on-call page the previous night. If the manager needed to have additional discussions with their employee or needed additional time to consider activity, they had up to a week (7 calendar days) before they needed to confirm or deny elevated privileges were needed, based on their team’s procedures. A manager needed to review an entire set of events that all share the same session, regardless of known keywords or specific strings, as part of all available data in AWS. This was the workflow:

  1. Employee uses homegrown application and standardized workflow to access Amazon EC2 with elevated privileges using Session Manager.
  2. API activity in CloudTrail and continuous logging to CloudWatch logs.
  3. The problem space – Data somehow gets procured, processed, and provided (this would become Log Lake later).
  4. Another homegrown system (different from step 1) presents session activity to managers and applies access controls (a manager should only review activity for their own employees, and not be able to peruse data outside their team). This data might be only one StartSession API call and no session details, or might be thousands of lines from cat file
  5. The manager reviews all available activity, makes an informed decision, and confirms or denies if use was justified.

This was an ongoing day-to-day operation, with a narrow scope. First, this meant only data available in AWS; if something couldn’t be captured by AWS, it was out of scope. If something was possible, it should be made available. Second, this meant only certain workflows; using Session Manager with elevated privileges for a specific, documented standard operating procedure.

Avoiding review

The simplest solution would be to block sessions on Amazon EC2 with elevated privileges, and fully automate build and deployment. This was possible for some but not all workloads, because some workloads required initial setup, troubleshooting, or emergency changes of Marketplace AMIs.

Is accurate logging and auditing possible?

We won’t extensively detail ways to bypass controls here, but there are important limitations and considerations we had to consider, and we recommend you do too.

First, logging isn’t available for sessionType Port, which includes SSH. This could be mitigated by ensuring employees can only use a custom application layer to start sessions without SSH. Blocking direct SSH access to EC2 instances using security group policies is another option.

Second, there are many ways to intentionally or accidentally hide or obfuscate activity in a session, making review of a specific command difficult or impossible. This was acceptable for our use case for multiple reasons:

  • A manager would always know if a session started and needed review from CloudTrail (our source signal). We joined to CloudWatch to meet our all available data requirement.
  • Continuous streaming to CloudWatch logs would log activity as it happened. Additionally, streaming to CloudWatch Logs supported interactive shell access, and our use case only used interactive shell access (sessionType Standard_Stream). Streaming isn’t supported for sessionType, InteractiveCommands, or NonInteractiveCommands.
  • The most important workflow to review involved an engineered application with one standard operating procedure (less variety than all the ways Session Manager could be used).
  • Most importantly, the manager was responsible for reviewing the reports and expected to apply their own judgement and interpret what happened. For example, a manager review could result in a follow up conversation with the employee that could improve business processes. A manager might ask their employee, “Can you help me understand why you ran this command? Do we need to update our runbook or automate something in deployment?”

To protect data against tampering, changes, or deletion, AWS provides tools and features such as AWS Identity and Access Management (IAM) policies and permissions and Amazon S3 Object Lock.

Security and compliance are a shared responsibility between AWS and the customer, and customers need to decide what AWS services and features to use for their use case. We recommend customers consider a comprehensive approach that considers overall system design and includes multiple layers of security controls (defense in depth). For more information, see the Security pillar of the AWS Well-Architected Framework.

Avoiding automation

Manual review can be a painful process, but we couldn’t automate review for two reasons: Legal requirements and to add friction to the feedback loop felt by a manager whenever an employee used elevated privileges, to discourage using elevated privileges.

Works with existing

We had to work with existing architecture, spanning thousands of accounts and multiple AWS Organizations. This meant sourcing data from buckets as an edge and point of ingress. Specifically, CloudTrail data was managed and consolidated outside of CloudTrail, across organizations and trails, into S3 buckets. CloudWatch data was also consolidated to S3 buckets, from Session Manager to CloudWatch Logs, with Amazon Data Firehose subscription filters on CloudWatch Logs pointing to S3. To avoid negative side effects on existing business processes, our business partners didn’t want to change settings in CloudTrail, CloudWatch, and Firehose. This meant Log Lake needed features and flexibility that enabled changes without impacting other workstreams using the same sources.

Event filtering is not a data lake

Before we were asked to help, there were attempts to do event filtering. One attempt tried to monitor session activity using Amazon EventBridge. This was limited to AWS API operations recorded by CloudTrail such as StartSession and didn’t include the information from inside the session, which was in CloudWatch Logs. Another attempt tried event filtering CloudWatch in the form of a subscription filter. Also, an attempt was made using EventBridge Event Bus with EventBridge rules, and storage in Amazon DynamoDB. These attempts didn’t deliver the expected results because of a combination of factors:

Size

Couldn’t accept large session log payloads because of the EventBridge PutEvents limit of 256 KB entry size. Saving large entries to Amazon S3 and using the object URL in the PutEvents entry would avoid this limitation in EventBridge, but wouldn’t pass the most important information the manager needed to review (the event’s sessionData element). This meant managing files and physical dependencies, and losing the metastore benefit of working with data as logical sets and objects.

Storage

Event filtering was a way to process data, not storage or a source of truth. We asked, how do we restore data lost in flight or destroyed after landing? If components are deleted or undergoing maintenance, can we still procure, process, and provide data—at all three layers independently? Without storage, no.

Data quality

No source of truth meant data quality checks weren’t possible.  We couldn’t answer questions like: “Did the last job process more than 90 percent of events from CloudTrail in DynamoDB?” or“What percentage are we missing from source to target?”

Anti-patterns

DynamoDB as long-term storage wasn’t the most appropriate data store for large analytical workloads, low I/O, and highly complex many-to-many joins.

Reading out

Deliveries were fast, but work (and time and cost) was needed after delivery. In other words, queries had to do extra work to transform raw data into the needed format at time of read, which had a significant, cumulative effect on performance and cost. Imagine users running a select * from table without any filters on years of data and paying for storage and compute of those queries.

Cost of ownership

Filtering by event contents (sessionData from CloudWatch) required knowledge of session behavior, which was business logic. This meant changes to business logic required changes to event filtering. Imagine being asked to change CloudWatch filters or EventBridge rules based on a business process change, and trying to remember where to make the change, or troubleshoot why expected events weren’t being passed. This meant a higher cost of ownership and slower cycle times at best, and inability to meet SLA and scale at worst.

Accidental coupling

Creates accidental coupling between downstream consumers and low-level events. Consumers who directly integrate against events might get different schemas at different times for the same events, or events they don’t need. There’s no way to manage data at a higher level than event, at the level of sets (like all events for one sessionid), or at the object level (a table designed for dependencies). In other words, there was no metastore layer that separated the schema from the files, like in a data lake.

More sources (data to load in)

There were other, less important use cases that we wanted to expand to later: inventory management and security.

For inventory management, such as identifying EC2 instances running a Systems Manager agent that’s missing a patch, finding IAM users with inline policies, or finding Redshift clusters with nodes that aren’t RA3. This data would come from AWS Config unless it isn’t a supported resource type. We cut inventory management from scope because AWS Config data could be added to an AWS Glue catalog later, and queried from Athena using an approach like the one described in How to query your AWS resource configuration states using AWS Config and Amazon Athena.

For security, Splunk and OpenSearch were already in use for serviceability and operational analysis, sourcing files from Amazon S3. Log Lake is a complementary approach sourcing from the same data, which adds metadata and simplified data structures at the cost of latency. For more information about having different tools analyze the same data, see Solving big data problems on AWS.

More use cases (reasons to read out)

We knew from the first meeting that this was a bigger opportunity than just building a dataset for sessions from Systems Manager for manual manager review. Once we had procured logs from CloudTrail and CloudWatch, set up Glue jobs to process logs into convenient tables, and were able to join across these tables, we could change filters and configuration settings to answer questions about additional services and use cases, too. Similar to how we process data for Session Manager, we could expand the filters on Log Lake’s Glue jobs, and add data for Amazon Bedrock model invocation logging. For other use cases, we could use Log Lake as a source for automation (rules-based or ML), deep forensic investigations, or string-match searches (such as IP addresses or user names).

Additional technical considerations

*How did we define session? We would always know if a session started from StartSession event in CloudTrail API activity. Regarding when a session ended, we did not use TerminateSession because this was not always present and we considered this domain-specific logic. Log Lake enabled downstream customers to decide how to interpret the data. For example, our most important workflow had a Systems Manager timeout of 15 minutes, and our SLA was 90 minutes. This meant managers knew a session with a start time more than 2 hours prior to the current time was already ended.

*CloudWatch data required additional processing compared to CloudTrail, because CloudWatch logs from Firehose were saved in gzip format without gz suffix and had multiple JSON documents in the same line that needed to be processed to be on separate lines. Firehose can transform and convert records, such as invoking a Lambda function to transform, convert JSON to ORC, and decompress data, but our business partners didn’t want to change existing settings.

How to get the data (a deep dive)

To support the dataset needed for a manager to review, we needed to identify API-specific metadata (time, event source, and event name), and then join it to session data. CloudTrail was necessary because it was the most authoritative source for AWS API activity, specifically StartSession and AssumeRole and AssumeRoleWithSAML events, and contained context that didn’t exist in CloudWatch Logs (such as the error code AccessDenied) which could be useful for compliance and investigation. CloudWatch was necessary because it contained the keystrokes in a session, in the CloudWatch log’s sessionData element. We needed to obtain the AWS source of record from CloudTrail, but we recommend you check with your authorities to confirm you really need to join to CloudTrail. We mention this in case you hear this question “why not derive some sort of earliest eventTime from CloudWatch logs, and skip joining to CloudTrail entirely? That would cut size and complexity by half.”

To join CloudTrail (eventTime, eventname, errorCode, errorMessage, and so on) with CloudWatch (sessionData), we had to do the following:

  1. Get the higher level API data from CloudTrail (time, event source, and event name), as the authoritative source for auditing Session Manager. To get this, we needed to look inside all CloudTrail logs and get only the rows with eventname=‘StartSession’ and eventsource=‘ssm.amazonaws.com’ (events from Systems Manager)—our business partners described this as looking for a needle in a haystack, because this could be only one session event across millions or billions of files. After we obtained this metadata, we needed to extract the sessionid to know what session to join it to, and we chose to extract sessionid from responseelements. Alternatively, we could use useridentity.sessioncontext.sourceidentity if a principal provided it while assuming a role (requires sts:SetSourceIdentity in the role trust policy).

Sample of a single record’s responseelements.sessionid value: "sessionid":"theuser-thefederation-0b7c1cc185ccf51a9"

The actual sessionid was the final element of the logstream: 0b7c1cc185ccf51a9.

  1. Next we needed to get all logs for a single session from CloudWatch. Similarly to CloudTrail, we needed to look inside all CloudWatch logs landing in Amazon S3 from Firehose to identify only the needles that contained "logGroup":"/aws/ssm/sessionlogs". Then, we could get sessionid from logstream or sessionId, and get session activity from the message.sessionData.

Sample of a single record’s logStream element: "sessionId": "theuser-thefederation-0b7c1cc185ccf51a9"

Note: Looking inside the log isn’t always necessary. We did it because we had to work with existing logs Firehose put to Amazon S3, which didn’t have the logstream (and sessionid) in the file name. For example, a file from Firehose might have a name like

cloudwatch-logs-otherlogs-3-2024-03-03-22-22-55-55239a3d-622e-40c0-9615-ad4f5d4381fa

If we were able to use the ability of Session Manager to send to S3 directly, the file name in S3 is the loggroup (theuser-thefederation-0b7c1cc185ccf51a9.dms)and could be used to derive sessionid without looking inside the file.

  1. Downstream of Log Lake, consumers could join on sessionid which was derived in the previous step.

What’s different about Log Lake

If you remember one thing about Log Lake, remember this: Log Lake is a data lake for compliance-related use cases, uses CloudTrail and CloudWatch as data sources, has separate tables for writing (original raw) and reading (read-optimized or readready), and gives you control over all components so you can customize it for yourself.

Here are some of the signature qualities of Log Lake:

Legal, identity, or compliance use cases

This includes deep dive forensic investigation, meaning use cases that are large volume, historical, and analytical. Because Log Lake uses Amazon S3, it can meet regulatory requirements that require write-once-read-many (WORM) storage.

AWS Well-Architected Framework

Log Lake applies real-world, time-tested design principles from the AWS Well-Architected Framework. This includes, but is not limited to:

Operational Excellence also meant knowing service quotas, performing workload testing, and defining and documenting runbook processes. If we hadn’t tried to break something to see where the limit is, then we considered it untested and inappropriate for production use. To test, we would determine the highest single day volume we’d seen in the past year, and then run that same volume in an hour to see if (and how) it would break.

High-Performance, Portable Partition Adding (AddAPart)

Log Lake adds partitions to tables using Lambda functions with SQS, a pattern we call AddAPart. This uses Amazon Simple Query Service (SQS) to decouple triggers (files landing in Amazon S3) from actions (associating that file with metastore partition). Think of this as having four F’s:

This means no AWS Glue crawlers, no alter table or msck repair table to add partitions in Athena, and can be reused across sources and buckets. The management of partitions in Log Lake makes using partition-related features available in AWS Glue, including AWS Glue partition indexes and workload partitioning and bounded execution.

File name filtering uses the same central controls for lower cost of ownership, faster changes, troubleshooting from one location, and emergency levers—this means that if you want to avoid log recursion happening from a specific account, or want to exclude a Region because of regulatory compliance, you can do it in one place, managed by your change control process, before you pay for processing in downstream jobs.

If you want to tell a team, “onboard your data source to our log lake, here are the steps you can use to self-serve,” you can use AddAPart to do that. We describe this in Part 2.

Readready Tables

In Log Lake, data structures offer differentiated value to users, and original raw data isn’t directly exposed to downstream users by default. For each source, Log Lake has a corresponding read-optimized readready table.

Instead of this:

from_cloudtrail_raw

from_cloudwatch_raw

Log Lake exposes only these to users:

from_cloudtrail_readready

from_cloudwatch_readready

In Part 2, we describe these tables in detail. Here are our answers to frequently asked questions about readready tables:

Q: Doesn’t this have an up-front cost to process raw into readready? Why not pass the work (and cost) to downstream users?

A: Yes, and for us the cost of processing partitions of raw into readready happened once and was fixed, and was offset by the variable costs of querying, which was from many company-wide callers (systemic and human), with high frequency, and large volume.

Q: How much better are readready tables in terms of performance, cost, and convenience? How do you achieve these gains? How do you measure “convenience”?

A: In most tests, readready tables are 5–10 times faster to query and more than 2 times smaller in Amazon S3. Log Lake applies more than one technique: omitting columns, partition design, AWS Glue partition indexes, data types (readready tables don’t allow any nested complex data types within a column, such as struct<struct>), columnar storage (ORC), and compression (ZLIB). We measure convenience as the amount of operations required to join on a sessionid; using Log Lake’s readready tables this is 0 (zero).

Q: Do raw and readready use the same files or buckets?

A: No, files and buckets are not shared. This decouples writes from reads, improves both write and read performance, and adds resiliency.

This question is important when designing for large sizes and scaling, because a single job or downstream read alone can span millions of files in Amazon S3. S3 scaling doesn’t happen immediately, so queries against raw or original data involving many tiny JSON files can cause S3 503 errors when it exceeds 5,500 GET/HEAD per second. More than one bucket helps avoid resource saturation. There is another option that we didn’t have when we created Log Lake: S3 Express One Zone. For reliability, we still recommend not putting all your files in one bucket. Also, don’t forget to filter your data.

Customization and control

You can customize and control all components (columns or schema, data types, compression, job logic, job schedule, and so on) because Log Lake is built using AWS primitives—such as Amazon SQS and Amazon S3—for the most comprehensive combination of features with the most freedom to customize. If you want to change something, you can.

From mono to many

Rather than one large, monolithic lake that is tightly coupled to other systems, Log Lake is just one node in a larger network of distributed data products across different data domains—this concept is data mesh. Just like the AWS APIs it is built on, Log Lake abstracts away heavy lifting and enables users to move faster, more efficiently, and not wait for centralized teams to make changes. Log Lake does not try to cover all use cases—instead, Log Lake’s data can be accessed and consumed by domain-specific teams, empowering business experts to self-serve.

When you need more flexibility and freedom

As builders, sometimes you want to dissect a customer experience, find problems, and figure out ways to make it better. That means going a layer down to mix and match primitives together to get more comprehensive features and more customization, flexibility, and freedom.

We built Log Lake for our long-term needs, but it would have been easier in the short-term to save Session Manager logs to Amazon S3 and query them with Athena. If you have considered what already exists in AWS, and you’re sure you need more comprehensive abilities or customization, read on to Part 2: Build, which explains Log Lake’s architecture and how you can set it up.

If you have feedback and questions, let us know in the comments section.

References


About the authors

Colin Carson is a Data Engineer at AWS ProServe. He has designed and built data infrastructure for multiple teams at Amazon, including Internal Audit, Risk & Compliance, HR Hiring Science, and Security.

Sean O’Sullivan is a Cloud Infrastructure Architect at AWS ProServe. He has over 8 years industry experience working with customers to drive digital transformation projects, helping architect, automate, and engineer solutions in AWS.

Deliver Amazon CloudWatch logs to Amazon OpenSearch Serverless

Post Syndicated from Balaji Mohan original https://aws.amazon.com/blogs/big-data/deliver-amazon-cloudwatch-logs-to-amazon-opensearch-serverless/

Amazon CloudWatch Logs collect, aggregate, and analyze logs from different systems in one place. CloudWatch provides subcriptions as a real-time feed of these logs to other services like Amazon Kinesis Data Streams, AWS Lambda, and Amazon OpenSearch Service. These subscriptions are a popular mechanism to enable custom processing and advanced analysis of log data to gain additional valuable insights. At the time of publishing this blog post, these subscription filters support delivering logs to Amazon OpenSearch Service provisioned clusters only. Customers are increasingly adopting Amazon OpenSearch Serverless as a cost-effective option for infrequent, intermittent and unpredictable workloads.

In this blog post, we will show how to use Amazon OpenSearch Ingestion to deliver CloudWatch logs to OpenSearch Serverless in near real-time. We outline a mechanism to connect a Lambda subscription filter with OpenSearch Ingestion and deliver logs to OpenSearch Serverless without explicitly needing a separate subscription filter for it.

Solution overview

The following diagram illustrates the solution architecture.

  1. CloudWatch Logs: Collects and stores logs from various AWS resources and applications. It serves as the source of log data in this solution.
  2. Subscription filter : A CloudWatch Logs subscription filter filters and routes specific log data from CloudWatch Logs to the next component in the pipeline.
  3. CloudWatch exporter Lambda function: This is a Lambda function that receives the filtered log data from the subscription filter. Its purpose is to transform and prepare the log data for ingestion into the OpenSearch Ingestion pipeline.
  4. OpenSearch Ingestion: This is a component of OpenSearch Service. The Ingestion pipeline is responsible for processing and enriching the log data received from the CloudWatch exporter Lambda function before storing it in the OpenSearch Serverless collection.
  5. OpenSearch Service: This is fully managed service that stores and indexes log data, making it searchable and available for analysis and visualization. OpenSearch Service offers two configurations: provisioned domains and serverless. In this setup, we use serverless, which is an auto-scaling configuration for OpenSearch Service.

Prerequisites

Deploy the solution

With the prerequisites in place, you can create and deploy the pieces of the solution.

Step 1: Create PipelineRole for ingestion

  • Open the AWS Management Console for AWS Identity and Access Management (IAM).
  • Choose Policies, and then choose Create policy.
  • Select JSON and paste the following policy into the editor:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "aoss:BatchGetCollection",
                "aoss:APIAccessAll"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:aoss:us-east-1:{accountId}:collection/{collectionId}"
        },
        {
            "Action": [
                "aoss:CreateSecurityPolicy",
                "aoss:GetSecurityPolicy",
                "aoss:UpdateSecurityPolicy"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aoss:collection": "{collection}"
                }
            }
        }
    ]
}

// Replace {accountId}, {collectionId}, and {collection} with your own values
  • Choose Next, choose Next, and name your policy collection-pipeline-policy.
  • Choose Create policy.
  • Next, create a role and attach the policy to it. Choose Roles, and then choose Create role.
  • Select Custom trust policy and paste the following policy into the editor:
{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Principal":{
            "Service":"osis-pipelines.amazonaws.com"
         },
         "Action":"sts:AssumeRole"
      }
   ]
}
  • Choose Next, and then search for and select the collection-pipeline-policy you just created.
  • Choose Next and name the role PipelineRole.
  • Choose Create role.

Step 2: Configure the network and data policy for OpenSearch collection

  • In the OpenSearch Service console, navigate to the Serverless menu.
  • Create a VPC endpoint by following the instruction in Create an interface endpoint for OpenSearch Serverless.
  • Go to Security and choose Network policies.
  • Choose Create network policy.
  • Configure the following policy
[
  {
    "Rules": [
      {
        "Resource": [
          "collection/{collection name}"
        ],
        "ResourceType": "collection"
      }
    ],
    "AllowFromPublic": false,
    "SourceVPCEs": [
      "{VPC Enddpoint Id}"
    ]
  },
  {
    "Rules": [
      {
        "Resource": [
          "collection/{collection name}"
        ],
        "ResourceType": "dashboard"
      }
    ],
    "AllowFromPublic": true
  }
]
  • Go to Security and choose Data access policies.
  • Choose Create access policy.
  • Configure the following policy:
[
  {
    "Rules": [
      {
        "Resource": [
          "index/{collection name}/*"
        ],
        "Permission": [
          "aoss:CreateIndex",
          "aoss:UpdateIndex",
          "aoss:DescribeIndex",
          "aoss:ReadDocument",
          "aoss:WriteDocument"
        ],
        "ResourceType": "index"
      }
    ],
    "Principal": [
      "arn:aws:iam::{accountId}:role/PipelineRole",
      "arn:aws:iam::{accountId}:role/Admin"
    ],
    "Description": "Rule 1"
  }
]

Step 3: Create an OpenSearch Ingestion pipeline

  • Navigate to the OpenSearch Service.
  • Go to the Ingestion pipelines section.
  • Choose Create pipeline.
  • Define the pipeline configuration.
version: "2"
 cwlogs-ingestion-pipeline:

  source:

    http:

      path: /logs/ingest

  sink:

    - opensearch:

        # Provide an AWS OpenSearch Service domain endpoint

        hosts: ["https://{collectionId}.{region}.aoss.amazonaws.com"]

        index: "cwl-%{yyyy-MM-dd}"

        aws:

          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com

          sts_role_arn: "arn:aws:iam::{accountId}:role/PipelineRole"

          # Provide the region of the domain.

          region: "{region}"

          serverless: true

          serverless_options:

            network_policy_name: "{Network policy name}"
 # To get the values for the placeholders: 
 # 1. {collectionId}: You can find the collection ID by navigating to the Amazon OpenSearch Serverless Collection in the AWS Management Console, and then clicking on the Collection. The collection ID is listed under the "Overview" section. 
 # 2. {region}: This is the AWS region where your Amazon OpenSearch Service domain is located. You can find this information in the AWS Management Console when you navigate to the domain. 
 # 3. {accountId}: This is your AWS account ID. You can find your account ID by clicking on your username in the top-right corner of the AWS Management Console and selecting "My Account" from the dropdown menu. 
 # 4. {Network policy name}: This is the name of the network policy you have configured for your Amazon OpenSearch Serverless Collection. If you haven't configured a network policy, you can leave this placeholder as is or remove it from the configuration.
 # After obtaining the necessary values, replace the placeholders in the configuration with the actual values.            

Step 4: Create a Lambda function

  • Create a Lambda layer for requests and sigv4 packages. Run the following commands in AWS Cloudshell.
mkdir lambda_layers
 cd lambda_layers
 mkdir python
 cd python
 pip install requests -t ./
 pip install requests_auth_aws_sigv4 -t ./
 cd ..
 zip -r python_modules.zip .


 aws lambda publish-layer-version --layer-name Data-requests --description "My Python layer" --zip-file fileb://python_modules.zip --compatible-runtimes python3.x
import base64
 import gzip
 import json
 import logging
 import json
 import jmespath
 import requests
 from datetime import datetime
 from requests_auth_aws_sigv4 import AWSSigV4
 import boto3


 LOGGER = logging.getLogger(__name__)
 LOGGER.setLevel(logging.INFO)


 def lambda_handler(event, context):

    """Extract the data from the event"""

    data = jmespath.search("awslogs.data", event)

    """Decompress the logs"""

    cwLogs = decompress_json_data(data)

    """Construct the payload to send to OpenSearch Ingestion"""

    payload = prepare_payload(cwLogs)

    print(payload)

    """Ingest the set of events to the pipeline"""    

    response = ingestData(payload)

    return {

        'statusCode': 200

    }
 def decompress_json_data(data):

    compressed_data = base64.b64decode(data)

    uncompressed_data = gzip.decompress(compressed_data)

    return json.loads(uncompressed_data)


 def prepare_payload(cwLogs):

    payload = []

    logEvents = cwLogs['logEvents']

    for logEvent in logEvents:

        request = {}

        request['id'] = logEvent['id']

        dt = datetime.fromtimestamp(logEvent['timestamp'] / 1000) 

        request['timestamp'] = dt.isoformat()

        request['message'] = logEvent['message'];

        request['owner'] = cwLogs['owner'];

        request['log_group'] = cwLogs['logGroup'];

        request['log_stream'] = cwLogs['logStream'];

        payload.append(request)

    return payload

 def ingestData(payload):

    ingestionEndpoint = '{OpenSearch Pipeline Endpoint}'

    endpoint = 'https://' + ingestionEndpoint

    headers = {'Content-Type': 'application/json', 'Accept':'application/json'}

    r = requests.request('POST', f'{endpoint}/logs/ingest', json=payload, auth=AWSSigV4('osis'), headers=headers)

    LOGGER.info('Response received: ' + r.text)

    return r
  • Replace {OpenSearch Pipeline Endpoint}’ with the endpoint of your OpenSearch Ingestion pipeline.
  • Attach the following inline policy in execution role.
{

    "Version": "2012-10-17",

    "Statement": [

        {

            "Sid": "PermitsWriteAccessToPipeline",

            "Effect": "Allow",

            "Action": "osis:Ingest",

            "Resource": "arn:aws:osis:{region}:{accountId}:pipeline/{OpenSearch Pipeline Name}"

        }

    ]
 }
  • Deploy the function.

Step 5: Set up a CloudWatch Logs subscription

  • Grant permission to a specific AWS service or AWS account to invoke the specified Lambda function. The following command grants permission to the CloudWatch Logs service to invoke the cloud-logs Lambda function for the specified log group. This is necessary because CloudWatch Logs cannot directly invoke a Lambda function without being granted permission. Run the following command in CloudShell to add permission.
aws lambda add-permission
 --function-name "{function name}"
 --statement-id "{function name}"
 --principal "logs.amazonaws.com"
 --action "lambda:InvokeFunction"
 --source-arn "arn:aws:logs:{region}:{accountId}:log-group:{log_group}:*"
 --source-account "{accountId}"
  • Create a subscription filter for a log group. The following command creates a subscription filter on the log group, which forwards all log events (because the filter pattern is an empty string) to the Lambda function. Run the following command in Cloudshell to create the subscription filter.
aws logs put-subscription-filter
 --log-group-name {log_group}
 --filter-name {filter name}
 --filter-pattern ""
 --destination-arn arn:aws:lambda:{region}:{accountId}:function:{function name}

Step 6: Testing and verification

  • Generate some logs in your CloudWatch log group. Run the following command in Cloudshell to create sample logs in log group.
aws logs put-log-events --log-group-name {log_group} --log-stream-name {stream_name} --log-events "[{\"timestamp\":{timestamp in millis} , \"message\": \"Simple Lambda Test\"}]"
  • Check the OpenSearch collection to ensure logs are indexed correctly.

Clean up

Remove the infrastructure for this solution when not in use to avoid incurring unnecessary costs.

Conclusion

You saw how to set up a pipeline to send CloudWatch logs to an OpenSearch Serverless collection within a VPC. This integration uses CloudWatch for log aggregation, Lambda for log processing, and OpenSearch Serverless for querying and visualization. You can use this solution to take advantage of the pay-as-you-go pricing model for OpenSearch Serverless to optimize operational costs for log analysis.

To further explore, you can:


About the Authors

Balaji Mohan is a senior modernization architect specializing in application and data modernization to the cloud. His business-first approach ensures seamless transitions, aligning technology with organizational goals. Using cloud-native architectures, he delivers scalable, agile, and cost-effective solutions, driving innovation and growth.

Souvik Bose is a Software Development Engineer working on Amazon OpenSearch Service.

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.

Accelerate your Terraform development with Amazon Q Developer

Post Syndicated from Dr. Rahul Sharad Gaikwad original https://aws.amazon.com/blogs/devops/accelerate-your-terraform-development-with-amazon-q-developer/

This post demonstrates how Amazon Q Developer, a generative AI-powered assistant for software development, helps create Terraform templates. Terraform is an infrastructure as code (IaC) tool that provisions and manages infrastructure on AWS safely and predictably. When used in an integrated development environment (IDE), Amazon Q Developer assists with software development, including code generation, explanation, and improvements. This blog highlights the 5 most used cases to show how Amazon Q Developer can generate Terraform code snippets:

  1. Secure Networking Deployment: Generate Terraform code snippet to create an Amazon Virtual Private Cloud (VPC) with subnets, route tables, and security groups.
  2. Multi-Account CI/CD Pipelines with AWS CodePipeline: Generate Terraform code snippet for setting up continuous integration and continuous delivery (CI/CD) pipelines using AWS CodePipeline across multiple AWS accounts.
  3. Event-Driven Architecture: Generate Terraform code snippet to set up an event-driven architecture using Amazon EventBridge, AWS Lambda, Amazon API Gateway, and other serverless services.
  4. Container Orchestration (Amazon ECS Fargate): Generate Terraform code snippet for deploying and managing container orchestration using Amazon Elastic Container Service (ECS) with Fargate.
  5. Machine Learning Workflows: Generate Terraform module for deploying machine learning workflows using AWS SageMaker.

Terraform Project Structure

First, you want to understand the best practices and requirements for setting up a multi-environment Terraform Infrastructure as Code (IaC) project. Following industry practices from the start is crucial to manage your multi-environment infrastructure and this is where Amazon Q will recommend a standard folder and file structure for organising Terraform templates and managing infrastructure deployments.

Interface showing Amazon Q Developer's recommendations for structuring a Terraform project

Amazon Q Developer recommended organising Terraform templates in separate folders for each environment, with sub-folders to group resources into modules. It provided best practices like version control, remote backend, and tagging for reusability and scalability. We can ask Amazon Q Developer to generate a sample example for better understanding.

Screenshot of the Terraform folder structure generated by Amazon Q Developer

You can see that the recommended folder structure includes a root project folder and sub-folders for environments and modules. The sub-folders manage multiple environments (like development, testing, production), reusable modules (like Virtual Private Cloud (VPC), Elastic Compute Cloud (EC2)) for various components and it demonstrates references to manage Terraform templates for individual environments and components. Let’s explore the top 5 most common use cases one by one.

1. Secure Networking Deployment

Once we setup Terraform project structure, we will ask Amazon Q Developer to give recommendations for networking services and requirements. Amazon Q Developer suggests to use Amazon Virtual Private Cloud (VPC), Subnets, Internet Gateways, Network Access Control Lists (ACLs) and Security Groups.

Amazon Q Developer interface displaying recommendations for core Amazon VPC components

Now, we can ask Amazon Q Developer to generate a Terraform template for building components like Virtual Private Cloud (VPC) and subnets. The Terraform code generated by Amazon Q Developer for a VPC, private subnet, and public subnet. It also added tags to each resource, following best practices. To leverage the suggested code snippet in your template, open the vpc.tf file and either insert it at the cursor or copy it into the file. Additionally, Amazon Q Developer created an Internet Gateway, Route Tables, and associated them with the VPC.

Generated Terraform code snippet for VPC and Subnets shown in the Amazon Q Developer interface

2. Multi-Account CI/CD Pipelines with AWS CodePipeline

Let’s discuss the second use case: a CI/CD (Continuous Integration/Continuous Deployment) pipeline using AWS CodePipeline to deploy Terraform code across multiple AWS accounts. Assume this is your first time working on this use case. Use Amazon Q Developer to understand the pipeline’s design stages and requirements for creating your pipeline across AWS accounts. I asked Amazon Q Developer about the pipeline stages and requirements to deploy across AWS accounts.

Amazon Q Developer displaying generated stages and dependencies for a CI/CD pipeline

Amazon Q Developer provided all the main stages for the CI/CD (Continuous Integration/Continuous Deployment) pipeline:

  1. Source stage pulls the Terraform code from the source code repository.
  2. Plan stage includes linting, validation, or running terraform plan to view the Terraform plan before deploying the code.
  3. Build stage performs additional testing for the infrastructure to ensure all components will be created successfully.
  4. Deploy stage runs terraform apply.

Amazon Q Developer also provided all the requirements needed before creating the CI/CD pipeline. These requirements include terraform is installed in the pipeline environment, IAM roles that will be assumed by the pipeline stages to deploy the terraform code across different accounts, an Amazon S3 bucket to store the state of the terraform code, source code repository to store our terraform files, use parameters to store sensitive data like AWS accounts IDs and AWS CodePipeline to create our CI/CD pipeline.

You will use Amazon Q Developer now to generate the terraform code for the CI/CD pipeline based on stages design proposed by the services in the previous question.

Interface showing Terraform code snippets for CI/CD pipeline stages generated by Amazon Q Developer

We would like to highlight three things here:

1. Amazon Q Developer suggested all the stages for our Continuous Integration/Continuous Deployment (CI/CD) design:

  • Source Stage: this stage pulls the code from the source code repository.
  • Build Stage: this stage will validates the infrastructure code, run a Terraform plan stage, and can automate any review.
  • Deploy Stage: this stage deploys the terraform code to the target AWS account.

2. Amazon Q Developer generated the build stage before the plan and this is expected in any CI/CD pipeline. First, we start with building the artifacts, running any tests or validation steps. If this stage is passed successfully, it will then proceed to terraform plan.
3. Amazon’s Q Developer suggests deploying to separate stages for each target account, aligning with the best practice of using different AWS accounts for development, testing, and production environments. This reduces the blast radius and allows configuring a separate stage for each environment/account.

3. Event-Driven Architecture

For the next use case, we will start by asking Amazon Q Developer to explain Event-driven architecture. As shown below, Amazon Q Developer highlights the important aspects that we need to consider when designing event-driven architecture like:

  • An event that represents an actual change in the application state like S3 object upload.
  • The event source that produces the actual event like Amazon S3.
  • An event route that routes the event between source and target.
  • An event target that consumes the event like AWS Lambda function.

Amazon Q Developer providing an explanation of event-driven architecture components

Assume you want to build a sample Event-driven architecture using Amazon SNS as a simple pub/sub. In order to build such architecture, we will need:

  • An Amazon SNS topic to publish events to an Amazon SQS queue subscriber.
  • An Amazon SQS queue which subscribe to SNS topic.

Sample architecture of Event-driven approach

Asking Amazon Q Developer to create a terraform code for the above use case, it generated a code that can get us started in seconds. The code includes:

  • Create an SNS Topic and SQS queue
  • Subscribe SQS queue to SNS topic
  • AWS IAM policy to allow SNS to write to SQS

Terraform code snippet for an event-driven application generated by Amazon Q Developer

4. Container Orchestration (Amazon ECS Fargate)

Now to our next use case, we want to build an Amazon ECS cluster with Fargate. Before we start, we will ask Amazon Q Developer about Amazon ECS and the difference between using Amazon EC2 or Fargate option.

Amazon Q Developer interface explaining Amazon ECS and its compute options

Amazon Q Developer not only provides details about Amazon ECS capabilities and well-defined description about the difference between Amazon EC2 and Fargate as compute options, and also a guide when to use what to help in decision making process.

Now, we will ask Amazon Q Developer to suggest a terraform code to create an Amazon ECS cluster with Fargate as the compute option.

Generated Terraform code snippet for deploying Amazon ECS on Fargate shown by Amazon Q Developer

Amazon Q Developer suggested the key resources in the Terraform code for the ECS cluster, the resources are:

  • An Amazon ECS Cluster.
  • A sample task definition for an Nginx container with all the required configuration for the container to run.
  • ECS service with Fargate as launch type for the container to run.

5. Secure Machine Learning Workflows

Now let us explore final use case on setting up Machine Learning workflow. Assuming I am new to ML on AWS, I will first try to understand the best practices and requirements for ML workflows. We can ask Amazon Q Developer to get the recommendations for the resources, security policies etc.

Amazon Q Developer's recommendations for best practices in secure ML workflows

Amazon Q Developer provided recommendations to provision SageMaker resources in private VPC, implement authentication and authorization using AWS IAM, encrypt data at rest and in transit using AWS KMS and setup secure CI/CD. Once I get the recommendations, I can identify the resources which I need to build MLOps workflow.

Interface showing suggested resources for building an MLOPs workflow by Amazon Q Developer

Amazon Q Developer suggested resources such as a SageMaker Studio instance, model registry, endpoints, version control, CI/CD Pipeline, CloudWatch etc. for ML model building, training and deployment using SageMaker. Now I can start building Terraform templates to deploy these resources. To get the code recommendations, I can ask Amazon Q Developer to give a Terraform snippet for all these resources.

Amazon Q Developer will generate Terraform snippets for isolated VPC, Security Group, S3 bucket and SageMaker pipeline. Now, as shown earlier, we can leverage Amazon Q Developer to generate Terraform code for each and every resource using comments.

Conclusion

In this blog post, we showed you how to use Amazon Q Developer, a generative AI–powered assistant from AWS, in your IDE to quickly improve your development experience when using an infrastructure-as-code tool like Terraform to create your AWS infrastructure. However, we would like to highlight some important points:

  • Code Validation and Testing: Always thoroughly validate and test the Terraform code generated by Amazon Q Developer in a safe environment before deploying it to production. Automated code generation tools can sometimes produce code that may need adjustments based on your specific requirements and configurations.
  • Security Considerations: Ensure that all security practices are followed, such as least privilege principles for IAM roles, proper encryption methods for sensitive data, and continuous monitoring for security compliance.
  • Limitations of Amazon Q Developer: While Amazon Q Developer provides valuable assistance, it may not cover all edge cases or specific customizations needed for your infrastructure. Always review the generated code and modify it as necessary to fit your unique use cases.
  • Updates and Maintenance: Infrastructure and services on AWS are continuously evolving. Make sure to keep your Terraform code and the use of Amazon Q Developer updated with the latest best practices and AWS service changes.
  • Real-Time Assistance: Use Amazon Q Developer in your IDE to enhance generated code as it provides real-time recommendations based on your existing code and comments. The suggestions are based on your existing code and comments, which are in this case generated by Amazon Q Developer.

To get started with Amazon Q Developer for debugging today navigate to Amazon Q Developer in IDE and simply start asking questions about debugging. Additionally, explore Amazon Q Developer workshop for additional hands-on use cases.

For any inquiries or assistance with Amazon Q Developer, please reach out to your AWS account team.

About the authors:

Dr. Rahul Sharad Gaikwad

Dr. Rahul is a Lead DevOps Consultant at AWS, specializing in migrating and modernizing customer workloads on the AWS Cloud. He is a technology enthusiast with a keen interest in Generative AI and DevOps. He received his Ph.D. in AIOps in 2022. He is recipient of the Indian Achievers’ Award (2024), Best PhD Thesis Award (2024), Research Scholar of the Year Award (2023) and Young Researcher Award (2022).

Omar Kahil

Omar is a Professional Services Senior consultant who helps customers adopt DevOps culture and best practices. He also works to simplify the adoption of AWS services by automating and implementing complex solutions.

Balance deployment speed and stability with DORA metrics

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/devops/balance-deployment-speed-and-stability-with-dora-metrics/

Development teams adopt DevOps practices to increase the speed and quality of their software delivery. The DevOps Research and Assessment (DORA) metrics provide a popular method to measure progress towards that outcome. Using four key metrics, senior leaders can assess the current state of team maturity and address areas of optimization.

This blog post shows you how to make use of DORA metrics for your Amazon Web Services (AWS) environments. We share a sample solution which allows you to bootstrap automatic metric collection in your AWS accounts.

Benefits of collecting DORA metrics

DORA metrics offer insights into your development teams’ performance and capacity by measuring qualitative aspects of deployment speed and stability. They also indicate the teams’ ability to adapt by measuring the average time to recover from failure. This helps product owners in defining work priorities, establishing transparency on team maturity, and developing a realistic workload schedule. The metrics are appropriate for communication with senior leadership. They help commit leadership support to resolve systemic issues inhibiting team satisfaction and user experience.

Use case

This solution is applicable to the following use case:

  • Development teams have a multi-account AWS setup including a tooling account where the CI/CD tools are hosted, and an operations account for log aggregation and visualization.
  • Developers use GitHub code repositories and AWS CodePipeline to promote code changes across application environment accounts.
  • Tooling, operations, and application environment accounts are member accounts in AWS Control Tower or workload accounts in the Landing Zone Accelerator on AWS solution.
  • Service impairment resulting from system change is logged as OpsItem in AWS Systems Manager OpsCenter.

Overview of solution

The four key DORA metrics

The ‘four keys’ measure team performance and ability to react to problems:

  1. Deployment Frequency measures the frequency of successful change releases in your production environment.
  2. Lead Time For Changes measures the average time for committed code to reach production.
  3. Change Failure Rate measures how often changes in production lead to service incidents/failures, and is complementary to Mean Time Between Failure.
  4. Mean Time To Recovery measures the average time from service interruption to full recovery.

The first two metrics focus on deployment speed, while the other two indicate deployment stability (Figure 1). We recommend organizations to set their own goals (that is, DORA metric targets) based on service criticality and customer needs. For a discussion of prior DORA benchmark data and what it reveals about the performance of development teams, consult How DORA Metrics Can Measure and Improve Performance.

Balance between deployment speed and stability in software delivery, utilizing DORA metrics across four quadrants. The horizontal axis depicts speed, progressing from low, infrequent deployments and higher time for changes on the left to rapid, frequent deployments with lower time for changes on the right. Vertically, the stability increases from the bottom, characterized by longer service restoration and higher failure rates, to the top, indicating quick restoration and fewer failures. The top-right quadrant represents the ideal state of high speed and stability, serving as the target for optimized software delivery and high performance.

Figure 1. Overview of DORA metrics

Consult the GitHub code repository Balance deployment speed and stability with DORA metrics for a detailed description of the metric calculation logic. Any modifications to this logic should be made carefully.

For example, the Change Failure Rate focuses on changes that impair the production system. Limiting the calculation to tags (such as hotfixes) on pull requests would exclude issues related to the build process. It’s important to match system change records that lead to actual impairments in production. Limiting the calculation to the number of failed deployments from the deployment pipeline only considers deployments that didn’t reach production. We use AWS Systems Manager OpsCenter as the system of records for change-related outages, rather than relying solely on data from CI/CD tools.

Similarly, Mean Time To Recovery measures the duration from a service impairment in production to a successful pipeline run. We encourage teams to track both pipeline status and recovery time, as frequent pipeline failure can indicate insufficient local testing and potential pipeline engineering issues.

Gathering DORA events

Our metric calculation process runs in four steps:

  1. In the tooling account, we send events from CodePipeline to the default event bus of Amazon EventBridge.
  2. Events are forwarded to custom event buses which process them according to the defined metrics and any filters we may have set up.
  3. The custom event buses call AWS Lambda functions which forward metric data to Amazon CloudWatch. CloudWatch gives us an aggregated view of each of the metrics. From Amazon CloudWatch, you can send the metrics to another designated dashboard like Amazon Managed Grafana.
  4. As part of the data collection, the Lambda function will also query GitHub for the relevant commit to calculate the lead time for changes metric. It will query AWS Systems Manager for OpsItem data for change failure rate and mean time to recovery metrics. You can create OpsItems manually as part of your change management process or configure CloudWatch alarms to create OpsItems automatically.

Figure 2 visualizes these steps. This setup can be replicated to a group of accounts of one or multiple teams.

This figure visualizes the aforementioned four steps of our metric calculation process. AWS Lambda functions process all events and publish custom metrics in Amazon CloudWatch.

Figure 2. DORA metric setup for AWS CodePipeline deployments

Walkthrough

Follow these steps to deploy the solution in your AWS accounts.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Deploying the solution

Clone the GitHub code repository Balance deployment speed and stability with DORA metrics.

Before you start deploying or working with this code base, there are a few configurations you need to complete in the constants.py file in the cdk/ directory. Open the file in your IDE and update the following constants:

  1. TOOLING_ACCOUNT_ID & TOOLING_ACCOUNT_REGION: These represent the AWS account ID and AWS region for AWS CodePipeline (that is, your tooling account).
  2. OPS_ACCOUNT_ID & OPS_ACCOUNT_REGION: These are for your operations account (used for centralized log aggregation and dashboard).
  3. TOOLING_CROSS_ACCOUNT_LAMBDA_ROLE: The IAM Role for cross-account access that allows AWS Lambda to post metrics from your tooling account to your operations account/Amazon CloudWatch dashboard.
  4. DEFAULT_MAIN_BRANCH: This is the default branch in your code repository that’s used to deploy to your production application environment. It is set to “main” by default, as we assumed feature-driven development (GitFlow) on the main branch; update if you use a different naming convention.
  5. APP_PROD_STAGE_NAME: This is the name of your production stage and set to “DeployPROD” by default. It’s reserved for teams with trunk-based development.

Setting up the environment

To set up your environment on MacOS and Linux:

  1. Create a virtual environment:
    $ python3 -m venv .venv
  2. Activate the virtual environment: On MacOS and Linux:
    $ source .venv/bin/activate

Alternatively, to set up your environment on Windows:

  1. Create a virtual environment:
    % .venv\Scripts\activate.bat
  2. Install the required Python packages:
    $ pip install -r requirements.txt

To configure the AWS Command Line Interface (AWS CLI):

  1. Follow the configuration steps in the AWS CLI User Guide.
    $ aws configure sso
  2. Configure your user profile (for example, Ops for operations account, Tooling for tooling account). You can check user profile names in the credentials file.

Deploying the CloudFormation stacks

  1. Switch directory
    $ cd cdk
  2. Bootstrap CDK
    $ cdk bootstrap –-profile Ops
  3. Synthesize the AWS CloudFormation template for this project:
    $ cdk synth
  4. To deploy a specific stack (see Figure 3 for an overview), specify the stack name and AWS account number(s) in the following command:
    $ cdk deploy <Stack-Name> --profile {Tooling, Ops}

    To launch the DoraToolingEventBridgeStack stack in the Tooling account:

    $ cdk deploy DoraToolingEventBridgeStack --profile Tooling

    To launch the other stacks in the Operations account (including DoraOpsGitHubLogsStack, DoraOpsDeploymentFrequencyStack, DoraOpsLeadTimeForChangeStack, DoraOpsChangeFailureRateStack, DoraOpsMeanTimeToRestoreStack, DoraOpsMetricsDashboardStack):

    $ cdk deploy DoraOps* --profile Ops

The following figure shows the resources you’ll launch with each CloudFormation stack. This includes six AWS CloudFormation stacks in operations account. The first stack sets up log integration for GitHub commit activity. Four stacks contain a Lambda function which creates one of the DORA metrics. The sixth stack creates the consolidated dashboard in Amazon CloudWatch.

Figure 3. Resources provisioned with this solution

Testing the deployment

To run the provided tests:

$ pytest

Understanding what you’ve built

Deployed resources in tooling account

The DoraToolingEventBridgeStack includes Amazon EventBridge rules with a target of the central event bus in the operations account, plus an AWS IAM role with cross-account access to put events in the operations account. The event pattern for invoking our EventBridge rules listens for deployment state changes in AWS CodePipeline:

{
  "detail-type": ["CodePipeline Pipeline Execution State Change"],
  "source": ["aws.codepipeline"]
}

Deployed resources in operations account

  1. The Lambda function for Deployment Frequency tracks the number of successful deployments to production, and posts the metric data to Amazon CloudWatch. You can add a dimension with the repository name in Amazon CloudWatch to filter on particular repositories/teams.
  2. The Lambda function for the Lead Time For Change metric calculates the duration from the first commit to successful deployment in production. This covers all factors contributing to lead time for changes, including code reviews, build, test, as well as the deployment itself.
  3. The Lambda function for Change Failure Rate keeps track of the count of successful deployments and the count of system impairment records (OpsItems) in production. It publishes both as metrics to Amazon CloudWatch and the latter calculates the ratio, as shown in below example.
    This visual shows three graphed metrics in Amazon CloudWatch: metric “m1” calculating number of failed deployments, metric “m2” calculating number of total deployments, and metric “m3” calculating change failure rate by dividing m1 with m2 and multiplying by 100.
  4. The Lambda function for Mean Time To Recovery keeps track of all deployments with status SUCCEEDED in production and whose repository branch name references an existing OpsItem ID. For every matching event, the function gets the creation time of the OpsItem record and posts the duration between OpsItem creation and successful re-deployment to the CloudWatch dashboard.

All Lambda functions publish metric data to Amazon CloudWatch using the PutMetricData API. The final calculation of the four keys is performed on the CloudWatch dashboard. The solution includes a simple CloudWatch dashboard so you can validate the end-to-end data flow and confirm that it has deployed successfully:

This simple CloudWatch dashboard displays the four DORA metrics for three reporting periods: per day, per week, and per month.

Cleaning up

Remember to delete example resources if you no longer need them to avoid incurring future costs.

You can do this via the CDK CLI:

$ cdk destroy <Stack-Name> --profile {Tooling, Ops}

Alternatively, go to the CloudFormation console in each AWS account, select the stacks related to DORA and click on Delete. Confirm that the status of all DORA stacks is DELETE_COMPLETE.

Conclusion

DORA metrics provide a popular method to measure the speed and stability of your deployments. The solution in this blog post helps you bootstrap automatic metric collection in your AWS accounts. The four keys help you gain consensus on team performance and provide data points to back improvement suggestions. We recommend using the solution to gain leadership support for systemic issues inhibiting team satisfaction and user experience. To learn more about developer productivity research, we encourage you to also review alternative frameworks including DevEx and SPACE.

Further resources

If you enjoyed this post, you may also like:

Author bio

Rostislav Markov

Rostislav is principal architect with AWS Professional Services. As technical leader in AWS Industries, he works with AWS customers and partners on their cloud transformation programs. Outside of work, he enjoys spending time with his family outdoors, playing tennis, and skiing.

Ojesvi Kushwah

Ojesvi works as a Cloud Infrastructure Architect with AWS Professional Services supporting global automotive customers. She is passionate about learning new technologies and building observability solutions. She likes to spend her free time with her family and animals.

Integrate Amazon MWAA with Microsoft Entra ID using SAML authentication

Post Syndicated from Satya Chikkala original https://aws.amazon.com/blogs/big-data/integrate-amazon-mwaa-with-microsoft-entra-id-using-saml-authentication/

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) provides a fully managed solution for orchestrating and automating complex workflows in the cloud. Amazon MWAA offers two network access modes for accessing the Apache Airflow web UI in your environments: public and private. Customers often deploy Amazon MWAA in private mode and want to use existing login authentication mechanisms and single sign-on (SSO) features to have seamless integration with the corporate Active Directory (AD). Also, the end-users don’t need to log in to the AWS Management Console to access the Airflow UI.

In this post, we illustrate how to configure an Amazon MWAA environment deployed in private network access mode with customer managed VPC endpoints and authenticate users using SAML federated identity using Microsoft Entra ID and Application Load Balancer (ALB). Users can seamlessly log in to the Airflow UI with their corporate credentials and access the DAGs. This solution can be modified for Amazon MWAA public network access mode as well.

Solution overview

The architectural components involved in authenticating the Amazon MWAA environment using SAML SSO are depicted in the following diagram. The infrastructure components include two public subnets and three private subnets. The public subnets are required for the internet-facing ALB. Two private subnets are used to set up the Amazon MWAA environment, and the third private subnet is used to host the AWS Lambda authorizer function. This subnet will have a NAT gateway attached to it, because the function needs to verify the signer to confirm the JWT header has the expected LoadBalancer ARN.

The workflow consists of the following steps:

  1. For SAML configuration, Microsoft Entra ID serves as the identity provider (IdP).
  2. Amazon Cognito serves as the service provider (SP).
  3. ALB has built-in support for Amazon Cognito and authenticates requests.
  4. Post-authentication, ALB forwards the requests to the Lambda authorizer function. The Lambda function decodes the user’s JWT token and validates whether the user’s AD group is mapped to the relevant AWS Identity and Access Management (IAM) role.
  5. If valid, the function creates a web login token and redirects to the Amazon MWAA environment for successful login.

The following are the high-level steps to deploy the solution:

  1. Create an Amazon Simple Storage Service (Amazon S3) bucket for artifacts.
  2. Create an SSL certificate and upload it to AWS Certificate Manager (ACM).
  3. Deploy the Amazon MWAA infrastructure stack using AWS CloudFormation.
  4. Configure Microsoft Entra ID services and integrate the Amazon Cognito user pool.
  5. Deploy the ALB CloudFormation stack.
  6. Log in to Amazon MWAA using Microsoft Entra ID user credentials.

Prerequisites

Before you get started, make sure you have the following prerequisites:

  • An AWS account
  • Appropriate IAM permissions to deploy AWS CloudFormation stack resources
  • A Microsoft Azure account is required for creating the Microsoft Entra ID app (IdP config) and Microsoft Entra ID P2.
  • A public certificate for the ALB in the AWS Region where the infrastructure is being deployed and a custom domain name relevant to the certificate.

Create an S3 bucket

In this step, we create an S3 bucket to store your Airflow DAGs, custom plugins in a plugins.zip file, and Python dependencies in a requirements.txt file. This bucket is used by the Amazon MWAA environment to fetch DAGs and dependency files.

  1. On the Amazon S3 console, choose the Region where you want to create a bucket.
  2. In the navigation pane, choose Buckets.
  3. Choose Create bucket.
  4. For Bucket type, select General purpose.
  5. For Bucket name, enter a name for your bucket (for this post, mwaa-sso-blog-<your-aws-account-number>).
  6. Choose Create bucket. 

  7. Navigate to the bucket and choose Create folder.
  8. For Folder name, enter a name (for this post, we name the folder dags).
  9. Choose Create folder.


Import certificates into ACM

ACM is integrated with Elastic Load Balancing (ALB). In this step,  you can request a public certificate using ACM or import a certificate into ACM. To import organization certificates linked to a custom DNS into ACM, you must provide the certificate and its private key. To import a certificate signed by a non-AWS Certificate Authority (CA), you must also include the private and public keys of the certificate.

  1. On the ACM console, choose Import certificate in the navigation pane.
  2. For Certificate body, enter the contents of the cert.pem file.
  3. For Certificate private key, enter the contents of the privatekey.pem file.
  4. Choose Next.


  5. Choose Review and import.
  6. Review the metadata about your certificate and choose Import.

After the import is successful, the status of the imported certificate will show as Issued.

Create the Azure AD service, users, groups, and enterprise application

For the SSO integration with Azure, an enterprise application is required, which acts as the IdP for the SAML flow. We add relevant users and groups to the application and configure the SP (Amazon Cognito) details.

Airflow comes with five default roles: Public, Admin, Op, User, Viewer. In this post, we focus on three: Admin , User and Viewer. We create three roles and three corresponding users and assign memberships appropriately.

  1. Log in to the Azure portal.
  2. Navigate to Enterprise applications and choose New application.

  3. Enter a name for your application (for example, mwaa-environment) and choose Create.



    You can now view the details of your application.


    Now you create two groups.

  4. In the search bar, search for Microsoft Entra ID.

  5. On the Add menu, choose Group.

  6. For Group type, choose a type (for this post, Security).
  7. Enter a group name (for example, airflow-admins) and description.
  8. Choose Create.


  9. Repeat these steps to create two more groups, named airflow-users and airflow-viewers.
  10. Note the object IDs for each group (these are required in a later step).


    Next, you create users.
  11. On the Overview page, on the Add menu, choose User and Create new user.
  12. Enter a name for your user (for example, mwaa-user), display name, and password.
  13. Choose Review + create.


  14. Repeat these steps to create a user called mwaa-admin.
  15. In your airflow-users group details page, choose Members in the navigation pane.
  16. Choose Add members.
  17. Search for and select the users you created and choose Select.


  18. Repeat these steps to add the users to each group.

  19. Navigate to your application and choose Assign users and groups.

  20. Choose Add user/group.

  21. Search for and select the groups you created, then choose Select.

 

Deploy the Amazon MWAA environment stack

For this solution, we provide two CloudFormation templates that set up the services illustrated in the architecture. Deploying the CloudFormation stacks in your account incurs AWS usage charges.

The first CloudFormation stack creates the following resources:
  • A VPC with two public subnets and three private subnets and relevant route tables, NAT gateway, internet gateway, and security group
  • VPC endpoints required for the Amazon MWAA environment
  • An Amazon Cognito user pool and user pool domain
  • Application Load Balancer
Deploy the stack by completing the following steps:
  1. Choose Launch Stack to launch the CloudFormation stack.

  2. For Stack name, enter a name (for example, sso-blog-mwaa-infra-stack).

  3.  Enter the following parameters:

    1. For MWAAEnvironmentName, enter the environment name.

    2. For MwaaS3Bucket, enter the S3 artifacts bucket you created.

    3. For VpcCIDR, enter the specify IP range (CIDR notation) for this VPC.

    4. For PrivateSubnet1CIDR, enter the IP range (CIDR notation) for the private subnet in the first Availability Zone.

    5.  For PrivateSubnet2CIDR, enter the IP range (CIDR notation) for the private subnet in the second Availability Zone.

    6. For PrivateSubnet3CIDR, enter the IP range (CIDR notation) for the private subnet in the third Availability Zone.

    7. For PublicSubnet1CIDR, enter the IP range (CIDR notation) for the public subnet in the first Availability Zone.

    8. For PublicSubnet2CIDR, enter the IP range (CIDR notation) for the public subnet in the second Availability Zone.

  4. Choose Next

  5. Review the template and choose Create stack.

After the stack is deployed successfully, you can view the resources on the stack’s Outputs tab on the AWS CloudFormation console. Note the ALB URL, Amazon Cognito user pool ID, and domain.

 

Integrate the Amazon MWAA application with the Azure enterprise application

Next, you configure the SAML configuration in the enterprise application by adding the SP details and redirect URLs (in this case, the Amazon Cognito details and ALB URL).

  1. In the Azure portal, navigate to your environment.
  2. Choose Set up single sign on.
  3. For Identifier, enter urn:amazon:cognito:sp:<your cognito user_id>.
  4. For Reply URL, enter https://<Your user pool domain>/saml2/idpresponse.
  5. For Sign on URL, enter https://<Your application load balancer DNS>.
  6. In the Attributes & Claims section, choose Add a group claim.
  7. Select Security groups.
  8. For Source attribute, choose Group ID.
  9. Choose Save.
  10. Note the values for App Federation Metadata Url and Login URL.


Deploy the ALB stack

When the SAML configuration is complete on the Azure end, the IdP details have to be configured in Amazon Cognito. When users access the ALB URL, they will be authenticated against the corporate identity using SAML through Amazon Cognito. After they’re authenticated, they’re redirected to the Lambda function for authorization against the group they belong to. The user’s group is then validated against matching IAM role. If it’s valid, the Lambda function adds the web login token to the URL, and the user will gain access to the Amazon MWAA environment.

This CloudFormation stack creates the following resources:

  • Two target groups: the Lambda target group and Amazon MWAA target group
  • Listener rules for the ALB to redirect URL requests to the relevant target groups
  • A user pool client and SAML provider (Azure) details to the Amazon Cognito user pool
  • IAM roles for Admin, User, and Viewer personas required for Airflow
  • The Lambda authorizer function to validate the JWT token and map Azure groups to IAM roles for appropriate Airflow UI access

Deploy the stack by completing the following steps:

  1. Choose Launch Stack to launch the CloudFormation stack:
  2. For Stack name, enter a name (for example, sso-blog-mwaa-alb-stack).

  3. Enter the following parameters:

    1. For MWAAEnvironmentName, enter your environment name.

    2. For ALBCertificateArn, enter the certificate ARN required for ALB. 

    3. For AzureAdminGroupID, enter the group name for the Azure Admin persona.

    4. For AzureUserGroupID, enter the group name for the Azure User persona.

    5. For AzureViewerGroupID, enter the group name for the Azure Viewer persona.

    6. For EntraIDLoginURL, enter the Azure IdP URI.

    7. For AppFederationMetadataURL, enter the URL of the metadata file for the SAML provider. 

  4. Choose Next.

  5. Review the template and choose Create stack.

Test the solution

Now that the SAML configuration and relevant AWS services are created, it’s time to access the Amazon MWAA environment.

  1. Open your web browser and enter the ALB DNS name.
    The SP initiates the sign-in request process and the browser redirects you to the Microsoft login page for credentials.
  2. Enter the Admin user credentials.

    The SAML request sign-in process completes and the SAML response is redirected to the Amazon Cognito user pool attached to the ALB.

    The listener rules will validate the query URL and pass the requests to the Lambda authorizer to validate the JWT and assign the appropriate group (Azure) to role (AWS) mapping.


  3. Repeat the steps to log in with User and Viewer credentials and observe the differences in access.

Clean up

When you’re done experimenting with this solution, it’s essential to clean up your resources to avoid incurring AWS charges.

  1. On the AWS CloudFormation console, delete the stacks you created.
  2. Remove the SSM parameters and private webserver and database VPC endpoints (created by the Lambda events function):
    aws ssm delete-parameters --names "MyFirstParameter" "MySecondParameter"
    aws ec2 delete-vpc-endpoints --vpc-endpoint-ids "Endpoint1" "Endpoint2"

  3. Delete the users, groups, and enterprise application in the Azure environment.

Conclusion

In this post, we demonstrated how to integrate Amazon MWAA with organization Azure AD services. We walked through the solution that solves this problem using infrastructure as code. This solution allows different end-user personas in your organization to access the Amazon MWAA Airflow UI using SAML SSO.

For additional details and code examples for Amazon MWAA, visit the Amazon MWAA User Guide and the Amazon MWAA examples GitHub repo.


About the Authors

Satya Chikkala is a Solutions Architect at Amazon Web Services. Based in Melbourne, Australia, he works closely with enterprise customers to accelerate their cloud journey. Beyond work, he is very passionate about nature and photography.

Vijay Velpula is a Data Lake Architect with AWS Professional Services. He assists customers in building modern data platforms by implementing big data and analytics solutions. Outside of his professional responsibilities, Velpula enjoys spending quality time with his family, as well as indulging in travel, hiking, and biking activities.

Federating access to Amazon DataZone with AWS IAM Identity Center and Okta

Post Syndicated from Carlos Gallegos original https://aws.amazon.com/blogs/big-data/federating-access-to-amazon-datazone-with-aws-iam-identity-center-and-okta/

Many customers rely today on Okta or other identity providers (IdPs) to federate access to their technology stack and tools. With federation, security teams can centralize user management in a single place, which helps simplify and brings agility to their day-to-day operations while keeping highest security standards.

To help develop a data-driven culture, everyone inside an organization can use Amazon DataZone. To realize the benefits of using Amazon DataZone for governing data and making it discoverable and available across different teams for collaboration, customers integrate it with their current technology stack. Handling access through their identity provider and preserving a familiar single sign-on (SSO) experience enables customers to extend the use of Amazon DataZone to users across teams in the organization without any friction while keeping centralized control.

Amazon DataZone is a fully managed data management service that makes it faster and simpler for customers to catalog, discover, share, and govern data stored across Amazon Web Services (AWS), on premises, and third-party sources. It also makes it simpler for data producers, analysts, and business users to access data throughout an organization so that they can discover, use, and collaborate to derive data-driven insights.

You can use AWS IAM Identity Center to securely create and manage identities for your organization’s workforce, or sync and use identities that are already set up and available in Okta or other identity provider, to keep centralized control of them. With IAM Identity Center you can also manage the SSO experience of your organization centrally, across your AWS accounts and applications.

This post guides you through the process of setting up Okta as an identity provider for signing in users to Amazon DataZone. The process uses IAM Identity Center and its native integration with Amazon DataZone to integrate with external identity providers. Note that, even though this post focuses on Okta, the presented pattern relies on the SAML 2.0 standard and so can be replicated with other identity providers.

Prerequisites

To build the solution presented in this post, you must have:

Process overview

Throughout this post you’ll follow these high-level steps:

  1. Establish a SAML connection between Okta and IAM Identity Center
  2. Set up automatic provisioning of users and groups in IAM Identity Center so that users and groups in the Okta domain are created in Identity Center.
  3. Assign users and groups to your AWS accounts in IAM Identity Center by assuming an AWS Identity and Access Management (IAM) role.
  4. Access the AWS Management Console and Amazon DataZone portal through Okta SSO.
  5. Manage Amazon DataZone specific permissions in the Amazon DataZone portal.

Setting up user federation with Okta and IAM Identity Center

This guide follows the steps in Configure SAML and SCIM with Okta and IAM Identity Center.

Before you get started, review the following items in your Okta setup:

  • Every Okta user must have a First name, Last name, Username and Display name value specified.
  • Each Okta user has only a single value per data attribute, such as email address or phone number. Users that have multiple values will fail to synchronize. If there are users that have multiple values in their attributes, remove the duplicate attributes before attempting to provision the user in IAM Identity Center. For example, only one phone number attribute can be synchronized. Because the default phone number attribute is work phone, use the work phone attribute to store the user’s phone number, even if the phone number for the user is a home phone or a mobile phone.
  • If you update a user’s address you must have streetAddress, city, state, zipCode and the countryCode value specified. If any of these values aren’t specified for the Okta user at the time of synchronization, the user (or changes to the user) won’t be provisioned.

Okta account

1) Establish a SAML connection between Okta and AWS IAM Identity Center

Now, let’s establish a SAML connection between Okta and AWS IAM Identity Center. First, you’ll create an application in Okta to establish the connection:

  1. Sign in to the Okta admin dashboard, expand Applications, then select Applications.
  2. On the Applications page, choose Browse App Catalog.
  3. In the search box, enter AWS IAM Identity Center, then select the app to add the IAM Identity Center app.

IAM identity center app in Okta

  1. Choose the Sign On tab.

IAM identity center app in Okta - sign on

  1. Under SAML Signing Certificates, select Actions, and then select View IdP Metadata. A new browser tab opens showing the document tree of an XML file. Select all of the XML from <md:EntityDescriptor> to </md:EntityDescriptor> and copy it to a text file.
  2. Save the text file as metadata.xml.

Identity provider metadata in Okta

Leave the Okta admin dashboard open, you will continue using it in the later steps.

Second, you’re going to set up Okta as an external identity provider in IAM Identity Center:

  1. Open the IAM Identity Center console as a user with administrative privileges.
  2. Choose Settings in the navigation pane.
  3. On the Settings page, choose Actions, and then select Change identity source.

Identity provider source in IAM identity center

  1. Under Choose identity source, select External identity provider, and then choose Next.

Identity provider source in IAM identity center

  1. Under Configure external identity provider, do the following:
    1. Under Service provider metadata, choose Download metadata file to download the IAM Identity Center metadata file and save it on your system. You will provide the Identity Center SAML metadata file to Okta later in this tutorial.
      1. Copy the following items to a text file for easy access (you’ll need these values later):
        • IAM Identity Center Assertion Consumer Service (ACS) URL
        • IAM Identity Center issuer URL
    2. Under Identity provider metadata, under IdP SAML metadata, choose Choose file and then select the metadata.xml file you created in the previous step.
    3. Choose Next.
  2. After you read the disclaimer and are ready to proceed, enter accept.
  3. Choose Change identity source.

Identity provider source in IAM identity center

Leave the AWS console open, because you will use it in the next procedure.

  1. Return to the Okta admin dashboard and choose the Sign On tab of the IAM Identity Center app, then choose Edit.
  2. Under Advanced Sign-on Settings enter the following:
    1. For ACS URL, enter the value you copied for IAM Identity Center Assertion Consumer Service (ACS) URL.
    2. For Issuer URL, enter the value you copied for IAM Identity Center issuer URL.
    3. For Application username format, select one of the options from the drop-down menu.
      Make sure the value you select is unique for each user. For this tutorial, select Okta username.
  3. Choose Save.

IAM identity center app in Okta - sign on

2) Set up automatic provisioning of users and groups in AWS IAM Identity Center

You are now able to set up automatic provisioning of users from Okta into IAM Identity Center. Leave the Okta admin dashboard open and return to the IAM Identity Center console for the next step.

  1. In the IAM Identity Center console, on the Settings page, locate the Automatic provisioning information box, and then choose Enable. This enables automatic provisioning in IAM Identity Center and displays the necessary System for Cross-domain Identity Management (SCIM) endpoint and access token information.

Automatic provisioning in IAM identity center

  1. In the Inbound automatic provisioning dialog box, copy each of the values for the following options:
    • SCIM endpoint
    • Access token

You will use these values to configure provisioning in Okta later.

  1. Choose Close.

Automatic provisioning in IAM identity center

  1. Return to the Okta admin dashboard and navigate to the IAM Identity Center app.
  2. On the AWS IAM Identity Center app page, choose the Provisioning tab, and then in the navigation pane, under Settings, choose Integration.
  3. Choose Edit, and then select the check box next to Enable API integration to enable provisioning.
  4. Configure Okta with the SCIM provisioning values from IAM Identity Center that you copied earlier:
    1. In the Base URL field, enter the SCIM endpoint Make sure that you remove the trailing forward slash at the end of the URL.
    2. In the API Token field, enter the Access token value.
  5. Choose Test API Credentials to verify the credentials entered are valid. The message AWS IAM Identity Center was verified successfully! displays.
  6. Choose Save. You are taken to the Settings area, with Integration selected.

API Integration in Okta

  1. Review the following setup before moving forward. In the Provisioning tab, in the navigation pane under Settings, choose To App. Check that all options are enabled. They should be enabled by default, but if not, enable them.

Application provision in Okta

3) Assign users and groups to your AWS accounts in AWS IAM Identity Center by assuming an AWS IAM role

By default, no groups nor users are assigned to your Okta IAM Identity Center app. Complete the following steps to synchronize users with IAM Identity Center.

  1. In the Okta IAM Identity Center app page, choose the Assignments tab. You can assign both people and groups to the IAM Identity Center app.
    1. To assign people:
      1. In the Assignments page, choose Assign, and then choose Assign to people.
      2. Select the Okta users that you want to have access to the IAM Identity Center app. Choose Assign, choose Save and Go Back, and then choose Done.
        This starts the process of provisioning the individual users into IAM Identity Center.

      Users assignment in Okta

    1. To assign groups:
      1. Choose the Push Groups tab. You can create rules to automatically provision Okta groups into IAM Identity Center.

      Groups assignment in Okta

      1. Choose the Push Groups drop-down list and select Find groups by rule.
      2. In the By rule section, set a rule name and a condition. For this post we’re using AWS SSO Rule as rule name and starts with awssso as a group name condition. This condition can be different depending on the name of the group you want to sync.
      3. Choose Create Rule

      Okta SSO group rule

      1. (Optional) To create a new group choose Directory in the navigation pane, and then choose Groups.

      Group creation in Okta

      1. Choose Add group and enter a name, and then choose Save.

      Group creation in Okta

      1. After you have created the group, you can assign people to it. Select the group name to manage the group’s users.

      Group user assign in Okta

      1. Choose Assign people and select the users that you want to assign to the group.

      Group user assign in Okta

      1. You will see the users that are assigned to the group.

      Group user assign in Okta

      1. Going back to Applications in the navigation pane, select the AWS IAM Identity Center app and choose the Push Groups tab. You should have the groups that match the rule synchronized between Okta and IAM Identity Center. The group status should be set to Active after the group and its members are updated in Identity Center.

      Active groups in Okta

  1. Return to the IAM Identity Center console. In the navigation pane, choose Users. You should see the user list that was updated by Okta.

Active users in IAM identity center

  1. In the left navigation, select Groups, you should see the group list that was updated by Okta.

Active groups in IAM identity center

Congratulations! You have successfully set up a SAML connection between Okta and AWS and have verified that automatic provisioning is working.

OPTIONAL: If you need to provide Amazon DataZone console access to the Okta users and groups, you can manage these permissions through the IAM Identity Center console.

  1. In the IAM Identity Center navigation pane, under Multi-account permissions, choose AWS accounts.
  2. On the AWS accounts page, the Organizational structure displays your organizational root with your accounts underneath it in the hierarchy. Select the checkbox for your management account, then choose Assign users or groups.

IAM Roles in IAM identity center

  1. The Assign users and groups workflow displays. It consists of three steps:
    1. For Step 1: Select users and groups choose the user that will be performing the administrator job function. Then choose Next.
    2. For Step 2: Select permission sets choose Create permission set to open a new tab that steps you through the three sub-steps involved in creating a permission set.
      1. For Step 1: Select permission set type complete the following:
        • In Permission set type, choose Predefined permission set.
        • In Policy for predefined permission set, choose AdministratorAccess.
      2. Choose Next.
      3. For Step 2: Specify permission set details, keep the default settings, and choose Next.
        The default settings create a permission set named AdministratorAccess with session duration set to one hour. You can also specify reduced permissions with a custom policy just to allow Amazon DataZone console access.
      4. For Step 3: Review and create, verify that the Permission set type uses the AWS managed policy AdministratorAccess or your custom policy. Choose Create. On the Permission sets page, a notification appears informing you that the permission set was created. You can close this tab in your web browser now.
  2. On the Assign users and groups browser tab, you are still on Step 2: Select permission sets from which you started the create permission set workflow.
  3. In the Permissions sets area, Refresh. The AdministratorAccess permission or your custom policy set you created appears in the list. Select the checkbox for that permission set, and then choose Next.

IAM Roles in IAM identity center

    1. For Step 3: Review and submit review the selected user and permission set, then choose Submit.
      The page updates with a message that your AWS account is being configured. Wait until the process completes.
    2. You are returned to the AWS accounts page. A notification message informs you that your AWS account has been re-provisioned, and the updated permission set is applied. When a user signs in, they will have the option of choosing the AdministratorAccess role or a custom policy role.

4) Access the AWS console and Amazon DataZone portal through Okta SSO

Now, you can test your user access into the console and Amazon DataZone portal using the Okta external identity application.

  1. Sign in to the Okta dashboard using a test user account.
  2. Under My Apps, select the AWS IAM Identity Center icon.

IAM identity center access in Okta

  1. Complete the authentication process using your Okta credentials.

IAM identity center access in Okta

4.1) For administrative users

  1. You’re signed in to the portal and can see the AWS account icon. Expand that icon to see the list of AWS accounts that the user can access. In this tutorial, you worked with a single account, so expanding the icon only shows one account.
  2. Select the account to display the permission sets available to the user. In this tutorial you created the AdministratorAccess permission set.
  3. Next to the permission set are links for the type of access available for that permission set. When you created the permission set, you specified both management console and programmatic access be enabled, so those two options are present. Select Management console to open the console.

AWS Management console

  1. The user is signed in to the console. Using the search bar, look for Amazon DataZone service and open it.
  2. Open the Amazon DataZone console and make sure you have enabled SSO users through IAM Identity Center. In case you haven’t, you can follow the steps in Enable IAM Identity Center for Amazon DataZone.

Note: In this post, we followed the default IAM Identity Center for Amazon DataZone configuration, which has implicit user assignment mode enabled. With this option, any user added to your Identity Center directory can access your Amazon DataZone domain automatically. If you opt for using explicit user assignment instead, remember that you need to manually add users to your Amazon DataZone domain in the Amazon DataZone console for them to have access.
To learn more about how to manage user access to an Amazon DataZone domain, see Manage users in the Amazon DataZone console.

  1. Choose the Open data portal to access the Amazon DataZone Portal.

DataZone console

4.2) For all other users

  1. Choose the Applications tab in the AWS access portal window and choose the Amazon DataZone data portal application link.

DataZone application

  1. In the Amazon DataZone data portal, choose SIGN IN WITH SSO to continue

DataZone portal

Congratulations! Now you’re signed in to the Amazon DataZone data portal using your user that’s managed by Okta.

DataZone portal

5) Manage Amazon DataZone specific permissions in the Amazon DataZone portal

After you have access to the Amazon DataZone portal, you can work with projects, the data assets within, environments, and other constructs that are specific to Amazon DataZone. A project is the overarching construct that brings together people, data, and analytics tools. A project has two roles: owner and contributor. Next, you’ll learn how a user can be made an owner or contributor of existing projects.

These steps must be completed by the existing project owner in the Amazon DataZone portal:

  1. Open the Amazon DataZone portal, select the project in the drop-down list on the left top of the portal and choose the project you own

DataZone project

  1. In the project window, choose the Members tab to see the current users in the project and add a new one.

DataZone project members

  1. Choose Add Members to add a new user. Make sure the User type is SSO User to add an Okta user. Look for the Okta user in the name drop-down list, select it, and select a project role for it. Finally, choose Add Members to add the user.

DataZone project members

  1. The Okta user has been granted the selected project role and can interact with the project, assets, and tools.

DataZone project members

  1. You can also grant permissions to SSO Groups. Choose Add members, then select SSO group in the drop-down list, next select the Group name, set the assigned project role, and choose Add Members.

DataZone project members

  1. The Okta group has been granted the project role and can interact with the project, assets, and tools.

DataZone project members

You can also manage SSO user and group access to the Amazon DataZone data portal from the console. See Manage users in the Amazon DataZone console for additional details.

Clean up

To ensure a seamless experience and avoid any future charges, we kindly request that you follow these steps:

By following these steps, you can effectively clean up the resources utilized in this blog post and prevent any unnecessary charges from accruing.

Summary

In this post, you followed a step-by-step guide to set up and use Okta to federate access to Amazon DataZone with AWS IAM Identity Center. You also learned how to group users and manage their permission in Amazon DataZone. As a final thought, now that you’re familiar with the elements involved in the integration of an external identity provider such as Okta to federate access to Amazon DataZone, you’re ready to try it with other identity providers.

To learn more about, see Managing Amazon DataZone domains and user access.


About the Authors

Carlos Gallegos is a Senior Analytics Specialist Solutions Architect at AWS. Based in Austin, TX, US. He’s an experienced and motivated professional with a proven track record of delivering results worldwide. He specializes in architecture, design, migrations, and modernization strategies for complex data and analytics solutions, both on-premises and on the AWS Cloud. Carlos helps customers accelerate their data journey by providing expertise in these areas. Connect with him on LinkedIn.

Jose Romero is a Senior Solutions Architect for Startups at AWS. Based in Austin, TX, US. He’s passionate about helping customers architect modern platforms at scale for data, AI, and ML. As a former senior architect in AWS Professional Services, he enjoys building and sharing solutions for common complex problems so that customers can accelerate their cloud journey and adopt best practices. Connect with him on LinkedIn.

Arun Pradeep Selvaraj is a Senior Solutions Architect at AWS. Arun is passionate about working with his customers and stakeholders on digital transformations and innovation in the cloud while continuing to learn, build, and reinvent. He is creative, fast-paced, deeply customer-obsessed and uses the working backwards process to build modern architectures to help customers solve their unique challenges. Connect with him on LinkedIn.

How to deploy an Amazon OpenSearch cluster to ingest logs from Amazon Security Lake

Post Syndicated from Kevin Low original https://aws.amazon.com/blogs/security/how-to-deploy-an-amazon-opensearch-cluster-to-ingest-logs-from-amazon-security-lake/

January 30, 2025: This post was republished to make the instructions clearer and compatible with OCSF 1.1.


Customers often require multiple log sources across their AWS environment to empower their teams to respond and investigate security events. In part one of this two-part blog post, I show you how you can use Amazon OpenSearch Service to ingest logs collected by Amazon Security Lake to facilitate near real-time monitoring.

Many customers use Security Lake to automatically centralize security data from Amazon Web Services (AWS) environments, software as a service (SaaS) providers, on-premises workloads, and cloud sources into a purpose-built data lake in their AWS environment. OpenSearch Service is a managed service that customers can use to deploy, operate, and scale OpenSearch clusters in the AWS Cloud. It natively integrates with Security Lake to enable customers to perform interactive log analytics and searches across large datasets, create enterprise visualization and dashboards, and perform analysis across disparate applications and logs. With Amazon OpenSearch Security Analytics, customers can also gain visibility into the security posture of their organization’s infrastructure, monitor for anomalous activity, detect potential security threats in near real time, and initiate alerts to pre-configured destinations.

Without using Amazon OpenSearch Service, customers would need to build, deploy and manage infrastructure for an analytics solution, such as an ELK stack.

Prerequisites

Security Lake should already be deployed. For details on how to deploy Security Lake, see Getting started with Amazon Security Lake. You will need AWS Identity and Access Management (IAM) permissions to manage Security Lake, OpenSearch Service, Amazon Cognito, AWS Secrets Manager, and Amazon Elastic Compute Cloud (Amazon EC2), and to create IAM roles to follow along with this post. The solution can be deployed in any AWS Region that has at least 3 Availability Zones, supports Security Lake, OpenSearch, and OpenSearch Ingestion.

Solution overview

The architecture diagram in Figure 1 shows the completed architecture of the solution.

  1. The OpenSearch Service cluster is deployed within a virtual private cloud (VPC) across three Availability Zones for high availability.
  2. The OpenSearch Service cluster ingests logs from Security Lake using an OpenSearch Ingestion pipeline.
  3. The cluster is accessed by end users through a public-facing proxy hosted on an Amazon EC2 instance.
    1. To reduce costs, the template doesn’t deploy a dead letter queue (DLQ) for the OpenSearch Ingestion pipeline. You can add one later if you want.
    2. Instead of a public facing proxy, you can deploy a VPN to access your cluster.
  4. Authentication to the cluster is managed with Amazon Cognito.

Figure 1: Solution architecture

Figure 1: Solution architecture

Planning the deployment

This section will help you plan your OpenSearch service deployment, including what nodes you should choose, the amount of storage to allocate, and where to deploy the cluster.

Deciding instances for the OpenSearch Service master and data nodes

First, determine what instance type to use for the master and data nodes. If your workload generates less than 100 GB of Security Lake logs per day, we recommend using three m6g.large.search master nodes and three r6g.large.search data nodes. You can start small and scale up or scale out later. For more information about deciding the size and number of instances, see Get started with Amazon OpenSearch Service. Note the instance types that you have selected on a text editor because you will use this as an input for the AWS CloudFormation template that you will deploy later.

Configuring storage

To optimize your storage costs, you need to plan your data strategy. In this architecture, Security Lake is used for long-term log storage. Because Security Lake uses Amazon Simple Storage Service (Amazon S3), you can optimize long-term storage costs. You can configure OpenSearch Service to ingest priority logs based on the recent data that you can use for near-real time detection and alerting. Your team can query logs in Security Lake using its Zero-ETL integration with OpenSearch Service to analyze older logs.

Therefore, Security Lake should serve as your primary long-term log storage, with OpenSearch Service storing only the most recent logs.

The number of days of logs in OpenSearch Service will depend on how many days’ worth of data you need to investigate at a given time. I recommend storing 15 days of data in OpenSearch Service. This allows you to react to and investigate the most immediate security events while optimizing storage costs for older logs.

The next step is to determine the volume of logs generated by Security Lake.

  1. Sign in to the Security Lake delegated administrator account.
  2. Go to the AWS Management Console for Security Lake. Choose Usage in the navigation pane.
  3. On the Usage screen, select Last 30 days as the range of usage.
  4. Add the total Actual usage for the last 30 days for the data sources that you intend to send to OpenSearch. If you have used Security Lake for less than 30 days, you can use the Total predicted usage per month. Divide this figure by 30 to get the daily data volume.

Figure 2: Select range of usage

Figure 2: Select range of usage

To determine the total storage needed, multiply the data generated by Security Lake per day by the retention period you chose, then by 1.1 to account for the indexes, then multiply that number by 1.15 for overhead storage. For more information about calculating storage, see Get started with Amazon OpenSearch Service.

To determine the amount of Amazon Elastic Block Store (Amazon EBS) storage that you need per node, take the total amount of storage and divide it by the number of nodes that you have. Round that number up to the nearest whole number. You can increase the amount of storage after deployment when you have a better understanding of your workload. Make a note of this number in a text editor because you’ll use it as an input in the CloudFormation template later.

Example 1: 10 GB of Security Lake logs generated per day, stored for 30 days in OpenSearch Service in three nodes

  • 10 GB of Security Lake logs stored for 30 days = 10 GB * 30 = 300 GB
  • Account for additional space for indexes and overhead space = 300 GB * 1.1 * 1.15 = 379.5 GB
  • Divide the storage required across three nodes, rounded up = 379.5/3 ≈ 127 GB per node
  • You would need 127 GB per node in OpenSearch Service

Example 2: 200 GB of Security Lake logs generated per day, stored for 15 days in OpenSearch Service across six nodes

  • 200 GB of Security Lake logs stored for 15 days = 200 GB * 15 = 3000 GB
  • Account for additional space for indexes and overhead space = 3000 GB * 1.1 * 1.15 = 3795 GB
  • Divide the storage required across three nodes, rounded up = 3795/6 ≈ 633 GB per node
  • You would need 633 GB per node in OpenSearch Service

Where to deploy the cluster?

If you have an AWS Control Tower deployment or have a deployment modelled after the AWS Security Reference Architecture (AWS SRA), Security Lake should be deployed in the Log Archive account. Because security best practices recommend that the Log Archive account should not be frequently accessed, the OpenSearch Service cluster should be deployed into your Audit account or Security Tooling account.

You need to deploy your Security Lake subscriber in the same Region as your Security Lake roll-up Region. If you have more than one roll-up Region, choose the Region that collects logs from the Regions you want to monitor.

Your cluster needs to be deployed in the same Region as your Security Lake subscriber be able to access data.

Setting up the Security Lake subscriber

Before deploying the solution, create a Security Lake subscriber in your Security Lake roll-up Region so that OpenSearch Service can access data from Amazon Security Lake.

  1. Access the Security Lake console in your Log Archive account.
  2. Choose Subscribers in the navigation pane.
  3. Choose Create subscriber.
  4. On the Create subscriber page, enter a name, such as OpenSearch-subscriber.
  5. Under Data Access, select Under S3 notification type, select SQS queue.
  6. Under Subscriber credentials, enter the AWS account ID for the account you plan to deploy the OpenSearch cluster to, which should be your Security Tooling
  7. Enter OpenSearchIngestion-<AWS account ID> under External ID.

    Figure 3: Configuring the Security Lake subscriber

    Figure 3: Configuring the Security Lake subscriber

  8. Leave All log and event sources selected and choose Create.

After the subscriber has been created, you will need to collect information to facilitate the deployment.

To gather necessary information:

  1. Select the subscriber that you just created.
  2. Derive the S3 bucket name from the S3 bucket ARN and store it in a text editor. The Amazon Resource Name (ARN) is formatted as arn:aws:s3:::<bucket name>. The bucket name should look like aws-security-data-lake-<region>-xxxxx.

    Figure 4: Derive the S3 bucket name from the Subscriber details page

    Figure 4: Derive the S3 bucket name from the Subscriber details page

  3. Go to the Amazon Simple Queue Service (Amazon SQS) console and select the SQS queue created as part of the Security Lake subscriber. It should look like AmazonSecurityLake-xxxxxxxxx-Main-Queue. Note the queue’s ARN and URL in your text editor.

    Figure 5: Relevant details from the SQS queue

    Figure 5: Relevant details from the SQS queue

Deploy the solution

To deploy the solution in your Security Tooling account, use a CloudFormation template. This template deploys the OpenSearch Service cluster, OpenSearch Ingestion pipeline, and an AWS Lambda function to initialize the cluster.

To deploy the OpenSearch cluster:

  1. To deploy the CloudFormation template that builds the OpenSearch service cluster, select the Launch Stack button.

    Select this image to open a link that starts building the CloudFormation stack

  2. In the CloudFormation console, make sure that you are in the correct AWS account. You should be in your Security Tooling account. Also make sure that you have selected the same Region as your Security Lake subscriber.
  3. Enter a name for your stack. A name like os-stack-<day>-<month> can help you keep track of deployments.
  4. Enter the instance types and Amazon EBS volume size that you noted earlier.
  5. Enter the IP address range that you want to allow to access the proxy’s security group. You should limit this to your corporate IP range. You can set it as 0.0.0/0 if you want to expose it to the public internet.
  6. Fill in the details of the Security Lake bucket and the subscriber Amazon SQS queue ARN, URL, and Region.

    Figure 6: Add stack parameters

    Figure 6: Add stack parameters

  7. Check the acknowledgements in the Capabilities section.
  8. Choose Create stack to begin deploying the resources.
  9. It will take 20–30 minutes to deploy the multiple nested templates. Wait for the main stack (not the nested ones) to achieve the CREATE_COMPLETE status before proceeding to the next step.

    Note: If you encounter failures while deployment, you can download the CloudFormation file here and select Preserve successfully provisioned resources under Stack failure options while deploying. This will allow you to troubleshoot the stack deployment.

  10. Go to the Outputs pane of the main CloudFormation stack. Save the DashboardsProxyURL, OpenSearchInitRoleARN, and PipelineRole values in a text editor to refer to later.

    Figure 7: The stacks in the CREATE_COMPLETE state with the outputs panel shown

    Figure 7: The stacks in the CREATE_COMPLETE state with the outputs panel shown

  11. Open the DashboardsProxyURL value in a new tab.

    Note: Because the proxy relies on a self-signed certificate, you will get an insecure certificate warning. You can safely ignore this warning and proceed. For a production workload, you should issue a trusted private certificate from your internal public key infrastructure or use AWS Private Certificate Authority.

  12. You will be presented with the Amazon Cognito sign-in page. Use administrator as the username.
  13. Access Secrets Manager to find the password. Select the secret that was created as part of the stack.

    Figure 9: Retrieve the secret value

    Figure 8: The Cognito password in Secrets Manager

  14. Choose Retrieve secret value to get the password.

    Figure 9: Retrieve the secret value

    Figure 9: Retrieve the secret value

  15. After signing in, you will be prompted to change your password and will be redirected to the OpenSearch dashboard.
  16. If you see a pop-up that states Start by adding your own data, select Explore on my own. On the next page, Introducing new OpenSearch Dashboards look & feel, choose Dismiss.
  17. If you see a pop-up that states Select your tenant, select Global, and then choose Confirm.

    Figure 10: Select and confirm your tenant

    Figure 10: Select and confirm your tenant

To initialize the OpenSearch cluster:

  1. Choose the menu icon (three stacked horizontal lines) on the top left and select Security under the Management section.

    Figure 11: Navigating to the Security page in the OpenSearch console

    Figure 11: Navigating to the Security page in the OpenSearch console

  2. Select Roles. On the Roles page, search for the all_access role and select it.
  3. Select Mapped users, and then select Manage mapping.
  4. On the Map user screen, choose Add another backend role. Paste the value for the OpenSearchInitRoleARN from the list of CloudFormation outputs. Choose Map.

    Figure 12: Mapping the role on the Security page in the OpenSearch console

    Figure 12: Mapping the role on the Security page in the OpenSearch console

  5. Leave this tab open and return to the AWS Management console. Go to the AWS Lambda console and select the function named xxxxxx-OS_INIT.
  6. In the function screen, choose Test, and then Create new test event.

    Figure 13: Creating the test event in the Lambda console

    Figure 13: Creating the test event in the Lambda console

  7. Choose Invoke. The function should run for about 30 seconds. The execution results should show the component templates that have been created. This Lambda function creates the component and index templates to ingest Open Cybersecurity Framework (OCSF) formatted data, a set of indices and aliases that correspond with the OCSF classes generated by Security Lake, and a rollover policy that will rollover the index daily or if it becomes larger than 40 GB.

    Figure 14: Invoking the Lambda function in the Lambda console

    Figure 14: Invoking the Lambda function in the Lambda console

To set up the pipeline

  1. Return to the Map user page on the OpenSearch console.
  2. Choose Add another backend role. Paste the value of the PipelineRole from the CloudFormation template output. Choose This will allow the OpenSearch Ingestion to write to the cluster.

    Figure 15: Mapping the OpenSearch Ingestion role

    Figure 15: Mapping the OpenSearch Ingestion role

  3. Access the Amazon S3 console in the Log Archive account where Security Lake is hosted.
  4. Select the Security Lake bucket in your roll-up Region. It should look like aws-security-data-lake-region-xxxxxxxxxx.
  5. Choose Permissions, then Edit under Bucket policy.
  6. Add this policy to the end of the existing bucket policy. Replace the Principal with the ARN of the PipelineRole and the name of your Security Lake bucket in the Resource section.
    {
                "Sid": "Cross Account Permissions",
                "Effect": "Allow",
                "Principal": {
                    "AWS": "<Pipeline role ARN>"
                },
                "Action": "s3:*",
                "Resource": [
                    "arn:aws:s3:::<Security Lake bucket name>/*",
                    "arn:aws:s3:::<Security Lake bucket name>"
                ]
            }

    Figure 16: The modified S3 bucket access policy

    Figure 16: The modified S3 bucket access policy

  7. Choose Save changes.

To upload the index patterns and dashboards

  1. Download the Security-lake-objects.ndjson file by right-clicking on this link and selecting Save link as.
  2. Access the Dashboards Management page through the navigation menu.
  3. Choose Saved objects in the navigation pane.
  4. On the Saved Objects page, choose Import on the right side of the screen.

    Figure 17: Import saved objects

    Figure 17: Import saved objects

  5. Choose Import and select the Security-lake-objects.ndjson file that you downloaded previously.
  6. Leave Create new objects with unique IDs selected and choose Import.
  7. You can now view the ingested logs on the Discover page and visualizations on the Dashboards page, which you can find on the navigation bar.

    Figure 18: The Discover page displaying ingested logs

    Figure 18: The Discover page displaying ingested logs

Clean up

To avoid unwanted charges, delete the main CloudFormation template, named os-stack-<day>-<month> (not the nested stacks).

Figure 19: Select the main stack in the CloudFormation console

Figure 19: Select the main stack in the CloudFormation console

Modify the Security Lake bucket policy in the logging account to remove the section you added that trusted the PipelineRole. Be careful not to modify the rest of the policy because it could impact the functioning of Security Lake and other subscribers.

Figure 20: The S3 bucket policy with the relevant sections that needed to be deleted

Figure 20: The S3 bucket policy with the relevant sections that needed to be deleted

Conclusion

In this post, you learned how to plan an OpenSearch deployment with Amazon OpenSearch Service to ingest logs from Amazon Security Lake. With this solution, you’re able to aggregate and manage logs with Security Lake and visualize and monitor those logs with OpenSearch Service. After deployment, monitor the OpenSearch Service metrics to determine if you need to scale this up or out for improved performance. In part 2, I will show you how to set up the Security Analytics detector to generate alerts to security findings in near-real time.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
 

Kevin Low
Kevin Low

Kevin is a Security Solutions Architect at AWS who helps the largest customers across ASEAN build securely. He specializes in threat detection and incident response and is passionate about integrating resilience and security. Outside of work, he loves spending time with his wife and dog, a poodle called Noodle.

Get started with the new Amazon DataZone enhancements for Amazon Redshift

Post Syndicated from Carmen Manzulli original https://aws.amazon.com/blogs/big-data/get-started-with-the-new-amazon-datazone-enhancements-for-amazon-redshift/

In today’s data-driven landscape, organizations are seeking ways to streamline their data management processes and unlock the full potential of their data assets, while controlling access and enforcing governance. That’s why we introduced Amazon DataZone.

Amazon DataZone is a powerful data management service that empowers data engineers, data scientists, product managers, analysts, and business users to seamlessly catalog, discover, analyze, and govern data across organizational boundaries, AWS accounts, data lakes, and data warehouses.

On March 21, 2024, Amazon DataZone introduced several exciting enhancements to its Amazon Redshift integration that simplify the process of publishing and subscribing to data warehouse assets like tables and views, while enabling Amazon Redshift customers to take advantage of the data management and governance capabilities or Amazon DataZone.

These updates empower the experience for both data users and administrators.

Data producers and consumers can now quickly create data warehouse environments using preconfigured credentials and connection parameters provided by their Amazon DataZone administrators.

Additionally, these enhancements grant administrators greater control over who can access and use the resources within their AWS accounts and Redshift clusters, and for what purpose.

As an administrator, you can now create parameter sets on top of DefaultDataWarehouseBlueprint by providing parameters such as cluster, database, and an AWS secret. You can use these parameter sets to create environment profiles and authorize Amazon DataZone projects to use these environment profiles for creating environments.

In turn, data producers and data consumers can now select an environment profile to create environments without having to provide the parameters themselves, saving time and reducing the risk of issues.

In this post, we explain how you can use these enhancements to the Amazon Redshift integration to publish your Redshift tables to the Amazon DataZone data catalog, and enable users across the organization to discover and access them in a self-service fashion. We present a sample end-to-end customer workflow that covers the core functionalities of Amazon DataZone, and include a step-by-step guide of how you can implement this workflow.

The same workflow is available as video demonstration on the Amazon DataZone official YouTube channel.

Solution overview

To get started with the new Amazon Redshift integration enhancements, consider the following scenario:

  • A sales team acts as the data producer, owning and publishing product sales data (a single table in a Redshift cluster called catalog_sales)
  • A marketing team acts as the data consumer, needing access to the sales data in order to analyze it and build product adoption campaigns

At a high level, the steps we walk you through in the following sections include tasks for the Amazon DataZone administrator, Sales team, and Marketing team.

Prerequisites

For the workflow described in this post, we assume a single AWS account, a single AWS Region, and a single AWS Identity and Access Management (IAM) user, who will act as Amazon DataZone administrator, Sales team (producer), and Marketing team (consumer).

To follow along, you need an AWS account. If you don’t have an account, you can create one.

In addition, you must have the following resources configured in your account:

  • An Amazon DataZone domain with admin, sales, and marketing projects
  • A Redshift namespace and workgroup

If you don’t have these resources already configured, you can create them by deploying an AWS CloudFormation stack:

  1. Choose Launch Stack to deploy the provided CloudFormation template.
  2. For AdminUserPassword, enter a password, and take note of this password to use in later steps.
  3. Leave the remaining settings as default.
  4. Select I acknowledge that AWS CloudFormation might create IAM resources, then choose Submit.
  5. When the stack deployment is complete, on the Amazon DataZone console, choose View domains in the navigation pane to see the new created Amazon DataZone domain.
  6. On the Amazon Redshift Serverless console, in the navigation pane, choose Workgroup configuration and see the new created resource.

You should be logged in using the same role that you used to deploy the CloudFormation stack and verify that you’re in the same Region.

As a final prerequisite, you need to create a catalog_sales table in the default Redshift database (dev).

  1. On the Amazon Redshift Serverless console, selected your workgroup and choose Query data to open the Amazon Redshift query editor.
  2. In the query editor, choose your workgroup and select Database user name and password as the type of connection, then provide your admin database user name and password.
  3. Use the following query to create the catalog_sales table, which the Sales team will publish in the workflow:
    CREATE TABLE catalog_sales AS 
    SELECT 146776932 AS order_number, 23 AS quantity, 23.4 AS wholesale_cost, 45.0 as list_price, 43.0 as sales_price, 2.0 as discount, 12 as ship_mode_sk,13 as warehouse_sk, 23 as item_sk, 34 as catalog_page_sk, 232 as ship_customer_sk, 4556 as bill_customer_sk
    UNION ALL SELECT 46776931, 24, 24.4, 46, 44, 1, 14, 15, 24, 35, 222, 4551
    UNION ALL SELECT 46777394, 42, 43.4, 60, 50, 10, 30, 20, 27, 43, 241, 4565
    UNION ALL SELECT 46777831, 33, 40.4, 51, 46, 15, 16, 26, 33, 40, 234, 4563
    UNION ALL SELECT 46779160, 29, 26.4, 50, 61, 8, 31, 15, 36, 40, 242, 4562
    UNION ALL SELECT 46778595, 43, 28.4, 49, 47, 7, 28, 22, 27, 43, 224, 4555
    UNION ALL SELECT 46779482, 34, 33.4, 64, 44, 10, 17, 27, 43, 52, 222, 4556
    UNION ALL SELECT 46779650, 39, 37.4, 51, 62, 13, 31, 25, 31, 52, 224, 4551
    UNION ALL SELECT 46780524, 33, 40.4, 60, 53, 18, 32, 31, 31, 39, 232, 4563
    UNION ALL SELECT 46780634, 39, 35.4, 46, 44, 16, 33, 19, 31, 52, 242, 4557
    UNION ALL SELECT 46781887, 24, 30.4, 54, 62, 13, 18, 29, 24, 52, 223, 4561

Now you’re ready to get started with the new Amazon Redshift integration enhancements.

Amazon DataZone administrator tasks

As the Amazon DataZone administrator, you perform the following tasks:

  1. Configure the DefaultDataWarehouseBlueprint.
    • Authorize the Amazon DataZone admin project to use the blueprint to create environment profiles.
    • Create a parameter set on top of DefaultDataWarehouseBlueprint by providing parameters such as cluster, database, and AWS secret.
  2. Set up environment profiles for the Sales and Marketing teams.

Configure the DefaultDataWarehouseBlueprint

Amazon DataZone blueprints define what AWS tools and services are provisioned to be used within an Amazon DataZone environment. Enabling the data warehouse blueprint will allow data consumers and data producers to use Amazon Redshift and the Query Editor for data sharing, accessing, and consuming.

  1. On the Amazon DataZone console, choose View domains in the navigation pane.
  2. Choose your Amazon DataZone domain.
  3. Choose Default Data Warehouse.

If you used the CloudFormation template, the blueprint is already enabled.

Part of the new Amazon Redshift experience involves the Managing projects and Parameter sets tabs. The Managing projects tab lists the projects that are allowed to create environment profiles using the data warehouse blueprint. By default, this is set to all projects. For our purpose, let’s grant only the admin project.

  1. On the Managing projects tab, choose Edit.

  1. Select Restrict to only managing projects and choose the AdminPRJ project.
  2. Choose Save changes.

With this enhancement, the administrator can control which projects can use default blueprints in their account to create environment profile

The Parameter sets tab lists parameters that you can create on top of DefaultDataWarehouseBlueprint by providing parameters such as Redshift cluster or Redshift Serverless workgroup name, database name, and the credentials that allow Amazon DataZone to connect to your cluster or workgroup. You can also create AWS secrets on the Amazon DataZone console. Before these enhancements, AWS secrets had to be managed separately using AWS Secrets Manager, making sure to include the proper tags (key-value) for Amazon Redshift Serverless.

For our scenario, we need to create a parameter set to connect a Redshift Serverless workgroup containing sales data.

  1. On the Parameter sets tab, choose Create parameter set.
  2. Enter a name and optional description for the parameter set.
  3. Choose the Region containing the resource you want to connect to (for example, our workgroup is in us-east-1).
  4. In the Environment parameters section, select Amazon Redshift Serverless.

If you already have an AWS secret with credentials to your Redshift Serverless workgroup, you can provide the existing AWS secret ARN. In this case, the secret must be tagged with the following (key-value): AmazonDataZoneDomain: <Amazon DataZone domain ID>.

  1. Because we don’t have an existing AWS secret, we create a new one by choosing Create new AWS Secret.
  2. In the pop-up, enter a secret name and your Amazon Redshift credentials, then choose Create new AWS Secret.

Amazon DataZone creates a new secret using Secrets Manager and makes sure the secret is tagged with the domain in which you’re creating the parameter set.

  1. Enter the Redshift Serverless workgroup name and database name to complete the parameters list. If you used the provided CloudFormation template, use sales-workgroup for the workgroup name and dev for the database name.
  2. Choose Create parameter set.

You can see the parameter set created for your Redshift environment and the blueprint enabled with a single managing project configured.

 

Set up environment profiles for the Sales and Marketing teams

Environment profiles are predefined templates that encapsulate technical details required to create an environment, such as the AWS account, Region, and resources and tools to be added to projects. The next Amazon DataZone administrator task consists of setting up environment profiles, based on the default enabled blueprint, for the Sales and Marketing teams.

This task will be performed from the admin project in the Amazon DataZone data portal, so let’s follow the data portal URL and start creating an environment profile for the Sales team to publish their data.

  1. On the details page of your Amazon DataZone domain, in the Summary section, choose the link for your data portal URL.

When you open the data portal for the first time, you’re prompted to create a project. If you used the provided CloudFormation template, the projects are already created.

  1. Choose the AdminPRJ project.
  2. On the Environments page, choose Create environment profile.
  3. Enter a name (for example, SalesEnvProfile) and optional description (for example, Sales DWH Environment Profile) for the new environment profile.
  4. For Owner, choose AdminPRJ.
  5. For Blueprint, select the DefaultDataWarehouse blueprint (you’ll only see blueprints where the admin project is listed as a managing project).
  6. Choose the current enabled account and the parameter set you previously created.

Then you will see each pre-compiled value for Redshift Serverless. Under Authorized projects, you can pick the authorized projects allowed to use this environment profile to create an environment. By default, this is set to All projects.

  1. Select Authorized projects only.
  2. Choose Add projects and choose the SalesPRJ project.
  3. Configure the publishing permissions for this environment profile. Because the Sales team is our data producer, we select Publish from any schema.
  4. Choose Create environment profile.

Next, you create a second environment profile for the Marketing team to consume data. To do this, you repeat similar steps made for the Sales team.

  1. Choose the AdminPRJ project.
  2. On the Environments page, choose Create environment profile.
  3. Enter a name (for example, MarketingEnvProfile) and optional description (for example, Marketing DWH Environment Profile).
  4. For Owner, choose AdminPRJ.
  5. For Blueprint, select the DefaultDataWarehouse blueprint.
  6. Select the parameter set you created earlier.
  7. This time, keep All projects as the default (alternatively, you could select Authorized projects only and add MarketingPRJ).
  8. Configure the publishing permissions for this environment profile. Because the Marketing team is our data consumer, we select Don’t allow publishing.
  9. Choose Create environment profile.

With these two environment profiles in place, the Sales and Marketing teams can start working on their projects on their own to create their proper environments (resources and tools) with fewer configurations and less risk to incur errors, and publish and consume data securely and efficiently within these environments.

To recap, the new enhancements offer the following features:

  • When creating an environment profile, you can choose to provide your own Amazon Redshift parameters or use one of the parameter sets from the blueprint configuration. If you choose to use the parameter set created in the blueprint configuration, the AWS secret only requires the AmazonDataZoneDomain tag (the AmazonDataZoneProject tag is only required if you choose to provide your own parameter sets in the environment profile).
  • In the environment profile, you can specify a list of authorized projects, so that only authorized projects can use this environment profile to create data warehouse environments.
  • You can also specify what data authorized projects are allowed to be published. You can choose one of the following options: Publish from any schema, Publish from the default environment schema, and Don’t allow publishing.

These enhancements grant administrators more control over Amazon DataZone resources and projects and facilitate the common activities of all roles involved.

Sales team tasks

As a data producer, the Sales team performs the following tasks:

  1. Create a sales environment.
  2. Create a data source.
  3. Publish sales data to the Amazon DataZone data catalog.

Create a sales environment

Now that you have an environment profile, you need to create an environment in order to work with data and analytics tools in this project.

  1. Choose the SalesPRJ project.
  2. On the Environments page, choose Create environment.
  3. Enter a name (for example, SalesDwhEnv) and optional description (for example, Environment DWH for Sales) for the new environment.
  4. For Environment profile, choose SalesEnvProfile.

Data producers can now select an environment profile to create environments, without the need to provide their own Amazon Redshift parameters. The AWS secret, Region, workgroup, and database are ported over to the environment from the environment profile, streamlining and simplifying the experience for Amazon DataZone users.

  1. Review your data warehouse parameters to confirm everything is correct.
  2. Choose Create environment.

The environment will be automatically provisioned by Amazon DataZone with the preconfigured credentials and connection parameters, allowing the Sales team to publish Amazon Redshift tables seamlessly.

Create a data source

Now, let’s create a new data source for our sales data.

  1. Choose the SalesPRJ project.
  2. On the Data page, choose Create data source.
  3. Enter a name (for example, SalesDataSource) and optional description.
  4. For Data source type, select Amazon Redshift.
  5. For Environment¸ choose SalesDevEnv.
  6. For Redshift credentials, you can use the same credentials you provided during environment creation, because you’re still using the same Redshift Serverless workgroup.
  7. Under Data Selection, enter the schema name where your data is located (for example, public) and then specify a table selection criterion (for example, *).

Here, the * indicates that this data source will bring into Amazon DataZone all the technical metadata from the database tables of your schema (in this case, a single table called catalog_sales).

  1. Choose Next.

On the next page, automated metadata generation is enabled. This means that Amazon DataZone will automatically generate the business names of the table and columns for that asset. 

  1. Leave the settings as default and choose Next.
  2. For Run preference, select when to run the data source. Amazon DataZone can automatically publish these assets to the data catalog, but let’s select Run on demand so we can curate the metadata before publishing.
  3. Choose Next.
  4. Review all settings and choose Create data source.
  5. After the data source has been created, you can manually pull technical metadata from the Redshift Serverless workgroup by choosing Run.

When the data source has finished running, you can see the catalog_sales asset correctly added to the inventory.

Publish sales data to the Amazon DataZone data catalog

Open the catalog_sales asset to see details of the new asset (business metadata, technical metadata, and so on).

In a real-world scenario, this pre-publishing phase is when you can enrich the asset providing more business context and information, such as a readme, glossaries, or metadata forms. For example, you can start accepting some metadata automatically generated recommendations and rename the asset or its columns in order to make them more readable, descriptive, and easy to search and understand from a business user.

For this post, simply choose Publish asset to complete the Sales team tasks.

Marketing team tasks

Let’s switch to the Marketing team and subscribe to the catalog_sales asset published by the Sales team. As a consumer team, the Marketing team will complete the following tasks:

  1. Create a marketing environment.
  2. Discover and subscribe to sales data.
  3. Query the data in Amazon Redshift.

Create a marketing environment

To subscribe and access Amazon DataZone assets, the Marketing team needs to create an environment.

  1. Choose the MarketingPRJ project.
  2. On the Environments page, choose Create environment.
  3. Enter a name (for example, MarketingDwhEnv) and optional description (for example, Environment DWH for Marketing).
  4. For Environment profile, choose MarketingEnvProfile.

As with data producers, data consumers can also benefit from a pre-configured profile (created and managed by the administrator) in order to speed up the environment creation process, avoiding mistakes and reducing risks of errors.

  1. Review your data warehouse parameters to confirm everything is correct.
  2. Choose Create environment.

Discover and subscribe to sales data

Now that we have a consumer environment, let’s search the catalog_sales table in the Amazon DataZone data catalog.

  1. Enter sales in the search bar.
  2. Choose the catalog_sales table.
  3. Choose Subscribe.
  4. In the pop-up window, choose your marketing consumer project, provide a reason for the subscription request, and choose Subscribe.

When you get a subscription request as a data producer, Amazon DataZone will notify you through a task in the sales producer project. Because you’re acting as both subscriber and publisher here, you will see a notification.

  1. Choose the notification, which will open the subscription request.

You can see details including which project has requested access, who is the requestor, and why access is needed.

  1. To approve, enter a message for approval and choose Approve.

Now that subscription has been approved, let’s go back to the MarketingPRJ. On the Subscribed data page, catalog_sales is listed as an approved asset, but access hasn’t been granted yet. If we choose the asset, you can see that Amazon DataZone is working on the backend to automatically grant the access. When it’s complete, you’ll see the subscription as granted and the message “Asset added to 1 environment.”

Query data in Amazon Redshift

Now that the marketing project has access to the sales data, we can use the Amazon Redshift Query Editor V2 to analyze the sales data.

  1. Under MarketingPRJ, go to the Environments page and select the marketing environment.
  2. Under the analytics tools, choose Query data with Amazon Redshift, which redirects you to the query editor within the environment of the project.
  3. To connect to Amazon Redshift, choose your workgroup and select Federated user as the connection type.

When you’re connected, you will see the catalog_sales table under the public schema.

  1. To make sure that you have access to this table, run the following query:
SELECT * FROM catalog_sales LIMIT 10

As a consumer, you’re now able to explore data and create reports, or you can aggregate data and create new assets to publish in Amazon DataZone, becoming a producer of a new data product to share with other users and departments.

Clean up

To clean up your resources, complete the following steps:

  1. On the Amazon DataZone console, delete the projects used in this post. This will delete most project-related objects like data assets and environments.
  2. Clean up all Amazon Redshift resources (workgroup and namespace) to avoid incurring additional charges.

Conclusion

In this post, we demonstrated how you can get started with the new Amazon Redshift integration in Amazon DataZone. We showed how to streamline the experience for data producers and consumers and how to grant administrators control over data resources.

Embrace these enhancements and unlock the full potential of Amazon DataZone and Amazon Redshift for your data management needs.

Resources

For more information, refer to the following resources:

 


About the author

Carmen is a Solutions Architect at AWS, based in Milan (Italy). She is a Data Lover that enjoys helping companies in the adoption of Cloud technologies, especially with Data Analytics and Data Governance. Outside of work, she is a creative people who loves being in contact with nature and sometimes practicing adrenaline activities.

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

Post Syndicated from Michael Greenshtein original https://aws.amazon.com/blogs/big-data/monitoring-apache-iceberg-metadata-layer-using-aws-lambda-aws-glue-and-aws-cloudwatch/

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources. Data lakes provide a unified repository for organizations to store and use large volumes of data. This enables more informed decision-making and innovative insights through various analytics and machine learning applications.

Despite their advantages, traditional data lake architectures often grapple with challenges such as understanding deviations from the most optimal state of the table over time, identifying issues in data pipelines, and monitoring a large number of tables. As data volumes grow, the complexity of maintaining operational excellence also increases. Monitoring and tracking issues in the data management lifecycle are essential for achieving operational excellence in data lakes.

This is where Apache Iceberg comes into play, offering a new approach to data lake management. Apache Iceberg is an open table format designed specifically to improve the performance, reliability, and scalability of data lakes. It addresses many of the shortcomings of traditional data lakes by providing features such as ACID transactions, schema evolution, row-level updates and deletes, and time travel.

In this blog post, we’ll discuss how the metadata layer of Apache Iceberg can be used to make data lakes more efficient. You will learn about an open-source solution that can collect important metrics from the Iceberg metadata layer. Based on collected metrics, we will provide recommendations on how to improve the efficiency of Iceberg tables. Additionally, you will learn how to use Amazon CloudWatch anomaly detection feature to detect ingestion issues.

Deep dive into Iceberg’s Metadata layer

Before diving into a solution, let’s understand how the Apache Iceberg metadata layer works. The Iceberg metadata layer provides an open specification instructing integrated big data engines such as Spark or Trino how to run read and write operations and how to resolve concurrency issues. It’s crucial for maintaining inter-operability between different engines. It stores detailed information about tables such as schema, partitioning, and file organization in versioned JSON and Avro files. This ensures that each change is tracked and reversible, enhancing data governance and auditability.

Apache Iceberg metadata layer architecture diagram

History and versioning: Iceberg’s versioning feature captures every change in table metadata as immutable snapshots, facilitating data integrity, historical views, and rollbacks.

File organization and snapshot management: Metadata closely manages data files, detailing file paths, formats, and partitions, supporting multiple file formats like Parquet, Avro, and ORC. This organization helps with efficient data retrieval through predicate pushdown, minimizing unnecessary data scans. Snapshot management allows concurrent data operations without interference, maintaining data consistency across transactions.

In addition to its core metadata management capabilities, Apache Iceberg also provides specialized metadata tables—snapshots, files, and partitions—that provide deeper insights and control over data management processes. These tables are dynamically generated and provide a live view of the metadata for query purposes, facilitating advanced data operations:

  • Snapshots table: This table lists all snapshots of a table, including snapshot IDs, timestamps, and operation types. It enables users to track changes over time and manage version history effectively.
  • Files table: The files table provides detailed information on each file in the table, including file paths, sizes, and partition values. It is essential for optimizing read and write performance.
  • Partitions table: This table shows how data is partitioned across different files and provides statistics for each partition, which is crucial for understanding and optimizing data distribution.

Metadata tables enhance Iceberg’s functionality by making metadata queries straightforward and efficient. Using these tables, data teams can gain precise control over data snapshots, file management, and partition strategies, further improving data system reliability and performance.

Before you get started

The next section describes a packaged open source solution using Apache Iceberg’s metadata layer and AWS services to enhance monitoring across your Iceberg tables.

Before we deep dive into the suggested solution, let’s mention Iceberg MetricsReporter, which is a native way to emit metrics for Apache Iceberg. It supports two types of reports: one for commits and one for scans. The default output is log based. It produces log files as a result of commit or scan operations. To submit metrics to CloudWatch or any other monitoring tool, users need to create and configure a custom MetricsReporter implementation. MetricsReporter is supported in Apache Iceberg v1.1.0 and later versions, and customers who want to use it must enable it through Spark configuration on their existing pipelines.

The following is deployed independently and doesn’t require any configuration changes to existing data pipelines. It can immediately start monitoring all the tables within the AWS account and AWS Region where it’s deployed. This solution introduces an additional latency of metrics arrival between 20 and 80 seconds compared to MetricsReporter but offers seamless integration without the need for custom configurations or changes to current workflows.

Solution overview

This solution is specifically designed for customers who run Apache Iceberg on Amazon Simple Storage Service (Amazon S3) and use AWS Glue as their data catalog.

Solution architecture diagram

Key features

This solution uses an AWS Lambda deployment package to collect metrics from Apache Iceberg tables. The metrics are then submitted to CloudWatch where you can create metrics visualizations to help recognize trends and anomalies over time.

The solution is designed to be lightweight, focusing on collecting metrics directly from the Iceberg metadata layer without scanning the actual data layer. This approach significantly reduces the compute capacity required, making it efficient and cost-effective. Key features of the solution include:

  • Time-series metrics collection: The solution monitors Iceberg tables continuously to identify trends and detect anomalies in data ingestion rates, partition skewness, and more.
  • Event-driven architecture: The solution uses Amazon EventBridge to launch a Lambda function when the state of an AWS Glue Data Catalog table changes. This ensures real-time metrics collection every time a transaction is committed to an Iceberg table.
  • Efficient data retrieval: Incorporates minimal compute resources by utilizing AWS Glue interactive sessions and the pyiceberg library to directly access Iceberg metadata tables such as snapshots, partitions, and files.

Metrics tracked

As of the blog release date, the solution collects over 25 metrics. These metrics are categorized into several groups:

  • Snapshot metrics: Include total and changes in data files, delete files, records added or removed, and size changes.
  • Partition and file metrics: Aggregated and per-partition metrics like average, maximum, minimum record counts and file sizes, which help in understanding data distribution and help optimizing storage.

To see the complete list of metrics, go to the GitHub repository.

Visualizing data with CloudWatch dashboards

The solution also provides a sample CloudWatch dashboard to visualize the collected metrics. Metrics visualization is important for real-time monitoring and detecting operational issues. The provided helper script simplifies the set up and deployment of the dashboard.

Amazon CloudWatch dashboard

You can go to the GitHub repository to learn more about how to deploy the solution in your AWS account.

What are the vital metrics for Apache Iceberg tables?

This section discusses specific metrics from Iceberg’s metadata and explains why they’re important for monitoring data quality and system performance. The metrics are broken down into three parts: insight, challenge, and action. This provides a clear path for practical application. In this section, we provide only a subset of the available metrics that the solution can collect, for a complete list, see the solution Github page.

1. snapshot.added_data_files, snapshot.added_records

  • Metric insight: The number of data files and number of records added to the table during the last transaction. The ingestion rate measures the speed at which new data is added to the data lake. This metric helps identify bottlenecks or inefficiencies in data pipelines, guiding capacity planning and scalability decisions.
  • Challenge: A sudden drop in the ingestion rate can indicate failures in data ingestion pipelines, source system outages, configuration errors or traffic spikes.
  • Action: Teams need to establish real-time monitoring and alert systems to detect drops in ingestion rates promptly, allowing quick investigations and resolutions.

2. files.avg_record_count, files.avg_file_size

  • Metric insight: These metrics provide insights into the distribution and storage efficiency of the table. Small file sizes might suggest excessive fragmentation.
  • Challenge: Excessively small file sizes can indicate inefficient data storage leading to increased read operations and higher I/O costs.
  • Action: Implementing regular data compaction processes helps consolidate small files, optimizing storage and enhancing content delivery speeds as demonstrated by a streaming service. Data Catalog offers automatic compaction of Apache Iceberg tables. To learn more about compacting Apache Iceberg tables, see Enable compaction in Working with tables on the AWS Glue console.

3. partitions.skew_record_count, partitions.skew_file_count

  • Metric insight: The metrics indicate the asymmetry of the data distribution across the available table partitions. A skewness value of zero, or very close to zero, suggests that the data is balanced. Positive or negative skewness values might indicate a problem.
  • Challenge: Imbalances in data distribution across partitions can lead to inefficiencies and slow query responses.
  • Action: Regularly analyze data distribution metrics to adjust partitioning configuration. Apache Iceberg allows you to transform partitions dynamically, which enables optimization of table partitioning as query patterns or data volumes change, without impacting your existing data.

4. snapshot.deleted_records, snapshot.total_delete_files, snapshot.added_position_deletes

  • Metric insight: Deletion metrics in Apache Iceberg provide important information on the volume and nature of data deletions within a table. These metrics help track how often data is removed or updated, which is essential for managing data lifecycle and compliance with data retention policies.
  • Challenge: High values in these metrics can indicate excessive deletions or updates, which might lead to fragmentation and decreased query performance.
  • Action: To address these challenges, run compaction periodically to ensure deleted rows do not persist in new files. Regularly review and adjust data retention policies and consider expiring old snapshots to keep only necessary amount of data files. You can run compaction operation on specific partitions using Amazon Athena Optimize

Effective monitoring is essential for making informed decisions about necessary maintenance actions for Apache Iceberg tables. Determining the right timing for these actions is crucial. Implementing timely preventative maintenance ensures high operational efficiency of the data lake and helps to address potential issues before they become significant problems.

Using Amazon CloudWatch for anomaly detection and alerts

This section assumes that you have completed the solution setup and collected operational metrics from your Apache Iceberg tables into Amazon CloudWatch.

Now you can start setting up some alerts and detect anomalies.

We guide you on setting up the anomaly detection and configuring alerts in CloudWatch to monitor the snapshot.added_records metric, which indicates the ingestion rate of data written into an Apache Iceberg table.

Set up anomaly detection

CloudWatch anomaly detection applies machine learning algorithms to continuously analyze system metrics, determine normal baselines, and identify items that are outside of the established patterns. Here is how you configure it:

Amazon CloudWatch anomaly detection screenshot

  1. Select Metrics: In the AWS Management Console for Cloudwatch, go to the Metrics  tab and search for and select snapshot.added_records.
  2. Create anomaly detection models: Choose the Graphed metrics tab and click the Pulse icon to enable anomaly detection.
  3. Set Sensitivity: The second parameter of the ANOMALY_DETECTION_BAND (m1, 5) is to adjust the sensitivity of the anomaly detection. The goal is to balance detecting real issues and reducing false positives.

Configure alerts

After the anomaly detection model is set up, set up an alert to notify operations teams about potential issues:

  1. Create alarm: Choose the bell icon under Actions on the same Graphed metrics tab.
  2. Alarm settings: Set the alarm to notify the operations team when the snapshot.added_records metric is outside the anomaly detection band for two consecutive periods. This helps reduce the risk of false alerts.
  3. Alarm actions: Configure CloudWatch to send an alarm email to the operations team. In addition to sending emails, CloudWatch alarm actions can automatically launch remediation processes, such as scaling operations or initiating data compaction.

Best practices

  • Regularly review and adjust models: As data patterns evolve, periodically review and adjust anomaly detection models and alarm settings to remain effective.
  • Comprehensive coverage: Ensure that all critical aspects of the data pipeline are monitored, not just a few metrics.
  • Documentation and communication: Maintain clear documentation of what each metric and alarm represent and ensure that your operations team understands the monitoring set up and response procedures. Set up the alerting mechanisms to send notifications through appropriate channels such as email, corporate messenger, or telephone to ensure your operations team stays informed and can quickly address the issues.
  • Create playbooks and automate remediation tasks: Establish detailed playbooks that describe step-by-step responses for common scenarios identified by alerts. Additionally, automate remediation tasks where possible to speed up response times and reduce the manual burden on teams. This ensures consistent and effective responses to all incidents.

CloudWatch anomaly detection and alerting features help organizations proactively manage their data lakes. This ensures data integrity, reduces downtime, and maintains high data quality. As a result, it enhances operational efficiency and supports robust data governance.

Conclusion

In this blog post, we explored Apache Iceberg’s transformative impact on data lake management. Apache Iceberg addresses the challenges of big data with features like ACID transactions, schema evolution, and snapshot isolation, enhancing data reliability, query performance, and scalability.

We delved into Iceberg’s metadata layer and related metadata tables such as snapshots, files, and partitions that allow easy access to crucial information about the current state of the table. These metadata tables facilitate the extraction of performance-related data, enabling teams to monitor and optimize the data lake’s efficiency.

Finally, we showed you a practical solution for monitoring Apache Iceberg tables using Lambda, AWS Glue, and CloudWatch. This solution uses Iceberg’s metadata layer and CloudWatch monitoring capabilities to provide a proactive operational framework. This framework detects trends and anomalies, ensuring robust data lake management.


About the Author

AvatarMichael Greenshtein is a Senior Analytics Specialist at Amazon Web Services. He is an experienced data professional with over 8 years in cloud computing and data management. Michael is passionate about open-source technology and Apache Iceberg.

Accelerate incident response with Amazon Security Lake – Part 2

Post Syndicated from Frank Phillis original https://aws.amazon.com/blogs/security/accelerate-incident-response-with-amazon-security-lake-part-2/

This blog post is the second of a two-part series where we show you how to respond to a specific incident by using Amazon Security Lake as the primary data source to accelerate incident response workflow. The workflow is described in the Unintended Data Access in Amazon S3 incident response playbook, published in the AWS incident response playbooks repository.

The first post in this series outlined prerequisite information and provided steps for setting up Security Lake. The post also highlighted how Security Lake can add value to your incident response capabilities and how that aligns with the National Institute of Standards and Technology (NIST) SP 800-61 Computer Security Incident Handling Guide. We demonstrated how you can set up Security Lake and related services in alignment with the AWS Security Reference Architecture (AWS SRA).

The following diagram shows the service architecture that we configured in the previous post. The highlighted services (Amazon Macie, Amazon GuardDuty, AWS CloudTrail, AWS Security Hub, Amazon Security Lake, Amazon Athena, and AWS Organizations) are relevant to the example referenced in this post, which focuses upon the phases of incident response outlined in NIST SP 800-61.

Figure 1: Example architecture configured in the previous blog post

Figure 1: Example architecture configured in the previous blog post

The first phase of the NIST SP 800-61 framework is preparation, which was covered in part 1 of this two-part blog series. The following sections cover phase 2 (detection and analysis) and phase 3 (containment, eradication, and recovery) of the NIST framework, and demonstrate how Security Lake accelerates your incident response workflow by providing a central datastore of security logs and findings in a standardized format.

Consider a scenario where your security team has received an Amazon GuardDuty alert through Amazon EventBridge for anomalous user behavior. As a result, the team becomes aware of potential unintended data access. GuardDuty noted unusual IAM user activity for the AWS Identity and Access Management (IAM) user Service_Backup and generated a finding. The security team had set up a rule in EventBridge that sends an alert email notifying them of GuardDuty findings relating to IAM user activity. At this point, the security team is unsure if any malicious activity has occurred, however, the username is not familiar. The security team should investigate further to determine whether the finding is a false positive by querying data in Security Lake. In line with the NIST incident response framework, the team moves to phase 2 of this investigation.

Phase 2: Acquire, preserve, and document evidence

The security team wants to investigate which API calls the unfamiliar user has been making. First, the team checks AWS CloudTrail management activity for the user, and they can do that by using Security Lake. They want a list of that user’s activity, which will help them in several ways:

  1. Give them a list of API calls that warrant further investigation (especially Create*)
  2. Give an indication whether the activity is unusual or not (in the context of “typical” user activity within the account, team, user group, or individual user)
  3. Give an indication of when potentially malicious activity might have started (user history)

The team can use Amazon Athena to query CloudTrail management events that were captured by Security Lake. In some cases, compromised user accounts might have existed for a long time previously as legitimate users and made tens of thousands of API calls. In such a case, how would a security team identify the calls that might need further investigation? A quick way to get a summary list of API calls the user has made would be to run a query like the following:

SELECT DISTINCT api.operation FROM amazon_security_lake_table_us_west_2_cloud_trail_mgmt_2_0
WHERE lower(actor.user.name) = 'service_backup'

From the results, the team can determine information and queries of interest to focus on for further investigation. To begin with, the security team uses the preceding query to enumerate the number and type of API calls made by the user, as shown in Figure 2.

Figure 2: Example API call summary made by an IAM user

Figure 2: Example API call summary made by an IAM user

The initial query results in Figure 2 show API calls that could indicate privilege elevation (creating users, attaching user policies, and similar calls are a good indicator). There are other API calls that indicate that additional resources may have been created, such as Amazon Simple Storage Service (Amazon S3) buckets and Amazon Elastic Compute Cloud (Amazon EC2) instances.

Note that in this early phase, the team didn’t time-bound the query. However, if there is a high degree of confidence that the team can focus on a specific time or date range, the query can be further modified. How would the team decide on a time period to focus on? When the team received the alert email from EventBridge, that email included information about the GuardDuty finding, including the time it was observed. The team can use that time as an anchor to search around.

The team now wants to look at the user’s activity in a bit more detail. To do that, they can use a query that returns more detail for each of the API calls the user has made:

SELECT time_dt AS "Time Date", metadata.product.name AS "Log Source", cloud.region, actor.user.type, actor.user.name, actor.user.account.uid AS "User AWS Acc ID", api.operation AS "API Operation", status AS "API Response", api.service.name AS "Service", api.request.data AS "Request Data"
FROM amazon_security_lake_table_us_west_2_cloud_trail_mgmt_2_0
WHERE lower(actor.user.name) = 'service_backup';
Figure 3: Example CloudTrail activity for an IAM user

Figure 3: Example CloudTrail activity for an IAM user

Figure 3 shows the result of the example query. The team observes that the user created an S3 bucket and performed other management plane actions, including creating IAM users and attaching the administrator access policy to a newly created user. Attempts were made to create other resources such as Amazon EC2 instances, but these were not successful. So the team needs to do further investigation on the newly created IAM users and S3 buckets, but they don’t need to take further action on EC2 instances, at least for this user.

The team starts to focus on investigating the IAM permissions of the Service_Backup user because some resources created by this user could lead to privilege elevation (for example: CreateUser >> AttachPolicy >> CreateAccessKey). The team verifies that the policy attached to that new user was AdminAccess. The activities of this newly created user should be investigated. Now, with a broad idea of that user’s activity and the time the activity occurred, the security team wants to focus on IAM activity, so that they can understand what resources have been created. Those resources will likely also need to be investigated.

The team can use the following query to find more detail about the resources that the user has created, modified, or deleted. In this case, the user has also created additional IAM users and roles. The team uses timestamps to limit query results to a specific time range during which they believe the suspicious activity occurred, and can also focus on IAM activity specifically.

SELECT time_dt AS "Time", cloud.region AS "Region", actor.session.issuer AS "Role Issuer", actor.user.name AS "User Name", api.operation AS "API Call", json_extract(api.request.data, '$.userName') AS "Target Principal", json_extract(api.request.data, '$.policyArn') AS "Target Policy", json_extract(api.request.data, '$.roleName') AS "Target Role", actor.user.uid AS "User ID", user.name AS "Target Principal", status AS "Response Status", accountid AS "AWS Account"
FROM amazon_security_lake_table_us_west_2_cloud_trail_mgmt_2_0
WHERE (lower(api.operation) LIKE 'create%' OR lower(api.operation) LIKE 'attach%' OR lower(api.operation) LIKE 'delete%')
AND lower(api.service.name) = 'iam.amazonaws.com'
AND lower(actor.user.name) = 'service_backup'
AND time_dt BETWEEN TIMESTAMP '2024-03-01 00:00:00.000' AND TIMESTAMP  '2024-05-31 23:59:00.000';
Figure 4: CloudTrail IAM activity for a specific user with additional resource detail

Figure 4: CloudTrail IAM activity for a specific user with additional resource detail

This additional detail helps the security team focus on the resources created by the Service_Backup user. These will need to be investigated further and most likely quarantined. If further analysis is required, resources can be disabled, or (in some instances) copied over to a forensic account to conduct the analysis.

Having identified newly created IAM resources that require further investigation, the team now continues by focusing on the resources created in S3. Has the Service_Backup user put objects into that bucket? Have they interacted with objects in any other buckets? To do that, the team queries the S3 data events by using Athena, as follows:

SELECT time_dt AS "Time Date", cloud.region AS "Region", actor.user.type, actor.user.name AS "User Name", api.operation AS "API Call", status AS "Status", api.request.data AS "Request Data", accountid AS "AWS Account ID"
FROM amazon_security_lake_table_us_west_2_s3_data_2_0
WHERE lower(actor.user.name) = 'service_backup';

The security team discovers that the Service_Backup user created an S3 bucket named breach.notify and uploaded a file named data-locked-xxx.txt in the bucket (the bucket name is also returned in the query results shown in Figure 3). Additionally, they see GetObject API calls for potentially sensitive data, followed by DeleteObject API calls for the same data and additional potentially sensitive data files. These are a group of CSV files, for example cc-data-2.csv, as shown in Figure 5.

Figure 5: Example Amazon S3 API activity for an IAM user

Figure 5: Example Amazon S3 API activity for an IAM user

Now the security team has two important goals:

  1. Protect and recover their data and resources
  2. Make sure that any applications that are reliant on those resources or data are available and serving customers

The security team knows that their S3 buckets do contain sensitive data, and they need a quick way to understand which files may be of value to a threat actor. Because the security team was able to quickly investigate their S3 data logs and determine that files have indeed been downloaded and deleted, they already have actionable information. To enrich that with additional context, the team needs a way to verify whether the file contains sensitive data. Amazon Macie can be configured to detect sensitive data in S3 buckets and natively integrate with Security Hub. The team had already configured Macie to scan their buckets and provide alerts if potentially sensitive data is discovered. The team can continue to use Athena to query Security Hub data stored in Security Lake, to see if the Macie results could be related to those files or buckets. The team looks for such findings that were generated around the time that the breach.notify S3 bucket was created, with the unusually named object uploaded, through the following Athena query:

SELECT time_dt AS "Date/Time", metadata.product.name AS "Log Source", cloud.account.uid AS "AWS Account ID", cloud.region AS "AWS Region", resources[1].type AS "Resource Type", resources[1].uid AS "Bucket ARN", resources[2].type AS "Resource Type 2", resources[2].uid AS "Object Name", finding_info.desc AS "Finding summary" FROM amazon_security_lake_table_us_west_2_sh_findings_2_0
WHERE cloud.account.uid = '<YOUR_AWS_ACCOUNT_ID>'
AND lower(metadata.product.name) = 'macie'
AND time_dt BETWEEN TIMESTAMP '2024-03-10 00:00:00.000' AND TIMESTAMP  '2024-03-14 23:59:00.000';
Figure 6: Example Amazon Macie finding summary for data in S3 buckets

Figure 6: Example Amazon Macie finding summary for data in S3 buckets

As Figure 6 shows, the team used the preceding query to pull just the information they needed to help them understand whether there is likely sensitive information in the bucket, which files contain that information, and what kind of information it is. From these results, it appears that the files listed do contain sensitive information, which appears to be credit card related data.

It’s time to stop and briefly review what the team now knows, and what still needs to be done. The team established that the Service_Backup user created additional IAM users and assigned wide-ranging permissions to those users. They also found that the Service_Backup user downloaded what appears to be confidential data, and then deleted those files from the customer’s buckets. Meanwhile, the Service_Backup user created a new S3 bucket and stored ransom files in it. In addition to this, that user also created IAM roles and attempted to create EC2 instances. In our example scenario, the first part of the investigation is complete.

There are a few things to note about what the team has done so far with Security Lake. First, because they’ve set up Security Lake across their entire organization, the team can query results from accounts in their organization, for various types of resources. That in itself saves a significant amount of time during the investigative process. Additionally, the team has seamlessly queried different sets of data to get an outcome—so far they have queried CloudTrail management events, S3 data events, and Macie findings through Security Hub—with the preceding queries done through Security Lake, directly from the Athena console, with no account or role switching, and no tool switching.

Next, we’ll move on to the containment step.

Phase 3: Containment, eradication, and recovery

Having gathered sufficient evidence in the previous phases to act, it’s now time for the team to contain this incident and focus on the AWS API by using either the AWS Management Console, the AWS CLI, or other tools. For the purposes of this blog post, we’ll use the CLI tools.

The team needs to perform several actions to contain the incident. They want to reduce the risk of further data exposure and the creation, modification, or deletion of resources in these AWS accounts. First, they will disable the Service_Backup user’s access and subsequently investigate and assess whether to disable the access of the IAM principal which created that user. Additionally, because Service_Backup created other IAM users, those users must also be investigated using the same process outlined earlier.

Next, the security team needs to determine whether or how they can restore the sensitive data that has been deleted from the bucket. If the bucket has versioning enabled, the act of deleting the object will result in the next most recent version becoming the current version. Alternatively, if they are using AWS Backup to protect their Amazon S3 data, they will be able to restore the most recently backed-up version. It’s worth noting that there could be other ways to restore that data—for example, the organization might have configured cross-Region replication for S3 buckets or other methods to protect their data.

After completing the steps to help prevent further access of the impacted IAM users, and restoring relevant data in impacted S3 buckets, the team now turns their attention to the additional resources created by the now-disabled user. Because these resources include IAM resources, the team needs a list of what has been created and deleted. They could see that information from earlier queries, but now decide to focus just on IAM resources by using the following example query;

SELECT time_dt AS "Time", cloud.region AS "Region", actor.session.issuer AS "Role Issuer", actor.user.name AS "User Name", api.operation AS "API Call", json_extract(api.request.data, '$.userName') AS "Target Principal", json_extract(api.request.data, '$.policyArn') AS "Target Policy", json_extract(api.request.data, '$.roleName') AS "Target Role", actor.user.uid AS "User ID", user.name AS "Target Principal", status AS "Response Status", accountid AS "AWS Account"
FROM amazon_security_lake_table_us_west_2_cloud_trail_mgmt_2_0
WHERE (lower(api.operation) LIKE 'create%' OR lower(api.operation) LIKE 'attach%' OR lower(api.operation) LIKE 'delete%')
AND lower(api.service.name) = 'iam.amazonaws.com'
AND time_dt BETWEEN TIMESTAMP '2024-03-01 00:00:00.000' AND TIMESTAMP  '2024-05-31 23:59:00.000';

This query returns a concise and informative list of activities for several users. There is a separate column for the role name or the user ID, corresponding to IAM roles and users, respectively, as shown in Figure 7.

Figure 7: IAM mutating API activity

Figure 7: IAM mutating API activity

The team uses the AWS CLI to revoke IAM role session credentials, to verify and, if necessary, to modify the role’s trust policy. They also capture an image of the EC2 instance for forensic analysis and terminate the instance. They will copy the data they want to save from the questionable S3 buckets and then delete the buckets, or at least remove the bucket policy.

After completing these tasks, the security team now confirms with the application owners that application recovery is complete or ongoing. They will subsequently review the event and undertake phase 4 of the NIST framework (post-incident activity) to find the root cause, look for opportunities for improvement, and work on remediating any configuration or design flaws that led to the initial breach.

Conclusion

This is the second post in a two-part series about accelerating security incident response with Security Lake. We used anomalous IAM user activity as an incident example to show how you can use Security Lake as a central repository for your security logs and findings, to accelerate the incident response process.

With Security Lake, your security team is empowered to use analytics tools like Amazon Athena to run queries against a central point of security logs and findings from various security data sources, including management logs and S3 data logs from AWS CloudTrail, Amazon Macie findings, Amazon GuardDuty findings, and more.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Frank Phillis

Frank Phillis
Frank is a Senior Solutions Architect (Security) at AWS. He enables customers to get their security architecture right. Frank specializes in cryptography, identity, and incident response. He’s the creator of the popular AWS Incident Response playbooks and regularly speaks at security events. When not thinking about tech, Frank can be found with his family, riding bikes, or making music.
You can follow Frank on LinkedIn.

Jerry Chen

Jerry Chen
Jerry is a Senior Cloud Optimization Success Solutions Architect at AWS. He focuses on cloud security and operational architecture design for AWS customers and partners.
You can follow Jerry on LinkedIn.

Testing your applications with Amazon Q Developer

Post Syndicated from Svenja Raether original https://aws.amazon.com/blogs/devops/testing-your-applications-with-amazon-q-developer/

Testing code is a fundamental step in the field of software development. It ensures that applications are reliable, meet quality standards, and work as intended. Automated software tests help to detect issues and defects early, reducing impact to end-user experience and business. In addition, tests provide documentation and prevent regression as code changes over time.

In this blog post, we show how the integration of generative AI tools like Amazon Q Developer can further enhance unit testing by automating test scenarios and generating test cases.

Amazon Q Developer helps developers and IT professionals with all of their tasks across the software development lifecycle – from coding, testing, and upgrading, to troubleshooting, performing security scanning and fixes, optimizing AWS resources, and creating data engineering pipelines. It integrates into your Integrated Development Environment (IDE) and aids with providing answers to your questions. Amazon Q Developer supports you across the Software Development Lifecycle (SLDC) by enriching feature and test development with step-by-step instructions and best practices. It learns from your interactions and training itself over time to output personalized, and tailored answers.

Solution overview

In this blog, we show how to use Amazon Q Developer to:

  • learn about software testing concepts and frameworks
  • identify unit test scenarios
  • write unit test cases
  • refactor test code
  • mock dependencies
  • generate sample data

Note: Amazon Q Developer may generate an output different from this blog post’s examples due to its nondeterministic nature.

Using Amazon Q Developer to learn about software testing frameworks and concepts

As you start gaining experience with testing, Amazon Q Developer can accelerate your learning through conversational Q&A directly within the AWS Management Console or the IDE. It can explain topics, provide general advice and share helpful resources on testing concepts and frameworks. It gives personalized recommendations on resources which makes the learning experience more interactive and accelerates the time to get started with writing unit tests. Let’s introduce an example conversation to demonstrate how you can leverage Amazon Q Developer for learning before attempting to write your first test case.

Example – Select and install frameworks

A unit testing framework is a software tool used for test automation, design and management of test cases. Upon starting a software project, you may be faced with the selection of a framework depending on the type of tests, programming language, and underlying technology. Let’s ask for recommendations around unit testing frameworks for Python code running in an AWS Lambda function.

In the Visual Studio Code IDE, a user asks Amazon Q Developer for recommendations on unit test frameworks suitable for AWS Lambda functions written in Python using the following prompt: “Can you recommend unit testing frameworks for AWS Lambda functions in Python?”. Amazon Q Developer returns a numbered list of popular unit testing frameworks, including Pytest, unittest, Moto, AWS SAM Local, and AWS Lambda Powertools. Amazon Q Developer returns two URLs in the Sources section. The first URL links to an article called “AWS Lambda function testing in Python – AWS Lambda” from the AWS documentation, and the other URL to an AWS DevOps Blog post with the title “Unit Testing AWS Lambda with Python and Mock AWS Services”.

Figure 1: Recommend unit testing frameworks for AWS Lambda functions in Python

In Figure 1, Amazon Q Developer answers with popular frameworks (pytest, unittest, Moto, AWS SAM Command Line Interface, and Powertools for AWS Lambda) including a brief description for each of them. It provides a reference to the sources of its response at the bottom and suggested follow up questions. As a next step, you may want to refine what you are looking for with a follow-up question. Amazon Q Developer uses the context from your previous questions and answers to give more precise answers as you continue the conversation. For example, one of the frameworks it suggested using was pytest. If you don’t know how to install that locally, you can ask something like “What are the different options to install pytest on my Linux machine?”. As shown in Figure 2, Amazon Q Developer provides installation recommendation using Python.

In the Visual Studio Code IDE, a user asks Amazon Q Developer about the different options to install pytest on a Linux machine using the following prompt: “What are the different options to install pytest on my Linux machine?”. Amazon Q Developer replies with four different options: using pip, using a package manager, using a virtual environment, and using a Python distribution. Each option includes the steps to install pytest. A source URL is included for an article called “How to install pytest in Python? – Be on the Right Side of Change”.

Figure 2: Options to install pytest

Example – Explain concepts

Amazon Q Developer can also help you to get up to speed with testing concepts, such as mocking service dependencies. Let’s ask another follow up question to explain the benefits of mocking AWS services.

In the Visual Studio Code IDE, a user asks Amazon Q Developer about the key benefits of mocking AWS Services’ API calls to create unit tests for Lambda function code using the following prompt: “What are the key benefits of mocking AWS Services' API calls to create unit tests for Lambda function code?”. Amazon Q Developer replies with six key benefits, including: Isolation of the Lambda function code, Faster feedback loop, Consistent and repeatable tests, Cost savings, Improved testability, and Easier debugging. Each benefit has a brief explanation attached to it. The Sources section includes one URL to an AWS DevOps Blog post with the title “Unit Testing AWS Lambda with Python and Mock AWS Services”.

Figure 3: Benefits of mocking AWS services

The above conversation in Figure 3 shows how Amazon Q Developer can help to understand concepts. Let’s learn more about Moto.

In the Visual Studio Code IDE, a user asks Amazon Q Developer: “What is Moto used for?”. Amazon Q Developer provides a brief explanation of the Moto library, and how it can be used to create simulations of various AWS resources, such as: AWS Lambda, Amazon Dynamo DB, Amazon S3, Amazon EC2, AWS IAM, and many other AWS services. Amazon Q Developer provides a simple code example of how to use Moto to mock the AWS Lambda service in a Pytest test case by using the @mock_lambda decorator.

Figure 4: Follow up question about Moto

In Figure 4, Amazon Q Developer gives more details about the Moto framework and provides a short example code snippet for mocking an AWS Lambda service.

Best Practice – Write clear prompts

Writing clear prompts helps you to get the desired answers from Amazon Q. A lack of clarity and topic understanding may result in unclear questions and irrelevant or off-target responses. Note how those prompts contain specific description of what the answer should provide. For example, Figure 1 includes the programming language (Python) and service (AWS Lambda) to be considered in the expected answer. If unfamiliar with a topic, leverage Amazon Q Developer as part of your research, to better understand that topic.

Using Amazon Q Developer to identify unit test cases

Understanding the purpose and intended functionality of the code is important for developing relevant test cases. We introduce an example use case in Python, which handles payroll calculation for different hour rates, hours worked, and tax rates.

"""
This module provides a Payroll class to calculate the net pay for an employee.

The Payroll class takes in the hourly rate, hours worked, and tax rate, and
calculates the gross pay, tax amount, and net pay.
"""

class Payroll:
    """
    A class to handle payroll calculations.
    """

    def __init__(self, hourly_rate: float, hours_worked: float, tax_rate: float):
        self._validate_inputs(hourly_rate, hours_worked, tax_rate)
        self.hourly_rate = hourly_rate
        self.hours_worked = hours_worked
        self.tax_rate = tax_rate

    def _validate_inputs(self, hourly_rate: float, hours_worked: float, tax_rate: float) -> None:
        """
        Validate the input values for the Payroll class.

        Args:
            hourly_rate (float): The employee's hourly rate.
            hours_worked (float): The number of hours the employee worked.
            tax_rate (float): The tax rate to be applied to the employee's gross pay.

        Raises:
            ValueError: If the hourly rate, hours worked, or tax rate is not a positive number, 
            or if the tax rate is not between 0 and 1.
        """
        if hourly_rate <= 0:
            raise ValueError("Hourly rate must be a non-negative number.")
        if hours_worked < 0:
            raise ValueError("Hours worked must be a non-negative number.")
        if tax_rate < 0 or tax_rate >= 1:
            raise ValueError("Tax rate must be between 0 and 1.")

    def gross_pay(self) -> float:
        """
        Calculate the employee's gross pay.

        Returns:
            float: The employee's gross pay.
        """
        return self.hourly_rate * self.hours_worked

    def tax_amount(self) -> float:
        """
        Calculate the tax amount to be deducted from the employee's gross pay.

        Returns:
            float: The tax amount.
        """
        return self.gross_pay() * self.tax_rate

    def net_pay(self) -> float:
        """
        Calculate the employee's net pay after deducting taxes.

        Returns:
            float: The employee's net pay.
        """
        return self.gross_pay() - self.tax_amount()

The example shows how Amazon Q Developer can be used to identify test scenarios before writing the actual cases. Let’s ask Amazon Q Developer to suggest test cases for the Payroll class.

In the Visual Studio Code IDE, a user has a Payroll Python class open on the right side of the editor. The Payroll class is a module used to calculate the net pay for an employee. It takes the hourly rate, hours worked, and tax rate to calculate the gross pay, tax amount, and net pay. The user asks Amazon Q Developer to list unit test scenarios for the Payroll class using the following prompt: “Can you list unit test scenarios for the Payroll class?”. Amazon Q Developer provides eight different unit test scenarios: Test valid input values, Test invalid input values, Test edge cases, Test methods behavior, Test error handling, Test data types, Test rounding behavior, and Test consistency.

Figure 5: Suggest unit test scenarios

In Figure 5, Amazon Q Developer provides a list of different scenarios specific to the Payroll class, including valid, error, and edge cases.

Using Amazon Q Developer to write unit tests

Developers can have a collaborative conversation with Amazon Q Developer, which helps to unpack the code and think through testing cases to check that important test cases are captured but also edge cases are identified. This section focuses on how to facilitate the quick generation of unit test cases, based on the cases recommended in the previous section. Let’s start with a question around best practices when writing unit tests with pytest.

In the Visual Studio Code IDE, a user asks Amazon Q Developer “What are the best practices for unit testing with pytest?”. Amazon Q Developer replies with ten best practices, including: Keep tests simple and focused, Use descriptive test function names, Organize your tests, Run tests multiple times in random order, Utilize static code analysis tools, Focus on behavior not implementation, Use fixtures to set up test data, Parameterize your tests, and Integrate with continuous integration. A source URL is also provided for a Medium article called “Python Unit Tests with pytest Guide”

Figure 6: Recommended best practices for testing with pytest

In Figure 6, Amazon Q Developer provides a list of best practices for writing effective unit tests. Let’s follow up by asking to generate one of the suggested test cases.

In the Visual Studio Code IDE, a user asks Amazon Q Developer to generate a test case using pytest to make sure that the Payroll class raises a ValueError when the hourly rate holds a negative value. Amazon Q Developer generates a code sample using the pytest.raises() context manager to satisfy the requirement. It also provides instructions on how to run the text by prefixing the test module with pytest and running the command in the terminal. The user can now click on Insert at cursor button to insert the test case into the module, and run the test.

Figure 7: Generate a unit test case

Amazon Q Developer includes code in its response which you can copy or insert directly into your file by choosing Insert at cursor. Figure 7 displays valid unit tests covering some of the suggested scenarios and best practices, such as being simple and holding descriptive naming. It also states how to run the test using a command for the terminal.

Best Practice – Provide context

Context allows Amazon Q Developer to offer tailored responses that are more in sync with the conversation. In the chat interface, the flow of the ongoing conversation and past interactions are a critical contextual element. Other ways to provide context are selecting the code-under-test, keeping any relevant files, such as test examples, open in the editor and leveraging conversation context such as asking for best practices and example test scenarios before writing the test cases.

Using Amazon Q Developer to refactor unit tests

To improve code quality, Amazon Q Developer can be used to recommend improvements and refactor parts of the code base. To illustrate Amazon Q Developer refactoring functionality, we prepared test cases for the Payroll class which deviate from some of the suggested best practices.

Example – Send to Amazon Q Refactor

Let’s follow up by asking to refactor the code built-in Amazon Q > Refactor functionality.

In the Visual Studio Code IDE, a user selects the code in test_payroll_refactor module and asks Amazon Q Developer to refactor it via the Amazon Q Developer Refactor functionality, accessible through right click. This code contains ambiguous function and variable names, and might be hard to read without context. Amazon Q Developer then generates the refactored code and outlines the changes made: renamed test functions, removed variable names, and unnecessary comments, as functions are now self-explanatory. The user can now use the Insert at Cursor feature to add the code to the test_payroll_refactored module, and run the tests.

Figure 8: Refactor test cases

In Figure 8, the refactoring renamed the function and variable names to be more descriptive and therefore removed the comments. The recommendation is inserted in the second file to verify it runs correctly.

Best Practice – Apply human judgement and continuously interact with Amazon Q Developer

Note that code generations should always be reviewed and adjusted before used in your projects. Amazon Q Developer can provide you with initial guidance, but you might not get a perfect answer. A developer’s judgement should be applied to value usefulness of the generated code and iterations should be used to continuously improve the results.

Using Amazon Q Developer for mocking dependencies and generating sample data

More complex application architectures may require developers to mock dependencies and use sample data to test specific functionalities. The second code example contains a save_to_dynamodb_table function that writes a job_posting object into a specific Amazon DynamoDB table. This function references the TABLE_NAME environment variable to specify the name of the table in which the data should be saved.

We break down the tasks for Amazon Q Developer into three smaller steps for testing: Generate a fixture for mocking the TABLE_NAME environment variable name, generate instances of the given class to be used as test data, and generate the test.

Example – Generate fixtures

Pytest provides the capability to define fixtures to set defined, reliable, and consistent context for tests. Let’s ask Amazon Q Developer to write a fixture for the TABLE_NAME environment variable.

In the Visual Studio Code IDE, a user asks Amazon Q Developer to create a pytest fixture to mock the TABLE_NAME environment variable using the following prompt: “Create a pytest fixture which mocks my TABLE_NAME environment variable”. Amazon Q Developer replies with a code example showing how the fixture uses the os.environ dictionary to temporarily set the environment variable value to a mock value just for the duration of any test case using the fixture. The code example also includes a yield keyword to pause the fixture, and then delete the mock environment variable to restore the actual value once the test is completed.

Figure 9: Generate a pytest fixture

The result in Figure 9 shows that Amazon Q Developer generated a simple fixture for the TABLE_NAME environment variable. It provides code showing how to use the fixture in the actual test case with additional comments for its content.

Example – Generate data

Amazon Q Developer provides capabilities that can help you generate input data for your tests based on a schema, data model, or table definition. The save_to_dynamodb_table saves an instance of the job posting class to the table. Let’s ask Amazon Q Developer to create a sample instance based on this definition.

In the Visual Studio Code IDE, a user asks Amazon Q Developer to create a sample valid instance of the selected JobPosting class using the following prompt: “Create a sample valid instance of the selected JobPostings class”. The JobPosting class includes multiple fields, including: id, title, description, salary, location, company, employment type, and application_deadline. Amazon Q Developer provides a valid snippet for the JobPosting class, incorporating a UUID for the id field, nested amount and currency for the Salary class, and generic sample values for the remaining fields.

Figure 10: Generate sample data

The answer shows a valid instance of the class in Figure 10 containing common example values for the fields.

Example – Generate unit test cases with context

The code being tested relies on an external library, boto3. To make sure that this dependency is included, we leave a comment specifying that boto3 should be mocked using the Moto library. Additionally, we tell Amazon Q Developer to consider the test instance named job_posting and the fixture named mock_table_name for reference. Developers can now provide a prompt to generate the test case using the context from previous tasks or use comments as inline prompts to generate the test within the test file itself.

In the Visual Studio Code IDE, a user is leveraging inline prompts to generate an autocomplete suggestion from Amazon Q Developer. The inline prompt response is trying to fill out the test_save_to_dynamodb_table function with a mock test asserting the previous JobPosting fields. The user can decide to accept or reject the provided code completion suggestion.

Figure 11: Inline prompts for generating unit test case

Figure 11 shows the recommended code using inline prompts, which can be accepted as the unit test for the save_to_dynamodb_table function.

Best Practice – Break down larger tasks into smaller ones

For cases where Amazon Q Developer does not have much context or example code to refer to, such as writing unit tests from scratch, it is helpful to break down the tasks into smaller tasks. Amazon Q Developer will get more context with each step and can result in more effective responses.

Conclusion

Amazon Q Developer is a powerful tool that simplifies the process of writing and executing unit tests for your application. The examples provided in this post demonstrated that it can be a helpful companion throughout different stages of your unit test process. From initial learning to investigation and writing of test cases, the Chat, Generate, and Refactor capabilities allow you to speed up and improve your test generation. Using clear and concise prompts, context, an iterative approach, and small scoped tasks to interact with Amazon Q Developer improves the generated answers.

To learn more about Amazon Q Developer, see the following resources:

About the authors

Iris Kraja

Iris is a Cloud Application Architect at AWS Professional Services based in New York City. She is passionate about helping customers design and build modern AWS cloud native solutions, with a keen interest in serverless technology, event-driven architectures and DevOps. Outside of work, she enjoys hiking and spending as much time as possible in nature.

Svenja Raether

Svenja is a Cloud Application Architect at AWS Professional Services based in Munich.

Davide Merlin

Davide is a Cloud Application Architect at AWS Professional Services based in Jersey City. He specializes in backend development of cloud-native applications, with a focus on API architecture. In his free time, he enjoys playing video games, trying out new restaurants, and watching new shows.

How to build a CA hierarchy across multiple AWS accounts and Regions for global organization

Post Syndicated from Jiaqing Xue original https://aws.amazon.com/blogs/security/how-to-build-a-ca-hierarchy-across-multiple-aws-accounts-and-regions-for-global-organization/

Building a certificate authority (CA) hierarchy using AWS Private Certificate Authority has been made simple in Amazon Web Services (AWS); however, the CA tree will often reside in one AWS Region in one account. Many AWS customers run their businesses in multiple Regions using multiple AWS accounts and have described the process of creating a CA hierarchy to reflect their organizational structure as complex. This blog post guides you through building a two-level CA hierarchy with CAs in multiple Regions and in multiple accounts.

Certificates, like those provided through AWS Private CA, are usually used to establish encrypted network connections to meet security requirements of an organization’s data, such as database connections. For example, a software as a service (SaaS) company headquartered in Singapore that offers managed database services to users across multiple Regions, such as Oregon (us-west-2), Singapore (ap-southeast-1), and Tokyo (ap-northeast-1). To comply with the regulatory requirements of various countries and Regions, the company needs to build a CA hierarchy with a root CA located in Singapore and subordinate CAs distributed globally.

Another example is a large manufacturing company selling millions of Internet of Things (IoT) devices globally that needs to issue certificates to those devices for identity, authentication, and secure network connections. Different device types or business requirements are handled by different teams that have their own independent AWS accounts within the organization. To meet the certificate policy requirements (such as key length, validity period, and so on) for different types of devices, preserve business flexibility, and control the blast radius of risks, these teams use their own CAs to issue certificates. A customer like this needs a CA hierarchy that aligns with its organizational structure. Subordinate CAs should be deployed to the AWS accounts in different business departments, with each department managing its own subordinate CAs. Each device, or end-entity, typically requires a unique certificate, but the number of private certificates each CA will issue is 1,000,000 by default. Because this number is easy to exceed in a mass-production scenario, a CA hierarchy with multiple subordinate CAs can help mitigate this limitation.

In a CA hierarchy, the root CA is responsible for issuing certificates to subordinate CAs, and subordinate CAs are responsible for issuing end-entity certificates. Because the root CA is the root of the trust chain, loss of trust in this CA can mean that the end-entity certificates it issued will become invalid. Therefore, it’s recommended to place the root CA in a dedicated account and apply strict access control and auditing on its usage. We also recommend you strictly limit the scope of use of the root CA. For more best practices, see AWS Private CA best practices and Designing a CA hierarchy.

Key concepts: Certificate template

In AWS Private CA, when you use AWS Command Line Interface (AWS CLI) or AWS APIs to issue a certificate, you need to supply the certificate template Amazon Resource Name (ARN) to specify the X509v3 parameters of the certificate. A certificate template defines what type of certificates can be issued and allows CA administrators to control what certificates are issued and with what attributes. If you provide no ARN, AWS Private CA will automatically apply the EndEntityCertificate/V1 template to issue an end-entity certificate.

AWS Private CA offers certificate templates that encapsulate best practices for the basic constraint values in X509v3 parameters of certificates. For example, if it’s an end-entity certificate, the cA value of Basic constraints in the parameter is False; if issuing a CA certificate, it is necessary to set the cA value in the parameter to True and set the pathLenConstraint according to the level of the CA using this certificate in the CA hierarchy.

When issuing certificates, you must select the appropriate certificate template based on the certificate’s intended purpose. For more information about certificate templates, see Understanding certificate templates.

Journey of a two-level CA hierarchy

Now that you understand the core concepts of a CA hierarchy, we will show you how to use two CAs located in different accounts and different Regions to establish a CA hierarchy and use the subordinate CA to issue an end-entity certificate associated with a web application. In the example, you will create two CAs: one in the Oregon (us-west-2) Region acting as the root CA and the other one in the Singapore (ap-southeast-1) Region acting as the subordinate CA.

The following figure shows what you will build:

Figure 1: Architecture of a two-level CA hierarchy

Figure 1: Architecture of a two-level CA hierarchy

Prerequisites

You need the following prerequisites to build and implement the solution in this post.

  1. Designate two different AWS accounts for creating the hierarchy. In this example, we use one called Root CA account and the another called Subordinate CA account.
  2. Verify that you can sign in to AWS CloudShell in the Root CA account, and verify that your console user or role has the AWS Identity and Access Management (IAM) policy AWSCertificateManagerPrivateCAFullAccess associated or has permissions to create CAs and issue certificates in AWS Private CA.
  3. Verify that you have access to a role in the Subordinate CA account that has appropriate permissions to create CAs in AWS Private CA, issue certificates using AWS Certificate Manager (ACM), and create an Application Load Balancer (ALB) for a sample web application.

Build the hierarchy

With the prerequisites in place, you can start building the hierarchy, including the:

  • Root CA
  • Subordinate CA
  • Subordinate CA certificate
  • End-entity certificate

To build the root CA

  1. In the AWS Management Console of the root CA account, go to AWS Private CA and make sure that you are in the Oregon (us-west-2) Region.
  2. Create a CA according to this guide. Select Root in the CA type options to indicate that it’s a root CA.

    Figure 2: Create root CA

    Figure 2: Create root CA

  3. After creation, the CA will be in Pending certificate status. Finish configuring your CA by choosing Actions and selecting Install CA certificate.

    Figure 3: Root CA in pending certificate status

    Figure 3: Root CA in pending certificate status

  4. The root CA will change to Active status.

    Figure 4: Root CA in active status

    Figure 4: Root CA in active status

To build the subordinate CA

  1. To build the subordinate CA, sign in to your second account designated for this and make sure your Region is different than that of the root CA. For this example, I’m creating this in the Singapore (ap-southeast-1) Region. Similar to the root CA, create a CA in the console and select Subordinate as the CA type options.

    Figure 5: Create a subordinate CA

    Figure 5: Create a subordinate CA

  2. As with the root CA, the subordinate CA is in the Pending certificate state, so you need to install the certificate for it.

    Figure 6: Subordinate CA in pending certificate status

    Figure 6: Subordinate CA in pending certificate status

To issue the subordinate CA certificate

In a CA hierarchy, the certificate of the subordinate CA must be issued by the higher-level CA, which in this case is the root CA. To configure the subordinate CA to be Active, you need to get the certificate signing request (CSR) of the subordinate CA and use the root CA to sign it and issue a certificate for it.

  1. Remaining within the Subordinate CA account, get the CSR of the subordinate CA for signing. In the details page of the subordinate CA, open the Install subordinate CA certificate page by choosing Actions and selecting Install CA certificate, the CSR will be displayed when you select External private CA as the CA type. Export this to a file called sub-ca.csr on your local machine.

    Although the subordinate CA was managed by AWS Private CA, it will be considered as an external private CA because it’s in a different account and a different Region from the root CA.

    Figure 7: Details page of Install subordinate CA certificate

    Figure 7: Details page of Install subordinate CA certificate

  2. In the example scenario, you want the subordinate CA to only issue certificates directly to end-entities, so its certificate must be signed using the SubordinateCACertificate_PathLen0/V1 template on the root CA. This encodes a limitation into the CA that declares that the subordinate CA cannot sign a CSR for another CA.

    To do this, switch to the Root CA account and use CloudShell to run commands with the AWS CLI. Recreate the file sub-ca.csr using a text editor within CloudShell, and then use the following command to issue subordinate CA certificate. Replace the root-ca-arn with the ARN of the root CA you created previously:

    aws acm-pca issue-certificate --region us-west-2 \
      --certificate-authority-arn <root-ca-arn> \
      --csr file://sub-ca.csr \
      --signing-algorithm SHA256WITHRSA \
      --template-arn arn:aws:acm-pca:::template/SubordinateCACertificate_PathLen0/V1 \
     --validity Value=5,Type=YEARS

    The output looks like the following. Note the CertificateArn that is returned, because it will be used as <sub-ca-certificate-arn> in the following steps:

    {
        "CertificateArn": "<sub-ca-certificate-arn>"
    }

  3. Remaining in the Root CA account, export the subordinate CA’s certificate and certificate chain. Store the text that is returned from the CLI in a file on your local machine for later use in the Subordinate CA account. Replace the <root-ca-arn> and <sub-ca-certificate-arn> with the actual values.
    1. Export the subordinate CA’s certificate:
      aws acm-pca get-certificate --region us-west-2 --certificate-authority-arn <root-ca-arn> --certificate-arn <sub-ca-certificate-arn> --query “Certificate” --output text

    2. Export the subordinate CA’s certificate chain:
      aws acm-pca get-certificate --region us-west-2 --certificate-authority-arn <root-ca-arn> --certificate-arn <sub-ca-certificate-arn> --query “CertificateChain” --output text

  4. Switching back to the Subordinate CA account, import the subordinate CA’s certificate and root certificate chain to activate the subordinate CA. On the install subordinate CA certificate, choose Actions and select Install CA certificate.
    1. Use the exported certificate from the previous step to fill in the certificate body.
    2. Use the exported certificate chain from the previous step to fill in the certificate chain.
    3. Choose Confirm and install.

      Figure 8: Import certificate and certificate chain to the subordinate CA

      Figure 8: Import certificate and certificate chain to the subordinate CA

  5. Now, the subordinate CA is in Active status.

    Figure 9: Subordinate CA in active status

    Figure 9: Subordinate CA in active status

To validate the end-entity certificate

Now that you have established a two-level CA hierarchy, you can issue an end-entity certificate and deploy a web application in the Subordinate CA account. Use a web browser to view the web application and validate the CA hierarchy.

  1. Go to the AWS Certificate Manager console of the subordinate CA account and request a private certificate. In the Certificate authority details section, select the subordinate CA that was previously built.

    Figure 10: Issue certificate using subordinate CA in ACM

    Figure 10: Issue certificate using subordinate CA in ACM

  2. Deploy a sample web application behind an Application Load Balancer (ALB) in the Subordinate CA account. This sample web application uses a feature of the ALB listener that returns a fixed response without needing to configure backend compute resources. Configure the listener to use the certificate issued in the previous step to handle HTTPS traffic.

    Figure 11: Details of the ALB

    Figure 11: Details of the ALB

  3. Check the hierarchical relationship of CAs by visiting the web application and viewing the certificate details in your web browser.

    Figure 12: Certificate details in the web browser

    Figure 12: Certificate details in the web browser

From this image, you can see that the certificate trust chain of the CA hierarchy has been established successfully. The end-entity certificate (on the left) was issued by the CA Example Corp SG L2 CA, which you can identify from its Issuer Name. In the middle is the certificate of Example Corp SG L2 CA, and its issuer is Example Corp Root CA. You can see the certificate of Example Corp Root CA on the right, which is the root CA, and its Issuer Name is itself. When the subject and issuer attributes match, this is indicative of a root CA.

Conclusion

In this post, we showed you how to build a CA hierarchy solution across AWS accounts and Regions, using AWS Private CA. Using this as a guide, you can build a complex CA hierarchy to help meet your global business security and compliance requirements.

Read more

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Private Certificate Authority re:Post or contact AWS Support.

Jiaqing Xue

Jiaqing Xue
Jiaqing is a Solutions Architect at AWS. He focuses on helping customers build more secure, modern applications on AWS. In his spare time, he enjoys coding.

Patrick Palmer

Patrick Palmer
Patrick is a Principal Security Specialist Solutions Architect. He helps customers around the world use AWS services in a secure manner and specializes in cryptography. When not working, he enjoys spending time with his growing family and playing video games.

Xinran Liu

Xinran Liu
Xinran is a Solutions Architect at AWS. Edge computing and DevOps are his areas of interest. He has 14 years of experience leading large-scale engineering projects with expertise in designing and managing both on-premises and cloud-based infrastructures.

Matt Yu

Matt Yu
Matt is a Security Solutions Architect at AWS, responsible for providing customers with end-to-end cloud security solutions. He has extensive experience in identity and access management, threat detection, and cloud security protection.

How ATPCO enables governed self-service data access to accelerate innovation with Amazon DataZone

Post Syndicated from Brian Olsen original https://aws.amazon.com/blogs/big-data/how-atpco-enables-governed-self-service-data-access-to-accelerate-innovation-with-amazon-datazone/

This blog post is co-written with Raj Samineni  from ATPCO.

In today’s data-driven world, companies across industries recognize the immense value of data in making decisions, driving innovation, and building new products to serve their customers. However, many organizations face challenges in enabling their employees to discover, get access to, and use data easily with the right governance controls. The significant barriers along the analytics journey constrain their ability to innovate faster and make quick decisions.

ATPCO is the backbone of modern airline retailing, enabling airlines and third-party channels to deliver the right offers to customers at the right time. ATPCO’s reach is impressive, with its fare data covering over 89% of global flight schedules. The company collaborates with more than 440 airlines and 132 channels, managing and processing over 350 million fares in its database at any given time. ATPCO’s vision is to be the platform driving innovation in airline retailing while remaining a trusted partner to the airline ecosystem. ATPCO aims to empower data-driven decision-making by making high quality data discoverable by every business unit, with the appropriate governance on who can access what.

In this post, using one of ATPCO’s use cases, we show you how ATPCO uses AWS services, including Amazon DataZone, to make data discoverable by data consumers across different business units so that they can innovate faster. We encourage you to read Amazon DataZone concepts and terminologies first to become familiar with the terms used in this post.

Use case

One of ATPCO’s use cases is to help airlines understand what products, including fares and ancillaries (like premium seat preference), are being offered and sold across channels and customer segments. To support this need, ATPCO wants to derive insights around product performance by using three different data sources:

  • Airline Ticketing data – 1 billion airline ticket sales data processed through ATPCO
  • ATPCO pricing data – 87% of worldwide airline offers are powered through ATPCO pricing data. ATPCO is the industry leader in providing pricing and merchandising content for airlines, global distribution systems (GDSs), online travel agencies (OTAs), and other sales channels for consumers to visually understand differences between various offers.
  • De-identified customer master data – ATPCO customer master data that has been de-identified for sensitive internal analysis and compliance.

In order to generate insights that will then be shared with airlines as a data product, an ATPCO analyst needs to be able to find the right data related to this topic, get access to the data sets, and then use it in a SQL client (like Amazon Athena) to start forming hypotheses and relationships.

Before Amazon DataZone, ATPCO analysts needed to find potential data assets by talking with colleagues; there wasn’t an easy way to discover data assets across the company. This slowed down their pace of innovation because it added time to the analytics journey.

Solution

To address the challenge, ATPCO sought inspiration from a modern data mesh architecture. Instead of a central data platform team with a data warehouse or data lake serving as the clearinghouse of all data across the company, a data mesh architecture encourages distributed ownership of data by data producers who publish and curate their data as products, which can then be discovered, requested, and used by data consumers.

Amazon DataZone provides rich functionality to help a data platform team distribute ownership of tasks so that these teams can choose to operate less like gatekeepers. In Amazon DataZone, data owners can publish their data and its business catalog (metadata) to ATPCO’s DataZone domain. Data consumers can then search for relevant data assets using these human-friendly metadata terms. Instead of access requests from data consumer going to a ATPCO’s data platform team, they now go to the publisher or a delegated reviewer to evaluate and approve. When data consumers use the data, they do so in their own AWS accounts, which allocates their consumption costs to the right cost center instead of a central pool. Amazon DataZone also avoids duplicating data, which saves on cost and reduces compliance tracking. Amazon DataZone takes care of all of the plumbing, using familiar AWS services such as AWS Identity and Access Management (IAM), AWS Glue, AWS Lake Formation, and AWS Resource Access Manager (AWS RAM) in a way that is fully inspectable by a customer.

The following diagram provides an overview of the solution using Amazon DataZone and other AWS services, following a fully distributed AWS account model, where data sets like airline ticket sales, ticket pricing, and de-identified customer data in this use case are stored in different member accounts in AWS Organizations.

Implementation

Now, we’ll walk through how ATPCO implemented their solution to solve the challenges of analysts discovering, getting access to, and using data quickly to help their airline customers.

There are four parts to this implementation:

  1. Set up account governance and identity management.
  2. Create and configure an Amazon DataZone domain.
  3. Publish data assets.
  4. Consume data assets as part of analyzing data to generate insights.

Part 1: Set up account governance and identity management

Before you start, compare your current cloud environment, including data architecture, to ATPCO’s environment. We’ve simplified this environment to the following components for the purpose of this blog post:

  1. ATPCO uses an organization to create and govern AWS accounts.
  2. ATPCO has existing data lake resources set up in multiple accounts, each owned by different data-producing teams. Having separate accounts helps control access, limits the blast radius if things go wrong, and helps allocate and control cost and usage.
  3. In each of their data-producing accounts, ATPCO has a common data lake stack: An Amazon Simple Storage Service (Amazon S3) bucket for data storage, AWS Glue crawler and catalog for updating and storing technical metadata, and AWS LakeFormation (in hybrid access mode) for managing data access permissions.
  4. ATPCO created two new AWS accounts: one to own the Amazon DataZone domain and another for a consumer team to use for analytics with Amazon Athena.
  5. ATPCO enabled AWS IAM Identity Center and connected their identity provider (IdP) for authentication.

We’ll assume that you have a similar setup, though you might choose differently to suit your unique needs.

Part 2: Create and configure an Amazon DataZone domain

After your cloud environment is set up, the steps in Part 2 will help you create and configure an Amazon DataZone domain. A domain helps you organize your data, people, and their collaborative projects, and includes a unique business data catalog and web portal that publishers and consumers will use to share, collaborate, and use data. For ATPCO, their data platform team created and configured their domain.

Step 2.1: Create an Amazon DataZone domain

Persona: Domain administrator

Go to the Amazon DataZone console in your domain account. If you use AWS IAM Identity Center for corporate workforce identity authentication, then select the AWS Region in which your Identity Center instance is deployed. Choose Create domain.

  1. Enter a name and description.
  2. Leave Customize encryption settings (advanced) cleared.
  3. Leave the radio button selected for Create and use a new role. AWS creates an IAM role in your account on your behalf with the necessary IAM permissions for accessing Amazon DataZone APIs.
  4. Leave clear the quick setup option for Set-up this account for data consumption and publishing because we don’t plan to publish or consume data in our domain account.
  5. Skip Add new tag for now. You can always come back later to edit the domain and add tags.
  6. Choose Create Domain.

After a domain is created, you will see a domain detail page similar to the following. Notice that IAM Identity Center is disabled by default.

Step 2.2: Enable IAM Identity Center for your Amazon DataZone domain and add a group

Persona: Domain administrator

By default, your Amazon domain, its APIs, and its unique web portal are accessible by IAM principals in this AWS account with the necessary datazone IAM permissions. ATPCO wanted its corporate employees to be able to use Amazon DataZone with their corporate single sign-on SSO credentials without needing secondary federation to IAM roles. AWS Identity Center is the AWS cross-service solution for passing identity provider credentials. You can skip this step if you plan to use IAM principals directly for accessing Amazon DataZone.

Navigate to your Amazon DataZone domain’s detail page and choose Enable IAM Identity Center.

  • Scroll down to the User management section and select Enable users in IAM Identity Center. When you do, User and group assignment method options appear below. Turn on Require assignments. This means that you need to explicitly allow (add) users and groups to access your domain. Choose Update domain.

Now let’s add a group to the domain to provide its members with access. Back on your domain’s detail page, scroll to the bottom and choose the User management tab. Choose Add, and select Add SSO Groups from the drop-down.

  1. Enter the first letters of the group name and select it from the options. After you’ve added the desired groups, choose Add group(s).
  2. You can confirm that the groups are added successfully on the domain’s detail page, under the User management tab by selecting SSO Users and then SSO Groups from the drop-down.

Step 2.3: Associate AWS accounts with the domain for segregated data publishing and consumption

Personas: Domain administrator and AWS account owners

Amazon DataZone supports a distributed AWS account structure, where data assets are segregated from data consumption (such as Amazon Athena usage), and data assets are in their own accounts (owned by their respective data owners). We call these associated accounts. Amazon DataZone and the other AWS services it orchestrates take care of the cross-account data sharing. To make this work, domain and account owners need to perform a one-time account association: the domain needs to be shared with the account, and the account owner needs to configure it for use with Amazon DataZone. For ATPCO, there are four desired associated accounts, three of which are the accounts with data assets stored in Amazon S3 and cataloged in AWS Glue (airline ticketing data, pricing data, and de-identified customer data), and a fourth account that is used for an analyst’s consumption.

The first part of associating an account is to share the Amazon DataZone domain with the desired accounts (Amazon DataZone uses AWS RAM to create the resource policy for you). In ATPCO’s case, their data platform team manages the domain, so a team member does these steps.

  1. Todo this in the Amazon DataZone console, sign in to the domain account and navigate to the domain detail page, and then scroll down and choose the Associated Accounts tab. Choose Request association.
  2. Enter the AWS account ID of the first account to be associated.
  3. Choose Add another account and repeat step one for the remaining accounts to be associated. For ATPCO, there were four to-be associated accounts.
  4. When complete, choose Request Association.

The second part of associating an account is for the account owner to then configure their account for use by Amazon DataZone. Essentially, this process means that the account owner is allowing Amazon DataZone to perform actions in the account, like granting access to Amazon DataZone projects after a subscription request is approved.

  1. Sign in to the associated account and go to the Amazon DataZone console in the same Region as the domain. On the Amazon DataZone home page, choose View requests.
  2. Select the name of the inviting Amazon DataZone domain and choose Review request.

  1. Choose the Amazon DataZone blueprint you want to enable. We select Data Lake in this example because ATPCO’s use case has data in Amazon S3 and consumption through Amazon Athena.

  1. Leave the defaults as-is in the Permissions and resources The Glue Manage Access role allows Amazon DataZone to use IAM and LakeFormation to manage IAM roles and permissions to data lake resources after you approve a subscription request in Amazon DataZone. The Provisioning role allows Amazon DataZone to create S3 buckets and AWS Glue databases and tables in your account when you allow users to create Amazon DataZone projects and environments. The Amazon S3 bucket for data lake is where you specify which S3bucket is used by Amazon DataZone when users store data with your account.

  1. Choose Accept & configure association. This will take you to the associated domains table for this associated account, showing which domains the account is associated with. Repeat this process for other to-be associated accounts.

After the associations are configured by accounts, you will see the status reflected in the Associated accounts tab of the domain detail page.

Step 2.4: Set up environment profiles in the domain

Persona: Domain administrator

The final step to prepare the domain is making the associated AWS accounts usable by Amazon DataZone domain users. You do this with an environment profile, which helps less technical users get started publishing or consuming data. It’s like a template, with pre-defined technical details like blueprint type, AWS account ID, and Region. ATPCO’s data platform team set up an environment profile for each associated account.

To do this in the Amazon DataZone console, the data platform team member sign in to the domain account and navigates to the domain detail page, and chooses Open data portal in the upper right to go to the web-based Amazon DataZone portal.

  1. Choose Select project in the upper-left next to the DataZone icon and select Create Project. Enter a name, like Domain Administration and choose Create. This will take you to your new project page.
  2. In the Domain Administration project page, choose the Environments tab, and then choose Environment profiles in the navigation pane. Select Create environment profile.
    1. Enter a name, such as Sales – Data lake blueprint.
    2. Select the Domain Administration project as owner, and the DefaultDataLake as the blueprint.
    3. Select the AWS account with sales data as well as the preferred Region for new resources, such as AWS Glue and Athena consumption.
    4. Leave All projects and Any database
    5. Finalize your selection by choosing Create Environment Profile.

Repeat this step for each of your associated accounts. As a result, Amazon DataZone users will be able to create environments in their projects to use AWS resources in specific AWS accounts forpublishing or consumption.

Part 3: Publish assets

With Part 2 complete, the domain is ready for publishers to sign in and start publishing the first data assets to the business data catalog so that potential data consumers find relevant assets to help them with their analyses. We’ll focus on how ATPCO published their first data asset for internal analysis—sales data from their airline customers. ATPCO already had the data extracted, transformed, and loaded in a staged S3 bucket and cataloged with AWS Glue.

Step 3.1: Create a project

Persona: Data publisher

Amazon DataZone projects enable a group of users to collaborate with data. In this part of the ATPCO use case, the project is used to publish sales data as an asset in the project. By tying the eventual data asset to a project (rather than a user), the asset will have long-lived ownership beyond the tenure of any single employee or group of employees.

  1. As a data publisher, obtain theURL of the domain’s data portal from your domain administrator, navigate to this sign-in page and authenticate with IAM or SSO. After you’re signed in to the data portal, choose Create Project, enter a name (such as Sales Data Assets) and choose Create.
  2. If you want to add teammates to the project, choose Add Members. On the Project members page, choose Add Members, search for the relevant IAM or SSO principals, and select a role for them in the project. Owners have full permissions in the project, while contributors are not able to edit or delete the project or control membership. Choose Add Members to complete the membership changes.

Step 3.2: Create an environment

Persona: Data publisher

Projects can be comprised of several environments. Amazon DataZone environments are collections of configured resources (for example, an S3 bucket, an AWS Glue database, or an Athena workgroup). They can be useful if you want to manage stages of data production for the same essential data products with separate AWS resources, such as raw, filtered, processed, and curated data stages.

  1. While signed in to the data portal and in the Sales Data Assets project, choose the Environments tab, and then select Create Environment. Enter a name, such as Processed, referencing the processed stage of the underlying data.
  2. Select the Sales – Data lake blueprint environment profile the domain administrator created in Part 2.
  3. Choose Create Environment. Notice that you don’t need any technical details about the AWS account or resources! The creation process might take several minutes while Amazon DataZone sets up Lake Formation, Glue, and Athena.

Step 3.3: Create a new data source and run an ingestion job

Persona: Data publisher

In this use case, ATPCO has cataloged their data using AWS Glue. Amazon DataZone can use AWS Glue as a data source. Amazon DataZone data source (for AWS Glue) is a representation of one or more AWS Glue databases, with the option to set table selection criteria based on their name. Similar to how AWS Glue crawlers scan for new data and metadata, you can run an Amazon DataZone ingestion job against an Amazon DataZone data source (again, AWS Glue) to pull all of the matching tables and technical metadata (such as column headers) as the foundation for one or more data assets. An ingestion job can be run manually or automatically on a schedule.

  1. While signed in to the data portal and in the Sales Data Assets project, choose the Data tab, and then select Data sources. Choose Create Data Source, and enter a name for your data source, such as Processed Sales data in Glue, select AWS Glue as the type, and choose Next.
  2. Select the Processed environment from Step 3.2. In the database name box, enter a value or select from the suggested AWS Glue databases that Amazon DataZone identified in the AWS account. You can add additional criteria and another AWS Glue database.
  3. For Publishing settings, select No. This allows you to review and enrich the suggested assets before publishing them to the business data catalog.
  4. For Metadata generation methods, keep this box selected. Amazon DataZone will provide you with recommended business names for the data assets and its technical schema to publish an asset that’s easier for consumers to find.
  5. Clear Data quality unless you have already set up AWS Glue data quality. Choose Next.
  6. For Run preference, select to run on demand. You can come back later to run this ingestion job automatically on a schedule. Choose Next.
  7. Review the selections and choose Create.

To run the ingestion job for the first time, choose Run in the upper right corner. This will start the job. The run time is dependent on the quantity of databases, tables, and columns in your data source. You can refresh the status by choosing Refresh.

Step 3.4: Review, curate, and publish assets

Persona: Data publisher

After the ingestion job is complete, the matching AWS Glue tables will be added to the project’s inventory. You can then review the asset, including automated metadata generated by Amazon DataZone, add additional metadata, and publish the asset.

  • While signed in to the data portal and in the Sales Data Assets project, go to the Data tab, and select Inventory. You can review each of the data assets generated by the ingestion job. Let’s select the first result. In the asset detail page, you can edit the asset’s name and description to make it easier to find, especially in a list of search results.
  • You can edit the Read Me section and add rich descriptions for the asset, with markdown support. This can help reduce the questions consumers message the publisher with for clarification.
  • You can edit the technical schema (columns), including adding business names and descriptions. If you enabled automated metadata generation, then you’ll see recommendations here that you can accept or reject.
  • After you are done enriching the asset, you can choose Publish to make it searchable in the business data catalog.

Have the data publisher for each asset follow Part 3. For ATPCO, this means two additional teams followed these steps to get pricing and de-identified customer data into the data catalog.

Part 4: Consume assets as part of analyzing data to generate insights

Now that the business data catalog has three published data assets, data consumers will find available data to start their analysis. In this final part, an ATPCO data analyst can find the assets they need, obtain approved access, and analyze the data in Athena, forming the precursor of a data product that ATPCO can then make available to their customer (such as an airline).

Step 4.1: Discover and find data assets in the catalog

Persona: Data consumer

As a data consumer, obtain the URL of the domain’s data portal from your domain administrator, navigate to in the sign-in page, and authenticate with IAM or SSO. In the data portal, enter text to find data assets that match what you need to complete your analysis. In the ATPCO example, the analyst started by entering ticketing data. This returned the sales asset published above because the description noted that the data was related to “sales, including tickets and ancillaries (like premium seat selection preferences).”

The data consumer reviews the detail page of the sales asset, including the description and human-friendly terms in the schema, and confirms that it’s of use to the analysis. They then choose Subscribe. The data consumer is prompted to select a project for the subscription request, in which case they follow the same instructions as creating a project in Step 3.1, naming it Product analysis project. Enter a short justification of the request. Choose Subscribe to send the request to the data publisher.

Repeat Steps 4.2 and 4.3 for each of the needed data assets for the analysis. In the ATPCO use case, this meant searching for and subscribing to pricing and customer data.

While waiting for the subscription requests to be approved, the data consumer creates an Amazon DataZone environment in the Product analysis project, similar to Step 3.2. The data consumer selects an environment profile for their consumption AWS account and the data lake blueprint.

Step 4.2: Review and approve subscription request

Persona: Data publisher

The next time that a member of the Sales Data Assets project signs in to the Amazon DataZone data portal, they will see a notification of the subscription request. Select that notification or navigate in the Amazon DataZone data portal to the project. Choose the Data tab and Incoming requests and then the Requested tab to find the request. Review the request and decide to either Approve or Reject, while providing a disposition reason for future reference.

Step 4.3: Analyze data

Persona: Data consumer

Now that the data consumer has subscribed to all three data assets needed (by repeating steps 4.1-4.2 for each asset), the data consumer navigates to the Product analysis project in the Amazon DataZone data portal. The data consumer can verify that the project has data asset subscriptions by choosing the Data tab and Subscribed data.

Because the project has an environment with the data lake blueprint enabled in their consumption AWS account, the data consumer will see an icon in the right-side tab called Query Data: Amazon Athena. By selecting this icon, they’re taken to the Amazon Athena console.

In the Amazon Athena console, the data consumer sees the data assets their DataZone project is subscribed to (from steps 4.1-4.2). They use the Amazon Athena query editor to query the subscribed data.

Conclusion

In this post, we walked you through an ATPCO use case to demonstrate how Amazon DataZone allows users across an organization to easily discover relevant data products using business terms. Users can then request access to data and build products and insights faster. By providing self-service access to data with the right governance guardrails, Amazon DataZone helps companies tap into the full potential of their data products to drive innovation and data-driven decision making. If you’re looking for a way to unlock the full potential of your data and democratize it across your organization, then Amazon DataZone can help you transform your business by making data-driven insights more accessible and productive.

To learn more about Amazon DataZone and how to get started, refer to the Getting started guide. See the YouTube playlist for some of the latest demos of Amazon DataZone and short descriptions of the capabilities available.


About the Author

Brian Olsen is a Senior Technical Product Manager with Amazon DataZone. His 15 year technology career in research science and product has revolved around helping customers use data to make better decisions. Outside of work, he enjoys learning new adventurous hobbies, with the most recent being paragliding in the sky.

Mitesh Patel is a Principal Solutions Architect at AWS. His passion is helping customers harness the power of Analytics, machine learning and AI to drive business growth. He engages with customers to create innovative solutions on AWS.

Raj Samineni is the Director of Data Engineering at ATPCO, leading the creation of advanced cloud-based data platforms. His work ensures robust, scalable solutions that support the airline industry’s strategic transformational objectives. By leveraging machine learning and AI, Raj drives innovation and data culture, positioning ATPCO at the forefront of technological advancement.

Sonal Panda is a Senior Solutions Architect at AWS with over 20 years of experience in architecting and developing intricate systems, primarily in the financial industry. Her expertise lies in Generative AI, application modernization leveraging microservices and serverless architectures to drive innovation and efficiency.

How to migrate your AWS CodeCommit repository to another Git provider

Post Syndicated from Rodney Bozo original https://aws.amazon.com/blogs/devops/how-to-migrate-your-aws-codecommit-repository-to-another-git-provider/

Customers can migrate their AWS CodeCommit Git repositories to other Git providers using several methods, such as cloning the repository, mirroring, or migrating specific branches. This blog describes a basic use case to mirror a repository to a generic provider, and links to instructions for mirroring to more specific providers. Your exact steps could vary depending on the type or complexity of your repository, and the decisions made on what and how you want to migrate. This post only describes how to migrate Git repository data, and does not describe exporting other data from CodeCommit such as pull requests.

Pre-requisites

  1. Before you can migrate your CodeCommit repository to another provider, make sure that you have the necessary credentials and permissions to both the AWS Management Console and the other provider’s account. For migrating to GitHub and Gitlab, use CodeCommit static credentials as described in HTTPS users using Git credentials. If you choose to use the generic migration option process described below, any type of CodeCommit credentials can be used. To learn more about setting up AWS CodeCommit access control see Setting up for AWS CodeCommit.
  2. In the AWS CodeCommit console, select the clone URL for the repository you will migrate. The correct clone URL (HTTPS, SSH, or HTTPS (CRC)) depends on which credential type and network protocol you have chosen to use.

Figure 1: Clone repositories

Migrating your AWS CodeCommit repository to a GitLab repository

Using the CodeCommit clone URL in combination with the HTTPS Git repository credentials, follow the guidance in GitLab’s documentation for importing source code from a repository by URL.

Migrating your AWS CodeCommit repository to a GitHub repository

Using the CodeCommit clone URL in combination with the HTTPS Git repository credentials, follow the guidance in GitHub’s documentation for importing source code.

Generic migration to a different repository provider

1.     Clone the AWS CodeCommit Repository
Clone the repository from AWS CodeCommit to your local machine using Git. If you’re using HTTPS, you can do this by running the following command:

git clone --mirror https://your-aws-repository-url your-aws-repository

Replace your-aws-repository-url with the URL of your AWS CodeCommit repository.
Replace your-aws-repository with a name for this repository. Example:

git clone https://git-codecommit.us-east-2.amazonaws.com/v1/repos/MyDemoRepo my-demo-repo

2.     Set up new remote repository

Navigate to the directory of your cloned AWS CodeCommit repository. Then, add the repository URL from the new repository provider as a remote:

git remote add <provider name> <provider-repository-url>

Replace <provider name> with the provider name of your choice. (Example: gitlab)
Replace <provider-repository-url> with the URL of your new repository provider’s repository.

3.     Push your local repository to the new remote repository

This will push all branches and tags to your new repository provider’s repository. The provider name must match the provider name from step 2.

git push <provider name> --mirror

Notes:

  • The remote repository should be empty
  • The remote repository may have protected branches not allowing force push. In this case, navigate to your new repository provider and disable branch protections to allow force push.

4.     Verify the Migration

Once the push is complete, verify that all files, branches, and tags have been successfully migrated to the new repository provider. You can do this by browsing your repository online or by cloning it to another location and checking it locally.

5.     Update Remote URLs

If you plan to continue working with the migrated repository locally, you may want to update the remote URL to point to the new provider’s repository instead of AWS CodeCommit. You can do this using the following command:

git remote set-url origin <provider-repository-url>

Replace <provider-repository-url> with the URL of your new repository provider’s repository.

6.     Update CI/CD Pipelines and fix protected branches

If you have CI/CD pipelines set up that interact with your repository, such as GitLab, GitHub or AWS CodePipeline, update their configuration to reflect the new repository URL. If you removed protected branch permissions in Step 3 you may want to add these back to your main branch.

7.     Inform Your Team

If you’re migrating a repository that others are working on, be sure to inform your team about the migration and provide them with the new repository URL.

8.     Delete the, now migrated, AWS CodeCommit repository

This action cannot be undone. Navigate back to the AWS CodeCommit console and delete the repository that you have migrated using the “Delete Repository” button.

Figure 2: Delete repositories

Conclusion

This post described a few methods to migrate your existing AWS CodeCommit repository to another Git provider. After migration, you have the option to continue to use your current AWS CodeCommit repository, but doing so will likely require a regular sync operation between AWS CodeCommit and the new repository provider. For more information about repository migration, please see the following resources:

Migrate a repository incrementally – This guide is written to migrate a repository to CodeCommit incrementally but can also be used for other Git providers.

Configure SAML federation with Amazon OpenSearch Serverless and Keycloak

Post Syndicated from Arpad Csoke original https://aws.amazon.com/blogs/big-data/configure-saml-federation-with-amazon-opensearch-serverless-and-keycloak/

Amazon OpenSearch Serverless is a serverless version of Amazon OpenSearch Service, a fully managed open search and analytics platform. On Amazon OpenSearch Service you can run petabyte-scale search and analytics workloads without the heavy lifting of managing the underlying OpenSearch Service clusters and Amazon OpenSearch Serverless supports workloads up to 30TB of data for time-series collections. Amazon OpenSearch Serverless provides an installation of OpenSearch Dashboards with every collection created.

The network configuration for an OpenSearch Serverless collection controls how the collection can be accessed over the network. You have the option to make the collection publicly accessible over the internet from any network, or to restrict access to the collection only privately through OpenSearch Serverless-managed virtual private cloud (VPC) endpoints. This network access setting can be defined separately for the collection’s OpenSearch endpoint (used for data operations) and its corresponding OpenSearch Dashboards endpoint (used for visualizing and analyzing data). In this post, we work with a publicly accessible OpenSearch Serverless collection.

SAML enables users to access multiple applications or services with a single set of credentials, eliminating the need for separate logins for each application or service. This improves the user experience and reduces the overhead of managing multiple credentials. We provide SAML authentication for OpenSearch Serverless. With this you can use your existing identity provider (IdP) to offer single sign-on (SSO) for the OpenSearch Dashboards endpoints of serverless collections. OpenSearch Serverless supports IdPs that adhere to the SAML 2.0 standard, including services like AWS IAM Identity Center, Okta, Keycloak, Active Directory Federation Services (AD FS), and Auth0. This SAML authentication mechanism is solely intended for accessing the OpenSearch Dashboards interface through a web browser.

In this post, we show you how to configure SAML authentication for controlling access to public OpenSearch Dashboards using Keycloak as an IdP.

Solution overview

The following diagram illustrates a sample architecture of a solution that allows users to authenticate to OpenSearch Dashboards using SSO with Keycloak.

The sign-in flow includes the following steps:

  1. A user accesses OpenSearch Dashboards in a browser and chooses an IdP from the list.
  2. OpenSearch Serverless generates a SAML authentication request.
  3. OpenSearch Service redirects the request back to the browser.
  4. The browser redirects the user to the selected IdP (Keycloak). Keycloak provides a login page, where users can provide their login credentials.
  5. If authentication was successful, Keycloak returns the SAML response to the browser.
  6. The SAML assertions is sent back to OpenSearch Serverless.
  7. OpenSearch Serverless validates the SAML assertion, and logs the user in to OpenSearch Dashboards.

Prerequisites

To get started, you should have the following prerequisites:

  1. An active OpenSearch Serverless collection
  2. A working Keycloak server (on premises or in the cloud)
  3. The following AWS Identity and Access Management (IAM) permissions to configure SAML authentication in OpenSearch Serverless:
    • aoss:CreateSecurityConfig – Create a SAML provider.
    • aoss:ListSecurityConfig – List all SAML providers in the current account.
    • aoss:GetSecurityConfig – View SAML provider information.
    • aoss:UpdateSecurityConfig – Modify a given SAML provider configuration, including the XML metadata.
    • aoss:DeleteSecurityConfig – Delete a SAML provider.

Create and configure a client in Keycloak

Complete the following steps to create your Keycloak client:

  1. Login to your Keycloak admin page.
  2. In the navigation pane, choose Client.
  3. Choose Create client
  4. For Client type, choose SAML.
  5. For Client ID enter aws:opensearch:AWS_ACCOUNT_ID, where AWS_ACCOUNT_ID is your AWS account ID.
  6. Enter a name and description for your client.
  7. Choose Next.
  8. For Valid redirect URIs, enter the address of the assertion consumer service (ACS), where REGION is the AWS Region in which you have created the OpenSearch Serverless collection.
  9. For Master SAML Processing URL, also enter the preceding ACS address.
  10. Complete your client creation.
  11. After you create the client, you have to disable the Signing keys config setting, because OpenSearch Serverless signed and encrypted requests are not supported. For more details, refer to Considerations.
  12. After you have created the client and disabled the client signature, you can export the SAML 2.0 IdP Metadata by choosing the link on the Realm settings page. You need this metadata, when you create the SAML provider in OpenSearch Serverless.

Create a SAML provider

When your OpenSearch Serverless collection is active, you then create a SAML provider. This SAML provider can be assigned to any collection in the same Region. Complete the following steps:

  1. On the OpenSearch Service console, under Serverless in the navigation pane, choose SAML authentication under Security.
  2. Choose Create SAML provider.
  3. Enter a name and description for your SAML provider.
  4. Enter the IdP metadata you downloaded earlier from Keycloak.
  5. Under Additional settings, you can optionally add custom user ID and group attributes (for this example, we leave this empty).
  6. Choose Create a SAML provider.

You have now configured a SAML provider for OpenSearch Serverless. Next, you configure the data access policy for accessing collections.

Create a data access policy

After you have configured SAML provider, you have to create data access policies for OpenSearch Serverless to allow access to the users.

  1. On the OpenSearch Service console, under Serverless in the navigation pane, choose Data access policies under Security.
  2. Choose Create access policy.
  3. Enter a name and optional description for your access policy.
  4. For Policy definition method, select Visual editor.
  5. For Rule name, enter a name.
  6. Under Select principals, for Add principals, choose SAML users and groups.

  7. For SAML provider name, choose the provider you created before.
  8. Choose Save.

  9. Specify the user or group in the format user/USERNAME or group/GROUPNAME. The value of the USERNAME or GROUPNAME should match the value you specified in Keycloak for user-/groupname.
  10. Choose Save.
  11. Choose Grant to grant permissions to resources.
  12. In the Grant resources and permissions section, you can specify access you want to provide for a given user at the collection level, and also at the index pattern level.
    For more information about how to set up more granular access for your users, refer to Supported OpenSearch API operations and permissions and Supported policy permissions.
  13. Choose Save.
  14. You can create additional rules if needed.
  15. Choose Create to create the data access policy.

Now, you have data access policy that will allow users to access the OpenSearch Dashboards and perform the allowed actions there.

Access the OpenSearch Dashboards

Complete the following steps to sign in to the OpenSearch Dashboards:

  1. On the OpenSearch Service console, under Serverless in the navigation pane, choose Dashboard.
  2. In the Collection section, locate your collection and choose Dashboard.

    The OpenSearch login page will open in a new browser tab.
  3. Choose your IdP provider on the dropdown menu and choose Login.

    You will be redirected to the Keycloak sign-in page.
  4. Log in with your SSO credentials.

After a successful login, you will be redirected to OpenSearch Dashboards, and you can perform the actions allowed by the data access policy.

You have successfully federated OpenSearch Dashboards with Keycloak as an IdP.

Cleaning up

When you’re done with this solution, delete the resources you created if you no longer need them.

  1. Delete your OpenSearch Serverless collection.
  2. Delete your data access policy.
  3. Delete the SAML provider.

Conclusion

In this post, we demonstrated how to set up Keycloak as an IdP to access an OpenSearch Serverless dashboard using SAML authentication. For more details, refer to SAML authentication for Amazon OpenSearch Serverless


About the Author

Arpad Csoke is a Solutions Architect at Amazon Web Services. His responsibilities include helping large enterprise customers understand and utilize the AWS environment, acting as a technical consultant to contribute to solving their issues.

How ActionIQ built a truly composable customer data platform using Amazon Redshift

Post Syndicated from Mackenzie Johnson original https://aws.amazon.com/blogs/big-data/how-actioniq-built-a-truly-composable-customer-data-platform-using-amazon-redshift/

This post is written in collaboration with Mackenzie Johnson and Phil Catterall from ActionIQ.

ActionIQ is a leading composable customer data (CDP) platform designed for enterprise brands to grow faster and deliver meaningful experiences for their customers. ActionIQ taps directly into a brand’s data warehouse to build smart audiences, resolve customer identities, and design personalized interactions to unlock revenue across the customer lifecycle. Enterprise brands including Albertsons, Atlassian, Bloomberg, e.l.f. Beauty, DoorDash, HP, and more use ActionIQ to drive growth through better customer experiences.

High costs associated with launching campaigns, the security risk of duplicating data, and the time spent on SQL requests have created a demand for a better solution for managing and activating customer data. Organizations are demanding secure, cost efficient, and time efficient solutions to power their marketing outcomes.

This post will demonstrate how ActionIQ built a connector for Amazon Redshift to tap directly into your data warehouse and deliver a secure, zero-copy CDP. It will cover how you can get started with building a truly composable CDP with Amazon Redshift—from the solution architecture to setting up and testing the connector.

The challenge

Copying or moving data means heavy and complex logistics, along with added incurred cost and security risks associated with replicating data. On the logistics side, data engineering teams have to set up additional extract, transform, and load (ETL) pipelines out of their Amazon Redshift warehouse into ActionIQ, then configure ActionIQ to ingest the data on a recurring basis. Additional ETL jobs means more moving parts, which introduces more potential points of failure, such as breaking schema changes, partial data transfers, delays, and more. All of this requires additional observability overhead to help your team alert on and manage issues as they come up.

These additional ETL jobs add latency to the end-to-end process from data collection to activation, which makes it more likely that your campaigns are activating on stale data and missing key audience members. That will have implications for the customer experience, thereby directly affecting their ability to drive revenue.

The solution

Our solution aims to reduce the logistics already discussed and enables up-to-the-minute data by establishing a secure connection and pushing queries directly down to your data warehouse. Instead of loading full datasets into ActionIQ, the query is pushed to the data warehouse, making it do the hard querying and aggregation work, and wait for the result set.

With Amazon Redshift as your data warehouse, you can run complex workloads with consistently high performance while minimizing the time and effort spent in copying data over to the data warehouse through the use of features like zero-ETL integration with transactional data stores, streaming ingestion, and data sharing. You can also train machine learning models and make predictions directly from your Amazon Redshift data warehouse using familiar SQL commands.

Solution architecture

Within AWS, ActionIQ has a virtual private cloud (VPC) and you have your own VPC. We work within our own private area in AWS, with our own locks and access restrictions. Because ActionIQ is going to have access to your Amazon Redshift data warehouse, this implies that an outside organization (ActionIQ) will be able to make direct database queries to the production database environment.

A Solution architecture diagram showing AWS PrivaeLink set up between ActionIQ's VPC and the customer's VPC

For your information security (infosec) teams to approve this design, we need very clear and tight guardrails to ensure that:

  • ActionIQ only has access to what is absolutely necessary
  • No unintended third party can access these assets

ActionIQ needs to communicate securely to satisfy every information security requirement. To do that, within those AWS environments, you must set up AWS PrivateLink with ActionIQ to create a secure connection between the two VPCs. PrivateLink sets up a secure tunnel between the two VPCs, thus avoiding any opening of either VPC to the public internet. After PrivateLink is set up, ActionIQ needs to be granted privileges to the relevant database objects in your Amazon Redshift data warehouse.

In Amazon Redshift, you must create a distinct database, a service account specifically for ActionIQ, and views to populate the data to be shared with ActionIQ. The views need to adhere to ActionIQ’s data model guidelines, which aren’t rigid, but nonetheless require some structure, such as a clear profile_id that is used in all the views for easy joins between the various data sets.

Getting started with ActionIQ

When starting a hybrid compute integration with Amazon Redshift, it’s key to align your data to ActionIQ’s table types in the following manner:

  • Customer base table: A single dimension table with one record per customer that contains all customers.
  • User info tables: Dimension tables that describe customers and join to the customer base table. They often contain slow-moving or static demographic information and are typically matched one to one with customer records.
  • Event tables: Fact or log-like tables contain events or actions your customers take. The primary key is typically a user_id and timestamp.
  • Entity tables: Dimension tables that describe non-customer objects. They often provide additional information to augment the data in event tables. For example, an entity table could be a product table that contains product metadata and joins to a transaction event table on a product_id.

Visual representation of the relationships of the high level entities in the customer, event and product subject areas

Note: User info and event tables can join on any available identifier to the customer base table, not just the base user ID.

Now you can set up the connection and declare the views in the ActionIQ UI. After ActionIQ establishes a table with master profiles, users can begin to interpret those tables and work with them to build out campaigns.

Establish a secure connection

After setting up PrivateLink, the remaining steps to prepare for hybrid compute are the following:

  1. Create a separate database in Amazon Redshift to define the shared dataset with ActionIQ.
  2. Create a service account for ActionIQ in Amazon Redshift.
  3. Grant READ access to the service account for the dedicated database.
  4. Define the views that will be shared with ActionIQ.

Allow listing

If your data warehouse is on a private network, you must add ActionIQ’s IP addresses to your network’s allow list to allow ActionIQ to access your cloud warehouse. For more information on how to set this up, see the Configure inbound rules for SQL clients.

Database set up

Create an Amazon Redshift user for ActionIQ

  1. Sign in to the Amazon Redshift console
  2. From the navigation menu, choose the Query Editor and connect to your database.
  3. Create a user for ActionIQ:
    CREATE USER actioniq PASSWORD 'password';

  4. Grant permissions to the tables within a given schema you want to give ActionIQ access to:
GRANT USAGE ON SCHEMA 'yourschema' TO actioniq;
GRANT SELECT ON ALL TABLES IN SCHEMA 'yourschema' TO actioniq;

You can then run commands in the query editor to create a new user and to grant permission to the data sets you want to access through ActionIQ.

The result is that ActionIQ now has a programmatic query access to a dedicated database in your Amazon Redshift data warehouse, and that access is limited to that database.

In order to make this easy to govern, we recommend the following guidelines on the shared views:

  • As much as possible, the shared objects should be views and not tables.
  • The views should never use select *, but should explicitly specify each field desired in the view. This has multiple benefits:
    • The schema is robust; even if the underlying table changes, it won’t initiate a change in the shared view
    • It makes it very clear which fields are accessible by ActionIQ and which are not, thereby enabling a proper governance approval process.
  • Limiting privileges to READ access means the data warehouse administrators can be structurally certain that the data views won’t change unless they want them to.

The importance of providing views instead of actual tables is two-fold:

  1. A view doesn’t replicate data. The whole point is to avoid data replication, and we don’t want to replicate data within Amazon Redshift either. With a view, which is essentially a query definition on top of the actual data tables, we avoid the need to replicate data at all. There is a legitimate question of “why not give access to tables directly?” which brings us to the second point.
  2. Tables and data schema change on their own schedule, and ActionIQ needs a stable data schema to work with. By defining a view, we’re also defining a contract for sharing data between you and ActionIQ. The underlying data table can change, and the view definition can absorb this change, without modifying the structure of what the view delivers. This stability is critical for any enterprise software as a service to work effectively with a large organization.

On the ActionIQ side, there’s no caching or persisting of data of any kind. This means that ActionIQ launches a query whenever a scheduled campaign is launched and requires data, and whenever a user of the platform asks for an audience count. In other words, queries will generally happen during business hours, but technically can happen at any time.

Testing the connector

ActionIQ deploys the Amazon Redshift connector and tested queries to validate the success of the connector. After the audience is defined and validated, ActionIQ sends the SQL query to Amazon Redshift and it returns information. We also validate the results with Amazon Redshift to ensure that the logic is correct as intended.

With this, you experience a lean and more transparent deployment process. You can see the queries ActionIQ sends to Amazon Redshift, because the queries are logged. You can see what’s going on, what is attributable to ActionIQ, and can see the growth of adoption and usage.

Image showing connection set up to an Amazon Redshift database

A Connector defines the credentials and other parameters needed to connect to a cloud database or warehouse. The Connector screen is used to create, view and manage your connectors.

Key considerations

Organizations need strong data governance. ActionIQ requires a contract of what the data is going to look like within the defined view. With the dynamic nature of data, strong governance workflows with defined fields are required to run the connector smoothly to achieve the ultimate outcome—driving revenue through marketing campaigns.

Because ActionIQ is used as the central hub of marketing orchestration and activation, it needs to process a lot of queries. Because marketing activity can have significant spikes in activity, it’s prudent to plan for the maximum load on the underlying database.

In one scenario, you might have spikey workloads. With Amazon Redshift Serverless, your data warehouse will be able to scale automatically to manage those spikes. That means Amazon Redshift can absorb large and sudden spikes in queries from ActionIQ without much technical planning.

If workload isolation is a priority and you want to run the ActionIQ workloads using dedicated compute resources, you can use the data sharing feature to create a data share that can be accessed by a dedicated Redshift serverless endpoint. This would allow ActionIQ to query up-to-date data from a separate Redshift serverless instance without the need to copy any data while maintaining complete workload isolation.

The data team needs data to run business intelligence. ActionIQ is driving marketing activation and creating a new data set for the universal contact history—essentially the log of all marketing contacts from the activity. ActionIQ provides this dataset back to Amazon Redshift, which can then be included in the BI reports for ROI measurements.

Conclusion

For your information security teams, the ActionIQ’s Amazon Redshift connector presents a viable solution because ActionIQ doesn’t replicate the data, and the controls outlined establish how ActionIQ accesses the data. Key benefits include:

  • Control: Choose where data is stored and queried to improve security and fit existing technology investments.
  • Performance: Reduce operational effort, increase productivity and cut down on unnecessary technology costs.
  • Power: Use the auto-scaling capabilities of Amazon Redshift for running your workload.

For business teams, the ActionIQ Amazon Redshift connector is querying the freshest data possible. With the connector, there is zero data latency—an important consideration with key audience members that are primed to convert.

ActionIQ is excited to launch the Amazon Redshift connector to activate your data where it lives—within your Amazon Redshift data warehouse—for a zero-copy, real-time experience that drives outcomes with your customers. To learn more about how organizations are modernizing their data platforms using Amazon Redshift, visit the Amazon Redshift page.

Enhance your Amazon Redshift investment with ActionIQ.


About the authors

Mackenzie Johnson is a Senior Manager at ActionIQ. She is an innovative marketing strategist who’s passionate about the convergence of complementary technologies and amplifying joint value. With extensive experience across digital transformation storytelling, she thrives on educating enterprise businesses about the impact of CX based on a data-driven approach.

Phil Catterall is a Senior Product Manager at ActionIQ and leads product development on ActionIQ’s foundational data management, processing, and query federation capabilities. He’s passionate about designing and building scalable data products to empower business users in new ways.

Sain Das is a Senior Product Manager on the Amazon Redshift team and leads Amazon Redshift GTM for partner programs including the Powered by Amazon Redshift and Redshift Ready programs.

Leveraging Amazon Q Developer for Efficient Code Debugging and Maintenance

Post Syndicated from Dr. Rahul Sharad Gaikwad original https://aws.amazon.com/blogs/devops/leveraging-amazon-q-developer-for-efficient-code-debugging-and-maintenance/

In this post, we guide you through five common components of efficient code debugging. We also show you how Amazon Q Developer can significantly reduce the time and effort required to manually identify and fix errors across numerous lines of code. With Amazon Q Developer on your side, you can focus on other aspects of software development, such as design and innovation. This leads to faster development cycles and more frequent releases, as you as a software developer can allocate less time to debugging and more time to building new features and functionality.

Debugging is a crucial part of the software development process. It involves identifying and fixing errors or bugs in the code to ensure that it runs smoothly and efficiently in the production environment. Traditionally, debugging has been a time-consuming and labor-intensive process, requiring developers to manually search through lines of code to investigate and subsequently fix errors. With Amazon Q Developer, a generative AI–powered assistant that helps you learn, plan, create, deploy, and securely manage applications faster, the debugging process becomes more efficient and streamlined. This enables you to easily understand the code, detect anomalies, fix issues and even automatically generate the test cases for your application code.

Let’s consider the following code example, which utilizes Amazon Bedrock, Amazon Polly, and Amazon Simple Storage Service (Amazon S3) to generate a speech explanation of a given prompt. A prompt is the message sent to a model in order to generate a response. It constructs a JSON payload with the prompt (for example – “explain black holes to 8th graders”) and configuration for the Claude model, and invokes the model via the Bedrock Runtime API. Amazon Polly converts the Large Language Model (LLM) response into an audio output format which is then uploaded to an Amazon S3 bucket location:

Sample Code Snippet:

import boto3
import json
# Define boto clients
brt = boto3.client(service_name='bedrock-runtime')
polly = boto3.client('polly')
s3 = boto3.client('s3')
body = json.dumps({
"prompt": "\n\nHuman: explain black holes to 8th graders\n\nAssistant:",
"max_tokens_to_sample": 300,
"temperature": 0.1,
"top_p": 0.9,
})
# Define variable
modelId = 'anthropic.claude-v2'
accept = 'application/json'
contentType = 'application/json'
response = brt.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
response_body = json.loads(response.get('body').read())
# Print response
print(response_body.get('completion'))
# Send response to Polly
polly_response = polly.synthesize_speech(
Text=response_body.get('completion'),
OutputFormat='mp3',
VoiceId='Joanna'
)
# Write audio to file
with open('output.mp3', 'wb') as file:
file.write(polly_response['AudioStream'].read())
# Upload file to S3 bucket
s3.upload_file('output.mp3', '<REPLACE_WITH_YOUR_BUCKET_NAME>', 'speech.mp3')

Top common ways to debug code:

1. Code Explanation:

The first step in debugging your code is to understand what the code is doing. Very often you have to build upon code from others. Amazon Q Developer comes with an inbuilt feature that allows it to quickly explain the selected code and identify the main components and functions. This feature utilizes natural language processing (NLP) and summarization capability of LLMs, making your code easy to understand, and helping you identify potential issues. To gain deeper insights into your codebase, simply select your code, right-click and select ‘Send to Amazon Q’ option from menu. Finally, click on “Explain” option.

Image showing the interface for sending selected code to Amazon Q Developer for explanation

Amazon Q Developer generates an explanation of the code, highlighting the key components and functions. You can then use this summary to quickly identify potential issues and minimize your debugging efforts. It demonstrates how Amazon Q Developer accepts your code as input and the summarizes it for you. It explains the code in a natural language, which can help developers understand and debug the code.

If you have any follow-up question regarding the given explanation, you can also ask Amazon Q Developer for more in-depth review of a specific task. With respect to the code context, Amazon Q Developer automatically understands the existing code functionality and provides more insights. We ask Amazon Q Developer to explain how the Amazon Polly service is being used for synthesizing in the sample code snippet. It provides detailed step-by-step clarification from the retrieval to the upload process. Using this approach, you can ask any clarifying questions similar to asking a colleague to help you better understand a topic.

Amazon Q Developer explains how Polly is used to Synthesize in the given code

2. Amplify debugging with Logs

The sample code is deficient in terms of support for debugging including proper exception handling mechanisms. This makes it difficult to identify and handle issues during the execution or runtime of the application, further causing delays in the overall development process. You can instruct Amazon Q Developer to add a new code block for debugging your application against potential failures. Amazon Q Developer analyzes the existing code and then recommends to add a new code snippet to include printing and logging of debug messages and exception handling. It embeds ‘try-except’ blocks to catch potential exceptions that may occur. You use the recommended code after reviewing and modifying it. For example, you need to update the variable REPLACE_WITH_YOUR_BUCKET_NAME with your actual Amazon S3 bucket name.

With Amazon Q, you seamlessly enhance your codebase by selecting specific code segments and providing them as prompts. This enables you to explore suggestions for incorporating the required exception handling mechanisms or strategic logging statements.

Amazon Q Developer interface illustrating the addition of debugging code to the original code snippet

Logs are an essential tool for debugging your application code. They allow you to track the execution of your code and identify where errors are occurring. With Amazon Q Developer, you can view and analyze log data from your application, and identify issues and errors that may not be immediately apparent from the code. It demonstrates that the given code does not have the logging statements embedded.

You can provide a prompt to Amazon Q Developer in the chat window asking it to write a code to enable logging and add log statements. This will automatically add logging statements wherever needed. It is useful to log inputs, outputs, and errors when making API calls or invoking models. You may copy the entire recommended code to the clipboard, review it, and then modify as needed. For example, you can decide which logging level you want to use, INFO or DEBUG.

Example of logging statements inserted into the code by Amazon Q Developer for enhanced debugging

3. Anomaly Detection

Another powerful use case for Amazon Q Developer is anomaly detection in your application code. Amazon Q Developer uses machine learning algorithms to identify unusual patterns in your code and highlight potential issues. It can detect issues that may be difficult to detect, such as an array out of bounds, infinite loops, or concurrency issues. For demonstration purposes, we intentionally introduce a simple anomaly in the code. We ask Amazon Q Developer to detect the anomaly in the code when attempting to generate text from the response. A simple prompt about detecting anomalies in the given code generates a useful output. It is able to detect the anomaly in the ‘response_body’ dictionary. Additionally, Amazon Q Developer recommends best practices to check the status code and handle the errors with a code snippet.

Process of submitting selected code to Amazon Q Developer for automated bug fixing

With code anomaly detection at your fingertips, you can promptly identify issues that may be impacting the application’s performance or user experience.

4. Automated Bug Fixing

After conducting all the necessary analysis, you can use Amazon Q Developer to fix bugs and issues in the code. Amazon Q Developer’s automated bug-fixing feature saves you hours you would otherwise spend on debugging and testing the code without its help. Starting with the code example which has an anomaly, you can identify and fix the code issue by simply selecting the code and sending it to Amazon Q Developer to apply fixes.

Process of submitting selected code to Amazon Q Developer for automated bug fixing

Amazon Q Developer identifies and suggests various ways to fix common issues in your code, such as syntax errors, logical errors, and performance issues. It can also recommend optimizations and improvements to your code, helping you to deliver a better user experience and improve your application’s performance.

5. Automated Test Case Generation

Testing is an integral part of code development and debugging. Running high quality test cases improve the quality of the code. With Amazon Q Developer, you can automatically generate the test cases quickly and easily. Amazon Q Developer uses its Large Language Model core to generate test cases based on your code, identifying potential issues and ensuring that your code is comprehensive and reliable.

With automated test case generation, you save time and effort while testing your code, ensuring that it is robust and reliable. A natural language prompt is sent to Amazon Q Developer to suggest test cases for given application code. You can notice that Amazon Q Developer provides a list of possible test cases that can aid in debugging and generating further test cases.

Recommendations and solutions provided by Amazon Q Developer to address issues in the code blocks

After receiving suggestions for test cases, you can also ask Amazon Q Developer to generate code snippet for some or all of the identified test cases.

Test cases proposed by Amazon Q Developer for evaluating the example application code

Conclusion:

Amazon Q Developer revolutionizes the debugging process for developers by leveraging advanced and natural language understanding and generative AI capabilities. From explaining and summarizing code to detecting anomalies, implementing automatic bug fixes, and generating test cases, Amazon Q Developer streamlines common aspects of debugging. Developers can now spend less time manually hunting for errors and more time innovating and building features. By harnessing the power of Amazon Q, organizations can accelerate development cycles, improve code quality, and deliver superior software experiences to their customers.

To get started with Amazon Q Developer for debugging today navigate to Amazon Q Developer in IDE and simply start asking questions about debugging. Additionally, explore Amazon Q Developer workshop for additional hands-on use cases.

For any inquiries or assistance with Amazon Q Developer, please reach out to your AWS account team.

About the authors:

Dr. Rahul Sharad Gaikwad

Dr. Rahul is a Lead DevOps Consultant at AWS, specializing in migrating and modernizing customer workloads on the AWS Cloud. He is a technology enthusiast with a keen interest in Generative AI and DevOps. He received his Ph.D. in AIOps in 2022. He is recipient of the Indian Achievers’ Award (2024), Best PhD Thesis Award (2024), Research Scholar of the Year Award (2023) and Young Researcher Award (2022).

Nishant Dhiman

Nishant is a Senior Solutions Architect at AWS with an extensive background in Serverless, Generative AI, Security and Mobile platform offerings. He is a voracious reader and a passionate technologist. He loves to interact with customers and believes in giving back to community by learning and sharing. Outside of work, he likes to keep himself engaged with podcasts, calligraphy and music.