Tag Archives: Amazon Elastic Kubernetes Service

Let’s Architect! Discovering Generative AI on AWS

2024-04-24 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-generative-ai/

Generative artificial intelligence (generative AI) is a type of AI used to generate content, including conversations, images, videos, and music. Generative AI can be used directly to build customer-facing features (a chatbot or an image generator), or it can serve as an underlying component in a more complex system. For example, it can generate embeddings (or compressed representations) or any other artifact necessary to improve downstream machine learning (ML) models or back-end services.

With the advent of generative AI, it’s fundamental to understand what it is, how it works under the hood, and which options are available for putting it into production. In some cases, it can also be helpful to move closer to the underlying model in order to fine tune or drive domain-specific improvements. With this edition of Let’s Architect!, we’ll cover these topics and share an initial set of methodologies to put generative AI into production. We’ll start with a broad introduction to the domain and then share a mix of videos, blogs, and hands-on workshops.

Navigating the future of AI

Many teams are turning to open source tools running on Kubernetes to help accelerate their ML and generative AI journeys. In this video session, experts discuss why Kubernetes is ideal for ML, then tackle challenges like dependency management and security. You will learn how tools like Ray, JupyterHub, Argo Workflows, and Karpenter can accelerate your path to building and deploying generative AI applications on Amazon Elastic Kubernetes Service (Amazon EKS). A real-world example showcases how Adobe leveraged Amazon EKS to achieve faster time-to-market and reduced costs. You will be also introduced to Data on EKS, a new AWS project offering best practices for deploying various data workloads on Amazon EKS.

Take me to this video!

Figure 1. Containers are a powerful tool for creating reproducible research and production environments for ML.

Generative AI: Architectures and applications in depth

This video session aims to provide an in-depth exploration of the emerging concepts in generative AI. By delving into practical applications and detailing best practices for implementation, the session offers a concrete understanding that empowers businesses to harness the full potential of these technologies. You can gain valuable insights into navigating the complexities of generative AI, equipping you with the knowledge and strategies necessary to stay ahead of the curve and capitalize on the transformative power of these new methods. If you want to dive even deeper, check this generative AI best practices post.

Take me to this video!

Figure 2. Models are growing exponentially: improved capabilities come with higher costs for productionizing them.

SaaS meets AI/ML & generative AI: Multi-tenant patterns & strategies

Working with AI/ML workloads and generative AI in a production environment requires appropriate system design and careful considerations for tenant separation in the context of SaaS. You’ll need to think about how the different tenants are mapped to models, how inferencing is scaled, how solutions are integrated with other upstream/downstream services, and how large language models (LLMs) can be fine-tuned to meet tenant-specific needs.

This video drills down into the concept of multi-tenancy for AI/ML workloads, including the common design, performance, isolation, and experience challenges that you can find during your journey. You will also become familiar with concepts like RAG (used to enrich the LLMs with contextual information) and fine tuning through practical examples.

Take me to this video!

Supporting different tenants might need fetching different context information with RAGs or offering different options for fine-tuning.

Figure 3. Supporting different tenants might need fetching different context information with RAGs or offering different options for fine-tuning.

Achieve DevOps maturity with BMC AMI zAdviser Enterprise and Amazon Bedrock

DevOps Research and Assessment (DORA) metrics, which measure critical DevOps performance indicators like lead time, are essential to engineering practices, as shown in the Accelerate book‘s research. By leveraging generative AI technology, the zAdviser Enterprise platform can now offer in-depth insights and actionable recommendations to help organizations optimize their DevOps practices and drive continuous improvement. This blog demonstrates how generative AI can go beyond language or image generation, applying to a wide spectrum of domains.

Take me to this blog post!

Figure 4. Generative AI is used to provide summarization, analysis, and recommendations for improvement based on the DORA metrics.

Hands-on Generative AI: AWS workshops

Getting hands on is often the best way to understand how everything works in practice and create the mental model to connect theoretical foundations with some real-world applications.

Generative AI on Amazon SageMaker shows how you can build, train, and deploy generative AI models. You can learn about options to fine-tune, use out-of-the-box existing models, or even customize the existing open source models based on your needs.

Building with Amazon Bedrock and LangChain demonstrates how an existing fully-managed service provided by AWS can be used when you work with foundational models, covering a wide variety of use cases. Also, if you want a quick guide for prompt engineering, you can check out the PartyRock lab in the workshop.

Figure 5. An image replacement example that you can find in the workshop.

See you next time!

Thanks for reading! We hope you got some insight into the applications of generative AI and discovered new strategies for using it. In the next blog, we will dive deeper into machine learning.

To revisit any of our previous posts or explore the entire series, visit the Let’s Architect! page.

AWS Weekly Roundup: Anthropic’s Claude 3 Opus in Amazon Bedrock, Meta Llama 3 in Amazon SageMaker JumpStart, and more (April 22, 2024)

2024-04-22 Esra Kayabali

Post Syndicated from Esra Kayabali original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-anthropics-claude-3-opus-in-amazon-bedrock-meta-llama-3-in-amazon-sagemaker-jumpstart-and-more-april-22-2024/

AWS Summits continue to rock the world, with events taking place in various locations around the globe. AWS Summit London (April 24) is the last one in April, and there are nine more in May, including AWS Summit Berlin (May 15–16), AWS Summit Los Angeles (May 22), and AWS Summit Dubai (May 29). Join us to connect, collaborate, and learn about AWS!

While you decide which summit to attend, let’s look at the last week’s new announcements.

Last week’s launches
Last week was another busy one in the world of artificial intelligence (AI). Here are some launches that got my attention.

Anthropic’s Claude 3 Opus now available in Amazon Bedrock – After Claude 3 Sonnet and Claude 3 Haiku, two of the three state-of-the-art models of Anthropic’s Claude 3, Opus is now available in Amazon Bedrock. Cluade 3 Opus is at the forefront of generative AI, demonstrating comprehension and fluency on complicated tasks at nearly human levels. Like the rest of the Claude 3 family, Opus can process images and return text outputs. Claude 3 Opus shows an estimated twofold gain in accuracy over Claude 2.1 on difficult open-ended questions, reducing the likelihood of faulty responses.

Meta Llama 3 now available in Amazon SageMaker JumpStart – Meta Llama 3 is now available in Amazon SageMaker JumpStart, a machine learning (ML) hub that can help you accelerate your ML journey. You can deploy and use Llama 3 foundation models (FMs) with a few steps in Amazon SageMaker Studio or programmatically through the Amazon SageMaker Python SDK. Llama is available in two parameter sizes, 8B and 70B, and can be used to support a broad range of use cases, with improvements in reasoning, code generation, and instruction following. The model will be deployed in an AWS secure environment under your VPC controls, helping ensure data security.

Built-in SQL extension with Amazon SageMaker Studio Notebooks – SageMaker Studio’s JupyterLab now includes a built-in SQL extension to discover, explore, and transform data from various sources using SQL and Python directly within the notebooks. You can now seamlessly connect to popular data services and easily browse and search databases, schemas, tables, and views. You can also preview data within the notebook interface. New features such as SQL command completion, code formatting assistance, and syntax highlighting improve developer productivity. To learn more, visit Explore data with ease: Use SQL and Text-to-SQL in Amazon SageMaker Studio JupyterLab notebooks and the SageMaker Developer Guide.

AWS Split Cost Allocation Data for Amazon EKS – You can now receive granular cost visibility for Amazon Elastic Kubernetes Service (Amazon EKS) in the AWS Cost and Usage Reports (CUR) to analyze, optimize, and chargeback cost and usage for your Kubernetes applications. You can allocate application costs to individual business units and teams based on how Kubernetes applications consume shared Amazon EC2 CPU and memory resources. You can aggregate these costs by cluster, namespace, and other Kubernetes primitives to allocate costs to individual business units or teams. These cost details will be accessible in the CUR 24 hours after opt-in. You can use the Containers Cost Allocation dashboard to visualize the costs in Amazon QuickSight and the CUR query library to query the costs using Amazon Athena.

AWS KMS automatic key rotation enhancements – AWS Key Management Service (AWS KMS) introduces faster options for automatic symmetric key rotation. You can now customize rotation frequency between 90 days to 7 years, invoke key rotation on demand for customer-managed AWS KMS keys, and view the rotation history for any rotated AWS KMS key. There is a nice post on the Security Blog you can visit to learn more about this feature, including a little bit of history about cryptography.

Amazon Personalize automatic solution training – Amazon Personalize now offers automatic training for solutions. With automatic training, you can set a cadence for your Amazon Personalize solutions to automatically retrain using the latest data from your dataset group. This process creates a newly trained machine learning (ML) model, also known as a solution version, and maintains the relevance of Amazon Personalize recommendations for end users. Automatic training mitigates model drift and makes sure recommendations align with users’ evolving behaviors and preferences. With Amazon Personalize, you can personalize your website, app, ads, emails, and more, using the same machine learning technology used by Amazon, without requiring any prior ML experience. To get started with Amazon Personalize, visit our documentation.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

We launched existing services and instance types in additional Regions:

Amazon RDS for Oracle extends support for x2iedn in Asia Pacific (Hyderabad, Jakarta, and Osaka), Europe (Milan and Paris), US West (N. California), AWS GovCloud (US-East), and AWS GovCloud (US-West). X2iedn instances are targeted for enterprise-class high-performance databases with high compute (up to 128 vCPUs), large memory (up to 4 TB) and storage throughput requirements (up to 256K IOPS) with a 32:1 ratio of memory to vCPU.
Amazon MSK is now available in Canada West (Calgary) Region. Amazon Managed Streaming for Apache Kafka (Amazon MSK) is a fully managed service for Apache Kafka and Kafka Connect that makes it easier for you to build and run applications that use Apache Kafka as a data store.
Amazon Cognito is now available in Europe (Spain) Region. Amazon Cognito makes it easy to add authentication, authorization, and user management to your web and mobile apps supporting sign-in with social identity providers such as Apple, Facebook, Google, and Amazon, and enterprise identity providers through standards such as SAML 2.0 and OpenID Connect.
AWS Network Manager is now available in AWS Israel (Tel Aviv) Region. AWS Network Manager reduces the operational complexity of managing global networks across AWS and on-premises locations by providing a single global view of your private network.
AWS Storage Gateway is now available in AWS Canada West (Calgary) Region. AWS Storage Gateway is a hybrid cloud storage service that provides on-premises applications access to virtually unlimited storage in the cloud.
Amazon SQS announces support for FIFO dead-letter queue redrive in the AWS GovCloud (US) Regions. Dead-letter queue redrive is an enhanced capability to improve the dead-letter queue management experience for Amazon Simple Queue Service (Amazon SQS) customers.
Amazon EC2 R6gd instances are now available in Europe (Zurich) Region. R6gd instances are powered by AWS Graviton2 processors and are built on the AWS Nitro System. These instances offer up to 25 Gbps of network bandwidth, up to 19 Gbps of bandwidth to the Amazon Elastic Block Store (Amazon EBS), up to 512 GiB RAM, and up to 3.8TB of NVMe SSD local instance storage.
Amazon Simple Email Service is now available in the AWS GovCloud (US-East) Region. Amazon Simple Email Service (SES) is a scalable, cost-effective, and flexible cloud-based email service that allows you to send marketing, notification, and transactional emails from within any application. To learn more, visit Amazon SES page.
AWS Glue Studio Notebooks is now available in the Middle East (UAE), Asia Pacific (Hyderabad), Asia Pacific (Melbourne), Israel (Tel Aviv), Europe (Spain), and Europe (Zurich) Regions. AWS Glue Studio Notebooks provides interactive job authoring in AWS Glue, which helps simplify the process of developing data integration jobs. To learn more, visit Authoring code with AWS Glue Studio notebooks.
Amazon S3 Access Grants is now available in in the Middle East (UAE), Asia Pacific (Melbourne), Asia Pacific (Hyderabad), and Europe (Spain) Regions. Amazon Simple Storage Service (Amazon S3) Access Grants map identities in directories such as Microsoft Entra ID, or AWS Identity and Access Management (IAM) principals, to datasets in S3. This helps you manage data permissions at scale by automatically granting S3 access to end-users based on their corporate identity. To learn more, visit Amazon S3 Access Grants page.

Other AWS news
Here are some additional news that you might find interesting:

The PartyRock Generative AI Hackathon winners – The PartyRock Generative AI Hackathon concluded with over 7,650 registrants submitting 1,200 projects across four challenge categories, featuring top winners like Parable Rhythm – The Interactive Crime Thriller, Faith – Manga Creation Tools, and Arghhhh! Zombie. Participants showed remarkable creativity and technical prowess, with prizes totaling $60,000 in AWS credits.

I tried the Faith – Manga Creation Tools app using my daughter Arya’s made-up stories and ideas and the result was quite impressive.

Visit Jeff Barr’s post to learn more about how to try the apps for yourself.

AWS open source news and updates – My colleague Ricardo writes about open source projects, tools, and events from the AWS Community. Check out Ricardo’s page for the latest updates.

Upcoming AWS events
Check your calendars and sign up for upcoming AWS events:

AWS Summits – Join free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. Register in your nearest city: Singapore (May 7), Seoul (May 16–17), Hong Kong (May 22), Milan (May 23), Stockholm (June 4), and Madrid (June 5).

AWS re:Inforce – Explore cloud security in the age of generative AI at AWS re:Inforce, June 10–12 in Pennsylvania for 2.5 days of immersive cloud security learning designed to help drive your business initiatives.

AWS Community Days – Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world: Turkey (May 18), Midwest | Columbus (June 13), Sri Lanka (June 27), Cameroon (July 13), Nigeria (August 24), and New York (August 28).

You can browse all upcoming AWS led in-person and virtual events and developer-focused events here.

That’s all for this week. Check back next Monday for another Weekly Roundup!

— Esra

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Integrate Kubernetes policy-as-code solutions into Security Hub

2024-04-18 Joaquin Manuel Rinaudo

Post Syndicated from Joaquin Manuel Rinaudo original https://aws.amazon.com/blogs/security/integrate-kubernetes-policy-as-code-solutions-into-security-hub/

Using Kubernetes policy-as-code (PaC) solutions, administrators and security professionals can enforce organization policies to Kubernetes resources. There are several publicly available PAC solutions that are available for Kubernetes, such as Gatekeeper, Polaris, and Kyverno.

PaC solutions usually implement two features:

Use Kubernetes admission controllers to validate or modify objects before they’re created to help enforce configuration best practices for your clusters.
Provide a way for you to scan your resources created before policies were deployed or against new policies being evaluated.

This post presents a solution to send policy violations from PaC solutions using Kubernetes policy report format (for example, using Kyverno) or from Gatekeeper’s constraints status directly to AWS Security Hub. With this solution, you can visualize Kubernetes security misconfigurations across your Amazon Elastic Kubernetes Service (Amazon EKS) clusters and your organizations in AWS Organizations. This can also help you implement standard security use cases—such as unified security reporting, escalation through a ticketing system, or automated remediation—on top of Security Hub to help improve your overall Kubernetes security posture and reduce manual efforts.

Solution overview

The solution uses the approach described in A Container-Free Way to Configure Kubernetes Using AWS Lambda to deploy an AWS Lambda function that periodically synchronizes the security status of a Kubernetes cluster from a Kubernetes or Gatekeeper policy report with Security Hub. Figure 1 shows the architecture diagram for the solution.

Figure 1: Diagram of solution

This solution works using the following resources and configurations:

A scheduled event which invokes a Lambda function on a 10-minute interval.
The Lambda function iterates through each running EKS cluster that you want to integrate and authenticate by using a Kubernetes Python client and an AWS Identity and Access Management (IAM) role of the Lambda function.
For each running cluster, the Lambda function retrieves the selected Kubernetes policy reports (or the Gatekeeper constraint status, depending on the policy selected) and sends active violations, if present, to Security Hub. With Gatekeeper, if more violations exist than those reported in the constraint, an additional INFORMATIONAL finding is generated in Security Hub to let security teams know of the missing findings.
Optional: EKS cluster administrators can raise the limit of reported policy violations by using the –constraint-violations-limit flag in their Gatekeeper audit operation.
For each running cluster, the Lambda function archives archive previously raised and resolved findings in Security Hub.

You can download the solution from this GitHub repository.

Walkthrough

In the walkthrough, I show you how to deploy a Kubernetes policy-as-code solution and forward the findings to Security Hub. We’ll configure Kyverno and a Kubernetes demo environment with findings in an existing EKS cluster to Security Hub.

The code provided includes an example constraint and noncompliant resource to test against.

Prerequisites

An EKS cluster is required to set up this solution within your AWS environments. The cluster should be configured with either aws-auth ConfigMap or access entries. Optional: You can use eksctl to create a cluster.

The following resources need to be installed on your computer:

Git command line interface.
Bash shell. On Windows 10, you can install the Windows Subsystem for Linux
AWS Command Line Interface (AWS CLI)
eksctl and Kubectl
Python3 and pip

Step 1: Set up the environment

The first step is to install Kyverno on an existing Kubernetes cluster. Then deploy examples of a Kyverno policy and noncompliant resources.

Deploy Kyverno example and policy

Deploy Kyverno in your Kubernetes cluster according to its installation manual using the Kubernetes CLI.
```
kubectl create -f https://github.com/kyverno/kyverno/releases/download/v1.10.0/install.yaml
```

Set up a policy that requires namespaces to use the label thisshouldntexist.

kubectl create -f - << EOF
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-ns-labels
spec:
  validationFailureAction: Audit
  background: true
  rules:
  - name: check-for-labels-on-namespace
    match:
      any:
      - resources:
          kinds:
          - Namespace
    validate:
      message: "The label thisshouldntexist is required."
      pattern:
        metadata:
          labels:
            thisshouldntexist: "?*"
EOF

Deploy a noncompliant resource to test this solution

Create a noncompliant namespace.
```
kubectl create namespace non-compliant
```
Check the Kubernetes policy report status using the following command:
```
kubectl get clusterpolicyreport -o yaml
```

You should see output similar to the following:

apiVersion: v1
items:
- apiVersion: wgpolicyk8s.io/v1alpha2
  kind: ClusterPolicyReport
  metadata:
    creationTimestamp: "2024-02-20T14:00:37Z"
    generation: 1
    labels:
      app.kubernetes.io/managed-by: kyverno
      cpol.kyverno.io/require-ns-labels: "3734083"
    name: cpol-require-ns-labels
    resourceVersion: "3734261"
    uid: 3cfcf1da-bd28-453f-b2f5-512c26065986
  results:
   ...
  - message: 'validation error: The label thisshouldntexist is required. rule check-for-labels-on-namespace
      failed at path /metadata/labels/thisshouldntexist/'
    policy: require-ns-labels
    resources:
    - apiVersion: v1
      kind: Namespace
      name: non-compliant
      uid: d62eb1ad-8a0b-476b-848d-ff6542c57840
    result: fail
    rule: check-for-labels-on-namespace
    scored: true
    source: kyverno
    timestamp:
      nanos: 0
      seconds: 1708437615

Step 2: Solution code deployment and configuration

The next step is to clone and deploy the solution that integrates with Security Hub.

To deploy the solution

Clone the GitHub repository by using your preferred command line terminal:

git clone https://github.com/aws-samples/securityhub-k8s-policy-integration.git

Open the parameters.json file and configure the following values:
1. Policy – Name of the product that you want to enable, in this case policyreport, which is supported by tools such as Kyverno.
2. ClusterNames – List of EKS clusters. When AccessEntryEnabled is enabled, this solution deploys an access entry for the integration to access your EKS clusters.
3. SubnetIds – (Optional) A comma-separated list of your subnets. If you’ve configured the API endpoints of your EKS clusters as private only, then you need to configure this parameter. If your EKS clusters have public endpoints enabled, you can remove this parameter.
4. SecurityGroupId – (Optional) A security group ID that allows connectivity to the EKS clusters. This parameter is only required if you’re running private API endpoints; otherwise, you can remove it. This security group should be allowed ingress from the security group of the EKS control plane.
5. AccessEntryEnabled – (Optional) If you’re using EKS access entries, the solution automatically deploys the access entries with read-only-group permissions deployed in the next step. This parameter is True by default.
Save the changes and close the parameters file.
Set up your AWS_REGION (for example, export AWS_REGION=eu-west-1) and make sure that your credentials are configured for the delegated administrator account.
Enter the following command to deploy:
```
./deploy.sh
```

You should see the following output:

Waiting for changeset to be created..
Waiting for stack create/update to complete
Successfully created/updated stack - aws-securityhub-k8s-policy-integration

Step 3: Set up EKS cluster access

You need to create the Kubernetes Group read-only-group to allow read-only permissions to the IAM role of the Lambda function. If you aren’t using access entries, you will also need to modify the aws-auth ConfigMap of the Kubernetes clusters.

To configure access to EKS clusters

For each cluster that’s running in your account, run the kube-setup.sh script to create the Kubernetes read-only cluster role and cluster role binding.
(Optional) Configure aws-auth ConfigMap using eksctl if you aren’t using access entries.

Step 4: Verify AWS service integration

The next step is to verify that the Lambda integration to Security Hub is running.

To verify the integration is running

Open the Lambda console, and navigate to the aws-securityhub-k8s-policy-integration-<region> function.
Start a test to import your cluster’s noncompliant findings to Security Hub.
In the Security Hub console, review the recently created findings from Kyverno.

Figure 2: Sample Kyverno findings in Security Hub

Step 5: Clean up

The final step is to clean up the resources that you created for this walkthrough.

To destroy the stack

Use the command line terminal in your laptop to run the following command:
```
./cleanup.sh
```

Conclusion

In this post, you learned how to integrate Kubernetes policy report findings with Security Hub and tested this setup by using the Kyverno policy engine. If you want to test the integration of this solution with Gatekeeper, you can find alternative commands for step 1 of this post in the GitHub repository’s README file.

Using this integration, you can gain visibility into your Kubernetes security posture across EKS clusters and join it with a centralized view, together with other security findings such as those from AWS Config, Amazon Inspector, and more across your organization. You can also try this solution with other tools, such as kube-bench or Gatekeeper. You can extend this setup to notify security teams of critical misconfigurations or implement automated remediation actions by using AWS Security Hub.

For more information on how to use PaC solutions to secure Kubernetes workloads in the AWS cloud, see Amazon Elastic Kubernetes Service (Amazon EKS) workshop, Amazon EKS best practices, Using Gatekeeper as a drop-in Pod Security Policy replacement in Amazon EKS and Policy-based countermeasures for Kubernetes.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

AWS Weekly Roundup: Amazon EC2 G6 instances, Mistral Large on Amazon Bedrock, AWS Deadline Cloud, and more (April 8, 2024)

2024-04-08 Donnie Prakoso

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-mistral-large-aws-clean-rooms-ml-aws-deadline-cloud-and-more-april-8-2024/

We’re just two days away from AWS Summit Sydney (April 10–11) and a month away from the AWS Summit season in Southeast Asia, starting with the AWS Summit Singapore (May 7) and the AWS Summit Bangkok (May 30). If you happen to be in Sydney, Singapore, or Bangkok around those dates, please join us.

Last Week’s Launches
If you haven’t read last week’s Weekly Roundup yet, Channy wrote about the AWS Chips Taste Test, a new initiative from Jeff Barr as part of April’ Fools Day.

Here are some launches that caught my attention last week:

New Amazon EC2 G6 instances — We announced the general availability of Amazon EC2 G6 instances powered by NVIDIA L4 Tensor Core GPUs. G6 instances can be used for a wide range of graphics-intensive and machine learning use cases. G6 instances deliver up to 2x higher performance for deep learning inference and graphics workloads compared to Amazon EC2 G4dn instances. To learn more, visit the Amazon EC2 G6 instance page.

Mistral Large is now available in Amazon Bedrock — Veliswa wrote about the availability of the Mistral Large foundation model, as part of the Amazon Bedrock service. You can use Mistral Large to handle complex tasks that require substantial reasoning capabilities. In addition, Amazon Bedrock is now available in the Paris AWS Region.

Amazon Aurora zero-ETL integration with Amazon Redshift now in additional Regions — Zero-ETL integration announcements were my favourite launches last year. This Zero-ETL integration simplifies the process of transferring data between the two services, allowing customers to move data between Amazon Aurora and Amazon Redshift without the need for manual Extract, Transform, and Load (ETL) processes. With this announcement, Zero-ETL integrations between Amazon Aurora and Amazon Redshift is now supported in 11 additional Regions.

Announcing AWS Deadline Cloud — If you’re working in films, TV shows, commercials, games, and industrial design and handling complex rendering management for teams creating 2D and 3D visual assets, then you’ll be excited about AWS Deadline Cloud. This new managed service simplifies the deployment and management of render farms for media and entertainment workloads.

AWS Clean Rooms ML is Now Generally Available — Last year, I wrote about the preview of AWS Clean Rooms ML. In that post, I elaborated a new capability of AWS Clean Rooms that helps you and your partners apply machine learning (ML) models on your collective data without copying or sharing raw data with each other. Now, AWS Clean Rooms ML is available for you to use.

Knowledge Bases for Amazon Bedrock now supports private network policies for OpenSearch Serverless — Here’s exciting news for you who are building with Amazon Bedrock. Now, you can implement Retrieval-Augmented Generation (RAG) with Knowledge Bases for Amazon Bedrock using Amazon OpenSearch Serverless (OSS) collections that have a private network policy.

Amazon EKS extended support for Kubernetes versions now generally available — If you’re running Kubernetes version 1.21 and higher, with this Extended Support for Kubernetes, you can stay up-to-date with the latest Kubernetes features and security improvements on Amazon EKS.

AWS Lambda Adds Support for Ruby 3.3 — Coding in Ruby? Now, AWS Lambda supports Ruby 3.3 as its runtime. This update allows you to take advantage of the latest features and improvements in the Ruby language.

Amazon EventBridge Console Enhancements — The Amazon EventBridge console has been updated with new features and improvements, making it easier for you to manage your event-driven applications with a better user experience.

Private Access to the AWS Management Console in Commercial Regions — If you need to restrict access to personal AWS accounts from the company network, you can use AWS Management Console Private Access. With this launch, you can use AWS Management Console Private Access in all commercial AWS Regions.

From community.aws
The community.aws is a home for us, builders, to share our learnings with building on AWS. Here’s my Top 3 posts from last week:

Other AWS News
Here are some additional news items, open-source projects, and Twitch shows that you might find interesting:

Build On Generative AI – Join Tiffany and Darko to learn more about generative AI, see their demos and discuss different aspects of generative AI with the guest speakers. Streaming every Monday on Twitch, 9:00 AM US PT.

AWS open source news and updates – If you’re looking for various open-source projects and tools from the AWS community, please read the AWS open-source newsletter maintained by my colleague, Ricardo.

Upcoming AWS events
Check your calendars and sign up for these AWS events:

AWS Summits – Join free online and in-person events that bring the cloud computing community together to connect, collaborate, and learn about AWS. Register in your nearest city: Amsterdam (April 9), Sydney (April 10–11), London (April 24), Singapore (May 7), Berlin (May 15–16), Seoul (May 16–17), Hong Kong (May 22), Milan (May 23), Dubai (May 29), Thailand (May 30), Stockholm (June 4), and Madrid (June 5).

AWS re:Inforce – Explore cloud security in the age of generative AI at AWS re:Inforce, June 10–12 in Pennsylvania for two-and-a-half days of immersive cloud security learning designed to help drive your business initiatives.

AWS Community Days – Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world: Poland (April 11), Bay Area (April 12), Kenya (April 20), and Turkey (May 18).

You can browse all upcoming in-person and virtual events.

That’s all for this week. Check back next Monday for another Weekly Roundup!

— Donnie

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Applying Spot-to-Spot consolidation best practices with Karpenter

2024-03-26 Chris Munns

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/applying-spot-to-spot-consolidation-best-practices-with-karpenter/

This post is written by Robert Northard – AWS Container Specialist Solutions Architect, and Carlos Manzanedo Rueda – AWS WW SA Leader for Efficient Compute

Karpenter is an open source node lifecycle management project built for Kubernetes. In this post, you will learn how to use the new Spot-to-Spot consolidation functionality released in Karpenter v0.34.0, which helps further optimize your cluster. Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances are spare Amazon EC2 capacity available for up to 90% off compared to On-Demand prices. One difference between On-Demand and Spot is that Spot Instances can be interrupted by Amazon EC2 when the capacity is needed back. Karpenter’s built-in support for Spot Instances allows users to seamlessly implement Spot best practices and helps users optimize the cost of stateless, fault tolerant workloads. For example, when Karpenter observes a Spot interruption, it automatically starts a new node in response.

Karpenter provisions nodes in response to unschedulable pods based on aggregated CPU, memory, volume requests, and other scheduling constraints. Over time, Karpenter has added functionality to simplify instance lifecycle configuration, providing a termination controller, instance expiration, and drift detection. Karpenter also helps optimize Kubernetes clusters by selecting the optimal instances while still respecting Kubernetes pod-to-node placement nuances, such as nodeSelector, affinity and anti-affinity, taints and tolerations, and topology spread constraints.

The Kubernetes scheduler assigns pods to nodes based on their scheduling constraints. Over time, as workloads are scaled out and scaled in or as new instances join and leave, the cluster placement and instance load might end up not being optimal. In many cases, it results in unnecessary extra costs. Karpenter has a consolidation feature that improves cluster placement by identifying and taking action in situations such as:

when a node is empty
when a node can be removed as the pods that are running on it can be rescheduled into other existing nodes
when the number of pods in a node has gone down and the node can now be replaced with a lower-priced and rightsized variant (which is shown in the following figure)

Karpenter consolidation, replacing one 2xlarge Amazon EC2 Instance with an xlarge Amazon EC2 Instance.

Karpenter versions prior to v0.34.0 only supported consolidation for Amazon EC2 On-Demand Instances. On-Demand consolidation allowed consolidating from On-Demand into Spot Instances and to lower-priced On-Demand Instances. However, once a pod was placed on a Spot Instance, Spot nodes were only removed when the nodes were empty. In v0.34.0, you can enable the feature gate to use Spot-to-Spot consolidation.

Solution overview

When launching Spot Instances, Karpenter uses the price-capacity-optimized allocation strategy when calling the Amazon EC2 instant Fleet API (shown in the following figure) and passes in a selection of compute instance types based on the Karpenter NodePool configuration. The Amazon EC2 Fleet API in instant mode is a synchronous API call that immediately returns a list of instances that launched and any instance that could not be launched. For any instances that could not be launched, Karpenter might request alternative capacity or remove any soft Kubernetes scheduling constraints for the workload.

Karpenter instance orchestration

Spot-to-Spot consolidation needed an approach that was different from On-Demand consolidation. For On-Demand consolidation, rightsizing and lowest price are the main metrics used. For Spot-to-Spot consolidation to take place, Karpenter requires a diversified instance configuration (see the example NodePool defined in the walkthrough) with at least 15 instances types. Without this constraint, there would be a risk of Karpenter selecting an instance that has lower availability and, therefore, higher frequency of interruption.

Prerequisites

The following prerequisites are required to complete the walkthrough:

Install an Amazon Elastic Kubernetes Service (Amazon EKS) cluster (version 1.29 or higher) with Karpenter (v0.34.0 or higher). The Karpenter Getting Started Guide provides steps for setting up an Amazon EKS cluster and adding Karpenter.
Enable replacement with Spot consolidation through the SpotToSpotConsolidation feature gate. This can be enabled during a helm install of the Karpenter chart by adding –-set settings.featureGates.spotToSpotConsolidation=true argument.
Install kubectl, the Kubernetes command line tool for communicating with the Kubernetes control plane API, and kubectl context configured with Cluster Operator and Cluster Developer permissions.

Walkthrough

The following walkthrough guides you through the steps for simulating Spot-to-Spot consolidation.

1. Create a Karpenter NodePool and EC2NodeClass

Create a Karpenter NodePool and EC2NodeClass. Replace the following with your own values. If you used the Karpenter Getting Started Guide to create your installation, then the value would be your cluster name.

Replace <karpenter-discovery-tag-value> with your subnet tag for Karpenter subnet and security group auto-discovery.
Replace <role-name> with the name of the AWS Identity and Access Management (IAM) role for node identity.

cat <<EOF > nodepool.yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    metadata:
      labels:
        intent: apps
    spec:
      nodeClassRef:
        name: default
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c","m","r"]
        - key: karpenter.k8s.aws/instance-size
          operator: NotIn
          values: ["nano","micro","small","medium"]
        - key: karpenter.k8s.aws/instance-hypervisor
          operator: In
          values: ["nitro"]
  limits:
    cpu: 100
    memory: 100Gi
  disruption:
    consolidationPolicy: WhenUnderutilized
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: Bottlerocket
  subnetSelectorTerms:          
    - tags:
        karpenter.sh/discovery: "<karpenter-discovery-tag-value>"
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "<karpenter-discovery-tag-value>"
  role: "<role-name>"
  tags:
    Name: karpenter.sh/nodepool/default
    IntentLabel: "apps"
EOF

kubectl apply -f nodepool.yaml

The NodePool definition demonstrates a flexible configuration with instances from the C, M, or R EC2 instance families. The configuration is restricted to use smaller instance sizes but is still diversified as much as possible. For example, this might be needed in scenarios where you deploy observability DaemonSets. If your workload has specific requirements, then see the supported well-known labels in the Karpenter documentation.

2. Deploy a sample workload

Deploy a sample workload by running the following command. This command creates a Deployment with five pod replicas using the pause container image:

cat <<EOF > inflate.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
spec:
  replicas: 5
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      nodeSelector:
        intent: apps
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.2
          resources:
            requests:
              cpu: 1
              memory: 1.5Gi
EOF
kubectl apply -f inflate.yaml

Next, check the Kubernetes nodes by running a kubectl get nodes CLI command. The capacity pool (instance type and Availability Zone) selected depends on any Kubernetes scheduling constraints and spare capacity size. Therefore, it might differ from this example in the walkthrough. You can see Karpenter launched a new node of instance type c6g.2xlarge, an AWS Graviton2-based instance, in the eu-west-1c Region:

$ kubectl get nodes -L karpenter.sh/nodepool -L node.kubernetes.io/instance-type -L topology.kubernetes.io/zone -L karpenter.sh/capacity-type

NAME                                     STATUS   ROLES    AGE   VERSION               NODEPOOL   INSTANCE-TYPE   ZONE         CAPACITY-TYPE
ip-10-0-12-17.eu-west-1.compute.internal Ready    <none>   80s   v1.29.0-eks-a5ec690   default    c6g.2xlarge     eu-west-1c   spot

3. Scale in a sample workload to observe consolidation

To invoke a Karpenter consolidation event scale, inflate the deployment to 1. Run the following command:

kubectl scale --replicas=1 deployment/inflate

Tail the Karpenter logs by running the following command. If you installed Karpenter in a different Kubernetes namespace, then replace the name for the -n argument in the command:

kubectl -n karpenter logs -l app.kubernetes.io/name=karpenter --all-containers=true -f --tail=20

After a few seconds, you should see the following disruption via consolidation message in the Karpenter logs. The message indicates the c6g.2xlarge Spot node has been targeted for replacement and Karpenter has passed the following 15 instance types—m6gd.xlarge, m5dn.large, c7a.xlarge, r6g.large, r6a.xlarge and 10 other(s)—to the Amazon EC2 Fleet API:

{"level":"INFO","time":"2024-02-19T12:09:50.299Z","logger":"controller.disruption","message":"disrupting via consolidation replace, terminating 1 candidates ip-10-0-12-181.eu-west-1.compute.internal/c6g.2xlarge/spot and replacing with spot node from types m6gd.xlarge, m5dn.large, c7a.xlarge, r6g.large, r6a.xlarge and 10 other(s)","commit":"17d6c05","command-id":"60f27cb5-98fa-40fb-8231-05b31fd41892"}

Check the Kubernetes nodes by running the following kubectl get nodes CLI command. You can see that Karpenter launched a new node of instance type c6g.large:

$ kubectl get nodes -L karpenter.sh/nodepool -L node.kubernetes.io/instance-type -L topology.kubernetes.io/zone -L karpenter.sh/capacity-type

NAME                                      STATUS   ROLES    AGE   VERSION               NODEPOOL   INSTANCE-TYPE ZONE       CAPACITY-TYPE
ip-10-0-12-156.eu-west-1.compute.internal           Ready    <none>   2m1s   v1.29.0-eks-a5ec690   default    c6g.large       eu-west-1c   spot

Use kubectl get nodeclaims to list all objects of type NodeClaim and then describe the NodeClaim Kubernetes resource using kubectl get nodeclaim/<claim-name> -o yaml. In the NodeClaim .spec.requirements, you can also see the 15 instance types passed to the Amazon EC2 Fleet API:

apiVersion: karpenter.sh/v1beta1
kind: NodeClaim
...
spec:
  nodeClassRef:
    name: default
  requirements:
  ...
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
    - c5.large
    - c5ad.large
    - c6g.large
    - c6gn.large
    - c6i.large
    - c6id.large
    - c7a.large
    - c7g.large
    - c7gd.large
    - m6a.large
    - m6g.large
    - m6gd.large
    - m7g.large
    - m7i-flex.large
    - r6g.large
...

What would happen if a Spot node could not be consolidated?

If a Spot node cannot be consolidated because there are not 15 instance types in the compute selection, then the following message will appear in the events for the NodeClaim object. You might get this event if you overly constrained your instance type selection:

Normal  Unconsolidatable   31s   karpenter  SpotToSpotConsolidation requires 15 cheaper instance type options than the current candidate to consolidate, got 1

Spot best practices with Karpenter

The following are some best practices to consider when using Spot Instances with Karpenter.

Avoid overly constraining instance type selection: Karpenter selects Spot Instances using the price-capacity-optimized allocation strategy, which balances the price and availability of AWS spare capacity. Although a minimum of 15 instances are needed, you should avoid constraining instance types as much as possible. By not constraining instance types, there is a higher chance of acquiring Spot capacity at large scales with a lower frequency of Spot Instance interruptions at a lower cost.
Gracefully handle Spot interruptions and consolidation actions: Karpenter natively handles Spot interruption notifications by consuming events from an Amazon Simple Queue Service (Amazon SQS) queue, which is populated with Spot interruption notifications through Amazon EventBridge. As soon as Karpenter receives a Spot interruption notification, it gracefully drains the interrupted node of any running pods while also provisioning a new node for which those pods can schedule. With Spot Instances, this process needs to complete within 2 minutes. For a pod with a termination period longer than 2 minutes, the old node will be interrupted prior to those pods being rescheduled. To test a replacement node, AWS Fault Injection Service (FIS) can be used to simulate Spot interruptions.
Carefully configure resource requests and limits for workloads: Rightsizing and optimizing your cluster is a shared responsibility. Karpenter effectively optimizes and scales infrastructure, but the end result depends on how well you have rightsized your pod requests and any other Kubernetes scheduling constraints. Karpenter does not consider limits or resource utilization. For most workloads with non-compressible resources, such as memory, it is generally recommended to set requests==limits because if a workload tries to burst beyond the available memory of the host, an out-of-memory (OOM) error occurs. Karpenter consolidation can increase the probability of this as it proactively tries to reduce total allocatable resources for a Kubernetes cluster. For help with rightsizing your Kubernetes pods, consider exploring Kubecost, Vertical Pod Autoscaler configured in recommendation mode, or an open source tool such as Goldilocks.
Configure metrics for Karpenter: Karpenter emits metrics in the Prometheus format, so consider using Amazon Managed Service for Prometheus to track interruptions caused by Karpenter Drift, consolidation, Spot interruptions, or other Amazon EC2 maintenance events. These metrics can be used to confirm that interruptions are not having a significant impact on your service’s availability and monitor NodePool usage and pod lifecycles. The Karpenter Getting Started Guide contains an example Grafana dashboard configuration.

You can learn more about other application best practices in the Reliability section of the Amazon EKS Best Practices Guide.

Cleanup

To avoid incurring future charges, delete any resources you created as part of this walkthrough. If you followed the Karpenter Getting Started Guide to set up a cluster and add Karpenter, follow the clean-up instructions in the Karpenter documentation to delete the cluster. Alternatively, if you already had a cluster with Karpenter, delete the resources created as part of this walkthrough:

kubectl delete -f inflate.yaml
kubectl delete -f nodepool.yaml

Conclusion

In this post, you learned how Karpenter can actively replace a Spot node with another more cost-efficient Spot node. Karpenter can consolidate Spot nodes that have the right balance between lower price and low-frequency interruptions when there are at least 15 selectable instances to balance price and availability.

To get started, check out the Karpenter documentation as well as Karpenter Blueprints, which is a repository including common workload scenarios following the best practices.

You can share your feedback on this feature by a raising a GitHub Issue.

How to use AWS Secrets Manager and ABAC for enhanced secrets management in Amazon EKS

2024-01-09 Nima Fotouhi

Post Syndicated from Nima Fotouhi original https://aws.amazon.com/blogs/security/how-to-use-aws-secrets-manager-and-abac-for-enhanced-secrets-management-in-amazon-eks/

In this post, we show you how to apply attribute-based access control (ABAC) while you store and manage your Amazon Elastic Kubernetes Services (Amazon EKS) workload secrets in AWS Secrets Manager, and then retrieve them by integrating Secrets Manager with Amazon EKS using External Secrets Operator to define more fine-grained and dynamic AWS Identity and Access Management (IAM) permission policies for accessing secrets.

It’s common to manage numerous workloads in an EKS cluster, each necessitating access to a distinct set of secrets. You can verify adherence to the principle of least privilege by creating separate permission policies for each workload to restrict their access. To scale and reduce overhead, Amazon Web Services (AWS) recommends using ABAC to manage workloads’ access to secrets. ABAC helps reduce the number of permission policies needed to scale with your environment.

What is ABAC?

In IAM, a traditional authorization approach is known as role-based access control (RBAC). RBAC sets permissions based on a person’s job function, commonly known as IAM roles. To enforce RBAC in IAM, distinct policies for various job roles are created. As a best practice, only the minimum permissions required for a specific role are granted (principle of least privilege), which is achieved by specifying the resources that the role can access. A limitation of the RBAC model is its lack of flexibility. Whenever new resources are introduced, you must modify policies to permit access to the newly added resources.

Attribute-based access control (ABAC) is an approach to authorization that assigns permissions in accordance with attributes, which in the context of AWS are referred to as tags. You create and add tags to your IAM resources. You then create and configure ABAC policies to permit operations requested by a principal when there’s a match between the tags of the principal and the resource. When a principal uses temporary credentials to make a request, its associated tags come from session tags, incoming transitive sessions tags, and IAM tags. The principal’s IAM tags are persistent, but session tags, and incoming transitive session tags are temporary and set when the principal assumes an IAM role. Note that AWS tags are attached to AWS resources, whereas session tags are only valid for the current session and expire with the session.

How External Secrets Operator works

External Secrets Operator (ESO) is a Kubernetes operator that integrates external secret management systems including Secrets Manager with Kubernetes. ESO provides Kubernetes custom resources to extend Kubernetes and integrate it with Secrets Manager. It fetches secrets and makes them available to other Kubernetes resources by creating Kubernetes Secrets. At a basic level, you need to create an ESO SecretStore resource and one or more ESO ExternalSecret resources. The SecretStore resource specifies how to access the external secret management system (Secrets Manager) and allows you to define ABAC related properties (for example, session tags and transitive tags).

You declare what data (secret) to fetch and how the data should be transformed and saved as a Kubernetes Secret in the ExternalSecret resource. The following figure shows an overview of the process for creating Kubernetes Secrets. Later in this post, we review the steps in more detail.

Figure 1: ESO process

How to use ESO for ABAC

Before creating any ESO resources, you must make sure that the operator has sufficient permissions to access Secrets Manager. ESO offers multiple ways to authenticate to AWS. For the purpose of this solution, you will use the controller’s pod identity. To implement this method, you configure the ESO service account to assume an IAM role for service accounts (IRSA), which is used by ESO to make requests to AWS.

To adhere to the principle of least privilege and verify that each Kubernetes workload can access only its designated secrets, you will use ABAC policies. As we mentioned, tags are the attributes used for ABAC in the context of AWS. For example, principal and secret tags can be compared to create ABAC policies to deny or allow access to secrets. Secret tags are static tags assigned to secrets symbolizing the workload consuming the secret. On the other hand, principal (requester) tags are dynamically modified, incorporating workload specific tags. The only viable option to dynamically modifying principal tags is to use session tags and incoming transitive session tags. However, as of this writing, there is no way to add session and transitive tags when assuming an IRSA. The workaround for this issue is role chaining and passing session tags when assuming downstream roles. ESO offers role chaining, meaning that you can refer to one or more IAM roles with access to Secrets Manager in the SecretStore resource definition, and ESO will chain them with its IRSA to access secrets. It also allows you to define session tags and transitive tags to be passed when ESO assumes the IAM roles with its primary IRSA. The ability to pass session tags allows you to implement ABAC and compare principal tags (including session tags) with secret tags every time ESO sends a request to Secrets Manager to fetch a secret. The following figure shows ESO authentication process with role chaining in one Kubernetes namespace.

Figure 2: ESO AWS authentication process with role chaining (single namespace)

Architecture overview

Let’s review implementing ABAC with a real-world example. When you have multiple workloads and services in your Amazon EKS cluster, each service is deployed in its own unique namespace, and service secrets are stored in Secrets Manager and tagged with a service name (key=service, value=service name). The following figure shows the required resources to implement ABAC with EKS and Secrets Manager.

Figure 3: Amazon EKS secrets management with ABAC

Prerequisites

An AWS account
Permission to create IAM resources
Permission to create Secrets Manager secrets
Permissions to manage KMS keys for Secrets Manager secrets
An existing Amazon EKS cluster
A user who can modify your Kubernetes cluster
AWS Command Line Interface (AWS CLI) and kubectl installed
Helm and eksctl installed

Deploy the solution

Begin by installing ESO:

From a terminal where you usually run your helm commands, run the following helm command to add an ESO helm repository.
```
helm repo add external-secrets https://charts.external-secrets.io
```

Install ESO using the following helm command in a terminal that has access to your target Amazon EKS cluster:

helm install external-secrets \
   external-secrets/external-secrets \
    -n external-secrets \
    --create-namespace \
   --set installCRDs=true

To verify ESO installation, run the following command. Make sure you pass the same namespace as the one you used when installing ESO:
```
kubectl get pods -n external-secrets
```

See the ESO Getting started documentation page for more information on other installation methods, installation options, and how to uninstall ESO.

Create an IAM role to access Secrets Manager secrets

You must create an IAM role with access to Secrets Manager secrets. Start by creating a customer managed policy to attach to your role. Your policy should allow reading secrets from Secrets Manager. The following example shows a policy that you can create for your role:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",k
			"Action": [
				"kms:ListKeys",
				"kms:ListAliases",
				"secretsmanager:ListSecrets"
			],
			"Resource": "*"
		},
		{
			"Effect": "Allow",
			"Action": [
				"kms:Decrypt",
				"kms:DescribeKey"
			],
			"Resource": <KMS Key ARN>
		},
		{
			"Effect": "Allow",
			"Action": [ 
				"secretsmanager:GetSecretValue",
				"secretsmanager:DescribeSecret",
				"secretsmanager:ListSecretVersionIds"
			],
			"Resource": "*",
			"Condition": {
				"StringEquals": {
					"secretsmanager:ResourceTag/ekssecret": "${aws:PrincipalTag/ekssecret}"
				}
			}
		}
	]
}

Consider the following in this policy:

Secrets Manager uses an AWS managed key for Secrets Manager by default to encrypt your secrets. It’s recommended to specify another encryption key during secret creation and have separate keys for separate workloads. Modify the resource element of the second policy statement and replace <KMS Key ARN> with the KMS key ARNs used to encrypt your secrets. If you use the default key to encrypt your secrets, you can remove this statement.
The policy statement conditionally allows access to all secrets. The condition element permits access only when the value of the principal tag, identified by the key service, matches the value of the secret tag with the same key. You can include multiple conditions (in separate statements) to match multiple tags.

After you create your policy, follow the guide for Creating IAM roles to create your role, attaching the policy you created. Use the default value for your role’s trust relationship for now, you will update the trust relationship in the next step. Note the role’s ARN after creation.

Create an IAM role for the ESO service account

Use eksctl to create the IAM role for the ESO service account (IRSA). Before creating the role, you must create an IAM policy. ESO IRSA only needs permission to assume the Secrets Manager access role that you created in the previous step.

Use the following example of an IAM policy that you can create. Replace <Secrets Manager Access Role ARN> with the ARN of the role you created in the previous step and follow creating a customer managed policy to create the policy. After creating the policy, note the policy ARN.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sts:AssumeRole",
                "sts:TagSession"
            ],
            "Resource": "<Secrets Manager Access Role ARN>"
        }
    ]
}

Next, run the following command to get the account name of the ESO service. You will see a list of service accounts, pick the one that has the same name as your helm release, in this example, the service account is external-secrets.
```
kubectl get serviceaccounts -n external-secrets
```
Next, create an IRSA and configure an ESO service account to assume the role. Run the following command to create a new role and associate it with the ESO service account. Replace the variables in brackets (<example>) with your specific information:
```
eksctl create iamserviceaccount --name <ESO service account> \
--namespace <ESO namespace> --cluster <cluster name> \
--role-name <IRSA name> --override-existing-serviceaccounts \
--attach-policy-arn <policy arn you created earlier> --approve
```
You can validate the operation by following the steps listed in Configuring a Kubernetes service account to assume an IAM role. Note that you had to pass the ‑‑override-existing-serviceaccounts argument because the ESO service account was already created.
After you’ve validated the operation, run the following command to retrieve the IRSA ARN (replace <IRSA name> with the name you used in the previous step):
```
aws iam get-role --role-name <IRSA name> --query Role.Arn
```

Modify the trust relationship of the role you created previously and limit it to your newly created IRSA. The following should resemble your trust relationship. Replace <IRSA Arn> with the IRSA ARN returned in the previous step:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<AWS ACCOUNT ID>:root"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "ArnEquals": {
                    "aws:PrincipalArn": "<IRSA Arn>"
                }
            }
        },
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "<IRSA Arn>"
            },
            "Action": "sts:TagSession",
            "Condition": {
                "StringLike": {
                    "aws:RequestTag/ekssecret": "*"
                }
            }
        }
    ]
}

Note that you will be using session tags to implement ABAC. When using session tags, trust policies for all roles connected to the identity provider (IdP) passing the tags must have the sts:TagSession permission. For roles without this permission in the trust policy, the AssumeRole operation fails.

Moreover, the condition block of the second statement limits ESO’s ability to pass session tags with the key name ekssecret. We’re using this condition to verify that the ESO role can only create session tags used for accessing secrets manager, and doesn’t gain the ability to set principal tags that might be used for any other purpose. This way, you’re creating a namespace to help prevent further privilege escalations or escapes.

Create secrets in Secrets Manager

You can create two secrets in Secrets Manager and tag them.

Follow the steps in Create an AWS Secrets Manager secret to create two secrets named service1_secret and service2_secret. Add the following tags to your secrets:
- service1_secret:
  - key=ekssecret, value=service1
- service2_secret:
  - key=ekssecret, value=service2
Run the following command to verify both secrets are created and tagged properly:
```
aws secretsmanager list-secrets --query 'SecretList[*].{Name:Name, Tags:Tags}'
```

Create ESO objects in your cluster

Create two namespaces in your cluster:

❯ kubectl create ns service1-ns
❯ kubectl create ns service2-ns

Assume that service1-ns hosts service1 and service2-ns hosts service2. After creating the namespaces for your services, verify that each service is restricted to accessing secrets that are tagged with a specific key-value pair. In this example the key should be ekssecret and the value should match the name of the corresponding service. This means that service1 should only have access to service1_secret, while service2 should only have access to service2_secret. Next, declare session tags in SecretStore object definitions.

Edit the following command snippet using the text editor of your choice and replace every instance of <Secrets Manager Access Role ARN> with the ARN of the IAM role you created earlier to access Secrets Manager secrets. Copy and paste the edited command in your terminal and run it to create a .yaml file in your working directory that contains the SecretStore definitions. Make sure to change the AWS Region to reflect the Region of your Secrets Manager.

cat > secretstore.yml <<EOF
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: aws-secretsmanager
  namespace: service1-ns
spec:
  provider:
    aws:
      service: SecretsManager
      role: <Secrets Manager Access Role ARN>
      region: us-west-2
      sessionTags:
        - key: ekssecret
          value: service1
---
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: aws-secretsmanager
  namespace: service2-ns
spec:
  provider:
    aws:
      service: SecretsManager
      role: <Secrets Manager Access Role ARN>
      region: us-west-2
      sessionTags:
        - key: ekssecret
          value: service2
EOF

Create SecretStore objects by running the following command:
```
kubectl apply -f secretstore.yml
```
Validate object creation by running the following command:
```
kubectl describe secretstores.external-secrets.io -A
```
Check the status and events section for each object and make sure the store is validated.

Next, create two ExternalSecret objects requesting service1_secret and service2_secret. Copy and paste the following command in your terminal and run it. The command will create a .yaml file in your working directory that contains ExternalSecret definitions.

cat > exrternalsecret.yml <<EOF
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: service1-es1
  namespace: service1-ns
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secretsmanager
    kind: SecretStore
  target:
    name: service1-ns-secret1
    creationPolicy: Owner
  data:
  - secretKey: service1_secret
    remoteRef:
      key: "service1_secret"
---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: service2-es2
  namespace: service2-ns
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secretsmanager
    kind: SecretStore
  target:
    name: service1-ns-secret2
    creationPolicy: Owner
  data:
  - secretKey: service2_secret
    remoteRef:
      key: "service2_secret"
EOF

Run the following command to create objects:
```
kubectl apply -f exrternalsecret.yml
```
Verify the objects are created by running following command:
```
kubectl get externalsecrets.external-secrets.io -A
```
Each ExternalSecret object should create a Kubernetes secret in the same namespace it was created in. Kubernetes secrets are accessible to services in the same namespace. To demonstrate that both Service A and Service B has access to their secrets, run the following command.
```
kubectl get secrets -A
```

You should see service1-ns-secret1 created in service1-ns namespace which is accessible to Service 1, and service1-ns-secret2 created in service2-ns which is accessible to Service2.

Try creating an ExternalSecrets object in service1-ns referencing service2_secret. Notice that your object shows SecretSyncedError status. This is the expected behavior, because ESO passes different session tags for ExternalSecret objects in each namespace, and when the tag where key is ekssecret doesn’t match the secret tag with the same key, the request will be rejected.

What about AWS Secrets and Configuration Provider (ASCP)?

Amazon offers a capability called AWS Secrets and Configuration Provider (ASCP), which allows applications to consume secrets directly from external stores, including Secrets Manager, without modifying the application code. ASCP is actively maintained by AWS, which makes sure that it remains up to date and aligned with the latest features introduced in Secrets Manager. See How to use AWS Secrets & Configuration Provider with your Kubernetes Secrets Store CSI driver to learn more about how to use ASCP to retrieve secrets from Secrets Manager.

Today, customers who use AWS Fargate with Amazon EKS can’t use the ASCP method due to the incompatibility of daemonsets on Fargate. Kubernetes also doesn’t provide a mechanism to add specific claims to JSON web tokens (JWT) used to assume IAM roles. Today, when using ASCP in Kubernetes, which assumes IAM roles through IAM roles for service accounts (IRSA), there’s a constraint in appending session tags during the IRSA assumption due to JWT claim restrictions, limiting the ability to implement ABAC.

With ESO, you can create Kubernetes Secrets and have your pods retrieve secrets from them instead of directly mounting secrets as volumes in your pods. ESO is also capable of using its controller pod’s IRSA to retrieve secrets, so you don’t need to set up IRSA for each pod. You can also role chain and specify secondary roles to be assumed by ESO IRSA and pass session tags to be used with ABAC policies. ESO’s role chaining and ABAC capabilities help decrease the number of IAM roles required for secrets retrieval. See Leverage AWS secrets stores from EKS Fargate with External Secrets Operator on the AWS Containers blog to learn how to use ESO on an EKS Fargate cluster to consume secrets stored in Secrets Manager.

Conclusion

In this blog post, we walked you through how to implement ABAC with Amazon EKS and Secrets Manager using External Secrets Operator. Implementing ABAC allows you to create a single IAM role for accessing Secrets Manager secrets while implementing granular permissions. ABAC also decreases your team’s overhead and reduces the risk of misconfigurations. With ABAC, you require fewer policies and don’t need to update existing policies to allow access to new services and workloads.

If you have feedback about this post, submit comments in the Comments section below.

Best Practices to help secure your container image build pipeline by using AWS Signer

2023-12-22 Jorge Castillo

Post Syndicated from Jorge Castillo original https://aws.amazon.com/blogs/security/best-practices-to-help-secure-your-container-image-build-pipeline-by-using-aws-signer/

AWS Signer is a fully managed code-signing service to help ensure the trust and integrity of your code. It helps you verify that the code comes from a trusted source and that an unauthorized party has not accessed it. AWS Signer manages code signing certificates and public and private keys, which can reduce the overhead of your public key infrastructure (PKI) management. It also provides a set of features to simplify lifecycle management of your keys and certificates so that you can focus on signing and verifying your code.

In June 2023, AWS announced Container Image Signing with AWS Signer and Amazon EKS, a new capability that gives you native AWS support for signing and verifying container images stored in Amazon Elastic Container Registry (Amazon ECR).

Containers and AWS Lambda functions are popular serverless compute solutions for applications built on the cloud. By using AWS Signer, you can verify that the software running in these workloads originates from a trusted source.

In this blog post, you will learn about the benefits of code signing for software security, governance, and compliance needs. Flexible continuous integration and continuous delivery (CI/CD) integration, management of signing identities, and native integration with other AWS services can help you simplify code security through automation.

Background

Code signing is an important part of the software supply chain. It helps ensure that the code is unaltered and comes from an approved source.

To automate software development workflows, organizations often implement a CI/CD pipeline to push, test, and deploy code effectively. You can integrate code signing into the workflow to help prevent untrusted code from being deployed, as shown in Figure 1. Code signing in the pipeline can provide you with different types of information, depending on how you decide to use the functionality. For example, you can integrate code signing into the build stage to attest that the code was scanned for vulnerabilities, had its software bill of materials (SBOM) approved internally, and underwent unit and integration testing. You can also use code signing to verify who has pushed or published the code, such as a developer, team, or organization. You can verify each of these steps separately by including multiple signing stages in the pipeline. For more information on the value provided by container image signing, see Cryptographic Signing for Containers.

Figure 1: Security IN the pipeline

In the following section, we will walk you through a simple implementation of image signing and its verification for Amazon Elastic Kubernetes Service (Amazon EKS) deployment. The signature attests that the container image went through the pipeline and came from a trusted source. You can use this process in more complex scenarios by adding multiple AWS CodeBuild code signing stages that make use of various AWS Signer signing profiles.

Services and tools

In this section, we discuss the various AWS services and third-party tools that you need for this solution.

CI/CD services

For the CI/CD pipeline, you will use the following AWS services:

AWS CodePipeline — a fully managed continuous delivery service that you can use to automate your release pipelines for fast and reliable application and infrastructure updates.
AWS CodeCommit — a fully managed source control service that hosts secure Git-based repositories.
AWS Signer — a fully managed code-signing service that you can use to help ensure the trust and integrity of your code.
AWS CodeBuild — A fully managed continuous integration service that compiles source code, runs tests, and produces software packages that are ready to deploy.

Container services

You will use the following AWS services for containers for this walkthrough:

Amazon EKS — a managed Kubernetes service to run Kubernetes in the AWS Cloud and on-premises data centers.
Amazon ECR — a fully managed container registry for high-performance hosting, so that you can reliably deploy application images and artifacts anywhere.

Verification tools

The following are publicly available sign verification tools that we integrated into the pipeline for this post, but you could integrate other tools that meet your specific requirements.

Notation — A publicly available Notary project within the Cloud Native Computing Foundation (CNCF). With contributions from AWS and others, Notary is an open standard and client implementation that allows for vendor-specific plugins for key management and other integrations. AWS Signer manages signing keys, key rotation, and PKI management for you, and is integrated with Notation through a curated plugin that provides a simple client-based workflow.
Kyverno — A publicly available policy engine that is designed for Kubernetes.

Solution overview

Figure 2: Solution architecture

Here’s how the solution works, as shown in Figure 2:

Developers push Dockerfiles and application code to CodeCommit. Each push to CodeCommit starts a pipeline hosted on CodePipeline.
CodeBuild packages the build, containerizes the application, and stores the image in the ECR registry.
CodeBuild retrieves a specific version of the image that was previously pushed to Amazon ECR. AWS Signer and Notation sign the image by using the signing profile established previously, as shown in more detail in Figure 3.

Figure 3: Signing images described

AWS Signer and Notation verify the signed image version and then deploy it to an Amazon EKS cluster.

If the image has not previously been signed correctly, the CodeBuild log displays an output similar to the following:

Error: signature verification failed: no signature is associated with "<AWS_ACCOUNT_ID>.dkr.ecr.<AWS_REGION>.amazonaws.com/hello-server@<DIGEST>" , make sure the artifact was signed successfully

If there is a signature mismatch, the CodeBuild log displays an output similar to the following:

Error: signature verification failed for all the signatures associated with <AWS_ACCOUNT_ID>.dkr.ecr.<AWS_REGION>.amazonaws.com/hello-server@<DIGEST>

Kyverno verifies the container image signature for use in the Amazon EKS cluster.
Figure 4 shows steps 4 and 5 in more detail.

Figure 4: Verification of image signature for Kubernetes

Prerequisites

Before getting started, make sure that you have the following prerequisites in place:

An Amazon EKS cluster provisioned.
An Amazon ECR repository for your container images.
A CodeCommit repository with your application code. For more information, see Create an AWS CodeCommit repository.
A CodePipeline pipeline deployed with the CodeCommit repository as the code source and four CodeBuild stages: Build, ApplicationSigning, ApplicationDeployment, and VerifyContainerSign. The CI/CD pipeline should look like that in Figure 5.

Figure 5: CI/CD pipeline with CodePipeline

Walkthrough

You can create a signing profile by using the AWS Command Line Interface (AWS CLI), AWS Management Console or the AWS Signer API. In this section, we’ll walk you through how to sign the image by using the AWS CLI.

To sign the image (AWS CLI)

Create a signing profile for each identity.

# Create an AWS Signer signing profile with default validity period
$ aws signer put-signing-profile \
    --profile-name build_signer \
    --platform-id Notation-OCI-SHA384-ECDSA

Sign the image from the CodeBuild build—your buildspec.yaml configuration file should look like the following:

version: 0.2

phases:
  pre_build:
    commands:
      - aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com
      - REPOSITORY_URI=$AWS_ACCOUNT_ID.dkr.ecr. $AWS_REGION.amazonaws.com/hello-server
      - COMMIT_HASH=$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | cut -c 1-7)
      - IMAGE_TAG=${COMMIT_HASH:=latest}
      - DIGEST=$(docker manifest inspect $AWS_ACCOUNT_ID.dkr.ecr. $AWS_REGION.amazonaws.com/hello-server:$IMAGE_TAG -v | jq -r '.Descriptor.digest')
      - echo $DIGEST
      
      - wget https://d2hvyiie56hcat.cloudfront.net/linux/amd64/installer/rpm/latest/aws-signer-notation-cli_amd64.rpm
      - sudo rpm -U aws-signer-notation-cli_amd64.rpm
      - notation version
      - notation plugin ls
  build:
    commands:
      - notation sign $REPOSITORY_URI@$DIGEST --plugin com.amazonaws.signer.notation.plugin --id arn:aws:signer: $AWS_REGION:$AWS_ACCOUNT_ID:/signing-profiles/notation_container_signing
      - notation inspect $AWS_ACCOUNT_ID.dkr.ecr. $AWS_REGION.amazonaws.com/hello-server@$DIGEST
      - notation verify $AWS_ACCOUNT_ID.dkr.ecr. $AWS_REGION.amazonaws.com/hello-server@$DIGEST
  post_build:
    commands:
      - printf '[{"name":"hello-server","imageUri":"%s"}]' $REPOSITORY_URI:$IMAGE_TAG > imagedefinitions.json
artifacts:
    files: imagedefinitions.json

The commands in the buildspec.yaml configuration file do the following:

Sign you in to Amazon ECR to work with the Docker images.
Reference the specific image that will be signed by using the commit hash (or another versioning strategy that your organization uses). This gets the digest.
Sign the container image by using the notation sign command. This command uses the container image digest, instead of the image tag.
Install the Notation CLI. In this example, you use the installer for Linux. For a list of installers for various operating systems, see the AWS Signer Developer Guide,
Sign the image by using the notation sign command.
Inspect the signed image to make sure that it was signed successfully by using the notation inspect command.

To verify the signed image, use the notation verify command. The output should look similar to the following:

Successfully verified signature for <AWS_ACCOUNT_ID>.dkr.ecr.<AWS_REGION>.amazonaws.com/hello-server@<DIGEST>

(Optional) For troubleshooting, print the notation policy from the pipeline itself to check that it’s working as expected by running the notation policy show command:

notation policy show

For this, include the command in the pre_build phase after the notation version command in the buildspec.yaml configuration file.

After the notation policy show command runs, CodeBuild logs should display an output similar to the following:

{
  "version": "1.0",
  "trustPolicies": [
    {
      "name": "aws-signer-tp",
      "registryScopes": [
      "<AWS_ACCOUNT_ID>.dkr.ecr.<AWS_REGION>.amazonaws.com/hello-server"
      ],
      "signatureVerification": {
        "level": "strict"
      },
      "trustStores": [
        "signingAuthority:aws-signer-ts"
      ],
      "trustedIdentities": [
        "arn:aws:signer:<AWS_REGION>:<AWS_ACCOUNT_ID>:/signing-profiles/notation_test"
      ]
    }
  ]
}

To verify the image in Kubernetes, set up both Kyverno and the Kyverno-notation-AWS Signer in your EKS cluster. To get started with Kyverno and the Kyverno-notation-AWS Signer solution, see the installation instructions.

After you install Kyverno and Kyverno-notation-AWS Signer, verify that the controller is running—the STATUS should show Running:

$ kubectl get pods -n kyverno-notation-aws -w

NAME                                    READY   STATUS    RESTARTS   AGE
kyverno-notation-aws-75b7ddbcfc-kxwjh   1/1     Running   0          6h58m

Configure the CodeBuild buildspec.yaml configuration file to verify that the images deployed in the cluster have been previously signed. You can use the following code to configure the buildspec.yaml file.

version: 0.2

phases:
  pre_build:
    commands:
      - echo Logging in to Amazon ECR...
      - aws --version
      - REPOSITORY_URI=${REPO_ECR}
      - COMMIT_HASH=$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | cut -c 1-7)
      - IMAGE_TAG=${COMMIT_HASH:=latest}
      - curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
      - curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl.sha256"
      - echo "$(cat kubectl.sha256)  kubectl" | sha256sum --check
      — chmod +x kubectl
      - mv ./kubectl /usr/local/bin/kubectl
      - kubectl version --client
  build:
    commands:
      - echo Build started on `date`
      - aws eks update-kubeconfig -—name ${EKS_NAME} —-region ${AWS_DEFAULT_REGION}
      - echo Deploying Application
      - sed -i '/image:\ image/image:\ '\"${REPOSITORY_URI}:${IMAGE_TAG}\"'/g' deployment.yaml
      - kubectl apply -f deployment.yaml 
      - KYVERNO_NOTATION_POD=$(kubectl get pods --no-headers -o custom-columns=":metadata.name" -n kyverno-notation-aws)
      - STATUS=$(kubectl logs --tail=1 kyverno-notation-aws-75b7ddbcfc-kxwjh -n kyverno-notation-aws | grep $IMAGE_TAG | grep ERROR)
      - |
        if [[ $STATUS ]]; then
          echo "There is an error"
          exit 1
        else
          echo "No Error"
        fi
  post_build:
    commands:
      - printf '[{"name":"hello-server","imageUri":"%s"}]' $REPOSITORY_URI:$IMAGE_TAG > imagedefinitions.json
artifacts:
    files: imagedefinitions.json

The commands in the buildspec.yaml configuration file do the following:

Set up the environment variables, such as the ECR repository URI and the Commit hash, to build the image tag. The kubectl tool will use this later to reference the container image that will be deployed with the Kubernetes objects.
Use kubectl to connect to the EKS cluster and insert the container image reference in the deployment.yaml file.
After the container is deployed, you can observe the kyverno-notation-aws controller and access its logs. You can check if the deployed image is signed. If the logs contain an error, stop the pipeline run with an error code, do a rollback to a previous version, or delete the deployment if you detect that the image isn’t signed.

Decommission the AWS resources

If you no longer need the resources that you provisioned for this post, complete the following steps to delete them.

To clean up the resources

Delete the EKS cluster and delete the ECR image.
Delete the IAM roles and policies that you used for the configuration of IAM roles for service accounts.
Revoke the AWS Signer signing profile that you created and used for the signing process by running the following command in the AWS CLI:
```
$ aws signer revoke-signing-profile
```

Delete signatures from the Amazon ECR repository. Make sure to replace <AWS_ACCOUNT_ID> and <AWS_REGION> with your own information.

# Use oras CLI, with Amazon ECR Docker Credential Helper, to delete signature
$ oras manifest delete <AWS_ACCOUNT_ID>.dkr.ecr.<AWS_REGION>.amazonaws.com/pause@sha256:ca78e5f730f9a789ef8c63bb55275ac12dfb9e8099e6a0a64375d8a95ed501c4

Note: Using the ORAS project’s oras client, you can delete signatures and other reference type artifacts. It implements deletion by first removing the reference from an index, and then deleting the manifest.

Conclusion

In this post, you learned how to implement container image signing in a CI/CD pipeline by using AWS services such as CodePipeline, CodeBuild, Amazon ECR, and AWS Signer along with publicly available tools such as Notary and Kyverno. By implementing mandatory image signing in your pipelines, you can confirm that only validated and authorized container images are deployed to production. Automating the signing process and signature verification is vital to help securely deploy containers at scale. You also learned how to verify signed images both during deployment and at runtime in Kubernetes. This post provides valuable insights for anyone looking to add image signing capabilities to their CI/CD pipelines on AWS to provide supply chain security assurances. The combination of AWS managed services and publicly available tools provides a robust implementation.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Amazon EKS Pod Identity simplifies IAM permissions for applications on Amazon EKS clusters

2023-11-27 Donnie Prakoso

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/amazon-eks-pod-identity-simplifies-iam-permissions-for-applications-on-amazon-eks-clusters/

Starting today, you can use Amazon EKS Pod Identity to simplify your applications that access AWS services. This enhancement provides you with a seamless and easy to configure experience that lets you define required IAM permissions for your applications in Amazon Elastic Kubernetes Service (Amazon EKS) clusters so you can connect with AWS services outside the cluster.

Amazon EKS Pod Identity helps you solve growing challenges for managing permissions across many of your EKS clusters.

Simplifying experience with Amazon EKS Pod Identity
In 2019, we introduced IAM roles for service accounts (IRSA). IRSA lets you associate an IAM role with a Kubernetes service account. This helps you to implement the principle of least privilege by giving pods only the permissions they need. This approach prioritizes pods in IAM and helps developers configure applications with fine-grained permissions that enable the least privileged access to AWS services.

Now, with Amazon EKS Pod Identity, it’s even easier to configure and automate granting AWS permissions to Kubernetes identities. As the cluster administrator, you no longer need to switch between Amazon EKS and IAM services to authenticate your applications to all AWS resources.

The overall workflow to start using Amazon EKS Pod Identity can be summarized in a few simple steps:

Step 1: Create an IAM role with required permissions for your application and specify pods.eks.amazonaws.com as the service principal in its trust policy.
Step 2: Install Amazon EKS Pod Identity Agent add-on using the Amazon EKS console or AWS Command Line Interface (AWS CLI).
Step 3: Map the role to a service account directly in the Amazon EKS console, APIs, or AWS CLI.

Once it’s done, any new pods that use that service account will automatically be configured to receive IAM credentials.

Let’s get started
Let me show you how you can get started with EKS Pod Identity. For the demo in this post, I need to configure permission for a simple API running in my Amazon EKS cluster, which will return the list of files in my Amazon Simple Storage Service (Amazon S3) bucket.

First, I need to create an IAM role to provide the required permissions so my applications can run properly. In my case, I need to configure permissions to access my S3 bucket.

Next, on the same IAM role, I need to configure its trust policy and configure the principal to pods.eks.amazonaws.com. The following is the IAM template that I use:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "pods.eks.amazonaws.com"
            },
            "Action": [
                "sts:AssumeRole",
                "sts:TagSession"
            ]
        }
    ]
}

At this stage, my IAM role is ready, and now we need to configure the Amazon EKS Pod Identity Agent in my cluster. For this article, I’m using my existing EKS cluster. If you want to learn how to do that, visit Getting started with Amazon EKS.

Moving on, I navigate to the Amazon EKS dashboard and then select my EKS cluster.

In my EKS cluster page, I need to select the Add-ons tab and then choose Get more add-ons.

Then, I need to add the Amazon EKS Pod Identity Agent add-on.

On the next page, I can add additional configuration if needed. In this case, I leave the default configuration and choose Next.

Then, I just need to review my add-on configuration and choose Create.

After a few minutes, the Amazon EKS Pod Identity Agent add-on is active for my cluster.

Once I have Amazon EKS Pod Identity in my cluster, I need to associate the IAM role to my Kubernetes pods.

I need to navigate to the Access tab in my EKS cluster. On the Pod Identity associations section, I select Create Pod Identity association to map my IAM role to Kubernetes pods.

Here, I use the IAM role that I created in the beginning. I also need to define my Kubernetes namespace and service account. If they don’t exist yet, I can type in the name of the namespace and service account. If they already exist, I can select them from the dropdown. Then, I choose Create.

Those are all the steps I need to do to configure IAM permissions for my applications running on Amazon EKS with EKS Pod Identity. Now, I can see my IAM role is listed in Pod Identity associations.

When I test my API running on Amazon EKS, it runs as expected and returns the list of files in my S3 bucket.

curl -X https://<API-URL> -H "Accept: application/json" 

{
   "files": [
         "test-file-1.md",
         "test-file-2.md"
    ]        
}

I found that Amazon EKS Pod Identity simplifies the experience of managing IAM roles for my applications running on Amazon EKS. I can easily reuse IAM roles across multiple EKS clusters without needing to update the role trust policy each time a new cluster is created.

New AWS APIs to configure EKS Pod Identity
You also have the flexibility to configure Amazon EKS Pod Identity for your cluster using AWS CLI. Amazon EKS Pod Identity provides a new set of APIs that you can use.

For example, I can use aws eks create-addon to install the Amazon EKS Pod Identity Agent add-on into my cluster. Here’s the AWS CLI command:

$ aws eks create-addon \
--cluster-name <CLUSTER_NAME> \
--addon-name eks-pod-identity-agent \
--addon-version v1.0.0-eksbuild.1

{
    "addon": {
    "addonName": "eks-pod-identity-agent",
    "clusterName": "<CLUSTER_NAME>",
    "status": "CREATING",
    "addonVersion": "v1.0.0-eksbuild.1",
    "health": {
        "issues": []
        },
    "addonArn": "<ARN>",
    "createdAt": 1697734297.597,
    "modifiedAt": 1697734297.612,
    "tags": {}
    }
}

Another example of what you can do with AWS APIs is to map the IAM role into your Kubernetes pods.

$ aws eks create-pod-identity-association \
  --cluster-name <CLUSTER_NAME> \
  --namespace <NAMESPACE> \
  --service-account <SERVICE_ACCOUNT_NAME> \
  --role-arn <IAM_ROLE_ARN>

Things to know

Availability – Amazon EKS Pod Identity is available in all AWS Regions supported by Amazon EKS, except the AWS GovCloud (US-East), AWS GovCloud (US-West), China (Beijing, operated by Sinnet), and China (Ningxia, operated by NWCD).

Pricing – Amazon EKS Pod Identity is available at no charge.

Supported Amazon EKS cluster – Amazon EKS Pod Identity supports Kubernetes running version 1.24 and above in Amazon EKS. You can see EKS Pod Identity cluster versions for more information.

Supported AWS SDK versions – You need to update your application to use the latest AWS SDK versions. Check out AWS developer tools to find out how to install and update your AWS SDK.

Get started today and visit EKS Pod Identities documentation page to learn more about how to simplify IAM management for your applications.

Happy building!
— Donnie

Amazon Managed Service for Prometheus collector provides agentless metric collection for Amazon EKS

2023-11-26 Donnie Prakoso

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/amazon-managed-service-for-prometheus-collector-provides-agentless-metric-collection-for-amazon-eks/

Today, I’m happy to announce a new capability, Amazon Managed Service for Prometheus collector, to automatically and agentlessly discover and collect Prometheus metrics from Amazon Elastic Kubernetes Service (Amazon EKS). Amazon Managed Service for Prometheus collector consists of a scraper that discovers and collects metrics from Amazon EKS applications and infrastructure without needing to run any collectors in-cluster.

This new capability provides fully managed Prometheus-compatible monitoring and alerting with Amazon Managed Service for Prometheus. One of the significant benefits is that the collector is fully managed, automatically right-sized, and scaled for your use case. This means you don’t have to run any compute for collectors to collect the available metrics. This helps you optimize metric collection costs to monitor your applications and infrastructure running on EKS.

With this launch, Amazon Managed Service for Prometheus now supports two major modes of Prometheus metrics collection: AWS managed collection, a fully managed and agentless collector, and customer managed collection.

Getting started with Amazon Managed Service for Prometheus Collector
Let’s take a look at how to use AWS managed collectors to ingest metrics using this new capability into a workspace in Amazon Managed Service for Prometheus. Then, we will evaluate the collected metrics in Amazon Managed Service for Grafana.

When you create a new EKS cluster using the Amazon EKS console, you now have the option to enable AWS managed collector by selecting Send Prometheus metrics to Amazon Managed Service for Prometheus. In the Destination section, you can also create a new workspace or select your existing Amazon Managed Service for Prometheus workspace. You can learn more about how to create a workspace by following the getting started guide.

Then, you have the flexibility to define your scraper configuration using the editor or upload your existing configuration. The scraper configuration controls how you would like the scraper to discover and collect metrics. To see possible values you can configure, please visit the Prometheus Configuration page.

Once you’ve finished the EKS cluster creation, you can go to the Observability tab on your cluster page to see the list of scrapers running in your EKS cluster.

The next step is to configure your EKS cluster to allow the scraper to access metrics. You can find the steps and information on Configuring your Amazon EKS cluster.

Once your EKS cluster is properly configured, the collector will automatically discover metrics from your EKS cluster and nodes. To visualize the metrics, you can use Amazon Managed Grafana integrated with your Prometheus workspace. Visit the Set up Amazon Managed Grafana for use with Amazon Managed Service for Prometheus page to learn more.

The following is a screenshot of metrics ingested by the collectors and visualized in an Amazon Managed Grafana workspace. From here, you can run a simple query to get the metrics that you need.

Using AWS CLI and APIs
Besides using the Amazon EKS console, you can also use the APIs or AWS Command Line Interface (AWS CLI) to add an AWS managed collector. This approach is useful if you want to add an AWS managed collector into an existing EKS cluster or make some modifications to the existing collector configuration.

To create a scraper, you can run the following command:

aws amp create-scraper \ 
       --source eksConfiguration="{clusterArn=<EKS-CLUSTER-ARN>,securityGroupIds=[<SG-SECURITY-GROUP-ID>],subnetIds=[<SUBNET-ID>]}" \ 
       --scrape-configuration configurationBlob=<BASE64-CONFIGURATION-BLOB> \ 
       --destination=ampConfiguration={workspaceArn="<WORKSPACE_ARN>"}

You can get most of the parameter values from the respective AWS console, such as your EKS cluster ARN and your Amazon Managed Service for Prometheus workspace ARN. Other than that, you also need to define the scraper configuration defined as configurationBlob.

Once you’ve defined the scraper configuration, you need to encode the configuration file into base64 encoding before passing the API call. The following is the command that I use in my Linux development machine to encode sample-configuration.yml into base64 and copy it onto the clipboard.

$ base64 sample-configuration.yml | pbcopy

Now Available
The Amazon Managed Service for Prometheus collector capability is now available to all AWS customers in all AWS Regions where Amazon Managed Service for Prometheus is supported.

Learn more:

Amazon Managed Service for Prometheus product page
AWS managed collectors documentation page

Happy building!
— Donnie

Journey to Cloud-Native Architecture Series #7: Using Containers and Cell-based design for higher resiliency and efficiency

2023-10-30 Anuj Gupta

Post Syndicated from Anuj Gupta original https://aws.amazon.com/blogs/architecture/journey-to-cloud-native-architecture-series-7-using-containers-and-cell-based-design-for-higher-resiliency-and-efficiency/

In our previous Journey to Cloud-Native blogposts, we talked about evolving our architecture to become more scalable, secure, and cost effective to handle hyperscale requirements. In this post, we take these next steps: 1/ containerizing our applications to improve resource efficiency, and, 2/ using cell-based design to improve resiliency and time to production.

Containerize applications for standardization and scale

Standardize container orchestration tooling

Selecting the right service for your use case requires considering your organizational needs and skill sets for managing container orchestrators at scale. Our team chose Amazon Elastic Kubernetes Service (EKS), because we have the skills and experience working with Kubernetes. In addition, our leadership was committed to open source, and the topology awareness feature aligned with our resiliency requirements.

For containerizing our Java and .NET applications, we used AWS App2Container (A2C) to create deployment artifacts. AWS A2C reduced the time needed for dependency analysis, and created artifacts that we could plug into our deployment pipeline. A2C supports EKS Blueprints with GitOps which helps reduce the time-to-container adoption. Blueprints also improved consistency and follows security best practices. If you are unsure on the right tooling for running containerized workloads, you can refer to Choosing an AWS container service.

Identify the right tools for logging and monitoring

Logging in to the container environment adds some complexity due to the dynamic number of short-lived log sources. For proper tracing of events, we needed a way to collect logs from all of the system components and applications. We set up the Fluent Bit plugin to collect logs and send them to Amazon CloudWatch.

We used CloudWatch Container Insights for Prometheus for scraping Prometheus metrics from containerized applications. It allowed us to use existing tools (Amazon CloudWatch and Prometheus) to build purpose-built dashboards for different teams and applications. We created dashboards in Amazon Managed Grafana by using the native integration with Amazon CloudWatch. These tools took away the heavy lifting of managing logging and container monitoring from our teams.

Managing resource utilization

In hyperscale environments, a noisy neighbor container can consume all the resources for an entire cluster. Amazon EKS provides the ability to define and apply requests and limits for pods and containers. These values determine the minimum and maximum amount of a resource that a container can have. We also used resource quotas to configure the total amount of memory and CPU that can be used by all pods running in a namespace.

In order to better understand the resource utilization and cost of running individual applications and pods, we implemented Kubecost, using the guidance from Multi-cluster cost monitoring for Amazon EKS using Kubecost and Amazon Managed Service for Prometheus.

Scaling cluster and applications dynamically

With Amazon EKS, the Kubernetes metric server GitHub collects resource metrics from Kubelets and exposes them in the Kubernetes API server. This is accomplished through the metrics API for use by the Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA). The HPA consumes these metrics to provide horizontal scaling by increasing the number of replicas to distribute your workloads. The VPA uses these metrics to dynamically adjust for pod resources like CPU/Memory reservations.

With metrics server integration, Amazon EKS takes away the operational complexity of scaling applications and provides more granular controls on scaling applications dynamically. We recommend that hyperscale customers consider HPA as their preferred application autoscaler because it provides resiliency (increased number of replicas) in addition to scaling. On the cluster level, Karpenter provides rapid provisioning and de-provisioning of large numbers of diverse compute resources, in addition to providing cost effective usage. It helps applications hyperscale to meet business growth needs.

Build rollback strategies for failure management

In hyperscale environments, deployment rollouts typically have a low margin for errors and require close to zero downtime. We implemented progressive rollouts, canary deployments, and automated rollbacks to reduce risk for production deployments. We used key performance indicators (KPIs) like application response time and error rates to determine whether to continue or to rollback our deployment. We leveraged the integration with Prometheus to collect metrics and measure KPIs.

Improving resilience and scale using cell-based design

We still needed to handle black swan events and minimize the impact of unexpected failures or scaling events to make them more resilient. We came up with a design that creates independently deployable units of applications with contained fault isolation boundaries. The new design uses a cell-based approach, along with a shuffle sharding technique to further improve resiliency. Peter Vosshall’s re:Invent talk details this approach.

First, we created a cell-based architecture as described in Reducing the Scope of Impact with Cell-Based Architecture. Second, we applied shuffle sharding as described in What about shuffle-sharding?, to further control system impact in case of black swan events.

In Figure 1 we see two components:

A light-weight routing layer that manages the traffic routing of incoming requests to the cells.
The cells themselves, which are independent units with isolated boundaries to prevent system wide impact.

Figure 1. High level cell-based architecture

Figure 1. High Level cell-based architecture

We used best practices from Guidance for Cell-based Architecture on AWS for implementation of our cell-based architectures and routing layer. In order to minimize the risk of failure in the routing layer, we made it the thinnest layer and tested robustness for all possible scenarios. We used a hash function to map cells to customers and stored the mapping in a highly scaled and resilient data store, Amazon DynamoDB. This layer eases the addition of new cells, and provides for a horizontal-scaled application environment to gracefully handle the hypergrowth of customers or orders.

In Figure 2, we revisit the architecture as mentioned on Blog #3 of our series. To manage AWS service limits and reduce blast radius, we spread our applications into multiple AWS accounts.

Figure 2. Architecture from Blog #3

Here’s how the new design looks when implemented on AWS:

Figure 3a. Cell-based architecture

Figure 3a uses Amazon EKS instead of Amazon EC2 for deploying applications, and uses a light-weight routing layer for incoming traffic. An Amazon Route 53 configuration was deployed in a networking shared services AWS account. We have multiple isolated boundaries in our design — an AWS availability zone, an AWS Region, and an AWS account. This makes the architecture more resilient and flexible in hypergrowth.

We used our AWS availability zones as a cell boundary. We implemented shuffle sharding to map customers to multiple cells to reduce impact in case one of the cells goes down. We also used shuffle sharding to scale the number of cells that can process a customer request, in case we have unprecedented requests from a customer.

Cell-based design to handle Black swan events Black swan event mitigation

Let’s discuss how our cell-based application will react to black swan events. First, we need to align the pods into cell groups in the Amazon EKS cluster. Now we can observe how each deployment is isolated from the other deployments. Only the routing layer, which include Amazon Route 53, Amazon DynamoDB, and the load balancer, is common across the system.

Figure 3b. Cell-based architecture zoomed in on EKS

Without a cell-based design, a black swan event could take down our application in the initial attack. Now that our application is spread over three cells, the worst case for an initial attack is 33% of our application’s capacity. We now have resiliency boundaries at the node level and availability zones.

Conclusion

In this blog post, we discussed how containerizing applications can improve resource efficiency and help standardize tooling and processes. This reduces engineering overhead of scaling applications. We talked about how adopting cell-based design and shuffle sharding further improves the resilience posture of our applications, our ability to manage failures, and handle unexpected large scaling events.

Further reading:

Build and deploy to Amazon EKS with Amazon CodeCatalyst

2023-10-02 Vineeth Nair

Post Syndicated from Vineeth Nair original https://aws.amazon.com/blogs/devops/build-and-deploy-to-amazon-eks-with-amazon-codecatalyst/

Amazon CodeCatalyst is an integrated service for software development teams adopting continuous integration and deployment (CI/CD) practices into their software development process. CodeCatalyst puts all of the tools that development teams need in one place, allowing for a unified experience for collaborating on, building, and releasing software. You can also integrate AWS resources with your projects by connecting your AWS accounts to your CodeCatalyst space. By managing all of the stages and aspects of your application lifecycle in one tool, you can deliver software quickly and confidently.

Introduction

Containerization has revolutionized the way we develop, deploy, and scale applications. With the rise of managed container services like Amazon Elastic Kubernetes Service (EKS), developers can leverage the power of Kubernetes without worrying about the underlying infrastructure. In this post, we will focus on how DevOps teams can use CodeCatalyst to build and deploy applications to EKS clusters.

CodeCatalyst offers a collection of pre-built actions that encapsulate common container-related tasks such as building and pushing a container image to an ECR and deploying a Kubernetes manifest. In this walkthrough, we will leverage two actions that can greatly simplify the container build and deployment process. We start by building a simple container image with the ‘Push to Amazon ECR’ action from CodeCatalyst labs. This action simplifies the process of building, tagging and pushing an image to an Amazon Elastic Container Registry (ECR). We will also utilize the ‘Deploy to Kubernetes cluster’ action from AWS for pushing our Kubernetes manifests with our updated image.

Architecture diagram demonstrating how a developer uses Cloud9 and a repository to store code, then pushes the image to Amazon ECR and deploys it to Amazon EKS.
Figure 1: Architectural Diagram.

Prerequisites

To follow along with the post, you will need the following items:

An AWS Builder ID for signing in to CodeCatalyst.
A CodeCatalyst space
Have the Space administrator role assigned in your CodeCatalyst space
Have an AWS account associated with your space along with an associated IAM role
A CodeCatalyst project with a source repository
A CodeCatalyst environment configured with a connection to your target AWS account
An Amazon EKS cluster and an Amazon ECR – we have called the ECR ‘demo-app’.
The CodeCatalyst IAM role added to the aws-auth configMap for the EKS cluster

Walkthrough

In this walkthrough, we will build a simple Nginx based application and push this to an ECR, we will then build and deploy this image to an EKS cluster. The emphasis of this post, will be on how to translate a fairly common pattern with microservices applications to a CodeCatalyst workflow. At the end of the post, our workflow will look like so:

The image shows how codecatalyst worflow configured. 1st stage is pusing the Image to Amazon ECR followed by Deploy to Amazon EKS
Figure 2: CodeCatalyst workflow.

Create the base workflow

To begin, we will create our workflow, in the CodeCatalyst project, Select CI/CD → Workflows → Create workflow:

Image shows how to create workflow which has Source repository and Branch to be selected from drop down option
Figure 3: Create workflow.

Leave the defaults for the Source Repository and Branch, select Create. We will have an empty workflow:

Image shows how an emapy workflow looks like. It doesn't have any sptes configured which need to be added in this file based on our requirement.
Figure 4: Empty workflow.

We can edit the workflow from within the CodeCatalyst console, or use a Dev Environment. We will create an initial commit of this workflow file, ignore any validation errors at this stage:

Image shows how to create a Dev environment. User need to provide a workflow name, commit message along with Repository name and Branch name.
Figure 5: Creating Dev environment.

Connect to CodeCatalyst Dev Environment

For this post, we will use an AWS Cloud9 Dev Environment. Our first step is to connect to the Dev environment. Select Code → Dev Environments. If you do not already a Dev Instance, you can create an instance by selecting Create Dev Environment.

Image shows how a Dev environment looks like. The environment we created in the previous step, will be listed here.
Figure 6: My Dev environment.

Create a CodeCatalyst secret

Prior to adding the code, we will add a CodeCatalyst secret that will be consumed by our workflow. Using CodeCatalyst secrets ensures that we do not store sensitive data in plaintext in our workflow file. To create the secrets in the CodeCatalyst console, browse to CICD -> Secrets. Select Create Secret with the following details:

The image shows how we can create a secret. We need to pass Name and Value along with option desctption to create a secret.
Figure 7: Adding secrets.

Name: eks_cluster_name

Value: <Your EKS Cluster name>

Connect to the CodeCatalyst Dev Environment

We already have a Dev Environment so we will select Resume Instance. A new browser tab opens for the IDE and will be available in less than a minute. Once the IDE is ready, we can go ahead and start creating the Dockerfile and Kubernetes manifest that make up our application

mkdir WebApp
cat <<EOF > WebApp/Dockerfile
FROM nginx
RUN apt-get update && apt-get install -y curl
EOF

The previous command block creates our Dockerfile, which we will build in our CodeCatalyst workflow from an Nginx base image and installs cURL. Next, we will add our Kubernetes manifest file to create a Kubernetes deployment and service for our application:

Create a directory called Manifests and a file inside the directory called demo-app.yaml. Update the file with code for deployment and Kubernetes Service.

The image shows the code structure of demo-app.yaml file
Figure 8: demo-app.yaml file.

The previous code block shows the Kubernetes manifest file for our deployment, along with a Kubernetes service. We modify the image value to include the URI for our ECR as this value is unique. Once we have created our Dockerfile and Kubernetes manifest, pull the latest changes to our repository, including our workflow file that we just created. In our environment, our repository is called eks-demo-app:

cd eks-demo-app && git pull

We can now edit this file in our IDE. In our example our workflow is Workflow_df84 , we will locate Workflow_df84.yaml in the .codecatalyst\workflows directory in our repository. From here we can double click on the file to launch in the IDE for editing:

Image shows empty workflow yaml file.
Figure 9: workflow file in yaml format.

Add the build steps to workflow

We can assign our workflow a name and configure the action for our build phase. The code outlined in the following diagram is our CodeCatalyst workflow definition

The image shows updated workflow file which has triggers and actions filled.
Figure 10: Workflow updated with build phase.

Kustomize starts from here

The image shows Kustomize steps added in the workflow file.
Figure 11: Workflow updated with Kustomize.

Deployment starts from here

The image shows deployment stage added in the workflow file.
Figure 12: Workflow updated with Deployment phase.

The workflow will now contain two CodeCatalyst actions – PushtoAmazonECR which builds and pushes our container image to the ECR. We have also added a dependent stage DeploytoKubernetesCluster which deploys our Kubernetes manifest.

To save our changes we select File -> Save, we can then commit these to our git repository by typing the following at the terminal:

git add . && git commit -m ‘adding workflow’ && git push

The previous command will commit and push our changes the CodeCatalyst source repository, as we have a branch trigger for main defined, this will trigger a run of the workflow. We can monitor the status of the workflow in the CodeCatalyst console by selecting CICD -> Workflows. Locate your workflow and click on Runs to view the status.

We will now have all two stages available, as depicted at the beginning of this walkthrough. We will now have a container image in our ECR along with the newly built image deployed to our EKS cluster.

Cleaning up

If you have been following along with this workflow, you should delete the resources that you have deployed to avoid further changes. First, delete the Amazon ECR repository and Amazon EKS cluster (along with associated IAM roles) using the AWS console. Second, delete the CodeCatalyst project by navigating to project settings and choosing to Delete Project.

Conclusion

In this post, we explained how teams can easily get started building, scanning, and deploying a microservice application to an EKS cluster using CodeCatalyst. We outlined the stages in our workflow that enabled us to achieve the end-to-end build and release cycle. We also demonstrated how to enhance the developer experience of integrating CodeCatalyst with our Cloud9 Dev Environment.

Call to Action

Learn more about CodeCatalyst here.

Set up fine-grained permissions for your data pipeline using MWAA and EKS

2023-09-25 Ulrich Hinze

Post Syndicated from Ulrich Hinze original https://aws.amazon.com/blogs/big-data/set-up-fine-grained-permissions-for-your-data-pipeline-using-mwaa-and-eks/

This is a guest blog post co-written with Patrick Oberherr from Contentful and Johannes Günther from Netlight Consulting.

This blog post shows how to improve security in a data pipeline architecture based on Amazon Managed Workflows for Apache Airflow (Amazon MWAA) and Amazon Elastic Kubernetes Service (Amazon EKS) by setting up fine-grained permissions, using HashiCorp Terraform for infrastructure as code.

Many AWS customers use Amazon EKS to execute their data workloads. The advantages of Amazon EKS include different compute and storage options depending on workload needs, higher resource utilization by sharing underlying infrastructure, and a vibrant open-source community that provides purpose-built extensions. The Data on EKS project provides a series of templates and other resources to help customers get started on this journey. It includes a description of using Amazon MWAA as a job scheduler.

Contentful is an AWS customer and AWS Partner Network (APN) partner. Behind the scenes of their Software-as-a-Service (SaaS) product, the Contentful Composable Content Platform, Contentful uses insights from data to improve business decision-making and customer experience. Contentful engaged Netlight, an APN consulting partner, to help set up a data platform to gather these insights.

Most of Contentful’s application workloads run on Amazon EKS, and knowledge of this service and Kubernetes is widespread in the organization. That’s why Contentful’s data engineering team decided to run data pipelines on Amazon EKS as well. For job scheduling, they started with a self-operated Apache Airflow on an Amazon EKS cluster and later switched to Amazon MWAA to reduce engineering and operations overhead. The job execution remained on Amazon EKS.

Contentful runs a complex data pipeline using this infrastructure, including ingestion from multiple data sources and different transformation jobs, for example using dbt. The whole pipeline shares a single Amazon MWAA environment and a single Amazon EKS cluster. With a diverse set of workloads in a single environment, it is necessary to apply the principle of least privilege, ensuring that individual tasks or components have only the specific permissions they need to function.

By segmenting permissions according to roles and responsibilities, Contentful’s data engineering team was able to create a more robust and secure data processing environment, which is essential for maintaining the integrity and confidentiality of the data being handled.

In this blog post, we walk through setting up the infrastructure from scratch and deploying a sample application using Terraform, Contentful’s tool of choice for infrastructure as code.

Prerequisites

To follow along this blog post, you need the latest version of the following tools installed:

AWS CLI, configured with access to your AWS account
Terraform CLI
kubectl

Overview

In this blog post, you will create a sample application with the following infrastructure:

Architecture drawing of the sample application deployed in this blog post

The sample Airflow workflow lists objects in the source bucket, temporarily stores this list using Airflow XComs, and writes the list as a file to the destination bucket. This application is executed using Amazon EKS pods, scheduled by an Amazon MWAA environment. You deploy the EKS cluster and the MWAA environment into a virtual private cloud (VPC) and apply least-privilege permissions to the EKS pods using IAM roles for service accounts. The configuration bucket for Amazon MWAA contains runtime requirements, as well as the application code specifying an Airflow Directed Acyclic Graph (DAG).

Initialize the project and create buckets

Create a file main.tf with the following content in an empty directory:

locals {
  region = "us-east-1"
}

provider "aws" {
  region = local.region
}

resource "aws_s3_bucket" "source_bucket" {
  bucket_prefix = "source"
}

resource "aws_s3_object" "dummy_object" {
  bucket  = aws_s3_bucket.source_bucket.bucket
  key     = "dummy.txt"
  content = ""
}

resource "aws_ssm_parameter" "source_bucket" {
  name  = "mwaa_source_bucket"
  type  = "SecureString"
  value = aws_s3_bucket.source_bucket.bucket
}

resource "aws_s3_bucket" "destination_bucket" {
  bucket_prefix = "destination"
  force_destroy = true
}

resource "aws_ssm_parameter" "destination_bucket" {
  name  = "mwaa_destination_bucket"
  type  = "SecureString"
  value = aws_s3_bucket.destination_bucket.bucket
}

This file defines the Terraform AWS provider as well as the source and destination bucket, whose names are exported as AWS Systems Manager parameters. It also tells Terraform to upload an empty object named dummy.txt into the source bucket, which enables the Airflow sample application we will create later to receive a result when listing bucket content.

Initialize the Terraform project and download the module dependencies by issuing the following command:

terraform init

Create the infrastructure:

terraform apply

Terraform asks you to acknowledge changes to the environment and then starts deploying resources in AWS. Upon successful deployment, you should see the following success message:

Apply complete! Resources: 5 added, 0 changed, 0 destroyed.

Create VPC

Create a new file vpc.tf in the same directory as main.tf and insert the following:

data "aws_availability_zones" "available" {}

locals {
  cidr = "10.0.0.0/16"
  azs  = slice(data.aws_availability_zones.available.names, 0, 3)
}

module "vpc" {
  name               = "data-vpc"
  source             = "terraform-aws-modules/vpc/aws"
  version            = "~> 4.0"
  cidr               = local.cidr
  azs                = local.azs
  public_subnets     = [for k, v in local.azs : cidrsubnet(local.cidr, 8, k + 48)]
  private_subnets    = [for k, v in local.azs : cidrsubnet(local.cidr, 4, k)]
  enable_nat_gateway = true
}

This file defines the VPC, a virtual network, that will later host the Amazon EKS cluster and the Amazon MWAA environment. Note that we use an existing Terraform module for this, which wraps configuration of underlying network resources like subnets, route tables, and NAT gateways.

Download the VPC module:

terraform init

Deploy the new resources:

terraform apply

Note which resources are being created. By using the VPC module in our Terraform file, much of the underlying complexity is taken away when defining our infrastructure, but it’s still useful to know what exactly is being deployed.

Note that Terraform now handles resources we defined in both files, main.tf and vpc.tf, because Terraform includes all .tf files in the current working directory.

Create the Amazon MWAA environment

Create a new file mwaa.tf and insert the following content:

locals {
  requirements_filename = "requirements.txt"
  airflow_version       = "2.6.3"
  requirements_content  = <<EOT
apache-airflow[cncf.kubernetes]==${local.airflow_version}
EOT
}

module "mwaa" {
  source = "github.com/aws-ia/terraform-aws-mwaa?ref=1066050"

  name              = "mwaa"
  airflow_version   = local.airflow_version
  environment_class = "mw1.small"

  vpc_id             = module.vpc.vpc_id
  private_subnet_ids = slice(module.vpc.private_subnets, 0, 2)

  webserver_access_mode = "PUBLIC_ONLY"

  requirements_s3_path = local.requirements_filename
}

resource "aws_s3_object" "requirements" {
  bucket  = module.mwaa.aws_s3_bucket_name
  key     = local.requirements_filename
  content = local.requirements_content

  etag = md5(local.requirements_content)
}

Like before, we use an existing module to save configuration effort for the Amazon MWAA environment. The module also creates the configuration bucket, which we use to specify the runtime dependency of the application (apache-airflow-cncf-kubernetes) in the requirements.txt file. This package, in combination with the preinstalled package apache-airflow-amazon, enables interaction with Amazon EKS.

Download the MWAA module:

terraform init

Deploy the new resources:

terraform apply

This operation takes 20–30 minutes to complete.

Create the Amazon EKS cluster

Create a file eks.tf with the following content:

module "cluster" {
  source = "github.com/aws-ia/terraform-aws-eks-blueprints?ref=8a06a6e"

  cluster_name    = "data-cluster"
  cluster_version = "1.27"

  vpc_id             = module.vpc.vpc_id
  private_subnet_ids = module.vpc.private_subnets
  enable_irsa        = true

  managed_node_groups = {
    node_group = {
      node_group_name = "node-group"
      desired_size    = 1
    }
  }
  application_teams = {
    mwaa = {}
  }

  map_roles = [{
    rolearn  = module.mwaa.mwaa_role_arn
    username = "mwaa-executor"
    groups   = []
  }]
}

data "aws_eks_cluster_auth" "this" {
  name = module.cluster.eks_cluster_id
}

provider "kubernetes" {
  host                   = module.cluster.eks_cluster_endpoint
  cluster_ca_certificate = base64decode(module.cluster.eks_cluster_certificate_authority_data)
  token                  = data.aws_eks_cluster_auth.this.token
}

resource "kubernetes_role" "mwaa_executor" {
  metadata {
    name      = "mwaa-executor"
    namespace = "mwaa"
  }

  rule {
    api_groups = [""]
    resources  = ["pods", "pods/log", "pods/exec"]
    verbs      = ["get", "list", "create", "patch", "delete"]
  }
}

resource "kubernetes_role_binding" "mwaa_executor" {
  metadata {
    name      = "mwaa-executor"
    namespace = "mwaa"
  }
  role_ref {
    api_group = "rbac.authorization.k8s.io"
    kind      = "Role"
    name      = kubernetes_role.mwaa_executor.metadata[0].name
  }
  subject {
    kind      = "User"
    name      = "mwaa-executor"
    api_group = "rbac.authorization.k8s.io"
  }
}

output "configure_kubectl" {
  description = "Configure kubectl: make sure you're logged in with the correct AWS profile and run the following command to update your kubeconfig"
  value       = "aws eks --region ${local.region} update-kubeconfig --name ${module.cluster.eks_cluster_id}"
}

To create the cluster itself, we take advantage of the Amazon EKS Blueprints for Terraform project. We also define a managed node group with one node as the target size. Note that in cases with fluctuating load, scaling your cluster with Karpenter instead of the managed node group approach shown above makes the cluster scale more flexibly. We used managed node groups primarily because of the ease of configuration.

We define the identity that the Amazon MWAA execution role assumes in Kubernetes using the map_roles variable. After configuring the Terraform Kubernetes provider, we give the Amazon MWAA execution role permissions to manage pods in the cluster.

Download the EKS Blueprints for Terraform module:

terraform init

Deploy the new resources:

terraform apply

This operation takes about 12 minutes to complete.

Create IAM roles for service accounts

Create a file roles.tf with the following content:

data "aws_iam_policy_document" "source_bucket_reader" {
  statement {
    actions   = ["s3:ListBucket"]
    resources = ["${aws_s3_bucket.source_bucket.arn}"]
  }
  statement {
    actions   = ["ssm:GetParameter"]
    resources = [aws_ssm_parameter.source_bucket.arn]
  }
}

resource "aws_iam_policy" "source_bucket_reader" {
  name   = "source_bucket_reader"
  path   = "/"
  policy = data.aws_iam_policy_document.source_bucket_reader.json
}

module "irsa_source_bucket_reader" {
  source = "github.com/aws-ia/terraform-aws-eks-blueprints//modules/irsa"

  eks_cluster_id              = module.cluster.eks_cluster_id
  eks_oidc_provider_arn       = module.cluster.eks_oidc_provider_arn
  irsa_iam_policies           = [aws_iam_policy.source_bucket_reader.arn]
  kubernetes_service_account  = "source-bucket-reader-sa"
  kubernetes_namespace        = "mwaa"
  create_kubernetes_namespace = false
}

data "aws_iam_policy_document" "destination_bucket_writer" {
  statement {
    actions   = ["s3:PutObject"]
    resources = ["${aws_s3_bucket.destination_bucket.arn}/*"]
  }
  statement {
    actions   = ["ssm:GetParameter"]
    resources = [aws_ssm_parameter.destination_bucket.arn]
  }
}

resource "aws_iam_policy" "destination_bucket_writer" {
  name   = "irsa_destination_bucket_writer"
  policy = data.aws_iam_policy_document.destination_bucket_writer.json
}

module "irsa_destination_bucket_writer" {
  source = "github.com/aws-ia/terraform-aws-eks-blueprints//modules/irsa"

  eks_cluster_id              = module.cluster.eks_cluster_id
  eks_oidc_provider_arn       = module.cluster.eks_oidc_provider_arn
  irsa_iam_policies           = [aws_iam_policy.destination_bucket_writer.arn]
  kubernetes_service_account  = "destination-bucket-writer-sa"
  kubernetes_namespace        = "mwaa"
  create_kubernetes_namespace = false
}

This file defines two Kubernetes service accounts, source-bucket-reader-sa and destination-bucket-writer-sa, and their permissions against the AWS API, using IAM roles for service accounts (IRSA). Again, we use a module from the Amazon EKS Blueprints for Terraform project to simplify IRSA configuration. Note that both roles only get the minimum permissions that they need, defined using AWS IAM policies.

Download the new module:

terraform init

Deploy the new resources:

terraform apply

Create the DAG

Create a file dag.py defining the Airflow DAG:

from datetime import datetime

from airflow import DAG
from airflow.providers.amazon.aws.operators.eks import EksPodOperator

dag = DAG(
    "dag_with_fine_grained_permissions",
    description="DAG with fine-grained permissions",
    default_args={
        "cluster_name": "data-cluster",
        "namespace": "mwaa",
        "get_logs": True,
        "is_delete_operator_pod": True,
    },
    schedule="@hourly",
    start_date=datetime(2023, 1, 1),
    catchup=False,
)

read_bucket = EksPodOperator(
    task_id="read-bucket",
    pod_name="read-bucket",
    service_account_name="source-bucket-reader-sa",
    image="amazon/aws-cli:latest",
    cmds=[
        "sh",
        "-xc",
        "aws s3api list-objects --output json --bucket $(aws ssm get-parameter --name mwaa_source_bucket --with-decryption --query 'Parameter.Value' --output text)  > /airflow/xcom/return.json",
    ],
    do_xcom_push=True,
    dag=dag,
)

write_bucket = EksPodOperator(
    task_id="write-bucket",
    pod_name="write-bucket",
    service_account_name="destination-bucket-writer-sa",
    image="amazon/aws-cli:latest",
    cmds=[
        "sh",
        "-xc",
        "echo '{{ task_instance.xcom_pull('read-bucket')|tojson }}' > list.json; aws s3 cp list.json s3://$(aws ssm get-parameter --name mwaa_destination_bucket  --with-decryption --query 'Parameter.Value' --output text)",
    ],
    dag=dag,
)

read_bucket >> write_bucket

The DAG is defined to run on an hourly schedule, with two tasks read_bucket with service account source-bucket-reader-sa and write_bucket with service account destination-bucket-writer-sa, running after one another. Both are run using the EksPodOperator, which is responsible for scheduling the tasks on Amazon EKS, using the AWS CLI Docker image to run commands. The first task lists files in the source bucket and writes the list to Airflow XCom. The second task reads the list from XCom and stores it in the destination bucket. Note that the service_account_name parameter differentiates what each task is permitted to do.

Create a file dag.tf to upload the DAG code to the Amazon MWAA configuration bucket:

locals {
  dag_filename = "dag.py"
}

resource "aws_s3_object" "dag" {
  bucket = module.mwaa.aws_s3_bucket_name
  key    = "dags/${local.dag_filename}"
  source = local.dag_filename

  etag = filemd5(local.dag_filename)
}

Deploy the changes:

terraform apply

The Amazon MWAA environment automatically imports the file from the S3 bucket.

Run the DAG

In your browser, navigate to the Amazon MWAA console and select your environment. In the top right-hand corner, select Open Airflow UI . You should see the following:

Screenshot of the MWAA user interface

To trigger the DAG, in the Actions column, select the play symbol and then select Trigger DAG. Click on the DAG name to explore the DAG run and its results.

Navigate to the Amazon S3 console and choose the bucket starting with “destination”. It should contain a file list.json recently created by the write_bucket task. Download the file to explore its content, a JSON list with a single entry.

Clean up

The resources you created in this walkthrough incur AWS costs. To delete the created resources, issue the following command:

terraform destroy

And approve the changes in the Terraform CLI dialog.

Conclusion

In this blog post, you learned how to improve the security of your data pipeline running on Amazon MWAA and Amazon EKS by narrowing the permissions of each individual task.

To dive deeper, use the working example created in this walkthrough to explore the topic further: What happens if you remove the service_account_name parameter from an Airflow task? What happens if you exchange the service account names in the two tasks?

For simplicity, in this walkthrough we used a flat file structure with Terraform and Python files inside a single directory. We did not adhere to the standard module structure proposed by Terraform, which is generally recommended. In a real-life project, splitting up the project into multiple Terraform projects or modules may also increase flexibility, speed, and independence between teams owning different parts of the infrastructure.

Lastly, make sure to study the Data on EKS documentation, which provides other valuable resources for running your data pipeline on Amazon EKS, as well as the Amazon MWAA and Apache Airflow documentation for implementing your own use cases. Specifically, have a look at this sample implementation of a Terraform module for Amazon MWAA and Amazon EKS, which contains a more mature approach to Amazon EKS configuration and node automatic scaling, as well as networking.

If you have any questions, you can start a new thread on AWS re:Post or reach out to AWS Support.

About the Authors

Ulrich Hinze is a Solutions Architect at AWS. He partners with software companies to architect and implement cloud-based solutions on AWS. Before joining AWS, he worked for AWS customers and partners in software engineering, consulting, and architecture roles for 8+ years.

Patrick Oberherr is a Staff Data Engineer at Contentful with 4+ years of working with AWS and 10+ years in the Data field. At Contentful he is responsible for infrastructure and operations of the data stack which is hosted on AWS.

Johannes Günther is a cloud & data consultant at Netlight with 5+ years of working with AWS. He has helped clients across various industries designing sustainable cloud platforms and is AWS certified.

Let’s Architect! Security in software architectures

2023-08-16 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-security-in-software-architectures/

Security is fundamental for each product and service you are building with. Whether you are working on the back-end or the data and machine learning components of a system, the solution should be securely built.

In 2022, we discussed security in our post Let’s Architect! Architecting for Security. Today, we take a closer look at general security practices for your cloud workloads to secure both networks and applications, with a mix of resources to show you how to architect for security using the services offered by Amazon Web Services (AWS).

In this edition of Let’s Architect!, we share some practices for protecting your workloads from the most common attacks, introduce the Zero Trust principle (you can learn how AWS itself is implementing it!), plus how to move to containers and/or alternative approaches for managing your secrets.

A deep dive on the current security threat landscape with AWS

This session from AWS re:Invent, security engineers guide you through the most common threat vectors and vulnerabilities that AWS customers faced in 2022. For each possible threat, you can learn how it’s implemented by attackers, the weaknesses attackers tend to leverage, and the solutions offered by AWS to avert these security issues. We describe this as fundamental architecting for security: this implies adopting suitable services to protect your workloads, as well as follow architectural practices for security.

Take me to this re:Invent 2022 session!

Statistics about common attacks and how they can be launched

Zero Trust: Enough talk, let’s build better security

What is Zero Trust? It is a security model that produces higher security outcomes compared with the traditional network perimeter model.

How does Zero Trust work in practice, and how can you start adopting it? This AWS re:Invent 2022 session defines the Zero Trust models and explains how to implement one. You can learn how it is used within AWS, as well as how any architecture can be built with these pillars in mind. Furthermore, there is a practical use case to show you how Delphix put Zero Trust into production.

Take me to this re:Invent 2022 session!

AWS implements the Zero Trust principle for managing interactions across different services

A deep dive into container security on AWS

Nowadays, it’s vital to have a thorough understanding of a container’s underlying security layers. AWS services, like Amazon Elastic Kubernetes Service and Amazon Elastic Container Service, have harnessed these Linux security-layer protections, keeping a sharp focus on the principle of least privilege. This approach significantly minimizes the potential attack surface by limiting the permissions and privileges of processes, thus upholding the integrity of the system.

This re:Inforce 2023 session discusses best practices for securing containers for your distributed systems.

Take me to this re:Inforce 2023 session!

Fundamentals and best practices to secure containers

Migrating your secrets to AWS Secrets Manager

Secrets play a critical role in providing access to confidential systems and resources. Ensuring the secure and consistent management of these secrets, however, presents a challenge for many organizations.

Anti-patterns observed in numerous organizational secrets management systems include sharing plaintext secrets via unsecured means, such as emails or messaging apps, which can allow application developers to view these secrets in plaintext or even neglect to rotate secrets regularly. This detailed guidance walks you through the steps of discovering and classifying secrets, plus explains the implementation and migration processes involved in transferring secrets to AWS Secrets Manager.

Take me to this AWS Security Blog post!

An organization’s perspectives and responsibilities when building a secrets management solution

Conclusion

We’re glad you joined our conversation on building secure architectures! Join us in a couple of weeks when we’ll talk about cost optimization on AWS.

To find all the blogs from this series, visit the Let’s Architect! list of content on the AWS Architecture Blog.

AWS Cloud service considerations for designing multi-tenant SaaS solutions

2023-08-04 Dennis Greene

Post Syndicated from Dennis Greene original https://aws.amazon.com/blogs/architecture/aws-cloud-service-considerations-for-designing-multi-tenant-saas-solutions/

An increasing number of software as a service (SaaS) providers are considering the move from single to multi-tenant to utilize resources more efficiently and reduce operational costs. This blog aims to inform customers of considerations when evaluating a transformation to multi-tenancy in the Amazon Web Services (AWS) Cloud. You’ll find valuable information on how to optimize your cloud-based SaaS design to reduce operating expenses, increase resiliency, and offer a high-performing experience for your customers.

Single versus multi-tenancy

In a multi-tenant architecture, resources like compute, storage, and databases can be shared among independent tenants. In contrast, a single-tenant architecture allocates exclusive resources to each tenant.

Let’s consider a SaaS product that needs to support many customers, each with their own independent deployed website. Using a single-tenant model (see Figure 1), the SaaS provider may opt to utilize a dedicated AWS account to host each tenant’s workloads. To contain their respective workloads, each tenant would have their own Amazon Elastic Compute Cloud (Amazon EC2) instances organized within an Auto Scaling group. Access to the applications running in these EC2 instances would be done via an Application Load Balancer (ALB). Each tenant would be allocated their own database environment using Amazon Relational Database Service (RDS). The website’s storage (consisting of PHP, JavaScript, CSS, and HTML files) would be provided by Amazon Elastic Block Store (EBS) volumes attached to the EC2 instances. The SaaS provider would have a control plane AWS account used to create and modify these tenant-specific accounts.

Figure 1. Single-tenant configuration

To transition to a multi-tenant pattern, the SaaS provider can use containerization to package each website, and a container orchestrator to deploy the websites across shared compute nodes (EC2 instances). Kubernetes can be employed as a container orchestrator, and a website would then be represented by a Kubernetes deployment and its associated pods. A Kubernetes namespace would serve as the logical encapsulation of the tenant-specific resources, as each tenant would be mapped to one Kubernetes namespace. The Kubernetes HorizontalPodAutoscaler can be utilized for autoscaling purposes, dynamically adjusting the number of replicas in the deployment on a given namespace based on workload demands.

When additional compute resources are required, tools such as the Cluster Autoscaler, or Karpenter, can dynamically add more EC2 instances to the shared Kubernetes Cluster. An ALB can be reused by multiple tenants to route traffic to the appropriate pods. For RDS, SaaS providers can use tenant-specific database schemas to separate tenant data. For static data, Amazon Elastic File System (EFS) and tenant-specific directories can be employed. The SaaS provider would still have a control plane AWS account that would now interact with the Kubernetes and AWS APIs to create and update tenant-specific resources.

This transition to a multi-tenant design utilizing Kubernetes, Amazon Elastic Kubernetes Service (EKS), and other managed services offers numerous advantages. It enables efficient resource utilization by leveraging containerization and auto-scaling capabilities, reducing costs, and optimizing performance (see Figure 2).

Figure 2. Multi-tenant configuration

EKS cluster sizing and customer segmentation considerations in multi-tenancy designs

A high concentration of SaaS tenants hosted within the same system results in a large “blast radius.” This means a failure within the system has the potential to impact all resident tenants. This situation can lead to downtime for multiple tenants at once. To address this problem, SaaS providers are encouraged to partition their customers amongst multiple AWS accounts, each with their own deployments of this multi-tenant architecture. The number of tenants that can be present in a single cluster is a determination that can only be made by the SaaS provider after weighing the risks. Compare the shared fate of some subset of their customers, against the possible efficiency benefits of a multi-tenant architecture.

EKS security

SaaS providers must evaluate whether it’s appropriate for them to make use of containers as a workload isolation boundary. This is of particular importance in multi-tenant Kubernetes architectures, given that containers running on a single Amazon EC2 instance will share the underlying Linux kernel. Security vulnerabilities place this shared resource (the EC2 instance) at risk from attack vectors from the host Linux instance. Risk is elevated when any container running in a Kubernetes Pod cluster initiates untrusted code. This risk is heightened if SaaS providers permit tenants to “bring their code”. Kubernetes is a single tenant orchestrator, but with a multi-tenant approach to SaaS architectures, a single instance of the Amazon EKS control plane will be shared among all the workloads running within a cluster. Amazon EKS considers the cluster as the hard isolation security boundary. Every Amazon EKS managed Kubernetes cluster is isolated in a dedicated single-tenant Amazon VPC. At present, hard multi-tenancy can only be implemented by provisioning a unique cluster for each tenant.

EFS considerations

A SaaS provider may consider EFS as the storage solution for the static content of the multiple tenants. This provides them with a straightforward, serverless, and elastic file system. Directories may be used to separate the content for each tenant. While this approach of creating tenant-specific directories in EFS provides many benefits, there may be challenges harvesting per-tenant utilization and performance metrics. This can result in operational challenges for providers that need to granularly meter per-tenant usage of resources. Consequently, noisy neighbors will be difficult to identify and remediate. To resolve this, SaaS providers should consider building a custom solution to monitor the individual tenants in the multi-tenant file system by leveraging storage and throughput/IOPS metrics.

RDS considerations

Multi-tenant workloads, where data for multiple customers or end users is consolidated in the same RDS database cluster, can present operational challenges regarding per-tenant observability. Both MySQL Community Edition and open-source PostgreSQL have limited ability to provide per-tenant observability and resource governance. AWS customers operating multi-tenant workloads often use a combination of ‘database’ or ‘schema’ and ‘database user’ accounts as substitutes. AWS customers should use alternate mechanisms to establish a mapping between a tenant and these substitutes. This will give you the ability to process raw observability data from the database engine externally. You can then map these substitutes back to tenants, and distinguish tenants in the observability data.

Conclusion

In this blog, we’ve shown what to consider when moving to a multi-tenancy SaaS solution in the AWS Cloud, how to optimize your cloud-based SaaS design, and some challenges and remediations. Invest effort early in your SaaS design strategy to explore your customer requirements for tenancy. Work backwards from your SaaS tenants end goals. What level of computing performance do they require? What are the required cyber security features? How will you, as the SaaS provider, monitor and operate your platform with the target tenancy configuration? Your respective AWS account team is highly qualified to advise on these design decisions. Take advantage of reviewing and improving your design using the AWS Well-Architected Framework. The tenancy design process should be followed by extensive prototyping to validate functionality before production rollout.

Related information

AWS Week in Review – Updates on Amazon FSx for NetApp ONTAP, AWS Lambda, eksctl, Karpetner, and More – July 17, 2023

2023-07-17 Channy Yun

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/aws-week-in-review-updates-on-amazon-fsx-for-netapp-ontap-aws-lambda-eksctl-karpetner-and-more-july-17-2023/

The Data Centered: Eastern Oregon, a five-part mini-documentary series looking at the real-life impact of the more than $15 billion investment AWS has made in the local community, and how the company supports jobs, generates economic growth, provides skills training and education, and unlocks opportunities for local businesses suppliers.

Last week, I watched a new episode introducing the Data Center Technician training program offered by AWS to train people with little or no previous technical experience in the skills they need to work in data centers and other information technology (IT) roles. This video reminded me of my first days of cabling and transporting servers in data centers. Remember, there are still people behind cloud computing.

Last Week’s Launches
Here are some launches that got my attention:

Amazon FSx for NetApp ONTAP Updates – Jeff Barr introduced Amazon FSx for NetApp ONTAP support for SnapLock, an ONTAP feature that gives you the power to create volumes that provide write once read many (WORM) functionality for regulatory compliance and ransomware protection. In addition, FSx for NetApp ONTAP now supports IPSec encryption of data in transit and two additional monitoring and troubleshooting capabilities that you can use to monitor file system events and diagnose network connectivity.

AWS Lambda detects and stops recursive loops in Lambda functions – In certain scenarios, due to resource misconfiguration or code defects, a processed event might be sent back to the same service or resource that invoked the Lambda function. This can cause an unintended recursive loop and result in unintended usage and costs for customers. With this launch, Lambda will stop recursive invocations between Amazon SQS, Lambda, and Amazon SNS after 16 recursive calls. For more information, refer to our documentation or the launch blog post.

Email notification

Amazon CloudFront supports for 3072-bit RSA certificates – You can now associate their 3072-bit RSA certificates with CloudFront distributions to enhance communication security between clients and CloudFront edge locations. To get started, associate a 3072-bit RSA certificate with your CloudFront distribution using console or APIs. There are no additional fees associated with this feature. For more information, please refer to the CloudFront Developer Guide.

Running GitHub Actions with AWS CodeBuild – Two weeks ago, AWS CodeBuild started to support GitHub Actions. You can now define GitHub Actions steps directly in the BuildSpec and run them alongside CodeBuild commands. Last week, the AWS DevOps Blog published the blog post about using the Liquibase GitHub Action for deploying changes to an Amazon Aurora database in a private subnet. You can learn how to integrate AWS CodeBuild and nearly 20,000 GitHub Actions developed by the open source community.

CodeBuild configuration showing the GitHub repository URL

Amazon DynamoDB local version 2.0 – You can develop and test applications by running Amazon DynamoDB local in your local development environment without incurring any additional costs. The new 2.0 version allows Java developers to use DynamoDB local to work with Spring Boot 3 and frameworks such as Spring Framework 6 and Micronaut Framework 4 to build modernized, simplified, and lightweight cloud-native applications.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Open Source Updates
Last week, we introduced new open source projects and significant roadmap contributions to the Jupyter community.

New joint maintainership between Weaveworks and AWS for eksctl – Now the eksctl open source project has been moved from the Weaveworks GitHub organization to a new top level GitHub organization—eksctl-io—that will be jointly maintained by Weaveworks and AWS moving forward. The eksctl project can now be found on GitHub.

Karpenter now supports Windows containers – Karpenter is an open source flexible, high-performance Kubernetes node provisioning and management solution that you can use to quickly scale Amazon EKS clusters. With the launch of version 0.29.0, Karpenter extends the automated node provisioning support to Windows containers running on EKS. Read this blog post for a step-by-step guide on how to get started with Karpenter for Windows node groups.

Updates in Amazon Aurora and Amazon OpenSearch Service – Following the announcement of updates to the PostgreSQL database in May by the open source community, we’ve updated Amazon Aurora PostgreSQL-Compatible Edition to support PostgreSQL 15.3, 14.8, 13.11, 12.15, and 11.20. These releases contain product improvements and bug fixes made by the PostgreSQL community, along with Aurora-specific improvements. You can also run OpenSearch version 2.7 in Amazon OpenSearch Service. With OpenSearch 2.7 (also released in May), we’ve made several improvements to observability, security analytics, index management, and geospatial capabilities in OpenSearch Service.

To learn about weekly updates for open source at AWS, check out the latest AWS open source newsletter by Ricardo.

Upcoming AWS Events
Check your calendars and sign up for these AWS events:

AWS Storage Day on August 9 – Join a one-day virtual event that will help you to better understand AWS storage services and make the most of your data. Register today.

AWS Global Summits – Sign up for the AWS Summit closest to your city: Hong Kong (July 20), New York City (July 26), Taiwan (August 2-3), São Paulo (August 3), and Mexico City (August 30).

AWS Community Days – Join a community-led conference run by AWS user group leaders in your region: Malaysia (July 22), Philippines (July 29-30), Colombia (August 12), and West Africa (August 19).

AWS re:Invent 2023 – Join us to hear the latest from AWS, learn from experts, and connect with the global cloud community. Registration is now open.

You can browse all upcoming AWS-led in-person and virtual events, and developer-focused events such as AWS DevDay.

Take the AWS Blog Customer Survey
We’re focused on improving our content to provide a better customer experience, and we need your feedback to do so. Take our survey to share insights regarding your experience on the AWS Blog.

This survey is hosted by an external company. AWS handles your information as described in the AWS Privacy Notice. AWS will own the data gathered via this survey and will not share the information collected with survey respondents.

That’s all for this week. Check back next Monday for another Week in Review!

— Channy

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Simulating Kubernetes-workload AZ failures with AWS Fault Injection Simulator

2023-06-09 Siva Guruvareddiar

Post Syndicated from Siva Guruvareddiar original https://aws.amazon.com/blogs/architecture/simulating-kubernetes-workload-az-failures-with-aws-fault-injection-simulator/

In highly distributed systems, it is crucial to ensure that applications function correctly even during infrastructure failures. One common infrastructure failure scenario is when an entire Availability Zone (AZ) becomes unavailable. Applications are often deployed across multiple AZs to ensure high availability and fault tolerance in cloud environments such as Amazon Web Services (AWS).

Kubernetes helps manage and deploy applications across multiple nodes and AZs, though it can be difficult to test how your applications will behave during an AZ failure. This is where fault injection simulators come in. The AWS Fault Injection Simulator (AWS FIS) service can intentionally inject faults or failures into a system to test its resilience. In this blog post, we will explore how to use an AWS FIS to simulate an AZ failure for Kubernetes workloads.

Solution overview

To ensure that Kubernetes cluster workloads are architected to handle failures, you must test their resilience by simulating real-world failure scenarios. Kubernetes allows you to deploy workloads across multiple AZs to handle failures, but it’s still important to test how your system behaves during AZ failures. To do this, we use a microservice for product details with the aim of running this microservice using auto-scaling with both Cluster Autoscaler (CA, from Kubernetes community) and Karpenter and test how the system responds to varying traffic levels.

This blog post explores a load test to mimic the behavior of hundreds of users accessing the service concurrently to simulate a realistic failure scenario. This test uses AWS FIS to disrupt network connectivity, and simulate AZ failure in a controlled manner. This allows us to measure how users are impacted when using CA and then with Karpenter.

Both CA and Karpenter automatically adjust the size of a cluster based on the resource requirements of the running workloads. By comparing the performance of the microservice under these two autoscaling tools, we can determine which tool is better-suited to handle such scenarios.

Figure 1 demonstrates the solution’s architecture.

Figure 1. Architecture flow for microservices to simulate a realistic failure scenario

Prerequisites

Install the following utilities on a Linux-based host machine, which can be an Amazon Elastic Compute Cloud (Amazon EC2) instance, AWS Cloud9 instance, or a local machine with access to your AWS account:

AWS CLI version 2 to interact with AWS services using CLI commands
Node.js (v16.0.0 or later) and npm (8.10.0 or later)
AWS CDK v2.70.0 or later to build and deploy cloud infrastructure and Kubernetes resources programmatically
Kubectl to communicate with the Kubernetes API server
Helm to manage Kubernetes applications
eks-node-viewer to visualize dynamic node usage within an Amazon Elastic Kubernetes Service (Amazon EKS) cluster

Setting up a microservice environment

This blog post consists of two major parts: Bootstrap and experiment. The bootstrap section provides step-by-step instructions for:

Creating and deploying a sample microservice
Creating an AWS IAM role for the FIS service
Creating an FIS experiment template

By following these bootstrap instructions, you can set up your own environment to test the different autoscaling tools’ performance in Kubernetes.

In the experiment section, we showcase how the system behaves with CA, then Karpenter.

Let’s start by setting a few environment variables using the following code:

export FIS_ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output text)
export FIS_AWS_REGION=us-west-2
export FIS_CLUSTER_NAME="fis-simulation-cluster"

Next, clone the sample repository which contains the code for our solution:

git clone https://github.com/aws-samples/containers-blog-maelstrom.git
cd ./containers-blog-maelstrom/fis-simulation-blog

Step 1. Bootstrap the environment

This solution uses Amazon EKS for AWS Cloud Development Kit (AWS CDK) Blueprints to provision our Amazon EKS cluster.

The first step to any AWS CDK deployment is bootstrapping the environment. cdk bootstrap is an AWS Command Line Interface (AWS CLI) tool that prepares the environment with resources required by AWS CDK to perform deployments into that environment (for example, a combination of AWS account and AWS Region).

Let’s run the below commands to bootstrap your environment and install all node dependencies required for deploying the solution:

npm install
cdk bootstrap aws://$FIS_ACCOUNT_ID/$FIS_AWS_REGION

We’ll use Amazon EKS Blueprints for CDK to create an Amazon EKS cluster and deploy add-ons. This stack deploys the following add-ons into the cluster:

AWS Load Balancer Controller
AWS VPC CNI
Core DNS
Kube-proxy

Step 2. Create an Amazon EKS cluster

Run the below command to deploy the Amazon EKS cluster:

npm install
cdk deploy "*" --require-approval never

Deployment takes approximately 20-30 minutes; then you will have a fully functioning Amazon EKS cluster in your account.

fis-simulation-cluster

Deployment time: 1378.09s

Copy and run the aws eks update-kubeconfig ... command from the output section to gain access to your Amazon EKS cluster using kubectl.

Step 3. Deploy a microservice to Amazon EKS

Use the code from the following Github repository and deploy using Helm.

git clone https://github.com/aws-containers/eks-app-mesh-polyglot-demo.git
helm install workshop eks-app-mesh-polyglot-demo/workshop/helm-chart/

Note: You are not restricted to this one as a mandate. If you have other microservices, use the same. This command deploys the following three microservices:

Frontend-node as the UI to the product catalog application
Catalog detail backend
Product catalog backend

To test the resiliency, let’s take one of the microservice viz productdetail, the backend microservice as an example. When checking the status of a service like the following, you will see that proddetail is of type ClusterIP, which is accessible only within the cluster. To access this outside of the cluster, perform the following steps.

kubectl get service proddetail -n workshop
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE 5000/TCP 11h
proddetail ClusterIP 10.100.168.219 <none> 3000/TCP 11m

Create ingress class

cat <<EOF | kubectl create -f -
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
name: aws-alb
spec:
controller: ingress.k8s.aws/alb
EOF

Create ingress resource

cat <<EOF | kubectl create -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  namespace: workshop
  name: proddtl-ingress
  annotations:
   alb.ingress.kubernetes.io/scheme: internet-facing
   alb.ingress.kubernetes.io/target-type: ip
spec:
  ingressClassName: aws-alb
  rules:
  - http:
   paths:
   - path: /
   pathType: Prefix
   backend:
   service:
   name: proddetail
   port:
   number: 3000
EOF

After this, your web URL is ready:

kubectl get ingress -n workshop
NAME CLASS HOSTS ADDRESS PORTS AGE
proddtl-ingress aws-alb * k8s-workshop-proddtli-166014b35f-354421654.us-west-1.elb.amazonaws.com 80 14s

Test the connectivity from your browser:

Figure 2. Testing the connectivity from your browser

Step 4. Create an IAM Role for AWS FIS

Before an AWS FIS experiment, create an IAM role. Let’s create a trust policy and attach as shown here:

cat > fis-trust-policy.json << EOF
{
   "Version": "2012-10-17",
   "Statement": [
   {
   "Effect": "Allow",
   "Principal": {
   "Service": [
   "fis.amazonaws.com"
   ]
   },
   "Action": "sts:AssumeRole"
   }
   ]
}
EOF
aws iam create-role --role-name my-fis-role --assume-role-policy-document file://permissons/fis-trust-policy.json

Create an AWS FIS policy and attach

aws iam attach-role-policy --role-name my-fis-role --policy-arn arn:aws:iam::aws:policy/service-role/AWSFaultInjectionSimulatorNetworkAccess
aws iam attach-role-policy --role-name my-fis-role --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
aws iam attach-role-policy --role-name my-fis-role --policy-arn arn:aws:iam::aws:policy/service-role/AWSFaultInjectionSimulatorEKSAccess
aws iam attach-role-policy --role-name my-fis-role --policy-arn arn:aws:iam::aws:policy/service-role/AWSFaultInjectionSimulatorEC2Access
aws iam attach-role-policy --role-name my-fis-role --policy-arn arn:aws:iam::aws:policy/service-role/AWSFaultInjectionSimulatorSSMAccess

Step 5: Create an AWS FIS experiment

Use AWS FIS to create an experiment to disrupt the network connectivity as below. Use the following experiment template with the IAM role created from the previous step:

Figure 3. Experiment template

Step 6. Failure simulation with AWS FIS on CA and Karpenter

We’ve completed microservice setup, made it internet-accessible, and created an AWS FIS template to simulate failures. Now let’s experiment with how the system behaves with different autoscalers: CA and Karpenter.

With the microservice available within Amazon EKS cluster, we’ll use Locust to simulate user behavior with a total of 100 users trying to access the URLs concurrently.

For the following experiments, Run 1 shows 100 users trying to access the service without any system disruptions. We’ll then move to AWS FIS and disrupt network connectivity in Run 2. Measuring the user impact and comparing it to the results from the first run provides insights on how the system responds to failures and can be improved for greater reliability and performance.

Simulating failures with Cluster Autoscaler (CA)

To perform this experiment, select your experiment template and click Start experiment, then enter start in the field. Currently 12 replicas of the proddetail microservice are running.

As the following Locust charts detail, Run 1 completed without failures. Run 2 simulated network connectivity disruption, resulting in a visible failure rate of 4 percent, with a peak of 7 failures at one time.

For this experiment, we used a total of seven nodes of type t3.small. Use eks-node-viewer to visualize dynamic node usage within a cluster.

Figure 4. CA experiment results

Figure 5. Dynamic node usage within cluster

Simulating failures with Karpenter

Continuing the same experiment with 12 replicas of the proddetail microservice, this time we are using Karpenter. As in the following figures, the cluster uses a combination of t3.small and “C” and “M” instances provided in Karpenter’s provisioner configuration.

In Run 1, we observe 0 failures. In Run 2, when network connectivity was disrupted by AWS FIS, Karpenter was able to maintain user requests with almost 0 percent failure. This outcome highlights the effectiveness of Karpenter as an autoscaler for maintaining high availability by carefully configuring the provisioner.

Figure 6. Karpenter experiment results

Figure 7. Dynamic node usage within cluster

Cleanup

Use the following commands to clean up your experiment environment.

#delete Ingress resources
kubectl delete ingress proddtl-ingresss -n workshop
kubectl delete ingressclass aws-alb

#delete IAM resources
aws iam delete-role --role-name my-fis-role

#delete FIS resources
fis_template_id=`aws fis list-experiment-templates --region $FIS_AWS_REGION |jq ".experimentTemplates[0].id"`
aws fis delete-experiment-template --id $fis_template_id --region $FIS_AWS_REGION

#delete application resources and cluster
helm uninstall workshop
cdk destroy

Conclusion

This experiment results show that Karpenter performs better and recovers quicker from network disrupt connectivity than Cluster Autoscaler. The figures in this blog post highlight Karpenter’s resiliency and ability to scale and recover from failures quickly.

While this experiment provides valuable insights into the performance and reliability of Kubernetes workloads in the face of failures, it’s important to acknowledge that this is not a true test of an AZ unavailable situation. In a real-world scenario, an AZ failure can have a cascading effect, potentially impacting other services that workloads depend upon. But simulating an AZ failure in a controlled environment helps you better understand how your Kubernetes cluster and applications will behave in an actual failure scenario. This knowledge can help you identify and address any issues before they occur in production, ensuring that your applications remain highly available and resilient.

In summary, this experiment provides good insights into the performance and resilience of Kubernetes workloads. It is not a perfect representation of a real-world AZ failure, but by leveraging tools such as AWS FIS and carefully configuring autoscaling policies, you can take proactive steps to optimize performance and ensure high availability for critical applications.

Let’s Architect! Multi-tenant SaaS architectures

2023-06-07 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-multi-tenant-saas-architectures/

In a multi-tenant architecture multiple instances of an application run on a shared infrastructure. With this type of approach, each tenant is isolated from others, typically through logical separation, while utilizing a shared infrastructure. This allows multiple tenants to use the same application and maintain their data security, privacy, and customization requirements.

Understanding architectural patterns for multi-tenancy has become crucial for architects and developers aiming to deliver scalable, secure, and cost-effective solutions. Isolating tenant data is a fundamental responsibility for Software as a Service (SaaS) providers. In this edition of Let’s Architect!, we talk about comprehensive exploration of multi-tenant architectures, covering various aspects, such as SaaS microservices, SaaS serverless, SaaS EKS, and an insightful whitepaper.

SaaS microservices deep dive: Simplifying multi-tenant development

In this session, Michael Beardsley, Principal Solutions Architect at AWS, takes a deep dive into the realm of multi-tenant microservices, exploring various patterns and strategies that enable the seamless implementation of multi-tenant microservices, all while ensuring that additional complexity is not imposed upon the SaaS builders. He shares practical patterns to simplify the development process by addressing crucial aspect, such as authorization, data access, tenant isolation, metrics, billing, logging, and a plethora of other considerations; this is irrespective of the chosen compute platform (like Amazon Elastic Container Service, Amazon Elastic Kubernetes Service [Amazon EKS], or AWS Lambda) or database solution.

There is another session available that highlights specific techniques and architecture strategies that can directly impact the success of a SaaS business. If you’re interested in learning more about optimizing multi-tenant SaaS architecture, this session is a great opportunity.

Take me to this video!

SaaS multi-tenant microservices

Building a Multi-Tenant SaaS Solution Using AWS Serverless Services

In this AWS Partner Network (APN) Blog post, you will explore a reference solution that presents a comprehensive perspective on a functional multi-tenant serverless SaaS environment. This solution effectively showcases various essential components required to construct a multi-tenant SaaS solution using serverless services, including onboarding processes, tenant isolation mechanisms, data partitioning techniques, a tenant deployment pipeline, and robust observability measures.

By delving into these aspects, you can gain valuable insights into the architecture and design considerations involved in creating a successful multi-tenant SaaS solution.

Take me to this AWS APN blogpost!

Tenant registration flow

Amazon EKS SaaS deep dive: A multi-tenant EKS SaaS solution

In this re:Invent 2021 presentation, Tod Golding, Principal Partner Solutions Architect, chats about a SaaS reference solution that addresses fundamental multi-tenant considerations, examining its approach to core SaaS topics, including tenant isolation, identity, onboarding, tenant administration, and data partitioning. The goal is to explore an Amazon EKS SaaS architecture through the lens of working code and highlight the key architectural strategies that were used in this reference environment.

There is also valuable information available on Github regarding EKS multi-tenancy. Exploring the Github repositories related to EKS multi-tenancy can provide further insights, resources, and practical examples for implementing multi-tenant architectures on EKS. This presentation is an engaging way to dive deeper into this topic and gain a more comprehensive understanding of best practices and real-world implementations.

Take me to this video!

Tenant deployment model

Saas Storage Strategies

Storage represents a challenging aspect of building and delivering multi-tenant software solutions. There are different strategies that can be used to partition tenant data, each with a unique set of trade-offs for implementing separation between tenants. This whitepaper covers different storage models for multi-tenancy; in particular, you can learn about the:

Silo model (data from the tenant is fully isolated)
Pool model (all the tenants use the same database and table)
Bridge model (single database but a different table for each tenant)

For each of these models, the whitepaper describes in detail how they can be implemented, as well as the different trade-offs in terms of isolation and agility. You can also discover how these tenancy models can be implemented specifically on databases, such as Amazon DynamoDB and Amazon Relational Database Service, thus covering both NoSQL and SQL scenarios.

Take me to this whitepaper!

Partitioning model tradeoffs

See you next time!

Thanks for joining our conversation on multi-tenant SaaS architectures! Next time, we’ll talk about open-source technologies.

To find all the blogs from this series, you can check out the Let’s Architect! list of content on the AWS Architecture Blog.

Introducing Amazon EMR on EKS job submission with Spark Operator and spark-submit

2023-06-06 Lotfi Mouhib

Post Syndicated from Lotfi Mouhib original https://aws.amazon.com/blogs/big-data/introducing-amazon-emr-on-eks-job-submission-with-spark-operator-and-spark-submit/

Amazon EMR on EKS provides a deployment option for Amazon EMR that allows organizations to run open-source big data frameworks on Amazon Elastic Kubernetes Service (Amazon EKS). With EMR on EKS, Spark applications run on the Amazon EMR runtime for Apache Spark. This performance-optimized runtime offered by Amazon EMR makes your Spark jobs run fast and cost-effectively. The EMR runtime provides up to 5.37 times better performance and 76.8% cost savings, when compared to using open-source Apache Spark on Amazon EKS.

Building on the success of Amazon EMR on EKS, customers have been running and managing jobs using the emr-containers API, creating EMR virtual clusters, and submitting jobs to the EKS cluster, either through the AWS Command Line Interface (AWS CLI) or Apache Airflow scheduler. However, other customers running Spark applications have chosen Spark Operator or native spark-submit to define and run Apache Spark jobs on Amazon EKS, but without taking advantage of the performance gains from running Spark on the optimized EMR runtime. In response to this need, starting from EMR 6.10, we have introduced a new feature that lets you use the optimized EMR runtime while submitting and managing Spark jobs through either Spark Operator or spark-submit. This means that anyone running Spark workloads on EKS can take advantage of EMR’s optimized runtime.

In this post, we walk through the process of setting up and running Spark jobs using both Spark Operator and spark-submit, integrated with the EMR runtime feature. We provide step-by-step instructions to assist you in setting up the infrastructure and submitting a job with both methods. Additionally, you can use the Data on EKS blueprint to deploy the entire infrastructure using Terraform templates.

Infrastructure overview

In this post, we walk through the process of deploying a comprehensive solution using eksctl, Helm, and AWS CLI. Our deployment includes the following resources:

A VPC, EKS cluster, and managed node group, set up with the eksctl tool
Essential Amazon EKS managed add-ons, such as the VPC CNI, CoreDNS, and KubeProxy set up with the eksctl tool
Cluster Autoscaler and Spark Operator add-ons, set up using Helm
A Spark job execution AWS Identity and Access Management (IAM) role, IAM policy for Amazon Simple Storage Service (Amazon S3) bucket access, service account, and role-based access control, set up using the AWS CLI and eksctl

Prerequisites

Verify that the following prerequisites are installed on your machine:

The AWS CLI, in order to interact with AWS services. For instructions, refer to Installing or updating the latest version of the AWS CLI.
kubectl, which allows you to run commands against Kubernetes clusters.
eksctl, a simple CLI tool for creating EKS clusters.
Helm 3.7+, the package manager for Kubernetes.

Set up AWS credentials

Before proceeding to the next step and running the eksctl command, you need to set up your local AWS credentials profile. For instructions, refer to Configuration and credential file settings.

Deploy the VPC, EKS cluster, and managed add-ons

The following configuration uses us-west-1 as the default Region. To run in a different Region, update the region and availabilityZones fields accordingly. Also, verify that the same Region is used in the subsequent steps throughout the post.

Enter the following code snippet into the terminal where your AWS credentials are set up. Make sure to update the publicAccessCIDRs field with your IP before you run the command below. This will create a file named eks-cluster.yaml:

cat <<EOF >eks-cluster.yaml
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: emr-spark-operator
  region: us-west-1 # replace with your region
  version: "1.25"
vpc:
  clusterEndpoints:
    publicAccess: true
    privateAccess: true
  publicAccessCIDRs: ["YOUR-IP/32"]
availabilityZones: ["us-west-1a","us-west-1b"] # replace with your region
iam:
  withOIDC: true
  serviceAccounts:
  - metadata:
      name: cluster-autoscaler
      namespace: kube-system
    wellKnownPolicies:
      autoScaler: true
    roleName: eksctl-cluster-autoscaler-role
managedNodeGroups:
  - name: m5x
    instanceType: m5.xlarge
    availabilityZones: ["us-west-1a"]
    volumeSize: 100
    volumeType: gp3
    minSize: 2
    desiredCapacity: 2
    maxSize: 10
    tags:
      k8s.io/cluster-autoscaler/enabled: "true"
      k8s.io/cluster-autoscaler/eks-nvme: "owned" 
addons:
  - name: vpc-cni
    version: latest
  - name: coredns
    version: latest
  - name: kube-proxy
    version: latest
cloudWatch:
  clusterLogging:
    enableTypes: ["*"]
EOF

Use the following command to create the EKS cluster : eksctl create cluster -f eks-cluster.yaml

Deploy Cluster Autoscaler

Cluster Autoscaler is crucial for automatically adjusting the size of your Kubernetes cluster based on the current resource demands, optimizing resource utilization and cost. Create an autoscaler-helm-values.yaml file and install the Cluster Autoscaler using Helm:

cat <<EOF >autoscaler-helm-values.yaml
---
autoDiscovery:
    clusterName: emr-spark-operator
    tags:
      - k8s.io/cluster-autoscaler/enabled
      - k8s.io/cluster-autoscaler/{{ .Values.autoDiscovery.clusterName }}
awsRegion: us-west-1 # Make sure the region same as the EKS Cluster
rbac:
  serviceAccount:
    create: false
    name: cluster-autoscaler
EOF

helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm install nodescaler autoscaler/cluster-autoscaler \
--namespace kube-system \
--values autoscaler-helm-values.yaml --debug

You can also set up Karpenter as a cluster autoscaler to automatically launch the right compute resources to handle your EKS cluster’s applications. You can follow this blog on how to setup and configure Karpenter.

Deploy Spark Operator

Spark Operator is an open-source Kubernetes operator specifically designed to manage and monitor Spark applications running on Kubernetes. It streamlines the process of deploying and managing Spark jobs, by providing a Kubernetes custom resource to define, configure and run Spark applications, as well as manage the job life cycle through Kubernetes API. Some customers prefer using Spark Operator to manage Spark jobs because it enables them to manage Spark applications just like other Kubernetes resources.

Currently, customers are building their open-source Spark images and using S3a committers as part of job submissions with Spark Operator or spark-submit. However, with the new job submission option, you can now benefit from the EMR runtime in conjunction with EMRFS. Starting with Amazon EMR 6.10 and for each upcoming version of the EMR runtime, we will release the Spark Operator and its Helm chart to use the EMR runtime.

In this section, we show you how to deploy a Spark Operator Helm chart from an Amazon Elastic Container Registry (Amazon ECR) repository and submit jobs using EMR runtime images, benefiting from the performance enhancements provided by the EMR runtime.

Install Spark Operator with Helm from Amazon ECR

The Spark Operator Helm chart is stored in an ECR repository. To install the Spark Operator, you first need to authenticate your Helm client with the ECR repository. The charts are stored under the following path: ECR_URI/spark-operator.

Authenticate your Helm client and install the Spark Operator:

aws ecr get-login-password \
--region us-west-1 | helm registry login \
--username AWS \
--password-stdin 608033475327.dkr.ecr.us-west-1.amazonaws.com

You can authenticate to other EMR on EKS supported Regions by obtaining the AWS account ID for the corresponding Region. For more information, refer to how to select a base image URI.

Install Spark Operator

You can now install Spark Operator using the following command:

helm install spark-operator-demo \
oci://608033475327.dkr.ecr.us-west-1.amazonaws.com/spark-operator \
--set emrContainers.awsRegion=us-west-1 \
--version 1.1.26-amzn-0 \
--set serviceAccounts.spark.create=false \
--namespace spark-operator \
--create-namespace

To verify that the operator has been installed correctly, run the following command:

helm list --namespace spark-operator -o yaml

Set up the Spark job execution role and service account

In this step, we create a Spark job execution IAM role and a service account, which will be used in Spark Operator and spark-submit job submission examples.

First, we create an IAM policy that will be used by the IAM Roles for Service Accounts (IRSA). This policy enables the driver and executor pods to access the AWS services specified in the policy. Complete the following steps:

As a prerequisite, either create an S3 bucket (aws s3api create-bucket --bucket <ENTER-S3-BUCKET> --create-bucket-configuration LocationConstraint=us-west-1 --region us-west-1) or use an existing S3 bucket. Replace <ENTER-S3-BUCKET> in the following code with the bucket name.

Create a policy file that allows read and write access to an S3 bucket:

cat >job-execution-policy.json <<EOL
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:AbortMultipartUpload",
                "s3:ListMultipartUploadParts"
            ],
            "Resource": [
                "arn:aws:s3:::<ENTER-S3-BUCKET>",
                "arn:aws:s3:::<ENTER-S3-BUCKET>/*",
                "arn:aws:s3:::aws-data-lake-workshop/*",
                "arn:aws:s3:::nyc-tlc",
                "arn:aws:s3:::nyc-tlc/*"
            ]
        }
    ]
}
EOL

Create the IAM policy with the following command:

aws iam create-policy --policy-name emr-job-execution-policy --policy-document file://job-execution-policy.json

Next, create the service account named emr-job-execution-sa-role as well as the IAM roles. The following eksctl command creates a service account scoped to the namespace and service account defined to be used by the executor and driver. Make sure to replace <ENTER_YOUR_ACCOUNT_ID> with your account ID before running the command:
```
eksctl create iamserviceaccount \
--cluster=emr-spark-operator \
--region us-west-1 \
--name=emr-job-execution-sa \
--attach-policy-arn=arn:aws:iam::<ENTER_YOUR_ACCOUNT_ID>:policy/emr-job-execution-policy \
--role-name=emr-job-execution-irsa \
--namespace=data-team-a \
--approve
```

Create an S3 bucket policy to allow only the execution role create in step 4 to write and read from the S3 bucket create in step 1. Make sure to replace <ENTER_YOUR_ACCOUNT_ID> with your account ID before running the command:

cat > bucketpolicy.json<<EOL
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:AbortMultipartUpload",
                "s3:ListMultipartUploadParts"
            ], "Principal": {
                "AWS": "arn:aws:iam::<ENTER_YOUR_ACCOUNT_ID>:role/emr-job-execution-irsa"
            },
            "Resource": [
                "arn:aws:s3:::<ENTER-S3-BUCKET>",
                "arn:aws:s3:::<ENTER-S3-BUCKET>/*"
            ]
        }
    ]
}
EOL

aws s3api put-bucket-policy --bucket ENTER-S3-BUCKET-NAME --policy file://bucketpolicy.json

Create a Kubernetes role and role binding required for the service account used in the Spark job run:

cat <<EOF >emr-job-execution-rbac.yaml
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: emr-job-execution-sa-role
  namespace: data-team-a
rules:
  - apiGroups: ["", "batch","extensions"]
    resources: ["configmaps","serviceaccounts","events","pods","pods/exec","pods/log","pods/portforward","secrets","services","persistentvolumeclaims"]
    verbs: ["create","delete","get","list","patch","update","watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: emr-job-execution-sa-rb
  namespace: data-team-a
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: emr-job-execution-sa-role
subjects:
  - kind: ServiceAccount
    name: emr-job-execution-sa
    namespace: data-team-a
EOF

Apply the Kubernetes role and role binding definition with the following command:

kubectl apply -f emr-job-execution-rbac.yaml

So far, we have completed the infrastructure setup, including the Spark job execution roles. In the following steps, we run sample Spark jobs using both Spark Operator and spark-submit with the EMR runtime.

Configure the Spark Operator job with the EMR runtime

In this section, we present a sample Spark job that reads data from public datasets stored in S3 buckets, processes them, and writes the results to your own S3 bucket. Make sure that you update the S3 bucket in the following configuration by replacing <ENTER_S3_BUCKET> with the URI to your own S3 bucket refered in step 2 of the “Set up the Spark job execution role and service account” section. Also, note that we are using data-team-a as a namespace and emr-job-execution-sa as a service account, which we created in the previous step. These are necessary to run the Spark job pods in the dedicated namespace, and the IAM role linked to the service account is used to access the S3 bucket for reading and writing data.

Most importantly, notice the image field with the EMR optimized runtime Docker image, which is currently set to emr-6.10.0. You can change this to a newer version when it’s released by the Amazon EMR team. Also, when configuring your jobs, make sure that you include the sparkConf and hadoopConf settings as defined in the following manifest. These configurations enable you to benefit from EMR runtime performance, AWS Glue Data Catalog integration, and the EMRFS optimized connector.

Create the file (emr-spark-operator-example.yaml) locally and update the S3 bucket location so that you can submit the job as part of the next step:

cat <<EOF >emr-spark-operator-example.yaml
---
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: taxi-example
  namespace: data-team-a
spec:
  type: Scala
  mode: cluster
  # EMR optimized runtime image
  image: "483788554619.dkr.ecr.eu-west-1.amazonaws.com/spark/emr-6.10.0:latest"
  imagePullPolicy: Always
  mainClass: ValueZones
  mainApplicationFile: s3://aws-data-lake-workshop/spark-eks/spark-eks-assembly-3.3.0.jar
  arguments:
    - s3://nyc-tlc/csv_backup
    - "2017"
    - s3://nyc-tlc/misc/taxi _zone_lookup.csv
    - s3://<ENTER_S3_BUCKET>/emr-eks-results
    - emr_eks_demo
  hadoopConf:
    # EMRFS filesystem config
    fs.s3.customAWSCredentialsProvider: com.amazonaws.auth.WebIdentityTokenCredentialsProvider
    fs.s3.impl: com.amazon.ws.emr.hadoop.fs.EmrFileSystem
    fs.AbstractFileSystem.s3.impl: org.apache.hadoop.fs.s3.EMRFSDelegate
    fs.s3.buffer.dir: /mnt/s3
    fs.s3.getObject.initialSocketTimeoutMilliseconds: "2000"
    mapreduce.fileoutputcommitter.algorithm.version.emr_internal_use_only.EmrFileSystem: "2"
    mapreduce.fileoutputcommitter.cleanup-failures.ignored.emr_internal_use_only.EmrFileSystem: "true"
  sparkConf:
    spark.eventLog.enabled: "true"
    spark.eventLog.dir: "s3://<ENTER_S3_BUCKET>/"
    spark.kubernetes.driver.pod.name: driver-nyc-taxi-etl
    # Required for EMR Runtime and Glue Catalogue
    spark.driver.extraClassPath: /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*
    spark.driver.extraLibraryPath: /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native
    spark.executor.extraClassPath: /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*
    spark.executor.extraLibraryPath: /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native
    # EMRFS commiter
    spark.sql.parquet.output.committer.class: com.amazon.emr.committer.EmrOptimizedSparkSqlParquetOutputCommitter
    spark.sql.parquet.fs.optimized.committer.optimization-enabled: "true"
    spark.sql.emr.internal.extensions: com.amazonaws.emr.spark.EmrSparkSessionExtensions
    spark.executor.defaultJavaOptions: -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseParallelGC -XX:InitiatingHeapOccupancyPercent=70 -XX:OnOutOfMemoryError='kill -9 %p'
    spark.driver.defaultJavaOptions:  -XX:OnOutOfMemoryError='kill -9 %p' -XX:+UseParallelGC -XX:InitiatingHeapOccupancyPercent=70
  sparkVersion: "3.3.1"
  restartPolicy:
    type: Never
  driver:
    cores: 1
    memory: "4g"
    serviceAccount: emr-job-execution-sa
  executor:
    cores: 1
    instances: 4
    memory: "4g"
    serviceAccount: emr-job-execution-sa
EOF

Run the following command to submit the job to the EKS cluster:
```
kubectl apply -f emr-spark-operator-example.yaml
```
The job may take 4–5 minutes to complete, and you can verify the successful message in the driver pod logs.
Verify the job by running the following command:

kubectl get pods -n data-team-a

Enable access to the Spark UI

The Spark UI is an important tool for data engineers because it allows you to track the progress of tasks, view detailed job and stage information, and analyze resource utilization to identify bottlenecks and optimize your code. For Spark jobs running on Kubernetes, the Spark UI is hosted on the driver pod and its access is restricted to the internal network of Kubernetes. To access it, we need to forward the traffic to the pod with kubectl. The following steps take you through how to set it up.

Run the following command to forward traffic to the driver pod:

kubectl port-forward <driver-pod-name> 4040:4040

You should see text similar to the following:

Forwarding from 127.0.0.1:4040 -> 4040
Forwarding from [::1]:4040 → 4040

If you didn’t specify the driver pod name at the submission of the SparkApplication, you can get it with the following command:

kubectl get pods -l spark-role=driver,spark-app-name=<your-spark-app-name> -o jsonpath='{.items[0].metadata.name}'

Open a browser and enter http://localhost:4040 in the address bar. You should be able to connect to the Spark UI.

spark-ui-screenshot

Spark History Server

If you want to explore your job after its run, you can view it through the Spark History Server. The preceding SparkApplication definition has the event log enabled and stores the events in an S3 bucket with the following path: s3://YOUR-S3-BUCKET/. For instructions on setting up the Spark History Server and exploring the logs, refer to Launching the Spark history server and viewing the Spark UI using Docker.

spark-submit

spark-submit is a command line interface for running Apache Spark applications on a cluster or locally. It allows you to submit applications to Spark clusters. The tool enables simple configuration of application properties, resource allocation, and custom libraries, streamlining the deployment and management of Spark jobs.

Beginning with Amazon EMR 6.10, spark-submit is supported as a job submission method. This method currently only supports cluster mode as the submission mechanism. To submit jobs using the spark-submit method, we reuse the IAM role for the service account we set up earlier. We also use the S3 bucket used for the Spark Operator method. The steps in this section take you through how to configure and submit jobs with spark-submit and benefit from EMR runtime improvements.

In order to submit a job, you need to use the Spark version that matches the one available in Amazon EMR. For Amazon EMR 6.10, you need to download the Spark 3.3 version.
You also need to make sure you have Java installed in your environment.
Unzip the file and navigate to the root of the Spark directory.
In the following code, replace the EKS endpoint as well as the S3 bucket then run the script:

./bin/spark-submit \
--class ValueZones \
--master k8s://EKS-ENDPOINT \
--conf spark.kubernetes.namespace=data-team-a \
--conf spark.kubernetes.container.image=608033475327.dkr.ecr.us-west-1.amazonaws.com/spark/emr-6.10.0:latest \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=emr-job-execution-sa \
--conf spark.kubernetes.authenticate.executor.serviceAccountName=emr-job-execution-sa \
--conf spark.driver.extraClassPath="/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*" \
--conf spark.driver.extraLibraryPath="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native" \
--conf spark.executor.extraClassPath="/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*" \
--conf spark.executor.extraLibraryPath="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native" \
--conf spark.hadoop.fs.s3.customAWSCredentialsProvider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider \
--conf spark.hadoop.fs.s3.impl=com.amazon.ws.emr.hadoop.fs.EmrFileSystem \
--conf spark.hadoop.fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3.EMRFSDelegate \
--conf spark.hadoop.fs.s3.buffer.dir=/mnt/s3 \
--conf spark.hadoop.fs.s3n.impl=com.amazon.ws.emr.hadoop.fs.EmrFileSystem \
--deploy-mode cluster \
s3://aws-data-lake-workshop/spark-eks/spark-eks-assembly-3.3.0.jar s3://nyc-tlc/csv_backup 2017 s3://nyc-tlc/misc/taxi_zone_lookup.csv s3://S3_BUCKET/emr-eks-results emr_eks_demo

The job takes about 7 minutes to complete with two executors of one core and 1 G of memory.

Using custom kubernetes schedulers

Customers running a large volume of jobs concurrently might face challenges related to providing fair access to compute capacity that they aren’t able to solve with the standard scheduling and resource utilization management Kubernetes offers. In addition, customers that are migrating from Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) and are managing their scheduling with YARN queues will not be able to transpose them to Kubernetes scheduling capabilities.

To overcome this issue, you can use custom schedulers like Apache Yunikorn or Volcano.Spark Operator natively supports these schedulers, and with them you can schedule Spark applications based on factors such as priority, resource requirements, and fairness policies, while Spark Operator simplifies application deployment and management. To set up Yunikorn with gang scheduling and use it in Spark applications submitted through Spark Operator, refer to Spark Operator with YuniKorn.

Clean up

To avoid unwanted charges to your AWS account, delete all the AWS resources created during this deployment:

eksctl delete cluster -f eks-cluster.yaml

Conclusion

In this post, we introduced the EMR runtime feature for Spark Operator and spark-submit, and explored the advantages of using this feature on an EKS cluster. With the optimized EMR runtime, you can significantly enhance the performance of your Spark applications while optimizing costs. We demonstrated the deployment of the cluster using the eksctl tool, , you can also use the Data on EKS blueprints for deploying a production-ready EKS which you can use for EMR on EKS and leverage these new deployment methods in addition to the EMR on EKS API job submission method. By running your applications on the optimized EMR runtime, you can further enhance your Spark application workflows and drive innovation in your data processing pipelines.

About the Authors

Lotfi Mouhib is a Senior Solutions Architect working for the Public Sector team with Amazon Web Services. He helps public sector customers across EMEA realize their ideas, build new services, and innovate for citizens. In his spare time, Lotfi enjoys cycling and running.

Vara Bonthu is a dedicated technology professional and Worldwide Tech Leader for Data on EKS, specializing in assisting AWS customers ranging from strategic accounts to diverse organizations. He is passionate about open-source technologies, data analytics, AI/ML, and Kubernetes, and boasts an extensive background in development, DevOps, and architecture. Vara’s primary focus is on building highly scalable data and AI/ML solutions on Kubernetes platforms, helping customers harness the full potential of cutting-edge technology for their data-driven pursuits.

AWS Week in Review – Amazon Security Lake Now GA, New Actions on AWS Fault Injection Simulator, and More – June 5, 2023

2023-06-05 Veliswa Boya

Post Syndicated from Veliswa Boya original https://aws.amazon.com/blogs/aws/aws-week-in-review-amazon-security-lake-now-ga-new-actions-on-aws-fault-injection-simulator-and-more-june-5-2023/

Last Wednesday, I traveled to Cape Town to speak at the .Net Developer User Group. My colleague Francois Bouteruche also gave a talk but joined virtually. I enjoyed my time there—what an amazing community! Join the group in order to learn about upcoming events.

Now onto the AWS updates from last week. There was a lot of news related to AWS, and I have compiled a few announcements you need to know. Let’s get started!

Last Week’s Launches
Here are a few launches from last week that you might have missed:

Amazon Security Lake is now Generally Available – This service automatically centralizes security data from AWS environments, SaaS providers, on-premises environments, and cloud sources into a purpose-built data lake stored in your account, making it easier to analyze security data, gain a more comprehensive understanding of security across your entire organization, and improve the protection of your workloads, applications, and data. Read more in Channy’s post announcing the preview of Security Lake.

New AWS Direct Connect Location in Santiago, Chile – The AWS Direct Connect service lets you create a dedicated network connection to AWS. With this service, you can build hybrid networks by linking your AWS and on-premises networks to build applications that span environments without compromising performance. Last week we announced the opening of a new AWS Direct Connect location in Santiago, Chile. This new Santiago location offers dedicated 1 Gbps and 10 Gbps connections, with MACsec encryption available for 10 Gbps. For more information on over 115 Direct Connect locations worldwide, visit the locations section of the Direct Connect product detail pages.

New actions on AWS Fault Injection Simulator for Amazon EKS and Amazon ECS – Had it not been for Adrian Hornsby’s LinkedIn post I would have missed this announcement. We announced the expanded support of AWS Fault Injection Simulator (FIS) for Amazon Elastic Kubernetes Service (EKS) and Amazon Elastic Container Service (ECS). This expanded support adds additional AWS FIS actions for Amazon EKS and Amazon ECS. Learn more about Amazon ECS task actions here, and Amazon EKS pod actions here.

Other AWS News
A few more news items and blog posts you might have missed:

Autodesk Uses Sagemaker to Improve Observability – One of our customers, Autodesk, used AWS services including Amazon Sagemaker, Amazon Kinesis, and Amazon API Gateway to build a platform that enables development and deployment of near-real time personalization experiments by modeling and responding to user behavior data. All this delivered a dynamic, personalized experience for Autodesk’s customers. Read more about the story at AWS Customer Stories.

AWS DMS Serverless – We announced AWS DMS Serverless which lets you automatically provision and scale capacity for migration and data replication. Donnie wrote about this announcement here.

For AWS open-source news and updates, check out the latest newsletter curated by my colleague Ricardo Sueiras to bring you the most recent updates on open-source projects, posts, events, and more.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Upcoming AWS Events
We have the following upcoming events. These give you the opportunity to meet with other tech enthusiasts and learn:

AWS Silicon Innovation Day (June 21) – A one-day virtual event that will allow you to understand AWS Silicon and how you can use AWS’s unique silicon offerings to innovate. Learn more and register here.

AWS Global Summits – Sign up for the AWS Summit closest to where you live: London (June 7), Washington, DC (June 7–8), Toronto (June 14).

AWS Community Days – Join these community-led conferences where event logistics and content are planned, sourced, and delivered by community leaders: Chicago, Illinois (June 15), and Chile (July 1).

And with that, I end my very first Week in Review post, and this was such fun to write. Come back next Monday for another Week in Review!

– Veliswa x

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Hands-on Generative AI: AWS workshops

See you next time!

Solution overview

Walkthrough

Prerequisites

Step 1: Set up the environment

Deploy Kyverno example and policy

Deploy a noncompliant resource to test this solution

Step 2: Solution code deployment and configuration

To deploy the solution

Step 3: Set up EKS cluster access

To configure access to EKS clusters

Step 4: Verify AWS service integration

To verify the integration is running

Step 5: Clean up

To destroy the stack

Conclusion

Solution overview

Prerequisites

Walkthrough

1. Create a Karpenter NodePool and EC2NodeClass

2. Deploy a sample workload

3. Scale in a sample workload to observe consolidation

What would happen if a Spot node could not be consolidated?

Spot best practices with Karpenter

Cleanup

Conclusion

#10: Build a serverless retail solution for endless aisle on AWS

#9: Optimizing data with automated intelligent document processing solutions

#8: Disaster Recovery Solutions with AWS managed services, Part 3: Multi-Site Active/Passive

#7: Simulating Kubernetes-workload AZ failures with AWS Fault Injection Simulator

#6: Let’s Architect! Designing event-driven architectures

#5: Use a reusable ETL framework in your AWS lake house architecture

#4: Invoking asynchronous external APIs with AWS Step Functions

#3: Announcing updates to the AWS Well-Architected Framework

#2: Let’s Architect! Designing architectures for multi-tenancy

#1: Understand resiliency patterns and trade-offs to architect efficiently in the cloud

Bonus! Three older special mentions

What is ABAC?

How External Secrets Operator works

How to use ESO for ABAC

Architecture overview

Prerequisites

Deploy the solution

Create an IAM role to access Secrets Manager secrets

Create an IAM role for the ESO service account

Create secrets in Secrets Manager

Create ESO objects in your cluster

What about AWS Secrets and Configuration Provider (ASCP)?

Conclusion

Background

Services and tools

CI/CD services

Container services

Verification tools

Solution overview

Prerequisites

Walkthrough

To sign the image (AWS CLI)

Decommission the AWS resources

To clean up the resources

Conclusion

Containerize applications for standardization and scale

Standardize container orchestration tooling

Identify the right tools for logging and monitoring

Managing resource utilization

Scaling cluster and applications dynamically

Build rollback strategies for failure management

Improving resilience and scale using cell-based design

Cell-based design to handle Black swan events Black swan event mitigation

Conclusion

Introduction

Prerequisites

Walkthrough

Create the base workflow

Connect to CodeCatalyst Dev Environment

Create a CodeCatalyst secret

Connect to the CodeCatalyst Dev Environment

Add the build steps to workflow

Cleaning up