Tag Archives: Amazon Elastic Kubernetes Service

AWS Week In Review — September 26, 2022

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/aws-week-in-review-september-26-2022/

It looks like my travel schedule is coupled with this Week In Review series of blog posts. This week, I am traveling to Fort-de-France in the French Caribbean islands to meet our customers and partners. I enjoy the travel time when I am offline. It gives me the opportunity to reflect on the past or plan for the future.

Last Week’s Launches
Here are some of the launches that caught my eye last week:

Amazon SageMaker Autopilothas added a new Ensemble training mode powered by AutoGluon that is 8X faster than the current Hyper parameter Optimization Mode and supports a wide range of algorithms, including LightGBM, CatBoost, XGBoost, Random Forest, Extra Trees, linear models, and neural networks based on PyTorch and FastAI.

AWS Outposts and Amazon EKSYou can now deploy both the worker nodes and the Kubernetes control plane on an Outposts rack. This allows you to maximize your application availability in case of temporary network disconnection on premises. The Kubernetes control plane continues to manage the worker nodes, and no pod eviction happens when on-premises network connectivity is reestablished.

Amazon Corretto 19 – Corretto is a no-cost, multiplatform, production-ready distribution of OpenJDK. Corretto is distributed by Amazon under an open source license. This version supports the latest OpenJDK feature release and is available on Linux, Windows, and macOS. You can download Corretto 19 from our downloads page.

Amazon CloudWatch Evidently – Evidently is a fully-managed service that makes it easier to introduce experiments and launches in your application code. Evidently adds support for Client Side Evaluations (CSE) for AWS Lambda, powered by AWS AppConfig. Evidently CSE allows application developers to generate feature evaluations in single-digit milliseconds from within their own Lambda functions. Check the client-side evaluation documentation to learn more.

Amazon S3 on AWS OutpostsS3 on Outposts now supports object versioning. Versioning helps you to locally preserve, retrieve, and restore each version of every object stored in your buckets. Versioning objects makes it easier to recover from both unintended user actions and application failures.

Amazon PollyAmazon Polly is a service that turns text into lifelike speech. This week, we announced the general availability of Hiujin, Amazon Polly’s first Cantonese-speaking neural text-to-speech (NTTS) voice. With this launch, the Amazon Polly portfolio now includes 96 voices across 34 languages and language variants.

X in Y – We launched existing AWS services in additional Regions:

Other AWS News
Introducing the Smart City Competency program – The AWS Smart City Competency provides best-in-class partner recommendations to our customers and the broader market. With the AWS Smart City Competency, you can quickly and confidently identify AWS Partners to help you address Smart City focused challenges.

An update to IAM role trust policy behavior – This is potentially a breaking change. AWS Identity and Access Management (IAM) is changing an aspect of how role trust policy evaluation behaves when a role assumes itself. Previously, roles implicitly trusted themselves. AWS is changing role assumption behavior to always require self-referential role trust policy grants. This change improves consistency and visibility with regard to role behavior and privileges. This blog post shares the details and explains how to evaluate if your roles are impacted by this change and what to modify. According to our data, only 0.0001 percent of roles are impacted. We notified by email the account owners.

Amazon Music Unifies Music QueuingThe Amazon Music team published a blog post to explain how they created a unified music queue across devices. They used AWS AppSync and AWS Amplify to build a robust solution that scales to millions of music lovers.

Upcoming AWS Events
Check your calendar and sign up for an AWS event in your Region and language:

AWS re:Invent – Learn the latest from AWS and get energized by the community present in Las Vegas, Nevada. Registrations are open for re:Invent 2022 which will be held from Monday, November 28 to Friday, December 2.

AWS Summits – Come together to connect, collaborate, and learn about AWS. Registration is open for the following in-person AWS Summits: Bogotá (October 4), and Singapore (October 6).

Natural Language Processing (NLP) Summit – The AWS NLP Summit 2022 will host over 25 sessions focusing on the latest trends, hottest research, and innovative applications leveraging NLP capabilities on AWS. It is happening at our UK headquarters in London, October 5–6, and you can register now.

AWS Innovate for every app – This regional online conference is designed to inspire and educate executives and IT professionals about AWS. It offers dozens of technical sessions in eight languages (English, Spanish, French, German, Italian, Japanese, Korean, and Indonesian). Register today: Americas, September 28; Europe, Middle-East, and Africa, October 6; Asia Pacific & Japan, October 20.

AWS Innovate for every app

AWS Community DaysAWS Community Day events are community-led conferences to share and learn with one another. In September, the AWS community in the US will run events in Arlington, Virginia (September 30). In Europe, Community Day events will be held in October. Join us in Amersfoort, Netherlands (October 3), Warsaw, Poland (October 14), and Dresden, Germany (October 19).

AWS Tour du Cloud – The AWS Team in France has prepared a roadshow to meet customers and partners with a one-day free conference in seven cities across the country (Aix en Provence, Lille, Toulouse, Bordeaux, Strasbourg, Nantes, and Lyon), and in Fort-de-France, Martinique. Tour du Cloud France

AWS Fest – This third-party event will feature AWS influencers, community heroes, industry leaders, and AWS customers, all sharing AWS optimization secrets (this week on Wednesday, September). You can register for AWS Fest here.

Stay Informed
That is my selection for this week! To better keep up with all of this news, please check out the following resources:

— seb
This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

How Facteus improved Quantamatics performance by adopting Amazon Aurora Serverless and Amazon EKS

Post Syndicated from Aishwarya Subramaniam original https://aws.amazon.com/blogs/architecture/how-facteus-improved-quantamatics-performance-by-adopting-amazon-aurora-serverless-and-amazon-eks/

Facteus Inc. is a leading provider of actionable insights from sensitive transaction data. Facteus safely transforms raw financial transaction data from legacy technologies into actionable information, without compromising data privacy, through its innovative synthetic data process. Quantamatics is one of Facteus’ core product offering.

Quantamatics accelerates the time it takes a user to go from raw alternative data to insights, by providing a cloud-based, turnkey research platform that handles data from ingestion to analysis. This platform saves the analysts, data researchers, and data scientists time by doing all the preparation and normalization efforts prior to working with the data for insight discovery. The provided cloud environment also allows for easy and flexible analysis of both provided and external data sources. Quantamatics is a SaaS offering with a subscription model that provides access to both the research platform and the associated Facteus datasets.

In June 2021, Facteus re-architected their monolithic Quantamatics application to use microservices. This blog will contrast the before and after states from a performance and management perspective as they migrated from Snowflake to Amazon Aurora Serverless v2 (Postgres) and from Amazon Elastic Compute Cloud (Amazon EC2) to Amazon Elastic Kubernetes Service (Amazon EKS).

A great place to start when evaluating existing workloads for fault tolerance and reliability is the AWS Well-Architected Framework. The Well-Architected Framework is designed to help cloud architects build secure, high-performing, resilient, and efficient infrastructure for their applications. Based on six pillars—operational excellence, security, reliability, performance efficiency, cost optimization, and sustainability—the Framework provides a consistent approach for customers to evaluate architectures, and implement designs that will scale over time.

The AWS Well-Architected Tool,  available at no charge in the AWS Management Console, lets you create self-assessments to identify and correct gaps in your current architecture. Adhering to Well-Architected principles, Facteus adopted managed services, such as Amazon EKS and Amazon Aurora Serverless, as they reduce efforts on provisioning, configuring, scaling, backing up, and so on. Additionally, using managed services helps to save on the overall costs of maintaining the services.

Facteus’ architecture overview

Before

Users can access Quantamatics for their research either through a Jupyter notebook or a Microsoft Excel plugin. Facteus used EC2 instances to directly host the underlying JupyterHub deployments and AWS Elastic Beanstalk to deploy APIs.

The legacy architecture, while cloud-based, had multiple issues that made it ineffective from a maintenance, scalability, and cost perspective (as demonstrated in Figure 1):

  • JupyterHub does not currently support high availability (HA) natively. This meant an EC2 failover would require relatively long unavailability while a replacement EC2 node spun up or potentially double the cost to keep an idle node on standby.
    • Also, with the EC2 instances being specialized, portions of each EC2 instance will remain unused, resulting in unnecessary costs compared to more modern solutions such as Amazon EKS, which can pool and divide up instances in a more granular fashion.
    • Finally, as the EC2 instances were standalone, solutions would need to be set up to both monitor application health and perform the appropriate actions in case of an outage.
  • Although Elastic Beanstalk was a great way to deploy API instances in an HA and scalable way, to completely modernize and remain consistent throughout application to a microservice-based architecture, Facteus migrated their Elastic Beanstalk instances as well, to better utilize the pooled resources.
Cloud-based legacy architecture

Figure 1. Cloud-based legacy architecture

Quantamatics requires a Data Warehouse solution to constantly run behind an API to allow for acceptable request and response times. While Snowflake is a great data warehousing and big data querying solution, Facteus found it expensive for their deployment. The queries that the Quantamatics APIs run are typically not computationally expensive but do end up returning relatively large amounts of data. This makes transferring the results back to the API over the internet a potential bottleneck.

To address these bottlenecks, Facteus re-architected their application into an Amazon EKS based one, backed with Aurora Serverless v2 (Postgres).

The new architecture resolves the previous problems in two ways (Figure 2):

  • By using Aurora Serverless v2 (Postgres) to store and query the datasets used by the API within the same VPC instead of Snowflake, it kept the query run time relatively the same but drastically decreased both the transfer time and the associated costs due to the locality of the database as well as the cost and scalability of Aurora Serverless v2.
  • By switching to Amazon EKS, the underlying EC2 nodes could easily be pooled and more thoroughly utilized across the various deployments, thus reducing costs. Additionally, as the deployments were now containerized, an outage would result in the quick relocation of those containerized apps (pods) to nodes with capacity, thus reducing downtime and cost.
    • As a side benefit with the move to managed nodes on Amazon EKS, this completely removed the node patching overhead, as Amazon EKS safely handles the patching of the underlying nodes with a single command.
    • Amazon EKS monitors and restarts pods automatically, which eliminated the need to set up and manage a solution that monitors pod health and takes the appropriate actions upon failures.
Contemporary architecture with Amazon EKS and Aurora Serverless v2 (Postgres)

Figure 2. Contemporary architecture with Amazon EKS and Aurora Serverless v2 (Postgres)

Auto scaling with Amazon EKS and Aurora Serverless

  • Amazon EKS helped to greatly reduce the overhead of setting up and managing the auto scaling of Quantamatics in two ways:
    • User compute environments could be spun up as isolated pods, with Amazon EKS spinning nodes up and down automatically based on demand.
    • API instances could also be automatically spun up and down based on network throughput metrics queried by Amazon EKS to handle the requests made by users in a timely fashion.
  • Aurora Serverless v2
    • With Aurora Serverless v2, the needed compute capacity of the database automatically scales based on load generated by the corresponding API requests. This both reduced the cost as the load varies heavily throughout the day, reducing the management overhead of handling spinning up and down of read replicas if other solutions were used instead.

Snowflake vs. Aurora Serverless V2 (Postgres) – Quantamatics query performance and cost comparison

The following steps were performed to migrate data from Snowflake to Aurora Serverless v2:

  • Use the Snowflake COPY INTO <location> command to copy the data from the Snowflake database table into one or more files in an S3 bucket.
  • Create tables in Aurora Serverless. Use the create_s3_uri function to load variables.
  • Use the aws_s3.table_import_from_s3 function to import the data file from an Amazon S3 file name prefix.
  • Verify that the information was loaded.

This blog post describes importing data from Amazon S3 to Amazon Aurora PostgreSQL.

Testing strategy: Run the corresponding CLI database utility for each database (snowsql vs psql) from within the VPC. Run the same query on each dataset. Return and write the results as CSV to a local file.
Data set size: ~178,000,000 rows
Result set size: ~418,000 rows

Data source Configuration Results
Snowflake Snowflake: Medium Warehouse (running), AWS based in same Region as APIs

  • Cost: ~$0.01 per query based on credit usage
  • 21.99 seconds run time
  • 3.36 seconds run time, 18.63 seconds transfer time
Aurora Serverless V2(Postgres) Idling on four Aurora Compute Units (ACU)

  • Cost: ~$0.24 an hour
  • Tables and indexes tuned for Quantamatics use cases
  • 7.00 seconds run time
  • 3.58 seconds run time, 3.42 seconds transfer time

Conclusion

The customer was able to achieve similar run times for the given dataset and query, but faster transfer speeds from Aurora Serverless due to the locality of the database. They also realized up to ~40x runtime cost savings by using Aurora Serverless—1,000 queries in Aurora Serverless vs. ~24 queries in Snowflake for the same cost.

Note: These results are specific to Quantamatics use cases where queries are fixed and well-known, and relatively limited in terms of complexity. This allowed the tables and database in Aurora Serverless v2 to be tuned for those specific purposes.

AWS recommends customers review their workloads using the AWS Well-Architected Tool to help ensure that their workloads are performant, secure, and cost-optimized. Well-Architected Framework Reviews are excellent opportunities to work together with your AWS account team and key stakeholders to discuss how modern infrastructure can help you win in the market.

Deploy your Amazon EKS Clusters Locally on AWS Outposts

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/deploy-your-amazon-eks-clusters-locally-on-aws-outposts/

I am pleased to announce the availability of local clusters for Amazon Elastic Kubernetes Service (Amazon EKS) on AWS Outposts. It means that starting today, you can deploy your Amazon EKS cluster entirely on Outposts: both the Kubernetes control plane and the nodes.

Amazon EKS is a managed Kubernetes service that makes it easy for you to run Kubernetes on AWS and on premises. AWS Outposts is a family of fully managed solutions delivering AWS infrastructure and services to virtually any on-premises or edge location for a truly consistent hybrid experience.

To fully understand the benefits of local clusters for Amazon EKS on Outposts, I need to first share a bit of background.

Some customers use Outposts to deploy Kubernetes cluster nodes and pods close to the rest of their on-premises infrastructure. This allows their applications to benefit from low latency access to on-premises services and data while managing the cluster and the lifecycle of the nodes using the same AWS API, CLI, or AWS console as they do for their cloud-based clusters.

Until today, when you deployed Kubernetes applications on Outposts, you typically started by creating an Amazon EKS cluster in the AWS cloud. Then you deployed the cluster nodes on your Outposts machines. In this hybrid cluster scenario, the Kubernetes control plane runs in the parent Region of your Outposts, and the nodes are running on your on-premises Outposts. The Amazon EKS service communicates through the network with the nodes running on the Outposts machine.

But, remember: everything fails all the time. Customers told us the main challenge they have in this scenario is to manage site disconnections. This is something we cannot control, especially when you deploy Outposts on rough edges: areas with poor or intermittent network connections. When the on-premises facility is temporarily disconnected from the internet, the Amazon EKS control plane running in the cloud is unable to communicate with the nodes and the pods. Although the nodes and pods work perfectly and continue to serve the application on the on-premises local network, Kubernetes may consider them unhealthy and schedule them for replacement when the connection is reestablished (see pod eviction in Kubernetes documentation). This may lead to application downtimes when connectivity is restored.

I talked with Chris, our Kubernetes Product Manager and expert, while preparing this blog post. He told me there are at least seven distinct options to configure how a control plane reconnects to its nodes. Unless you master all these options, the system status at re-connection is unpredictable.

To simplify this, we are giving you the ability to host your entire Amazon EKS cluster on Outposts. In this configuration, both the Kubernetes control plane and your worker nodes run locally on premises on your Outposts machine. That way, your cluster continues to operate even in the event of a temporary drop in your service link connection. You can perform cluster operations such as creating, updating, and scaling applications during network disconnects to the cloud.

EKS Local Cluster DiagramLocal clusters are identical to Amazon EKS in the cloud and automatically deploy the latest security patches to make it easy for you to maintain an up-to-date, secure cluster. You can use the same tooling you use with Amazon EKS in the cloud and the AWS Management Console for a single interface for your clusters running on Outposts and in AWS Cloud.

Let’s See It In Action
Let’s see how we can use this new capability. For this demo, I will deploy the Kubernetes control plane on Amazon Elastic Compute Cloud (Amazon EC2) instances running on premises on an Outposts rack.

I use an Outposts rack already configured. If you want to learn how to get started with Outposts, you can read the steps on the Get Started with AWS Outposts page.

AWS Outposts Configuration

This demo has two parts. First, I create the cluster. Second, I connect to the cluster and create nodes.

Creating Cluster
Before deploying the Amazon EKS local cluster on Outposts, I make sure I created an IAM cluster role and attached the AmazonEKSLocalOutpostClusterPolicy managed policy. This IAM cluster role will be used in cluster creation.

Then, I switch to the Amazon EKS dashboard, and I select Add Cluster, then Create.

Creating Cluster

On the following page, I chose the location of the Kubernetes control plane: the AWS Cloud or AWS Outposts. I select AWS Outposts and specify the Outposts ID.

Configure EKS Cluster to Use AWS Outposts

The Kubernetes control plane on Outposts is deployed on three EC2 instances for high availability. That’s why I see three Replicas. Then, I choose the instance type according to the number of worker nodes needed for workloads. For example, to handle 0–20 worker nodes, it is recommended to use m5d.large EC2 instances.

Setting Instance Type

On the same page, I specify configuration values for the Kubernetes cluster, such as its Name, Kubernetes version, and the Cluster service role that I created earlier.

Cluster Configuration

On the next page, I configure the networking options. Since Outposts is an extension of an AWS Region, I need to use the VPC and Subnets used by Outposts to enable communication between Kubernetes control plane and worker nodes. For Security Groups, Amazon EKS creates a security group for local clusters that enables communication between my cluster and my VPC. I can also define additional security groups according to my application requirements.

Specify Networking

As we run the Kubernetes control plane inside Outposts, the Cluster endpoint access can only be accessed privately. This means I can only access the Kubernetes cluster through machines that are deployed in the same VPC or over the local network via the Outposts local gateway with Direct VPC Routing.

Private Cluster Endoint Access
On the next page, I define logging. Logging is disabled by default, and I may enable it as needed. For more details about logging, you can read the Amazon EKS control plane logging documentation.

Configure Logging

The last screen allows me to review all configuration options. When I’m satisfied with the configuration, I select Create to create the cluster.

Networking

The cluster creation takes a few minutes. To check the cluster creation status, I can use the console or the terminal with the following command:

$ aws eks describe-cluster \ 
--region <REGION_CODE> \ 
--name <CLUSTER_NAME> \ 
--query "cluster.status"

The Status section tells me when the cluster is created and active.

Active Cluster

In addition to using the AWS Management Console, I can also create a local cluster using the AWS CLI. Here is the command snippet to create a local cluster with the AWS CLI:

$ aws eks create-cluster \ 
--region <REGION_CODE> \ 
--name <CLUSTER_NAME> \ 
--resources-vpc-config subnetIds=<SUBNET_ID>\ 
--role-arn <ARN_CLUSTER_ROLE> \ 
--outpost-config controlPlaneInstanceType=<INSTANCE_TYPE> \ 
--outpostArns=<ARN_OUTPOST>

Connecting to the Cluster
The endpoint access for a local cluster is private; therefore, I can access it from a local gateway with Direct VPC Routing or from machines that are in the same VPC. To find out how to use local gateways with Outposts, you can follow the information on the Working with local gateways page. For this demo, I use an EC2 instance as a bastion host, and I manage the Kubernetes cluster using kubectl command.

The first thing I do is edit Security Groups to open traffic access from the bastion host. I go to the detail page of the Kubernetes cluster and select the Networking tab. Then I select the link in Cluster security group.

Networking & Security Group

Then, I add inbound rules, and I provide access for the bastion host by specifying its IP address.

Adding Inbound Rule in Security Group

Once I’ve allowed the access, I create kubeconfig in the bastion host by running the command:

$ aws eks update-kubeconfig --region <REGION_CODE> --name <CLUSTER_NAME>

Finally, I use kubectl to interact with the Kubernetes API server, just like usual.

$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-10-X-Y-Z.us-west-2.compute.internal NotReady control-plane,master 10h v1.21.13 10.X.Y.Z <none> Bottlerocket OS 1.8.0 (aws-k8s-1.21) 5.10.118 containerd://1.6.6+bottlerocket
ip-10-X-Y-Z.us-west-2.compute.internal NotReady control-plane,master 10h v1.21.13 10.X.Y.Z <none> Bottlerocket OS 1.8.0 (aws-k8s-1.21) 5.10.118 containerd://1.6.6+bottlerocket
ip-10-X-Y-Z.us-west-2.compute.internal NotReady control-plane,master 9h v1.21.13 10.X.Y.Z <none> Bottlerocket OS 1.8.0 (aws-k8s-1.21) 5.10.118 containerd://1.6.6+bottlerocket

Kubernetes local clusters running on AWS Outposts run on three EC2 instances. We see on the output above that the status of three worker nodes is NotReady. This is because they are used by the control plane exclusively, and we cannot use them to schedule pods.

From this stage, you can deploy self-managed node groups using the Amazon EKS local cluster.

Pricing and Availability
Amazon EKS local clusters are charged at the same price as traditional EKS clusters. It starts at $0.10/hour. The EC2 instances required to deploy the Kubernetes control plane and nodes on Outposts are included in the price of the Outposts. As usual, the pricing page has the details.

Amazon EKS local clusters are available in all AWS Regions where Outposts is available.

Go build and create your first EKS local cluster today!

— seb and Donnie.

Choosing an AWS container service to run your modern application

Post Syndicated from Lewis Tang original https://aws.amazon.com/blogs/architecture/choosing-an-aws-container-service-to-run-your-modern-application/

Businesses want to innovate quickly and deliver value even faster. To achieve these goals, the platform needs to enable teams to focus on delivering applications that are reliable, secure, highly available, cost-efficient, and scalable to required sizes.

Consider including containers on AWS in your platform, whether you are trying containers for the first time, spinning out parts of an on-premises solution to microservices in the cloud, or are new to the cloud. Containers can help you achieving a range of business benefits, including increased scalability, agility, flexibility, and cost efficiency.

In this post, we discuss three sets of builder expectations and how AWS container services can help to meet with your application delivery requirements and choose the appropriate container platform service on AWS.

Decrease container platform operations management overhead

If managing a platform is not your business’s strategic focus (for example, if most of your engineers are code developers), it can be preferable to only manage application development.

Amazon Lightsail containers offer a simple way for developers to deploy their containers to the cloud. With a Docker image you provide for your containers, AWS automatically deploys containerized workloads for you.

Lightsail assigns an HTTPS endpoint that is ready to serve your web application running in the cloud container. It automatically sets up a load-balanced Transport Layer Security (TLS) endpoint and takes care of the TLS certificate. This service replaces unresponsive containers for you automatically; by assigning a Domain Name System to your endpoint, Lightsail maintains the old version until the new version is healthy and ready to go live (Figure 1).

Amazon Lightsail containers

Figure 1. Amazon Lightsail containers

Another simple way to build and run your containerized web application in AWS is using AWS App Runner, which provides a fully managed container-native service.

Without orchestrators to configure, build pipelines to set up, or load balancers to optimize, you can bring existing containers or use the integrated container build service to go directly from the code repository to deployed application.

The build service can connect to a GitHub repository, providing a Git push workflow that deploys changes automatically. App Runner orchestration workflow take cares of the build, deployment, and configuration tasks, such as host, runtime patching, monitoring load balancing, and auto scaling (Figure 2). Explore AWS App Runner documentation and workshop for more details about the service.

AWS App Runner

Figure 2. AWS App Runner

When designing an application, you often start with a whiteboard or mental model that has representations of each service and lines for how they interact with each other. When considering an application’s platform architecture, the cloud components are not limited to virtual private cloud (VPC) subnets, load balancers, deployment pipelines, and durable storage for your application’s stateful data. Bringing all underlying cloud components together and making sure the design is well architected can be challenging.

AWS Copilot can provide guided best practices when deploying a microservice architecture that includes multiple services deployed as containers. You can use Copliot to handle cloud component details for you. By providing a container image, Copilot works with App Runner or Amazon Elastic Container Service (Amazon ECS) to provision cloud components, like VPC and having Copilot handle high-availability deployment, load balancer creation, and configuration.

To automate application deployment and new version release, Copilot can create a deployment pipeline so that the latest version of your application is automatically deployed every time you push a new commit to your code repository (as demonstrated in Figure 3).

AWS Copilot pipeline

Figure 3. AWS Copilot pipeline

Full-control application deployment with container orchestration

As business grows, your application portfolio grows. Some applications may require the selection of Microsoft Windows containers or deep customizations on container-resource scheduling, monitoring, and logging. To accommodate this, you need the flexibility of configuring the required underlying container services while still using the efficient container orchestrator to automate the common processes to achieve operation efficiency. This is where Amazon ECS and Amazon Elastic Kubernetes Service (Amazon EKS) can help.

Using Amazon ECS

As demonstrated in Figure 4, Amazon ECS is a highly scalable, high-performance container management service that supports containers and allows you to easily run applications on a managed cluster of Amazon Elastic Compute Cloud instances with Amazon Fargate (a serverless compute engine for containers). With this, you can launch and stop containerized applications and query the complete state of your cluster. You have the ability to access and configure many familiar features, like security groups and Elastic Load Balancing (ELB), by sending simple API calls.

Amazon ECS can be used to schedule container placement across your cluster based on resource needs and availability requirements. You can also integrate your own scheduler or third-party schedulers to meet business- or application-specific requirements.

Amazon ECS using AWS Fargate

Figure 4. Amazon ECS using AWS Fargate

Using Amazon EKS

Amazon EKS is a managed service that can be used to run Kubernetes on AWS, without installing, operating, and maintaining your own Kubernetes control plane or nodes. For many developers who have experience using Kubernetes, running Amazon EKS for application container workload is a preferred option because Amazon EKS provides the flexibility of Kubernetes with the scalability, security and resiliency of being an AWS managed service.

Amazon EKS runs and automatically scales the Kubernetes control plane across multiple AWS availability zones to ensure high availability, as in Figure 5. The control plane instances are automatically scaled based on load. Amazon EKS detects and replaces unhealthy control plane instances and provides automated version updates and patching. Amazon EKS enables developers to run up-to-date versions of the open-source Kubernetes software, the existing or new third-party plugins, and tooling. This means you can more easily migrate any standard Kubernetes application to Amazon EKS without code modification.

Scalability and security are essential to your business-critical workloads. Amazon EKS is integrated with many AWS services, including Amazon Elastic Container Registry for container images, ELB for load distribution, IAM for authentication, and Amazon Virtual Private Cloud for isolation.

Amazon EKS scales Kubernetes across multiple availability zones

Figure 5. Amazon EKS scales Kubernetes across multiple availability zones

Conclusion

To innovate and respond to changes faster, businesses need to build modern applications quickly and manage them efficiently. AWS provides container services to run your most sensitive, secure, and business-critical workloads reliably and to-scale.

With little-to-no prior container experience, developers can use Lightsail containers to run web application container workloads with easy-to-use interface. App Runner simplifies application deployment and management down into one particular service for running web applications. With Copilot, you can get step-by-step best practice guidance when you need to deploy microservice architecture with multiple services deployed as containers. Amazon ECS and Amazon EKS give the flexibility of configuring container workloads while maintaining the application deployment and operational efficiency.

Further reading

Deploying AWS Lambda functions using AWS Controllers for Kubernetes (ACK)

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/deploying-aws-lambda-functions-using-aws-controllers-for-kubernetes-ack/

This post is written by Rajdeep Saha, Sr. SSA, Containers/Serverless.

AWS Controllers for Kubernetes (ACK) allows you to manage AWS services directly from Kubernetes. With the ACK service controller for AWS Lambda, you can provision and manage Lambda functions with kubectl and custom resources. With ACK, you can have a single consolidated approach to managing container workloads and other AWS services, such as Lambda, directly from Kubernetes without needing additional infrastructure automation tools.

This post walks you through deploying a sample Lambda function from a Kubernetes cluster provided by Amazon EKS.

Use cases

Some of the use cases for provisioning Lambda functions from ACK include:

  • Your organization already has a DevOps process to deploy resources into the Amazon EKS cluster using Kubernetes declarative YAMLs (known as manifest files). With ACK for AWS Lambda, you can now use manifest files to provision Lambda functions without creating separate infrastructure as a code template.
  • Your project has implemented GitOps with Kubernetes. With GitOps, git becomes the single source of truth, and all the changes are done via git repo. In this model, Kubernetes continuously reconciles the git repo (desired state) with the resources running inside the cluster (current state). If any differences are found, the GitOps process automatically implements changes to the cluster from the git repo. Using ACK for AWS Lambda, since you are creating the Lambda function using Kubernetes custom resource, the GitOps model is applied for Lambda.
  • Your organization has established permissions boundaries for different users and groups using role-based access control (RBAC) and IAM roles for service accounts (IRSA). You can reuse this security model for Lambda without having to create new users and policies.

How ACK for AWS Lambda works

  1. The ‘Ops’ team deploys the ACK service controller for Lambda. This controller runs as a pod within the Amazon EKS cluster.
  2. The controller pod needs permission to read the Lambda function code and create the Lambda function. The Lambda function code is stored as a zip file in an S3 bucket for this example. The permissions are granted to the pod using IRSA.
  3. Each AWS service has separate ACK service controllers. This specific controller for AWS Lambda can act on the custom resource type ‘Function’.
  4. The ‘Dev’ team deploys Kubernetes manifest file with custom resource type ‘Function’. This manifest file defines the necessary fields required to create the function, such as S3 bucket name, zip file name, Lambda function IAM role, etc.
  5. The ACK service controller creates the Lambda function using the values from the manifest file.

Prerequisites

You need a few tools before deploying the sample application. Ensure that you have each of the following in your working environment:

This post uses shell variables to make it easier to substitute the actual names for your deployment. When you see placeholders like NAME=<your xyz name>, substitute in the name for your environment.

Setting up the Amazon EKS cluster

  1. Run the following to create an Amazon EKS cluster. The following single command creates a two-node Amazon EKS cluster with a unique name.
    eksctl create cluster
  2. It may take 15–30 minutes to provision the Amazon EKS cluster. When the cluster is ready, run:
    kubectl get nodes
  3. The output shows the following:
    Output
  4. To get the Amazon EKS cluster name to use throughout the walkthrough, run:
    eksctl get cluster
    
    export EKS_CLUSTER_NAME=<provide the name from the previous command>

Setting up the ACK Controller for Lambda

To set up the ACK Controller for Lambda:

  1. Install an ACK Controller with Helm by following these instructions:
    – Change ‘export SERVICE=s3’ to ‘export SERVICE=lambda’.
    – Change ‘export AWS_REGION=us-west-2’ to reflect your Region appropriately.
  2. To configure IAM permissions for the pod running the Lambda ACK Controller to permit it to create Lambda functions, follow these instructions.
    – Replace ‘SERVICE=”s3”’ with ‘SERVICE=”lambda”’.
  3. Validate that the ACK Lambda controller is running:
    kubectl get pods -n ack-system
  4. The output shows the running ACK Lambda controller pod:
    Output

Provisioning a Lambda function from the Kubernetes cluster

In this section, you write a sample “Hello world” Lambda function. You zip up the code and upload the zip file to an S3 bucket. Finally, you deploy that zip file to a Lambda function using the ACK Controller from the EKS cluster you created earlier. For this example, use Python3.9 as your language runtime.

To provision the Lambda function:

  1. Run the following to create the sample “Hello world” Lambda function code, and then zip it up:
    mkdir my-helloworld-function
    cd my-helloworld-function
    cat << EOF > lambda_function.py 
    import json
    
    def lambda_handler(event, context):
        # TODO implement
        return {
            'statusCode': 200,
            'body': json.dumps('Hello from Lambda!')
        }
    EOF
    zip my-deployment-package.zip lambda_function.py
    
  2. Create an S3 bucket following the instructions here. Alternatively, you can use an existing S3 bucket in the same Region of the Amazon EKS cluster.
  3. Run the following to upload the zip file into the S3 bucket from the previous step:
    export BUCKET_NAME=<provide the bucket name from step 2>
    aws s3 cp  my-deployment-package.zip s3://${BUCKET_NAME}
  4. The output shows:
    upload: ./my-deployment-package.zip to s3://<BUCKET_NAME>/my-deployment-package.zip
  5. Create your Lambda function using the ACK Controller. The full spec with all the available fields is listed here. First, provide a name for the function:
    export FUNCTION_NAME=hello-world-s3-ack
  6. Create and deploy the Kubernetes manifest file. The command at the end, kubectl create -f function.yaml submits the manifest file, with kind as ‘Function’. The ACK Controller for Lambda identifies this custom ‘Function’ object and deploys the Lambda function based on the manifest file.
    export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
    export LAMBDA_ROLE="arn:aws:iam::${AWS_ACCOUNT_ID}:role/lambda_basic_execution"
    
    cat << EOF > lambdamanifest.yaml 
    apiVersion: lambda.services.k8s.aws/v1alpha1
    kind: Function
    metadata:
     name: $FUNCTION_NAME
     annotations:
       services.k8s.aws/region: $AWS_REGION
    spec:
     name: $FUNCTION_NAME
     code:
       s3Bucket: $BUCKET_NAME
       s3Key: my-deployment-package.zip
     role: $LAMBDA_ROLE
     runtime: python3.9
     handler: lambda_function.lambda_handler
     description: function created by ACK lambda-controller e2e tests
    EOF
    kubectl create -f lambdamanifest.yaml
    
  7. The output shows:
    function.lambda.services.k8s.aws/< FUNCTION_NAME> created
  8. To retrieve the details of the function using a Kubernetes command, run:
    kubectl describe function/$FUNCTION_NAME
  9. This Lambda function returns a “Hello world” message. To invoke the function, run:
    aws lambda invoke --function-name $FUNCTION_NAME  response.json
    cat response.json
    
  10. The Lambda function returns the following output:
    {"statusCode": 200, "body": "\"Hello from Lambda!\""}

Congratulations! You created a Lambda function from your Kubernetes cluster.

To learn how to provision the Lambda function using the ACK controller from an OCI container image instead of a zip file in an S3 bucket, follow these instructions.

Cleaning up

This section cleans up all the resources that you have created. To clean up:

  1. Delete the Lambda function:
    kubectl delete function $FUNCTION_NAME
  2. If you have created a new S3 bucket, delete it by running:
    aws s3 rm s3://${BUCKET_NAME} --recursive
    aws s3api delete-bucket --bucket ${BUCKET_NAME}
  3. Delete the EKS cluster:
    eksctl delete cluster --name $EKS_CLUSTER_NAME
  4. Delete the IAM role created for the ACK Controller. Get the IAM role name by running the following command, then delete the role from the IAM console:
    echo $ACK_CONTROLLER_IAM_ROLE

Conclusion

This blog post shows how AWS Controllers for Kubernetes enables you to deploy a Lambda function directly from your Amazon EKS environment. AWS Controllers for Kubernetes provides a convenient way to connect your Kubernetes applications to AWS services directly from Kubernetes.

ACK is open source: you can request new features and report issues on the ACK community GitHub repository.

For more serverless learning resources, visit Serverless Land.

Design patterns to manage Amazon EMR on EKS workloads for Apache Spark

Post Syndicated from Jamal Arif original https://aws.amazon.com/blogs/big-data/design-patterns-to-manage-amazon-emr-on-eks-workloads-for-apache-spark/

Amazon EMR on Amazon EKS enables you to submit Apache Spark jobs on demand on Amazon Elastic Kubernetes Service (Amazon EKS) without provisioning clusters. With EMR on EKS, you can consolidate analytical workloads with your other Kubernetes-based applications on the same Amazon EKS cluster to improve resource utilization and simplify infrastructure management. Kubernetes uses namespaces to provide isolation between groups of resources within a single Kubernetes cluster. Amazon EMR creates a virtual cluster by registering Amazon EMR with a namespace on an EKS cluster. Amazon EMR can then run analytics workloads on that namespace.

In EMR on EKS, you can submit your Spark jobs to Amazon EMR virtual clusters using the AWS Command Line Interface (AWS CLI), SDK, or Amazon EMR Studio. Amazon EMR requests the Kubernetes scheduler on Amazon EKS to schedule pods. For every job you run, EMR on EKS creates a container with an Amazon Linux 2 base image, Apache Spark, and associated dependencies. Each Spark job runs in a pod on Amazon EKS worker nodes. If your Amazon EKS cluster has worker nodes in different Availability Zones, the Spark application driver and executor pods can spread across multiple Availability Zones. In this case, data transfer charges apply for cross-AZ communication and increases data processing latency. If you want to reduce data processing latency and avoid cross-AZ data transfer costs, you should configure Spark applications to run only within a single Availability Zone.

In this post, we share four design patterns to manage EMR on EKS workloads for Apache Spark. We then show how to use a pod template to schedule a job with EMR on EKS, and use Karpenter as our autoscaling tool.

Pattern 1: Manage Spark jobs by pod template

Customers often consolidate multiple applications on a shared Amazon EKS cluster to improve utilization and save costs. However, each application may have different requirements. For example, you may want to run performance-intensive workloads such as machine learning model training jobs on SSD-backed instances for better performance, or fault-tolerant and flexible applications on Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances for lower cost. In EMR on EKS, there are a few ways to configure how your Spark job runs on Amazon EKS worker nodes. You can utilize the Spark configurations on Kubernetes with the EMR on EKS StartJobRun API, or you can use Spark’s pod template feature. Pod templates are specifications that determine how to run each pod on your EKS clusters. With pod templates, you have more flexibility and can use pod template files to define Kubernetes pod configurations that Spark doesn’t support.

You can use pod templates to achieve the following benefits:

  • Reduce costs – You can schedule Spark executor pods to run on EC2 Spot Instances while scheduling Spark driver pods to run on EC2 On-Demand Instances.
  • Improve monitoring – You can enhance your Spark workload’s observability. For example, you can deploy a sidecar container via a pod template to your Spark job that can forward logs to your centralized logging application
  • Improve resource utilization – You can support multiple teams running their Spark workloads on the same shared Amazon EKS cluster

You can implement these patterns using pod templates and Kubernetes labels and selectors. Kubernetes labels are key-value pairs that are attached to objects, such as Kubernetes worker nodes, to identify attributes that are meaningful and relevant to users. You can then choose where Kubernetes schedules pods using nodeSelector or Kubernetes affinity and anti-affinity so that it can only run on specific worker nodes. nodeSelector is the simplest way to constrain pods to nodes with specific labels. Affinity and anti-affinity expand the types of constraints you can define.

Autoscaling in Spark workload

Autoscaling is a function that automatically scales your compute resources up or down to changes in demand. For Kubernetes auto scaling, Amazon EKS supports two auto scaling products: the Kubernetes Cluster Autoscaler and the Karpenter open-source auto scaling project. Kubernetes autoscaling ensures your cluster has enough nodes to schedule your pods without wasting resources. If some pods fail to schedule on current worker nodes due to insufficient resources, it increases the size of the cluster and adds additional nodes. It also attempts to remove underutilized nodes when its pods can run elsewhere.

Pattern 2: Turn on Dynamic Resource Allocation (DRA) in Spark

Spark provides a mechanism called Dynamic Resource Allocation (DRA), which dynamically adjusts the resources your application occupies based on the workload. With DRA, the Spark driver spawns the initial number of executors and then scales up the number until the specified maximum number of executors is met to process the pending tasks. Idle executors are deleted when there are no pending tasks. It’s particularly useful if you’re not certain how many executors are needed for your job processing.

You can implement it in EMR on EKS by following the Dynamic Resource Allocation workshop.

Pattern 3: Fully control cluster autoscaling by Cluster Autoscaler

Cluster Autoscaler utilizes the concept of node groups as the element of capacity control and scale. In AWS, node groups are implemented by auto scaling groups. Cluster Autoscaler implements it by controlling the DesiredReplicas field of your auto scaling groups.

To save costs and improve resource utilization, you can use Cluster Autoscaler in your Amazon EKS cluster to automatically scale your Spark pods. The following are recommendations for autoscaling Spark jobs with Amazon EMR on EKS using Cluster Autoscaler:

  • Create Availability Zone bounded auto scaling groups to make sure Cluster Autoscaler only adds worker nodes in the same Availability Zone to avoid cross-AZ data transfer charges and data processing latency.
  • Create separate node groups for EC2 On-Demand and Spot Instances. By doing this, you can add or shrink driver pods and executor pods independently.
  • In Cluster Autoscaler, each node in a node group needs to have identical scheduling properties. That includes EC2 instance types, which should be of similar vCPU to memory ratio to avoid inconsistency and wastage of resources. To learn more about Cluster Autoscaler node groups best practices, refer to Configuring your Node Groups.
  • Adhere to Spot Instance best practices and maximize diversification to take advantages of multiple Spot pools. Create multiple node groups for Spark executor pods with different vCPU to memory ratios. This greatly increases the stability and resiliency of your application.
  • When you have multiple node groups, use pod templates and Kubernetes labels and selectors to manage Spark pod deployment to specific Availability Zones and EC2 instance types.

The following diagram illustrates Availability Zone bounded auto scaling groups.

AZ bounded CA ASG
As multiple node groups are created, Cluster Autoscaler has the concept of expanders, which provide different strategies for selecting which node group to scale. As of this writing, the following strategies are supported: random, most-pods, least-waste, and priority. With multiple node groups of EC2 On-Demand and Spot Instances, you can use the priority expander, which allows Cluster Autoscaler to select the node group that has the highest priority assigned by the user. For configuration details, refer to Priority based expander for Cluster Autoscaler.

Pattern 4: Group-less autoscaling with Karpenter

Karpenter is an open-source, flexible, high-performance Kubernetes cluster auto scaler built with AWS. The overall goal is the same of auto scaling Amazon EKS clusters to adjust un-schedulable pods; however, Karpenter takes a different approach than Cluster Autoscaler, known as group-less provisioning. It observes the aggregate resource requests of unscheduled pods and makes decisions to launch minimal compute resources to fit the un-schedulable pods for efficient binpacking and reducing scheduling latency. It can also delete nodes to reduce infrastructure costs. Karpenter works directly with the Amazon EC2 Fleet.

To configure Karpenter, you create provisioners that define how Karpenter manages un-schedulable pods and expired nodes. You should utilize the concept of layered constraints to manage scheduling constraints. To reduce EMR on EKS costs and improve Amazon EKS cluster utilization, you can use Karpenter with similar constraints of Single-AZ, On-Demand Instances for Spark driver pods, and Spot Instances for executor pods without creating multiple types of node groups. With its group-less approach, Karpenter allows you to be more flexible and diversify better.

The following are recommendations for auto scaling EMR on EKS with Karpenter:

  • Configure Karpenter provisioners to launch nodes in a single Availability Zone to avoid cross-AZ data transfer costs and reduce data processing latency.
  • Create a provisioner for EC2 Spot Instances and EC2 On-Demand Instances. You can reduce costs by scheduling Spark driver pods to run on EC2 On-Demand Instances and schedule Spark executor pods to run on EC2 Spot Instances.
  • Limit the instance types by providing a list of EC2 instances or let Karpenter choose from all the Spot pools available to it. This follows the Spot best practices of diversifying across multiple Spot pools.
  • Use pod templates and Kubernetes labels and selectors to allow Karpenter to spin up right-sized nodes required for un-schedulable pods.

The following diagram illustrates how Karpenter works.

Karpenter How it Works

To summarize the design patterns we discussed:

  1. Pod templates help tailor your Spark workloads. You can configure Spark pods in a single Availability Zone and utilize EC2 Spot Instances for Spark executor pods, resulting in better price-performance.
  2. EMR on EKS supports the DRA feature in Spark. It is useful if you’re not familiar how many Spark executors are needed for your job processing, and use DRA to dynamically adjust the resources your application needs.
  3. Utilizing Cluster Autoscaler enables you to fully control how to autoscale your Amazon EMR on EKS workloads. It improves your Spark application availability and cluster efficiency by rapidly launching right-sized compute resources.
  4. Karpenter simplifies autoscaling with its group-less provisioning of compute resources. The benefits include reduced scheduling latency, and efficient bin-packing to reduce infrastructure costs.

Walkthrough overview

In our example walkthrough, we will show how to use Pod template to schedule a job with EMR on EKS. We use Karpenter as our autoscaling tool.

We complete the following steps to implement the solution:

  1. Create an Amazon EKS cluster.
  2. Prepare the cluster for EMR on EKS.
  3. Register the cluster with Amazon EMR.
  4. For Amazon EKS auto scaling, set up Karpenter auto scaling in Amazon EKS.
  5. Submit a sample Spark job using pod templates to run in single Availability Zone and utilize Spot for Spark executor pods.

The following diagram illustrates this architecture.

Walkthrough Overview

Prerequisites

To follow along with the walkthrough, ensure that you have the following prerequisite resources:

Create an Amazon EKS cluster

There are two ways to create an EKS cluster: you can use AWS Management Console and AWS CLI, or you can install all the required resources for Amazon EKS using eksctl, a simple command line utility for creating and managing Kubernetes clusters on EKS. For this post, we use eksctl to create our cluster.

Let’s start with installing the tools to set up and manage your Kubernetes cluster.

  1. Install the AWS CLI with the following command (Linux OS) and confirm it works:
    curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
    unzip awscliv2.zip
    sudo ./aws/install
    aws --version

    For other operating systems, see Installing, updating, and uninstalling the AWS CLI version.

  2. Install eksctl, the command line utility for creating and managing Kubernetes clusters on Amazon EKS:
    curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
    sudo mv -v /tmp/eksctl /usr/local/bin
    eksctl version

    eksctl is a tool jointly developed by AWS and Weaveworks that automates much of the experience of creating EKS clusters.

  3. Install the Kubernetes command-line tool, kubectl, which allows you to run commands against Kubernetes clusters:
    curl -o kubectl https://amazon-eks.s3.us-west-2.amazonaws.com/1.18.8/2020-09-18/bin/linux/amd64/kubectl
    chmod +x ./kubectl
    sudo mv ./kubectl /usr/local/bin

  4. Create a new file called eks-create-cluster.yaml with the following:
    apiVersion: eksctl.io/v1alpha5
    kind: ClusterConfig
    
    metadata:
      name: emr-on-eks-blog-cluster
      region: us-west-2
    
    availabilityZones: ["us-west-2b", "us-west-2c", "us-west-2d"]
    
    managedNodeGroups:#On-demand nodegroups for spark job
    - name: singleaz-ng-ondemand
      instanceType: m5.xlarge
      desiredCapacity: 1
      availabilityZones: ["us-west-2b"]
    

  5. Create an Amazon EKS cluster using the eks-create-cluster.yaml file:
    eksctl create cluster -f eks-create-cluster.yaml

    In this Amazon EKS cluster, we create a single managed node group with a general purpose m5.xlarge EC2 Instance. Launching Amazon EKS cluster, its managed node groups, and all dependencies typically takes 10–15 minutes.

  6. After you create the cluster, you can run the following to confirm all node groups were created:
    eksctl get nodegroups --cluster emr-on-eks-blog-cluster

    You can now use kubectl to interact with the created Amazon EKS cluster.

  7. After you create your Amazon EKS cluster, you must configure your kubeconfig file for your cluster using the AWS CLI:
    aws eks --region us-west-2 update-kubeconfig --name emr-on-eks-blog-cluster
    kubectl cluster-info
    

You can now use kubectl to connect to your Kubernetes cluster.

Prepare your Amazon EKS cluster for EMR on EKS

Now we prepare our Amazon EKS cluster to integrate it with EMR on EKS.

  1. Let’s create the namespace emr-on-eks-blog in our Amazon EKS cluster:
    kubectl create namespace emr-on-eks-blog

  2. We use the automation powered by eksctl to create role-based access control permissions and to add the EMR on EKS service-linked role into the aws-auth configmap:
    eksctl create iamidentitymapping --cluster emr-on-eks-blog-cluster --namespace emr-on-eks-blog --service-name "emr-containers"

  3. The Amazon EKS cluster already has an OpenID Connect provider URL. You enable IAM roles for service accounts by associating IAM with the Amazon EKS cluster OIDC:
    eksctl utils associate-iam-oidc-provider --cluster emr-on-eks-blog-cluster —approve

    Now let’s create the IAM role that Amazon EMR uses to run Spark jobs.

  4. Create the file blog-emr-trust-policy.json:

    {
    "Version": "2012-10-17",
    "Statement": [
    {
    "Effect": "Allow",
    "Principal": {
    "Service": "elasticmapreduce.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
    }
    ]
    }

    Set up an IAM role:

    aws iam create-role --role-name blog-emrJobExecutionRole —assume-role-policy-document file://blog-emr-trust-policy.json

    This IAM role contains all permissions that the Spark job needs—for instance, we provide access to S3 buckets and Amazon CloudWatch to access necessary files (pod templates) and share logs.

    Next, we need to attach the required IAM policies to the role so it can write logs to Amazon S3 and CloudWatch.

  5. Create the file blog-emr-policy-document with the required IAM policies. Replace the bucket name with your S3 bucket ARN.

    {
    "Version": "2012-10-17",
    "Statement": [
    {
    "Effect": "Allow",
    "Action": [
    "s3:PutObject",
    "s3:GetObject",
    "s3:ListBucket"
    ],
    "Resource": ["arn:aws:s3:::<bucket-name>"]
    },
    {
    "Effect": "Allow",
    "Action": [
    "logs:PutLogEvents",
    "logs:CreateLogStream",
    "logs:DescribeLogGroups",
    "logs:DescribeLogStreams"
    ],
    "Resource": [
    "arn:aws:logs:::"
    ]
    }
    ]
    }

    Attach it to the IAM role created in the previous step:

    aws iam put-role-policy --role-name blog-emrJobExecutionRole --policy-name blog-EMR-JobExecution-policy —policy-document file://blog-emr-policy-document.json

  6. Now we update the trust relationship between the IAM role we just created with the Amazon EMR service identity. The namespace provided here in the trust policy needs to be same when registering the virtual cluster in next step:
    aws emr-containers update-role-trust-policy --cluster-name emr-on-eks-blog-cluster --namespace emr-on-eks-blog --role-name blog-emrJobExecutionRole --region us-west-2

Register the Amazon EKS cluster with Amazon EMR

Registering your Amazon EKS cluster is the final step to set up EMR on EKS to run workloads.

We create a virtual cluster and map it to the Kubernetes namespace created earlier:

aws emr-containers create-virtual-cluster \
    --region us-west-2 \
    --name emr-on-eks-blog-cluster \
    --container-provider '{
       "id": "emr-on-eks-blog-cluster",
       "type": "EKS",
       "info": {
          "eksInfo": {
              "namespace": "emr-on-eks-blog"
          }
       }
    }'

After you register, you should get confirmation that your EMR virtual cluster is created:

{
"arn": "arn:aws:emr-containers:us-west-2:142939128734:/virtualclusters/lwpylp3kqj061ud7fvh6sjuyk",
"id": "lwpylp3kqj061ud7fvh6sjuyk",
"name": "emr-on-eks-blog-cluster"
}

A virtual cluster is an Amazon EMR concept that means that Amazon EMR registered to a Kubernetes namespace and can run jobs in that namespace. If you navigate to your Amazon EMR console, you can see the virtual cluster listed.

Set up Karpenter in Amazon EKS

To get started with Karpenter, ensure there is some compute capacity available, and install it using the Helm charts provided in the public repository. Karpenter also requires permissions to provision compute resources. For more information, refer to Getting Started.

Karpenter’s single responsibility is to provision compute for your Kubernetes clusters, which is configured by a custom resource called a provisioner. Once installed in your cluster, the Karpenter provisioner observes incoming Kubernetes pods, which can’t be scheduled due to insufficient compute resources in the cluster, and automatically launches new resources to meet their scheduling and resource requirements.

For our use case, we provision two provisioners.

The first is a Karpenter provisioner for Spark driver pods to run on EC2 On-Demand Instances:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: ondemand
spec:
  ttlSecondsUntilExpired: 2592000 

  ttlSecondsAfterEmpty: 30

  labels:
    karpenter.sh/capacity-type: on-demand

  requirements:
    - key: "topology.kubernetes.io/zone"
      operator: In
      values: ["us-west-2b"]
    - key: "kubernetes.io/arch"
      operator: In
      values: ["arm64"]
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["on-demand"]

  limits:
    resources:
      cpu: "1000"
      memory: 1000Gi

  provider: 
    subnetSelector:
      alpha.eksctl.io/cluster-name: emr-on-eks-blog-cluster
    securityGroupSelector:
      alpha.eksctl.io/cluster-name: emr-on-eks-blog-cluster

The second is a Karpenter provisioner for Spark executor pods to run on EC2 Spot Instances:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  ttlSecondsUntilExpired: 2592000 

  ttlSecondsAfterEmpty: 30

  labels:
    karpenter.sh/capacity-type: spot

  requirements:
    - key: "topology.kubernetes.io/zone"
      operator: In
      values: ["us-west-2b"]
    - key: "kubernetes.io/arch"
      operator: In
      values: ["arm64"]
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["spot"]

  limits:
    resources:
      cpu: "1000"
      memory: 1000Gi

  provider: 
    subnetSelector:
      alpha.eksctl.io/cluster-name: emr-on-eks-blog-cluster
    securityGroupSelector:
      alpha.eksctl.io/cluster-name: emr-on-eks-blog-cluster

Note the highlighted portion of the provisioner config. In the requirements section, we use the well-known labels with Amazon EKS and Karpenter to add constraints for how Karpenter launches nodes. We add constraints that if the pod is looking for a label karpenter.sh/capacity-type: spot, it uses this provisioner to launch an EC2 Spot Instance only in Availability Zone us-west-2b. Similarly, we follow the same constraint for the karpenter.sh/capacity-type: on-demand label. We can also be more granular and provide EC2 instance types in our provisioner, and they can be of different vCPU and memory ratios, giving you more flexibility and adding resiliency to your application. Karpenter launches nodes only when both the provisioner’s and pod’s requirements are met. To learn more about the Karpenter provisioner API, refer to Provisioner API.

In the next step, we define pod requirements and align them with what we have defined in Karpenter’s provisioner.

Submit Spark job using Pod template

In Kubernetes, labels are key-value pairs that are attached to objects, such as pods. Labels are intended to be used to specify identifying attributes of objects that are meaningful and relevant to users. You can constrain a pod so that it can only run on particular set of nodes. There are several ways to do this, and the recommended approaches all use label selectors to facilitate the selection.

Beginning with Amazon EMR versions 5.33.0 or 6.3.0, EMR on EKS supports Spark’s pod template feature. We use pod templates to add specific labels where Spark driver and executor pods should be launched.

Create a pod template file for a Spark driver pod and save them in your S3 bucket:

apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    karpenter.sh/capacity-type: on-demand
  containers:
  - name: spark-kubernetes-driver # This will be interpreted as Spark driver container

Create a pod template file for a Spark executor pod and save them in your S3 bucket:

apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    karpenter.sh/capacity-type: spot
  containers:
  - name: spark-kubernetes-executor # This will be interpreted as Spark driver container

Pod templates provide different fields to manage job scheduling. For additional details, refer to Pod template fields. Note the nodeSelector for the Spark driver pods and Spark executor pods, which match the labels we defined with the Karpenter provisioner.

For a sample Spark job, we use the following code, which creates multiple parallel threads and waits for a few seconds:

cat << EOF > threadsleep.py
import sys
from time import sleep
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("threadsleep").getOrCreate()
def sleep_for_x_seconds(x):sleep(x*20)
sc=spark.sparkContext
sc.parallelize(range(1,6), 5).foreach(sleep_for_x_seconds)
spark.stop()
EOF

Copy the sample Spark job into your S3 bucket:

aws s3 mb s3://<YourS3Bucket>
aws s3 cp threadsleep.py s3://<YourS3Bucket>

Before we submit the Spark job, let’s get the required values of the EMR virtual cluster and Amazon EMR job execution role ARN:

export S3blogbucket= s3://<YourS3Bucket>
export VIRTUAL_CLUSTER_ID=$(aws emr-containers list-virtual-clusters --query "virtualClusters[?state=='RUNNING'].id" --region us-west-2 --output text)

export EMR_ROLE_ARN=$(aws iam get-role --role-name blog-emrJobExecutionRole --query Role.Arn --region us-west-2 --output text)

To enable the pod template feature with EMR on EKS, you can use configuration-overrides to specify the Amazon S3 path to the pod template:

aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name spark-threadsleep-single-az \
--execution-role-arn $EMR_ROLE_ARN \
--release-label emr-5.33.0-latest \
--region us-west-2 \
--job-driver '{
    "sparkSubmitJobDriver": {
        "entryPoint": "'${S3blogbucket}'/threadsleep.py",
        "sparkSubmitParameters": "--conf spark.executor.instances=6 --conf spark.executor.memory=1G --conf spark.executor.cores=1 --conf spark.driver.cores=2"
        }
    }' \
--configuration-overrides '{
    "applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.driver.memory":"1G",
"spark.kubernetes.driver.podTemplateFile":"'${S3blogbucket}'/spark_driver_podtemplate.yaml", "spark.kubernetes.executor.podTemplateFile":"'${S3blogbucket}'/spark_executor_podtemplate.yaml"
         }
      }
    ], 
    "monitoringConfiguration": {
      "cloudWatchMonitoringConfiguration": {
        "logGroupName": "/emr-on-eks/emreksblog", 
        "logStreamNamePrefix": "threadsleep"
      }, 
      "s3MonitoringConfiguration": {
        "logUri": "'"$S3blogbucket"'/logs/"
      }
    }
}'

In the Spark job, we’re requesting two cores for the Spark driver and one core each for Spark executor pod. Because we only had a single EC2 instance in our managed node group, Karpenter looks at the un-schedulable Spark driver pods and utilizes the on-demand provisioner to launch EC2 On-Demand Instances for Spark driver pods in us-west-2b. Similarly, when the Spark executor pods are in pending state, because there are no Spot Instances, Karpenter launches Spot Instances in us-west-2b.

This way, Karpenter optimizes your costs by starting from zero Spot and On-Demand Instances and only creates them dynamically when required. Additionally, Karpenter batches pending pods and then binpacks them based on CPU, memory, and GPUs required, taking into account node overhead, VPC CNI resources required, and daemon sets that will be packed when bringing up a new node. This makes sure you’re efficiently utilizing your resources with least wastage.

Clean up

Don’t forget to clean up the resources you created to avoid any unnecessary charges.

  1. Delete all the virtual clusters that you created:
    #List all the virtual cluster ids
    aws emr-containers list-virtual-clusters#Delete virtual cluster by passing virtual cluster id
    aws emr-containers delete-virtual-cluster —id <virtual-cluster-id>
    

  2. Delete the Amazon EKS cluster:
    eksctl delete cluster emr-on-eks-blog-cluster

  3. Delete the EMR_EKS_Job_Execution_Role role and policies.

Conclusion

In this post, we saw how to create an Amazon EKS cluster, configure Amazon EKS managed node groups, create an EMR virtual cluster on Amazon EKS, and submit Spark jobs. Using pod templates, we saw how to ensure Spark workloads are scheduled in the same Availability Zone and utilize Spot with Karpenter auto scaling to reduce costs and optimize your Spark workloads.

To get started, try out the EMR on EKS workshop. For more resources, refer to the following:


About the author

Jamal Arif is a Solutions Architect at AWS and a containers specialist. He helps AWS customers in their modernization journey to build innovative, resilient, and cost-effective solutions. In his spare time, Jamal enjoys spending time outdoors with his family hiking and mountain biking.

AWS Week in Review – August 1, 2022

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-week-in-review-august-1-2022/

AWS re:Inforce returned to Boston last week, kicking off with a keynote from Amazon Chief Security Officer Steve Schmidt and AWS Chief Information Security officer C.J. Moses:

Be sure to take some time to watch this video and the other leadership sessions, and to use what you learn to take some proactive steps to improve your security posture.

Last Week’s Launches
Here are some launches that caught my eye last week:

AWS Wickr uses 256-bit end-to-end encryption to deliver secure messaging, voice, and video calling, including file sharing and screen sharing, across desktop and mobile devices. Each call, message, and file is encrypted with a new random key and can be decrypted only by the intended recipient. AWS Wickr supports logging to a secure, customer-controlled data store for compliance and auditing, and offers full administrative control over data: permissions, ephemeral messaging options, and security groups. You can now sign up for the preview.

AWS Marketplace Vendor Insights helps AWS Marketplace sellers to make security and compliance data available through AWS Marketplace in the form of a unified, web-based dashboard. Designed to support governance, risk, and compliance teams, the dashboard also provides evidence that is backed by AWS Config and AWS Audit Manager assessments, external audit reports, and self-assessments from software vendors. To learn more, read the What’s New post.

GuardDuty Malware Protection protects Amazon Elastic Block Store (EBS) volumes from malware. As Danilo describes in his blog post, a malware scan is initiated when Amazon GuardDuty detects that a workload running on an EC2 instance or in a container appears to be doing something suspicious. The new malware protection feature creates snapshots of the attached EBS volumes, restores them within a service account, and performs an in-depth scan for malware. The scanner supports many types of file systems and file formats and generates actionable security findings when malware is detected.

Amazon Neptune Global Database lets you build graph applications that run across multiple AWS Regions using a single graph database. You can deploy a primary Neptune cluster in one region and replicate its data to up to five secondary read-only database clusters, with up to 16 read replicas each. Clusters can recover in minutes in the result of an (unlikely) regional outage, with a Recovery Point Objective (RPO) of 1 second and a Recovery Time Objective (RTO) of 1 minute. To learn a lot more and see this new feature in action, read Introducing Amazon Neptune Global Database.

Amazon Detective now Supports Kubernetes Workloads, with the ability to scale to thousands of container deployments and millions of configuration changes per second. It ingests EKS audit logs to capture API activity from users, applications, and the EKS control plane, and correlates user activity with information gleaned from Amazon VPC flow logs. As Channy notes in his blog post, you can enable Amazon Detective and take advantage of a free 30 day trial of the EKS capabilities.

AWS SSO is Now AWS IAM Identity Center in order to better represent the full set of workforce and account management capabilities that are part of IAM. You can create user identities directly in IAM Identity Center, or you can connect your existing Active Directory or standards-based identify provider. To learn more, read this post from the AWS Security Blog.

AWS Config Conformance Packs now provide you with percentage-based scores that will help you track resource compliance within the scope of the resources addressed by the pack. Scores are computed based on the product of the number of resources and the number of rules, and are reported to Amazon CloudWatch so that you can track compliance trends over time. To learn more about how scores are computed, read the What’s New post.

Amazon Macie now lets you perform one-click temporary retrieval of sensitive data that Macie has discovered in an S3 bucket. You can retrieve up to ten examples at a time, and use these findings to accelerate your security investigations. All of the data that is retrieved and displayed in the Macie console is encrypted using customer-managed AWS Key Management Service (AWS KMS) keys. To learn more, read the What’s New post.

AWS Control Tower was updated multiple times last week. CloudTrail Organization Logging creates an org-wide trail in your management account to automatically log the actions of all member accounts in your organization. Control Tower now reduces redundant AWS Config items by limiting recording of global resources to home regions. To take advantage of this change you need to update to the latest landing zone version and then re-register each Organizational Unit, as detailed in the What’s New post. Lastly, Control Tower’s region deny guardrail now includes AWS API endpoints for AWS Chatbot, Amazon S3 Storage Lens, and Amazon S3 Multi Region Access Points. This allows you to limit access to AWS services and operations for accounts enrolled in your AWS Control Tower environment.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Here are some other news items and customer stories that you may find interesting:

AWS Open Source News and Updates – My colleague Ricardo Sueiras writes a weekly open source newsletter and highlights new open source projects, tools, and demos from the AWS community. Read installment #122 here.

Growy Case Study – This Netherlands-based company is building fully-automated robot-based vertical farms that grow plants to order. Read the case study to learn how they use AWS IoT and other services to monitor and control light, temperature, CO2, and humidity to maximize yield and quality.

Journey of a Snap on Snapchat – This video shows you how a snapshot flows end-to-end from your camera to AWS, to your friends. With over 300 million daily active users, Snap takes advantage of Amazon Elastic Kubernetes Service (EKS), Amazon DynamoDB, Amazon Simple Storage Service (Amazon S3), Amazon CloudFront, and many other AWS services, storing over 400 terabytes of data in DynamoDB and managing over 900 EKS clusters.

Cutting Cardboard Waste – Bin packing is almost certainly a part of every computer science curriculum! In the linked article from the Amazon Science site, you can learn how an Amazon Principal Research Scientist developed PackOpt to figure out the optimal set of boxes to use for shipments from Amazon’s global network of fulfillment centers. This is an NP-hard problem and the article describes how they build a parallelized solution that explores a multitude of alternative solutions, all running on AWS.

Upcoming Events
Check your calendar and sign up for these online and in-person AWS events:

AWS SummitAWS Global Summits – AWS Global Summits are free events that bring the cloud computing community together to connect, collaborate, and learn about AWS. Registrations are open for the following AWS Summits in August:

Imagine Conference 2022IMAGINE 2022 – The IMAGINE 2022 conference will take place on August 3 at the Seattle Convention Center, Washington, USA. It’s a no-cost event that brings together education, state, and local leaders to learn about the latest innovations and best practices in the cloud. You can register here.

That’s all for this week. Check back next Monday for another Week in Review!

Jeff;

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

How Epos Now modernized their data platform by building an end-to-end data lake with the AWS Data Lab

Post Syndicated from Debadatta Mohapatra original https://aws.amazon.com/blogs/big-data/how-epos-now-modernized-their-data-platform-by-building-an-end-to-end-data-lake-with-the-aws-data-lab/

Epos Now provides point of sale and payment solutions to over 40,000 hospitality and retailers across 71 countries. Their mission is to help businesses of all sizes reach their full potential through the power of cloud technology, with solutions that are affordable, efficient, and accessible. Their solutions allow businesses to leverage actionable insights, manage their business from anywhere, and reach customers both in-store and online.

Epos Now currently provides real-time and near-real-time reports and dashboards to their merchants on top of their operational database (Microsoft SQL Server). With a growing customer base and new data needs, the team started to see some issues in the current platform.

First, they observed performance degradation for serving the reporting requirements from the same OLTP database with the current data model. A few metrics that needed to be delivered in real time (seconds after a transaction was complete) and a few metrics that needed to be reflected in the dashboard in near-real-time (minutes) took several attempts to load in the dashboard.

This started to cause operational issues for their merchants. The end consumers of reports couldn’t access the dashboard in a timely manner.

Cost and scalability also became a major problem because one single database instance was trying to serve many different use cases.

Epos Now needed a strategic solution to address these issues. Additionally, they didn’t have a dedicated data platform for doing machine learning and advanced analytics use cases, so they decided on two parallel strategies to resolve their data problems and better serve merchants:

  • The first was to rearchitect the near-real-time reporting feature by moving it to a dedicated Amazon Aurora PostgreSQL-Compatible Edition database, with a specific reporting data model to serve to end consumers. This will improve performance, uptime, and cost.
  • The second was to build out a new data platform for reporting, dashboards, and advanced analytics. This will enable use cases for internal data analysts and data scientists to experiment and create multiple data products, ultimately exposing these insights to end customers.

In this post, we discuss how Epos Now designed the overall solution with support from the AWS Data Lab. Having developed a strong strategic relationship with AWS over the last 3 years, Epos Now opted to take advantage of the AWS Data lab program to speed up the process of building a reliable, performant, and cost-effective data platform. The AWS Data Lab program offers accelerated, joint-engineering engagements between customers and AWS technical resources to create tangible deliverables that accelerate data and analytics modernization initiatives.

Working with an AWS Data Lab Architect, Epos Now commenced weekly cadence calls to come up with a high-level architecture. After the objective, success criteria, and stretch goals were clearly defined, the final step was to draft a detailed task list for the upcoming 3-day build phase.

Overview of solution

As part of the 3-day build exercise, Epos Now built the following solution with the ongoing support of their AWS Data Lab Architect.

Epos Now Arch Image

The platform consists of an end-to-end data pipeline with three main components:

  • Data lake – As a central source of truth
  • Data warehouse – For analytics and reporting needs
  • Fast access layer – To serve near-real-time reports to merchants

We chose three different storage solutions:

  • Amazon Simple Storage Service (Amazon S3) for raw data landing and a curated data layer to build the foundation of the data lake
  • Amazon Redshift to create a federated data warehouse with conformed dimensions and star schemas for consumption by Microsoft Power BI, running on AWS
  • Aurora PostgreSQL to store all the data for near-real-time reporting as a fast access layer

In the following sections, we go into each component and supporting services in more detail.

Data lake

The first component of the data pipeline involved ingesting the data from an Amazon Managed Streaming for Apache Kafka (Amazon MSK) topic using Amazon MSK Connect to land the data into an S3 bucket (landing zone). The Epos Now team used the Confluent Amazon S3 sink connector to sink the data to Amazon S3. To make the sink process more resilient, Epos Now added the required configuration for dead-letter queues to redirect the bad messages to another topic. The following code is a sample configuration for a dead-letter queue in Amazon MSK Connect:

Because Epos Now was ingesting from multiple data sources, they used Airbyte to transfer the data to a landing zone in batches. A subsequent AWS Glue job reads the data from the landing bucket , performs data transformation, and moves the data to a curated zone of Amazon S3 in optimal format and layout. This curated layer then became the source of truth for all other use cases. Then Epos Now used an AWS Glue crawler to update the AWS Glue Data Catalog. This was augmented by the use of Amazon Athena for doing data analysis. To optimize for cost, Epos Now defined an optimal data retention policy on different layers of the data lake to save money as well as keep the dataset relevant.

Data warehouse

After the data lake foundation was established, Epos Now used a subsequent AWS Glue job to load the data from the S3 curated layer to Amazon Redshift. We used Amazon Redshift to make the data queryable in both Amazon Redshift (internal tables) and Amazon Redshift Spectrum. The team then used dbt as an extract, load, and transform (ELT) engine to create the target data model and store it in target tables and views for internal business intelligence reporting. The Epos Now team wanted to use their SQL knowledge to do all ELT operations in Amazon Redshift, so they chose dbt to perform all the joins, aggregations, and other transformations after the data was loaded into the staging tables in Amazon Redshift. Epos Now is currently using Power BI for reporting, which was migrated to the AWS Cloud and connected to Amazon Redshift clusters running inside Epos Now’s VPC.

Fast access layer

To build the fast access layer to deliver the metrics to Epos Now’s retail and hospitality merchants in near-real time, we decided to create a separate pipeline. This required developing a microservice running a Kafka consumer job to subscribe to the same Kafka topic in an Amazon Elastic Kubernetes Service (Amazon EKS) cluster. The microservice received the messages, conducted the transformations, and wrote the data to a target data model hosted on Aurora PostgreSQL. This data was delivered to the UI layer through an API also hosted on Amazon EKS, exposed through Amazon API Gateway.

Outcome

The Epos Now team is currently building both the fast access layer and a centralized lakehouse architecture-based data platform on Amazon S3 and Amazon Redshift for advanced analytics use cases. The new data platform is best positioned to address scalability issues and support new use cases. The Epos Now team has also started offloading some of the real-time reporting requirements to the new target data model hosted in Aurora. The team has a clear strategy around the choice of different storage solutions for the right access patterns: Amazon S3 stores all the raw data, and Aurora hosts all the metrics to serve real-time and near-real-time reporting requirements. The Epos Now team will also enhance the overall solution by applying data retention policies in different layers of the data platform. This will address the platform cost without losing any historical datasets. The data model and structure (data partitioning, columnar file format) we designed greatly improved query performance and overall platform stability.

Conclusion

Epos Now revolutionized their data analytics capabilities, taking advantage of the breadth and depth of the AWS Cloud. They’re now able to serve insights to internal business users, and scale their data platform in a reliable, performant, and cost-effective manner.

The AWS Data Lab engagement enabled Epos Now to move from idea to proof of concept in 3 days using several previously unfamiliar AWS analytics services, including AWS Glue, Amazon MSK, Amazon Redshift, and Amazon API Gateway.

Epos Now is currently in the process of implementing the full data lake architecture, with a rollout to customers planned for late 2022. Once live, they will deliver on their strategic goal to provide real-time transactional data and put insights directly in the hands of their merchants.


About the Authors

Jason Downing is VP of Data and Insights at Epos Now. He is responsible for the Epos Now data platform and product direction. He specializes in product management across a range of industries, including POS systems, mobile money, payments, and eWallets.

Debadatta Mohapatra is an AWS Data Lab Architect. He has extensive experience across big data, data science, and IoT, across consulting and industrials. He is an advocate of cloud-native data platforms and the value they can drive for customers across industries.

Amazon Detective Supports Kubernetes Workloads on Amazon EKS for Security Investigations

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/amazon-detective-supports-kubernetes-workloads-on-amazon-eks-for-security-investigations/

In March 2020, we introduced Amazon Detective, a fully managed service that makes it easy to analyze, investigate, and quickly identify the root cause of potential security issues or suspicious activities.

Amazon Detective continuously extracts temporal events such as login attempts, API calls, and network traffic from Amazon GuardDutyAWS CloudTrail, and Amazon Virtual Private Cloud (Amazon VPC) Flow Logs into a graph model that summarizes the resource behaviors and interactions observed across your entire AWS environment. We have added new features such as AWS IAM Role session analysis, enhanced IP address analytics, Splunk integration, Amazon S3 and DNS finding types, and the support of AWS Organizations.

Customers are rapidly moving to containers to deploy Kubernetes workloads with Amazon Elastic Kubernetes Service (Amazon EKS). Its highly programmatic nature allows thousands of individual container deployments and millions of configuration changes to occur in seconds. To effectively secure EKS workloads, it is important to monitor container deployments and configurations that are captured in the form of EKS audit logs and to correlate activities to user activity and network traffic happening across AWS accounts.

Today we announce new capabilities in Amazon Detective to expand security investigation coverage for Kubernetes workloads running on Amazon EKS. When you enable this new feature, Amazon Detective automatically starts ingesting EKS audit logs to capture chronological API activity from users, applications, and the control plane in Amazon EKS for clusters, pods, container images, and Kubernetes subjects (Kubernetes users and service accounts).

Detective automatically correlates user activity using CloudTrail, and network activity using Amazon VPC Flow logs, without the need for you to enable, store, or retain logs manually. The service gleans key security information from these logs and retains them in a security behavioral graph database that enables fast cross-referenced access to twelve months of activity. Detective provides a data analysis and visualization layer purpose-built to answer common security questions backed by a behavioral graph database that allows you to quickly investigate potential malicious behavior associated with your EKS workloads.

You can rapidly respond to security issues rather than focusing on log management, operational systems, or ongoing security tooling maintenance. Detective’s EKS capabilities come with a free 30-day trial for all customers that allows you to ensure that the capabilities meet your needs and to fully understand the cost for the service on an ongoing basis.

Getting Started with Security Investigations for EKS Audit Logs
To get started, enable Amazon Detective with just a few clicks in the AWS Management Console. GuardDuty is a prerequisite of Amazon Detective. When you try to enable Detective, Detective checks whether GuardDuty has been enabled for your account. You must either enable GuardDuty or wait for 48 hours. This allows GuardDuty to assess the data volume that your account produces.

You can enable your account by attaching the AWS IAM policy or delegate it to an administrator of your organization. To learn more, refer to Setting up Detective in the AWS documentation.

To enable EKS support in Detective as an existing customer, navigate to the Settings menu in the left panel and select General. Under Optional source packages, enable EKS audit logs.

If you are a new customer of Detective, the EKS protection feature will be enabled by default. If you do not want to trial EKS audit logs right away, you can disable this feature within the first week of enabling Detective and preserve the full 30-day free trial period to use in the future.

Once enabled, Detective will begin monitoring the Kubernetes audit logs that are generated by Amazon EKS, extracting and correlating information for security usage. You do not need to enable any log sources or make any configuration changes to your existing EKS clusters or future deployments.

You can see recent monitoring results of your EKS clusters on the Summary page.

When you choose one of the EKS clusters, you will see the details of containers running in the cluster, Kubernetes API activities, and network activities that occurred on this resource around the scope time.

In the Overview tab, you also see details about all containers running in the cluster, including their pod, image and security context.

In the Kubernetes API activity tab, you can get an overview of the full API activities involving the EKS cluster. You can choose a time range to drill down based on specific API methods within the EKS cluster. When you select a specific time, you can see API subjects, IP addresses, and the number of API calls by the success, failure, unauthorized, or forbidden state.

You can also see details of newly observed Kubernetes API calls  inside this cluster for the first time and subjects with increased volume that happened inside the cluster.

Enabling GuardDuty EKS Protection
In January 2022, Amazon GuardDuty expanded coverage to EKS cluster activity to identify malicious or suspicious behavior that represents potential threats to container workloads.

When the optional GuardDuty EKS Protection is enabled, GuardDuty will continuously monitor your EKS deployments and alert you to threats detected in your workloads. You can view and investigate these security findings in Detective.

With Detective for EKS enabled, you can quickly access information about the resources involved in the finding, such as their CloudTrail and Kubernetes API activity, and netflow information. This can aid in investigation and help you determine root cause, impact, and other related resources that may also be compromised.

To learn more, see How to use new Amazon GuardDuty EKS Protection findings in the AWS Security Blog.

Now Available
You can now use Amazon Detective for EKS protection in all Regions where Amazon Detective is available. This feature is priced based on the volume of audit logs processed and analyzed by Detective.

Detective provides a free 30-day trial to all customers that enable EKS coverage, allowing customers to ensure that Detective’s capabilities meet security needs and to get an estimate of the service’s monthly cost before committing to paid usage. To learn more, see the Detective pricing page.

For technical documentation, visit the Amazon Detective User Guide. Please send feedback to AWS re:Post for Amazon Detective or through your usual AWS support contacts.

Learn all the details about Amazon Detective for EKS protection and get started today.

Channy

Stream Amazon EMR on EKS logs to third-party providers like Splunk, Amazon OpenSearch Service, or other log aggregators

Post Syndicated from Matthew Tan original https://aws.amazon.com/blogs/big-data/stream-amazon-emr-on-eks-logs-to-third-party-providers-like-splunk-amazon-opensearch-service-or-other-log-aggregators/

Spark jobs running on Amazon EMR on EKS generate logs that are very useful in identifying issues with Spark processes and also as a way to see Spark outputs. You can access these logs from a variety of sources. On the Amazon EMR virtual cluster console, you can access logs from the Spark History UI. You also have flexibility to push logs into an Amazon Simple Storage Service (Amazon S3) bucket or Amazon CloudWatch Logs. In each method, these logs are linked to the specific job in question. The common practice of log management in DevOps culture is to centralize logging through the forwarding of logs to an enterprise log aggregation system like Splunk or Amazon OpenSearch Service (successor to Amazon Elasticsearch Service). This enables you to see all the applicable log data in one place. You can identify key trends, anomalies, and correlated events, and troubleshoot problems faster and notify the appropriate people in a timely fashion.

EMR on EKS Spark logs are generated by Spark and can be accessed via the Kubernetes API and kubectl CLI. Therefore, although it’s possible to install log forwarding agents in the Amazon Elastic Kubernetes Service (Amazon EKS) cluster to forward all Kubernetes logs, which include Spark logs, this can become quite expensive at scale because you get information that may not be important for Spark users about Kubernetes. In addition, from a security point of view, the EKS cluster logs and access to kubectl may not be available to the Spark user.

To solve this problem, this post proposes using pod templates to create a sidecar container alongside the Spark job pods. The sidecar containers are able to access the logs contained in the Spark pods and forward these logs to the log aggregator. This approach allows the logs to be managed separately from the EKS cluster and uses a small amount of resources because the sidecar container is only launched during the lifetime of the Spark job.

Implementing Fluent Bit as a sidecar container

Fluent Bit is a lightweight, highly scalable, and high-speed logging and metrics processor and log forwarder. It collects event data from any source, enriches that data, and sends it to any destination. Its lightweight and efficient design coupled with its many features makes it very attractive to those working in the cloud and in containerized environments. It has been deployed extensively and trusted by many, even in large and complex environments. Fluent Bit has zero dependencies and requires only 650 KB in memory to operate, as compared to FluentD, which needs about 40 MB in memory. Therefore, it’s an ideal option as a log forwarder to forward logs generated from Spark jobs.

When you submit a job to EMR on EKS, there are at least two Spark containers: the Spark driver and the Spark executor. The number of Spark executor pods depends on your job submission configuration. If you indicate more than one spark.executor.instances, you get the corresponding number of Spark executor pods. What we want to do here is run Fluent Bit as sidecar containers with the Spark driver and executor pods. Diagrammatically, it looks like the following figure. The Fluent Bit sidecar container reads the indicated logs in the Spark driver and executor pods, and forwards these logs to the target log aggregator directly.

Architecture of Fluent Bit sidecar

Pod templates in EMR on EKS

A Kubernetes pod is a group of one or more containers with shared storage, network resources, and a specification for how to run the containers. Pod templates are specifications for creating pods. It’s part of the desired state of the workload resources used to run the application. Pod template files can define the driver or executor pod configurations that aren’t supported in standard Spark configuration. That being said, Spark is opinionated about certain pod configurations and some values in the pod template are always overwritten by Spark. Using a pod template only allows Spark to start with a template pod and not an empty pod during the pod building process. Pod templates are enabled in EMR on EKS when you configure the Spark properties spark.kubernetes.driver.podTemplateFile and spark.kubernetes.executor.podTemplateFile. Spark downloads these pod templates to construct the driver and executor pods.

Forward logs generated by Spark jobs in EMR on EKS

A log aggregating system like Amazon OpenSearch Service or Splunk should always be available that can accept the logs forwarded by the Fluent Bit sidecar containers. If not, we provide the following scripts in this post to help you launch a log aggregating system like Amazon OpenSearch Service or Splunk installed on an Amazon Elastic Compute Cloud (Amazon EC2) instance.

We use several services to create and configure EMR on EKS. We use an AWS Cloud9 workspace to run all the scripts and to configure the EKS cluster. To prepare to run a job script that requires certain Python libraries absent from the generic EMR images, we use Amazon Elastic Container Registry (Amazon ECR) to store the customized EMR container image.

Create an AWS Cloud9 workspace

The first step is to launch and configure the AWS Cloud9 workspace by following the instructions in Create a Workspace in the EKS Workshop. After you create the workspace, we create AWS Identity and Access Management (IAM) resources. Create an IAM role for the workspace, attach the role to the workspace, and update the workspace IAM settings.

Prepare the AWS Cloud9 workspace

Clone the following GitHub repository and run the following script to prepare the AWS Cloud9 workspace to be ready to install and configure Amazon EKS and EMR on EKS. The shell script prepare_cloud9.sh installs all the necessary components for the AWS Cloud9 workspace to build and manage the EKS cluster. These include the kubectl command line tool, eksctl CLI tool, jq, and to update the AWS Command Line Interface (AWS CLI).

$ sudo yum -y install git
$ cd ~ 
$ git clone https://github.com/aws-samples/aws-emr-eks-log-forwarding.git
$ cd aws-emr-eks-log-forwarding
$ cd emreks
$ bash prepare_cloud9.sh

All the necessary scripts and configuration to run this solution are found in the cloned GitHub repository.

Create a key pair

As part of this particular deployment, you need an EC2 key pair to create an EKS cluster. If you already have an existing EC2 key pair, you may use that key pair. Otherwise, you can create a key pair.

Install Amazon EKS and EMR on EKS

After you configure the AWS Cloud9 workspace, in the same folder (emreks), run the following deployment script:

$ bash deploy_eks_cluster_bash.sh 
Deployment Script -- EMR on EKS
-----------------------------------------------

Please provide the following information before deployment:
1. Region (If your Cloud9 desktop is in the same region as your deployment, you can leave this blank)
2. Account ID (If your Cloud9 desktop is running in the same Account ID as where your deployment will be, you can leave this blank)
3. Name of the S3 bucket to be created for the EMR S3 storage location
Region: [xx-xxxx-x]: < Press enter for default or enter region > 
Account ID [xxxxxxxxxxxx]: < Press enter for default or enter account # > 
EC2 Public Key name: < Provide your key pair name here >
Default S3 bucket name for EMR on EKS (do not add s3://): < bucket name >
Bucket created: XXXXXXXXXXX ...
Deploying CloudFormation stack with the following parameters...
Region: xx-xxxx-x | Account ID: xxxxxxxxxxxx | S3 Bucket: XXXXXXXXXXX

...

EKS Cluster and Virtual EMR Cluster have been installed.

The last line indicates that installation was successful.

Log aggregation options

There are several log aggregation and management tools on the market. This post suggests two of the more popular ones in the industry: Splunk and Amazon OpenSearch Service.

Option 1: Install Splunk Enterprise

To manually install Splunk on an EC2 instance, complete the following steps:

  1. Launch an EC2 instance.
  2. Install Splunk.
  3. Configure the EC2 instance security group to permit access to ports 22, 8000, and 8088.

This post, however, provides an automated way to install Spunk on an EC2 instance:

  1. Download the RPM install file and upload it to an accessible Amazon S3 location.
  2. Upload the following YAML script into AWS CloudFormation.
  3. Provide the necessary parameters, as shown in the screenshots below.
  4. Choose Next and complete the steps to create your stack.

Splunk CloudFormation screen - 1

Splunk CloudFormation screen - 2

Splunk CloudFormation screen - 3

Alternatively, run an AWS CLI script like the following:

aws cloudformation create-stack \
--stack-name "splunk" \
--template-body file://splunk_cf.yaml \
--parameters ParameterKey=KeyName,ParameterValue="< Name of EC2 Key Pair >" \
  ParameterKey=InstanceType,ParameterValue="t3.medium" \
  ParameterKey=LatestAmiId,ParameterValue="/aws/service/ami-amazon-linux-latest/amzn2-ami-hvm-x86_64-gp2" \
  ParameterKey=VPCID,ParameterValue="vpc-XXXXXXXXXXX" \
  ParameterKey=PublicSubnet0,ParameterValue="subnet-XXXXXXXXX" \
  ParameterKey=SSHLocation,ParameterValue="< CIDR Range for SSH access >" \
  ParameterKey=VpcCidrRange,ParameterValue="172.20.0.0/16" \
  ParameterKey=RootVolumeSize,ParameterValue="100" \
  ParameterKey=S3BucketName,ParameterValue="< S3 Bucket Name >" \
  ParameterKey=S3Prefix,ParameterValue="splunk/splunk-8.2.5-77015bc7a462-linux-2.6-x86_64.rpm" \
  ParameterKey=S3DownloadLocation,ParameterValue="/tmp" \
--region < region > \
--capabilities CAPABILITY_IAM
  1. After you build the stack, navigate to the stack’s Outputs tab on the AWS CloudFormation console and note the internal and external DNS for the Splunk instance.

You use these later to configure the Splunk instance and log forwarding.

Splunk CloudFormation output screen

  1. To configure Splunk, go to the Resources tab for the CloudFormation stack and locate the physical ID of EC2Instance.
  2. Choose that link to go to the specific EC2 instance.
  3. Select the instance and choose Connect.

Connect to Splunk Instance

  1. On the Session Manager tab, choose Connect.

Connect to Instance

You’re redirected to the instance’s shell.

  1. Install and configure Splunk as follows:
$ sudo /opt/splunk/bin/splunk start --accept-license
…
Please enter an administrator username: admin
Password must contain at least:
   * 8 total printable ASCII character(s).
Please enter a new password: 
Please confirm new password:
…
Done
                                                           [  OK  ]

Waiting for web server at http://127.0.0.1:8000 to be available......... Done
The Splunk web interface is at http://ip-xx-xxx-xxx-x.us-east-2.compute.internal:8000
  1. Enter the Splunk site using the SplunkPublicDns value from the stack outputs (for example, http://ec2-xx-xxx-xxx-x.us-east-2.compute.amazonaws.com:8000). Note the port number of 8000.
  2. Log in with the user name and password you provided.

Splunk Login

Configure HTTP Event Collector

To configure Splunk to be able to receive logs from Fluent Bit, configure the HTTP Event Collector data input:

  1. Go to Settings and choose Data input.
  2. Choose HTTP Event Collector.
  3. Choose Global Settings.
  4. Select Enabled, keep port number 8088, then choose Save.
  5. Choose New Token.
  6. For Name, enter a name (for example, emreksdemo).
  7. Choose Next.
  8. For Available item(s) for Indexes, add at least the main index.
  9. Choose Review and then Submit.
  10. In the list of HTTP Event Collect tokens, copy the token value for emreksdemo.

You use it when configuring the Fluent Bit output.

splunk-http-collector-list

Option 2: Set up Amazon OpenSearch Service

Your other log aggregation option is to use Amazon OpenSearch Service.

Provision an OpenSearch Service domain

Provisioning an OpenSearch Service domain is very straightforward. In this post, we provide a simple script and configuration to provision a basic domain. To do it yourself, refer to Creating and managing Amazon OpenSearch Service domains.

Before you start, get the ARN of the IAM role that you use to run the Spark jobs. If you created the EKS cluster with the provided script, go to the CloudFormation stack emr-eks-iam-stack. On the Outputs tab, locate the IAMRoleArn output and copy this ARN. We also modify the IAM role later on, after we create the OpenSearch Service domain.

iam_role_emr_eks_job

If you’re using the provided opensearch.sh installer, before you run it, modify the file.

From the root folder of the GitHub repository, cd to opensearch and modify opensearch.sh (you can also use your preferred editor):

[../aws-emr-eks-log-forwarding] $ cd opensearch
[../aws-emr-eks-log-forwarding/opensearch] $ vi opensearch.sh

Configure opensearch.sh to fit your environment, for example:

# name of our Amazon OpenSearch cluster
export ES_DOMAIN_NAME="emreksdemo"

# Elasticsearch version
export ES_VERSION="OpenSearch_1.0"

# Instance Type
export INSTANCE_TYPE="t3.small.search"

# OpenSearch Dashboards admin user
export ES_DOMAIN_USER="emreks"

# OpenSearch Dashboards admin password
export ES_DOMAIN_PASSWORD='< ADD YOUR PASSWORD >'

# Region
export REGION='us-east-1'

Run the script:

[../aws-emr-eks-log-forwarding/opensearch] $ bash opensearch.sh

Configure your OpenSearch Service domain

After you set up your OpenSearch service domain and it’s active, make the following configuration changes to allow logs to be ingested into Amazon OpenSearch Service:

  1. On the Amazon OpenSearch Service console, on the Domains page, choose your domain.

Opensearch Domain Console

  1. On the Security configuration tab, choose Edit.

Opensearch Security Configuration

  1. For Access Policy, select Only use fine-grained access control.
  2. Choose Save changes.

The access policy should look like the following code:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
      },
      "Action": "es:*",
      "Resource": "arn:aws:es:xx-xxxx-x:xxxxxxxxxxxx:domain/emreksdemo/*"
    }
  ]
}
  1. When the domain is active again, copy the domain ARN.

We use it to configure the Amazon EMR job IAM role we mentioned earlier.

  1. Choose the link for OpenSearch Dashboards URL to enter Amazon OpenSearch Service Dashboards.

Opensearch Main Console

  1. In Amazon OpenSearch Service Dashboards, use the user name and password that you configured earlier in the opensearch.sh file.
  2. Choose the options icon and choose Security under OpenSearch Plugins.

opensearch menu

  1. Choose Roles.
  2. Choose Create role.

opensearch-create-role-button

  1. Enter the new role’s name, cluster permissions, and index permissions. For this post, name the role fluentbit_role and give cluster permissions to the following:
    1. indices:admin/create
    2. indices:admin/template/get
    3. indices:admin/template/put
    4. cluster:admin/ingest/pipeline/get
    5. cluster:admin/ingest/pipeline/put
    6. indices:data/write/bulk
    7. indices:data/write/bulk*
    8. create_index

opensearch-create-role-button

  1. In the Index permissions section, give write permission to the index fluent-*.
  2. On the Mapped users tab, choose Manage mapping.
  3. For Backend roles, enter the Amazon EMR job execution IAM role ARN to be mapped to the fluentbit_role role.
  4. Choose Map.

opensearch-map-backend

  1. To complete the security configuration, go to the IAM console and add the following inline policy to the EMR on EKS IAM role entered in the backend role. Replace the resource ARN with the ARN of your OpenSearch Service domain.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "es:ESHttp*"
            ],
            "Resource": "arn:aws:es:us-east-2:XXXXXXXXXXXX:domain/emreksdemo"
        }
    ]
}

The configuration of Amazon OpenSearch Service is complete and ready for ingestion of logs from the Fluent Bit sidecar container.

Configure the Fluent Bit sidecar container

We need to write two configuration files to configure a Fluent Bit sidecar container. The first is the Fluent Bit configuration itself, and the second is the Fluent Bit sidecar subprocess configuration that makes sure that the sidecar operation ends when the main Spark job ends. The suggested configuration provided in this post is for Splunk and Amazon OpenSearch Service. However, you can configure Fluent Bit with other third-party log aggregators. For more information about configuring outputs, refer to Outputs.

Fluent Bit ConfigMap

The following sample ConfigMap is from the GitHub repo:

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-sidecar-config
  namespace: sparkns
  labels:
    app.kubernetes.io/name: fluent-bit
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush         1
        Log_Level     info
        Daemon        off
        Parsers_File  parsers.conf
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_Port     2020

    @INCLUDE input-application.conf
    @INCLUDE input-event-logs.conf
    @INCLUDE output-splunk.conf
    @INCLUDE output-opensearch.conf

  input-application.conf: |
    [INPUT]
        Name              tail
        Path              /var/log/spark/user/*/*
        Path_Key          filename
        Buffer_Chunk_Size 1M
        Buffer_Max_Size   5M
        Skip_Long_Lines   On
        Skip_Empty_Lines  On

  input-event-logs.conf: |
    [INPUT]
        Name              tail
        Path              /var/log/spark/apps/*
        Path_Key          filename
        Buffer_Chunk_Size 1M
        Buffer_Max_Size   5M
        Skip_Long_Lines   On
        Skip_Empty_Lines  On

  output-splunk.conf: |
    [OUTPUT]
        Name            splunk
        Match           *
        Host            < INTERNAL DNS of Splunk EC2 Instance >
        Port            8088
        TLS             On
        TLS.Verify      Off
        Splunk_Token    < Token as provided by the HTTP Event Collector in Splunk >

  output-opensearch.conf: |
[OUTPUT]
        Name            es
        Match           *
        Host            < HOST NAME of the OpenSearch Domain | No HTTP protocol >
        Port            443
        TLS             On
        AWS_Auth        On
        AWS_Region      < Region >
        Retry_Limit     6

In your AWS Cloud9 workspace, modify the ConfigMap accordingly. Provide the values for the placeholder text by running the following commands to enter the VI editor mode. If preferred, you can use PICO or a different editor:

[../aws-emr-eks-log-forwarding] $  cd kube/configmaps
[../aws-emr-eks-log-forwarding/kube/configmaps] $ vi emr_configmap.yaml

# Modify the emr_configmap.yaml as above
# Save the file once it is completed

Complete either the Splunk output configuration or the Amazon OpenSearch Service output configuration.

Next, run the following commands to add the two Fluent Bit sidecar and subprocess ConfigMaps:

[../aws-emr-eks-log-forwarding/kube/configmaps] $ kubectl apply -f emr_configmap.yaml
[../aws-emr-eks-log-forwarding/kube/configmaps] $ kubectl apply -f emr_entrypoint_configmap.yaml

You don’t need to modify the second ConfigMap because it’s the subprocess script that runs inside the Fluent Bit sidecar container. To verify that the ConfigMaps have been installed, run the following command:

$ kubectl get cm -n sparkns
NAME                         DATA   AGE
fluent-bit-sidecar-config    6      15s
fluent-bit-sidecar-wrapper   2      15s

Set up a customized EMR container image

To run the sample PySpark script, the script requires the Boto3 package that’s not available in the standard EMR container images. If you want to run your own script and it doesn’t require a customized EMR container image, you may skip this step.

Run the following script:

[../aws-emr-eks-log-forwarding] $ cd ecr
[../aws-emr-eks-log-forwarding/ecr] $ bash create_custom_image.sh <region> <EMR container image account number>

The EMR container image account number can be obtained from How to select a base image URI. This documentation also provides the appropriate ECR registry account number. For example, the registry account number for us-east-1 is 755674844232.

To verify the repository and image, run the following commands:

$ aws ecr describe-repositories --region < region > | grep emr-6.5.0-custom
            "repositoryArn": "arn:aws:ecr:xx-xxxx-x:xxxxxxxxxxxx:repository/emr-6.5.0-custom",
            "repositoryName": "emr-6.5.0-custom",
            "repositoryUri": " xxxxxxxxxxxx.dkr.ecr.xx-xxxx-x.amazonaws.com/emr-6.5.0-custom",

$ aws ecr describe-images --region < region > --repository-name emr-6.5.0-custom | jq .imageDetails[0].imageTags
[
  "latest"
]

Prepare pod templates for Spark jobs

Upload the two Spark driver and Spark executor pod templates to an S3 bucket and prefix. The two pod templates can be found in the GitHub repository:

  • emr_driver_template.yaml – Spark driver pod template
  • emr_executor_template.yaml – Spark executor pod template

The pod templates provided here should not be modified.

Submitting a Spark job with a Fluent Bit sidecar container

This Spark job example uses the bostonproperty.py script. To use this script, upload it to an accessible S3 bucket and prefix and complete the preceding steps to use an EMR customized container image. You also need to upload the CSV file from the GitHub repo, which you need to download and unzip. Upload the unzipped file to the following location: s3://<your chosen bucket>/<first level folder>/data/boston-property-assessment-2021.csv.

The following commands assume that you launched your EKS cluster and virtual EMR cluster with the parameters indicated in the GitHub repo.

Variable Where to Find the Information or the Value Required
EMR_EKS_CLUSTER_ID Amazon EMR console virtual cluster page
EMR_EKS_EXECUTION_ARN IAM role ARN
EMR_RELEASE emr-6.5.0-latest
S3_BUCKET The bucket you create in Amazon S3
S3_FOLDER The preferred prefix you want to use in Amazon S3
CONTAINER_IMAGE The URI in Amazon ECR where your container image is
SCRIPT_NAME emreksdemo-script or a name you prefer

Alternatively, use the provided script to run the job. Change the directory to the scripts folder in emreks and run the script as follows:

[../aws-emr-eks-log-forwarding] cd emreks/scripts
[../aws-emr-eks-log-forwarding/emreks/scripts] bash run_emr_script.sh < S3 bucket name > < ECR container image > < script path>

Example: bash run_emr_script.sh emreksdemo-123456 12345678990.dkr.ecr.us-east-2.amazonaws.com/emr-6.5.0-custom s3://emreksdemo-123456/scripts/scriptname.py

After you submit the Spark job successfully, you get a return JSON response like the following:

{
    "id": "0000000305e814v0bpt",
    "name": "emreksdemo-job",
    "arn": "arn:aws:emr-containers:xx-xxxx-x:XXXXXXXXXXX:/virtualclusters/upobc00wgff5XXXXXXXXXXX/jobruns/0000000305e814v0bpt",
    "virtualClusterId": "upobc00wgff5XXXXXXXXXXX"
}

What happens when you submit a Spark job with a sidecar container

After you submit a Spark job, you can see what is happening by viewing the pods that are generated and the corresponding logs. First, using kubectl, get a list of the pods generated in the namespace where the EMR virtual cluster runs. In this case, it’s sparkns. The first pod in the following code is the job controller for this particular Spark job. The second pod is the Spark executor; there can be more than one pod depending on how many executor instances are asked for in the Spark job setting—we asked for one here. The third pod is the Spark driver pod.

$ kubectl get pods -n sparkns
NAME                                        READY   STATUS    RESTARTS   AGE
0000000305e814v0bpt-hvwjs                   3/3     Running   0          25s
emreksdemo-script-1247bf80ae40b089-exec-1   0/3     Pending   0          0s
spark-0000000305e814v0bpt-driver            3/3     Running   0          11s

To view what happens in the sidecar container, follow the logs in the Spark driver pod and refer to the sidecar. The sidecar container launches with the Spark pods and persists until the file /var/log/fluentd/main-container-terminated is no longer available. For more information about how Amazon EMR controls the pod lifecycle, refer to Using pod templates. The subprocess script ties the sidecar container to this same lifecycle and deletes itself upon the EMR controlled pod lifecycle process.

$ kubectl logs spark-0000000305e814v0bpt-driver -n sparkns  -c custom-side-car-container --follow=true

Waiting for file /var/log/fluentd/main-container-terminated to appear...
AWS for Fluent Bit Container Image Version 2.24.0Start wait: 1652190909
Elapsed Wait: 0
Not found count: 0
Waiting...
Fluent Bit v1.9.3
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2022/05/10 13:55:09] [ info] [fluent bit] version=1.9.3, commit=9eb4996b7d, pid=11
[2022/05/10 13:55:09] [ info] [storage] version=1.2.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
[2022/05/10 13:55:09] [ info] [cmetrics] version=0.3.1
[2022/05/10 13:55:09] [ info] [output:splunk:splunk.0] worker #0 started
[2022/05/10 13:55:09] [ info] [output:splunk:splunk.0] worker #1 started
[2022/05/10 13:55:09] [ info] [output:es:es.1] worker #0 started
[2022/05/10 13:55:09] [ info] [output:es:es.1] worker #1 started
[2022/05/10 13:55:09] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2022/05/10 13:55:09] [ info] [sp] stream processor started
Waiting for file /var/log/fluentd/main-container-terminated to appear...
Last heartbeat: 1652190914
Elapsed Time since after heartbeat: 0
Found count: 0
list files:
-rw-r--r-- 1 saslauth 65534 0 May 10 13:55 /var/log/fluentd/main-container-terminated
Last heartbeat: 1652190918

…

[2022/05/10 13:56:09] [ info] [input:tail:tail.0] inotify_fs_add(): inode=58834691 watch_fd=6 name=/var/log/spark/user/spark-0000000305e814v0bpt-driver/stdout-s3-container-log-in-tail.pos
[2022/05/10 13:56:09] [ info] [input:tail:tail.1] inotify_fs_add(): inode=54644346 watch_fd=1 name=/var/log/spark/apps/spark-0000000305e814v0bpt
Outside of loop, main-container-terminated file no longer exists
ls: cannot access /var/log/fluentd/main-container-terminated: No such file or directory
The file /var/log/fluentd/main-container-terminated doesn't exist anymore;
TERMINATED PROCESS
Fluent-Bit pid: 11
Killing process after sleeping for 15 seconds
root        11     8  0 13:55 ?        00:00:00 /fluent-bit/bin/fluent-bit -e /fluent-bit/firehose.so -e /fluent-bit/cloudwatch.so -e /fluent-bit/kinesis.so -c /fluent-bit/etc/fluent-bit.conf
root       114     7  0 13:56 ?        00:00:00 grep fluent
Killing process 11
[2022/05/10 13:56:24] [engine] caught signal (SIGTERM)
[2022/05/10 13:56:24] [ info] [input] pausing tail.0
[2022/05/10 13:56:24] [ info] [input] pausing tail.1
[2022/05/10 13:56:24] [ warn] [engine] service will shutdown in max 5 seconds
[2022/05/10 13:56:25] [ info] [engine] service has stopped (0 pending tasks)
[2022/05/10 13:56:25] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=54644346 watch_fd=1
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=60917120 watch_fd=1
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=60917121 watch_fd=2
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=58834690 watch_fd=3
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=58834692 watch_fd=4
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=58834689 watch_fd=5
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=58834691 watch_fd=6
[2022/05/10 13:56:25] [ info] [output:splunk:splunk.0] thread worker #0 stopping...
[2022/05/10 13:56:25] [ info] [output:splunk:splunk.0] thread worker #0 stopped
[2022/05/10 13:56:25] [ info] [output:splunk:splunk.0] thread worker #1 stopping...
[2022/05/10 13:56:25] [ info] [output:splunk:splunk.0] thread worker #1 stopped
[2022/05/10 13:56:25] [ info] [output:es:es.1] thread worker #0 stopping...
[2022/05/10 13:56:25] [ info] [output:es:es.1] thread worker #0 stopped
[2022/05/10 13:56:25] [ info] [output:es:es.1] thread worker #1 stopping...
[2022/05/10 13:56:25] [ info] [output:es:es.1] thread worker #1 stopped

View the forwarded logs in Splunk or Amazon OpenSearch Service

To view the forwarded logs, do a search in Splunk or on the Amazon OpenSearch Service console. If you’re using a shared log aggregator, you may have to filter the results. In this configuration, the logs tailed by Fluent Bit are in the /var/log/spark/*. The following screenshots show the logs generated specifically by the Kubernetes Spark driver stdout that were forwarded to the log aggregators. You can compare the results with the logs provided using kubectl:

kubectl logs < Spark Driver Pod > -n < namespace > -c spark-kubernetes-driver --follow=true

…
root
 |-- PID: string (nullable = true)
 |-- CM_ID: string (nullable = true)
 |-- GIS_ID: string (nullable = true)
 |-- ST_NUM: string (nullable = true)
 |-- ST_NAME: string (nullable = true)
 |-- UNIT_NUM: string (nullable = true)
 |-- CITY: string (nullable = true)
 |-- ZIPCODE: string (nullable = true)
 |-- BLDG_SEQ: string (nullable = true)
 |-- NUM_BLDGS: string (nullable = true)
 |-- LUC: string (nullable = true)
…

|02108|RETAIL CONDO           |361450.0            |63800.0        |5977500.0      |
|02108|RETAIL STORE DETACH    |2295050.0           |988200.0       |3601900.0      |
|02108|SCHOOL                 |1.20858E7           |1.20858E7      |1.20858E7      |
|02108|SINGLE FAM DWELLING    |5267156.561085973   |1153400.0      |1.57334E7      |
+-----+-----------------------+--------------------+---------------+---------------+
only showing top 50 rows

The following screenshot shows the Splunk logs.

splunk-result-driver-stdout

The following screenshots show the Amazon OpenSearch Service logs.

opensearch-result-driver-stdout

Optional: Include a buffer between Fluent Bit and the log aggregators

If you expect to generate a lot of logs because of high concurrent Spark jobs creating multiple individual connects that may overwhelm your Amazon OpenSearch Service or Splunk log aggregation clusters, consider employing a buffer between the Fluent Bit sidecars and your log aggregator. One option is to use Amazon Kinesis Data Firehose as the buffering service.

Kinesis Data Firehose has built-in delivery to both Amazon OpenSearch Service and Splunk. If using Amazon OpenSearch Service, refer to Loading streaming data from Amazon Kinesis Data Firehose. If using Splunk, refer to Configure Amazon Kinesis Firehose to send data to the Splunk platform and Choose Splunk for Your Destination.

To configure Fluent Bit to Kinesis Data Firehose, add the following to your ConfigMap output. Refer to the GitHub ConfigMap example and add the @INCLUDE under the [SERVICE] section:

     @INCLUDE output-kinesisfirehose.conf
…

  output-kinesisfirehose.conf: |
    [OUTPUT]
        Name            kinesis_firehose
        Match           *
        region          < region >
        delivery_stream < Kinesis Firehose Stream Name >

Optional: Use data streams for Amazon OpenSearch Service

If you’re in a scenario where the number of documents grows rapidly and you don’t need to update older documents, you need to manage the OpenSearch Service cluster. This involves steps like creating a rollover index alias, defining a write index, and defining common mappings and settings for the backing indexes. Consider using data streams to simplify this process and enforce a setup that best suits your time series data. For instructions on implementing data streams, refer to Data streams.

Clean up

To avoid incurring future charges, delete the resources by deleting the CloudFormation stacks that were created with this script. This removes the EKS cluster. However, before you do that, remove the EMR virtual cluster first by running the delete-virtual-cluster command. Then delete all the CloudFormation stacks generated by the deployment script.

If you launched an OpenSearch Service domain, you can delete the domain from the OpenSearch Service domain. If you used the script to launch a Splunk instance, you can go to the CloudFormation stack that launched the Splunk instance and delete the CloudFormation stack. This removes remove the Splunk instance and associated resources.

You can also use the following scripts to clean up resources:

Conclusion

EMR on EKS facilitates running Spark jobs on Kubernetes to achieve very fast and cost-efficient Spark operations. This is made possible through scheduling transient pods that are launched and then deleted the jobs are complete. To log all these operations in the same lifecycle of the Spark jobs, this post provides a solution using pod templates and Fluent Bit that is lightweight and powerful. This approach offers a decoupled way of log forwarding based at the Spark application level and not at the Kubernetes cluster level. It also avoids routing through intermediaries like CloudWatch, reducing cost and complexity. In this way, you can address security concerns and DevOps and system administration ease of management while providing Spark users with insights into their Spark jobs in a cost-efficient and functional way.

If you have questions or suggestions, please leave a comment.


About the Author

Matthew Tan is a Senior Analytics Solutions Architect at Amazon Web Services and provides guidance to customers developing solutions with AWS Analytics services on their analytics workloads.                       

AWS Week in Review – May 9, 2022

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-week-in-review-may-9-2022/

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Another week starts, and here’s a collection of the most significant AWS news from the previous seven days. This week is also the one-year anniversary of CloudFront Functions. It’s exciting to see what customers have built during this first year.

Last Week’s Launches
Here are some launches that caught my attention last week:

Amazon RDS supports PostgreSQL 14 with three levels of cascaded read replicas – That’s 5 replicas per instance, supporting a maximum of 155 read replicas per source instance with up to 30X more read capacity. You can now build a more robust disaster recovery architecture with the capability to create Single-AZ or Multi-AZ cascaded read replica DB instances in same or cross Region.

Amazon RDS on AWS Outposts storage auto scalingAWS Outposts extends AWS infrastructure, services, APIs, and tools to virtually any datacenter. With Amazon RDS on AWS Outposts, you can deploy managed DB instances in your on-premises environments. Now, you can turn on storage auto scaling when you create or modify DB instances by selecting a checkbox and specifying the maximum database storage size.

Amazon CodeGuru Reviewer suppression of files and folders in code reviews – With CodeGuru Reviewer, you can use automated reasoning and machine learning to detect potential code defects that are difficult to find and get suggestions for improvements. Now, you can prevent CodeGuru Reviewer from generating unwanted findings on certain files like test files, autogenerated files, or files that have not been recently updated.

Amazon EKS console now supports all standard Kubernetes resources to simplify cluster management – To make it easy to visualize and troubleshoot your applications, you can now use the console to see all standard Kubernetes API resource types (such as service resources, configuration and storage resources, authorization resources, policy resources, and more) running on your Amazon EKS cluster. More info in the blog post Introducing Kubernetes Resource View in Amazon EKS console.

AWS AppConfig feature flag Lambda Extension support for Arm/Graviton2 processors – Using AWS AppConfig, you can create feature flags or other dynamic configuration and safely deploy updates. The AWS AppConfig Lambda Extension allows you to access this feature flag and dynamic configuration data in your Lambda functions. You can now use the AWS AppConfig Lambda Extension from Lambda functions using the Arm/Graviton2 architecture.

AWS Serverless Application Model (SAM) CLI now supports enabling AWS X-Ray tracing – With the AWS SAM CLI you can initialize, build, package, test on local and cloud, and deploy serverless applications. With AWS X-Ray, you have an end-to-end view of requests as they travel through your application, making them easier to monitor and troubleshoot. Now, you can enable tracing by simply adding a flag to the sam init command.

Amazon Kinesis Video Streams image extraction – With Amazon Kinesis Video Streams you can capture, process, and store media streams. Now, you can also request images via API calls or configure automatic image generation based on metadata tags in ingested video. For example, you can use this to generate thumbnails for playback applications or to have more data for your machine learning pipelines.

AWS GameKit supports Android, iOS, and MacOS games developed with Unreal Engine – With AWS GameKit, you can build AWS-powered game features directly from the Unreal Editor with just a few clicks. Now, the AWS GameKit plugin for Unreal Engine supports building games for the Win64, MacOS, Android, and iOS platforms.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Some other updates you might have missed:

🎂 One-year anniversary of CloudFront Functions – I can’t believe it’s been one year since we launched CloudFront Functions. Now, we have tens of thousands of developers actively using CloudFront Functions, with trillions of invocations per month. You can use CloudFront Functions for HTTP header manipulation, URL rewrites and redirects, cache key manipulations/normalization, access authorization, and more. See some examples in this repo. Let’s see what customers built with CloudFront Functions:

  • CloudFront Functions enables Formula 1 to authenticate users with more than 500K requests per second. The solution is using CloudFront Functions to evaluate if users have access to view the race livestream by validating a token in the request.
  • Cloudinary is a media management company that helps its customers deliver content such as videos and images to users worldwide. For them, Lambda@Edge remains an excellent solution for applications that require heavy compute operations, but lightweight operations that require high scalability can now be run using CloudFront Functions. With CloudFront Functions, Cloudinary and its customers are seeing significantly increased performance. For example, one of Cloudinary’s customers began using CloudFront Functions, and in about two weeks it was seeing 20–30 percent better response times. The customer also estimates that they will see 75 percent cost savings.
  • Based in Japan, DigitalCube is a web hosting provider for WordPress websites. Previously, DigitalCube spent several hours completing each of its update deployments. Now, they can deploy updates across thousands of distributions quickly. Using CloudFront Functions, they’ve reduced update deployment times from 4 hours to 2 minutes. In addition, faster updates and less maintenance work result in better quality throughout DigitalCube’s offerings. It’s now easier for them to test on AWS because they can run tests that affect thousands of distributions without having to scale internally or introduce downtime.
  • Amazon.com is using CloudFront Functions to change the way it delivers static assets to customers globally. CloudFront Functions allows them to experiment with hyper-personalization at scale and optimal latency performance. They have been working closely with the CloudFront team during product development, and they like how it is easy to create, test, and deploy custom code and implement business logic at the edge.

AWS open-source news and updates – A newsletter curated by my colleague Ricardo to bring you the latest open-source projects, posts, events, and more. Read the latest edition here.

Reduce log-storage costs by automating retention settings in Amazon CloudWatch – By default, CloudWatch Logs stores your log data indefinitely. This blog post shows how you can reduce log-storage costs by establishing a log-retention policy and applying it across all of your log groups.

Observability for AWS App Runner VPC networking – With X-Ray support in App runner, you can quickly deploy web applications and APIs at any scale and take advantage of adding tracing without having to manage sidecars or agents. Here’s an example of how you can instrument your applications with the AWS Distro for OpenTelemetry (ADOT).

Upcoming AWS Events
It’s AWS Summits season and here are some virtual and in-person events that might be close to you:

You can now register for re:MARS to get fresh ideas on topics such as machine learning, automation, robotics, and space. The conference will be in person in Las Vegas, June 21–24.

That’s all from me for this week. Come back next Monday for another Week in Review!

Danilo

How to use new Amazon GuardDuty EKS Protection findings

Post Syndicated from Marshall Jones original https://aws.amazon.com/blogs/security/how-to-use-new-amazon-guardduty-eks-protection-findings/

If you run container workloads that use Amazon Elastic Kubernetes Service (Amazon EKS), Amazon GuardDuty now has added support that will help you better protect these workloads from potential threats. Amazon GuardDuty EKS Protection can help detect threats related to user and application activity that is captured in Kubernetes audit logs. Newly-added Kubernetes threat detections include Amazon EKS clusters that are accessed by known malicious actors or from Tor nodes, API operations performed by anonymous users that might indicate a misconfiguration, and misconfigurations that can result in unauthorized access to Amazon EKS clusters. By using machine learning (ML) models, GuardDuty can identify patterns consistent with privilege-escalation techniques, such as a suspicious launch of a container with root-level access to the underlying Amazon Elastic Compute Cloud (Amazon EC2) host. In this post, we give you an overview of the new GuardDuty EKS Protection feature; show you examples of new finding details; and help you understand, operationalize, and respond to these new findings.

Amazon GuardDuty is an automated threat detection service that continuously monitors for suspicious activity and potentially unauthorized behavior to help protect your AWS accounts, Amazon EC2 workloads, data stored in Amazon Simple Storage Service (S3), and now Amazon EKS workloads.

If you are already a GuardDuty customer, you can enable GuardDuty EKS Protection and efficiently navigate the console to begin to use this feature. Your delegated administrator accounts can enable this for existing member accounts and determine if new AWS accounts in an organization will be automatically enrolled. If you are new to GuardDuty, the EKS Protection feature is included as part of the service’s 30-day trial period. As part of the 30-day trial period, you can take full advantage of this new feature and gain insight into your Amazon EKS workloads.

Overview of GuardDuty EKS Protection

GuardDuty EKS Protection enables GuardDuty to detect suspicious activities and potential compromises of your EKS clusters by analyzing Kubernetes audit logs. Kubernetes audit logs provide a security relevant, chronological set of records documenting the sequence of events from individual users, administrators, or system components that have affected your cluster. Audit logs can help answer questions such as: What happened? When did it happen? Who initiated it? GuardDuty EKS Protection analyzes Kubernetes audit logs from your Amazon EKS clusters, both new and existing, without the need to configure EKS control plane logging in your environment. GuardDuty collects these Kubernetes audit logs in addition to AWS CloudTrail, Amazon Virtual Private Cloud (Amazon VPC) flow logs, DNS queries, and Amazon S3 data events. GuardDuty EKS Protection performs analysis and looks for suspicious activity without the need for agents or adding resource constraints to your environment.

To detect threats using Kubernetes audit logs, GuardDuty uses a combination of machine learning, anomaly detection, and integrated threat intelligence to identify and prioritize potential threats. These findings primarily align to five root causes including compromised container images, configuration issues, Kubernetes user compromise, pod compromise, and node compromise. An example of a configuration issue is granting unnecessary privileges to the anonymous user by misconfiguring role-based access control (RBAC), which may inadvertently allow anonymous and unauthenticated calls to the Kubernetes API. A Kubernetes user compromise example could be a bad actor using stolen credentials to deploy containers with insecure settings, to use for a variety of activities from command and control to crypto-mining.

After a threat is detected, GuardDuty generates a security finding that includes container details such as the pod ID, container image ID, and tags associated with the Amazon EKS cluster. These finding details assist you with understanding the root cause which you can use to identify basic steps to remediate findings specific to EKS clusters. For example, your response to a finding or group of findings associated with a compromised Kubernetes user might begin with revoking access. For more information, see Remediating Kubernetes security issues discovered by GuardDuty in the Amazon GuardDuty User Guide.

Understanding new GuardDuty EKS Protection findings

As adversaries continue to become more sophisticated, it becomes even more important for you to align to a common framework to understand the tactics, techniques, and procedures (TTPs) behind an individual event. GuardDuty aligns findings using the MITRE ATT&CK framework, which is a globally-accessible knowledge base of adversary tactics and techniques based on real-world observations. GuardDuty findings have a specific finding format that helps you understand details of each finding. If you examine the ThreatPurpose portion in the GuardDuty EKS Protection finding types, you see there are finding types associated with various MITRE ATT&CK tactics, including CredentialAccess, DefenseEvasion, Discovery, Impact, Persistence, and PrivilegeEscalation. This can help you identify and understand the type of activity associated with a finding.

For example, look at two different finding types that seem similar: Impact:Kubernetes/SuccessfulAnonymousAccess and Discovery:Kubernetes/SuccessfulAnonymousAccess. You can see the difference is the ThreatPurpose at the beginning. They are both involved with successful anonymous access, and the difference is the intent of the activity associated with each finding. GuardDuty has determined based on the API or request URI invoked, that in this example, the activity seen on one finding aligns with the Impact tactic whereas the other finding aligns with the Discovery tactic

With GuardDuty EKS Protection, you now have an additional mechanism to gain insight into your EKS clusters across your accounts to look for suspicious activity. You can be alerted to Kubernetes-specific suspicious activity including: allowing administrator access to the default service account, exposing a Kubernetes dashboard, and launching a container with sensitive host paths. With this new feature, GuardDuty is also able to extend support for finding types that you might already be familiar with that also apply to Amazon EKS workloads. These finding types include calls to a Kubernetes cluster API from a Tor node, or calls to a Kubernetes cluster from a known malicious IP address, which can indicate that there are interactions with your Kubernetes clusters from sources that are commonly associated with malicious actors.

Responding to GuardDuty EKS Protection findings

This section gives an overview of three new GuardDuty EKS Protection findings, how to prevent them, and how to investigate and respond if they happen in your environment. The patterns shown can also act as a guide for how to prevent, investigate, and respond to other GuardDuty EKS Protection findings.

Discovery:Kubernetes/SuccessfulAnonymousAccess

Finding documentation: Discovery:Kubernetes/SuccessfulAnonymousAccess

Severity: Medium

Overview: This finding (as shown in Figure 1) informs you that an API operation was successfully invoked by the system:anonymous user. API calls made by system:anonymous are unauthenticated. The observed API is commonly associated with the discovery stage of an attack when an adversary is gathering information on your Kubernetes cluster. This activity indicates that anonymous or unauthenticated access is permitted on the API action reported in the finding, and may be permitted on other actions. These API calls are possible because of a misconfiguration of the system:anonymous user or system:unauthenticated group.

Preventative measures: AWS recommends that you disable unnecessary anonymous authentication. For instructions, see Review and revoke unnecessary anonymous access in the Amazon EKS Best Practices Guides. It is important to note that Kubernetes versions older than 1.14 granted system:discovery and system:basic-user roles to system:anonymous user by default, and these permissions remain in place after updating unless you explicitly change them.

How to remediate: To respond to this finding, it is important to first identify the details of the activity, for example what cluster is involved? Who is the owner of this cluster? This information will assist you with the remediation steps that follow, to review and revoke unnecessary permissions, and also help you determine a root cause.
 

Figure 1: GuardDuty Console showing Discovery:Kubernetes/SuccessfulAnonymousAccess finding type

Figure 1: GuardDuty Console showing Discovery:Kubernetes/SuccessfulAnonymousAccess finding type

Remediation step 1: Examine permissions

The first step is to examine the permissions that have been granted to the system:anonymous user, and determine what permissions are needed. To accomplish this, you need to first understand what permissions the system:anonymous user has. You can use an rbac-lookup tool to list the Kubernetes roles and cluster roles bound to users, service accounts, and groups. An alternative method can be found at this GitHub page.

./rbac-lookup | grep -P 'system:(anonymous)|(unauthenticated)'
system:anonymous               cluster-wide        ClusterRole/system:discovery
system:unauthenticated         cluster-wide        ClusterRole/system:discovery
system:unauthenticated         cluster-wide        ClusterRole/system:public-info-viewer

Remediation step 2: Disassociate groups

Next, you disassociate the system:unauthenticated group from system:discovery and system:basic-user ClusterRoles, which you do by editing the ClusterRoleBinding. Make sure to not remove system:unauthenticated from the system:public-info-viewer cluster role binding, because that will prevent the Network Load Balancer from performing health checks against the API server. For more information, see Network Load Balancer in the AWS Load Balancer Controller Guide and Identity and Access Management in Amazon EKS Best Practices Guide.

To disassociate the appropriate groups

  1. Run the command kubectl edit clusterrolebindings system:discovery. This command will open the current definition of system:discovery ClusterRoleBinding in your editor as shown in the sample .yaml configuration file:
    # Please edit the object below. Lines beginning with a '#' will be ignored,
    # and an empty file will abort the edit. If an error occurs while saving this # file will be reopened with the relevant failures.
    #
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      annotations:
        rbac.authorization.kubernetes.io/autoupdate: "true"
      creationTimestamp: "2021-06-17T20:50:49Z"
      labels:
        kubernetes.io/bootstrapping: rbac-defaults
      name: system:discovery
      resourceVersion: "24502985"
      selfLink: /apis/rbac.authorization.k8s.io/v1/clusterrolebindings/system%3Adiscovery
      uid: b7936268-5043-431a-a0e1-171a423abeb6
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: system:discovery
    subjects:
    - apiGroup: rbac.authorization.k8s.io
      kind: Group
      name: system:authenticated
    - apiGroup: rbac.authorization.k8s.io
      kind: Group
      name: system:unauthenticated
    

  2. Delete the entry for system:unauthenticated group, which is highlighted in bold in the subjects section.
  3. Repeat the same steps for system:basic-user ClusterRoleBinding.

If there is no reason that the system:anonymous user should be used in your environment, AWS recommends that you set up automatic response and remediation steps 1-3. For more information about the system:anonymous user, see Identity and Access Management in Amazon EKS Best Practices Guide.

PrivilegeEscalation:Kubernetes/PrivilegedContainer

Finding documentation: PrivilegeEscalation:Kubernetes/PrivilegedContainer

Severity: Medium

Overview: This finding (as shown in Figure 2) informs you that a privileged container was launched on your Kubernetes cluster using an image that has never before been used to launch privileged containers in your cluster. A privileged container has root level access on the host. Adversaries commonly launch privileged containers to perform privilege escalation to gain access and compromise the underlying host.

Preventative measures: Create and enforce policy-as-code (PAC) or Pod Security Standards (PSS) that require that pods be created as non-privileged. For more information, see Pod Security in the in Amazon EKS Best Practices Guide.

How to remediate: To respond to this finding, it is important to first identify the details of the activity and begin to answer questions that will help determine what happened. For example, what pod or workload was launched? Who was the user that launched this pod or workload? What cluster is involved?
 

Figure 2: GuardDuty Console showing PrivilegeEscalation:Kubernetes/PrivilegedContainer finding type

Figure 2: GuardDuty Console showing PrivilegeEscalation:Kubernetes/PrivilegedContainer finding type

If this privileged container launch is unexpected, the credentials of the user identity used to launch the container may be compromised. You should then focus on remediating and reviewing access to your cluster, and remediating the user. To do this, follow the procedure in the Remediating a compromised Kubernetes user section of this post. Next, you should identify compromised pods using the procedure in the Identifying and remediating compromised pods section of this post.

If you know what specific circumstances a privileged container can be deployed in your environment, for example only in a specific namespace, it is likely you can automatically remediate any GuardDuty EKS Protection finding associated with a privileged container in any other namespace. For more information about automated response activities, see Incident response and forensics in the Amazon EKS Best Practices Guide.

Persistence:Kubernetes/ContainerWithSensitiveMount

Finding documentation: Persistence:Kubernetes/ContainerWithSensitiveMount

Severity: Medium

Overview: This finding (as shown in Figure 3) informs you that a container was launched with a configuration that included a sensitive host path with write access in the volumeMounts section. This makes the sensitive host path accessible and writable from inside the container. This technique is commonly used by adversaries to gain access to the host’s filesystem.

Preventative Measures: Create and enforce policy-as-code (PAC) or Pod Security Standards (PSS) that use the allowedHostPaths control to only allow required host paths for use in volumes and preferably with read-only access. For more information, see Pod Security in the Amazon EKS Best Practices Guide.

How to remediate: To respond to this finding, it is important to first identify the details of the activity and begin to answer questions that will help determine what happened. For example, what pod or workload was launched? Who was the user that launched this pod or workload? What cluster is involved?
 

Figure 3: GuardDuty Console showing Persistence:Kubernetes/ContainerWithSensitiveMount finding type

Figure 3: GuardDuty Console showing Persistence:Kubernetes/ContainerWithSensitiveMount finding type

If the container launched is unexpected, the credentials of the user identity used to launch the container may be compromised. You should then focus on remediating and reviewing access to your cluster and remediating the user. To do this, follow the procedure in the next section, Remediating a compromised Kubernetes user.

If you can determine what containers should and should not be launched with writable hostPath mounts, then you can create automatic response and remediation for this use case. For example, you might want to revoke temporary security credentials assigned to the pod or worker node. For more information about revoking temporary security credentials and other response and remediation actions, see Incident response and forensics in the Amazon EKS Best Practices Guide.

Remediating a compromised Kubernetes user

If the compromised user has privileges to read secrets of one or more namespaces, rotate all of the affected secrets. For more information about the different types of secrets, see Secrets in the Kubernetes documentation. If the user has write privileges, AWS recommends auditing all changes made by the user in question. You can accomplish this by querying audit logs, if you have enabled EKS control plane logging on your EKS cluster. If you do not currently have logging enabled, follow the instructions for Enabling and disabling control plane logs in the Amazon EKS User Guide. Amazon EKS stores these control plane logs in Amazon CloudWatch Logs in your account. You can use CloudWatch Logs Insights to list all the mutating changes that the compromised user has made.

Remediation step 1: Identify the user

All actions performed on a Kubernetes cluster has an associated identity. GuardDuty EKS Protection findings report details of the Kubernetes user identity that the malicious actor may have compromised. You can find details of the user identity in the GuardDuty console under the Kubernetes user details section in the finding details, or in the finding JSON under the resources.eksClusterDetails.kubernetesDetails.kubernetesUserDetails section. These user details include username, UID, and groups that the user belongs to.

Remediation step 2: Identify changes

  1. Identify the changes made by the attacker associated with the compromised user identity by using the code example below to query CloudWatch Logs Insights, replacing the placeholders with your values.
    fields @timestamp, @message
    | filter user.username == <username> 
    | filter verb == "create" or verb == "update" or verb == "patch"
    | filter responseStatus.code >= 200 and responseStatus.code <= 300
    | filter @timestamp >= <approximate start time of the attack in epoch milliseconds>

    For example:

    fields @timestamp, @message
    | filter user.username == "kubernetes-admin" 
    | filter verb == "create" or verb == "update" or verb == "patch"
    | filter responseStatus.code >= 200 and responseStatus.code <= 300
    | filter @timestamp >= 1628279482312
    

  2. An EKS cluster can have multiple types of user identities, for example the kubernetes-admin user, aws-auth ConfigMap defined user, and so on. You will need to take actions appropriate for the user type to properly revoke its access. For more information, see Remediating compromised Kubernetes users in the Amazon GuardDuty User Guide.
  3. (Optional) If the compromised user identity had extensive privileges and you determine that the attacker made extensive changes to the cluster, you should consider isolating the pod, followed by creating a new clean cluster and redeploying your applications to the new cluster. For instructions to isolate and redeploy EKS pods, see Isolate the Pod by creating a Network Policy that denies all ingress and egress traffic to the pod in the Amazon EKS Best Practices Guide.

Identifying and remediating compromised pods

If a GuardDuty EKS Protection finding is caused by activity related to a specific pod, the value of the finding JSON resource.kubernetesDetails.kubernetesWorkloadDetails.type field is pod. The finding includes the name of the pod and namespace in the resource.kubernetesDetails.kubernetesWorkloadDetails.name and resource.kubernetesDetails.kubernetesWorkloadDetails.namespace fields, which uniquely identify the pod.

In other cases, such as when a service account or a Kubernetes workload name is in the resource.kubernetesDetails.kubernetesUserDetails, you can follow the instructions in the Sample incident response plan to identify compromised pods using different pieces of information available in the GuardDuty EKS Protection findings.

After you have identified compromised pods, to remediate, use the instructions to isolate the pods, rotate the credentials, and gather data for forensic analysis in Isolate the Pod by creating a Network Policy that denies all ingress and egress traffic to the pod in the Amazon EKS Best Practices Guide.

Conclusion

In this post, you learned the details of the new Amazon GuardDuty EKS Protection feature, and Kubernetes audit logs, and you saw examples for how to understand, operationalize, and respond to these new findings. You can enable this feature through the GuardDuty Console or APIs to start monitoring your Amazon EKS clusters today. If you have created Amazon EventBridge Rules to send findings from GuardDuty to a target, then ensure that your rules are configured to deliver these newly added findings.

AWS is committed to continually improving GuardDuty, to make it more efficient for you to operate securely in AWS. At AWS, customer feedback drives change, so we encourage you to continue providing feedback. If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on AWS re:Post or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Marshall Jones

Marshall is a worldwide security specialist solutions architect at AWS. His background is in AWS consulting and security architecture, focused on a variety of security domains including edge, threat detection, and compliance. Today, he helps enterprise customers adopt and operationalize AWS security services to increase security effectiveness and reduce risk.

How SailPoint solved scaling issues by migrating legacy big data applications to Amazon EMR on Amazon EKS

Post Syndicated from Richard Li original https://aws.amazon.com/blogs/big-data/how-sailpoint-solved-scaling-issues-by-migrating-legacy-big-data-applications-to-amazon-emr-on-amazon-eks/

This post is co-written with Richard Li from SailPoint.

SailPoint Technologies is an identity security company based in Austin, TX. Its software as a service (SaaS) solutions support identity governance operations in regulated industries such as healthcare, government, and higher education. SailPoint distinguishes multiple aspects of identity as individual identity security services, including cloud governance, SaaS management, access risk governance, file access management, password management, provisioning, recommendations, and separation of duties, as well as access certification, access insights, access modeling, and access requests.

In this post, we share how SailPoint updated its platform for big data operations, and solved scaling issues by migrating legacy big data applications to Amazon EMR on Amazon EKS.

The challenge with the legacy data environment

SailPoint acquired a SaaS software platform that processes and analyzes identity, resource, and usage data from multiple cloud providers, and provides access insights, usage analysis, and access risk analysis. The original design criteria of the platform was focused on serving small to medium-sized companies. To quickly process these analytics insights, many of these processing workloads were done inside many microservices through streaming connections.

After acquisition, we set a goal to expand the platform’s capability to handle customers with large cloud footprints over multiple cloud providers, sometime over hundreds or even thousands of accounts producing large amount of cloud event data.

The legacy architecture has a simplistic approach for data processing, as shown in the following diagram. We were processing the vast majority of event data in-service and directly ingested into Amazon Relational Database Service (Amazon RDS), which we then merged with a graph database to form the final view..

We needed to convert this into a scalable process that could handle customers of any size. To address this challenge, we had to quickly introduce a big data processing engine in the platform.

How migrating to Amazon EMR on EKS helped solve this challenge

When evaluating the platform for our big data operations, several factors made Amazon EMR on EKS a top choice.

The amount of event data we receive at any given time is generally unpredictable. To stay cost-effective and efficient, we need a platform that is capable of scaling up automatically when the workload increases to reduce wait time, and can scale down when the capacity is no longer needed to save cost. Because our existing application workloads are already running on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster with the cluster autoscaler enabled, running Amazon EMR on EKS on top of our existing EKS cluster fits this need.

Amazon EMR on EKS can safely coexist on an EKS cluster that is already hosting other workloads, be contained within a specified namespace, and have controlled access through use of Kubernetes role-based access control and AWS Identity and Access Management (IAM) roles for service accounts. Therefore, we didn’t have to build new infrastructures just for Amazon EMR. We simply linked up Amazon EMR on EKS with our existing EKS cluster running our application workloads. This reduced the amount of DevOps support needed, and significantly sped up our implementation and deployment timeline.

Unlike Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2), because our EKS cluster spans over multiple Availability Zones, we can control Spark pods placements using Kubernetes’s pod scheduling and placement strategy to achieve higher fault tolerance.

With the ability to create and use custom images in Amazon EMR on EKS, we could also utilize our existing container-based application build and deployment pipeline for our Amazon EMR on EKS workload without any modifications. This also gave us additional benefit in reducing job startup time because we package all job scripts as well as all dependencies with the image, without having to fetch them at runtime.

We also utilize AWS Step Functions as our core workflow engine. The native integration of Amazon EMR on EKS with Step Functions is another bonus where we didn’t have to build custom code for job dispatch. Instead, we could utilize the Step Functions native integration to seamlessly integrate Amazon EMR jobs with our existing workflow, with very little effort.

In merely 5 months, we were able to go from design, to proof of concept, to rolling out phase 1 of the event analytics processing. This vastly improved our event analytics processing capability by extending horizontal scalability, which gave us the ability to take customers with significantly larger cloud footprints than the legacy platform was designed for.

During the development and rollout of the platform, we also found that the Spark History Server provided by Amazon EMR on EKS was very useful in terms of helping us identify performance issues and tune the performance of our jobs.

As of this writing, the phase 1 rollout, which includes the event processing component of the core analytics processing, is complete. We’re now expanding the platform to migrate additional components onto Amazon EMR on EKS. The following diagram depicts our future architecture with Amazon EMR on EKS when all phases are complete.

In addition, to improve performances and reduce costs, we’re currently testing the Spark dynamic resource allocation support of Amazon EMR on EKS. This would automatically scale up and down the job executors based on load, and therefore boost performance when needed and reduce cost when the workload is low. Furthermore, we’re investigating the possibility to reduce the overall cost and increase performance by utilizing the pod template feature that would allow us to seamlessly transition our Amazon EMR job workload to AWS Graviton based instances.

Conclusion

With Amazon EMR on EKS, we can now onboard new customers and process vast amounts of data in a cost-effective manner, which we couldn’t do with our legacy environment. We plan to expand our Amazon EMR on EKS footprint to handle all our transform and load data analytics processes.


About the Authors

Richard Li is a senior staff software engineer on the SailPoint Technologies Cloud Access Management team.

Janak Agarwal is a product manager for Amazon EMR on Amazon EKS at AWS.

Kiran Guduguntla is a WW Go-to-Market Specialist for Amazon EMR at AWS. He works with AWS customers across the globe to strategize, build, develop, and deploy modern data analytics solutions.

Financial Crime Discovery using Amazon EKS and Graph Databases

Post Syndicated from Severin Gassauer-Fleissner original https://aws.amazon.com/blogs/architecture/financial-crime-discovery-using-amazon-eks-and-graph-databases/

Discovering and solving financial crimes has become a challenge due to an increasing amount of financial data. While storing transactional payment data in a structured table format is useful for searching, filtering, and calculations, it is not always an ideal way to represent transactional data. For example, determining if there is a suspicious financial relationship between entity A and entity B is difficult to visualize in a table. Using tables, we would have to do SQL joins for every possible transaction from entity A to every possible subsequent transaction. We would then have to iterate this process until we found a relationship to entity B. Moreover, certain queries are challenging to run on a relational database management system (RDBMS). For example, it can be quite time consuming to discover which account received a minimum amount of $10,000,000 from other accounts.

Graph databases such as Amazon Neptune can be helpful with performing queries, because they can traverse the data and perform calculations simultaneously. Graphs enable us to represent transactions and parties over a multi-connected network, and discover patterns and chains of connections. It is common to use them in anti-money laundering (AML) applications, as they can help find patterns of suspicious transactions.

We needed a solution that could scale and process millions of transactions, by effectively using high memory and CPU configurations to perform complex queries quickly. As part of our customer demonstration to show how graph databases can help discover financial crimes, we also sourced a large dataset on which to test the solution. We used a graph database, Amazon Elastic Kubernetes Service (EKS), and Amazon Neptune, to search for suspicious financial chains across large amounts of transactional data in minutes.

Overview of our conceptual financial crime discovery solution

Figure 1. Workflow for financial crime discovery

Figure 1. Workflow for financial crime discovery

We first needed to find a rules engine that could perform transaction inferencing and reasoning. It had to be able to process various rules on our data, be efficient, and able to scale. Next, we needed a straightforward way to ingest data into the solution. Once we had the data available, we needed to initiate a task to begin our inference job. Finally, we needed a location to store the result for further analysis and persistence, see Figure 1.

Using the AWS Cloud to scale up a graph database

We used multiple AWS services to create a fully automated end-to-end batch-based transaction process, shown in Figure 2. We used RDFox, which is an AWS Marketplace product, created by Oxford Semantic Technologies. RDFox is a high-performance in-memory graph database and semantic reasoner. To orchestrate RDFox, we used Amazon EKS Autoscaler to spin up a cluster to instantiate the RDFox container. Amazon EKS can spin up multiple containers for difference inference jobs and recycle the resources when the job finishes. We also used Amazon Neptune, an Amazon managed distributed graph database that can store up to 64 TiB of the results for diagnosis and long-term retention.

The data is stored in Amazon S3 buckets, which provide a streamlined way to feed a large dataset for processing.

Figure 2. Architecture diagram for financial crime discovery

Figure 2. Architecture diagram for financial crime discovery

Financial crimes rules

The power of graphs can help discover financial crimes that are reflected in relationships and monetary transactions. To demonstrate this, we will write two rules to detect two scenarios:

  1. Given two suspicious parties X and Y, find out if there is a transactional relationship between them, and if so, provide the chain that connects them. This is a common scenario that financial institutions must detect.
  2. Given a chain identifying suspicious behavior, find out if the minimum transaction amount that reached the beneficiary exceeds a threshold of Z dollars.

Generating data

To generate data for testing, we used a synthetic data generator written in Python (see GitHub repo in References). The generator built two sets of graph artifacts – parties and transactions. Every transaction is being paired with two random parties, and this iterative process creates a network of connected transactions and parties.

We created a dataset with a small percentage (0.01%) of parties tagged as “Suspicious Party,” to simulate the preceding business scenario. Note that those parties will have transactions going both in and out. In some cases, this will collide with other suspicious parties and establish a chain. This method enables us to get simulated data without engineering the suspicious chains.
The test dataset used with this solution comprises 1M transactions and thousands of parties.
For more information on generating data, see GitHub: Transaction Chains Data Generator.

Walkthrough of the financial investigator workflow

Once deployed, this solution can assist investigators as follows:

  1. An investigator places the transactions and party data (nt triples) in a subdirectory within the input bucket. Typically, subdirectories can be named as a date or range of dates. In addition, the investigator uploads the particular rules (dlog files) and queries (rq files) that must be processed on the data.
  2. Once the data is ready, the investigator uploads a job spec file (simple JSON, see References section). This contains the description of what resources the job requires (CPUs and memory), along with other configurations for the job.
  3. Once the job spec has been uploaded to the bucket’s subdirectory, the job is automatically initiated. The Kubernetes scheduler will allocate enough resources to initiate the RDFox pod. The containers in the pod will then load, process, query results, and upload them to the output bucket.
  4. Once the data reaches the output bucket, an AWS Lambda function is initiated. This invokes the Amazon Neptune Bulk Loader, which asynchronously loads the results to the Neptune cluster in a parallelized manner.
  5. Once the load completes, the investigator gets an email notification that the job has been completed, and the results are ready for view.

Additionally:

  • The investigator can upload multiple rules and queries, they will all be processed automatically.
  • The investigator can launch multiple jobs with different/same data, and with different rules and queries at the same time.
  • All jobs outputs are saved in a unique job ID subdirectory in the output bucket.

To create the solution in your account, follow the instruction here: GitHub repo

Prerequisites

For this walkthrough, you should have the following prerequisites:

Rules

We create two materialization rule sets to fetch the two scenarios described.

1. detect-suspicious-parties-pair.dlog

The purpose of this first set of rules is to detect chains that might exist between two suspicious parties. The idea of these chains is to represent all the possible relationships that contain monetary transfers between a suspicious originator and the beneficiary. This will include non-suspicious parties in the chain. The rule tags these chains with the “SuspiciousChain” flag.

2. detect-chains-exceed-100-dollars.dlog

This set of rules is designed to tag the chains identified by the first set of rules. It also contains a minimum amount of $100 passed to the beneficiary. We can change the amount to check for different compliance requirements. We tag those chains as “HighValueChain.”

Run the job and check your results

Now we can run our job, with the given data, rules, and two additional queries (to extract “SuspiciousChain” and “HighValueChain” respectively). The result of the queries will be loaded to Amazon Neptune automatically for persistent storage, and is made durably available for further analysis.

Let’s look at the results. The following query can be initiated against RDFox console or Amazon Neptune.

Visualization

PREFIX : <http://oxfordsemantic.tech/transactions/entities#>

PREFIX prop: <http://oxfordsemantic.tech/transactions/properties#>

PREFIX type: <http://oxfordsemantic.tech/transactions/classes#>

PREFIX tt: <http://oxfordsemantic.tech/transactions/tupletables#>

SELECT ?S ?P ?O WHERE {

            ?S a type:SuspiciousChain .

            ?S ?P ?O .

}

Figure 3. Visualizing suspicious chains

Figure 3. Visualizing suspicious chains

Whoa! Figure 3 might look complicated at first, but this is because we are visualizing every pair of suspicious parties that have a relationship with another suspicious party. Let’s filter the query to look only at a single particular chain, which exceeds a minimum of $100 to the beneficiary. The following query can be executed against RDFox console or Amazon Neptune.

PREFIX : <http://oxfordsemantic.tech/transactions/entities#>
PREFIX prop: <http://oxfordsemantic.tech/transactions/properties#>
PREFIX type: <http://oxfordsemantic.tech/transactions/classes#>
PREFIX tt: <http://oxfordsemantic.tech/transactions/tupletables#>

SELECT * WHERE {
?S ?P ?O
{
SELECT ?S WHERE {
?S a type:HighValueChain .

} Limit 1

}

Figure 4. Visualizing a particular chain

Figure 4. Visualizing a particular chain

In Figure 4, we can see that Allison, the suspicious originator of the chain, has sent a transaction to Troy. Troy, who is not suspicious, sent the transaction to Karina. Karina is the suspicious beneficiary. We can also see additional information, such as the transaction amount that Karina received, and the chain length of 2 in this case.

In our testing, we were able to scale up to 500M transactions with 50M parties and process this in less than two hours! And we performed this at a significant lower cost when compared to running constant, fixed similar hardware.

Cleaning up

Follow Terraform cleanup instructions.

Conclusion

Graph databases are a powerful tool to apply reasoning on complex financial relationships. The combination of Amazon Web Services and the RDFox engine results in an automated, scalable, and cost-effective, thanks to the dynamic Kubernetes Cluster Autoscaler. Customers can use this solution and provide their investigators with a tool they can experiment and reason on financial transactions. This solution simplifies the process, and makes it easier to try different rules and queries on complex large data collections.

This blog post is written with Oxford Semantic.

Oxford Semantic Logo

References

Solution:

Further reading

RDFox:

Other:

Scaling DLT to 1M TPS on AWS: Optimizing a Regulated Liabilities Network

Post Syndicated from Erica Salinas original https://aws.amazon.com/blogs/architecture/scaling-dlt-to-1m-tps-on-aws-optimizing-a-regulated-liabilities-network/

SETL is an open source, distributed ledger technology (DLT) company that enables tokenisation, digital custody, and DLT for securities markets and payments. In mid-2021, they developed a blueprint for a Regulated Liabilities Network (RLN) that enables holding and managing a variety of tokenized value irrespective of its form. In a December 2021 collaboration with Amazon Web Services (AWS), SETL demonstrated that their RLN platform could support one million transactions per second. These findings were published in their whitepaper, The Regulated Liability Network: Whitepaper on scalability and performance.

This AWS-hosted architecture scaled each processing component and the complete transaction flow. It was built using Amazon Elastic Kubernetes Service (EKS), Amazon Elastic Cloud Compute (EC2), and Amazon Managed Streaming for Apache Kafka (MSK).

This post discusses the technical implementation of the simulated network. We show how scaling characteristics were achieved, while maintaining the business requirements of atomicity and finality. We also discuss how each RLN component was optimized for high performance.

Background: What is an RLN?

In The Regulated Internet of Value, Tony McLaughlin of Citi proposed a single network to record ownership of multiple token types, each which represents a liability of a regulated entity. The RLN would give commercial banks, e-money providers, and central banks, the ability to issue liabilities and immutably record all balances and owners. Because it is designed to maintain a globally sequenced set of state changes, an unambiguous ledger of token balances can be computed and updated. Transactions, which result in two or more ledger updates, are proposed to the network. They are authorized by all impacted network participants, ordered for distribution to participant ledgers, and then persisted by multiple participant ledgers. This process is completed with five processing components.

Figure 1. Core components and queues of RLN

Figure 1. Core components and queues of RLN

Given the existing performance limitations faced by many DLT platforms, a main concern with the network design was its ability to meet throughput requirements. SETL collaborated with AWS to define an architecture that integrated decoupled processing components, as shown in Figure 2. The target throughput was 1,000,000 TPS for each component through the simulated network.

Figure 2. Simulated RLN architecture in the AWS Cloud

Figure 2. Simulated RLN architecture in the AWS Cloud

A queue – process – queue architectural model

The basic architectural model applied was “queue – process – queue.” Each component consumed transactions from one or more Kafka topics, performed its requisite activities, and produced output to a second Kafka topic. Components of the message flow used Amazon EKS, Amazon EC2, and Amazon MSK.

Specific techniques were applied to achieve the scaling characteristics of components in the main message flow.

  • A distinct EKS-managed node group was used for each RLN component. Each node within the group had its own EC2 instance that housed at most two Kubernetes Pods. This technique decreased potential network throttling and enabled the monitoring of pod resource utilization using Amazon CloudWatch at the EC2 level. This aided in identifying bottlenecks, which can be challenging if large EC2 nodes house many Kubernetes Pods.
  • Each RLN queue component had its own dedicated Kafka cluster. By isolating the Kafka topics to their own cluster, we were able to scale and monitor the cluster individually. This decreased potential chokepoints.
  • A combination of eksctl, kubectl, and EC2 Auto Scaling groups was used to deploy and horizontally scale each RLN component. The tooling enabled rapid results by controlling pod configuration, automating deployments, and supporting multiple performance testing iterations.

Fine tuning RLN system components

The key metric tested was message throughput in transactions per second (TPS) consumed, processed, and produced for each component. An end-to-end test was performed, which measured throughput of the complete simulated network. Each RLN component was tuned to meet this metric as follows.

The Transaction Generator acted as the customer entering transactions into the system (see Figure 3.) The Kafka producer property buffer.memory was changed to 536870912 to simulate an increase in producer TPS for each component.

Figure 3. Simulated RLN Transaction Generator

Figure 3. Simulated RLN Transaction Generator

The Scheduler’s scaling technique was a dedicated Compute (one Pod/EC2 node) that listened to one Kafka partition in the Transaction Queue as seen in Figure 4. Additionally, turning on gzip compression (compression.type) on the producer resolved a network bottleneck for the downstream Kafka cluster.

Figure 4. Simulated RLN Scheduler

Figure 4. Simulated RLN Scheduler

The Approver’s scaling technique was like the Scheduler, however, the Proposal Queue had 10 topics as opposed to one. The approver is stateless, so it randomly selects a Kafka topic and partition to listen to. When scaling to 100 Kubernetes Pods, there would be roughly one Kafka partition per pod. As shown in Figure 5, each pod has its own EC2 instance. The Assembler’s scaling techniques were the same as for the Scheduler and Approver components.

Figure 5. Simulated RLN Approver

Figure 5. Simulated RLN Approver

The Sequencer, as designed, cannot scale horizontally due to the nature of the sequencing algorithm. However, without any special tuning of the Kafka consumer configuration or any significant increase in EC2 sizing, it was able to achieve one million TPS. Its architecture is shown in Figure 6.

Figure 6. Simulated RLN Sequencer

Figure 6. Simulated RLN Sequencer

The State Updater scaled with each Kubernetes Pod listening to an assigned Approved Proposal topic (one-to-one mapping) and receiving all sequenced hashes from the Hash Queue. Achieving 1 million TPS throughput required 10 nodes configured with c5.4xlarge, as seen in Figure 7.

Figure 7. Simulated RLN State Updater

Figure 7. Simulated RLN State Updater

Successful throughput testing

Test results demonstrate that each component can sustain over 1 million TPS. While horizontal scaling was applied, even a single instance achieved over 10,000 TPS. Individual component performance test results can be seen in Table 1.

Component Single Instance (TPS) # of Instances Multiple Instances (TPS)
Generator ~ 629,219 2 1,258,438
Scheduler ~11,019 100 1,101,928
Approver ~10,348 100 1,034,868
Assembler ~10,691 100 1,069,100
Sequencer ~1,052,845 1 N/A
State Updater ~100,256 10 1,002,569

Table 1. Test results per individual component

Network tests were performed with all components integrated to create a simulated RLN network. The tests were executed with transaction generation of ~1,000,000 TPS. Throughput was measured at the output of the final component’s (State Updater) instances (see Figure 8). It was also measured in aggregate at over 1.2 million TPS (see Figure 9).

Figure 8. Individual TPS for State Update component nodes

Figure 8. Individual TPS for State Update component nodes

Figure 9. Aggregate TPS for State Update component nodes

Figure 9. Aggregate TPS for State Update component nodes

While this study was designed to demonstrate capacity and throughput, the tests did replicate Kafka partitions across three Availability Zones in the selected AWS Region. Thus, testing demonstrated that the proposed architecture supports geographic resilience while meeting the throughput benchmark. Resiliency and security are areas of focus for testing in the next phases of architecting a production-ready version of RLN.

Looking forward

During the month of joint work between the SETL and AWS technical teams, we stood up and tuned basic RLN functionality. We successfully demonstrated the scalability of the environment to at least 1M TPS. By using Kafka queues and the horizontal and vertical scalability of AWS services, we addressed the primary concern of reaching production-level throughput.

The industry group collaborating on the RLN concept now has a solid technical foundation on which to build a production-ready RLN architecture. Future experimentation will include integration with financial institutions and technology partners that will provide the capabilities needed to build a successful RLN ecosystem.

Further explorations include enabling digital signing, verification at scale, and expanding the network to include financial institution environments integrated with their own RLN partitions.

Amazon Elastic Kubernetes Service Adds IPv6 Networking

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/amazon-elastic-kubernetes-service-adds-ipv6-networking/

Starting today, you can deploy applications that use IPv6 address space on Amazon Elastic Kubernetes Service (EKS).

Many of our customers are standardizing Kubernetes as their compute infrastructure platform for cloud and on-premises applications. Amazon EKS makes it easy to deploy containerized workloads. It provides highly available clusters and automates tasks such as patching, node provisioning, and updates.

Kubernetes uses a flat networking model that requires each pod to receive an IP address. This simplified approach enables low-friction porting of applications from virtual machines to containers but requires a significant number of IP addresses that many private VPC IPv4 networks are not equipped to handle. Some cluster administrators work around this IPv4 space limitation by installing container network plugins (CNI) that virtualize IP addresses a layer above the VPC, but this architecture limits an administrator’s ability to effectively observe and troubleshoot applications and has a negative impact on network performance at scale. Further, to communicate with internet services outside the VPC, traffic from IPv4 pods is routed through multiple network hops before reaching its destination, which adds latency and puts a strain on network engineering teams who need to maintain complex routing setups.

To avoid IP address exhaustion, minimize latency at scale, and simplify routing configuration, the solution is to use IPv6 address space.

IPv6 is not new. In 1996, I bought my first book on “IPng, Internet Protocol Next Generation”, as it was called 25 years ago. It provides a 64-bit address space, allowing 3.4 x 10^38 possible IP addresses for our devices, servers, or containers. We could assign an IPv6 address to every atom on the surface of the planet and still have enough addresses left to do another 100-plus Earths.

IPng Internet protocol Next Generation bookThere are a few advantages to using Amazon EKS clusters with an IPv6 network. First, you can run more pods on one single host or subnet without the risk of exhausting all available IPv4 addresses available in your VPC. Second, it allows for lower-latency communications with other IPv6 services, running on-premises, on AWS, or on the internet, by avoiding an extra NAT hop. Third, it relieves network engineers of the burden of maintaining complex routing configurations.

Kubernetes cluster administrators can focus on migrating and scaling applications without spending efforts working around IPv4 limits. Finally, pod networking is configured so that the pods can communicate with IPv4-based applications outside the cluster, allowing you to adopt the benefits of IPv6 on Amazon EKS without requiring that all dependent services deployed across your organization are first migrated to IPv6.

As usual, I built a short demo to show you how it works.

How It Works
Before I get started, I create an IPv6 VPC. I use this CDK script to create an IPv6-enabled VPC in a few minutes (thank you Angus Lees for the code). Just install CDK v2 (npm install -g aws-cdk@next) and deploy the stack (cdk bootstrap && cdk deploy).

When the VPC with IPv6 is created, I use the console to configure auto-assignment of IPv6 addresses to resources deployed in the public subnets (I do this for each public subnet).

auto assign IPv6 addresses in subnet

I take note of the subnet IDs created by the CDK script above (they are listed in the output of the script) and define a couple of variables I’ll use throughout the demo. I also create a cluster IAM role and a node IAM role, as described in the Amazon EKS documentation. When you already have clusters deployed, these two roles exist already.

I open a Terminal and type:


CLUSTER_ROLE_ARN="arn:aws:iam::0123456789:role/EKSClusterRole"
NODE_ROLE_ARN="arn:aws:iam::0123456789:role/EKSNodeRole"
SUBNET1="subnet-06000a8"
SUBNET2="subnet-03000cc"
CLUSTER_NAME="AWSNewsBlog"
KEYPAIR_NAME="my-key-pair-name"

Next, I create an Amazon EKS IPv6 cluster. In a terminal, I type:


aws eks create-cluster --cli-input-json "{
\"name\": \"${CLUSTER_NAME}\",
\"version\": \"1.21\",
\"roleArn\": \"${CLUSTER_ROLE_ARN}\",
\"resourcesVpcConfig\": {
\"subnetIds\": [
    \"${SUBNET1}\", \"${SUBNET2}\"
],
\"endpointPublicAccess\": true,
\"endpointPrivateAccess\": true
},
\"kubernetesNetworkConfig\": {
    \"ipFamily\": \"ipv6\"
}
}"

{
    "cluster": {
        "name": "AWSNewsBlog",
        "arn": "arn:aws:eks:us-west-2:486652066693:cluster/AWSNewsBlog",
        "createdAt": "2021-11-02T17:29:32.989000+01:00",
        "version": "1.21",

...redacted for brevity...

        "status": "CREATING",
        "certificateAuthority": {},
        "platformVersion": "eks.4",
        "tags": {}
    }
}

I use the describe-cluster while waiting for the cluster to be created. When the cluster is ready, it has "status" : "ACTIVE"

aws eks describe-cluster --name "${CLUSTER_NAME}"

Then I create a node group:

aws eks create-nodegroup                       \
        --cluster-name ${CLUSTER_NAME}         \
        --nodegroup-name AWSNewsBlog-nodegroup \
        --node-role ${NODE_ROLE_ARN}           \
        --subnets "${SUBNET1}" "${SUBNET2}"    \
        --remote-access ec2SshKey=${KEYPAIR_NAME}
		
{
    "nodegroup": {
        "nodegroupName": "AWSNewsBlog-nodegroup",
        "nodegroupArn": "arn:aws:eks:us-west-2:0123456789:nodegroup/AWSNewsBlog/AWSNewsBlog-nodegroup/3ebe70c7-6c45-d498-6d42-4001f70e7833",
        "clusterName": "AWSNewsBlog",
        "version": "1.21",
        "releaseVersion": "1.21.4-20211101",

        "status": "CREATING",
        "capacityType": "ON_DEMAND",

... redacted for brevity ...

}		

Once the node group is created, I see two EC2 instances in the console. I use the AWS Command Line Interface (CLI) to verify that the instances received an IPv6 address:

aws ec2 describe-instances --query "Reservations[].Instances[? State.Name == 'running' ][].NetworkInterfaces[].Ipv6Addresses" --output text 

2600:1f13:812:0000:0000:0000:0000:71eb
2600:1f13:812:0000:0000:0000:0000:3c07

I use the kubectl command to verify the cluster from a Kubernetes point of view.

kubectl get nodes -o wide

NAME                                       STATUS   ROLES    AGE     VERSION               INTERNAL-IP                              EXTERNAL-IP    OS-IMAGE         KERNEL-VERSION                CONTAINER-RUNTIME
ip-10-0-0-108.us-west-2.compute.internal   Ready    <none>   2d13h   v1.21.4-eks-033ce7e   2600:1f13:812:0000:0000:0000:0000:2263   18.0.0.205   Amazon Linux 2   5.4.149-73.259.amzn2.x86_64   docker://20.10.7
ip-10-0-1-217.us-west-2.compute.internal   Ready    <none>   2d13h   v1.21.4-eks-033ce7e   2600:1f13:812:0000:0000:0000:0000:7f3e   52.0.0.122   Amazon Linux 2   5.4.149-73.259.amzn2.x86_64   docker://20.10.7

Then I deploy a Pod. I follow these steps in the EKS documentation. It deploys a sample nginx web server.

kubectl create namespace aws-news-blog
namespace/aws-news-blog created

# sample-service.yml is available at https://docs.aws.amazon.com/eks/latest/userguide/sample-deployment.html
kubectl apply -f  sample-service.yml 
service/my-service created
deployment.apps/my-deployment created

kubectl get pods -n aws-news-blog -o wide
NAME                             READY   STATUS    RESTARTS   AGE   IP                           NODE                                       NOMINATED NODE   READINESS GATES
my-deployment-5dd5dfd6b9-7rllg   1/1     Running   0          17m   2600:0000:0000:0000:405b::2   ip-10-0-1-217.us-west-2.compute.internal   <none>           <none>
my-deployment-5dd5dfd6b9-h6mrt   1/1     Running   0          17m   2600:0000:0000:0000:46f9::    ip-10-0-0-108.us-west-2.compute.internal   <none>           <none>
my-deployment-5dd5dfd6b9-mrkfv   1/1     Running   0          17m   2600:0000:0000:0000:46f9::1   ip-10-0-0-108.us-west-2.compute.internal   <none>           <none>

I take note of the IPv6 address of my pods, and try to connect it from my laptop. As my awesome service provider doesn’t provide me with an IPv6 at home yet, the connection fails. This is expected as the pods do not have an IPv4 address at all. Notice the -g option telling curl to not consider : in the IP address as the separator for the port number and -6 to tell curl to connect through IPv6 only (required when you provide curl with a DNS hostname).

curl -g -6 http://\[2600:0000:0000:35000000:46f9::1\]
curl: (7) Couldn't connect to server

To test IPv6 connectivity, I start a dual stack (IPv4 and IPv6) EC2 instance in the same VPC as the cluster. I SSH connect to the instance and try the curl command again. I see I receive the default HTML page served by nginx. IPv6 connectivity to the pod works!

curl -g -6 http://\[2600:0000:0000:35000000:46f9::1\]
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>

... redacted for brevity ...

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

If it does not work for you, verify the security group for the cluster EC2 nodes and be sure it has a rule allowing incoming connections on port TCP 80 from ::/0.

A Few Things to Remember
Before I wrap up, I’d like to answer some frequent questions received from customers who have already experimented with this new capability:

Pricing and Availability
IPv6 support for your Amazon Elastic Kubernetes Service (EKS) cluster is available today in all AWS Regions where Amazon EKS is available, at no additional cost.

Go try it out and build your first IPv6 cluster today.

— seb

How Meshify Built an Insurance-focused IoT Solution on AWS

Post Syndicated from Grant Fisher original https://aws.amazon.com/blogs/architecture/how-meshify-built-an-insurance-focused-iot-solution-on-aws/

The ability to analyze your Internet of Things (IoT) data can help you prevent loss, improve safety, boost productivity, and even develop an entirely new business model. This data is even more valuable, with the ever-increasing number of connected devices. Companies use Amazon Web Services (AWS) IoT services to build innovative solutions, including secure edge device connectivity, ingestion, storage, and IoT data analytics.

This post describes Meshify’s IoT sensor solution, built on AWS, that helps businesses and organizations prevent property damage and avoid loss for the property-casualty insurance industry. The solution uses real-time data insights, which result in fewer claims, better customer experience, and innovative new insurance products.

Through low-power, long-range IoT sensors, and dedicated applications, Meshify can notify customers of potential problems like rapid temperature decreases that could result in freeze damage, or rising humidity levels that could lead to mold. These risks can then be averted, instead of leading to costly damage that can impact small businesses and the insurer’s bottom line.

Architecture building blocks

The three building blocks of this technical architecture are the edge portfolio, data ingestion, and data processing and analytics, shown in Figure 1.

Figure 1. Building blocks of Meshify’s technical architecture

Figure 1. Building blocks of Meshify’s technical architecture

I. Edge portfolio (EP)

Starting with the edge sensors, the Meshify edge portfolio covers two types of sensors:

  • LoRaWAN (Low power, long range WAN) sensor suite. This sensor provides the long connectivity range (> 1000 feet) and extended battery life (~ 5 years) needed for enterprise environments.
  • Cellular-based sensors. This sensor is a narrow band/LTE-M device that operates at LTE-M band 2/4/12 radio frequency and uses edge intelligence to conserve battery life.

II. Data ingestion (DI)

For the LoRaWAN solution, aggregated sensor data at the Meshify gateway is sent to AWS using AWS IoT Core and Meshify’s REST service endpoints. AWS IoT Core is a managed cloud platform that lets IoT devices easily and securely connect using multiple protocols like HTTP, MQTT, and WebSockets. It expands its protocol coverage through a new fully managed feature called AWS IoT Core for LoRaWAN. This gives Meshify the ability to connect LoRaWAN wireless devices with the AWS Cloud. AWS IoT Core for LoRaWAN delivers a LoRaWAN network server (LNS) that provides gateway management using the Configuration and Update Server (CUPS) and Firmware Updates Over-The-Air (FUOTA) capabilities.

III. Data processing and analytics (DPA)

Initial processing of the data is done at the ingestion layer, using Meshify REST API endpoints and the Rules Engine of AWS IoT Core. Meshify applies filtering logic to route relevant events to Amazon Managed Streaming for Apache Kafka (Amazon MSK). Amazon MSK is an AWS streaming data service that manages Apache Kafka infrastructure and operations, streamlining the process of running Apache Kafka applications on AWS.

Meshify’s applications then consume the events from Amazon MSK per the configured topic subscription. They enrich and correlate the events with the records with a managed service, Amazon Relational Database Service (RDS). These applications run as scalable containers on another managed service, Amazon Elastic Kubernetes Service (EKS), which runs container applications.

Bringing it all together – technical workflow

In Figure 2, we illustrate the technical workflow from the ingestion of field events to their processing, enrichment, and persistence. Finally, we use these events to power risk avoidance decision-making.

Figure 2. Technical workflow for Meshify IoT architecture

Figure 2. Technical workflow for Meshify IoT architecture

  1. After installation, Meshify-designed LoRa sensors transmit information to the cloud through Meshify’s gateways. LoRaWAN capabilities create connectivity between the sensors and the gateways. They establish a low power, wide area network protocol that securely transmits data over a long distance, through walls and floors of even the largest buildings.
  2. The Meshify Gateway is a redundant edge system, capable of sending sensor data from various sensors to the Meshify cloud environment. Once the LoRa sensor information is received by the Meshify Gateway, it converts the incoming radio frequency (RF) signals, which support faster transfer rate to Meshify’s cloud environment.
  3. Data from the Meshify Gateway and sensors is initially processed at Meshify’s AWS IoT Core and REST service endpoints. These destinations for IoT streaming data help with the initial intake and introduce field data to the Meshify cloud environment. The initial ingestion points can scale automatically based upon the volume of sensor data received. This enables rapid scaling and ease of implementation.
  4. After the data has entered the Meshify cloud environment, Meshify uses Amazon EKS and Amazon MSK to process the incoming data stream. Amazon MSK producer and consumer applications within the EKS systems enrich the data streams for the end users and systems to consume.
  5. Producer applications running on EKS send processed events to the Amazon MSK service. These events include storing and retrieval of raw data, enriched data, and system-level data.
  6. Consumer applications hosted on the EKS pods receive events per the subscribed Amazon MSK topic. Web, mobile, and analytic applications enrich and use these data streams to display data to end users, business teams, and systems operations.
  7. Processed events are persisted in Amazon RDS. The databases are used for reporting, machine learning, and other analytics and processing services.

Building a scalable IoT solution

Meshify first began work on the Meshify sensors and hosted platform in 2012. In the ensuing decade, Meshify has successfully created a platform to auto-scale upon demand with steady, predictable performance. This gave Meshify both the ability to use only the resources needed, and still have the capacity to handle unexpected voluminous data.

As the platform scaled, so did the volume of sensor data, operations and diagnostics data, and metadata from installations and deployments. Building an end-to-end data pipeline that integrates these different data sources and delivers co-related insights at low latency was time well spent.

Conclusion

In this post, we’ve shown how Meshify is using AWS services to power their suite of IoT sensors, software, and data platforms. Meshify’s most important architectural enhancements have involved the introduction of managed services, notably AWS IoT Core for LoRaWAN and Amazon MSK. These improvements have primarily focused on the data ingestion, data processing, and analytics stages.

Meshify continues to power the data revolution at the intersection of IoT and insurance at the edge, using AWS. Looking ahead, Meshify and HSB are excited at the prospect of scaling the relationship with AWS from cloud computing to the world of edge devices.

Learn more about how emerging startups and large enterprises are using AWS IoT services to build differentiated products.

Meshify is an IoT technology company and subsidiary of HSB, based in Austin, TX. Meshify builds pioneering sensor hardware, software, and data analytics solutions that protect businesses from property and equipment damage.

Ingesting Automotive Sensor Data using DXC RoboticDrive Ingestor on AWS

Post Syndicated from Bryan Berezdivin original https://aws.amazon.com/blogs/architecture/ingesting-automotive-sensor-data-using-dxc-roboticdrive-ingestor-on-aws/

This post was co-written by Pawel Kowalski, a Technical Product Manager for DXC RoboticDrive and Dr. Max Böhm, a software and systems architect and DXC Distinguished Engineer.

To build the first fully autonomous vehicle, L5 standard per SAE, auto-manufacturers collected sensor data from test vehicle fleets across the globe in their testing facilities and driving circuits. It wasn’t easy collecting, ingesting and managing massive amounts of data, and that’s before it’s used for any meaningful processing, and training the AI self-driving algorithms. In this blog post, we outline a solution using DCX RoboticDrive Ingestor to address the following challenges faced by automotive original equipment manufacturers (OEMs) when ingesting sensor data from the car to the data lake:

  • Each test drive alone can gather hundreds of terabytes of data per fleet. A few vehicles doing test fleets can produce petabyte(s) worth of data in a day.
  • Vehicles collect data from all the sensors into local storage with fast disks. Disks must offload data to a buffer station and then to a data lake as soon as possible to be reused in the test drives.
  • Data is stored in binary file formats like ROSbag, ADTF or MDF4, that are difficult to read for any pre-ingest checks and security scans.
  • On-premises (ingest location) to Cloud data ingestion requires a hybrid solution with reliable connectivity to the cloud.
  • Ingesting process depositing files into a data lake without organizing them first can cause discovery and read performance issues later in the analytics phase.
  • Orchestrating and scaling an ingest solution can involve several steps, such as copy, security scans, metadata extraction, anonymization, annotation, and more.

Overview of the RoboticDrive Ingestor solution

The DXC RoboticDrive Ingestor (RDI) solution addresses the core requirements as follows:

  • Cloud ready:  Implements a recommended architectural pattern where data from globally located logger copy stations are ingested into the cloud data lake in one or more AWS Regions over AWS Direct Connect or Site-to-Site VPN connections.
  • Scalable: Runs multiple steps as defined in an Ingest Pipeline, containerized on Amazon EKS based Amazon EC2 compute nodes that can auto-scale horizontally and vertically on AWS. Steps can be for example, Virus Scan, Copy to Amazon S3, Quality Check, Metadata Extraction, and others.
  • Performant: Uses the Apache Spark-based DXC RoboticDrive Analyzer component to process large amounts of sensor data files efficiently in parallel, which will be described in a future blog post. Analyzer can read and processes native automotive file formats, extract and prepare metadata, and populate metadata into the DXC RoboticDrive Meta Data Store. The maximum throughput of network and disk is leveraged and data files are placed on the data lake following a hierarchical prefix domain model, that aligns with the best practices around S3 performance.
  • Operations ready: Containerized deployment, custom monitoring UI, visual workflow management with MWAA, Amazon CloudWatch based monitoring for virtual infrastructure and cloud native services.
  • Modular: Programmatically add or customize steps into the managed Ingest Pipeline DAG, defined in Amazon Managed Workflows for Apache Airflow (MWAA).
  • Security: IAM based access policy and technical roles, Amazon Cognito based UI authentication, virus scan, and an optional data anonymization step.

The ingest pipeline copies new files to S3 followed by user configurable steps, such as Virus Scan or Integrity Checks. When new files have been uploaded, cloud native S3 event notifications generate messages in an Amazon SQS queue. These are then consumed by serverless AWS Lambda functions or by containerized workloads in the EKS cluster, to asynchronously perform metadata extraction steps.

Walkthrough

There are two main areas of the ingest solution:

  1. On-Premises devices – these are dedicated copy stations in the automotive OEM’s data center where data cartridges are physically connected. Copy stations are easily configurable devices that automatically mount inserted data cartridges and share its content via NFS. At the end of the mounting process, they call an HTTP post API to trigger the ingest process.
  2. Managed Ingest Pipeline – this is a Managed service in the cloud to orchestrate the ingest process. As soon as a cartridge is inserted into the Copy Station, mounted and shared via NFS, the ingest process can be started.
Figure 1 - Architecture showing the DXC RoboticDrive Ingestor (RDI) solution

Figure 1 – Architecture showing the DXC RoboticDrive Ingestor (RDI) solution

Depending on the customer’s requirements, the ingest process may differ. It can be easily modified using predefined Docker images. A configuration file defines what features from the Ingest stack should be used. It can be a simple copy from the source to the cloud but we can easily extend it with additional steps, such as a virus scan, special data partitioning, data quality checks on even scripts provided by the automotive OEM.

The ingest process is orchestrated by a dedicated MWAA pipeline built dynamically from the configuration file. We can easily track the progress, see the logs and, if required, stop or restart the process. Progress is tracked at a detailed level, including the number and volume of data uploaded, remaining files, estimated finish time, and ingest summary, all visualized in a dedicated ingest UI.

Prerequisites

Ingest pipeline

To build an MWAA pipeline, we need to choose from prebuilt components of the ingestor and add them into the JSON configuration, which is stored in an Airflow variable:

{
    "tasks" : [
        {
            "name" : "Virus_Scan",
            "type" : "copy",
            "srcDir" : "file:///mnt/ingestor/CopyStation1/",
            "quarantineDir" : "s3a://k8s-ingestor/CopyStation1/quarantine/”
        },
        {
            "name" : "Copy_to_S3",
            "type" : "copy",
            "srcDir" : "file:///mnt/ingestor/car01/test.txt",
            "dstDir" : "s3a://k8s-ingestor/copy_dest/",
            "integrity_check" : false
        },
        {
            "name" : "Quality_Check",
            "type" : "copy",
            "srcDir" : "file:///mnt/ingestor/car01/test.txt",
            "InvalidFilesDir" : "s3a://k8s-ingestor/CopyStation1/invalid/",
        }
    ]
}

The final ingest pipeline, a Directed Acyclic Graph (DAG), is built automatically from the configuration and changes are visible within seconds:

Figure 2 - Screenshot showing the final ingest pipeline, a Directed Acyclic Graph (DAG)

Figure 2 – Screenshot showing the final ingest pipeline, a Directed Acyclic Graph (DAG)

If required, external tools or processes can also call ingest components using the Ingest API, which is shown in the following screenshot:

Figure 3 - Screenshot showing a tool being used to call ingest components using the Ingest API

Figure 3 – Screenshot showing a tool being used to call ingest components using the Ingest API

The usual ingest process consists of the following steps:

  1. Ingest data source (cartridge) into the Copy Station

2. Wait for Copy Station to report proper cartridge mounting. In case a dedicated device (Copy Station) is not used, and you want to use an external disc as your data source, it can be mounted manually:

mount -o ro /dev/sda /media/disc

Here, /dev/sda is the disc containing your data and /media/disc is the mountpoint, which must also be added to /etc/exports.

3. Every time the MWAA ingest pipeline (DAG) is triggered, a Kubernetes job is created through a Helm template and Helm values which add the above data source as a persistent volume (PV) mount inside the job.

Helm template is as follows:

apiVersion: v1

kind: PersistentVolume

metadata:

 name: {{ .name }}

 namespace: {{ .Values.namespace }}

spec:

 capacity:

  storage: {{ .storage }}

 accessModes:

  - {{ .accessModes }}

nfs:

 server: {{ .server }}

 path: {{ .path }}

Helm values:

name: "YourResourceName"

server: 10.1.1.44

path: "/cardata"

mountPath: "/media/disc"

storage: 100Mi

accessModes: ReadOnlyMany

4.  Trigger ingest manually from MWAA UI – this can also be automatically triggered from a Copy Station by calling MWAA API:

curl -X POST "https://roboticdrive/api/v1/dags/ingest_dag/dagRuns" -H "accept: application/json" -H "Content-Type: application/json" -d "{\"dag_run_id\":\"IngestDAG\"}"

5.  Watch ingest progress – this can be omitted as the ingest results, such as success, failure, can be notified by email or reported to a ticketing system.

6.  Once successfully uploaded, turn off the Copy Station manually or it can be the last task of a flow in the following ingest DAG:

(…)

def ssh_operator(task_identifier: str, cmd: str):

    return SSHOperator(task_id=task_identifier,

                       command=cmd + ' ',

                       remote_host=CopyStation,

                       ssh_hook=sshHook,

                       trigger_rule=TriggerRule.NONE_FAILED,

                       do_xcom_push=True,

                       dag= IngestDAG)

host_shutdown = ssh_operator(task_identifier='host_shutdown', \

                             cmd='sudo shutdown –h +5 &')

When launched, you can track the progress and basic KPIs on the Monitoring UI:

Figure 5 - Screenshot basic KPIs on the Monitoring UI

Figure 4 – Screenshot basic KPIs on the Monitoring UI

Conclusion

In this blog post, we showed how to set up the DXC RoboticDrive Ingestor. This solution was designed to overcome several data ingestion challenges and resulted in the following positive results:

  • Time-to-Analyze KPI decreased by 30% in our tests, local offload steps become obsolete, and data offloaded from cars were accessible on a globally accessible data-lake (S3) for further processing and analysis.
  • MWAA integrated with EKS made the solution flexible, fault tolerant, as well as easy to manage and maintain.
  • Autoscaling the compute capacity as needed for any pre-processing steps within ingestion, helped with boosting performance and productivity.
  • On-premises to AWS cloud connectivity options provided flexibility in configuring and scaling (up and down), and provides optimal performance and bandwidth.

We hope you found this solution insightful and look forward to hearing your feedback in the comments!

Dr. Max Böhm

Dr. Max Böhm

Dr. Max Böhm is a software and systems architect and DXC Distinguished Engineer with over 20 years’ experience in the IT industry and 9 years of research and teaching at universities. He advises customers across industries on their digital journeys, and has created innovative solutions, including real-time management views for global IT infrastructures and data correlation tools that simplify consumption-based billing. He currently works as a solution architect for DXC Robotic Drive. Max has authored or co-authored over 20 research papers and four patents. He has participated in key industry-research collaborations including projects at CERN.

Pawel Kowalski

Pawel Kowalski

Pawel Kowalski is a Technical Product Manager for DXC RoboticDrive where he leads the Data Value Services stream. His current area of focus is to drive solution development for large-scale (PB) end-to-end data ingestion use cases, ensuring performance and reliability. With over 10 years of experience in Big Data Analytics & Business Intelligence, Pawel has designed and delivered numerous customer tailored solutions.

Automate Container Anomaly Monitoring of Amazon Elastic Kubernetes Service Clusters with Amazon DevOps Guru

Post Syndicated from Rahul Sharad Gaikwad original https://aws.amazon.com/blogs/devops/automate-container-anomaly-monitoring-of-amazon-elastic-kubernetes-service-clusters-with-amazon-devops-guru/

Observability in a container-centric environment presents new challenges for operators due to the increasing number of abstractions and supporting infrastructure. In many cases, organizations can have hundreds of clusters and thousands of services/tasks/pods running concurrently. This post will demonstrate new features in Amazon DevOps Guru to help simplify and expand the capabilities of the operator. The features include grouping anomalies by metric and container cluster to improve context and simplify access and support for additional Amazon CloudWatch Container Insight metrics. An example of these capabilities in action would be that Amazon DevOps Guru can now identify anomalies in CPU, memory, or networking within Amazon Elastic Kubernetes Service (EKS), notifying the operators and letting them more easily navigate to the affected cluster to examine the collected data.

Amazon DevOps Guru offers a fully managed AIOps platform service that lets developers and operators improve application availability and resolve operational issues faster. It minimizes manual effort by leveraging machine learning (ML) powered recommendations. Its ML models take advantage of the expertise of AWS in operating highly available applications for the world’s largest ecommerce business for over 20 years. DevOps Guru automatically detects operational issues, predicts impending resource exhaustion, details likely causes, and recommends remediation actions.

Solution Overview

In this post, we will demonstrate the new Amazon DevOps Guru features around cluster grouping and additionally supported Amazon EKS metrics. To demonstrate these features, we will show you how to create a Kubernetes cluster, instrument the cluster using AWS Distro for OpenTelemetry, and then configure Amazon DevOps Guru to automate anomaly detection of EKS metrics. A previous blog provides detail on the AWS Distro for OpenTelemetry collector that is employed here.

Prerequisites

EKS Cluster Creation

We employ the eksctl CLI tool to create an Amazon EKS. Using eksctl, you can provide details on the command line or specify a manifest file. The following manifest is used to create a single managed node using Amazon Elastic Compute Cloud (EC2), and this will be created and constrained to the specified Region via entry metadata/region and Availability Zones via the managedNodeGroups/availabilityZones entry. By default, this will create a new VPC with eight subnets.

# An example of ClusterConfig object using Managed Nodes
---
    apiVersion: eksctl.io/v1alpha5
    kind: ClusterConfig

    metadata:
      name: devopsguru-eks-cluster
      region: <SPECIFY_REGION_HERE>
      version: "1.21"

    availabilityZones: ["<FIRST_AZ>","<SECOND_AZ>"]
    managedNodeGroups:
      - name: managed-ng-private
        privateNetworking: true
        instanceType: t3.medium
        minSize: 1
        desiredCapacity: 1
        maxSize: 6
        availabilityZones: ["<SPECIFY_AVAILABILITY_ZONE(S)_HERE"]
        volumeSize: 20
        labels: {role: worker}
        tags:
          nodegroup-role: worker
    cloudWatch:
      clusterLogging:
        enableTypes:
          - "api"
  • To create an Amazon EKS cluster using eksctl and a manifest file, we use eksctl create as shown below. Note that this step will take 10 – 15 minutes to establish the cluster.
$ eksctl create cluster -f devopsguru-managed-node.yaml
2021-10-13 10:44:53 [i] eksctl version 0.69.0
…
2021-10-13 11:04:42 [✔] all EKS cluster resources for "devopsguru-eks-cluster" have been created
2021-10-13 11:04:44 [i] nodegroup "managed-ng-private" has 1 node(s)
2021-10-13 11:04:44 [i] node "<ip>.<region>.compute.internal" is ready
2021-10-13 11:04:44 [i] waiting for at least 1 node(s) to become ready in "managed-ng-private"
2021-10-13 11:04:44 [i] nodegroup "managed-ng-private" has 1 node(s)
2021-10-13 11:04:44 [i] node "<ip>.<region>.compute.internal" is ready
2021-10-13 11:04:47 [i] kubectl command should work with "/Users/<user>/.kube/config"
  • Once this is complete, you can use kubectl, the Kubernetes CLI, to access the managed nodes that are running.
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
<ip>.<region>.compute.internal Ready <none> 76m v1.21.4-eks-033ce7e

AWS Distro for OpenTelemetry Collector Installation

We will use AWS Distro for OpenTelemetry Collector to extract metrics from a pod running in Amazon EKS. This will collect metrics within the Kubernetes cluster and surface them to Amazon CloudWatch. We start by defining a policy to allow access. The following information comes from the post here.

Attach the CloudWatchAgentServerPolicy IAM Policy to worker node

  • Open the Amazon EC2 console.
  • Select one of the worker node instances, and choose the IAM role in the description.
  • On the IAM role page, choose Attach policies.
  • In the list of policies, select the check box next to CloudWatchAgentServerPolicy. You can use the search box to find this policy.
  • Choose Attach policies.

Deploy AWS OpenTelemetry Collector on Amazon EKS

Next, you will deploy the AWS Distro for OpenTelemetry using a GitHub hosted manifest.

  • Deploy the artifact to the Amazon EKS cluster using the following command:
$ curl https://raw.githubusercontent.com/aws-observability/aws-otel-collector/main/deployment-template/eks/otel-container-insights-infra.yaml | kubectl apply -f -
  • View the resources in the aws-otel-eks namespace.
$ kubectl get pods -l name=aws-otel-eks-ci -n aws-otel-eks
NAME READY STATUS RESTARTS AGE
aws-otel-eks-ci-jdf2w 1/1 Running 0 107m

View Container Insight Metrics in Amazon CloudWatch

Access Amazon CloudWatch and select Metrics, All metrics to view the published metrics. Under Custom Namespaces, ContainerInsights is selectable. Under this, one can view metrics at the cluster, node, pod, namespace, and service granularity. The following example shows pod level metrics of CPU:

The AWS Console with Amazon Cloudwatch Container Insights Pod Level CPU Utilization.

Amazon Simple Notification Service

It is necessary to allow Amazon DevOps Guru access to Amazon SNS in order for Amazon SNS to publish events. During the setup process, an Amazon SNS Topic is created, and the following resource policy is applied:

{
    "Sid": "DevOpsGuru-added-SNS-topic-permissions",
    "Effect": "Allow",
    "Principal": {
        "Service": "region-id.devops-guru.amazonaws.com"
    },
    "Action": "sns:Publish",
    "Resource": "arn:aws:sns:region-id:topic-owner-account-id:my-topic-name",
    "Condition" : {
      "StringEquals" : {
        "AWS:SourceArn": "arn:aws:devops-guru:region-id:topic-owner-account-id:channel/devops-guru-channel-id",
        "AWS:SourceAccount": "topic-owner-account-id"
    }
  }
}

Amazon DevOps Guru

Amazon DevOps Guru can now be leveraged to monitor the Amazon EKS cluster and Managed Node Group. Select Amazon DevOps Guru, and select Get started as shown in the following figure to do this.

The Amazon DevOps Guru service via the AWS Console.

Once selected, the Get started console displays, letting you specify the IAM role for DevOps guru to access the appropriate resources.

The Get started dialog for Amazon DevOps Guru including instructions on how the service operates, IAM Role Permissions and Amazon DevOps Guru analysis coverage.

Under the Amazon DevOps Guru analysis coverage, Choose later is selected. This will let us specify the CloudFormation stacks to monitor. Select Create a new SNS topic, and provide a name. This will be used to collect notifications and allow for subscribers to then be notified. Select Enable when complete.

The Amazon DevOps Guru analysis coverage allowing the user to select all resources in a region or to choose later. In addition the image shows the dialog that requests the user specify an Amazon SNS topic for notification when insights occur.

On the Manage DevOps Guru analysis coverage, select Analyze all AWS resources in the specified CloudFormation stacks in this Region. Then, select the cluster and managed node group AWS CloudFormation stacks so that DevOps Guru can monitor Amazon EKS.

A dialog where the user is able to specify the AWS CloudFormation stacks in a region for analysis coverage. Two stacks are select including the eks cluster and eks cluster managed node group.

Once this is selected, the display will update indicating that two CloudFormation stacks were added.

Amazon DevOps Guru Settings including DevOps Guru analysis coverage and Amazon SNS notifications.

Amazon DevOps Guru will finally start analysis for those two stacks. This will take several hours to collect data and to identify normal operating conditions. Once this process is complete, the Dashboard will display that those resources have been analyzed, as shown in the following figure.

The completed analysis by DevOps guru of the two AWS Cloudformation stacks indicating a healthy status for both.

Enable Encryption on Amazon SNS Topic

The Amazon SNS Topic created by Amazon DevOps Guru will not enable encryption by default. It is important to enable this feature to encrypt notifications at rest. Go to Amazon SNS, select the topic that is created and then Edit topic. Open the Encryption dialog box and enable encryption as shown in the following figure, specifying an alias, or accepting the default.

The Encryption dialog for Amazon SNS topic when it is Edited.

Deploy Sample Application on Amazon EKS To Trigger Insights

You will employ a sample application that is part of the AWS Distro for OpenTelemetry Collector to simulate failure. Using the following manifest, you will deploy a sample application that has pod resource limits for memory and CPU shares. These limits are artificially low and insufficient for the pod to run. The pod will exceed memory and will be identified for eviction by Amazon EKS. When it is evicted, it will attempt to be redeployed per the manifest requirement for a replica of one. In turn, this will repeat the process and generate memory and pod restart errors in Amazon CloudWatch. For this example, the deployment was left for over an hour, thereby causing the pod failure to repeat numerous times. The following is the manifest that you will create on the filesystem.

kind: Deployment
apiVersion: apps/v1
metadata:
  name: java-sample-app
  namespace: aws-otel-eks
  labels:
    name: java-sample-app
spec:
  replicas: 1
  selector:
    matchLabels:
      name: java-sample-app
  template:
    metadata:
      labels:
        name: java-sample-app
    spec:
      containers:
        - name: aws-otel-emitter
          image: aottestbed/aws-otel-collector-java-sample-app:0.9.0
          resources:
            limits:
              memory: "128Mi"
              cpu: "200m"
          ports:
          - containerPort: 4567
          env:
          - name: OTEL_OTLP_ENDPOINT
            value: "localhost:4317"
          - name: OTEL_RESOURCE_ATTRIBUTES
            value: "service.namespace=AWSObservability,service.name=CloudWatchEKSService"
          - name: S3_REGION
            value: "us-east-1"
          imagePullPolicy: Always

To deploy the application, use the following command:

$ kubectl apply -f <manifest file name>
deployment.apps/java-sample-app created

Scenario: Improved context from DevOps Guru Container Cluster Grouping and Increased Metrics

For our scenario, Amazon DevOps Guru is monitoring additional Amazon CloudWatch Container Insight Metrics for EKS. The following figure shows the flow of information and eventual notification of the operator, so that they can examine the Amazon DevOps Guru Insight. Starting at step 1, the container agent (AWS Distro for OpenTelemetry) forwards container metrics to Amazon CloudWatch. In step 2, Amazon DevOps Guru is continually consuming those metrics and performing anomaly detection. If an anomaly is detected, then this generates an Insight, thereby triggering Amazon SNS notification as shown in step 3. In step 4, the operators access Amazon DevOps Guru console to examine the insight. Then, the operators can leverage the new user interface capability displaying which cluster, namespace, and pod/service is impacted along with correlated Amazon EKS metric(s).

 The flow of information and eventual notification of the operator, so that they can examine the Amazon DevOps Guru Insight. Starting at step 1, the container agent (AWS Distro for OpenTelemetry) forwards container metrics to Amazon CloudWatch. In step 2, Amazon DevOps Guru is continually consuming those metrics and performing anomaly detection. If an anomaly is detected, then this generates an Insight, thereby triggering Amazon SNS notification as shown in step 3. In step 4, the operators access Amazon DevOps Guru console to examine the insight. Then, the operators can leverage the new user interface capability displaying which cluster, namespace, and pod/service is impacted along with correlated Amazon EKS metric(s).

New EKS Container Metrics in DevOps Guru

As part of the release, the following pod and node metrics are now tracked by DevOps Guru:

  • pod_number_of_container_restarts – number of times that a pod is restarted (e.g., image pull issues, container failure).
  • pod_memory_utilization_over_pod_limit – memory that exceeds the pod limit called out in resource memory limits.
  • pod_cpu_utilization_over_pod_limit – CPU shares that exceed the pod limit called out in resource CPU limits.
  • pod_cpu_utilization – percent CPU Utilization within an active pod.
  • pod_memory_utilization – percent memory utilization within an active pod.
  • node_network_total_bytes – total bytes over the network interface for the managed node (e.g., EC2 instance)
  • node_filesystem_utilization – percent file system utilization for the managed node (e.g., EC2 instance).
  • node_cpu_utilization – percent CPU Utilization within a managed node (e.g., EC2 instance).
  • node_memory_utilization – percent memory utilization within a managed node (e.g., EC2 instance).

Operator Scenario

The Kubernetes Operator in the following figure is informed of an insight via Amazon SNS. The Amazon SNS message content appears in the following code, showing the originator and information identifying the InsightDescription, InsightSeverity, name of the container metric, and the Pod / EKS Cluster:

{
 "AccountId": "XXXXXXX",
 "Region": "<REGION>",
 "MessageType": "NEW_INSIGHT",
 "InsightId": "ADFl69Pwq1Aa6M373DhU0zkAAAAAAAAABuZzSBHxeiNexxnLYD7Lhb0vuwY9hLtz",
 "InsightUrl": "https://<REGION>.console.aws.amazon.com/devops-guru/#/insight/reactive/ADFl69Pwq1Aa6M373DhU0zkAAAAAAAAABuZzSBHxeiNexxnLYD7Lhb0vuwY9hLtz",
 "InsightType": "REACTIVE",
 "InsightDescription": "ContainerInsights pod_number_of_container_restarts Anomalous In Stack eksctl-devopsguru-eks-cluster-cluster",
 "InsightSeverity": "high",
 "StartTime": 1636147920000,
 "Anomalies": [
 {
 "Id": "ALAGy5sIITl9e6i66eo6rKQAAAF88gInwEVT2WRSTV5wSTP8KWDzeCYALulFupOQ",
 "StartTime": 1636147800000,
 "SourceDetails": [
 {
 "DataSource": "CW_METRICS",
 "DataIdentifiers": {
 "name": "pod_number_of_container_restarts",
 "namespace": "ContainerInsights",
 "period": "60",
 "stat": "Average",
 "unit": "None",
 "dimensions": "{\"PodName\":\"java-sample-app\",\"ClusterName\":\"devopsguru-eks-cluster\",\"Namespace\":\"aws-otel-eks\"}"
 }
 ....
 "awsInsightSource": "aws.devopsguru"
}

Amazon DevOps Guru Console collects the insights under the Insights selection as shown in the following figure. Select Insights to view the details.

Amazon DevOps Guru Insights. An insight is displayed with a status of Ongoing and Severity of High.

Aggregated Metrics provides the identification of the EKS Container Metrics that have errored. In this case, pod_memory_utilization_over_pod_limit and pod_number_of_container_restarts.

Aggregated Metrics panel with pod_memory_utilization_over_pod_limit and pod_number_of_container_restarts for the Amazon EKS cluster names devopsguru-eks-cluster. Graphically a timeline including time and date is displayed conveying the length of the anomaly.

Further details can be identified by selecting and expanding each insight as shown in the following figure.

Displays the ability to expand the cluster metrics providing further information on the PodName, Namespace and ClusterName. Furthermore, a search bar is provided to search on name, stack or service name.

Note that the display provides information around the Cluster, PodName, and Namespace. This helps operators maintaining large numbers of EKS Clusters to quickly isolate the offending Pod, its operating Namespace, and EKS Cluster to which it belongs. A search bar provides further filtering to isolate the name, stack, or service name displayed.

Cleaning Up

Follow the steps to delete the resources to prevent additional charges being posted to your account.

Amazon EKS Cluster Cleanup

Follow these steps to detach the customer managed policy and delete the cluster.

  • Detach customer managed policy, AWSDistroOpenTelemetryPolicy, via IAM Console.
  • Delete cluster using eksctl.
$ eksctl delete cluster devopsguru-eks-cluster --region <region>
2021-10-13 14:08:28 [i] eksctl version 0.69.0
2021-10-13 14:08:28 [i] using region <region>
2021-10-13 14:08:28 [i] deleting EKS cluster "devopsguru-eks-cluster"
2021-10-13 14:08:30 [i] will drain 0 unmanaged nodegroup(s) in cluster "devopsguru-eks-cluster"
2021-10-13 14:08:32 [i] deleted 0 Fargate profile(s)
2021-10-13 14:08:33 [✔] kubeconfig has been updated
2021-10-13 14:08:33 [i] cleaning up AWS load balancers created by Kubernetes objects of Kind Service or Ingress
2021-10-13 14:09:02 [i] 2 sequential tasks: { delete nodegroup "managed-ng-private", delete cluster control plane "devopsguru-eks-cluster" [async] }
2021-10-13 14:09:02 [i] will delete stack "eksctl-devopsguru-eks-cluster-nodegroup-managed-ng-private"
2021-10-13 14:09:02 [i] waiting for stack "eksctl-devopsguru-eks-cluster-nodegroup-managed-ng-private" to get deleted
2021-10-13 14:12:30 [i] will delete stack "eksctl-devopsguru-eks-cluster-cluster"
2021-10-13 14:12:30 [✔] all cluster resources were deleted

Conclusion

In the previous scenarios, demonstration of the new cluster organization and additional container metrics was performed. Both of these features further simplify and expand the ability for an operator to more easily identify issues within a container cluster when Amazon DevOps Guru detects anomalies. You can start building your own solutions that employ Amazon CloudWatch Agent / AWS Distro for OpenTelemetry Agent and Amazon DevOps Guru by reading the documentation. This provides a conceptual overview and practical examples to help you understand the features provided by Amazon DevOps Guru and how to use them.

About the authors

Rahul Sharad Gaikwad

Rahul Sharad Gaikwad is a Lead Consultant – DevOps with AWS. He helps customers and partners on their Cloud and DevOps adoption journey. He is passionate about technology and enjoys collaborating with customers. In his spare time, he focuses on his PhD Research work. He also enjoys gymming and spending time with his family.

Leo Da Silva

Leo Da Silva is a Partner Solution Architect Manager at AWS and uses his knowledge to help customers better utilize cloud services and technologies. Over the years, he had the opportunity to work in large, complex environments, designing, architecting, and implementing highly scalable and secure solutions to global companies. He is passionate about football, BBQ, and Jiu Jitsu — the Brazilian version of them all.

Chris Riley

Chris Riley is a Senior Solutions Architect working with Strategic Accounts providing support in Industry segments including Healthcare, Financial Services, Public Sector, Automotive and Manufacturing via Managed AI/ML Services, IoT and Serverless Services.

Unify log aggregation and analytics across compute platforms

Post Syndicated from Hari Ohm Prasath original https://aws.amazon.com/blogs/big-data/unify-log-aggregation-and-analytics-across-compute-platforms/

Our customers want to make sure their users have the best experience running their application on AWS. To make this happen, you need to monitor and fix software problems as quickly as possible. Doing this gets challenging with the growing volume of data needing to be quickly detected, analyzed, and stored. In this post, we walk you through an automated process to aggregate and monitor logging-application data in near-real time, so you can remediate application issues faster.

This post shows how to unify and centralize logs across different computing platforms. With this solution, you can unify logs from Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Kinesis Data Firehose, and AWS Lambda using agents, log routers, and extensions. We use Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) with OpenSearch Dashboards to visualize and analyze the logs, collected across different computing platforms to get application insights. You can deploy the solution using the AWS Cloud Development Kit (AWS CDK) scripts provided as part of the solution.

Customer benefits

A unified aggregated log system provides the following benefits:

  • A single point of access to all the logs across different computing platforms
  • Help defining and standardizing the transformations of logs before they get delivered to downstream systems like Amazon Simple Storage Service (Amazon S3), Amazon OpenSearch Service, Amazon Redshift, and other services
  • The ability to use Amazon OpenSearch Service to quickly index, and OpenSearch Dashboards to search and visualize logs from its routers, applications, and other devices

Solution overview

In this post, we use the following services to demonstrate log aggregation across different compute platforms:

  • Amazon EC2 – A web service that provides secure, resizable compute capacity in the cloud. It’s designed to make web-scale cloud computing easier for developers.
  • Amazon ECS – A web service that makes it easy to run, scale, and manage Docker containers on AWS, designed to make the Docker experience easier for developers.
  • Amazon EKS – A web service that makes it easy to run, scale, and manage Docker containers on AWS.
  • Kinesis Data Firehose – A fully managed service that makes it easy to stream data to Amazon S3, Amazon Redshift, or Amazon OpenSearch Service.
  • Lambda – A compute service that lets you run code without provisioning or managing servers. It’s designed to make web-scale cloud computing easier for developers.
  • Amazon OpenSearch Service – A fully managed service that makes it easy for you to perform interactive log analytics, real-time application monitoring, website search, and more.

The following diagram shows the architecture of our solution.

The architecture uses various log aggregation tools such as log agents, log routers, and Lambda extensions to collect logs from multiple compute platforms and deliver them to Kinesis Data Firehose. Kinesis Data Firehose streams the logs to Amazon OpenSearch Service. Log records that fail to get persisted in Amazon OpenSearch service will get written to AWS S3. To scale this architecture, each of these compute platforms streams the logs to a different Firehose delivery stream, added as a separate index, and rotated every 24 hours.

The following sections demonstrate how the solution is implemented on each of these computing platforms.

Amazon EC2

The Kinesis agent collects and streams logs from the applications running on EC2 instances to Kinesis Data Firehose. The agent is a standalone Java software application that offers an easy way to collect and send data to Kinesis Data Firehose. The agent continuously monitors files and sends logs to the Firehose delivery stream.

BDB-1742-Ec2

The AWS CDK script provided as part of this solution deploys a simple PHP application that generates logs under the /etc/httpd/logs directory on the EC2 instance. The Kinesis agent is configured via /etc/aws-kinesis/agent.json to collect data from access_logs and error_logs, and stream them periodically to Kinesis Data Firehose (ec2-logs-delivery-stream).

Because Amazon OpenSearch Service expects data in JSON format, you can add a call to a Lambda function to transform the log data to JSON format within Kinesis Data Firehose before streaming to Amazon OpenSearch Service. The following is a sample input for the data transformer:

46.99.153.40 - - [29/Jul/2021:15:32:33 +0000] "GET / HTTP/1.1" 200 173 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"

The following is our output:

{
    "logs" : "46.99.153.40 - - [29/Jul/2021:15:32:33 +0000] \"GET / HTTP/1.1\" 200 173 \"-\" \"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36\"",
}

We can enhance the Lambda function to extract the timestamp, HTTP, and browser information from the log data, and store them as separate attributes in the JSON document.

Amazon ECS

In the case of Amazon ECS, we use FireLens to send logs directly to Kinesis Data Firehose. FireLens is a container log router for Amazon ECS and AWS Fargate that gives you the extensibility to use the breadth of services at AWS or partner solutions for log analytics and storage.

BDB-1742-ECS

The architecture hosts FireLens as a sidecar, which collects logs from the main container running an httpd application and sends them to Kinesis Data Firehose and streams to Amazon OpenSearch Service. The AWS CDK script provided as part of this solution deploys a httpd container hosted behind an Application Load Balancer. The httpd logs are pushed to Kinesis Data Firehose (ecs-logs-delivery-stream) through the FireLens log router.

Amazon EKS

With the recent announcement of Fluent Bit support for Amazon EKS, you no longer need to run a sidecar to route container logs from Amazon EKS pods running on Fargate. With the new built-in logging support, you can select a destination of your choice to send the records to. Amazon EKS on Fargate uses a version of Fluent Bit for AWS, an upstream conformant distribution of Fluent Bit managed by AWS.

BDB-1742-EKS

The AWS CDK script provided as part of this solution deploys an NGINX container hosted behind an internal Application Load Balancer. The NGINX container logs are pushed to Kinesis Data Firehose (eks-logs-delivery-stream) through the Fluent Bit plugin.

Lambda

For Lambda functions, you can send logs directly to Kinesis Data Firehose using the Lambda extension. You can deny the records being written to Amazon CloudWatch.

BDB-1742-Lambda

After deployment, the workflow is as follows:

  1. On startup, the extension subscribes to receive logs for the platform and function events. A local HTTP server is started inside the external extension, which receives the logs.
  2. The extension buffers the log events in a synchronized queue and writes them to Kinesis Data Firehose via PUT records.
  3. The logs are sent to downstream systems.
  4. The logs are sent to Amazon OpenSearch Service.

The Firehose delivery stream name gets specified as an environment variable (AWS_KINESIS_STREAM_NAME).

For this solution, because we’re only focusing on collecting the run logs of the Lambda function, the data transformer of the Kinesis Data Firehose delivery stream filters out the records of type function ("type":"function") before sending it to Amazon OpenSearch Service.

The following is a sample input for the data transformer:

[
   {
      "time":"2021-07-29T19:54:08.949Z",
      "type":"platform.start",
      "record":{
         "requestId":"024ae572-72c7-44e0-90f5-3f002a1df3f2",
         "version":"$LATEST"
      }
   },
   {
      "time":"2021-07-29T19:54:09.094Z",
      "type":"platform.logsSubscription",
      "record":{
         "name":"kinesisfirehose-logs-extension-demo",
         "state":"Subscribed",
         "types":[
            "platform",
            "function"
         ]
      }
   },
   {
      "time":"2021-07-29T19:54:09.096Z",
      "type":"function",
      "record":"2021-07-29T19:54:09.094Z\tundefined\tINFO\tLoading function\n"
   },
   {
      "time":"2021-07-29T19:54:09.096Z",
      "type":"platform.extension",
      "record":{
         "name":"kinesisfirehose-logs-extension-demo",
         "state":"Ready",
         "events":[
            "INVOKE",
            "SHUTDOWN"
         ]
      }
   },
   {
      "time":"2021-07-29T19:54:09.097Z",
      "type":"function",
      "record":"2021-07-29T19:54:09.097Z\t024ae572-72c7-44e0-90f5-3f002a1df3f2\tINFO\tvalue1 = value1\n"
   },   
   {
      "time":"2021-07-29T19:54:09.098Z",
      "type":"platform.runtimeDone",
      "record":{
         "requestId":"024ae572-72c7-44e0-90f5-3f002a1df3f2",
         "status":"success"
      }
   }
]

Prerequisites

To implement this solution, you need the following prerequisites:

Build the code

Check out the AWS CDK code by running the following command:

mkdir unified-logs && cd unified-logs
git clone https://github.com/aws-samples/unified-log-aggregation-and-analytics .

Build the lambda extension by running the following command:

cd lib/computes/lambda/extensions
chmod +x extension.sh
./extension.sh
cd ../../../../

Make sure to replace default AWS region specified under the value of firehose.endpoint attribute inside lib/computes/ec2/ec2-startup.sh.

Build the code by running the following command:

yarn install && npm run build

Deploy the code

If you’re running AWS CDK for the first time, run the following command to bootstrap the AWS CDK environment (provide your AWS account ID and AWS Region):

cdk bootstrap \
    --cloudformation-execution-policies arn:aws:iam::aws:policy/AdministratorAccess \
    aws://<AWS Account Id>/<AWS_REGION>

You only need to bootstrap the AWS CDK one time (skip this step if you have already done this).

Run the following command to deploy the code:

cdk deploy --requires-approval

You get the following output:

 ✅  CdkUnifiedLogStack

Outputs:
CdkUnifiedLogStack.ec2ipaddress = xx.xx.xx.xx
CdkUnifiedLogStack.ecsloadbalancerurl = CdkUn-ecsse-PY4D8DVQLK5H-xxxxx.us-east-1.elb.amazonaws.com
CdkUnifiedLogStack.ecsserviceLoadBalancerDNS570CB744 = CdkUn-ecsse-PY4D8DVQLK5H-xxxx.us-east-1.elb.amazonaws.com
CdkUnifiedLogStack.ecsserviceServiceURL88A7B1EE = http://CdkUn-ecsse-PY4D8DVQLK5H-xxxx.us-east-1.elb.amazonaws.com
CdkUnifiedLogStack.eksclusterClusterNameCE21A0DB = ekscluster92983EFB-d29892f99efc4419bc08534a3d253160
CdkUnifiedLogStack.eksclusterConfigCommand515C0544 = aws eks update-kubeconfig --name ekscluster92983EFB-d29892f99efc4419bc08534a3d253160 --region us-east-1 --role-arn arn:aws:iam::xxx:role/CdkUnifiedLogStack-clustermasterroleCD184EDB-12U2TZHS28DW4
CdkUnifiedLogStack.eksclusterGetTokenCommand3C33A2A5 = aws eks get-token --cluster-name ekscluster92983EFB-d29892f99efc4419bc08534a3d253160 --region us-east-1 --role-arn arn:aws:iam::xxx:role/CdkUnifiedLogStack-clustermasterroleCD184EDB-12U2TZHS28DW4
CdkUnifiedLogStack.elasticdomainarn = arn:aws:es:us-east-1:xxx:domain/cdkunif-elasti-rkiuv6bc52rp
CdkUnifiedLogStack.s3bucketname = cdkunifiedlogstack-logsfailederrcapturebucket0bcc-xxxxx
CdkUnifiedLogStack.samplelambdafunction = CdkUnifiedLogStack-LambdatransformerfunctionFA3659-c8u392491FrW

Stack ARN:
arn:aws:cloudformation:us-east-1:xxxx:stack/CdkUnifiedLogStack/6d53ef40-efd2-11eb-9a9d-1230a5204572

AWS CDK takes care of building the required infrastructure, deploying the sample application, and collecting logs from different sources to Amazon OpenSearch Service.

The following is some of the key information about the stack:

  • ec2ipaddress – The public IP address of the EC2 instance, deployed with the sample PHP application
  • ecsloadbalancerurl – The URL of the Amazon ECS Load Balancer, deployed with the httpd application
  • eksclusterClusterNameCE21A0DB – The Amazon EKS cluster name, deployed with the NGINX application
  • samplelambdafunction – The sample Lambda function using the Lambda extension to send logs to Kinesis Data Firehose
  • opensearch-domain-arn – The ARN of the Amazon OpenSearch Service domain

Generate logs

To visualize the logs, you first need to generate some sample logs.

  1. To generate Lambda logs, invoke the function using the following AWS CLI command (run it a few times):
aws lambda invoke \
--function-name "<<samplelambdafunction>>" \
--payload '{"payload": "hello"}' /tmp/invoke-result \
--cli-binary-format raw-in-base64-out \
--log-type Tail

Make sure to replace samplelambdafunction with the actual Lambda function name. The file path needs to be updated based on the underlying operating system.

The function should return "StatusCode": 200, with the following output:

{
    "StatusCode": 200,
    "LogResult": "<<Encoded>>",
    "ExecutedVersion": "$LATEST"
}
  1. Run the following command a couple of times to generate Amazon EC2 logs:
curl http://ec2ipaddress:80

Make sure to replace ec2ipaddress with the public IP address of the EC2 instance.

  1. Run the following command a couple of times to generate Amazon ECS logs:
curl http://ecsloadbalancerurl:80

Make sure to replace ecsloadbalancerurl with the public ARN of the AWS Application Load Balancer.

We deployed the NGINX application with an internal load balancer, so the load balancer hits the health checkpoint of the application, which is sufficient to generate the Amazon EKS access logs.

Visualize the logs

To visualize the logs, complete the following steps:

  1. On the Amazon OpenSearch Service console, choose the hyperlink provided for the OpenSearch Dashboard 7URL.
  2. Configure access to the OpenSearch Dashboard.
  3. Under OpenSearch Dashboard, on the Discover menu, start creating a new index pattern for each compute log.

We can see separate indexes for each compute log partitioned by date, as in the following screenshot.

BDB-1742-create-index

The following screenshot shows the process to create index patterns for Amazon EC2 logs.

BDB-1742-ec2

After you create the index pattern, we can start analyzing the logs using the Discover menu under OpenSearch Dashboard in the navigation pane. This tool provides a single searchable and unified interface for all the records with various compute platforms. We can switch between different logs using the Change index pattern submenu.

BDB-1742-unified

Clean up

Run the following command from the root directory to delete the stack:

cdk destroy

Conclusion

In this post, we showed how to unify and centralize logs across different compute platforms using Kinesis Data Firehose and Amazon OpenSearch Service. This approach allows you to analyze logs quickly and the root cause of failures, using a single platform rather than different platforms for different services.

If you have feedback about this post, submit your comments in the comments section.

Resources

For more information, see the following resources:


About the author

HariHari Ohm Prasath is a Senior Modernization Architect at AWS, helping customers with their modernization journey to become cloud native. Hari loves to code and actively contributes to the open source initiatives. You can find him in Medium, Github & Twitter @hariohmprasath.

balluBallu Singh is a Principal Solutions Architect at AWS. He lives in the San Francisco Bay area and helps customers architect and optimize applications on AWS. In his spare time, he enjoys reading and spending time with his family.